% \VignetteIndexEntry{ADaCGH2 Overview} % \VignetteDepends{ADaCGH2} % \VignetteKeywords{ADaCGH2 CGH segmentation parallelization} % \VignettePackage{ADaCGH2} \synctex=1 \documentclass[a4paper,11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{threeparttable} \usepackage{array} \usepackage{hyperref} \usepackage{geometry} \geometry{verbose,a4paper,tmargin=23mm,bmargin=26mm,lmargin=28mm,rmargin=28mm} \SweaveOpts{echo=TRUE} %\usepackage{tikz} %\usetikzlibrary{arrows,shapes,positioning} \usepackage[latin1]{inputenc} \usepackage{datetime} \newdateformat{mydate}{\THEDAY-\monthname[\THEMONTH]-\THEYEAR} \title{Analysis of data from aCGH experiments using parallel computing and ff objects} \date{\mydate\today} \author{Ramon Diaz-Uriarte$^{1}$, Daniel Rico$^{2}$, and Oscar M.\ Rueda$^{3}$} %% Remember to add BiocStyle to Suggests %% %% I get lots of problems, so will try later. %% <>= %% BiocStyle::latex() %% @ \begin{document} \maketitle \begin{center} 1. Department of Biochemistry, Universidad Autonoma de Madrid Instituto de Investigaciones Biomedicas ``Alberto Sols'' (UAM-CSIC), Madrid (SPAIN). 2. Structural Computational Biology Group. Spanish National Cancer Center (CNIO), Madrid (SPAIN). 3. Cancer Research UK Cambridge Research Institute Cambridge, UK {\tt rdiaz02@gmail.com}, {\tt drico@cnio.es}, {\tt Oscar.Rueda@cancer.org.uk} \end{center} \tableofcontents \section{This vignette} This vignette presents the ADaCGH2 package using: \begin{itemize} \item Three fully commented examples that deal with the usage of the different parallelization options and types of objects (in particular, \textit{ff} objects) available. \item Examples of using ADaCGH2 with CGHregions, Limma, and snapCGH. \end{itemize} All of the runnable examples in this vignette use a small toy example (they need to run in a reasonably short of time in a variety of machines). In the vignette called ``ADaCGH2-long-examples'' we list example calls of all segmentation methods, with different options for methods, as well as different options for type of input object and clustering. That other vignette is provided as both extended help and as a simple way of checking that all the functions can be run and yield identical results regardless of type of input and clustering. Finally, the file ``benchmarks.pdf'' presents extensive benchmarks comparing the current version of ADaCGH2 (>= 2.3.6) with the former version (v.\ 1.10, in BioConductor 2.12), as well as some comparisons with non-parallelized executions and a discussion of recommended patterns of usage. % In the vignette ADaCGH2-long (under /inst/doc) there are examples that % use ADaCGH2 with much larger data sets (millions of probes, hundreds of % samples). \section{Overview:} ADaCGH2 is a package for the analysis of CGH data. The main features of ADaCGH2 are: \begin{itemize} \item Parallelization of (several of) the main segmentation/calling algorithms currently available, to allow efficient usage of computing clusters. Parallelization can use either \textit{forking} (in Unix-like OSs) or sockets, MPI, etc, as provided by package \textit{snow} (\url{http://cran.r-project.org/web/packages/snow/index.html}). Forking will probably be the fastest approach in multicore machines, whereas MPI or sockets will be used with clusters made of several independent machines with few CPUs/cores each. \item Optional storage of, and access to, data using the \textit{ff} package (\url{http://cran.r-project.org/web/packages/ff/index.html}), making it possible to analyze data from very large projects and/or use machines with limited memory. \item Parallelization and \textit{ff} can be used simultaneously. WaviCGH \cite{Carro2010} (\url{http://wavi.bioinfo.cnio.es}), a web-server application for the analysis and visualization of array-CGH data that uses ADaCGH2, consitutes a clear demonstration of the usage of \textit{ff} on a computing cluster with shared storage over NFS. \end{itemize} ADaCGH2 is a major re-write of our former package ADaCGH \cite{Diaz-Uriarte2007} and version 2 of ADaCGH2 is, itself, a major rewrite of the version 1.x series. Over time, we have improved the parallelization and, specially, changed completely the data handling routines. The first major rewrite of ADaCGH2 included the usage of the \textit{ff} package, which allows ADaCGH2 to analyze data sets of more than four million probes in machines with no more than 2 GB of RAM. The second major rewrite reimplemented all the reading routines, and much of the analysis, which now allow a wider range of options with increased speed and decreased memory usage, and also allows users to disable the usage of \textit{ff}. Moreover, in the new version, a large part of the reading is parallelized and makes use of temporary \textit{ff} objects and we allow parallelization of analysis (and data reading) using forking. Further details and comparisons between the old and new versions are provided in the document ``benchmarks.pdf'', included with this package. \subsection{Terminology} The following is the meaning of some terms we will use repeatedly. \begin{description} \item[\textit{ff} object] An object that uses the \textit{ff} package. A tiny part of that object lives in memory, in the R session, but most of the object is stored on the hard drive. The part that lives in memory is just a pointer to the object that resides in the hard drive. \item[RAM objects] The ``usual'' R objects (in our case, mainly data frames and matrices); these are stored, or live, in memory. Somewhat similar to what the documentation of the \texttt{ff} package does, we refer to these objects, that reside in memory, as RAM objects. Technically, a given data frame, for instance, need not be in RAM in a particular moment (that actual memory page might have been swapped to disk). Regardless, the object is accessed as any other object which resides in memory. Likewise, note that \textit{ff} also have a small part that is in memory, but the data themselves are stored on disk. \item[forking] We copy literally from the vignette of the \textit{parallel} package \cite{Rcitation}: ``Fork is a concept from POSIX operating systems, and should be available on all R platforms except Windows. This creates a new R process by taking a complete copy of the master process, including the workspace and state of the random-number stream. However, the copy will (in any reasonable OS) share memory pages with the master until modified so forking is very fast.'' Forking is, thus, a reasonable way of parallelizing jobs in multicore computers. Note, however, that this will not work \textbf{across} machines (for instance, across workstations in clusters of workstations). \item[cluster] We use it here to contrast it with \textit{forking}. With \textit{cluster}, tasks are sent to other R processes using, for instance, MPI or any of the other methods provided by package \textbf{snow} (e.g., PVM, sockets, or NWS). For example, MPI (for ``Message Passing Interface'') is a standardized system for parallel computing, probably the most widely used approach for parallelization with distributed memory machines (such as in clusters of workstations). The package \textbf{Rmpi} (and \textbf{snow} on top of \textbf{Rmpi}) use MPI. In the examples in this vignette, however, we will use clusters of type \textit{socket}, as these are available in several OSs (including Windows), and do not require installation of MPI. If we are running Linux, Unix, or other POSIX operating systems, in a single computer with multiple cores we can use both forking and clusters (e.g., MPI or sockets). In most cases forking will be preferable as we will avoid some communication overheads and it will also probably use less total memory. If we are running Windows, however, we will need to use a cluster even in a single multicore machine. \end{description} \subsection{Suggested usage patterns summary} The following table provides a simple guide of suggested usage patterns with small to moderate data sets: %% this :\begin{ŧabular}{lp{7.5cm}r} %% or, if using package array, what I have in proyecto docente. \begin{center} \begin{tabular}{>{\centering}p{3.1cm}p{0.01cm}>{\centering}p{3.7cm}p{0.01cm}>{\centering}p{3.2cm}p{0.001cm}} \hline & &Lots of RAM && Little RAM &\\ \hline \centerline{Single node,} many cores/node &&\centerline{RAM objects (?), forking} \textit{ff} objects (?), forking&& \textit{ff} objects, forking& \\ &&&\\ \centerline{Many nodes,}few cores/node&& \textit{ff} objects, cluster && \textit{ff} objects, cluster&\\ \hline \end{tabular} \end{center} %% \begin{center} %% \begin{tabular}{p{5cm}p{3cm}p{3cm}} %% \hline %% & Lots of RAM & Little RAM \\ %% \hline %% Many cores per node and few cores & Local objects, forking & \textit{ff} %% objects, forking \\ %% Few cores per node and many nodes & \textit{ff} objects, cluster & \textit{ff} objects, %% cluster\\ %% \hline %% \end{tabular} %% \end{center} The question marks denote not-so-obvious choices, where the best decision will depend on the actual details of number of nodes, size of data sets, speed of communication between nodes, etc. For large data sets, the recommended usage involves always using \textit{ff} objects. Using \textit{ff} objects is slightly more cumbersome, but can allow us to analyze very large data sets in moderate hardware and will often result in faster computation; see details and discussion in ``benchmarks.pdf''. Of course, what is ``lots'', ``many'', and ``large'', will depend on the arrays you analyze and the hardware. The examples below cover all three possible usage patterns: \begin{description} \item[RAM objects, forking]: section \ref{ex-ram-fork}. \item[ff objects, cluster]: section \ref{ex-ff-cluster}. \item[ff objects, forking]: section \ref{ex-ff-fork}. \end{description} \subsection{Usage: main steps and choices} ADaCGH2 includes functions that use as input, or produce as output, either \textit{ff} objects or RAM R objects. Some functions also allow you to choose between using forking and using other mechanisms for parallelization. % \begin{description} % \item[inputToADaCGH] % \begin{description} % \item[Input] Local R objects, text file, Limma and snapCGH % \end{description} % \end{description} % This figure summarizes input, output, and % parallelization choices. % \begin{figure} % \begin{tikzpicture} % \end{tikzpicture} % \end{figure} For both interactive and non-interactive executions we will often execute the following in sequence: \begin{enumerate} \item Check the original data and convert to appropriate objects (e.g., to \textit{ff} objects). \item Initialize the computing cluster if not using forking. \item Carry out segmentation and calling \item Plot the results \end{enumerate} We cover each in turn in the remaining of this section and discuss alternative routes. But first, we discuss why we might want to use ADaCGH2 instead of just ``doing it manually on our own''. \section{The data for all the examples}\label{inputdata} We will use a small, fictitious data set for all the examples, with six arrays/subjects and five chromosomes. The data are available as an RData file % %% FIXME: do I want this? % <>= % options(width = 70) % @ <<>>= library(ADaCGH2) data(inputEx) summary(inputEx) head(inputEx) @ The data are are also available (in the /data subdirectory of the package) as an ASCII text file in two formats: with columns separated by tabs and with columns separated by spaces\footnote{These two files are used in the example of the help for the \texttt{cutFile} function}. \section{Example 1: RAM objects and forking} \label{ex-ram-fork} This is the simplest procedure if you are not under Windows. It will work when data is small (relative to available RAM) and the number of cores/processors in the single computing node is large relative to the number of subjects. However, this will not provide any parallelism under Windows: we use forking, as provided by the \texttt{mclapply} function in package \texttt{parallel}, and forking is available for POSIX operating systems (and Windows is not one of those). Using forking can be a good idea because, with fork, creating new process is very fast and lightweight, and all the child process share memory pages until they start modifying the objects, and you do not need to explicitly send those pre-existing objects to the child processes. In contrasts, if we use other types of clusters (e.g., sockets or MPI), we need to make sure packages and R objects are explicitly sent to the child or slave processes. If you have lots of RAM (ideally all you would need is enough memory to hold one copy of your original CGH data plus the return object), you will also probably use RAM objects and not \textit{ff} objects, as these are less cumbersome to deal with than \textit{ff} objects. But see details in file ``benchmarks.pdf''. % With % forking, the R object that holds the CGH data is accessed only for % reading, and thus can be shared between all processes. % This should be % faster than using \textit{ff} (see section \ref{to-ff}), because access to % disk is several orders of magnitude slower than access to memory % ---provided, of course, the memory that is used is RAM and not virtual % memory mapped to disk. The steps for the analysis are: \begin{itemize} \item Read the input data. \item Carry out the segmentation. \end{itemize} \subsection{Reading data and storing as a RAM object (a ``usual'' R object)} \label{reading-data-storing} We provide here details on reading data from several different sources. Of course, in any specific case, you only need to use one route. \subsubsection{Data available as a data frame in an RData file} \label{input-as-RData} % If the data sets are small and/or we have lots of RAM we might want to % keep the objects in memory inside our R session. This should be faster % than using \textit{ff} (see section \ref{to-ff}), because access to disk % is several orders of magnitude slower than access to memory ---provided, % of course, the memory that is used is RAM and not virtual memory mapped to % disk. As we said in section \ref{inputdata}, the data are available as an R data frame (\texttt{inputEx}), which we have saved as an RData file (\texttt{inputEx.RData}). We will use \texttt{inputToADaCGH} to produce the three objects needed later for the segmentation, and to carry out some checks for missing values, repeated identifiers and positions, etc. % To allow the conversion to be carried out using data from previous % sessions, the conversion takes as input the name of an RData. %%FIXME <<>>= fnameRdata <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") inputToADaCGH(ff.or.RAM = "RAM", RDatafilename = fnameRdata) @ We need to provide the path to the RData file, which we stored in the object \texttt{fnameRData}. This RData file will contain a single data frame. In this data frame, the first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. The names of the first three column do not matter, but the order does. Names of the remaining columns will be used if existing; otherwise, fake array names will be created. Note the usage of \texttt{ff.or.RAM = "RAM"}, which is different from that in section \ref{to-ff}. % , and we use the (default) option of % \texttt{ff.or.RAM = ``RAM"} for the return. The output from the call will leave several R objects in the global environment. The name of the objects can be changed with the argument \texttt{robjnames}. These are your usual R objects (data frames and vectors); thus, they are RAM objects. \subsubsection{Data available as an R data frame}\label{input-as-df} Instead of accessing the RData file, we will directly use the data frame. This way, we use \texttt{inputToADaCGH} basically for its checks. The first three columns of the data frame are the IDs of the probes, the chromosome number, and the position, and all remaining columns contain the data for the arrays, one column per array. <<>>= data(inputEx) ## make inputEx available as a data frame with that name inputToADaCGH(ff.or.RAM = "RAM", dataframe = inputEx) @ \paragraph{Skipping the call to \texttt{inputToADaCGH}} Since our data are already available as an R data frame, and if we are not interested in the checks provided by \texttt{inputToADaCGH}, we do not need to call it. To prepare the data for later usage with \texttt{pSegment} we can just do as follows: <<>>= data(inputEx) cgh.dat <- inputEx[, -c(1, 2, 3)] chrom.dat <- as.integer(inputEx[, 2]) pos.dat <- inputEx[, 3] @ \subsubsection{Using input data from a text file} \label{input-as-text} Our data can also be in a text file, with a format where the first three columns are ID, chromosome, and position, and the remaining columns are arrays\footnote{If they are not, utilities such as \texttt{awk}, \texttt{cut}, etc, might be used for this purpose.}. \texttt{inputDataToADaCGH} allows this type of input and, inside, uses \texttt{read.table.ff}; this way, we can read a very large data set and store it as an \textit{ff} object or a RAM object without exhausting the available RAM. <<>>= fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") ## You might want to adapt mc.cores to your hardware tmp <- inputToADaCGH(ff.or.RAM = "RAM", textfilename = fnametxt, mc.cores = 2) @ If you will be using a cluster created with \texttt{makeCluster} (see section \ref{initcluster}) you will not want to use this options. You will need to create \textit{ff} objects because, when using a cluster, and to minimize transferring data and possibly exhausting available RAM, we have written the code so that the slaves do not receive the data itself, but just pointers to the data (i.e., names of \textit{ff} objects) that live in the disk. \paragraph{Compressed text files} \label{sec:compr-text-files} %% In the package, the text file is actually stored as a compressed %% ---gzipped--- file. If you do %% <<>>= %% fnametxt %% @ %% you will see that. As far as you are concerned, this is irrelevant here, %% and The function \texttt{inputToADaCGH} will work with both compressed and uncompressed files. However, if you are working with a really large text file, if you start from a compressed file, you will have to add the time it takes to decompress the file; thus, you might want to decompress it, outside R, before you start all of your work if you plan on using this file repeatedly as input. \subsubsection{Using data from Limma or snapCGH} You can also use data from snapCGH and Limma. See section \ref{snap}. \subsubsection{Reading data from a directory} \label{reading-dir-0} Reading data from a directory is discussed in more detail in section \ref{reading-dir}, and it is the preferred approach when we have a lot of data. Since saving the results as a RAM object is not likely to be the way to go in such cases (we would exhaust available RAM), we do not discuss it here any further. \subsection{Carrying out segmentation and calling} \label{segment-local-fork} Segmentation and calling are carried out with the \texttt{pSegment} functions. Here we show just one such example. Many more are available in the second vignette. Setting argument \texttt{typeParall} to \texttt{fork} is not needed (it is the default), but we set it here explicitly for clarity. <>= help(pSegment) @ <<>>= ## You might want to adapt mc.cores to your hardware haar.RAM.fork <- pSegmentHaarSeg(cgh.dat, chrom.dat, merging = "MAD", typeParall = "fork", mc.cores = 2) @ Since the input are RAM objects, the output is also a RAM object (a regular R object, in this case a list). <<>>= lapply(haar.RAM.fork, head) summary(haar.RAM.fork[[1]]) @ \subsection{Plotting the results} \label{plot-local-fork} Plotting produces PNG files for easier sharing over the Internet. The plotting function takes as main arguments the names of the result from \texttt{pSegment} and the input objects to \texttt{pSegment} (we will later see, for instance in section \ref{plot-ff-cluster}, how to use results stored as \textit{ff} objects). Setting argument \texttt{typeParall} to \texttt{fork} is not needed (it is the default), but we set it here explicitly for clarity. <<>>= ## You might want to adapt mc.cores to your hardware pChromPlot(haar.RAM.fork, cghRDataName = cgh.dat, chromRDataName = chrom.dat, posRDataName = pos.dat, probenamesRDataName = probenames.dat, imgheight = 350, typeParall = "fork", mc.cores = 2) @ \section{Example 2: \textit{ff} objects and cluster}\label{ex-ff-cluster} This procedure should work even with relatively small amounts of RAM, and it will also work under Windows. However, using a cluster involves additional steps. For both interactive and non-interactive sessions we will often execute the following in sequence: \begin{enumerate} \item Check the original data and convert to appropriate objects (e.g., to \textit{ff} objects). \item Initialize the computing cluster.% if not using forking. \item Carry out segmentation and calling \item Plot the results \end{enumerate} Compared to section \ref{ex-ram-fork} we introduce here the following new major topics: \begin{itemize} \item Using \textit{ff} objects. \item Setting up a cluster. \end{itemize} % We cover each in turn in the remaining of this section and discuss % alternative routes. \textbf{Note for Windows users}: in this vignette, the code that uses \texttt{ff} objects has been disabled as it leads to random and difficult to reproduce problems with the automated testing procedure (from creating socket clusters to removing temporary directories). Therefore, all remaining code in this vignette is surrounded with \texttt{if(.Platform\$OS.type != "windows") \{some-code-here\}}. This code, however, should work interactively. \subsection{Choosing a working directory} \label{choosewd} As we will use \textit{ff} objects, we will read and write quite a few files to the hard drive. The easiest way to organize your work is to create a separate directory for each project. At the end of this example, we will remove this directory. All plot files and ff data will be stored in this new directory. (Just in case, we check for the existence of the directory first. We also store the current working directory to return to it at the very end.) <<>>= if(.Platform$OS.type != "windows") { originalDir <- getwd() ## make it explicit where we are print(originalDir) } @ <<>>= if(.Platform$OS.type != "windows") { if(!file.exists("ADaCGH2_vignette_tmp_dir")) dir.create("ADaCGH2_vignette_tmp_dir") setwd("ADaCGH2_vignette_tmp_dir") } @ It is \textbf{very important} to remember that the names of the \textit{ff} objects that are exposed to the user are always the same (i.e., chromData.RData, posData.RData, cghData.RData, probeNames.RData). Therefore, successive invocations of \texttt{inputToADaCGH}, if they produce \textit{ff} output (i.e., \texttt{ff.or.RAM = "ff"}) will overwrite this objects (and make them point to different binary \textit{ff} files on disk). In this vignette, we keep reusing \texttt{inputToADaCGH}, but note that in all the cases we produce as output \textit{ff} files (sections \ref{to-ff}, \ref{RData-to-ff}, \ref{separate-proc}, \ref{input-as-text-to-ff}, \ref{to-ff-ex3}, \ref{RData-to-ff-ex3}, \ref{separate-proc-ex3}, \ref{input-as-text-to-ff-ex3}), the data used as input are the same, so there is no problem here (although we will leave binary \textit{ff} objects on disk without a corresponding \textit{ff} RData object on the R session). In particular, note that when we show the usage of Limma and snapCGH objects as input (section \ref{snap}), we are using RAM objects (not \textit{ff} objects) as output, so there is no confusion. \subsection{Reading data and storing as \textit{ff} objects} \label{to-ff} Converting the original data to \textit{ff} objects can be done either before or after initializing the cluster (section \ref{initcluster}), as it does not use the computing cluster. The purpose of this step is to write the \textit{ff} files to disk, so they are available for the segmentation and ploting functions. \subsubsection{Data available as a data frame in an RData file} \label{RData-to-ff} To allow the conversion to be carried out using data from previous sessions, the conversion takes as input the name of an RData that contains plain, ``regular'' R objects (which, if loaded, would be RAM objects). <<>>= if(.Platform$OS.type != "windows") { fnameRdata <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRdata) } @ The first command is used in this example to find the complete path of the example data set. The actual call to the function is the second expression. Note that we used a path to an RData file, and do not just use a RAM object. If you are very short of RAM, you might want to do the conversion in a separate R process that exists once the conversion is done and returns all of the RAM it used to the operating system. This we cover next in section \ref{separate-proc}. An alternative approach to try to minimize RAM is available if our data are in a text file, as discussed in section \ref{input-as-text}. %% Why? If you have your data already loaded in your main %% session as an R data frame, it is immediate for you to produce the %% required data frames from it as input for \texttt{pSegment}, as we did in %% section \ref{input-as-RData}, and thus you probably do not need %% \texttt{inputDataToADaCGH}. %% Because if you are producing \textit{ff} objects, you are most %% likely short of RAM, and you might not want to load the RData file in your %% main session; in fact, you might want to do the conversion in a separate R %% process that exists once the conversion is done and returns all of the RAM %% it used to the operating system. This we cover next in section %% \ref{separate-proc}. \subsubsection{Converting from RData to \textit{ff} objects in a separate process} \label{separate-proc} With large data sets, converting from RData to \textit{ff} can be the single step that consumes the most RAM, since we need to load the original data into R. Even if, after the conversion to \textit{ff}, we remove the original data and call \texttt{gc()}, R might not return all of the memory to the operating system, and this might be inconvenient in multiuser environments and/or long running processes. We can try dealing with the above problems by executing the conversion to \textit{ff} in a separate R process that is spawned exclusively just for the conversion. For instance, we could use the \texttt{mcparallel} function (from package \textbf{parallel}) and do: <>= mcparallel(inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRData), silent = FALSE) tableChromArray <- mccollect() if(inherits(tableChromArray, "try-error")) { stop("ERROR in input data conversion") } @ That way, the \textit{ff} are produced and stored locally in the hard drive, but the R process where the original data was loaded (and the conversion to \textit{ff} carried out) dies immediately after the conversion, freeing back the memory to the operating system. \subsubsection{Data available as an R data frame}\label{input-as-df-cluster} Instead of accessing the RData file, we can directly use the data frame, as we did in section \ref{input-as-df}. <<>>= if(.Platform$OS.type != "windows") { data(inputEx) ## make inputEx available as a data frame with that name inputToADaCGH(ff.or.RAM = "ff", dataframe = inputEx) } @ \subsubsection{Using input data from a text file}\label{input-as-text-to-ff} As in \ref{input-as-text}, we can read from a text file. In this case, however, the output will be a set of \textit{ff} objects. % Our data might be in a text file, with a format where the first three % columns are ID, chromosome, and position, and the remaining columns are % arrays. (If they are not, utilities such as \texttt{awk}, \texttt{cut}, % etc, might be used for this purpose). \texttt{inputDataToADaCGH} allows % this type of input, and uses \texttt{read.table.ff}; this way, we can read % a very large data set and store it as an \textit{ff} object (or a local R % object) without exhausting the available RAM. <<>>= if(.Platform$OS.type != "windows") { fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", textfilename = fnametxt, mc.cores = 2) } @ \subsubsection{Using data from Limma or snapCGH} You can also use data from snapCGH and Limma. See section \ref{snap}. \subsubsection{Reading data from a directory} \label{reading-dir-1} See section \ref{reading-dir} for further details. This option is the best option with very large data sets. The initial data reading will use forking and, once we have saved the objects as \textit{ff} objects, we can apply all the subsequent analysis steps discussed in the rest of this section. \subsubsection{Moving a set of \textit{ff} objects} This is not specific to ADaCGH2, but since this issue can come up frequently, we explain it here. The paths of the \textit{ff} files are stored in the object. How can we move this R object with all the \textit{ff} files? First, we save the R object and all the \textit{ff} files: \begin{verbatim} ffsave(cghData, file = "savedcghData", rootpath = "./") \end{verbatim} We then take the resulting RData object (possible a very large object), and load it in the new location, rerooting the path: \begin{verbatim} ffload(file = "pathtofile/savedcghData", rootpath = getwd()) \end{verbatim} \subsection{Initializing the computing cluster}\label{initcluster} %%Unless we use forking, we need to set up and initialize a cluster. Cluster initialization uses the functions provided in \texttt{parallel}. In the example we will use a sockect cluster, since this is likely to run under a variety of operating systems and should not need any additional software. Note, however, that MPI can also be used (in fact, that is what we use in our servers). In this example we will use as many nodes as cores can be detected. <<>>= if(.Platform$OS.type != "windows") { ## Adapt number of nodes to your hardware number.of.nodes <- 2 ##detectCores() cl2 <- parallel::makeCluster(number.of.nodes, "PSOCK") parallel::clusterSetRNGStream(cl2) parallel::setDefaultCluster(cl2) parallel::clusterEvalQ(NULL, library("ADaCGH2")) wdir <- getwd() parallel::clusterExport(NULL, "wdir") parallel::clusterEvalQ(NULL, setwd(wdir)) } @ The first two calls create a cluster and initialize the random number generator\footnote{We use the version from package \textbf{parallel}, instead of the one from \textbf{BiocGenerics}, as the last one is still experimental.}. The third expression sets the cluster just created as the default cluster. This is important: to simplify function calls, we do not pass the cluster name around, but rather expect a default cluster to be set up. The fourth line makes the \textbf{ADaCGH2} package available in all the nodes of the cluster (notice we did not need to do this with forking, as the child processes shared memory with the parent). The last three lines make sure the slave processes use the same directory as the master. Because we created the cluster after changing directories (section \ref{choosewd}) this step is not really needed here. But we make it explicit so as to verify it works, and as a reminder that you will need to do this if you change directories AFTER creating the cluster. If you run on a multinode cluster, you must ensure that the same directory exists in all machines. (In this case, we are running on the localhost). \subsection{Carrying out segmentation and calling} \label{segment-ff-cluster} Segmentation and calling are carried out with the \texttt{pSegment} functions. Here we show just one such example. Many more are available in the second vignette. <>= help(pSegment) @ <<>>= if(.Platform$OS.type != "windows") { haar.ff.cluster <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", typeParall = "cluster") } @ We can take a quick look at the output. We first open the \textit{ff} objects (the output is a list of \textit{ff} objects) and then call \texttt{summary} on the list that contains the results of the wavelet smoothing: <<>>= if(.Platform$OS.type != "windows") { lapply(haar.ff.cluster, open) summary(haar.ff.cluster[[1]][,]) } @ \subsection{Plotting the results} \label{plot-ff-cluster} The call here is the same as in section \ref{plot-local-fork}, except that we change the values for the arguments. As we are using \textit{ff} objects, we also need to first write to disk the (\textit{ff}) object with the results. <<>>= if(.Platform$OS.type != "windows") { save(haar.ff.cluster, file = "hs_mad.out.RData", compress = FALSE) pChromPlot(outRDataName = "hs_mad.out.RData", cghRDataName = "cghData.RData", chromRDataName = "chromData.RData", posRDataName = "posData.RData", probenamesRDataName = "probeNames.RData", imgheight = 350, typeParall = "cluster") } @ Finally, we stop the workers and close the cluster <<>>= if(.Platform$OS.type != "windows") { parallel::stopCluster(cl2) } @ \section{Example 3: \textit{ff} objects and forking} \label{ex-ff-fork} This example uses \textit{ff} objects, as in section \ref{ex-ff-cluster}, but it will not use a cluster but forking, as in section \ref{ex-ram-fork}. Therefore, we will not need to create a cluster, but we will need to read data and convert it to \textit{ff} objects. Here we introduce no new major topics. Working with \textit{ff} objects was covered in section \ref{to-ff} and forking was covered in section \ref{segment-local-fork}. We simply combine these work-flows. \subsection{Choosing a working directory} \label{choosewd-ex3} As we will use \textit{ff} objects, it will be convenient, as we did in section \ref{choosewd}, to create a separate directory for each project, to store all plot files and ff data. Since we already did that above (section \ref{choosewd}) we do not repeat it here. However, for real work, you might want to keep different analyses associated to different working directories. % this would be a mess for me, when deleating, cleaning up, % and acessing files in the all-examples case. % To keep the two examples separate, we % create a new one. (Just in case, we check for its existence first.) % <<>>= %if(.Platform$OS.type != "windows") { % setwd(getwd) % if(!file.exists("ADaCGH2_vignette_ex3_dir")) % dir.create("ADaCGH2_vignette_ex3_dir") % setwd("ADaCGH2_vignette_ex3_dir") % @ \subsection{Reading data and storing as \textit{ff} objects} \label{to-ff-ex3} We have here the same options as in section \ref{to-ff}. We repeat them briefly. A key difference with respect to section \ref{to-ff} is that we are not creating a cluster, so there will be no need to export the current working directory to slave processes explictly (in contrast to \ref{initcluster}). \subsubsection{Data available as a data frame in an RData file} \label{RData-to-ff-ex3} <<>>= if(.Platform$OS.type != "windows") { fnameRdata <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.RData") inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRdata) } @ \subsubsection{Converting from RData to \textit{ff} objects in a separate process} \label{separate-proc-ex3} Even if we are using forking, we might still want to carry the conversion to \textit{ff} objects in a separate process, as we did in section \ref{separate-proc}, since the conversion to \textit{ff} objects might be the step that consumes most RAM in the whole process and we might want to make sure we return that memory to the operating system as soon as possible. <>= mcparallel(inputToADaCGH(ff.or.RAM = "ff", RDatafilename = fnameRdata), silent = FALSE) tableChromArray <- collect() if(inherits(tableChromArray, "try-error")) { stop("ERROR in input data conversion") } @ \subsubsection{Data available as an R data frame}\label{input-as-df-ff-fork} Instead of accessing the RData file, we can directly use the data frame, as we did in section \ref{input-as-df-cluster}. \subsubsection{Using input data from a text file} \label{input-as-text-to-ff-ex3} <<>>= if(.Platform$OS.type != "windows") { fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", textfilename = fnametxt, mc.cores = 2) } @ \subsubsection{Using data from Limma or snapCGH} You can also use data from snapCGH and Limma. See section \ref{snap}. \subsubsection{Reading data from a directory} \label{reading-dir} This is probably the best option for very large input data. We will read \textbf{all} the files in a given directory (except for those you might explicitly specify not to). Even if your original file follows the format of the data file in \ref{input-as-text-to-ff-ex3}, you might want to convert it to the format used here (where each column is a file) as the time it takes to convert the file will be more than compensated by the speed ups of reading, in R, each file on its own. With very large files, it is much faster to read the data this way (we avoid having to loop many times over the file to read each column). Reading the data is parallelized, which allows us to speed up the reading process significantly (the parallelization uses forking, and thus you will see no speed gains in Windows). Finally, to maximize speed and minimize memory consumption, we use \textit{ff} objects for intermediate storage. \subsubsection{Cutting the original file into one-column files}\label{cutfile} We provide a simple function, \texttt{cutFile}, to do this job. Here we create a directory where we will place the one-column files (we first check that the directory does not exist\footnote{If it exists and contains files, \texttt{inputToADaCGH} will probably fail, as it is set to read all the files in the directory.}). Note that this will probably NOT work under Windows\footnote{Under Macs it might or might not work; in all of the Macs we have tried it, it works, but not on the testing machine at BioC.} , and thus we skip using \texttt{cutFile} under Windows, and use a directory where we have stored the files split by column. % <<>>= %% if(.Platform$OS.type != "windows") { %% if( .Platform$OS.type == "unix") { % fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), % full.names = TRUE, pattern = "inputEx.txt") % if(file.exists("cuttedFile")) { % stop("The cuttedFile directory already exists.", % "Did you run this vignette from this directory before?", % "You will not want to do that, unless you modify the arguments", % "to inputToADaCGH below") % } else dir.create("cuttedFile") % setwd("cuttedFile") % cutFile(fnametxt, 1, 2, 3, sep = "\t") % cuttedFile.dir <- getwd() % setwd("../") % } else { % cuttedFile.dir <- system.file("example-datadir", % package = "ADaCGH2") % } % @ <<>>= if( (.Platform$OS.type == "unix") && (Sys.info()['sysname'] != "Darwin") ) { fnametxt <- list.files(path = system.file("data", package = "ADaCGH2"), full.names = TRUE, pattern = "inputEx.txt") if(file.exists("cuttedFile")) { stop("The cuttedFile directory already exists. ", "Did you run this vignette from this directory before? ", "You will not want to do that, unless you modify the arguments ", "to inputToADaCGH below") } else dir.create("cuttedFile") setwd("cuttedFile") ## You might want to adapt mc.cores to your hardware cutFile(fnametxt, 1, 2, 3, sep = "\t", mc.cores = 2) cuttedFile.dir <- getwd() setwd("../") } else { cuttedFile.dir <- system.file("example-datadir", package = "ADaCGH2") } @ We create a new directory and carry out the file cutting there since the upper level directory is already populated with other files we have been creating. If we cut the file in the upper directory, we would later need to specify a lengthy list of files to exclude in the arguments to \texttt{inputToADaCGH}. To avoid that, we create a directory, and leave the files in the newly created directory. After cutting, we return to the former level directory, to keep that directory with only the files for input. It is important to realize that the previous paragraph, which might seem a mess, does not reflect the way you would usually work, which would actually be much simpler, and something like the following: \begin{enumerate} \item Create a directory for your new project (lets call this directory \texttt{d1}). \item Copy the text file with your big txt file with data to \texttt{d1}; lets call this file \texttt{afile.txt}. \item In R, move to \texttt{d1} (for example, \texttt{setwd("~/d1")}). \item Use \texttt{cutFile}: \texttt{cutFile("afile.txt", 1, 2, 3)}. \item Call \texttt{inputToADaCGH}: \texttt{inputToADaCGH(ff.or.RAM = "ff", path = getwd(), excludefile = "afile.txt")} \end{enumerate} (In this vignette the workflow was not as easy because we are running lots of different examples, with several different workflows.) \texttt{cutFile} will run several jobs in parallel to speed up the cutting process, launching by default as many jobs as cores it can detect, and will produce files with the required naming conventions of \texttt{inputToADaCGH}. Note that \texttt{cutFile} is unlikely to work under Windows. If you do not want to use \texttt{cutFile} you can use utilities provided by your operating system. The following is a very simple example of using \texttt{cut} under bash (which is not unlike what we do internally in \texttt{cutFile}) to produce one-column files from a file called \texttt{Data.txt}, with 77 arrays/subjects, where cutting the data part is parallelized over four processors: \begin{verbatim} cut -f1 Data.txt > ID.txt cut -f2 Data.txt > Chrom.txt cut -f3 Data.txt > Pos.txt for i in {4..20}; do cut -f$i Data.txt > col_$i.txt; done & for i in {21..40}; do cut -f$i Data.txt > col_$i.txt; done & for i in {41..60}; do cut -f$i Data.txt > col_$i.txt; done & for i in {61..80}; do cut -f$i Data.txt > col_$i.txt; done & \end{verbatim} After you have cut the file, each file contains one column of data. Three of the files must be named \texttt{"ID.txt"}, \texttt{"Chrom.txt"}, and \texttt{"Pos.txt"}. The rest of the files contain the data for each one of the arrays or subjects. The name of the rest of the files is irrelevant. When using \texttt{inputDataToADaCGH} with a directory, the output can be either \textit{ff} objects or RAM objects. However, the latter will rarely make sense (it will be slower and we can run into memory contraints); see the discussion in file ``benchmarks.pdf''. %% <<>>= %% namepath <- system.file("example-datadir", package = "ADaCGH2") %% inputToADaCGH(ff.or.local = "ff", %% path = namepath) %% @ %% <<>>= %% namepath <- getwd() %% inputToADaCGH(ff.or.local = "ff", %% path = namepath) %% @ %% (If we are using Windows, we cannot use \texttt{cutFile}; in this %% vignette, therefore, we will use a directory where we have placed the %% files split by column) <>= if(.Platform$OS.type != "windows") { ## You might want to adapt mc.cores to your hardware inputToADaCGH(ff.or.RAM = "ff", path = cuttedFile.dir, verbose = TRUE, mc.cores = 2) } @ We have used the previously cut files in this example. You can also check the files that live under the ``example-datadir'' directory and you will see six files with names starting with ``col'', which are the data files, and the files \texttt{"ID.txt"}, \texttt{"Chrom.txt"}, and \texttt{"Pos.txt"}. (That is the directory we would use as input had we used Windows.) Note that, to provide additional information on what we are doing we are calling the function with the (non-default) \texttt{verbose = TRUE}, which will list all the files we will be reading. \textbf{Beware of possible different orderings of files.} When reading from a directory, and since each column is a file, the order of the columns (and, thus, subjects or arrays) in the data files that will be created can vary. In particular, the command \texttt{list.files} (which we use to list of the files) can produce different output (different order of files) between operating systems and versions of R. What this means is that, say, column three does not necessarily refer to the same subject or array. \textbf{Always use the column names to identify unambiguously the data and the results}. \paragraph{What about performing this step in a separate process?} In sections \ref{separate-proc} and \ref{separate-proc-ex3} we performed the data preparation in a separate process, to free up RAM to the OS right after the conversion. You can do that too here if you want, but we have not found that necessary, since the memory consumption when reading column by column is often small. See examples with large data sets in section \ref{large-ex-read}. \subsection{Carrying out segmentation and calling} \label{segment-ff-fork-ex3} The call is similar to the one in \ref{segment-ff-cluster}, except for the argument \texttt{typeParall}. <>= if(.Platform$OS.type != "windows") { ## You might want to adapt mc.cores to your hardware haar.ff.fork <- pSegmentHaarSeg("cghData.RData", "chromData.RData", merging = "MAD", typeParall = "fork", mc.core = 2) } @ \subsection{Plotting the results} \label{plot-ff-fork-ex3} The call here is the same as in section \ref{plot-ff-cluster}, except for argument \texttt{typeParall}. <>= if(.Platform$OS.type != "windows") { save(haar.ff.fork, file = "haar.ff.fork.RData", compress = FALSE) ## You might want to adapt mc.cores to your hardware pChromPlot(outRDataName = "haar.ff.fork.RData", cghRDataName = "cghData.RData", chromRDataName = "chromData.RData", posRDataName = "posData.RData", probenamesRDataName = "probeNames.RData", imgheight = 350, typeParall = "fork", mc.cores = 2) } @ \section{Input and output to/from other packages} \subsection{\label{snap}Input data from Limma and snapCGH} Many aCGH studies use pre-processing steps similar to those of gene expression data. The \texttt{MAList} object, from \textit{Limma} and \texttt{SegList} object, from \textit{snapCGH}, are commonly used to store aCGH information. The following examples illustrate the usage of the function \texttt{inputToADaCGH} to convert \texttt{MAList} and \texttt{SegList} data into a format suitable for \textit{ADaCGH2}. We will start with objects produced by \textit{snapCGH}. The following code is copied from the \textit{snapCGH} vignette (pp.\ 2 and 3). Please check the original vignette for details. In summary, a set of array files are read, the data are normalized and, finally, averaged over clones. snapCGH uses limma for the initial import of data and, next, with the \texttt{read.clonesinfo} function adds additional information such as chromosome and position. The \texttt{MA} object created is of class \texttt{MAList}, but with added information (compared to a basic, original, limma MAList object). \texttt{MA2} is of type \texttt{SegList}. %% This code is not run now, as it depends on \textit{snapCGH} which is not %% available at his time (2013-10) for BioC devel. <>= if(.Platform$OS.type != "windows") { require("limma") require("snapCGH") datadir <- system.file("testdata", package = "snapCGH") targets <- readTargets("targets.txt", path = datadir) RG1 <- read.maimages(targets$FileName, path = datadir, source = "genepix") ## This is snapCGH-specific RG1 <- read.clonesinfo("cloneinfo.txt", RG1, path = datadir) RG1$printer <- getLayout(RG1$genes) types <- readSpotTypes("SpotTypes.txt", path = datadir) RG1$genes$Status <- controlStatus(types, RG1) RG1$design <- c(-1, -1) RG2 <- backgroundCorrect(RG1, method = "minimum") ## class RGList MA <- normalizeWithinArrays(RG2, method = "median") ## class MAList class(MA) ## now obtain an object of class SegList MA2 <- processCGH(MA, method.of.averaging = mean, ID = "ID") class(MA2) } @ %% FIXME: watch out for NAs! All the information (intensity ratios and location) is available in the \texttt{MA} and \texttt{MA2} objects. We can directly convert them to \textit{ADaCGH2} objects (we set \texttt{na.omit = TRUE} as the data contain missing values). The first call process the \texttt{MAList} and the second the \texttt{SegList}. In this section, we use the argument \texttt{robjnames}, to produce as output a set of RAM objects with a different set of names from the default. (Note that we could also have produced \textit{ff} files as output, using the option \texttt{ff.or.RAM = "ff"}). <>= if(.Platform$OS.type != "windows") { tmp <- inputToADaCGH(MAList = MA, robjnames = c("cgh-ma.dat", "chrom-ma.dat", "pos-ma.dat", "probenames-ma.dat")) tmp <- inputToADaCGH(MAList = MA2, robjnames = c("cgh-ma.dat", "chrom-ma.dat", "pos-ma.dat", "probenames-ma.dat"), minNumPerChrom = 4) } @ We need to change the argument to \texttt{minNumPerChrom} because, after the data processing step in \texttt{processCGH}, chromosome 21 has only four observations. The original \texttt{MAList} as produced directly from \textit{limma} do not have chromosome and position information. That is what the \texttt{read.clonesinfo} function from \textit{snapCGH} did. To allow using objects directly from \textit{limma} and incorporating position information, we will use an approach to directly mimicks that in \textit{snapCGH}. If you use and \texttt{MAList} you can also provide a \texttt{cloneinfo} argument; this can be either the full path to a file with the format required by \texttt{read.clonesinfo} or, else, the name of an object with (at least) three columns, names \texttt{ID}, \texttt{Chr}, and \texttt{Position}. We copy from the limma vignette (section 3.2, p.8), changing the names of objets by appending ``.limma''. <>= if(.Platform$OS.type != "windows") { targets.limma <- readTargets("targets.txt", path = datadir) RG.limma <- read.maimages(targets.limma, path = datadir, source="genepix") RG.limma <- backgroundCorrect(RG.limma, method="normexp", offset=50) MA.limma <- normalizeWithinArrays(RG.limma) } @ We can add the chromosomal and position information in two different ways. First, as in \texttt{read.clonesinfo} or, else, we can provide the name of a file (with the same format as required by \texttt{read.clonesinfo}). Note that \texttt{fclone} is a path (and, thus, a character vector). <>= if(.Platform$OS.type != "windows") { fclone <- list.files(path = system.file("testdata", package = "snapCGH"), full.names = TRUE, pattern = "cloneinfo.txt") fclone tmp <- inputToADaCGH(MAList = MA.limma, cloneinfo = fclone, robjnames = c("cgh-ma.dat", "chrom-ma.dat", "pos-ma.dat", "probenames-ma.dat")) } @ Alternatively, we can provide the name of an object with the additional information. For illustrative purposes, we can use here the columns of the \texttt{MA} object. <>= if(.Platform$OS.type != "windows") { acloneinfo <- MA$genes tmp <- inputToADaCGH(MAList = MA.limma, cloneinfo = acloneinfo, robjnames = c("cgh-ma.dat", "chrom-ma.dat", "pos-ma.dat", "probenames-ma.dat")) } @ %$ % the last thing is because the $ screws up the syntax highlighting \subsection{Using CGHregions}\label{cghregions} The CGHregions package \cite{CGHregions-manual} is a BioConductor package that implements a well known method \cite{VandeWiel2007} for dimension reduction for aCGH data (see a review of common regions issues and methods in \cite{Rueda2010a}). The \texttt{CGHregions} function accepts different type of input, among others a data frame. The function \texttt{outputToCGHregions} produces that data frame, ready to be used as input to CGHregions (for the next example, you will need to have the \textit{CGHregions} package installed). Note: \textbf{it is up to you to deal with missing values!!!} In the example below, we do a simple \texttt{na.omit}, but note that we are now working with data frames. Extending the usage of this, and other methods, to much larger data sets, using \textit{ff}, and properly dealing with missing values, is beyond the scope of this package. <>= if(.Platform$OS.type != "windows") { forcghr <- outputToCGHregions(haar.ff.cluster) if(require(CGHregions)) { regions1 <- CGHregions(na.omit(forcghr)) regions1 } } @ Please note that \texttt{outputToCGHregions} does NOT check if the calls are something that can be meaningfully passed to CGHregions. In particular, you probably do NOT want to use this function when \texttt{pSegment} has been called using \texttt{merging = "none"}. <>= if(.Platform$OS.type != "windows") { ## We are done with the executable code in the vignette. ## Restore the directory setwd(originalDir) print(getwd()) } @ <>= if(.Platform$OS.type != "windows") { ## Remove the tmp dir. Sys.sleep to prevent Windoze problems. ## Sys.sleep(1) ## What is in that dir? dir("ADaCGH2_vignette_tmp_dir") unlink("ADaCGH2_vignette_tmp_dir", recursive = TRUE) ## Sys.sleep(1) } @ \section{Why ADaCGH2 instead of a ``manual'' solution} It is of course possible to parallelize the analysis (and figure creation) without using ADaCGH2. To deal with very large data, the key idea is to never try to load more data than we strictly need for an analysis (which is the strategy used by ADaCGH2). To examine the simplest scenario, let us suppose we are already provided with single-column files (as, for instance, we obtain after using the helper function \texttt{cutFile} ---see section \ref{cutfile}); if we had a single large file, we would need to think of a way of reading only specific rows of a single column. Now, we need to think how to parallelize the analysis. We will consider two cases: parallelizing by subject (or array or column) and parallelizing by subject*chromosome. Let's first examine the simplest case: we will parallelize by subject, as is done by ADaCGH2 with HaarSeg and CBS (these methods are very fast, and further splitting by chromosome is rarely worth it). These are the required steps: \begin{enumerate} \item Each R process needs to have access to the chromosome information; this probably requires loading a vector with chromosome positions. \item Each process will carry these steps until all the columns/subjects have been processed: \begin{enumerate} \item Read the data for a specific column (or subject). \item Analyze (segment) those data. \item Save the results to disk. \item Remove from the workspace the results and the data (and probably call the garbage collector). \end{enumerate} \item When all analysis are completed, assemble the results somehow to allow easy access to results.\label{assemble} \end{enumerate} The steps above need to deal with the following possible problems: \begin{itemize} \item We need to consider how to deal with missing values, since a simple removal of missing values case by case will result in a ragged array of results, which would probably not be acceptable. \item ``loading'' and ``saving'' can be time-consuming steps: the direct way in R would be to use functions such as \texttt{scan} (for reading) and \texttt{save} (for saving), but when done repeatedly, these are likely to be slower than using \textit{ff} objects (e.g., using \texttt{scan} will be slower than acessing data from an \textit{ff} object). \item Much more serious can be step \ref{assemble} since we need to assemble a whole object with results. If the analysis involves many arrays and/or data sets with millions of probes, then we will not be able to load all of that in memory. (The approach we use in ADaCGH2 with the use of \textit{ff} objects is to never reload all of the results to assemble the final object, but only assemble a set of pointers to data structures on disk). Of course, an alternative is to leave the results as a large collection of files, and never try to assemble a single object with results. This, however, is likely much more cumbersome than having a single results object with all the information available that can be accessed as need (e.g., for further plotting). \end{itemize} Let us now examine the second scenario, where we parallelize by subject*chromosome. This is done, for instance, with HMM or BioHMM in ADaCGH2. Why? Because the methods are sufficiently slow that a finer grained division is likely to pay off in terms of gain in speed. In this case, additional partition and reassembly of the data are required for the segmentation and merging steps. These are the main steps: \begin{enumerate} \item Each R process needs to have access to the chromosome information. \item Each process will carry these steps until all the columns/subjects have been processed: \begin{enumerate} \item Read the data for a specific set of positions (those that correspond to a specific chromosome) for a given column (or subject). \item Analyze (segment) those data. \item Save the results to disk. \item Remove from the workspace the results and the data. \end{enumerate} \item When all segmentation steps are completed, assemble the results by column/subject for the merging step. \item For the merging step, each process will load the data for a complete column/subject and merge them with the corresponding algorithm: \begin{enumerate} \item Read the data for a column/subject. \item Perform merging. \item Save the results to disk. \item Remove from the workspace the results and the data and do garbage collection. \end{enumerate} \item When all analysis are completed, assemble the results somehow to allow easy access to results.\label{assemble2} \end{enumerate} This process is, of course, more convoluted than when parallelizing only by subject. As above, we need to consider how to deal with missing values, the use of repeated ``scan'' and ``save'', and the much more serious problem of putting together the complete object with all of the results. Finally, if we were interested in analyzing the data with more than one method, we would need to modify the code above since each method uses different ways of being called (e.g., some methods require setting up specific objects before segmentation can be called). What ADaCGH2 provides is, among other things, a way to eliminate those steps, automating them for the user, with careful consideration of fast access to data on disk, and attempts to minimize memory usage in repeated calls to the same process (which we can do successfully, as can be seen from the benchmarks for large numbers of arrays with more than 6 million probes ---memory usage levels out; see the file ``benchmarks.pdf'' ). Of course, ADaCGH2 provides other benefits (e.g., facilities for using as input the data from other packages ---e.g, section \ref{snap}--- or providing output for other packages ---e.g., section \ref{cghregions}). %% \section{(Non-runnable) Examples with large data sets and a comparison %% of approaches} %% \label{very-large-example} %% We discuss here a specific example to give you a rough idea of timings and %% preferred approaches for analyzing very large data sets. %% \subsection{Data set and hardware} %% \label{sec:data-set-hardware} %% We will use a simulated data set that contains 6,067,433 rows and 1000 %% columns of data; these are, thus, data for an array with 6 million probes %% and data for 1000 subjects. In some examples shown below, we use smaller %% subsets of the same data (with only 200 or 50 columns) and in some cases %% we will use a larger data set with 2000 subjects. There are 421,000 %% missing values per data column. The ASCII file with the data for the 1000 %% column data is about 96 GB\footnote{All sizes are computed from the %% reported size in bytes or megabytes, using 1024, or powers of 1024, as %% denominator.} and the RData for those same data is 46 GB (without %% compression; 41 GB with the standard R compression); in a freshly started %% R session, loading the RData will use 46 GB (as reported by %% \texttt{gc()}). We will analyze the full data set with 1000 columns, and %% some smaller versions of that data set (with 200 and 50 data columns). %% The examples below were run on a Dell PowerEdge C6145 chasis with two %% nodes. Each node has 4 AMD Opteron 6276 processors; since each processor %% has 16 cores, each node has 64 cores. One node has 256 GB RAM and the %% other 384 GB of RAM. Both nodes are connected by Infiniband (40Gb/s). For %% the data presented here, when using a single node, the data live on an xfs %% partition of a RAID0 array of SAS disks (15000 rpm) that are local to the %% node doing the computations. When using the two nodes, the data live on %% that same local SAS drive, which is seen by the other node using a simple %% NFS setup (we have also used distributed file systems, such as FhGFS, but %% they have tended to be a lot slower in these experiments; your mileage %% might vary). Therefore, in the table entries below, executions using both %% nodes will indicate ``124 cores'' \footnote{124 is not a typo; it is 124, %% even if the total number of cores is $128 = 64 * 2$. This is due to the %% following documented issue with Open MPI and Infiniband: %% \url{http://www.open-mpi.org/community/lists/users/2011/07/17003.php}, %% and since $128^2 = 16384$, we are hitting the limit, and we have not had %% a chance to correct this problem yet. Regardless, the penalty we would %% pay would be a difference of 4 process out of 124.} %% We will also show some examples run on an HP Z800 workstation, with 2 %% Intel Xeon E5645 processors (each processor has six cores), and 64 GB of %% RAM. The data live on an ext4 partition on a SATA disk (7200 rmp). In the %% tables, this is indicated as ``Coleonyx, 12 cores'' (table with results %% for reading data) or by the number 12 in the column for number of cores. %% In both systems, the operating system is Debian GNU/Linux (a mixture of %% Debian testing and Debian unstable). The Dell PowerEdge nodes were running %% version R-2.15.1 as available from the Debian repository (v.\ 2.15.1-5) %% or, later, R-3.0.1, patched (different releases, as available through May %% and June of 2013), and compiled from sources. The Xeon workstation was %% running R-2.15.1, patched version (2012-10-02 r60861), compiled from %% sources or, later R-3.0.1, patched (different releases, as available %% through May and June of 2013). Open MPI is version 1.4.3-2 (as available %% from Debian). %% As a reference for the size of the data set, the RData object with the %% 1000 columns, when loaded into R in the PowerEdges, takes 13 minutes to %% load and uses a total of about 46 GB (45.7 from calls to \texttt{gc} %% before and after loading the object, and adding Ncells and Vcells, or 45.6 %% as reported by \texttt{object.size}). Note that this is not the result of %% the object being a data frame and having a column with identifiers (a %% factor), instead of being a matrix; a similarly sized matrix with just the %% numeric data for the probes (i.e., without the first three columns of ID, %% chromosome, and location) has a size of 45.2 GB (therefore, the difference %% of 300 MB due to the first column, ID, being a factor with the identifiers, is %% minute relative to the size of the matrix). %% %% each column is about 0.045 GB, so a 1003 cols obj would have been %% %% 45.2 + 0.135 = 45.33. Thus, the factor takes an extra 300 MB. %% %% the data frame, without the first three columns, is 45.2 GB. %% % \footnote{A submatrix of those same data with 100 %% % columns of data, i.e., a numeric matrix of, say, columns 101 to 200, %% % occupies 4.52 GB. Thus, a matrix with 1003 columns of that type of data %% % would occupy about 45.34 GB (4.52 * 1003/100 = 45.34); the difference of %% % 300 MB is minute relative to the size of the matrix.}. %% %% used (Mb) gc trigger (Mb) max used (Mb) %% %% Ncells 6272458 335.0 9532693 509.2 6273371 335.1 %% %% Vcells 6102207940 46556.2 6634368466 50616.3 6102370100 46557.4 %% %% 100 data columns (columns 101 to 200) occupy, as data.frame %% %% 4853957368 bytes, so about 4.52 GB. %% %% As a matrix, 4853952408 bytes, so another 4.52 GB. %% % Sizes of objects: %% % - RData-to-local-uncompressed.RData: 46GB %% % - RData-to-local-compressed.RData: 41GB %% % - ff better: a single vector limited by int.machine, etc, but not the %% % whole array. %% For the two tables below, the meaning of columns is as follows: %% \begin{description} %% \item[Wall time] The ``elapsed'' entry returned by the command %% \texttt{system.time}. This is the real elapsed time since the function %% was called. %% It is important to understand that these timings can be variable, and we %% show here a single number.% In other words, repeated executions of the %% % same script can show variations in time; for instance, variations that %% % can be of a few minutes for reading of data sets with 50 to 200 columns %% % (i.e., where the wall time can, in some cases, increase or decrease by %% % about 50\%). %% \item[Max.\ mem (\texttt{gc})] The sum of the two rows of the ``max used'' %% column reported by \texttt{gc()}, in R, at the end of the execution of %% the given function. This number cannot reflect all the memory used by the %% function if the function spawns other R processes. %% \item[Max.\ $\Sigma$ mem.] A simple attempt to measure the memory used by %% all the processes\footnote{Just adding the entries given by \texttt{top} %% or \texttt{ps} will not do, and will overestimate, sometimes by a huge %% amount, the total memory used.}. Right before starting the execution %% of our function, we call the operating system command \texttt{free} and %% record the value reported by the ``-/+ buffers/cache'' row. Then, while %% the function is executing, we record, every 0.5 seconds, that same %% quantity. The largest difference between the successive measures and the %% original one is the largest RAM consumption. Note that this is an %% approximation. First, if other process start executing, they will lead %% to an overestimation of RAM usage; this, however, is unlikely to have %% had serious effects (the systems were basically idle, except for light %% weight cron jobs). Second, sampling is carried out every 0.5 seconds, so %% we could miss short peaks in RAM usage (at least one likely example of %% such underestimation shows in the tables below). %% \item[Columns] The number of data columns of the data set. %% \item[Method] The analysis method (either HaarSeg or CBS ---DNAcopy). %% \item[Fork/MPI] Whether forking (via \texttt{mclapply}) or MPI (using %% the facilities provided by package \textbf{Rmpi}, which are called %% from package \textbf{snow}) are used to parallelize execution. The two %% ``NP'' entries refer to non-parallelized execution, using the original %% packages\footnote{The packages are DNAcopy, as available from %% BioConductor, and the HaarSeg package, available from R-forge: %% \url{https://r-forge.r-project.org/projects/haarseg/}; \cite{haar-package}}. %% \item[Cores] Number of cores, and machine. 64: 64 cores in a single node %% of the Dell PowerEdge C6145. 124: 124 cores, using both nodes of the %% Dell PowerEdge. 12: 12 cores in the Z800 HP Workstation. %% \item[ff/RAM] If the data for the analysis had been stored as an %% \textit{ff} object, or as a data frame inside an RData that was loaded %% before the analysis. %% \end{description} %% For the table for reading data, all results, except for the one indicated %% in the table, correspond to using a single node in the Dell PowerEdge %% C6145 (64 cores). %% \clearpage %% \subsection{Reading data}\label{large-ex-read} %% \renewcommand{\arraystretch}{1.8} %% \begin{center} %% \begin{threeparttable} %% \caption{Time and memory usage when reading data} %% \begin{tabular}{p{3.52cm}p{1.8cm}p{1.8cm}p{2.65cm}p{3.4cm}} %% \hline %% &Columns & Wall time (minutes)& Max.\ mem.\ (\texttt{gc}), GB & %% Max.\ $\Sigma$ mem., GB\\ %% \hline %% Directory to \textit{ff} & 2000 & 13.9 & 1.3 & 39.1 \\ %% I get times from 9 %% %% to 34 %% Directory to \textit{ff} & 1000 & 6.1 & 1.3 & 41.21 \\ %% Directory to \textit{ff} & 500 & 4.6 & 1.3 & 39.7 \\ %% can go from 4.5 %% %% to 10 %% %% huge variability depending on how full the disk is and, I guess, %% %% fragmentation and if files are close to each other on disk %% Directory to \textit{ff} (Coleonyx, 12 cores) & 1000 & 88 & 1.3 & 8.4 \\ %% Txt file to \textit{ff} & 1000 & 2630 & 1.3 & NA (not calculated) \\ %% RData to \textit{ff} & 1000 & 29.6 & 169 & 168.3 \\ %% Directory to data frame (RAM object) & 1000 & 22 + 2\tnote{a} & 96 & NA. Output unusable for analysis \\ %% RData to data frame (RAM object) & 1000 & 22 + 2\tnote{b} & 139 & NA. Output unusable for analysis \\ %% Directory to data frame (RAM object) & 200 & 7.7 + 0.4\tnote{c} & 20 & 38 \\ %% Directory to \textit{ff} & 200 & 3.6 & 1.3 & 38 \\ %% Directory to data frame (RAM object) & 100 & 8.7 + 0.3\tnote{d} & 10.5 & 30 \\ %% Directory to \textit{ff} & 100 & 3.6 & 1.3 & 32 \\ %% Directory to \textit{ff} & 50 & 3.2 & 1.3 & 28 \\ %% Directory to data frame (RAM object) & 50 & 5.8 + 0.1\tnote{e} & 5.8 & 27 \\ %% %% I take several measyures in a few cases, and show the representative one %% \hline %% \end{tabular} %% \begin{tablenotes} %% { %% \footnotesize %% \item[a] The 2 reflects the time needed to save the resulting data frame to an RData file. %% \item[b] The 2 reflects the time needed to save the resulting data frame to an RData file. %% \item[c] The 0.4 reflects the time needed to save the resulting data frame to an RData file. %% \item[d] The 0.3 reflects the time needed to save the resulting data frame to an RData file. %% \item[e] The 0.1 reflects the time needed to save the resulting data frame to an RData file. %% } %% \end{tablenotes} %% \end{threeparttable} %% \end{center} %% % ``5.8 + 0.1'' or ``22 + 2'': add time to save the RData object. When %% % creating \textit{ff} objects, saving is part of the main reading function. %% % Max.\ mem. (\texttt{gc}) is the sum of the ``max used'' column reported by %% % \texttt{gc}, in R. %% % Notice: ff adjusts usage, thus the figure for COleonyx. %% % Last column: could be a slight underestimate if a peak of %% % consumpt. between two samples. %% \clearpage %% \subsection{Analyzing data} %% \renewcommand{\arraystretch}{1.4} %% %\thispagestyle{empty} %% \begin{center} %% \begin{threeparttable} %% \caption{Time and memory usage of segmentation with default options} %% \begin{tabular}{p{1.5cm}p{1.2cm}p{1.2cm}p{1.2cm}p{1.5cm}p{3.0cm}p{1.8cm}p{1.6cm}} %% \hline %% Method& Fork /MPI & Cores & ff/RAM & Columns & Wall time (min.) & Max.\ mem. (\texttt{gc}), GB & Max.\ $\Sigma$ mem., GB\\ %% \hline %% CBS&MPI &124 & ff & 1000 & 313 & 0.14 & 53 \\ %% CBS&MPI&64& ff & 1000 & 545 & 0.14 & 51 \\ %% CBS & Fork & 64 & ff & 2000 & 705.3 & 0.16 & 45.7 \\ %% CBS & Fork & 12 & ff & 2000 & 2414.4 & 0.15 & 19.3 \\ %% CBS&Fork&64&ff&1000&339 & 0.136 & 41 \\ %% CBS&Fork&12&ff&1000&1184.0 & 0.142 & 13.8 \\ %% CBS&Fork&64&ff&500&185.8 & 0.133 & 41 \\ %% &&&&&&&\\ %% HaarSeg & MPI &124 & ff & 1000 & 12.3 & 0.145 & 38.5 \\ %% HaarSeg & MPI &64 & ff & 1000 & 9.5 & 0.145 & 38.5 \\ %% HaarSeg & MPI &12 & ff & 1000 & 36 & 0.144 & 7.2 \\ %% HaarSeg & Fork &64 & ff & 2000 & 18.7 & 0.15 & 28.4 \\ %% HaarSeg & Fork &12 & ff & 2000 & 57.4 & 0.16 & 10.1 \\ %% HaarSeg & Fork &64 & ff & 1000 & 7.9 & 0.14 & 27 \\ %% HaarSeg & Fork &12 & ff & 1000 & 32 & 0.129 & 7.3 \\ %% HaarSeg & Fork &64 & ff & 200 & 2.3 & 0.135 & 28.6\\ %% HaarSeg & Fork &64 & ff & 500 & 4.6 & 0.13 & 26 \\ %% &&&&&&&\\ %% % %% HaarSeg & Fork &10\tnote{f} & ff & 1000 &33.3 & 0.142 & 5.9 \\ %% % HaarSeg & Fork &20\tnote{f} & ff & 1000 & 17.4 & 0.142 & 10.0 \\ %% % HaarSeg & Fork &40\tnote{f} & ff & 1000 & 10.2 & 0.142 & 17.5 \\ %% % HaarSeg & Fork &50\tnote{f} & ff & 1000 & 8.8 & 0.142 & 21.7 \\ %% % HaarSeg\tnote{a} & Fork &64 &ff & 100 & 1.2 & 0.133 & 24.5 \\ %% % CBS\tnote{a} & Fork &64 &ff & 100 & 55.9 & 0.132 & 35.3 \\ %% % When we just replace NAs by 0: run-sequential-Haar.R %% % HaarSeg & NP &- &RAM & 100 & x & 0.95 + 3.3 + 13.15 & x\tnote{d} \\ %% % HaarSeg & NP &- &RAM & 100 & 0.95 + 22.9 + 1.4\tnote{b} & 12.5 & 12.5\tnote{d} \\ %% % CBS & NP &- &RAM & 100 & 0.95 + 1698 + 6.7\tnote{c} & 40 & 40\tnote{d} \\ %% HaarSeg & Fork &64 & ff & 50 & 0.5 & 0.133 & 20.6\\ %% HaarSeg & Fork &64 & RAM & 50 & 0.7 + 2.5 + 0.9\tnote{e}& 14.4 & 140\\ %% HaarSeg & MPI &64 & ff & 50 & 0.63 & 0.14 & 24.6\\ %% &&&&&&&\\ %% HaarSeg & Fork &64 & RAM & 1000 & NA & NA & \textbf{Cannot allocate memory} \\ %% HaarSeg & Fork &64 & RAM & 200 & NA & NA & \textbf{Cannot allocate memory}\\ %% \hline %% \end{tabular} %% \begin{tablenotes} %% {\footnotesize %% %% \item[a] With option \texttt{merging = "none"} to make it comparable %% %% to the canonical code in the original packages. %% % \item[b] 0.95 + 22.9 + 1.4: load data, analyze, and save results. Since %% % there are missing values in the data, and the original HaarSeg code %% % does not deal with missing values, we are forced to remove NAs %% % array-per-array, and make repeated calls to the function. %% % % , both o %% % % which introduce an important penalty. We must do this to make %% % % analysis comparable to those of \texttt{pSegmentHaarSeg} that %% % % \textbf{always} checks, and allows, for missing values. %% % If there are no missing values in this data set, the total time of %% % analysis (i.e., sending the whole matrix at once and not checking %% % for, nor removing, NAs) is 3.3 minutes. %% % \item[c] 0.95 + 1698 + 6.7: load data, analyze, and save results. The %% % analysis involves calling the \texttt{CNA} function to create the CNA object %% % (5.3 min), calling the \texttt{smooth.CNA} function to smooth the data %% % and detect outlier (83.2 minutes), and segmenting the data with the %% % \texttt{segment} function (1609.5 minutes). %% \item[d] Since only one process, same as previous column. %% \item[e] 0.7 + 2.5 + 0.9: load data, analyze, and save results. %% % \item[f] These examples have been run on the Dell Power Edges, but %% % changing the number of cores used. %% } %% \end{tablenotes} %% \end{threeparttable} %% \end{center} %% %% time and memory, haar seg, with increasing num. cols in 64 cores %% %% h2 <- rbind(c(50, 0.5, 20.6), %% %% c(200, 2.3, 28.6), %% %% c(500, 4.6, 26), %% %% c(1000, 7.9, 27), %% %% c(2000, 18.7, 28.4)) %% %% plot(h2[, 2] ~ h2[, 1]) %% %% plot(h2[, 3] ~ h2[, 1]) %% %% time and memory, cbs and haar, with 2000 arrays, as cores change %% %% t.c <- c(717, 857, 1007, 1879, 3771) %% %% m.c <- c(42, 35.3, 29, 16.5, 11.1) %% %% x <- c(64, 50, 40, 20, 10) %% %% plot(t.c ~ x) %% %% plot(m.c ~ x) %% %% t.h <- c(12, 13, 15.4, 26.3, 49.7) %% %% m.h <- c(28, 23, 19.5, 11.9, 8) %% %% plot(m.h ~ x) %% %% plot(t.h ~ x) %% %\clearpage %% \begin{center} %% \begin{threeparttable} %% \caption{Time and memory usage of segmentation without merging and %% comparison with non-parallized executions. These examples have all %% been run on the Dell Power Edges.} %% \begin{tabular}{p{1.5cm}p{1.2cm}p{1.2cm}p{1.2cm}p{1.5cm}p{3.0cm}p{1.8cm}p{1.6cm}} %% \hline %% Method& Fork /MPI & Cores & ff/RAM & Columns & Wall time (min.) & Max.\ mem. (\texttt{gc}), GB & Max.\ $\Sigma$ mem., GB\\ %% \hline %% HaarSeg & Fork &64 & ff & 100 & 1.2 & 0.13 & 24.5 \\ %% HaarSeg & Fork &10 & ff & 1000 &23.5 & 0.142 & 5.9 \\ %% HaarSeg & Fork &20 & ff & 1000 & 12.3 & 0.137 & 10.0 \\ %% HaarSeg & Fork &40 & ff & 1000 & 7.4 & 0.142 & 17.6 \\ %% HaarSeg & Fork &50 & ff & 1000 & 6.7 & 0.139 & 21.2 \\ %% HaarSeg & Fork & 64 & ff & 1000 & 6.4 & 0.142 & 26.9 \\ %% &&&&&&&\\ %% HaarSeg & Fork &10 & ff & 2000 &49.7 & 0.16 & 8.0 \\ %% HaarSeg & Fork &20 & ff & 2000 & 26.3 & 0.16 & 11.9 \\ %% HaarSeg & Fork &40 & ff & 2000 & 15.4& 0.16 & 19.5 \\ %% HaarSeg & Fork &50 & ff & 2000 & 13.3& 0.16 & 23.2 \\ %% HaarSeg & Fork &64 & ff & 2000 & 11.9& 0.16 & 28.4 \\ %% &&&&&&&\\ %% CBS & Fork & 64 & ff & 100 & 55.9 & 0.13 & 35.3 \\ %% CBS & Fork &10 & ff & 1000 &1855.4 & 0.135 & 8.6 \\ %% CBS & Fork &20 & ff & 1000 & 939.8 & 0.135 & 14.9 \\ %% CBS & Fork &40 & ff & 1000 & 513.5 & 0.136 & 27.0 \\ %% CBS & Fork &50 & ff & 1000 & 438.6 & 0.142 & 33.1 \\ %% CBS & Fork & 64 & ff & 1000 & 350.3 & 0.142 & 41.3 \\ %% &&&&&&&\\ %% CBS & Fork &10 & ff & 2000 &3770.9 & 0.15 & 11.1 \\ %% CBS & Fork &20 & ff & 2000 & 1878.8 & 0.16 & 16.5 \\ %% CBS & Fork &40 & ff & 2000 & 1007.0 & 0.15 & 28.8 \\ %% CBS & Fork &50 & ff & 2000 & 857.1 & 0.16 & 35.3 \\ %% CBS & Fork & 64 & ff & 2000 & 717.3 & 0.163 & 41.9 \\ %% &&&&&&&\\ %% HaarSeg & NP &- &RAM & 100 & 0.95 + 22.9 + 1.4\tnote{b} & 12.5 & 12.5\tnote{d} \\ %% CBS & NP &- &RAM & 100 & 0.95 + 1698 + 6.7\tnote{c} & 40 & 40\tnote{d} \\ %% \hline %% \end{tabular} %% \begin{tablenotes} %% {\footnotesize %% \item[b] 0.95 + 22.9 + 1.4: load data, analyze, and save results. Since %% there are missing values in the data, and the original HaarSeg code %% does not deal with missing values, we are forced to remove NAs %% array-per-array, and make repeated calls to the function. %% % , both o %% % which introduce an important penalty. We must do this to make %% % analysis comparable to those of \texttt{pSegmentHaarSeg} that %% % \textbf{always} checks, and allows, for missing values. %% If there are no missing values in this data set, the total time of %% analysis (i.e., sending the whole matrix at once and not checking %% for, nor removing, NAs) is 3.3 minutes. %% \item[c] 0.95 + 1698 + 6.7: load data, analyze, and save results. The %% analysis involves calling the \texttt{CNA} function to create the CNA object %% (5.3 min), calling the \texttt{smooth.CNA} function to smooth the data %% and detect outlier (83.2 minutes), and segmenting the data with the %% \texttt{segment} function (1609.5 minutes). %% % \item[f] These examples have been run on the Dell Power Edges, but %% % changing the number of cores used. %% } %% \end{tablenotes} %% \end{threeparttable} %% \end{center} %% \clearpage %% %% \renewcommand{\arraystretch}{1.4} %% %% %\thispagestyle{empty} %% %% \begin{center} %% %% \begin{table} %% %% \caption{Further benchmarks on the Dell Opterons, with varying numbers of %% %% cores used} %% %% \begin{tabular}{p{1.5cm}p{1.2cm}p{1.2cm}p{1.2cm}p{1.5cm}p{3.0cm}p{1.8cm}p{1.6cm}} %% %% \hline %% %% Method& Fork /MPI & Cores & ff/RAM & Columns & Wall time (min.) & Max.\ mem. (\texttt{gc}), GB & Max.\ $\Sigma$ mem., GB\\ %% %% \hline %% %% HaarSeg & Fork &64 & ff & 1000 & 8 & 0.136 & 35 \\ %% %% HaarSeg & Fork &12 & ff & 1000 & 32 & 0.129 & 7.3 \\ %% %% \hline %% %% \end{tabular} %% %% \end{table} %% %% \end{center} %% %% \clearpage %% \subsection{Comments and recommended usage patterns} %% \label{commentsend} %% \begin{enumerate} %% \item Reading data and trying to save it as a RAM object, a usual %% in-memory data frame, will quickly exhaust available memory. For these %% data, we were not able to read data sets of 100 or more columns. Part of %% the problem lies on the way memory is handled and freed in the slaves, %% given that we are returning actual lists. In contrast, when saving as %% \textit{ff} objects, the slaves are only returning tiny objects (just %% pointers to a file). %% \item Saving data as RData objects will also not be an option for large %% numbers of columns as we will quickly exhaust available memory when %% trying to analyze them. %% \item In a single machine, and for the same number of cores, analyzing %% data with MPI is often generally slower than using forking, which is not %% surprising. Note also that with MPI there is an overhead of spawning %% slaves and loading packages in the slaves (which, in our case, takes %% about half a minute to a minute). %% These overheads could be amortized if %% %% you continue using MPI for a session (but OpenMPI is not ideal for that %% %% scenario, in contrast to other versions such as LAM MPI). %% \item When using two nodes (i.e., almost doubling the number of cores), %% MPI might, or might not, be faster than using forking on a single %% node. Two main issues affect the speed differences: inter-process %% communication and access to files. In our case, the likely bottleneck %% lies in access to files, which live on a SAS disk which is accessed via %% NFS. With other hardware/software configurations, access to shared files %% might be much faster. Regardless, the MPI costs might not be worth if %% each individual calculation is fast; this is why MPI with HaarSeg does %% not pay off, but it does pay off with CBS. %% \item When using \textit{ff}, the exact same operations in systems with %% different RAM can lead to different amounts of memory usage, as %% \textit{ff} tries autotuning when starting. %% Note, for instance, the %% %% differences in RAM usage between our two machines when reading a %% %% directory with a 1000 column data set. Hey! This is just because we %% %% use 12 or 64 cores! %% You can tune parameters when you load the \textbf{ff} package, but even %% if you don't (and, by default, we don't), defaults are often sensible %% and will play in your favor. %% \item Even for relatively small examples, using \textit{ff} can be faster %% than using RAM objects. Using RAM objects incurs overheads of loading %% and saving the RData objects in memory, but analyses also tend to be %% slightly slower. The later is somewhat surprising: with forking and RAM %% objects, the R object that holds the CGH data is accessed only for %% reading, and thus can be shared between all processes. We expected this %% to be faster than using \textit{ff}, because access to disk is several %% orders of magnitude slower than access to memory ---note that we made %% sure that memory was not virtual memory mapped to disk, as we had %% disabled all swapping. We suspect the main difference lies in %% bottlenecks and contention problems that result from accessing data in a %% single data frame simultaneously from multiple processes, compared to %% loading exactly just one column independently in each process, and/or %% repeated cache misses. %% \item \texttt{inputToADaCGH} (i.e., transforming a directory of files into %% \textit{ff} objects) can be severely affected, of course, by other %% processes accessing the disk. More generally, since with %% \texttt{inputToADaCGH} several processes can try to access different %% files at once (we are trying to parallelize the reading of data), issues %% such as type of file system, configuration and type of RAID, number of %% spindles, quality of the RAID card, amount of free space, etc, etc, etc, %% can have an effect on all heavy I/O operations. For example, when %% reading a set of 2000 files, the time differences in a set of benchmarks %% varied between 9 and 34 minutes. The only difference between the runs we %% could find being that in the 34 minute runs, we had added about 60 GB %% worth of files after creating/copying the text files, and before %% reading, and the disk was about 64\% full. Note also that, as a general %% rule, it is better if the newly created \textit{ff} files from %% \texttt{inputToADaCGH} are written to an empty directory, and if the %% working directory for segmentation analysis is another empty directory %% if you are using \texttt{ff} objects. %% \texttt{inputToADaCGH} accepts an argument to reduce the number of cores used, %% which can help with contention issues related to I/O. A multicore %% machine (say, 12 cores) with a single SATA drive might actually %% complete the reading faster if you use fewer than 12 cores for the %% reading. But your mileage might vary. %% \item Reordering data takes time (a lot if you do not use \textit{ff} %% objects) and can use a lot of memory. So it is much better if input data are %% already ordered (by Chromosome and Position within Chromosome). %% \end{enumerate} %% % Moving an ff set of files %% % \begin{verbatim} %% % ffsave(cghData, file = "savedcghData", rootpath = "./") %% % ffload(file = "pathtofile/savedcghData", rootpath = getwd()) %% % \end{verbatim} \section{Session info and packages used} This is the information about the version of R and packages used: <<>>= sessionInfo() @ \newpage \bibliographystyle{apalike} \bibliography{ADaCGH2} <>= ## try(closeAllConnections()) ## nope, breaks things ## Remove this sys.sleep; is this sometimes breaking windoze build in BioC? ## Sys.sleep(1) ## I hope we do not need this @ \end{document} % To try to get call graphs with proftools % <<>>= % library(proftools) % library(ADaCGH2) % ## Do not use cluster if drawing call graphs % ## snowfallInit(universeSize = 2, typecluster = "SOCK") % fname <- list.files(path = system.file("data", package = "ADaCGH2"), % full.names = TRUE, pattern = "inputEx2") % tableChromArray <- inputToADaCGH(filename = fname) % Rprof(tmp <- tempfile()) % cbs.out <- pSegmentDNAcopy("cghData.RData", % "chromData.RData") % Rprof() % profileCallGraph2Dot(readProfileData(tmp), score = "total", % filename = "pSegment.dot", % rankdir = "TB") % fname <- list.files(path = system.file("data", package = "ADaCGH2"), % full.names = TRUE, pattern = "inputEx1") % Rprof(tmp <- tempfile()) % hmm_mad.out <- pSegmentHMM("cghData.RData", % "chromData.RData", merging = "MAD") % Rprof() % profileCallGraph2Dot(readProfileData(tmp), score = "total", % filename = "pSegmentHMM.dot", % rankdir = "TB") % ## edit, change name, see and produce pdf % ## dotty pSegment-DNAcopy.dot % ## dot -Tpdf pSegment-DNAcopy.dot > foo.pdf % ## xpdf foo.pdf % @ % <<>>= % waves.merge.local.fork <- pSegmentWavelets(cgh.dat, % chrom.dat, merging = "mergeLevels");head(waves.merge.local.fork[[1]]) % @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%% FIXME: MOVE HERE ORIGINAL VIGNETTE %%% Notes for additional text %% It should be noted that reading very large RData files and trying to %% remove NAs can break R. For instance, with messages such as %% [ramon@Lacerta:]~/ADaCGH2-test-files tail -f read-1000cols-RData-ff.Rout %% The following object(s) are masked from 'package:base': %% %in% %% Loading required package: parallel %% > %% > fnameRData <- paste(getwd(), "/Data1000cols.RData", sep = "") %% > %% > unix.time(tableChromArray <- inputToADaCGH(ff.or.local = "ff", %% + RDatafilename = fnameRData, na.omit = TRUE)) %% Error in cbind(deparse.level, ...) : %% resulting vector exceeds vector length limit in 'AnswerType' %% Calls: unix.time ... inputToADaCGH -> is.na -> is.na.data.frame -> do.call -> cbind %% Timing stopped at: 768.444 113.815 892.471 %% Execution halted %% so you are suggested to clean NAs before hand, or use text files as input. %% In windows, it crashes with a known problem %% https://www.stat.math.ethz.ch/pipermail/r-devel/2013-April/066493.html %% but see this %% https://stat.ethz.ch/pipermail/r-devel/2013-March/066126.html %%%%%%%%%%%%%% checking under windoze %%%%%%%%%%%%% I can NOT do it: remember socket clusters do not work under VirtualBox. %% Install the devel version of R %% Install Rtools and Miktex %% Install the devel version of BioC %% Install the dependencies. Either from BioC or from CRAN %% cp -a of the ADaCGH2 dir to the shared place %% inside windoze again, cp that to my home c:\Documents and Settings\Ramon %% Run as "c:\Archivos de programa\R\R-3.0.1\bin\R.exe" CMD build --keep-empty-dirs --no-resave-data ADaCGH2 %% "c:\Archivos de programa\R\R-3.0.1\bin\R.exe" CMD check --force-multiarch --no-vignettes --timings ADaCGH2_2.1.0.tar.gz %% "c:\Archivos de programa\R\R-3.0.1\bin\R.exe" --build --merge-multiarch ADaCGH2_2.1.0.tar.gz