\documentclass{article} \usepackage{natbib} \usepackage{graphics} \usepackage{amsmath} \usepackage{indentfirst} \usepackage[utf8]{inputenc} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} %\VignetteIndexEntry{flowPhyto} \begin{document} <>= options(keep.source = TRUE, width = 60) foo <- packageDescription("flowPhyto") @ \title{The flowPhyto Package \\ (Version \Sexpr{foo$Version})} \author{ Francois Ribalet \\ \texttt{ribalet@u.washington.edu} } \maketitle \section{Licensing} This package is licensed under the Artistic License v2.0: it is therefore free to use and redistribute, however, we, the copyright holders, wish to maintain primary artistic control over any further development. Please be sure to cite us if you use this package in work leading to publication. \begin{enumerate} \item{ Ribalet, F., Schruth, D., Armbrust, E.V. \newblock flowPhyto: enabling automated analysis of microscopic algae from continuous flow cytometric data. 2011 \newblock \emph{Bioinformatics}, doi: 10.1093/bioinformatics/btr003. } \end{enumerate} \section{Installation} \subsection{Unix/Linux/Mac} Building the \Rpackage{flowPhyto} package from source requires that you have a C compiler, and all of the prerequisites for the underlying flowCore package: namely the GNU Scientific library (GSL), and the Basic Linear Algebra Subprograms (BLAS). After these prerequisites are taken care of, the package is ready to install via: \begin{verbatim} R CMD INSTALL flowPhyto_x.y.z.tar.gz \end{verbatim} After a successful installation the package can be loaded in the normal way: by starting R and invoking the \Rfunction{library} command like so: <>= library(flowPhyto) @ \subsection{Windows} The \Rpackage{flowPhyto} package is compatible with the Windows version of R and the same prerequisites apply. However, the \Rfunction{pipeline} function and the downstream file-based functions which deploy the four analysis steps to a cluster are not currently supported. \section{Introduction} Flow cytometry is a widely used technique among biologists to study the abundances of populations of microscopic algae living in aquatic environments. A new generation of high-frequency flow cytometer, known as SeaFlow, collects up to several hundred samples per day and can run continuously for several weeks (see Ribalet \textit{et al.}, 2010 for more details). Automated computational methods are needed to analyze the different phytoplankton populations present in each sample. Here we describe the \Rpackage{flowPhyto} R package which performs aggregate statistics on virtually unlimited collections of raw flow cytometry files in a memory efficient, parallelized fashion. \section{The SeaFlow Respository} SeaFlow data are stored in a custom binary file (EVT file) created every 3 minutes and consist of eight 16-bit integer channels namely: <>= CHANNEL.CLMNS @ The SeaFlow repository is composed of julian day labeled directories, each containing chronologically-ordered EVT files. The following code shows how to read one of these files into memory: <>= evt.file.path <- system.file("extdata","seaflow_cruise","2011_001", "2.evt", package="flowPhyto") evt <- readSeaflow(evt.file.path) @ \section{Core Functions} \subsection{OPP Filtration} Unlike a traditional flow cytometer, SeaFlow directly analyzes a raw stream of seawater using two detectors that determine the position of a particle in the focal region of the instrument optical system (Swalwell \textit{et al.}, 2009). The \Rfunction{filter} function selects optimally positioned particles (OPP) in each EVT file that are used to distinguish the different phytoplankton populations. <>= opp <- filter(evt, notch=1.1) @ \subsection{Cluster Based Classification} Because the characteristics of each phytoplankton population vary according to environmental conditions and instrument settings, a table of customizable parameters (pop.def.tab) is used to define the pre-gating regions and statistical priors of phytoplankton population clusters. <>= opp.path <- system.file("extdata","seaflow_cruise","2011_001", "2.evt.opp", package="flowPhyto") pop.def.path <- system.file("extdata","seaflow_cruise","pop.def.tab", package="flowPhyto") opp <- readSeaflow(opp.path) def <- readPopDef(pop.def.path) def @ Above we can see the default population definition table with the two dimentional pregating ranges and the parameters passed to the statistical clustering methods of the \Rpackage{flowClust} package (Lo \textit{et al.} 2009). Below, the \Rfunction{classify} function uses these pre-defined parameters and inputs one or more OPP files (3 by default) to classify individual phytoplankton cells into different populations. <>= pop <- classify(x=opp, pop.def= def, func=2 ) table(pop$pop) @ The \Rfunction{plotCytogram} function outputs a series of customizable 2-D cytograms to visualize the phytoplankton populations identified by the \Rfunction{classify} function. <>= plotCytogram(pop, "fsc_small","chl_small", pop.def= def, add.legend=TRUE, cex=1) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{The above 2-D Cytogram depicts the phytoplankton population present in the sample. } \label{fig:one} \end{figure} \newpage \subsection{Consensus and Census} \Rfunction{classify} outputs vector files (consensus.vct) that contain the population identification of the cells. \Rfunction{classify} is run in single file increments to provide multiple passes over a single cell and strengthen the clustering analysis. During the \Rfunction{census} step, these multiple-pass vector files are collapsed into one consensus vector, which represents the most likely population classification of the different phytoplankton cells. In addition, \Rfunction{census} produces a one-row census tab file that contains the number of cells per population for each file. The concatenation of these census tab files is used to create a per-population resampling scheme that calculates the number of OPP files necessary so a sufficient number of cells (500 by default) is present in the resampled population. <>= vct.paths <- sapply(c(1,439,440), function(i) system.file("extdata","seaflow_cruise","2011_001", paste("1.evt.opp.",i,'-class.vct',sep=''), package="flowPhyto")) mat <- do.call(cbind,lapply(vct.paths, read.delim)) consen.df <- consensus(mtrx=mat) table(consen.df$pop) aggregate(consen.df$support,list(consen.df$pop), mean) @ Above is a table of cross tabulated sums per population of the generated consensus vector and a corresponding table of the average 'support' counts. The support column in the output of \Rfunction{consensus} keeps track of the number of the multiple-pass classification vectors that called an event as this population. Compare the above population count cross tabulation with the output of census below. <>= census(v=pop$pop, pop.def=def) @ \subsection{Aggregate Statistics} The \Rfunction{summarize} function performs per-population aggregate statistics (cell concentration and the mean and standard deviation of the different channels) using the resampling scheme. <>= filter.df <- readSeaflow(opp.path, add.yearday.file=TRUE) classed <- cbind.data.frame(filter.df, consen.df) names(opp.path) <- getFileNumber(opp.path) class.jn <- joinSDS(classed, opp.path) nrow.opp <- sapply(opp.path, function(p) readSeaflow( p , count.only=TRUE)) nrow.evt <- sapply(opp.path, function(p) readSeaflow(sub('.opp','',p), count.only=TRUE)) class.jn$opp <- rep(nrow.opp, times=nrow.opp) class.jn$evt <- rep(nrow.evt, times=nrow.opp) summarize(class.jn, opp.paths.str=opp.path) @ The summarize function associates the corresponding acquisition time and location (latitude and longitude). It outputs a summary table of the entire set of SeaFlow data. The \Rfunction{plotStatMap} creates customizable plots of the geo-referenced data created by \Rfunction{summarize}. A combination of the different parameters per population or a single parameter over different populations can be selected depending on the purpose of the analysis. <>= stat.tab <- system.file("extdata","seaflow_cruise","stats.tab", package="flowPhyto") stats <- read.delim(stat.tab) plotStatMap(df=stats, pop='ultra', z.param='conc', margin=0.2, zlab=expression(paste('Cell concentration, ',10^6 * cells/L)), main="Cell concentration of Ultra-plankton population") mtext(line=1, side=4, "cell concentration 10^6 cells / L") @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Ultra-plankton concentration for the Puget Sound in November, 2009} \label{fig:two} \end{figure} \newpage \section{File-Based Functions, Input Validation, and The Pipeline} Each of the above core functions has a file based analog that takes one (or several) paths as it's main input parameter and outputs one or many files. For examples of these please check the man pages for individual file functions. \subsection{Directory Initializtion} First you'll need transfer the data to a location where there is plenty of extra disk space (around 25 percent more space than the raw EVT files alone). Then you'll want to make sure the directory has write access for the user who will be running the pipeline. Changing your working directory to the output directory is also recommended as many of the cruise specific job files get written there by default. Additional steps such as creating a log or plot sub directory in the repository or a new record in your cruise database (if you plan on uploading the resulting statistics or sds information to your database) may be desiarable as well. \subsection{SDS and pop.def.tab validation} One of the more important pre-processing steps to make can be with validating the SDS file before running the pipeline. One should both check for evidence of parsing errors that may have crept into the ship's data stream. The following code demonstrates one way this could be done, namely, longitude and latitude checking: <>= path <- system.file("extdata","seaflow_cruise",package="flowPhyto") sds <- combineSdsFiles(path) plot(sds$LON, sds$LAT) @ Additionally any externally defined population definition table should be validated using the following function. <>= validatePopDef(readPopDef(pop.def.path)) @ An external pop.def can be specified by placing a file named 'pop.def.tab' in the cruise's directory. The parameter names and data types should match those found in the POP.DEF object. If such a file is not present, one will get created automatically from the dataframe hard coded into Define.R. \subsection{Running the Pipeline} The pipeline itself is merely a cluster deployment function which executes, in concerted batches, each of the file-based wrapper functions for the 4 main analysis steps. Many of the sub-function specific parameters can also be passed through from this upper level function. The following example copies the very small bundled example data set to the present working directory and runs the pipeline for just step 4 which calculate statistics on a repository that has already undergone analysis steps 1 through 3. <>= repository.dir <- '.' output.path <- paste(repository.dir,'/','seaflow_cruise',sep='') seaflow.path <- system.file("extdata", 'seaflow_cruise', package="flowPhyto") file.copy(from=seaflow.path, to=repository.dir, recursive=TRUE) pipeline(repo= repository.dir, cruise.name='seaflow_cruise', steps=4, parallel=FALSE, submit.cmd='qsub -l walltime=00:20:00') unlink(output.path, recursive=TRUE) @ The most important parameters to set when calling pipeline are 'repo' which should be set to the location of your repository, and 'parallel' which tells the function whether or not to run in serial or parallel. Currently parallel jobs are simply submitted a via a cluster submission command such as 'qsub' (for Torque/SGE) or 'mosrun' (for MOSIX) as specified by the 'submit.cmd' option. For the purposes of this brief example 'parallel' has been set to FALSE but should almost always, where possible, be set to TRUE (the default) when running the pipeline over realistically sized, day or more long data sets. Additionally, the submit.cmd parameter was set to use qsub as a non-functional example. (Normally parallel=FALSE and submit.cmd would not be used together). Future plans for parallization include replacement of the above 'R CMD BATCH' and 'submit.cmd' based parallelization with a PVM/MPI based \Rpackage{snow} package implimentation. \subsection{Cleanup} There are two useful functions that can help to clean up the aftermath of all of the pipelined R CMD BATCH calls. The \Rfunction{cleanupLogs} function deletes log files depending on their error status. The \Rfunction{clearOutput} removes any output files from specified steps to clear the way for a re-run of the pipeline. \section{Example Dataset} The examples bundled with this dataset have been artifically reduced both in size and in number to make the package as light weight as possible. For a more realistic example you can visit our website \url{http://seaflow.ocean.washington.edu} to download a copy of the day-long 2009 Puget Sound cruise. \begin{thebibliography}{} \bibitem[Lo \textit{et~al}. (2009)]{Lo2009} Lo,K. \textit{et~al}. (2009) flowclust: a bioconductor package for automated gating of flow cytometry data. \textit{BMC Bioinformatics}, {\bf 10}(1):145. \bibitem[Ribalet \textit{et~al}. (2010)]{Ribalet2010} Ribalet,F. \textit{et~al}. (2010) Unveiling a phytoplankton hotspot at a narrow boundary between coastal and offshore waters. \textit{Proc. Nat. Acad. Sci.}, {\bf 107}(38):16571--16576. \bibitem[Ribalet et al.(2010)]{flowPhyto} Ribalet, F., Schruth, D., Armbrust, E.V. \newblock flowPhyto: enabling automated analysis of microscopic algae from continuous flow cytometric data. \newblock \emph{Bioinformatics}, submitted. \bibitem[Spidlen \textit{et~al}. (2010)]{Spidlen2010} Spidlen,J. \textit{et~al}. (2010) Data file standard for flow cytometry, version fcs 3.1. \textit{Cytom. Part A}, {\bf 77}A(1):97--100. \bibitem[Swalwell \textit{et~al}. (2009)]{Swalwell2009} Swalwell,J. \textit{et~al}. (2009) Virtual-core flow cytometry. \textit{Cytom. Part A}, {\bf 75}A(11):960--965. \end{thebibliography} \end{document}