% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{Basic Functions for Flow Cytometry Data} %\VignetteDepends{flowCore} %\VignetteKeywords{ncdf,flowCore} %\VignetteEngine{knitr::knitr} %\VignettePackage{ncdfFlow} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{graphicx} %\usepackage{subfigure} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{ncdfFlow: Provides netCDF storage based methods and functions for manipulation of flow cytometry data} \author{Mike Jiang,Greg Finak,N. Gopalakrishnan} \begin{document} <>= library(knitr) opts_chunk$set( concordance=TRUE ) @ \maketitle \begin{abstract} \noindent \textbf{Background} The Bioconductor package \Rpackage{flowCore} is the object model and a collection of standard tools designed for flow cytometry data analysis. The related R packages including data analysis (flowClust, flowMerge, flowMeans,flowTrans ,flowStats), visualization (ggcyto) and quality control (QUALIFIER) use the flowCore infrastructure to deal with flow cytometry data. However the flowFrame or flowSet which represent a single or a set of FCS files are memory-resident data structures and require the entire data elements to remain in memory in order to perform all kinds of the data manipulations. Hundreds or thousands of datasets generated by high throughput instruments can easily hit the memory limit if they are imported as the flowSet or flowFrames in R. It presents a challenge to scientists and bioinformaticians who use the R tools described above to perform statistical data analysis on a regular computer .We propose a new R object model and related functions to address this problem. The new model ncdfFlowSet inherit most of data structures from flowSet. It stores the large volume of event-level data on disk and only keeps the file handler and meta data in memory. Thus the memory consumption is significantly reduced. NetCDF is used as the data formats because it is self-describing, machine-independent and specifically optimized for storing and accessing array-oriented scientific data. With the compression and chunking features introduced by the new release of netCDF4, the new model is able to maintain high performance of data processing. Most of the functions and methods including transformation,compensation,gating and subsetting methods for flowSet are extended to ncdfFlowSet (\Rfunction{spillover},\Rfunction{normalize} and \Rfunction{workflow} methods of \Rpackage{flowCore} are currently not supported yet.). Thus the change of data structure is almost transparent to the users of flowCore-based R packages. \noindent \textbf{keywords} Flow cytometry, high throughput,netCDF, flowSet,ncdfFlowSet \end{abstract} <>= library(ncdfFlow) @ \section{Representing Flow Cytometry Data with ncdfFlowSet} \Rpackage{ncdfFlow} reppresents a flow cytometry data model that is very similar to the \Robject{flowSet} structure.The only difference is that the event-based 2-D data matrices from multiple samples of the same experiment are stored as one single 3D data matrix on disk in ncdf format. Each sample can be accessed efficiently from the 3-D matrix as a data chunk or slice and further manipulated in memory. The basic unit of manipulation in \Rpackage{ncdfFlow} is the \Rclass{ncdfFlowSet}. It inherites all the slots from \Rclass{flowSet}. However, the \Robject{flowFrame} objects stored in the "frames" slot of a \Robject{ncdfFlowSet} object do not host the data matrix.Instead,their the "exprs" lots are kepted empty and the actual data are stored in the 3-D data matrix on disk and only the file name is stored in "file" slot of \Robject{ncdfFlowSet}. Thus \Robject{ncdfFlowSet} reduces the memory requirements and meanwhile ensures the consistent data structure with \Rclass{flowSet}. \section{Creating a \Rclass{ncdfFlowSet}} We provide a function to read FCS files into a \Rclass{ncdfFlowSet} object: <>= path<-system.file("extdata","compdata","data",package="flowCore") files<-list.files(path,full.names=TRUE)[1:3] nc1 <- read.ncdfFlowSet(files=files) nc1 @ As we see,the contructor function is very similar to the \Rclass{flowSet} execpt that it requires a filename for the ncdf file. <>= fs1 <- read.flowSet(files=files) @ Note that an ncdf file that stores the actual data is generated and saved on disk once a \Robject{ncdfFlowSet} is created. Users need to explicitely call the \Rfunction{unlink} method to remove the file before delete the object from memory by \Rfunction{rm}. <>= unlink(nc1) rm(nc1) @ Users can also create an empty ncdfFlowSet first and add data slices later by assigning argument isWriteSlice as FALSE. <>= nc1 <- read.ncdfFlowSet(files=files, isWriteSlice= FALSE) nc1[[1]] @ As we see here,before writing the actual flowFrame by \Rfunction{[[<-}, the flowFrame object returned by \Rfunction{[[} has 0 events. <>= targetSampleName<-sampleNames(fs1)[1] nc1[[targetSampleName]] <- fs1[[1]] nc1[[1]] nc1[[2]] @ Note that it is important to always use sample name to specify the target position in the data matrix where the actual is added. Because the sample name is the identifier used to index the data matrix. Sometime it is helpful to copy the structure from an existing ncdfFlow object and then add the data slice to the empty ncdfFlow cloned by \Rfunction{clone.ncdfFlowSet}. <>= nc2 <- clone.ncdfFlowSet(nc1, isEmpty = TRUE) nc2[[1]] nc2[[sampleNames(fs1)[1]]] <- fs1[[1]] nc2[[1]] @ <>= unlink(nc2) rm(nc2) unlink(nc1) rm(nc1) @ We also provide the coerce function to convert the flowSet to a ncdfFlowSet. <>= data(GvHD) GvHD <- GvHD[pData(GvHD)$Patient %in% 6:7][1:4] nc1<-ncdfFlowSet(GvHD) @ Or coerce a ncdfFlowSet to flowSet <>= fs1<-as.flowSet(nc1,top=2) @ Note that \Rclass{ncdfFlowSet} is designed to store large datasets and it is not recommened to corece the entire ncdfFlowset to flowSet. Usually users want to select a subset from ncdfFlowSet by \Rfunction{[} and convert the subetted data. Sometimes it is helpful to randomly select a cerntain number of flowFrames from the entire datasets represented by by \Rclass{ncdfFlowSet} to have a preview of the data.The arugment "top" can be used here for this purpose. \section{Working with metadata} Like \Rclass{flowSet},\Rclass{ncdfFlowSet} has an associated \Rclass{AnnotatedDataFrame} that provides metadata of experiments. This data frame is accessed and modified via the same methods of \Rpackage{flowCore}. : <>= phenoData(nc1) pData(nc1) varLabels(nc1) varMetadata(nc1) sampleNames(nc1) keyword(nc1,"FILENAME") identifier(nc1) colnames(nc1) colnames(nc1,prefix="s6a01") length(nc1) getIndices(nc1,"s6a01") @ \section{Manipulating a \Rclass{ncdfFlowSet}} You can extract a \Rclass{flowFrame} from a \Rclass{ncdfFlowSet} object in the same way as \Rclass{flowSet} by using the \Rfunction{[[} or \Rfunction{\$} extraction operators. Note that using the \Rfunction{[} extraction operator returns a new \Rclass{ncdfFlowSet} that points to the same ncdf file. SO the original ncdf file serves as a data repository and the \Robject{ncdfFlowSet} works as view of the data in this sense. <>= nm<-sampleNames(nc1)[1] expr1<-paste("nc1$'",nm,"'",sep="") eval(parse(text=expr1)) nc1[[nm]] nm<-sampleNames(nc1)[c(1,3)] nc2<-nc1[nm] summary(nc2) @ \Rclass{flowSet}-specific iterator \Rfunction{fsApply} can also be applied to Robject{ncdfFlowSet}: <>= fsApply(nc1,range) fsApply(nc1, each_col, median) @ However, we recmmend to use another iterator \Rfunction{ncfsApply} designed for the function that returns a flowFrame (such as compensate,transform...). \Rfunction{ncfsApply} works the same as \Rfunction{fsApply} execpt that it returns a ncdfFlowSet object with the acutal data stored in cdf to avoid the huge memory consumption. Note that the return value of the function applied in \Rfunction{ncfsApply} must be a flowFrame object and it should have the same dimensions(channles and events) as the original data. \section{Compensation,Transformation and Gating} \Rfunction{transform} amd \Rfunction{compensate} for \Robject{ncdfFlowSet} work the same as \Robject{flowSet}. <>= cfile <- system.file("extdata","compdata","compmatrix", package="flowCore") comp.mat <- read.table(cfile, header=TRUE, skip=2, check.names = FALSE) comp <- compensation(comp.mat) #compensation summary(nc1)[[1]] nc2<-compensate(nc1, comp) summary(nc2)[[1]] unlink(nc2) rm(nc2) #transformation asinhTrans <- arcsinhTransform(transformationId="ln-transformation", a=1, b=1, c=1) nc2 <- transform(nc1,`FL1-H`=asinhTrans(`FL1-H`)) summary(nc1)[[1]] summary(nc2)[[1]] unlink(nc2) rm(nc2) @ Note that compensation/transformation return the \Robject{ncdfFlowSet} objects that point to the new cdf file containing the compensated/transformed data. \Rfunction{filter} for \Robject{flowSet} also works for \Robject{ncdfFlowSet}: <>= rectGate <- rectangleGate(filterId="nonDebris","FSC-H"=c(200,Inf)) fr <- filter(nc1,rectGate) summary(fr) rg2 <- rectangleGate(filterId="nonDebris","FSC-H"=c(300,Inf)) rg3 <- rectangleGate(filterId="nonDebris","FSC-H"=c(400,Inf)) flist <- list(rectGate, rg2, rg3) names(flist) <- sampleNames(nc1[1:3]) fr3 <- filter(nc1[1:3], flist) summary(fr3[[3]]) @ \section{Subsetting} The \Rfunction{Subset} and \Rfunction{split} methods for \Rclass{ncdfFlowSet}: <>= nc2 <- Subset(nc1,rectGate) summary(nc2[[1]]) @ <>= nc2 <- Subset(nc1, fr) summary(nc2[[1]])###this will cause crashing error see #50 rm(nc2) @ <>= library(flowStats) morphGate <- norm2Filter("FSC-H", "SSC-H", filterId = "MorphologyGate",scale = 2) smaller <- Subset(nc1[c(1,3)], morphGate,c("FSC-H", "SSC-H")) smaller[[1]] nc1[[1]] rm(smaller) @ Note that \Rfunction{Subset} does not create the new ncdf file (the same as extraction operator \Rfunction{[}). So we need to be careful about using \Rfunction{unlink} to delete the subsetted data since it points to the same ncdf file that is also used by the original \Robject{ncdfFlowSet} object. \Rfunction{split} returns a \Robject{ncdfFlowList} object which is a container of multiple \Robject{ncdfFlowSet} objects. <>= ##splitting by a gate qGate <- quadGate(filterId="qg", "FSC-H"=200, "SSC-H"=400) fr<-filter(nc1,qGate) ncList<-split(nc1,fr) ncList nc1[[1]] ncList[[2]][[1]] ncList[[1]][[1]] @ Note that the \Robject{ncdfFlowSet} objects contained in \Robject{ncdfFlowList} by default share the same ncdf file as the original \Robject{ncdfFlowSet}. In order to keep its own data copy, try to set argument \Robject{isNew} to "TRUE" in \Robject{split} method. <>= ncList_new<-split(nc1,fr,isNew=TRUE)###this will cause crashing error see #50 @ \clearpage %\bibliographystyle{plainnat} %\bibliography{cytoref} \end{document}