%\VignetteIndexEntry{Main vignette (complete version): End-to-end analysis of cell-based screens} %\VignetteKeywords{Cell based assays} %\VignettePackage{cellHTS2} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage{color} \definecolor{darkblue}{rgb}{0.0,0.0,0.75} \usepackage[% baseurl={http://www.bioconductor.org},% pdftitle={End-to-end analysis of cell-based screens},% pdfauthor={Wolfgang Huber},% pdfsubject={cellHTS2},% pdfkeywords={Bioconductor},% pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,% pagecolor=darkblue,raiselinks,plainpages,pdftex]{hyperref} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\myincfig}[3]{% \begin{figure}[tp] \begin{center} \includegraphics[width=#2]{#1} \caption{\label{#1}#3} \end{center} \end{figure} } \begin{document} %------------------------------------------------------------ \title{End-to-end analysis of cell-based screens: from raw intensity readings to the annotated hit list} %------------------------------------------------------------ \author{Michael Boutros, L\'igia Br\'as, Florian Hahne and Wolfgang Huber} \maketitle \tableofcontents <>= options(width=70) @ \section{Introduction}\label{sec:Intro} The package \Rpackage{cellHTS2} is a revised and improved version of the \Rpackage{cellHTS} package~\footnote{To convert a S3 class \Rclass{cellHTS} object that was made using the \Rpackage{cellHTS} package into an S4 class \Rclass{cellHTS} object suitable for \Rpackage{cellHTS2}, please see the function \Rfunction{convertOldCellHTS}.}. This report describes the structure of the \Rclass{cellHTS} class, while explaining all the steps necessary to run a complete analysis of a cell-based high-throughput screen (HTS), from raw intensity readings to an annotated hit list. This text has been produced as a reproducible document~\cite{Gentleman2004RepRes}. It contains the actual computer instructions for the methods it describes, and these in turn produce all results, including the figures and tables that are shown here. The computer instructions are given in the language R, thus, in order to reproduce the computations shown here, you will need an installation of R (version 2.3 or greater) together with a recent version of the package \Rpackage{cellHTS2} and of some other add-on packages. Within R, the following commands can be used to install \Rpackage{cellHTS2} along with all dependent packages. To reproduce the computations shown here, you do not need to type them or copy-paste them from the PDF file; rather, you can take the file \textit{cellhts2Complete.Rnw} in the \textit{scripts} directory of the package, open it in a text editor, run it using the R command \Rfunction{Sweave}, and modify it to your needs. First, we load the package. % <>= library("cellHTS2") @ % <>= ## working path: workPath <- getwd() ## check if bib file exists if (!("cellhts.bib" %in% dir()) ) system(sprintf("cp %s/cellhts.bib .", system.file("doc", package="cellHTS2"))) ## for debugging: options(error=recover) ## for software development, when we do not want to install ## the package after each minor change: ## for(f in dir("~/huber/projects/Rpacks/cellHTS2/R", full.names=TRUE, pattern=".R$"))source(f) @ % %------------------------------------------------------------ \section{Reading the intensity data} \label{sec:read} %------------------------------------------------------------ We consider a cell-based screen that was conducted in microtiter plate format, where a library of double-stranded RNAs was used to target the corresponding genes in cultured \textit{Drosophila} $Kc_{167}$ cells~\cite{Boutros2004}. Each of the wells in the plates contains either a gene-specific probe, a control, or it can be empty. The experiments were done in duplicate, and the viability of the cells after treatment was recorded by a plate reader measuring luciferase activity. The promoter has been chosen such that the luciferase activity is indicative of ATP levels. Although this set of example data corresponds to a single-channel screening assay, the \Rpackage{cellHTS2} package can also deal with cases where there are readings from more channels, corresponding to different reporters. Usually, the measurements from each replicate and each channel come in individual result files. The set of available result files and the information about them (which plate, which replicate, which channel) is contained in a spreadsheet, which we call the \emph{plate list file}. This file should contain at least the following columns: \emph{Filename}, \emph{Plate} and \emph{Replicate}. The last two columns should be integer numbers, with values ranging from 1 to the maximum number of plates or replicates, respectively. An optional \emph{Batch} column can be used to provide batch information about the experiment, i.e., changes in reagents, or days for a multi-day experiment. See Section~ref{sec:multPlateConfs} for more details. The first few lines of an example plate list file are shown in Table~\ref{tab:platelist}, while Table~\ref{tab:signalintensity} shows the first few lines from one of the plate result files listed in the plate list file. \input{cellhts2-platelist} \input{cellhts2-signalintensity} The first step of the analysis is to read the plate list file, to read all the intensity files, and to assemble the data into a single R object that is suitable for subsequent analyses. The main component of that object are arrays with the intensity readings of all plates, channels, and replicates. We demonstrate the R instructions for this step. First we define the path where the input files can be found. % <>= experimentName <- "KcViab" dataPath <- system.file(experimentName, package="cellHTS2") @ % In this example, the input files are in the \Robject{\Sexpr{experimentName}} directory of the \Rpackage{cellHTS2} package. To read your own data, modify \Robject{dataPath} to point to the directory where they reside. We show the names of 12 files from our example directory: % <>= dataPath rev(dir(dataPath))[1:12] @ % and read the data into the object \Robject{x} % <>= x <- readPlateList("Platelist.txt", name=experimentName, path=dataPath) @ % <>= x @ % The plate format used in the screen (96-well or 384-well plate design) is automatically determined from the raw intensity files when calling the \Rfunction{readPlateList} function. %% Create the example table: %% it would have been nice to use the "xtable" package but it insists on %% adding row numbers, which we don't like. <>= cellHTS2:::tableOutput(file.path(dataPath, "Platelist.txt"), "plate list") cellHTS2:::tableOutput(file.path(dataPath, names(intensityFiles(x))[1]), "signal intensity", header=FALSE) @ %------------------------------------------------------------ \subsection{Importing intensity data files with other formats} \label{sec:read2} %------------------------------------------------------------ The function \Rfunction{readPlateList} has the argument \Robject{importFun} that can be used to provide a different import function to read plate result files with a format different from that shown in Table~\ref{tab:signalintensity}. For example, to import plate data files from an EnVision plate reader, set \Robject{importFun=getEnVisionRawData} or \Robject{importFun=getEnvisionCrosstalkCorrectedData} when calling \Rfunction{readPlateList}. Please see the help page of function \Rfunction{getEnVisionRawData} for an example. Another import function (``importData.R'') is given together with the example data set for a enhancer-supressor screen in the directory called \emph{TwoWayAssay} of this package.\\ While for the above data sets, the measurements from each replicate and channel come in separate result files, this is not the case when measurement files are provided in the HTanalyst format. In this case, each output file contains meta-experimental data together with intensity readings (in a matrix-like layout) of a set of plates made for the same replicate or screen. Thus, there is no need to have a plate list file, and instead of using \Rfunction{readPlateList} we should call the function \Rfunction{readHTAnalystData}. Please see the help page for this function (\Robject{? readHTAnalystData}), where we illustrate how it can be applied to import HTAnalyst data files. %------------------------------------------------------------ \section{The \Rclass{cellHTS} class and reports} %------------------------------------------------------------ The basic data structure of the package is the class \Rclass{cellHTS}, which is a container for cell-based high-throughput RNA interference assays (data and experimental meta-data) performed in multi-plate format. This class extends the class \Rclass{NChannelSet} of the \Rpackage{Biobase} package~\cite{BioC2004}. The data can be thought of as being organised in a two- or three-dimensional array as follows: \begin{enumerate} \item The first dimension corresponds to reagents (e.g. siRNAs, chemical compounds) that were used in the assays. For example, if the screen used 100 plates of 384 wells (24 columns, 16 rows), then the first dimension has size 38,400, and the \Rclass{cellHTS} object keeps track of plate ID, row, and column associated with each element. For historic reasons, and because we are using infrastructure that was developed for microarray experiments, the following terms are used synonymously for the elements of the first dimension: \emph{reagents}, \emph{features}, \emph{probes}, \emph{genes}. \item The second dimension corresponds to assays, including replicates and different experimental conditions (cell type, treatment, genetic background). A potentially confusing terminology is that the data structure that annotates the second dimension is called \Robject{phenoData} (see below). This is because we are using infrastructure, namely the \Rclass{NChannelSet} class from the \Rpackage{Biobase} package, that uses this unfortunate term for this purpose. Sometimes, the elements of this second dimension are also called \emph{samples}. \item The (optional) third dimension corresponds to different channels (e.g. different luminescence reporters) \end{enumerate} \noindent The main structures contained in a \Rclass{cellHTS} object are: \begin{description} \item[assayData] an object of class \Rclass{AssayData}, usually an environment containing a set of matrices of identical size. Each matrix represents a single channel. In each matrix, the rows correspond to features or reporters (e.\,g.\ siRNAs, dsRNAs) and the colums to samples (different conditions and/or replicates). \item[phenoData] a dataframe (more precisely, an object of class \Rclass{AnnotatedDataFrame}) containing information about the screens, such as the replicate number and the type of biological assay. It must have the following columns in its data component: \begin{description} \item[replicate] a vector of integers giving the replicate number \item[assay] a character vector giving the name of the biological assay or condition \end{description} Both of these columns have the same length as the number of samples in the \Robject{assayData} slot. The choice of name \emph{phenoData} for this structure is unfortunate, it has nothing to do with phenotypes. \item[featureData] a dataframe (more precisely, an object of class \Rclass{AnnotatedDataFrame}) that contains information about the reagents used in the experiment. There are three mandatory columns, in addition there can be an arbitrary number of additional columns, for example a target gene identifier. The mandatory columns are: \begin{description} \item[plate] integers specifying the plate number (e.\,g.\ $1, 2, \ldots$) \item[well] alphanumeric character strings giving the well ID within the plate (e.g. A01, B01, ..., P24). \item[controlStatus] a factor specifying the annotation for each well with possible levels: \emph{empty}, \emph{other}, \emph{neg}, \emph{sample}, \emph{pos}. Other levels besides \emph{pos} and \emph{neg} may be employed for the positive and negative controls. \end{description} \item[plateData] a list of dataframes, where each list item contains plate-specific information. The structure of the individual dataframes is fixed: rows are supposed to be plates and columns are supposed to be samples (i.e., the number of columns has to be the same as the number of columns in the \Robject{assayData} matrices. By default, the only available list item is \Robject{Batch} containing the information about experimental batches in the experiment. Unless explicitely specified in the plate list file, these will all be set to 1. \item[experimentData] an object of class \Rclass{MIAME} containing descriptions of the experiment. \end{description} \noindent The \Rclass{cellHTS} class also includes additional slots that are used to store the input files used to assemble the \Rclass{cellHTS} object. These are: \begin{description} \item[plateList] a dataframe containing names and metadata about input measurement data files, plus a column named \emph{status}, of type \Rclass{character}, whose elements are the string "OK" if the data import appeared to have gone well, and the respective error or warning message otherwise. \item[intensityFiles] a list whose components are copies of the imported input data files. Its length corresponds to the number of rows of \Robject{plateList}. \item[plateConf] a data frame containing what was read from the configuration file for the experiment (except the first two header rows). \item[screenLog] a data frame containing what was read from the screen log file for the experiment (in case there was one). \item[screenDesc] a character containing what was read from the description file of the experiment. \end{description} Other slots are \Robject{rowcol.effects}, \Robject{overall.effects} and \Robject{annotation}. For a detailed description of this class, please type \Robject{class ? cellHTS}. In Section~\ref{sec:read}, we created the object \Robject{x}, which is an instance of the \Rclass{cellHTS} class. The measurements intensities are stored in the slot \Robject{assayData} of \Robject{x}. The slot called \Robject{state} helps to keep track of the preprocessing state of our \Rclass{cellHTS} object. This slot can be accessed through the \Rfunction{state}, as shown below: % <>= state(x) @ % It contains a logical vector of length \Sexpr{length(state(x))} representing the processing status of the object. It should have the names "configured", "normalized", "scored" and "annotated". We can thus see that \Robject{x} has not been configured, annotated, normalized or scored yet. These 4 main stages are explained along this vignette and can be briefly described as follows: \begin{description} \item[Configuration] involves annotating the experiment with information about the screen (e.\,g.\ title, when and how it was performed, which organism, which library, type of assay, etc.), annotating the measured values with information on controls and flagging invalid measurements. This step is covered in Section~\ref{sec:screenconf} and is prerequisite for preprocessing the experimental screening data (i.\,e.\ for normalization and scoring). \item[Annotation] involves annotating the features (reagents) with information about, for example, their target genes. This step, detailed in Section~\ref{sec:annotation}, is not essential for preprocessing, but this information can be used for the HTML quality reports generated by \Rpackage{cellHTS2} and in further analyses (see the complete report, as explained in Section~\ref{sec:Intro}). \item[Normalization] involves removing systematic variations, making measurements comparable across plates and within plates, enhancing the biological signal and eventually transforming the data to a scale suitable for subsequent analyses. \item[Replicates scoring and summarization] involves standardizing the values for each replicate and then summarizing replicate values in order to obtain a single-value per probe (for example, a robust $z$-score). \end{description} The steps of data normalization, replicate scoring and summarization constitute the preprocessing work-flow of the screening data. They alter the contents of \Robject{assayData} slot and even the number of channels of the \Rclass{cellHTS} object. The complete analysis project is contained in a set of \Rclass{cellHTS} objects that reflect different preprocessing stages and that can be shared with others and stored for subsequent computational analyses. The \Rpackage{cellHTS2} package offers export functions for generating human-readable reports, which consist of linked HTML pages with tables and plots. The final scored hit list is written as a tab-delimited format suitable for reading by spreadsheet programs. Returning to our example data set, we create a report using the function \Rfunction{writeReport}. Since until now we only have unnormalized experimental data, the \Rclass{cellHTS} object should be given to the function as the \Robject{raw} argument: % <>= out <- writeReport(raw=x) @ <>= out <- writeReport(raw=x, force=TRUE, outdir=tempdir()) @ % This will write the report into the directory given by the argument \Robject{outdir}. The option \Robject{force=TRUE} tells the function to delete and overwrite a possibly already existing directory of the same name. This option needs to be used with caution and is only accepted if \Robject{outdir} is explicitely specified. If the argument \Robject{outdir} is not specified, the default is a subdirectory of the current working directory with name given by \Robject{name(x)} In order to keep a reference to the R commands used to create the HTML output, one can provide the path to an ASCII script file using the \Robject{mainScriptFile} argument. This comes back to the notion of reproducible research mentioned before, and we strongly suggest to make use of this feature. In fact, the function will issue a warning if no script file is supplied. This said, we will not make use of the \Robject{mainScriptFile} arguments, since we don't really have scripting files at hand; all commands are contained in this document. It can take a while to run this function, since it creates a large number of graphics files. After this function has finished, the index page of the report will be in the file indicated by the variable \Robject{out}, % <>= out @ % and you can view it by directing a web browser to that file. % <>= if (interactive()) browseURL(out) @ %------------------------------------------------------------ \section{Screen configuration: annotating the plate results} \label{sec:screenconf} %------------------------------------------------------------ \input{cellhts2-plateconfiguration} \input{cellhts2-screenlog} The next step of the analysis involves reading and processing three input files specific of the screening experiment: \begin{itemize} \item \emph{Screen description file} contains a general description of the screen, its goal, the conditions under which it was performed, references, and any other information that is pertinent to the biological interpretation of the experiments. In \Rpackage{cellHTS2} package we provide a function that creates a template description file whose entries can be edited and completed by the user. Type \Robject{? templateDescriptionFile}. This file contains the entries compliant with the \Rclass{MIAME} class and also additional fields specific for the \Rclass{cellHTS} class. \item \emph{Plate configuration file} is used to annotate the measured data with information on controls. The content of this file for the example data set analysed here is shown in Table~\ref{tab:plateconfiguration} and the expected format for this file is explained in Section~\ref{sec:plateconf}. \item \emph{Screen log file} (optional) is used to flag individual measurements as invalid. The first 5 lines of this file are shown in Table~\ref{tab:screenlog}, and the layout for the \emph{screen log file} is detailed in Section~\ref{sec:screenlog}.\\ \end{itemize} To apply the information contained in these three files in our \Rclass{cellHTS} object, we call: <>= x <- configure(x, descripFile="Description.txt", confFile="Plateconf.txt", logFile="Screenlog.txt", path=dataPath) @ % Note that the function \Rfunction{configure}\footnote{More precisely, \Rfunction{configure} is a method for the S4 class \Rclass{cellHTS}.} takes \Robject{x}, the result from Section~\ref{sec:read}, as an argument, and we then overwrite \Robject{x} with the result of this function. If no screen log file is available for the experiment, the argument \Robject{logFile} of the function \Rfunction{configure} should be omitted. %% %% Create the example table for plateConf and screenLog <>= cellHTS2:::tableOutputWithHeaderRows(file.path(dataPath, "Plateconf.txt"), "plate configuration", selRows=NULL) cellHTS2:::tableOutput(file.path(dataPath, "Screenlog.txt"), "screen log", selRows=1:3) @ %%-------------------------------------------------- \subsection{Format of the plate configuration file} \label{sec:plateconf} %%-------------------------------------------------- The software expects this to be a rectangular table in a tabulator delimited text file, with mandatory columns \emph{Plate}, \emph{Well}, \emph{Content}, plus two additional header lines that give the total number of wells and plates (see Table~\ref{tab:plateconfiguration} for an example). The content of this file (except the two header lines) are stored in slot \Robject{plateConf} of \Robject{x}.\newline As the name suggests, the \emph{Content} column provides the content of each well in the plate (here referred to as the \textit{well annotation}). Mainly, this annotation falls into four categories: empty wells, wells targeting genes of interest, control wells, and wells containing other things that do not fit in the previous categories. The first two types of wells should be indicated in the \textit{Content} column of the plate configuration file by \textit{empty} and \textit{sample}, respectively, while the last type of wells should be indicated by \textit{other}. The designation for the control wells in the \textit{Content} column is more flexible. By default, the software expects them to be indicated by \textit{pos} (for positive controls), or \textit{neg} (for negative controls). However, other names are allowed, given that they are specified by the user whenever necessary (for example, when calling the \Rfunction{writeReport} function). This versatility for the control wells' annotation is justified by the fact that, sometimes, multiple positive and/or negative controls can be employed in a given screen, making it useful to give different names to the distinct controls in the \textit{Content} column. Moreover, this versatility might be required in multi-channel screens for which we frequently have reporter-specific controls. The \emph{Well} column contains the name of each well of the plate in alphanumeric format (in this case, \Robject{A01} to \Robject{P24}), while column \emph{Plate} gives the plate number (1, 2, ...). These two columns are also allowed to contain regular expressions. In the plate configuration file, each well and plate should be covered by a rule, and in case of multiple definitions only the last one is considered. For example, in the file shown in Table~\ref{tab:plateconfiguration}, the rule specified by the first line after the column header indicates that all of the wells in each of the 57 assay plate contain ``sample''. However, a following rule indicate that the content of wells \Robject{A01}, \Robject{A02} and \Robject{B01} and \Robject{B02} differ from ``sample'', containing other material (in this case, ``other'' and controls). Note that the well annotations mentioned above are used by the software in the normalization, quality control, and gene selection calculations. Data from wells that are annotated as \textit{empty} are ignored, i.\,e.\ they are set to \Robject{NA}. Here we look at the frequency of each well annotation in the example data: % <<>>= table(wellAnno(x)) @ % We can also use the function \Rfunction{configurationAsScreenPlot} to see the plate configuration of all of the plates as a screen plot: % <>= png("cellhts2Complete-configurationplot.png", width=324, height=324) configurationAsScreenPlot(x) dev.off() @ <>= configurationAsScreenPlot(x) @ The result is shown in Figure~\ref{fig:configurationplot}. This can be useful to verify the correctness of the plate configuration when a complex set of rules with regular expressions are used. \begin{figure}[tp] \begin{center} \includegraphics[width=324pt,height=324pt]{cellhts2Complete-configurationplot.png} \caption{\label{fig:configurationplot}Screen plot of the plate configuration.} \end{center} \end{figure} A special case of well annotation is when different types of positive controls are used for the screening, that is \emph{enhancer} and \emph{suppressor} compounds. The vignette \textit{Analysis of screens with enhancer and suppressor controls} accompanying this package explains how such controls can be handled during the screen analysis using \Rpackage{cellHTS2} package. %-------------------------------------------------- \subsubsection{Multiple plate configurations} \label{sec:multPlateConfs} %%-------------------------------------------------- Although it is good practice to use the same plate configuration for the whole experiment, sometimes this does not work out, and there are different parts of the experiment with different plate configurations. The use of regular expressions in columns \emph{Plate} and \emph{Well} of the plate configuration file allow therefore to specify different configurations within and between assay plates. The two header rows of this file ascertain that all of the plates and wells are covered in the plate configuration file. %% Note about 'batches' Note that in contrast to the \Rpackage{cellHTS} package, in \Rpackage{cellHTS2} package the concept of \emph{batch} is separated from the concept of having multiple plate configurations. So, for example, different replicate of the same plate can be set as to belong to different batches (even though they are required to have the same plate configuration!) - since readouts were performed on different days, or due to the use of different lots of reagents, etc. Batch information (expressed in terms of integer values giving the batch number: 1, 2, ...) can go into a particular slot called \Robject{plateData}. This is expected to be a dataframe) of integer values giving the batch number (1, 2, ...) for each plate and sample. Its first dimension corresponds to the number of individual plates, and its second dimension correspond to the number of columns of each matrix stored in \Robject{assayData} slot (the samples). Batch information can be filled in by the user in case s/he wants to take into account this information in the analysis (for example, see \Rfunction{normalizePlates} function, which allows to adjust the data variance on a per-batch basis). It can either be provided as an optional column in the plate list file, or using the \Rfunction{batch} accessor method. %%-------------------------------------------------- \subsection{Format of the screen log file} \label{sec:screenlog} %%-------------------------------------------------- The screen log file is a tabulator delimited file with mandatory columns \emph{Plate}, \emph{Well}, \emph{Flag}. If there are multiple samples (replicates or conditions), a column called \emph{Sample} should also be present; a column named \emph{Channel} must also be provided when there are multiple channels. In addition, it can contain arbitrary optional columns. Each row corresponds to one flagged measurement, identified by the plate number (and possible sample number and channel number) and the well identifier (alphanumeric identifier). The type of flag is specified in the column \emph{Flag}. Most commonly, this will have the value ``NA'', indicating that the measurement should be discarded and regarded as missing.\\ For those users that have been using \Rpackage{cellHTS} package and need to migrate their projects to \Rpackage{cellHTS2} package, we explain how this can be smoothly performed in Appendix~\ref{sec:migratingProjects}. %------------------------------------------------------------ \section{Normalization, scoring and summarization of replicates} \label{sec:norm} %------------------------------------------------------------ The data normalization, replicates scoring and summarization functions available in \Rpackage{cellHTS2} package work on the data stored in the \Robject{assayData} slot of the \Rclass{cellHTS} object. They create a copy of their input \Rclass{cellHTS} object, with the data replaced by the transformed values. For a list of the available functions, type \Robject{? cellHTS2}. The preprocessing work-flow of a typical RNAi screen using \Rpackage{cellHTS2} package is the following: \begin{description} \item[(a)] Per-plate normalization to remove plate and/or edge effects. This can be done using the function \Rfunction{normalizePlates}. \item[(b)] Scoring of measurements (for example, compute, for each replicate, $z$-scores). This can be done using the function \Rfunction{scoreReplicates}. \item[(c)] Summarization of replicates (for example, take the median value). This can be done using the \Rfunction{summarizeReplicates}. \end{description} Note that this work-flow is suitable for single-channel data. For dual-channel data, further steps are required, as explained in the vignette \textit{Analysis of multi-channel cell-based screens}, accompanying this package. The function \Rfunction{normalizePlates} can be called to perform per-plate data transformation, normalization and variance adjustment. The normalization is performed in a plate-by-plate fashion and has three components: \begin{description} \item[Data transformation] logarithmic transformation (optional, this can be advisable if the data are on multiplicative scale) \item[Per-plate normalization] location adjustment for possible plate effects and/or possible spatial effects, using a choice of methods that you need to adapt to your data \item[Variance adjustment] plate-specific scale (variance) adjustment (optional) \end{description} For more details about this function and available normalization methods, please type \Robject{? normalizePlates}. To provide the means to perform the above steps, \Rfunction{normalizePlates} has the arguments \Robject{scale}, \Robject{log}, \Robject{method} and \Robject{varianceAdjust}. The argument \Robject{scale} allows to define the scale of the data (``additive'' or ``multiplicative''), which will then control the subsequent data transformation and normalization steps. Namely, when data are in multiplicative scale, the function allows to perform $\log_2$ data transformation. For that we need to set the function's argument \Robject{log} to \Robject{TRUE}. Log transformation will then change the scale of the data to be ``additive''. Per-plate median normalization is one of the methods available in \Rfunction{normalizePlates} and can be chosen by setting the argument \Robject{method="median"}. In this case, plate effects are corrected by dividing (if the current scale of the data is multiplicative) each measurement by the median value across wells annotated as sample, for each plate and replicate. If data are in additive scale, the per-plate median values are subtracted from each plate measurement instead. All of the available normalization methods are described in Appendix~\ref{sec:normalizationMethods}. The variance of normalized intensities can be adjusted according to argument \Robject{varianceAdjust}, as follows: \begin{itemize} \item \Robject{varianceAdjust="byPlate"}: per plate normalized intensities are divided by the per-plate median absolute deviations (MAD) in "sample" wells. This is done separately for each replicate and channel; \item \Robject{varianceAdjust="byBatch"}: using the content of slot \Robject{batch}, plates are split according to assay batches and the individual normalized intensities in each group of plates (batch) are divided by the per-"batch of plates" MAD values (calculated based on "sample" wells). This is done separately for each replicate and channel; \item \Robject{varianceAdjust="byExperiment"}: each normalized measurement is divided by the overall MAD of normalized values in wells containing "sample". This is done separately for each replicate and channel. \end{itemize} If \Robject{varianceAdjust="none"}, no variance adjustment is performed (default). As explained above, the parameter \Robject{method} of \Rfunction{normalizePlates} allows to choose between different types of per-plate normalization methods. Returning to our example data set, we choose to apply \emph{plate median scaling}: \begin{eqnarray} x'_{ki} &=& \frac{x_{ki}}{M_i}\quad\quad\forall k,i \label{eq:normalizePlateMedian}\\ M_{i}&=&\mathop{\operatorname{median}}_{m\in\,\mbox{\scriptsize samples}} x_{mi} \label{eq:plateMedian} \end{eqnarray} where $x_{ki}$ is the raw intensity for the $k$-th well in the $i$-th replicate file, and $x'_{ki}$ is the corresponding normalized intensity. The median is calculated across the wells annotated as \textit{sample} in the $i$-th result file. This is achieved by calling: % <>= xn <- normalizePlates(x, scale="multiplicative", log=FALSE, method="median", varianceAdjust="none") @ after which we obtain a \Rclass{cellHTS} object with the normalized intensities stored in the slot \Robject{assayData}. In the previous call to \Rfunction{normalizePlates} function, we have chosen not to adjust the data variance (default behaviour of \Rfunction{normalizePlates}). For example, we can use function \Rfunction{compare2cellHTS} provided with the package to check whether these two \Rclass{cellHTS} objects, \Robject{x} and \Robject{xn} belong to the same experiment: <>= compare2cellHTS(x, xn) @ After normalizing the data, we standardize the values for each replicate experiment using Equation~\eqref{eq:defz}. In this equation, $\hat{\mu}$ and $\hat{\sigma}$ are estimators of location and scale of the distribution of $x'_{ki}$ taken across all plates and wells of a given replicate experiment. This function uses robust estimators, namely, median and median absolute deviation (MAD). Moreover, it only considers the wells containing ``sample'' for estimating $\hat{\mu}$ and $\hat{\sigma}$. %As the values $x'_{ki}$ were obtained using plate median %normalization~\eqref{eq:normalizePlateMedian}, it holds that %$\hat{\mu}=1$. The symbol $\pm$ indicates that we allow for either plus or minus sign in Equation~\eqref{eq:defz}; the minus sign can be useful in the application to an inhibitor assay, where an effect results in a decrease of the signal and we may want to see this represented by a large score. This is done by calling the \Rfunction{scoreReplicates} function, where arguments \Robject{sign} and \Robject{method} define the sign and the scoring method to apply (robust $z$-scores, in this case), respectively: % <>= xsc <- scoreReplicates(xn, sign="-", method="zscore") @ % After data standardization, we summarize the replicates, calculating a single score for each gene. This is performed using the \Rfunction{summarizeReplicates} function, where we use its argument \Robject{summary} to select the summary to apply. One option is \emph{rms}, which corresponds to take the root mean square of the replicates values, and is shown in Equation~\eqref{eq:scoresSummary}. The chosen summary is taken over all the $n_{\mbox{\scriptsize rep}_k}$ replicates of probe $k$. \begin{eqnarray} z_{ki} &=& \pm \frac{x'_{ki}-\hat{\mu}}{\hat{\sigma}} \label{eq:defz} \\ z_{k} &=& \sqrt{\frac{1}{n_{\mbox{\scriptsize rep}_k}} \sum_{r=1}^{n_{\mbox{\scriptsize rep}_k}} z_{kr}^2} \label{eq:scoresSummary}. \end{eqnarray} Depending on the intended stringency of the analysis, other plausible choices of summary function between replicates are the minimum, the maximum, the mean or the median. In the first case, the analysis would be particularly conservative: all replicate values have to be high in order for $z_{k}$ to be high. For the cases where both sides of the distribution of $z$-score values are of interest, alternative summary options for the replicates are to select the value closest to zero (conservative approach) by setting \Robject{summary="closestToZero"} or the value furthest from zero ( \Robject{summary="furthestFromZero"} ). In order to compare our results with those obtained in the paper of Boutros \textit{et al.}~\cite{Boutros2004}, we choose to consider the mean as a summary: % <>= xsc <- summarizeReplicates(xsc, summary="mean") @ % after which we obtain a \Rclass{cellHTS} object with the resulting single $z$-score value per probe stored in \Robject{assayData} slot. Boxplots of the $z$-scores for the different types of probes are shown in Figure~\ref{cellhts2Complete-boxplotzscore}. % <>= scores <- Data(xsc) ylim <- quantile(scores, c(0.001, 0.999), na.rm=TRUE) boxplot(scores ~ wellAnno(x), col="lightblue", outline=FALSE, ylim=ylim) @ % \myincfig{cellhts2Complete-boxplotzscore}{0.5\textwidth}{Boxplots of $z$-scores for the different types of probes.} % In the \Rpackage{cellHTS2} package, we provide a further transformation of the $z$-score values to obtain the so-called \emph{calls}. This involves applying a sigmoidal transformation to the $z$-score values, with parameters $z_0$ and $\lambda$ ($>0$), given by: \begin{equation} y_k = \frac{1}{1+e^{-\lambda \,\left(z_k-z_0\right)}} \label{eq:sigtransf} \end{equation} This transformation maps the $z$-score values to the interval $\left[0,\,1\right]$ and is intended to expand the scale of $z$-scores with intermediate values and shrink the ones showing extreme values, therefore making the difference between intermediate phenotypes larger. The parameter $z_0$ defines the centre of the sigmoidal transformation, while $\lambda$ controls the smoothness of the transformation. This transformation can be done by calling the function \Rfunction{scores2calls}, as shown in Figure~\ref{cellhts2Complete-callvalues.png}. % <>= y <- scores2calls(xsc, z0=1.5, lambda=2) png("cellhts2Complete-callvalues.png") plot(Data(xsc), Data(y), col="blue", pch=".", xlab="z-scores", ylab="calls", main=expression(1/(1+e^{-lambda *(z-z[0])}))) dev.off() @ <>= y <- scores2calls(xsc, z0=1.5, lambda=2) plot(Data(xsc), Data(y), col="blue", pch=".", xlab="z-scores", ylab="calls", main=expression(1/(1+e^{-lambda *(z-z[0])}))) @ \myincfig{cellhts2Complete-callvalues.png}{180px}{A sigmoidal transformation that can be used for obtaining call values.} % However, for the purpose of the present analysis, we will consider the $z$-score values instead of the call values. %------------------------------------------------------------ \section{Probe annotation} \label{sec:annotation} %------------------------------------------------------------ \input{cellhts2-geneID} Up to now, the assayed genes have been identified solely by the identifiers of the plate and the well that contains the probe for them. The \emph{annotation file} contains additional annotation, such as the probe sequence, references to the probe sequence in public databases, the gene name, gene ontology annotation, and so forth. Mandatory columns of the annotation file are \textit{Plate}, \textit{Well}, and \textit{GeneID}, and it has one row for each well. The content of the \textit{GeneID} column will be species- or project-specific. The first 5 lines of the example file are shown in Table~\ref{tab:geneID}, where we have associated each probe with CG-identifiers for the genes of \textit{Drosophila melanogaster}. % <>= xsc <- annotate(xsc, geneIDFile="GeneIDs_Dm_HFA_1.1.txt", path=dataPath) @ %% Create the example table: <>= cellHTS2:::tableOutput(file.path(dataPath, "GeneIDs_Dm_HFA_1.1.txt"), "gene ID", selRows = 3:6) @ % An optional column named \textit{GeneSymbol} can be included in the \emph{annotation file}, and its content will be displayed by the tooltips added to the plate plots and screen-wide plot in the HTML quality report (see Section~\ref{sec:report}). %--------------------------------------------------------------- \subsection{Adding additional annotation from public databases} \label{sec:additAnno} %--------------------------------------------------------------- For the analysis of the RNAi screening results, we usually want to consider gene annotation information such as Gene Ontology, chromosomal location, gene function summaries, homology. The package \Rpackage{biomaRt} can be used to obtain such annotation from public databases~\cite{biomaRt2005}. However, there are also numerous alternative methods to annotate a list of gene identifiers with public annotation -- pick your favourite one. This section demonstrates how to do it with the package \Rpackage{biomaRt}. It is optional, you can move on to Section \ref{sec:report} if you do not have the \Rpackage{biomaRt} package or do not want to use it. If you do skip this section, then for the purpose of this vignette, please load a cached version of the gene annotation: <>= data("bdgpbiomart") fData(xsc) <- bdgpbiomart fvarMetadata(xsc)[names(bdgpbiomart), "labelDescription"] <- sapply(names(bdgpbiomart), function(i) sub("_", " ", i) ) @ %--------------------------- \subsubsection{Installation} \label{sec:install} %--------------------------- The installation of the \Rpackage{biomaRt} package can be a little bit tricky, since it relies on the two packages \Rpackage{RCurl} and \Rpackage{XML}, which in turn rely on the presence of the system libraries \textit{libcurl} and \textit{libxml2} on your computer. If you are installing the precompiled R packages (for example, this is what most people do on Windows), then you need to make sure that the system libraries on your computer are version-compatible with those on the computer where the R packages were compiled, and that they are found. If you are installing the R packages from source, then you need to make sure that the library header files are available and that the headers as well as the actual library is found by the compiler and linker. Please refer to the \textit{Writing R Extensions} manual and to the FAQ lists on www.r-project.org. %------------------------------------------------------------------ \subsubsection{Using biomaRt to annotate the target genes online} %------------------------------------------------------------------ In the remainder of this section, we will demonstrate how to obtain the dataframe \Robject{bdgpbiomart} by querying the online webservice \textit{BioMart} and through it the Ensembl genome annotation database~\cite{Ensembl2006}. % <>= rnwPath <- system.file("doc/Rnw", package="cellHTS2") setwd(rnwPath) system(sprintf("cp biomart.tex %s", workPath)) setwd(workPath) @ % <>= cat("Weaving the biomart vignette part...") setwd(rnwPath) Sweave("biomart.Rnw") setwd(workPath) stop() @ % \input{Rnw/biomart.tex} %\input{biomart.tex} % %------------------------------------------------------------ \section{Report} \label{sec:report} %------------------------------------------------------------ We have now completed the analysis tasks: the dataset has been read, configured, normalized, scored, and annotated: % <>= xsc @ % We can now save the scored data set to a file. % <>= save(xsc, file=paste(experimentName, ".rda", sep="")) @ % The data set can be loaded again for subsequent analysis, or passed on to others. To produce a comprehensive report, we can call the function \Rfunction{writeReport} again, this time specifying the three \Rclass{cellHTS} objects as separate function arguments: ``raw'', ``normalized'' and ``scored''. We also alter some of the default output settings using the \Rfunction{setSettings} functions (more details are given in section \ref{settings}) <>= setSettings(list(plateList=list(reproducibility=list(include=TRUE, map=TRUE), intensities=list(include=TRUE, map=TRUE)), screenSummary=list(scores=list(range=c(-4, 8), map=TRUE)))) out <- writeReport(raw=x, normalized=xn, scored=xsc, force=TRUE) @ % and use a web browser to view the resulting report. % <>= if (interactive()) browseURL(out) @ % The report contains a quality report for each plate, and also for the whole screening assays. The per-plate HTML reports display the scatterplot between duplicated plate measurements, the histogram of the normalized signal intensities for each replicate, and plate plots representing, in a false color scale, the normalized values of each replicate, and the standard deviation between replicate measurements at each plate position. It also reportes different measures of agreement between replicate measurements, such as the repeatability standard deviation between replicate plates and the Spearman correlation coefficient between duplicates. The per-plate reports also show the dynamic range, calculated as the ratio between the geometric means of the positive and negative controls. These measures can also be obtained independently from \Rfunction{writeReport} function, by using the functions \Rfunction{getMeasureRepAgreement} and \Rfunction{getDynamicRange} provided in \Rpackage{cellHTS2} package. If different positive controls were specified at the configuration step and when calling \Rfunction{writeReport}, the dynamic range is calculated separately for the distinct positive controls, since different positive controls might have different potencies. The experiment-wide HTML report presents, for each replicate, the boxplots with raw and normalized intensities for the different plates, and two plots for the controls: one showing the signal from positive and negative controls at each plate, and another plot displaying the distribution of the signal from positive and negative controls, obtained from kernel density estimates. The latter plot further gives the $Z'$-factor determined for each experiment (replicate) using the negative controls and each different type of positive controls~\cite{Oldenburg1999}, as a measure to quantify the distance between their distributions. This measure can also be obtained by calling the function \Rfunction{getZfactor}. The experiment-wide report also shows a screen-wide plot with the $z$-scores in every well position of each plate. If the argument \Robject{map} of \Rfunction{writeReport} function is set to \Robject{TRUE}, this plot and the plate plots of the per-plate reports contain tooltips (information pop-up boxes) dispaying the annotation information at each position within the plates. If the scored \Rclass{cellHTS} object provided for \Rfunction{writeReport} is not annotated with gene identifiers, the annotation information shown by the tooltips is simply the well identifiers. For an annotated \Rclass{cellHTS} object, if an optional column called \textit{GeneSymbol} was included in the \emph{annotation file} (see Section~\ref{sec:annotation}), and therefore is present in \Robject{featureData} slot of the annotated object, then its content is used for the tooltips. Otherwise, the content of column ``GeneID'' of the \Robject{featureData} slot (which can be accessed via \Robject{geneAnno}) is considered. The screen-wide image plot can also be produced separately using the function \Rfunction{imageScreen} given in the \Rpackage{cellHTS2} package. This might be useful if we want to select the best display for our data, namely, the aspect ratio for the plot and/or the range of $z$-score values to be mapped into the color scale. These can be passed to the function's arguments \Robject{ar} and \Robject{zrange}, respectively. For example, <>= imageScreen(xsc, ar=1, zrange=c(-3,4)) @ It should be noted that the per-plate and per-experiment quality reports are constructed based on the normalized \Rclass{cellHTS} object, in case it is provided to \Rfunction{writeReport} function. Otherwise, it uses the data of the raw \Rclass{cellHTS} object. The quality report produced by \Rfunction{writeReport} function has also a link to a file called \emph{topTable.txt} that contains the list of scored probes ordered by decreasing $z$-score values, when the final scored \Rclass{cellHTS} object is provided. This file has one row for each well and plate, and for the present example data set, it has the following columns: \begin{itemize} \item \verb=plate= plate identifier for each feature; \item \verb=position= gives the position of the well in the plate (runs from 1 to the total number of wells in the plate); \item \verb=well= gives the alphanumeric well identifier for each feature; \item \verb=score= corresponds to the summarized score calculated for each probe (data stored in the scored and summarized object \Robject{xsc}); \item \verb=wellAnno= corresponds to the well annotation (as given by the plate configuration file); \item \verb=finalWellAnno= gives the final well annotation for the scored values. It combines the information given in the plate configuration file with the values in \Robject{assayData} slot of the scored \Rclass{cellHTS} object, in order to have into account the wells that have been flagged either by the screen log file, or manually by the user during the analysis. These flagged wells appear with the annotation \emph{flagged}. \item \verb=raw_r1_ch1= and \verb=raw_r2_ch1= contain the raw intensities for replicate 1 and replicate 2, respectively (data stored in the unnormalized \Rclass{cellHTS} object \Robject{x}). 'ch' refers to channel; \item \verb=median_ch1= corresponds to the median of raw measurements across replicates; \item \verb=diff_ch1= gives the difference between replicated raw measurements (only given if the number of replicates is equal to two); \item \verb=average_ch1= corresponds to the average between replicate raw intensities (only given if the number of replicates is higher than two); \item \verb=raw/PlateMedian_r1_ch1= and \verb=raw/PlateMedian_r2_ch1= give the ratio between each raw measurement and the median intensity in each plate for replicate 1 and replicate 2, respectively. The plate median is determined for the raw intensities, using exclusively the wells annotated as ``sample''. \item \verb=normalized_r1_ch1= and \verb=normalized_r2_ch1= give the normalized intensities for replicate 1 and replicate 2, respectively. This corresponds to the data stored in the normalized \Rclass{cellHTS} object \Robject{xn}. \end{itemize} Additionally, if any of the \Rclass{cellHTS} objects provided in the argument \Robject{cellHTSlist} to \Rfunction{writeReport} has been annotated (as in the present case), it also contains the data given in the content of \Robject{featureData} slot of the annotated object. The above file with the list of scored probes can also be obtained without the need to run \Rfunction{writeReport} by using the function \Rfunction{getTopTable} provided in the package. %------------------------------------------------------------ \subsection{Controlling settings}\label{settings} %------------------------------------------------------------ The \Rfunction{writeReport} function is highly customizable in terms of the resulting HTML output. For most of the graphics that get generated the color scheme, the size, the font size and many other features can be controlled individually. This control is available either through session-wide settings using the \Rfunction{setSettings} function, or for each call of the the \Rfunction{writeReport} function through the optional \Robject{settings} argument. Most of the plots can even be completely supressed by switching the respective \Robject{include} setting to \Robject{FALSE}. Please see \Rfunction{?settings} for more details. %------------------------------------------------------------ \subsection{Exporting data to a tab-delimited file} %------------------------------------------------------------ The \Rpackage{cellHTS2} package contains a function called \Rfunction{writeTab} to save the data stored in \Robject{assayData} slot of a \Rclass{cellHTS} object to a tab-delimited file. The rows of the file are sorted by plate and well, and there is one row for each plate and well. When the \Rclass{cellHTS} object is annotated, the probe information (i.e. the probe identifiers stored in column ``GeneID'' of the \Robject{featureData} slot) is also added. % <>= writeTab(xsc, file="Scores.txt") @ % Since you might be interestered in saving other values to a tab delimited file, below we demonstrate how you can create a matrix with the ratio between each raw measurement and the plate median, together with the gene and well annotation, and export it to a tab-delimited file using the function \Rfunction{write.tabdel}~\footnote{This function is a wrapper of the function \Rfunction{write.table}, whereby you just need to specify the name of the data object and the file} also provided in the \Rpackage{cellHTS2} package. % <>= # determine the ratio between each well and the plate median y <- array(as.numeric(NA), dim=dim(Data(x))) nrWell <- prod(pdim(x)) nrPlate <- max(plate(x)) for(p in 1:nrPlate) { j <- (1:nrWell)+nrWell*(p-1) samples <- wellAnno(x)[j]=="sample" y[j, , ] <- apply(Data(x)[j, , , drop=FALSE], 2:3, function(w) w/median(w[samples], na.rm=TRUE)) } y <- signif(y, 3) out <- y[,,1] out <- cbind(fData(xsc), out) names(out) = c(names(fData(xsc)), sprintf("Well/Median_r%d_ch%d", rep(1:dim(y)[2], dim(y)[3]), rep(1:dim(y)[3], each=dim(y)[2]))) write.tabdel(out, file="WellMedianRatio.txt") @ % At this point we are finished with the basic analysis of the screen. As one example for how one could continue to further mine the screen results for biologically relevant patterns, we demonstrate an application of category analysis. %------------------------------------------------------------ \section{Category analysis} %------------------------------------------------------------ We would like to see whether there are Gene Ontology categories~\cite{GO} overrepresented among the probes with a high score. For this we use the category analysis from Robert Gentleman's \Rpackage{Category} package~\cite{GentlemanCategories}. Similar analyses could be done for other categorizations, for example chromosome location, pathway membership, or categorical phenotypes from other studies. % <>= library("Category") @ % Now we can create the category matrix. Conceptually, this a matrix with one column for each probe and one row for each category. The matrix element \Robject{[i,j]} is \Robject{1} if probe \Robject{j} belongs to the \Robject{j}-th category, and \Robject{0} if not. %In practice, this matrix would be rather large (and perhaps %exhaust the memory of your computer), hence we use a type of sparse %matrix representation, implemented by the \Rclass{graph} object %\Robject{categs}. This bipartite graph's nodes correspond to the rows %and columns of the matrix and there is an edge if the matrix element %is non-zero. % <>= % scores <- as.vector(Data(xsc)) % names(scores) <- geneAnno(xsc) % sel <- !is.na(scores) & (!is.na(fData(xsc)$go)) % goids = strsplit(fData(xsc)$go[sel], ", ") % genes = rep(geneAnno(xsc)[sel], listLen(goids)) % ux = unique(unlist(goids, use.names=FALSE)) % require("GO") % s1 = ux %in% ls(GOMFANCESTOR) % s2 = ux %in% ls(GOBPANCESTOR) % s3 = ux %in% ls(GOCCANCESTOR) % s4 = is.na(ux) % obsolete = sort(ux[which(!(s1|s2|s3|s4))]) % @ <>= obsolete <- c("GO:0005489", "GO:0015997", "GO:0045034", "GO:0005660", "GO:0006118", "GO:0006512", "GO:0045045", "GO:0006125", "GO:0043072", "GO:0006100", "GO:0048740") @ Some distractions are the GO terms \Robject{\Sexpr{paste(obsolete, collapse=", ")}}, which are annotated to some of the genes, but are obsolete. % <>= scores <- as.vector(Data(xsc)) names(scores) <- geneAnno(xsc) sel <- !is.na(scores) & (!is.na(fData(xsc)$go_biological_process_id)) goids <- strsplit(fData(xsc)$go_biological_process_id[sel], ", ") goids <- lapply(goids, function(x) x[!(x %in% obsolete)]) genes <- rep(geneAnno(xsc)[sel], listLen(goids)) categs <- cateGOry(genes, unlist(goids, use.names=FALSE)) @ % We will select only those categories that contain at least 3 and no more than 1000 genes. <>= nrMem <- rowSums(categs) # number of genes per category remGO <- which(nrMem < 3 | nrMem > 1000) categs <- categs[-remGO,,drop=FALSE] ## see if there are genes that don't belong to any category ## after applying the filter nrMem <- rowSums(t(categs)) rem <- which(nrMem==0) if(length(rem)!=0) categs <- categs[,-rem, drop=FALSE] @ % As the statistic for the category analysis we use the $z$-score. First, we need to select the subset of genes that actually have GO annotation: % <>= stats <- scores[ sel & (names(scores) %in% colnames(categs)) ] @ There are some replicated probes in \Robject{stats}. We will handle this by taking the maximum value between replicate probes (non-conservative approach): <>= ## handle duplicated genes in stats: isDup <- duplicated(names(stats)) table(isDup) dupNames <- names(stats)[isDup] sp <- stats[names(stats) %in% dupNames] sp <- split(sp, names(sp)) table(sapply(sp, length)) aux <- stats[!isDup] aux[names(sp)] <- sapply(sp, max) stats <- aux rm(aux) @ % Before calling the category summary functions, we need to order our statistic vector according to the names of the columns of the category matrix. % <>= m <- match(colnames(categs), names(stats)) stats <- stats[m] stopifnot(colnames(categs)==names(stats)) @ % Finally, we are ready to call the category summary functions: % <>= acMean <- applyByCategory(stats, categs) acTtest <- applyByCategory(stats, categs, FUN=function(v) t.test(v, stats)$p.value) acNum <- applyByCategory(stats, categs, FUN=length) isEnriched <- (acTtest<=1e-3) & (acMean>0.5) @ % A volcano plot of the $-\log_{10}$ of the $p$-value \Robject{acTtest} versus the per category mean $z$-score \Robject{acMean} is shown in Figure~\ref{cellhts2Complete-volcano.png}. For a given category, the $p$-value is calculated from the $t$-test against the null hypothesis that there is no difference between the mean $z$-score of all probes and the mean $z$-score of the probes in that category. To select the enriched categories (\Robject{isEnriched}), we considered a significance level of $0.1\%$ for the $t$-test, and a per category mean $z$-score greater than $0.5$. This led to the \Sexpr{sum(isEnriched)} categories marked in red in Figure~\ref{cellhts2Complete-volcano.png} are listed in Table~\ref{tab:enrichedGoCateg}. % \input{cellhts2-enrichedGoCateg} % \myincfig{cellhts2Complete-volcano.png}{180px}{Volcano plot of the $t$-test $p$-values and the mean $z$-values of the category analysis for Gene Ontology categories. The top categories are shown in red.} % <>= png("cellhts2Complete-volcano.png", width=180, height=180) par(mai=c(0.9,0.9,0.1,0.1)) px <- cbind(acMean, -log10(acTtest)) plot(px, main='', xlab=expression(z[mean]), ylab=expression(-log[10]~p), pch=".", col="black") points(px[isEnriched, ], pch=16, col="red", cex=0.7) dev.off() stopifnot(identical(names(acMean), names(acTtest)), identical(names(acMean), names(acNum))) @ % <>= enrichedGOCateg <- names(which(isEnriched)) require("GO.db") res <- data.frame( "$n$" = acNum[isEnriched], "$z_{\\mbox{\\scriptsize mean}}$" = signif(acMean[isEnriched],2), "$p$" = signif(acTtest[isEnriched],2), "GOID" = I(enrichedGOCateg), "Ontology" = I(sapply(enrichedGOCateg, function(x) Ontology(get(x, GOTERM)))), "description" = I(sapply(enrichedGOCateg, function(x) Term(get(x, GOTERM)))), check.names=FALSE) mt <- match(res$Ontology, c("CC", "BP", "MF")) stopifnot(!any(is.na(mt))) res <- res[order(mt, res$"$p$"), ] cellHTS2:::dataframeOutput(res, header=TRUE, caption=sprintf("Top %d Gene Ontology categories with respect to $z$-score.", nrow(res)), label="enrichedGoCateg", gotable=TRUE) @ % We have recently added an experimental feature to the \Rpackage{cellHTS2} package in order to integrate such category analyses into the HTML report framework. It leverages functionality from the \Rpackage{GSEABase} package, which was specifically designed to deal with gene sets. The feature is exemplyfied using the KEGG pathway information available though the \Rpackage{KEGG.db} package. First we have to create an object of class \Rclass{GeneSetCollection} which represents the mapping of genes to KEGG pathways. The geneIDs in the gene set collection are supposed to map to the geneIDs of the assay scores. From the \Rpackage{KEGG.db} package we get a list of pathway mapping to FlyBase identifiers. <>= library(KEGG.db) library(GSEABase) kegg <- as.list(KEGGPATHID2EXTID) kegg <- kegg[grep("dme", names(kegg))] gsc <- GeneSetCollection(mapply(function(keggId, geneId) GeneSet(unique(geneId), geneIdType=EntrezIdentifier(), collectionType=KEGGCollection(keggId), setName=keggId), names(kegg), kegg)) @ % In a subsequent step we want to filter out all the assay scores without proper annotation (e.g., control wells) and also restrict our collection of gene sets to those containing at least 5 genes. The gene annotations of our assay are the old CG identifiers, but we have retrieved a mapping to proper FlyBase identifiers in the previous \Rpackage{biomaRt} step. <>= scores <- as.vector(Data(xsc)) names(scores) <- fData(xsc)$flybase_gene_id sel <- !is.na(scores) & !is.na(names(scores)) scores <- scores[sel] ## Only consider set with more than 5 genes gsc <- gsc[sapply(gsc, length) > 5] gsNames <- mget(gsub("dme", "", names(gsc)), KEGGPATHID2NAME) ann <- data.frame(Name=I(unlist(gsNames))) rownames(ann) <- names(gsc) @ % Finally, we need to bundle up the assay scores and the gene set collection and define one or several summary statistics to show in the HTML report. Each of these statistics is supposed to be created by a separate function, which takes one mandatory and one optional argument. The mandatory argument is simply a vector of assay scores for the respective gene set. The optional second argument is a numeric vector of all available scores. This setup allows to compute simple statistics like means or medians, but also more complicated test statistics, for instance a $t$-test. The container for this information is an object of class \Rclass{gseaModule} and there is a convenient constructor which does most of the job. In the following example, we define four test statistics (mean, $p$-value of a $t$-test, its $t$-statistic and the number of genes in the gene set) and pass them to the constructor as a list. We also add a data.frame with further gene set annotation (in this case more human readable names for the pathways) and the vector of assays scores. Note that if no explicit vector is given, the scores of the \Rclass{cellHTS} object are used. In our case their identifiers don't match the gene sets FlyBase identifiers, and we also wanted to do a couple of extra filtering steps. <>= gmod <- gseaModule(gsc, list("Mean"=function(x,...) mean(x), "P-value"=function(x,y) t.test(x, y)$p.value, "T-statistic"=function(x,y) t.test(x, y)$statistic, "NrGenes"=function(x,...) length(x)), annotation=ann, scores=scores) @ % We can now re-write the \Rclass{cellHTS} report using the familiar Rfunction{writeReport} function with the addition \Robject{gseaModule} argument. <>= ## Now load the cellHTS objects and rewrite the report out <- writeReport(raw=x, normalized=xn, scored=xsc, force=TRUE, outdir=tempdir(), gseaModule=gmod) if (interactive()) browseURL(out) @ %------------------------------------------------------------ \section{Comparison with the results previously reported} %------------------------------------------------------------ In this section we compare the current results obtained using \Rpackage{cellHTS} package, with the ones previously reported in Boutros \textit{et al.}~\cite{Boutros2004}. The file ``Analysis2003.txt'' in the same directory as the input data files, i.\,e.\, in \Robject{\Sexpr{experimentName}} directory of the \Rpackage{cellHTS} package. First, We will load this file: % <>= data2003 <- read.table(file.path(dataPath, "Analysis2003.txt"), header=TRUE, as.is=TRUE, sep="\t") @ % The file contains the columns \Robject{\Sexpr{paste(names(data2003),collapse=", ")}}. The scored values in the \Robject{Scores} column will be compared with the ones obtained in our analysis. For that, I will start by adding to \Robject{data2003}, a column with the corresponding $z$-score values calculated using the \Rpackage{cellHTS} package. % <>= i <- data2003$Position + 384*(data2003$Plate-1) data2003$ourScore <- as.vector(Data(xsc))[i] @ % % \myincfig{cellhts2Complete-scoresComparison.png}{324px}{Scored values obtained in the paper of Boutros \textit{et al.} against the scored values calculated herein. Each panel corresponds to one 384-well plate. Axis labels are not pretty - they overlap with neighboring panels due to space constraints.} % <>= png("cellhts2Complete-scoresComparison.png", width=324, height=324) par(mfrow=c(7,9), mai=c(0,0,0,0)) for(i in 1:max(data2003$Plate)) { sel <- (data2003$Plate==i) plot(data2003$ourScore[sel], data2003$Score[sel], pch=19, cex=0.6) } dev.off() @ % Figure~\ref{cellhts2Complete-scoresComparison.png} shows the scatterplot between Boutros \textit{et al.}'s scores and our scores in each of the 384-well plates. The results between the two analyses are very similar, except for two minor details: use of robust estimators of location and spread (median and MAD instead of mean and standard deviation), and estimation of MAD over the whole experiment instead of plate-by-plate. In fact, Figure~\ref{cellhts2Complete-scoresComparison.png} evidenciates how the scored values exactly agree up to an offset (mean versus median) and scale (standard deviation versus MAD). % %------------------------------------------------------------ \section{Appendix: How to convert \Rpackage{cellHTS} to \Rpackage{cellHTS2} configuration files} \label{sec:migratingProjects} %------------------------------------------------------------ We advise the users of \Rpackage{cellHTS} package to start using the improved package \Rpackage{cellHTS2} described herein, since the latter provides better functionality for working with multi-channel screens and multiple screens. To facilitate this transition and help users to migrate their \Rpackage{cellHTS}-specific projects to \Rpackage{cellHTS2}, we provide in this package a function that converts the old S3 \Robject{cellHTS} object associated with \Rpackage{cellHTS} package into one or several S4 \Robject{cellHTS} defined with the \Rpackage{cellHTS2} package. This function is called \Rfunction{convertOldCellHTS}. However, you might want to migrate an existing project from start, i.\,e.\,, redo all the steps starting by reading the intensity files and configuring the screening data. In this case, you need to update the screen log file (if available), the screen description file and the screen configuration file~\footnote{The expected format of the other input files, namely the raw intensity data files (Section~\ref{sec:read}) and the annotation file (Section~\ref{sec:annotation}) remains unchanged between the two packages.}.\\ % Screen description file Regarding the screen description file, as mentioned in Section~\ref{sec:screenconf}, we provide a function that creates a template screen description file that can be edited and modified by the user. Below we examplify how such file can be created: % <>= out <- templateDescriptionFile("template-Description.txt", force=TRUE) out readLines(out) @ % \input{cellhts2-cellHTSpackage-specificplateconfiguration} \input{cellhts2-cellHTSpackage-specificscreenlog} % Create the example table for the older formats of plate configuration and screen log files <>= cellHTS2:::tableOutput(file.path(dataPath, "old-Plateconf.txt"), "cellHTS package-specific plate configuration", selRows=1:28) cellHTS2:::tableOutput(file.path(dataPath, "old-Screenlog.txt"), "cellHTS package-specific screen log", selRows=1:3) @ % % Screen log file The format of the screen log file compatible with \Rpackage{cellHTS2} package is shown in Table~\ref{tab:screenlog}. Compared to the previous format required for \Rpackage{cellHTS} package (Table~\ref{tab:cellHTSpackage-specificscreenlog}), we note that column \textit{Filename} was replaced by two columns named \textit{Plate} and \textit{Sample}. \\ % Screen configuration file In \Rpackage{cellHTS} package, the concept of \textit{batch} is intrinsically related with the plate configuration, since a change in plate configuration along the experiment had to be handled by setting each distinct plate configuration as corresponding to a different batch. Therefore, in \Rpackage{cellHTS} package, the plate configuration file had three mandatory columns named \textit{Batch}, \textit{Well}, \textit{Content}, where the \textit{Batch} column allowed for different plate configurations. The first 28 lines of such file for the example RNAi screen considered in this report is shown in Table~\ref{tab:cellHTSpackage-specificplateconfiguration}. Thus, in the old format required for the plate configuration file, we had to have a number of rows equal to the product between the total number of batches and the total number of wells per plate. In contrast to \Rpackage{cellHTS} package, in \Rpackage{cellHTS2} package the concept of \emph{batch} and multiple plate configurations were made independent (see Section~\ref{sec:multPlateConfs}). For \Rpackage{cellHTS2} package, Table~\ref{tab:plateconfiguration} shows the required file format, which was discussed in more detail in Section~\ref{sec:plateconf}. Due to the separation between batch and multiple plate configuration, the column \textit{Batch} was removed, and replaced by the column \emph{Plate}. The other two mandatory columns \emph{Well} and \emph{Content} were kept. Additionally, we now require that this file contains two extra header lines giving the total number of wells and plates (Table~\ref{tab:plateconfiguration}). There is also an improvement in the file format related with the fact that \emph{Plate} and \emph{Well} columns now allow the use of regular expressions (see Section~\ref{sec:plateconf} for more specific information), which allows to cover the plate configuration used in the screen with just a few lines. Besides, it allows to specify different configurations within and between assay plates. %------------------------------------------------------------ \section{Appendix: Normalization methods implemented in \Rpackage{cellHTS2} package} \label{sec:normalizationMethods} %------------------------------------------------------------ There are two main normalization methods available with \Rpackage{cellHTS2} package: methods based on the use of reference controls, and distribution-based methods. These methods can be applied using the function \Rfunction{normalizePlates}. %------------------------------------------------------------ \subsection{Controls-based normalization} %------------------------------------------------------------ % POC % NPI % negatives %------------------------------------------------------------ \subsubsection{Percent of control} %------------------------------------------------------------ Percent of control (POC) is a preprocessing method that tries to correct for plate-to-plate variability by normalizing each $k$th compound raw measurements in the $i$th result file, $x_{ki}$, relative to the average of within-plate controls. In an antagonist (or inhibition) type assay, it is defined as: \begin{equation} x_{ki}^{\text{POC}} = \frac{x_{ki}}{\mu_i^{\text{pos}}} \times 100 \end{equation} \noindent where $\mu_i^{\text{pos}}$ is the average of the measurements on the positive controls in the $i$th result file (i.\,e.\,, for a given plate and replicate). In \Rpackage{cellHTS2} package, this method can be applied by setting the argument \Robject{method="POC"} when calling \Rfunction{normalizePlates} function. We also provide in the package a normalization method for \Rfunction{normalizePlates} (\Robject{method="negatives"}) that consists of scaling the plate measurements by the per-plate median of the intensities on the negative controls~\footnote{If the scale of the data is defined as being additive (i.\,e.\,, argument \Robject{scale="additive"}, or arguments \Robject{scale="multiplicate"} and \Robject{log=TRUE}), measurements are subtracted by the median of per-plate negative controls instead.}. %------------------------------------------------------------ \subsubsection{Normalized percent inhibition} %------------------------------------------------------------ If \Rfunction{normalizePlates} is called with \Robject{method="NPI"}, the method known as normalized percent inhibition (NPI) is applied in a per-plate basis to correct for plate effects. For an antagonist assay, this method divides the difference between each measurement in a given result file $i$ ($x_{ki}$) and the average of the positive controls on that plate ($\mu_i^{\text{pos}}$) by the difference between the averages of the measurements on the positive ($\mu_i^{\text{pos}}$) and the negative controls ($\mu_i^{\text{neg}}$): \begin{equation} x_{ki}^{\text{NPI}} = \frac{\mu_i^{\text{pos}}-x_{ki}}{\mu_i^{\text{pos}}-\mu_i^{\text{neg}}} \end{equation} %------------------------------------------------------------ \subsection{Non-controls-based normalization} % or distribution-based normalization %------------------------------------------------------------ % Z score % plate-median normalization % B score % spatial normalization ("loess" or "locfit") There are several normalization method implemented in \Rpackage{cellHTS2} package that make use of the overall distribution of values, instead of relying exclusively on controls. These are described in the following sections. %------------------------------------------------------------ \subsubsection{Z score method} %------------------------------------------------------------ Z score is a simple and widely known normalizing method that is performed in a per-plate basis as follows: \begin{equation} x_{ki}^{\text{Z}} = \frac{x_{ki} - \mu_i}{\sigma_i}, \end{equation} \noindent where $\mu_i$ and $\sigma_i$ are the mean and standard deviation, respectively, of all measurements within the $i$th result file (replicated plate). In the Z score method, measurements are re-scaled relative to within-plate variation by subtracting the average of the plate values and dividing this difference by the standard deviation estimated from all measurements of the plate. In \Rpackage{cellHTS2}, we consider a robust version of this method, where the mean and standard deviation are replaced by the median and the MAD, respectively, calculated at the sample wells. This robust Z score method is performed by calling \Rfunction{normalizePlates} as follows: % <>= xZ <- normalizePlates(x, scale="additive", log=FALSE, method="median", varianceAdjust="byPlate") @ % %------------------------------------------------------------ \subsubsection{Plate median normalization} %------------------------------------------------------------ Plate median normalization involves calculating the relative signal of each well compared to the median of the sample wells in the plate, as shown in Equation (\ref{eq:normalizePlateMedian}) and Equation (\ref{eq:plateMedian}). The median is calculated among the $m$ wells containing \textit{sample} (i.\,e.\,, for wells that contain genes of interest) in result file $i$. Plate median normalization can be chosen by setting \Robject{method="median"} in \Rfunction{normalizePlates}. When applied to data on additive scale, the plate median normalization involves subtracting the plate measuments by the per-plate median instead. We also have two variants of the plate median scaling which consist of using as the per-plate scaling factor $M_i$ the per-plate average intensity on sample wells (\Robject{method="mean"}) or the midpoint of the shorth of the per-plate distribution of values on sample wells (\Robject{method="shorth"}).\\ The next methods are intended to explicitly correct for \emph{spatial effects} within plates, i.\,e.\,, the presence of intensity gradients within the plates. Such signal gradients can be caused by differences in temperature, incubation time or concentration, etc., in different wells across a plate. Typically, these gradients produce repeatitive patterns, which make it possible to distinguish them from real actives that should be more or less randomly dispersed for quasi-randomized collections of compounds. %------------------------------------------------------------ \subsubsection{B score method} %------------------------------------------------------------ In the B score method, row and column biases within each plate are explicitly corrected for by fitting a two-way median polish to the raw data in a per-plate fashion~\cite{Brideau2003}: \begin{eqnarray} e_{rci} = x_{rci} - \hat{x}_{rci} = x_{rci} - \left( \hat{\mu}_i + \hat{R}_{ri} + \hat{C}_{ci}\right) \\ \label{eq:BscoreResiduals} x^{\text{B}}_{rci} = \frac{e_{rci}}{MAD_i} \label{eq:BscoreVarAdj} \end{eqnarray} \noindent Here, $x_{rci}$ is the measurement value in the $r$th row and $c$th column of the plate corresponding to the $i$th result file, $\hat{x}_{rci}$ is the corresponding fitted value defined as the sum between the estimated average of the replicate plate ($\hat{\mu}_i$), the estimated systematic offset for row $r$ ($\hat{R}_{ri}$) and the systematic offset for column $c$ ($\hat{C}_{ci}$) in that replicated plate. In a second step, each of the obtained residual values $e_{rci}$'s of the $i$th result file are divided by their median absolute deviation ($MAD_i$) giving the final B score value -- Equation~(\ref{eq:BscoreVarAdj}). We implemented a method similar to the B score method described in Malo \textit{et al.}~\cite{Malo2006} and Brideau \textit{et al.}~\cite{Brideau2003} using the Tukey's median polish procedure~\cite{Tukey1977} (function \textit{medpolish} of the package \textit{stats}) which fits an additive model to the data according to Equation~(\ref{eq:BscoreResiduals}). The Tukey's median polish algorithm works by alternately removing the row and column medians and continues until the proportional reduction in the sum of absolute residuals is less than $\epsilon$ or until the maximum number of iterations has been reached. The B score method can be applied by defining argument \Robject{method="Bscore"} in \Rfunction{normalizePlates}. Alternatively, the method can be applied by calling a separate function called \Rfunction{Bscore} provided in the \Rpackage{cellHTS2} package.\\ In \textit{cellHTS2} package, we provide two additional spatial normalization methods that fit a polynomial surface to the intensities within each assay plate using local regression and that can be performed via \Rfunction{normalizePlates} or \Rfunction{spatialNormalization} functions, although we advise to apply these methods using the former function. The fit can be performed either using the \textit{loess} procedure or the \textit{locfit.robust} function of package \textit{locfit}. In \Rfunction{normalizePlates}, if \Robject{method="locfit"}, spatial effects are removed by fitting a bivariate local regression to each plate and replicate, while if \Robject{method="loess"}, a loess curve is fitted instead. %------------------------------------------------------------ \section{Appendix: Data transformation} %------------------------------------------------------------ \myincfig{cellhts2Complete-transfplots.png}{324px}{Comparison between untransformed (left) and logarithmically (base 2) transformed (right), normalized data. Upper: histogram of intensity values of replicate 1. Middle: scatterplots of standard deviation versus mean of the two replicates. Bottom: Normal quantile-quantile plots.} An obvious question is whether to do the statistical analyses on the original intensity scale or on a transformed scale such as the logarithmic one. Many statistical analysis methods, as well as visualizations work better if (to sufficient approximation) \begin{itemize} \item replicate values are normally distributed, \item the data are evenly distributed along their dynamic range, \item the variance is homogeneous along the dynamic range~\cite{Huber2002ismb}. \end{itemize} Figure~\ref{cellhts2Complete-transfplots.png} compares these properties for untransformed and log-transformed normalized data, showing that the difference is small. Intuitively, this can be explained by the fact that for small $x$, \[ \log(1+x)\approx x \] and that indeed the range of the untransformed data is mostly not far from 1. Hence, for the data examined here, the choice between original scale and logarithmic scale is one of taste, rather than necessity. % <>= library(vsn) myPlots=function(z, main, plotCol) { z <- as.data.frame(z) colnames(z) <- paste0("Sample", seq_len(ncol(z))) gh <- ggplot2::ggplot(z, ggplot2::aes(x=Sample1))+ ggplot2::geom_histogram(fill="darkblue")+ ggplot2::ggtitle(main) gm <- meanSdPlot(as.matrix(z), plot=FALSE)$gg+ ggplot2::ylim(c(0, quantile(abs(z[,2]-z[,1]), 0.95, na.rm=TRUE)))+ ggplot2::theme(legend.key.size=unit(0.02, "npc"), legend.position="top") gq <- ggplot2::qplot(sample=z$Sample1) print(gh, vp=viewport(layout.pos.row=1, layout.pos.col=plotCol)) print(gm, vp=viewport(layout.pos.row=2, layout.pos.col=plotCol)) print(gq, vp=viewport(layout.pos.row=3, layout.pos.col=plotCol)) } png("cellhts2Complete-transfplots.png", width=400, height=600) grid.newpage() pushViewport(viewport(layout=grid.layout(3, 2))) dv <- Data(xn)[,,1] myPlots(dv, main="untransformed", plotCol=1) xlog <- normalizePlates(x, scale="multiplicative", log=TRUE, method="median", varianceAdjust="byExperiment") dvlog <- Data(xlog)[,,1] myPlots(dvlog, main="log2", plotCol=2) dev.off() @ % %--------------------------------------------------------- \section{Session info} %--------------------------------------------------------- \begin{table*}[h]%[tbp] \begin{minipage}{\textwidth} <>= toLatex(sessionInfo()) @ \end{minipage} \caption{\label{tab:sessioninfo}% The output of \Rfunction{sessionInfo} on the build system after running this vignette.} \end{table*} %------------------------------------------------------------ %Bibliography %------------------------------------------------------------ \bibliography{cellhts} \bibliographystyle{plain} \end{document}