%\VignetteIndexEntry{An introduction to PLGEM} %\VignetteKeywords{Error Model, Microarray, Proteomics} %\VignetteDepends{plgem} %\VignettePackage{plgem} \documentclass[a4paper]{article} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{An introduction to PLGEM} \author{Mattia Pelizzola and Norman Pavelka} \begin{document} \maketitle \section{Introduction} This document serves as a brief tutorial to the \textbf{Power Law Global Error Model} (\textbf{PLGEM}) analysis method \cite{bmc}. \textbf{PLGEM} has so far been shown to faithfully model the variance-versus-mean dependence that exists in a wide variety of genome-wide data sets, including microarray \cite{bmc} and proteomics data \cite{mcp}. The use of \textbf{PLGEM} has furthermore been shown to improve the detection of differentially expressed genes or proteins in these datasets \cite{bmc, mcp}. \section{Running \textbf{PLGEM} in \emph{wrapper} mode} A wrapper function (called \Rfunction{run.plgem}) is provided in the package, which performs all the necessary steps to obtain a list of differentially expressed genes or proteins (DEG), starting from a dataset of class \Rclass{ExpressionSet}. This input dataset is expected to contain either normalized gene expression values obtained from one-channel microarrays (such as Affymetrix GeneChip \cite{bmc}), or normalized spectral counts obtained from mass spectrometry-based proteomics methods (such as MudPIT \cite{mcp}). The wrapper automatically attempts to find the best solution at each step, and requires only modest or no input decisions by the user. For didactic purposes, we will use here a subset of the microarray dataset used in the original publication about \textbf{PLGEM}, containing two replicates of LPS-stimulated dendritic cells (`LPS') and four replicates of untreated dendritic cells (`C'): <>= library(plgem) data(LPSeset) set.seed(123) LPSdegList <- run.plgem(esdata=LPSeset) @ The above obtained object \Robject{LPSdegList} will contain the list of genes or proteins, selected as significantly changing between experimental condition `LPS' and baseline condition `C', at the default significance level 0.001. \section{Running \textbf{PLGEM} in \emph{step-by-step} mode} To provide advanced users with a higher control on the inner workings of the \textbf{PLGEM} pipeline, the individual functions called by \Rfunction{run.plgem} are also described in this tutorial. We will use them now, step by step. \subsection{Fitting of the model to a data set} The first step is to fit the model to the microarray or proteomics dataset. By fitting the model we will obtain a mathematical relationship that will allow us to determine the expected standard deviation associated to a given average gene expression value or average protein abundance level. \textbf{PLGEM} can only be fitted on a set of replicates of a same experimental condition, therefore we first need to choose which condition to use for the fitting step. In our dataset two conditions are provided: `C' and `LPS'. Usually the most replicated one is chosen, in this example we will therefore choose the first condition `C' (i.e. \Rfunarg{fitCondition="C"}), because it contains four replicates. Technically, we may have decided to fit the model also on the two `LPS' samples, as two replicates are required and sufficient to fit a PLGEM. But having 3 or more replicates improves the fitting and is usually recommended \cite{mcp}. Moreover, we can can change the default values of parameters \Rfunarg{p} and \Rfunarg{q}. Briefly, \Rfunarg{p} represents the number of intervals (or bins) used to partition the expression value range of the dataset. We observed that \Rfunarg{p} can be modified over a wide range of values, without any major effects on the final results, except when it was chosen to close to the total number of genes or proteins in the dataset \cite{bmc}. As a rule of thumb, \Rfunarg{p} should be no more than one tenth of the number of genes or proteins. The default of 10 should therefore be appropriate for most microarray experiments, but could be set lower for proteomics data where less than 100 proteins were identified. \Rfunarg{q} is the quantile of the location-dependent spread used to fit the model. The default of \Rfunarg{q} is set to 0.5, because this represents the median value, which is what you are looking for when modeling the variability. We recommend to only modify this parameter for very special purposes, e.g. for the determination of empirical confidence intervals of standard deviation. Moreover it is possible to evaluate the fitting of the model setting the option \Rfunarg{fittingEval}. If this argument is set to \Rfunarg{TRUE}, a multi-panel plot is produced where the residuals of the model are evaluated. A good fit is characterized by a near-normal distribution and an horizontal symmetric plot of the ranked residuals. Finally, setting \Rfunarg{plot.file=TRUE} saves above diagnostic plot to a png file instead of the default device, therefore we won't change its default. \begin{center} <>= LPSfit <- plgem.fit(data=LPSeset, covariate=1, fitCondition='C', p=10, q=0.5, plot.file=FALSE, fittingEval=TRUE, verbose=TRUE) @ \end{center} \subsection{Computation of observed signal-to-noise ratios} The next step is the computation of the signal-to-noise ratio (STN) statistics for the detection of differential expression. The STN is determined using the model-derived spread estimates instead of the data-derived ones. Therefore it is necessary to give to the \Rfunction{plgem.obsStn} function the model parameters \Rfunarg{slope} and \Rfunarg{intercept} determined during the model fitting that are contained in the value returned by \Rfunction{plgem.fit} function. By default, all experimental conditions (according to the values of the \Rfunarg{covariate} defined in the \Rclass{phenoData} slot of the \Rclass{ExpressionSet}) are compared to the first condition (baseline). If the condition to be treated as the baseline is not the first one, we can change this by modifying the argument \Rfunarg{baselineCondition}. A matrix of observed STN is determined, where the number of rows are the number of genes or proteins in the dataset and the number of columns are the number of comparisons that can be performed in the dataset (i.e. the total number of experimental conditions minus one). In this case, the dataset contains only one condition to be compared to the baseline, therefore the matrix will be a one-dimensional array of observed PLGEM-STN values. <>= LPSobsStn <- plgem.obsStn(data=LPSeset, covariate=1, baselineCondition=1, plgemFit=LPSfit, verbose=TRUE) @ \subsection{Computation of resampled signal-to-noise ratios} In order to get an estimate of the distribution of the test statistic under the null hypothesis of no differential expression, a resampled statistic is determined using the method described in the paper \cite{bmc}. The number of iterations of the resampling step should be correlated with the total number of replicates that are present in the data set. If this argument is set to \Rfunarg{"automatic"}, the number of iterations is automatically determined based on the total number of possible combinations. In this case, an upper threshold of 500 iterations is set to avoid excessive computation time. This should be fine for most purposes. <>= set.seed(123) LPSresampledStn <- plgem.resampledStn(data=LPSeset, plgemFit=LPSfit, iterations="automatic", verbose=TRUE) @ \subsection{Computation of p-values} Next, p-values are calculated for each observed STN value via a call to function \Rfunction{plgem.pValue}. Resampled STN values are also required by this function, because they will be used to build empirical cumulative distribution functions of the STN values that can be observed under the null hypothesis: <>= LPSpValues <- plgem.pValue(observedStn=LPSobsStn, plgemResampledStn=LPSresampledStn, verbose=TRUE) head(LPSpValues) @ \subsection{Detection of differentially expressed genes or proteins (DEG)} Finally, DEG are selected at the given significance level \Rfunarg{delta} via a call to function \Rfunction{plgem.deg}. The chosen value of \Rfunarg{delta} can be seen as an estimate of the False Positive Rate (FPR). Therefore, in case of a microarray dataset with 10,000 genes of which not a single one is truly differentially expressed, choosing \Rfunarg{delta=0.001} will select roughly 10 genes by chance: <>= LPSdegList <- plgem.deg(observedStn=LPSobsStn, plgemPval=LPSpValues, delta=0.001, verbose=TRUE) head(LPSdegList$significant[["0.001"]][["LPS_vs_C"]]) @ Above function returns a list with a number of items that is equal to the number of different significance levels \Rfunarg{delta} used as input. In this case the default single value of 0.001 was used, so the list will contain only one item at this level. This item is again a list, whose number of items correspond to the number of performed comparisons, i.e. the number of conditions in the starting \Rclass{ExpressionSet} minus the baseline, in this case again only one. In each list-item the values are the observed STN and the names are the IDs of the significantly changing genes or proteins, as defined in the \Rclass{ExpressionSet}. Finally, the obtained list of DEG can be written to the current working directory using the \Rfunction{plgem.write.summary} function. <>= sessionInfo() @ \begin{thebibliography}{} \bibitem[1]{bmc} Pavelka, N.,~Pelizzola, M.,~Vizzardelli, C.,~Capozzoli, M.,~Splendiani, A.,~ Granucci, F. and P. Ricciardi-Castagnoli (2004). \newblock A power law global error model for the identification of differentially expressed genes in microarray data. \newblock \emph{BMC Bioinformatics}, 5:203. \bibitem[2]{mcp} Pavelka, N.,~Fournier, M., L.,~Swanson, S., K.,~Pelizzola, M.,~ Ricciardi-Castagnoli, P.,~Florens, L. and M. P. Washburn (2008). \newblock Statistical similarities between transcriptomics and quantitative shotgun proteomics data. \newblock \emph{Molecular and Cellular Proteomics}, 7(4):631-44. \end{thebibliography} \end{document}