\documentclass{article} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{edge Package} \usepackage{graphics} \usepackage{amsmath} \usepackage{fullpage} \usepackage{bibentry} \usepackage[section]{placeins} \usepackage[round]{natbib} \usepackage{authblk} \usepackage[parfill]{parskip} \setlength{\parskip}{10pt} %\usepackage{indentfirst} \usepackage[colorlinks=true]{hyperref} \usepackage[utf8]{inputenc} \nobibliography* \newcommand{\edge}{{\tt edge}\ } \newcommand{\edgeNS}{{\tt edge}} \Sexpr{library(knitr); library(splines); library(edge); opts_chunk$set(tidy=TRUE, cache=FALSE, warning=FALSE, message=FALSE,fig.align='center')} \begin{document} <>= library(edge) options(keep.source = TRUE, width = 48) foo <- packageDescription("edge") @ \title{\edgeNS:\\ \underline{E}xtraction of \underline{D}ifferential \underline{G}ene \underline{E}xpression \\ Version \Sexpr{foo$Version}} \author[1,2]{John D. Storey\thanks{\url{http://genomine.org/contact.html}}} \author[3]{Jeffrey T. Leek} \author[1]{Andrew J. Bass} \affil[1]{Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton NJ USA} \affil[2]{Center for Statistics and Machine Learning, Princeton University, Princeton NJ USA} \affil[3]{Department of Biostatistics, John Hopkins University, Baltimore MD USA} %\renewcommand\Authands{ and } \maketitle \Large \begin{center} First edition: October 2005 Most recent edition: March 2015 \end{center} \normalsize \ \ \noindent {\bf Note:} \edge was first released in 2005 and described in the publication \cite{leek2005}. It was an independently released R package by the \href{http://www.genomine.org/}{John Storey Lab}, which included multi-threading and a graphical user interface. However, the current version is now available through Bioconductor as a standard R package. \clearpage \tableofcontents % INTRODUCTION ---------------------------------------------------------------- \section{Introduction} The \edge package implements methods for carrying out differential expression analyses of genome-wide gene expression studies. Significance testing using the optimal discovery procedure and generalized likelihood ratio test (equivalent to F-tests and t-tests) are implemented for general study designs. Special functions are available to facilitate the analysis of common study designs, including time course experiments. Other packages such as {\tt snm}, {\tt sva}, and {\tt qvalue} are integrated in \edge to provide a wide range of tools for gene expression analysis. \edge performs significance analysis by using a framework developed by \cite{storey:2007} called the optimal discovery procedure (ODP). Whereas standard methods employ statistics that are essentially designed for testing one gene at a time (e.g., t-statistics and F-statistics), the ODP-statistic uses information across all genes to test for differential expression. \cite{storey:etal:2007} shows that the ODP is a principled, often times more powerful, approach to multiple hypothesis testing compared to traditional methods. The improvements in power from using the optimal discovery procedure can be substantial; Figure~\ref{fig:Comparison} shows a comparison between \edge and five leading software packages based on the \cite{hedenfalk:2001} breast cancer expression study. \begin{figure}[htb] \centering \includegraphics[scale=.50]{edgecomp.pdf} \caption{Comparison of EDGE to various other leading methods for identifying differential expressed genes in the \cite{hedenfalk:2001} study. This figure is from \cite{leek2005}.} \label{fig:Comparison} \end{figure} {\tt edge} also implements strategies that have been specifically designed for time course experiments and other complex study designs. Specifically, \cite{storey:2005} developed a procedure that simplifies the modelling process for time course experiments. In addition to identifying differentially expressed genes in multi-class, time course, and general study designs, \edge includes implementations of popular packages such as {\tt snm}, {\tt sva} and {\tt qvalue} to help simplify the analysis process for researchers. The rest of the document details how to use \edge in three different case studies: static sampling among $K$ groups, independent time course, and longitudinal time course. For additional information regarding the optimal discovery procedure or the \cite{storey:2005} methodology for time course experiments, see section~\ref{sec:citepackage}. \section{Citing this package} \label{sec:citepackage} {\bf When reporting results involving the estimation of false discovery rate or q-value quantities, please cite:} \bibentry{Storey:2002fc} \bibentry{Storey:2003il} {\bf When reporting results involving the analysis of time course studies, please cite:} \bibentry{storey:2005} % A methodology for analyzing time course microarray data is introduced and % applied to two time course studies on humans. {\bf When reporting results involving use of the optimal discovery procedure ({\tt odp}), please cite:} \bibentry{storey:2007} % Theory paper that introduces the optimal discovery procedure and shows that it % maximizes the expected true positive results for each number of fixed false % positive results. The optimality is closely related to the false discovery rate. \bibentry{storey:etal:2007} % Discusses various ways of estimating the ODP statistic with applications to % microarray experiments. \bibentry{woo:leek:storey:2011} % Previous implementations of the ODP are computationally infeasible for a large % number of hypothesis tests. This paper introduces a computationally efficient % implementation of ODP that this package is based on. {\bf When reporting results involving surrogate variable analysis ({\tt apply\_sva}), please cite:} \bibentry{leek:2007} \bibentry{Leek:2008qf} {\bf When reporting results involving supervised normalization of microarrays ({\tt apply\_snm}), please cite:} \bibentry{mecham:2010} {\bf To cite the {\tt edge} R package itself, please type the following to retrieve the citation:} <>= citation("edge") @ \section{Getting help} Many questions about {\tt qvalue} will hopefully be answered by this documentation and references therein. As with any R package, detailed information on functions, their arguments and values, can be obtained in the help files. To view the help for {\tt qvalue} within R, type <>= help(package="edge") @ \noindent If you identify bugs related to basic usage please contact the authors directly, preferably via GitHub at \url{https://github.com/jdstorey/edge/issues}. Otherwise, any questions or problems regarding {\tt edge} will most efficiently be addressed on the Bioconductor support site, \url{https://support.bioconductor.org/}. \section{Quick start guide} To get started, first load the {\tt kidney} dataset included in the package: <>= library(edge) data(kidney) age <- kidney$age sex <- kidney$sex kidexpr <- kidney$kidexpr @ The {\tt kidney} study is interested in determining differentially expressed genes with respect to age in the kidney. The {\tt age} variable is the age of the subjects and the {\tt sex} variable is whether the subjects are male or female. The expression values for the genes are contained in the {\tt kidexpr} variable. Once the data has been loaded, the user has two options to create the experimental models: {\tt build\_models} or {\tt build\_study}. If the experiment models are unknown to the user, {\tt build\_study} can be used to create the models: <>= de_obj <- build_study(data = kidexpr, adj.var = sex, tme = age, sampling = "timecourse") full_model <- fullModel(de_obj) null_model <- nullModel(de_obj) @ {\tt sampling} describes the type of experiment performed, {\tt adj.var} is the adjustment variable and {\tt time} is the time variable in the study. If the experiment is more complex then type {\tt ?build\_study} for additional arguments. If the full and null models are known to the user then {\tt build\_models} can be used to make an {\tt deSet} object: <>= library(splines) cov <- data.frame(sex = sex, age = age) null_model <- ~sex full_model <- ~sex + ns(age, df=4) de_obj <- build_models(data = kidexpr, cov = cov, null.model = null_model, full.model = full_model) @ {\tt cov} is a data frame of covariates, {\tt null.model} is the null model and {\tt full.model} is the full model. The input {\tt cov} is a data frame with the column names the same as the variables in the full and null models. The {\tt odp} or {\tt lrt} function can be used on {\tt de\_obj} to implement either the optimal discovery procedure or the likelihood ratio test, respectively: <>= # optimal discovery procedure de_odp <- odp(de_obj, bs.its = 50, verbose=FALSE) # likelihood ratio test de_lrt <- lrt(de_obj) @ To access the $\pi_{0}$ estimate, p-values, q-values and local false discovery rates for each gene, use the function {\tt qvalueObj}: <>= qval_obj <- qvalueObj(de_odp) qvals <- qval_obj$qvalues pvals <- qval_obj$pvalues lfdr <- qval_obj$lfdr pi0 <- qval_obj$pi0 @ The following sections of the manual go through various case studies for a more comprehensive overview of the \edge package. \section{Case study: static experiment} \label{sec:gibson} In the static sampling experiment, the arrays have been collected from distinct biological groups without respect to time. The goal is to identify genes that have a statistically significant difference in average expression across these distinct biological groups. The example data set that will be used in this section is the {\tt gibson} data set and it is a random subset of the data from \cite{gibson:2008}. The {\tt gibson} data set provides gene expression measurements in peripheral blood leukocyte samples from three Moroccan Amazigh groups leading distinct ways of life: desert nomadic (DESERT), mountain agrarian (VILLAGE), and coastal urban (AGADIR). We are interested in finding the genes that differentiate the Moroccan Amazigh groups the most. See \cite{gibson:2008} for additional information regarding the data. \subsection{Importing the data} To import the {\tt gibson} data use the {\tt data} function: <>= data(gibson) gibexpr <- gibson$gibexpr batch <- gibson$batch gender <- gibson$gender location <- gibson$location @ There are a few variables in the data set: {\tt batch}, {\tt gibexpr}, {\tt gender}, and {\tt location}. The three covariates of interest are {\tt gender}, {\tt batch} and {\tt location}. The biological variable is the {\tt location} variable, which contains information on where individuals are sampled: ``VILLAGE'', ``DESERT'' or ``AGADIR''. The {\tt gender} variable specifies whether the individual is a male or a female and there are two different {\tt batches} in the study. The {\tt gibexpr} variable contains the gene expression measurements. As an example, the expression values of the first gene are shown in Figure~\ref{fig:gplot}. In the figure, it appears that the individuals from ``VILLAGE'' are more expressed when compared to the other lifestyles. We should stop short of that observation because the data needs to be adjusted with the experimental models. Before that, the full and null models of the study needs to be carefully formulated. \begin{figure}[t] \centering <>= library(ggplot2) gender <- as.factor(as.matrix(gender) ) location = as.factor(as.matrix(location)) batch = as.factor(as.matrix(batch)) qplot(location, gibexpr[1,], geom="point", colour=gender:batch, xlab= "location", ylab="expression") @ \caption{Plot of gene 1 in the {\tt gibson} study.} \label{fig:gplot} \end{figure} \subsection{Creating the full and null models} In order to find deferentially expressed genes, there first needs to be a full and null model for the study. There are two ways to input the experimental models in \edgeNS: {\tt build\_models} and {\tt build\_study}. {\tt build\_study} should be used by users unfamiliar with formulating the full and null models but are familiar with the covariates in the study: <>= de_obj <- build_study(data = gibexpr, adj.var = cbind(gender, batch), grp = location, sampling = "static") @ {\tt adj.var} is for the adjustment variables, {\tt grp} is the variable containing the group assignments for each individual in the study and {\tt sampling} describes the type of experiment. Since {\tt gibson} is a static study, the {\tt sampling} argument will be ``static''. The {\tt grp} variable will be the {\tt location} variable and the adjustment variables are {\tt gender} and {\tt batch}. Alternatively, if the user is familiar with their full and null models in the study then {\tt build\_models} can be used to input the models directly: <>= cov <- data.frame(Gender = gender, Batch = batch, Location = location) null_model <- ~Gender + Batch null_model <- ~Gender + Batch + Location de_obj <- build_models(data = gibexpr, cov = cov, full.model = null_model, null.model = null_model) @ The {\tt cov} argument is a data frame of all the relevant covariates, {\tt full.model} and {\tt null.model} are the full and null models of the experiment, respectively. Notice that the models must be formatted as a formula and contain the same variable names as in the {\tt cov} data frame. The null model contains the {\tt gender} and {\tt batch} covariates and the full model includes the {\tt location} variable. Therefore, we are interested in testing whether the full model improves the model fit of a gene when compared to the null model. If it does not, then we can conclude that there is no significant difference between Moroccan Amazigh groups for this particular gene. The variable {\tt de\_obj} is an {\tt deSet} object that stores all the relevant experimental data. The {\tt deSet} object is discussed further in the next section. \subsection{The {\tt deSet} object} Once either {\tt build\_models} or {\tt build\_study} is used, an {\tt deSet} object is created. To view the slots contained in the object: <>= slotNames(de_obj) @ A description of each slot is listed below: \begin{itemize} \item {\tt full.model}: the full model of the experiment. Contains the biological variables of interest and the adjustment variables. \item {\tt null.model}: the null model of the experiment. Contains the adjustment variables in the experiment. \item {\tt full.matrix}: the full model in matrix form. \item {\tt null.matrix}: the null model in matrix form. \item {\tt individual}: variable that keeps track of individuals (if same individuals are sampled multiple times). \item {\tt qvalueObj}: {\tt qvalue} list. Contains p-values, q-values and local false discovery rates of the significance analysis. See the \href{http://www.bioconductor.org/packages/release/bioc/html/qvalue.html}{{\tt qvalue} package} for more details. \item {\tt ExpressionSet}: inherits the slots from {\tt ExpressionSet} object. \end{itemize} {\tt ExpressionSet} contains the expression measurements and other information from the experiment. The {\tt deSet} object inherits all the functions from an {\tt ExpressionSet} object. As an example, to access the expression values, one can use the function {\tt exprs} or to access the covariates, {\tt pData}: <>= gibexpr <- exprs(de_obj) cov <- pData(de_obj) @ The {\tt ExpressionSet} class is a widely used object in Bioconductor and more information can be found \href{http://www.bioconductor.org/packages/2.14/bioc/html/Biobase.html}{here}. See the section~\ref{sec:advanced} on {\tt ExpressionSet} to get a better understanding of how it integrates into the \edge framework. As an example of how to access the slots of {\tt de\_obj} suppose we are interested in viewing the full and null models. The models can be accessed by: <>= fullModel(de_obj) nullModel(de_obj) @ Next, we can extract the models in matrix form for computational analysis: <>= full_matrix <- fullMatrix(de_obj) null_matrix <- nullMatrix(de_obj) @ See {\tt ?deSet} for additional functions to access different slots of the {\tt deSet} object. \subsection{Fitting the data} \begin{figure}[t] \centering <>= library(ggplot2) library(reshape2) de_obj <- build_study(data = gibexpr, adj.var = cbind(gender, batch), grp = location, sampling = "static") ef_obj <- fit_models(de_obj) fitVals <- fitFull(ef_obj) fitVals0 <- fitNull(ef_obj) gender <- as.factor(gender) location = as.matrix(location) batch = as.factor(batch) df <- data.frame(batch = batch, gender = gender, location = location, null.model = fitVals0[1,], full.model = fitVals[1,], raw = exprs(de_obj)[1,]) df <- melt(data = df, id.vars=c("batch", "location", "gender")) ggplot(df, aes(location, value, color=gender:batch), xlab="location", ylab="expression") + geom_point() + facet_wrap(~variable) @ \caption{Plot of gene 1 in the {\tt gibson} study after applying the full and null model fit. The ``raw'' column are the expression values of the original data.} \label{fig:gplotFit} \end{figure} The {\tt fit\_models} function is an implementation of least squares using the full and null models: <>= ef_obj <- fit_models(de_obj, stat.type="lrt") @ The {\tt stat.type} argument specifies whether you want the {\tt odp} or {\tt lrt} fitted values. The difference between choosing ``odp'' and ``lrt'' is that ``odp'' centers the data by the null model fit which is necessary for downstream analysis in the optimal discovery procedure. {\tt fit\_models} creates another object with the following slots: \begin{itemize} \item {\tt fit.full}: fitted values from the full model. \item {\tt fit.null}: fitted values from null model. \item {\tt res.full}: residuals from the full model. \item {\tt res.null}: residuals from the null model. \item {\tt dH.full}: diagonal elements in the projection matrix for the full model. \item {\tt beta.coef}: the coefficients for the full model. \item {\tt stat.type}: statistic type used, either ``odp'' or ``lrt''. \end{itemize} To access the fitted coefficients of the full model in {\tt ef\_obj}: <>= betaCoef(ef_obj) @ To access the full and null residuals: <>= alt_res <- resFull(ef_obj) null_res <- resNull(ef_obj) @ To access the fitted values: <>= alt_fitted <- fitFull(ef_obj) null_fitted <- fitNull(ef_obj) @ See {\tt ?deFit} for more details on accessing the slots in an {\tt deFit} object. The fitted values of the first gene are shown in Figure~\ref{fig:gplotFit}. The null model fit is the average expression value across the interaction of {\tt batch} and {\tt sex}. The full model fit seems to pick up some differences relative to the null model. Next, we have to test whether the observed differences between the model fits are significant. \subsection{Significance analysis} Interpreting the models in a hypothesis test is very intuitive: Does the full model better fit the data when compared to the null model? For the fitted values of the first gene plotted in Figure~\ref{fig:gplotFit}, it seems that the full model fits the data better than the null model. In order to conclude that it is significant, we need to calculate the p-value. The user can use either the optimal discovery procedure or likelihood ratio test. \subsubsection{Likelihood ratio test} The {\tt lrt} function performs a likelihood ratio test to determine p-values: <>= de_lrt <- lrt(de_obj, nullDistn="normal") @ If the null distribution, {\tt nullDistn}, is calculated using ``bootstrap'' then residuals from the full model are re-sampled and added to the null model to simulate a distribution where there is no differential expression. Otherwise, the default input is ``normal'' and the assumption is that the null statistics follow a F-distribution. See {\tt ?lrt} for additional arguments. \subsubsection{Optimal discovery procedure} {\tt odp} performs the optimal discovery procedure, which is a new approach developed by \cite{storey:etal:2007} for optimally performing many hypothesis tests in a high-dimensional study. When testing a feature, information from all the features is utilized when testing for significance of a feature. It guarantees to maximize the number of expected true positive results for each fixed number of expected false positive results which is related to the false discovery rate. The optimal discovery procedure can be implemented on {\tt de\_obj} by the {\tt odp} function: <>= de_odp <- odp(de_obj, bs.its = 50, verbose = FALSE, n.mods = 50) @ The number of bootstrap iterations is controlled by {\tt bs.its}, {\tt verbose} prints each bootstrap iteration number and {\tt n.mods} is the number of clusters in the k-means algorithm. A k-means algorithm is used to assign genes to groups in order to speed up the computational time of the algorithm. If {\tt n.mods} is equal to the number of genes then the original optimal discovery procedure is used. Depending on the number of genes, this setting can take a very long time. Therefore, it is recommended to use a small {\tt n.mods} value to substantially decrease the computational time. In \cite{woo:leek:storey:2011}, it is shown that assigning {\tt n.mods} to about 50 will cause a negligible loss in power. Type {\tt ?odp} for more details on the algorithm. \subsection{Significance results} \begin{figure}[t] \centering <>= sig_results <- qvalueObj(de_odp) hist(sig_results) @ \caption{Applying the function {\tt hist} to the slot {\tt qvalueObj} in the {\tt gibson} data set. Function is derived from the {\tt qvalue} package.} \label{fig:gqvalHist} \end{figure} The {\tt summary} function can be used on an {\tt deSet} object to give an overview of the analysis: <>= summary(de_odp) @ There are three core summaries: {\tt ExpressionSet} summary, \edge analysis and statistical significance summary. The {\tt ExpressionSet} summary shows a summary of the {\tt ExpressionSet} object. \edge analysis shows an overview of the models used and other information about the data set. The significance analysis shows the proportion of null genes, $\pi_{0}$, and significant genes at various cutoffs in terms of p-values, q-values and local false discovery rates. The function {\tt qvalueObj} can be used on {\tt de\_odp} to extract the significance results: <>= sig_results <- qvalueObj(de_odp) @ The object {\tt sig\_results} is a list with the following slots: <<>>= names(sig_results) @ The key variables are {\tt pi0}, {\tt pvalues}, {\tt lfdr} and {\tt qvalues}. The {\tt pi0} variable provides an estimate of the proportion of null p-values, {\tt pvalues} are the p-values, {\tt qvalues} are the estimated q-values and {\tt lfdr} are the local false discovery rates. Using the function {\tt hist} on {\tt sig\_results} will produce a p-value histogram along with the density curves of q-values and local false discovery rate values: <>= hist(sig_results) @ The plot is shown in Figure~\ref{fig:gqvalHist}. To extract the p-values, q-values, local false discovery rates and the $\pi_{0}$ estimate: <<>>= pvalues <- sig_results$pvalues qvalues <- sig_results$qvalues lfdr <- sig_results$lfdr pi0 <- sig_results$pi0 @ Making significance decisions based on p-values in multiple hypothesis testing problems can lead to accepting a lot of false positives in the study. Instead, using q-values to determine significant genes is recommended because it controls the false discovery rate. Q-values measure the proportion of false positives incurred when calling a particular test significant. For example, to complete our analysis of gene 1 in this example, lets view the q-value estimate: <>= qvalues[1] @ So for this particular gene, the q-value is \Sexpr{qvalues[1]}. If we consider a false discovery rate cutoff of 0.1 then this gene is significant. Therefore, the observed differences observed in Figure~\ref{fig:gplotFit} are significant so this particular gene is differentially expressed between locations. To get a list of all the significant genes at a false discovery rate cutoff of 0.01: <>= fdr.level <- 0.01 sigGenes <- qvalues < fdr.level @ View the \href{http://www.bioconductor.org/packages/release/bioc/html/qvalue.html}{{\tt qvalue} vignette} to get a more thorough discussion in how to use p-values, q-values, $\pi_{0}$ estimate and local false discovery rates to determine significant genes. \section{Case study: independent time course experiment} \label{sec:kidney} In the independent time course study, the arrays have been sampled with respect to time from one biological group and the goal is to identify genes that show ``within-class temporal differential expression'', i.e., genes that show statistically significant changes in expression over time. The example data set used in this section is a kidney data set by \cite{rodwell:2004}. Gene expression measurements, from cortex and medulla samples in the kidney, were obtained from 72 human subjects ranging in age from 27 to 92 years. Only one array was obtained per sample and the age and tissue type of each subject was recorded. See \cite{rodwell:2004} for additional information regarding the data set. \subsection{Importing the data} \begin{figure}[t] \centering <>= library(ggplot2) data(kidney) age <- kidney$age sex <- kidney$sex kidexpr <- kidney$kidexpr qplot(age, kidexpr[5,], geom="point", colour=sex, xlab= "age", ylab="expression") @ \caption{Plot of gene 5 in the {\tt kidney} study.} \label{fig:kplot} \end{figure} To import the {\tt kidney} data use the {\tt data} function: <>= data(kidney) age <- kidney$age sex <- kidney$sex kidexpr <- kidney$kidexpr @ There are a few covariates in this data set: {\tt sex}, {\tt age}, and {\tt kidexpr}. The two main covariates of interest are the {\tt sex} and {\tt age} covariates. The {\tt sex} variable is whether the subject was male or female and the {\tt age} variable is the age of the patients. {\tt kidexpr} contains the gene expression values for the study. As an example of a gene in the study, the expression values of the fifth gene are shown in Figure~\ref{fig:kplot}. It is very difficult to find a trend for this particular gene. Instead, we need to adjust the data with the models in the study which is discussed in the next section. \subsection{Creating the full and null models} In order to find differentially expressed genes, the full and null model for the study need to be formulated. There are two ways to input the experimental models in \edgeNS: {\tt build\_models} and {\tt build\_study}. {\tt build\_study} should be used by users unfamiliar with formulating the full and null models but are familiar with the covariates in the study: <>= de_obj <- build_study(data = kidexpr, adj.var = sex, tme = age, sampling = "timecourse", basis.df = 4) @ {\tt adj.var} is for the adjustment variables, {\tt tme} is the time variable, {\tt basis.df} is the degrees of freedom for the spline fit, and {\tt sampling} describes the type of experiment. Since {\tt kidney} is a time course study, the {\tt sampling} argument will be ``timecourse''. The {\tt tme} variable will be the {\tt age} variable, {\tt basis.df} will be 4 based on previous work by \cite{storey:2005} and the adjustment variable is {\tt sex}. To view the models generated by {\tt build\_study}: <>= fullModel(de_obj) nullModel(de_obj) @ Notice that the difference between the full and null model is the natural spline fit of the {\tt age} variable. If we look at Figure~\ref{fig:kplot}, it becomes evident that a spline curve can be used to approximate the fit of the data, and 4 degrees of freedom is chosen based on previous analysis of the expression patterns. See \cite{storey:2005} for a detailed discussion on modelling in time course studies. Alternatively, if the user is familiar with their full and null models in the study then {\tt build\_models} can be used to input the models directly: <>= library(splines) cov <- data.frame(sex = sex, age = age) null_model <- ~sex null_model <- ~sex + ns(age, df=4) de_obj <- build_models(data = kidexpr, cov = cov, full.model = null_model, null.model = null_model) @ The {\tt cov} argument is a data frame of all the relevant covariates, {\tt full.model} and {\tt null.model} are the full and null models of the experiment, respectively. Notice that the models must be formatted as a formula and contain the same variable names as in the {\tt cov} data frame. The null model contains the {\tt sex} covariate and the full model includes the {\tt age} variable. Therefore, we are interested in testing whether the full model improves the model fit of a gene significantly when compared to the null model. If it does not, then we can conclude that there is no significant difference in the gene as it ages in the cortex. The variable {\tt de\_obj} is an {\tt deSet} object that stores all the relevant experimental data. The {\tt deSet} object is discussed further in the next section. \subsection{The {\tt deSet} object} \label{subsec:deSet} Once either {\tt build\_models} or {\tt build\_study} is used, an {\tt deSet} object is created. To view the slots contained in the object: <<>>= slotNames(de_obj) @ A description of each slot is listed below: \begin{itemize} \item {\tt full.model}: the full model of the experiment. Contains the biological variables of interest and the adjustment variables. \item {\tt null.model}: the null model of the experiment. Contains the adjustment variables in the experiment. \item {\tt full.matrix}: the full model in matrix form. \item {\tt null.matrix}: the null model in matrix form. \item {\tt individual}: variable that keeps track of individuals (if same individuals are sampled multiple times). \item {\tt qvalueObj}: {\tt qvalue} list. Contains p-values, q-values and local false discovery rates of the significance analysis. See the \href{http://www.bioconductor.org/packages/release/bioc/html/qvalue.html}{{\tt qvalue} package} for more details. \item {\tt ExpressionSet}: inherits the slots from {\tt ExpressionSet} object. \end{itemize} {\tt ExpressionSet} contains the expression measurements and other information from the experiment. The {\tt deSet} object inherits all the functions from an {\tt ExpressionSet} object. As an example, to access the expression values, one can use the function {\tt exprs} or to access the covariates, {\tt pData}: <<>>= gibexpr <- exprs(de_obj) cov <- pData(de_obj) @ The {\tt ExpressionSet} class is a widely used object in Bioconductor and more information can be found \href{http://www.bioconductor.org/packages/2.14/bioc/html/Biobase.html}{here}. See the section~\ref{sec:advanced} on {\tt ExpressionSet} to get a better understanding of how it integrates into the \edge framework. As an example of how to access the slots of {\tt de\_obj} suppose we are interested in viewing the full and null models. The models can be accessed by: <>= fullModel(de_obj) nullModel(de_obj) @ Next, we can extract the models in matrix form for computational analysis: <>= full_matrix <- fullMatrix(de_obj) null_matrix <- nullMatrix(de_obj) @ See {\tt ?deSet} for additional functions to access different slots of the {\tt deSet} object. \subsection{Fitting the data} \begin{figure}[t] \centering <>= library(ggplot2) library(reshape2) de_obj <- build_study(data = kidexpr, adj.var = sex, tme = age, sampling = "timecourse", basis.df=4) ef_obj <- fit_models(de_obj) fitVals <- fitFull(ef_obj) fitVals0 <- fitNull(ef_obj) df <- data.frame(age= age, sex=sex, null.model = fitVals0[5,], full.model = fitVals[5,], raw = exprs(de_obj)[5,]) df <- melt(data = df, id.vars=c("age", "sex")) ggplot(df, aes(age, value, color=sex), xlab="age", ylab="expression") + geom_point() + facet_wrap(~variable) @ \caption{Plot of gene 5 in the {\tt kidney} study after applying the full and null model fit. The ``raw'' column are the expression values of the original data.} \label{fig:kplotFit} \end{figure} The {\tt fit\_models} function is an implementation of least squares using the full and null models: <>= ef_obj <- fit_models(de_obj, stat.type="lrt") @ The {\tt stat.type} argument specifies whether you want the {\tt odp} or {\tt lrt} fitted values. The difference between choosing ``odp'' and ``lrt'' is that ``odp'' centers the data by the null model fit which is necessary for downstream analysis in the optimal discovery procedure. {\tt fit\_models} creates another object with the following slots: \begin{itemize} \item {\tt fit.full}: fitted values from the full model. \item {\tt fit.null}: fitted values from null model. \item {\tt res.full}: residuals from the full model. \item {\tt res.null}: residuals from the null model. \item {\tt dH.full}: diagonal elements in the projection matrix for the full model. \item {\tt beta.coef}: the coefficients for the full model. \item {\tt stat.type}: statistic type used, either ``odp'' or ``lrt''. \end{itemize} To access the fitted coefficients of the full model in {\tt ef\_obj}: <>= betaCoef(ef_obj) @ To access the full and null residuals: <>= alt_res <- resFull(ef_obj) null_res <- resNull(ef_obj) @ To access the fitted values: <>= alt_fitted <- fitFull(ef_obj) null_fitted <- fitNull(ef_obj) @ See {\tt ?deFit} for more details on accessing the slots in a {\tt deFit} object. The fitted values of the fifth gene are shown in Figure~\ref{fig:kplotFit}. The null model fit is the average expression. It appears that the full model fits the raw data better than the null model. Next, we have to test whether the observed differences between the model fits are significant. \subsection{Significance analysis} Interpreting the models in a hypothesis test is very intuitive: Does the full model better fit the data when compared to the null model? For the fitted values of the fifth gene plotted in Figure~\ref{fig:kplotFit}, it seems that the full model fits the data better than the null model. In order to conclude it is significant, we need to calculate the p-value. The user can use either the optimal discovery procedure or likelihood ratio test. \subsubsection{Likelihood ratio test} The {\tt lrt} function performs a likelihood ratio test to determine p-values: <>= library(splines) @ <>= de_lrt <- lrt(de_obj, nullDistn="normal") @ If the null distribution, {\tt nullDistn}, is calculated using ``bootstrap'' then residuals from the full model are re-sampled and added to the null model to simulate a distribution where there is no differential expression. Otherwise, the default input is ``normal'' and the assumption is that the null statistics follow a F-distribution. See {\tt ?lrt} for additional arguments. \subsubsection{Optimal discovery procedure} {\tt odp} performs the optimal discovery procedure, which is a new approach developed by \cite{storey:2005} for optimally performing many hypothesis tests in a high-dimensional study. When testing a feature, information from all the features is utilized when testing for significance of a feature. It guarantees to maximize the number of expected true positive results for each fixed number of expected false positive results which is related to the false discovery rate. The optimal discovery procedure can be implemented on {\tt de\_obj} by the {\tt odp} function: <>= library(splines) @ <>= de_odp <- odp(de_obj, bs.its = 50, verbose = FALSE, n.mods = 50) @ The number of bootstrap iterations is controlled by {\tt bs.its}, {\tt verbose} prints each bootstrap iteration number and {\tt n.mods} is the number of clusters in the k-means algorithm. A k-means algorithm is used to assign genes to groups in order to speed up the computational time of the algorithm. If {\tt n.mods} is equal to the number of genes then the original optimal discovery procedure is used. Depending on the number of genes, this setting can take a very long time. Therefore, it is recommended to use a small {\tt n.mods} value to substantially decrease the computational time. In \cite{woo:leek:storey:2011}, it is shown that assigning {\tt n.mods} to about 50 will cause a negligible loss in power. Type {\tt ?odp} for more details on the algorithm. \subsection{Significance results} The {\tt summary} function can be used on an {\tt deSet} object to give an overview of the analysis: <>= summary(de_odp) @ There are three core summaries: {\tt ExpressionSet} summary, \edge analysis and statistical significance summary. The {\tt ExpressionSet} summary shows a summary of the {\tt ExpressionSet} object. \edge analysis shows an overview of the models used and other information about the data set. The significance analysis shows the proportion of null genes, $\pi_{0}$, and significant genes at various cutoffs in terms of p-values, q-values and local false discovery rates. The function {\tt qvalueObj} can be used on {\tt de\_odp} to extract the significance results: <<>>= sig_results <- qvalueObj(de_odp) @ The object {\tt sig\_results} is a list with the following slots: <<>>= names(sig_results) @ The key variables are {\tt pi0}, {\tt pvalues}, {\tt lfdr} and {\tt qvalues}. The {\tt pi0} variable provides an estimate of the proportion of null p-values, {\tt pvalues} are the p-values, {\tt qvalues} are the estimated q-values and {\tt lfdr} are the local false discovery rates. Using the function {\tt hist} on {\tt sig\_results} will produce a p-value histogram along with the density curves of q-values and local false discovery rate values: <>= hist(sig_results) @ \begin{figure}[t] \centering <>= hist(sig_results) @ \caption{Applying the function {\tt hist} to the slot {\tt qvalueObj} in the {\tt kidney} data set. Function is derived from the {\tt qvalue} package.} \label{fig:kqvalHist} \end{figure} The plot is shown in Figure~\ref{fig:kqvalHist}. To extract the p-values, q-values, local false discovery rates and the $\pi_{0}$ estimate: <>= pvalues <- sig_results$pvalues qvalues <- sig_results$qvalues lfdr <- sig_results$lfdr pi0 <- sig_results$pi0 @ Making significance decisions based on p-values in multiple hypothesis testings problems can lead to accepting a lot of false positives in the study. Instead, using q-values to determine significant genes is recommended because it controls the false discovery rate at a level {\tt alpha}. Q-values measure the proportion of false positives incurred when calling a particular test significant. For example, to complete our analysis of gene 5 in this example, lets view the q-value estimate: <>= qvalues[5] @ So for this particular gene, the q-value is \Sexpr{qvalues[5]}. If we consider a false discovery rate cutoff of 0.1 then this gene is not significant. Therefore, the observed differences observed in Figure~\ref{fig:kplotFit} are not significant so this particular gene is not differentially expressed as the kidney ages. To get a list of all the significant genes at a false discovery rate cutoff of 0.1: <>= fdr.level <- 0.1 sigGenes <- qvalues < fdr.level @ View the \href{http://www.bioconductor.org/packages/release/bioc/html/qvalue.html}{{\tt qvalue} vignette} to get a more thorough discussion in how to use p-values, q-values, $\pi_{0}$ estimate and local false discovery rates to determine significant genes. \section{Case study: longitudinal time course experiment} \label{sec:endotoxin} In the longitudinal time course study, the goal is to identify genes that show ``between-class temporal differential expression'', i.e., genes that show statistically significant differences in expression over time between the various groups. The {\tt endotoxin} data set provides gene expression measurements in an endotoxin study where four subjects were given endotoxin and four subjects were given a placebo. Blood samples were collected and leukocytes were isolated from the samples before infusion. Measurements were recorded at times 2, 4, 6, 9, 24 hours. We are interested in identifying genes that vary over time between the endotoxin and control groups. See \cite{calvano:2005} for more details regarding the {\tt endotoxin} dataset. \subsection{Importing the data} To import the {\tt endotoxin} data use the {\tt data} function: <>= data(endotoxin) endoexpr <- endotoxin$endoexpr class <- endotoxin$class ind <- endotoxin$ind time <- endotoxin$time @ There are a few covariates in this data set: {\tt endoexpr}, {\tt class}, {\tt ind}, and {\tt time}. There are 8 individuals in the experiment ({\tt ind}) that were sampled at multiple time points ({\tt time}) that were either ``endotoxin'' or ``control'' ({\tt class}). The {\tt endoexpr} variable contains the expression values of the experiment. To show an example gene, the expression values of the second gene are shown in Figure~\ref{fig:eplot}. It is very difficult to find a trend for this particular gene. Instead, we need to adjust the data with the models in the study. \begin{figure}[t] \centering <>= library(ggplot2) qplot(time, endoexpr[2,], geom="point", colour=class, xlab= "time (hours)", ylab="expression") @ \caption{Plot of gene 2 in the {\tt endotoxin} study.} \label{fig:eplot} \end{figure} \subsection{Creating the full and null models} In order to find differentially expressed genes, there first needs to be an full and null model for the study. There are two ways to input the experimental models in \edgeNS: {\tt build\_models} and {\tt build\_study}. {\tt build\_study} should be used by users unfamiliar with formulating the full and null models but are familiar with the covariates in the study: <>= de_obj <- build_study(data = endoexpr, grp = class, tme = time, ind = ind, sampling = "timecourse") @ {\tt grp} is for the variable which group each individual belongs to, {\tt tme} is the time variable, {\tt ind} is used when individuals are sampling multiple times and {\tt sampling} describes the type of experiment. Since {\tt endotoxin} is a time course study, the {\tt sampling} argument will be ``timecourse''. The {\tt tme} variable will be the {\tt time} variable, {\tt ind} is the {\tt individuals} variable and the {\tt grp} variable is {\tt class}. To view the models created by {\tt build\_study}: <>= fullModel(de_obj) nullModel(de_obj) @ See \cite{storey:2005} for how the models in the {\tt endotoxin} experiment are formed. Alternatively, if the user is familiar with their full and null models in the study then {\tt build\_models} can be used to input the models directly: <>= cov <- data.frame(ind = ind, tme = time, grp = class) null_model <- ~grp + ns(tme, df = 2, intercept = FALSE) null_model <- ~grp + ns(tme, df = 2, intercept = FALSE) + (grp):ns(tme, df = 2, intercept = FALSE) de_obj <- build_models(data = endoexpr, cov = cov, full.model = null_model, null.model = null_model) @ The {\tt cov} argument is a data frame of all the relevant covariates, {\tt full.model} and {\tt null.model} are the full and null models of the experiment, respectively. Notice that the models must be formatted as a formula and contain the same variable names as in the {\tt cov} data frame. We are interested in testing whether the full model improves the model fit of a gene significantly when compared to the null model. If it does not, then we can conclude that there is no significant difference in this gene between the endotoxin and the control as time goes on. The variable {\tt de\_obj} is an {\tt deSet} object that stores all the relevant experimental data. The {\tt deSet} object is discussed further in the next section. \subsection{The {\tt deSet} object} Once either {\tt build\_models} or {\tt build\_study} is used, an {\tt deSet} object is created. To view the slots contained in the object: <>= slotNames(de_obj) @ A description of each slot is listed below: \begin{itemize} \item {\tt full.model}: the full model of the experiment. Contains the biological variables of interest and the adjustment variables. \item {\tt null.model}: the null model of the experiment. Contains the adjustment variables in the experiment. \item {\tt full.matrix}: the full model in matrix form. \item {\tt null.matrix}: the null model in matrix form. \item {\tt individual}: variable that keeps track of individuals (if same individuals are sampled multiple times). \item {\tt qvalueObj}: {\tt qvalue} list. Contains p-values, q-values and local false discovery rates of the significance analysis. See the \href{http://www.bioconductor.org/packages/release/bioc/html/qvalue.html}{{\tt qvalue} package} for more details. \item {\tt ExpressionSet}: inherits the slots from {\tt ExpressionSet} object. \end{itemize} {\tt ExpressionSet} contains the expression measurements and other information from the experiment. The {\tt deSet} object inherits all the functions from an {\tt ExpressionSet} object. As an example, to access the expression values, one can use the function {\tt exprs} or to access the covariates, {\tt pData}: <<>>= gibexpr <- exprs(de_obj) cov <- pData(de_obj) @ The {\tt ExpressionSet} class is a widely used object in Bioconductor and more information can be found \href{http://www.bioconductor.org/packages/2.14/bioc/html/Biobase.html}{here}. See the section~\ref{sec:advanced} on {\tt ExpressionSet} to get a better understanding of how it integrates into the \edge framework. As an example of how to access the slots of {\tt de\_obj} suppose we are interested in viewing the full and null models. The models can be accessed by: <>= fullModel(de_obj) nullModel(de_obj) @ Next, we can extract the models in matrix form for computational analysis: <>= full_matrix <- fullMatrix(de_obj) null_matrix <- nullMatrix(de_obj) @ See {\tt ?deSet} for additional functions to access different slots of the {\tt deSet} object. \subsection{Fitting the data} The {\tt fit\_models} function is an implementation of least squares using the full and null models: <>= ef_obj <- fit_models(de_obj, stat.type="lrt") @ \begin{figure}[t] \centering <>= library(ggplot2) library(reshape2) #endoexpr <- endoexpr - rowMeans(endoexpr) de_obj <- build_study(data = endoexpr, grp = class, tme = time, ind = ind, sampling = "timecourse", basis.df=2) ef_obj <- fit_models(de_obj) fitVals <- fitFull(ef_obj) fitVals0 <- fitNull(ef_obj) id=2 df <- data.frame(class= class, tme=time, ind=ind, null.model = fitVals0[id,], full.model = fitVals[id,]) df <- melt(data = df, id.vars=c("class", "tme", "ind")) ggplot(df, aes(tme, value, color=class), xlab="time (hours)", ylab="expression") + geom_smooth(se=FALSE) + facet_wrap(~variable) @ \caption{Plot of gene 2 in the {\tt endotoxin} study after applying the full and null model fit. The ``raw'' column is the expression values of the original data.} \label{fig:eplotFit} \end{figure} The {\tt stat.type} argument specifies whether you want the {\tt odp} or {\tt lrt} fitted values. The difference between choosing ``odp'' and ``lrt'' is that ``odp'' centers the data by the null model fit which is necessary for downstream analysis in the optimal discovery procedure. {\tt fit\_models} creates another object with the following slots: \begin{itemize} \item {\tt fit.full}: fitted values from the full model. \item {\tt fit.null}: fitted values from null model. \item {\tt res.full}: residuals from the full model. \item {\tt res.null}: residuals from the null model. \item {\tt dH.full}: diagonal elements in the projection matrix for the full model. \item {\tt beta.coef}: the coefficients for the full model. \item {\tt stat.type}: statistic type used, either ``odp'' or ``lrt''. \end{itemize} To access the fitted coefficients of the full model in {\tt ef\_obj}: <>= betaCoef(ef_obj) @ To access the full and null residuals: <>= alt_res <- resFull(ef_obj) null_res <- resNull(ef_obj) @ To access the fitted values: <>= alt_fitted <- fitFull(ef_obj) null_fitted <- fitNull(ef_obj) @ See {\tt ?deFit} for more details on accessing the slots in an {\tt deFit} object. The fitted values of the second gene are shown in Figure~\ref{fig:eplotFit}. It appears that the full model fits a pattern that might be observed in the raw data. Next, we have to test whether the observed differences between the model fits are significant. \subsection{Significance analysis} Interpreting the models in a hypothesis test is very intuitive: Does the full model better fit the data when compared to the null model? For the fitted values of the second gene plotted in Figure~\ref{fig:eplotFit}, it seems that the full model fits the data better than the null model. In order to conclude it is significant, we need to calculate the p-value. The user can use either the optimal discovery procedure or likelihood ratio test. \subsubsection{Likelihood ratio test} The {\tt lrt} function performs a likelihood ratio test to determine p-values: <>= de_lrt <- lrt(de_obj, nullDistn="normal") @ If the null distribution, {\tt nullDistn}, is calculated using ``bootstrap'' then residuals from the full model are re-sampled and added to the null model to simulate a distribution where there is no differential expression. Otherwise, the default input is ``normal'' and the assumption is that the null statistics follow a F-distribution. See {\tt ?lrt} for additional arguments. \subsubsection{Optimal discovery procedure} {\tt odp} performs the optimal discovery procedure, which is a new approach developed by \cite{storey:2005} for optimally performing many hypothesis tests in a high-dimensional study. When testing a feature, information from all the features is utilized when testing for significance of a feature. It guarantees to maximize the number of expected true positive results for each fixed number of expected false positive results which is related to the false discovery rate. The optimal discovery procedure can be implemented on {\tt de\_obj} by the {\tt odp} function: <>= de_odp <- odp(de_obj, bs.its = 50, verbose = FALSE, n.mods = 50) @ The number of bootstrap iterations is controlled by {\tt bs.its}, {\tt verbose} prints each bootstrap iteration number and {\tt n.mods} is the number of clusters in the k-means algorithm. A k-means algorithm is used to assign genes to groups in order to speed up the computational time of the algorithm. If {\tt n.mods} is equal to the number of genes then the original optimal discovery procedure is used. Depending on the number of genes, this setting can take a very long time. Therefore, it is recommended to use a small {\tt n.mods} value to substantially decrease the computational time. In \cite{woo:leek:storey:2011}, it is shown that assigning {\tt n.mods} to about 50 will cause a negligible loss in power. Type {\tt ?odp} for more details on the algorithm. \subsection{Significance results} The {\tt summary} function can be used on an {\tt deSet} object to give an overview of the analysis: <>= summary(de_odp) @ There are three core summaries: {\tt ExpressionSet} summary, \edge analysis and statistical significance summary. The {\tt ExpressionSet} summary shows a summary of the {\tt ExpressionSet} object. \edge analysis shows an overview of the models used and other information about the data set. The significance analysis shows the proportion of null genes, $\pi_{0}$, and significant genes at various cutoffs in terms of p-values, q-values and local false discovery rates. The function {\tt qvalueObj} can be used on {\tt de\_odp} to extract the significance results: <>= sig_results <- qvalueObj(de_odp) @ The object {\tt sig\_results} is a list with the following slots: <>= names(sig_results) @ \begin{figure}[t] \centering <>= hist(sig_results) @ \caption{Applying the function {\tt hist} to the slot {\tt qvalueObj} in the {\tt endotoxin} data set. Function is derived from the {\tt qvalue} package.} \label{fig:eqvalHist} \end{figure} The key variables are {\tt pi0}, {\tt pvalues}, {\tt lfdr} and {\tt qvalues}. The {\tt pi0} variable provides an estimate of the proportion of null p-values, {\tt pvalues} are the p-values, {\tt qvalues} are the estimated q-values and {\tt lfdr} are the local false discovery rates. Using the function {\tt hist} on {\tt sig\_results} will produce a p-value histogram along with the density curves of q-values and local false discovery rate values: <>= hist(sig_results) @ The plot is shown in Figure~\ref{fig:eqvalHist}. To extract the p-values, q-values, local false discovery rates and the $\pi_{0}$ estimate: <>= pvalues <- sig_results$pvalues qvalues <- sig_results$qvalues lfdr <- sig_results$lfdr pi0 <- sig_results$pi0 @ Making significance decisions based on p-values in multiple hypothesis testings problems can lead to accepting a lot of false positives in the study. Instead, using q-values to determine significant genes is recommended because it controls the false discovery rate at a level {\tt alpha}. Q-values measure the proportion of false positives incurred when calling a particular test significant. For example, to complete our analysis of gene 2 in this example, lets view the q-value estimate: <>= qvalues[2] @ So for this particular gene, the q-value is \Sexpr{qvalues[2]}. If we consider a false discovery rate cutoff of 0.1 then this gene is not significant. Therefore, the observed differences observed in Figure~\ref{fig:eplotFit} are not significant so this particular gene is not differentially expressed between {\tt class} as time varies. To get a list of all the significant genes at a false discovery rate cutoff of 0.1: <>= fdr.level <- 0.1 sigGenes <- qvalues < fdr.level @ View the \href{http://www.bioconductor.org/packages/release/bioc/html/qvalue.html}{{\tt qvalue} vignette} to get a more thorough discussion in how to use p-values, q-values, $\pi_{0}$ estimate and local false discovery rates to determine significant genes. \section{{\tt sva}: Surrogate variable analysis} \label{sec:SVA} The {\tt sva} package is useful for removing batch effects or any unwanted variation in an experiment. It does this by forming surrogate variables to adjust for sources of unknown variation. Details on the algorithm can be found in \cite{leek:2007}. \edge uses the {\tt sva} package in the function {\tt apply\_sva}. Suppose we are working with the {\tt kidney} data in~\ref{sec:kidney}, then the first step is to create an {\tt deSet} object by either using {\tt build\_models} or {\tt build\_study}: <>= library(splines) cov <- data.frame(sex = sex, age = age) null_model <- ~sex full_model <- ~sex + ns(age, df=4) de_obj <- build_models(data = kidexpr, cov = cov, full.model = full_model, null.model = null_model) @ To find the surrogate variables and add them to the experimental models in {\tt de\_obj}, use the function {\tt apply\_sva}: <>= de_sva <- apply_sva(de_obj, n.sv = 3, B = 10) @ {\tt n.sv} is the number of surrogate variables and {\tt B} is the number of bootstraps. See {\tt ?apply\_sva} for additional arguments. To see the terms that have been added to the models: <>= fullModel(de_sva) nullModel(de_sva) @ The variables SV1, SV2 and SV3 are the surrogate variables formed by {\tt sva}. To access the surrogate variables: <>= cov <- pData(de_sva) names(cov) surrogate.vars <- cov[, 3:ncol(cov)] @ Now {\tt odp} or {\tt lrt} can be used as in previous examples: <>= de_odp <- odp(de_sva, bs.its = 50, verbose = FALSE) de_lrt <- lrt(de_sva, verbose = FALSE) summary(de_odp) @ And to extract the $\pi_{0}$ estimate, q-values, local false discovery rates and p-values: <>== qval_obj <- qvalueObj(de_odp) qvals <- qval_obj$qvalues lfdr <- qval_obj$lfdr pvals <- qval_obj$pvalues pi0 <- qval_obj$pi0 @ \section{{\tt snm}: Supervised normalization of microarray data} \label{sec:snm} There has been a lot of work done on separating signal from confounding factors, but a lot of algorithms fail to consider both the models of the study and the technical factors such as batch or array processing date. The {\tt snm} package allows for supervised normalization of microarrays on a gene expression matrix. It takes into account both the experimental models and other technical factors in the experiments. Details on the algorithm can be found in \cite{mecham:2010}. The {\tt snm} package is implemented in the {\tt apply\_snm} function. Continuing the analysis on the {\tt kidney} study in~\ref{sec:kidney}: <>= library(splines) cov <- data.frame(sex = sex, age = age) null_model <- ~sex full_model <- ~sex + ns(age, df=4) de_obj <- build_models(data = kidexpr, cov = cov, full.model = full_model, null.model = null_model) @ Now that we have created {\tt de\_obj}, we can adjust for additional array effects, dye effects and other intensity-dependent effects. In this example, we created array effects that are not existent in the real data set in order to show how to use the function: <>= int.var <- data.frame(array.effects = as.factor(1:72)) de_snm <- apply_snm(de_obj, int.var = int.var, num.iter = 2, diagnose = FALSE, verbose = FALSE) @ The {\tt int.var} is where the data frame of intensity-dependent effects are inputted, {\tt diagnose} is a flag to let the software know whether to produce diagnostic plots. Additional arguments can be found by typing {\tt ?apply\_snm}. Once the data has been normalized, we can access the normalized matrix by using {\tt exprs}: <>= norm.matrix <- exprs(de_obj) @ To run the significance analysis, {\tt odp} or {\tt lrt} can be used: <>= de_odp <- odp(de_snm, bs.its = 50, verbose=FALSE) summary(de_odp) @ And to extract the $\pi_{0}$ estimate, q-values, local false discovery rates and p-values: <>== qval_obj <- qvalueObj(de_odp) qvals <- qval_obj$qvalues lfdr <- qval_obj$lfdr pvals <- qval_obj$pvalues pi0 <- qval_obj$pi0 @ \section{{\tt qvalue}: Estimate the q-values} \label{sec:qvalue} After {\tt odp} or {\tt lrt} is used, the user may wish to change some parameters used when calculating the q-values. This can be done by using the {\tt apply\_qvalue} function. Lets review the analysis process for the {\tt kidney} dataset in~\ref{sec:kidney}: create the full and null models and then run {\tt odp} or {\tt lrt} to get significance results. Applying these steps in the {\tt kidney} dataset: <>= # create models library(splines) cov <- data.frame(sex = sex, age = age) null_model <- ~sex full_model <- ~sex + ns(age, df=4) de_obj <- build_models(data = kidexpr, cov = cov, full.model = full_model, null.model = null_model) # run significance analysis de_obj <- odp(de_obj, bs.its = 50, verbose = FALSE) @ Suppose we wanted to estimate $\pi_{0}$ using the ``bootstrap'' method in {\tt qvalue} (see \href{http://www.bioconductor.org/packages/release/bioc/html/qvalue.html}{{\tt qvalue} vignette} for more details): <>= old_pi0est <- qvalueObj(de_obj)$pi0 de_obj <- apply_qvalue(de_obj, pi0.method = "bootstrap") new_pi0est <- qvalueObj(de_obj)$pi0 @ <>= print(data.frame(old_pi0est = old_pi0est, new_pi0est = new_pi0est)) @ In this case, there is a difference between using the ``smoother'' method and ``bootstrap'' method. See {\tt apply\_qvalue} for additional arguments. \section{Advanced topic: Using the ExpressionSet object} \label{sec:advanced} \edge was designed for complementing {\tt ExpressionSet} objects in significance analysis. The {\tt deSet} inherits all the slots from an {\tt ExpressionSet} object and adds vital slots for significance analysis. The rest of this section is for advanced users because it requires knowledge of full and null model creation. To begin, lets create an {\tt ExpressionSet} object from the {\tt kidney} dataset: <>== library(edge) anon_df <- as(data.frame(age=age, sex=sex), "AnnotatedDataFrame") exp_set <- ExpressionSet(assayData = kidexpr, phenoData = anon_df) @ In the {\tt kidney} experiment they were interested in finding the effect of age on gene expression. In this case, we handle the time variable, {\tt age}, by fitting a natural spline curve as done in \cite{storey:2005}. The relevant models for the experiment can be written as <>= library(splines) null_model <- ~1 + sex full_model <- ~1 + sex + ns(age, intercept = FALSE, df = 4) @ where {\tt null\_model} is the null model and {\tt full\_model} is the full model. The {\tt sex} covariate is an adjustment variable while {\tt age} is the biological variable of interest. It is important to note that it is necessary to include the adjustment variables in the formulation of the full models as done above. Having both {\tt expSet} and the hypothesis models, the function {\tt deSet} can be used to create an {\tt deSet} object: <>== de_obj <- deSet(exp_set, full.model = full_model, null.model = null_model) slotNames(de_obj) @ From the slot names, it is evident that the {\tt deSet} object inherits the {\tt ExpressionSet} slots in addition to other slots relating to the significance analysis. See section~\ref{subsec:deSet} for more details on the {\tt deSet} slots. We can now simply run {\tt odp} or {\tt lrt} for significance results: <>= de_odp <- odp(de_obj, bs.its = 50, verbose=FALSE) de_lrt <- lrt(de_obj) summary(de_odp) @ And use the function {\tt qvalueObj} to extract the $\pi_{0}$ estimate, q-values, local false discovery rates and p-values: <>== qval_obj <- qvalueObj(de_odp) qvals <- qval_obj$qvalues lfdr <- qval_obj$lfdr pvals <- qval_obj$pvalues pi0 <- qval_obj$pi0 @ \section*{Acknowledgements} This software development has been supported in part by funding from the National Institutes of Health and the Office of Naval Research. \bibliographystyle{plainnat} \bibliography{edgerefs} \end{document}