% \VignetteIndexEntry{XDE Vignette} % \VignetteKeywords{microarray, differential expression, meta-analysis} % \VignettePackage{XDE} \documentclass{article} \usepackage{amsmath} \usepackage{inconsolata} \usepackage{graphicx} \usepackage{natbib} \usepackage{hyperref} \usepackage[mathcal]{euscript} \parindent 0in % Left justify \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\stan}{\texttt{Stanford}} \newcommand{\xde}{\textit{XDE}} \newcommand{\eset}{\Robject{ExpressionSet}} \newcommand{\esets}{\Robject{ExpressionSet}s} \newcommand{\esetList}{\Robject{ExpressionSetList}} \newcommand{\esetLists}{\Robject{ExpressionSetList}s} \newcommand{\xparam}{\Robject{XdeParameter}} \newcommand{\xmcmc}{\Robject{XdeMcmc}} \newcommand{\R}{\textsf{R}} \newcommand{\bioc}{http://www.bioconductor.org} \textwidth=6.2in \textheight=8.5in \parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \title{XDE: A Bayesian hierarchical model for analysis of differential expression in multiple studies} \author{Robert Scharpf, Andrew Nobel, Giovanni Parmigiani, H\aa kon Tjelmeland} \begin{document} \maketitle <>= options(width=60) @ \section{Introduction} There are many publicly available high throughput gene expression studies that address comparable biological questions with similar patient populations. For economical and practical reasons many of these studies have a relatively small number of biological replicates. To improve the statistical power it is of interest to combine observed data from several microarray studies, potentially measured with different technologies. However, variation in the measured gene expression levels is caused not only by the biological conditions of interest and natural variation in gene expression in different samples of the same type, but also to a great extent by technological differences between studies. To successfully combine observed data from different studies, it is therefore essential to filter out the technological effects. The \R{} package \Rpackage{XDE} fits a Bayesian Hierarchical model for estimation of differential expression in 2 or more studies. Alternative methods for assessing differential expression are provided. Other \R{} packages developed for the cross-study analysis of differential expression are \Rpackage{GeneMeta} \cite{Lusa2007}, \Rpackage{metaArray} \cite{Choi2007}, and \Rpackage{rankProduct} \cite{Hong2006}. After reading this vignette, one should be able to \begin{itemize} \item create an instance of the class \esetList{} \item create an instance of the class \xparam{} that contains default options for the MCMC \item fit the Bayesian hierarchical model to two or more studies stored as an \esetList{} class \item plot MCMC chains to assess convergence \item compute posterior averages for concordant and discordant differential expression \item generate alternative cross-study summaries of differential expression \end{itemize} See \cite{Scharpf2009} for a more detailed discussion of the Bayesian hierarchical model \section{\label{sec:xlist} The \esetList{} Class} Presently the software for fitting the Bayesian hierarchical model requires that each study be represented as an \Robject{ExpressionSet} \cite{Gentleman2004} and that the same features be present in each study (the \Rfunction{featureNames} of the studies must be identical). Therefore, for the meta-analysis of several platforms, platform-specific annotations must be mapped to a common reference identifier. Mapping identifiers in each platform is a non-trivial stop in any cross-study analysis. The \R{} packages \Rpackage{funcBox} (not yet publicly available) and \Rpackage{MergeMaid} have been developed to facilitate the annotation process and the merging of multiple studies, respectively \cite{Cope2004}. For the purposes of this vignette, \xde{} provides an example dataset of a single study that was split into three artificial datasets. <>= library(XDE) data(expressionSetList) xlist <- expressionSetList class(xlist) @ The original study is described in \cite{garb:etal:2001}. The processed data was mapped to unigene identifiers and made available in the experimental data package \Rpackage{lungExpression} (\bioc) to facilitate the reproducibility of the analyses described in \cite{Parmigiani2004a}. The function \Rfunction{validObject} checks that the each element in \esetList{} is a valid \eset{} and that the \Robject{featureNames} are the same in each element. <>= validObject(xlist) @ In order to assess differential expression across multiple studies, one must define a dichotomous covariate in the \Robject{phenoData} of each \eset{} element in the \esetList{} object. This covariate must have an identical name in each study. In this vignette, we will use \xde{} to quantify differential expression between adenocarcinomas and squamous carcinomas. The binary covariate ``adenoVsquamous'' has been defined in each of the \eset{} elements for this purpose. The following statement must evaluate to \texttt{TRUE}: <>= stopifnot(all(sapply(xlist, function(x, label){ label %in% varLabels(x)}, label="adenoVsquamous"))) @ \section{\label{sec:XdeParam} The \xparam{} Class} There are many features in our implementation of the hierarchical Bayesian model that can be modified: \begin{itemize} \item hyperparameters for the Bayesian hierarchical model \item the seed for generating random values in the MCMC \item the starting values for the MCMC chains \item tuning parameters for Metropolis-Hastings proposals \item the number of updates per MCMC iteration (each parameter can have a different number of updates) \item selection of MCMC chains that are to be written to log files \end{itemize} All of the above features are provided in the \xparam{} class. The attributes provided when initializing an instance of class \xparam{} work well in most instances. The \xparam{} vignette provides a more detailed description of how the attributes can be modified, as well as a brief description of the Bayesian model and the algorithm for the MCMC. See \cite{Scharpf2009} for a more detailed description of the Bayesian model. The default values for our \Robject{xlist} object are obtained by initializing an instance of the \xparam{} class: <>= params <- new("XdeParameter", esetList=xlist, phenotypeLabel="adenoVsquamous") params @ This parameterization, presently the default, assumes that a gene is differentially expressed in all studies or in none. An alternative parametrization that allows genes to be differentially expressed in a subset of studies can be obtained by setting the argument \Robject{one.delta} to \Robject{FALSE}. <>= params <- new("XdeParameter", esetList=xlist, phenotypeLabel="adenoVsquamous", one.delta=FALSE) @ Whether one fits the ``$\delta_g$'' (\Robject{one.delta=TRUE}) or the ``$\delta_{gp}$'' model (\Robject{one.delta=FALSE}), the chain written to file is of dimension G x P x I, where G is the number of genes, P is the number of studies, and S is the number of samples (in the case of the $\delta_g$ model, the value written to file for a single gene will be the same for each study. The same is true for the $\xi_p$ parameter.) \section{\label{sec:fit} Fitting the Bayesian hierarchical model} \subsection{Starting values} \paragraph{Randomly simulated starting values} By default, the first iteration of the MCMC chain stored in the slot \texttt{firstMcmc} of the \Robject{params} are simulated randomly from the priors. When the value of \texttt{burnin} is true, the output from the MCMC are not saved to file and the chain can not be monitored for convergence. By default the value of \Robject{burnin} is TRUE and only the last iteration from the chain will be available (the parameters are not written to log files). \paragraph{Empirical starting values} One can use empirical values for starting the chain (or specify your own starting values) by initializing an object of class XdeParameter and then specifying your own values for the first MCMC: <>= params <- new("XdeParameter", esetList=xlist, phenotypeLabel="adenoVsquamous", one.delta=FALSE) empirical <- empiricalStart(xlist, phenotypeLabel="adenoVsquamous") firstMcmc(params) <- empirical @ To run a burnin of 3 iterations starting from the empircal values: <>= iterations(params) <- 3 burnin(params) <- TRUE fit <- xde(params, xlist) @ Only the first and last iterations of the MCMC are available when \texttt{burnin} is \texttt{TRUE}. The output of the \Rfunction{xde} is an object of class \xmcmc{}. The object \texttt{fit} contains the last iteration from the MCMC, as well as a different seed that can be used for initiating the next chain. For instance, to run two additional iterations starting at the last iteration stored in the \texttt{fit} object, one should provide the \Robject{params}, \Robject{xlist}, and \Robject{fit} objects to the \Rfunction{xde} function: <>= iterations(params) <- 2 fit2 <- xde(params, xlist, fit) @ When an object of class \xmcmc{} is supplied as an argument to the \Rfunction{xde} function, the seed and the last iteration from the \Robject{fit} object are used to begin the next chain. Note that the results from the previous call to the \Rfunction{xde} would be identical to the following sequence of commands: <>= firstMcmc(params) <- lastMcmc(fit) seed(params) <- seed(fit) fit2 <- xde(params, xlist) @ One should run several thousand iterations (saving all parameters to file) to monitor convergence. In the following code chunk (not run) we save only the chains for the parameters that are not indexed by genes and/or study by setting the \texttt{output} for these parameters to zero -- this step was taken to keep this package from becoming unnecessarily large. By setting the \texttt{thin} to 2, we only write every other MCMC iteration to file. In total, 1000 iterations are saved. <>= burnin(params) <- FALSE iterations(params) <- 2000 output(params)[c("potential", "acceptance", "diffExpressed", "nu", ##"DDelta", ##"delta", "probDelta", ##"sigma2", "phi")] <- 0 thin(params) <- 2 directory(params) <- "logFiles" xmcmc <- xde(params, xlist) @ <>= ##this function should work postAvg <- calculatePosteriorAvg(xmcmc, NCONC=2, NDIFF=1) save(postAvg, file="~/madman/Rpacks/XDE/inst/logFiles/postAvg.rda") ##put browser in .standardizedDelta to fix tau2R, ... BES <- calculateBayesianEffectSize(xmcmc) save(BES, file="~/madman/Rpacks/XDE/inst/logFiles/BES.rda") save(xmcmc, file="~/madman/Rpacks/XDE/data/xmcmc.RData") q("no") @ See the \xparam{} vignette for a more detailed discussion of the output, burnin, and thin methods. The \Robject{xmcmc} object can be loaded by: <>= data(xmcmc, package="XDE") @ \section{\label{sec:mcmc} MCMC diagnostics} In this section we briefly describe how to access the chains for assessing convergence of the MCMC. We refer the \R user to the package \Rpackage{coda} for more detailed discussion of MCMC diagnostics \cite{Plummer2007}. The following is a list of the chains that were saved in our run with 1000 saved iterations: <>= output(xmcmc)[2:22][output(xmcmc)[2:22] == 1] @ The \$ operator can be used to read in log files. First, we need to update the \texttt{directory} slot in the \Robject{xmcmc} with a character string indicating the path to the log files. (Note that an assignment method for directory is only available for \R{} object of class \xmcmc{} -- typically one would not need to change the directory in the \xmcmc{} object). In the following code chunk, we extract the chain for the $c^2$ parameter: <>= pathToLogFiles <- system.file("logFiles", package="XDE") xmcmc@directory <- pathToLogFiles c2 <- xmcmc$c2 @ <>= par(las=1) plot.ts(c2, ylab="c2", xlab="iterations", plot.type="single") @ Here we extract only the parameters that are not indexed by gene and platform (log files for parameters indexed by gene and platform are typically very large): <>= getLogs <- function(object){ params <- output(object)[output(object) == 1] params <- params[!(names(params) %in% c("nu", "phi", "DDelta", "delta", "sigma2", "diffExpressed"))] names(params) } param.names <- getLogs(xmcmc) params <- lapply(lapply(as.list(param.names), function(name, object) eval(substitute(object$NAME_ARG, list(NAME_ARG=name))), object=xmcmc), as.ts) names(params) <- param.names tracefxn <- function(x, name) plot(x, plot.type="single", col=1:ncol(x), ylab=name) mapply(tracefxn, params, name=names(params)) @ \section{\label{sec:posterior} Posterior probabilities of differential expression} We refer to the posterior mean of the standardized offsets in the hierarchical model as the Bayesian effect size (BES). The BES is calculated as $\frac{\delta_g\Delta_{gp}}{c \tau \sigma^{b_p}}$ and obtained by <>= bayesianEffectSize(xmcmc) <- calculateBayesianEffectSize(xmcmc) @ As the function \Rfunction{calculateBayesianEffectSize} requires that the $\delta, \Delta$, and $\sigma^2$ chains are saved, the above code chunk is not evaluated in this vignette. <>= load(file.path(pathToLogFiles, "BES.rda")) load(file.path(pathToLogFiles, "postAvg.rda")) @ Posterior averages for the probability of differential expression, concordant differential expression, and discordant differential expression are stored in the \Robject{postAvg} object. See \cite{Scharpf2009} for a discussion of how the above posterior average probabilities are computed. <>= par(las=1) hist(postAvg[, "concordant"], breaks=50, xlim=c(0, 1), main="", xlab="posterior probability of concordant differential expression") @ <>= par(las=1) hist(postAvg[, "discordant"], breaks=50, xlim=c(0, 1), main="", xlab="posterior probability of discordant differential expression") @ We may wish to differentially label the study-specific statistics of effect size that have high probabilities of concordant differential expression. To do this, we order the matrix of the BES so that the genes with the highest posterior probabilities are plotted last. The function \Rfunction{symbolsInteresting} returns a list of graphical options to \Rfunction{pairs}. <>= op.conc <- symbolsInteresting(rankingStatistic=postAvg[, "concordant"]) op.disc <- symbolsInteresting(rankingStatistic=postAvg[, "discordant"]) @ <>= par(las=1) graphics:::pairs(BES[op.conc$order, ], pch=op.conc$pch, col=op.conc$col, bg=op.conc$bg, upper.panel=NULL, cex=op.conc$cex) @ or high probabilities of discordant differential expression <>= graphics:::pairs(BES[op.disc$order, ], pch=op.disc$pch, col=op.disc$col, bg=op.disc$bg, upper.panel=NULL, cex=op.disc$cex) @ \section{\label{sec:alternatives} Alternative cross-study summaries of differential expression} Study-specific estimates of effect size, such as SAM or t-statistics, can be useful to check the overall reproducibility between studies. Again using pairwise scatterplots we plot t- and SAM-statistics using different colors and plotting symbols for the genes that show high posterior probabilities of concordant differential expression. <>= ##t <- ssStatistic(statistic="t", phenotypeLabel="adenoVsquamous", esetList=xlist) tt <- rowttests(xlist, "adenoVsquamous", tstatOnly=TRUE) if(require(siggenes)){ sam <- ssStatistic(statistic="sam", phenotypeLabel="adenoVsquamous", esetList=xlist) } if(require(GeneMeta)){ z <- ssStatistic(statistic="z", phenotypeLabel="adenoVsquamous", esetList=xlist) } @ <>= graphics:::pairs(tt[op.conc$order, ], pch=op.conc$pch, col=op.conc$col, bg=op.conc$bg, upper.panel=NULL, cex=op.conc$cex) @ <>= graphics:::pairs(tt[op.conc$order, ], pch=op.conc$pch, col=op.conc$col, bg=op.conc$bg, upper.panel=NULL, cex=op.conc$cex) @ <>= graphics:::pairs(tt[op.disc$order, ], pch=op.disc$pch, col=op.disc$col, bg=op.disc$bg, upper.panel=NULL, cex=op.disc$cex) @ Uncorrelated $t$ and $SAM$ statistics suggest a low level of reproducibility that may be attributable to technological differences in the platforms, probes that align to different transcripts of the same gene, or differences in the study populations. Low or non-existing reproducibility may induce a wrong borrowing of strength in the Bayesian model, whereby concordant differential expression is seen as noise and shrunk to zero. Hence, study-specific estimates of t- and SAM-statistics may be helpful in deciding whether the Bayesian model is likely to be beneficial. If the correlation across studies is low, we suggest an unsupervised approach to gene filtering, such as integrative correlation, to select for genes that show some level of reproducibility across studies. The \R{} packages \Rpackage{MergeMaid} and \Rpackage{genefilter} may be helpful. For evaluating the overall differential expression, we follow the discussion of Garrett-Mayer \citep{Garrett-Mayer2007}, and combine the elements of single study statistics in a linear fashion to obtain a statistic suitable for assessing differential expression. For a more detailed discussion of how these cross-study summaries were generated for evaluating concordant and discordant differential expression, see \cite{Scharpf2009}. <>= tScores <- xsScores(tt, N=nSamples(xlist)) samScores <- xsScores(sam, nSamples(xlist)) zScores <- xsScores(z[, match(names(xlist), colnames(z))], N=nSamples(xlist)) ##Concordant differential expression, we use the combined score from the random effects model directly zScores[, "concordant"] <- z[, "zSco"] @ %\section{Assessing Goodness of Fit} \section{Session Information} The version number of \R{} and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliography{genomics} \bibliographystyle{plain} \end{document}