% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \VignetteIndexEntry{GeneMeta Vignette} % \VignetteKeywords{meta analysis, combining data} %\VignettePackage{GeneMeta} % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[12pt]{article} \usepackage{amsmath,fullpage} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \parindent 0in % Left justify \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \bibliographystyle{plainnat} \title{Meta-analysis for Microarray Experiments} \author{Robert Gentleman, Markus Ruschhaupt, Wolfgang Huber, and Lara Lusa} \begin{document} \maketitle \section{Introduction} The use of meta-analysis tools and strategies for combining data from microarray experiments seems to be a good and practical idea. \citet{Choi.2003} is among the first authors to address these issues. Of great importance in working with these data is the realization that different experiments typically have been designed to address different questions. In general, it will only make sense to combine data sets if the questions are the same, or, if some aspects of the experiments are sufficiently similar that one can hope to make better inference from the whole than from the experiments separately. Just because two experiments were run on the same microarray platform is not sufficient justification for combining them. In \Rpackage{GeneMeta} we have implemented many of the tools described by \citep{Choi.2003}. They focused on the combination of datasets based on two sample comparisons. Hence, their procedures are largely based on the $t$-test. It is not clear whether improvements would eventuate if some of the more popular adjustments to these tests were used instead. Consider the situation where data from $k$ trials is available and we want to estimate the mean difference in expression, for each gene, between two commonly measured phenotypes (here we use the term phenotype loosely). The setting considered by Choi et al was that of a tumor versus normal comparison. The general model for this setting, is as follows. Let $\mu$ denote the parameter of interest (the true difference in mean, say). Let $y_i$ denote the measure effect for study $i$, with $i = 1, \ldots, k$. Then the hierarchical model is: \begin{eqnarray*} y_i &=& \theta_i + \epsilon_i, \qquad \epsilon_i \sim N(0, \sigma_i^2) \\ \theta_i &= & \mu + \delta_i, \qquad \delta_i \sim N(0, \tau^2) \end{eqnarray*} where $\tau^2$ represents the between study variability and $\sigma_i^2$ denotes the within study variability. The analysis is different depending on whether a fixed effect model (FEM) is deemed appropriate, or a random effects model (REM) is deemed appropriate. Under a FEM, the basic presumption is that $\tau = 0$. If this does not hold then a REM will need to be fit. The estimates of the overall effect, $\mu$, are different depending on which model is used. \citet{Choi.2003} suggest using an estimator due to DerSimonian and Laird for the REM model. This estimator is computed using the function \Rfunction{tau2.DL}, and its variance via \Rfunction{var.tau2} \section*{Simple Usage} In this vignette we want to show how these methods can be used to combine data sets. Typically matching of identifiers is an important component. We don't want to address the problem here and so just do the following: we split a data set and then combine these two splits. We show that the combination of the splits is as nearly good as the original set. So in this paper we also do not address the problem, that is mentioned above, i.e. to combine only things that are measuring the same thing. In this example we know that the same thing has be measured. \subsection*{Getting the data} We first load a data sets that were reported by \citet{West.2001} and were collected on patients with breast cancer. \Robject{Nevins} includes data from 46 hybridizations on hu6800 Affymetrix chips. <>= library(GeneMeta) library(RColorBrewer) #load("~/Bioconductor/Projects/GraphCombine/MetaBreast/data/Nevins.RData") data(Nevins) @ We want to look at the estrogen receptor status and find genes that have a high 't-statistic' for the difference between estrogen receptor positive and negative patients. Actually we don't use the t statistic itself but $$ d = t \cdot \sqrt{\frac{n_1 + n_2}{n_1\cdot n_2}}$$. Here $t$ is the 'usual' $t$-statistic and $n_1$ and $n_2$ are the number of elements in the two groups. We create two data sets from the original set by splitting. We make sure that the same fraction of ER positive cases is in each group. <>= set.seed(1609) thestatus <- pData(Nevins)[,"ER.status"] group1 <- which(thestatus=="pos") group2 <- which(thestatus=="neg") rrr <- c(sample(group1, floor(length(group1)/2)), sample(group2,ceiling(length(group2)/2))) Split1 <- Nevins[,rrr] Split2 <- Nevins[,-rrr] @ For each data set (Split1 and Split2) we extract the estrogen receptor (ER) status and code it as a 0-1 vector. <>= #obtain classes Split1.ER<-pData(Split1)[,"ER.status"] levels(Split1.ER) <- c(0,1) Split1.ER<- as.numeric(as.character(Split1.ER)) Split2.ER<-pData(Split2)[,"ER.status"] levels(Split2.ER) <- c(0,1) Split2.ER<- as.numeric(as.character(Split2.ER)) @ \subsection*{Combining the data} Next we compute the unbiased estimates of the effect (\Robject{d.adj.Split1} and \Robject{d.adj.Split2}) and its variance (\Robject{var.d.adj.Split1} and \Robject{var.d.adj.Split2}). Our goal is to compute Cochran's Q statistic to determine whether we should be considering a fixed effects or a random effects model for the data. <>= #calculate d for Split1 d.Split1 <- getdF(Split1, Split1.ER) #adjust d value d.adj.Split1 <- dstar(d.Split1, length(Split1.ER)) var.d.adj.Spli1 <- sigmad(d.adj.Split1, sum(Split1.ER==0), sum(Split1.ER==1)) #calculate d for Split2 d.Split2 <- getdF(Split2, Split2.ER) #adjust d value d.adj.Split2 <- dstar(d.Split2, length(Split2.ER)) var.d.adj.Split2 <- sigmad(d.adj.Split2, sum(Split2.ER==0), sum(Split2.ER==1)) @ Now, with those in hand we can compute Q and then create and display a qq-plot for comparing the observed values to a $\chi^2_1$ random variable (since we have two experiments). <>= #calculate Q mymns <- cbind(d.adj.Split1, d.adj.Split2) myvars <- cbind(var.d.adj.Spli1,var.d.adj.Split2) my.Q <- f.Q(mymns, myvars) mean(my.Q) @ <>= hist(my.Q,breaks=50,col="red") @ We can see immediately from the histogram and the mean of the $Q$ values that the hypothesis that these values come from a $\chi^2_1$ random variable seems to be valid. <>= ########### graphics ################ num.studies<-2 #quantiles of the chisq distribution chisqq <- qchisq(seq(0, .9999, .001), df=num.studies-1) tmp<-quantile(my.Q, seq(0, .9999, .001)) qqplot(chisqq, tmp, ylab="Quantiles of Sample",pch="*", xlab="Quantiles of Chi square", main="QQ Plot") lines(chisqq, chisqq, lty="dotted",col="red") @ Given that we need to fit a FEM model we next compute the estimated effect sizes. Each effect size is a weighted average of the effects for the individual experiments divided by its standard error. The weights are the reciprocal of the estimated variances. <>= muFEM = mu.tau2(mymns, myvars) sdFEM = var.tau2(myvars) ZFEM = muFEM/sqrt(sdFEM) @ Plotting the quantiles of the effects we can see that the presumption of approximate Normality seems to be appropriate. <>= qqnorm(ZFEM,pch="*") qqline(ZFEM,col="red") @ If instead we would have to fit a REM model we would compute the estimated effect sizes using the DerSimonian and Laird estimator. Therefore, we must first estimate the variance $\tau$ of the 'between experiments' random variable. <>= my.tau2.DL<-tau2.DL(my.Q, num.studies, my.weights=1/myvars) #obtain new variances s^2+tau^2 myvarsDL <- myvars + my.tau2.DL #compute muREM <- mu.tau2(mymns, myvarsDL) #cumpute mu(tau) varREM <- var.tau2(myvarsDL) ZREM <- muREM/sqrt(varREM) @ %<>= % qqnorm(ZREM,pch="*") % qqline(ZREM,col="red") %@ We can easily compare the two different estimates, <>= plot(muFEM, muREM, pch=".") abline(0,1,col="red") @ We do not see much difference here. This is because in the REM model for most of the genes the variance $\tau$ is estimated as zero. <>= hist(my.tau2.DL,col="red",breaks=50,main="Histogram of tau") @ The procedure described above is also implemented in the function \Rfunction{zScores} (part of this package) and \Rfunction{meta.summaries} from the package \Rpackage{rmeta}. While \Rfunction{meta.summaries} do the calculation for arbitrary effects and their variances, \Rfunction{zScores} exactly follows the calculation from \citet{Choi.2003}. The arguments of this function are a list of expression sets and a list of classes. We include our two splits and also the original data set. By default \Rfunction{zScores} would combine all expression sets in the list, but we only want the combine the first two. So we have to set an additional parameter. <>= esets <- list(Split1,Split2,Nevins) data.ER <-pData(Nevins)[,"ER.status"] levels(data.ER) <- c(0,1) data.ER<- as.numeric(as.character(data.ER)) classes <- list(Split1.ER,Split2.ER,data.ER) theScores <- zScores(esets,classes,useREM=FALSE,CombineExp=1:2) @ We get a matrix in the following form. <>= theScores[1:2,] @ Here \Robject{Effect\_Ex\_1} and \Robject{Effect\_Ex\_2} are the unbiased estimates of the effect (\Robject{d.adj.Split1} and \Robject{d.adj.Split2}). \Robject{EffectVar\_Ex\_1} and \Robject{EffectVar\_Ex\_2} are the estimated variances of the unbiased effects (\Robject{var.d.adj.Split1} and \Robject{var.d.adj.Split2}). \Robject{zSco\_Ex\_1} and \Robject{zSco\_Ex\_2} are the unbiased estimates of the effects divided by their standard deviation. The same values are also calculated the the complete data set ( \Robject{Effect\_Ex\_3},\Robject{EffectVar\_Ex\_3}, and \Robject{ZSco\_Ex\_3}). \Robject{Qvals} are the Q statistics (\Robject{my.Q}) and \Robject{df} is the number of combined experiments minus one. \Robject{MUvals} and \Robject{MUsds} are equal to \Robject{muFEM} and \Robject{sdFEM} (the overall mean effect size and its standard deviation). \Robject{zSco} are the z scores (\Robject{ZFEM}). \Robject{Qpvalues} is for each gene the probability that a chisq distribution with \Robject{df} degree of freedom has a higher value than its Q statistic. And \Robject{Chisq} is the probability that a chisq distribution with $1$ degree of freedom has a higher value than $\Robject{zSco}^2$. We plot the z scores of original data set against the z scores of the combined data set. We see a good correlation so the combination of the two data sets works quite well. In the next paragraph we want to see how big the benefit of combining data sets really is. <>= plot(theScores[,"zSco_Ex_3"],theScores[,"zSco"],pch="*",xlab="original score",ylab="combined score") abline(0,1,col="red") @ We now will have a look at the IDR plot as it is described in \citep{Choi.2003}. For a threshold $z_th$ this plot shows the fraction of the genes that have a higher effect size than the threshold for the combined effect $z$, but not for any of the experiment specific effects $z_i$, e.g. we look for genes with \begin{eqnarray*} z & \geq &z_{th} \mbox{ and } \sum_{i=1}^k I(z_i \geq z_{th}) = 0 \mbox{ for } z > 0 \mbox{ or }\\ z & \leq &-z_{th} \mbox{ and } \sum_{i=1}^k I(z_i \leq - z_{th}) = 0 \mbox{ for } z < 0 \end{eqnarray*} The IDR was computed for $ z> 0$ (blue) and $z < 0 $ (red) separately. We can see that we get higher z scores by combing the sets. <>= IDRplot(theScores,Combine=1:2,colPos="blue", colNeg="red") @ \subsection*{Estimating the false discovery rate} Next \citet{Choi.2003} discussed using a SAM \citep{Tusher.2001} type analysis to estimate the false discovery rate(FDR). This is implemented in the function \Rfunction{zscoresFDR}. <>= ScoresFDR <- zScoreFDR(esets, classes, useREM=FALSE, nperm=50,CombineExp=1:2) @ This object is a list with three slots <>= names(ScoresFDR) @ The first slot stores the results of the calculation, if the FDR is computed for the positive scores, the second for the negative scores and the last one for the tow sided situation (i.e. we look at the absolute values of the z scores). Each slot contains a matrix with the values obtained by zScores and additional a FDR for each experiment and the combination of experiments. <>= ScoresFDR$pos[1:2,] @ We plot the number of genes and the corresponding FDR. Here the result for the combined set is red and for the result for the original set (without splitting) is blue. We extract the FDR for the two sided situation. It can be see that the combined data set has a lower FDR than the splits and a FDR as good as the original set. <>= FDRwholeSettwo <- sort(ScoresFDR$"two.sided"[,"FDR"]) experimentstwo <- list() for(j in 1:3){ experimentstwo[[j]] <- sort(ScoresFDR$"two.sided"[,paste("FDR_Ex_",j,sep="")]) } @ <>= theNewC <- brewer.pal(3,"Set2") @ <>= ##################### # # #two sided z values # # # ##################### plot(FDRwholeSettwo,pch="*",col="red",ylab="FDR",xlab="Number of genes") for(j in 1:3) points(experimentstwo[[j]],pch="*", col=theNewC[j]) legend(4000,0.4,c("Combined set","Split 1" , "Split 2" ,"original set"), c("red",theNewC[1:3])) @ If we are more interested in the number of gene that are below a given threshold for the FDR we can use the \Rfunction{CountPlot}. Similar to \Rfunction{IDRplot} it shows the following: for each study (indicated by different colors) and various thresholds for the FDR (x axis) the number of genes that are below this threshold in the given study but above in all other studies are shown (y axis). The studies that should be considered (apart from the combined set that is always present) can be specified with \Robject{CombineExp}. Here we compare the original data set (green) against the combined data set (red). It can be seen that we do quite well. <>= #par(mfrow=c(2,2)) #CountPlot(ScoresFDR,Score="FDR",kindof="neg",cols=c("red",theNewC), # main="Negative FDR", xlab="FDR threshold", ylab="Number of genes",CombineExp=1:2) #CountPlot(ScoresFDR,Score="FDR",cols=c("red",theNewC),kindof="pos", #main="Positive FDR", xlab="FDR threshold", ylab="Number of genes",Combine=1:2) CountPlot(ScoresFDR,Score="FDR",kindof="two.sided",cols=c("red",theNewC),main="two sided FDR", xlab="FDR threshold",ylab="Number of genes",CombineExp=3) @ \begin{thebibliography}{} \bibitem[Choi et al.(2003)]{Choi.2003} Choi JK, Yu U, Kim S, Yoo OJ. \newblock Combining multiple microarray studies and modeling interstudy variation \newblock\textit{BI} 19(1): i84-i90 (2003). \bibitem[West et al.(2001)]{West.2001} West M, Blanchette C, Dressman H and others. \newblock Predicting the clinical status of human breast cancer by using gene expression profiles \newblock\textit{Proc Natl Acad Sci U S A} 98(20):11462--11467 (2001). \bibitem[Tusher et al.(2001)] {Tusher.2001} Tusher VG, Tibshirani R, Chu, G. \newblock Significance analysis of microarrays applied to the ionizing radiation response \newblock\textit{PNAS} 98:5116--5121 (2001). \end{thebibliography} \end{document}