%\VignetteKeywords{Database} %\VignetteDepends{curatedOvarianData} %\VignettePackage{curatedOvarianData} %\VignetteIndexEntry{curatedOvarianData} \input{preamble.tex} \title{\Huge{{\bf curatedOvarianData}}} \author{ Benjamin Frederick Ganzfried, Markus Riester, Benjamin \\ Haibe-Kains, Thomas Risch, Svitlana Tyekucheva, Ina Jazic,\\ Victoria Xin Wang, Mahnaz Ahmadifar, Michael Birrer, \\ Giovanni Parmigiani, Curtis Huttenhower, Levi Waldron} \begin{document} \date{2013} \maketitle \tableofcontents <>= library(sva) options(width=60) @ \section{curatedOvarianData: Clinically Annotated Data for the Ovarian Cancer Transcriptome} This package represents a manually curated data collection for gene expression meta-analysis of patients with ovarian cancer. This resource provides uniformly prepared microarray data with curated and documented clinical metadata. It allows a computational user to efficiently identify studies and patient subgroups of interest for analysis and to run such analyses immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs, expression data processing methods, and clinical data formats. Please see http://bcb.dfci.harvard.edu/ovariancancer for alterative versions of this package, differing in how redundant probe sets are dealt with. In this vignette, we give a short tour of the package and will show how to use it efficiently. \section{Load TCGA data} Loading a single dataset is very easy. First we load the package: <>= library(curatedOvarianData) @ To get a listing of all the datasets, use the \texttt{data} function: <>= data(package="curatedOvarianData") @ Now to load the TCGA data, we use the \texttt{data} function again: <>= data(TCGA_eset) TCGA_eset @ The datasets are provided as Bioconductor \texttt{ExpressionSet} objects and we refer to the Bioconductor documentation for users unfamiliar with this data structure. \section{Load datasets based on rules} For a meta-analysis, we typically want to filter datasets and patients to get a population of patients we are interested in. We provide a short but powerful R script that does the filtering and provides the data as a list of \texttt{ExpressionSet} objects. One can use this script within R by first sourcing a config file which specifies the filters, like the minimum numbers of patients in each dataset. It is also possible to filter samples by annotation, for example to remove early stage and normal samples. <>= source(system.file("extdata", "patientselection.config",package="curatedOvarianData")) ls() @ See what the values of these variables we have loaded are. The variable names are fairly descriptive, but note that ``rule.1'' is a character vector of length 2, where the first entry is the name of a clinical data variable, and the second entry is a Regular Expression providing a requirement for that variable. Any number of rules can be added, with increasing identifiers, e.g. ``rule.2'', ``rule.3'', etc. Here strict.checking is FALSE, meaning that samples not annotated for the variables in these rules are allowed to pass the filter. If strict.checking == TRUE, samples missing this annotation will be removed. <>= sapply(ls(), get) @ Now that we have defined the sample filter, we create a list of \texttt{ExpressionSets} by sourcing the \texttt{createEsetList.R} file: <>= source(system.file("extdata", "createEsetList.R", package = "curatedOvarianData")) @ It is also possible to run the script from the command line and then load the R data file within R: \begin{verbatim} R --vanilla "--args patientselection.config ovarian.eset.rda tmp.log" < createEsetList.R \end{verbatim} Now we have \Sexpr{length(esets)} datasets with samples that passed our filter in a list of \texttt{ExpressionSets} called \texttt{esets}: <>= names(esets) @ \section{Association of CXCL12 expression with overall survival} Next we use the list of \Sexpr{length(esets)} datasets from the previous example and test if the expression of the CXCL12 gene is associated with overall survival. CXCL12/CXCR4 is a chemokine/chemokine receptor axis that has previously been shown to be directly involved in cancer pathogenesis. We first define a function that will generate a forest plot for a given gene. It needs the overall survival information as \texttt{Surv} objects, which the \texttt{createEsetList.R} function already added in the \texttt{phenoData} slots of the \texttt{ExpressionSets}, accessible at the \texttt{y} label. The resulting forest plot is shown for the CXCL12 gene in Figure~\ref{fig:example3:forest}. <>= esets[[1]]$y forestplot <- function(esets, y="y", probeset, formula=y~probeset, mlab="Overall", rma.method="FE", at=NULL,xlab="Hazard Ratio",...) { require(metafor) esets <- esets[sapply(esets, function(x) probeset %in% featureNames(x))] coefs <- sapply(1:length(esets), function(i) { tmp <- as(phenoData(esets[[i]]), "data.frame") tmp$y <- esets[[i]][[y]] tmp$probeset <- exprs(esets[[i]])[probeset,] summary(coxph(formula,data=tmp))$coefficients[1,c(1,3)] }) res.rma <- metafor::rma(yi = coefs[1,], sei = coefs[2,], method=rma.method) if (is.null(at)) at <- log(c(0.25,1,4,20)) forest.rma(res.rma, xlab=xlab, slab=gsub("_eset$","",names(esets)), atransf=exp, at=at, mlab=mlab,...) return(res.rma) } @ \begin{figure} \centering <>= res <- forestplot(esets=esets,probeset="CXCL12",at=log(c(0.5,1,2,4))) @ \caption{The database confirms CXCL12 as prognostic of overall survival in patients with ovarian cancer. Forest plot of the expression of the chemokine CXCL12 as a univariate predictor of overall survival, using all \Sexpr{length(esets)} datasets with applicable expression and survival information. A hazard ratio significantly larger than 1 indicates that patients with high CXCL12 levels had poor outcome. The p-value for the overall HR, found in res\$pval, is \Sexpr{signif(res$pval, 2)}. This plot is Figure 3 of the curatedOvarianData manuscript. } \label{fig:example3:forest} \end{figure} We now test whether CXCL12 is an independent predictor of survival in a multivariate model together with success of debulking surgery, defined as residual tumor smaller than 1 cm, and Federation of Gynecology and Obstetrics (FIGO) stage. We first filter the datasets without debulking and stage information: <>= idx.tumorstage <- sapply(esets, function(X) sum(!is.na(X$tumorstage))) > 0 idx.debulking <- sapply(esets, function(X) sum(X$debulking=="suboptimal",na.rm=TRUE)) > 0 @ In Figure~\ref{fig:example4:forest}, we see that CXCL12 stays significant after adjusting for debulking status and FIGO stage. We repeated this analysis for the CXCR4 receptor and found no significant association with overall survival (Figure~\ref{fig:example5:forest}). \begin{figure} \centering <>= res <- forestplot(esets=esets[idx.debulking & idx.tumorstage], probeset="CXCL12",formula=y~probeset+debulking+tumorstage, at=log(c(0.5,1,2,4))) @ \caption{Validation of CXCL12 as an independent predictor of survival. This figure shows a forest plot as in Figure~\ref{fig:example3:forest}, but the CXCL12 expression levels were adjusted for debulking status (optimal versus suboptimal) and tumor stage. The p-value for the overall HR, found in res\$pval, is \Sexpr{signif(res$pval, 2)}.} \label{fig:example4:forest} \end{figure} \begin{figure} \centering <>= res <- forestplot(esets=esets,probeset="CXCR4",at=log(c(0.5,1,2,4))) @ \caption{Up-regulation of CXCR4 is not associated with overall survival. This figure shows again a forest plot as in Figure~\ref{fig:example3:forest}, but here the association of mRNA expression levels of the CXCR4 receptor and overall survival is shown. The p-value for the overall HR, found in res\$pval, is \Sexpr{signif(res$pval, 2)}.} \label{fig:example5:forest} \end{figure} \clearpage \section{Batch correction with ComBat} If datasets are merged, it is typically recommended to remove a very likely batch effect. We will use the ComBat \citep{Johnson:2007} method, implemented for example in the SVA Bioconductor package \citep{sva}. To combine two ExpressionSet objects, we can use the \texttt{combine()} function. This function will fail when the two ExpressionSets have conflicting annotation slots, for example \texttt{annotation} when the platforms differ. We write a simple \texttt{combine2} function which only considers the \texttt{exprs} and \texttt{phenoData} slots: <>= combine2 <- function(X1, X2) { fids <- intersect(featureNames(X1), featureNames(X2)) X1 <- X1[fids,] X2 <- X2[fids,] ExpressionSet(cbind(exprs(X1),exprs(X2)), AnnotatedDataFrame(rbind(as(phenoData(X1),"data.frame"), as(phenoData(X2),"data.frame"))) ) } @ In Figure~\ref{fig:batch:1}, we combined two datasets from different platforms, resulting in a huge batch effect. \begin{figure} <>= data(E.MTAB.386_eset) data(GSE30161_eset) X <- combine2(E.MTAB.386_eset, GSE30161_eset) boxplot(exprs(X)) @ \caption{Boxplot showing the expression range for all samples of two merged datasets arrayed on different platforms. This illlustrates a huge batch effect.} \label{fig:batch:1} \end{figure} Now we apply ComBat and adjust for the batch and show the boxplot after batch correction in Figure~\ref{fig:batch:2}: <>= mod <- model.matrix(~as.factor(tumorstage), data=X) batch <- as.factor(grepl("DFCI",sampleNames(X))) combat_edata <- ComBat(dat=exprs(X), batch=batch, mod=mod) @ \begin{figure} <>= boxplot(combat_edata) @ \caption{Boxplot showing the expression range for all samples of two merged datasets arrayed on different platforms after batch correction with ComBat.} \label{fig:batch:2} \end{figure} \clearpage \bibliographystyle{plainnat} \bibliography{curatedOvarianData_vignette} \appendix \section{Available Clinical Characteristics} <>= rm(list=ls()) source(system.file("extdata", "patientselection_all.config",package="curatedOvarianData")) source(system.file("extdata", "createEsetList.R", package = "curatedOvarianData")) @ \begin{figure}[!htb] \centering <>= .esetsStats <- function(esets) { res <- lapply(varLabels(esets[[1]]), function(covar) unlist(sapply(esets, function(X) sum(!is.na(X[[covar]]))>0))) names(res) <- varLabels(esets[[1]]) do.call(rbind, res) } df.r <- .esetsStats(esets) M <- as.matrix(apply(df.r,c(1,2),ifelse,0,1)) colnames(M) <- gsub("_eset$", "", colnames(M)) # no need to show the sample ids M <- M[-(1:2),] heatmap(M[nrow(M):1,],scale="none",margins=c(8,10),Rowv=NA) @ \caption{Available clinical annotation. This heatmap visualizes for each curated clinical characteristic (rows) the availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least one sample in the dataset. This plot is Figure 2 of the curatedOvarianData manuscript.} \end{figure} \section{Summarizing the List of ExpressionSets} This example provides a table summarizing the datasets being used, and is useful when publishing analyses based on curatedOvarianData. First, define some useful functions for this purpose: <>= source(system.file("extdata", "summarizeEsets.R", package = "curatedOvarianData")) @ Now create the table, used for Table 1 of the curatedOvarianData manuscript: <>= summary.table <- t(sapply(esets, getEsetData)) rownames(summary.table) <- sub("_eset", "", rownames(summary.table)) @ Optionally write this table to file, for example ( replace myfile <- tempfile() with something like myfile <- ``nicetable.csv'' ) <>= (myfile <- tempfile()) write.table(summary.table, file=myfile, row.names=FALSE, quote=TRUE, sep=",") @ <>= library(xtable) print(xtable(summary.table[, c(2, 3, 4, 5, 7)], caption="Datasets provided by curatedOvarianData. This is an abbreviated version of Table 1 of the manuscript; the full version is written by the write.table command above. Stage column is early/late/unknown, histology column is ser/clearcell/endo/mucinous/other/unknown.", table.placement="p", caption.placement="bottom"), floating.environment='sidewaystable') @ \section{For non-R users} If you are not doing your analysis in R, and just want to get some data you have identified from the curatedOvarianData manual, here is a simple way to do it. For one dataset: <>= library(curatedOvarianData) library(affy) data(GSE30161_eset) write.csv(exprs(GSE30161_eset), file="GSE30161_eset_exprs.csv") write.csv(pData(GSE30161_eset), file="GSE30161_eset_clindata.csv") @ Or for several datasets: <>= data.to.fetch <- c("GSE30161_eset", "E.MTAB.386_eset") for (onedata in data.to.fetch){ print(paste("Fetching", onedata)) data(list=onedata) write.csv(exprs(get(onedata)), file=paste(onedata, "_exprs.csv", sep="")) write.csv(pData(get(onedata)), file=paste(onedata, "_clindata.csv", sep="")) } @ \section{Session Info} <>= toLatex(sessionInfo()) @ \end{document}