%\VignetteIndexEntry{An introduction to AGDEX} %\VignetteDepends{Biobase} %\VignetteKeywords{Microarray Association Pattern} %\VignettePackage{AGDEX} \documentclass[]{article} \usepackage{times} \usepackage{hyperref} \usepackage{enumerate} \newcommand{\Rpackage}[1]{{\textit{#1}}} \title{An Introduction to \Rpackage{AGDEX}} \author{Stan Pounds, Cuilan Lani Gao} \date{\Sexpr{date()}} \begin{document} \maketitle <>= options(width=60) @ \section{Introduction} A challenging problem in contemporary genomics research is how to integrate and compare gene expression data from studies that utilize different microarray platforms or even different species (e.g. a study of a human disease and a study of an animal model of that disease). We have developed the agreement of differential expression (AGDEX) procedure to integrate differential expression analysis results across two experiments that may utilize different platforms or even different species. AGDEX is able to combine transcriptome information across two experiments that compare expression across two biological conditions. AGDEX was initially used in the study of the pediatric brain tumor ependymoma (Johnson et al., Nature 2010) to characterize the transcriptional similarity of a mouse model to one subtype of human ependymoma. The AGDEX procedure performs a rigorous differential expression analysis for each two-group comparison and formally evaluates the agreement of differential expression analysis results across the entire transcriptome. Optionally, users may use AGDEX to identify differentially expressed gene-sets for each comparison and evaluate agreement of differential expression analysis results within gene-sets. In total, the AGDEX procedure performs the following statistical analyses: \begin{enumerate} \item identify genes that are differentially expressed in each experiment; \item identify gene-sets that are differentially expressed in each experiment; \item integrate results across experiments to identify differentially expressed genes; \item integrate results across experiments to identify differentially expressed gene-sets; \item characterize and determine the statistical significance of similarities of differential and expression profiles across the two experiments for the entire transcriptome and for specific gene-sets. \end{enumerate} The AGDEX method is described in greater detail in the supplementary materials of Johnson et al. (2010) and Gibson et al. (2010). \section{Pre-requisite Packages} The AGDEX package depends on the {\em Biobase} and {\em GSEABase} packages. Users must know how to store the expression data as an {\em ExpressionSet} object defined by the {\em Biobase} package. To perform gene-set analyses, users must represent gene-set data as an {\em GeneSetCollection} object defined by the {\em GSEABase} package. \section{Data Requirements} Data must be prepared and stored in a specific format for AGDEX analysis. First, data from each experiment must be stored as an {\em ExpressionSet} object defined by the {\em Biobase} package. Secondly, the data from each experiment must be linked with a definition of the contrast (such as ``tumor - control'') for the differential expression analysis. Optionally, each experiment may have gene-set definitions represented as a {\em GeneSetCollection} object defined by the {\em GSEABase} package. These information provide all the details necessary to perform differential expression analysis of the data from each experiment. Finally, a data-set that matches the probe-set identifiers from the two experiments is necessary to integrate results across the two experiments and evaluate the agreement of differential expression results across the two experiments. AGDEX requires that the information needed to perform differential expression analysis of one experiment be provided in the form of a {\em dex.set} list object. The expression and phenotype data are stored as an {\em ExpressionSet} in a component named {\em Eset.data}. Recall that an {\em ExpressionSet} stores the expression data as a samples-by-genes matrix in the component {\em exprs} and the phenotype data as a {\em data.frame} in the component {\em pData}. The expression data should be normalized log-intensity values. The phenotype data must include one column with group labels to be used for the two-group differential expression analysis comparison. The {\em comp.var} component of the {\em dex.set} list object gives the name or numeric index of the column of the phenotype data with those group labels. The {\em comp.def} component of the {\em dex.set} list object is a string that defines the contrast for the two-group comparison. For example, the {\em comp.def} component may contain the string ``tumor-control'' to indicate that the analysis will compare the expression of those samples with the label ``tumor'' to that of those samples with the label ``control''. Optionally, the {\em dex.set} object may include a {\em GeneSetCollection} object (as defined by the package {\em GSEABase}) in the {\em gset.collection} component. In this way, the {\em dex.set} object contains the data for and the definition of a two-group differential expression analysis. The data and definition for each differential expression analysis must be contained in a {\em dex.set} object. To perform the cross-experiment integration and evaluate the cross-experiment agreement, AGDEX requires information to match the probe-set identifiers of the first differential expression analysis to those of the second differential expression analysis. This information is provided in the form of a {\em map.data} list object. The {\em probe.map} component of the {\em map.data} object is a {\em data.frame} that defines how probe-set identifiers are matched across experiments. As such, {\em map.data} must include a column with probe-set identifiers from experiment ``A'' and a column with probe-set identifiers from experiment ``B''. The components {\em map.Aprobe.col} and {\em map.Bprobe.col} give the name or numeric index of the columns of the {\em probe.map} component with the probe-set identifiers from experiments ``A'' and ``B'', respectively. Finally, the user must specify how many permutations must be performed. AGDEX allows users to utilize an adaptive permutation testing (APT) strategy to reduce computing time for gene-set analyses. APT performs permutations until obtaining {\em min.perms} permutation-statistics with absolute value greater than that of the observed test-statistic or until performing a maximum {\em max.nperms} permutations. Pounds et al. (2011) give a more detailed description of APT. \section{Example} This example illustrates how users may perform an AGDEX analysis. \subsection{Prepare the Expression Data as {\em ExpressionSet} Object} First, users must prepare the {\em ExpressionSet} for each experiment. The {\em human.data} and {\em mouse.data} {\em ExpressionSet} objects are included in the AGDEX package. <>= library(AGDEX) data(human.data) # Load the human.data ExpressionSet object head(exprs(human.data)[,1:5]) # Preview the human expression data head(pData(human.data)) # Preview the human phenotype data table(pData(human.data)$grp) # See number in each group all(rownames(pData(human.data))==colnames(exprs(human.data))) # Check that expression data and phenotype data have samples in the same order data(gset.data) # A GeneSetCollection for human.data # Now the same for the mouse.data data(mouse.data) head(exprs(mouse.data)[,1:5]) head(pData(mouse.data)) table(pData(mouse.data)$grp) all(colnames(exprs(mouse.data))==rownames(pData(mouse.data))) @ \subsection{Form a {\em dex.set} Object for Each Experiment} Second, for each experiment, information defining the differential expression analysis must be combined with the {\em ExpressionSet} data and stored in a {\em dex.set} object by using {\em make.dex.set.object}, as shown below. <>= # Create dex.set for human.comparison dex.set.human <- make.dex.set.object(Eset.data= human.data, comp.var=2, comp.def="human.tumor.typeD-other.human.tumors", gset.collection=gset.data) dex.set.mouse <- make.dex.set.object(mouse.data, comp.var=2, comp.def="mouse.tumor-mouse.control", gset.collection=NULL) @ In the first statement above, {\em Eset.data=human.data} indicates that the {\em ExpressionSet} object {\em human.data} contains the expression and phenotype data, {\em comp.var=2} indicates that the second column of the phenotype data (e.g. {\em pData(human.data)[,2]}) has the group labels for the differential expression analysis comparison, {\em comp.def=``human.tumor.typeD-other.human.tumors''} indicates that the comparison will be computed as ``human.tumor.typeD'' minus ``other.human.tumors'', and {\em gset.collection=gset.data} indicates that the {\em GeneSetcollection} object {\em gset.data} defines gene-sets for the differential expression analysis. The second statement above performs an analogous operation for the mouse data, except that it does not provide gene-set definitions for gene-set analyses. \subsection{Prepare the {\em map.data} Object that Defines How Probe-Sets are Matched Across Experiments} The {\em map.data} list object includes a component {\em probe.map} with a {\em data.frame} that defines how probe-sets are matched across experiments and components {\em map.Aprobe.col} and {\em map.Bprobe.col} that give the name or numeric index of the columns with the probe-set identifiers from experiments ``A'' and ``B'', respectively. The code segment below illustrates the structure of the {\em map.data} object. <>= data(map.data) names(map.data) head(map.data$probe.map) map.data$map.Aprobe.col map.data$map.Bprobe.col @ \subsection{Perform the AGDEX Analysis} Now that the {\em dex.set} objects for each experiment and the {\em map.data} object have been prepared, the AGDEX analysis may be performed by a simple call to the function {\em agdex}, as shown below. <>= agdex.res<-agdex(dex.setA=dex.set.human, dex.setB=dex.set.mouse, map.data=map.data, min.nperms=5, max.nperms=10) @ This statement performs the AGDEX analysis with the human data considered as experiment ``A'' and the mouse data considered as experiment ``B''. Note that it is important that the call to the function {\em agdex} and the {\em map.data} object label the experiments in the same way. Clearly, one usually will set larger values of {\em min.nperms} and {\em max.nperms} in most applications. The classical permutation test can be performed by setting {\em min.nperms} = {\em max.nperms}. See Pounds et al. (2011) for more details on how to set {\em min.nperms} and {\em max.nperms}. The AGDEX procedure will perform exact tests if the total number of permutations is less than the expected number of permutations under the null hypothesis of exchangeability. \subsection{Explore AGDEX Results} The results of the AGDEX analysis are stored in a list with multiple components. More details are available from {\em help(agdex.result)}. As shown below, several components of the result object echo the input for the group labels and definition of the contrast for each differential expression analysis. <>= names(agdex.res) agdex.res$dex.compA # echoes comp.def of dex.setA agdex.res$dex.compB # echoes comp.def of dex.setB head(agdex.res$dex.asgnA) # echoes group-labels from dex.setA head(agdex.res$dex.asgnB) # echoes group-labels from dex.setB @ The result object also contains the probe-set level differential expression analysis results for each experiment. These components give the difference of means and p-values for each probe-set in their respective experiments. <>= head(agdex.res$dex.resA) # Human results, difference of means and p-values head(agdex.res$dex.resB) # Mouse Results, difference of means and p-values @ The {\em meta.dex.res} component contains these results and the meta-analysis z-statistic and p-value for the matched probe-set pairs. <>= head(agdex.res$meta.dex.res) @ The function {\em agdex.scatterplot} produces a scatterplot of the difference-of-means statistics for probe-set pairs. <>= agdex.scatterplot(agdex.res, gset.id=NULL) @ The results of the genome-wide AGDEX analysis are available in the {\em gwide.agdex.result} component. <>= agdex.res$gwide.agdex.res @ The {\em gwide.agdex.result} component is a {\em data.frame} with the cosine and difference-of-proportions statistics and their p-values by permutation of group labels from experiments ``A'' and ``B''. It also indicates the number of permutations performed for each experiment and whether or not the test is exact (i.e., based on all possible permutations). The results of gene-set differential expression analysis for each experiment, cross-experiment meta-analysis, and cross-experiment agreement are available in the {\em gset.res} component. <>= head(agdex.res$gset.res) @ If there is interest in seeing probe-set level details for the gene-set analysis results, the function {\em get.gset.result.details} may be used. The function {\em get.gset.result.details} may be used to obtain details for a specific gene-set of particular interest or to obtain details for those with p-values less than a specific threshold. <>= gset.res.stats<-get.gset.result.details(agdex.res, gset.ids = NULL, alpha=0.01) names(gset.res.stats) head(gset.res.stats$enrichA.details) head(gset.res.stats$agdex.details) dna.cat.process.gset.res<-get.gset.result.details(agdex.res, gset.ids="DNA_CATABOLIC_PROCESS") head(dna.cat.process.gset.res$agdex.details) @ \subsection{Store and Report AGDEX Results} User may also use the \emph{write.agdex.result} command to save their results in tab-delimited text format for viewing in Microsoft Excel. The command \emph{read.agdex.result} may be used to read the output of \emph{write.agdex.result} back into R. Users may also wish to annotate the genes in each of the above result. Bioconductor annotation packages and annotation databases provide these capabilities for a wide range of gene expression microarrays. \section{References} \begin{enumerate} \item Pounds, S. et al. A Procedure to statistically evaluate agreement of differential expression for cross-species genomics. {\em Bioinformatics}, doi: 10.1093/bioinformatics/btr362(2011). \item Johnson, R. et al. Cross-species genomics matches driver mutations and cell compartments to model ependymoma. {\em Nature}, 466, 632-6 (2010). \item Gibson, P. et al. Subtypes of medulloblastoma have distinct developmental origins. {\em Nature}, 468, 1095-99 (2010). \item Pounds, S., et al. Integrated Analysis of Pharmacokinetic, Clinical, and SNP Microarray Data using Projection onto the Most Interesting Statistical Evidence with Adaptive Permutation Testing. \em {International Journal of Data Mining and Bioinformatics}, 5:143-157 (2011). \end{enumerate} \end{document}