\documentclass[a4paper]{article} \usepackage[OT1]{fontenc} \usepackage{geometry} \geometry{hmargin = 2cm, vmargin = 2cm } \usepackage[plainpages=false,pdfpagelabels]{hyperref} \usepackage{Sweave} \begin{document} % \VignetteIndexEntry{UsingMLP} % \VignetteDepends{mouse4302.db} \title{Using MLP} \author{Nandini Raghavan, An De Bondt, Tobias Verbeke} \maketitle \tableofcontents \section{Introduction} Profiling technologies like gene expression profiling made it possible to quantify and compare relative gene expression profiles across a series of conditions and this for thousands of genes at a time. In order to understand the biology behind the difference between e.g. treatment and control, one might look at the function of individual genes which are differentially expressed. Another approach is to test which biological processes are significantly affected. Genes can be grouped into gene sets e.g. based on the biological process they are involved in. Coordinated differential expression of a set of functionally related genes could be more relevant than differential expression of a few unrelated genes, scattered across multiple gene sets. \\The idea of the MLP methodology is to test whether there are gene sets enriched in small p-values (MLP denotes mean minus log p-value). The method does not require a cut-off value for significance at gene level. Also a distinction between up- or downregulated genes is not needed. Both of these principles taken together makes it possible to find, in one go, affected gene sets consisting of e.g. 50\% inhibitors and 50\% inducers. \\The MLP methodology involves the use of (a) a test statistic to quantify the extent of the differential expression and (b) a resampling scheme to judge whether the difference is possibly real or attributable to chance. This process can be repeated for all gene sets of interest. The starting point for the MLP methodology is a list of p-values, or any similar statistics, that can quantify the degree of differential expression for each gene measured. These can be generated by a variety of methods used for calculating p-values based on gene expression data, several of them have been incorporated in the limma package. \href{www.ncbi.nlm.nih.gov/pubmed/12710670}{Smyth et al., 2003} \href{http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf}{Smyth, 2005} \\In this vignette, we show an example to identify biological processes that are affected in the considered experiment. The \href{http://www.geneontology.org/}{Gene Ontology Consortium} is dealing with the classification of genes based on 3 criteria: \begin{itemize} \item the biological process they are involved in \item the molecular function they have \item their cellular localisation \end{itemize} The analysis behind the results below is focussing on identifying affected gene sets based on biological processes. \section{Example Use} The example data is from an expression profiling experiment with 2 sample groups, comparing wild-type mice with animals of which 1 gene has been knocked out. Each of the groups consist of 6 mice. The expression array used is the Affymetrix' Mouse430\_2. The gene expression measurements have been summarized using GC-RMA (\href{www.ncbi.nlm.nih.gov/pubmed/12925520}{Irizarry et al., 2003} and \href{http://www.bepress.com/jhubiostat/paper1/}{Wu et al., 2004}) based on Entrez Gene probeset definitions \href{www.ncbi.nlm.nih.gov/pmc/articles/PMC1283542/}{Dai et al., 2005}. \subsection{Preliminaries} Load the needed libraries and the preprocessed data. <>= require(MLP) require(limma) pathExampleData <- system.file("exampleFiles", "expressionSetGcrma.rda", package = "MLP") load(pathExampleData) # if (!require(mouse4302mmentrezg.db)) # annotation(expressionSetGcrma) <- "mouse4302" annotation(expressionSetGcrma) <- "mouse4302" @ It is advisable to make use of the annotation packages of the BrainArray group\footnote{See \href{http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp}{here}.} at the University of Michigan, but as these packages are unfortunately not officially part of BioConductor (and on the build servers used to build packages [and corresponding vignettes]), the code above allows for using the sub-optimal Affymetrix annotation. \subsection{Prepare P Values} Estimate the fold changes and standard errors by fitting a linear model for each gene. <>= ### calculation of the statistics values via limma group <- as.numeric(factor(pData(expressionSetGcrma)$subGroup1, levels = c("WT", "KO")))-1 design <- model.matrix(~group) fit <- lmFit(exprs(expressionSetGcrma), design) fit2 <- eBayes(fit) results <- limma:::topTable(fit2, coef = "group", adjust.method = "fdr", number = Inf) pvalues <- results[,"P.Value"] names(pvalues) <- rownames(results) # since we moved towards using "_at", next step should be needed as well names(pvalues) <- sub("_at", "", names(pvalues)) @ \subsection{Prepare Gene Sets} Create an object with the overview of the groups of genes you would like to consider. This object is a list of class \texttt{geneSetMLP}, where the slot names correspond to the gene set identifier and the slot content is a character vector of Entrez Gene identifiers for those genes belonging to that gene set. This object can be created using the \texttt{getGeneSets} function. This function has 3 parameters: \begin{itemize} \item \texttt{species} = a string being 'Human', 'Mouse', 'Rat' or 'Dog' \item \texttt{geneSetSource} = a string or a data.frame (more info below) \item \texttt{entrezIdentifiers} = a character vector of Entrez Gene identifiers for which gene statistics are available \end{itemize} The \texttt{geneSetSource} can be a string, i.e. \texttt{'GOBP'}, \texttt{'GOMF'}, \texttt{'GOCC'}, \texttt{'KEGG'} or \texttt{'REACTOME'}. The downstream analysis in these cases will identify gene sets, publicly available, as defined by the \href{http://www.geneontology.org/}{Gene Ontology Consortium} for the first 3 , the \href{http://www.genome.jp/kegg/}{Kyoto Encyclopedia of Genes and Genomes} and \href{http://www.reactome.org}{REACTOME}. The \texttt{geneSetSource} can also be a data.frame with at least the following 4 columns: \begin{itemize} \item \texttt{PATHWAYID} = identifier of the gene set \item \texttt{PATHWAYNAME} = description of the gene set \item \texttt{TAXID} = taxonomy identifier (9606, 10090, 10116 or 9615 for respectively Human, Mouse, Rat or Dog) \item \texttt{GENEID} = Entrez Gene identifier belonging to that gene set \end{itemize} As an example, submitting the GO biological processes information as a data.drame would look like the data.frame below. In such a data.frame, the details of each gene set (identifier and description), is repeated as many times as the number of genes in that gene set and this for each species enclosed in the database. \begin{verbatim} PATHWAYID TAXID PATHWAYNAME GENEID 1 GO:0000002 10090 mitochondrial genome maintenance 18975 2 GO:0000002 10090 mitochondrial genome maintenance 19819 3 GO:0000002 10090 mitochondrial genome maintenance 27393 4 GO:0000002 10090 mitochondrial genome maintenance 27395 5 GO:0000002 10090 mitochondrial genome maintenance 27397 6 GO:0000002 10090 mitochondrial genome maintenance 57813 7 GO:0000002 10090 mitochondrial genome maintenance 83945 8 GO:0000002 10090 mitochondrial genome maintenance 226153 9 GO:0000002 10090 mitochondrial genome maintenance 382985 10 GO:0000002 10116 mitochondrial genome maintenance 83474 11 GO:0000002 10116 mitochondrial genome maintenance 291824 12 GO:0000002 10116 mitochondrial genome maintenance 298933 13 GO:0000002 10116 mitochondrial genome maintenance 309441 14 GO:0000002 10116 mitochondrial genome maintenance 309762 15 GO:0000002 10116 mitochondrial genome maintenance 360481 \end{verbatim} <>= geneSet <- getGeneSets(species = "Mouse", geneSetSource = "GOCC", entrezIdentifiers = names(pvalues) ) tail(geneSet, 3) @ As mentioned above, the returned object is a list of class \texttt{geneSetMLP}. This object has attributes which are used for the downstream analysis. <>= str(attributes(geneSet)) @ \clearpage \subsection{Run MLP} Run the actual MLP. To retrieve exact reproducible results, the seed is set in advance. The MLP function has 10 parameters, many of them can remain on their default values: \begin{itemize} \item \texttt{geneSet} = object of class \texttt{geneSetMLP} created by \texttt{getGeneSets} \item \texttt{geneStatistic} = named numeric vector corresponding to a gene-specific statistic, such as a p-value. The names are the corresponding Entrez Gene identifiers. This vector has the same length as the character vector submitted as \texttt{entrezIdentifiers} to the getGeneSets function. \item \texttt{minGenes} = minimal number of genes for a gene set to be considered in the analysis, default 5 \item \texttt{maxGenes} = maximal number of genes for a gene set to be considered in the analysis, default 100 \item \texttt{rowPermutations} = logical indicating whether critical values for the \texttt{geneSet} are computed using a permutation of the \texttt{geneStatistics}, default TRUE. The alternative is column permutations which requires a matrix of p-values corresponding to permutations of the original samples to be input. This latter option has not been fully implemented. \item \texttt{nPermutations} = number of permutations to be used for calculating the critical value for a certain \texttt{geneSet}. \item \texttt{smoothPValues} = logical indicating whether smoothing is desirable or not, default TRUE \item \texttt{probabilityVector} = vector of probabilities at which critical value curves for the \texttt{geneSet} are to be calculated. Default is c(0.5, 0.9, 0.95, 0.99, 0.999, 0.9999, 0.99999) corresponding to p-values of respectively 0.5, 0.1, 0.01, 0.001 etc. \item \texttt{df} = degrees of freedom for the smoothing parameter used in \texttt{smoothPValues}. The higher, the more smooth, default 9. \item \texttt{addGeneSetDescription} = logical indicating whether adding gene sets annotation to the MLP output is desirable or not, default TRUE. \end{itemize} <>= set.seed(111) mlpOut <- MLP( geneSet = geneSet, geneStatistic = pvalues, minGenes = 5, maxGenes = 100, rowPermutations = TRUE, nPermutations = 50, smoothPValues = TRUE, probabilityVector = c(0.5, 0.9, 0.95, 0.99, 0.999, 0.9999, 0.99999), df = 9) head(mlpOut) @ Some properties of the MLP procedure , as well as some parts of the implemented procedure, assume that the \texttt{geneStatistic} has a uniform distribution between 0 and 1 under the null hypothesis for a given analysis. \\The returned object is a data.frame of class \texttt{MLP} with at least 4 columns: \begin{itemize} \item \texttt{totalGeneSetSize} = total number of genes in the corresponding gene set \item \texttt{testedGeneSetSize} = number of genes in the corresponding \texttt{geneSet} for which a gene statistic has been submitted \item \texttt{geneSetStatistic} = mean of the -log10 of the genes tested in that \texttt{geneSet} \item \texttt{geneSetPValue} = p-value associated with the geneSetStatistic \end{itemize} This object has attributes which are used for visualising the analysis results. <>= str(attributes(mlpOut)) @ \clearpage \subsection{Visualize MLP Results} Three different types of plots are made available. The type of the plot is indicated with the \texttt{type} argument which can be one of \begin{verbatim} plot(., type = "quantileCurves") plot(., type = "barplot") plot(., type = "GOgraph") \end{verbatim} \subsubsection{Quantile Curves} This visualisation shows the relationship between the \texttt{geneSetStatistic} and the size of the gene sets. It also indicates the quantiles of interest as specified as \texttt{probabilityVector} in the \texttt{MLP} function. The most significant gene sets are plotted above the smooth curve. <>= pdf("mlpQuantileCurves.pdf", width = 10, height = 10) plot(mlpOut, type = "quantileCurves") dev.off() @ \begin{figure}[h!] \centering \includegraphics[width=0.7\textwidth]{mlpQuantileCurves} \caption{Example of a quantile curve for the MLP results. Every dot represents a gene set. Every line represents a smoothing of the quantile per gene set size.} \end{figure} \clearpage \subsubsection{Bar Plot} For this type of plot there are some extra parameters of interest: \begin{itemize} \item \texttt{nRow} = number of gene sets to include in the graph, default is 20 \item \texttt{barColors} = vector of colors, default is a shade of grey per bar. The 3 possible shadings correspond to the \% of genes in a gene set tested as compared to the total number of genes, the darker, the bigger the portion of genes tested. \item \texttt{ylab} =label for the y-axis \end{itemize} <>= pdf("mlpBarplot.pdf", width = 10, height = 10) op <- par(mar = c(30, 10, 6, 2)) plot(mlpOut, type = "barplot", ylab = "-log10(gene set p-value)") par(op) dev.off() @ \begin{figure}[h!] \centering \includegraphics[angle = 270, width = 0.9\textwidth]{mlpBarplot} \caption{Example of a barplot for the MLP results. The height of a bar represents the significance (-log10(\texttt{geneSetPValue})) of the gene set indicated on the x-axis. The number between brackets represent the number of genes within that gene set (number of genes for which a gene statistic has been submitted as well as the total number of genes).} \end{figure} \clearpage \subsubsection{Gene Ontology Graph} As the title indicates, this type of plot is only possible if the \texttt{geneSetSource} was 'GOBP', 'GOMF' or 'GOCC'. Also for this type of plot there is some extra parameters: \begin{itemize} \item \texttt{nRow} = number of gene sets as basis to create the graph, default is low, i.e. 5. The higher this number, the more populated the graph gets. \item \texttt{nCutDescPath} = number of characters at which the pathway description should be cut (depends on figure size) \end{itemize} <>= pdf("mlpGOgraph.pdf", width = 8, height = 6) op <- par(mar = c(0, 0, 0, 0)) plot(mlpOut, type = "GOgraph", nRow = 10, nCutDescPath = 15) par(op) dev.off() @ \begin{figure}[h!] \centering \includegraphics[angle = 90, width = 0.7\textwidth]{mlpGOgraph} \caption{Example of a GOgraph for the MLP results. Every elipse represents a gene set. The color indicates the significance, the more green, the more significant. The connectors indicate the parent - child relationship. The number between brackets represent the number of genes within that gene set (number of genes for which a gene statistic has been submitted as well as the total number of genes)} \end{figure} \clearpage \subsection{Visualize Individual Genes in a Gene Set} To know which genes contribute most to the significance of a gene set or to focus on a certain gene set of interest, you can plot the significance of each gene beloning to that gene set. This plot shows the significance (-log10(\texttt{geneStatistic})) of the genes within the gene set of interest. The \texttt{plotGeneSetSignificance} function needs 4 parameters and there is also one optional parameter: \begin{itemize} \item \texttt{geneSet} = object of class \texttt{geneSetMLP} created by getGeneSets \item \texttt{geneSetIdentifier} = identifier of the gene set of interest \item \texttt{geneStatistic} = named numeric vector which should have a uniform distribution between 0 and 1. The names are the corresponding Entrez Gene identifiers. \item \texttt{annotationPackage} = string representing the annotation package used to retrieve gene symbols and gene descriptions \item \texttt{barColors} = optional color vector \end{itemize} <>= geneSetID <- rownames(mlpOut)[1] pdf("geneSignificance.pdf", width = 10, height = 10) op <- par(mar = c(25, 10, 6, 2)) plotGeneSetSignificance( geneSet = geneSet, geneSetIdentifier = geneSetID, geneStatistic = pvalues, annotationPackage = annotation(expressionSetGcrma), ) par(op) dev.off() @ \begin{figure}[h!] \centering \includegraphics[angle = 270, width = 0.7\textwidth]{geneSignificance} \caption{Example of a gene significance plot for a gene set of interest. The height of a bar represents the significance (-log10(\texttt{geneStatistic})) of the gene indicated on the x-axis.} \end{figure} % %A second plot is showing the signal intensities across the samples %of the genes within the gene set of interest. %The \texttt{plotGeneSetIntensities} function needs 4 parameter: %\begin{itemize} %\item \texttt{eset} = ExpressionSet object %\item \texttt{geneSet} = object created by getGeneSets %\item \texttt{geneSetIdentifier} = identifier of the gene set of interest %\item \texttt{pdfName} = file name for the output of this function (always a pdf) %\end{itemize} % %<>= %plotGeneSetIntensities( % eset = expressionSetGcrma, % geneSet = geneSet, % geneSetIdentifier = geneSet, % pdfName = "geneIntensities.pdf") %@ % % %\begin{figure}[h!] %\includegraphics[width=0.7\textwidth]{geneIntensities} %\caption{Example of a gene intensity plot for a gene set of interest. The number between bracket correspond to the Entrez Gene identifier. The labels along the x-axis are the sample names.} %\end{figure} \section{References} Raghavan N, De Bondt AM, Talloen W, Moechars D, G\"{o}hlmann HW, Amaratunga D. The high-level similarity of some disparate gene expression measures. \emph{Bioinformatics}. 2007 Nov 15;23(22):3032-8. Epub 2007 Sep 24.PMID: 17893087 Raghavan N, Amaratunga D, Cabrera J, Nie A, Qin J, McMillian M. On methods for gene function scoring as a means of facilitating the interpretation of microarray results. \emph{J Comput Biol}. 2006 Apr; 13(3):798-809.PMID: 16706726 \end{document}