\documentclass[a4paper]{article} \usepackage[OT1]{fontenc} \usepackage{Sweave} \usepackage{url} \usepackage{afterpage} \usepackage{hyperref} \usepackage{geometry} \geometry{ hmargin=3cm, vmargin=2.5cm } \usepackage{graphicx} \begin{document} % \VignetteIndexEntry{a4vignette} % \VignetteDepends{ALL, MLP, nlcv} \title{Using the \texttt{a4} package} \author{Willem Talloen, Tobias Verbeke} \maketitle \tableofcontents \pagebreak{} <>= options(width = 50) options(continue=" ") options(prompt="R> ") set.seed(123) @ \section{Introduction} The \texttt{a4} suite of packages is a suite for convenient analysis of Affymetrix microarray experiments which supplements Goehlmann and Talloen (2010). The suite currently consists of several packages which are centered around particular tasks: \begin{itemize} \item \texttt{a4Preproc}: package for preprocessing of microarray data. Currently the only function in the package adds complementary annotation information to the ExpressionSet objects (in function \texttt{addGeneInfo}). Many of the subsequent analysis functions rely on the presence of such information. \item \texttt{a4Core}: package made to allow for easy interoperability with the \texttt{nlcv} package which is currently being developed on R-Forge at \url{http://r-forge.r-project.org/projects/nlcv}. \item \texttt{a4Base}: all basic functionality of the \texttt{a4} suite \item \texttt{a4Classif}: functionality for classification work that has been split off a.o. in order to reduce \texttt{a4Base} loading time \item \texttt{a4Reporting}: a package which provides reporting functionality and defines \texttt{xtable}-methods that are foreseen for tables with hyperlinks to public gene annotation resources. \end{itemize} This document provides an overview of the typical analysis workflow for such microarray experiments using functionality of all of the mentioned packages. \section{Preparation of the Data} First we load the package \texttt{a4} and the example real-life data set \texttt{ALL}. <>= library(a4) require(ALL) data(ALL, package = "ALL") @ For illustrative purposes, simulated data sets can also be very valuable (but not used here). <>= require(nlcv) esSim <- simulateData(nEffectRows=50, betweenClassDifference = 5, nNoEffectCols = 5, withinClassSd = 0.2) @ \subsection{ExpressionSet object} The data are assumed to be in an expressionSet object. Such an object structure combines different sources of information into a single structure, allowing easy data manipulation (e.g., subsetting, copying) and data modelling. The texttt{featureData} slot is typically not yet containing all relevant information about the genes. This interesting extra gene information can be added using \texttt{addGeneInfo}. <>= ALL <- addGeneInfo(ALL) @ <>= # The phenotypic data head(pData(ALL)[, c(1:5, 13, 18, 21)]) # The gene expression data head(exprs(ALL)[, 1:5]) # The feature data fDat <- head(pData(featureData(ALL))) fDat[,"Description"] <- substr(fDat[,"Description"], 1, 30) fDat @ \subsection{Some data manipulation} The \texttt{ALL} data consists out of samples obtained from two types of cells with very distinct expression profiles; B-cells and T-cells. To have a more subtle signal, gene expression will also be compared between the BCR/ABL and the NEG group within B-cells only. To this end, we create the expressionSet \texttt{bcrAblOrNeg} containing only B-cells with BCR/ABL or NEG. <>= Bcell <- grep("^B", as.character(ALL$BT)) # create B-Cell subset for ALL subsetType <- "BCR/ABL" # other subsetType can be "ALL/AF4" bcrAblOrNegIdx <- which(as.character(ALL$mol) %in% c("NEG", subsetType)) bcrAblOrNeg <- ALL[, intersect(Bcell, bcrAblOrNegIdx)] bcrAblOrNeg$mol.biol <- factor(bcrAblOrNeg$mol.biol) @ \pagebreak{} \section{Unsupervised data exploration} Spectral maps are very powerful techniques to get an unsupervised picture of how the data look like. A spectral map of the \texttt{ALL} data set shows that the B- and the T-subtypes cluster together along the x-axis (the first principal component). The plot also indicates which genes contribute in which way to this clustering. For example, the genes located in the same direction as the T-cell samples are higher expressed in these T-cells. Indeed, the two genes at the left (TCF7 and CD3D) are well known to be specifically expressed by T-cells (Wetering 1992, Krissansen 1986). <>= spectralMap(object = ALL, groups = "BT") # optional argument settings # plot.mpm.args=list(label.tol = 12, zoom = c(1,2), do.smoothScatter = TRUE), # probe2gene = TRUE) @ A spectral map of the \texttt{bcrAblOrNeg} data subset does not show a clustering of BCR/ABL or NEG cells. <>= spectralMap(object = bcrAblOrNeg, groups = "mol.biol", probe2gene = TRUE) @ \section{Filtering} The data can be filtered, for instance based on variance and intensity, in order to reduce the high-dimensionality. <>= selBcrAblOrNeg <- filterVarInt(object = bcrAblOrNeg) propSelGenes <- round((dim(selBcrAblOrNeg)[1]/dim(bcrAblOrNeg)[1])*100,1) @ This filter selected \Sexpr{propSelGenes} \% of the genes (\Sexpr{dim(selBcrAblOrNeg)[1]} of the in total \Sexpr{dim(bcrAblOrNeg)[1]} genes). \pagebreak{} \section{Detecting differential expression} \subsection{T-test} <>= tTestResult <- tTest(selBcrAblOrNeg, "mol.biol") @ <>= histPvalue(tTestResult[,"p"], addLegend = TRUE) propDEgenesRes <- propDEgenes(tTestResult[,"p"]) @ Using an ordinary t-test, there are \Sexpr{sum(tTestResult[, "pBH"] < 0.1)} genes significant at a FDR of 10\%. The proportion of genes that are trully differentially expressed is estimated to be around \Sexpr{propDEgenesRes}. The toptable and the volcano plot show that three most significant probe sets all target \texttt{ABL1}. This makes sense as the main difference between BCR/ABL and NEG cells is a mutation in this particular ABL gene. <>= tabTTest <- topTable(tTestResult, n = 10) xtable(tabTTest, caption="The top 5 features selected by an ordinary t-test.", label ="tablassoClass") @ <>= volcanoPlot(tTestResult, topPValues = 5, topLogRatios = 5) @ \subsection{Limma for comparing two groups} In this particular data set, the modified t-test using \texttt{limmaTwoLevels} provides very similar results. This is because the sample size is relatively large. <>= limmaResult <- limmaTwoLevels(selBcrAblOrNeg, "mol.biol") volcanoPlot(limmaResult) # histPvalue(limmaResult) # propDEgenes(limmaResult) @ It is very useful to put lists of genes in annotated tables where the genes get hyperlinks to \href{http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene}{EntrezGene}. <>= tabLimma <- topTable(limmaResult, n = 10, coef = 2) # 1st is (Intercept) @ <>= tabLimmaSel <- tabLimma[,c("SYMBOL", "logFC", "AveExpr","P.Value","adj.P.Val", "GENENAME" )] tabLimmaSel[, "GENENAME"] <- substr(tabLimmaSel[,"GENENAME"], 1, 38) dData <- data.frame(Gene = tabLimmaSel[, 1], tabLimmaSel[,-1], stringsAsFactors = FALSE, row.names = NULL) hData <- data.frame(Gene = generateEntrezIdLinks(tabLimma[,"ENTREZID"])) tabAnnot <- annotationTable(displayData = dData, hrefData = hData) xTabAnnot <- a4Reporting::xtable(tabAnnot, digits = 2, caption = "Top differentially expressed genes between disabled anf functional p53 cell lines.") print(xTabAnnot, include.rownames = FALSE, floating = FALSE) @ \subsection{Limma for linear relations with a continuous variable} Testing for (linear) relations of gene expression with a (continuous) variable is typically done using regression. A modified t-test approach improves the results by penalizing small slopes. The modified regressions can be applied using \texttt{limmaReg}. <>= @ \pagebreak{} \section{Class prediction} There are many classification algorithms with profound conceptual and methodological differences. Given the differences between the methods,there's probably no single classification method that always works best, but that certain methods perform better depending on the characteristics of the data. On the other hand, these methods are all designed for the same purpose, namely maximizing classification accuracy. They should consequently all pick up (the same) strong biological signal when present, resulting in similar outcomes. Personally, we like to apply four different approaches; PAM, RandomForest, forward filtering in combination with various classifiers, and LASSO. All four methods have the property that they search for the smallest set of genes while having the highest classification accuracy. The underlying rationale and algorithm is very different between the four approaches, making their combined use potentially complementary. \subsection{PAM} PAM (Tibshirani 2002) applies univariate and dependent feature selection. <>= resultPam <- pamClass(selBcrAblOrNeg, "mol.biol") plot(resultPam) featResultPam <- topTable(resultPam, n = 15) xtable(head(featResultPam$listGenes), caption = "Top 5 features selected by PAM.") @ \subsection{Random forest} Random forest with variable importance filtering (Breiman 2001, Diaz-Uriarte 2006) applies multivariate and dependent feature selection. Be cautious when interpreting its outcome, as the obtained results are unstable and sometimes overoptimistic. <>= resultRF <- rfClass(selBcrAblOrNeg, "mol.biol") plot(resultRF, which = 2) featResultRF <- topTable(resultRF, n = 15) xtable(head(featResultRF$topList), caption = "Features selected by Random Forest variable importance.") @ <>= plotCombination2genes(probesetId1=rownames(featResultRF$topList)[1], probesetId2=rownames(featResultRF$topList)[2], object = selBcrAblOrNeg, groups = "mol.biol") @ \subsection{Forward filtering with various classifiers} Forward filtering in combination with various classifiers (like DLDA, SVM, random forest, etc.) apply an independent feature selection. The selection can be either univariate or multivariate depending on the chosen selection algorithm; we usually choose Limma as a univariate although random forest variable importance could also be used as a multivariate selection criterium. <>= # nlcvTT <- nlcv(selBcrAblOrNeg, classVar = 'mol.biol', # classdist = "unbalanced", # nRuns = 10, fsMethod = "t.test", verbose = TRUE) data(nlcvTT) @ <>= mcrPlot_TT <- mcrPlot(nlcvTT, plot = TRUE, optimalDots = TRUE, layout = TRUE, main = "t-test selection") @ <>= xtable(summary(mcrPlot_TT),rownames=TRUE, caption = "Optimal number of genes per classification method together with the respective misclassification error rate (mean and standard deviation across all CV loops).", label = "tabmcrPlot_TT") @ <>= scoresPlot(nlcvTT, tech = "svm", nfeat = 2) @ \subsection{Penalized regression} LASSO (Tibshirani 2002) or elastic net (Zou 2005) apply multivariate and dependent feature selection. <>= resultLasso <- lassoClass(object = bcrAblOrNeg, groups = "mol.biol") plot(resultLasso, label = TRUE, main = "Lasso coefficients in relation to degree of penalization.") featResultLasso <- topTable(resultLasso, n = 15) @ <>= lassoTable <- xtable(featResultLasso, label = "tablassoClass", caption = "Features selected by Lasso, ranked from largest to smallest penalized coefficient.") print(lassoTable, include.rownames = FALSE) @ <>= op <- par(mfrow=c(1,2)) plotCombination2genes(geneSymbol1 = featResultLasso$topList[1, 1], geneSymbol2 = featResultLasso$topList[2, 1], object = bcrAblOrNeg, groups = "mol.biol", main = "Combination of\nfirst and second gene", addLegend = TRUE, legendPos = "topright") plotCombination2genes(geneSymbol1 = featResultLasso$topList[1, 1], geneSymbol2 = featResultLasso$topList[3, 1], object = bcrAblOrNeg, groups = "mol.biol", main = "Combination of\nfirst and third gene", addLegend = FALSE) par(op) @ \afterpage{\clearpage} \pagebreak{} \subsection{Logistic regression} Logistic regression is used for predicting the probability to belong to a certain class in binary classification problems. <>= logRegRes <- logReg(geneSymbol = "ABL1", object = bcrAblOrNeg, groups = "mol.biol") @ The obtained probabilities can be plotted with \texttt{ProbabilitiesPlot}. A horizontal line indicates the 50\% threshold, and samples that have a higher probability than 50\% are indicated with blue dots. Apparently, using the expression of the gene \texttt{ABL1}, quite a lot of samples predicted to with a high probability to be NEG, are indeed known to be NEG. <>= probabilitiesPlot(proportions = logRegRes$fit, classVar = logRegRes$y, sampleNames = rownames(logRegRes), main = "Probability of being NEG") @ <>= probabilitiesPlot(proportions = logRegRes$fit, classVar = logRegRes$y, barPlot= TRUE, sampleNames = rownames(logRegRes), main = "Probability of being NEG") @ \subsection{Receiver operating curve} A ROC curve plots the fraction of true positives (TPR = true positive rate) versus the fraction of false positives (FPR = false positive rate) for a binary classifier when the discrimination threshold is varied. Equivalently, one can also plot sensitivity versus (1 - specificity). <>= ROCres <- ROCcurve(geneSymbol = "ABL1", object = bcrAblOrNeg, groups = "mol.biol") @ \section{Visualization of interesting genes} \subsection{Plot the expression levels of one gene} Some potentially interesting genes can be visualized using \texttt{plot1gene}. Here the most significant gene is plotted. <>= plot1gene(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg, groups = "mol.biol", legendPos = "topright") @ There are some variations possible on the default \texttt{plot1gene} function. For example, the labels of x-axis can be changed or omitted. <>= plot1gene(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg, groups = "mol.biol", sampleIDs = "mol.biol", legendPos = "topright") @ Another option is to color the samples by another categorical variable than used for ordering. <>= plot1gene(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg, groups = "mol.biol", colgroups = 'BT', legendPos = "topright") @ The above graphs plot one sample per tickmark in the x-axis. This is very useful to explore the data as one can directly identify interesting samples. If it is not interesting to know which sample has which expression level, one may want to plot in the x-axis not the samples but the groups of interest. It is possible to pass arguments to the boxplot function to custopmize the graph. For example the \texttt{boxwex} argument allows to reduce the width of the boxes in the plot. <>= boxPlot(probesetId = rownames(tTestResult)[1], object = selBcrAblOrNeg, boxwex = 0.3, groups = "mol.biol", colgroups = "BT", legendPos = "topright") @ \subsection{Plot the expression levels of two genes versus each other} <>= plotCombination2genes(geneSymbol1 = featResultLasso$topList[1, 1], geneSymbol2 = featResultLasso$topList[2, 1], object = bcrAblOrNeg, groups = "mol.biol", main = "Combination of\nfirst and second gene", addLegend = TRUE, legendPos = "topright") @ \subsection{Plot expression line profiles of multiple genes/probesets across samples} Multiple genes can be plotted simultaneously on a graph using line profiles. Each line reflects one gene and are colored differenly. As an example, here three probesets that measure the gene \texttt{LCK}, found to be differentially expressed between B- and T-cells. Apparently, one probeset does not measure the gene appropriately. <>= myGeneSymbol <- "LCK" probesetPos <- which(myGeneSymbol == featureData(ALL)$SYMBOL) myProbesetIds <- featureNames(ALL)[probesetPos] profilesPlot(object = ALL, probesetIds = myProbesetIds, orderGroups = "BT", sampleIDs = "BT") @ \afterpage{\clearpage} \pagebreak{} \subsection{Smoothscatter plots} It may be of interest to look at correlations between samples. As each dot represents a gene, there are typically many dots. It is therefore wise to color the dots in a density dependent way. <>= plotComb2Samples(ALL, "11002", "01003", xlab = "a T-cell", ylab = "another T-cell") @ <>= png(filename="plotComb2Samples.png",width=500,height=500) plotComb2Samples(ALL, "11002", "01003", xlab = "a T-cell", ylab = "another T-cell") dev.off() @ \begin{figure}[h] \begin{center} \includegraphics[width=.95\textwidth]{plotComb2Samples.png} \caption{Correlations in gene expression profiles between two T-cell samples (samples 11002 and 01003).} \label{fig:plotComb2Samples} \end{center} \end{figure} \afterpage{\clearpage} \pagebreak{} If there are outlying genes, one can label them by their gene symbol by specifying the expression intervals (X- or Y- axis or both) that contain the genes to be highlighted using \texttt{trsholdX} and \texttt{trsholdY}. <>= plotComb2Samples(ALL, "84004", "01003", trsholdX = c(10,12), trsholdY = c(4,6), xlab = "a B-cell", ylab = "a T-cell") @ <>= png(filename="plotComb2SamplesWithAnnotation.png",width=500,height=500) plotComb2Samples(ALL,"84004", "01003", trsholdX = c(10,12), trsholdY = c(4,6), xlab = "a B-cell", ylab = "a T-cell") dev.off() @ \begin{figure}[h] \begin{center} \includegraphics[width=.95\textwidth]{plotComb2SamplesWithAnnotation.png} \caption{Correlations in gene expression profiles between a B-cell and a T-cell (samples 84004 and 01003). Some potentially interesting genes are indicated by their gene symbol.} \label{fig:plotComb2SamplesWithAnnotation} \end{center} \end{figure} \afterpage{\clearpage} \pagebreak{} One can also show multiple pairwise comparisons in a pairwise scatterplot matrix. <>= plotCombMultSamples(exprs(ALL)[,c("84004", "11002", "01003")]) # text.panel= function(x){x, labels = c("a B-cell", "a T-cell", "another T-cell")}) @ <>= png(filename="plotCombMultipleSamples.png", width=500, height=500) plotCombMultSamples(exprs(ALL)[, c("84004", "11002", "01003")]) dev.off() @ \begin{figure}[h] \begin{center} \includegraphics[width=.95\textwidth]{plotCombMultipleSamples.png} \caption{Correlations in gene expression profiles between a B-cell and two T-cell samples (respectively samples 84004, 11002 and 01003).} \label{fig:plotCombMultipleSamples} \end{center} \end{figure} \afterpage{\clearpage} \pagebreak{} \subsection{Gene lists of log ratios} When analyzing treatments that are primarily interesting relative to a control treatment, it may be of value to look at the log ratios of several treatments (in columns) for a selected list of genes (in rows). <>= ALL$BTtype <- as.factor(substr(ALL$BT,0,1)) ALL2 <- ALL[,ALL$BT != 'T1'] # omit subtype T1 as it only contains one sample ALL2$BTtype <- as.factor(substr(ALL2$BT,0,1)) # create a vector with only T and B # Test for differential expression between B and T cells tTestResult <- tTest(ALL, "BTtype", probe2gene = FALSE) topGenes <- rownames(tTestResult)[1:20] # plot the log ratios versus subtype B of the top genes LogRatioALL <- computeLogRatio(ALL2, reference = list(var="BT", level="B")) a <- plotLogRatio(e = LogRatioALL[topGenes,], openFile = FALSE, tooltipvalues = FALSE, device = "pdf", filename = "GeneLRlist", colorsColumnsBy = "BTtype", main = 'Top 20 genes most differentially between T- and B-cells', orderBy = list(rows = "hclust"), probe2gene = TRUE) @ \begin{figure}[h] \begin{center} \includegraphics[width=.95\textwidth]{GeneLRlist.pdf} \caption{Log ratios of the 20 genes that are most differentially expressed between B-cell and two T-cells.} \label{fig:GeneLRlist} \end{center} \end{figure} The following example demonstrates how to display log ratios for four compounds for which gene expression was measured on four timepoints. <>= load(system.file("extdata", "esetExampleTimeCourse.rda", package = "a4")) logRatioEset <- computeLogRatio(esetExampleTimeCourse, within = "hours", reference = list(var = "compound", level = "DMSO")) # re-order idx <- order(pData(logRatioEset)$compound, pData(logRatioEset)$hours) logRatioEset <- logRatioEset[,idx] # plot LogRatioEset across all cl <- "TEST" compound <- "COMPOUND" shortvarnames <- unique(interaction(pData(logRatioEset)$compound, pData(logRatioEset)$hours)) shortvarnames <- shortvarnames[-grep("DMSO", shortvarnames), drop=TRUE] plotLogRatio(e = logRatioEset, mx = 1, filename = "logRatioOverallTimeCourse.pdf", gene.fontsize = 8, orderBy = list(rows = "hclust", cols = NULL), colorsColumnsBy = c('compound'), within = "hours", shortvarnames = shortvarnames, exp.width = 1, main = paste("Differential Expression (trend at early time points) in", cl, "upon treatment with", compound), reference = list(var = "compound", level = "DMSO"), device = 'pdf') @ \begin{figure}[h] \begin{center} \includegraphics[width=.95\textwidth]{logRatioOverallTimeCourse.pdf} \caption{Log ratios for four compounds at four time points (for 20 genes).} \label{fig:logRatioOverallTimeCourse} \end{center} \end{figure} \afterpage{\clearpage} \pagebreak{} \section{Pathway analysis} \subsection{Minus log p} The MLP method is one method of pathway analysis that is commonly used by the a4 suite user base. Although the method is explained in detail in the MLP package vignette we briefly walk throught the analysis steps using the same example dataset used in the preceding parts of the analysis. In order to detect whether certain gene sets are enriched in genes with low p values, we obtain the vector of p values for the genes and the corresponding relevant gene sets: <>= require(MLP) # create groups labels <- as.factor(ifelse(regexpr("^B", as.character(pData(ALL)$BT))==1, "B", "T")) pData(ALL)$BT2 <- labels # generate p-values limmaResult <- limmaTwoLevels(object = ALL, group = "BT2") pValues <- limmaResult@MArrayLM$p.value pValueNames <- fData(ALL)[rownames(pValues), 'ENTREZID'] pValues <- pValues[,2] names(pValues) <- pValueNames pValues <- pValues[!is.na(pValueNames)] @ <>= geneSet <- getGeneSets(species = "Human", geneSetSource = "GOBP", entrezIdentifiers = names(pValues) ) tail(geneSet, 3) @ Next, we run the MLP analysis: <>= mlpOut <- MLP( geneSet = geneSet, geneStatistic = pValues, minGenes = 5, maxGenes = 100, rowPermutations = TRUE, nPermutations = 50, smoothPValues = TRUE, probabilityVector = c(0.5, 0.9, 0.95, 0.99, 0.999, 0.9999, 0.99999), df = 9) @ The results can be visualized in many ways, but for Gene Ontology based gene set definitions, the following graph may be useful: <>= library(Rgraphviz) library(GOstats) pdf(file = "GOgraph.pdf") plot(mlpOut, type = "GOgraph", nRow = 25) dev.off() @ \begin{figure} \includegraphics{GOgraph} \end{figure} \section{Software used} <>= toLatex(sessionInfo()) @ \end{document}