%\VignetteIndexEntry{Running qusage} \documentclass[12pt]{article} \usepackage{hyperref} \usepackage{amsmath} \textwidth=6.2in \textheight=8.5in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \begin{document} \SweaveOpts{concordance=TRUE} \title{qusage Package Vignette} \author{Christopher Bolen} \date{1 September 2012, Revised 3 October 2015} \maketitle \tableofcontents \clearpage \section{Introduction} This package is an implementation the Quantitative Set Analysis for Gene Expression (qusage) method described in Yaari et al. (Nucleic Acids Res, 2013). This is a novel Gene Set Enrichment-type test, which is designed to provide a faster, more accurate, and easier to understand test for gene expression studies. The package is designed primarily for use with microarray expression data, but can be extended for use with count data as well. qusage accounts for inter-gene correlations using a Variance Inflation Factor technique that extends the method proposed by Wu et al. (Nucleic Acids Res, 2012). In addition, rather than simply evaluating the deviation from a null hypothesis with a single number (a P value), qusage quantifies gene set activity with a complete probability density function (PDF). From this PDF, P values and confidence intervals can be easily extracted. Preserving the PDF also allows for post-hoc analysis (e.g., pair-wise comparisons of gene set activity) while maintaining statistical traceability. Finally, while qusage is compatible with individual gene statistics from existing methods (e.g., LIMMA), a Welch-based method is implemented that is shown to improve specificity. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Getting Started} The qusage package can be downloaded from the Bioconductor website using the following commands: <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("qusage") @ \clearpage \section{A Simple Example} \label{sec:simple} <>= %% Load in data and source files. options(width=70) options(SweaveHooks=list(fig=function() par(mar=c(5.1, 4.1, 1.1, 2.1)))) library(qusage) library(nlme) data("fluExample") data("GeneSets") i = c(which(flu.meta$Hours==0), which(flu.meta$Hours==77)) eset = eset.full[,i] resp = flu.meta$Condition[i] @ The simplest way to run the qusage algorithm is by using the \texttt{qusage} function. This function is a wrapper for the three primary algorithms used in the package, \texttt{makeComparison} \texttt{aggregateGeneSet}, and \texttt{calcVIF}, each of which can be run individually if you want to have greater control over the parameters used in the algorithm. A discussion of each function can be found in section~\ref{sec:algorithm}. With default settings, \texttt{qusage} takes four inputs: \texttt{eset}, \texttt{labels}, \texttt{contrast}, \& \texttt{\mbox{geneSets}} \begin{enumerate} \item \texttt{eset} This variable can be provided in two formats: either as an objet of class \texttt{ExpressionSet} containing log-normalized expression data (as created by the affy and lumi packages), or as matrix of $log_2$ gene expression measurements, with rows of genes and columns of samples. For best results, we recommend that this dataset be trimmed to remove genes with very low expression. For this example, we'll use part of a dataset from a study by Huang Y et al. (GEO ID: GSE30550), which looks at gene expression in 17 patients before and 77 hours after exposure to the influenza virus. This dataset has already been trimmed to remove genes with very low expression. Here is what the first few rows of our expression data look like: <>= dim(eset) eset[1:5,1:5] @ \item \textbf{\texttt{labels}} This is a character vector which contains labels describing the group structure in eset. The vector must have one entry for each column in the eset, and the labels must be legal variable names in R, meaning they can't contain spaces and must start with a letter. Often data will consist of two groups, described using "mock" and "treatment", or "healthy" and "disease". For our example data, we have 17 patients, each with a time point before and after viral exposure, which we will call "t0" and "t1". We generate the labels as follows: <>= labels = c(rep("t0",17),rep("t1",17)) labels @ \item \textbf{\texttt{contrast}} The contrast is a single character string indicating how you want to compare the groups in your data. In almost all cases, this will be of the form \texttt{"treatment-mock"}, which will test whether our gene sets are significantly different in treatment vs mock samples. We could also switch the order of the two groups, i.e. \texttt{"mock-treatment"}, which would result in negative fold changes for pathways that are higher in \texttt{treatment}. In our example, the contrast of interest is simply: <>= contrast = "t1-t0" @ \item \textbf{\texttt{geneSets}} This is either a vector describing a single gene set, or a list of vectors representing a group of gene sets, such as the ones available from Broad's Molecular Signatures Database. In this first example, we will use a single list of IFN-stimulated genes, which looks like this: <>= ISG.geneSet[1:10] @ Notice that both this vector and the rownames of our \texttt{eset} are using gene symbols. We could have just as easily used the original Affymetrix probe IDs or even converted everything to REFSEQ IDs. The only requirement is that the rownames of \texttt{eset} match the symbols used in \texttt{geneSets}. If they are in a different format, you will receive an error stating that all of your gene sets are of size 0. \end{enumerate} With these four parameters, we can run \texttt{qusage} as follows: \begin{footnotesize} <>= qs.results = qusage(eset, labels, contrast, ISG.geneSet) @ \end{footnotesize} The \texttt{qusage} method will return a QSarray object containing statistical data on both the distributions of individual genes and on the pathway itself (for a full description of the data in the QSarray object, refer to section~\ref{sec:QSarrayObj}). While the results of this can be analyzed however you choose, there are a variety of functions available in the qusage package for exploring these results. To find out if our \texttt{geneSet} is significantly enriched for this comparison, we use the \texttt{pdf.pVal} function, which calculates a p-value asking whether the convoluted distribution of our gene set is significantly enriched (i.e. that the mean fold change of the gene set is different from 0). <>= pdf.pVal(qs.results) @ In order to view the probability density function, you can either use the \texttt{plotDensityCurves} function, or directly call \texttt{plot} on your QSarray object. In this case, \texttt{plot} will automatically call \texttt{plotDensityCurves} with default parameters. This will appear as follows: <>= plot(qs.results, xlab="Gene Set Activation") @ \begin{center} <>= <> @ %\caption{Probability Density Curve for average fold change in $t1-t0$. Dotted line at 0 fold change represents point where $t1=t0$} \label{fig:simplePDF} \end{center} The \texttt{plotDensityCurves} function also invisibly returns the x- and y- coordinates of the PDFs being plotted in the form of a list of matrices, which can be used to generate custom plots for the PDFs. For more information on custom plots, refer to section~\ref{sec:plotFns}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Multiple Gene Sets} \label{sec:geneSetDB} Most often, qusage is not going to be run with a single gene set. Instead, it is most useful when run on a group of gene sets as a tool for hypothesis discovery. To demonstrate this, we will use MsigDB's KEGG database (which can be downloaded at \url{www.broadinstitute.org/gsea/msigdb}). We can read this file in using the \texttt{read.gmt} function. With the resulting object, we run \texttt{qusage} exactly as before: <>= MSIG.geneSets = read.gmt("c2.cp.kegg.v4.0.symbols.gmt") @ <>= summary(MSIG.geneSets[1:5]) MSIG.geneSets[2] qs.results.msig = qusage(eset, labels, contrast, MSIG.geneSets) numPathways(qs.results.msig) @ The \texttt{qs.results.msig} object now contains information for each of the gene sets in \texttt{MSIG.geneSets}. Once again, we can get the p-values for the comparison by using the \texttt{pdf.pVal} function. However, because we have calculated such a large number of p-values, it's generally not valid to look at them without performing some kind of multiple hypothesis testing. Below, we calculate the p-values and use R's built-in \texttt{p.adjust} method to calculate the FDR values. <>= p.vals = pdf.pVal(qs.results.msig) head(p.vals) q.vals = p.adjust(p.vals, method="fdr") head(q.vals) @ All of this data is also presented using the \texttt{qsTable} function, which returns a table ordered according to the significance of each gene set. For our data, the results are as follows: <>= options(width=100) @ \begin{scriptsize} <>= qsTable(qs.results.msig, number=10) @ \end{scriptsize} <>= options(width=70) @ We can also try plotting the data again using a call to \texttt{plot(qs.results.msig)}. <>= par(mar=c(10,4,1,2)) plot(qs.results.msig) @ \begin{center} \includegraphics[width=1.1\textwidth]{qusage-plotCIs.pdf} %\caption{Means and $95\%$ confidence intervals for the pathways in MSIG.geneSets} \label{fig:CIsPlot} \end{center} However, this time the qusage package creates a different kind of plot. This is because, by default, when the QSarray object contains data on more than ten pathways, \texttt{plot} will automatically use the function \texttt{plotCIs}. This function calculates the 95\% confidence interval for each gene set using the \texttt{calcBayesCI} function and plots it along with the mean fold change of each gene set. This is usually a more useful view when there are a large number of gene sets, but a plot of the PDFs can still be generated using a call to \texttt{plotDensityCurves}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Paired Samples} \label{sec:pair} In some cases, samples in a gene expression experiment may be paired with each other. For example, we may have pairs of samples from the same patient both before and after treatment. Because there may be patient-specific effects in the gene expression data, we want to use this information when we calculate pathway activation. In qusage, all we have to do to indicate paired samples is supply a \texttt{pairVector} containing information on which samples should be paired together. The vector should be the same length as \texttt{eset}, and it can be either a numeric vector or a set of factors which uniquely describe each pair. A simpler way to think about the pair vector is to consider it as a vector of patient IDs for each column in eset. Although for simplicity we have been ignoring it thus far, our gene expression data is actually a paired dataset as well. We have a set of 17 patients with two samples each (t0 and t1). For our \texttt{pairVector}, we will use the numbers 1:17 to represent our patients, as follows: <>= pairs = c(1:17,1:17) pairs @ Then, we can simply run the \texttt{qusage} function in a paired manner by adding the argument \texttt{pairVector=pairs} (note: we don't actually show this here, but you can see it in action in the next section). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Two-Way Comparisons} \label{sec:twoWay} The true power of the qusage method comes when you have results from more than one run of \texttt{qusage} that you wish to compare. If a dataset contains more than two groups, it's common to compare these groups in many different ways, and the ablility to compare the results of these comparisons with each other is extremely useful. These two-way comparisons are important when examining differences in the response characteristics of various groups (e.g. to answer the question of "does group A respond to treatment in a different way than group B?"). To illustrate how to make these comparisons, we will once again be using the Influenza response data. Although in the previous examples we have grouped all patients together, the patients can be classified into two distinct groups based on their response to the virus. Of the 17 patients in the study, 9 of them had a symptomatic response, and 8 had an asymptomatic response. While we could compare these two groups of patients directly at the two time points, it might be more interesting to see if the response to virus over the time-course of the experiment is different between these two groups. Before we can compare these groups, however, we will need to remake the \texttt{labels} in order to describe the new groups in our data. For this example, we have the response information stored in a variable called \texttt{resp}, which is just a vector with either "sx" (for symptomatic patients) or "asx" (for asymptomatic patients): <>= resp twoWay.labels = paste(resp, labels, sep=".") twoWay.labels @ Now, we run \texttt{qusage} twice; once for each comparison that we are interested in (note: we're also including information on the sample pairs, which was described in section~\ref{sec:pair}). <>= sx.results = qusage(eset, twoWay.labels, "sx.t1-sx.t0", MSIG.geneSets, pairVector=pairs) asx.results = qusage(eset, twoWay.labels, "asx.t1-asx.t0", MSIG.geneSets, pairVector=pairs) @ We could examine the information contained in these two runs using the functions described above, but what we are most interested in is whether or not there is a significant difference between the responses of these two groups. We can directly compare the results in these two groups using the \texttt{twoCurve.pVal} function. This will return a p-value for the comparison of each gene set in the two objects. <>= p.vals = twoCurve.pVal(sx.results,asx.results) head(p.vals) @ NOTE: In \texttt{qusage} version 2.0, we have added a streamlined version of the same methodology that also calculates the PDF of the difference between these two comparisons. In this case, a more complex comparison can be specified, indicating that the resulting PDF should be the difference between the symptomatic PDF and the asymptomatic PDF. <>= asx.vs.sx = qusage(eset, twoWay.labels, "(sx.t1-sx.t0) - (asx.t1-asx.t0)", MSIG.geneSets, pairVector=pairs) p.vals = pdf.pVal(asx.vs.sx) head(p.vals) @ It turns out that gene set 125 -- the \verb@"CYTOSOLIC DNA SENSING"@ pathway -- has the most significant difference in activation. There are multiple ways we could examine this individual gene set, but perhaps the simplest way is to again use the \texttt{plotDensityCurves} function. Notice here that we are specifying which pathway to plot using the \texttt{path.index} parameter. <>= plotDensityCurves(asx.results,path.index=125,xlim=c(-0.2,0.5), col=3,main="CYTOSOLIC DNA SENSING") plotDensityCurves(sx.results,path.index=125,col=2,add=T) plotDensityCurves(asx.vs.sx, path.index=125,col=1,add=T, lwd=3) legend("topleft", legend=c("asx","sx","asx-sx"),lty=1,col=c(2:3,1), lwd=c(1,1,3)) @ \begin{center} <>= <> @ %\caption{Comparison of PDFs for symptomatic (red) and asymptomatic (black) patients.} \end{center} The results here are pretty interesting, actually. What we're seeing is that, in symptomatic patients, the Cytosolic DNA Sensing pathway is extremely active, while the asymptomatic patients essentially have no activity at all. %%%%%%%%%------- Gene Distribution ------%%%%%%%%%%%%% \subsubsection{Individual Gene Plots} \label{sec:genePlot} Often the next question when an interesting pathway is found is what's happening with the individual genes in that pathway. Looking at the individual genes can give some idea as to how consistent the patterns of regulation in the data are, or whether the significance levels are being driven by only one or two individual genes. Because the gene set PDF in this package is actually a convolution of the PDFs of each individual gene, we can actually look at the distributions of each gene directly to understand where our variation comes from. qusage provides two plotting functions which can be used to visualize this data. The first is the \texttt{plotGeneSetDistributions} function, which plots out the PDF for both the pathway and for that pathway's individual genes. The plot, which can be seen below, includes thin colored curves for each gene (color coded according to their standard deviation), as well as a thicker black curve representing the PDF of the entire dataset. Here we plot the results for both the symptomatic patients and the asymptomatic patients side-by-side. <>= plotGeneSetDistributions(sx.results,asx.results,path.index=125) @ \begin{center} \includegraphics[width=1.1\textwidth]{qusage-plotGSD.pdf} \label{fig:GSDplot} \end{center} Again, we can see that the symptomatic patients have much more activity than the asymptomatic patients, a pattern which seems generally consistent in many of the genes in this pathway. This plot is useful for giving a general idea of how the genes in the pathway act. However, it would also be useful to find out which genes are the most active, and since the \texttt{plotGeneSetDistributions} plot does not provide labels for the individual genes, we must use a separate function: \texttt{plotCIsGenes} to see which genes are most active. This function produces a very similar plot to the \texttt{plotCIs} function, but in this case, it will plot the means and confidence intervals for each individual gene in a pathway. To generate this plot for the CYTOSOLIC DNA SENSING pathway, we will use the following code: <>= par(mar=c(8,4,1,2)) plotCIsGenes(sx.results,path.index=125) @ \begin{center} \includegraphics[width=1.1\textwidth]{qusage-plotCIsGenes.pdf} %\caption{Means and $95\%$ confidence intervals for the pathways in MSIG.geneSets} \label{fig:CIsGenesPlot} \end{center} \subsection{Running qusage with small sample sizes} Although qusage does not require as many samples as permutation-based methods, the current implementation requires at least three degrees of freedom. As the variance of a t distribution is $\frac{\nu}{\nu-2}$, this will diverge for $\nu \rightarrow 2$. Hence, in cases where the degrees of freedom falls below three, QuSAGE artificially set the degrees of freedom to be three and issues a warning message that the results should be interpreted with caution. It is imortant to keep in mind that in this case, the p-values will be inflated, and may not accurately reflect the underlying data. A second problem with small sample sizes is the large range over which the gene set PDF must be sampled. Whereas a t-distribution with 25 degrees of freedom is sampled from $(-10,10)$, a distribution with four degrees of freedom must be sampled from $(-416,416)$. When using the default number of points \texttt{(n.points=4096)}, this can result in inadequate sampling of the distribution and inaccurate estimates of pathway significance. In order to prevent this, you must increase the n.points parameter when running the aggregateGeneSet function (this is done automatically when using the \texttt{qusage} function if the degrees of freedom are below 5). However, note that the run time of aggregateGeneSet increases linearly with the number of points, so increasing this parameter may significantly slow down the algorithm. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% Chapter 4: The qusage Algorithm %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{The qusage Algorithm} \label{sec:algorithm} The qusage algorithm involves two basic steps: 1) Calculating gene-level comparisons, and 2) Aggregating gene-level data for each gene set. In addition, there is one optional (but highly recommended) step, 3) determining the Variance Inflation Factor (VIF), which corrects for gene-gene correlations in the gene sets. In section~\ref{sec:simple}, we discussed how to use the \texttt{qusage} algorithm to run all three steps together with the default parameters. Below, we discuss each function and their parameters in detail, and look at examples of how to run each function separately. %%%%%%%%%%%%%%%%%%%%%%% \subsection{Calculating Gene Level Comparisons} \label{sec:makeComparison} This section describes the \texttt{makeComparison} function, which is the first step in the qusage algorithm. This function defines the t-distributions for every gene in \texttt{eset} by calculating the fold change and standard deviation between two groups of samples. The function has three required parameters: \texttt{eset}, \texttt{labels}, and \texttt{contrast}, which were described in section~\ref{sec:simple}. A \texttt{pairVector} can also be provided to specify that the comparisons should be done in a paired manner (see section~\ref{sec:pair}). The \texttt{makeComparison} function will return a QSarray object containing information on the means and standard deviations of each gene, which will be used in the subsequent functions to calculate pathway enrichment. \vspace{5mm} \texttt{var.equal} There are two main algorithms used to calculate a t-distribution from two groups of samples, known as the "Pooled Variance" and "Welch's approximate" approach. In the first method (the pooled variance approach), the variances of the two groups are assumed to be equal, and the difference in the mean expression values is modeled as a Student's t distribution. The second method, proposed by B. L. Welch (Biometrika, 1947), is also based on a Student's t distribution, but relaxes the assumption of equal variances. Most recent gene expression studies take the first approach and assume that the variances of the two groups are equal. For example, this assumption underlies the individual gene statistics in the widely used LIMMA package in Bioconductor. The pooled variance approach can slightly improve sensitivity (if the equal variance assumption holds), and is easily compatible with linear models and ANOVA. However, this approach can be extremely biased when the assumption of equal variances is broken, which we find is often the case in many real gene expression data sets. By default, the \texttt{makeComparison} function will assume unequal variances, but this behavior can be changed by specifying the parameter \texttt{var.equal = TRUE}. If \texttt{var.equal} is set to \texttt{TRUE}, the t-distribution is actually calculated using the linear model approach implemented by LIMMA. This may be appealing in some cases, because using LIMMA's linear model framework makes more complicated contrasts possible, while also making it possible to use LIMMA's framework for the Bayesian estimation of gene variances, as described below. \vspace{5mm} \texttt{bayesEstimation} If \texttt{var.equal} is set to \texttt{TRUE} and LIMMA is used to calculate the t-distributions, we can use some of the additional features of LIMMA when calculating the parameters for each gene. The \texttt{bayesEstimation} parameter is used by LIMMA to optionally apply a bayesian step in the estimation of the pooled variances of each gene. This bayesian step is designed to shrink the variances of the genes on the chip towards a common value. In some cases, this may help to give slightly more stable values for the variances, especially in cases where there are very few samples. The Bayesian step also prevents any genes from having a variance of 0, which would otherwise lead to divergence of the t-distribution. For complete details of how the bayesian estimation step works, refer to the documentation of the LIMMA package. \vspace{5mm} \texttt{min.variance.factor} If \texttt{var.equal} is set to \texttt{FALSE}, the \texttt{min.variance.factor} can be used as an alternative to the Bayesian estimation step described above. The min.variance.factor is designed to add a very small value to the variance of each gene on the chip in order to prevent any variances from being set to 0. The default, which is set to $10^-8$, is small enough that it will not impact the variance of expressed genes, but will prevent the t-distribution from diverging on genes with variance 0. %%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Aggregating gene-level data into gene set PDFs} \label{sec:calcPDF} The second step of the \texttt{qusage} algorithm is the aggregation step, which convolutes individual gene t-distributions into a single PDF for each gene set. This step is performed by the \texttt{aggregateGeneSet} function. The function has two required parameters: a \texttt{QSarray} object such as the one returned by \texttt{makeComparison} (see section~\ref{sec:makeComparison}), and a list of \texttt{geneSets}. As mentioned in section~\ref{sec:simple}, the geneSets parameter can either be a character vector containing a single gene set, or a list of character vectors representing multiple different gene sets. This function also contains an option to change the number of points that the convoluted PDF is sampled over. By default the parameter \texttt{n.points} is set to $2^{12}$, or 4096 points, which will give very accurate p-values in most cases. Sampling at more points will increase the accuracy of the resulting p-values, but will also linearly increase the amount of time needed to calculate the result. In many cases, as few as $1/4$ this number of points can be used without seriously affecting the accuracy of the resulting p-values. The PDF for each individual gene set is generated by using numerical convolution applied to the individual gene PDFs. Briefly, a Fast Fourier Transform (FFT) is calculated for each individual gene PDF to arrive at a k-component vector. The product of each component across all of the genes is then taken to arrive at a new k-component vector for the gene set. The real part of the resulting product is then transformed back to a PDF using a reverse FFT, and assured to be normalized and centered around zero. The mean of the combined PDF is simply the mean fold change of the individual genes. The range for sampling is determined by the lowest degrees of freedom of the individual genes, such that at most $10^{-8}$ of the cumulative distribution at the tails are excluded (i.e., assumed to be $0$). For example, when $\nu = 3$, the range is $(-480,480)$, and when $\nu = 120$, the range is $(-6,6)$. Technically, the output of this step is the PDF of the \textit{sum} of differences in expressions over all genes in the gene set under the assumption that the genes are independent. In order to estimate the \textit{mean} differential expression PDF, this distribution is scaled by a factor of $1/N$, where $N$ is the number of genes in the gene set. The resulting PDFs of the input gene sets are stored as a matrix in \texttt{path.PDF} slot of the returned QSarray object. However, the x-coordinates for these PDFs are not stored in the QSarray object, and must be calculated using the \texttt{getXcoords} function. \subsection{The Variance Inflation Factor} \label{sec:calcVIF} The expression measurements of individual genes in a set are almost always correlated. As such, it is important to take these correlations into account when testing for gene set enrichment. In qusage, we build off of a technique proposed by Wu et al. (Nucleic Acids Res, 2012), which calculates a Variance Inflation Factor (VIF) for each gene set based on the correlation of the genes in that set. This method, referred to as CAMERA, uses the linear model framework created by LIMMA to calculate gene-gene correlations, but consequently it must assume equal variance not only between all groups in the dataset, but also across each gene in the gene set. While this assumption leads to a slightly more computationally efficient VIF calculation, it is not valid for most gene sets, and its violation can greatly impact specificity. In the qusage paper (Yaari et al., NAR, 2013), we show that this assumption of equal variances leads to over-estimation of the VIF, and consequently an overall loss of power. Although our recommendation is to use the qusage algorithm for calculating the VIF, the \texttt{calcVIF} function includes the option to use either the CAMERA method (as implemented in the \texttt{limma} package in R) or the qusage method. If the parameter \texttt{useCAMERA=TRUE} is specified when running \texttt{calcVIF}, the CAMERA method will be used when calculating the VIF for each gene set. By default, this will only be used if \texttt{var.equal} was set to \texttt{TRUE} during the original call to \texttt{makeComparison}. \vspace{5mm} \texttt{useAllData} The \texttt{calcVIF} algorithm contains an additional parameter, \texttt{useAllData}, which specifies whether all of the groups in \texttt{eset} should be used when calculating the VIF using the qusage algorithm, or if only the two groups used in the original comparison should be used. By default (\texttt{useAllData=TRUE}), all of the data will be used to calculate the VIF. In most cases, using more data will increase the accuracy of the VIF calculation, especially in cases where individual groups are very small. However, it may not always be valid to include all of the data when calculating gene-gene correlations (e.g. if the data set contains expression values from more than one species), so it may be useful to limit what groups are used in the VIF calculation. Note that this parameter is only used if \texttt{useCAMERA} is \texttt{FALSE}, as the CAMERA algorithm automatically uses all data in the eset provided. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% Chapter 5: QGen %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Using qusage with Generalized Linear Mixed Models} The qusage methodology was extended by Turner et al (BMC Bioinformatics 2015) to incorporate general linear mixed models available from the nlme package. The function \texttt{qgen} is now available through the qusage package to implement the extension. The main difference lies in that the t-statistic information in qusage are based on two sample t-tests that cannot incorporate additional covariates to adjust the fold change estimates due to potential confounders that can exist within observational studies. The qgen algorithm uses the t-statistic information derived from a linear mixed model and estimates the VIFs using the raw residuals rather than group labelings as this can yield biased VIF estimates. \subsection{The qgen function} Although the general structure of the \texttt{qgen} function is similar to \texttt{qusage}, there are more inputs required in order for the user to incorporate added complexity in the statistical modeling step. With default settings, \texttt{qgen} takes six inputs: \texttt{eset}, \texttt{design},\texttt{fixed}, \texttt{contrast.factor}, \texttt{contrast}, \& \texttt{\mbox{geneSets}}, however, additional inputs are often necessary to handle repeated measures data through \texttt{random} and \texttt{correlations}. \begin{enumerate} \item \texttt{eset} This variable is a matrix of log2 gene expression measurements, with rows of genes and columns of samples. For best results, we recommend that this dataset be trimmed to remove genes with very low expression. For this example, we'll use the entire dataset from the study by Huang Y et al. (GEO ID: GSE30550), which looks at gene expression in 17 patients across a large number of time points in hours before and after exposure to the influenza virus. In addition to time points, each patient's condition is also defined as either "asymptomatic" (asx) or "symptomatic" (sx) to the exposure. <>= dim(eset.full) eset.full[1:5,1:5] @ \item \textbf{\texttt{design}} Rather than using a \texttt{labels} vector describing the group structure in eset, a dataframe must be provided that contains the information needed for the linear mixed model. This can contain the fixed effect factors such as groups, continuous explanatory variables, as well as donor specific identifiers and time points necessary to model repeated measure through random effects or residual side covariance structures. It is strongly recommended that an addition column is included corresponding to the sample names of eset. With this information, qgen checks for any discrepancies that exist between eset and design. Any samples that do not exist in both files are removed. If such an identifier is not specified using the \texttt{design.sampleid} option, qgen assumes that the order in the design file corresponds to the order of the samples given in \texttt{eset}. Here are the first few rows of the design file for FluEset: <>= head(flu.meta) @ \item \textbf{\texttt{fixed}} Fixed is a one sided formula specifying the fixed effects the user wishes to include in the linear model. Typical linear models are of the two sided formula \texttt{y~X1+X2+X3}, but since the response variable is defined by the expression values of eset, only the right hand side is needed and take the general form, \texttt{~X1+X2+X3}. For this example the primary interest is determining if there are changes over time with respect to the baseline values both for the asymptomatic and symptomatic groups. Additional sample information such as age and gender could also be included to account for any additional source of variability or if they were confounded with primary variables of interest. We include fixed effects for subject's condition, time, and their interaction as well as including both age and gender. <>= fixedeffects <- ~Hours+Condition+Condition*Hours+Age+Gender @ \item \textbf{\texttt{contrast.factor}} This one-sided fromula refers to one of the fixed effects components in the fixed input that tells qgen what contrast the user is interested in making a comparison. So if the user wishes to compare symptomatic versus asymptomatic regardless of time, then the user would simply provide $\sim$Hours. If the user would like to compare symptomatic versus asymptomatic at hour 77 then the user would provide the interaction component $\sim$Hours*Condition. For our example we wish to compare changes at Hour 77 versus Hour 0 among the symptomatic group only, this corresponds to: <>= con.fac <- ~Condition*Hours @ \item \textbf{\texttt{contrast}} This is a character string that serves the same purpose in the original \texttt{qusage} function. It specifies which levels of contrast.factor the user wishes to compare. This is usually of the form "TrtA - TrtB". Since our comparison involves the combination of two factor levels, the levels just need to be concatenated in the corresponding order. So to compare Hour 77 vs Hour 0 among the symptomatic group, the corresponding contrast can be written as: <>= contrast = "sx77 - sx0" @ The one-sided formula in contrast.factor, and the corresponding statement in contrast are fed directly to the \texttt{makeContrasts} of the limma package to obtain the appropriate sequence of 0's and 1's for the contrast. One caveat to the naming convention of contrast is that the names in contrast must be valid R names. So if one were to specify the contrast.factor=~Hours*Condition and contrast= "45Symp - 0Symp", an error will return as \texttt{makeContrasts} only allows factor level names to be valid R names and cannot start with a numeric value. \item \textbf{\texttt{geneSets}} This is exactly the same as for the original qusage function, either a vector describing a single gene set, or a list of vectors representing a group of gene sets. For this example, we will continue to use the IFN-stimulated genes which looks like this: <>= ISG.geneSet[1:10] @ Notice that both this vector and the rownames of our \texttt{eset} are using gene symbols. We could have just as easily used the original Affymetrix probe IDs or even converted everything to REFSEQ IDs. The only requirement is that the rownames of \texttt{eset} match the symbols used in \texttt{geneSets}. If they are in a different format, you will receive an error stating that all of your gene sets are of size 0. \end{enumerate} With these 6 parameters specified, qgen will run the analysis using ordinary least squares regression (the default unless additional options are specified). \subsection{Qgen with random effects} The real power of this approach is running models that can handle the repeated measures nature of the study design. This can be done by specifying either the \texttt{random} parameter, the \texttt{correlation} parameter, or a combination of both. Both of these options are directly fed to the linear model functions provided in the nlme package. See \texttt{lme} and \texttt{gls} functions for additional details. In order to generate a model with random effects, an optional formula must be supplied to \texttt{random}. For a simple repeated measures study design such as this, the form is usually ~1|g where g specifies the donor or subject ids. To run the qgen analysis using a random effect, the following command can be called: <>= qusage.result.random<-qgen(eset.full,flu.meta, fixed=fixedeffects, contrast.factor=con.fac, contrast=contrast, geneSets=ISG.geneSet, random=~1|Subject, design.sampleid="SampleID") @ The results of this analysis can then be examined using the same functions described in section~\ref{sec:simple}. <>= qsTable(qusage.result.random) @ \subsection{Qgen with correlations} An optional nlme corStruct object to specify within group correlation structure through the residual covariance matrix and is passed directly to the lme function within the nlme package. There are many modeling options for the user to pick from if this is desired, we strongly urge reading the documentation within the nlme package on CorStruct objects. As an example suppose we assume that the correlations that exist over the repeated measure time points die off over time exponentially. The corresponding model to incorporate this strategy would yield <>= corStruct = corExp(value=3,~Hours.Numeric|Subject) corStruct qusage.result.corstruct<-qgen(eset.full,flu.meta, fixed=fixedeffects, contrast.factor=con.fac, contrast=contrast, geneSets=ISG.geneSet, correlation=corStruct, design.sampleid="SampleID") @ The results of both the correlation analysis and the random effects analysis can be compared directly, revealing that in both cases, there seems to be an approximately $(2^{0.8}=1.75)$-fold increase in the expression of ISGs at day 77 of infection. However, in this case, taking random effects into account leads to a more significant p-value, resulting from lower variance in the estimate. <>= plotDensityCurves(qusage.result.random, main="ISG_geneSet", xlim=c(0,1.2), xlab="Day 77 - Day 0", col="red") plotDensityCurves(qusage.result.corstruct, col="blue",add=T) @ \begin{center} <>= <> @ \end{center} \subsubsection{Additional qgen details} The computational speed of qgen depends on a couple of factors, in particular, the complexity of the linear model, the total number of genes within the gene sets, the total number of genesets, and the total number of samples. To save computational time, qgen only performs the linear model on probes that are directly included within the gene sets as genes not included will not be used in the analysis. For the previous random effect model example, if one would to run qgen with 260 gene sets comprising of ~12,000 genes, the algorithm from start to finish will take approximately 9 minutes. The qgen function provides a series of checks and provides various warnings and errors if any of the above mentioned options are not syntactically correct or if other obvious discrepancies exists. It must be noted, that in some situations, some probes may have little to no variation due to processing or low intensity thresholds. In these cases, the linear mixed model may not converge to the maximum likelihood estimates. Currently, probes that do not convergence are removed from the analysis and a warning is presented. qgen returns a QSarray object, therefore, any additional functions that apply to the original qusage result object such as plot and qsTable can be applied to results obtained from qgen. The VIF estimation techniques are implemented by the method provided by Turner et al. (BMC Bioinformatics 2015). In certain situations when the number of random effect replicates are low, an alternative VIF estimation is preferred to maintain type-I error estimation. qgen checks for these situations and automatically conducts the correct VIF estimation when it can. However, sometimes the technique will not be available due to the study design or the model is more complex than the scenarios provided by the authors. In cases such as this, a warning is produced letting the user know that the VIF estimate will use the original conditional residual approach and the type-I error rate may not be conserved. We suggest a simulation study could provide insight on whether or not there is a concern. Additional approaches similar to that of Turner et al. (BMC Bioinformatics 2015) could be incorporated to handle these situations in future updates. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%% Chapter 6: Meta Analysis %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \clearpage \section{Meta-Analysis with qusage} \label{sec:metaAnalysis} QuSAGE meta-analysis enables gene set analysis from multiple studies to be combined into a single estimation of gene set activity. This is done by using numerical convolution to combine the probability density functions (PDFs) of each data set into a single PDF. <>= data("fluVaccine") @ To illustrate how QuSAGE meta-analysis works, we analyzed gene expression data from young individuals in two Influenza vaccination response data sets (GSE59635 and GSE59654). The goal of the analysis was to detect gene sets associated with successful vaccination responses. For this analysis, we will use the blood transcription modules (BTMs) described in Li et al. (Nat Immunol., 2014). <>= read.gmt("BTM_for_GSEA_20131008.gmt") @ In data set GSE59635, there are 8 young subjects (3 low-responders and 5 high-responders to the vaccine). Each subject has day 0 (pre-vaccination) and day 7 (post-vaccination) blood samples. <>= head(fluVaccine$phenoData[["GSE59635"]]) @ In data set GSE59654, there are 11 young subjects (7 low-responders and 4 high-responders to the vaccine). Similarly, each subject has day 0 (pre-vaccination) and day 7 (post-vaccination) blood samples. <>= head(fluVaccine$phenoData[["GSE59654"]]) @ Using this data, QuSAGE was first used to carry out a two-way comparison (see section~\ref{sec:twoWay} for details) between low-responders and high-responders on each data set independently. We will do this by looping over each dataset with an \texttt{lapply} statement. Note that for this analysis, we're increasing the number of points \texttt{(n.points)} above the default due to low numbers of samples: <>= qs.results.list = lapply(c("GSE59635", "GSE59654"), function(n){ eset = fluVaccine$esets[[n]] phenoData = fluVaccine$phenoData[[n]] labels = paste(phenoData$responder, phenoData$bloodDrawDay, sep = ".") ##run QuSAGE qusage(eset, labels, "(high.d7-high.d0)-(low.d7-low.d0)", BTM.geneSets, pairVector=phenoData$subjectId, n.points=2^14) }) @ Next, QuSAGE meta-analysis was applied on the two resulting QSarray objects (one from each study). Each distrisbution is first weighted by the number of samples in each dataset, and then the two distributions are combined using numerical convolution. <>= combined = combinePDFs(qs.results.list) @ P-values for each module can then be calculated the same way as for any other \texttt{QSarray} object. <>= head(pdf.pVal(combined)) @ It turns out that "plasma cells, immunoglobulins (M156.1)" was one of top ranked gene modules from this QuSAGE meta-analysis. This module is significantly more up-regulated (day 7 vs. day 0) in high-responders compared to low-responders to flu vaccination. To plot this result, we can use the \texttt{plotCombinedPDF} function, as shown below. <>= plotCombinedPDF(combined, path.index=235) legend("topleft", legend=c("GSE59635", "GSE59654", "Meta-analysis"), lty=1, col=c("red", "blue","black")) @ \begin{center} <>= <> @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Chapter 7: Plotting Functions %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \clearpage \section{Plotting Functions} \label{sec:plotFns} The \texttt{qusage} package contains a number of functions designed to easily and intuitively visualize the results stored in a QSarray object. Many of these functions were demonstrated in section~\ref{sec:simple}, but in the following sections we will highlight the variety of options available when plotting the diverse data generated when running the qusage algorithm. \subsection{plotDensityCurves} \label{sec:plotDensityCurves} The \texttt{plotDensityCurves} function is the simplest way to plot the data in a \texttt{QSarray} object. By default, this function generates a very simple plot containing probability density curves for each of the gene sets included in the qusage run. The density (y-axis) data for each PDF is stored in the \texttt{path.pdf} slot of the \texttt{QSarray} object, and the x-coordinates for the PDF are calculated on the fly using the \texttt{getXcoords} function (see below for details on this method). We've seen PDF plots before in section~\ref{sec:simple}, but a slightly more customized version of the plot with multiple PDFs can be seen below: <>= ##select the best 5 gene sets p.vals = pdf.pVal(qs.results.msig) topSets = order(p.vals)[1:5] ##plot plotDensityCurves(qs.results.msig, path.index=topSets, col=1:5, lwd=2, xlab=expression(log[2](Activity))) @ \begin{center} <>= <> @ \end{center} Because this function is simply an extension of the basic \texttt{plot} function, these plots can be heavily customized. This customization can be accomplished both through the basic plotting parameters, such as \texttt{xlab, ylab, main} and the parameters passed on to \texttt{par}, and via some parameters which are specific to \texttt{plotDensityCurves}. The \texttt{add} parameter can also be specified to overlay a second plot over an existing one. The function-specific parameters are listed below. \begin{itemize} \item \textbf{\texttt{path.index}} This parameter controls which PDFs are plotted and the order that they are plotted in. It can be specified as either a numeric vector or a vector of names matching the gene sets. \item \textbf{\texttt{zeroLine}} By default, a vertical dashed line is added to the plot at 0 in order to give a visual indication of how significant the gene set is. In order to suppress the addition of this line, the parameter \texttt{zeroLine=FALSE} can be specified. \item \textbf{\texttt{addVIF}} Much like the other functions in qusage, this function gives you the option to specify whether the VIF should be used in calculating the PDF. By default, if the VIF has been calculated, it will always be included, but this behavior can be suppressed by explicity specifying \texttt{addVIF=FALSE}. While this will improve how significant your PDFs appear, it is strongly reccommended that the VIF is included whenever possible, as it is the only way to account for gene-gene correlations in a gene set. \end{itemize} The \texttt{plotDensityCurves} function is a useful way to visualize the PDF of a gene set, but in some cases it may be necessary to access the x and y coordinates of these PDFs individually. The information required to create these plots is all stored in the QSarray object, but some processing is necessary to get coordinates in the correct format for plotting. Although the y-coordinates can generally be used as-is when taken from the \texttt{path.pdf} matrix, the x-coordinates are not stored in the qusage object. Instead, the \texttt{getXcoords} function calculates x-coordinates for each point n using the following formula: \begin{align*} x_n = (-1+\frac{2(n-1)}{N_{pts}-1}) \times r \times \sqrt{VIF} + \hat{\mu}_{path} \end{align*} where $N_{pts}$ is the total number of points, $r$ is the range used for the numerical convolution (see section~\ref{sec:calcPDF}), $VIF$ is the Variance Inflation Factor, and $\hat{\mu}_{path}$ is the mean fold change of the pathway. Although the x- and y- coordinates can be extracted by hand from a qusage object, by far the easiest way to access this information is with the \texttt{plotDensityCurves} function itself. As mentioned in section~\ref{sec:simple}, the \texttt{plotDensityCurves} function silently returns the x- and y-coordinates of each PDF being plotted. This data is returned as a list of data frames, with each individual data frame containing the x- and y-coordinates for the PDF of a single gene set. An example of how to use this data for a custom plot is shown below. In this case, we will use the results from running qusage on the ISG set, stored in the \texttt{qs.results} object, to generate a Cumulative Distribution Function, or CDF. <>= ##get coordinates coord.list = plotDensityCurves(qs.results, plot=FALSE) coords = coord.list[[1]] x=coords$x ##Calculate the CDF as the integral of the PDF at each point. y = cumsum(coords$y)/sum(coords$y) plot(x,y, type="l", xlim=calcBayesCI(qs.results,low=0.001)[,1]) @ \begin{center} <>= <> @ \label{fig:customCDF} \end{center} \clearpage \subsection{plotGeneSetDistributions} The \texttt{plotGeneSetDistributions} function is a method designed to visualize the PDFs of each individual gene in a gene set. As seen in section \ref{sec:genePlot}, the resulting plot is a useful and intuitive way to view how individual genes contribute to the overall PDF of the gene set. To generate a default plot, \texttt{plotGeneSetDistributions} can be called as shown below: <>= plotGeneSetDistributions(qs.results) @ \begin{center} \includegraphics[width=1.1\textwidth]{qusage-plotISG_GSD.pdf} \end{center} The default gene set distribution plot is composed of two parts: a plot of the gene and gene set PDFs, and a barcode plot representing the mean of the individual genes. Both portions of the plot can be customized using the parameters in \texttt{plotGeneSetDistributions} and with the generic parameters that can be passed on to \texttt{plot}. \vspace{5mm} \textbf{PDFs} The PDF plot is, in essence, an extension of the \texttt{plotDensityCurves} function discussed in section~\ref{sec:plotDensityCurves}. The plot includes the average PDF for the entire gene set, which is calculated in the same way as discussed in section~\ref{sec:plotDensityCurves}, as well as PDFs for each individual gene in the pathway. Because the mean and SD of each gene was calculated using a t-test, the individual gene PDFs are generated as t-distributions with parameters mean, standard deviation, and degrees of freedom defined by the values stored in \texttt{QSarray}. \texttt{plotGeneSetDistributions} contains a number of parameters which are useful for customizing the PDF plots. These parameters are listed below: \begin{itemize} \item \textbf{\texttt{colorScheme}} This parameter is used to establish a color scheme for the individual gene PDFs. The default color scheme, sdHeat, will automatically color-code the gene PDFs by their standard deviations, with hotter colors being used for smaller standard deviations. In contrast, the "rainbow" color scheme will use a repeating series of colors, so that individual curves can be easily distinguished from their neighbors. Although these two options are the only automatically generated options, the \texttt{colorScheme} parameter can also accept a custom color scheme, which can be provided as a vector of colors. For a single gene set distribution plot, this can be useful for highlighting the PDFs of specific genes of interest. However, it is important to note that the order that the colors are used in is not the same as the order of genes in the original gene set. All gene sets are reordered when they are stored in the \texttt{QSarray\$pathways} slot, and the vector provided to \texttt{colorScheme} will be used in this order. \item \textbf{\texttt{alpha}} This parameter controls the alpha channel of the individual gene PDFs. It is only used if the "sdHeat" or "rainbow" colorSchemes are used. Making the PDFs slightly transparent can be especially useful when there are a large number of genes, and the individual pdfs have a significant amount of overlap. \item \textbf{\texttt{normalizePeaks}} This parameter is used to control the heights of the individual gene PDFs. By default, the PDF heights will be scaled to correctly represent the density of the PDF, but in some cases, keeping the PDFs at a fixed height may make the picture easier to understand. If \texttt{normalizePeaks} is set to \texttt{TRUE}, the height of the individual gene PDFs will be scaled to be exactly the same. \item \textbf{\texttt{lwds}} This parameter controls the line widths for the PDFs. The first parameter controls the individual gene PDFs, and the second controls the gene set PDFs. \end{itemize} \vspace{5mm} \textbf{Barcode Plot} The barcode plot is a visual way of marking where the peaks of an individual gene PDF (i.e. the mean fold change of the genes) are located. By default, the barcode will be added as a series of vertical black lines directly below the PDF plot. The appearance of this plot can be controlled by three different parameters: \begin{itemize} \item \texttt{addBarcod} This boolean parameter controls the plotting of the barcode. If \texttt{addBarcode} is set to \texttt{FALSE}, the barcode will not be added to the plot. \item \texttt{barcode.col} This parameter controls the color of the vertical bars. This parameter can either be a single color or a vector with individual colors for each gene in the gene set. As with \texttt{colorScheme}, it is important to note that the order of the genes in the gene set is not the same as the input parameters. All gene sets are reordered when they are stored in the \texttt{QSarray\$pathways} slot, and the vector provided to \texttt{colorScheme} will be used in this order. \item \texttt{barcode.hei} This parameter controls the height of the barcode plot. This can be specified as a single numeric value representing the height of the bar as a fraction of the height of the main plotting region. \end{itemize} \subsubsection{Comparing two gene sets} In cases where you want to compare two gene set distribution plots, the \texttt{plotGeneSetDistributions} function can automatically plot the results from a second gene set directly below the first. This can be done in two ways. First, to compare two different gene sets, a vector of length 2 can be provided to \texttt{path.index}, representing the two gene sets to be compared from \texttt{QSarray1}. Second, to compare two different runs of qusage, a second QSarray object can be provided via the \texttt{QSarray2} parameter. In this case, if \texttt{path.index} is of length 1, the single gene set specified will be pulled from both QSarray objects, making it easy to compare the activity of the gene set between runs. If both QSarray objects are provided and \texttt{path.index} is of length 2, then the first gene set will be pulled from QSarray1, and the second from QSarray2. \vspace{5mm} \subsubsection{Custom Example: Highlighting genes of interest} As an example of customizing the density curves plot, we will once again look at the activity of the \verb@"CYTOSOLIC DNA SENSING"@ pathway in our dataset. Perhaps the gene we are most interested in is CXCL10, and we would like to see how this specific gene compares to the others in the gene set. In order to do this, we will highlight the PDF for CXCL10 in red, and plot the rest of the gene PDFs in grey. <>= path = 125 ##the CYTOSOLIC_DNA_SENSING pathway path.gene.i = sx.results$pathways[[path]] ##the genes are stored as indexes path.genes = rownames(eset)[path.gene.i] cols = rep("#33333399",length(path.genes)) ##make the IFNA genes red, and the IFNB genes blue. cols[path.genes=="CXCL10"] = "#FF0000" plotGeneSetDistributions(sx.results,asx.results,path.index=125, groupLabel=c("Symptomatic","Asymptomatic"), colorScheme=cols, barcode.col=cols) @ \begin{center} \includegraphics[width=1.1\textwidth]{qusage-plotGSDHighlight.pdf} \end{center} \clearpage \subsection{plotCIs} As an alternative to plotting the entire PDF of a pathway, the qusage package also provides a function, \texttt{plotCIs}, which is designed to plot just the mean and $95\%$ confidence interval of a pathway. An example of this function was seen in section~\ref{sec:geneSetDB}, and the resulting plot is re-created below. \begin{center} \includegraphics[width=1.1\textwidth]{qusage-plotCIs.pdf} %\caption{Means and $95\%$ confidence intervals for the pathways in MSIG.geneSets} \label{fig:CIsPlotRepeat} \end{center} The default CI plot displays the mean fold change and $95\%$ confidence interval of all the pathways contained in a \texttt{QSarray} object. The confidence intervals for each pathway are calculated using the \texttt{calcBayesCI} function, and the amount of overlap with the zero line can be used to estimate the significance of the gene set. In addition, each point is color coded by significance using the p-values calculated with \texttt{pdf.pVal}, and both the p-values and the CIs are corrected for multiple hypothesis testing using the methods defined in \texttt{p.adjust}. This plot can also be heavily customized using any of the additional parameters listed below. \begin{itemize} \item \textbf{\texttt{path.index}} This parameter controls which PDFs are plotted and the order that they are plotted in. It can be specified as either a numeric vector or a vector of names matching the gene sets. By default, the order of these pathways does not matter, but if \texttt{sort.by} is set to \texttt{"none"}, the pathways will be plotted in the order provided. \item \textbf{\texttt{sort.by}} This parameter accepts a string (either \texttt{"mean"}, \texttt{"p"}, or \texttt{"none"}) indicating how the pathways should be ordered. If \texttt{sort.by="mean"} or \texttt{"p"}, the pathways will be ordered by their means and p-values, respectively. Else, if \texttt{sort.by="none"}, the pathways will not be rearranged at all, and the order provided in path.index will be retained. The last option is most useful when plotting the results of two qusage runs in one plot, when it is necessary to ensure both plots have the same order. \item \textbf{\texttt{lowerBound, upperBound}} These parameters are passed on to \texttt{calcBayesCI} for use in calculating the confidence interval. The default, \texttt{lowerBound=0.025, upperBound=0.975}, represents a two-sided $95\%$ confidence interval. \item \textbf{\texttt{use.p.colors}} This parameter controls the coloring of the error bars for the plot. By default (\texttt{use.p.colors=TRUE}), the bars will automatically be colored different shades of red and green based on the significance of their respective p-values (e.g. a p-value of 0.05 would result in a dark red bar, and a p-value of 0.0001 would result in a bright red bar). \item \textbf{\texttt{col}} This is a vector of colors which can be provided to define the color of the points. If \texttt{use.p.colors=FALSE} is specified, these colors will also be used for the error bars. \item \textbf{\texttt{p.breaks}} This is a numeric vector indicating where the breaks in the p-value color scheme should be placed. By default, breaks will be at 0.001, 0.005, 0.01, 0.05, \& 0.1, meaning that a p-value between 0.05 and 0.1 will be a darker color than a p-value between 0.01 and 0.05. If you just wanted all significant p-values to be colored the same, you would simply set \texttt{p.breaks=0.05} \item \textbf{\texttt{p.adjust.method}} The method to use to adjust the p-values. Must be one of the methods in \texttt{p.adjust.methods}. \item \textbf{\texttt{addLegend}} This is a logical parameter specifying if a legend for the p-value color scheme be plotted. If true, the values from p.breaks will be plotted alongside their respective colors in the top right hand corner. \item \textbf{\texttt{lowerColorBar}} Options for plotting a color bar below each point. Automatically generated color bars have not yet been implemented, but custom bars can be created using the "lowerColorBar.cols" parameter. \item \textbf{\texttt{lowerColorBar.cols}} This is a vector of colors to plot as a bar along the bottom of the plot. This can be useful if there's other information you wish to include in the plot, such as a p-value for a second comparison, or a visual representation of another test statistic from a different program. \item \textbf{\texttt{addGrid}} This boolean parameter controls the addition of dashed lines behind the plot which help to follow a point down to its x-axis label. \item \textbf{\texttt{x.labels}} This is a character vector of the same length as \texttt{path.index} giving the names of the pathways. By default, \texttt{plotCIs} will use the names stored in \texttt{QSarray}. \item \textbf{\texttt{cex.xaxis}} This parameter controls the character expansion of the x-axis separate from the overall \texttt{cex}. The value supplied will be multiplied by the current plot's \texttt{cex} to determine the size of the x-axis text \item \textbf{\texttt{shift}} This is a number between 0 and 1 decribing the amount to shift points with respects to the guiding lines and axis labels. This is primarily used when \texttt{add=TRUE} and you do not want the two plots to directly overlap. \item \textbf{\texttt{add}} This boolean parameter is used to overlay a second plot over one that was previously generated. If \texttt{add=TRUE}, the bar plots will be added on to the existing plotting region. \end{itemize} \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%% The QSarray Object %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{The QSarray object} \label{sec:QSarrayObj} The QSarray object is designed to hold the results of the various functions and algorithms that make up the qusage package. Exactly what information is contained in a QSarray object is dependent upon which functions have been used to create the object. A new QSarray object can also be created by calling the \texttt{QSarray()} function. A list of all possible fields and their descriptions are included below. \begin{itemize} \item \textbf{\texttt{mean, SD, dof}} - Vectors containing the means, standard deviations, and degrees of freedom for each gene as output by \texttt{makeComparison} (see section~\ref{sec:makeComparison}). Together they describe the t-distribution of each gene in the dataset. \item \textbf{\texttt{n.samples}} - The number of samples used in the comparison. If the QSarray object is the result of a call to \texttt{combinePDF}, this will be a numeric vector detailing the number of samples in each input QSarray. \item \textbf{\texttt{path.PDF}} - This is a matrix which contains the density values for the PDF of each gene set provided to \texttt{aggregateGeneSet}. The number of rows of \texttt{path.PDF} is determined by the \texttt{n.points} parameter, which by default is 4096. \item \textbf{\texttt{path.mean}} - A numeric vector containing the mean fold change of each gene set provided to \texttt{aggregateGeneSet}. \item \textbf{\texttt{ranges}} - The (uncorrected) range that each PDF was calculated over. If the VIF is not used to correct the range, the x-coordinates of the PDF are the sequence of n.points from \texttt{path.mean-ranges} to \texttt{path.mean+ranges} \item \textbf{\texttt{path.size}} - A vector containing the number of features in each pathway that mapped to the input data. \item \textbf{\texttt{vif}} - A vector containing the Variance Inflation Factor for each pathway, as calculated by calcVIF. \item \textbf{\texttt{pathways}} - A version of the input \texttt{geneSets}, represented as a list of numeric vectors which map to the rows of \texttt{eset} which contain features in that gene set. \item \textbf{\texttt{labels, contrast, pairVector, var.method, n.points}} - Copies of the input parameters which are used in future calculations. \item \texttt{design} - A design matrix for the linear model fitting in LIMMA. Only created if limma is used in \texttt{makeComparison} (see section~\ref{sec:makeComparison}) \item \textbf{\texttt{QSlist}} - If the QSarray is the result of a call to \texttt{combinePDF} (see chapter~\ref{sec:metaAnalysis}), the QSlist will contain the list of QSarray objects used during the meta-analysis. \end{itemize} \end{document}