%\VignetteIndexEntry{sigPathway} %\VignetteDepends{hgu133a.db} %\VignetteKeywords{Pathway Analysis, Gene Sets} %\VignettePackage{sigPathway} \documentclass[11pt]{article} \SweaveOpts{echo=FALSE,engine=R} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage[margin=1in]{geometry} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \begin{document} \title{\bf sigPathway: Pathway Analysis with Microarray Data} \author{Weil Lai$^{1}$, Lu Tian$^{2}$, and Peter Park$^{1,3}$} \maketitle \begin{center} 1. Harvard-Partners Center for Genetics and Genomics, 77 Avenue Louis Pasteur, Boston, MA 02115 2. Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, 680 North Lake Shore Drive, Chicago, IL 60611 3. Children's Hospital Informatics Program, 300 Longwood Avenue, Boston, MA 02115 \\ \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} \Rpackage{sigPathway} is an R package that performs pathway (gene set) analysis on microarray data. It calculates two gene set statistics, the $NT_{k}$ (Q1) and $NE_{k}$ (Q2), by permutation, ranks the pathways based on the magnitudes of the two statistical tests, and estimates q-values for each pathway \citep{Tian2005}. The program permutes the rows and columns of the expression matrix for $NT_{k}$ and $NE_{k}$, respectively. In this vignette, we demonstrate how the user can use this package to identify statistically significant pathways in their data and export the results to HTML for browsing. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Data} In \cite{Tian2005}, microarray data from patients with diabetes, inflammatory myopathies, and Alzheimers' data sets were analyzed. To save disk space, a small portion of the inflammatory myopathies data set has been included with \Rpackage{sigPathway} as an example data set. Expression values and annotations for this data set are stored in the \Robject{MuscleExample} workspace. This workspace contains the following R objects: \begin{description} \item[tab] a filtered numeric matrix containing expression values from 7/13 normal (NORM) and 8/23 inclusion body myositis (IBM) samples. The row and column names of the matrix correspond to Affymetrix probe set IDs and sample IDs, respectively. The 5000 probe sets in this matrix represent the most variable probe sets (by expression value) in the 15 arrays. \item[phenotype] a character vector with {\tt 0\_NORM} to represent NORM and {\tt 1\_IBM} to represent IBM \item[G] a pathway annotation list containing the pathway's source, title, and associated probe set IDs \end{description} \noindent To load this data set, type 'data(MuscleExample)' after loading the \Rpackage{sigPathway} package. The pathways annotated in \Robject{G} were curated from Gene Ontology, KEGG, BioCarta, BioCyc, and SuperArray. Each element {\it within} \Robject{G} is a list describing a pathway with the following sub-elements: \begin{description} \item[src] a character vector containing either the pathway ID (for Gene Ontology) or the name of the pathway database \item[title] a character vector containing the pathway name \item[probes] a character vector containing probe set IDs that are associated with the pathway (by mapping them to Entrez Gene IDs) \end{description} \noindent The full inflammatory myopathway data set and pathway annotations for other, selected Affymetrix microarray platforms are available at \url{http://www.chip.org/~ppark/Supplements/PNAS05.html}. For example, the more comprehensive pathway annotation list for the Affymetrix HG-U133A platform is called {\it GenesetsU133a}. For arrays not listed on the website (or for scenarios such as linkage analysis), the user can make his/her own pathway annotations and use them in \Rpackage{sigPathway} as long as the pathway annotations are arranged in the above format. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Example} In this section, we show the R code necessary to conduct pathway analysis with \Rpackage{sigPathway} on an example data set. First, we load \Rpackage{sigPathway} and the example data set into memory. If we are dealing with the full data set, we could remove probe sets that have expression values less than the trimmed mean in all of the arrays. We assume that the probe sets with lower expression values across all arrays are not of interest. The trimmed mean was used as the filtering criteron in \cite{Tian2005}. The probe sets in the example data set were selected for their variance across 15 arrays (not shown). <>= library(sigPathway) data(MuscleExample) ls() @ For microarray data, the convention is to use rows and columns to represent probe sets and individual arrays, respectively. To tell the program which column in \Robject{tab} belongs to which phenotype, we have created a character vector with {\tt 0\_NORM} to represent NORM and {\tt 1\_IBM} to represent IBM. Because {\tt 0\_NORM} comes before {\tt 1\_IBM} in alphanumeric order, the program internally treats NORM as {\tt 0} and IBM as {\tt 1}. Alternatively, we could have simply used the numerals {\tt 0} and {\tt 1} to represent NORM and IBM. Note that the row names for \Robject{tab} are probe set IDs. <>= dim(tab) print(tab[501:504, 1:3]) table(phenotype) @ How much do IBM and NORM samples differ? Let us plot the unadjusted p-values for each probe set from the 2 group (sample) t-test, assuming unequal variances and using the Welch approximation to estimate the appropriate degrees of freedom. \begin{center} <>= statList <- calcTStatFast(tab, phenotype, ngroups = 2) hist(statList$pval, breaks = seq(0,1,0.025), xlab = "p-value", ylab = "Frequency", main = "") @ \end{center} The two different types of samples are certainly very different by the probe set level, but what pathways are driving the differences? With our pathway annotations, we calculate the $NT_{k}$ and $NE_{k}$ statistics for each gene set, and rank the top pathways based on the magnitude of the two statistics. The result is stored in a list (\Robject{res.muscle}), of which we will later use to write results to HTML. <>= set.seed(1234) res.muscle <- runSigPathway(G, 20, 500, tab, phenotype, nsim = 1000, weightType = "constant", ngroups = 2, npath = 25, verbose = FALSE, allpathways = FALSE, annotpkg = "hgu133a.db", alwaysUseRandomPerm = FALSE) @ The \Rfunction{set.seed} function is used here only for the purpose of getting the exact results when regenerating this vignette from its source files. Because there can be many thousands of pathways represented in the pathway annotations, we have chosen to analyze pathways that contain at least 20 probe sets as represented in \Robject{tab}. We also exclude pathways represented by more than 500 probe sets because larger pathways tend to be non-specific. These two values were the ones used in \cite{Tian2005}. To save space, our pathway annotation list has already been filtered with the above criteria. So, all of the \Sexpr{length(G)} pathways in \Robject{G} will be considered in the calculations. The run time of the $NT_{k}$ and $NE_{k}$ is approximately linearly proportional to \Robject{nsim}, or the maximum number of permutations. When \Robject{alwaysUseRandomPerm} is set to {\tt FALSE} (the default value), the program will use a smaller \Robject{nsim} for the $NE_{k}$ calculations and switch to using complete permutation if the total number of unique permutations for the phenotype is less than \Robject{nsim}. We are setting \Robject{weightType} to 'constant' because of the additional time required to calculate variable weights for $NE_{k}$. If the histogram of unadjusted p-values (of the probe sets) is nearly horizontal, and we later observe high q-values (i.e., approaching 1) for the top ranked pathways, then setting \Robject{weightType} to 'variable' would help lower some of the $NE_{k}$ q-values. To rank the pathways, the program adds up the ranks corresponding to the magnitudes of $NT_{k}$ and $NE_{k}$. When \Robject{npath} is set to 25 and \Robject{allpathways} to FALSE, the program considers the top 25 pathways for each gene set statistic before summing the individual ranks. If \Robject{allpathways} is set to {\tt TRUE}, then all pathways are ranked for each gene set statistic before summing the individual ranks. Here, \Robject{allpathways} is set to {\tt FALSE} because we are interested in observing pathways that are consistently highly ranked for each gene set statistic. Also, please note that out of the numerous input parameters to \Rfunction{runSigPathway}, \Robject{annotpkg} is optional because it refers to a Bioconductor metadata package that may not already be present on your installation of R. In our example, 'hgu133a.db' refers to the BioConductor metadata package of the Affymetrix HG-U133A platform. By specifying 'hgu133a.db' for \Robject{annotpkg}, \Rfunction{runSigPathway} will include the accession number, Entrez Gene ID, gene symbol, and gene name of probe sets associated with each pathway in the list of top pathways. Printed below is a table of the top 10 pathways, the set size, the $NT_{k}$ and $NE_{k}$ statistics, and the statistics' ranks and q-values. This table is accessible through the following command: <>= print(res.muscle$df.pathways[1:10, ]) @ The positive signs on the gene set statistics indicate that the corresponding pathways are more highly expressed in IBM compared to NORM. Had we defined {\tt 1} for NORM and {\tt 0} for IBM, the interpretation would remain the same, but we would expect the signs for the gene set statistics to be flipped. Detailed information about each probe set in each pathway on the list of top pathways are stored in the \Robject{list.gPS}, an element within \Robject{res.muscle}. \Robject{list.gPS} is a list containing data frames describing the probe sets for each top pathway. For example, let us view the annotations and test statistics for 10 probe sets in the {\it MHC class I receptor activity} pathway. <>= print(res.muscle$list.gPS[[7]][1:10, ]) @ A much more intuitive method to browse through the results is to write the results to HTML, which can then be read by an Internet browser program (e.g., Mozilla Firefox, Microsoft Internet Explorer). Writing the results can be achieved with the \Rfunction{writeSigPathway} function. Please refer to the help file of \Rfunction{writeSigPathway} for more details on how to save to results to a specific directory. \noindent Figures \ref{fig:toppathways} and \ref{fig:mhc1} show examples of the HTML output after running \Rfunction{writeSigPathway} and opening the corresponding HTML file in an Internet browser. \begin{figure} \includegraphics[width=\columnwidth]{TopPathwaysTable} \caption{List of Top Pathways in Inclusion Body Myositis versus Normal} \label{fig:toppathways} \begin{center} \end{center} \end{figure} \begin{figure} \includegraphics[width=\columnwidth]{MHC1} \caption{MHC class I receptor activity} \label{fig:mhc1} \begin{center} \end{center} \end{figure} \section{Notes} This vignette was compiled with the following settings: <>= print(sessionInfo()) @ \bibliographystyle{plainnat} \bibliography{sigPathway-vignette} \end{document}