%\VignetteIndexEntry{PCpheno Vignette} %\VignetteDepends{} %\VignetteKeywords{Phenotype} %\VignettePackage{PCpheno} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage[small]{caption} \usepackage{array} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \title{PCpheno: assessing the role cellular organizational units in determining phenotype} \author{\small{Nolwenn Le Meur and Robert Gentleman}} \begin{document} \maketitle \section{Introduction} We propose computational methods and statistical paradigms to explore the relationships between phenotypic data and cellular organizational units, such as multi-protein complexes or pathways. Indeed, while proteins are often the primary unit used by cells to carry out the many different functions that the cell requires for life, they seldom accomplish important tasks alone, but rather assemble into organizational units. Recent studies suggest that some control of phenotype can be usefully attributed to multi-protein complexes rather than genes \citep{Deutschbauer2005a, Spirin2006} and hence may help provide elucidation of the underlying roles or mechanisms that directly control changes in phenotype. \section{Data sources} In this package, we currently present yeast phenotypic datasets and use of the yeast cellular organizational units defined in the Bioconductor package \textit{ScISI} package and the KEGG pathways listed in the \textit{org.Sc.sgd.db} package (formerly the \textit{YEAST} package).\\ Nevertheless our methods can easily be applied to other species and other estimates of organizational units within the genome, or proteome, and in no way rely on the particular choices we have made here. \subsection{Phenotypic data} We currently propose 7 Yeast phenotypic datasets, downloaded from the the literature and the \textit{Saccharomyces Genome Database} (\href{http://www.yeastgenome.org/}{http://www.yeastgenome.org/}). \begin{itemize} \item \cite{Giaever2002} collection of single gene-deletion mutants under 6 different experimental conditions. \item \cite{Dudley2005} collection of single gene-deletion mutants under 21 different experimental conditions. \item \cite{Deutschbauer2005a} collection of haploinsufficient genes. \item \cite{Lesage2005} network of genetic interactions. \item \cite{Kastenmayer2005} collection single gene-deletion mutants under 5 different experimental conditions. \item \cite{Osterberg2006} collection of overexpression ~600 C-terminal tagged integral membrane under 4 different experimental conditions. \item \cite{SGD} list of phenotypes and associated genes from several published experiments. \end{itemize} While our approach focuses on understanding the functional roles that underly phenotypic changes when manipulating single genes, we hope that these methods will also form the basis for the analysis of more complex gene manipulation experiments.\newline To illustrate this vignette, we will use the data by \cite{Dudley2005}. <>= library(PCpheno) data(DudleyPhenoM) ##Number of genes sensitive at each condition colSums(DudleyPhenoM) ##Retrieve the name of the sensitive genes in each condition DudleyPhenoL <- apply(DudleyPhenoM,2,function(x) names(which(x==1))) DudleyPhenoL[1] @ \subsection{Cellular organizational units} As previously mentioned, here we are interested in yeast datasets and the yeast cellular organizational units defined in the Bioconductor package \textit{ScISI} package and the KEGG pathways relevant to the yeast genome and available in the \textit{YEAST} annotation package. However our methods can easily be applied to other species and other estimates of organizational units within the genome, or proteome. The cellular organizational units should be represented as an adjacency matrix. The row names are the gene names and the column names the cellular organizational units. A 1 means that this particular gene belongs to this particular organizational units. Below is an example of yeast KEGG pathways. The \textit{org.Sc.sgd.db} annotation package contains KEGG pathways annotation for the yeast genes and the \Rfunction{PWAmat}, available in the \Rpackage{annotate} package, allows to build the adjacency matrix. <>= library(org.Sc.sgd.db) ## new YEAST annotation package ##library(annotate) KeggMat <- PWAmat("org.Sc.sgd") KeggMat[1:5, 1:5] @ To build such interactome for a particular species, one should first have an annotation package for its species of interest. For instance, one can create this annotation package using the Bioconductor package \textit{AnnBuilder} which retrieves, among other annotation, the KEGG pathways associated with the genome of interest. For protein complexes, it might be slightly more complicated but one can use the GO categories that refer to complexes and create a similar binary matrix. In the case of the yeast genome, we use the interactome available in the Bioconductor package \textit{ScISI}. The \Rpackage{ScISI} package or \textit{In Silico Interactome for Saccharomyces cerevisiae} provides an interactome built for computational experimentation. The \Robject{ScISI} is binary incidence matrix where the rows are indexed by the gene locus names and the columns are indexed by the identification codes for the protein complexes based on the repository from where they are obtained. This interactome is currently built from the Intact, Gene Ontology and Mips curated databases, and estimated protein complexes from the \Rpackage{apComplex} package. In this vignette, we will make use of a subset of the \Robject{ScISI} interactome, the \Robject{ScISIC} data, that only contains the data from the curated databases. <>= library(ScISI) data(ScISIC) ScISIC[1:5, 1:5] @ \section{Computational and Statistical Methods} In order to test for association between 2 datasets or 2 phenomenon, one has to define a null hypothesis. In our case, our null hypothesis is that there is no association between the collection of genes that induce the phenotypic change and the organizational units (\textit{e.g.}, multi-protein complexes, pathways). To test this hypothesis we consider a multi-faceted approach. First, for any level of organization, we use a hypothesis test designed to determine whether there is an effect that can be attributed to that specific groupings of genes, without testing which cellular organizational units are involved. Then, if we reject the null hypothesis of no association between the collection of genes that induce the phenotypic change and the organizational units, the next step is to identify those specific organizational units. \subsection{Global testing} We currently have devised two different methods of performing the omnibus test. One test is based on density estimation \citep{Silverman} and provides valuable visual information, but for which we do not have an explicit $P$-value. The second approach is based on the permutation of graphs \citep{Balasubramanian2004} and while it provides an explicit $P$-value, it provides little insight into the reasons for rejecting, or not, the hypothesis. See below for more details. \subsubsection{Density Estimation} For each cellular organizational unit, we compute the proportion of genes that affect the phenotype. We then compute the smoothed histogram of the proportions and compare it to a reference distribution. Our reference distribution is obtained by randomly permuting 1,000 times the gene labels for the interactome and computing, for each permutation, the new (simulated) proportions of genes that affects the phenotype and the associated smoothed histograms.\newline As example, we test whether the genes sensitive to paraquat in the \cite{Dudley2005} experiment are randomly distributed among multi-protein complexes. <<>>= perm <- 10 paraquat <- DudleyPhenoL[["Paraq"]] parDensity <- densityEstimate(genename=paraquat, interactome=ScISIC, perm=perm) @ Then, we can visualize the result of this test using the \Rfunction{plot} function. \begin{figure}[htbp] \centering <>= plot(parDensity, main="Effect of paraquat on S. cerevisiae genes") @ \caption{\label{fig:Figure001-densityEstimate}\textbf{Genes sensitive to paraquat are not well represented in our interactome.} A high frequency of protein complexes have zero "paraquat-sensitive" genes. Grey lines represent the permutation data and the black line is the observed data.} \end{figure} \subsubsection{Graph Theory} The graph theory procedure is based on the permutation of graphs proposed by \cite{Balasubramanian2004}. Two distinct graphs, $G_i = (V, E_i)$, $i=1,2$, are formed. The nodes, $V$, are in our case the \textit{S. cerevisiae} genes, and they are common to both graphs. In one graph $G_1$ two proteins have an edge between them if, and only if, they are co-members of one, or more, cellular organizational units. In the second graph $G_2$ edges are created between all proteins that are associated with a phenotype of interest, so that if there are $k$ genes associated with the phenotype of interest then there is $k(k-1)/2$ edges. We exclude self-loops in both graphs. We then compute the intersection of these two graphs and count the edges in common. To test whether the number of edges in the third graph is unexpectedly large, a permutation analysis is performed. A reference distribution is computed by permuting $n$ times the labels on either $G_1$ or $G_2$ and counting the number of edges in common obtained. A $p$-value can be obtained by comparing the observed test statistic to the observed distribution of the counts of intersecting edges from the permutations. <<>>= parGraph <- graphTheory(genename=paraquat, interactome=ScISIC, perm=perm) @ Then, we can visualize the result of this test using the \Rfunction{plot} function. \begin{figure}[htbp] \centering <>= plot(parGraph, main="Effect of paraquat S. cerevisiae genes") @ \caption{\label{fig:Figure001-densityEstimate}\textbf{Genes sensitive to paraquat are not randomly distributed in our interactome.} The "paraquat-sensitive" genes gather among complexes more than expected by chance. the grey histogram represent the observed distributions and the red dashed line the observed data.} \end{figure} \subsection{Hypergeometric Test} %%\subsubsection{Hypergeometric Test} A Hypergeometric test can be used to assess whether a cellular organizational unit contains more genes that affect the phenotype than expected by chance. We rank the multi-protein complexes by their $p$-value and classify a complex as being associated with the phenotype if the Hypergeometric $p$-value was less than a threshold, \textit{e.g.,} $0.01$. The Hypergeometric test is the equivalent of Fisher's exact test for two-by-two tables and the function repot the $p$-value, expected values and odds ratio for all tests. <<>>= params <- new("CoHyperGParams", geneIds=paraquat, universeGeneIds=rownames(ScISIC), annotation="org.Sc.sgd", categoryName="ScISIC", pvalueCutoff=0.01, testDirection="over") paraquat.complex <- hyperGTest(params) @ We can display the results using the \Rfunction{summary} function (Note that for visualization purposes we only show the first 6 columns). <<>>= summary(paraquat.complex)[,1:6] @ Finally, you can classify the complexes as significant or not and annotate them if they are a GO, MIPS or KEGG term. <<>>= status <- complexStatus(data=paraquat.complex, phenotype=paraquat, interactome=ScISIC, threshold=0.01) descr <- getDescr(status$A, database= c("GO","MIPS")) data.frame( descr,"pvalues"=paraquat.complex@pvalues[status$A]) @ Those results are in concordance with the known effect of paraquat on H+-transporting. Indeed, the mechanisms of the toxic effects of paraquat are largely the result of a metabolically catalyzed single-electron reduction-oxidation reaction, resulting in depletion of cellular NADPH and the generation of potentially toxic forms of oxygen such as the superoxide radical. It also highlight the critical role of the ESCRT complexes (Endosomal Sorting Complex Required for Transport). \section{Conclusion} This package offers computational methods and statistical paradigms to explore the relationships between phenotype and cellular organizational units. We demonstrated its usefulness in \textit{S. cerevisiae} using \cite{Dudley2005} dataset and multi-protein complexes. While this is one example, we believe that those approaches are powerful enough to investigate many other phenotypes and estimates of organizational units within the genome, or proteome. \bibliographystyle{plainnat} \bibliography{reference} \end{document}