%\VignetteIndexEntry{RRHO} \documentclass[11pt]{article} \usepackage[utf8]{inputenc} \usepackage{natbib} \usepackage{amsthm} \newtheorem{remark}{Remark}[section] \title{RRHO package} \author{Jonathan Rosenblatt \\ Dept. of Mathematics and Computer Science,\\ The Weizmann Institute of Science \\ Israel} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \section{Introduction} Consider the problem of comparing the degree and significance of overlap between two lists of same length. In the following, we will assume that one list consists of differential expression statistics between two conditions for each gene. A second list consists of differential expression statistics between two different conditions for the same genes. A possible way of comparison would be to test the hypothesis that the ordering of lists by their differential expression statistics is arbitrary. The test can be performed against a one sided hypothesis (an over-enrichment hypothesis), or a two sided hypothesis (looking for under- or over-enrichment). This is the purpose of this package, based on the work of~\citet{plaisier_rankrank_2010}. The proposed approach is to count the number of common genes in the first $i \times stepsize$ and $j \times stepsize$ elements of the first and second list respectively, where $stepsize$ is an arbitrary user inputted number. As the count of common elements could be driven by chance, the significance of the observed count is computed assuming the hypothesis of completely random list orderings. As this is performed for all $i \times stepsize$ and $j \times stepsize$, correction for multiple comparisons is necessary. The package offers both FWER control\footnote{ For a general introduction to multiple testing error rates, see~\citep{rosenblatt_practitioners_2013}.} using permutation testing and FDR control using the B-Y procedure~\citep{benjamini_control_2001} as proposed in the original work by~\citet{plaisier_rankrank_2010}. \begin{remark} FDR or FWER? For brevity, $i$ and $j$ will denote $i \times stepsize$ and $j \times stepsize$ respectively. \citet{plaisier_rankrank_2010} recommend the control of the FDR over the different $i$s and $j$s. Each $i,j$ combination tests the null hypothesis of ``arbitrary rankings of the two lists'', versus an alternative of ``non-arbitrary ranking in the first $i$ and $j$ elements of the first and second list respectively''. FDR control is thus appropriate if concerned with the number of false $i,j$ statements made. If only concerned with the existence of any non-arbitrariness, without claiming at which part of the lists it resides, than FWER control is more appropriate. \end{remark} \section{Comparing Two Lists} We start with a sketch of the workflow. The details follow. \begin{itemize} \item Compute the marginal significance of the gene overlap for all $i$ and $j$ first elements of the two lists. \item Correct the marginal significance levels for the multiple $i$s and $j$s. \item Report findings using the exported significance matrices and accompanying Venn diagrams. \end{itemize} <>= library(RRHO) # Create "gene" lists: list.length <- 100 list.names <- paste('Gene',1:list.length, sep='') gene.list1<- data.frame(list.names, sample(100)) gene.list2<- data.frame(list.names, sample(100)) # Compute overlap and significance RRHO.example <- RRHO(gene.list1, gene.list2, BY=TRUE, alternative='enrichment') @ <>= # Examine Nominal (-log) pvalues lattice::levelplot(RRHO.example$hypermat) # Note: If lattice is available try: # levelplot(RRHO.example$hypermat) @ <<>>= # FWER corrected pvalues using 50 random permutations: pval.testing <- pvalRRHO(RRHO.example, 50) pval.testing$pval # The sampling distribution of the minimum # of the (-log) nominal p-values: xs<- seq(0, 10, length=100) plot(Vectorize(pval.testing$FUN.ecdf)(xs)~xs, xlab='-log(pvalue)', ylab='ECDF', type='S') @ <>= # Examine B-Y corrected pvalues # Note: probably nothing will be rejected in this # example as the data is generated from the null. lattice::levelplot(RRHO.example$hypermat.by) @ \begin{remark} As of version $1.4.0$ a two-sided hypothesis test is now possible. The computation of the p-values differ from that described in~\cite{plaisier_rankrank_2010}. The algorithm propsed in~\citep[Section Hypergeometric probability distributions]{plaisier_rankrank_2010} does not control the type $I$ error as demonstrated in the following simulation: <<>>= m<- 100 ; n<- 100; k<- 50 data<- rhyper(1000, m, n, k) pvals<- pmin(phyper(data,m,n,k, lower.tail=TRUE), phyper(data,m,n,k, lower.tail=FALSE)) alpha<- 0.05 prop.table(table(pvals>= getPval<- function(count,m,n,k){ the.mean<- k*m/(m+n) if(count>= size<- 500 list1<- data.frame( GeneIdentifier=paste('gen',1:size, sep=''), RankingVal=-log(runif(size))) list2<- data.frame( GeneIdentifier=paste('gen',1:size, sep=''), RankingVal=-log(runif(size))) list3<- data.frame( GeneIdentifier=paste('gen',1:size, sep=''), RankingVal=-log(runif(size))) rrho.comparison<- RRHOComparison(list1,list2,list3, stepsize=10, labels=c("list1", "list2", "list3"), plots=FALSE, outputdir=temp.dir); @ <>= ## The standard RRHO map between list1 and list 3. lattice::levelplot(rrho.comparison$hypermat1) @ <>= ## The p-value of the difference between # (list1-list3)-(list2-list3). lattice::levelplot(rrho.comparison$Pdiff) @ \bibliography{RRHO} \bibliographystyle{plainnat} \end{document}