%\VignetteIndexEntry{Polyfit} \documentclass[12pt]{article} \textwidth=6.2in \textheight=8.5in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \begin{document} \SweaveOpts{concordance=TRUE} \title{Polyfit Vignette} \author{Conrad Burden} \date{\today} \maketitle Polyfit~\cite{Burden14} is an add-on to the negative-binomial based package {\tt DESeq}~\cite{Anders10} for two-class detection of differential expression. Its purpose is to ensure the p-value distribution is close to uniform over the interval [0, 1] for the subset of genes satisfying the null hypothesis of no differential expression. The first component of Polyfit is the function {\tt pfNbinomTest} which replaces the function {\tt nbinomTest} in {\tt DESeq}. Its purpose is to smooth point singularities (or `flagpoles'), particularly one at $p = 1$, in the p-value distribution caused by calculating calculating p-values from a discrete distribution. The output from this function should then be passed to the second component, the function {\tt levelPValues}. Its purpose is to apply a variant of the Storey-Tibshirani procedure~\cite{Storey03} to shift the p-values so that those corresponding to the null hypothesis have a unform distribution, and to calculate corresponding q-values (or `adjusted p-values') for controlling errors via the false discovery rate. To load and attach Polyfit, type <<>>= library(Polyfit) @ at the R prompt. edgeR and DESeq are dependencies and will be automatically loaded. \section*{Removing the flagpoles} When calculating p-values, DESeq assumes as a null hypothesis that the total number of counts $K_A$ and $K_B$ summed over replicates in each of conditions A and B is distributed according to a negative binomial distribution with parameters estimated from the data. The distribution of $K_A$, conditonal on the observed value $k_A + k_B$ of the sum of counts in both conditions is thus a discrete distribution. DESeq calculates p-values as the sum of probabilities from this distribution less than or equal to the probability of the observed counts $(k_A, k_B)$ (see Fig.~\ref{fig:polyfitFig1}). This method invariably causes a spikes in the histogram of generated p-values, the most notable one being at $p = 1$ from those observed counts which happen to hit the mode of the null hypothesis distribution. \begin{figure}[t] \begin{center} \includegraphics[width=0.9\textwidth]{polyfitFig1.pdf} \caption{Calculation of p-values by the DESeq functions {\tt nbinomTest} (left) and the replacement Polyfit function {\tt pfNbinomTest} (right).} \label{fig:polyfitFig1} \end{center} \end{figure} Because it needs to fit a smooth polynomial to the p-value histogram, Polyfit redistributes the spikes by replacing the discrete distribution with a ``squared off'' continuous distribution, as shown in the right half of Fig.~\ref{fig:polyfitFig1}. If the data are generated according to the postulated null hypothesis, the p-values will then have a uniform distribution on $[0, 1]$. In the following example we use simulated data generated by the DESeq function {\tt makeExampleCountDataSet()}. To generate p-values with the flagploes removed, replace the DESeq function {\tt nbinomTest} with its Polyfit replacement {\tt pfNbinomTest}. <<>>= cds <- makeExampleCountDataSet() cds <- estimateSizeFactors( cds ) cds <- estimateDispersions( cds ) nbT <- nbinomTest( cds, "A", "B" ) pValuesDESeq <- nbT$pval # <-- Original DESeq code nbTPolyfit <- pfNbinomTest( cds, "A", "B" ) pValuesPFDESeq <- nbTPolyfit$pval # <-- Polyfit replacement @ Histograms of the resulting p-values are shown in Fig.~\ref{fig:polyfitFig2}. <>= oldpar <- par(mfrow=c(1,2)) hist(pValuesDESeq,breaks=seq(0,1,by=0.01), xlab="P-value", main="DESeq") hist(pValuesPFDESeq,breaks=seq(0,1,by=0.01), xlab="P-value", main="polyfit-DESeq") par(oldpar) @ \begin{center} \begin{figure} <>= <> @ \caption{Histogram of p-values generated by {\tt nbinomTest} and {\tt pfNbinomTest}} \label{fig:polyfitFig2} \end{figure} \end{center} \section*{Levelling the p-value histogram} Because the parameters of the negative binomial distribution for each gene must be estimated from the available count data, p-values reported by DESeq for those genes which are not differentially expressed may not be uniformly distributed. A fairly extreme case is shown in Fig.~\ref{fig:polyfitFig4}(a) generated from DESeq using synthetic data after the flagpole has been removed as described above. \begin{figure}[t] \begin{center} \includegraphics[width=0.9\textwidth]{polyfitFig4.pdf} \caption{(a) Example p-value spectrum calculated by DESeq for synthetic data RNA-seq with 15\% genes up- or down-regulated after removal of `flagpoles' (taken from ref.~\cite{Burden14}). The shaded histogram is the 85\% of transcripts which are unregulated. (b) Schematic representation of the Storey-Tibshirani procedure for estimating false discovery rates, assuming correctly calculated p-values. (c) Schematic representation of the analogous Polyfit procedure. (TP = true positives, FP = false positives, FN = false negatives and TN = true negatives at a specified significance point $\alpha$.)} \label{fig:polyfitFig4} \end{center} \end{figure} If p-values have been calculated exactly, the Storey-Tibshirani procedure calculates q-values, that is, estimates of the false discovery rate, essentially by fitting a uniform distribution to the right hand part of the p-value histogram (see Fig.~\ref{fig:polyfitFig4}(b)). Polyfit replaces the uniform distribution fit with a polynomial fit (see Fig.~\ref{fig:polyfitFig4}(c)), and estimates p-values and q-values for each gene by the formulae $$ \mbox{corrected p-value} = \frac{\mbox{FP}}{\mbox{FP} + \mbox{TN}} \: , \qquad \mbox{corrected q-value} = \frac{\mbox{FP}}{\mbox{FP} + \mbox{TP}} \: . $$ The function {\tt levelPValues} generates the levelled p-values, estimated q-values calculated from the adapted Storey-Tibshirani procedure and, for comparison, also reports q-values calculated from the levelled p-values using the Benjamini-Hochberg procedure. The following code calculates the corrected p-values and q-values from the nominal p-values generated in the DESeq example above: <<>>= lP <- levelPValues(pValuesPFDESeq) outTable <- cbind(origPval= pValuesPFDESeq, levelledPval=lP$pValueCorr, levelledQval=lP$qValueCorr, BH_Qval=lP$qValueCorrBH) head(outTable) @ If the option {\tt plot=TRUE} is used a diagnostic plot in the format of Fig.~\ref{fig:polyfitFig5} showing the p-value distribution before and after levelling is produced. The top left hand panel plots estimates of the fraction ($\pi_0$) of genes not DE obtained by fitting a quadratic to the nominal p-value historgam (without flagpoles) over the interval $[\lambda, 1]$. Beneath this is a density plot of obtained estimates $\hat\pi_0$. The optimal $\lambda$ (the red dot) is obtained by choosing the $\hat\pi_0(\lambda)$ closest to the mode of the $\hat\pi_0$ density. The mode is also indicated by the dotted line in the top left panel. The original and corrected p-value histograms are shown on the right, together with optimally fitted quadratic (upper plot) and its image after correction (lower plot). The red part of the quadratic and its image below correspond to the interval over which the quadratic is fitted. <>= lP <- levelPValues(pValuesPFDESeq, plot=TRUE) @ \begin{center} \begin{figure}[h] <>= <> @ \caption{Diagnostic plot produced by {\tt levelPValues}} \label{fig:polyfitFig5} \end{figure} \end{center} \begin{thebibliography}{3} \bibitem{Anders10} {Anders, S.\ and Huber, W.\ Differential expression analysis for sequence count data. {\it Genome Biology}, 11(10):R106 (2010) } \bibitem{Burden14} {Burden, C.J., Qureshi, S.\ and Wilson, S.R. Error estimates for the analysis of differential expression from RNA-seq count data. {\it PeerJ PrePrints 2:e400v1}, {\tt http://dx.doi.org/10.7287/peerj.preprints.400v1} (2014) } \bibitem{Storey03} {Storey, J.\ and Tibshirani, R. Statistical significance for genomewide studies. {\it Proceedings of the National Academy of Science}, 100(16):9440--9445 (2003) } \end{thebibliography} \end{document}