%% LyX 1.6.9 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[english]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{babel} \usepackage[unicode=true,pdfusetitle, bookmarks=true,bookmarksnumbered=false,bookmarksopen=false, breaklinks=true,pdfborder={0 0 0},backref=false,colorlinks=false] {hyperref} \usepackage{breakurl} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{Introduction to the les package: Identifying Differential Effects in Tiling Microarray Data with the Loci of Enhanced Significance Framework} %\VignettePackage{les} %\usepackage{fancyvrb} %\fvset{listparameters={\setlength{\topsep}{0pt}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \makeatother \begin{document} \title{Introduction to the \Rpackage{les} package:\\ Identifying Differential Effects in\\ Tiling Microarray Data with the\emph{}\\ \emph{Loci of Enhanced Significance} Framework} \author{Julian Gehring, Clemens Kreutz} \maketitle \SweaveOpts{width=7,height=5} <>= set.seed(1) options(width=65, SweaveHooks=list(fig=function() par(mar=c(5.1, 5.1, 2.1, 1.1)))) @ \section{Introduction} Tiling microarrays have become an important platform for the investigation of regulation in expression, DNA modification, and DNA-protein interaction. Since unbiased in relation to annotation they provide an powerful tool for biological research. Beside the analysis of single conditions, the investigation of regulation between multiple experimental conditions is an important part of current research. A common approach consists in assessing the differential effect on the level of individual probes and thereby computing p-values $p_{i}$ for each probe $i$ with a suitable statistical test, for example an F-test, independently. Since the targets exhibiting differential effects cover multiple probes, a reasonable next step involves combining information from neighboring probes. While several approaches for this purpose exist, most of them lack a statistical interpretation of the results or are specific for certain platforms and statistical tests. The approach used here utilizes the fact that p-values in regions without an differential effect are uniformly distributed. In the case that regulation is present the distribution of $p_{i}$ is shifted towards zero; such regions are therefore referred to as \emph{Loci of Enhanced Significance (LES)}. By assessing the uniform component of the p-value distribution within a sliding window, the local degree of significant probes along the genome is estimated. The \Rpackage{les} package implements the method for detecting \emph{LES} in tiling microarray data sets, independent of the underlying statistical test and hence is suitable for a wide range of applications. \section{Data set and statistic testing on probe level} The data set used in this vignette is part of a quality control study for tiling microarrays \cite{johnson_systematic_2008}, in which spike-ins were used to assess the influence of microarray platforms, preparation procedures, and analysis algorithms on the accuracy and reproducibility of ChIP-chip experiments. Here, the expression intensities of one region from the \emph{undiluted} data set investigated with Affymetrix tiling microarrays is selected, consisting of 452 probes and two conditions with three replicates each. The data has been normalized using quantile normalization and probe positions remapped to a common reference. Normalized expression intensities are stored in the matrix \Robject{exprs}, with rows representing probes and columns arrays. The corresponding names contain the position of the probes and the condition of the samples, respectively. The properties of the spike-in DNA, yielding a region with differential effect, are described in the \Robject{reference} data frame. <>= library(les) data(spikeInData) head(exprs) dim(exprs) pos <- as.integer(rownames(exprs)) condition <- as.integer(colnames(exprs)) reference region <- as.vector(reference[ ,c("start", "end")]) @ In order to assess the differential effect between the two conditions on the level of individual probes, we use a modified t-test provided by the \Rpackage{limma} package \cite{smyth_limma_2005}. This offers an increased statistical power compared to a classical Student's t-test for small sample sizes, as present in most tiling microarray experiments. <>= library(limma) design <- cbind(offset=1, diff=condition) fit <- lmFit(exprs, design) fit <- eBayes(fit) pval <- fit$p.value[, "diff"] @ <>= plot(pos, pval, pch=20, xlab="Probe position", ylab=expression(p)) abline(v=region) @ \begin{figure} \centering \includegraphics{les-plotProbeLevelStatistics} \caption{\label{fig:plotProbeLevelStatistics}Probe-level p-values} \end{figure} \section{Incorporation of neighboring probes} The accumulation of small p-values indicates a \emph{LES} at the location of the spike-in, whereas the response of individual probes may be not reliable (Figure \ref{fig:plotProbeLevelStatistics}). Therefore, an incorporation of neighboring probes should be beneficial and yield improved results in the detection of the differential effect. For each probe $i$ the surrounding p-values $p_{j}$ get weights assigned, corresponding to the distance of probe $i$ and $j$. Being free in the definition of the weighting function, the spatial relationship of probes in observing the same effect can be accounted for. A weighted cumulative density is computed and the fraction of significant $p$ is estimated by an iterative regression. The method is based on the fact that p-values are uniformly distributed under the null hypothesis $H_{0}$ whereas p-values violating $H_{0}$ are shifted towards smaller values \cite{bartholom_estimation_2009}, which is closely related to the estimation of a false discovery rate. This results in $\Lambda_{i}$, constituting an estimate of the fraction of p-values violating $H_{0}$ and therefore the degree of significance in the local surrounding of the evaluated position. For the analysis, we first store all relevant data in an object of class \Rclass{Les} by calling the \Rmethod{Les} method. The only data required are the positions of the probes $i$, the corresponding p-values $p_{i}$ from the statistical test, and optionally their chromosomal location. <>= res <- Les(pos, pval) @ Then we can compute our estimate $\Lambda_{i}$ for which we have to specify a weighting window. Here, we use a triangular window taken as default with a size of 200bp. The power of detecting a region will be high if the window matches approximately the properties of the regulated region. In many experiments a rough prior knowledge is available which can be sufficient for choosing a suitable window. Further, optimal weighting windows can be estimated from the data directly. <>= res <- estimate(res, win=200) @ All data, results and parameters are stored in the \Robject{res} object of class \Rclass{Les}. We can get an overview of the results with the \Rmethod{print}, \Rmethod{summary}, or \Rmethod{plot} method (Figure \ref{fig:showPlotLes}). <>= res summary(res) plot(res) abline(v=region) @ \begin{figure} \centering \includegraphics{les-showPlotLes} \caption{\label{fig:showPlotLes}Results with triangular window.} \end{figure} For comparison we analyze and plot the same data with a rectangular weighting window (Figure \ref{fig:showPlotLes2}). In this example the rectangular window leads to better results. The \Rpackage{les} package includes four predefined windows; custom functions can also be used, as described in Section \ref{sec:Specification-of-own}. <>= res2 <- estimate(res, win=200, weighting=rectangWeight) res2 plot(res2) abline(v=region) @ \begin{figure} \centering \includegraphics{les-showPlotLes2} \caption{\label{fig:showPlotLes2}Results with rectangular window.} \end{figure} \section{Parameter estimation} To turn the continuous $\Lambda_{i}$ into distinct regions of interest we define a threshold $\Theta$. It can be derived from the data by estimating the number of probes with a significant effect on the whole array. <>= res2 <- threshold(res2, grenander=TRUE, verbose=TRUE) @ Given $\Theta$ we can look for regions that have a continuous $\Lambda_{i}\geq\Theta$. The \Rmethod{regions} method takes by default the estimated $\Theta$ from the previous step. We can also pass our own estimate for $\Theta$ with the \Rfunarg{limit} argument. Further restrictions can be imposed on the regions such as the minimal length of a region and the maximum gap allowed between probes in one region. The \Rmethod{[} method allows to access any data slot of an object of class \Rclass{Les}. Here, we use it to extract the data frame with the estimated regions. <>= res2 <- regions(res2, verbose=TRUE) res2 res2["regions"] @ <>= plot(res2, region=TRUE) abline(v=region) @ \begin{figure} \centering \includegraphics{les-plotRegions} \caption{\label{fig:plotRegions}Results with estimates for regions.} \end{figure} \section{Calculation of confidence intervals} By bootstrapping probes in each window, confidence intervals for the statistic $\Lambda_{i}$ can be computed. Since confidence intervals are primarily interesting in regions of interest and bootstrapping is by its nature computationally demanding, we can restrict the calculation to a subset of probes. <>= subset <- pos >= 5232400 & pos <= 5233100 res2 <- ci(res2, subset, nBoot=50, alpha=0.1) plot(res2, error="ci", region=TRUE) @ \begin{figure} \centering \includegraphics{les-plotCi} \caption{\label{fig:plotCi}Results with confidence intervals and estimates for regions.} \end{figure} \section{Visualization} The \Rmethod{plot} method uses a special system in order to customize the graphical elements of the figure. It allows to refer to all its components with the name of the additional input argument; its value is a list containing named graphical parameters for the underlying plot function. As an example, we plot smaller region of the chromosome with confidence intervals, estimated region, and the probe density. Further, we adapt several parameters changing the graphical representation. For details, please refer to the help of the \Rpackage{les} package. <>= plot(res2, error="ci", region=TRUE, rug=TRUE, xlim=c(5232000, 5233000), sigArgs=list(col="firebrick4"), plotArgs=list(main="LES results", yaxp=c(0, 1, 2)), limitArgs=list(lty=2, lwd=3), regionArgs=list(col="black", density=20), probeArgs=list(col="dodgerblue4", type="p")) @ \begin{figure} \centering \includegraphics{les-plotOptions} \caption{\label{fig:plotOptions}Results with customized graphical parameters.} \end{figure} \section{Exporting results to external software} With the \Rmethod{export} method the estimated regions as well as $\Lambda_{i}$ can be saved to standard files formats, in order to facilitate the export of the results to other software and genome browsers. The region estimates can be exported to the \emph{bed} and \emph{gff} formats, the test statistic $\Lambda_{i}$ to the \emph{wig} format. <>= bedFile <- paste(tempfile(), "bed", sep=".") gffFile <- paste(tempfile(), "gff", sep=".") wigFile <- paste(tempfile(), "wig", sep=".") export(res2, bedFile) export(res2, gffFile, format="gff") export(res2, wigFile, format="wig") @ \section{Specification of custom weighting windows\label{sec:Specification-of-own}} With the \Rfunction{triangWeight}, \Rfunction{rectangWeight}, \Rfunction{epWeight}, and \Rfunction{gaussWeight} functions, four weighting windows are included in the \Rpackage{les} package, providing a triangular, rectangular, Epanechnikov, and Gaussian window, respectively. We can also specify custom window functions and pass it as \Rfunarg{weighting} argument in the \Rmethod{estimate} method. They have to be specified as a function depending on the distance of the probes (\Rfunarg{distance}) and the window size (\Rfunarg{win}), as illustrated here with a triangular weighting. <>= weightFoo <- function(distance, win) { weight <- 1 - distance/win return(weight) } resFoo <- estimate(res, 200, weighting=weightFoo) @ <>= regions <- res["regions"] winsize <- seq(100, 300, by=20) res2 <- chi2(res2, winsize, regions, offset=2500) plot(winsize, x["chi2"], type="b") @ \newpage{} \bibliographystyle{plain} \bibliography{ref_les} \section*{Session information} <>= toLatex(sessionInfo(), locale=FALSE) @ \end{document}