\documentclass{report} \usepackage{tikz} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{User's guide} <>= BiocStyle::latex(relative.path=TRUE) @ <>= knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE, cache=TRUE) @ \bioctitle[\Biocpkg{csaw} User's Guide]{\Biocpkg{csaw}: ChIP-seq analysis with windows \\ User's Guide} %% also: \bioctitle{Title used for both header and title page} %% or... \title{Title used for both header and title page} \author[1]{Aaron T. L. Lun} \affil[1]{The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia} \usepackage{framed,color} \newenvironment{combox} { \definecolor{shadecolor}{RGB}{255, 240, 240} \begin{shaded}\begin{center}\begin{minipage}[t]{0.95\textwidth} } { \end{minipage}\end{center}\end{shaded} \definecolor{shadecolor}{RGB}{240,240,240} } \usepackage{bibentry} \nobibliography* \begin{document} \maketitle \begin{abstract} This document contains instructions on how to use \Biocpkg{csaw} for differential binding analyses of ChIP-seq data with windows. It covers read counting into windows, filtering for windows of interest, normalization of sample-specific biases, variance modelling and hypothesis testing, summarization of windows into regions, and visualization and annotation of detected regions. \vspace{0.05in} \textit{First edition:} 15 August 2012 \\ \textit{Last revised:} 2 June 2019 \\ \textit{Last compiled:} \Sexpr{format(Sys.Date(), "%d %B %Y")} \end{abstract} \packageVersion{\Sexpr{BiocStyle::pkg_ver("csaw")}} \tableofcontents \newpage \chapter{Introduction} \section{Scope} This document describes the \Bioconductor{} package \Biocpkg{csaw} for detecting differential binding (DB) in ChIP-seq experiments. \Biocpkg{csaw} uses sliding windows to identify significant changes in binding patterns for transcription factors (TFs) or histone marks across different biological conditions \cite{lun2016csaw}. However, it can also be applied to any sequencing technique where reads represent coverage of enriched genomic regions. The descriptions in this user's guide are detailed and will explore the theoretical and practical motivations behind each step of a \Biocpkg{csaw} analysis. While all users are welcome to read it from start to finish, new users may prefer to examine the case studies presented in the \Biocpkg{chipseqDB} package \cite{lun2015from}, which provides the important information in a more concise format. Experienced users (or those looking for some nighttime reading!) are more likely to benefit from the in-depth discussions in this document. The statistical methods described here are based upon those in the \Biocpkg{edgeR} package \cite{robinson2010}. Knowledge of \Biocpkg{edgeR} is useful but not a prerequesite for reading this guide. \section{How to get help} Most questions about \Biocpkg{csaw} should be answered by the documentation. Every function mentioned in this guide has its own help page. For example, a detailed description of the arguments and output of the \Rfunction{windowCounts} function can be obtained by typing \Rcode{?windowCounts} or \Rcode{help(windowCounts)} at the \R{} prompt. Further detail on the methods or the underlying theory can be found in the references at the bottom of each help page. The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. Other questions about how to use \Biocpkg{csaw} are best sent to the Bioconductor support site at \url{https://support.bioconductor.org}. Please send requests for general assistance and advice to the support site, rather than to the individual authors. Users posting to the support site for the first time may find it helpful to read the posting guide at \url{http://www.bioconductor.org/help/support/posting-guide}. \section{How to cite this package} Most users of \Biocpkg{csaw} should cite the following in any publications: \begin{quote} \bibentry{lun2016csaw} \end{quote} Anyone who uses the Bioconductor \href{https://www.bioconductor.org/help/workflows/chipseqDB/}{workflow} to construct their analyses should also cite: \begin{quote} \bibentry{lun2015from} \end{quote} For people interested in combined $p$-values, their use in DB analyses was proposed in: \begin{quote} \bibentry{lun2014} \end{quote} The DB analyses shown here use methods from the \Biocpkg{edgeR} package, which has its own citation recommendations. See the appropriate section of the \Biocpkg{edgeR} user's guide for more details. \section{Quick start} A typical ChIP-seq analysis in \Biocpkg{csaw} would look something like that described below. This assumes that a vector of file paths to sorted and indexed BAM files is provided in \Robject{bam.files} and a design matrix in supplied in \Robject{design}. The code is split across several steps: <>= library(chipseqDBData) tf.data <- NFYAData() tf.data <- head(tf.data, -1) # skip the input. bam.files <- tf.data$Path cell.type <- sub("NF-YA ([^ ]+) .*", "\\1", tf.data$Description) design <- model.matrix(~factor(cell.type)) colnames(design) <- c("intercept", "cell.type") @ \begin{enumerate} \item Loading in data from BAM files. <<>>= library(csaw) param <- readParam(minq=20) data <- windowCounts(bam.files, ext=110, width=10, param=param) @ \item Filtering out uninteresting regions. <<>>= library(edgeR) keep <- aveLogCPM(asDGEList(data)) >= -1 data <- data[keep,] @ \item Calculating normalization factors. <<>>= binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param) data <- normFactors(binned, se.out=data) @ \item Identifying DB windows. <<>>= y <- asDGEList(data) y <- estimateDisp(y, design) fit <- glmQLFit(y, design, robust=TRUE) results <- glmQLFTest(fit) @ \item Correcting for multiple testing. <<>>= merged <- mergeResults(data, results$table, tol=1000L) @ \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Converting reads to counts} \label{chap:count} \begin{combox} Hello, reader. A little box like this will be present at the start of each chapter. The idea is to list the objects from previous chapters that are needed to run the code in the current chapter. Hopefully, it'll provide a nice segue between chapters for tired eyes. \end{combox} \section{Types of input data} Sorted and indexed BAM (i.e., binary SAM) files \cite{li2009} are required as input into the read counting functions in \Biocpkg{csaw}. Sorting should be performed on the genomic coordinates of the mapped reads. Each read should only have one alignment in the file, i.e., secondary alignments should not be present. For a given BAM file named \file{xxx.bam}, the corresponding index file should be named as \file{xxx.bam.bai} in the same directory. The sensibility of the supplied index is not checked prior to counting. A common mistake is to replace or update the BAM file without updating the index. This will cause \Biocpkg{csaw} to return incorrect results when it attempts to load alignments from the new BAM file. In this guide, the behavior of each step will be demonstrated with some publicly available data from the \Biocpkg{chipseqDBData} package. The dataset below focuses on changes in the binding profile of the NF-YA transcription factor between embryonic stem cells and terminal neurons \cite{tiwari2012}. This will be used as a case study for most of the code examples throughout the guide. <<>>= library(chipseqDBData) tf.data <- NFYAData() tf.data bam.files <- head(tf.data$Path, -1) # skip the input. bam.files @ \section{Counting reads into windows} \subsection{Overview} The \Rfunction{windowCounts} function uses a sliding window approach to count fragments for a set of libraries. For single-end data, the fragment corresponding to a read is imputed by directionally extending each read to the average fragment length. We define a window as a fixed-width genomic interval, and we count the number of fragments overlapping that window in each library. This is repeated after sliding the window along the genome to a new position. A count is then obtained for each window in each library, thus quantifying protein binding intensity across the genome. <<>>= frag.len <- 110 win.width <- 10 param <- readParam(minq=20) data <- windowCounts(bam.files, ext=frag.len, width=win.width, param=param) @ The function returns a \Rclass{RangedSummarizedExperiment} object where the matrix of counts is stored as the first entry in the \Rcode{assays} slot. Each row corresponds to a genomic window while each column corresponds to a library. The coordinates of each window are stored in the \Rcode{rowRanges} slot. The total number of reads in each library (also referred to as the library size) is stored as \Rcode{totals} in the \Rcode{colData} slot. <<>>= head(assay(data)) head(rowRanges(data)) data$totals @ For single-end data, we estimate the average fragment length from a cross-correlation plot (see Section~\ref{sec:ccf}) for use as \Rcode{ext}. Alternatively, the length can be estimated from diagnostics during ChIP or library preparation, e.g., post-fragmentation gel electrophoresis images. Typical values range from 100 to 300 bp, depending on the efficiency of sonication and the use of size selection steps in library preparation. \begin{center} \begin{tikzpicture}[font=\sffamily] % Filling the left region. \filldraw[fill=black!20] (0,0) rectangle (10,1); \draw (0,-0.1)--(0,-0.5)--(10,-0.5)--(10,-0.1); \draw (5,-0.5)--(5,-1); \node[below] at (5, -1) {width}; % Adding left read. \draw[line width=3pt,color=red!30] (-2,1.5)--(2.5,1.5); \draw[line width=3pt,color=red] (-2,1.5)--(-0.5,1.5)--(-0.8,1.8); \draw (-2,1.8)--(-2,2)--(2.5,2)--(2.5,1.8); \node[below] at (-2,1.45) {\begin{tabular}{c} forward \\ read \end{tabular}}; % Adding the right read \draw[line width=3pt,color=blue!30] (12,1.8)--(7.5,1.8); \draw[line width=3pt,color=blue] (12,1.8)--(10.5,1.8)--(10.8,2.1); \draw (12,2.1)--(12,2.3)--(7.5,2.3)--(7.5,2.1); \node[below] at (12,1.75) {\begin{tabular}{c} reverse \\ read \end{tabular}}; \draw (0,2)--(0,3.2)--(3.1,3.2); \draw (10,2.3)--(10,3.2)--(6.9,3.2); \node[above] at (5,2.8) {fragment length (ext)}; \end{tikzpicture} \end{center} The \Rcode{width} argument defines the window size, which we interpret as the width of the binding site for the target protein. This is user-specified and has important implications for the power and resolution of a DB analysis, which are discussed in Section~\ref{sec:windowsize}. For TF analyses with small windows, the choice of spacing interval will also be affected by the choice of window size -- see Section~\ref{sec:efficiency} for more details. \subsection{Filtering out low-quality reads} Read extraction from the BAM files is controlled with the \Rcode{param} argument in \Rfunction{windowCounts}. This takes a \Rclass{readParam} object that specifies a number of extraction parameters. The idea is to define the \Rclass{readParam} object once in the entire analysis pipeline, which is then reused for all relevant functions. This ensures that read loading is consistent throughout the analysis. <<>>= param @ In the example above, reads are filtered out based on the minimum mapping score with the \Rcode{minq} argument. Low mapping scores are indicative of incorrectly and/or non-uniquely aligned sequences. Removal of these reads is highly recommended as it will ensure that only the reliable alignments are supplied to \Biocpkg{csaw}. The exact value of the threshold depends on the range of scores provided by the aligner. The \software{subread} program \cite{liao2013} was used to align the reads in this dataset, so a value of 20 might be appropriate. Reads mapping to the same genomic position can be marked as putative PCR duplicates using software like the \software{MarkDuplicates} program from the \software{Picard} suite. Marked reads in the BAM file can be ignored during counting by setting \Rcode{dedup=TRUE} in the \Rclass{readParam} object. This reduces the variability caused by inconsistent amplification between replicates, and avoid spurious duplicate-driven DB between groups. An example of counting with duplicate removal is shown below, where fewer reads are used from each library relative to \Rcode{data\$totals}. <<>>= dedup.param <- readParam(minq=20, dedup=TRUE) demo <- windowCounts(bam.files, ext=frag.len, width=win.width, param=dedup.param) demo$totals @ Duplicate removal is generally not recommended for routine DB analyses. This is because it caps the number of reads at each position, reducing DB detection power in high-abundance regions. Spurious differences may also be introduced when the same upper bound is applied to libraries of varying size. However, it may be unavoidable in some cases, e.g., involving libraries generated from low quantities of DNA. Duplicate removal is also acceptable for paired-end data, as exact overlaps for both paired reads are required to define duplicates. This greatly reduces the probability of incorrectly discarding read pairs from non-duplicate DNA fragments (assuming that a pair-aware method was used during duplicate marking). \subsection{Avoiding problematic genomic regions} Read extraction and counting can be restricted to particular chromosomes by specifying the names of the chromosomes of interest in \Rcode{restrict}. This avoids the need to count reads on unassigned contigs or uninteresting chromosomes, e.g., the mitochondrial genome for ChIP-seq studies targeting nuclear factors. Alternatively, it allows \Rfunction{windowCounts} to work on huge datasets or in limited memory by analyzing only one chromosome at a time. <<>>= restrict.param <- readParam(restrict=c("chr1", "chr10", "chrX")) @ Reads lying in certain regions can also be removed by specifying the coordinates of those regions in \Rcode{discard}. This is intended to remove reads that are wholly aligned within known repeat regions but were not removed by the \Rcode{minq} filter. Repeats are problematic as changes in repeat copy number or accessibility between conditions can lead to spurious DB. Removal of reads within repeat regions can avoid detection of these irrelevant differences. <<>>= repeats <- GRanges("chr1", IRanges(3000001, 3041000)) # telomere discard.param <- readParam(discard=repeats) @ Coordinates of annotated repeats can be obtained from several different sources. A curated blacklist of problematic regions is available from the ENCODE project \cite{dunham2012}, and can be obtained at \url{https://sites.google.com/site/anshulkundaje/projects/blacklists}. This list is constructed empirically from the ENCODE datasets and includes obvious offenders like telomeres, microsatellites and some rDNA genes. Alternatively, repeats can be predicted from the genome sequence using software like \software{RepeatMasker}. These calls are available from the UCSC website (e.g., \url{hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromOut.tar.gz} for mouse) or they can be extracted from an appropriate masked \Rclass{BSgenome} object. Using \Rcode{discard} is more appropriate than simply ignoring windows that overlap the repeat regions. For example, a large window might contain both repeat and non-repeat regions. Discarding the window because of the former will compromise detection of DB features in the latter. Of course, any DB sites within the discarded regions will be lost from downstream analyses. Some caution is therefore required when specifying the regions of disinterest. For example, many more repeats are called by \software{RepeatMasker} than are present in the ENCODE blacklist, so the use of the former may result in loss of potentially interesting features. \subsection{Additional notes about parameter specification} Users can modify an existing \Rclass{readParam} object using the \Rfunction{reform} method. The example below copies \Robject{param} and replaces \Rcode{minq} and \Rcode{discard} with new values. This is safer than directly modifying the slots, as appropriate type/value checking of each class member is performed. <<>>= another.param <- reform(param, minq=20, discard=repeats) another.param @ Users are encouraged to construct their own \Rclass{readParam} objects and apply them consistently throughout their analyses. A good measure of synchronisation between \Rfunction{windowCounts} calls is to check that the values of \Rcode{...\$totals} are identical between calls. This suggests that the same reads are being extracted from the BAM files in each call. \subsection{Increasing speed and memory efficiency} \label{sec:efficiency} The \Rcode{spacing} parameter controls the distance between adjacent windows in the genome. By default, this is set to 50 bp, i.e., sliding windows are shifted 50 bp forward at each step. Using a higher value will reduce computational work as fewer features need to be counted. This may be useful when machine memory is limited. Of course, spatial resolution is lost with larger spacings. Adjacent positions are not counted and thus cannot be distinguished. <<>>= demo <- windowCounts(bam.files, spacing=100, ext=frag.len, width=win.width, param=param) head(rowRanges(demo)) @ For analyses with large windows, we suggest increasing the \Rcode{spacing} to a fraction of the specified \Rcode{width}. This reduces the computational work by decreasing the number of windows and extracted counts. Any loss in spatial resolution due to a larger spacing interval is negligible compared to that already lost by using a large window size. Conversely, \Rcode{spacing} should not be larger than \Rcode{ext/2} for analyses with small windows. This ensures that a narrow binding site will not be overlooked if it falls between two windows. If \Rcode{ext} is also very small, \Rcode{spacing} should be set to \Rcode{width} to avoid loading too many small windows. Windows that are overlapped by few fragments are filtered out based on the \Rcode{filter} argument. A window is removed if the sum of counts across all libraries is below \Rcode{filter}. This improves memory efficiency by discarding the majority of low-abundance windows corresponding to uninteresting background regions. The default value of the filter threshold is 10, though it can be raised to reduce memory usage for large libraries. More sophisticated filtering is recommended and should be applied later (see Chapter~\ref{chap:filter}). <<>>= demo <- windowCounts(bam.files, ext=frag.len, width=win.width, filter=30, param=param) head(assay(demo)) @ Users can parallelize read counting and several other functions by setting the \Rcode{BPPARAM} argument. This will load and process reads from multiple BAM files simultaneously. The number of workers and type of parallelization can be specified using \Rclass{BiocParallelParam} objects. By default, parallelization is turned off (i.e., set to a \Rclass{SerialParam} object) because it provides little benefit for small files or on systems with I/O bottlenecks. \subsection{Assigning reads into bins} Setting \Rcode{bin=TRUE} will direct \Rfunction{windowCounts} to count reads into contiguous bins across the genome. Here, \Rcode{spacing} is set to \Rcode{width} such that each window forms a bin. Only the 5\ensuremath{'} end of each read is used for counting into bins, without any directional extension. (For paired-end data, the midpoint of the originating fragment is used -- see below.) <<>>= demo <- windowCounts(bam.files, width=1000, bin=TRUE, param=param) head(rowRanges(demo)) @ The \Rcode{filter} argument is automatically set to 1, which means that counts will be returned for each non-empty genomic bin. Users should set \Rcode{width} to a reasonably large value, to avoid running out of memory with a large number of small bins. We can also force \Rfunction{windowCounts} to return bins for \textit{all} bins by setting \Rcode{filter=0} manually. \section{Experiments involving paired-end data} \label{data:pet} ChIP experiments with paired-end sequencing are accomodated by setting \Rcode{pe="both"} in the \Robject{param} object supplied to \Rfunction{windowCounts}. Read extension is not required as the genomic interval spanned by the originating fragment is explicitly defined as that between the 5\ensuremath{'} positions of the paired reads. The number of fragments overlapping each window is then counted as previously described. By default, only proper pairs are used in which the two paired reads are on the same chromosome, face inward and are no more than \Rcode{max.frag} apart. <<>>= # Using the BAM file in Rsamtools as an example. pe.bam <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) pe.param <- readParam(max.frag=400, pe="both") demo <- windowCounts(pe.bam, ext=250, param=pe.param) demo$totals @ A suitable value for \Rcode{max.frag} is chosen by examining the distribution of fragment sizes from the \Rfunction{getPESizes} function. In this example, we might use a value of around 400 bp as it is larger than the vast majority of fragment sizes. The plot can also be used to examine the quality of the PE sequencing procedure. The location of the mode should be consistent with the fragmentation and size selection steps in library preparation. <>= out <- getPESizes(pe.bam) frag.sizes <- out$sizes[out$sizes<=800] hist(frag.sizes, breaks=50, xlab="Fragment sizes (bp)", ylab="Frequency", main="", col="grey80") abline(v=400, col="red") @ The number of fragments exceeding the maximum size is recorded for quality control. The \Rfunction{getPESizes} function also returns the number of single reads, pairs with one unmapped read, improperly orientated pairs and inter-chromosomal pairs. A non-negligble proportion of these reads may be indicative of problems with paired-end alignment or sequencing. <<>>= c(out$diagnostics, too.large=sum(out$sizes > 400)) @ Note that all of the paired-end methods in \Biocpkg{csaw} depend on correct mate information for each alignment. This is usually enforced by the aligner in the output BAM file. Any file manipulations that might break the synchronisation should be corrected (e.g., with the \software{FixMateInformation} program from the \software{Picard} suite) prior to read counting. Paired-end data can also be treated as single-end by specifiying \Rcode{pe="first"} or \Rcode{"second"} in the \Rfunction{readParam} constructor. This will only use the first or second read of each read pair, regardless of the validity of the pair or the relative quality of the alignments. This setting may be useful for contrasting paired- and single-end analyses, or in disastrous situations where paired-end sequencing has failed, e.g., due to ligation between DNA fragments. <<>>= first.param <- readParam(pe="first") demo <- windowCounts(pe.bam, param=first.param) demo$totals @ \section{Estimating the average fragment length} \label{sec:ccf} \subsection{Using cross-correlation plots} Cross-correlation plots are generated directly from BAM files using the \Rfunction{correlateReads} function. This provides a measure of the immunoprecipitation (IP) efficiency of a ChIP-seq experiment \cite{kharchenko2008}. Efficient IP should yield a smooth peak at a delay distance corresponding to the average fragment length. This reflects the strand-dependent bimodality of reads around narrow regions of enrichment, e.g., TF binding sites. <>= max.delay <- 500 dedup.on <- reform(param, dedup=TRUE) x <- correlateReads(bam.files, max.delay, param=dedup.on) plot(0:max.delay, x, type="l", ylab="CCF", xlab="Delay (bp)") @ The location of the peak is used as an estimate of the fragment length for read extension in \Rfunction{windowCounts}. An estimate of \textasciitilde{}110 bp is obtained from the plot above. We can do this more precisely with the \Rfunction{maximizeCcf} function, which returns a similar value. <<>>= maximizeCcf(x) @ A sharp spike may also be observed in the plot at a distance corresponding to the read length. This is thought to be an artifact, caused by the preference of aligners towards uniquely mapped reads. Duplicate removal is typically required here (i.e., set \Rcode{dedup=TRUE} in \Rfunction{readParam}) to reduce the size of this spike. Otherwise, the fragment length peak will not be visible as a separate entity. The size of the smooth peak can also be compared to the height of the spike to assess the signal-to-noise ratio of the data \cite{landt2012}. Poor IP efficiency will result in a smaller or absent peak as bimodality is less pronounced. Cross-correlation plots can also be used for fragment length estimation of narrow histone marks such as histone acetylation and H3K4 methylation. However, they are less effective for regions of diffuse enrichment where bimodality is not obvious (e.g., H3K27 trimethylation). <>= n <- 1000 # Using more data sets from 'chipseqDBData'. acdata <- H3K9acData() h3k9ac <- correlateReads(acdata$Path[1], n, param=dedup.on) k27data <- H3K27me3Data() h3k27me3 <- correlateReads(k27data$Path[1], n, param=dedup.on) k4data <- H3K4me3Data() h3k4me3 <- correlateReads(k4data$Path[1], n, param=dedup.on) plot(0:n, h3k9ac, col="blue", ylim=c(0, 0.1), xlim=c(0, 1000), xlab="Delay (bp)", ylab="CCF", pch=16, type="l", lwd=2) lines(0:n, h3k27me3, col="red", pch=16, lwd=2) lines(0:n, h3k4me3, col="forestgreen", pch=16, lwd=2) legend("topright", col=c("blue", "red", "forestgreen"), c("H3K9ac", "H3K27me3", "H3K4me3"), pch=16) @ \label{data:ccf} \subsection{Variable fragment lengths between libraries} \label{sec:coercelen} The \Rfunction{windowCounts} function also supports the use of library-specific fragment lengths. For example, libraries with less efficient fragmentation will have larger fragment lengths and wider peaks. Single-end reads in the peaks of such libraries will require more directional extension to impute a fragment interval that covers the binding site. However, some work is required to avoid detecting irrelevant DB from differences in peak widths. This is done by resizing the inferred fragments to the same length in all libraries. (Consider a bimodal peak, present in several libraries that have different fragment lengths. Resizing ensures that the subpeak on the forward strand is centered at the same location in each library. The same applies for the subpeak on the reverse strand.) Thus, the effect of differences in peak width between libraries can be largely mitigated. Variable read extension is performed in \Rfunction{windowCounts} by setting \Rcode{ext} to a list with two elements. The first element is a vector where each entry specifies the average fragment length to be used for the corresponding library. The second specifies the final length to which the inferred fragments are to be resized. If the second element is set to \Rcode{NA}, no rescaling is performed and the library-specific fragment sizes are used directly. This also works for analyses with paired-end data, though the first element of \Rcode{ext} will be ignored as directional extension is not performed. The example below rescales all fragments to 200 bp in all libraries. Extension information is stored in the \Rclass{RangedSummarizedExperiment} object for later use. <<>>= multi.frag.lens <- list(c(100, 150, 200, 250), 200) demo <- windowCounts(bam.files, ext=multi.frag.lens, filter=30, param=param) demo$ext metadata(demo)$final @ In general, use of different extension lengths is unnecessary in well-controlled datasets. Difference in lengths between libraries are usually smaller than 50 bp. This is less than the inherent variability in fragment lengths within each library (see the histogram for the paired-end data in Section~\ref{data:pet}). The effect on the coverage profile of within-library variability in lengths will likely mask the effect of small between-library differences in the average lengths. Thus, an \Rcode{ext} list should only be specified for datasets that exhibit large differences in the average fragment sizes between libraries. \section{Choosing an appropriate window size} \label{sec:windowsize} We interpret the window size as the width of the binding ``footprint'' for the target protein, where the protein residues directly contact the DNA. TF analyses typically use a small window size, e.g., 10 - 20 bp, which maximizes spatial resolution for optimal detection of narrow regions of enrichment. For histone marks, widths of at least 150 bp are recommended \cite{humburg2011}. This corresponds to the length of DNA wrapped up in each nucleosome, which is the smallest relevant unit for histone mark enrichment. We consider diffuse marks as chains of adjacent histones, for which the combined footprint may be very large (e.g., 1-10 kbp). The choice of window size controls the compromise between spatial resolution and count size. Larger windows will yield larger read counts that can provide more power for DB detection. However, spatial resolution is also lost for large windows whereby adjacent features can no longer be distinguished. Reads from a DB site may be counted alongside reads from a non-DB site (e.g., non-specific background) or even those from an adjacent site that is DB in the opposite direction. This will result in the loss of DB detection power. We might expect to be able to infer the optimal window size from the data, e.g., based on the width of the enriched regions. However, in practice, a clear-cut choice of distance/window size is rarely found in real datasets. For many non-TF targets, the widths of the enriched regions can be highly variable, suggesting that no single window size is optimal. Indeed, even if all enriched regions were of constant width, the width of the DB events occurring \textit{within} those regions may be variable. This is especially true of diffuse marks where the compromise between resolution and power is more arbitrary. We suggest performing an initial DB analysis with small windows to maintain spatial resolution. The widths of the final merged regions (see Section~\ref{sec:cluster}) can provide an indication of the appropriate window size. Alternatively, the analysis can be repeated with a series of larger windows, and the results combined (see Section~\ref{sec:bin_integrate}). This examines a spread of resolutions for more comprehensive detection of DB regions. \section{Miscellaneous functions for non-standard counting} \subsection{Counting over manually specified regions} The \Biocpkg{csaw} package focuses on counting reads into windows. However, it may be occasionally desirable to use the same conventions (e.g., duplicate removal, quality score filtering) when counting reads into pre-specified regions. This can be performed with the \Rfunction{regionCounts} function, which is largely a wrapper for \Rfunction{countOverlaps} from the \Biocpkg{GenomicRanges} package. <<>>= my.regions <- GRanges(c("chr11", "chr12", "chr15"), IRanges(c(75461351, 95943801, 21656501), c(75461610, 95944810, 21657610))) reg.counts <- regionCounts(bam.files, my.regions, ext=frag.len, param=param) head(assay(reg.counts)) @ \subsection{Strand-specific counting} Techniques like CLIP-seq, MeDIP-seq or CAGE provide strand-specific sequence information. The \Biocpkg{csaw} package can analyze these datasets through strand-specific counting. This can be done manually setting the \Rcode{forward} slot in the \Rclass{readParam} object to \Rcode{TRUE} or \Rcode{FALSE}, to count only forward- or reverse-strand reads respectively in \Rfunction{windowCounts} or \Rfunction{regionCounts}. Alternatively, the \Rfunction{strandedCounts} wrapper function can be used to obtain strand-specific counts for each window or region. The strand of each output range indicates the strand on which reads were counted for that row. Up to two rows can be generated for each window or region, depending on filtering. <<>>= ss.param <- reform(param, forward=NULL) ss.counts <- strandedCounts(bam.files, ext=frag.len, width=win.width, param=ss.param) strand(rowRanges(ss.counts)) @ Note that \Rfunction{strandedCounts} operates internally by calling \Rfunction{windowCounts} (or \Rfunction{regionCounts}) twice with different settings for \Rcode{param\$forward}. Any value for \Rcode{forward} in the input \Robject{param} object will be ignored. In fact, the function will \textit{only} accept a \Rcode{NULL} value for this slot. This is intended to protect the user, as any attempt to re-use the \Rcode{ss.param} object in functions that are not designed for strand specificity will (appropriately) raise an error. <>= rm(demo, ss.counts) gc() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Filtering out uninteresting windows} \label{chap:filter} \begin{combox} This chapter will require \Robject{bam.files}, \Robject{frag.len}, \Robject{param} and \Robject{data} from the last chapter. We'll also need the \Rfunction{aveLogCPM} function from \Biocpkg{csaw}, so load the package if you haven't done so already. \end{combox} \section{Independent filtering for count data} Many of the low abundance windows in the genome correspond to background regions in which DB is not expected. Indeed, windows with low counts will not provide enough evidence against the null hypothesis to obtain sufficiently low $p$-values for DB detection. Similarly, some approximations used in the statistical analysis will fail at low counts. Removing such uninteresting or ineffective tests reduces the severity of the multiple testing correction, increases detection power amongst the remaining tests and reduces computational work. Filtering is valid so long as it is independent of the test statistic under the null hypothesis \cite{bourgon2010}. In the negative binomial (NB) framework, this (probably) corresponds to filtering on the overall NB mean. The DB $p$-values retained after filtering on the overall mean should be uniform under the null hypothesis, by analogy to the normal case. Row sums can also be used for datasets where the effective library sizes are not very different, or where the counts are assumed to be Poisson-distributed between biological replicates. In \Biocpkg{edgeR}, the log-transformed overall NB mean is referred to as the average abundance. This is computed with the \Rfunction{aveLogCPM} function, as shown below for each window. <<>>= abundances <- aveLogCPM(asDGEList(data)) summary(abundances) @ For demonstration purposes, an arbitrary threshold of -1 is used here to filter the window abundances. This restricts the analysis to windows with abundances above this threshold. <<>>= keep.simple <- abundances > -1 filtered.data <- data[keep.simple,] summary(keep.simple) @ The exact choice of filter threshold may not be obvious. In particular, there is often no clear distinction in abundances between genuine binding and background events, e.g., due to the presence of many weak but genuine binding sites. A threshold that is too small will be ineffective, whereas a threshold that is too large may decrease power by removing true DB sites. Arbitrariness is unavoidable when balancing these opposing considerations. Nonetheless, several strategies for defining the threshold are described below. Users should start by choosing \textbf{one} of these filtering approaches to implement in their analyses. Each approach yields a logical vector that can be used in the same way as \Robject{keep.simple}. \section{By count size} The simplest approach is to simply filter according to the count size. This removes windows for which the counts are simply too low for modelling and hypothesis testing. The code below retains windows with (library size-adjusted) average counts greater than 5. <<>>= keep <- abundances > aveLogCPM(5, lib.size=mean(data$totals)) summary(keep) @ However, a count-based filter becomes less effective as the library size increases. More windows will be retained with greater sequencing depth, even in uninteresting background regions. This increases both computational work and the severity of the multiplicity correction. The threshold may also be inappropriate when library sizes are very different. \section{By proportion} One approach is to to assume that only a certain proportion - say, 0.1\% - of the genome is genuinely bound. This corresponds to the top proportion of high-abundance windows. The total number of windows is calculated from the genome length and the \Rcode{spacing} interval used in \Rfunction{windowCounts}. The \Rfunction{filterWindowsProportion} function returns the ratio of the rank of each window to this total, where higher-abundance windows have larger ranks. Users can then retain those windows with rank ratios above the unbound proportion of the genome. <<>>= keep <- filterWindowsProportion(data)$filter > 0.999 sum(keep) @ This approach is simple and has the practical advantage of maintaining a constant number of windows for the downstream analysis. However, it may not adapt well to different datasets where the proportion of bound sites can vary. Using an inappropriate percentage of binding sites will result in the loss of potential DB regions or inclusion of background regions. \section{By global enrichment} \label{sec:global_filter} An alternative approach involves choosing a filter threshold based on the fold change over the level of non-specific enrichment. The degree of background enrichment is estimated by counting reads into large bins across the genome. Binning is necessary here to increase the size of the counts when examining low-density background regions. This ensures that precision is maintained when estimating the background abundance. <<>>= bin.size <- 2000L binned <- windowCounts(bam.files, bin=TRUE, width=bin.size, param=param) @ The median of the average abundances across all bins is computed and used as a global estimate of the background coverage. This global background is then compared to the window-based abundances. This determines whether a window is driven by background enrichment, and thus, unlikely to be interesting. However, some care is required as the sizes of the regions used for read counting are different between bins and windows. The average abundance of each bin must be scaled down to be comparable to those of the windows. The \Rfunction{filterWindowsGlobal} function returns the increase in the abundance of each window over the global background. Windows are filtered by setting some minimum threshold on this increase. The aim is to eliminate the majority of uninteresting windows prior to further analysis. Here, a fold change of 3 is necessary for a window to be considered as containing a binding site. This approach has an intuitive and experimentally relevant interpretation that adapts to the level of non-specific enrichment in the dataset. <<>>= filter.stat <- filterWindowsGlobal(data, background=binned) keep <- filter.stat$filter > log2(3) sum(keep) @ The effect of filtering can also be visualized with a histogram. This allows users to confirm that the bulk of (assumed) background bins are discarded upon filtering. Note that bins containing genuine binding sites will usually not be visible on such plots. This is due to the dominance of the background-containing bins throughout the genome. <>= hist(filter.stat$back.abundances, xlab="Adjusted bin log-CPM", breaks=100, main="", col="grey80", xlim=c(min(filter.stat$back.abundances), 0)) global.bg <- filter.stat$abundances - filter.stat$filter abline(v=global.bg[1], col="red", lwd=2) abline(v=global.bg[1]+log2(3), col="blue", lwd=2) legend("topright", lwd=2, col=c('red', 'blue'), legend=c("Background", "Threshold")) @ Of course, the pre-specified minimum fold change may be too aggressive when binding is weak. For TF data, a large cut-off works well as narrow binding sites will have high read densities and are unlikely to be lost during filtering. Smaller minimum fold changes are recommended for diffuse marks where the difference from background is less obvious. \section{By local enrichment} \subsection{Mimicking single-sample peak callers} Local background estimators can also be constructed, which avoids inappropriate filtering when there are differences in background coverage across the genome. Here, the 2 kbp region surrounding each window will be used as the ``neighborhood'' over which a local estimate of non-specific enrichment for that window can be obtained. The counts for these regions are first obtained with the \Rfunction{regionCounts} function. This should be synchronized with \Rfunction{windowCounts} by using the same \Robject{param}, if any non-default settings were used. <<>>= surrounds <- 2000 neighbor <- suppressWarnings(resize(rowRanges(data), surrounds, fix="center")) wider <- regionCounts(bam.files, regions=neighbor, ext=frag.len, param=param) @ We apply \Rfunction{filterWindowsLocal} to compute enrichment values, i.e., the increase in the abundance of each window over its neighborhood. In this function, counts for each window are subtracted from the counts for its neighborhood. This ensures that any enriched regions or binding sites inside the window will not interfere with estimation of its local background. The width of the window is also subtracted from that of its neighborhood, to reflect the effective size of the latter after subtraction of counts. Based on the fold-differences in widths, the abundance of the neighborhood is scaled down for a valid comparison to that of the corresponding window. Enrichment values are subsequently calculated from the differences in scaled abundances. <<>>= filter.stat <- filterWindowsLocal(data, wider) summary(filter.stat$filter) @ Filtering can then be performed using a quantile- or fold change-based threshold on the enrichment values. In this scenario, a 3-fold increase in enrichment over the neighborhood abundance is required for retention of each window. This roughly mimics the behavior of single-sample peak-calling programs such as \software{MACS} \cite{zhang2008}. <<>>= keep <- filter.stat$filter > log2(3) sum(keep) @ Note that this procedure also assumes that no other enriched regions are present in each neighborhood. Otherwise, the local background will be overestimated and windows may be incorrectly filtered out. This may be problematic for diffuse histone marks or TFBS clusters where enrichment may be observed in both the window and its neighborhood. If this seems too complicated, an alternative is to identify locally enriched regions using peak-callers like \software{MACS}. Filtering can then be performed to retain only windows within called peaks. However, peak calling must be done independently of the DB status of each window. If libraries are of similar size or biological variability is low, reads can be pooled into one library for single-sample peak calling \cite{lun2014}. This is equivalent to filtering on the average count and avoids loss of the type I error control from data snooping. \subsection{Identifying local maxima} \label{sec:localmax} Another strategy is to use the \Rfunction{findMaxima} function to identify local maxima in the read density across the genome. The code below will determine if each window is a local maximum, i.e., whether it has the highest average abundance within 1 kbp on either side. The data can then be filtered to retain only these locally maximal windows. This can also be combined with other filters to ensure that the retained windows have high absolute abundance. <<>>= maxed <- findMaxima(rowRanges(data), range=1000, metric=abundances) summary(maxed) @ This approach is very aggressive and should only be used (sparingly) in datasets where binding is sharp, simple and isolated. Complex binding events involving diffuse enrichment or adjacent binding sites will not be handled well. For example, DB detection will fail if a low-abundance DB window is ignored in favor of a high-abundance non-DB neighbor. \subsection{With negative controls} Negative controls for ChIP-seq refer to input or IgG libraries where the IP step has been skipped or compromised with an irrelevant antibody, respectively. This accounts for sequencing/mapping biases in ChIP-seq data. IgG controls also quantify the amount of non-specific enrichment throughout the genome. These controls are mostly irrelevant when testing for DB between ChIP samples. However, they can be used to filter out windows where the average abundance across the ChIP samples is below the abundance of the control. To illustrate, let us add an input library to our NF-YA data set in the code below. <<>>= tf.data <- NFYAData() with.input <- tf.data$Path in.demo <- windowCounts(with.input, ext=frag.len, param=param) chip <- in.demo[,1:4] # All ChIP libraries control <- in.demo[,5] # All control libraries @ Some additional work is required to account for composition biases that are likely to be present when comparing ChIP to negative control samples (see Section~\ref{sec:compo_norm}). A simple strategy for normalization involves counting reads into large bins, which are used in \Rfunction{scaleControlFilter} to compute a normalization factor. <<>>= in.binned <- windowCounts(with.input, bin=TRUE, width=10000, param=param) chip.binned <- in.binned[,1:4] control.binned <- in.binned[,5] scale.info <- scaleControlFilter(chip.binned, control.binned) @ We use the \Rfunction{filterWindowsControl} function to compute the enrichment of the ChIP counts over the control counts for each window. This uses \Robject{scale.info} to adjust for composition biases between ChIP and control samples. A larger \Rcode{prior.count} of 5 is also used to compute the average abundance. This protects against inflated log-fold changes when the count for the window in the control sample is near zero. By comparison, the global and local background estimates require less protection (\Rcode{prior.count=2}, by default) as they are derived from larger bins with more counts. <<>>= filter.stat <- filterWindowsControl(chip, control, prior.count=5, scale.info=scale.info) @ The log-fold enrichment of the ChIP sample over the control is then computed for each window, after normalizing for composition bias with the binned counts. The example below requires a 3-fold or greater increase in abundance over the control to retain each window. <<>>= keep <- filter.stat$filter > log2(3) sum(keep) @ As an aside, the \Biocpkg{csaw} pipeline can also be applied to search for ``DB'' between ChIP libraries and control libraries. The ChIP and control libraries can be treated as separate groups, in which most ``DB'' events are expected to be enriched in the ChIP samples. If this is the case, the filtering procedure described above is inappropriate as it will select for windows with differences between ChIP and control samples. This compromises the assumption of the null hypothesis during testing, resulting in loss of type I error control. % This is the best way to deal with GC biases, as you can just grab any old input off the web and use it (citation?). % Alternatively, you could estimate the expected background from a GC content vs. abundance curve on the ChIP data, but that's susceptible to inflation from genuine binding. \section{By prior information} When only a subset of genomic regions are of interest, DB detection power can be improved by removing windows lying outside of these regions. Such regions could include promoters, enhancers, gene bodies or exons. Alternatively, sites could be defined from a previous experiment or based on the genome sequence, e.g., TF motif matches. The example below retrieves the coordinates of the broad gene bodies from the mouse genome, including the 3 kbp region upstream of the TSS that represents the putative promoter region for each gene. <<>>= library(TxDb.Mmusculus.UCSC.mm10.knownGene) broads <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene) broads <- resize(broads, width(broads)+3000, fix="end") head(broads) @ Windows can be filtered to only retain those which overlap with the regions of interest. Discerning users may wish to distinguish between full and partial overlaps, though this should not be a significant issue for small windows. This could also be combined with abundance filtering to retain windows that contain putative binding sites in the regions of interest. <<>>= suppressWarnings(keep <- overlapsAny(rowRanges(data), broads)) sum(keep) @ Any information used here should be independent of the DB status under the null in the current dataset. For example, DB calls from a separate dataset and/or independent annotation can be used without problems. However, using DB calls from the same dataset to filter regions would violate the null assumption and compromise type I error control. In addition, this filter is unlike the others in that it does not operate on the abundance of the windows. It is possible that the set of retained windows may be very small, e.g., if no non-empty windows overlap the pre-defined regions of interest. Thus, it may be better to apply this filter before the multiplicity correction but after DB testing. This ensures that there are sufficient windows for stable estimation of the downstream statistics. \section{Some final comments about filtering} It should be stressed that these filtering strategies do not eliminate subjectivity. Some thought is still required in selecting an appropriate proportion of bound sites or minimum fold change above background for each method. Rather, these filters provide a relevant interpretation for what would otherwise be an arbitrary threshold on the abundance. As a general rule, users should filter less aggressively if there is any uncertainty about the features of interest. In particular, the thresholds shown in this chapter for each filtering statistic are fairly mild. This ensures that more potentially DB windows are retained for testing. Use of an aggressive filter risks the complete loss of detection for such windows, even if power is improved among those that are retained. Low numbers of retained windows may also lead to unstable estimates during, e.g., normalization, variance modelling. Different filters can also be combined in more advanced applications, e.g., by running \Rcode{data[keep1 \& keep2,]} for filter vectors \Robject{keep1} and \Robject{keep2}. Any benefit will depend on the type of filters involved. The greatest effect is observed for filters that operate on different principles. For example, the low-count filter can be combined with others to ensure that all retained windows surpass some minimum count. This is especially relevant for the local background filters, where a large enrichment value does not guarantee a large count. <>= rm(binned, in.demo, wider, neighbor, chip, control, filter.stat, in.binned, chip.binned, control.binned) gc() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Calculating normalization factors} \label{chap:norm} \begin{combox} This chapter will need to use the \Robject{bam.files} vector again (from the introduction). We'll also need the \Robject{param} object that we defined in Chapter~\ref{chap:count}, as well as the \Robject{filtered.data} object that we constructed in Chapter~\ref{chap:filter}. You'll notice that character vectors containing paths to other BAM files are defined throughout this chapter. However, these are only present for demonstration purposes and aren't necessary for the main NF-YA example. \end{combox} \section{Overview} The complexity of the ChIP-seq technique gives rise to a number of different biases in the data. For a DB analysis, library-specific biases are of particular interest as they can introduce spurious differences between conditions. This includes composition biases, efficiency biases and trended biases. Thus, normalization between libraries is required to remove these biases prior to any statistical analysis. Several normalization strategies are presented here, though users should only pick \textbf{one} to use for any given analysis. Advice on choosing the most appropriate method is scattered throughout the chapter, so read carefully. \section{Eliminating composition biases} \label{sec:compo_norm} \subsection{Using the TMM method on binned counts} As the name suggests, composition biases are formed when there are differences in the composition of sequences across libraries. Highly enriched regions consume more sequencing resources and thereby suppress the representation of other regions. Differences in the magnitude of suppression between libraries can lead to spurious DB calls. Scaling by library size fails to correct for this as composition biases can still occur in libraries of the same size. To remove composition biases in \Biocpkg{csaw}, reads are counted into large bins and the counts are used for normalization with the \Rfunction{normFactors} wrapper function. This uses the trimmed mean of M-values (TMM) method \cite{oshlack2010} to correct for any systematic fold change in the coverage of the bins. The assumption here is that most bins represent non-DB background regions, so any consistent difference across bins must be technical bias. <<>>= binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param) filtered.data <- normFactors(binned, se.out=filtered.data) filtered.data$norm.factors @ The TMM method trims away putative DB bins (i.e., those with extreme M-values) and computes normalization factors from the remainder to use in \Biocpkg{edgeR}. The size of each library is scaled by the corresponding factor to obtain an effective library size for modelling. A larger normalization factor results in a larger effective library size and is conceptually equivalent to scaling each individual count downwards, given that the ratio of that count to the (effective) library size will be smaller. Check out the \Biocpkg{edgeR} user's guide for more information. In the above code, the \Rfunction{normFactors} call computes normalization factors from the bin-level counts in \Robject{binned} (see Section~\ref{sec:normbinsize}). The \Rcode{se.out} argument directs the function to return a modified version of \Robject{filtered.data}, where the normalization factors are stored alongside the \textit{window}-level counts for further analysis. Composition biases affect both bin- and window-level counts, so computing normalization factors from the former and applying them to the latter is valid -- provided that the library sizes are the same between the two sets of counts, as the factors are interpreted with respect to the library sizes. (In \Biocpkg{csaw}, separate calls to \Rfunction{windowCounts} with the same \Rclass{readParam} object will always yield the same library sizes in \Rcode{totals}.) Note that \Rfunction{normFactors} skips the precision weighting step in the TMM method. Weighting aims to increase the contribution of bins with high counts, as these yield more precise M-values. However, high-abundance bins are more likely to contain binding sites and thus are more likely to be DB compared to background regions. If any DB regions should survive trimming, upweighting them would be counterproductive. % By default, the top 5% of most abundant elements are already removed by TMM. % You can ask for more removal, which could help; but in general, binding sites % are so negligble in quantity compared to the background regions, it doesn't % really matter too much, so long as weighting isn't in play. \subsection{Motivating the use of large bins instead of windows} \label{sec:normbinsize} By definition, read coverage is low for background regions of the genome. This can result in a large number of zero counts and undefined M-values when reads are counted into small windows. Adding a prior count is only a superficial solution as the chosen prior will have undue influence on the estimate of the normalization factor when many counts are low. The variance of the fold change distribution is also higher for low counts, which reduces the effectiveness of the trimming procedure. These problems can be overcome by using large bins to increase the size of the counts, thus improving the precision of TMM normalization. The normalization factors computed from the bin-level counts are then applied to the window-level counts of interest. Of course, this strategy requires the user to supply a bin size. If the bins are too large, background and enriched regions will be included in the same bin. This makes it difficult to trim away bins corresponding to enriched regions. On the other hand, the counts will be too low if the bins are too small. Testing multiple bin sizes is recommended to ensure that the estimates are robust to any changes. A value of 10 kbp is usually suitable for most datasets. <<>>= demo <- windowCounts(bam.files, bin=TRUE, width=5000, param=param) normFactors(demo, se.out=FALSE) demo <- windowCounts(bam.files, bin=TRUE, width=15000, param=param) normFactors(demo, se.out=FALSE) @ We use \Rcode{se.out=FALSE} to instruct \Rfunction{normFactors} to return the normalization factors directly. This is more convenient than returning a \Rclass{RangedSummarizedExperiment} object and extracting the normalization factors with \Rcode{...\$norm.factors}. Here, the factors are consistently close to unity, which suggests that composition bias is negligble in this dataset. See Section~\ref{sec:eff_norm_ma} for some examples with greater bias. \subsection{Visualizing normalization outcomes with MA plots} The effectiveness of normalization can be examined using a MA plot. A single main cloud of points should be present, consisting primarily of background regions. Separation into multiple discrete points indicates that the counts are too low and that larger bin sizes should be used. Composition biases manifest as a vertical shift in the position of this cloud. Ideally, the log-ratios of the corresponding normalization factors should pass through the centre of the cloud. This indicates that undersampling has been identified and corrected. % Genuine binding sites are mixed in with the background in this example, as % there's no clear distinction between the two. As one might expect, the bins % containing binding sites tend to be those with higher A-values. Most of the % bins should still be free of binding, though (8000/260000 bins, based on % those that overlap globally filtered windows in Chapter 4). <>= par(mfrow=c(1, 3), mar=c(5, 4, 2, 1.5)) adj.counts <- cpm(asDGEList(binned), log=TRUE) normfacs <- filtered.data$norm.factors for (i in seq_len(length(bam.files)-1)) { cur.x <- adj.counts[,1] cur.y <- adj.counts[,1+i] smoothScatter(x=(cur.x+cur.y)/2+6*log2(10), y=cur.x-cur.y, xlab="A", ylab="M", main=paste("1 vs", i+1)) all.dist <- diff(log2(normfacs[c(i+1, 1)])) abline(h=all.dist, col="red") } @ \section{Eliminating efficiency biases} \label{sec:eff_norm} \subsection{Using the TMM method on high-abundance regions} \label{data:norm} Efficiency biases in ChIP-seq data refer to fold changes in enrichment that are introduced by variability in IP efficiencies between libraries. These technical differences are not biologically interesting and must be removed. This can be achieved by assuming that high-abundance windows contain binding sites. Consider the following H3K4me3 data set, where reads are counted into 150 bp windows. <<>>= k4data <- H3K4me3Data() k4data me.files <- k4data$Path[c(1,3)] me.demo <- windowCounts(me.files, width=150, param=param) @ High-abundance windows are chosen using a global filtering approach described in Section~\ref{sec:global_filter}. Here, the binned counts in \Robject{me.bin} are only used for defining the background abundance, \emph{not} for computing normalization factors. <<>>= me.bin <- windowCounts(me.files, bin=TRUE, width=10000, param=param) keep <- filterWindowsGlobal(me.demo, me.bin)$filter > log2(3) filtered.me <- me.demo[keep,] @ The TMM method is then applied to eliminate systematic differences across those windows. This also assumes that most binding sites in the genome are not DB. Thus, any systematic differences in coverage among the high-abundance windows must be caused by differences in IP efficiency between libraries or some other technical issue. Scaling by the normalization factors will subseqeuently remove these bises between libraries. <<>>= filtered.me <- normFactors(filtered.me, se.out=TRUE) me.eff <- filtered.me$norm.factors me.eff @ The above process seems rather involved, but this is only because we need to work our way through counting and normalization for a new data set. Only \Rfunction{normFactors} is actually performing the normalization step. We set \Rcode{se.out=TRUE} so that the normalization factors are stored alongside the window counts for use in the downstream \Biocpkg{edgeR} analysis. As a demonstration, we repeat this procedure on another data set involving H3 acetylation. <<>>= acdata <- H3K9acData() acdata ac.files <- acdata$Path[c(1,2)] ac.demo <- windowCounts(ac.files, width=150, param=param) ac.bin <- windowCounts(ac.files, bin=TRUE, width=10000, param=param) keep <- filterWindowsGlobal(ac.demo, ac.bin)$filter > log2(5) summary(keep) filtered.ac <- ac.demo[keep,] filtered.ac <- normFactors(filtered.ac, se.out=TRUE) ac.eff <- filtered.ac$norm.factors ac.eff @ Normalization for efficiency biases assumes that most binding sites are not DB. Genuine biological differences may be removed when the assumption of a non-DB majority does not hold, e.g., overall binding is truly lower in one condition. In such cases, it is safer to normalize for composition biases -- see Section~\ref{sec:normchoice} for a discussion of the choice between normalization methods. \subsection{Filtering windows prior to normalization} Normalization for efficiency biases is performed on window-level counts instead of bin counts. This is possible as filtering ensures that we only retain the high-abundance windows, i.e., those with counts that are large enough for stable calculation of normalization factors. It is not necessary to use larger windows or bins, and indeed, direct use of the windows of interest ensures removal of systematic differences in those windows prior to downstream analyses. The filtering procedure needs to be stringent enough to avoid retaining windows from background regions. These will interfere with calculation of normalization factors from binding sites. This is due to the lower coverage for background regions, as well as the fact that they are not affected by efficiency bias (and cannot contribute to its estimation). Conversely, attempting to use the factors computed from high-abundance windows on windows from background regions will result in incorrect normalization of the latter. Thus, it is usually better to err on the side of caution and filter aggressively to ensure that background regions are not retained in downstream analyses. Obviously, though, retaining too few windows will result in unstable estimates of the normalization factors. \subsection{Checking normalization with MA plots} \label{sec:eff_norm_ma} We again visualize the effect of normalization with MA plots. Plots are constructed using counts for 10 kbp bins, rather than with those from the windows. This is useful as the behavior of the entire genome can be examined, rather than just that of the high-abundance windows. It also allows calculation of and comparison to the factors for composition bias. <<>>= me.comp <- normFactors(me.bin, se.out=FALSE) me.comp ac.comp <- normFactors(ac.bin, se.out=FALSE) ac.comp @ The clouds at low and high A-values represent the background and bound regions, respectively. The normalization factors from removal of composition bias (dashed) pass through the former, whereas the factors to remove efficiency bias (full) pass through the latter. A non-zero M-value location for the high A-value cloud represents a systematic difference between libraries for the bound regions, either due to genuine DB or variable IP efficiency. This also induces composition bias, leading to a non-zero M-value for the background cloud. <>= par(mfrow=c(1,2)) for (main in c("H3K4me3", "H3ac")) { if (main=="H3K4me3") { bins <- me.bin comp <- me.comp eff <- me.eff } else { bins <- ac.bin comp <- ac.comp eff <- ac.eff } adjc <- cpm(asDGEList(bins), log=TRUE) smoothScatter(x=rowMeans(adjc), y=adjc[,1]-adjc[,2], xlab="A", ylab="M", main=main) abline(h=log2(eff[1]/eff[2]), col="red") abline(h=log2(comp[1]/comp[2]), col="red", lty=2) } @ \section{Choosing between normalization strategies} \label{sec:normchoice} The normalization strategies for composition and efficiency biases are mutually exclusive, as only one set of normalization factors will ultimately be used in \Biocpkg{edgeR}. The choice between the two methods depends on whether one assumes that the systematic differences at high abundances represent genuine DB events. If so, the binned TMM method from Section~\ref{sec:compo_norm} should be used to remove composition bias. This will preserve the assumed DB, at the cost of ignoring any efficiency biases that might be present. Otherwise, if the systematic differences are not genuine DB, they must represent efficiency bias and should be removed by applying the TMM method on high-abundance windows (Section~\ref{sec:eff_norm}). Some understanding of the biological context is useful in making this decision, e.g., comparing a wild-type against a knock-out for the target protein should result in systematic DB, while overall levels of histone marking are expected to be consistent in most conditions. For the main NF-YA example, there is no expectation of constant binding between cell types. Thus, normalization factors will be computed to remove composition biases. This ensures that any genuine systematic changes in binding will still be picked up. In general, normalization for composition bias is a good starting point for any analysis. This can be considered as the ``default'' strategy unless there is evidence for a confounding efficiency bias. \section{Scaling normalization with spike-in chromatin} Several studies have used spike-in chromatin for scaling normalization of ChIP-seq data \cite{bonhoure2014quantifying,orlando2014quantitative}. Briefly, a constant amount of chromatin from a different species is added to each sample at the start of the ChIP-seq protocol. The mixture is processed and sequenced in the usual manner, using an antibody that can bind epitopes of interest from both species. The coverage of the spiked-in foreign chromatin is then quantified in each library. As the quantity of foreign chromatin should be constant in each sample, the coverage of binding sites on the foreign genome should also be the same between libraries. Any difference in coverage between libraries represents some technical bias that should be removed by scaling. This normalization strategy can be implemented in \Biocpkg{csaw} with some work. Assuming that the reference genome includes appropriate sequences from the foreign genome, coverage is quantified across genomic windows with \Rfunction{windowCounts}. Filtering is performed to select for high-abundance windows in the foreign genome (see Section~\ref{sec:eff_norm}), yielding counts for all enriched spike-in regions. (The filtered object is named \Robject{spike.data} in the code below.) Normalization factors are computed by applying the TMM method on these counts via \Rfunction{normFactors}. This aims to identify the fold-change in coverage between samples that is attributable to technical effects. <<>>= # Pretend chr1 is a spike-in, for demonstration purposes only! is.1 <- seqnames(rowRanges(data))=="chr1" spike.data <- data[is.1,] endog.data <- data[!is.1,] endog.data <- normFactors(spike.data, se.out=endog.data) @ In the code above, the spike-in normalization factors are returned in a modified copy of \Robject{endog.data} for further analysis of the endogenous windows. We assume that the library sizes in \Rcode{totals} are the same between \Robject{spike.data} and \Robject{endog.data}, which should be the case if they were formed by subsetting the output of a single \Rfunction{windowCounts} call. This ensures that the normalization factors computed from the spike-in windows are applicable to the endogenous windows. Compared to the previous normalization methods, the spike-in approach does not distinguish between composition and efficiency biases. Instead, it uses the fold-differences in the coverage of spiked-in binding sites to empirically measure and remove the net bias between libraries. This avoids the need for assumptions regarding the origin of any systematic differences between libraries. That said, spike-in normalization involves some strong assumptions of its own. In particular, the ratio of spike-in chromatin to endogenous chromatin is assumed to be the same in each sample. This requires accurate quantitation of the chromatin in each sample, followed by precise addition of small spike-in quantities. Furthermore, the spike-in chromatin is assumed to behave in the same manner as endogenous chromatin throughout the ChIP-seq protocol. Whether these assumptions are reasonable will depend on the experimenter and the nature of the spike-in chromatin. \section{Dealing with trended biases} \subsection{Applying loess-based normalization to the counts} In more extreme cases, the bias may vary with the average abundance to form a trend. One possible explanation is that changes in IP efficiency will have little effect at low-abundance background regions and more effect at high-abundance binding sites. Thus, the magnitude of the bias between libraries will change with abundance. The trend cannot be corrected with scaling methods as no single scaling factor will remove differences at all abundances. Rather, non-linear methods are required, such as cyclic loess or quantile normalization. One such implementation of a non-linear normalization method is provided in \Rfunction{normOffsets}. This is based on the fast loess algorithm \cite{ballman2004} with minor adaptations to handle low counts. A matrix is produced that contains an offset term for each bin/window in each library. This offset matrix can then be directly used in \Biocpkg{edgeR}, assuming that the bins or windows used in normalization are also the ones to be tested for DB. We demonstrate this procedure below, using filtered counts for 2 kbp windows in the H3 acetylation data set. (This window size is chosen purely for aesthetics in this demonstration, as the trend is less obvious at smaller widths. Obviously, users should pick a more appropriate value for their analysis.) <<>>= ac.demo2 <- windowCounts(ac.files, width=2000L, param=param) filtered <- filterWindowsGlobal(ac.demo2, ac.bin) keep <- filtered$filter > log2(4) ac.demo2 <- ac.demo2[keep,] ac.demo2 <- normOffsets(ac.demo2, se.out=TRUE) ac.off <- assay(ac.demo2, "offset") head(ac.off) @ When \Rcode{se.out=TRUE}, the offsets are stored in the \Rclass{RangedSummarizedExperiment} object as an \Rcode{"offsets"} entry in the \Rcode{assays} slot. Each offset represents the log-transformed scaling factor that needs to be applied to the corresponding entry of the count matrix for its normalization. Any operations like subsetting that are applied to modify the object will also be applied to the offsets, allowing for synchronised processing. \subsection{Characteristics of loess normalization} Loess normalization of trended biases is quite similar to TMM normalization for efficiency biases described in Section~\ref{sec:eff_norm}. Both methods assume a non-DB majority across features, and will not be appropriate if there is a change in overall binding. Loess normalization involves a slightly stronger assumption of a non-DB majority at every abundance, not just across all bound regions. This is necessary to remove trended biases but may also discard genuine changes, such as a subset of DB sites at very high abundances. Compared to TMM normalization, the accuracy of loess normalization is less dependent on stringent filtering. This is because the use of a trend accommodates changes in the bias between high-abundance binding sites and low-abundance background regions. Nonetheless, some filtering is still necessary to avoid inaccuracies in loess fitting at low abundances. Any filter statistic for the windows should be based on the average abundance from \Rfunction{aveLogCPM}, such as those calculated using \Rfunction{filterWindowsGlobal} or equivalents. An average abundance threshold will act as a clean vertical cutoff in the MA plots above. This avoids introducing spurious trends at the filter boundary that might affect normalization. \subsection{Checking normalization with MA plots} We examine the MA plots to determine whether normalization was successful. Any abundance-dependent trend in the M-values should be eliminated after applying the offsets to the log-counts. This is done by subtraction, though note that the offsets are base $e$ while most log-values in \Biocpkg{edgeR} are reported as base 2. <>= par(mfrow=c(1,2)) # MA plot without normalization. ac.y <- asDGEList(ac.demo2) lib.size <- ac.y$samples$lib.size adjc <- cpm(ac.y, log=TRUE) abval <- aveLogCPM(ac.y) mval <- adjc[,1]-adjc[,2] fit <- loessFit(x=abval, y=mval) smoothScatter(abval, mval, ylab="M", xlab="Average logCPM", main="Raw", ylim=c(-2,2), xlim=c(0, 7)) o <- order(abval) lines(abval[o], fit$fitted[o], col="red") # Repeating after normalization. re.adjc <- log2(assay(ac.demo2)+0.5) - ac.off/log(2) mval <- re.adjc[,1]-re.adjc[,2] fit <- loessFit(x=abval, y=mval) smoothScatter(abval, re.adjc[,1]-re.adjc[,2], ylab="M", xlab="Average logCPM", main="Normalized", ylim=c(-2,2), xlim=c(0, 7)) lines(abval[o], fit$fitted[o], col="red") @ \section{A word on other biases} No normalization is performed to adjust for differences in mappability or sequencability between different regions of the genome. Region-specific biases are assumed to be constant between libraries. This is generally reasonable as the biases depend on fixed properties of the genome sequence such as GC content. Thus, biases should cancel out during DB comparisons. Any variability between samples will just be absorbed into the dispersion estimate. That said, explicit normalization to correct these biases can improve results for some datasets. Procedures like GC correction could decrease the observed variability by removing systematic differences between replicates. Of course, this also assumes that the targeted differences have no biological relevance. Detection power may be lost if this is not true. For example, differences in the GC content distribution can be driven by technical bias as well as biology, e.g., when protein binding is associated with a specific GC composition. <>= rm(demo, ac.demo, filtered.ac, ac.bin, me.demo, filtered.me, me.bin, ac.demo2, filtered, keep, adjc) gc() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Testing for differential binding} \label{chap:stats} \begin{combox} Here, we'll need the \Robject{filtered.data} generated in Chapter~\ref{chap:filter} (and modified in Chapter~\ref{chap:norm}), and the \Robject{design} matrix from the introduction. We'll also need the original \Robject{data} from Chapter~\ref{chap:count}, but just for a demonstration. Finally, we'll be executing a number of \Biocpkg{edgeR} functions, so make sure the \Biocpkg{edgeR} package is loaded if you've been skipping chapters. \end{combox} \section{Introduction to \Biocpkg{edgeR}} \subsection{Overview} Low counts per window are typically observed in ChIP-seq datasets, even for genuine binding sites. Any statistical analysis to identify DB sites must be able to handle discreteness in the data. Count-based models are ideal for this purpose. In this guide, the quasi-likelihood (QL) framework in the \Biocpkg{edgeR} package is used \cite{lund2012}. Counts are modelled using NB distributions that account for overdispersion between biological replicates \cite{robinson2008}. Each window can then be tested for significant DB between conditions. Of course, any statistical method can be used if it is able to accept a count matrix and a vector of normalization factors (or more generally, a matrix of offsets). The choice of \Biocpkg{edgeR} is primarily motivated by its performance relative to some published alternatives \cite{law2014}. This author's desire to increase his h-index may also be a factor \cite{chen2014}. \subsection{Setting up the data} A \Rclass{DGEList} object is first constructed from the count matrix in \Robject{filtered.data}. If normalization factors or offsets are present in the \Rclass{RangedSummarizedExperiment} object -- see Chapter~\ref{chap:norm} -- they will automatically be extracted and used to construct the \Rclass{DGEList} object. Otherwise, they can be manually passed to the \Rfunction{asDGEList} function. If offsets are available, they will generally override the normalization factors in the downstream \Biocpkg{edgeR} analysis. <<>>= y <- asDGEList(filtered.data) @ The experimental design is described by a design matrix. In this case, the only relevant factor is the cell type of each sample. A generalized linear model (GLM) will be fitted to the counts for each window using the specified design \cite{mccarthy2012}. This provides a general framework for the analysis of complex experiments with multiple factors. Readers are referred to the user's guide in \Biocpkg{edgeR} for more details on parametrization. <<>>= tf.data <- NFYAData() cell.type <- sub("NF-YA ([^ ]+) .*", "\\1", head(tf.data$Description, -1)) cell.type design <- model.matrix(~factor(cell.type)) colnames(design) <- c("intercept", "cell.type") design @ \section{Estimating the dispersions} \subsection{Stabilising estimates with empirical Bayes} \label{sec:dispest} Under the QL framework, both the QL and NB dispersions are used to model biological variability in the data \cite{lund2012}. The former ensures that the NB mean-variance relationship is properly specified with appropriate contributions from the Poisson and Gamma components. The latter accounts for variability and uncertainty in the dispersion estimate. However, limited replication in most ChIP-seq experiments means that each window does not contain enough information for precise estimation of either dispersion. % Both parameters need to be estimated for optimal performance; using too high % a value for the NB dispersion means that the QL dispersion can't recover (as % it's very sensitive to the former). Also, using a constant value (e.g. 0, a % la quasi-poisson) puts a lot of pressure on the trend fitting as you're % trying to shoehorn a NB mean-variance relationship into a QL mean-varince % relationship (asymptotically the same, but different at low counts). This problem is overcome in \Biocpkg{edgeR} by sharing information across windows. For the NB dispersions, a mean-dispersion trend is fitted across all windows to model the mean-variance relationship \cite{mccarthy2012}. The raw QL dispersion for each window is estimated after fitting a GLM with the trended NB dispersion. Another mean-dependent trend is fitted to the raw QL estimates. An empirical Bayes (EB) strategy is then used to stabilize the raw QL dispersion estimates by shrinking them towards the second trend \cite{lund2012}. The ideal amount of shrinkage is determined from the variability of the dispersions. <<>>= y <- estimateDisp(y, design) summary(y$trended.dispersion) fit <- glmQLFit(y, design, robust=TRUE) summary(fit$var.post) @ The effect of EB stabilisation can be visualized by examining the biological coefficient of variation (for the NB dispersion) and the quarter-root deviance (for the QL dispersion). These plots can also be used to decide whether the fitted trend is appropriate. Sudden irregulaties may be indicative of an underlying structure in the data which cannot be modelled with the mean-dispersion trend. Discrete patterns in the raw dispersions are indicative of low counts and suggest that more aggressive filtering is required. <>= par(mfrow=c(1,2)) o <- order(y$AveLogCPM) plot(y$AveLogCPM[o], sqrt(y$trended.dispersion[o]), type="l", lwd=2, ylim=c(0, 1), xlab=expression("Ave."~Log[2]~"CPM"), ylab=("Biological coefficient of variation")) plotQLDisp(fit) @ A strong trend may also be observed where the NB dispersion drops sharply with increasing average abundance. This is due to the disproportionate impact of artifacts such as mapping errors and PCR duplicates at low counts. It is difficult to accurately fit an empirical curve to these strong trends. As a consequence, the dispersions at high abundances may be overestimated. Filtering of low-abundance regions (as described in Chapter~\ref{chap:filter}) provides some protection by removing the strongest part of the trend. Users can compare raw and filtered results to see whether it makes any difference. Filtering has an additional benefit of removing those tests that have low power due to the magnitude of the dispersions. <>= relevant <- rowSums(assay(data)) >= 20 # weaker filtering than 'filtered.data' yo <- asDGEList(data[relevant], norm.factors=normfacs) yo <- estimateDisp(yo, design) oo <- order(yo$AveLogCPM) plot(yo$AveLogCPM[oo], sqrt(yo$trended.dispersion[oo]), type="l", lwd=2, ylim=c(0, max(sqrt(yo$trended))), xlab=expression("Ave."~Log[2]~"CPM"), ylab=("Biological coefficient of variation")) lines(y$AveLogCPM[o], sqrt(y$trended[o]), lwd=2, col="grey") legend("topright", c("raw", "filtered"), col=c("black", "grey"), lwd=2) @ \subsection{Modelling variable dispersions between windows} Any variability in the dispersions across windows is modelled in \Biocpkg{edgeR} by the prior degrees of freedom (d.f.). A large value for the prior d.f. indicates that the variability is low. This means that more EB shrinkage can be performed to reduce uncertainty and maximize power. However, strong shrinkage is not appropriate if the dispersions are highly variable. Fewer prior degrees of freedom (and less shrinkage) are required to maintain type I error control. <<>>= summary(fit$df.prior) @ On occasion, the estimated prior degrees of freedom will be infinite. This is indicative of a strong batch effect where the dispersions are consistently large. A typical example involves uncorrected differences in IP efficiency across replicates. In severe cases, the trend may fail to pass through the bulk of points as the variability is too low to be properly modelled in the QL framework. This problem is usually resolved with appropriate normalization. Note that the prior degrees of freedom should be robustly estimated \cite{phipson2016}. Obviously, this protects against large positive outliers (e.g., highly variable windows) but it also protects against near-zero dispersions at low counts. These will manifest as large negative outliers after a log transformation step during estimation \cite{smyth2004}. Without robustness, incorporation of these outliers will inflate the observed variability in the dispersions. This results in a lower estimated prior d.f. and reduced DB detection power. % If you've forgotten, you get near-zero dispersions because counts can be exactly equal. \section{Testing for DB windows} The effect of specific factors can be tested to identify windows with significant differential binding. In the QL framework, $p$-values are computed using the QL F-test \cite{lund2012}. This is more appropriate than using the likelihood ratio test as the F-test accounts for uncertainty in the dispersion estimates. Associated statistics such as log-fold changes and log-counts per million are also computed for each window. <<>>= results <- glmQLFTest(fit, contrast=c(0, 1)) head(results$table) @ The null hypothesis here is that the cell type has no effect. The \Rcode{contrast} argument in the \Rfunction{glmQLFTest} function specifies which factors are of interest. In this case, a contrast of \Rcode{c(0, 1)} defines the null hypothesis as \Rcode{0*intercept + 1*cell.type = 0}, i.e., that the log-fold change between cell types is zero. DB windows can then be identified by rejecting the null. Specification of the contrast is explained in greater depth in the \Biocpkg{edgeR} user's manual. Once the significance statistics have been calculated, they can be stored in row metadata of the \Rclass{RangedSummarizedExperiment} object. This ensures that the statistics and coordinates are processed together, e.g., when subsetting to select certain windows. <<>>= rowData(filtered.data) <- cbind(rowData(filtered.data), results$table) @ \section{What to do without replicates} Designing a ChIP-seq experiment without any replicates is strongly discouraged. Without replication, the reproducibility of any findings cannot be determined. Nonetheless, it may be helpful to salvage some information from datasets that lack replicates. This is done by supplying a ``reasonable'' value for the NB dispersion during GLM fitting (e.g., 0.05 - 0.1, based on past experience). DB windows are then identified using the likelihood ratio test. <<>>= fit.norep <- glmFit(y, design, dispersion=0.05) results.norep <- glmLRT(fit.norep, contrast=c(0, 1)) head(results.norep$table) @ Obviously, this approach has a number of pitfalls. The lack of replicates means that the biological variability in the data cannot be modelled. Thus, it becomes impossible to gauge the sensibility of the supplied NB dispersions in the analysis. Another problem is spurious DB due to inconsistent PCR duplication between libraries. Normally, inconsistent duplication results in a large QL dispersion for the affected window, such that significance is downweighted. However, estimation of the QL dispersion is not possible without replicates. This means that duplicates may need to be removed to protect against false positives. \section{Examining replicate similarity with MDS plots} As a quality control measure, the window counts can be used to examine the similarity of replicates through multi-dimensional scaling (MDS) plots. The distance between each pair of libraries is computed as the square root of the mean squared log-fold change across the top set of bins with the highest absolute log-fold changes. A small top set visualizes the most extreme differences whereas a large set visualizes overall differences. Checking a range of \Rcode{top} values may be useful when the scope of DB is unknown. Again, counting with large bins is recommended as fold changes will be undefined in the presence of zero counts. <>= par(mfrow=c(2,2), mar=c(5,4,2,2)) adj.counts <- cpm(y, log=TRUE) for (top in c(100, 500, 1000, 5000)) { out <- plotMDS(adj.counts, main=top, col=c("blue", "blue", "red", "red"), labels=c("es.1", "es.2", "tn.1", "tn.2"), top=top) } @ Replicates from different groups should form separate clusters in the MDS plot, as observed above. This indicates that the results are reproducible and that the effect sizes are large. Mixing between replicates of different conditions indicates that the biological difference has no effect on protein binding, or that the data is too variable for any effect to manifest. Any outliers should also be noted as their presence may confound the downstream analysis. In the worst case, outlier samples may need to be removed to obtain sensible results. <>= rm(adj.counts) gc() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Correction for multiple testing} \begin{combox} All right, we're almost there. This chapter needs the \Robject{results} object from the last chapter, as well as \Robject{filtered.data} from Chapter~\ref{chap:filter}. A few other things are also required for some of the optional sections below -- namely, \Robject{broads} from Chapter~\ref{chap:filter}, and \Robject{ac.files} and \Robject{param} from Chapters~\ref{chap:norm} and \ref{chap:count}, respectively. \end{combox} \section{Problems with false discovery rate control} \subsection{Overview} The false discovery rate (FDR) is usually the most appropriate measure of error for high-throughput experiments. Control of the FDR can be provided by applying the Benjamini-Hochberg (BH) method \cite{benjamini1995} to a set of $p$-values. This is less conservative than the alternatives (e.g., Bonferroni) yet still provides some measure of error control. The most obvious approach is to apply the BH method to the set of $p$-values across all windows. This will control the FDR across the set of putative DB windows. However, the FDR across all detected windows is not necessarily the most relevant error rate. Interpretation of ChIP-seq experiments is more concerned with regions of the genome in which (differential) protein binding is found, rather than the individual windows. In other words, the FDR across all detected DB regions is usually desired. This is not equivalent to that across all DB windows as each region will often consist of multiple overlapping windows. Control of one will not guarantee control of the other \cite{lun2014}. To illustrate this difference, consider an analysis where the FDR across all window positions is controlled at 10\%. In the results, there are 18 adjacent window positions in one region and 2 windows in a separate region. The first set of windows is a truly DB region whereas the second set is a false positive. A window-based interpretation of the FDR is correct as only 2 of the 20 window positions are false positives. However, a region-based interpretation results in an actual FDR of 50\%. % The BH method is particularly popular as it is simple to apply and robust to % correlations \cite{reiner2003,kim2008}. Simes' is also pretty robust to % correlations \cite{samuel1996,sarkar1997}, in the same respect as the FDR. % Say you control the FDR within a cluster using the BH method, so % E(FDR)<=0.05. Now, the probability of all false positives (i.e. FDR=1) must % be under 0.05 as well. So, if the BH method works, so does Simes' method. To avoid misinterpretation of the FDR, \Biocpkg{csaw} provides a number of strategies to obtain region-level results. This involves defining the regions of interest - possibly from the windows themselves - and converting per-window statistics into a $p$-value for each region. Application of the BH method to the per-region $p$-values will then control the relevant FDR across regions. These strategies are demonstrated below using the NF-YA data. \section{Grouping windows into regions} \subsection{Quick and dirty clustering} \label{sec:cluster} The \Rfunction{mergeWindows} function provides a simple single-linkage algorithm to cluster windows into regions. Windows that are less than \Rcode{tol} apart are considered to be adjacent and are grouped into the same cluster. The chosen \Rcode{tol} represents the minimum distance at which two binding events are treated as separate sites. Large values (500 - 1000 bp) reduce redundancy and favor a region-based interpretation of the results, while smaller values (\textless{} 200 bp) allow resolution of individual binding sites. <<>>= merged <- mergeWindows(filtered.data, tol=1000L) merged$regions @ If many adjacent windows are present, very large clusters may be formed that are difficult to interpret. We perform a simple check below to determine whether most clusters are of an acceptable size. Huge clusters indicate that more aggressive filtering from Chapter~\ref{chap:filter} is required. This mitigates chaining effects by reducing the density of windows in the genome. % Note that several large clusters may still be present due to high coverage within long tandem repeat loci. % In general, chaining isn't as bad as single-linkage on the reads themselves, % because windows that survive weak filtering should have reasonably high read counts. <<>>= summary(width(merged$regions)) @ Alternatively, chaining can be limited by setting \Rcode{max.width} to restrict the size of the merged intervals. Clusters substantially larger than \Rcode{max.width} are split into several smaller subclusters of roughly equal size. The chosen value should be small enough so as to separate DB regions from unchanged neighbors, yet large enough to avoid misinterpretation of the FDR. Any value from 2000 to 10000 bp is recommended. This paramater can also interpreted as the maximum distance at which two binding sites are considered part of the same event. <<>>= merged.max <- mergeWindows(filtered.data, tol=1000L, max.width=5000L) summary(width(merged.max$regions)) @ \subsection{Using external information} Another approach is to group together windows that overlap with a pre-specified region of interest. The most obvious source of pre-specified regions is that of annotated features such as promoters or gene bodies. Alternatively, called peaks can be used provided that sufficient care has been taken to avoid loss of error control from data snooping \cite{lun2014}. Regardless of how they are specified, each region of interest corresponds to a group that contains all overlapping windows, as identified by the \Rfunction{findOverlaps} function from the \Biocpkg{GenomicRanges} package. <<>>= olap <- findOverlaps(broads, rowRanges(filtered.data)) olap @ At this point, one might imagine that it would be simpler to just collect and analyze counts over the pre-specified regions. This is a valid strategy but will yield different results. Consider a promoter containing two separate sites that are identically DB in opposite directions. Counting reads across the promoter will give equal counts for each condition so changes within the promoter will not be detected. Similarly, imprecise peak boundaries can lead to loss of detection power due to ``contamination'' by reads in background regions. Window-based methods may be more robust as each interval of the promoter/peak region is examined separately \cite{lun2014}, avoiding potential problems with peak-calling errors and incorrect/incomplete annotation. \section{Obtaining per-region $p$-value} \subsection{Combining window-level $p$-values} We compute a combined $p$-value for each region based on the $p$-values of the constituent windows \cite{simes1986}. This tests the joint null hypothesis for each region, i.e., that no enrichment is observed across any of its windows. Any DB within the region will reject the joint null and yield a low $p$-value for the entire region. The combined $p$-values are then adjusted using the BH method to control the region-level FDR. <<>>= tabcom <- combineTests(merged$ids, results$table) is.sig.region <- tabcom$FDR <= 0.05 summary(is.sig.region) @ Summarizing the direction of DB for each cluster requires some care as the direction of DB can differ between constituent windows. The \Rcode{num.up.tests} and \Rcode{num.down.tests} fields contain the number of windows that change in each direction, and can be used to gauge whether binding increases or decreases across the cluster. A complex DB event may be present if both \Rcode{num.up.tests} and \Rcode{num.down.tests} are non-zero (i.e., opposing changes within the region) or if the total number of windows is much larger than either number (e.g., interval of constant binding adjacent to the DB interval). Alternatively, the \Rcode{direction} field specifies which DB direction contributes to the combined $p$-value. If \Rcode{"up"}, the combined $p$-value for this cluster is driven by $p$-values of windows with positive log-fold changes. If \Rcode{"down"}, the combined $p$-value is driven by windows with negative log-fold changes. If \Rcode{"mixed"}, windows with both positive and negative log-fold changes are involved. This allows the dominant DB in significant clusters to be quickly summarized, as shown below. <<>>= table(tabcom$direction[is.sig.region]) @ For pre-specified regions, the \Rfunction{combineOverlaps} function will combine the $p$-values for all windows in each region. This is a wrapper around \Rfunction{combineTests} for \Rclass{Hits} objects. It returns a single combined $p$-value (and its BH-adjusted value) for each region. Regions that do not overlap any windows have values of \Rcode{NA} in all fields for the corresponding rows. <<>>= tabbroad <- combineOverlaps(olap, results$table) head(tabbroad[!is.na(tabbroad$PValue),]) is.sig.gene <- tabcom$FDR <= 0.05 table(tabbroad$direction[is.sig.gene]) @ \subsection{Based on the most significant window} \label{sec:mostsig} Another approach is to use the single window with the strongest DB as a representative of the entire region. This is useful when a log-fold change is required for each cluster, e.g., for plotting. (In contrast, taking the average log-fold change across all windows in a region will understate the magnitude of DB, especially if the region includes some non-DB background intervals of the genome.) Identification of the most significant (i.e., ``best'') window is performed using the \Rfunction{getBestTest} function. This reports the index of the window with the lowest $p$-value in each cluster as well as the associated statistics. <<>>= tab.best <- getBestTest(merged$ids, results$table) head(tab.best) @ A Bonferroni correction is applied to the $p$-value of the best window in each region, based on the number of constituent windows in that region. This is necessary to account for the implicit multiple testing across all windows in each region. The corrected $p$-value is reported as \Rcode{PValue} in \Robject{tab.best}, and can be used for correction across regions using the BH method to control the region-level FDR. In addition, it is often useful to report the start location of the best window within each cluster. This allows users to easily identify a relevant DB subinterval in large regions. For example, the sequence of the DB subinterval can be extracted for motif discovery. <<>>= tabcom$rep.start <- start(rowRanges(filtered.data))[tab.best$rep.test] head(tabcom[,c("rep.logFC", "rep.start")]) @ The same approach can be applied to the overlaps between windows and pre-specified regions, using the \Rfunction{getBestOverlaps} wrapper function. This is demonstrated below for the broad gene body example. As with \Rfunction{combineOverlaps}, regions with no windows are assigned \Rcode{NA} in the output table, but these are removed here to show some actual results. <<>>= tab.best.broad <- getBestOverlaps(olap, results$table) tabbroad$rep.start <- start(rowRanges(filtered.data))[tab.best.broad$rep.test] head(tabbroad[!is.na(tabbroad$PValue),c("rep.logFC", "rep.start")]) @ \subsection{Wrapper functions} For convenience, the steps of merging windows and computing statistics are implemented into a single wrapper function. This simply calls \Rcode{mergeWindows} followed by \Rcode{combineTests} and \Rcode{getBestTest}. <<>>= merge.res <- mergeResults(filtered.data, results$table, tol=100, merge.args=list(max.width=5000)) names(merge.res) @ An equivalent wrapper function is also available for handling overlaps to pre-specified regions. This simply calls \Rcode{findOverlaps} followed by \Rcode{combineOverlaps} and \Rcode{getBestOverlaps}. <<>>= broad.res <- overlapResults(filtered.data, regions=broads, tab=results$table) names(broad.res) @ \section{Squeezing out more detection power} \subsection{Integrating results from multiple window sizes} \label{sec:bin_integrate} Repeating the analysis with different window sizes may uncover new DB events at different resolutions. Multiple sets of DB results are integrated by clustering adjacent windows together (even if they differ in size) and combining $p$-values within each of the resulting clusters. The example below uses the H3 acetylation data from Chapter~\ref{chap:norm}. Some filtering is performed to avoid excessive chaining in this demonstration. Corresponding tables of DB results should also be obtained -- for brevity, mock results are used here. <<>>= ac.small <- windowCounts(ac.files, width=150L, spacing=100L, filter=25, param=param) ac.large <- windowCounts(ac.files, width=1000L, spacing=500L, filter=35, param=param) # Mocking up results for demonstration purposes. ns <- nrow(ac.small) mock.small <- data.frame(logFC=rnorm(ns), logCPM=0, PValue=runif(ns)) nl <- nrow(ac.large) mock.large <- data.frame(logFC=rnorm(nl), logCPM=0, PValue=runif(nl)) @ The \Rfunction{mergeResultsList} function merges windows of all sizes into a single set of regions, and computes a combined $p$-value from the associated $p$-values for each region. Equal contributions from each window size are enforced by setting \Rcode{equiweight=TRUE}, which uses a weighted version of Simes' method \cite{benjamini1997}. The weight assigned to each window is inversely proportional to the number of windows of that size in the same cluster. This avoids the situation where, if a cluster contains many small windows, the DB results for the analysis with the small window size contribute most to the combined $p$-value. This is not ideal when results from all window sizes are of equal interest. <<>>= cons.res <- mergeResultsList(list(ac.small, ac.large), tab.list=list(mock.small, mock.large), equiweight=TRUE, tol=1000) cons.res$regions cons.res$combined @ Similarly, the \Rfunction{overlapResultsList} function is used to merge windows of varying size that overlap pre-specified regions. <<>>= cons.broad <- overlapResultsList(list(ac.small, ac.large), tab.list=list(mock.small, mock.large), equiweight=TRUE, region=broads) cons.broad$regions cons.res$combined @ In this manner, DB results from multiple window widths can be gathered together and reported as a single set of regions. Consolidation is most useful for histone marks and other analyses involving diffuse regions of enrichment. For such studies, the ideal window size is not known or may not even exist, e.g., if the widths of the enriched regions or DB subintervals are variable. \subsection{Weighting windows on abundance} Windows that are more likely to be DB can be upweighted to improve detection power. For example, in TF ChIP-seq data, the window of highest abundance within each enriched region probably contains the binding site. It is reasonable to assume that this window will also have the strongest DB. To improve power, the weight assigned to the most abundant window is increased relative to that of other windows in the same cluster. This means that the $p$-value of this window will have a greater influence on the final combined $p$-value. Weights are computed in a manner to minimize conservativeness relative to the optimal unweighted approaches in each possible scenario. If the strongest DB event is at the most abundant window, the weighted approach will yield a combined $p$-value that is no larger than twice the $p$-value of the most abundant window. (Here, the optimal approach would be to use the $p$-value of the most abundance window directly as a proxy for the $p$-value of the cluster.) If the strongest DB event is \emph{not} at the most abundant window, the weighted approach will yield a combined $p$-value that is no larger than twice the combined $p$-value without wweighting (which is optimal as all windows have equal probabilities of containing the strongest DB). All windows have non-zero weights, which ensures that any DB events in the other windows will still be considered when the $p$-values are combined. The application of this weighting scheme is demonstrated in the example below. First, the \Rfunction{getBestTest} function with \Rcode{by.pval=FALSE} is used to identify the most abundant window in each cluster. Window-specific weights are then computed using the \Rfunction{upweightSummits} function, and supplied to \Rcode{combineTests} to use in computing combined $p$-values. <<>>= tab.ave <- getBestTest(merged$id, results$table, by.pval=FALSE) weights <- upweightSummit(merged$id, tab.ave$rep.test) head(weights) tabcom.w <- combineTests(merged$id, results$table, weight=weights) head(tabcom.w) @ The weighting approach can also be applied to the clusters from the broad gene body example. This is done by replacing the call to \Rfunction{getBestTest} with one to \Rfunction{getBestOverlaps}, as before. Similarly, \Rfunction{upweightSummit} can be replaced with \Rfunction{summitOverlaps}. These wrappers are designed to minimize book-keeping problems when one window overlaps multiple regions. <<>>= broad.best <- getBestOverlaps(olap, results$table, by.pval=FALSE) head(broad.best[!is.na(broad.best$PValue),]) broad.weights <- summitOverlaps(olap, region.best=broad.best$rep.test) tabbroad.w <- combineOverlaps(olap, results$table, o.weight=broad.weights) @ \subsection{Filtering after testing but before correction} Most of the filters in Chapter~\ref{chap:filter} are applied before the statistical analysis. However, some of the approaches may be too aggressive, e.g., filtering to retain only local maxima or based on pre-defined regions. In such cases, it may be preferable to initially apply one of the other, milder filters. This ensures that sufficient windows are retained for stable normalization and/or EB shrinkage. The aggressive filters can then be applied after the window-level statistics have been calculated, but before clustering into regions and calculation of cluster-level statistics. This is still beneficial as it removes irrelevant windows that would increase the severity of the BH correction. It may also reduce chaining effects during clustering. \section{FDR control in difficult situations} \subsection{Clustering only on DB windows for diffuse marks} The clustering procedures described above rely on independent filtering to remove irrelevant windows. This ensures that the regions of interest are reasonably narrow and can be easily interpreted, which is typically the case for most protein targets, e.g., TFs, narrow histone marks. However, enriched regions may be very large for more diffuse marks. Such regions may be difficult to interpret when only the DB subinterval is of interest. To overcome this, a \textit{post-hoc} analysis can be performed whereby only significant windows are used for clustering. <<>>= postclust <- clusterWindows(rowRanges(filtered.data), results$table, target=0.05, tol=100, max.width=1000) postclust$FDR postclust$region @ This will define and cluster significant windows in a manner that controls the cluster-level FDR at 5\%. The clustering step itself is performed using \Rfunction{mergeWindows} with the specified parameters. Each cluster consists entirely of DB windows and can be directly interpreted as a DB region or a DB subinterval of a larger enriched region. This reduces the pressure on abundance filtering to obtain well-separated regions prior to clustering, e.g., for diffuse marks or in data sets with weak IP signal. That said, users should be aware that calculation of the cluster-level FDR is not entirely rigorous. As such, independent clustering and FDR control via Simes' method should be considered as the default for routine analyses. \subsection{Using the empirical FDR for noisy data} Some analyses involve comparisons of ChIP samples to negative controls. In such cases, any region exhibiting enrichment in the negative control over the ChIP samples must be a false positive. The number of significant regions that change in the ``wrong'' direction can be used as an estimate of the number of false positives at any given $p$-value threshold. Division by the number of discoveries changing in the ``right'' direction yields an estimate of the FDR, i.e., the empirical FDR \cite{zhang2008}. This strategy is implemented in the \Rfunction{empiricalFDR} function, which controls the empirical FDR across clusters based on their combined $p$-values. Its use is demonstrated below, though the output is not meaningful in this situation as genuine changes in binding can be present in both directions. <<>>= empres <- empiricalFDR(merged$id, results$table) @ The empirical FDR is useful for analyses of noisy data with high levels of non-specific binding. This is because the estimate of the number of false positives adapts to the observed number of regions exhibiting enrichment in the negative controls. In contrast, the standard BH method in \Rfunction{combineTests} relies on proper type I error control during hypothesis testing. As non-specific binding events tend to be condition-specific, they are indistinguishable from DB events and assigned low $p$-values, resulting in loss of FDR control. Thus, for noisy data, use of the empirical FDR may be more appropriate to control the proportion of ``experimental'' false positives. However, calculation of the empirical FDR is not as statistically rigorous as that of the BH method, so users are advised to only apply it when necessary. \subsection{Detecting complex DB} Complex DB events involve changes to the shape of the binding profile, not just a scaling increase/decrease to binding intensity. Such regions may contain multiple sites that change in binding strength in opposite directions, or peaks that change in width or position between conditions. This often manifests as DB in opposite directions in different subintervals of a region. Some of these events can be identified using the \Rfunction{mixedTests} function. <<>>= tab.mixed <- mixedTests(merged$ids, results$table) tab.mixed @ \Rfunction{mixedTests} converts the $p$-value for each window into two one-sided $p$-values. The one-sided $p$-values in each direction are combined using Simes' method, and the two one-sided combined $p$-values are themselves combined using an intersection-union test \cite{berger1996bioequivalence}. The resulting $p$-value is only low if a region contains strong DB in both directions. \Rfunction{combineTests} also computes some statistics for informal detection of complex DB. For example, the \Rcode{num.up.tests} and \Rcode{num.down.tests} fields can be used to identify regions with changes in both directions. The \Rcode{direction} field will also label some regions as \Rcode{"mixed"}, though this is not comprehensive. Indeed, regions labelled as \Rcode{"up"} or \Rcode{"down"} in the `direction` field may also correspond to complex DB events, but will not be labelled as \Rcode{"mixed"} if the significance calculations are dominated by windows changing in only one direction. \subsection{Enforcing a minimal number of DB windows} On occasion, we may be interested in genomic regions that contain at least a minimal number or proportion of DB windows. This is motivated by the desire to avoid detecting DB regions where only a small subinterval exhibits a change, instead favoring more systematic changes throughout the region that are easier to interpret. We can identify these regions using the \Rfunction{minimalTests} function. <<>>= tab.min <- minimalTests(merged$ids, results$table, min.sig.n=3, min.sig.prop=0.5) tab.min @ \Rfunction{minimalTests} applies a Holm-Bonferroni correction to all windows in the same cluster and picks the $x$\textsuperscript{th}-smallest adjusted $p$-value (where $x$ is defined from `min.sig.n` and `min.sig.prop`). This tests the joint null hypothesis that the per-window null hypothesis is false for fewer than $x$ windows in the cluster. If the $x$\textsuperscript{th}-smallest $p$-value is low, this provides strong evidence against the joint null for that cluster. As an aside, this function also has some utility outside of ChIP-seq contexts. For example, we might want to obtain a single $p$-value for a gene set based on the presence of a minimal percentage of differentially expressed genes. Alternatively, we may be interested in ranking genes in loss-of-function screens based on a minimal number of shRNA/CRISPR guides that exhibit a significant effect. These problems are equivalent to that of identifying a genomic region with a minimal number of DB windows. \section{Further points on data management} Technically, \Rcode{results\$table} was not required in any of the calls performed in this chapter. Recall that the per-window significance statistics were stored in the row metadata of \Robject{filtered.data}. Thus, \Rcode{rowData(filtered.data)} could be used instead of \Rcode{results\$table}. The former may be more convenient as it avoids the need to keep track of a separate object. On a similar note, it is possible to store the results for each region in the per-element metadata of the corresponding \Rclass{GRanges} object. This synchronises the storage and handling of the statistics and coordinates for each region. Thus, data management throughout the rest of the analysis can be simplified. This is demonstrated below for \Robject{tabcom} and \Robject{tabbroad}. <<>>= mcols(broads) <- tabbroad mcols(merged$region) <- tabcom @ <>= rm(ac.small, ac.large) gc() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Post-processing steps} \begin{combox} This is where we bring it all together. We'll need the \Robject{merged} list from the previous chapter -- oh, and the \Robject{org.Mm.eg.db} object that we loaded in Chapter~\ref{chap:filter}. There's a bit about visualization at the end where we need the \Robject{data} and \Robject{param} objects from Chapter~\ref{chap:count}, along with the original \Robject{bam.files} that we started with in the introduction. \end{combox} \section{Adding gene-based annotation} Annotation can be added to a given set of regions using the \Rfunction{detailRanges} function. This will identify overlaps between the regions and annotated genomic features such as exons, introns and promoters. Here, the promoter region of each gene is defined as some interval 3 kbp up- and 1 kbp downstream of the TSS for that gene. Any exonic features within \Rcode{dist} on the left or right side of each supplied region will also be reported. <<>>= library(org.Mm.eg.db) anno <- detailRanges(merged$region, txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db, promoter=c(3000, 1000), dist=5000) head(anno$overlap) head(anno$left) head(anno$right) @ Character vectors of compact string representations are provided to summarize the features overlapped by each supplied region. Each pattern contains \Rcode{GENE|STRAND|TYPE} to describe the strand and overlapped features of that gene. Exons are labelled as \Rcode{E}, promoters are \Rcode{P} and introns are \Rcode{I}. For \Rcode{left} and \Rcode{right}, \Rcode{TYPE} is replaced by \Rcode{DISTANCE}. This indicates the gap (in base pairs) between the supplied region and the closest non-overlapping exon of the annotated feature. All of this annotation can be stored in the metadata of the \Rcode{GRanges} object for later use. <<>>= merged$region$overlap <- anno$overlap merged$region$left <- anno$left merged$region$right <- anno$right @ While the string representation saves space in the output, it is not easy to work with. If the annotation needs to manipulated directly, users can obtain it from the \Rfunction{detailRanges} command by not specifying the regions of interest. This can then be used for interactive manipulation, e.g., to identify all genes where the promoter contains DB sites. <<>>= anno.ranges <- detailRanges(txdb=TxDb.Mmusculus.UCSC.mm10.knownGene, orgdb=org.Mm.eg.db) anno.ranges @ \section{Checking bimodality for TF studies} For TF experiments, a simple measure of strand bimodality can be reported as a diagnostic. Given a set of regions, the \Rfunction{checkBimodality} function will return the maximum bimodality score across all base positions in each region. The bimodality score at each base position is defined as the minimum of the ratio of the number of forward- to reverse-stranded reads to the left of that position, and the ratio of the reverse- to forward-stranded reads to the right. A high score is only possible if both ratios are large, i.e., strand bimodality is present. <<>>= spacing <- metadata(data)$spacing expanded <- resize(merged$region, fix="center", width=width(merged$region)+spacing) sbm.score <- checkBimodality(bam.files, expanded, width=frag.len) head(sbm.score) @ In the above code, all regions are expanded by \Rcode{spacing}, i.e., 50 bp. This ensures that the optimal bimodality score can be computed for the centre of the binding site, even if that position is not captured by a window. The \Rcode{width} argument specifies the span with which to count reads for the score calculation. This should be set to the average fragment length. If multiple \Rcode{bam.files} are provided, they will be pooled during counting. For typical TF binding sites, bimodality scores can be considered to be ``high'' if they are larger than 4. This allows users to distinguish between genuine binding sites and high-abundance artifacts such as repeats or read stacks. However, caution is still required as some high scores may be driven by the stochastic distribution of reads. Obviously, the concept of strand bimodality is less relevant for diffuse targets like histone marks. \section{Saving the results to file} It is a simple matter to save the results for later perusal. This is done here in the \file{*.tsv} format where all detail is preserved. Compression is used to reduce the file size. <<>>= ofile <- gzfile("clusters.tsv.gz", open="w") write.table(as.data.frame(merged$region), file=ofile, row.names=FALSE, quote=FALSE, sep="\t") close(ofile) @ Of course, other formats can be used depending on the purpose of the file. For example, significantly DB regions can be exported to BED files through the \Biocpkg{rtracklayer} package for visual inspection with genomic browsers. A transformed FDR is used here for the score field. <<>>= is.sig <- merged$region$FDR <= 0.05 library(rtracklayer) test <- merged$region[is.sig] test$score <- -10*log10(merged$region$FDR[is.sig]) names(test) <- paste0("region", 1:sum(is.sig)) export(test, "clusters.bed") head(read.table("clusters.bed")) @ Alternatively, the \Rclass{GRanges} object can be directly saved to file and reloaded later for direct manipulation in the R environment, e.g., to find overlaps with other regions of interest. <<>>= saveRDS(merged$region, "ranges.rds") @ \section{Simple visualization of genomic coverage} Visualization of the read depth around interesting features is often desired. This is facilitated by the \Rfunction{extractReads} function, which pulls out the reads from the BAM file. The returned \Rclass{GRanges} object can then be used to plot the sequencing coverage or any other statistic of interest. Note that the \Rfunction{extractReads} function also accepts a \Rclass{readParam} object. This ensures that the same reads used in the analysis will be pulled out during visualization. <<>>= cur.region <- GRanges("chr18", IRanges(77806807, 77807165)) extractReads(bam.files[[1]], cur.region, param=param) @ Here, coverage is visualized as the number of reads covering each base pair in the interval of interest. Specifically, the reads-per-million is shown to allow comparisons between libraries of different size. The plots themselves are constructed using methods from the \Biocpkg{Gviz} package. The blue and red tracks represent the coverage on the forward and reverse strands, respectively. Strong strand bimodality is consistent with a genuine TF binding site. For paired-end data, coverage can be similarly plotted for fragments, i.e., proper read pairs. <>= library(Gviz) collected <- vector("list", length(bam.files)) for (i in seq_along(bam.files)) { reads <- extractReads(bam.files[[i]], cur.region, param=param) adj.total <- data$totals[i]/1e6 pcov <- as(coverage(reads[strand(reads)=="+"])/adj.total, "GRanges") ncov <- as(coverage(reads[strand(reads)=="-"])/adj.total, "GRanges") ptrack <- DataTrack(pcov, type="histogram", lwd=0, fill=rgb(0,0,1,.4), ylim=c(0,1.1), name=tf.data$Name[i], col.axis="black", col.title="black") ntrack <- DataTrack(ncov, type="histogram", lwd=0, fill=rgb(1,0,0,.4), ylim=c(0,1.1)) collected[[i]] <- OverlayTrack(trackList=list(ptrack,ntrack)) } gax <- GenomeAxisTrack(col="black") plotTracks(c(gax, collected), from=start(cur.region), to=end(cur.region)) @ <>= rm(anno) gc() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Epilogue} \begin{combox} Congratulations on getting to the end. Here's a poem for your efforts. \begin{quote} There once was a man named Will \\ Who never ate less than his fill. \\ He ate meat and bread \\ Until he was fed \\ But died when he saw the bill. \end{quote} \end{combox} \section{Session information} <<>>= sessionInfo() @ \bibliography{ref_ug} \end{document}