%\VignetteIndexEntry{wavClusteR} %\VignettePackage{wavClusteR} \documentclass{article} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \usepackage{color} %\SweaveOpts{keep.source=TRUE,eps=TRUE,width=8,height=8.5} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\ext}[1]{\texttt{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\disc}[1]{\color{red} #1 \color{black}} \renewcommand{\floatpagefraction}{0.7} \setkeys{Gin}{width=0.8\textwidth} \author{Federico Comoglio and Cem Sievers\\ Department of Biosystems Science and Engineering\\ ETH Z\"urich, Basel, Switzerland\\ \texttt{federico.comoglio@bsse.ethz.ch}\\ \texttt{cem.sievers@bsse.ethz.ch}} \title{\textsf{\textbf{The \Rpackage{wavClusteR} package \\{\large (version 2.0)}}}} \begin{document} \maketitle \begin{abstract} Different recently developed next-generation sequencing based methods (e.g. PAR-CLIP or Bisulphite sequencing) specifically induce nucleotide substitutions within the short reads with respect to the reference genome. This package provides functions for the analysis of the data obtained by such methods - with a major focus on PAR-CLIP - and exploits the experimentally induced substitutions in order to identify high confidence signals, such as RNA-binding sites, in the data. The workflow consists of two steps; (i) the estimation of a non-parametri two-component mixture model, identifying substitution frequencies most affected by the experimental procedure; (ii) a binding sites (clusters) identification algorithm which resolves clusters at high resolution. Key functions support multicore computing, if available. For a detailed description of the method see \cite{cemo, cf}. \end{abstract} \tableofcontents %-------------------------------------------------- \section{Preparing the input} %-------------------------------------------------- Starting with a fastq file, a commonly used short read format, the short reads should be aligned to the reference genome using a short read aligner, e.g. Bowtie \cite{Langmead:2009p2762}. The output file (e.g. in SAM format) should then be converted to BAM format (e.g. using \ext{samtools view}, \url{http://samtools.sourceforge.net/samtools.shtml}) and sorted (e.g. using \ext{samtools sort}). Since \Rpackage{wavClusteR} requires an indexed BAM file containing the short read alignments, an index file (\texttt{.bai}) should be generated from the sorted BAM file (see, e.g. \ext{samtools index}). \newline The following code provides an example of the steps described above using the samtools toolkit. The first line is pseudo code. Please replace it with the aligner specific syntax. \begin{verbatim} ALIGN: sample.fastq -> sample.sam CONVERT: samtools view -b -S sample.sam -o sample.bam SORT: samtools sort sample.bam sample_sorted INDEXING: samtools index sample_sorted.bam \end{verbatim} %%-------------------------------------------------- \subsection{Example dataset} %%-------------------------------------------------- In this vignette, we consider as an example a chunk of a published Argonaute 2 (AGO2) PAR-CLIP data set obtained from human HEK293 cells \cite{Kishore:2011p4559}. This chunk contains reads mapping to chromosome X in the interval: 23996166 - 24023263. This data set is provided in \Rpackage{wavClusteR} 2.0. % %%-------------------------------------------------- \subsection{Importing short reads into the R session} %%-------------------------------------------------- % An indexed BAM file can be loaded into the R session using the \Rfunction{readSortedBam} function. This calls \Rfunction{scanBam} from \Rpackage{Rsamtools}~\cite{rsamtools} and extracts the mismatch MD field and the read sequence from the BAM file, returning a \Robject{GRanges} object. % <>= library(wavClusteR) filename <- system.file( "extdata", "example.bam", package = "wavClusteR" ) Bam <- readSortedBam(filename = filename) Bam @ % %%-------------------------------------------------- \subsection{Extracting the informative positions used for model parameter estimation} %%-------------------------------------------------- % To estimate the mixture model both mixing coefficients and density functions (components) have to be estimated from the data. To this purpose, genome-wide substitutions are first identified and filtered according to a minimum coverage value at substitutions. The minimum coverage, which should be chosen to account for variables such as sequencing depth, provides a way to select the positions used for parameter estimation. Hence, it can be used to tune the stringency of the analysis. There is no obvious theoretical justification to find the optimal minimum coverage. However since relative substitution frequencies have to be computed for parameter estimation, the minimum coverage will influence the variance of the estimate. The lowest minimum coverage used for our analysis was 10.\\ The \Rfunction{getAllSub} function identifies the genomic positions that show at least one substitution and satisfy the minimum coverage requirement. It returns a \Robject{GRanges} object specifying the genomic position, the strand, the observed substitution (e.g. "TC" implies a T in the reference genome and a C in the read), the strand-specific coverage and the number of observed substitutions at the specific position. % <>= countTable <- getAllSub( Bam, minCov = 10 ) head( countTable ) @ % Once all substitutions are computed, the corresponding substitution profile can be plotted with \Rfunction{plotSubstitutions}. This function returns a barplot showing the total number of genomic positions that exhibit a given type of substitution and highlights the substitution type that is expected to be generated by the experimental procedure. In addition, the percentage of substitution of this type with respect to all identified substitutions is indicated. This plot conveys information about the quality of the data and can be used to compare different data sets generated by the same experimental conditions. % \begin{center} \setkeys{Gin}{width=0.6\textwidth} <>= plotSubstitutions( countTable, highlight = "TC" ) @ \end{center} % %%-------------------------------------------------- \section{Estimating the non-parametric mixture model} %%-------------------------------------------------- The genomic positions identified above (see \Rfunction{getAllSub}) are used to estimate the mixture model densities and mixing coefficients. The estimation is performed by the function \Rfunction{fitMixtureModel} for the substitution of interest. The function returns a list containing: \begin{itemize} \item the two mixing coefficients (\Robject{l1} and \Robject{l2}) \item the two individual components (\Robject{p1} and \Robject{p2}). \item the full density (\Robject{p}) \end{itemize} Given an observed relative substitution frequency, the model is used to compute the posterior probability that it was obtained by experimental induction.\\ Given the \Robject{GRanges} object \Robject{countTable} the model can be estimated as follows: <>= model <- fitMixtureModel(countTable, substitution = "TC") #not run @ The small size of the example dataset would not allow a reliable model estimation. Therefore, the mixture model for the entire AGO2 dataset has been precomputed and is provided in \Rpackage{wavClusteR} for convenience. The model can be loaded as % <>= data(model) str(model) @ % Once the mixture model is estimated, the model fit can be inspected and the RSF interval most likely to be affected by the experimental procedure can be identified using the \Rfunction{getExpInterval} function. Besides then high-confidence RSF interval, two plots are returned. The first one illustrates the estimated densities $p, p_1$ and $p_2$ and ensuing log odds $o$ $$ o=\rm{log}\frac{p(k=2|x)}{p(k=1|x)} $$ whereas the second plot shows the resulting posterior class probability, i.e. the probability that a given relative substitution frequency (RSF, horizontal axis) has been experimentally induced. The area under the curve corresponding to the returned RSF interval is colored, and the RSF interval indicated. By default, \Rfunction{getExpInterval} returns the RSF interval according to the Bayes classifier, i.e. RSF values having probability larger than or equal to 0.5. \begin{center} \setkeys{Gin}{width=0.85\textwidth} <>= (support <- getExpInterval( model, bayes = TRUE ) ) @ \end{center} However, the user can modify the stringency of the analysis and determine a custom RSF interval in two ways: \begin{enumerate} \item By setting the \Robject{rightProb} and \Robject{leftProb} parameters to a desired posterior probability cutoff, e.g. \setkeys{Gin}{width=0.85\textwidth} <>= (support <- getExpInterval( model, bayes = FALSE, leftProb = 0.9, rightProb = 0.9 ) ) @ \item By inspecting the posterior class probability density and directly enter the RSF interval boundaries when calling high-confidence substitutions (see call to \Rfunction{getHighConfSub} function in the next section) \end{enumerate} Finally, the model can be used to produce further diagnostic plots. Particularly, besides the barplot returned by \Rfunction{plotSubstitutions} (see previous section), the total number of reads carrying a given substitution and an RSF-based partitioning of genomic positions with substitutions is returned by % \begin{center} \setkeys{Gin}{width=1\textwidth} <>= plotSubstitutions( countTable, highlight = "TC", model ) @ \end{center} %%-------------------------------------------------- \section{Identifying protein binding sites (clusters)} %%-------------------------------------------------- %%-------------------------------------------------- \subsection{Filtering high confidence signal sites} %%-------------------------------------------------- % High-confidence transitions are identified by the \Rfunction{getHighConfSub} function. The RSF interval returned by \Rfunction{getExpInterval} (see previous section) can either directly enter a \Rfunction{getHighConfSub} function call as <>= highConfSub <- getHighConfSub( countTable, support = support, substitution = "TC" ) @ or, alternatively, the interval can be specified by the user as <>= highConfSub <- getHighConfSub( countTable, supportStart = 0.2, supportEnd = 0.7, substitution = "TC" ) @ The function returns a \Robject{GRanges} object with genomic position, strand, strand-specific coverage (\Robject{coverage}), occurence (\Robject{count}), and relative substitution frequency (\Robject{rsf}) for each identified high-confidence substitution. <>= head( highConfSub ) @ %%-------------------------------------------------- \subsection{Identifying cluster boundaries and computing cluster statistics} %%-------------------------------------------------- Binding sites (referred to as clusters) can be identified by the function \Rfunction{getClusters}. This function takes as input high-confidence substitution sites and the overall coverage across the genome, which can be computed using \Rpackage{GenomicRanges} as % <>= coverage <- coverage( Bam ) coverage$chrX @ % Cluster boundaries in \Rpackage{wavClusteR} 2.0 are resolved by default using the Mini-Rank Norm (MRN) algorithm. Briefly, this algorithm finds an optimal cluster boundary for each high-confidence substitution by solving an optimization problem that integrates prior knowledge on the geometry of PAR-CLIP clusters. The algorithm first considers differences in the coverage function. Then, it removes background fluctuations via learning of a local background threshold or hard thresholding (default). This choice is controlled by the \Robject{threshold} parameter. The MRN algorithm proceeds by evaluating all ensuing possible cluster boundaries and computes ranking of boundary signals and cluster widths, which are finally used to find the optimum cluster boundary for each high-confidence substitution. Please see~\cite{cf} for further details. Please notice that the MRN algorithm is strongly recommended as computationally faster (up to 10x) and more sensitive than the previously adopted cluster identification algorithm based on continuous wavelet transform (CWT) of the coverage function (see~\cite{cemo}). The latter computes the continuous wavelet transform (CWT) of the coverage function on a 1 kb window centered at a high-confidence substitution site. The minimum required signal-to-noise ratio can be specified with the parameter \Robject{snr} (default \Robject{snr}=3). Since multiple high-confidence substitution sites can localize in close proximity, the step size (controlled by the\Robject{step} parameter) can be set to values larger than 1 (default), such that the CWT is computed only if the subsequent high-confidence substitution is located further than the specified value from the previously considered position. Starting from the peak positions the cluster boundaries are then expanded. This algorithm is still maintained in \Rpackage{wavClusteR} 2.0 and can be called by setting \Robject{method = "cwt"} in the \Rfunction{getClusters} function.\\ Clusters can be computed by default as <>= clusters <- getClusters( highConfSub = highConfSub, coverage = coverage, sortedBam = Bam, method = "mrn", threshold = 1, cores = 1 ) clusters @ by using the MRN algorithm. Two options are available here: \begin{enumerate} \item Hard thresholding, based on a globally applied threshold determining the extent of noise in the coverage function. Empirically, 10\% of the required \Robject{minCov} at high-confidence substitutions worked well in practice on all tested datasets (e.g. a value of 1 in this example where \Robject{minCov = 10}. Alternatively, 10\% of the mode of the coverage distribution at high-confidence substitutions can represent a valuable choice. \item Local thresholding, based on a global estimation of background levels via a Gaussian mixture model. Omitting the \Robject{threshold} parameter in the call to \Rfunction{getClusters} enables local thresholding, e.g. <>= clusters <- getClusters( highConfSub = highConfSub, coverage = coverage, sortedBam = Bam, method = "mrn", cores = 1 ) clusters @ \end{enumerate} Once clusters are identified, the reported genomic regions can be merged in a strand-specific manner and statistics for each resulting cluster, which we call a wavCluster, can be computed using the \Rfunction{filterClusters} function, which takes as input the following elements: \begin{itemize} \item The identified clusters \item The high-confidence substitution sites \item The genome-wide coverage \item The mixture model \item A \Rpackage{BSgenome}~\cite{BSgenome} object containing the correct reference sequence \item The reference base expected to be converted by the experimental procedure \item The minimum required width of a wavCluster \end{itemize} The function can be called as follows: <>= require(BSgenome.Hsapiens.UCSC.hg19) wavclusters <- filterClusters( clusters = clusters, highConfSub = highConfSub, coverage = coverage, model = model, genome = Hsapiens, refBase = "T", minWidth = 12) @ <>= wavclusters @ The call returns a \Robject{GRanges} object where for each wavCluster: \begin{itemize} \item the number of high-confidence transitions (\Robject{Ntransitions}) \item the the mean coverage (\Robject{MeanCov}) \item the number of bases in the reference genome of the same type as the specified \Robject{refBase} (\Robject{NbasesInRef}) \item the estimated cross-linking efficiency (\Robject{CrossLinkEff}), i.e. the ratio between \Robject{Ntransitions} and \Robject{NbasesInRef} \item the genomic sequence (\Robject{Sequence}) \item the sum of the log odds (\Robject{SumLogOdds}), contributed by each high-confidence transition within the cluster \item the relative log odds (\Robject{RelLogOdds}), i.e. the ratio between \Robject{SumLogOdds} and \Robject{Ntransitions} \end{itemize} is returned. Notice that the relative log odds can be used to rank clusters according to statistical confidence. %%-------------------------------------------------- \section{Output post-processing} %%-------------------------------------------------- \Rpackage{wavClusteR} 2.0 contains a variety of functions (summarized in Table~1) that help the user in the post-processing of the identified wavClusters. \begin{table}[htdp] \begin{center} \begin{tabular}{|p{7cm}|p{3.8cm}|p{3.6cm}|} \hline \bf Task & \bf Function & \bf Output format\\ \hline Export all identified substitutions or high-confidence substitutions & \Rfunction{exportHighConfSub} & BED (for UCSC~\cite{UCSC})\\ Export clusters & \Rfunction{exportClusters} & BED (for UCSC~\cite{UCSC})\\ Export coverage function & \Rfunction{exportCoverage} & BigWig (for UCSC~\cite{UCSC})\\ \hline Visualize the size distribution of wavClusters & \Rfunction{plotSizeDistribution} & histogram\\ \hline Annotate clusters with respect to genomic features (e.g. CDS, introns, 3'-UTRs, 5'-UTRs) in a strand-specific manner based on the \Rpackage{GenomicFeatures} package & \Rfunction{annotateClusters} & dot chart, vector\\ \hline Compute metagene profiles of wavClusters, where the density of wavClusters is represented as a function of a reference genomic coordinates & \Rfunction{getMetaGene}& line plot, vector\\ \hline Compute metaTSS profiles based on all aligned reads in the input BAM file & \Rfunction{getMetaTSS}& line plot, vector \\ \hline Visualize wavClusteR statistics and meta data to learn pairwise relationships between variables & \Rfunction{plotStatistics} & pairs plot\\ \hline \end{tabular} \end{center} \caption{Summary of post-processing functions in \Rpackage{wavClusteR} 2.0.} \label{default} \end{table}% %%-------------------------------------------------- \subsection{Exporting substitutions, wavClusters and coverage function} %%-------------------------------------------------- The identified high-confidence substitutions can be exported as <>= exportHighConfSub( highConfSub = highConfSub, filename = "hcTC.bed", trackname = "hcTC", description = "hcTC" ) @ where \Robject{trackname} and \Robject{description} correspond to the very same attributes in the UCSC BED file format specification and define the name of the BED track and its description, respectively. Notice that by replacing \Robject{highConfSub} with another set of substitutions (e.g. all identified substitutions of a given type), those can be exported and visualized using the same function call.\\ % Similarly, wavClusters can be exported as <>= exportClusters( clusters = wavclusters, filename = "wavClusters.bed", trackname = "wavClusters", description = "wavClusters" ) @ % and the coverage function can be exported as <>= exportCoverage( coverage = coverage, filename = "coverage.bigWig" ) @ %%-------------------------------------------------- \subsection{wavClusters annotation} %%-------------------------------------------------- The identified wavClusters can be annotated with respect to known genomic features using the \Rfunction{annotateClusters} function, which generates a strand-specific dot chart representing wavClusters annotation. The function takes as an input the wavClusters and a \Robject{transcriptDB} object containing all transcript annotations. The latter can either be generated a priori using the \Rfunction{makeTranscriptDbFromUCSC} (from the \Rpackage{GenomicFeatures} package) or is automatically fetch and built by \Rfunction{annotateClusters} if not provided. If multiple calls to \Rfunction{annotateClusters} are planned, the recommended solution is to build the object once as <>= txDB <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "ensGene") @ Then, the \Rfunction{annotateClusters} can be called as follows <>= annotateClusters( clusters = wavclusters, txDB = txDB, plot = TRUE, verbose = TRUE) @ \setkeys{Gin}{width=0.6\textwidth} \centerline{\includegraphics{AGO2annotation.pdf}} Four dot charts are returned by the function. The first plot (top) represents the percentage of clusters mapping to different transcript features localized on the same strand as the identified clusters. Please note that the dot chart above was produced by providing wavClusters identified on the entire AGO2 dataset. Multiple hits, i.e. wavClusters that overlap with more than one genomic feature, are reported as "multiple", whereas wavClusters that map outside of the considered features are labeled as "other". The latter are then annotated with respect to features on the antisense strand and the results are represented in the second plot. The third plot represents the relative sequence length of different compartments relative to the total transcriptome length of the organism being considered (clearly, this plot does not depend on the PAR-CLIP data). These ratios are then used to normalize the counts on the sense strand in order to produce the fourth (bottom) plot, which can be used to show enrichments or depletion of clusters in the different functional compartments. %%-------------------------------------------------- \subsection{Computing metagene profiles} %%-------------------------------------------------- A metagene profile, namely a graphical representation of the density of wavClusters as a function of a binning of genomic coordinates across all annotated genes, can be obtained with a call to the \Rfunction{getMetaGene} function as follows: <>= getMetaGene( clusters = wavclusters, txDB = txDB, upstream = 1e3, downstream = 1e3, nBins = 40, nBinsUD = 10, minLength = 1, plot = TRUE, verbose = TRUE ) @ \setkeys{Gin}{width=0.8\textwidth} \centerline{\includegraphics{AGO2metagene.pdf}} Please note that the line plot above was produced by providing wavClusters identified on the entire AGO2 dataset. In the function call above, genes were divided in 40 bins (\Robject{nBins}) and an upstream/downstream region spanning 1kb was considered (width controlled by \Robject{upstream} and \Robject{downstream} parameters). This, in turn, was subdivided in 10 bins (\Robject{nBinsUD}). No restriction on gene length was applied (\Robject{minLength}). The numeric vector of length \Robject{nBins} + 2\Robject{nBinsUD} with normalized counts is returned by the function and therefore can be used, for example, to compare the distribution of wavClusters across several PAR-CLIP samples.\\ In addition to metagene profiles, metaTSS profiles based on all aligned reads in the input BAM file can be generated using the \Rfunction{getMetaTSS} function. A default function call is as follows: % <>= getMetaTSS( sortedBam = Bam, txDB = txDB, upstream = 1e3, downstream = 1e3, nBins = 40, unique = FALSE, plot = TRUE, verbose = TRUE ) @ % where the \Robject{upstream} and \Robject{downstream} parameters control the width of the window centered on the transcription start site (TSS) to be considered, \Robject{nBins} determines the resolution of the profile. If \Robject{unique} is enabled, then overlapping TSSs are discarded. The numeric vector of length \Robject{nBins} with normalized read counts is returned by the function and therefore can be used, for example, to compare several PAR-CLIP samples.\\ %%-------------------------------------------------- \subsection{Visualizing the size distribution of wavClusters} %%-------------------------------------------------- The size distribution of wavClusters is visualized as a histogram and returned by the following function call \begin{center} \setkeys{Gin}{width=0.6\textwidth} <>= plotSizeDistribution( clusters = wavclusters, col = "skyblue2" ) @ \end{center} where additional parameters of the \Rfunction{hist} function can be passed in the function call. Finally, if \Robject{showCov=TRUE}, a scatter plot of average cluster coverage vs. cluster length is returned \begin{center} \setkeys{Gin}{width=0.6\textwidth} <>= plotSizeDistribution( clusters = wavclusters, showCov = TRUE, col = "skyblue2" ) @ \end{center} %%%-------------------------------------------------- %\subsection{Visualizing wavClusteR statistics and meta data} %%%-------------------------------------------------- % %The size distribution of wavClusters is visualized as a histogram and returned by the following function call %\begin{center} %\setkeys{Gin}{width=0.8\textwidth} %<>= %plotStatistics( clusters = wavclusters, % corMethod = "spearman", % lower = panel.smooth ) %@ %\end{center} \section{Session Info} <>= sessionInfo() @ % \begin{thebibliography}{} \bibitem{cemo} Sievers, C., Schlumpf, T., Sawarkar, R., Comoglio, F. \& Paro, R. (2012) Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data. \textit{Nucleic Acids Res} {\bf 40(2)}: e160 \bibitem{cf} Comoglio, F., Sievers, C. \& Paro, R. wavClusteR: an R package for PAR-CLIP data analysis, submitted \bibitem{Langmead:2009p2762} Langmead,B., Trapnell,C., Pop,M. \& Salzberg,S.L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. \textit{Genome Biol}, \textbf{10}, R25 \bibitem{Kishore:2011p4559} Kishore, S. et al. (2011) A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. \textit{Nature Methods}, \textbf{8(7)}, 559-564 \bibitem{rsamtools} Morgan, M. \& Pages, H. Rsamtools: Binary alignment (BAM), variant call (BCF), or tabix file import, http://bioconductor.org/packages/release/bioc/html/Rsamtools. html \bibitem{BSgenome} Pages, H., BSgenome: Infrastructure for Biostrings-based genome data packages \bibitem{UCSC} Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (2002) The human genome browser at UCSC. \textit{Genome Res.} \textbf{12(6)}, 996-1006. \end{thebibliography} \end{document}