---
title: "wavClusteR: a workflow for PAR-CLIP data analysis"
author: "Federico Comoglio and Cem Sievers"
date: "`r Sys.Date()`"
package: "`r pkg_ver('wavClusteR')`"
vignette: >
    %\VignetteIndexEntry{wavClusteR: a workflow for PAR-CLIP data analysis}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document
---

#Abstract

A number of recently developed next-generation sequencing based methods (e.g. PAR-CLIP, Bisulphite sequencing, RRBS) specifically induce nucleotide substitutions within the short reads with respect to the reference genome. **wavClusteR** provides functions for the analysis of the data obtained by such methods with a major focus on PAR-CLIP. The package leverages on experimentally induced substitutions to identify high confidence signals (e.g. RNA-binding sites) in the data. A **wavClusteR** workflow consists of two steps:

1. The estimation of a non-parametric two-component mixture model to identify substitution frequencies that are likely to be affected by the experimental procedure
2. The identification of binding sites (clusters).

The package supports multicore computing. For a detailed description of the method please refer to [Sievers et al., 2012](http://www.ncbi.nlm.nih.gov/pubmed/22844102); [Comoglio et al., 2015](http://www.ncbi.nlm.nih.gov/pubmed/25638391).

#Preparing the input

**wavClusteR** expects a sorted and indexed BAM file as input. A streamlined workflow to generate the required input from a fastq file is outlined below (line 1 is pseudo code, replace it with the aligner specific syntax).

```{r eval=FALSE}
	#ALIGN: 
		sample.fastq -> sample.sam
	#CONVERT: 
		samtools view -b -S sample.sam -o sample.bam
	#SORT: 
		samtools sort sample.bam sample_sorted
	#INDEXING: 
		samtools index sample_sorted.bam
```

1. Short read alignment to the reference genome using relaxed alignment parameters. Experimentally-induced transitions will otherwise impede alignment of informative reads. Currently, **wavClusteR** has been tested with bowtie/Bowtie2 [Langmead et al., 2009](http://www.ncbi.nlm.nih.gov/pubmed/19261174); [Langmead and Salzberg, 2012](http://www.ncbi.nlm.nih.gov/pubmed/22388286)
2. Conversion of the alignment file from SAM to BAM format, e.g. using ```samtools view``` from [SAMtools](http://samtools.sourceforge.net)
3. Sorting of the BAM file, e.g. using ```samtools sort```
4. Indexing of the sorted BAM file, e.g. with ```samtools index```.

##Example dataset

**wavClusteR** provides a chunk of a human Argonaute 2 (AGO2) PAR-CLIP data set from [Kishore et al., 2011](http://www.ncbi.nlm.nih.gov/pubmed/?term=kishore+2011+argonaute). Reads in the chunk map to the genomic interval chrX:23996166-24023263. This data set is used below to illustrate a workflow for PAR-CLIP data analysis.

#A workflow for PAR-CLIP data analysis

##Reading sorted BAM files

A sorted and indexed BAM file can be loaded into the R session with **readSortedBam**. This function extracts aligned reads, sequences and the mismatch MD field, and returns a GRanges object. 

```{r eval=TRUE}
library(wavClusteR)
filename <- system.file( "extdata", "example.bam", package = "wavClusteR" )
Bam <- readSortedBam(filename = filename)
Bam
```

##Extracting informative genomic positions

Prior to model fitting, genome-wide substitutions need to be identified and filtered based on a coverage threshold. The **getAllSub** function identifies all genomic positions exhibiting at least one substitution and coverage above this threshold.

```{r eval=TRUE}
countTable <- getAllSub( Bam, minCov = 10 )
head( countTable )
```

The function returns a GRanges object specifying genomic position, strand, substitution type (e.g. "TC": T in the reference genome; C in the read), strand-specific coverage and number of observed substitutions at the position.

###How to choose the coverage threshold?

The coverage threshold ```minCov``` defines the genomic positions to be used for parameter estimation, thus providing a means to tune the stringency of the analysis. Currently, **wavClusteR** does not allow to learn this threshold from the data. However, since the model is based on relative substitution frequencies, the value of ```minCov``` will influence the variance of the estimated parameters. Therefore, values smaller than default (```minCov=10```) are not recommended. 

##Inspecting the substitutions profile (diagnostic I)

Once all substitutions are computed, a diagnostic substitution profile can be plotted with **plotSubstitutions**. 

```{r fig.width=5, fig.height=5, fig.align='center', eval=TRUE}
plotSubstitutions( countTable, highlight = "TC" )
```

The function returns a barplot showing the total number of genomic positions that exhibit a given type of substitution and highlights the substitution type that is expected to be generated by the experimental procedure. The percentage of substitution of this type is also shown. This plot can be readily used to assess the quality of the sequenced libraries and can be used to compare different data sets generated under the same experimental condition.

##Estimating the model

**wavClusteR** uses the identified genomic positions to learn a non-parametric mixture model discriminating PAR-CLIP-specific from extrinsic transitions. Model parameters are estimated by **fitMixtureModel**.

```{r eval=FALSE}
model <- fitMixtureModel(countTable, substitution = "TC")
```

The function returns a list of:

+ the two mixing coefficients (**l1** and **l2**)
+ the two model components (**p1** and **p2**)
+ the full density (**p**)

As the small size of our example data set does not to estimate the model reliably, the mixture model for the entire AGO2 dataset has been precomputed and is provided by the package.

```{r eval=TRUE}
data(model)
str(model)
```

##Inspecting the model fit (diagnostic II)

The model fit can be inspected using **getExpInterval**. 

```{r fig.width=7, fig.height=4.5, fig.align='center', eval=TRUE}
(support <- getExpInterval( model, bayes = TRUE ) )
```

The function returns two diagnostic plots. The first plot illustrates the estimated densities \\(p\\), \\(p_1\\) and \\(p_2\\), and log odds

\\[
	o=\rm{log}\frac{p(k=2|x)}{p(k=1|x)}.
\\]

The second plot shows the resulting posterior class probability, i.e. the probability that a given relative substitution frequency (RSF, horizontal axis) has been induced by PAR-CLIP. The area under the curve corresponding to the returned RSF interval is colored, and the RSF interval indicated. By default, **getExpInterval** returns the RSF interval according to the Bayes classifier, i.e. a posterior probability cutoff of 0.5. However, the user can specify a custom RSF interval:

1. By means of the **rightProb** and **leftProb** parameters, e.g.

	```{r fig.width=7, fig.height=4.5, fig.align='center', eval=TRUE}
	(support <- getExpInterval( model, bayes = FALSE, leftProb = 0.9, rightProb = 	0.9 ) )
	```

2. By inspecting the posterior class probability density and passing the RSF interval boundaries when calling high-confidence substitutions (see point 6). 

Finally, the model can be used to produce further diagnostic plots. Particularly,  the total number of reads exhibitng a given substitution and an RSF-based partitioning of genomic positions with substitutions can be obtained by

```{r fig.width=7, fig.height=5, fig.align='center', eval=TRUE}
plotSubstitutions( countTable, highlight = "TC", model )
```

##Selecting high-confidence PAR-CLIP induced transitions

The RSF support is used to filter all observed transitions to define a set of high-confidence, PAR-CLIP induced transitions. These are identified by **getHighConfSub**. The function returns a GRanges object with genomic position, strand, strand-specific coverage, occurence (count), and relative substitution frequency (rsf) for each high-confidence substitution. In a call to **getHighConfSub**, the RSF interval returned by **getExpInterval** can be supplied as ```support``` argument directly

```{r eval=TRUE}
highConfSub <- getHighConfSub( countTable, 
                               support = support, 
                               substitution = "TC" )
head( highConfSub )                               
```

or, alternatively, by specifying ```supportStart``` and ```supportEnd```, which define the range of RSF of interest.

```{r eval=FALSE}
highConfSub <- getHighConfSub( countTable, 
                               supportStart = 0.2, 
                               supportEnd = 0.7, 
                               substitution = "TC" )
head( highConfSub )
```

##Identifying protein binding sites (clusters) 

Binding sites (clusters) are identified by **getClusters**. The function takes high-confidence substitution sites and the coverage function

```{r eval=TRUE}
coverage <- coverage( Bam )
coverage$chrX
```

as an input. From package version 2.0, cluster boundaries are resolved using the Mini-Rank Norm (MRN) [Comoglio et al., 2015](http://www.ncbi.nlm.nih.gov/pubmed/25638391), which is up to 10x faster and more sensitive than the previously adopted  algorithm based on continuous wavelet transform of the coverage function [Sievers et al., 2012](http://www.ncbi.nlm.nih.gov/pubmed/22844102). Briefly, the MRN algorithm finds an optimal cluster boundary for each high-confidence substitution by solving an optimization problem that integrates prior knowledge on the geometry of PAR-CLIP clusters. Two options are available:

1. Hard thresholding, i.e. the coverage function is denoised using a global threshold. Empirically, ```minCov=1``` worked well on all tested datasets for which ```minCov = 10```. Alternatively, 10% of the mode of the coverage distribution at high-confidence substitutions represents a possible choice.

	```{r eval=TRUE}
	clusters <- getClusters( highConfSub = highConfSub,
                         		coverage = coverage,
                         		sortedBam = Bam,
                         		threshold = 1,
                         		cores = 1 )
	clusters
	```

2. Local thresholding, based on a global estimation of background levels via a Gaussian mixture model. Omitting the threshold parameter in the call to getClusters enables local thresholding

	```{r eval=TRUE}
	clusters <- getClusters( highConfSub = highConfSub,
                         		coverage = coverage,
                         		sortedBam = Bam,
                         		cores = 1 )
	clusters
	```

##Merging clusters

Once the clusters are identified, the corresponding genomic regions can be merged in a strand-specific manner. Statistics for each merged cluster (a wavCluster), can be computed using **filterClusters**. 

```{r eval=TRUE}
require(BSgenome.Hsapiens.UCSC.hg19)

wavclusters <- filterClusters( clusters = clusters, 
                   highConfSub = highConfSub,
                   coverage = coverage, 
                   model = model, 
                   genome = Hsapiens, 
                   refBase = "T", 
                   minWidth = 12)

wavclusters
```

The function takes as input the following elements:

+ clusters
+ high-confidence substitution sites
+ genome-wide coverage
+ mixture model
+ a BSgenome object containing the correct reference sequence
+ reference base expected to be converted by the experimental procedure
+ minimum required width of a wavCluster

and it returns a GRanges object with the following metadata:

+ number of high-confidence transitions (**Ntransitions**)
+ mean coverage (**MeanCov**)
+ number of bases in the reference genome of the same type as the specified ```refBase``` (**NbasesInRef**)
+ the estimated cross-linking efficiency (**CrossLinkEff**), i.e. the ratio between **Ntransitions** and **NbasesInRef**
+ the genomic sequence (**Sequence**)
+ the sum of the log odds (**SumLogOdds**), contributed by each high-confidence transition within the cluster
+ the relative log odds (**RelLogOdds**), i.e. the ratio between **SumLogOdds** and **Ntransitions**

The relative log odds can be used to rank clusters according to statistical confidence.

##Post-processing of the binding sites 

**wavClusteR** provides a number of functions (summarized in the table below) for post-processing of the identified binding sites.

Task  | Function | Output format |
------------- | ------------- | ------------- |
Export all identified substitutions or high-confidence substitutions | exportHighConfSub | BED 
Export clusters | exportClusters | BED 
Export coverage function | exportCoverage | BigWig
Visualize the size distribution of wavClusters | plotSizeDistribution | histogram
Annotate clusters with respect to genomic features (e.g. CDS, introns, 3'-UTRs, 5'-UTRs) in a strand-specific manner | annotateClusters | dot chart, vector
Compute metagene profiles of wavClusters, where the density of wavClusters is represented as a function of a reference genomic coordinates | getMetaGene | line plot, vector
Compute metaTSS profiles based on all aligned reads in the input BAM file | \Rfunction{getMetaTSS} | line plot, vector
Visualize wavClusteR statistics and meta data to learn pairwise relationships between variables | plotStatistics | pairs plot

###Exporting substitutions, wavClusters and coverage function 

High-confidence substitutions can be exported with

```{r eval=FALSE}
exportHighConfSub( highConfSub = highConfSub, 
                   filename = "hcTC.bed", 
                   trackname = "hcTC", 
                   description = "hcTC" )
```

where ```trackname``` and ```description``` set the corresponding attributes in the UCSC BED file. By replacing ```highConfSub``` with another set of substitutions (e.g. all identified substitutions of a given type), those can be exported likewise.

wavClusters can be exported with

```{r eval=FALSE}
exportClusters( clusters = wavclusters, 
                filename = "wavClusters.bed", 
                trackname = "wavClusters", 
                description = "wavClusters" )
```

and the coverage function can be exported with

```{r eval=FALSE}
exportCoverage( coverage = coverage, filename = "coverage.bigWig" )
```

###Annotating binding sites

wavClusters can be annotated with respect to genomic features using **annotateClusters**. This function generates a strand-specific dot chart representing wavClusters annotation. **annotateClusters** takes as an input the wavClusters and a transcriptDB object containing transcript annotations. The latter can be generated using **makeTxDbFromUCSC** (GenomicFeatures package) 

```{r eval=FALSE}
txDB <- makeTxDbFromUCSC(genome = "hg19", tablename = "ensGene")
```

and is automatically downloaded by **annotateClusters** if missing. 

Then, the annotateClusters can be called as follows

```{r eval=FALSE}
annotateClusters( clusters = wavclusters, 
              txDB = txDB, 
              plot = TRUE, 
              verbose = TRUE)
```

Four dot charts are returned by the function showing the percentage of clusters mapping to different transcript features localized on the same (sense) or on the opposite (antisense) strand, the relative sequence length of different compartments relative to the total transcriptome length and the normalized enrichment of clusters across functional compartments.

Note: multiple hits, i.e. wavClusters that overlap with more than one genomic feature, are reported as "multiple", whereas wavClusters that map outside of the considered features are labeled as "other". The latter are then annotated with respect to features on the antisense strand. 

###Computing cluster metagene profiles

A graphical representation of the density of wavClusters as a function of a binning of genomic coordinates across all annotated genes can be obtained using the **getMetaGene** function.

```{r eval=FALSE}
getMetaGene( clusters = wavclusters, 
             txDB = txDB, 
             upstream = 1e3, 
             downstream = 1e3, 
             nBins = 40, 
             nBinsUD = 10, 
             minLength = 1, 
             plot = TRUE,
             verbose = TRUE ) 
```

In this example, genes were divided in 40 bins (```nBins```) and an ```upstream```/```downstream``` region spanning 1kb was considered. This region was subdivided in 10 bins (```nBinsUD```). No restriction on gene length was applied (```minLength```). **getMetaGene** returns a numeric vector of length ```nBins``` + 2*```nBinsUD``` with normalized counts, which can be used, for instance, to compare the distribution of wavClusters across several PAR-CLIP samples.

In addition to metagene profiles, meta transcription start site (TSS) profiles based on all mapped reads can be generated using **getMetaTSS**. 

```{r eval=FALSE}
getMetaTSS( sortedBam = Bam, 
            txDB = txDB, 
            upstream = 1e3, 
            downstream = 1e3, 
            nBins = 40, 
            unique = FALSE, 
            plot = TRUE, 
            verbose = TRUE ) 
```

Here, the ```upstream``` and ```downstream``` parameters control the width of the window to be considered, and ```nBins``` controls the resolution of the profile. If ```unique=TRUE```, then overlapping windows are discarded. **getMetaTSS** returns a numeric vector of length ```nBins``` with normalized read counts.

###Computing the size distribution of wavClusters

The size distribution of wavClusters can be visualized by

```{r fig.width=5, fig.height=5, fig.align='center', eval=TRUE}
plotSizeDistribution( clusters = wavclusters, showCov = TRUE, col = "skyblue2" )
```

###Visualizing wavCluster statistics and meta data

Cluster statistics can be plotted as pairs plot using

```{r fig.width=5, fig.height=5, fig.align='center', eval=FALSE}
plotStatistics( clusters = wavclusters, 
                corMethod = "spearman", 
                lower = panel.smooth )
```

#Session Info
```{r eval=TRUE}
sessionInfo()
```

#References

Comoglio, F., Sievers, C. & Paro, R. (2015) Sensitive and highly resolved inidentification of RNA-protein interaction sites in PAR-CLIP data. BMC Bioinformatics, 16, 32

Langmead,B., Trapnell,C., Pop,M. \& Salzberg,S.L. (2009)
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25

Kishore, S. et al. (2011)
A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nature Methods 8(7), 559-564

Sievers, C., Schlumpf, T., Sawarkar, R., Comoglio, F. & Paro, R. (2012) Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data. Nucleic Acids Res 40(2), e160