---
title: "Detecting differential binding of CBP in mouse fibroblasts"
author:
- name: Aaron T. L. Lun
  affiliation: 
  - &WEHI The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC 3052, Melbourne, Australia
  - Department of Medical Biology, The University of Melbourne, Parkville, VIC 3010, Melbourne, Australia
- name: Gordon K. Smyth
  affiliation: 
  - *WEHI
  - Department of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Melbourne, Australia
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{3. Differential binding of CBP in fibroblasts}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    toc_float: yes
    titlecaps: false
bibliography: ref.bib
---

```{r style, echo=FALSE, results='hide', message=FALSE}
library(BiocStyle)
library(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE, cache=TRUE)
```

# Overview

Here, we perform a window-based DB analysis to identify differentially bound (DB) regions for CREB-binding protein (CBP).
This provides an example of how to use `r Biocpkg("csaw")` for transcription factor (TF) data, 
complementing our `r Biocpkg("chipseqDB", "h3k9ac.html", "previous analysis")` on a histone mark.
We use CBP ChIP-seq data from a study comparing wild-type (WT) and CBP knock-out (KO) animals [@kasper2014genomewide], with two biological replicates for each genotype.
BAM files and indices are downloaded using `r Biocpkg("chipseqDBData")` and cached for later use.

```{r}
library(chipseqDBData)
cbpdata <- CBPData()
cbpdata
```

# Pre-processing checks

We check some mapping statistics for the CBP dataset with `r Biocpkg("Rsamtools")`, as previously described.

```{r}
library(Rsamtools)
diagnostics <- list()
for (b in seq_along(cbpdata$Path)) {
    bam <- cbpdata$Path[[b]]
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[b]] <- c(Total=total, Mapped=mapped, Marked=marked)
}

diag.stats <- data.frame(do.call(rbind, diagnostics))
rownames(diag.stats) <- cbpdata$Name
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats
```

We construct a `readParam` object to standardize the parameter settings in this analysis.
The ENCODE blacklist is again used^[Assuming you ran the previous workflow, this will be retrieved from cache rather than being downloaded again.] to remove reads in problematic regions [@encode2012encode].

```{r}
library(BiocFileCache)
bfc <- BiocFileCache("local", ask=FALSE)
black.path <- bfcrpath(bfc, file.path("https://www.encodeproject.org",
    "files/ENCFF547MET/@@download/ENCFF547MET.bed.gz"))

library(rtracklayer)
blacklist <- import(black.path)
```

We set the minimum mapping quality score to 10 to remove poorly or non-uniquely aligned reads.

```{r}
library(csaw)
param <- readParam(minq=10, discard=blacklist)
param
```

# Computing the average fragment length

The average fragment length is estimated by maximizing the cross-correlation function (Figure \@ref(fig:ccfplot)), as previously described.
Generally, cross-correlations for TF datasets are sharper than for histone marks as the TFs typically contact a smaller genomic interval.
This results in more pronounced strand bimodality in the binding profile.

```{r}
x <- correlateReads(cbpdata$Path, param=reform(param, dedup=TRUE))
frag.len <- maximizeCcf(x)
frag.len
```

```{r ccfplot, fig.cap="Cross-correlation function (CCF) against delay distance for the CBP dataset. The delay with the maximum correlation is shown as the red line."}
plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
abline(v=frag.len, col="red")
text(x=frag.len, y=min(x), paste(frag.len, "bp"), pos=4, col="red")
```

# Counting reads into windows

Reads are then counted into sliding windows using `r Biocpkg("csaw")` [@lun2015csaw].
For TF data analyses, smaller windows are necessary to capture sharp binding sites.
A large window size will be suboptimal as the count for a particular site will be "contaminated" by non-specific background in the neighbouring regions.
In this case, a window size of 10 bp is used.

```{r}
win.data <- windowCounts(cbpdata$Path, param=param, width=10, ext=frag.len)
win.data
```

The default spacing of 50 bp is also used here.
This may seem inappropriate given that the windows are only 10 bp.
However, reads lying in the interval between adjacent windows will still be counted into several windows.
This is because reads are extended to the value of `frag.len`, which is substantially larger than the 50 bp spacing^[Smaller spacings can be used but will provide little benefit given that each extended read already overlaps multiple windows.].

# Normalization for composition biases

Composition biases are introduced when the amount of DB in each condition is unbalanced [@robinson2010scaling; @lun2014denovo].
More binding in one condition means that more reads are sequenced at the binding sites, leaving fewer reads for the rest of the genome.
This suppresses the genomic coverage at non-DB sites, resulting in spurious differences between samples. 
We expect unbalanced DB in this dataset as CBP function should be compromised in the KO cells,
such that most - if not all - of the DB sites should exhibit increased CBP binding in the WT condition.

To remove this bias, we assign reads to large genomic bins and assume that most bins represent non-DB background regions.
Any systematic differences in the coverage of those bins is attributed to composition bias and is normalized out.
Specifically, the trimmed mean of M-values (TMM) method [@robinson2010scaling] is applied to compute normalization factors from the bin counts.
These factors are stored in `win.data`^[See the `se.out=` argument.] so that they will be applied during the DB analysis with the window counts.

```{r}
bins <- windowCounts(cbpdata$Path, bin=TRUE, width=10000, param=param)
win.data <- normFactors(bins, se.out=win.data)
(normfacs <- win.data$norm.factors)
```

We visualize the effect of normalization with mean-difference plots between pairs of samples (Figure \@ref(fig:compoplot)).
The dense cloud in each plot represents the majority of bins in the genome.
These are assumed to mostly contain background regions.
A non-zero log-fold change for these bins indicates that composition bias is present between samples. 
The red line represents the log-ratio of normalization factors and passes through the centre of the cloud in each plot,
indicating that the bias has been successfully identified and removed.

```{r compoplot, fig.width=12, fig.asp=0.5, fig.cap="Mean-difference plots for the bin counts, comparing sample 4 to all other samples. The red line represents the log-ratio of the normalization factors between samples."}
bin.ab <- scaledAverage(bins)
adjc <- calculateCPM(bins, use.norm.factors=FALSE)

par(cex.lab=1.5, mfrow=c(1,3))
smoothScatter(bin.ab, adjc[,1]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 4)")
abline(h=log2(normfacs[1]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,2]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (2 vs 4)")
abline(h=log2(normfacs[2]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,3]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (3 vs 4)")
abline(h=log2(normfacs[3]/normfacs[4]), col="red")
```

Note that this normalization strategy is quite different from that in the H3K9ac analysis.
Here, systematic DB in one direction is expected between conditions, given that CBP function is lost in the KO genotype.
This means that the assumption of a non-DB majority (required for non-linear normalization of the `r Biocpkg("chipseqDB", "h3k9ac.html#normalizing-for-sample-specific-trended-biases", "H3K9ac data")`) is not valid.
No such assumption is made by the binned-TMM approach described above, which makes it more appropriate for use in the CBP analysis.

# Filtering of low-abundance windows

Removal of low-abundance windows is performed as `r Biocpkg("chipseqDB", "h3k9ac.html#filtering-windows-by-abundance", "previously described")`, by computing the coverage in each window relative to a global estimate of background enrichment.
The majority of windows in background regions are filtered out upon applying a modest fold-change threshold.
This leaves a small set of relevant windows for further analysis.

```{r}
filter.stat <- filterWindowsGlobal(win.data, bins)
min.fc <- 3
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
filtered.data <- win.data[keep,]
```

Note that the 10 kbp bins are used here for filtering, while smaller 2 kbp bins were used in the corresponding step for the H3K9ac analysis.
This is purely for convenience -- the 10 kbp counts for this dataset were previously loaded for normalization, and can be re-used during filtering to save time.
Changes in bin size will have little impact on the results, so long as the bins (and their counts) are large enough for precise estimation of the background abundance.
While smaller bins provide greater spatial resolution, this is irrelevant for quantifying coverage in large background regions that span most of the genome.

# Statistical modelling of biological variability

Counts for each window are modelled using `r Biocpkg("edgeR")` [@mccarthy2012differential; @robinson2010edger].
We convert our `RangedSummarizedExperiment` object into a `DGEList`.

```{r}
library(edgeR)
y <- asDGEList(filtered.data)
summary(y)
```

We then construct a design matrix for our experimental design.
Again, we have a simple one-way layout with two groups of two replicates.

```{r}
genotype <- cbpdata$Description
genotype[grep("wild-type", genotype)] <- "wt"
genotype[grep("knock-out", genotype)] <- "ko"

genotype <- factor(genotype)
design <- model.matrix(~0+genotype)
colnames(design) <- levels(genotype)
design
```

We estimate the negative binomial (NB) and quasi-likelihood (QL) dispersions for each window [@lund2012ql].
The estimated NB dispersions (Figure \@ref(fig:bcvplot)) are substantially larger than those observed in the H3K9ac dataset.
They also exhibit an unusual increasing trend with respect to abundance.

```{r bcvplot, fig.cap="Abundance-dependent trend in the biological coefficient of variation (i.e., the root-NB dispersion) for each window, represented by the blue line. Common (red) and tagwise estimates (black) are also shown."}
y <- estimateDisp(y, design)
summary(y$trended.dispersion)
plotBCV(y)
```

The estimated prior d.f. is also infinite, meaning that all the QL dispersions are equal to the trend (Figure \@ref(fig:qlplot)).

```{r qlplot, fig.cap="Effect of EB shrinkage on the raw QL dispersion estimate for each window (black) towards the abundance-dependent trend (blue) to obtain squeezed estimates (red). Quarter-root estimates are shown for greater dynamic range."}
fit <- glmQLFit(y, design, robust=TRUE)
summary(fit$df.prior)
plotQLDisp(fit)
```

These results are consistent with the presence of a systematic difference in CBP enrichment between the WT replicates.
An increasing trend in Figure \@ref(fig:bcvplot) is typical after normalization for composition biases,
where replicates exhibit some differences in efficiency that manifest as increased dispersions at high abundance.
The dispersions for all windows are inflated to a similarly large value by this difference, 
manifesting as low variability in the dispersions across windows.
This effect is illustrated in Figure \@ref(fig:mdsplot) where the WT samples are clearly separated in both dimensions.

```{r mdsplot, fig.cap="MDS plot with two dimensions for all samples in the CBP dataset. Samples are labelled and coloured according to the genotype. A larger top set of windows was used to improve the visualization of the genome-wide differences between the WT samples."}
plotMDS(cpm(y, log=TRUE), top=10000, labels=genotype,
    col=c("red", "blue")[as.integer(genotype)])
```

The presence of a large batch effect between replicates is not ideal.
Nonetheless, we can still proceed with the DB analysis - albeit with some loss of power due to the inflated NB dispersions - 
given that there are strong differences between genotypes in Figure \@ref(fig:mdsplot),

# Testing for DB

We test for a significant difference in binding between genotypes in each window using the QL F-test.

```{r}
contrast <- makeContrasts(wt-ko, levels=design)
res <- glmQLFTest(fit, contrast=contrast)
```

Windows less than 100 bp apart are clustered into regions [@lun2014denovo] with a maximum cluster width of 5 kbp.
We then control the region-level FDR by combining per-window $p$-values using Simes' method [@simes1986].

```{r}
merged <- mergeResults(filtered.data, res$table, tol=100, 
    merge.args=list(max.width=5000))
merged$regions
tabcom <- merged$combined
is.sig <- tabcom$FDR <= 0.05
summary(is.sig)
```

All significant regions have increased CBP binding in the WT genotype.
This is expected given that protein function should be lost in the KO genotype.

```{r}
table(tabcom$direction[is.sig])

# Direction according the best window in each cluster.
tabbest <- merged$best
is.sig.pos <- (tabbest$rep.logFC > 0)[is.sig]
summary(is.sig.pos)
```

We save the results to file in the form of a serialized R object for later inspection.

```{r}
out.ranges <- merged$regions
mcols(out.ranges) <- DataFrame(tabcom,
    best.pos=mid(ranges(rowRanges(filtered.data[tabbest$rep.test]))),
    best.logFC=tabbest$rep.logFC)
saveRDS(file="cbp_results.rds", out.ranges)
```

# Annotation and visualization

Annotation for each region is added using the `detailRanges()` function,
as `r Biocpkg("chipseqDB", "h3k9ac.html#using-the-detailranges-function", "previously described")`.

```{r}
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
anno <- detailRanges(out.ranges, orgdb=org.Mm.eg.db,
    txdb=TxDb.Mmusculus.UCSC.mm10.knownGene)
mcols(out.ranges) <- cbind(mcols(out.ranges), anno)
```

We will visualize one of the top-ranked DB regions here.
This corresponds to a simple DB event as all windows are changing in the same direction, i.e., up in the WT.
The binding region is also quite small relative to some of the H3K9ac examples, 
consistent with sharp TF binding to a specific recognition site.

```{r}
o <- order(out.ranges$PValue)    
cur.region <- out.ranges[o[2]]
cur.region
```

```{r, results="hide", echo=FALSE}
if (!overlapsAny(cur.region, GRanges("chr16", IRanges(70313851, 70314860)), type="equal")) {
        warning("first region does not match expectations")
}
```

We use `r Biocpkg("Gviz")` [@hahne2016visualizing] to plot the results.
As in the `r Biocpkg("chipseqDB", "h3k9ac.html#visualizing-db-results", "H3K9ac analysis")`, 
we set up some tracks to display genome coordinates and gene annotation.

```{r}
library(Gviz)
gax <- GenomeAxisTrack(col="black", fontsize=15, size=2)
greg <- GeneRegionTrack(TxDb.Mmusculus.UCSC.mm10.knownGene, showId=TRUE,
    geneSymbol=TRUE, name="", background.title="transparent")
symbols <- unlist(mapIds(org.Mm.eg.db, gene(greg), "SYMBOL",
    "ENTREZID", multiVals = "first"))
symbol(greg) <- symbols[gene(greg)]
```

We visualize two tracks for each sample -- one for the forward-strand coverage, another for the reverse-strand coverage.
This allows visualization of the strand bimodality that is characteristic of genuine TF binding sites.
In Figure \@ref(fig:tfplot), two adjacent sites are present at the *Gbe1* promoter, both of which exhibit increased binding in the WT genotype.
Coverage is also substantially different between the WT replicates, consistent with the presence of a batch effect.

```{r tfplot, fig.width=8, fig.asp=0.75, fig.cap="Coverage tracks for TF binding sites that are differentially bound in the WT (top two tracks) against the KO (last two tracks). Blue and red tracks represent forward- and reverse-strand coverage, respectively, on a per-million scale (capped at 5 in SRR1145788, for visibility)."}
collected <- list()
lib.sizes <- filtered.data$totals/1e6

for (i in seq_along(cbpdata$Path)) {
    reads <- extractReads(bam.file=cbpdata$Path[[i]], cur.region, param=param)
    pcov <- as(coverage(reads[strand(reads)=="+"])/lib.sizes[i], "GRanges")
    ncov <- as(coverage(reads[strand(reads)=="-"])/-lib.sizes[i], "GRanges")
    ptrack <- DataTrack(pcov, type="histogram", lwd=0, ylim=c(-5, 5),
        name=cbpdata$Description[i], col.axis="black", col.title="black",
        fill="blue", col.histogram=NA)
    ntrack <- DataTrack(ncov, type="histogram", lwd=0, ylim=c(-5, 5),
        fill="red", col.histogram=NA)
    collected[[i]] <- OverlayTrack(trackList=list(ptrack, ntrack))
}

plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)),
    from=start(cur.region), to=end(cur.region))
```

# Session information

```{r}
sessionInfo()
```

# References