---
title: "Detecting differential enrichment of H3K27me3 in the mouse lung epithelium"
author:
- name: Aaron T. L. Lun
  affiliation: 
  - &WEHI The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC 3052, Melbourne, Australia
  - Department of Medical Biology, The University of Melbourne, Parkville, VIC 3010, Melbourne, Australia
- name: Aliaksei Holik
  affiliation:
  - *WEHI
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{4. Differential enrichment of H3K27me3 in lung epithelium}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    toc_float: yes
    titlecaps: false
bibliography: ref.bib
---

```{r style, echo=FALSE, results='hide', message=FALSE}
library(BiocStyle)
library(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE, cache=TRUE)
```

# Overview

Here, we perform a window-based DB analysis to identify regions of differential H3K27me3 enrichment in mouse lung epithelium.
H3K27me3 is associated with transcriptional repression and is usually observed with broad regions of enrichment.
The aim of this workflow is to demonstrate how to analyze these broad marks with `r Biocpkg("csaw")`,
especially at variable resolutions with multiple window sizes.
We use H3K27me3 ChIP-seq data from a study comparing wild-type (WT) and _Ezh2_ knock-out (KO) animals [@galvis2015repression],
contains two biological replicates for each genotype.
We download BAM files and indices using `r Biocpkg("chipseqDBData")`.

```{r}
library(chipseqDBData)
h3k27me3data <- H3K27me3Data()
h3k27me3data
```

# Pre-processing checks

We check some mapping statistics with `r Biocpkg("Rsamtools")`.

```{r}
library(Rsamtools)
diagnostics <- list()
for (b in seq_along(h3k27me3data$Path)) {
    bam <- h3k27me3data$Path[[b]]
    total <- countBam(bam)$records
    mapped <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE)))$records
    marked <- countBam(bam, param=ScanBamParam(
        flag=scanBamFlag(isUnmapped=FALSE, isDuplicate=TRUE)))$records
    diagnostics[[b]] <- c(Total=total, Mapped=mapped, Marked=marked)
}

diag.stats <- data.frame(do.call(rbind, diagnostics))
rownames(diag.stats) <- h3k27me3data$Name
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$Marked/diag.stats$Mapped*100
diag.stats
```

We construct a `readParam` object to standardize the parameter settings in this analysis.
For consistency with the original analysis by @galvis2015repression,
we will define the blacklist using the predicted repeats from the RepeatMasker software.

```{r}
library(BiocFileCache)
bfc <- BiocFileCache("local", ask=FALSE)
black.path <- bfcrpath(bfc, file.path("http://hgdownload.cse.ucsc.edu",
    "goldenPath/mm10/bigZips/chromOut.tar.gz"))
tmpdir <- tempfile()
dir.create(tmpdir)
untar(black.path, exdir=tmpdir)

# Iterate through all chromosomes.
collected <- list()
for (x in list.files(tmpdir, full=TRUE)) {
    f <- list.files(x, full=TRUE, pattern=".fa.out")
    to.get <- vector("list", 15)
    to.get[[5]] <- "character"
    to.get[6:7] <- "integer"
    collected[[length(collected)+1]] <- read.table(f, skip=3, 
        stringsAsFactors=FALSE, colClasses=to.get)
}

collected <- do.call(rbind, collected)
blacklist <- GRanges(collected[,1], IRanges(collected[,2], collected[,3]))
blacklist
```

We set the minimum mapping quality score to 10 to remove poorly or non-uniquely aligned reads.

```{r}
library(csaw)
param <- readParam(minq=10, discard=blacklist)
param
```

# Counting reads into windows

Reads are then counted into sliding windows using `r Biocpkg("csaw")` [@lun2015csaw].
At this stage, we use a large 2 kbp window to reflect the fact that H3K27me3 exhibits broad enrichment.
This allows us to increase the size of the counts and thus detection power,
without having to be concerned about loss of genomic resolution to detect sharp binding events.

```{r}
win.data <- windowCounts(h3k27me3data$Path, param=param, width=2000,
    spacing=500, ext=200)
win.data
```

We use `spacing=500` to avoid redundant work when sliding a large window across the genome.
The default spacing of 50 bp would result in many windows with over 90% overlap in their positions,
increasing the amount of computational work without a meaningful improvement in resolution.
We also set the fragment length to 200 bp based on experimental knowledge of the size selection procedure.
Unlike the previous analyses, the fragment length cannot easily estimated here due to weak strand bimodality of diffuse marks.

# Normalization for composition biases

As in the CBP example, we normalize for composition biases resulting from imbalanced DB between conditions [@lun2014denovo].
We expect systematic DB in one direction as Ezh2 function (and thus some H3K27me3 deposition activity) is lost in the KO genotype.
We apply the TMM method [@robinson2010scaling] to counts for large 10 kbp bins,
and store the resulting normalization factors back in `win.data` for use in the DB analysis with the window counts.

```{r}
bins <- windowCounts(h3k27me3data$Path, bin=TRUE, width=10000, param=param)
win.data <- normFactors(bins, se.out=win.data)
(normfacs <- win.data$norm.factors)
```

Figure \@ref(fig:compoplot) shows the effect of normalization on the relative enrichment between pairs of samples. 
We see that log-ratio of normalization factors passes through the centre of the cloud of background regions in each plot,
indicating that the bias has been successfully identified and removed.

```{r compoplot, fig.width=12, fig.asp=0.5, fig.cap="Mean-difference plots for the bin counts, comparing sample 1 to all other samples. The red line represents the log-ratio of the normalization factors between samples."}
bin.ab <- scaledAverage(bins)
adjc <- calculateCPM(bins, use.norm.factors=FALSE)

par(cex.lab=1.5, mfrow=c(1,3))
smoothScatter(bin.ab, adjc[,1]-adjc[,2], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 2)")
abline(h=log2(normfacs[1]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,1]-adjc[,3], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 3)")
abline(h=log2(normfacs[2]/normfacs[4]), col="red")

smoothScatter(bin.ab, adjc[,1]-adjc[,4], ylim=c(-6, 6),
    xlab="Average abundance", ylab="Log-ratio (1 vs 4)")
abline(h=log2(normfacs[3]/normfacs[4]), col="red")
```

# Filtering of low-abundance windows

We estimate the global background and remove low-abundance windows that are not enriched above this background level.
To retain a window, we require it to have at least 2-fold more coverage than the average background.
This is less stringent than the thresholds used in previous analyses, owing the weaker enrichment observed for diffuse marks.

```{r}
filter.stat <- filterWindowsGlobal(win.data, bins)
min.fc <- 2 
```

Figure \@ref(fig:bghistplot) shows that chosen threshold is greater than the abundances of most bins in the genome,
presumably those corresponding to background regions.
This suggests that the filter will remove most windows lying within background regions.

```{r bghistplot, fig.cap="Histogram of average abundances across all 10 kbp genomic bins. The filter threshold is shown as the red line."}
hist(filter.stat$back.abundances, main="", breaks=50,
    xlab="Background abundance (log2-CPM)")
threshold <- filter.stat$abundances[1] - filter.stat$filter[1] + log2(min.fc)
abline(v=threshold, col="red")
```

We filter out the majority of windows in background regions upon applying a modest fold-change threshold.
This leaves a small set of relevant windows for further analysis.

```{r}
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
filtered.data <- win.data[keep,]
```

# Statistical modelling of biological variability

Counts for each window are modelled using `r Biocpkg("edgeR")` [@mccarthy2012differential; @robinson2010edger].
We first convert our `RangedSummarizedExperiment` object into a `DGEList`.

```{r}
library(edgeR)
y <- asDGEList(filtered.data)
str(y)
```

We then construct a design matrix for our experimental design.
Here, we use a simple one-way layout with two groups of two replicates.

```{r}
genotype <- h3k27me3data$Description
genotype[grep("control", genotype)] <- "wt"
genotype[grep("knock-out", genotype)] <- "ko"

genotype <- factor(genotype)
design <- model.matrix(~0+genotype)
colnames(design) <- levels(genotype)
design
```

We estimate the negative binomial (NB) and quasi-likelihood (QL) dispersions for each window [@lund2012ql].
We again observe an increasing trend in the NB dispersions with respect to abundance (Figure \@ref(fig:bcvplot)).

```{r bcvplot, fig.cap="Abundance-dependent trend in the BCV for each window, represented by the blue line. Common (red) and tagwise estimates (black) are also shown."}
y <- estimateDisp(y, design)
summary(y$trended.dispersion)
plotBCV(y)
```

The QL dispersions are strongly shrunk towards the trend (Figure \@ref(fig:qlplot)), 
indicating that there is little variability in the dispersions across windows.

```{r qlplot, fig.cap="Effect of EB shrinkage on the raw QL dispersion estimate for each window (black) towards the abundance-dependent trend (blue) to obtain squeezed estimates (red)."}
fit <- glmQLFit(y, design, robust=TRUE)
summary(fit$df.prior)
plotQLDisp(fit)
```

These results are consistent with the presence of systematic differences in enrichment between replicates,
as discussed in the `r Biocpkg("chipseqDB", "cbp.html#statistical-modelling-of-biological-variability", "CBP analysis")`.
Nonetheless, the samples separate by genotype in the MDS plot (Figure \@ref(fig:mdsplot)),
which suggests that the downstream analysis will be able to detect DB regions.

```{r mdsplot, fig.cap="MDS plot with two dimensions for all samples in the H3K27me3 data set. Samples are labelled and coloured according to the genotype. A larger top set of windows was used to improve the visualization of the genome-wide differences between the WT samples."}
plotMDS(cpm(y, log=TRUE), top=10000, labels=genotype,
    col=c("red", "blue")[as.integer(genotype)])
```

We then test for DB between conditions in each window using the QL F-test.

```{r}
contrast <- makeContrasts(wt-ko, levels=design)
res <- glmQLFTest(fit, contrast=contrast)
```

# Consolidating results from multiple window sizes

Consolidation allows the analyst to incorporate information from a range of different window sizes,
each of which has a different trade-off between resolution and count size.
This is particularly useful for broad marks where the width of an enriched region can be variable,
as can the width of the differentially bound interval of an enriched region.
To demonstrate, we repeat the entire analysis using 500 bp windows.
Compared to our previous 2 kbp analysis, this provides greater spatial resolution at the cost of lowering the counts.

```{r}
# Counting into 500 bp windows.
win.data2 <- windowCounts(h3k27me3data$Path, param=param, width=500,
    spacing=100, ext=200)

# Re-using the same normalization factors.
win.data2$norm.factors <- win.data$norm.factors

# Filtering on abundance.
filter.stat2 <- filterWindowsGlobal(win.data2, bins)
keep2 <- filter.stat2$filter > log2(min.fc)
filtered.data2 <- win.data2[keep2,]

# Performing the statistical analysis.
y2 <- asDGEList(filtered.data2)
y2 <- estimateDisp(y2, design)
fit2 <- glmQLFit(y2, design, robust=TRUE)
res2 <- glmQLFTest(fit2, contrast=contrast)
```

We consolidate the 500 bp analysis with our previous 2 kbp analysis using the `mergeWindowsList()` function.
This clusters both sets of windows together into a single set of regions.
To limit chaining effects, each region cannot be more than 30 kbp in size.

```{r}
merged <- mergeResultsList(list(filtered.data, filtered.data2), 
    tab.list=list(res$table, res2$table),
    equiweight=TRUE, tol=100, merge.args=list(max.width=30000))
merged$regions
```

We compute combined $p$-values using Simes' method for region-level FDR control [@simes1986; @lun2014denovo].
This is done after weighting the contributions from the two sets of windows
to ensure that the combined $p$-value for each region is not dominated by the analysis with more (smaller) windows.

```{r}
tabcom <- merged$combined
is.sig <- tabcom$FDR <= 0.05
summary(is.sig)
```

Ezh2 is one of the proteins responsible for depositing H3K27me3, 
so we might expect that most DB regions have increased enrichment in the WT condition.
However, the opposite seems to be true here, which is an interesting result that may warrant further investigation. 

```{r}
table(tabcom$direction[is.sig])
```

We also obtain statistics for the window with the lowest $p$-value in each region.
The signs of the log-fold changes are largely consistent with the `direction` numbers above.

```{r}
tabbest <- merged$best
is.sig.pos <- (tabbest$rep.logFC > 0)[is.sig]
summary(is.sig.pos)
```

Finally, we save these results to file for future reference.

```{r}
out.ranges <- merged$regions
mcols(out.ranges) <- data.frame(tabcom, best.logFC=tabbest$rep.logFC)
saveRDS(file="h3k27me3_results.rds", out.ranges)
```

# Annotation and visualization

We add annotation for each region using the `detailRanges()` function, 
as `r Biocpkg("chipseqDB", "h3k9ac.html#using-the-detailranges-function", "previously described")`.

```{r}
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
anno <- detailRanges(out.ranges, orgdb=org.Mm.eg.db,
    txdb=TxDb.Mmusculus.UCSC.mm10.knownGene)
mcols(out.ranges) <- cbind(mcols(out.ranges), anno)
```

We visualize one of the DB regions overlapping the _Cdx2_ gene to reproduce the results in @holik2015transcriptome.

```{r}
cdx2 <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)["12591"] # Cdx2 Entrez ID
cur.region <- subsetByOverlaps(out.ranges, cdx2)[1]
cur.region
```

```{r, results="hide", echo=FALSE}
if (cur.region$FDR > 0.05 || cur.region$best.logFC < 0) {
    stop("Cdx2 should be significantly upregulated in WT")
}
```

We use `r Biocpkg("Gviz")` [@hahne2016visualizing] to plot the results.
As in the H3K9ac analysis, we set up some tracks to display genome coordinates and gene annotation.

```{r}
library(Gviz)
gax <- GenomeAxisTrack(col="black", fontsize=15, size=2)
greg <- GeneRegionTrack(TxDb.Mmusculus.UCSC.mm10.knownGene, showId=TRUE,
    geneSymbol=TRUE, name="", background.title="transparent")
symbols <- unlist(mapIds(org.Mm.eg.db, gene(greg), "SYMBOL",
    "ENTREZID", multiVals = "first"))
symbol(greg) <- symbols[gene(greg)]
```

In Figure \@ref(fig:tfplot), we see enrichment of H3K27me3 in the WT condition at the _Cdx2_ locus.
This is consistent with the known regulatory relationship between Ezh2 and _Cdx2_.

```{r tfplot, fig.width=8, fig.asp=0.75, fig.cap="Coverage tracks for a region with H3K27me3 enrichment in KO (top two tracks) against the WT (last two tracks)."}
collected <- list()
lib.sizes <- filtered.data$totals/1e6
for (i in seq_along(h3k27me3data$Path)) {
    reads <- extractReads(bam.file=h3k27me3data$Path[[i]], cur.region, param=param)
    cov <- as(coverage(reads)/lib.sizes[i], "GRanges")
    collected[[i]] <- DataTrack(cov, type="histogram", lwd=0, ylim=c(0,1),
        name=h3k27me3data$Description[i], col.axis="black", col.title="black",
        fill="darkgray", col.histogram=NA)
}

plotTracks(c(gax, collected, greg), chromosome=as.character(seqnames(cur.region)),
    from=start(cur.region), to=end(cur.region))
```

In contrast, if we look at a constitutively expressed gene, such as _Col1a2_, we find little evidence of H3K27me3 deposition.
Only a single window is retained after filtering on abundance,
and this window shows no evidence of changes in H3K27me3 enrichment between WT and KO samples.

```{r}
col1a2 <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)["12843"] # Col1a2 Entrez ID
cur.region <- subsetByOverlaps(out.ranges, col1a2)
cur.region
```

```{r, results="hide", echo=FALSE}
if (cur.region$FDR < 0.05) {
    stop("Col1a2 should not be significantly DB")
}
```

# Session information

```{r}
sessionInfo()
```

# References