---
title: "Processing statistics for ChIP-seq datasets"
author:
- name: Aaron Lun
  affiliation: Cancer Research UK Cambridge Institute, Cambridge, UK
date: "Revised: 6 December 2018"
output:
    BiocStyle::html_document:
        toc_float: yes
package: chipseqDBData 
vignette: >
    %\VignetteIndexEntry{File manifest and statistics}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}    
bibliography: ref.bib
---

```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
```

# Introduction

This package contains several ChIP-seq datasets for use in differential binding (DB) analyses:

- H3K9ac and H3K4me3 ChIP-seq in murine pro-B and mature B cells [@domingo2012bcell]
- CREB binding protein (CBP) ChIP-seq in wild-type and CBP-knockout mouse embryonic fibroblasts [@kasper2014genomewide]
- Nuclear transcription factor subunit gamma alpha (NF-YA) ChIP-seq in mouse terminal neurons and embryonic stem cells [@tiwari2011chromatin]
- H3K27me3 ChIP-seq in mouse wild-type and Ezh2-knockout lung epithelium [@galvis2015repression]

These datasets are mainly used in the `r Biocpkg("chipseqDB")` workflow [@lun2015reads] and the `r Biocpkg("csaw")` user's guide [@lun2016csaw].
This vignette will briefly demonstrate how to obtain each dataset and investigate some of the processing statistics.

# Obtaining each dataset

We obtain the H3K9ac dataset from `r Biocpkg("ExperimentHub")` using the `H3K9acData()` function.
This downloads sorted and indexed BAM files to a local cache, along with the associated index files.
The function returns a `DataFrame` of file paths and sample descriptions to further use in workflows.

```{r}
library(chipseqDBData)
h3k9ac.paths <- H3K9acData()
h3k9ac.paths
```

Note that the time-consuming download only occurs upon the first use of the function.
Later uses will simply re-use the same files, thus avoiding the need to re-download these large files.
(Some readers may notice that the paths point to the temporary directory, which is destroyed at the end of each R session.
Here, the temporary directory contains only soft-links to the persistent BAM files in the local cache.
This is a low-cost illusion to ensure that the index files have the same prefixes as the BAM files.)

The same approach is used for all of the other datasets, e.g., `CBPData()`, `NFYAData()`.
Be aware that the initial download time will depend on the size and number of the BAM files in each dataset.

# Investigating mapping statistics

We use functions from the `r Biocpkg("Rsamtools")` package to examine the mapping statistics.
This includes the number of mapped reads, the number of marked reads (i.e., potential PCR duplicates) and the number of high-quality alignments with high mapping scores.

```{r}
library(Rsamtools)
diagnostics <- list()
for (i in seq_len(nrow(h3k9ac.paths))) {
    stats <- scanBam(h3k9ac.paths$Path[i], 
        param=ScanBamParam(what=c("mapq", "flag")))
    flag <- stats[[1]]$flag
    mapq <- stats[[1]]$mapq

    mapped <- bitwAnd(flag, 0x4)==0
    diagnostics[[h3k9ac.paths$Name[i]]] <- c(
        Total=length(flag), 
        Mapped=sum(mapped),
        HighQual=sum(mapq >= 10 & mapped),
        DupMarked=sum(bitwAnd(flag, 0x400)!=0)
    )
}
diag.stats <- data.frame(do.call(rbind, diagnostics))
diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100
diag.stats$Prop.marked <- diag.stats$DupMarked/diag.stats$Mapped*100
diag.stats
```

More comprehensive quality checks are beyond the scope of this document, but can be performed with other packages such as `r Biocpkg("ChIPQC")`.

# Session information

```{r}
sessionInfo()
```

# References