--- title: "Processing statistics for ChIP-seq datasets" author: - name: Aaron Lun affiliation: Cancer Research UK Cambridge Institute, Cambridge, UK date: "Revised: 6 December 2018" output: BiocStyle::html_document: toc_float: yes package: chipseqDBData vignette: > %\VignetteIndexEntry{File manifest and statistics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ref.bib --- ```{r, echo=FALSE, results="hide", message=FALSE} require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) ``` # Introduction This package contains several ChIP-seq datasets for use in differential binding (DB) analyses: - H3K9ac and H3K4me3 ChIP-seq in murine pro-B and mature B cells [@domingo2012bcell] - CREB binding protein (CBP) ChIP-seq in wild-type and CBP-knockout mouse embryonic fibroblasts [@kasper2014genomewide] - Nuclear transcription factor subunit gamma alpha (NF-YA) ChIP-seq in mouse terminal neurons and embryonic stem cells [@tiwari2011chromatin] - H3K27me3 ChIP-seq in mouse wild-type and Ezh2-knockout lung epithelium [@galvis2015repression] These datasets are mainly used in the `r Biocpkg("chipseqDB")` workflow [@lun2015reads] and the `r Biocpkg("csaw")` user's guide [@lun2016csaw]. This vignette will briefly demonstrate how to obtain each dataset and investigate some of the processing statistics. # Obtaining each dataset We obtain the H3K9ac dataset from `r Biocpkg("ExperimentHub")` using the `H3K9acData()` function. This downloads sorted and indexed BAM files to a local cache, along with the associated index files. The function returns a `DataFrame` of file paths and sample descriptions to further use in workflows. ```{r} library(chipseqDBData) h3k9ac.paths <- H3K9acData() h3k9ac.paths ``` Note that the time-consuming download only occurs upon the first use of the function. Later uses will simply re-use the same files, thus avoiding the need to re-download these large files. (Some readers may notice that the paths point to the temporary directory, which is destroyed at the end of each R session. Here, the temporary directory contains only soft-links to the persistent BAM files in the local cache. This is a low-cost illusion to ensure that the index files have the same prefixes as the BAM files.) The same approach is used for all of the other datasets, e.g., `CBPData()`, `NFYAData()`. Be aware that the initial download time will depend on the size and number of the BAM files in each dataset. # Investigating mapping statistics We use functions from the `r Biocpkg("Rsamtools")` package to examine the mapping statistics. This includes the number of mapped reads, the number of marked reads (i.e., potential PCR duplicates) and the number of high-quality alignments with high mapping scores. ```{r} library(Rsamtools) diagnostics <- list() for (i in seq_len(nrow(h3k9ac.paths))) { stats <- scanBam(h3k9ac.paths$Path[i], param=ScanBamParam(what=c("mapq", "flag"))) flag <- stats[[1]]$flag mapq <- stats[[1]]$mapq mapped <- bitwAnd(flag, 0x4)==0 diagnostics[[h3k9ac.paths$Name[i]]] <- c( Total=length(flag), Mapped=sum(mapped), HighQual=sum(mapq >= 10 & mapped), DupMarked=sum(bitwAnd(flag, 0x400)!=0) ) } diag.stats <- data.frame(do.call(rbind, diagnostics)) diag.stats$Prop.mapped <- diag.stats$Mapped/diag.stats$Total*100 diag.stats$Prop.marked <- diag.stats$DupMarked/diag.stats$Mapped*100 diag.stats ``` More comprehensive quality checks are beyond the scope of this document, but can be performed with other packages such as `r Biocpkg("ChIPQC")`. # Session information ```{r} sessionInfo() ``` # References