---
title: "Common Downstream ChIP-seq Analysis Workflows using ChIPpeakAnno"
author: "Jianhong Ou, Jun Yu, Lihua Julie Zhu"
output: 
    BiocStyle::html_document:
        css: custom.css
date: "`r doc_date()`"
package: "`r pkg_ver('ChIPpeakAnno')`"
bibliography: bibliography.bib
csl: nature.csl
vignette: >
  %\VignetteIndexEntry{ChIPpeakAnno Annotation Pipeline}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
    library(ChIPpeakAnno)
    library(rtracklayer)
    library(EnsDb.Hsapiens.v75)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(org.Hs.eg.db)
    library(reactome.db)
    library(BSgenome.Hsapiens.UCSC.hg19)
})
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
i <- 0
generateFigureCaption <- function(cap){
    i <<- i+1
    return(paste0("Figure ", i, ". ", cap))
}
```

In this guide, we illustrate here two common downstream analysis workflows for ChIP-seq experiments, one is for comparing and combining peaks for single transcription factor (TF) with replicates, and the other is for comparing binding profiles from ChIP-seq experiments with multiple TFs.

# Workflow for ChIP-seq experiments of single transcription factor with replicates

This workflow shows how to convert BED/GFF files to GRanges, find overlapping peaks between two peak sets, and visualize the number of common and specific peaks with Venn diagram.

## Import data and obtain overlapping peaks from replicates

The input for **ChIPpeakAnno**[@zhu2010, @zhu2013] is a list of called peaks identified 
from ChIP-seq experiments or any other experiments that yield a set of chromosome coordinates. Although peaks are represented as GRanges
in **ChIPpeakAnno**, other common peak formats such as BED, GFF and MACS can be converted to GRanges easily using a conversion `toGRanges` method. For detailed information on how to use this method, please type ?`toGRanges`.

The following examples illustrate the usage of this method to convert BED and GFF file to GRanges, add metadata from orignal peaks to the overlap GRanges using function `addMetadata`, and visualize the overlapping using function `makeVennDiagram`.

```{r workflow1, fig.cap=generateFigureCaption("Venn diagram of overlaps for replicated experiments"), fig.width=6, fig.height=6}
library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE) 
## one can also try import from rtracklayer
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
## must keep the class exactly same as gr1$score, i.e., numeric.
gr2$score <- as.numeric(gr2$score) 
ol <- findOverlapsOfPeaks(gr1, gr2)
## add metadata (mean of score) to the overlapping peaks
ol <- addMetadata(ol, colNames="score", FUN=mean) 
ol$peaklist[["gr1///gr2"]][1:2]
makeVennDiagram(ol, fill=c("#009E73", "#F0E442"), # circle fill color
                col=c("#D55E00", "#0072B2"), #circle border color
                cat.col=c("#D55E00", "#0072B2")) # label color, keep same as circle border color
```

## Prepare annotation data

Annotation data should be an object of GRanges. Same as import peaks, we use the method `toGRanges`, which can return an object of GRanges, to represent the annotation data. An annotation data be constructed from not only BED, GFF or user defined readable text files, but also EnsDb or TxDb object,  by calling the `toGRanges` method. Please type ?`toGRanges` for more information.


```{r annoData}
library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
```

## Visualize binding site distribution relative to features
After finding the overlapping peaks, the distribution of the distance of overlapped peaks to the nearest feature such as the transcription start sites (TSS) can be plotted by `binOverFeature` function. 
The sample code here plots the distribution of peaks around the TSS.

```{r workflow2,fig.cap=generateFigureCaption("Distribution of peaks around transcript start sites."),fig.width=8,fig.height=6}
overlaps <- ol$peaklist[["gr1///gr2"]]
binOverFeature(overlaps, annotationData=annoData,
               radius=5000, nbins=20, FUN=length, errFun=0,
               ylab="count", 
               main="Distribution of aggregated peak numbers around TSS")
```

In addition, `assignChromosomeRegion` can be used to summarize the distribution of  peaks over different type of features such as exon, intron, enhancer, proximal promoter, 5' UTR and 3' UTR. 
This distribution can be summarized in peak centric or nucleotide centric view using 
the function `assignChromosomeRegion`. 
Please note that one peak might span multiple type of features, leading to the number of annotated features greater than the total number of input peaks. At the peak centric view, precedence will dictate the annotation order when peaks span multiple type of features.

```{r workflow4,fig.cap=generateFigureCaption("Peak distribution over different genomic features."),fig.width=10,fig.height=4}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
aCR<-assignChromosomeRegion(overlaps, nucleotideLevel=FALSE, 
                           precedence=c("Promoters", "immediateDownstream", 
                                         "fiveUTRs", "threeUTRs", 
                                         "Exons", "Introns"), 
                           TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene)
barplot(aCR$percentage, las=3)
```

## Annotate peaks
As shown from the distribution of aggregated peak numbers around TSS and 
the distribution of peaks in different of chromosome regions,
most of the peaks locate around TSS. Therefore, it is reasonable to use `annotatePeakInBatch` or `annoPeaks` 
to annotate the peaks to the promoter regions of Hg19 genes. 
Promoters can be specified with bindingRegion. For the following example, promoter region is defined as upstream 2000 and downstream 500 from TSS (bindingRegion=c(-2000, 500)).

```{r workflow3}
overlaps.anno <- annotatePeakInBatch(overlaps, 
                                     AnnotationData=annoData, 
                                     output="nearestBiDirectionalPromoters",
                                     bindingRegion=c(-2000, 500))
library(org.Hs.eg.db)
overlaps.anno <- addGeneIDs(overlaps.anno,
                            "org.Hs.eg.db",
                            IDs2Add = "entrez_id")
head(overlaps.anno)
write.csv(as.data.frame(unname(overlaps.anno)), "anno.csv")
```

The distribution of the common peaks around features can be visualized using a pie chart.

```{r workflow20,fig.cap=generateFigureCaption("Pie chart of the distribution of common peaks around features.")}
pie1(table(overlaps.anno$insideFeature))
```


## Obtain enriched GO terms and Pathways
The following example shows how to use `getEnrichedGO` to obtain a list of enriched GO terms with annotated peaks. For pathway analysis, please use function `getEnrichedPATH` with reactome or KEGG database. Please note that by default  _feature\_id\_type_ is set as “ensembl\_gene\_id”. If you are using **TxDb** as annotation data, please set it to “entrez\_id”.
```{r enrichment}
over <- getEnrichedGO(overlaps.anno, orgAnn="org.Hs.eg.db", 
                     maxP=.05, minGOterm=10, 
                     multiAdjMethod="BH", condense=TRUE)
head(over[["bp"]][, -c(3, 10)])
library(reactome.db)
path <- getEnrichedPATH(overlaps.anno, "org.Hs.eg.db", "reactome.db", maxP=.05)
head(path)
```

## Obtain the sequences surrounding the peaks
Here is an example to get the sequences of the peaks plus 20 bp upstream and downstream for PCR validation or motif discovery.

```{r fasta}
library(BSgenome.Hsapiens.UCSC.hg19)
seq <- getAllPeakSequence(overlaps, upstream=20, downstream=20, genome=Hsapiens)
write2FASTA(seq, "test.fa")
```

## Output a summary of consensus in the peaks
Here is an example to get the Z-scores for short oligos[@leung1996over,@van2000statistical].
```{r consensus,fig.cap=generateFigureCaption("Histogram of Z-score of 6-mer"),fig.height=6,fig.width=6}
## summary of the short oligos
freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3, 
                   quickMotif=FALSE, freqs=freqs)
## plot the results
zscore <- sort(os$zscore)
h <- hist(zscore, breaks=100, xlim=c(-50, 50), main="Histogram of Z-score")
text(zscore[length(zscore)], max(h$counts)/10, 
     labels=names(zscore[length(zscore)]), adj=1)
```
```{r simulation, fig.cap=generateFigureCaption("Histogram of Z-score of simulation data"), fig.width=6, fig.height=6}
## We can also try simulation data
seq.sim.motif <- list(c("t", "g", "c", "a", "t", "g"), 
                      c("g", "c", "a", "t", "g", "c"))
set.seed(1)
seq.sim <- sapply(sample(c(2, 1, 0), 1000, replace=TRUE, prob=c(0.07, 0.1, 0.83)), 
                  function(x){
    s <- sample(c("a", "c", "g", "t"), 
                sample(100:1000, 1), replace=TRUE)
    if(x>0){
        si <- sample.int(length(s), 1)
        if(si>length(s)-6) si <- length(s)-6
        s[si:(si+5)] <- seq.sim.motif[[x]]
    }
    paste(s, collapse="")
})
os <- oligoSummary(seq.sim, oligoLength=6, MarkovOrder=3, 
                   quickMotif=TRUE)
zscore <- sort(os$zscore, decreasing=TRUE)
h <- hist(zscore, breaks=100, main="Histogram of Z-score")
text(zscore[1:2], rep(5, 2), 
     labels=names(zscore[1:2]), adj=0, srt=90)
```
```{r simulation.motif, fig.cap=generateFigureCaption("Motif of simulation data"), fig.width=6, fig.height=6}
## generate the motifs
library(motifStack)
pfms <- mapply(function(.ele, id)
    new("pfm", mat=.ele, name=paste("SAMPLE motif", id)), 
    os$motifs, 1:length(os$motifs))
motifStack(pfms[[1]])
```

## Find peaks with bi-directional promoters

Bidirectional promoters are the DNA regions located between TSS 
of two adjacent genes that are transcribed on opposite directions 
and often co-regulated by this shared promoter region[@robertson2007]. 
Here is an example to find peaks near bi-directional promoters.

```{r peaksNearBDP16}
bdp <- peaksNearBDP(overlaps, annoData, maxgap=5000)
c(bdp$percentPeaksWithBDP, 
  bdp$n.peaks, 
  bdp$n.peaksWithBDP)
bdp$peaksWithBDP[1:2]
```

## Find possible enhancers with DNA interaction data

There are several techniques available to determine the spatial organization of chromosomes at high resolution such as 3C, 5C and HiC[@lieberman2009comprehensive]. These techniques make it possible to search peaks binding to the potential enhancer regions. Here is an example to find peaks binding to the potential enhancer regions. 

```{r findEnhancers}
DNA5C <- system.file("extdata", 
                     "wgEncodeUmassDekker5CGm12878PkV2.bed.gz",
                     package="ChIPpeakAnno")
DNAinteractiveData <- toGRanges(gzfile(DNA5C))
findEnhancers(overlaps, annoData, DNAinteractiveData)
```

# Workflow for comparing binding profiles from multiple transcription factors (TFs)
Given two or more peak lists from different TFs, one may be interested in finding 
whether DNA binding profile of those TFs are correlated, and if correlated, 
what is the common binding pattern. The workflow here shows how to test the correlation of binding profiles of three TFs and how to discover the common binding pattern.

## Import data
```{r importData}
path <- system.file("extdata", package="ChIPpeakAnno")
files <- dir(path, "broadPeak")
data <- sapply(file.path(path, files), toGRanges, format="broadPeak")
names(data) <- gsub(".broadPeak", "", files)
```

## Determine if there is a significant overlap among multiple sets of peaks
### Hypergeometric test
When we test the association between two sets of data based on hypergeometric 
distribution, the number of all potential binding sites is required. 
The parameter _totalTest_ in the function `makeVennDiagram` indicates how many 
potential peaks in total will be used in the hypergeometric test. It should be 
larger than the largest number of peaks in the peak list. The smaller it is 
set, the more stringent the test is. The time used to calculate p-value does not
depend on the value of the _totalTest_. 
For practical guidance on how to choose _totalTest_, 
please refer to the [post](https://stat.ethz.ch/pipermail/bioconductor/2010-November/036540.html).
The following example makes an assumption that there are 3% of coding region plus promoter region. Because the sample data is only a subset of chromosome 2, 
we estimate that the total binding sites is 1/24 of possible binding region in the genome.

```{r vennDiagram, fig.cap=generateFigureCaption("Venn diagram of overlaps."), fig.width=6, fig.height=6}
ol <- findOverlapsOfPeaks(data, connectedPeaks="keepAll")
averagePeakWidth <- mean(width(unlist(GRangesList(ol$peaklist))))
tot <- ceiling(3.3e+9 * .03 / averagePeakWidth / 24)
makeVennDiagram(ol, totalTest=tot, connectedPeaks="keepAll", 
                fill=c("#CC79A7", "#56B4E9", "#F0E442"), # circle fill color
                col=c("#D55E00", "#0072B2", "#E69F00"), #circle border color
                cat.col=c("#D55E00", "#0072B2", "#E69F00"))
## see the difference if we set connectedPeaks to "keepFirstListConsistent"
## set connectedPeaks to keepFirstListConsistent will show consistent total 
## number of peaks for the first peak list.
makeVennDiagram(ol, totalTest=tot, connectedPeaks="keepFirstListConsistent", 
                fill=c("#CC79A7", "#56B4E9", "#F0E442"),
                col=c("#D55E00", "#0072B2", "#E69F00"),
                cat.col=c("#D55E00", "#0072B2", "#E69F00"))

```

### Permutation test
The above hypergeometric test requires users to input an estimate of 
the total potential binding sites for a given TF. To circumvent this requirement, we implemented a permutation test called `peakPermTest`. 
Before performing a permutation test, users need to generate random peaks using the distribution discovered from the input peaks for a given feature type (transcripts or exons), to make sure the binding positions relative to features, such as TSS and geneEnd, and the width of the random peaks follow the distribution of that of the input peaks.

Alternatively, a peak pool representing all potential 
binding sites can be created with associated binding probabilities for random peak sampling using `preparePool`. Here is an example to build a peak pool for human genome using the transcription factor binding site clusters (V3) (see ?`wgEncodeTfbsV3`) 
downloaded from [ENCODE](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRegTfbsClustered/wgEncodeRegTfbsClusteredV3.bed.gz) with the HOT spots (?`HOT.spots`) removed. HOT spots are the genomic regions with high probability of being bound by many TFs in ChIP-seq experiments[@yip2012]. 
We suggest remove those HOT spots from the peak lists before performing 
permutation test to avoid the overestimation of the association between the two input 
peak lists. Users can also choose to remove [ENCODE blacklist](https://sites.google.com/site/anshulkundaje/projects/blacklists) for a given species.
The blacklists were constructed by identifying consistently problematic regions over independent cell lines and types of experiments for each species in the ENCODE and modENCODE datasets[@encode2012integrated]. 
Please note that some of the blacklists may need to be converted to the correct genome assembly using liftover utility.

Following are the sample codes to do the permutation test using `permTest`:

```{r peakPermTest1, fig.cap=generateFigureCaption("permutation test for YY1 and TEAD4")}
    data(HOT.spots)
    data(wgEncodeTfbsV3)
    hotGR <- reduce(unlist(HOT.spots))
    removeOl <- function(.ele){
        ol <- findOverlaps(.ele, hotGR)
        if(length(ol)>0) .ele <- .ele[-unique(queryHits(ol))]
        .ele
    }
    TAF <- removeOl(data[["TAF"]])
    TEAD4 <- removeOl(data[["Tead4"]])
    YY1 <- removeOl(data[["YY1"]])
    # we subset the pool to save demo time
    set.seed(1)
    wgEncodeTfbsV3.subset <- 
        wgEncodeTfbsV3[sample.int(length(wgEncodeTfbsV3), 2000)]
    pool <- new("permPool", grs=GRangesList(wgEncodeTfbsV3.subset), N=length(YY1))
    pt1 <- peakPermTest(YY1, TEAD4, pool=pool, seed=1, force.parallel=FALSE)
    plot(pt1)
```
```{r peakPermTest2, fig.cap=generateFigureCaption("permutation test for YY1 and TAF")}
    pt2 <- peakPermTest(YY1, TAF, pool=pool, seed=1, force.parallel=FALSE)
    plot(pt2)
```


## Visualize and compare the binding pattern
The binding pattern around a genome feature could be visualized and compared using a side-by-side heatmap and density plot using the binding ranges of overlapping peaks.
```{r heatmap,fig.cap=generateFigureCaption("Heatmap of aligned features sorted by signal of TAF"),fig.width=4,fig.height=6}
features <- ol$peaklist[[length(ol$peaklist)]]
feature.recentered <- reCenterPeaks(features, width=4000)
## here we also suggest importData function in bioconductor trackViewer package 
## to import the coverage.
## compare rtracklayer, it will save you time when handle huge dataset.
library(rtracklayer)
files <- dir(path, "bigWig")
if(.Platform$OS.type != "windows"){
    cvglists <- sapply(file.path(path, files), import, 
                       format="BigWig", 
                       which=feature.recentered, 
                       as="RleList")
}else{## rtracklayer can not import bigWig files on Windows
    load(file.path(path, "cvglist.rds"))
}
names(cvglists) <- gsub(".bigWig", "", files)
feature.center <- reCenterPeaks(features, width=1)
sig <- featureAlignedSignal(cvglists, feature.center, 
                            upstream=2000, downstream=2000)
##Because the bw file is only a subset of the original file,
##the signals are not exists for every peak.
keep <- rowSums(sig[[2]]) > 0
sig <- sapply(sig, function(.ele) .ele[keep, ], simplify = FALSE)
feature.center <- feature.center[keep]
heatmap <- featureAlignedHeatmap(sig, feature.center, 
                                 upstream=2000, downstream=2000,
                                 upper.extreme=c(3,.5,4))
```
```{r sortHeatmapByHcluster,fig.cap=generateFigureCaption("Heatmap of aligned features sorted by hclut"),fig.width=4,fig.height=6}
sig.rowsums <- sapply(sig, rowSums, na.rm=TRUE)
d <- dist(sig.rowsums)
hc <- hclust(d)
feature.center$order <- hc$order
heatmap <- featureAlignedHeatmap(sig, feature.center, 
                                 upstream=2000, downstream=2000,
                                 upper.extreme=c(3,.5,4),
                                 sortBy="order")
```
```{r distribution,fig.cap=generateFigureCaption("Distribution of aligned features"),fig.width=6,fig.height=6}
featureAlignedDistribution(sig, feature.center, 
                           upstream=2000, downstream=2000,
                           type="l")
```