---
title: "Similarity between two ChIP-Seq profiles"
output:
    BiocStyle::html_document:
        toc: true
vignette: >
  %\VignetteIndexEntry{Similarity between two ChIP-Seq profiles}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignettePackage{similaRpeak}
  \usepackage[utf8]{inputenc}
---

```{r style, echo = FALSE, results = 'asis', message = FALSE, warning=FALSE}
BiocStyle::markdown()
library(knitr)
```
<br>
**Package**: `r Rpackage("similaRpeak")`<br />
**Authors**: `r packageDescription("similaRpeak")[["Author"]]`<br />
**Version**: `r packageDescription("similaRpeak")$Version`<br />
**Compiled date**: `r Sys.Date()`<br />
**License**: `r packageDescription("similaRpeak")[["License"]]`<br />


Metrics which estimate similarity between two ChIP-Seq profiles
====================================================================

Astrid Deschenes, Elsa Bernatchez, Charles Joly Beauparlant, 
Fabien Claude Lamaze, Rawane Samb, Pascal Belleau and Arnaud Droit.

This package and the underlying similaRpeak code are distributed 
under the Artistic license 2.0. You are free to use and redistribute 
this software. 

## Introduction

The **similaRpeak** package calculates metrics to estimate the level 
of similarity between two ChIP-Seq profiles.

The metrics are:

* **RATIO_AREA**: The ratio between the profile areas. 
The first profile is always divided by the second profile. `NA` is returned 
if minimal area threshold is not respected for at least one of the profiles.

* **DIFF_POS_MAX**: The difference between the maximal peaks positions. The 
difference is always the first profile value minus the second profile value. 
`NA` is returned if minimal peak value is not respected. A profile can have 
more than one position with the maximum value. In that case, the median 
position is used. A threshold argument can be set to consider all positions 
within a certain range of the maximum value. A threshold argument can also be 
set to ensure that the distance between two maximum values is not too wide. 
When this distance is not respected, it is assumed that more than one peak 
is present in the profile and `NA` is returned. 

* **RATIO_MAX_MAX**: The ratio between the peaks values 
in each profile. The first profile is always divided by the second profile. 
`NA` if minimal peak value threshold is not respected for at least one of the 
profiles.

* **RATIO_INTERSECT**: The ratio between the intersection area and the total 
area of the two profiles. `NA` if minimal area threshold is not respected for 
the intersection area.

* **RATIO_NORMALIZED_INTERSECT**: The ratio between the intersection area and 
the total area of two normalized profiles. The profiles are normalized by 
dividing them by their average value. `NA` if minimal area threshold is not 
respected for the intersection area.

* **SPEARMAN_CORRELATION**: The Spearman's rho statistic between profiles.


![alt text](metrics_small.gif "Metrics")


## Loading the similaRpeak package

```{r loadingPackage, warning=FALSE, message=FALSE} 
library(similaRpeak)
```


## Inputs

### ChIP-Seq profiles vectors

ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel 
DNA sequencing to identify the binding sites of DNA-associated proteins. For a
specific region, the read count of aligned sequences at each position of the 
region is used to generate the ChIP-Seq profile for the region.

To estimate the level of similarity between two ChIP-Seq profiles for a 
specific region, two `vector` containing the profiles values, as reported in 
reads per million (RPM) for each position of the selected region, are needed. 
Both `vector` should have the same length and should not contain any negative 
value.

Aligned sequences are usually stored in BAM files. As example, a slimmed BAM 
file (align1.bam) is selected as well as a specific region 
(chr1:172081530-172083529). Using BAM file and 
region information, represented by as a `GRanges` object, the coverage for the 
specified region is extracted using specialized Bioconductor packages. 

```{r bam_extract_coverage, collapse=TRUE} 
suppressMessages(library(GenomicAlignments))
suppressMessages(library(rtracklayer))
suppressMessages(library(Rsamtools))

bamFile01 <- system.file("extdata/align1.bam", package = "similaRpeak")

region <- GRanges(Rle(c("chr1")), IRanges(start = 172081530, end = 172083529), 
                strand= Rle(strand(c("*"))))

param <- ScanBamParam(which = region)

alignments01 <- readGAlignments(bamFile01, param = param)

coveragesRegion01 <- coverage(alignments01)[region]
coveragesRegion01
```

The `coverages01` can 
easily be transformed to a vector of numerical value to obtain the raw
ChIP-Seq profile for the selected region.

```{r iranges_to_vector, collapse=TRUE}
coveragesRegion01 <- as.numeric(coveragesRegion01$chr1)
length(coveragesRegion01)
summary(coveragesRegion01)
```

To convert the raw coverage to reads per million (RPM), the total number of 
reads present in the BAM file is needed to assign a weight at the `coverage` 
function.

```{r bam_count, collapse=TRUE}
param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery=FALSE))
count01 <- countBam(bamFile01, param = param)$records
coveragesRPMRegion01 <- coverage(alignments01, weight=1000000/count01)[region]
coveragesRPMRegion01 <- as.numeric(coveragesRPMRegion01$chr1)
length(coveragesRPMRegion01)
summary(coveragesRPMRegion01)
```

The read per millions values are quite low for the `coveragesRPMRegion01` 
because the original BAM file has been reduced in size to simplify the example.

Other examples are available on the worflows section of the 
[Bioconductor website](http://www.bioconductor.org/help/workflows/high-throughput-sequencing/ 
"high-throughput-sequencing").

Finally, the 
[metagene package](http://bioconductor.org/packages/release/bioc/html/metagene.html)
, available on [Bioconductor](http://bioconductor.org), can also be used to 
generate ChIP-Seq profiles. An example is available on 
[metagene wiki](https://github.com/CharlesJB/metagene/wiki/Extract-ChIP-Seq-profiles-using-metagene). 

## Metrics

### Metric versus Pseudometric

Mathematically, a metric is considered as a function that quantifies the 
similarity between two objects.The function must return zero when the two 
objects are perfectly similar (identity of indiscernibles) and a non-negative 
value when are dissimilar. 

The metrics present in the **similaRpeak** package do not strictly respect this
standard but can all be translated to pseudometrics. A pseudometric is a 
function d which satisfies the axioms for a metric, except that instead of the
identity of indiscernibles axiom, the metric must only return zero when it 
compare an object to itself.

By using the absolute value of the **DIFF_POS_MAX** metric, the definition of 
a pseudometric is formally respected. However, the respective position of the 
maximum peak of profiles is lost.

$$ |DIFF\_POS\_MAX| $$

By using the absolute value of the logarithm of the 
**RATIO_AREA**, **RATIO_MAX_MAX**, **RATIO_INTERSECT** and 
**RATIO_NORMALIZED_INTERSECT** metrics, the definition of a pseudometric is 
formally respected.

$$ |\log(RATIO)| $$ 
            
<br>

## Metrics Presentation

To ease comparison, the same ChIP-Seq profiles are used in each metric 
description section. Those are ChIP-Seq profiles of two histone 
post-transcriptional modifications linked to highly active enhancers H3K27ac 
(DCC accession: ENCFF000ASG) and H3K4me1 (DCC accession: ENCFF000ARY) from 
the Encyclopedia of DNA Elements (ENCODE) data (Dunham I et al. 2012).

Here is how to load the `demoProfiles` data used in following sections. The 
ChIP-Seq profiles of the enhancers H3K27ac and H3K4me1 for 4 specifics regions
are in reads per million (RPM).

```{r demo_profiles_loading, collapse=T}
data(demoProfiles)
str(demoProfiles)
```

<br>

### RATIO_AREA

The **RATIO_AREA** metric is the ratio between the profile areas. The first 
profile (`profile1` parameter) is always divided by the second profile 
(`profile2` parameter). `NA` is returned if minimal area threshold 
(`ratioAreaThreshold` parameter, default = 1) is not respected for at least 
one of the profiles.

The **RATIO_AREA** metric can be useful to detect regions with similar 
coverage even if the profiles are different. 

```{r ratio_area_graph, echo=FALSE, fig.width=11, fig.height=8}
par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(250, 17, "RATIO_AREA=1.03", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(375, 1000, "RATIO_AREA=0.06", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(285, 80, "RATIO_AREA=2.23", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 225, "RATIO_AREA=0.12", cex=1.5, col="red")
```


&nbsp; 

<p style="text-align: center; color: darkblue">
**similarity(profile1, profile2, ratioAreaThreshold = 1)**
</p>

```{r ratio_area_calculation, collapse=T}
metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_AREA

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                        demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_AREA

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                        demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_AREA

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                        demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_AREA
```

<br>

### DIFF_POS_MAX

The **DIFF_POS_MAX** metric is the difference between the maximal peaks 
positions. The difference is always the first profile value 
(`profile1` parameter) minus the second profile value 
(`profile2` parameter). `NA` is returned if minimal peak value is not 
respected. A profile can have more than one position with the maximum value. 
In that case, the median position is used. A threshold (`diffPosMaxTolerance` 
parameter) can be set to consider all positions within a certain range of the 
maximum value. A threshold (`diffPosMaxThresholdMaxDiff` parameter) can also 
be set to ensure that the distance between two maximum values is not too wide. 
When this distance is not respected, it is assumed that more than one peak is 
present in the profile and `NA` is returned. Finally, a threshold 
(`diffPosMaxThresholdMinValue` parameter) can be set to ensure that the 
maximum value is egal or superior to a minimal value. When this minimum value 
is not respected, it is assumed that no peak is present in the profile and 
`NA` is returned. 

The **DIFF_POS_MAX** metric can be useful to detect regions with shifted peaks.

```{r diff_pos_max, echo=FALSE, fig.width=11, fig.height=8}

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(250, 17, "DIFF_POS_MAX=-20", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(375, 1000, "DIFF_POS_MAX=-0.5", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(285, 80, "DIFF_POS_MAX=2.5", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 225, "DIFF_POS_MAX=2.5", cex=1.5, col="red")
```

<p style="text-align: center; color: darkblue">
**similarity(profile1, profile2, diffPosMaxTolerance = 0.01, diffPosMaxThresholdMaxDiff = 100, diffPosMaxThresholdMinValue = 1)**
</p>

```{r diff_pos_max_calculation, collapse=T}
metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$DIFF_POS_MAX

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                        demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$DIFF_POS_MAX

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                        demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$DIFF_POS_MAX

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                        demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$DIFF_POS_MAX
```

<br>

### RATIO_MAX_MAX 

The **RATIO_MAX_MAX** metric is the ratio between the peaks values 
in each profile. The first profile (`profile1` parameter) is always divided by 
the second profile (`profile2` parameter). `NA` if minimal peak value threshold 
(`ratioMaxMaxThreshold` parameter, default = 1) is not respected for at least 
one of the profiles.

The **RATIO_MAX_MAX** metric can be useful to detect regions with peaks with 
similar (or dissimilar) amplitude.

```{r ratio_max_max, echo=FALSE, fig.width=11, fig.height=8}

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(250, 17, "RATIO_MAX_MAX=0.95", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(380, 1000, "RATIO_MAX_MAX=0.06", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(290, 80, "RATIO_MAX_MAX=2.5", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(280, 225, "RATIO_MAX_MAX=0.12", cex=1.5, col="red")
```

&nbsp;

<p style="text-align: center; color: darkblue">
**similarity(profile1, profile2, ratioMaxMaxThreshold = 1)**
</p>

```{r ratio_max_max_calculation, collapse=T}
metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_MAX_MAX

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                        demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_MAX_MAX

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                        demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_MAX_MAX

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                        demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_MAX_MAX
```

<br>

### RATIO_INTERSECT 

The **RATIO_INTERSECT** metric is the ratio between the intersection area and 
the total area of the two profiles. `NA` if minimal area threshold 
(`ratioIntersectThreshold` parameter, default = 1) is not respected for the 
intersection area.

The **RATIO_INTERSECT** metric can be useful to detect regions with possibily 
similar profiles.

```{r ratio_intersect, echo=FALSE, fig.width=11, fig.height=8}

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(242, 17, "RATIO_INTERSECT=0.63", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(375, 1000, "RATIO_INTERSECT=0.06", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(288, 87, "RATIO_INTERSECT=", cex=1.5, col="red")
text(288, 75, "0.43", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage in reads per million (RPM)",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 225, "RATIO_INTERSECT=0.12", cex=1.5, col="red")
```

&nbsp;

<p style="text-align: center; color: darkblue">
**similarity(profile1, profile2, ratioIntersectThreshold = 1)**
</p>

```{r ratio_intersect_calculation, collapse=T}
metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                      demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_INTERSECT

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                      demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_INTERSECT

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                      demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_INTERSECT

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                      demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_INTERSECT
```

<br>

### RATIO_NORMALIZED_INTERSECT

The **RATIO_NORMALIZED_INTERSECT** metric is the ratio between the 
intersection area and the total area of the two normalized profiles.  The 
profiles are normalized by divinding them by their average value (total area 
of the profile divided by profile lenght). `NA` if minimal area threshold 
(`ratioNormalizedIntersectThreshold` parameter, default = 1) is not respected 
for the intersection area.

The **RATIO_NORMALIZED_INTERSECT** metric can be useful to detect regions with 
possibily similar profiles even if their have different amplitude.


```{r ratio_normalized_intersect, echo=FALSE, fig.width=11, fig.height=8}

par(mar = c(6,4,2,2))
par(mfrow = c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac*
        length(demoProfiles$chr2.70360770.70361098$H3K27ac)/
        sum(demoProfiles$chr2.70360770.70361098$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 3),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1*
        length(demoProfiles$chr2.70360770.70361098$H3K4me1)/
        sum(demoProfiles$chr2.70360770.70361098$H3K4me1, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 3))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(245, 2, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(245, 1.75, "INTERSECT=0.63", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac*
        length(demoProfiles$chr8.43092918.43093442$H3K27ac)/
        sum(demoProfiles$chr8.43092918.43093442$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 12),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1*
        length(demoProfiles$chr8.43092918.43093442$H3K4me1)/
        sum(demoProfiles$chr8.43092918.43093442$H3K4me1, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 12))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(370, 8, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(370, 7, "INTERSECT=0.89", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac*
        length(demoProfiles$chr3.73159773.73160145$H3K27ac)/
        sum(demoProfiles$chr3.73159773.73160145$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 8),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1*
        length(demoProfiles$chr3.73159773.73160145$H3K4me)/
        sum(demoProfiles$chr3.73159773.73160145$H3K4me, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 8))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(288, 5.5, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(288, 4, "INTERSECT=0.78", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac*
         length(demoProfiles$chr19.27739373.27739767$H3K27ac)/
         sum(demoProfiles$chr19.27739373.27739767$H3K27ac, na.rm=TRUE),
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 12),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1*
        length(demoProfiles$chr19.27739373.27739767$H3K4me1)/
        sum(demoProfiles$chr19.27739373.27739767$H3K4me1, na.rm=TRUE),
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)",  ylim=c(0, 12))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(275, 7, "RATIO_NORMALIZED_", cex=1.5, col="red")
text(275, 6, "INTERSECT=0.84", cex=1.5, col="red")
```

&nbsp;

<p style="text-align: center; color: darkblue">
**similarity(profile1, profile2, ratioNormalizedIntersectThreshold = 1)**
</p>

```{r ratio_normalized_intersect_calculation, collapse=TRUE}
metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                      demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                      demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                      demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                      demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$RATIO_NORMALIZED_INTERSECT
```

<br>

### SPEARMAN_CORRELATION

The **SPEARMAN_CORRELATION** metric is the Spearman's rho statistic calculated
usign the two profiles. `NA` if minimal standard deviation
(`spearmanCorrSDThreashold` parameter, default = 1e-8) is not respected for 
at least one of the profiles.

The **SPEARMAN_CORRELATION** assesses how well the relationship between the two
ChIP-Seq profiles can be described using a monotonic function. As the 
**RATIO_NORMALIZED_INTERSECT** metric, the **SPEARMAN_CORRELATION** 
metric can be useful to detect regions with possibily similar profiles even if 
their have different amplitude.


```{r spearman_corr_graphs, echo=FALSE, fig.width=11, fig.height=8}

par(mar=c(6,4,2,2))
par(mfrow=c(2,2)) 

plot(demoProfiles$chr2.70360770.70361098$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 25),
        main="chr2:70360770-70361098")
par(new=TRUE)
plot(demoProfiles$chr2.70360770.70361098$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 25))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(225, 17, "SPEARMAN_CORRELATION", cex=1.5, col="red")
text(225, 14, "=0.06", cex=1.5, col="red")

plot(demoProfiles$chr8.43092918.43093442$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 1500),
        main="chr8:43092918-43093442")
par(new=TRUE)
plot(demoProfiles$chr8.43092918.43093442$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 1500))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(325, 1000, "SPEARMAN_CORRELATION", cex=1.5, col="red")
text(325, 850, "=0.82", cex=1.5, col="red")

plot(demoProfiles$chr3.73159773.73160145$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 125),
        main="chr3:73159773-73160145")
par(new=TRUE)
plot(demoProfiles$chr3.73159773.73160145$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 125))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(288, 92, "SPEARMAN_", cex=1.5, col="red")
text(288, 78, "CORRELATION=0.95", cex=1.5, col="red")

plot(demoProfiles$chr19.27739373.27739767$H3K27ac,
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 350),
        main="chr19:27739373-27739767")
par(new=TRUE)
plot(demoProfiles$chr19.27739373.27739767$H3K4me1,
        type="l", col="darkgreen", xlab="Position", 
        ylab="Coverage",  ylim=c(0, 350))
legend("topright", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
text(270, 225, "SPEARMAN_CORRELATION", cex=1.5, col="red")
text(270, 192, "=0.62", cex=1.5, col="red")

```

&nbsp;

<p style="text-align: center; color: darkblue">
**similarity(profile1, profile2, spearmanCorrSDThreashold = 1e-8)**
</p> 

```{r spearman_correlation_calculation, collapse=TRUE}
metrics <- similarity(demoProfiles$chr2.70360770.70361098$H3K27ac, 
                      demoProfiles$chr2.70360770.70361098$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION

metrics <- similarity(demoProfiles$chr8.43092918.43093442$H3K27ac, 
                      demoProfiles$chr8.43092918.43093442$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION

metrics <- similarity(demoProfiles$chr3.73159773.73160145$H3K27ac, 
                      demoProfiles$chr3.73159773.73160145$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION

metrics <- similarity(demoProfiles$chr19.27739373.27739767$H3K27ac, 
                      demoProfiles$chr19.27739373.27739767$H3K4me1)
metrics$metric$SPEARMAN_CORRELATION
```

<br>

## Using similaRpeak on real ChIP-Seq profiles

Highly active enhancer regions are thought to be important for the cell fate 
(Andersson et al. 2014, FANTOM5 consortium, Hnisz et al. 2013). Highly active 
enhancers regions have been selected in GM12878 cells. Similarity of ChIP-seq 
profiles has been tested using two histone post-transcriptional modifications 
linked to highly active enhancers H3K27ac (DCC accession: ENCFF000ASG) and 
H3K4me1 (DCC accession: ENCFF000ARY) from the Encyclopedia of DNA Elements 
(ENCODE) data (Dunham I et al. 2012). Accordingly with the literature, 
similarity between the profiles of these two histone marks has been identified.

First, the `similaRpeak` package must be loaded.

```{r libraryLoad}
suppressMessages(library(similaRpeak))
```

A region, chr7:61968807-61969730, shows interesting profiles for both histones.
Let's load the data for this region.

```{r profiles, collapse=TRUE}
data(chr7Profiles)
str(chr7Profiles)
```

H3K27ac and H3K4me1 profiles have those shapes:

```{r graphProfiles, echo=FALSE, fig.align='center', fig.height=6 }
plot(chr7Profiles$chr7.61968807.61969730$H3K27ac, type="l", col="blue", 
        xlab="", ylab="", ylim=c(0, 700), main="chr7:61968807-61969730")
par(new=TRUE)
plot(chr7Profiles$chr7.61968807.61969730$H3K4me1, type="l", col="darkgreen", 
        xlab="Position", ylab="Coverage in reads per million (RPM)", 
        ylim=c(0, 700))
legend("topleft", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
```

The metrics are calculated using the `similarity` function which takes as 
arguments the two ChIP-Seq profiles vectors and the threshold values.

```{r metricCalculation}
metrics <- similarity(chr7Profiles$chr7.61968807.61969730$H3K27ac, 
                            chr7Profiles$chr7.61968807.61969730$H3K4me1, 
                            ratioAreaThreshold=5, 
                            ratioMaxMaxThreshold=2, 
                            ratioIntersectThreshold=5, 
                            ratioNormalizedIntersectThreshold=2,
                            diffPosMaxThresholdMinValue=10, 
                            diffPosMaxThresholdMaxDiff=100, 
                            diffPosMaxTolerance=0.01)
```

The `similarity` function returns a list which contains the general 
information about both ChIP-Seq profiles and a list of all calculated metrics. 

```{r metricReturn, collapse=TRUE}
metrics
```

Each specific information can be directly accessed. Some examples:

```{r getInfo, collapse=TRUE}
metrics$areaProfile1
metrics$areaProfile2
metrics$metrics$RATIO_INTERSECT
```

The **RATIO_INTERSECT** value of `r round(metrics$metrics$RATIO_INTERSECT, 2)` 
and the **RATIO_MAX_MAX** value of `r round(metrics$metrics$RATIO_MAX_MAX, 2)` 
are quite low. Both values can be explained by the large difference in 
coverage between profiles. Those values could be interpreted as two profiles 
with low level of similarity. However, the **RATIO_NORMALIZED_INTERSECT** of 
`r round(metrics$metrics$RATIO_NORMALIZED_INTERSECT, 2)` is much closer to 1. 
It could be a sign that the profiles, once normalized, are quite similar.
This hypothesis can be validated by looking at a graph of the normalized
profiles :

```{r graphProfilesNorm, echo=FALSE, fig.align='center', fig.height=6 }
plot(chr7Profiles$chr7.61968807.61969730$H3K27ac*
        length(chr7Profiles$chr7.61968807.61969730$H3K27ac)/
        sum(chr7Profiles$chr7.61968807.61969730$H3K27ac, na.rm=TRUE), 
        type="l", col="blue", xlab="", ylab="", ylim=c(0, 3.5))
par(new=TRUE)
plot(chr7Profiles$chr7.61968807.61969730$H3K4me1*
        length(chr7Profiles$chr7.61968807.61969730$H3K4me1)/
        sum(chr7Profiles$chr7.61968807.61969730$H3K4me1, na.rm=TRUE), 
        type="l", col="darkgreen", xlab="Position", 
        ylab="Normalized Coverage (Coverage/Mean Coverage)", 
        ylim=c(0, 3.5))
legend("topleft", c("H3K27ac","H3K4me1"), cex=1.2, 
        col=c("blue","darkgreen"), lty=1)
```


## Metrics calculation using a MetricFactory object
 
It is possible to create only one selected metric by using the 
**MetricFactory** object (with the possibility of specifying the thresholds) 
and by passing the name of the metric to create 
(**RATIO_AREA**, **DIFF_POS_MAX**, **RATIO_MAX_MAX**, **RATIO_INTERSECT** or 
**RATIO_NORMALIZED_INTERSECT**):

```{r factory}
factory = MetricFactory$new(diffPosMaxTolerance=0.04)
```

The factory has to be iniatized only once and can be used as many times as 
necessary. It can calculate the same metrics but with different profiles or 
different metrics with same profiles as long as the thresholds values 
remain the same:

```{r factoryDemo, collapse=TRUE }
ratio_max_max <- factory$createMetric(metricType="RATIO_MAX_MAX", 
                        profile1=demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        profile2=demoProfiles$chr2.70360770.70361098$H3K4me1)

ratio_max_max

ratio_normalized_intersect <- factory$createMetric(
                        metricType="RATIO_NORMALIZED_INTERSECT",
                        profile1=demoProfiles$chr2.70360770.70361098$H3K27ac, 
                        profile2=demoProfiles$chr2.70360770.70361098$H3K4me1)

ratio_normalized_intersect
```


## Session info

Here is the output of `sessionInfo()` on the system on which this document was 
compiled:

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

## References

Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, et al. (2014) 
An atlas of active enhancers across human cell types and tissues. 
Nature, 507(7493), 455-461.

Dunham I, Kundaje A, Aldred SF, et al. An integrated encyclopedia of DNA 
elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74.

Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, et al. (2014) A 
promoter-level mammalian expression atlas. Nature, 507(7493):462-470.

Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-Andr&eacute; V, et al. (2013) 
Super-enhancers in the control of cell identity and disease. Cell, 155(4), 
934-947.