---
title: "Case study I: CTCF occupancy"
author: "Eric S. Davis"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
output:
  rmarkdown::html_document:
    highlight: tango
    toc: true
    toc_float: true	
    fig_width: 5
    fig_height: 3
vignette: |
  %\VignetteIndexEntry{2. Case study I: CTCF occupancy}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

In this vignette we demonstrate generating covariate-matched,
null-hypothesis GRanges using the `matchRanges()` function to test for
the occupancy of CCCTC-binding factor (CTCF) at chromatin loop
anchors. 

## Background

One of the fundamental principles of chromatin-looping suggests that
most loops are bound at both ends by the CTCF transcription factor
(TF). CTCF-bound loops can be formed by loop-extrusion, where the
ring-like cohesin complex extrudes chromatin until stopped by bound
CTCF. By this mechanism, we expect most loop anchors will be bound by
CTCF. 

While we could test this hypothesis by simple overlap or permutation
testing, these approaches fail to account for non-uniformly
distributed covariate genomic features. For example, loop anchors are
commonly bound by CTCF and located in open chromatin regions. We can
use `matchRanges()` to test for CTCF occupancy at loop anchors
controlling for open chromatin regions. 

Here, we generate a set of null-hypothesis GRanges to more rigorously
test CTCF occupancy at loop anchors independently from open chromatin
regions. We use the `hg19_10kb_bins` dataset from the
`nullrangesData` package, which contains ranges for every 10Kb bin
along the genome with CTCF, DNase, and loop feature annotations from
GM12878 (see `?nullrangesData::hg19_10kb_bins`). 

```{r, message=FALSE, warning=FALSE, echo=FALSE, fig.width=8.5, fig.height=6.5}
## Define colors
colors <- c("#e19995", "#adaf64", "#4fbe9b", "#6eb3d9", "#d098d7")

## Create artificial GRanges
library(GenomicRanges)
set.seed(5)
pool <- GRanges(seqnames = "chr1",
                ranges = IRanges(start = sample(1:800, 120, replace = TRUE),
                                 width = sample(25:200, 120, replace = TRUE)),
                color = sample(1:5, 120, replace = TRUE))
focal <- GRanges(seqnames = "chr1",
                 ranges = IRanges(start = sample(1:800, 16, replace = TRUE),
                                  width = sample(25:200, 16, replace = TRUE)),
                 color = sample(1:5, 16, replace = TRUE))

## Add width to metadata
pool$length <- width(pool)
focal$length <- width(focal)

## Match ranges
library(nullranges)
set.seed(123)
x <- matchRanges(focal = focal,
                 pool = pool,
                 covar = ~color + length,
                 method = 'n', replace = TRUE)

## Visualize sets
library(plotgardener)
library(grid)
set.seed(123)
pageCreate(width = 8.5, height = 6.5, showGuides = FALSE, xgrid = 0, ygrid = 0)

## Define common parameters
p <- pgParams(chrom = "chr1", chromstart = 1, chromend = 1000)

## Pool set
poolSet <- plotRanges(data = pool, params = p,
                         x = 1, y = 1, width = 2.5, height = 2.5,
                         fill = colors,
                         colorby = colorby("color"))
annoGenomeLabel(plot = poolSet, x = 1, y = 3.55)
plotText(label = "Pool Set",
            x = 2.25, y = 0.9,
            just = c("center", "bottom"),
            fontcolor = "#33A02C",
            fontface = "bold",
            fontfamily = 'mono')

## Focal set
focalSet <- plotRanges(data = focal, params = p,
                          x = 5, y = 1, width = 2.5, height = 1,
                          fill = colors,
                          colorby = colorby("color"))
annoGenomeLabel(plot = focalSet, x = 5, y = 2.05)
plotText(label = "Focal Set",
            x = 6.25, y = 0.9,
            just = c("center", "bottom"),
            fontcolor = "#1F78B4",
            fontface = "bold",
            fontfamily = 'mono')


## Matched set
matchedSet <- plotRanges(data = matched(x), params = p,
                            x = 5, y = 2.5, width = 2.5, height = 1,
                            fill = colors,
                            colorby = colorby("color"))
annoGenomeLabel(plot = matchedSet, x = 5, y = 3.55)
plotText(label = "Matched Set",
            x = 6.25, y = 2.75,
            just = c("center", "bottom"),
            fontcolor = "#A6CEE3",
            fontface = "bold",
            fontfamily = 'mono')


## Arrow and matchRanges label
plotSegments(x0 = 3.5, y0 = 3,
                x1 = 5, y1 = 3,
                arrow = arrow(type = "closed", length = unit(0.1, "inches")),
                fill = "black", lwd = 2)
plotText(label = "matchRanges()", fontfamily = 'mono',
            x = 4.25, y = 2.9, just = c("center", "bottom"))


## Matching plots
library(ggplot2)
smallText <- theme(legend.title = element_text(size=8),
                   legend.text=element_text(size=8),
                   title = element_text(size=8),
                   axis.title.x = element_text(size=8),
                   axis.title.y = element_text(size=8))

plot1 <-
  plotPropensity(x, sets=c('f','m','p')) +
  smallText +
  theme(legend.key.size = unit(0.5, 'lines'),
        title = element_blank())

plot2 <-
  plotCovariate(x=x, covar=covariates(x)[1], sets=c('f','m','p')) +
  smallText +
  theme(legend.text = element_blank(),
        legend.position = 'none')
  
plot3 <-
  plotCovariate(x=x, covar=covariates(x)[2], sets=c('f','m','p'))+
  smallText + 
  theme(legend.key.size = unit(0.5, 'lines'))


## Propensity scores
plotText(label = "plotPropensity()",
            x = 1.10, y = 4.24,
            just = c("left", "bottom"),
            fontface = "bold",
            fontfamily = 'mono')
plotText(label = "~color + length",
            x = 1.25, y = 4.5,
            just = c("left", "bottom"),
            fontsize = 10,
            fontfamily = "mono")
plotGG(plot = plot1,
          x = 1, y = 4.5, width = 2.5, height = 1.5,
          just = c("left", "top"))

## Covariate balance
plotText(label = "plotCovariate()",
            x = 3.75, y = 4.24,
            just = c("left", "bottom"),
            fontface = "bold",
            fontfamily = "mono")
plotText(label = covariates(x),
            x = c(4, 5.9), y = 4.5,
            just = c("left", "bottom"),
            fontsize = 10,
            fontfamily = "mono")
plotGG(plot = plot2,
          x = 3.50, y = 4.5, width = 1.8, height = 1.5,
          just = c("left", "top"))

plotGG(plot = plot3,
          x = 5.30, y = 4.5, width = 2.75, height = 1.5,
          just = c("left", "top"))
```

## Matching with `matchRanges()`

Before we generate our null ranges, let's take a look at our example dataset:

```{r, message=FALSE, warning=FALSE}
library(nullrangesData)

## Load example data
bins <- hg19_10kb_bins()

bins
```

`matchRanges()` works by selecting a set of covariate-matched controls
from a pool of options based on an input focal set of interest. Here,
we define `focal` as bins that contain a loop anchor, `pool` as bins
that don't contain a loop anchor, and `covar` as DNase signal and
number of DNase sites per bin: 

```{r}
library(nullranges)

## Match ranges
set.seed(123)
mgr <- matchRanges(focal = bins[bins$looped],
                   pool = bins[!bins$looped],
                   covar = ~dnaseSignal + n_dnase_sites)
mgr
```

When the focal and pool arguments are `GRanges` objects,
`matchRanges()` returns a `MatchedGRanges` object. The
`MatchedGRanges` class extends `GRanges`, so all of the same
operations can be applied: 

```{r, message=FALSE, warning=FALSE}
library(GenomicRanges)
library(plyranges)
library(ggplot2)

## Summarize ctcfSignal by n_ctcf_sites
mgr %>%
  group_by(n_ctcf_sites) %>%
  summarize(ctcfSignal = mean(ctcfSignal)) %>%
  as.data.frame() %>%
  ggplot(aes(x = n_ctcf_sites, y = ctcfSignal)) +
    geom_line() +
    geom_point(shape = 21, stroke = 1,  fill = 'white') +
    theme_minimal() +
    theme(panel.border = element_rect(color = 'black',
                                      fill = NA))
```

Here, we utilize
the [`plyranges` package](https://sa-lee.github.io/plyranges/) which
provides a set of "tidy" verbs for manipulating genomic ranges for a
seamless and integrated genomic analysis workflow. 

## Assessing quality of matching

We can get a quick summary of the matching quality with `overview()`:

```{r}
overview(mgr)
```

For continuous covariates (such as `dnaseSignal`), `overview()` shows
the mean and standard deviation between each matched set. For
categorical covariates, such as `n_dnase_sites`, `overview()` reports
the number of observations per category and matched set. The bottom
section shows the mean and s.d (or n, for factors) difference between
focal and matched sets. 

`overview()` also summarizes the propensity scores for each set to
give a quick idea of overall matching quality. 

### Visualizing matching results

Let's visualize overall matching quality by plotting propensity scores
for the focal, pool, and matched sets: 

```{r, message=FALSE}
plotPropensity(mgr, sets = c('f', 'p', 'm'), type = 'ridges')
```

From this plot, it is clear that the matched set is much closer to the focal set than the pool set.

We can ensure that covariate distributions have been matched
appropriately by using the `covariates()` function to extract matched
covariates along with `patchwork` and `plotCovarite` to visualize all
distributions: 

```{r, message=FALSE, warning=FALSE, fig.height=6, fig.width=5}
library(patchwork)
plots <- lapply(covariates(mgr), plotCovariate, x=mgr, sets = c('f', 'm', 'p'))
Reduce('/', plots)
```

## Compare CTCF sites

Using our matched ranges, we can compare CTCF occupancy in bins that
1) contain a loop anchor (i.e. looped), 2) don't contain a loop anchor
(i.e. unlooped), or 3) don't contain a loop anchor, but are also
matched for the strength and number of DNase sites (i.e. matched). In
this case, we calculate CTCF occupancy as the percent of bins that
contain CTCF among our 3 sets by using the `focal()` and `pool()`
accessor functions: 

```{r fig.width=4.5, fig.height=5}
## Percent of bins with CTCF
g1 <- (sum(focal(mgr)$n_ctcf_sites >= 1) / length(focal(mgr))) * 100
g2 <- (sum(pool(mgr)$n_ctcf_sites >= 1) / length(pool(mgr))) * 100
g3 <- (sum(mgr$n_ctcf_sites >= 1) / length(mgr)) * 100

## Visualize
barplot(height = c(g1, g2, g3),
        names = c('looped\n(focal)', 'unlooped\n(pool)', 'unlooped\n(matched)'),
        ylab = "CTCF occupied bins (%)",
        col = c('#1F78B4', '#33A02C', '#A6CEE3'),
        main = 'CTCF occupancy',
        border = NA,
        las = 1)
```

# Session information

```{r}
sessionInfo()
```