--- title: "Case study I: CTCF occupancy" author: "Eric S. Davis" date: "`r format(Sys.Date(), '%m/%d/%Y')`" output: rmarkdown::html_document: highlight: tango toc: true toc_float: true fig_width: 5 fig_height: 3 vignette: | %\VignetteIndexEntry{2. Case study I: CTCF occupancy} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- In this vignette we demonstrate generating covariate-matched, null-hypothesis GRanges using the `matchRanges()` function to test for the occupancy of CCCTC-binding factor (CTCF) at chromatin loop anchors. ## Background One of the fundamental principles of chromatin-looping suggests that most loops are bound at both ends by the CTCF transcription factor (TF). CTCF-bound loops can be formed by loop-extrusion, where the ring-like cohesin complex extrudes chromatin until stopped by bound CTCF. By this mechanism, we expect most loop anchors will be bound by CTCF. While we could test this hypothesis by simple overlap or permutation testing, these approaches fail to account for non-uniformly distributed covariate genomic features. For example, loop anchors are commonly bound by CTCF and located in open chromatin regions. We can use `matchRanges()` to test for CTCF occupancy at loop anchors controlling for open chromatin regions. Here, we generate a set of null-hypothesis GRanges to more rigorously test CTCF occupancy at loop anchors independently from open chromatin regions. We use the `hg19_10kb_bins` dataset from the `nullrangesData` package, which contains ranges for every 10Kb bin along the genome with CTCF, DNase, and loop feature annotations from GM12878 (see `?nullrangesData::hg19_10kb_bins`). ```{r, message=FALSE, warning=FALSE, echo=FALSE, fig.width=8.5, fig.height=6.5} ## Define colors colors <- c("#e19995", "#adaf64", "#4fbe9b", "#6eb3d9", "#d098d7") ## Create artificial GRanges library(GenomicRanges) set.seed(5) pool <- GRanges(seqnames = "chr1", ranges = IRanges(start = sample(1:800, 120, replace = TRUE), width = sample(25:200, 120, replace = TRUE)), color = sample(1:5, 120, replace = TRUE)) focal <- GRanges(seqnames = "chr1", ranges = IRanges(start = sample(1:800, 16, replace = TRUE), width = sample(25:200, 16, replace = TRUE)), color = sample(1:5, 16, replace = TRUE)) ## Add width to metadata pool$length <- width(pool) focal$length <- width(focal) ## Match ranges library(nullranges) set.seed(123) x <- matchRanges(focal = focal, pool = pool, covar = ~color + length, method = 'n', replace = TRUE) ## Visualize sets library(plotgardener) library(grid) set.seed(123) pageCreate(width = 8.5, height = 6.5, showGuides = FALSE, xgrid = 0, ygrid = 0) ## Define common parameters p <- pgParams(chrom = "chr1", chromstart = 1, chromend = 1000) ## Pool set poolSet <- plotRanges(data = pool, params = p, x = 1, y = 1, width = 2.5, height = 2.5, fill = colors, colorby = colorby("color")) annoGenomeLabel(plot = poolSet, x = 1, y = 3.55) plotText(label = "Pool Set", x = 2.25, y = 0.9, just = c("center", "bottom"), fontcolor = "#33A02C", fontface = "bold", fontfamily = 'mono') ## Focal set focalSet <- plotRanges(data = focal, params = p, x = 5, y = 1, width = 2.5, height = 1, fill = colors, colorby = colorby("color")) annoGenomeLabel(plot = focalSet, x = 5, y = 2.05) plotText(label = "Focal Set", x = 6.25, y = 0.9, just = c("center", "bottom"), fontcolor = "#1F78B4", fontface = "bold", fontfamily = 'mono') ## Matched set matchedSet <- plotRanges(data = matched(x), params = p, x = 5, y = 2.5, width = 2.5, height = 1, fill = colors, colorby = colorby("color")) annoGenomeLabel(plot = matchedSet, x = 5, y = 3.55) plotText(label = "Matched Set", x = 6.25, y = 2.75, just = c("center", "bottom"), fontcolor = "#A6CEE3", fontface = "bold", fontfamily = 'mono') ## Arrow and matchRanges label plotSegments(x0 = 3.5, y0 = 3, x1 = 5, y1 = 3, arrow = arrow(type = "closed", length = unit(0.1, "inches")), fill = "black", lwd = 2) plotText(label = "matchRanges()", fontfamily = 'mono', x = 4.25, y = 2.9, just = c("center", "bottom")) ## Matching plots library(ggplot2) smallText <- theme(legend.title = element_text(size=8), legend.text=element_text(size=8), title = element_text(size=8), axis.title.x = element_text(size=8), axis.title.y = element_text(size=8)) plot1 <- plotPropensity(x, sets=c('f','m','p')) + smallText + theme(legend.key.size = unit(0.5, 'lines'), title = element_blank()) plot2 <- plotCovariate(x=x, covar=covariates(x)[1], sets=c('f','m','p')) + smallText + theme(legend.text = element_blank(), legend.position = 'none') plot3 <- plotCovariate(x=x, covar=covariates(x)[2], sets=c('f','m','p'))+ smallText + theme(legend.key.size = unit(0.5, 'lines')) ## Propensity scores plotText(label = "plotPropensity()", x = 1.10, y = 4.24, just = c("left", "bottom"), fontface = "bold", fontfamily = 'mono') plotText(label = "~color + length", x = 1.25, y = 4.5, just = c("left", "bottom"), fontsize = 10, fontfamily = "mono") plotGG(plot = plot1, x = 1, y = 4.5, width = 2.5, height = 1.5, just = c("left", "top")) ## Covariate balance plotText(label = "plotCovariate()", x = 3.75, y = 4.24, just = c("left", "bottom"), fontface = "bold", fontfamily = "mono") plotText(label = covariates(x), x = c(4, 5.9), y = 4.5, just = c("left", "bottom"), fontsize = 10, fontfamily = "mono") plotGG(plot = plot2, x = 3.50, y = 4.5, width = 1.8, height = 1.5, just = c("left", "top")) plotGG(plot = plot3, x = 5.30, y = 4.5, width = 2.75, height = 1.5, just = c("left", "top")) ``` ## Matching with `matchRanges()` Before we generate our null ranges, let's take a look at our example dataset: ```{r, message=FALSE, warning=FALSE} library(nullrangesData) ## Load example data bins <- hg19_10kb_bins() bins ``` `matchRanges()` works by selecting a set of covariate-matched controls from a pool of options based on an input focal set of interest. Here, we define `focal` as bins that contain a loop anchor, `pool` as bins that don't contain a loop anchor, and `covar` as DNase signal and number of DNase sites per bin: ```{r} library(nullranges) ## Match ranges set.seed(123) mgr <- matchRanges(focal = bins[bins$looped], pool = bins[!bins$looped], covar = ~dnaseSignal + n_dnase_sites) mgr ``` When the focal and pool arguments are `GRanges` objects, `matchRanges()` returns a `MatchedGRanges` object. The `MatchedGRanges` class extends `GRanges`, so all of the same operations can be applied: ```{r, message=FALSE, warning=FALSE} library(GenomicRanges) library(plyranges) library(ggplot2) ## Summarize ctcfSignal by n_ctcf_sites mgr %>% group_by(n_ctcf_sites) %>% summarize(ctcfSignal = mean(ctcfSignal)) %>% as.data.frame() %>% ggplot(aes(x = n_ctcf_sites, y = ctcfSignal)) + geom_line() + geom_point(shape = 21, stroke = 1, fill = 'white') + theme_minimal() + theme(panel.border = element_rect(color = 'black', fill = NA)) ``` Here, we utilize the [`plyranges` package](https://sa-lee.github.io/plyranges/) which provides a set of "tidy" verbs for manipulating genomic ranges for a seamless and integrated genomic analysis workflow. ## Assessing quality of matching We can get a quick summary of the matching quality with `overview()`: ```{r} overview(mgr) ``` For continuous covariates (such as `dnaseSignal`), `overview()` shows the mean and standard deviation between each matched set. For categorical covariates, such as `n_dnase_sites`, `overview()` reports the number of observations per category and matched set. The bottom section shows the mean and s.d (or n, for factors) difference between focal and matched sets. `overview()` also summarizes the propensity scores for each set to give a quick idea of overall matching quality. ### Visualizing matching results Let's visualize overall matching quality by plotting propensity scores for the focal, pool, and matched sets: ```{r, message=FALSE} plotPropensity(mgr, sets = c('f', 'p', 'm'), type = 'ridges') ``` From this plot, it is clear that the matched set is much closer to the focal set than the pool set. We can ensure that covariate distributions have been matched appropriately by using the `covariates()` function to extract matched covariates along with `patchwork` and `plotCovarite` to visualize all distributions: ```{r, message=FALSE, warning=FALSE, fig.height=6, fig.width=5} library(patchwork) plots <- lapply(covariates(mgr), plotCovariate, x=mgr, sets = c('f', 'm', 'p')) Reduce('/', plots) ``` ## Compare CTCF sites Using our matched ranges, we can compare CTCF occupancy in bins that 1) contain a loop anchor (i.e. looped), 2) don't contain a loop anchor (i.e. unlooped), or 3) don't contain a loop anchor, but are also matched for the strength and number of DNase sites (i.e. matched). In this case, we calculate CTCF occupancy as the percent of bins that contain CTCF among our 3 sets by using the `focal()` and `pool()` accessor functions: ```{r fig.width=4.5, fig.height=5} ## Percent of bins with CTCF g1 <- (sum(focal(mgr)$n_ctcf_sites >= 1) / length(focal(mgr))) * 100 g2 <- (sum(pool(mgr)$n_ctcf_sites >= 1) / length(pool(mgr))) * 100 g3 <- (sum(mgr$n_ctcf_sites >= 1) / length(mgr)) * 100 ## Visualize barplot(height = c(g1, g2, g3), names = c('looped\n(focal)', 'unlooped\n(pool)', 'unlooped\n(matched)'), ylab = "CTCF occupied bins (%)", col = c('#1F78B4', '#33A02C', '#A6CEE3'), main = 'CTCF occupancy', border = NA, las = 1) ``` # Session information ```{r} sessionInfo() ```