---
title: "Introduction to nullranges"
output:
  rmarkdown::html_document
bibliography: library.bib
vignette: |
  %\VignetteIndexEntry{0. Introduction to nullranges}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The *nullranges* package contains functions for generation of feature
sets (genomic regions) for exploring the null hypothesis of overlap or 
colocalization of two observed feature sets.

The package has two branches of functionality: *matching* or
*bootstrapping* to generate null feature sets. The decision about
which approach to use is ultimately up to the bioinformatics analyst.
Here we describe the two different approaches briefly. For a listing
of all the vignettes in the package, one can type:

```{r eval=FALSE}
vignette(package="nullranges")
```

## Related work

For general considerations of generation of null feature sets or
segmentation for enrichment or colocalization analysis, consider the
papers of @de_2014, @haiminen_2007,
@huen_2010, and @kanduri_2019 (with links in references below).
Other Bioconductor packages that offer randomization techniques for 
enrichment analysis include 
[LOLA](https://bioconductor.org/packages/LOLA) [@LOLA] and 
[regioneR](https://bioconductor.org/packages/regioneR) [@regioneR]. 
Methods implemented outside of Bioconductor include 
[GAT](https://github.com/AndreasHeger/gat) [@GAT],
[GSC](https://www.encodeproject.org/software/gsc/) [@bickel_2010],
[GREAT](http://bejerano.stanford.edu/great/public/html/) [@GREAT],
[GenometriCorr](https://github.com/favorov/GenometriCorr) [@GenometriCorr],
or [ChIP-Enrich](http://chip-enrich.med.umich.edu/) [@ChIP-Enrich].
We note that our block bootstrapping approach closely follows that of 
[GSC](https://www.encodeproject.org/software/gsc/), while offering
additional features/visualizations, and is re-implemented within
R/Bioconductor with efficient vectorized code for operation on 
*GRanges* objects [@granges].

## Brief description of methods

Suppose we want to examine the significance of overlaps
of genomic sets of features $x$ and $y$. To test the significance of
this overlap, we calculate the overlap expected under the null by
generating a null feature set $y'$ (potentially many times). The null
features in $y'$ may be characterized by:

1. Drawing from a larger pool $z$ ($y' \subset z$), such that $y$ and
   $y'$ have a similar distribution over one or more covariates. This
   is the "matching" case. Note that the features in $y'$ are original
   features, just drawn from a different pool than y.
2. Generating a new set of genomic features $y'$, constructing them
   from the original set $y$ by selecting blocks of the genome with
   replacement, i.e. such that features can be sampled more than once.
   This is the "bootstrapping" case. Note that, in this case, $y'$ is an
   artificial feature set, although the re-sampled features can retain
   covariates such as score from the original feature set $y$.

## In other words

1. Matching -- drawing from a pool of features but controlling for 
   certain characteristics
2. Bootstrapping -- placing a number of artificial features in the 
   genome but controlling for their spatial distribution

## Options and features

We provide a number of vignettes to describe the different matching
and bootstrapping use cases. In the matching case, we have implemented
a number of options, including nearest neighbor matching or
rejection sampling based matching. In the bootstrapping case, we have
implemented options for bootstrapping across or within chromosomes, and
bootstrapping only within states of a segmented genome. We also
provide a function to segment the genome by density of features. For
example, supposing that $x$ is a subset of genes, we may want to
generate $y'$ from $y$ such that features are re-sampled in blocks
from segments across the genome with similar gene density.  
In both cases, we provide a number of functions for performing quality
control via visual inspection of diagnostic plots.

## Consideration of excluded regions

Finally, we recommend to incorporate list of regions where artificial
features should *not* be placed, including the ENCODE Exclusion List
[@encode_exclude]. This and other excluded ranges are made available
in the *excluderanges* Bioconductor package by Mikhail Dozmorov. 
Use of excluded ranges is demonstrated in the segmented block bootstrap
vignette.

# References