---
title: "ODER: Optimising the Definition of Expressed Regions"
author: 
  - name: Emmanuel Olagbaju
    affiliation:
    - UCL
    email: e.olagbaju@ucl.ac.uk
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('ODER')`"
vignette: >
  %\VignetteIndexEntry{Introduction to ODER}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
) # Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
```

```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ODER = citation("ODER")[1]
)
```

# Basics

## Install `ODER`

`R` is an open-source statistical environment which can be easily modified to 
enhance its functionality via packages. `r Biocpkg("ODER")` is a `R` package 
available via the [Bioconductor](http://bioconductor.org) repository for 
packages. `R` can be installed on any operating system from 
[CRAN](https://cran.r-project.org/) after which you can install 
`r Biocpkg("ODER")` by using the following commands in your `R` session:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("eolagbaju/ODER")

## Check that you have a valid Bioconductor installation
BiocManager::valid()
```

## Required knowledge

The expected input of `r Biocpkg("ODER")` is coverage in the form of BigWig 
files as well as, depending on the functionalility required by a specific user,
junctions in form of a `RangedSummarizedExperiments`. 

`r Biocpkg("ODER")` is based on many other packages and in particular in those 
that have implemented the infrastructure needed for dealing with RNA-seq data. 
The `r Biocpkg("GenomicRanges")` package is heavily used in `r Biocpkg("ODER")` 
while other packages like `r Biocpkg("SummarizedExperiment")` and 
`r Biocpkg("derfinder")`. Previous experience with these packages will help in 
the comprehension and use of `r Biocpkg("ODER")`.

If you are asking yourself the question "Where do I start using Bioconductor?" 
you might be interested in 
[this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). 
If you find the structure of a `r Biocpkg("SummarizedExperiment")` unclear, you 
might consider checking out [this manual](http://master.bioconductor.org/help/course-materials/2019/BSS2019/04_Practical_CoreApproachesInBioconductor.html).

## Asking for help

As package developers, we try to explain clearly how to use our packages and in
which order to use the functions. But `R` and `Bioconductor` have a steep 
learning curve so it is critical to learn where to ask for help. The blog post 
quoted above mentions some but we would like to highlight the 
[Bioconductor support site](https://support.bioconductor.org/) as the main 
resource for getting help: remember to use the `ODER` tag and check 
[the older posts](https://support.bioconductor.org/t/ODER/). Other alternatives 
are available such as creating GitHub issues and tweeting. However, please note
that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is 
particularly critical that you provide a small reproducible example and your 
session information so package developers can track down the source of the error.

## Citing `ODER`

We hope that `r Biocpkg("ODER")` will be useful for your research. Please use 
the following information to cite the package and the overall approach. Thank you!

```{r "citation"}
## Citation info
citation("ODER")
```

# Background 

`ODER` is a packaged form of the method developed in the Zhang et al. 2020 publication: 
[Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders](https://advances.sciencemag.org/content/6/24/eaay8299.full). 
The overarching aim of `ODER` is use RNA-sequencing data to define regions of 
unannotated expression (ERs) in the genome, optimise their definition, then link
them to known genes. 

`ODER` builds upon the method defined in `r Biocpkg("derfinder")` by improving 
the definition of ERs in a few ways. Firstly, rather than being a fixed value, 
the coverage cut off is optimised based on a set of user-inputted, gold-standard
exons for a given set of samples. Secondly, `ODER` introduces a second 
optimisation parameter, the max region gap, which determines the number of 
base-pairs of gap between which ERs are merged. Thirdly, ERs can be connected to
known genes through junction reads. This aids the interpretation of ERs and also
allows their definition to be refined further using the intersection between ERs
and junctions. For more detailed methods, please see the methods section of the 
[original publication](https://www.science.org/doi/10.1126/sciadv.aay8299). 

Unannotated ERs can provide evidence for the existence and location of novel 
exons, which are yet to be added within reference annotation. Improving the 
completeness of reference annotation can aid the interpretation of genetic 
variation. For example, the output of `ODER` can help to better interpret 
non-coding genetics variants that are found in the genome of Mendelian disease 
patients, poetentially leading to improvements in diagnosis rates.

# Quick start to using to `ODER`

From the top-level ODER consists of 4 core functions, which are broken down 
internally into several smaller helper functions. These functions are expected 
to be run sequentially in the order presented below:

1. `ODER()` - Takes as input coverage in the form of BigWig files. Uses 
`r Biocpkg("derfinder")` to call contigous blocks of expression that we term 
expressed regions or ERs. ER definitions are optimised across a pair of 
parameters the mean coverage cut-off (MCC) and the max region gap (MRG) with 
respect to a user-inputted set of gold standard exons. The set of ERs for the 
optimised MCC and MRG are returned. 
2. `annotatER()` - Takes as input the optimised set of ERs and a set of 
junctions. Finds overlaps between the ERs and junctions, thereby annotating ERs 
with the gene associated with it's corresponding junction. Also categorises ERs 
into "exon", "intron", "intergenic" or any combination of these three categories
depending on the ERs overlap with existing annotation. 
3. `refine_ers()` - Takes as input the optimised set of ERs annotated with 
junctions. Refines the ER definition based on the intersection between the ER 
and it's overlapping junction. 
4. `get_count_matrix()` - Takes as input any set of `GenomicRanges` and a set of
`BigWig` files. Returns a `RangedSummarizedExperiment` with `assays` containing 
the average coverage across each range. This function is intended to obtain the
coverage across ERs to allow usage in downstream analyses such as differential 
expression. 

## Example

This is a basic example to show how you can use ODER. First, we need to download
the example `BigWig` data required as input for `ODER`. 

```{r load_data, eval = requireNamespace('ODER')}

library("ODER")
library("magrittr")

# Download recount data in the form of BigWigs
gtex_metadata <- recount::all_metadata("gtex")

gtex_metadata <- gtex_metadata %>%
    as.data.frame() %>%
    dplyr::filter(project == "SRP012682")

url <- recount::download_study(
    project = "SRP012682",
    type = "samples",
    download = FALSE
)

# file_cache is an internal ODER function to cache files for
# faster repeated loading
bw_path <- file_cache(url[1])

bw_path
```

To get the optimum set of ERs from a BigWig file we can use the `ODER()` 
function.This will obtain the optimally defined ERs by finding the combination
of MCC and MRG parameters that gives the lowest exon delta between the ERs and 
the inputted gold-standard exons. The MCC is minimum read depth that a base pair
needs to have to be considered expressed. The MRG is the maximum number of base 
pairs between reads that fall below the MCC before you would not include it as 
part of the expressed region. Internally, gold-standard exons are obtained by 
finding the non-overlapping exons from the inputted reference annotation. 

In this example, we demonstrate `ODER()` on a single unstranded `Bigwig`. 
However, in reality, it is likely that you will want to optimise the ER 
definitions across multiple `BigWigs`. It is worth noting that the arguments 
`bw_pos` and `bw_neg` in `ODER()` allow for the input of stranded `BigWigs`. 

```{r ODER}

# load reference annotation from Ensembl
gtf_url <- paste0(
    "http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens",
    "/Homo_sapiens.GRCh38.103.chr.gtf.gz"
)
gtf_path <- file_cache(gtf_url)
gtf_gr <- rtracklayer::import(gtf_path)
# As of rtracklayer 1.25.16, BigWig is not supported on Windows.
if (!xfun::is_windows()) {
    opt_ers <- ODER(
        bw_paths = bw_path, auc_raw = gtex_metadata[["auc"]][1],
        auc_target = 40e6 * 100, chrs = c("chr21"),
        genome = "hg38", mccs = c(2, 4, 6, 8, 10), mrgs = c(10, 20, 30),
        gtf = gtf_gr, ucsc_chr = TRUE, ignore.strand = TRUE,
        exons_no_overlap = NULL, bw_chr = "chr"
    )
}
```

Once we have the obtained the optimised set of ERs, we may consider plotting the
exon delta across the various MCCs and MRGs. This can be useful to check the 
error associated with the definition of the set of optimised ERs. This error is 
measured through two metrics; the median exon delta and the number of ERs with 
exon delta equal to 0. The median exon delta represents the overall accuracy of 
all ER definitions, whereas the number of ERs with exon delta equal to 0 
indicates the extent to which ER definitions precisely match overlapping 
gold-standard exon boundaries. 

```{r plot_example}
# visualise the spread of mccs and mrgs
if (!xfun::is_windows()) { # uses product of ODER
    plot_ers(opt_ers[["deltas"]], opt_ers[["opt_mcc_mrg"]])
}
```

Next, we will use `annotatERs()` to find the overlap between the ERs and 
junctions. Furthermore, `annotatERs()` will also classify ERs by their overlap
with existing reference annotation into the categories; "exon", "intron" and 
"intergenic". This can be helpful for two reasons. Primarily, junctions can be 
used to inform which gene the ER is associated to. This gene-level association 
can be useful multiple downstream applications, such as novel exon discovery. 
Furthermore, the category of ER, in terms of whether it overlaps a exonic, 
intronic or intergenic region, can help determine whether the ER represents 
novel expression. For example, ERs solely overlapping intronic or intergenic 
regions and associated with a gene can be the indication of the expression of 
an unannotated exon. 

To note, it is recommended that the inputted junctions are derived from the same
samples or tissue as the `BigWig` files used to define ERs.

```{r annotatER_example}
library(utils)
# running only chr21 to reduce runtime
chrs_to_keep <- c("21")

# prepare the txdb object to create a genomic state
# based on https://support.bioconductor.org/p/93235/
hg38_chrominfo <- GenomeInfoDb::getChromInfoFromUCSC("hg38")

new_info <- hg38_chrominfo$size[match(
    chrs_to_keep,
    GenomeInfoDb::mapSeqlevels(hg38_chrominfo$chrom, "Ensembl")
)]

names(new_info) <- chrs_to_keep
gtf_gr_tx <- GenomeInfoDb::keepSeqlevels(gtf_gr,
    chrs_to_keep,
    pruning.mode = "tidy"
)

GenomeInfoDb::seqlengths(gtf_gr_tx) <- new_info
GenomeInfoDb::seqlevelsStyle(gtf_gr_tx) <- "UCSC"
GenomeInfoDb::genome(gtf_gr_tx) <- "hg38"

ucsc_txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf_gr_tx)
genom_state <- derfinder::makeGenomicState(txdb = ucsc_txdb)

# convert UCSC chrs format to Ensembl to match the ERs
ens_txdb <- ucsc_txdb
GenomeInfoDb::seqlevelsStyle(ens_txdb) <- "Ensembl"

# lung_junc_21_22 is an example data set of junctions
# obtained from recount3, derived from the lung tissue
# run annotatERs()
data(lung_junc_21_22, package = "ODER")
if (!xfun::is_windows()) { # uses product of ODER
    annot_ers <- annotatERs(
        opt_ers = head(opt_ers[["opt_ers"]], 100),
        junc_data = lung_junc_21_22,
        genom_state = genom_state,
        gtf = gtf_gr,
        txdb = ens_txdb
    )

    # print first 5 ERs
    annot_ers[1:5]
}
```

After we have annotated ERs with the overlapping junctions, optionally we can 
use `refine_ers()` to refine the starts and ends of the ERs based on the 
overlapping junctions. This will filter ERs for those which have either a single
or two non-intersecting junctions overlapping. For the remaining ERs, 
`refine_ers()` will shave the ER definitions to the exon boundaries matching the
overlapping junctions. This can be useful for downstream applications whereby 
the accuracy of the ER definition is extremely important. For example, the 
interpretion of variants in diagnostic setting. 

```{r refine_ERs_example}
if (!xfun::is_windows()) { # uses product of ODER
    refined_ers <- refine_ERs(annot_ers)

    refined_ers
}
```

Finally, we can generate an ER count matrix with `get_count_matrix()`. This 
function can flexibly be run at any stage of the `ODER` pipeline and it requires
a set of `GenomicRanges` and `BigWig` paths as input. `get_count_matrix()` will 
return a `RangedSummarizedExperiment` which has `assays` filled with the mean 
coverage across each inputted range. Internally, `get_count_matrix()` relies on 
`r Biocpkg("megadepth")` to obtain coverage from `BigWigs` therefore 
`megadepth::install_megadepth()` must be executed at least once on the user's 
system before `get_count_matrix()`. 

```{r get_count_matrix_example}

# create sample metadata containing identifiers for each BigWig
run <- gtex_metadata[["run"]][[1]]
col_info <- as.data.frame(run)

# install megadepth
megadepth::install_megadepth()
if (!xfun::is_windows()) { # uses product of ODER
    er_count_matrix <- get_count_matrix(
        bw_paths = bw_path, annot_ers = annot_ers,
        cols = col_info
    )

    er_count_matrix
}
```

# Reproducibility

The `r Biocpkg("ODER")` package `r Citep(bib[["ODER"]])` was made possible 
thanks to:

-   R `r Citep(bib[["R"]])`
-   `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
-   `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`
-   `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])`
-   `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])`
-   `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])`
-   `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])`

This package was developed using `r BiocStyle::Biocpkg("biocthis")`.

Date the vignette was generated.

```{r reproduce1, echo=FALSE}
## Date the vignette was generated
Sys.time()
```

Wallclock time spent generating the vignette.

```{r reproduce2, echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)
```

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```