--- title: "ODER: Optimising the Definition of Expressed Regions" author: - name: Emmanuel Olagbaju affiliation: - UCL email: e.olagbaju@ucl.ac.uk output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('ODER')`" vignette: > %\VignetteIndexEntry{Introduction to ODER} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) # Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], ODER = citation("ODER")[1] ) ``` # Basics ## Install `ODER` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg("ODER")` is a `R` package available via the [Bioconductor](http://bioconductor.org) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg("ODER")` by using the following commands in your `R` session: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("eolagbaju/ODER") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Required knowledge The expected input of `r Biocpkg("ODER")` is coverage in the form of BigWig files as well as, depending on the functionalility required by a specific user, junctions in form of a `RangedSummarizedExperiments`. `r Biocpkg("ODER")` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. The `r Biocpkg("GenomicRanges")` package is heavily used in `r Biocpkg("ODER")` while other packages like `r Biocpkg("SummarizedExperiment")` and `r Biocpkg("derfinder")`. Previous experience with these packages will help in the comprehension and use of `r Biocpkg("ODER")`. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). If you find the structure of a `r Biocpkg("SummarizedExperiment")` unclear, you might consider checking out [this manual](http://master.bioconductor.org/help/course-materials/2019/BSS2019/04_Practical_CoreApproachesInBioconductor.html). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `ODER` tag and check [the older posts](https://support.bioconductor.org/t/ODER/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `ODER` We hope that `r Biocpkg("ODER")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r "citation"} ## Citation info citation("ODER") ``` # Background `ODER` is a packaged form of the method developed in the Zhang et al. 2020 publication: [Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders](https://advances.sciencemag.org/content/6/24/eaay8299.full). The overarching aim of `ODER` is use RNA-sequencing data to define regions of unannotated expression (ERs) in the genome, optimise their definition, then link them to known genes. `ODER` builds upon the method defined in `r Biocpkg("derfinder")` by improving the definition of ERs in a few ways. Firstly, rather than being a fixed value, the coverage cut off is optimised based on a set of user-inputted, gold-standard exons for a given set of samples. Secondly, `ODER` introduces a second optimisation parameter, the max region gap, which determines the number of base-pairs of gap between which ERs are merged. Thirdly, ERs can be connected to known genes through junction reads. This aids the interpretation of ERs and also allows their definition to be refined further using the intersection between ERs and junctions. For more detailed methods, please see the methods section of the [original publication](https://www.science.org/doi/10.1126/sciadv.aay8299). Unannotated ERs can provide evidence for the existence and location of novel exons, which are yet to be added within reference annotation. Improving the completeness of reference annotation can aid the interpretation of genetic variation. For example, the output of `ODER` can help to better interpret non-coding genetics variants that are found in the genome of Mendelian disease patients, poetentially leading to improvements in diagnosis rates. # Quick start to using to `ODER` From the top-level ODER consists of 4 core functions, which are broken down internally into several smaller helper functions. These functions are expected to be run sequentially in the order presented below: 1. `ODER()` - Takes as input coverage in the form of BigWig files. Uses `r Biocpkg("derfinder")` to call contigous blocks of expression that we term expressed regions or ERs. ER definitions are optimised across a pair of parameters the mean coverage cut-off (MCC) and the max region gap (MRG) with respect to a user-inputted set of gold standard exons. The set of ERs for the optimised MCC and MRG are returned. 2. `annotatER()` - Takes as input the optimised set of ERs and a set of junctions. Finds overlaps between the ERs and junctions, thereby annotating ERs with the gene associated with it's corresponding junction. Also categorises ERs into "exon", "intron", "intergenic" or any combination of these three categories depending on the ERs overlap with existing annotation. 3. `refine_ers()` - Takes as input the optimised set of ERs annotated with junctions. Refines the ER definition based on the intersection between the ER and it's overlapping junction. 4. `get_count_matrix()` - Takes as input any set of `GenomicRanges` and a set of `BigWig` files. Returns a `RangedSummarizedExperiment` with `assays` containing the average coverage across each range. This function is intended to obtain the coverage across ERs to allow usage in downstream analyses such as differential expression. ## Example This is a basic example to show how you can use ODER. First, we need to download the example `BigWig` data required as input for `ODER`. ```{r load_data, eval = requireNamespace('ODER')} library("ODER") library("magrittr") # Download recount data in the form of BigWigs gtex_metadata <- recount::all_metadata("gtex") gtex_metadata <- gtex_metadata %>% as.data.frame() %>% dplyr::filter(project == "SRP012682") url <- recount::download_study( project = "SRP012682", type = "samples", download = FALSE ) # file_cache is an internal ODER function to cache files for # faster repeated loading bw_path <- file_cache(url[1]) bw_path ``` To get the optimum set of ERs from a BigWig file we can use the `ODER()` function.This will obtain the optimally defined ERs by finding the combination of MCC and MRG parameters that gives the lowest exon delta between the ERs and the inputted gold-standard exons. The MCC is minimum read depth that a base pair needs to have to be considered expressed. The MRG is the maximum number of base pairs between reads that fall below the MCC before you would not include it as part of the expressed region. Internally, gold-standard exons are obtained by finding the non-overlapping exons from the inputted reference annotation. In this example, we demonstrate `ODER()` on a single unstranded `Bigwig`. However, in reality, it is likely that you will want to optimise the ER definitions across multiple `BigWigs`. It is worth noting that the arguments `bw_pos` and `bw_neg` in `ODER()` allow for the input of stranded `BigWigs`. ```{r ODER} # load reference annotation from Ensembl gtf_url <- paste0( "http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens", "/Homo_sapiens.GRCh38.103.chr.gtf.gz" ) gtf_path <- file_cache(gtf_url) gtf_gr <- rtracklayer::import(gtf_path) # As of rtracklayer 1.25.16, BigWig is not supported on Windows. if (!xfun::is_windows()) { opt_ers <- ODER( bw_paths = bw_path, auc_raw = gtex_metadata[["auc"]][1], auc_target = 40e6 * 100, chrs = c("chr21"), genome = "hg38", mccs = c(2, 4, 6, 8, 10), mrgs = c(10, 20, 30), gtf = gtf_gr, ucsc_chr = TRUE, ignore.strand = TRUE, exons_no_overlap = NULL, bw_chr = "chr" ) } ``` Once we have the obtained the optimised set of ERs, we may consider plotting the exon delta across the various MCCs and MRGs. This can be useful to check the error associated with the definition of the set of optimised ERs. This error is measured through two metrics; the median exon delta and the number of ERs with exon delta equal to 0. The median exon delta represents the overall accuracy of all ER definitions, whereas the number of ERs with exon delta equal to 0 indicates the extent to which ER definitions precisely match overlapping gold-standard exon boundaries. ```{r plot_example} # visualise the spread of mccs and mrgs if (!xfun::is_windows()) { # uses product of ODER plot_ers(opt_ers[["deltas"]], opt_ers[["opt_mcc_mrg"]]) } ``` Next, we will use `annotatERs()` to find the overlap between the ERs and junctions. Furthermore, `annotatERs()` will also classify ERs by their overlap with existing reference annotation into the categories; "exon", "intron" and "intergenic". This can be helpful for two reasons. Primarily, junctions can be used to inform which gene the ER is associated to. This gene-level association can be useful multiple downstream applications, such as novel exon discovery. Furthermore, the category of ER, in terms of whether it overlaps a exonic, intronic or intergenic region, can help determine whether the ER represents novel expression. For example, ERs solely overlapping intronic or intergenic regions and associated with a gene can be the indication of the expression of an unannotated exon. To note, it is recommended that the inputted junctions are derived from the same samples or tissue as the `BigWig` files used to define ERs. ```{r annotatER_example} library(utils) # running only chr21 to reduce runtime chrs_to_keep <- c("21") # prepare the txdb object to create a genomic state # based on https://support.bioconductor.org/p/93235/ hg38_chrominfo <- GenomeInfoDb::getChromInfoFromUCSC("hg38") new_info <- hg38_chrominfo$size[match( chrs_to_keep, GenomeInfoDb::mapSeqlevels(hg38_chrominfo$chrom, "Ensembl") )] names(new_info) <- chrs_to_keep gtf_gr_tx <- GenomeInfoDb::keepSeqlevels(gtf_gr, chrs_to_keep, pruning.mode = "tidy" ) GenomeInfoDb::seqlengths(gtf_gr_tx) <- new_info GenomeInfoDb::seqlevelsStyle(gtf_gr_tx) <- "UCSC" GenomeInfoDb::genome(gtf_gr_tx) <- "hg38" ucsc_txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf_gr_tx) genom_state <- derfinder::makeGenomicState(txdb = ucsc_txdb) # convert UCSC chrs format to Ensembl to match the ERs ens_txdb <- ucsc_txdb GenomeInfoDb::seqlevelsStyle(ens_txdb) <- "Ensembl" # lung_junc_21_22 is an example data set of junctions # obtained from recount3, derived from the lung tissue # run annotatERs() data(lung_junc_21_22, package = "ODER") if (!xfun::is_windows()) { # uses product of ODER annot_ers <- annotatERs( opt_ers = head(opt_ers[["opt_ers"]], 100), junc_data = lung_junc_21_22, genom_state = genom_state, gtf = gtf_gr, txdb = ens_txdb ) # print first 5 ERs annot_ers[1:5] } ``` After we have annotated ERs with the overlapping junctions, optionally we can use `refine_ers()` to refine the starts and ends of the ERs based on the overlapping junctions. This will filter ERs for those which have either a single or two non-intersecting junctions overlapping. For the remaining ERs, `refine_ers()` will shave the ER definitions to the exon boundaries matching the overlapping junctions. This can be useful for downstream applications whereby the accuracy of the ER definition is extremely important. For example, the interpretion of variants in diagnostic setting. ```{r refine_ERs_example} if (!xfun::is_windows()) { # uses product of ODER refined_ers <- refine_ERs(annot_ers) refined_ers } ``` Finally, we can generate an ER count matrix with `get_count_matrix()`. This function can flexibly be run at any stage of the `ODER` pipeline and it requires a set of `GenomicRanges` and `BigWig` paths as input. `get_count_matrix()` will return a `RangedSummarizedExperiment` which has `assays` filled with the mean coverage across each inputted range. Internally, `get_count_matrix()` relies on `r Biocpkg("megadepth")` to obtain coverage from `BigWigs` therefore `megadepth::install_megadepth()` must be executed at least once on the user's system before `get_count_matrix()`. ```{r get_count_matrix_example} # create sample metadata containing identifiers for each BigWig run <- gtex_metadata[["run"]][[1]] col_info <- as.data.frame(run) # install megadepth megadepth::install_megadepth() if (!xfun::is_windows()) { # uses product of ODER er_count_matrix <- get_count_matrix( bw_paths = bw_path, annot_ers = annot_ers, cols = col_info ) er_count_matrix } ``` # Reproducibility The `r Biocpkg("ODER")` package `r Citep(bib[["ODER"]])` was made possible thanks to: - R `r Citep(bib[["R"]])` - `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` - `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` - `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` - `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` - `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])` - `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])` This package was developed using `r BiocStyle::Biocpkg("biocthis")`. Date the vignette was generated. ```{r reproduce1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproduce2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ```