--- title: "recount3 quick start guide" author: - name: Leonardo Collado-Torres affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus - &ccb Center for Computational Biology, Johns Hopkins University email: lcolladotor@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('recount3')`" vignette: > %\VignetteIndexEntry{recount3 quick start guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r 'knitr_options', include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("knitcitations") ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = "to.doc", citation_format = "text", style = "html") # Note links won't show for now due to the following issue # https://github.com/cboettig/knitcitations/issues/63 ## Write bibliography information bib <- c( R = citation(), BiocFileCache = citation("BiocFileCache")[1], BiocStyle = citation("BiocStyle")[1], knitcitations = citation("knitcitations")[1], knitr = citation("knitr")[3], recount3 = citation("recount3")[1], recount3paper = citation("recount3")[2], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], SummarizedExperiment = RefManageR::BibEntry( bibtype = "manual", key = "SummarizedExperiment", author = "Martin Morgan and Valerie Obenchain and Jim Hester and Hervé Pagès", title = "SummarizedExperiment: SummarizedExperiment container", year = 2019, doi = "10.18129/B9.bioc.SummarizedExperiment" ), testthat = citation("testthat")[1], covr = citation("covr")[1], RefManageR = citation("RefManageR")[1], pryr = citation("pryr")[1], interactiveDisplayBase = citation("interactiveDisplayBase")[1], rtracklayer = citation("rtracklayer")[1], S4Vectors = citation("S4Vectors")[1], httr = citation("httr")[1], data.table = citation("data.table")[1], R.utils = citation("R.utils")[1], Matrix = citation("Matrix")[1], GenomicRanges = citation("GenomicRanges")[1], recount2paper = citation("recount")[1], recount2workflow = citation("recount")[2], recount1paper = citation("recount")[5], recountbrain = citation("recount")[6], recount2fantom = citation("recount")[7], bioconductor2015 = RefManageR::BibEntry( bibtype = "Article", key = "bioconductor2015", author = "Wolfgang Huber and Vincent J Carey and Robert Gentleman and Simon Anders and Marc Carlson and Benilton S Carvalho and Hector Corrada Bravo and Sean Davis and Laurent Gatto and Thomas Girke and Raphael Gottardo and Florian Hahne and Kasper D Hansen and Rafael A Irizarry and Michael Lawrence and Michael I Love and James MacDonald and Valerie Obenchain and Andrzej K Oles and Hervé Pagès and Alejandro Reyes and Paul Shannon and Gordon K Smyth and Dan Tenenbaum and Levi Waldron and Martin Morgan", title = "Orchestrating high-throughput genomic analysis with Bioconductor", year = 2015, doi = "10.1038/nmeth.3252", journal = "Nature Methods", journaltitle = "Nat Methods" ) ) write.bibtex(bib, file = "quickstartRef.bib") ``` # Overview The `r Biocpkg('recount3')` R/Bioconductor package is an interface to the recount3 project. recount3 provides uniformly processed RNA-seq data for hundreds of thousands of samples. The R package makes it possible to easily retrieve this data in standard Bioconductor containers, including _RangedSummarizedExperiment_. The sections on terminology and available data contains more detail on those subjects. The **main documentation website** for all the `recount3`-related projects is available at [**recount.bio**](https://LieberInstitute.github.io/recount3-docs). Please check that website for more information about how this R/Bioconductor package and other tools are related to each other. # Basics ## Installing `recount3` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('recount3')` is a `R` package available via the [Bioconductor](http://bioconductor/packages/recount3) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('recount3')` by using the following commands in your `R` session: ```{r 'install', eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("recount3") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` You can install the development version from [GitHub](https://github.com/) with: ```{r 'install_dev', eval = FALSE} BiocManager::install("LieberInstitute/recount3") ``` ## Required knowledge `r Biocpkg('recount3')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A `r Biocpkg('recount3')` user will benefit from being familiar with `r Biocpkg('SummarizedExperiment')` to understand the objects `r Biocpkg('recount3')` generates. It might also prove to be highly beneficial to check the * `r Biocpkg('DESeq2')` and `r Biocpkg("limma")` packages for performing differential expression analysis with the _RangedSummarizedExperiment_ objects, * `r Biocpkg('DEFormats')` package for switching the objects to those used by other differential expression packages such as `r Biocpkg('edgeR')`, * `r Biocpkg('derfinder')` package for performing annotation-agnostic differential expression analysis. * `r Biocpkg('recount')` package which provided an R/Bioconductor interface to the data in the `recount2` project `r citep(list(bib[["recount2paper"]], bib[["recount2workflow"]]))`. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `recount3` tag and check [the older posts](https://support.bioconductor.org/tag/recount3/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `r Biocpkg('recount3')` We hope that `r Biocpkg('recount3')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r 'citation'} ## Citation info citation("recount3") ``` # Quick start After installing `r Biocpkg('recount3')` `r citep(bib[['recount3paper']])`, we need to load the package, which will automatically load the required dependencies. ```{r 'start', message=FALSE} ## Load recount3 R package library("recount3") ``` If you have identified a **study** of interest and want to access the gene level expression data, use `create_rse()` as shown below. `create_rse()` has arguments that will allow you to specify the **annotation** of interest for the given organism, and whether you want to download **gene**, **exon** or **exon-exon junction** expression data. ```{r 'quick_example'} ## Find all available human projects human_projects <- available_projects() ## Find the project you are interested in, ## here we use SRP009615 as an example proj_info <- subset( human_projects, project == "SRP009615" & project_type == "data_sources" ) ## Create a RangedSummarizedExperiment (RSE) object at the gene level rse_gene_SRP009615 <- create_rse(proj_info) ## Explore that RSE object rse_gene_SRP009615 ``` You can also interactively choose your study of interest ```{r "interactive_display", eval = FALSE} ## Note that you can interactively explore the available projects proj_info_interactive <- interactiveDisplayBase::display(human_projects) ## Select a single row, then hit "send". The following code checks this. stopifnot(nrow(proj_info_interactive) == 1) ## Then create the RSE object rse_gene_interactive <- create_rse(proj_info_interactive) ``` Once you have a RSE file, you can use `transform_counts()` to transform the raw coverage counts. ```{r "tranform_counts"} ## Once you have your RSE object, you can transform the raw coverage ## base-pair coverage counts using transform_counts(). ## For RPKM, TPM or read outputs, check the details in transform_counts(). assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615) ``` Now you are ready to continue with downstream analysis software. `r Biocpkg('recount3')` also supports accessing the BigWig raw coverage files as well as specific study or collection sample **metadata**. Please continue to the users guide for more detailed information. # Users guide `r Biocpkg('recount3')` `r citep(bib[['recount3paper']])` provides an interface for downloading the [recount3 raw files](https://LieberInstitute.github.io/recount3-docs/docs/raw-files.html) and building Bioconductor-friendly R objects `r citep(list(bib[["bioconductor2015"]], bib[["SummarizedExperiment"]]))` that can be used with many downstream packages. To achieve this, the raw data is organized by **study** from a specific **data source**. That same study can be a part of one or more **collections**, which is a manually curated set of studies with collection-specific sample metadata (see the [Data source vs collection](https://LieberInstitute.github.io/recount3-docs/docs/raw-files.html#data-source-vs-collection) for details). To get started with `r Biocpkg('recount3')`, you will need to identify the ID for the study of interest from either **human** or **mouse** for a particular **annotation** of interest. Once you have identified study, data source or collection, and annotation, `r Biocpkg('recount3')` can be used to build a `RangedSummarizedExperiment` object `r citep(bib[["SummarizedExperiment"]])` for either **gene**, **exon** or **exon-exon junction** expression feature data. Furthermore, `r Biocpkg('recount3')` provides access to the coverage BigWig files that can be quantified for custom set of genomic regions using `r Biocpkg('megadepth')`. Furthermore, `r Biocpkg('snapcount')` allows fast-queries for custom exon-exon junctions and other custom input. ## Available data `r Biocpkg('recount3')` provides access to most of the [`recount3` raw files](https://LieberInstitute.github.io/recount3-docs/docs/raw-files.html) in a form that is R/Bioconductor-friendly. As a summary of the data provided by the `recount3` project (Figure \@ref(fig:recountWorkflowFig1)), the main data files provided are: * **metadata**: information about the samples in `recount3`, which can come from sources such as the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) as well as `recount3` [quality metrics](https://LieberInstitute.github.io/recount3-docs/docs/quality-check-fields.html). * **gene**: RNA expression data quantified at the gene annotation level. This information is provided for multiple annotations that are organism-specific. Similar to the `recount2` project, the `recount3` project provides counts at the base-pair coverage level `r citep(bib[["recount2workflow"]])`. * **exon**: RNA expression data quantified at the exon annotation level. The data is also annotation-specific and the counts are also at the base-pair coverage level. * **exon-exon junctions**: RNA expression data quantified at the exon-exon junction level. This data is annotation-agnostic (it does not depend on the annotation) and is variable across each study because different sets of exon-exon junctions are measured in each study. * **bigWig**: RNA expression data in raw format at the base-pair coverage resolution. This raw data when coupled with a given annotation can be used to generate gene and exon level counts using software such as `r Biocpkg("megadepth")`. It enables exploring the RNA expression landscape in an annotation-agnostic way. ```{r "recountWorkflowFig1", out.width="100%", fig.align="center", fig.cap = "Overview of the data available in recount2 and recount3. Reads (pink boxes) aligned to the reference genome can be used to compute a base-pair coverage curve and identify exon-exon junctions (split reads). Gene and exon count matrices are generated using annotation information providing the gene (green boxes) and exon (blue boxes) coordinates together with the base-level coverage curve. The reads spanning exon-exon junctions (jx) are used to compute a third count matrix that might include unannotated junctions (jx 3 and 4). Without using annotation information, expressed regions (orange box) can be determined from the base-level coverage curve to then construct data-driven count matrices. DOI: < https://doi.org/10.12688/f1000research.12223.1>.", echo = FALSE} knitr::include_graphics("https://raw.githubusercontent.com/LieberInstitute/recountWorkflow/master/vignettes/Figure1.png") ``` ## Terminology Here we describe some of the common terminology and acronyms used throughout the rest of the documentation. `r Biocpkg('recount3')` enables creating `RangedSummarizedExperiment` objects that contain expression quantitative data (Figure \@ref(fig:recountWorkflowFig2)). As a quick overview, some of the main terms are: * **rse**: a `RangedSummarizedExperiment` object from `r Biocpkg("SummarizedExperiment")` `r citep(bib[["SummarizedExperiment"]])` that contains: * **counts**: a matrix with the expression feature data (either: gene, exon, or exon-exon junctions) and that can be accessed using `assays(counts)`. * **metadata**: a table with information about the samples and quality metrics that can be accessed using `colData(rse)`. * **annotation**: a table-like object with information about the expression features which can be annotation-specific (gene and exons) or annotation-agnostic (exon-exon junctions). This information can be accessed using `rowRanges(rse)`. ```{r recountWorkflowFig2, out.width="100%", fig.align="center", fig.cap = "recount2 and recount3 provide coverage count matrices in RangedSummarizedExperiment (rse) objects. Once the rse object has been downloaded and loaded into R (using `recount3::create_rse()` or `recount2::download_study()`), the feature information is accessed with `rowRanges(rse)` (blue box), the counts with assays(rse) (pink box) and the sample metadata with `colData(rse)` (green box). The sample metadata stored in `colData(rse)` can be expanded with custom code (brown box) matching by a unique sample identifier such as the SRA Run ID. The rse object is inside the purple box and matching data is highlighted in each box. DOI: < https://doi.org/10.12688/f1000research.12223.1>.", echo = FALSE} knitr::include_graphics("https://raw.githubusercontent.com/LieberInstitute/recountWorkflow/master/vignettes/Figure2.png") ``` `r Biocpkg('recount3')` enables accessing data from multiple reference organisms from public projects. To identify these projects, the key terms we use are: * **project**: the project ID, which is typically the same ID that is used externally to identify that project in databases like the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) ^[GTEx and TCGA are broken up by tissue as described later in this vignette.] * **project_home**: the location for where the project is located at in the main `recount3` data host: IDIES SciServer. We have two types of project homes: * **data_source**: this identifies where the raw data was downloaded from for creating `recount3`. * **collection**: this identifies a human-curated set of projects or samples. The main difference compared to a *data_source* is that a *collection* has collection-specific metadata. For example, someone might have created a table of information about samples in several projects by reading the papers describing these studies as in `recount-brain` `r citep(bib[["recountbrain"]])`. * **organism**: the reference organism (species) for the RNA-seq data, which is either _human_ (hg38) or _mouse_ (mm10). Many of the [`recount3` raw files](https://LieberInstitute.github.io/recount3-docs/docs/raw-files.html) include three columns that are used to identify each sample and that allow merging the data across these files. Those are: * **rail_id**: an internal ID used by the `recount` team. The name stems from `Rail-RNA` which was the aligner used for generated the data in `recount2` `r citep(bib[["recount2paper"]])`. * **external_id**: the external ID for the samle, such as the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) `run` ID. * **study**: the **project** ID. ## Find a study In order to access data from `r Biocpkg('recount3')`, the first step is to identify a **project** that you are interested in working with. Most of the project IDs are the ones you can find on the Sequence Read Archive ([SRA](https://www.ncbi.nlm.nih.gov/sra)). For example, [SRP009615](https://www.ncbi.nlm.nih.gov/sra/?term=SRP009615) which we use in the examples in this vignette.The exceptions are the Genotype-Expression and The Cancer Genome Atlas human studies, commonly known as [GTEx](https://gtexportal.org/home/) and [TCGA](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). Both GTEx and TCGA are available in `r Biocpkg('recount3')` by tissue. ### Through `available_projects()` While you can use external websites to find a study of interest, you can also use `available_projects()` to list the projects that are available in `r Biocpkg('recount3')` as shown below. This will return a `data.frame()` object that lists the unique project IDs. ```{r "human_projects"} human_projects <- available_projects() dim(human_projects) head(human_projects) ## Select a study of interest project_info <- subset( human_projects, project == "SRP009615" & project_type == "data_sources" ) project_info ``` Let's say that you are interested in the GTEx projects, you could then filter by `file_source`. We'll focus only on those entries that from a data source, and not from a collection for now. ```{r "gtex_projects"} subset(human_projects, file_source == "gtex" & project_type == "data_sources") ``` Note that one of the projects for GTEx is `STUDY_NA`, that's because in `r Biocpkg('recount3')` we processed all GTEx samples, including some that had no tissue assigned and were not used by the GTEx consortium. If you prefer to view the list of available studies interactively, you can do so with `r Biocpkg('interactiveDisplayBase')` as shown below. You'll want to assign the output of `interactiveDisplayBase::display()` to an object so you can save your selections and use them later. By doing so, you'll be able to select a study of interest, and save the information for later use after you hit the "send" button. ```{r "select_interactively", eval = FALSE} ## Alternatively, interactively browse the human projects, ## select one, then hit send selected_study <- interactiveDisplayBase::display(human_projects) ``` Ultimately, we need three pieces of information in order to download a specific dataset from `r Biocpkg('recount3')`. Those are: * **project**: the project ID, * **organism**: either human or mouse, * **project_home**: where the data lives which varies between data sources and collections. ```{r "display_proj_info"} project_info[, c("project", "organism", "project_home")] ``` ## Available annotations Now that we have identified our project of interest, the next step is to choose an annotation that we want to work with. The annotation files available depend on the organism. To facilitate finding the specific names we use in `r Biocpkg('recount3')`, we have provided the function `annotation_options()`. ```{r "annotations"} annotation_options("human") annotation_options("mouse") ``` The main sources are: * [GENCODE](https://www.gencodegenes.org/) project * [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/): NCBI Reference Sequence Database * FANTOM6_CAT which was first used with `recount2` `r citep(bib[["recount2fantom"]])` * [ERCC](https://www.nist.gov/programs-projects/external-rna-controls-consortium): External RNA Controls Consortium ^[These are 92 control spike-in sequences that are commonly used in bulk RNA-seq projects.], commercially available from [ThermoFisher Scientific](https://www.thermofisher.com/order/catalog/product/4456740#/4456740) * [SIRV](https://www.biorxiv.org/content/10.1101/080747v1) ^[69 controls sequences.], commercially available from [Lexogen](https://www.lexogen.com/sirvs/) In `r Biocpkg('recount3')` we have provided multiple annotations, which is different from `r Biocpkg('recount')` (`recount2` R/Bioconductor package) where all files were computed using GENCODE version 25. However, in both, you might be interested in quantifying your annotation of interest, as described further below in the BigWig files section. ## Build a RSE Once you have chosen an annotation and a project, you can now build a `RangedSummarizedExperiment` object `r citep(list(bib[["bioconductor2015"]], bib[["SummarizedExperiment"]]))`. To do so, we recommend using the `create_rse()` function as shown below for GENCODE v26 (the default annotation for human files). `create_rse()` internally uses several other `r Biocpkg('recount3')` functions for locating the necessary raw files, downloading them, reading them, and building the `RangedSummarizedExperiment` (RSE) object. `create_rse()` shows several status message updates that you can silence with `suppressMessages(create_rse())` if you want to. ```{r "rse_gencode_v26"} ## Create a RSE object at the gene level rse_gene_SRP009615 <- create_rse(project_info) ## Explore the resulting RSE gene object rse_gene_SRP009615 ``` ## Explore the RSE Because the RSE object is created at run-time in `r Biocpkg('recount3')`, unlike the static files provided by `r Biocpkg('recount')` for `recount2`, `create_rse()` stores information about how this RSE object was made under `metadata()`. This information is useful in case you share the RSE object and someone else wants to be able to re-make the object with the latest data ^[This new design allows us to couple the expression data with metadata on the fly, as well as have flexibility in case we uncover an error in the files.]. ```{r "metadata_rse"} ## Information about how this RSE object was made metadata(rse_gene_SRP009615) ``` The SRP009615 study was composed of 12 samples, for which we have 63,856 genes in GENCODE v26. The annotation-specific information is available through `rowRanges()` as shown below with the `gene_id` column used to identify genes in each of the annotations ^[Although ERCC and SIRV are technically not genes.]. ```{r "row_ranges"} ## Number of genes by number of samples dim(rse_gene_SRP009615) ## Information about the genes rowRanges(rse_gene_SRP009615) ``` ## Sample metadata The sample metadata provided in `r Biocpkg('recount3')` is much more extensive than the one in `r Biocpkg('recount')` for the `recount2` project because it's includes for [**quality control metrics**](https://LieberInstitute.github.io/recount3-docs/docs/quality-check-fields.html), predictions, and information used internally by `r Biocpkg('recount3')` functions such as `create_rse()`. All individual metadata tables include the columns **rail_id**, **external_id** and **study** which are used for merging the different tables. Finally, **BigWigUrl* provides the URL for the BigWig file for the given sample. ```{r "recount3_meta_cols"} ## Sample metadata recount3_cols <- colnames(colData(rse_gene_SRP009615)) ## How many are there? length(recount3_cols) ## View the first few ones head(recount3_cols) ## Group them by source recount3_cols_groups <- table(gsub("\\..*", "", recount3_cols)) ## Common sources and number of columns per source recount3_cols_groups[recount3_cols_groups > 1] ## Unique columns recount3_cols_groups[recount3_cols_groups == 1] ``` ```{r "explore_all", eval = FALSE} ## Explore them all recount3_cols ``` For studies from SRA, we can further extract the SRA attributes using `expand_sra_attributes()` as shown below. ```{r "expand attributes"} rse_gene_SRP009615_expanded <- expand_sra_attributes(rse_gene_SRP009615) colData(rse_gene_SRP009615_expanded)[, ncol(colData(rse_gene_SRP009615)):ncol(colData(rse_gene_SRP009615_expanded))] ``` ## Counts The counts in `r Biocpkg('recount3')` are raw base-pair coverage counts, similar to those in `r Biocpkg('recount')`. To further understand them, check the `r Biocpkg('recountWorkflow')` DOI [10.12688/f1000research.12223.1](https://doi.org/10.12688/f1000research.12223.1). To highlight that these are raw base-pair coverage counts, they are stored in the "raw_counts" assay ```{r "raw_counts"} assayNames(rse_gene_SRP009615) ``` Using `transform_counts()` you can scale the counts and assign them to the "counts" assays slot to use them in downstream packages such as `r Biocpkg('DESeq2')` and `r Biocpkg('limma')`. ```{r "scaling_counts"} ## Once you have your RSE object, you can transform the raw coverage ## base-pair coverage counts using transform_counts(). ## For RPKM, TPM or read outputs, check the details in transform_counts(). assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615) ``` Just like with `r Biocpkg('recount')` for `recount2`, you can transform the raw base-pair coverage counts `r citep(bib[["recount2workflow"]])` to read counts with `compute_read_counts()`, RPKM with `recount::getRPKM()` or TPM values with `recount::getTPM()`. Check `transform_counts()` from `r Biocpkg('recount3')` for more details. ## Exon `r Biocpkg('recount3')` provides an interface to [raw files](https://LieberInstitute.github.io/recount3-docs/docs/raw-files.html) that go beyond gene counts, as well as other features you might be interested in. For instance, you might want to study expression at the **exon** expression level instead of **gene**. To do so, use the `type` argument in `create_rse()` as shown below. ```{r "rse_exon"} ## Create a RSE object at the exon level rse_exon_SRP009615 <- create_rse( proj_info, type = "exon" ) ## Explore the resulting RSE exon object rse_exon_SRP009615 ## Explore the object dim(rse_exon_SRP009615) ``` Each exon is shown in this output, so, you might have to filter the exons of interest. Unlike in `recount2`, these are actual exons and not disjoint exons ^[Check the BigWig files section further below.]. ```{r "exon_ranges"} ## Exon annotation information rowRanges(rse_exon_SRP009615) ## Exon ids are repeated because a given exon can be part of ## more than one transcript length(unique(rowRanges(rse_exon_SRP009615)$exon_id)) ``` Because there are many more exons than genes, this type of analysis uses more computational resources. Thus, for some large projects you might need to use a high performance computing environment. To help you proceed with caution, `create_rse()` shows how many features and samples it's trying to access. So if you get an out of memory error, you'll know why that happened. ```{r "exon_memory"} ## Check how much memory the gene and exon RSE objects use pryr::object_size(rse_gene_SRP009615) pryr::object_size(rse_exon_SRP009615) ``` ## Exon-exon junctions In `r Biocpkg('recount3')` we have also provided the option to create RSE files for exon-exon junctions. Unlike the gene/exon RSE files, only the junctions present in a given project are included in the files, so you'll have to be more careful when merging exon-exon junction RSE files. Furthermore, these are actual read counts and not raw base-pair counts. Given the sparsity of the data, the counts are provided using a sparse matrix object from `r CRANpkg("Matrix")`. Thus exon-exon junction files can be less memory demanding than the exon RSE files. ```{r "rse_jxn"} ## Create a RSE object at the exon-exon junction level rse_jxn_SRP009615 <- create_rse( proj_info, type = "jxn" ) ## Explore the resulting RSE exon-exon junctions object rse_jxn_SRP009615 dim(rse_jxn_SRP009615) ## Exon-exon junction information rowRanges(rse_jxn_SRP009615) ## Memory used pryr::object_size(rse_jxn_SRP009615) ``` ## BigWig files Internally we used `GenomicFeatures::exonicParts()` when processing all annotations in `r Biocpkg('recount3')` instead of `GenomicRanges::disjoin()` that was used in `recount2`. We then re-assembled the counts for each exon/gene to create the count files provided in `r Biocpkg('recount3')`. However, you might want to exclude the overlapping exonic parts from the counts. If that's the case or if you are interested in specific regions of the `hg38`/`mm10` genomes, you might want to access the coverage BigWig files. ```{r "BigWigURLs"} ## BigWig file names ## The full URL is stored in BigWigUrl basename(head(rse_gene_SRP009615$BigWigURL)) ``` These BigWig files can be accessed using `rtracklayer::import.bw()` from R, or other tools such as [bwtool](https://github.com/CRG-Barcelona/bwtool/wiki) that we've used in the past ^[For example in *[recount.bwtool](https://github.com/LieberInstitute/recount.bwtool)* .]. Using them, you can compute a coverage matrix for a given set of regions. One new software we developed is `r Biocpkg("megadepth")` for which we have provided an R/Bioconductor package interface. `r Biocpkg("megadepth")` is faster at accessing BigWig files and is the software we used internally for generating the `recount3` data. `r Biocpkg("megadepth")` provides convenient to use functions for quantifying a set of regions, which might be of interest for co-expression analyses where double counting exonic parts can be problematic. You can also use `r Biocpkg("derfinder")` and `r Biocpkg("derfinderPlot")` if you are interested in visualizing the base-pair coverage data for a specific region using these BigWig coverage files. ## Local files `r Biocpkg('recount3')` depends on `r Biocpkg('BiocFileCache')` `r citep(bib[['BiocFileCache']])` for organizing the raw files and caching them, such that if you use the same file more than once, you only have to download it once. `r Biocpkg('BiocFileCache')` will automatically update the file if it detects that the file has changed in the source If you want to inspect which files you have downloaded or even delete them, them you'll want to use `recount3_cache_files()` and `recount3_cache_rm()` as illustrated below. ```{r "BiocFileCache"} ## List the URLs of the recount3 data that you have downloaded recount3_cache_files() ``` ```{r "remove_files", eval = FALSE} ## Delete the recount3 files from your cache recount3_cache_rm() ## Check that no files are listed recount3_cache_files() ``` ## Your own mirror `r Biocpkg('recount3')` functions such as `create_rse()` have a `recount3_url` argument that can be changed to point to a mirror or to a path in your computing system. This argument enables using `r Biocpkg('recount3')` with data stored in other locations, or even generated using the same processing pipeline that was used for `r Biocpkg('recount3')` but for your own/private data. The main documentation website documents how the raw files should be organized in your mirror or for your own data. You can inspect the structure of the data by checking the internals of `locate_url()` and `locate_url_ann()`. Both functions can be used to get the full list of URLs. In addition, for a given mirror, `available_projects()` will show the local data sources and collections. Finally, `file_retrieve()` won't cache the data if it detects that the data already exists in the local disk. In particular, this functionality can be useful if you want to access the data through [Registry of Open Data on AWS](https://registry.opendata.aws/recount/) or at [IDIES](https://idies.jhu.edu/) using [SciServer Compute](https://apps.sciserver.org/compute/), which are the two official mirrors for `recount3` data. ## Teams involved The `ReCount` family involves the following teams: * [Ben Langmead's lab](http://www.langmead-lab.org/) at JHU Computer Science * [Kasper Daniel Hansen's lab](https://www.hansenlab.org/) at JHBSPH Biostatistics Department * [Leonardo Collado-Torres](http://lcolladotor.github.io/) and [Andrew E. Jaffe](http://aejaffe.com/) from [LIBD](https://www.libd.org/) * [Abhinav Nellore's lab](http://nellore.bio/) at OHSU * [Jeff Leek's lab](http://jtleek.com/) at JHBSPH Biostatistics Deparment * Data hosted by the [Registry of Open Data on AWS](https://registry.opendata.aws/recount/) and [SciServer from IDIES at JHU](https://www.sciserver.org/) through a load balancer called [duffel](https://github.com/nellore/digitalocean-duffel). ## Project history To clarify the relationship between the R/Bioconductor packages and the phases of `ReCount` `r citep(bib[["recount1paper"]])` please check the table below: | Year | Phase | Main references | R/Bioconductor | | --- | --- | --- | --- | | 2011 | [`ReCount`](http://bowtie-bio.sourceforge.net/recount/) | DOI: [10.1186/1471-2105-12-449](https://doi.org/10.1186/1471-2105-12-449) | none | | 2017 | [`recount2`](https://jhubiostatistics.shinyapps.io/recount/) | DOI: [10.1038/nbt.3838](https://doi.org/10.1038/nbt.3838) [10.12688/f1000research.12223.1](https://doi.org/10.12688/f1000research.12223.1) | `r Biocpkg('recount')` | | 2021 | [`recount3`](https://LieberInstitute.github.io/recount3-docs) | DOI: [10.1186/s13059-021-02533-6](https://doi.org/10.1186/s13059-021-02533-6) | `r Biocpkg('recount3')` | ## Other related tools The `ReCount` team has worked on several software solutions and projects that complement each other and enable you to re-use public RNA-seq data. Another Bioconductor package that you might be highly interested in is `r Biocpkg("snapcount")`, which allows you to use the [Snaptron web services](http://snaptron.cs.jhu.edu/). In particular, `r Biocpkg("snapcount")` is best for queries over a particular subset of genes or intervals across all or most of the samples in `recount2`/`Snaptron`. We remind you that the **main documentation website** for all the `recount3`-related projects is available at [**recount.bio**](https://LieberInstitute.github.io/recount3-docs). Please check that website for more information about how this R/Bioconductor package and other tools are related to each other. **Thank you!** P.S. An [alternative version of this vignette is available](https://LieberInstitute.github.io/recount3/) that was made using `r CRANpkg("pkgdown")`. # Reproducibility The `r Biocpkg('recount3')` package `r citep(bib[['recount3']])` was made possible thanks to: * R `r citep(bib[['R']])` * `r Biocpkg('BiocFileCache')` `r citep(bib[['BiocFileCache']])` * `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])` * `r Githubpkg("lcolladotor/biocthis")` * `r CRANpkg('covr')` `r citep(bib[['covr']])` * `r CRANpkg('data.table')` `r citep(bib[['data.table']])` * `r Biocpkg('GenomicRanges')` `r citep(bib[['GenomicRanges']])` * `r CRANpkg('httr')` `r citep(bib[['httr']])` * `r Biocpkg('interactiveDisplayBase')` `r citep(bib[['interactiveDisplayBase']])` * `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])` * `r CRANpkg('knitr')` `r citep(bib[['knitr']])` * `r CRANpkg('Matrix')` `r citep(bib[['Matrix']])` * `r CRANpkg('pryr')` `r citep(bib[['pryr']])` * `r CRANpkg('RefManageR')` `r citep(bib[['RefManageR']])` * `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])` * `r Biocpkg('rtracklayer')` `r citep(bib[['rtracklayer']])` * `r CRANpkg('R.utils')` `r citep(bib[['R.utils']])` * `r Biocpkg('S4Vectors')` `r citep(bib[['S4Vectors']])` * `r CRANpkg('sessioninfo')` `r citep(bib[['sessioninfo']])` * `r Biocpkg('SummarizedExperiment')` `r citep(bib[['SummarizedExperiment']])` * `r CRANpkg('testthat')` `r citep(bib[['testthat']])` Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("recount3-quickstart.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("recount3-quickstart.Rmd", tangle = TRUE) ``` ```{r createVignette2} ## Clean up file.remove("quickstartRef.bib") ``` Date the vignette was generated. ```{r reproduce1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproduce2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])` with `r CRANpkg('knitr')` `r citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])` running behind the scenes. Citations made with `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`. ```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography bibliography() ```