--- title: "regutools: an R package for the extraction of gene regulatory networks from RegulonDB" author: - name: Joselyn Chávez affiliation: - &ibt Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, MX - &first Equal contribution email: joselynchavezf@gmail.com - name: Carmina Barberena Jonas affiliation: - &langebio LANGEBIO, Cinvestav, Irapuato, Guanajuato, MX - *first email: car.barjon@gmail.com - name: Jesus Emiliano Sotelo-Fonseca affiliation: - *langebio - *first email: jemiliano@gmail.com - name: Jose Alquicira Hernandez affiliation: - &ccg Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, MX - University of Queensland, Brisbane, QLD, AU - name: Heladia Salgado affiliation: - *ccg email: heladia@ccg.unam.mx - name: Leonardo Collado-Torres affiliation: - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, US - &ccb Center for Computational Biology, Johns Hopkins University, Baltimore, MD, US - &corres Co-corresponding author email: lcolladotor@gmail.com - name: Alejandro Reyes affiliation: - Dana Farber Cancer Institute, Boston, MA, US - *corres email: alejandro.reyes.ds@gmail.com date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('regutools')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{regutools: an R package for data extraction from RegulonDB} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## For links library("BiocStyle") ## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("knitcitations") ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = "to.doc", citation_format = "text", style = "html") # Note links won't show for now due to the following issue # https://github.com/cboettig/knitcitations/issues/63 ## Write bibliography information bib <- c( R = citation(), AnnotationDbi = citation("AnnotationDbi"), AnnotationHub = citation("AnnotationHub"), BiocFileCache = citation("BiocFileCache"), BiocStyle = citation("BiocStyle"), Biostrings = citation("Biostrings"), DBI = citation("DBI"), GenomicRanges = citation("GenomicRanges"), Gviz = citation("Gviz"), IRanges = citation("IRanges"), knitcitations = citation("knitcitations"), knitr = citation("knitr")[3], RCy3 = citation("RCy3"), regutools = citation("regutools")[1], regutoolsPaper = citation("regutools")[2], rmarkdown = citation("rmarkdown")[1], RSQLite = citation("RSQLite"), S4Vectors = citation("S4Vectors"), sessioninfo = citation("sessioninfo"), testthat = citation("testthat") ) ``` # Basics ## Install `regutools` `R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('regutools')` is a `R` package available via Bioconductor. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('regutools')` by using the following commands in your `R` session: ```{r 'install', eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("regutools") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` ## Required knowledge `r Biocpkg('regutools')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with genomic and sequence data. That is, packages like `r Biocpkg('Biostrings')` that allow you to work with sequences and `r Biocpkg('GenomicRanges')` for data on genomic coordinates. A `r Biocpkg('regutools')` user is not expected to deal with those packages directly but will need to be familiar with them to understand the results `r Biocpkg('regutools')` generates. Furthermore, it'll be useful for the user to know the syntax of `r Biocpkg('AnnotationHub')` `r citep(bib[['AnnotationHub']])` in order to query and load the data provided by this package. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU). ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help regarding Bioconductor. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `regutools` We hope that `r Biocpkg('regutools')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r 'citation'} ## Citation info citation("regutools") ``` # Overview *Escherichia coli K-12 (E. coli)* is the best bacterial organism studied to date. Thousands of papers have been published using *E. coli* as a model system asking how genes are regulated. The throughput of these experiments range from single-gene studies to whole-genome approaches. Twenty years ago, the database [*RegulonDB*](http://regulondb.ccg.unam.mx/) started collecting, curating and organizing this information into a centralized resource. Thanks to this huge efforts, researchers have had an easy way to access all these data in a database, facilitating the advancements of complete fields, such as *systems biology*. The analysis of high-throughput experiments -such as RNA-seq or ChIP-seq- often requires the integration of databases such as *RegulonDB* in order to give biological interpretations to these data. The *regutools* package is designed to facilitate such integration by providing programmatic access to *RegulonDB* within the R environment `r citep(bib[['regutoolsPaper']])`. The package retrieves information from the *RegulonDB* database into *Bioconductor* objects, ready for downstream analyses. The package defines the object **regulondb**, which is a data structure that contains the path to a *SQLite* database retrieved from *RegulonDB* along with metadata such as database version and reference genome. The function `connect_database()` will retrieve the latest version of the database and connect to it. It will download the database using `r Biocpkg('AnnotationHub')` or a backup mechanism if necessary. The *regutools* package contains functions with the most popular queries to *regutools*, such as retrieving information of which gene targets are regulated by a transcription factor. But users can also design queries that are specific to their analyses. This vignette describes how to use the provided functions and how to design programmatic queries to *regutools*. The general syntax of the function calls of this package is `result <- functionCalled( regulondb, arguments )`. # The **regulondb** object The **regulondb** object is an extension of an **SQLiteConnection** class that host a connection to a database with the table structure defined in the *RegulonDB* database. It contains additional slots that specify the organism, genome version and database version. The function `regulondb()` is the constructor function of **regulondb** objects. This function receives as input a file path to the database file as well as information about the annotation as character vectors. ```{r 'connect_db', echo = TRUE, message=FALSE} library("regutools") ## Other packages used library("Biostrings") ## Connect to the RegulonDB database regulondb_conn <- connect_database() ## Build a regulondb object e_coli_regulondb <- regulondb( database_conn = regulondb_conn, organism = "E.coli", database_version = "1", genome_version = "1" ) e_coli_regulondb ``` In order to get an overview of the tables present in a *regulondb* object, we can use the function `list_datasets()`. This function will output all the available tables (datasets) that can be used to build queries. ```{r 'list_datasets', echo = TRUE} list_datasets(e_coli_regulondb) ``` For each table in the database, users can explore the fields (or attributes) of each table using the function `list_attributes`. ```{r 'list_attr'} head(list_attributes(e_coli_regulondb, "GENE"), 8) ``` # Retrieving data Since the *regulondb* object is an extension of the *SQLiteConnection*, users can retrieve data from a *regulondb* object using the function `dbGetQuery()`. Additionally, this package provides a wrapper function to build queries to the database. This function is called `get_dataset()` and has a very similar syntax to the `getBM()` function from the biomaRt package. The main arguments of the `get_dataset()` function are a *regulondb* object, a dataset (or table) of the database, the fields of the dataset to retrieve (attributes) and filters to specify what information to get. The code below shows an example where three attributes from the dataset **"GENE"** for the genes *araC*, *crp* and *lacI*. Note that if the `filters=` parameter is empty, the function will retrieve all the hits it find in the database. ```{r 'get_dataset', echo = TRUE} get_dataset( regulondb = e_coli_regulondb, dataset = "GENE", attributes = c("posleft", "posright", "strand", "name"), filters = list("name" = c("araC", "crp", "lacI")) ) ``` Some of the filters, such as *posright* or *posleft*, can be filtered by specifying intervals. For example, the code below indicates that all the start positions *posright" should be between position 1 and position 5000 of the genome. The parameter `inverval=` is used to specify that the filter for that field will be defined by an interval rather than a exact match. ```{r 'get_dataset_interval', echo = TRUE} get_dataset( e_coli_regulondb, attributes = c("posright", "name"), filters = list("posright" = c(1, 5000)), interval = "posright", dataset = "GENE" ) ``` # The **regulondb_result** object and integration into the *BioC* ecosystem By default, the function `get_dataset()` outputs a **regulondb_result** object, which is an extension of a **DataFrame** that stores information about the query used to generate this object. This additional information includes the organism name, the database and genome versions, and the table (or dataset) of the *regulondb* object that was queried by `get_dataset()`. ```{r 'get_dataset_DF'} res <- get_dataset( regulondb = e_coli_regulondb, dataset = "GENE", attributes = c("posleft", "posright", "strand", "name"), filters = list("name" = c("araC", "crp", "lacI")) ) slotNames(res) ``` To enable integration with other Bioconductor packages, we provide the function `convert_to_granges()` which converts a **regulondb_result** object into a **GRanges** object whenever possible. For example, the result stored in the variable `res` has genomic coordinates and it is thus possible convert `res` into a **GRanges** object. ```{r 'convert_granges'} convert_to_granges(res) ``` An alternative way to get to the same result is to use the parameter `output_format=` directly in the function `get_dataset()`. ```{r 'get_dataset_GRanges'} get_dataset( regulondb = e_coli_regulondb, dataset = "GENE", attributes = c("posleft", "posright", "strand", "name"), filters = list("name" = c("araC", "crp", "lacI")), output_format = "GRanges" ) ``` In a similar manner, the function `convert_to_biostrings()` converts **regulondb** objects into objects from the `r BiocStyle::Biocpkg("Biostrings")` package. Possible outputs of `convert_to_biostrings()` are **DNAStringSet** objects if `seq_type="DNA"` or a **BStringSet** if `seq_type="product"`. ```{r 'dnastring_res'} res_dnastring <- get_dataset( regulondb = e_coli_regulondb, dataset = "GENE", attributes = c("posleft", "posright", "strand", "name", "dna_sequence"), filters = list("name" = c("araC", "crp", "lacI")) ) res_dnastring <- convert_to_biostrings(res_dnastring, seq_type = "DNA") res_dnastring mcols(res_dnastring) ``` ```{r 'product_seq'} res_prodstring <- get_dataset( regulondb = e_coli_regulondb, dataset = "GENE", attributes = c("posleft", "posright", "strand", "name", "product_sequence"), filters = list("name" = c("araC", "crp", "lacI")) ) res_prodstring <- convert_to_biostrings(res_prodstring, seq_type = "product") mcols(res_prodstring) ``` As with the **GRanges** output mentioned above, it is possible for the output of `get_dataset()` to be a **DNAStringSet** object by specifying the parameter `output_format="DNAStringSet"` or a **BStringSet** object by specifying `output_format="BStringSet"`. Note that the functions to convert **regulondb_result** objects will throw errors if there is insufficient information for the coercion to occur. For example, we will get an error if we try to convert into a **GRanges** object when genomic coordinates are missing from the **regulondb_result** object. # Building your own queries In the *regutools* package, we have implemented features that are commonly used when querying data from databases: filtering results by partial matching, filtering by numeric intervals, and building complex queries. ## Partial matching The code below illustrates the concept of partial matching, in which by setting the parameter `partialmatch=` to `"name"`, the query returns all the gene name in which the word *ara* is contained. ```{r 'partial_match', echo = TRUE} get_dataset( e_coli_regulondb, attributes = c("posright", "name"), filters = list("name" = "ara"), partialmatch = "name", dataset = "GENE" ) ``` Note that setting the parameter `partialmatch=` to `NULL` will only return genes where the name string is identical to *ara*. ## Filtering by numeric intervals In addition to partial matching, queries can be filtered by numeric intervals. For example, in the code below, the parameter `interv=` is set to `"posright"`. By doing this assignment, we are specifying that the values for `"posright"` must lie between the values of `posright` specified in the `filter=` parameter. Thus, the result of this query will be genes whose right positions lie between the coordinates 2000 and 4000000. Note that the use of the `interv=` parameter in the code below is equivalent to setting the parameter `output_format=` to `"GRanges"` and further subsetting the *GRanges* object using the function `subsetByOverlaps()`. ```{r 'position_interval', echo = TRUE} get_dataset( e_coli_regulondb, attributes = c("name", "strand", "posright", "product_name"), dataset = "GENE", filters = list(posright = c("2000", "4000000")), interval = "posright" ) ``` ## Retrieving genomic elements Based on genomic coordinates, the code below retrieves all genomic elements whose positions lie between the coordinates provided as a *GRanges* object. If no aditional parameters are provided, the result will retrieve genes that relies within the first 5000pb from the E. coli genome. ```{r 'genomic_elements', echo = TRUE } get_dna_objects(e_coli_regulondb) ``` Especific genomic positions can be provided within the parameter `grange`. It is important to provide a genomic range that covers as minimal the length of one genomic element. ```{r 'especific_ranges', echo = TRUE } grange <- GenomicRanges::GRanges( "chr", IRanges::IRanges(5000, 10000) ) get_dna_objects(e_coli_regulondb, grange) ``` Aditional genomic elements such as "-35 promoter box", "gene", "promoter", "Regulatory Interaction", "sRNA interaction", or "terminator" can be selected. ```{r 'aditional_elements', echo = TRUE} grange <- GenomicRanges::GRanges( "chr", IRanges::IRanges(5000, 10000) ) get_dna_objects(e_coli_regulondb, grange, elements = c("gene", "promoter")) ``` Evenmore, the genomic elements retrieved can be observed in a Genome Browser-like plot. The genomic elements are annotated using a UCSC genome as reference, it is important to provide a valid chromosome name for annotation purpose. ```{r 'plot_elements', echo = TRUE} e_coli_regulondb <- regulondb( database_conn = regulondb_conn, organism = "chr", database_version = "1", genome_version = "1" ) grange <- GenomicRanges::GRanges("chr", IRanges::IRanges(5000, 10000)) plot_dna_objects(e_coli_regulondb, grange, elements = c("gene", "promoter")) ``` ## Complex filters The examples so far have considered queries in which the results are filtered according to single fields from tables. In order to build queries with filters from more than one field, several filters names can be passed as a list to the `filters=` argument. Additionally the `and=` argument is used to specify whether the filtering conditions of the result of the query must be satisfied (`and=TRUE`) or if satisfying a single condition is enough (`and=FALSE`). For example, the code below extracts the genes where either `name` or `product_name` contain the word *Ara* or *Ara*, respectively, if the gene is in the forward strand or if the right position of the gene is between 2000 and 40000000. ```{r 'complex_filter'} nrow( get_dataset( e_coli_regulondb, attributes = c("name", "strand", "posright", "product_name"), dataset = "GENE", filters = list( name = c("ARA"), product_name = c("Ara"), strand = c("forward"), posright = c("2000", "4000000") ), and = FALSE, partialmatch = c("name", "product_name"), interval = "posright" ) ) ``` The query below, which is identical to the query above except the `and=` is set to `TRUE`, returns the genes where all of the specified conditions are satisfied. ```{r 'complex_filter_2'} nrow( get_dataset( e_coli_regulondb, attributes = c("name", "strand", "posright", "product_name"), dataset = "GENE", filters = list( name = c("ARA"), product_name = c("Ara"), strand = c("forward"), posright = c("2000", "4000000") ), and = TRUE, partialmatch = c("name", "product_name"), interval = "posright" ) ) ``` # Functions with implement popular queries The regutools package provides functions that encode the most frequent queries that are submitted to the *RegulonDB* web resource. ## Extracting regulatory networks One of the most important information that the *RegulonDB* database contains are manually curated regulatory networks. The *regutools* package exports the function `get_gene_regulators()`, which inputs a vector of gene names and outputs information about what are the transcription factors that regulate such genes together with the regulatory effect (activator, repressor or dual). ```{r 'gene_regulators', echo = TRUE} get_gene_regulators(e_coli_regulondb, c("araC", "fis", "crp")) ``` Similarly, the `get_regulatory_network()` function retrieves all the regulatory network from a **regulondb** object. By default, the output will be a list of transcription factor-gene pairs, indicating which transcription factor regulates which gene. ```{r 'regulatory_network', echo = TRUE} head(get_regulatory_network(e_coli_regulondb)) ``` But users can also set the parameter `type=` to `"GENE-GENE"` or `"TF-GENE"` to retrieve gene-gene regulatory networks or transcription factor-transcription factor regulatory networks, respectively. Users can also use the `get_regulatory_summary()` to retrieve a summary of the transcription factor regulated the expression of a set of given genes. The parameter `gene_regulators` can receive either a vector of gene names or the output of a call to the function `get_gene_regulators()`. The resulting output is a *regulondb_result* object in which each row shows a transcription factor, and the columns display information about the number, percentage and regulatory activity exerted to the genes. ```{r 'regulatory_summary'} get_regulatory_summary(e_coli_regulondb, gene_regulators = c("araC", "modB") ) ``` ## Visualizing networks using cytoscape Software tools such as [Cytoscape](https://cytoscape.org/index.html) are often useful for interactive exploration of data and visualization of networks. The function `get_regulatory_network()` has a parameter `cytograph=` that if set to `TRUE`, it will visualize the network in a cytoscape session. Of note, this feature will only work if the user has Cytoscape open in their computer. ```{r 'prep_cyto', eval=FALSE} get_regulatory_network(e_coli_regulondb, cytograph = TRUE) ``` ## Transcription factor binding sites Another common request to the *RegulonDB* database is to obtain the genomic coordinates and sequences of the DNA binding sites for a given transcription factor. This query is implemented in the function `get_binding_sites()`, in which the results are formatted according to the parameter `output_format=` as either a *GRanges* object or a *Biostrings* object. ```{r 'binding_sites'} get_binding_sites(e_coli_regulondb, transcription_factor = "AraC") get_binding_sites(e_coli_regulondb, transcription_factor = "AraC", output_format = "Biostrings" ) ``` # A note about CDSB This was a project accomplished by members of the [Community of Bioinformatics Software Developers](https://comunidadbioinfo.github.io/) (CDSB in Spanish). In part CDSB was formed to help R users in Latin America become R/[Bioconductor](http://bioconductor.org/) developers. For more information about CDSB, the CDSB workshops or its online community, please check the [CDSB website](https://comunidadbioinfo.github.io/) which is available in both Spanish and English. # Reproducibility The `r Biocpkg('regutools')` package `r citep(bib[['regutools']])` was made possible thanks to: * R `r citep(bib[['R']])` * `r Biocpkg('AnnotationDbi')` `r citep(bib[['AnnotationDbi']])` * `r Biocpkg('AnnotationHub')` `r citep(bib[['AnnotationHub']])` * `r Biocpkg('BiocFileCache')` `r citep(bib[['BiocFileCache']])` * `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])` * `r Biocpkg('Biostrings')` `r citep(bib[['Biostrings']])` * `r CRANpkg('DBI')` `r citep(bib[['DBI']])` * `r Biocpkg('GenomicRanges')` `r citep(bib[['GenomicRanges']])` * `r Biocpkg('Gviz')` `r citep(bib[['Gviz']])` * `r Biocpkg('IRanges')` `r citep(bib[['IRanges']])` * `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])` * `r CRANpkg('knitr')` `r citep(bib[['knitr']])` * `r Biocpkg('RCy3')` `r citep(bib[['RCy3']])` * `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])` * `r CRANpkg('RSQLite')` `r citep(bib[['RSQLite']])` * `r Biocpkg('S4Vectors')` `r citep(bib[['S4Vectors']])` * `r CRANpkg('sessioninfo')` `r citep(bib[['sessioninfo']])` * `r CRANpkg('testthat')` `r citep(bib[['testthat']])` Code for creating the vignette ```{r createVignette, eval=FALSE} ## Create the vignette library("rmarkdown") system.time(render("regutools.Rmd")) ## Extract the R code library("knitr") knit("regutools.Rmd", tangle = TRUE) ``` Date the vignette was generated. ```{r reproduce1, echo=FALSE} ## Date the vignette was generated Sys.time() ``` Wallclock time spent generating the vignette. ```{r reproduce2, echo=FALSE} ## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3) ``` `R` session information. ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])`, `r CRANpkg('knitr')` `r citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])` running behind the scenes. Citations were made with `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`. ```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography bibliography() ```