--- title: "The biomaRt users guide" author: "Steffen Durinck, Wolfgang Huber, Mike Smith" package: "`r pkg_ver('biomaRt')`" output: BiocStyle::html_document: md_extensions: "-autolink_bare_uris" css: style.css vignette: > %\VignetteIndexEntry{The biomaRt users guide} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, echo = FALSE} knitr::opts_chunk$set(error = TRUE, cache = FALSE, eval = TRUE) ``` # Introduction In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. The `r Biocpkg("biomaRt")` package, provides an interface to a growing collection of databases implementing the [BioMart software suite](http://www.biomart.org). The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give `r Biocpkg("biomaRt")` users direct access to a diverse set of data and enable a wide range of powerful online queries from R. # Selecting a BioMart database and dataset Every analysis with `r Biocpkg("biomaRt")` starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function `listMarts()` will display all available BioMart web services ```{r annotate,echo=FALSE} options(width=100) ``` ```{r biomaRt} library("biomaRt") listMarts() ``` The `useMart()` function can now be used to connect to a specified BioMart database, this must be a valid name given by `listMarts()`. In the next example we choose to query the Ensembl BioMart database. ```{r ensembl1} ensembl <- useMart("ensembl") ``` BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function `listDatasets()`. *Note: use the function `head()` to display only the first 5 entries as the complete list has `r nrow(listDatasets(ensembl))` entries.* ```{r listDatasets} datasets <- listDatasets(ensembl) head(datasets) ``` To select a dataset we can update the `Mart` object using the function `useDataset()`. In the example below we choose to use the hsapiens dataset. ```{r ensembl2, eval=TRUE} ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) ``` Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by: ```{r ensembl3} ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") ``` # How to build a biomaRt query The `getBM()` function has three arguments that need to be introduced: filters, attributes and values. *Filters* define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter *chromosome_name* can be used with value 'X'. The `listFilters()` function shows you all available filters in the selected dataset. ```{r filters} filters = listFilters(ensembl) filters[1:5,] ``` *Attributes* define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The `listAttributes()` function displays all available attributes in the selected dataset. ```{r attributes} attributes = listAttributes(ensembl) attributes[1:5,] ``` The `getBM()` function is the main query function in `r Biocpkg("biomaRt")`. It has four main arguments: * `attributes`: is a vector of attributes that one wants to retrieve (= the output of the query). * `filters`: is a vector of filters that one wil use as input to the query. * `values`: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument (see examples below). * `mart`: is an object of class `Mart`, which is created by the `useMart()` function. *Note: for some frequently used queries to Ensembl, wrapper functions are available: `getGene()` and `getSequence()`. These functions call the `getBM()` function with hard coded filter and attribute names.* Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a `r Biocpkg("biomaRt")` query. Let's make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings. The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the `listAttributes()` and `listFilters()` function respectively. Let's now run the query: ```{r getBM1, echo=TRUE, eval=TRUE} affyids=c("202763_at","209310_s_at","207500_at") getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene_id'), filters = 'affy_hg_u133_plus_2', values = affyids, mart = ensembl) ``` ## Searching for datasets, filters and attributes The functions `listDatasets()`, `listAttributes()`, and `listFilters()` will return every available option for their respective types. However, this can be unwieldy when the list of results is long, involving much scrolling to find the entry you are interested in. `r Biocpkg("biomaRt")` also provides the functions `searchDatasets()`, `searchAttributes()`, and `searchFilters()` which will try to find any entries matching a specific term or pattern. For example, if we want to find the details of any datasets in our `ensembl` mart that contain the term '*hsapiens*' we could do the following: ```{r searchDatasets, echo = TRUE, eval = TRUE} searchDatasets(mart = ensembl, pattern = "hsapiens") ``` You can search in a simlar fashion to find available attributes and filters that you may be interested in. The example below returns the details for all attributes that contain the pattern '*hgnc*'. ```{r searchAttributes, echo = TRUE, eval = TRUE} searchAttributes(mart = ensembl, pattern = "hgnc") ``` For advanced use, note that the *pattern* argument takes a regular expression. This means you can create more complex queries if required. Imagine, for example, that we have the string *ENST00000577249.1*, which we know is an Ensembl ID of some kind, but we aren't sure what the appropriate filter term is. The example shown next uses a pattern that will find all filters that contain both the terms '*ensembl*' and '*id*'. This allows use to reduced the list of filters to only those that might be appropriate for our example. ```{r searchFilters, echo = TRUE, eval = TRUE} searchFilters(mart = ensembl, pattern = "ensembl.*id") ``` From this we can see that *ENST00000577249.1* is a Transcript ID with version, and the appropriate filter value to use with it is `ensembl_transcript_id_version`. ## Using predefined filter values Many filters have a predefined list of values that are known to be in the dataset they are associated with. An common example would be the names of chromosomes when searching a dataset at Ensembl. In this online interface to BioMart these available options are displayed as a list as shown in Figure \@ref(fig:filtervalues). ```{r filtervalues, fig.cap='The options available to the Chromosome/Scaffold field are limited to a pretermined list based on the values in this dataset.', echo = FALSE} knitr::include_graphics('filtervalues.png') ``` You can list this in an R session with the function `listFilterValues()`, passing it a mart object and the name of the filter. For example, to list the possible chromosome names you could run the following: ```{r chromosomeNames, results = FALSE} listFilterValues(mart = ensembl, filter = "chromosome_name") ``` It is also possible to search the list of available values via `searchFilterValues()`. In the examples below, the first returns all chromosome names starting with "*GL*", which the second will find any phenotype descriptions that contain the string "*Crohn*". ```{r searchFilterValues, results = FALSE} searchFilterValues(mart = ensembl, filter = "chromosome_name", pattern = "^GL") searchFilterValues(mart = ensembl, filter = "phenotype_description", pattern = "Crohn") ``` ## Attribute Pages For large BioMart databases such as Ensembl, the number of attributes displayed by the `listAttributes()` function can be very large. In BioMart databases, attributes are put together in pages, such as sequences, features, homologs for Ensembl. An overview of the attributes pages present in the respective BioMart dataset can be obtained with the `attributePages()` function. ```{r attributePages} pages = attributePages(ensembl) pages ``` To show us a smaller list of attributes which belong to a specific page, we can now specify this in the `listAttributes()` function. *The set of attributes is still quite long, so we use `head()` to show only the first few items here.* ```{r listAttributes} head(listAttributes(ensembl, page="feature_page")) ``` We now get a short list of attributes all of which are related to the region where the genes are located. # Result Caching To save time and computing resources `r Biocpkg("biomaRt")` will attempt to identify when you are re-running a query you have executed before. Each time a new query is run, the results will be saved to a cache on your computer. If a query is identified as having been run previously, rather than submitting the query to the server, the results will be loaded from the cache. You can get some information on the size and location of the cache using the function `biomartCacheInfo()`: ```{r cacheInfo} biomartCacheInfo() ``` The cache can be deleted using the command `biomartCacheClear()`. This will remove all cached files. # Using a BioMart other than Ensembl There are a small number of non-Ensembl databases that offer a BioMart interface to their data. The `r Biocpkg("biomaRt")` package can be used to access these in a very similar fashion to Ensembl. The majority of `r Biocpkg("biomaRt")` functions will work in the same manner, but the construction of the initial Mart object requires slightly more setup. In this section we demonstrate the setting requires to query [Wormbase ParaSite](https://parasite.wormbase.org/index.html) and [Phytozome](https://phytozome.jgi.doe.gov/pz/portal.html) ## Wormbase To demonstrate the use of the `r Biocpkg("biomaRt")` package with non-Ensembl databases the next query is performed using the Wormbase ParaSite BioMart. In this example, we use the `listMarts()` function to find the name of the available marts, given the URL of Wormbase. We use this to connect to Wormbase BioMart using the `useMart()` function. *Note that we use the `https` address and must provide the port as `443`. Queries to WormBase will fail without these options.* ```{r wormbase, echo=TRUE, eval=TRUE} listMarts(host = "parasite.wormbase.org") wormbase <- useMart(biomart = "parasite_mart", host = "https://parasite.wormbase.org", port = 443) ``` We can then use functions described earlier in this vignette to find and select the gene dataset, and print the first 6 available attributes and filters. Then we use a list of gene names as filter and retrieve associated transcript IDs and the transcript biotype. ```{r wormbase-2, echo=TRUE, eval=TRUE} listDatasets(wormbase) wormbase <- useDataset(mart = wormbase, dataset = "wbps_gene") head(listFilters(wormbase)) head(listAttributes(wormbase)) getBM(attributes = c("external_gene_id", "wbps_transcript_id", "transcript_biotype"), filters = "gene_name", values = c("unc-26","his-33"), mart = wormbase) ``` ## Phytozome ### Version 12 The Phytozome BioMart can be accessed with the following settings. *Note that we use the `https` address. Queries to Phytozome will fail without specifying this.* ```{r, phytozome-1, echo = TRUE, eval = TRUE} listMarts(host = "https://phytozome.jgi.doe.gov") phytozome <- useMart(biomart = "phytozome_mart", host = "https://phytozome.jgi.doe.gov") listDatasets(phytozome) wormbase <- useDataset(mart = phytozome, dataset = "phytozome") ``` Once this is set up the usual `r Biocpkg("biomaRt")` functions can be used to interogate the database options and run queries. ```{r, pytozome-2} getBM(attributes = c("organism_name", "gene_name1"), filters = "gene_name_filter", values = "82092", mart = phytozome) ``` ### Version 13 Version 13 of Phyotozome can be found at https://phytozome-next.jgi.doe.gov/ and if you wish to query that version the URL used to create the Mart object must reflect that. ```{r, phytozome-13, echo = TRUE, eval = TRUE} phytozome_v13 <- useMart(biomart = "phytozome_mart", dataset = "phytozome", host = "https://phytozome-next.jgi.doe.gov") ``` # Local BioMart databases The `r Biocpkg("biomaRt")` package can be used with a local install of a public BioMart database or a locally developed BioMart database and web service. In order for `r Biocpkg("biomaRt")` to recognize the database as a BioMart, make sure that the local database you create has a name conform with ` database_mart_version ` where database is the name of the database and version is a version number. No more underscores than the ones showed should be present in this name. A possible name is for example ` ensemblLocal_mart_46 `. ## Minimum requirements for local database installation More information on installing a local copy of a BioMart database or develop your own BioMart database and webservice can be found on Once the local database is installed you can use `r Biocpkg("biomaRt")` on this database by: ```{r localCopy, eval = FALSE} listMarts(host="www.myLocalHost.org", path="/myPathToWebservice/martservice") mart=useMart("nameOfMyMart",dataset="nameOfMyDataset",host="www.myLocalHost.org", path="/myPathToWebservice/martservice") ``` For more information on how to install a public BioMart database see: http://www.biomart.org/install.html and follow link databases. # Using `select()` In order to provide a more consistent interface to all annotations in Bioconductor the `select()`, `columns()`, `keytypes()` and `keys()` have been implemented to wrap some of the existing functionality above. These methods can be called in the same manner that they are used in other parts of the project except that instead of taking a `AnnotationDb` derived class they take instead a `Mart` derived class as their 1st argument. Otherwise usage should be essentially the same. You still use `columns()` to discover things that can be extracted from a `Mart`, and `keytypes()` to discover which things can be used as keys with `select()`. ```{r columnsAndKeyTypes} mart <- useMart(dataset="hsapiens_gene_ensembl",biomart='ensembl') head(keytypes(mart), n=3) head(columns(mart), n=3) ``` And you still can use `keys()` to extract potential keys, for a particular key type. ```{r keys1} k = keys(mart, keytype="chromosome_name") head(k, n=3) ``` When using `keys()`, you can even take advantage of the extra arguments that are available for others keys methods. ```{r keys2} k = keys(mart, keytype="chromosome_name", pattern="LRG") head(k, n=3) ``` Unfortunately the `keys()` method will not work with all key types because they are not all supported. But you can still use `select()` here to extract columns of data that match a particular set of keys (this is basically a wrapper for `getBM()`). ```{r select} affy=c("202763_at","209310_s_at","207500_at") select(mart, keys=affy, columns=c('affy_hg_u133_plus_2','entrezgene_id'), keytype='affy_hg_u133_plus_2') ``` So why would we want to do this when we already have functions like `getBM()`? For two reasons: 1) for people who are familiar with select and it's helper methods, they can now proceed to use `r Biocpkg("biomaRt")` making the same kinds of calls that are already familiar to them and 2) because the select method is implemented in many places elsewhere, the fact that these methods are shared allows for more convenient programmatic access of all these resources. An example of a package that takes advantage of this is the `r Biocpkg("OrganismDbi")` package. Where several packages can be accessed as if they were one resource. # Connection troubleshooting Note: if the function `useMart()` runs into proxy problems you should set your proxy first before calling any `r Biocpkg("biomaRt")` functions. You can do this using the Sys.putenv command: ```{r putenv, eval = FALSE} Sys.setenv("http_proxy" = "http://my.proxy.org:9999") ``` Some users have reported that the workaround above does not work, in this case an alternative proxy solution below can be tried: ```{r rCurlOptions, eval = FALSE} options(RCurlOptions = list(proxy="uscache.kcc.com:80",proxyuserpwd="------:-------")) ``` # Session Info ```{r sessionInfo} sessionInfo() warnings() ```