--- title: "An introduction to biodbUniprot" author: "Pierrick Roger" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('biodbUniprot')`" vignette: | %\VignetteIndexEntry{Introduction to the biodbUniprot package.} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc: yes toc_depth: 4 toc_float: collapsed: false BiocStyle::pdf_document: default bibliography: references.bib --- # Introduction biodbUniprot is a *biodb* extension package that implements a connector to Uniprot database. The *UniProt* Knowledge Base [@uniprotConsortium2016UniProtKB] can be searched using its *search* web service. We present here the way to contact this web service with this package. # Installation Install using Bioconductor: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install('biodbUniprot') ``` # Initialization The first step in using *biodbUniprot*, is to create an instance of the biodb class `BiodbMain` from the main *biodb* package. This is done by calling the constructor of the class: ```{r, results='hide'} mybiodb <- biodb::newInst() ``` During this step the configuration is set up, the cache system is initialized and extension packages are loaded. We will see at the end of this vignette that the *biodb* instance needs to be terminated with a call to the `terminate()` method. # Creating a connector to Uniprot database In *biodb* the connection to a database is handled by a connector instance that you can get from the factory. biodbUniprot implements a connector to a remote database. Here is the code to instantiate a connector: ```{r} conn <- mybiodb$getFactory()$createConn('uniprot') ``` # Getting entries To download entries, run the `getEntry()`, which returns a list of `BiodbEntry` objects: ```{r} entries <- conn$getEntry(c('P01011', 'P09237')) ``` To print the information contained in the entry objects as a data frame, run the `entriesToDataframe()` method attached to the `BiodbMain` instance: ```{r} mybiodb$entriesToDataframe(entries) ``` # Using the *search* web service The method `wsSearch()` (`wsQuery()` is now deprecated) implements the request to the *search* web service, and the parsing of its output. To get the raw results returned by the *UniProt* server, run the following code: ```{r} conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='plain') ``` The first parameter is the query, as required by the web service. To learn how to write a query for *UniProt*, see a description of the *query* web service at . The `fields` parameter is the fields you want back for each entry returned by the database. The `size` parameter is the maximum number of entries the server must return. The `retfmt` parameter controls the type of output desired. Here `"plain"` states that we want the raw output from the server. To get the output parsed by *biodb* and get a data frame, run: ```{r} conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='parsed') ``` To get only the list of *UniProt* identifiers, run: ```{r} conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='ids') ``` And if you are curious to see the URL request that is sent to the server, run: ```{r} conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='request') ``` # Conversion of gene symbols to *UniProt* IDs The method `geneSymbolToUniprotIds()` uses `wsSearch()` to search for *UniProt* entries that reference particular gene symbols. For instance, if you want to get the UniProt entries that have the gene symbol **G-CSF**, just run: ```{r} ids <- conn$geneSymbolToUniprotIds('G-CSF') mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol')) ``` If you want to match also **GCSF** (no minus sign character), then run: ```{r} ids <- conn$geneSymbolToUniprotIds('G-CSF', ignore.nonalphanum=TRUE) mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol')) ``` If you want to match **G-CSFa2** too, run: ```{r} ids <- conn$geneSymbolToUniprotIds('G-CSF', partial.match=TRUE) mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol')) ``` The way this method works is by running `wsSearch()` to get a first set of entry identifiers, and then download each entry and apply a filtering on them. The downloading of the entries may quite long, `wsSearch()` returning potentially thousands of entries, each entry being downloaded with a single separate request and the frequency limit being 3 request per second. Entries already in cache or memory will not be downloaded again, so running the same request a second time will be faster, as it is usually the case with *biodb*. # Closing biodb instance When done with your *biodb* instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc): ```{r} mybiodb$terminate() ``` # Session information ```{r} sessionInfo() ``` # References