---
title: "An introduction to biodbUniprot"
author: "Pierrick Roger"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('biodbUniprot')`"
vignette: |
  %\VignetteIndexEntry{Introduction to the biodbUniprot package.}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: false
  BiocStyle::pdf_document: default
bibliography: references.bib
---

# Introduction

biodbUniprot is a *biodb* extension package that implements a connector to
Uniprot database.

The *UniProt* Knowledge Base [@uniprotConsortium2016UniProtKB] can be searched
using its *search* web service.

We present here the way to contact this web service with this package.

# Installation

Install using Bioconductor:
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install('biodbUniprot')
```

# Initialization

The first step in using *biodbUniprot*, is to create an instance of the biodb
class `BiodbMain` from the main *biodb* package. This is done by calling the
constructor of the class:
```{r, results='hide'}
mybiodb <- biodb::newInst()
```
During this step the configuration is set up, the cache system is initialized
and extension packages are loaded.

We will see at the end of this vignette that the *biodb* instance needs to be
terminated with a call to the `terminate()` method.

# Creating a connector to Uniprot database

In *biodb* the connection to a database is handled by a connector instance that
you can get from the factory.
biodbUniprot implements a connector to a remote database.
Here is the code to instantiate a connector:
```{r}
conn <- mybiodb$getFactory()$createConn('uniprot')
```

# Getting entries

To download entries, run the `getEntry()`, which returns a list of `BiodbEntry`
objects:
```{r}
entries <- conn$getEntry(c('P01011', 'P09237'))
```

To print the information contained in the entry objects as a data frame, run
the `entriesToDataframe()` method attached to the `BiodbMain` instance:
```{r}
mybiodb$entriesToDataframe(entries)
```

# Using the *search* web service

The method `wsSearch()` (`wsQuery()` is now deprecated) implements the request
to the *search* web service, and the parsing of its output.

To get the raw results returned by the *UniProt* server, run the following code:
```{r}
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
    size=2, retfmt='plain')
```

The first parameter is the query, as required by the web service.
To learn how to write a query for *UniProt*, see a description of the *query*
web service at <http://www.uniprot.org/help/api_queries>.

The `fields` parameter is the fields you want back for each entry
returned by the database.

The `size` parameter is the maximum number of entries the server must
return.

The `retfmt` parameter controls the type of output desired.
Here `"plain"` states that we want the raw output from the server.

To get the output parsed by *biodb* and get a data frame, run:
```{r}
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
    size=2, retfmt='parsed')
```

To get only the list of *UniProt* identifiers, run:
```{r}
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
    size=2, retfmt='ids')
```

And if you are curious to see the URL request that is sent to the server,
run:
```{r}
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'),
    size=2, retfmt='request')
```

# Conversion of gene symbols to *UniProt* IDs

The method `geneSymbolToUniprotIds()` uses `wsSearch()` to search for *UniProt*
entries that reference particular gene symbols.

For instance, if you want to get the UniProt entries that have the gene symbol
**G-CSF**, just run:
```{r}
ids <- conn$geneSymbolToUniprotIds('G-CSF')
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession',
    'gene.symbol'))
```

If you want to match also **GCSF** (no minus sign character), then run:
```{r}
ids <- conn$geneSymbolToUniprotIds('G-CSF', ignore.nonalphanum=TRUE)
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession',
    'gene.symbol'))
```

If you want to match **G-CSFa2** too, run:
```{r}
ids <- conn$geneSymbolToUniprotIds('G-CSF', partial.match=TRUE)
mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession',
    'gene.symbol'))
```

The way this method works is by running `wsSearch()` to get a first set of entry
identifiers, and then download each entry and apply a filtering on them.
The downloading of the entries may quite long, `wsSearch()` returning
potentially thousands of entries, each entry being downloaded with a single
separate request and the frequency limit being 3 request per second.
Entries already in cache or memory will not be downloaded again, so running the
same request a second time will be faster, as it is usually the case with
*biodb*.

# Closing biodb instance

When done with your *biodb* instance you have to terminate it, in order to
ensure release of resources (file handles, database connection, etc):
```{r}
mybiodb$terminate()
```

# Session information

```{r}
sessionInfo()
```

# References