---
title: "Querying protein features"
author: "Johannes Rainer"
graphics: yes
package: ensembldb
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Querying protein features}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{ensembldb,EnsDb.Hsapiens.v86,BiocStyle}
---

```{r  biocstyle, echo = FALSE, results = "asis", message = FALSE }
library(BiocStyle)
library(ensembldb)
BiocStyle::markdown() 
```


# Introduction

From Bioconductor release 3.5 on, `EnsDb` databases/packages created by the
`ensembldb` package contain also, for transcripts with a coding regions, mappings
between transcripts and proteins. Thus, in addition to the RNA/DNA-based
features also the following protein related information is available:

-   `protein_id`: the Ensembl protein ID. This is the primary ID for the proteins
    defined in Ensembl and each (protein coding) Ensembl transcript has one
    protein ID assigned to it.
-   `protein_sequence`: the amino acid sequence of a protein.
-   `uniprot_id`: the Uniprot ID for a protein. Note that not every Ensembl
    `protein_id` has an Uniprot ID, and each `protein_id` might be mapped to several
    `uniprot_id`. Also, the same Uniprot ID might be mapped to different `protein_id`.
-   `uniprot_db`: the name of the Uniprot database in which the feature is
    annotated. Can be either *SPTREMBL* or *SWISSPROT*.
-   `uniprot_mapping_type`: the type of the mapping method that was used to assign
    the Uniprot ID to the Ensembl protein ID.
-   `protein_domain_id`: the ID of the protein domain according to the
    source/analysis in/by which is was defined.
-   `protein_domain_source`: the source of the protein domain information, one of
    *pfscan*, *scanprosite*, *superfamily*, *pfam*, *prints*, *smart*, *pirsf* or *tigrfam*.
-   `interpro_accession`: the Interpro accession ID of the protein domain (if
    available).
-   `prot_dom_start`: the start of the protein domain within the sequence of
    the protein.
-   `prot_dom_start`: the end position of the protein domain within the
    sequence of the protein.

Thus, for protein coding transcripts, these annotations can be fetched from the
database too, given that protein annotations are available. Note that only `EnsDb`
databases created through the Ensembl Perl API contain protein annotation, while
databases created using `ensDbFromAH`, `ensDbFromGff`, `ensDbFromGRanges` and
`ensDbFromGtf` don't.

```{r  doeval, echo = FALSE, results = "hide" }
## Globally switch off execution of code chunks
evalMe <- TRUE
haveProt <- FALSE
## evalMe <- .Platform$OS.type == "unix"
 
```

```{r  loadlib, message = FALSE, eval = evalMe }
library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
## Evaluate whether we have protein annotation available
hasProteinData(edb)
 
```

```{r  restrict9, message = FALSE, echo = FALSE }
## silently subsetting to chromosome 11
edb <- filter(edb, filter = ~ seq_name == "11") 
```

If protein annotation is available, the additional tables and columns are also
listed by the `listTables` and `listColumns` methods.

```{r  listCols, message = FALSE, eval = evalMe }
listTables(edb)
 
```

In the following sections we show examples how to 1) fetch protein annotations
as additional columns to gene/transcript annotations, 2) fetch protein
annotation data and 3) map proteins to the genome.

```{r  haveprot, echo = FALSE, results = "hide", eval = evalMe }
## Use this to conditionally disable eval on following chunks
haveProt <- hasProteinData(edb) & evalMe
 
```


# Fetch protein annotation for genes and transcripts

Protein annotations for (protein coding) transcripts can be retrieved by simply
adding the desired annotation columns to the `columns` parameter of the e.g. `genes`
or `transcripts` methods.

```{r  a_transcripts, eval = haveProt }
## Get also protein information for ZBTB16 transcripts
txs <- transcripts(edb, filter = GeneNameFilter("ZBTB16"),
		   columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs
 
```

The gene ZBTB16 has protein coding and non-coding transcripts, thus, we get the
protein ID for the coding- and `NA` for the non-coding transcripts. Note also that
we have a transcript targeted for nonsense mediated mRNA-decay with a protein ID
associated with it, but no Uniprot ID.

```{r  a_transcripts_coding_noncoding, eval = haveProt }
## Subset to transcripts with tx_biotype other than protein_coding.
txs[txs$tx_biotype != "protein_coding", c("uniprot_id", "tx_biotype",
					  "protein_id")]
 
```

While the mapping from a protein coding transcript to a Ensembl protein ID
(column `protein_id`) is 1:1, the mapping between `protein_id` and `uniprot_id` can be
n:m, i.e. each Ensembl protein ID can be mapped to 1 or more Uniprot IDs and
each Uniprot ID can be mapped to more than one `protein_id` (and hence
`tx_id`). This should be kept in mind if querying transcripts from the database
fetching Uniprot related additional columns or even protein ID features, as in
such cases a redundant list of transcripts is returned.

```{r  a_transcripts_coding, eval = haveProt }
## List the protein IDs and uniprot IDs for the coding transcripts
mcols(txs[txs$tx_biotype == "protein_coding",
	  c("tx_id", "protein_id", "uniprot_id")])
 
```

Some of the n:m mappings for Uniprot IDs can be resolved by restricting either
to entries from one Uniprot database (*SPTREMBL* or *SWISSPROT*) or to mappings of a
certain type of mapping method. The corresponding filters are the
`UniprotDbFilter` and the `UniprotMappingTypeFilter` (using the `uniprot_db` and
`uniprot_mapping_type` columns of the `uniprot` database table). In the example
below we restrict the result to Uniprot IDs with the mapping type *DIRECT*.

```{r  a_transcripts_coding_up, eval = haveProt }
## List all uniprot mapping types in the database.
listUniprotMappingTypes(edb)

## Get all protein_coding transcripts of ZBTB16 along with their protein_id
## and Uniprot IDs, restricting to protein_id to uniprot_id mappings based
## on "DIRECT" mapping methods.
txs <- transcripts(edb, filter = list(GeneNameFilter("ZBTB16"),
				      UniprotMappingTypeFilter("DIRECT")),
		   columns = c("protein_id", "uniprot_id", "uniprot_db"))
mcols(txs)
 
```

For this example the use of the `UniprotMappingTypeFilter` resolved the multiple
mapping of Uniprot IDs to Ensembl protein IDs, but the Uniprot ID *Q05516* is
still assigned to the two Ensembl protein IDs *ENSP00000338157* and
*ENSP00000376721*.

All protein annotations can also be added as *metadata columns* to the
results of the `genes`, `exons`, `exonsBy`, `transcriptsBy`, `cdsBy`, `fiveUTRsByTranscript`
and `threeUTRsByTranscript` methods by specifying the desired column names with
the `columns` parameter. For non coding transcripts `NA` will be reported in the
protein annotation columns.

In addition to retrieve protein annotations from the database, we can also use
protein data to filter the results. In the example below we fetch for example
all genes from the database that have a certain protein domain in the protein
encoded by any of its transcripts.

```{r  a_genes_protdomid_filter, eval = haveProt }
## Get all genes encoded on chromosome 11 which protein contains 
## a certain protein domain.
gns <- genes(edb, filter = ~ prot_dom_id == "PS50097" & seq_name == "11")
length(gns)

sort(gns$gene_name)
 
```

So, in total we got 152 genes with that protein domain. In addition to the
`ProtDomIdFilter`, also the `ProteinidFilter` and the `UniprotidFilter` can be used to
query the database for entries matching conditions on their protein ID or
Uniprot ID.


# Use methods from the `AnnotationDbi` package to query protein annotation

The `select`, `keys` and `mapIds` methods from the `AnnotationDbi` package can also be
used to query `EnsDb` objects for protein annotations. Supported columns and
key types are returned by the `columns` and `keytypes` methods.

```{r  a_2_annotationdbi, message = FALSE, eval = haveProt }
## Show all columns that are provided by the database
columns(edb)

## Show all key types/filters that are supported
keytypes(edb)
 
```

Below we fetch all Uniprot IDs annotated to the gene *ZBTB16*.

```{r  a_2_select, message = FALSE, eval = haveProt }
select(edb, keys = "ZBTB16", keytype = "GENENAME",
       columns = "UNIPROTID")
 
```

This returns us all Uniprot IDs of all proteins encoded by the gene's
transcripts. One of the transcripts from ZBTB16, while having a CDS and being
annotated to a protein, does not have an Uniprot ID assigned (thus `NA` is
returned by the above call). As we see below, this transcript is targeted for
non sense mediated mRNA decay.

```{r  a_2_select_nmd, message = FALSE, eval = haveProt }
## Call select, this time providing a GeneNameFilter.
select(edb, keys = GeneNameFilter("ZBTB16"),
       columns = c("TXBIOTYPE", "UNIPROTID", "PROTEINID"))
 
```

Note also that we passed this time a `GeneMameFilter` with the `keys` parameter.


# Retrieve proteins from the database

Proteins can be fetched using the dedicated `proteins` method that returns, unlike
DNA/RNA-based methods like `genes` or `transcripts`, not a `GRanges` object by
default, but a `DataFrame` object. Alternatively, results can be returned as a
`data.frame` or as an `AAStringSet` object from the `Biobase` package. Note that this
might change in future releases if a more appropriate object to represent
protein annotations becomes available.

In the code chunk below we fetch all protein annotations for the gene *ZBTB16*.

```{r  b_proteins, message = FALSE, eval = haveProt }
## Get all proteins and return them as an AAStringSet
prts <- proteins(edb, filter = GeneNameFilter("ZBTB16"),
		 return.type = "AAStringSet")
prts
 
```

Besides the amino acid sequence, the `prts` contains also additional annotations
that can be accessed with the `mcols` method (metadata columns). All additional
columns provided with the parameter `columns` are also added to the `mcols`
`DataFrame`.

```{r  b_proteins_mcols, message = FALSE, eval = haveProt }
mcols(prts)
 
```

Note that the `proteins` method will retrieve only gene/transcript annotations of
transcripts encoding a protein. Thus annotations for the non-coding transcripts
of the gene *ZBTB16*, that were returned by calls to `genes` or `transcripts` in the
previous section are not fetched.

Querying in addition Uniprot identifiers or protein domain data will result at
present in a redundant list of proteins as shown in the code block below.

```{r  b_proteins_prot_doms, message = FALSE, eval = haveProt }
## Get also protein domain annotations in addition to the protein annotations.
pd <- proteins(edb, filter = GeneNameFilter("ZBTB16"),
	       columns = c("tx_id", listColumns(edb, "protein_domain")),
	       return.type = "AAStringSet")
pd
 
```

The result contains one row/element for each protein domain in each of the
proteins. The number of protein domains per protein and the `mcols` are shown
below.

```{r  b_proteins_prot_doms_2, message = FALSE, eval = haveProt }
## The number of protein domains per protein:
table(names(pd))

## The mcols
mcols(pd)
 
```

As we can see each protein can have several protein domains with the start and
end coordinates within the amino acid sequence being reported in columns
`prot_dom_start` and `prot_dom_end`. Also, not all Ensembl protein IDs, like
`protein_id` *ENSP00000445047* are mapped to an Uniprot ID or have protein domains.


# Map peptide features within proteins to the genome

The *coordinate-mapping.Rmd* vignette provides a detailed description of all
functions that allow to map between genomic, transcript and protein coordinates.


# Session information

```{r  sessionInfo }
sessionInfo() 
```