--- title: "Accessing EuPathDB Resources using AnnotationHub" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Accessing EuPathDB Resources using AnnotationHub} %\VignetteEngine{knitr::rmarkdown} % \VignetteKeyword{eupathdb, annotations} \usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results='asis', message=FALSE} BiocStyle::markdown() ``` **Authors**: [V. Keith Hughitt](mailto:keith.hughitt@nih.gov)
**Modified:** `r file.info("EuPathDB.Rmd")$mtime`
**Compiled**: `r date()` # Overview This tutorial describes how to query and make use of annotations retrieved from [EuPathDB : The Eukaryotic Pathogen Genomics Resource](http://eupathdb.org/eupathdb/) using [AnnotationHub](http://bioconductor.org/packages/release/bioc/html/AnnotationHub.html). For more information on using AnnotationHub, check out the AnnotationHub vignettes: - [AnnotationHub: Access the AnnotationHub Web Service](http://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html) - [AnnotationHub How-To’s](http://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html) The resources described in this tutorial were generating using GFF files and web API requests made to the various EuPathDB databases (TriTrypDB, ToxoDB, etc.) Only organisms with annotated genomes (those for which GFF files are available) are accessible through AnnotationHub. The two main resources provided are: - [OrgDb](https://www.bioconductor.org/help/workflows/annotation/annotation/#OrgDb) - [GRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) OrgDb objects for an organism include basic gene-level information such as: - Gene ID - Gene description - Chromosome number - GO terms associated with gene - KEGG Pathways associated with gene - Etc. For some organisms, [InterPro](https://www.ebi.ac.uk/interpro/) protein domain information is also available (in some cases, however, even though InterPro domain information is available through EuPathDB, it is too large to be included in the current AnnotationHub resources). For more information about working with Bioconductor annotation resources, see: - [Genomic Annotation Resources in Bioconductor ](https://www.bioconductor.org/help/workflows/annotation/annotation/) # Installation If you don't already have AnnotationHub installed on your system, use `BiocManager::install` to install the package: ```{r eval = FALSE} install.packages("BiocManager") BiocManager::install("AnnotationHub") ``` # Getting started To begin, let's create a new `AnnotationHub` connection and use it to query AnnotationHub for all EuPathDB resources. ```{r} library('AnnotationHub') # create an AnnotationHub connection ah <- AnnotationHub() # search for all EuPathDB resources meta <- query(ah, "EuPathDB") length(meta) head(meta) # types of EuPathDB data available table(meta$rdataclass) # distribution of resources by specific databases table(meta$dataprovider) # list of organisms for which resources are available length(unique(meta$species)) head(unique(meta$species)) ``` # Working with EuPathDB OrgDb resources Next, we will see how you can query AnnotationHub for EuPathDB OrgDb resources. To begin, create an AnnotationHub connection, if you have not already done so, as shown in the section above. You can now use the `query` function to search for your organism of interest and store the result as follows: ```{r} res <- query(ah, c('Leishmania major strain Friedlin', 'OrgDb', 'EuPathDB')) res ``` The result includes a single record, "AH56967". The record can be accessed from the result variable using list-like indexing: ```{r} orgdb <- res[['AH65089']] class(orgdb) ``` We can see that we now have an OrgDb instance, and as such, we can use the usual methods available for working this OrgDb objects, including: - `columns()` - `keys()` - `select()` ```{r} # list available fields to retrieve columns(orgdb) # create a vector containing all gene ids for the organism gids <- keys(orgdb, keytype='GID') head(gids) # retrieve the chromosome, description, and biotype for each gene dat <- select(orgdb, keys=gids, keytype='GID', columns=c('CHR', 'TYPE', 'GENEDESCRIPTION')) head(dat) table(dat$TYPE) table(dat$CHR) # create a gene / GO term mapping gene_go_mapping <- select(orgdb, keys=gids, keytype='GID', columns=c('GO_ID', 'GO_TERM_NAME', 'ONTOLOGY')) head(gene_go_mapping) # retrieve KEGG, etc. pathway annotations gene_pathway_mapping <- select(orgdb, keys=gids, keytype='GID', columns=c('PATHWAY', 'PATHWAY_SOURCE')) table(gene_pathway_mapping$PATHWAY_SOURCE) head(gene_pathway_mapping) ``` # Working with EuPathDB GRanges resources In addition to retrieving gene annotations, AnnotationHub can also be used to query GenomicRange (GRange) objects containing information about gene and transcript structure. ```{r} # query AnnotationHub res <- query(ah, c('Leishmania major strain Friedlin', 'GRanges', 'EuPathDB')) res # retrieve a GRanges instance associated with the result record gr <- res[['AH65354']] gr ``` The resulting `GRanges` object can then be interacted with using the [standard GRanges functions](https://bioconductor.org/packages/3.7/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf), including: - seqnames - strand - width ```{r} # chromosome names seqnames(gr) # strand information strand(gr) # feature widths width(gr) ``` Some information can be retrieve directly as object properties using the `$` operator: ```{r} # list of location types in the resource table(gr$type) ``` To subset the GRanges instance, you can use the standard `[` operator: ```{r} # get the first three ranges gr[1:3] # get all gene entries on chromosome 4 gr[gr$type == 'gene' & seqnames(gr) == 'LmjF.04'] ``` # Session Information ```{r} sessionInfo() ```