--- title: "Introduction to AnnotationHubData" author: "Valerie Obenchain" date: "Modified: October 2016. Compiled: `r format(Sys.Date(), '%d %b %Y')`" output: BiocStyle::html_document: toc: true --- # Overview The `AnnotationHubData` package provides tools to acquire, annotate, convert and store data for use in Bioconductor's `AnnotationHub`. BED files from the Encode project, gtf files from Ensembl, or annotation tracks from UCSC, are examples of data that can be downloaded, described with metadata, transformed to standard `Bioconductor` data types, and stored so that they may be conveniently served up on demand to users via the AnnotationHub client. While data are often manipulated into a more R-friendly form, the data themselves retain their raw content and are not filtered or curated like those in [ExperimentHub](http://www.bioconductor.org/packages/3.4/bioc/html/ExperimentHub.html). Each resource has associated metadata that can be searched through the `AnnotationHub` client interface. # New resources ## Family of resources Multiple, related resources are added to `AnnotationHub` by creating a software package similar to the existing annotation packages. The package itself does not contain data but serves as a light weight wrapper around scripts that generate metadata for the resources added to `AnnotationHub`. At a minimum the package should contain a man page describing the resources. Vignettes and additional `R` code for manipulating the objects are optional. Creating the package involves the following steps: 1. Notify `Bioconductor` team member: Man page and vignette examples in the software package will not work until the data are available in `AnnotationHub`. Adding the data to AWS S3 and the metadata to the production database involves assistance from a `Bioconductor` team member. If you are interested in submitting a package, please send an email to packages@bioconductor.org so a team member can work with you through the process. 2. Building the software package: Below is an outline of package organization. The files listed are required unless otherwise stated. * inst/extdata/ - metadata.csv: This file contains the metadata in the format of one row per resource to be added to the `AnnotationHub` database. The file should be generated from the code in inst/scripts/make-metadata.R where the final data are written out with write.csv(..., row.names=FALSE). The required column names and data types are specified in `AnnotationHubData::readMetadataFromCsv()`. See ?`readMetadataFromCsv` for details. * inst/scripts/ - make-data.R: A script describing the steps involved in making the data object(s). This includes where the original data were downloaded from, pre-processing, and how the final R object was made. Include a description of any steps performed outside of `R` with third party software. Data objects should be serialized with save() with the .rda extension on the filename. - make-metadata.R: A script to make the metadata.csv file located in inst/extdata of the package. See ?`readMetadataFromCsv` for a description of expected fields and data types. `readMetadataFromCsv()` can be used to validate the metadata.csv file before submitting the package. * vignettes/ OPTIONAL vignette(s) describing analysis workflows. * R/ - make-metadata.R: Code that assembles metadata for all resources and calls `AnnotationHubData::AnnotationHubMetadata()`. The output should be a list of `AnnotationHubMetadata` objects, one for each resource. Examples functions can be found in the `AnnotationHubData` source code with names of make*ToAHM(). - make-data.R: Code that downloads and manipulates (if necessary) the data; outputs are files on disk ready to be pushed to S3. If data are to be hosted on a personal web site instead of S3, this file should explain any manipulation of the data prior to hosting on the web site. For data hosted on a public web site with no prior manipultaion this file is not needed. - OPTIONAL functions to enhance data exploration. * man/ - package man page: The package man page serves as a landing point and should briefly describe all resources associated with the package. There should be an \alias entry for each resource title either on the package man page or individual man pages. - resource man pages: OPTIONAL. Man page(s) should describe the resource (raw data source, processing, QC steps) and demonstrate how the data can be loaded through the `AnnotationHub` interface. For example, replace "SEARCHTERM*" below with one or more search terms that uniquely identify resources in your package. ``` library(AnnotationHub) hub <- AnnotationHub() myfiles <- query(hub, "SEARCHTERM1", "SEARCHTERM2") myfiles[[1]] ## load the first resource in the list ``` * DESCRIPTION / NAMESPACE The package should depend on and fully import `AnnotationHub`. Package authors are encouraged to use the `AnnotationHub::listResources()` and `AnnotationHub::loadResource()` functions in their man pages and vignette. These helpers are designed to facilitate data discovery within a specific package vs within all of `AnnotationHub`. 3. Data objects: Data are not formally part of the software package and are stored separately in AWS S3 buckets. The author should make the data available via dropbox, ftp site or another mutually accessible application and it will be uploaded to S3 by a member of the `Bioconductor` team. 4. Package review: When the data and metadata are ready, a `Bioconductor` team member will push the data to AWS S3 and add the metadata to the production database. At this point the package man pages and vignette can be finalized. When the package passes R CMD build and check it can be submitted to the [package tracker](https://github.com/Bioconductor/Contributions) for review. ## Individual resources Individual objects of a standard class can be added to the hub by providing only the data and metadata files or by creating a package as described in the `Family of Resources` section. OrgDb, TxDb and BSgenome objects are well defined `Bioconductor` classes and methods to download and process these objects already exist in `AnnotationHub`. When adding only one or two objects the overhead of creating a package may be unnecessary. The goal of the package is to provide structure for metadata generation and makes sense when there are plans to update versions or add new organisms in the future. Make sure the OrgDb, TxDb or BSgenome object you want to add does not already exist here: [Biocondcutor annotation repository](http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData) Providing just data and metadata files involves the following steps: 1. Notify `Bioconductor` team member: Adding the data to AWS S3 and the metadata to the production database involves assistance from a `Bioconductor` team member. Please send email to packages@bioconductor.org so a team member can work with you through the process. 2. Prepare the data: In the case of an OrgDb object, only the sqlite file is stored in S3. See makeOrgPackageFromNCBI() and makeOrgPackage() in the `AnnotationForge` package for help creating the sqlite file. BSgenome objects should be made according to the steps outline in the [BSgenome vignette](http://www.bioconductor.org/packages/3.4/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf). TxDb objects will be made on-the-fly from a GRanges with GenomicFeatures::makeTxDbFromGRanges() when the resource is downloaded from `AnnotationHub`. Data should be provided as a GRanges object. See GenomicRanges::makeGRangesFromDataFrame() or rtracklayer::import() for help creating the GRanges. 3. Generate metadata: Prepare a .R file that generates metadata for the resource(s) by calling the `AnnotationHubData::AnnotationHubMetadata()` constructor. Argument details are found on the ?`AnnotationHubMetadata` man page. As an example, this piece of code generates the metadata for Timothée's the Vitis vinifera TxDb Timothée Flutre contributed to `AnnotationHub`: ```{r, TxDb_Metadata, eval=FALSE} metadata <- AnnotationHubMetadata( Description="Gene Annotation for Vitis vinifera", Genome="IGGP12Xv0", Species="Vitis vinifera", SourceUrl="http://genomes.cribi.unipd.it/DATA/V2/V2.1/V2.1.gff3", SourceLastModifiedDate=as.POSIXct("2014-04-17"), SourceVersion="2.1", RDataPath="community/tflutre/", TaxonomyId=29760L, Title="Vvinifera_CRIBI_IGGP12Xv0_V2.1.gff3.Rdata", BiocVersion=package_version("3.3"), Coordinate_1_based=TRUE, DataProvider="CRIBI", Maintainer="Timothée Flutre