--- title: "SeSAMe Data User Guide" shorttitle: "sesameData guide" package: sesameData output: rmarkdown::html_vignette fig_width: 6 fig_height: 5 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{sesameData User Guide} %\VignetteEncoding{UTF-8} --- # Introduction `sesameData` package provides associated data for sesame package. This includes example data for testing and instructional purpose, as we ll as probe annotation for different Infinium platforms. ```{r, message=FALSE, warning=FALSE} library(sesameData) library(GenomicRanges) ``` # Data from ExperimentHub Titles of all the available data can be shown with: ```{r} head(sesameDataList()) ``` ## Local caching Each sesame datum from ExperimentHub is accessible through the `sesameDataGet` interface. It should be noted that all data must be pre-cached to local disk before they can be used. This design is to prevent conflict in annotation data caching and remove internet dependency. Caching needs only be done once per sesame/sesameData installation. One can cache data using ```{r, include = FALSE} sesameDataCacheExample() # examples ``` ```{r} sesameDataCache() ``` Once a data object is loaded, it is stored to a tempoary cache, so that the data doesn't need to be retrieved again next time we call `sesameDataGet`. This design is meant to speeed up the run time. For example, the annotation for HM27 can be retrieved with the title: ```{r} HM27.address <- sesameDataGet('HM27.address') ``` ## In-memory caching It's worth noting that once a data is retrieved through the `sesameDataGet` inferface (below), it will stay in memory so next time the object will be returned immediately. This design avoids repeated disk/web retrieval. In some rare situation, one may want to redo the download/disk IO, or empty the cache to save memory. This can be done with: ```{r} sesameDataGet_resetEnv() ``` ## Transcript models Sesame provides some utility functions to process transcript models, which can be represented as data.frame, GRanges and GRangesList objects. For example, `sesameData_getTxnGRanges` calls `sesameDataGet("genomeInfo.mm10")$txns` to retrieve a transcript-centric GRangesList object from GENCODE including its gene annotation, exon and cds (for protein-coding genes). It is then turned into a simple GRanges object of transcript: ```{r} txns_gr <- sesameData_getTxnGRanges("mm10") txns_gr ``` The returned GRanges object does not contain the exon coordinates. We can further collapse different transcripts of the same gene (isoforms) to gene level. Gene start is the minimum of all isoform starts and end is the maximum of all isoform ends. ```{r} genes_gr <- sesameData_txnToGeneGRanges(txns_gr) genes_gr ``` # Annotate probes One can annotate given probe ID using any genomic features stored in GRanges objects. For example, the following demonstrate the annotation of 500 random Mammal40 probes for gene promoters. ```{r} probes <- names(sesameData_getManifestGRanges("Mammal40"))[1:500] head(probes) # our input txns <- sesameData_getTxnGRanges("hg38") pm <- promoters(txns, upstream = 1500, downstream = 1500) pm <- pm[pm$transcript_type == "protein_coding"] sesameData_annoProbes(probes, pm, column = "gene_name") ``` # Manifest from Github Sesame provides access to array manfiest stored as GRanges object. These GRanges object are converted from the raw tsv files on our [array annotation website](http://zwdzwd.github.io/InfiniumAnnotation) using the [conversion code](https://tinyurl.com/8bt9dssc) ```{r} gr <- sesameData_getManifestGRanges("HM450") length(gr) ``` Note that by default the GRanges object exclude decoy sequence probes (e.g., _alt, and _random contigs). To include them, we need to use the `decoy = TRUE` option in sesameData_getManifestDF. # Subset probes One can directly get probes from different parts of the genome. ```{r message = FALSE, warning = FALSE} library(GenomicRanges) ``` ```{r} regs <- GRanges('chr5', IRanges(135313937, 135419936)) sesameData_getProbesByRegion(regs, platform = 'Mammal40') sesameData_getProbesByChromosome('chrX', platform = 'Mammal40') sesameData_getAutosomeProbes("Mammal40") sesameData_getProbesByGene('DNMT3A', "Mammal40", upstream=500) sesameData_getProbesByTSS('DNMT3A', "Mammal40") ``` One can also get all TSS probes by ```{r} TSSprobes = sesameData_getProbesByTSS(NULL, "Mammal40") ``` # Get nearby genes ```{r} sesameData_getGenesByProbes(c("cg14620903","cg22464003"), max_distance = 10000) ``` # Installation From Bioconductor ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("sesameData") ``` Development version can be installed from github ```{r, eval=FALSE} BiocManager::install("zwdzwd/sesameData") ``` ```{r} sessionInfo() ```