--- title: "Mapping between different Genome references" output: BiocStyle::html_document: toc: true --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Package:** [`Pbase`](http://bioconductor.org/packages/devel/bioc/html/Pbase.html)
**Author:** [Laurent Gatto](http://cpu.sysbiol.cam.ac.uk/)
**Last compiled:** `r date()`
**Last modified:** `r file.info("ensucsc.Rmd")$mtime` This vignette described how to convert coordinates between different genome references. We will use transcript `ENST00000373316` and GRCh38 and GRCh37 as working example. ```{r bm} tr <- "ENST00000373316" ``` ## GRCh38 Here is use `r Biocpkg("Gviz")` to query the latest Ensembl biomart and extract the transcript of interest. ```{r h38} suppressMessages(library("Gviz")) suppressMessages(library("biomaRt")) h38 <- useMart("ensembl", "hsapiens_gene_ensembl") tr38 <- BiomartGeneRegionTrack(biomart = h38, transcript = tr) tr38 <- split(tr38, transcript(tr38)) tr38 <- ranges(tr38[[tr]]) tr38 ``` Note the starting position of the transcript is `r min(start(tr38))`. ## GRCh37 Below, I repeat the same operation without using my own ens Mart instance. As far as I understand, Gviz queries the UCSC genome reference by default. ```{r h37} h37 <- useMart(host = "feb2014.archive.ensembl.org", biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl") tr37 <- BiomartGeneRegionTrack(biomart = h37, transcript = tr) tr37 <- split(tr37, transcript(tr37)) tr37 <- ranges(tr37[[tr]]) tr37 ``` Note the starting position of the transcript is `r min(start(tr37))`. These differences seem to stem from different genome builds. **Ensembl release 78** uses **GRCh38**, while **UCSC** uses **GRCh37**. Indeed, `r Biocpkg("Gviz")` sets the Ensembl biomart server to `Feb.2014` `GRCh37.p13`. ## Coordinates conversion We will use the coordinate mapping infrastructure described in the [January 2015 Bioconductor Newletter](http://www.bioconductor.org/help/newsletters/2015_January/#coordinate-mapping) and the [Changing genomic coordinate systems with rtracklayer::liftOver](http://bioconductor.org/help/workflows/liftOver/) workflow. First, we query `r Biocpkg("AnnotationHub")` for a chain file to perform the operation we want. ```{r chain} library("AnnotationHub") hub <- AnnotationHub() query(hub, 'hg19ToHg38') chain <- query(hub, 'hg19ToHg38')[[1]] ``` The `liftOver` function from the `r Biocpkg("rtracklayer")` package will use the chain and translate the coordinates of a `GRanges` object into a new `GRangesList` object. ```{r} library("rtracklayer") res <- liftOver(tr37, chain) res <- unlist(res) ## set annotation names(res) <- NULL genome(res) <- "hsapiens_gene_ensembl" all.equal(res, tr38) ``` ## Session information ```{r si} sessionInfo() ```