--- title: "ORFik Overview" author: "Haakon Tjeldnes & Kornel Labun" date: "`r BiocStyle::doc_date()`" package: "`r pkg_ver('ORFik')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{ORFik Overview} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction Welcome to the `ORFik` package. This vignette will walk you through our main package usage with examples. `ORFik` is an R package containing various functions for analysis of RiboSeq, RNASeq and CageSeq data. `ORFik` currently supports: 1. Finding Open Reading Frames (very fast) in the genome of interest or on the set of transcripts/sequences. 2. Automatic estimations of RiboSeq footprint shift. 3. Utilities for metaplots of RiboSeq coverage over gene START and STOP codons allowing to spot the shift. 4. Shifting functions for the RiboSeq data. 5. Finding new Transcription Start Sites with the use of CageSeq data. 6. Various measurements of gene identity e.g. FLOSS, coverage, ORFscore, entropy that are recreated based on many scientific publications. 7. Utility functions to extend GenomicRanges for faster grouping, splitting, tiling etc. # Finding Open Reading Frames In molecular genetics, an Open Reading Frame (ORF) is the part of a reading frame that has the ability to be translated. It does not mean that every ORF is being translated or is functional, but to be able to find novel genes we must be able to firstly identify potential ORFs. To find all Open Reading Frames (ORFs) and possibly map them to genomic coordinates `ORFik` gives you three main functions: * `findORFs` - find ORFs in sequences of interest, * `findMapORFs` - find ORFs and map them to their respective genomic coordinates * `findORFsFasta` - find ORFs in Fasta file or `BSGenome` (supports circular genomes!) ## Example of finding ORFs in on 5' UTR of hg19 ```{r eval = TRUE, echo = TRUE, message = FALSE} library(ORFik) library(GenomicFeatures) ``` After loading libraries, load example data from `GenomicFeatures`. We load gtf file as txdb. We will extract the 5' leaders to find all upstream open reading frames. ```{r eval = TRUE, echo = TRUE} txdbFile <- system.file("extdata", "hg19_knownGene_sample.sqlite", package = "GenomicFeatures") txdb <- loadTxdb(txdbFile) fiveUTRs <- fiveUTRsByTranscript(txdb, use.names = TRUE) fiveUTRs ``` As we can see we have extracted 5' UTRs for hg19 annotations. Now we can load `BSgenome` version of human genome (hg19). If you don't have this package installed you will not see the result from the code below. You might have to install `BSgenome.Hsapiens.UCSC.hg19` and run the code for yourself as we don't install this package together with `ORFik`. ```{r eval = TRUE, echo = TRUE, message = FALSE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { # Extract sequences of fiveUTRs. # Either you import fasta file of ranges, or you have some BSgenome. tx_seqs <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, fiveUTRs) # Find all ORFs on those transcripts and get their genomic coordinates fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs) fiveUTR_ORFs } ``` In the example above you can see that fiveUTR_ORFs are grouped by transcript, the first group is from transcript "uc010ogz.1". Meta-column names contains name of the transcript and identifier of the ORF separated by "_". When ORF is separated into two exons you can see it twice, like the first ORF with name "uc010ogz.1_1". The first ORF will always be the one most upstream for "+" strand, and least upstream for "-" strand. # CageSeq data for 5' UTR re-annotation In the prerevious example we used the refence annotation of the 5' UTRs from the package GenomicFeatures. Here we will use advantage of CageSeq data to set new Transcription Start Sites (TSS) and re-annotate 5' UTRs. ```{r eval = TRUE, echo = TRUE} # path to example CageSeq data from hg19 heart sample cageData <- system.file("extdata", "cage-seq-heart.bed.bgz", package = "ORFik") # get new Transcription Start Sites using CageSeq dataset newFiveUTRs <- reassignTSSbyCage(fiveUTRs, cageData) newFiveUTRs ``` You will now see that most of the transcription start sites have changed. Depending on the species, regular annotations might be incomplete or not specific enough for your purposes. NOTE: IF you want to edit the whole txdb / gtf file, use reassignTxDbByCage. And save this to get the new gtf with reannotated leaders by CAGE. # RiboSeq footprints automatic shift detection and shifting In RiboSeq data ribosomal footprints are restricted to their p-site positions and shifted with respect to the shifts visible over the start and stop codons. `ORFik` has multiple functions for processing of RiboSeq data. We will go through an example processing of RiboSeq data below. Load example raw RiboSeq footprints (unshifted). ```{r eval = TRUE, echo = TRUE} bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik") footprints <- GenomicAlignments::readGAlignments(bam_file) ``` Investigate what footprint lengths are present in our data. ```{r eval = TRUE, echo = TRUE} table(readWidths(footprints)) ``` For the sake of this example we will focus only on most abundant length of 29. ```{r eval = TRUE, echo = TRUE} footprints <- footprints[readWidths(footprints) == 29] footprints ``` Restrict footprints to their 5' starts (after shifting it will be a p-site). ```{r eval = TRUE, echo = TRUE} footprintsGR <- ORFik:::convertToOneBasedRanges(footprints, addSizeColumn = TRUE) footprintsGR ``` Now, lets prepare annotations and focus on START and STOP codons. ```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE} gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik") txdb <- loadTxdb(gtf_file) tx <- GenomicFeatures::exonsBy(txdb, by = "tx", use.names = TRUE) cds <- cdsBy(txdb, by = "tx", use.names = TRUE) trailers <- threeUTRsByTranscript(txdb, use.names = TRUE) cds[1] ``` Filter cds to only those who have some minimum trailer and leader lengths, as well as cds. And get start and stop codons with extra window of 30bp around them. ```{r eval = TRUE, echo = TRUE, warning = FALSE} txNames <- filterTranscripts(txdb) tx <- tx[txNames]; cds <- cds[txNames]; trailers <- trailers[txNames]; windowsStart <- startRegion(cds[txNames], tx, TRUE, upstream = 30, 29) windowsStop <- startRegion(trailers, tx, TRUE, upstream = 30, 29) windowsStart ``` Calculate meta-coverage over start and stop windowed regions. ```{r eval = TRUE, echo = TRUE, warning = FALSE} hitMapStart <- metaWindow(footprintsGR, windowsStart, withFrames = TRUE) hitMapStop <- metaWindow(footprintsGR, windowsStop, withFrames = TRUE) ``` Plot start/stop windows for length 29. ```{r eval = TRUE, echo = TRUE, warning = FALSE} ORFik:::pSitePlot(hitMapStart) ``` ```{r eval = TRUE, echo = TRUE, warning = FALSE} ORFik:::pSitePlot(hitMapStop, region = "stop") ``` We can also use automatic detection of RiboSeq shifts using the code below. As we can see reasonable conclusion from the plots would be to shift length 29 by 12, it is in agreement with the automatic detection of the offsets. ```{r eval = TRUE, echo = TRUE, warning = FALSE} shifts <- detectRibosomeShifts(footprints, txdb, stop = TRUE) shifts ``` Fortunately `ORFik` has function that can be used to shift footprints using desired shifts. Check documentation for more details. ```{r eval = TRUE, echo = TRUE, warning = FALSE} shiftedFootprints <- shiftFootprints(footprints, shifts) shiftedFootprints ``` # Gene identity functions for ORFs or genes `ORFik` contains functions of gene identity that can be used to predict which ORFs are potentially coding and functional. There are 2 main categories: - Sequence features (kozak, gc-content, etc.) - Read features (reads as: Ribo-seq, RNA-seq, TCP-seq, shape-seq etc) - FLOSS `floss`, - coverage `coverage`, - ORFscore `orfScore`, - entropy `entropy`, - translational effiency `translationalEff`, - inside outside score `insideOutsideScore`, - distance between orfs and cds' `distToCds`, - other All of the features are implemented based on scientific article published in peer reviewed journal. `ORFik` supports seemingles calculation of all available features. See example below. ```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { library(GenomicFeatures) # Extract sequences of fiveUTRs. fiveUTRs <- fiveUTRs[1:10] faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens tx_seqs <- extractTranscriptSeqs(faFile, fiveUTRs) # Find all ORFs on those transcripts and get their genomic coordinates fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs) unlistedORFs <- unlistGrl(fiveUTR_ORFs) # group GRanges by ORFs instead of Transcripts, use 4 first ORFs fiveUTR_ORFs <- groupGRangesBy(unlistedORFs, unlistedORFs$names)[1:4] # make some toy ribo seq and rna seq data starts <- unlist(ORFik:::firstExonPerGroup(fiveUTR_ORFs), use.names = FALSE) RFP <- promoters(starts, upstream = 0, downstream = 1) score(RFP) <- rep(29, length(RFP)) # the original read widths # set RNA seq to duplicate transcripts RNA <- unlist(exonsBy(txdb, by = "tx", use.names = TRUE), use.names = TRUE) # transcript database txdb <- loadTxdb(txdbFile) dt <- computeFeatures(fiveUTR_ORFs, RFP, RNA, txdb, faFile, orfFeatures = TRUE) dt } ``` You will now get a data.table with one column per score, the columns are named after the different scores, you can now go further with prediction, or making plots. # Calculating Kozak sequence score for ORFs Instead of getting all features, we can also extract single features. To understand how strong the binding affinitity of an ORF promoter region might be, we can use kozak sequence score. The kozak functions supports several species. In the first example we use human kozak sequence, then we make a self defined kozak sequence. ```{r eval = TRUE, echo = TRUE} # In this example we will find kozak score of cds' if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { cds <- cdsBy(txdb, by = "tx", use.names = TRUE)[1:10] tx <- exonsBy(txdb, by = "tx", use.names = TRUE)[names(cds)] faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens kozakSequenceScore(cds, tx, faFile, species = "human") # A few species are pre supported, if not, make your own input pfm. # here is an example where the human pfm is sent in again, even though # it is already supported. pfm <- t(matrix(as.integer(c(29,26,28,26,22,35,62,39,28,24,27,17, 21,26,24,16,28,32,5,23,35,12,42,21, 25,24,22,33,22,19,28,17,27,47,16,34, 25,24,26,25,28,14,5,21,10,17,15,28)), ncol = 4)) kozakSequenceScore(cds, tx, faFile, species = pfm) } ``` # GRanges and GRangesList utilities `ORFik` contains couple functions that can be utilized to speed up your coding. Check documentations for these functions: `unlistGrl`, `sortPerGroup`, `strandBool`, `tile1`. ## Grouping ORFs Sometimes you want a GRangesList of ORFs grouped by transcript, or you might want each ORF as groups in the GRangesList. To do this more easily you can use the function `groupGRangesBy`. ```{r eval = TRUE, echo = TRUE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { # the orfs are now grouped by orfs. If we want to go back to transcripts we do: unlisted_ranges <- unlistGrl(fiveUTR_ORFs) unlisted_ranges test_ranges <- groupGRangesBy(unlisted_ranges, names(unlisted_ranges)) # test_ranges is now grouped by transcript, but we want them grouped by ORFs: # we use the orfs exon column called ($names) to group, it is made by ORFik. unlisted_ranges <- unlistGrl(test_ranges) test_ranges <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names) } ``` ## Filtering example Lets say you found some ORFs, and you want to filter out some of them. ORFik provides several functions for filtering. A problem with the original GenomicRanges container, is that filtering on GRanges objects are much easier than on GRangesList objects, ORFik tries to fix this. In this example we will filter out all orfs as following: 1. First group GRangesList by ORFs 2. width < 60 3. number of exons < 2 4. strand is negative ```{r eval = TRUE, echo = TRUE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { # lets use the fiveUTR_ORFs #1. Group by ORFs unlisted_ranges <- unlistGrl(fiveUTR_ORFs) ORFs <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names) length(ORFs) #2. Remove widths < 60 ORFs <- ORFs[widthPerGroup(ORFs) >= 60] length(ORFs) #3. Keep only ORFs with at least 2 exons ORFs <- ORFs[numExonsPerGroup(ORFs) > 1] length(ORFs) #4. Keep only positive ORFs ORFs <- ORFs[strandPerGroup(ORFs) == "+"] # all remaining ORFs where on positive strand, so no change length(ORFs) } ``` ## ORF interest regions Specific part of the ORF are usually of interest, like start and stop codons. Here we run an example to show what ORFik can do for you. ```{r eval = TRUE, echo = TRUE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { # let's use the ORFs from the previous examples #1. Find the start and stop sites startSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE) stopSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE) #2. Lets find the start and stop codons, # this takes care of potential 1 base exons etc. starts <- startCodons(fiveUTR_ORFs, is.sorted = TRUE) starts stops <- stopCodons(fiveUTR_ORFs, is.sorted = TRUE) stops #3. Lets get the bases of the start and stop codons from the fasta file # It's very important to check that ORFs are sorted here, else you could get # the end of the ORF instead of the beginning etc. txSeqsFromFa(starts, faFile, is.sorted = TRUE) txSeqsFromFa(stops, faFile, is.sorted = TRUE) } ``` Many more operations are also supported for manipulation # When to use which ORFfinding function ORFik supports multiple ORF finding functions,here we describe their specific use. If you have a DNAStringSet or a character vector use findORFs. DNAStringSet is safer since all characters are forced to uppercase. findORFs will give you only 5' to 3' direction, so if you want both directions, you can do (for double stranded): ```{r eval = TRUE, echo = TRUE} library(Biostrings) library(S4Vectors) seqs <- "ATGAAATGAAGTAAATCAAAACAT" # strand with ORFs in both directions # positive strands pos <- findORFs(seqs, startCodon = "ATG", minimumLength = 0) # negative strands neg <- findORFs(reverseComplement(DNAStringSet(seqs)), startCodon = "ATG", minimumLength = 0) # make GRanges since we want strand information pos <- GRanges(pos, strand = "+") neg <- GRanges(neg, strand = "-") # as GRanges res <- c(pos, neg) # or merge together and make GRangesList res <- split(res, seq.int(1, length(pos) + length(neg))) res ``` Note that findORFsFasta automaticly finds (-) strand ORFs. Since that is normally used for genomes. If you have transcriptomes, you dont want the (-) strand. If you get both (+/-) strand and only want (+) ORFs, do: ```{r eval = TRUE, echo = TRUE} res[strandBool(res)] ``` ## Finding ORFs in spliced transcripts If you want to find ORFs in spliced transcripts, use findMapORFs. It supports automatic exon splitting, see above for example. ## Procaryote and Circular Genomes If you want to find ORFs on circular genomes, use findORFsFasta. ## Input conclusion: Eucaryote splicing: findMapORFs, GRangesList (exons) and char (splice joined) Procaryote/circular: findORFsFasta, fasta file Direct ORFs from character vector: findORFs, char vector # Using ORFik in your package or scripts The focus of ORFik for development is to be a swiss army knife for transcriptomics. If you need functions for splicing, getting windows of exons per transcript, periodic windows of exons, speicific parts of exons etc, ORFik can help you with this. Let's do an example where ORFik shines. Objective: We have three transcripts, we also have a library of Ribo-seq. This library was treated with cyclohexamide, so we know Ribo-seq reads can stack up close to the stop codon of the CDS. Lets say we only want to keep transcripts, where the cds stop region (defined as last 9 bases of cds), has maximum 33% of the reads. To only keep transcripts with a good spread of reads over the CDS. How would you make this filter ? ```{r eval = TRUE, echo = TRUE} cds <- GRanges("chr1", IRanges(c(1, 10, 20, 30, 40, 50, 60, 70, 80), c(5, 15, 25, 35, 45, 55, 65, 75, 85)), "+") names(cds) <- c(rep("tx1", 3), rep("tx2", 3), rep("tx3", 3)) cds <- groupGRangesBy(cds) ribo <- GRanges("chr1", c(1, rep.int(23, 4), 30, 34, 34, 43, 60, 64, 71, 74), "+") # We could do a simplification and use the ORFik entropy function entropy(cds, ribo) # <- spread of reads ``` We see that ORF 1, has a low(bad) entropy, but we do not know where the reads are stacked up. So lets make a new filter by using ORFiks utility functions: ```{r eval = TRUE, echo = TRUE} tile <- tile1(cds, FALSE, FALSE) # tile them to 1 based positions tails <- tails(tile, 9) stopOverlap <- countOverlaps(tails, ribo) allOverlap <- countOverlaps(cds, ribo) fractions <- (stopOverlap + 1) / (allOverlap + 1) # pseudocount 1 cdsToRemove <- fractions > 1 / 2 # filter with pseudocounts (1+1)/(3+1) cdsToRemove ``` We now easily made a stop codon filter for our coding sequences. # Coverage plots made easy with ORFik In investigation of ORFs or other interest regions, ORFik can help you make some coverage plots from reads of Ribo-seq, RNA-seq, CAGE-seq, TCP-seq etc. Lets make 3 plots of Ribo-seq focused on CDS regions. ```{r eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { # Load data as shown before and pshift the Ribo-seq # Get the annotation txdb <- loadTxdb(gtf_file) # Lets take all valid transcripts, with size restrictions: # leader > 100 bases, cds > 100 bases, trailer > 100 bases txNames <- filterTranscripts(txdb, 100, 100, 100) # valid transcripts leaders = fiveUTRsByTranscript(txdb, use.names = TRUE)[txNames] cds <- cdsBy(txdb, "tx", use.names = TRUE)[txNames] trailers = threeUTRsByTranscript(txdb, use.names = TRUE)[txNames] tx <- exonsBy(txdb, by = "tx", use.names = TRUE) # Ribo-seq bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik") reads <- readGAlignments(bam_file) shiftedReads <- shiftFootprints(reads, detectRibosomeShifts(reads, txdb)) } ``` ```{r eval = TRUE, echo = TRUE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { library(data.table) # Create meta coverage per part leaderCov <- metaWindow(shiftedReads, leaders, scoring = NULL, returnAs = "data.table", feature = "leaders") cdsCov <- metaWindow(shiftedReads, cds, scoring = NULL, returnAs = "data.table", feature = "cds") trailerCov <- metaWindow(shiftedReads, trailers, scoring = NULL, returnAs = "data.table", feature = "trailers") # bind together dt <- rbindlist(list(leaderCov, cdsCov, trailerCov)) # Now set info column dt[, `:=` (fraction = "Ribo-seq")] # NOTE: All of this is done in one line in function: windowPerTranscript # zscore gives shape, a good starting plot windowCoveragePlot(dt, scoring = "zscore", title = "Ribo-seq metaplot") } ``` Z-score is good at showing overall shape. You see from the windows each region; leader, cds and trailer is scaled to 100. Lets use a median scoring to find median counts per meta window per positions. ```{r eval = TRUE, echo = TRUE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { windowCoveragePlot(dt, scoring = "median", title = "Ribo-seq metaplot") } ``` We see a big spike close to start of CDS, called the TIS. The median counts by transcript is close to 50 here. Lets look at the TIS region using the pshifting plot, seperated into the 3 frames. ```{r eval = TRUE, echo = TRUE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { # size 100 window: 50 upstream, 49 downstream of TIS windowsStart <- startRegion(cds, tx, TRUE, upstream = 50, 49) hitMapStart <- metaWindow(shiftedReads, windowsStart, withFrames = TRUE) ORFik:::pSitePlot(hitMapStart, length = "meta coverage") } ``` Since these reads are p-shifted it is not that unexpected that the maximum number of reads are on the 0 position. We also see a clear pattern in the Ribo-seq. To see how the different read lengths distribute over the region, we make a heatmap. Where the colors represent the zscore of counts per position. ```{r eval = TRUE, echo = TRUE, message = FALSE} if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { # size 25 window (default): 5 upstream, 20 downstream of TIS hitMap <- windowPerReadLength(cds, tx, shiftedReads) ORFik:::coverageHeatMap(hitMap) } ``` In the heatmap you can see that read length 30 has the strongest peak on the TIS, while read length 28 has some reads in the leaders (the - positions). Our hope is that by using ORFik, we can simplify your analysis when you focus on ORFs / transcripts and especially in combination with sequence libraries like RNA-seq, Ribo-seq etc.