Ribostan 0.99.10
Ribo-seq is a specific form of RNA-seq expression assay in which the fragments sequenced are footprints of actively translating Ribosomes. Ribo-seq experiments of necessity have much shorter read-lenegths than RNA-seq experiments, which can complicate quantification. In principle, Ribo-seq experiments provide nucleotide resolution information about the location of ribosomes, and can thus be used to elucidate the dynamics of Ribosomal elongation, and initiation. In practice, determining where the Ribosome’s A/P-site is in relation to the footprint is complicated by random variations in footprint size and location, which ‘blur’ the positions of ribosomes. Ribostan is a collection of tools for the analysis of Riboseq which include isoform-aware quantification, P/A site alignment via several methods, and uORF identification via multitaper periodicity test (similiar to ORFquant and Ribotaper).
#first, let's load up some test data
anno_file <- here::here('test.gc32.gtf')
if(!file.exists(anno_file)){
library(AnnotationHub)
ah <- AnnotationHub()
gencode32 <- ah[['AH75191']]
seqlevels(gencode32)<-'chr22'
rtracklayer::export(gencode32, anno_file)
}
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## loading from cache
## Loading required package: GenomicFeatures
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:AnnotationHub':
##
## cache
## Warning in .local(object, con, format, ...): The phase information is missing. The written file will contain CDS
## with no phase information.
fafile <- here::here('chr22.fa')
library(BSgenome.Hsapiens.UCSC.hg38)
## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: BiocIO
## Loading required package: rtracklayer
##
## Attaching package: 'rtracklayer'
## The following object is masked from 'package:BiocIO':
##
## FileForFormat
## The following object is masked from 'package:AnnotationHub':
##
## hubUrl
if(!file.exists(fafile)){
seq <- Biostrings::DNAStringSet(BSgenome.Hsapiens.UCSC.hg38[['chr22']])
names(seq) <- 'chr22'
Biostrings::writeXStringSet(
seq, fafile)
}
#now load Ribo-seq data, reading in the transcript-space alignments
testbam <- system.file('extdata', 'nchr22.bam', package='Ribostan',
mustWork=TRUE)
Ribostan includes various functionality for a) filtering ORFs in an existing annotation, and b) finding potential uORFs. By default, load_annotation will keep only those ORFs which are multiples of 3bp long, and which begin with a start codon and end wth a stop. Ribostan will also search for uORFs for those ORFs, by default allowing uORFs that are as short as 2bp. The function load_annotation caries out these filterng steps and creates an object with an attached fasta file for use by other Ribostan functions. The function get_readgr loads ribosomal footprint alignments (including multimappers).
#now load our annootation, filter out suspect ORFs, and find potential uORFs
chr22_anno <- load_annotation(anno_file, fafile, add_uorfs=TRUE)
## reformatting gtf to include transcript_id etc in mcols
## removing 0 non empty seqlevels that are absent from the fasta
## removing 710 non empty seqlevels that are absent from the fasta
## filtered out 444 ORFs for not being multiples of 3bp long
## filtered out 185 ORFs not ending with *
## filtered out 77 ORFs not starting with M
## 1493 ORFs left
## adding uORFs..
## Warning in call_fun_in_txdbmaker("makeTxDbFromGRanges", ...): makeTxDbFromGRanges() has moved to the txdbmaker package. Please call
## txdbmaker::makeTxDbFromGRanges() to get rid of this warning.
## starting to filter out ourfs...
## finished filtering ourfs
## uORFs found
rpfs <- get_readgr(testbam, chr22_anno)
An RPF alignment is ambiguous with respect to the position of the underlying ribosome because a) it maybe be multimapping, and it’s actual origin thus uncertain, and b) stochastic processes underlying footprint size and location mean that the precise location of the p-site must be determined. Various methods exist to do this, Ribostan makes use of the method described by Ahmed et al 2019, in which for each phase and read length, an offset is chose than maximizes the number of reads within the CDS. The function get_offsets creates a dataframe describing the optimal offsets using this process, and the function get_psite_gr applies these offsets to the alignments, annotating their ORF of origin (which is chosen randomly in the rare case where a footprint’s psite plausibly overlaps more than one ORF, since more than one phase/offset is possible).
#determine offsets by maximum CDS occupancy
offsets_df <- get_offsets(rpfs, chr22_anno)
## Adding missing grouping variables: `readlen`
## Adding missing grouping variables: `readlen`
#use our offsets to determine p-site locations
psites <- get_psite_gr(rpfs, offsets_df, chr22_anno)
#Verifying offsets with KL-divergence Most methods of determinig P-site offsets, including the one above, are vaulnerable to error when unusual patterns of footprints exist at the start/end of the CDS. An orthogonal means of determining A/P site offsets is to plot KL-divergence in ‘metacodon’ profiles - KL divergence measures the degree to which the underlying codon predicts density at a given location relative to it, and will typically have two large peaks due to cut-site bias at 0 and -read_length, along with a peak between these corresponding to the influence of codon-specific dwell time at the A and P site. (see O’Connor et al 2016). get_metacodon_profs, derives average, normalized profiles around each codon, and get_kl_df derives the KL divergence per read_length/sample/location. Plotting these provides an orthogonal means of verifying p-site offsets.
covgrs = list(sample1=rpfs)
metacodondf <- get_metacodon_profs(covgrs, chr22_anno)
## getting codon positions...
## ......
kl_df<-get_kl_df(metacodondf, chr22_anno)
kl_offsets <- select_offsets(kl_df)
kl_offsets%>%dplyr::select(p_offset,length=nreadlen)%>%
readr::write_tsv('offsets_rustvar.tsv')
#
kl_div_plot <- plot_kl_dv(kl_df, kl_offsets)
#
allcodondt <- export_codon_dts(metacodondf, kl_offsets)
knitr::kable(allcodondt)
sample | codon |
---|
ORF peridicity is a good diagnostic tool for identifying bona fida translation (non periodic noise such as RPFs with ribosome-like footprints can generate riboseq-signal in untranslated regions). The function periodicity_filter_uORFs filters the uORFs found by load_annotation by searching for periodicity in psites. get_ritpms furthermore carries out optimization of ribosome densities, and, similiarly to programs like salmon or RSEM, performs isoform-aware quantification. multimapping psites can then be sampled according to these TPMs, to give an estimate of true ribosome locations.
#use our p-sites to filter the annotation
chr22_anno <- periodicity_filter_uORFs( psites, chr22_anno, remove=TRUE)
## running multitaper tests, this will be slow for a full dataset...
#now, use Stan to estimate normalized p-site densities for our data
ritpms <- get_ritpms(psites, chr22_anno)
## optimizing...
#and get these at the gene level (ignoring uORFs)
gritpms = gene_level_expr(ritpms, chr22_anno)
#and remove multimapped alignments from the psites
psites <- psites[psites$orf%in%names(chr22_anno$trspacecds)]
psites <- sample_cov_gr(psites, chr22_anno, ritpms)
knitr::kable(tibble::enframe(head(ritpms)))
name | value |
---|---|
TxID:214634 | 2928.9943445 |
TxID:214633 | 31.3654540 |
TxID:214633_1 | 2.0025508 |
TxID:214646 | 1.1432730 |
TxID:214647 | 44.5695414 |
TxID:214648 | 0.9397825 |
knitr::kable(head(gritpms))
gene_id | expr |
---|---|
ENSG00000008735.14 | 1422.627 |
ENSG00000015475.18 | 0.000 |
ENSG00000025708.14 | 0.000 |
ENSG00000025770.19 | 6288.334 |
ENSG00000040608.14 | 0.000 |
ENSG00000054611.14 | 1349.704 |
With P-site locations determined, this nucleotide resolution information on ribosome positioning can be used to estimate which codons show high or low occupancy, which given relative independance between codons will be linearly propertional to dwell time in a given sample. These codon level occupancies can be averaged foir all codons in an ORF to predict which ORFs are fast or slow.
#get codon-level occupanies using RUST (at least 1000 should be used
#for n_genes on a real run)
rust_codon_occ_df <- get_codon_occs(psites, offsets_df, chr22_anno,
n_genes=20, method='RUST_glm')
#get predicted mean elongation rates for ORFs
orf_elong = get_orf_elong(chr22_anno, rust_codon_occ_df)
#and at the gene level, ingoring uORFs
gn_elong = gene_level_elong(orf_elong, ritpms, chr22_anno)
knitr::kable(head(gn_elong))
gene_id | mean_occ |
---|---|
ENSG00000008735.14 | 0.0348736 |
ENSG00000015475.18 | 0.0183829 |
ENSG00000025708.14 | 0.0372108 |
ENSG00000025770.19 | 0.0660976 |
ENSG00000040608.14 | 0.0679560 |
ENSG00000054611.14 | -0.0240666 |
sessionInfo()