%\VignetteIndexEntry{An Introduction to the BUMHMM pipeline} %\VignetteKeywords{BUMHMM, structure probing, RNA} %\VignettePackage{BUMHMM} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= BiocStyle::latex() @ \bioctitle[BUMHMM: Computational pipeline for modelling structure probing data]{BUMHMM: Probabilistic computational pipeline for modelling RNA structure probing data} \author{Alina Selega\footnote{alina.selega@ed.ac.uk or alina.selega@gmail.com}} \date{Modified: October 25, 2016. Compiled: \today} \begin{document} \maketitle \tableofcontents \section{Introduction} RNA structure is known to be a key regulator of many important mechanisms, such as RNA stability, transcription, and mRNA translation. RNA structural regulatory elements are interrogated with chemical and enzymatic structure probing \cite{kubota2015progress}. In these experiments, a chemical agent reacts with the RNA molecule in a structure-dependent way, cleaving or otherwise modifying its flexible parts. These modified positions can then be detected, providing valuable structural information that can be used for structure prediction \cite{wu2015improved}. Specifically, chemical modification terminates the reverse transcription reaction, resulting in the reverse transcriptase (RT) dropping off at the modified positions. These positions of drop-off can be then mapped back to the reference sequence. However, the challenge lies in the stochasticity of this process as the RT can also drop off randomly. To address this, a complementary control experiment is routinely performed to monitor random RT drop-offs when no reagent is used. Let us consider a toy example of data obtained in a paired-end sequencing structure probing experiment (Fig. 1). We'll focus on a particular nucleotide G and analyse the data from a control experiment (with no reagent added) and a treatment experiment (with RNAs modified by the reagent). In control conditions, we mapped 5 fragments overlapping with the nucleotide G, one of which also terminated at that position. Thus, this nucleotide had a coverage of 5 and a drop-off count of 1 (the number of times the RT dropped off immediately after this position), giving it a \textit{drop-off rate} of \( \frac{1}{5} \) (formally defined below). In treatment conditions, more fragments terminated at this position and we measured a drop-off rate of \( \frac{4}{5} \). This seems to suggest that the next nucleotide T has been modified by the reagent and perhaps corresponds to a flexible site within the molecule. However, would our conclusion remain the same had we observed a higher drop-off rate in control conditions to start with? In fact, how high would this control drop-off rate have to be for us to dismiss the drop-off rate of \( \frac{4}{5} \) as a noisy measurement of randrom drop-off rather than an indication of real modification? \begin{figure}[h] \caption{Toy example of structure probing data.} \centering \includegraphics[width=0.9\textwidth]{toyEx_structureProbing.png} \end{figure} This question reinforces the need for deciding statistically whether the drop-off rate in treatment conditions is significantly higher than the drop-off rate in control. To do this, we must understand how much noise can be expected in control conditions. If the treatment drop-off rate is outside of this range of drop-off rate variability, then we could deem it as significantly higher. We developed Beta-Uniform Mixture hidden Markov model (\verb|BUM-HMM|) \cite{selega2016robust}, a statistical framework for modelling reactivity scores from an RNA structure probing experiment such as SHAPE \cite{spitale2013rna} or ChemModSeq \cite{hector2014snapshots}. \verb|BUM-HMM| implements the intuition outlined above by utilising data from multiple experimental replicates and quantifying the variability of the RT drop-offs. \verb|BUM-HMM| also provides empirical strategies to correct intrinsic biases in the data. It generates a probabilistic output measuring the probability of modification for each nucleotide transcriptome-wide. The \verb|BUMHMM| package implements the functionality of the \verb|BUM-HMM| model. This vignette provides an example workflow for using the \verb|BUMHMM| package on the structure probing data set for the yeast ribosomal RNA 18S, obtained in an experiment with random priming and paired-end sequencing (available in the Gene Expression Omnibus under accession number GSE52878). \section{Data format} The \verb|BUMHMM| pipeline requires three data sets for all nucleotide positions: \begin{itemize} \item the coverage (or the number of reads overlapping with this position), \item the drop-off count (or the number of times the RT dropped off at the next nucleotide), \item and the drop-off rate at this position. \end{itemize} The coverage and the drop-off counts are the data obtained in a structure probing experiment with paired-end sequencing. The drop-off rate $r$ at each nucleotide position is computed as the ratio between its drop-off count $k$ and the coverage $n$: \(r = \frac{k}{n} \). Such data sets can be easily stored in a \Biocpkg{SummarizedExperiment} object, commonly used to represent data from sequencing-based experiments such as RNA-Seq. The key strength of the \verb|BUM-HMM| model is accounting for the biological variability of the data and thus, it requires data sets available in multiple replicates. The data set \texttt{se} provided with this package (accession number GSE52878) is available in triplicates and was obtained in a structure probing experiment on the 18S ribosomal RNA using the DMS chemical probing agent \cite{wells200032}. <<>>= suppressPackageStartupMessages({ library(BUMHMM) library(Biostrings) library(SummarizedExperiment) }) se @ We see that 18S has 1,800 nucleotides (represented as rows in \texttt{se}) and that the data set has 6 replicates: 3 control experiments followed by 3 treatment experiments (represented as columns in \texttt{se}). The assays correspond to different data sets, namely, the coverage, drop-off count, and drop-off rate information for each nucleotide. One could quickly access the coverage information for control experimental replicates (labelled 'C1', 'C2', 'C3') as follows: <<>>= controls <- se[, se$replicate == "control"] head(assay(controls, 'coverage')) @ We also provide the associated genomic sequence of 18S, accessible through the \Rfunction{rowData} function which stores information about the rows, in this case corresponding to the nucleobases: <<>>= rowData(controls)[1:4,] @ Similarly, the function \Rfunction{colData} stores the description of the columns, which correspond to the experimental replicates: <<>>= colData(controls) @ For transcriptome-wide experiments, the data over different chromosomes should be concatenated row-wise. To briefly illustrate the data set, let us examine the 300th nucleotide: <<>>= pos <- 300 assay(controls, 'coverage')[pos, 1] assay(controls, 'dropoff_count')[pos, 1] assay(controls, 'dropoff_rate')[pos, 1] @ We see that it had coverage of \Sexpr{assay(controls, 'coverage')[pos, 1]} in the first control experimental replicate, of which the reverse transcription randomly terminated at that position \Sexpr{assay(controls, 'dropoff_count')[pos, 1]} times, giving it a drop-off rate of \Sexpr{signif(assay(controls, 'dropoff_rate')[pos, 1], 3)}. <<>>= treatments <- se[, se$replicate == "treatment"] assay(treatments, 'coverage')[pos, 1] assay(treatments, 'dropoff_count')[pos, 1] assay(treatments, 'dropoff_rate')[pos, 1] @ In the presence of a chemical probe (in the first treatment replicate), the coverage and drop-off count at that position were higher but the drop-off rate remained roughly similar, \Sexpr{signif(assay(treatments, 'dropoff_rate')[pos, 1], 3)}. \section{The overview of pipeline} The logic of structure probing experiments associates the binding accessibility of a nucleotide with its structural flexibility, i.e. double-stranded nucleotides or those otherwise protected (e.g. by a protein interaction) will not be available for interaction with the chemical reagent. In contrast, those nucleotides located in flexible parts of the molecule, could be chemically modified by the reagent and will therefore correspond to the positions at which the RT drops off. Thus, we expect the nucleotides immediately downstream from the modification sites within the transcript to have a high drop-off rate in the presence of a reagent; higher than what we observe in control conditions. To quantify the variability in drop-off rate measured in control conditions, the \verb|BUM-HMM| method compares the drop-off rates at each nucleotide position between two \textit{control} experimental replicates, ${C_i}$ and ${C_j}$: \[log \Big( \frac{r_{C_i}}{r_{C_j}} \Big) \] If the drop-off rates $r_{C_i}$ and $r_{C_j}$ are similar in a pair of control replicates, the above log-ratio will be close to 0, indicating little to no variability in drop-off rate. In contrast, different drop-off rates will result in a large log-ratio (in absolute value). Computing these per-nucleotide log-ratios for all pairs of control experimental replicates defines a \textit{null distribution}, which quantifies how much variability in drop-off rate we can observe between two experimental replicates simply by chance, in the absence of any reagent. (Note that due to a log transform, the drop-off rates $r = 0$ are not allowed.) We now compute this log-ratio between the drop-off rates in all pairs of \textit{treatment} and \textit{control} experimental replicates, ${T_i}$ and ${C_j}$: \[log \Big( \frac{r_{T_i}}{r_{C_j}} \Big) \] We expect the neighbour of a modified nucleotide to have a much larger drop-off rate in a treatment experiment compared to control conditions, generating a large log-ratio for this pair of experimental replicates. By comparing each treatment-control log-ratio to the null distribution, we can find those nucleotide positions that demonstrate differences in drop-off rate larger than what can be expected by chance. The next section goes through the steps of the \verb|BUMHMM| pipeline using the provided data set \texttt{se} as an example. \section{BUMHMM pipeline steps} \subsection{Selecting pairs of nucleotides} We first need to select nucleotide positions in each experimental replicate for which we will compute the log-ratios. This is implemented with the function \Rfunction{selectNuclPos}. This function requires the coverage and drop-off count information stored in \texttt{se}, the numbers of control and treatment experimental replicates (\texttt{Nc} and \texttt{Nt}, correspondingly), and a user-specified coverage threshold \texttt{t}. Nucleotides with coverage $n < t$ will not be considered. In our data set, we have 3 control and 3 treatment replicates, so if we set the minimum allowed coverage as $t = 1$, we can make the following function call: <<>>= Nc <- Nt <- 3 t <- 1 nuclSelection <- selectNuclPos(se, Nc, Nt, t) List(nuclSelection) @ The function \Rfunction{selectNuclPos} returns a list with two elements: \begin{itemize} \item \texttt{analysedC} is a list where each element corresponds to a control-control replicate comparison. Each element holds indices of nucleotides that have coverage $n >= t$ and a drop-off count $k > 0$ in both replicates of that comparison. Thus, each element stores those nucleotide positions for which we can compute the log-ratio for the corresponding pair of control replicates. \item \texttt{analysedCT} is a list where each element corresponds to a treatment-control replicate comparison. Again, each element holds indices of nucleotides that have coverage $n >= t$ and a drop-off count $k > 0$ in both replicates of that comparison. \end{itemize} The pairwise control replicate comparisons are enumerated with the function \Rfunction{combn} from the \verb|utils| package: <<>>= t(combn(Nc, 2)) @ Thus, the first element of \texttt{analysedC} corresponds to comparing the control replicate 1 to control replicate 2. The comparisons between treatment and control replicates are computed similarly. <<>>= length(nuclSelection$analysedC[[1]]) length(nuclSelection$analysedCT[[1]]) @ We select \Sexpr{length(nuclSelection$analysedC[[1]])} nucleotide positions for computing the log-ratios for the first control-control comparison and \Sexpr{length(nuclSelection$analysedCT[[1]])} positions for the first treatment-control comparison. \subsection{Scaling the drop-off rates across replicates} Because \verb|BUMHMM| works with data collected in multiple experimental replicates, it is important to ensure that the drop-off rates do not differ dramatically between replicates. Thus, the second step of the pipeline scales the drop-off rates of nucleotides selected for pairwise comparisons to have a common median value. This is implemented with a function \Rfunction{scaleDOR}, which requires the data container \texttt{se}, the output of the function \Rfunction{selectNuclPos} (described above), and the numbers of replicates. It returns the updated drop-off rates such that the selected positions have the same median drop-off rate in all replicates: <<>>= ## Medians of original drop-off rates in each replicate apply(assay(se, 'dropoff_rate'), 2, median) ## Scale drop-off rates assay(se, "dropoff_rate") <- scaleDOR(se, nuclSelection, Nc, Nt) ## Medians of scaled drop-off rates in each replicate apply(assay(se, 'dropoff_rate'), 2, median) @ After scaling, medians are much more similar across replicates (they are not exactly equal when computed this way as most, but not all nucleotides were selected for the pairwise comparisons.) \subsection{Computing stretches of nucleotide positions} The next step in the \verb|BUM-HMM| modelling approach enforces a smoothness assumption over the state of nucleotides: chemical modification does not randomly switch along the chromosome, rather, continuous stretches of RNA are either flexible or not. This is captured with a hidden Markov model (HMM) with binary latent state corresponding to the true state of each nucleotide: modified or unmodified. The observations of the HMM are the empirical $p$-values associated with each nucleotide. These $p$-values arise from comparing the treatment-control log-ratios corresponding to each nucleotide position with the null distribution: \[\textrm{$p$-value} = 1 - \textrm{closest percentile of null distribution}\] If the difference between the drop-off rates in treatment and control replicates (as measured by the treatment-control log-ratio) is well within the range of the drop-off rate variability that we observed in control conditions (as summarised by the null distribution), then this log-ratio will get assigned a fairly large $p$-value. However, those log-ratios that are much larger than most values in the null distribution will be close to its right side (e.g. 90\textsuperscript{th} percentile). They will then receive a small $p$-value ($1 - 0.9 = 0.1$ in this case). Thus, the $p$-value can be thought of as a probability for the treatment-control log-ratio to belong to the null distribution. We are interested in those log-ratios that are unlikely to belong to it as they could indicate the real RT drop-off signal. Note that we expect the drop-off rate in treatment conditions to be higher than in control, which is why we restrict our attention to the right side of the null distribution. Modelling $p$-values directly enabled us to define the emission distribution of the HMM as a Beta-Uniform mixture model. Briefly, the unmodified state of a nucleotide corresponds to the null hypothesis and the associated $p$-values are modelled with the Uniform distribution. In the modified state we expect to see large log-ratios and small associated $p$-values, which are modelled with a Beta distribution. Further details and theoretical justifications can be found in \cite{selega2016robust}. To run the HMM, we compute uninterrupted stretches of nucleotides for which the posterior probabilities are to be computed. Posterior probabilities will be computed for those nucleotides with at least the minimum allowed coverage in all experimental replicates and a non-zero drop-off count in at least one treatment replicate. This is achieved with the function \Rfunction{computeStretches}, which takes \texttt{se} and the threshold \texttt{t} as parameters. <<>>= stretches <- computeStretches(se, t) @ The function returns an \Biocpkg{IRanges} object where each element corresponds to a stretch of nucleotides and each stretch is at least 2 nucleotides long. HMM will be run separately on each stretch. <<>>= head(stretches) assay(se, 'dropoff_count')[1748,] @ On this data set, we will compute posterior probabilities for all nucleotides but one, which is at the 1748th position. This is because at this position, all treatment replicates (and in fact, all replicates) had a drop-off count of 0. \subsection{Bias correction} Using a transcriptome-wide data set, we identified sequence and coverage as factors that influence log-ratios in control conditions \cite{selega2016robust}. We would therefore like to transform the log-ratios such that these biases are eliminated and the performed comparisons are not confounded. \subsubsection{Coverage bias} The coverage bias is addressed by a variance stabilisation strategy, implemented by the \Rfunction{stabiliseVariance} function. This function aims to find a functional relationship between the log-ratios in the null distribution and the average coverage in the corresponding pair of control replicates. This relationship is modelled with the assumption that the drop-off count is a binomially distributed random variable (see \cite{selega2016robust} for details) and is fitted to the data with a non-linear least squares technique. Then, all log-ratios (both for control-control and treatment-control comparisons) are transformed accordingly so that this dependency on coverage is eliminated or at least reduced. The function requires the data container \texttt{se}, the positions of nucleotides selected for pairwise comparisons, and the numbers of replicates. It returns a list with two elements (\texttt{LDR} stands for ``log drop-off rate ratio''): \begin{itemize} \item \texttt{LDR\_C} is a matrix with transformed log-ratios for control-control comparisons. \item \texttt{LDR\_CT} is a matrix with transformed log-ratios for treatment-control comparisons. \end{itemize} Both matrices have rows corresponding to nucleotide positions and columns -- to a pairwise comparison. Thus, \texttt{LDR\_C} has as many columns as there are control-control comparisons (3 comparisons for 3 control replicates) and \texttt{LDR\_CT} has as many columns as treatment-control comparisons (9 comparisons for 3 control and 3 treatment replicates). <<>>= varStab <- stabiliseVariance(se, nuclSelection, Nc, Nt) LDR_C <- varStab$LDR_C LDR_CT <- varStab$LDR_CT hist(LDR_C, breaks = 30, main = 'Null distribution of LDRs') @ The histogram shows the null distribution after the transformation. \subsubsection{Sequence bias} The sequence-dependent bias is addressed by computing different null distributions of log-ratios for different sequence patterns of nucleotides. One could consider sequences of three nucleotides, reflecting the assumption that the immediate neighbours of a nucleotide on both sides could affect its accessibility; patterns of other lengths could also be considered. The function \Rfunction{nuclPerm} returns a vector of all permutations of four nucleobases (A, T, G, and C) of length $n$: <<>>= nuclNum <- 3 patterns <- nuclPerm(nuclNum) patterns @ Considering patterns of length $n = 3$ will result in computing \Sexpr{length(patterns)} different null distributions of log-ratios, each corresponding to one sequence pattern. To do this, we first need to find all occurrences of each pattern within the sequence. This is implemented with the function \Rfunction{findPatternPos}, which takes the list of patterns, a string containing the sequence (e.g. a \verb|DNAString| object), and a parameter indicating whether we are dealing with sense (+) or anti-sense (-) DNA strand. For transcriptome-wide experiments, when searching for pattern occurrences within the genomic sequence on the anti-sense strand (sense strand sequence is expected by the function as the second parameter), the patterns will be converted to complementary sequence. <<>>= ## Extract the DNA sequence sequence <- subject(rowData(se)$nucl) sequence nuclPosition <- findPatternPos(patterns, sequence, '+') patterns[[1]] head(nuclPosition[[1]]) @ The function returns a list with an element corresponding to each pattern generated by \Rfunction{nuclPerm}. Each element holds the indices of the middle nucleotide of this pattern's occurrence in the genomic sequence. Thus, we will separately consider the drop-off rates of the nucleotides that occur in the context of each pattern. For instance, the null distribution specific to the pattern ``\Sexpr{patterns[[1]]}'' will be constructed from the log-ratios for the nucleotide positions \Sexpr{head(nuclPosition[[1]])} etc. \subsection{Computing posterior probabilities with HMM} We are now ready for the final step of the pipeline which computes posterior probabilities of modification with the HMM. Due to the short length of the 18S molecule (only 1,800 nucleotides), we will be omitting the sequence bias-correcting step, which is primarily designed for transcriptome studies. Instead, we will use all nucleotide positions for constructing a single null distribution quantifying the drop-off rate variability. The \texttt{nuclPosition} list should therefore have one element corresponding to the single stretch that we will run the HMM on. This stretch contains all nucleotide positions: <<>>= nuclPosition <- list() nuclPosition[[1]] <- 1:nchar(sequence) ## Start of the stretch nuclPosition[[1]][1] ## End of the stretch nuclPosition[[1]][length(nuclPosition[[1]])] @ The HMM is run separately on all stretches of nucleotides (of which we have only one in this particular case). However, in transcriptome studies it could be useful to only select some stretches of interest, e.g. those overlapping with particular genes. The function \Rfunction{computeProbs} computes the posterior probabilities of modification for all nucleotides in the specified stretches. The function requires matrices with transformed log-ratios \texttt{LDR\_C} and \texttt{LDR\_CT}, the numbers of replicates, the strand indicator, the lists of positions for computing the null distribution(-s) (stored in \texttt{nuclPosition}) and pairwise comparisons (stored in \texttt{nuclSelection}), and the stretches which to run the HMM on: <<>>= posteriors <- computeProbs(LDR_C, LDR_CT, Nc, Nt, '+', nuclPosition, nuclSelection$analysedC, nuclSelection$analysedCT, stretches) @ The function \Rfunction{computeProbs} compares log-ratios to the null distribution(-s) and computes empirical $p$-values. These are then passed as observations to the HMM, which computes posterior probabilities for each selected nucleotide of being in the unmodified (first column in \texttt{posteriors}) and modified state (second column in \texttt{posteriors}). \section{BUMHMM output} We see that the model assigns very large probabilities to the first few nucleotides to be unmodified by the chemical probe. <<>>= head(posteriors) @ As the modified positions within a transcript are the ones at which the reverse transcriptase drops off, the last position of the corresponding cDNA fragment that is detected and for which we consequently increment the drop-off count is the nucleotide next to the modification site. Thus, we need to shift our probabilities up by one position: <<>>= shifted_posteriors <- matrix(, nrow=dim(posteriors)[1], ncol=1) shifted_posteriors[1:(length(shifted_posteriors) - 1)] <- posteriors[2:dim(posteriors)[1], 2] @ We can now plot the probabilities to see the \verb|BUMHMM| output for the DMS structure probing data for the yeast rRNA 18S. <<>>= plot(shifted_posteriors, xlab = 'Nucleotide position', ylab = 'Probability of modification', main = 'BUMHMM output for 18S DMS data set') @ We see that most nucleotides are predicted to be in the unmodified state, having the probability of modification close to 0 (and thus could possibly be double-stranded or protected by a protein interaction). However, the model also identified some modified regions, which could correspond to accessible parts of the molecule. It should be noted that the DMS probe preferentially reacts with ``A'' and ``C'' nucleotides, which effectively makes only a subset of the structural state of the molecule accessible to probing. The \verb|BUMHMM| package also provides an option to optimise the shape parameters of the Beta distribution, which defines the HMM emission model for the modified state. To optimise parameters with the EM algorithm, the \Rfunction{computeProbs} function should be called with the last parameter \texttt{optimise} set to the desired tolerance. Once the previous and current estimates of the parameters are within this tolerance, the EM algorithms stops (unless it already reached the maximum number of iterations before that). Further details can be found in \cite{selega2016robust}. <>= ## Call the function with the additonal tolerance parameter posteriors <- computeProbs(LDR_C, LDR_CT, Nc, Nt, '+', nuclPosition, nuclSelection$analysedC, nuclSelection$analysedCT, stretches, 0.001) @ By default, the last parameter is set to \texttt{NULL}. During our experiments, we discovered that this optimisation appeared vulnerable to local minima. Thus, the current version of the \verb|BUMHMM| pipeline does not use this optimisation. %\clearpage \section{Session Info} This vignette was compiled using: <<>>= sessionInfo() @ \section{Acknowledgements} This package was developed at the the School of Informatics, University of Edinburgh with support from Sander Granneman and Guido Sanguinetti. The manuscript describing the \verb|BUM-HMM| computational model can be found in \cite{selega2016robust}. This study was supported in part by the grants from the UK Engineering and Physical Sciences Research Council, Biotechnology and Biological Sciences Research Council, and the UK Medical Research Council to the University of Edinburgh Doctoral Training Centre in Neuroinformatics and Computational Neuroscience (EP/F500385/1 and BB/F529254/1). \bibliography{vignette} \end{document}