---
title: "RNAmodR.ML: detecting patterns of post-transcriptional modifications using machine learning"
author: "Felix G.M. Ernst and Denis L.J. Lafontaine"
date: "`r Sys.Date()`"
package: RNAmodR.ML
output:
BiocStyle::html_document:
toc: true
toc_float: true
df_print: paged
vignette: >
%\VignetteIndexEntry{RNAmodR.ML}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: references.bib
---
# Introduction
Post-transcriptional modifications can be found abundantly in rRNA and tRNA and
can be detected classically via several strategies. However, difficulties arise
if the identity and the position of the modified nucleotides is to be determined
at the same time. Classically, a primer extension, a form of reverse
transcription (RT), would allow certain modifications to be accessed by blocks
during the RT changes or changes in the cDNA sequences. Other modification would
need to be selectively treated by chemical reactions to influence the outcome of
the reverse transcription.
With the increased availability of high throughput sequencing, these classical
methods were adapted to high throughput methods allowing more RNA molecules to
be accessed at the same time. However, patterns of some modification cannot be
detected by accessing small number of parameters. For these cases machine
learning models can be trained on data from positions known to be modified
in order to detect additional modified positions.
To extend the functionality of the `RNAmodR` package and classical detection
strategies used for RiboMethSeq or AlkAnilineSeq (see `RNAmodR.RiboMethSeq` and
`RNAmodR.AlkAnilineSeq` packages) towards detection through machine learning
models, `RNAmodR.ML` provides classes and an example workflow.
# Using RNAmodR.ML
```{r, echo = FALSE}
suppressPackageStartupMessages({
library(rtracklayer)
library(GenomicRanges)
library(RNAmodR.ML)
library(RNAmodR.Data)
})
```
```{r, eval = FALSE}
library(rtracklayer)
library(GenomicRanges)
library(RNAmodR.ML)
library(RNAmodR.Data)
```
The `ModifierML` class extends the `Modifier` class from the `RNAmodR` package
and adds one slot, `mlModel`, a getter/setter `getMLModel`/`setMLModel`, an
additional `useMLModel` function to be called from the `aggregate` function.
The slot `mlModel` can either be an empty character or contain a name of a
`ModifierMLModel` class, which is loaded upon creation of a `ModifierML` object,
and serves as a wrapper around a machine learning model. For different types of
machine learning models `ModifierMLModel` derived classes are available, which
currently are:
* `ModifierMLranger` for models generated with the `ranger` package
[[@Wright.2017]](#References)
* `ModifierMLkeras` for models generated with the `keras` package
[[@Allaire.2018]](#References)
To illustrate how to develop a machine learning model with help from the
`RNAmodR.ML` package, an example is given below.
# Development of new `Modifier` class
As an example for this vignette, we will try to detect D positions in
AlkAnilineSeq data. First define a specific `ModifierML` class loading pileup
and coverage data. In this example, the RNA specific `RNAModifierML` class is
used.
```{r}
setClass("ModMLExample",
contains = c("RNAModifierML"),
prototype = list(mod = c("D"),
score = "score",
dataType = c("PileupSequenceData",
"CoverageSequenceData"),
mlModel = character(0)))
# constructor function for ModMLExample
ModMLExample <- function(x, annotation = NA, sequences = NA, seqinfo = NA,
...){
RNAmodR:::Modifier("ModMLExample", x = x, annotation = annotation,
sequences = sequences, seqinfo = seqinfo, ...)
}
setClass("ModSetMLExample",
contains = "ModifierSet",
prototype = list(elementType = "ModMLExample"))
# constructor function for ModSetMLExample
ModSetMLExample <- function(x, annotation = NA, sequences = NA, seqinfo = NA,
...){
RNAmodR:::ModifierSet("ModMLExample", x, annotation = annotation,
sequences = sequences, seqinfo = seqinfo, ...)
}
```
Since the `mlModel` slot contains an empty character, the creation of the object
will not automatically trigger a search for modifications. However, it will
aggregate the data in the format we want to use. The `aggregate_example`
function is just an example and the aggregation of the data is part of the
model building. (See (#Summary))
```{r}
setMethod(
f = "aggregateData",
signature = signature(x = "ModMLExample"),
definition =
function(x){
aggregate_example(x)
}
)
```
# Getting training data
To gather training data, we just create a `ModMLExample` object and let it do
its job.
```{r,include=FALSE}
annotation <- GFF3File(RNAmodR.Data.example.gff3())
sequences <- RNAmodR.Data.example.fasta()
files <- list("wt" = c(treated = RNAmodR.Data.example.bam.1(),
treated = RNAmodR.Data.example.bam.2(),
treated = RNAmodR.Data.example.bam.3()))
```
```{r}
me <- ModMLExample(files[[1]], annotation, sequences)
```
Afterwards we need to load/create coordinates for positions known to be modified
as well as positions known to be unmodified.
```{r}
data("dmod",package = "RNAmodR.ML")
# we just select the next U position from known positions
nextUPos <- function(gr){
nextU <- lapply(seq.int(1L,2L),
function(i){
subseq <- subseq(sequences(me)[dmod$Parent], start(dmod)+3L)
pos <- start(dmod) + 2L +
vapply(strsplit(as.character(subseq),""),
function(y){which(y == "U")[i]},integer(1))
ans <- dmod
ranges(ans) <- IRanges(start = pos, width = 1L)
ans
})
nextU <- do.call(c,nextU)
nextU$mod <- NULL
unique(nextU)
}
nondmod <- nextUPos(dmod)
nondmod <- nondmod[!(nondmod %in% dmod)]
coord <- unique(c(dmod,nondmod))
coord <- coord[order(as.integer(coord$Parent))]
```
With these coordinates the aggregated data of the `ModMLExample` can be subset
to a training data set using the function `trainingData`.
```{r}
trainingData <- trainingData(me,coord)
trainingData <- unlist(trainingData, use.names = FALSE)
# converting logical labels to integer
trainingData$labels <- as.integer(trainingData$labels)
```
# Training a model
How a specific model can be trained or what kind of strategies can be employed
to successfully train a model, is out of scope for the vignette. For this
example a random forest is trained using the functionality provided by the
`ranger` package.
```{r}
library(ranger)
model <- ranger(labels ~ ., data.frame(trainingData))
```
# Constructing a 'ModifierMLModel'
Now, the trained model can be used to create a new `ModifierMLModel` class and
object.
```{r}
setClass("ModifierMLexample",
contains = c("ModifierMLranger"),
prototype = list(model = model))
ModifierMLexample <- function(...){
new("ModifierMLexample")
}
mlmodel <- ModifierMLexample()
```
To be able to use the model via the `ModifierMLModel` class, we also need to
define an accessor to the predictions made by the model. This function is called
`useModel` and is already prefined for the `ModifierMLModel` classes mentioned
above in secion [Using RNAmodR.ML](#RNAmodR.ML).
```{r}
getMethod("useModel", c("ModifierMLranger","ModifierML"))
```
If the results of these function is not usable for a specific model, it can
redefined for your needs. As defined by `RNAmodR.ML` the function returns a
`NumericList` along the aggregated data of the `ModifierML` object.
# Setting and using the model
The generated `ModifierMLexample` wrapper can now be set for the `ModifierML`
object using the `setMLModel` function. (If a model is saved as part of a
package, this step ist not necessary, because it can be part of the class
definition)
```{r}
setMLModel(me) <- mlmodel
```
In order for the prediction data to be usable, another function has to be
implemented to save the predictions in the aggregated data. The function is
called `useMLModel`.
```{r}
setMethod(f = "useMLModel",
signature = signature(x = "ModMLExample"),
definition =
function(x){
predictions <- useModel(getMLModel(x), x)
data <- getAggregateData(x)
unlisted_data <- unlist(data, use.names = FALSE)
unlisted_data$score <- unlist(predictions)
x@aggregate <- relist(unlisted_data,data)
x
}
)
```
By re-running the `aggregate` function and force an update of the data, the
predictions are made and used to populate the `score` column as outlined above.
```{r}
me <- aggregate(me, force = TRUE)
```
# Performance
During the model building phase some form of a performance measurement usually
is included. In addition to these model specific measurements, `RNAmodR.ML`
includes the functionality from the `ROCR` package inherited from the `RNAmodR`
package. With this the performance of a model can evaluted over the training set
or any coordinates.
```{r plot1, fig.cap="Performance of the maching learning model to distinguish unmodified from modified nucleotides.", fig.asp=1.5, dev="png"}
plotROC(me, dmod)
```
# Using a `ModifierML` class
Since we want to use the `ModifierML` object to detect modifications, we also
need to define the `findMod` function. Based on the information on the
performance, we set a threshold of `0.8` for the prediction score for detecting
D modifications. In the example below this threshold is hardcoded in the
`find_mod_example` function, but can also be implemented using the `settings`
function.
```{r}
setMethod(
f = "findMod",
signature = signature(x = "ModMLExample"),
definition =
function(x){
find_mod_example(x, 25L)
}
)
```
Now we can redfine the `ModMLExample` class with the slot `mlModel` already set
to the `ModifierMLexample` class. The `ModMLExample` is now complete.
```{r}
rm(me)
setClass("ModMLExample",
contains = c("RNAModifierML"),
prototype = list(mod = c("D"),
score = "score",
dataType = c("PileupSequenceData",
"CoverageSequenceData"),
mlModel = "ModifierMLexample"))
me <- ModMLExample(files[[1]], annotation, sequences)
```
The detected modifications can be access using the `modifications` function.
```{r}
mod <- modifications(me)
mod <- split(mod, factor(mod$Parent,levels = unique(mod$Parent)))
mod
```
# Refining a model
Some of the modification found look reasonable. However, some of the positions
seem to be noise.
```{r}
options(ucscChromosomeNames=FALSE)
```
```{r plot2, fig.cap="Visualization of sequence data", dev="png"}
plotDataByCoord(sequenceData(me),mod[["4"]][1])
```
Several options exist to improve the model: Either the threshold applied to the
prediction score can be raised to a higher value, like `0.9` or the model can
maybe retrained by including these position in another training data set. In
addition, the training data might be improved in general by higher sequencing
depth.
```{r}
nonValidMod <- mod[c("1","4")]
nonValidMod[["18"]] <- nonValidMod[["18"]][2]
nonValidMod[["26"]] <- nonValidMod[["26"]][2]
nonValidMod <- unlist(nonValidMod)
nonValidMod <- nonValidMod[,"Parent"]
coord <- unique(c(dmod,nondmod,nonValidMod))
coord <- coord[order(as.integer(coord$Parent))]
```
As an example, a new model is trained including the wrongly identified
positions from the previous model as unmodified positions.
```{r}
trainingData <- trainingData(me,coord)
trainingData <- unlist(trainingData, use.names = FALSE)
trainingData$labels <- as.integer(trainingData$labels)
```
```{r}
model2 <- ranger(labels ~ ., data.frame(trainingData), num.trees = 2000)
setClass("ModifierMLexample2",
contains = c("ModifierMLranger"),
prototype = list(model = model2))
ModifierMLexample2 <- function(...){
new("ModifierMLexample2")
}
mlmodel2 <- ModifierMLexample2()
me2 <- me
setMLModel(me2) <- mlmodel2
me2 <- aggregate(me2, force = TRUE)
```
After updating the `ModifierMLexample` class and aggregating the data again
the performance looks a bit better...
```{r plot3, fig.cap="Performance aggregation of multiple samples and strategies."}
plotROC(me2, dmod, score="score")
```
... and leads to a better result for detecting D modifications. Some positions
are not detected anymore, which suggest that this model is still not an optimal
solution and other factors could and should be improved upon as suggested above.
```{r}
setMethod(
f = "findMod",
signature = signature(x = "ModMLExample"),
definition =
function(x){
find_mod_example(x, 25L)
}
)
me2 <- modify(me2, force = TRUE)
modifications(me2)
```
In addition to training a single model, several models can be trained and
combined to a `ModifierSet`.
```{r}
mse <- ModSetMLExample(list(one = me, two = me2))
```
An overall performance over several models can be analyzed or the individual
performance compaired.
```{r plot4, fig.cap="Performance average across models", dev="png"}
plotROC(mse, dmod, score= "score",
plot.args = list(avg = "threshold", spread.estimate = "stderror"))
```
If several models are trained and each provides useful information, these can be
package into a single `ModifierMLModel` class to combine the output of several
models. Some of the functions outlined above need, e.g. `useMLModel` and/or
`useModel`, to be modified to provide one or more scores for detecting a
modification.
# Packaging
If the created models can be saved to file, they can also be included in a
package. This would allow others to use the models and the models can
potentially be improved upon with increasing version numbers.
# Summary
`RNAmodR.ML` provides the interface for machine learning models to be used
with `RNAmodR` to detect modified nucleotides in high throughput sequencing
data. For your own work defining a working model might take some time. We hope
that by using `RNAmodR.ML` the steps surrounding this crucial step might become
a bit easier.
However, if some steps or design choices made for `RNAmodR.ML` do not suit your
need, let us know. Contributions are always welcome as well.
# Hints
We also want to provide some additional hints and suggestions for developing
machine learning models.
1. the aggregate function is used in the example above as a feature engineering
tool. You might want to skip this step, if you want to use a deep learning
model for example with `keras`.
2. If you don't want to engage in a feature enginering step and just want to
aggregate the sequence data as is, just do a custom `cbind` on the data from
the `SequenceData` objects (`cbind` works on `SequenceData` objects, if they
are of the same type. Convert each of them to a `CompressedSplitDataFrameList`
using `as(x,"CompressedSplitDataFrameList")`).
3. in a deep learning context, if a coordinate is selected without a flanking
value, e.g. when using `trainingData`, a 2D tensor is returned (sample,
values). This can be converted into a 3D tensor by providing a flanking value.
# Sessioninfo
```{r sessioninfo}
sessionInfo()
```
# References