---
title: "miRNA affinity models and the KdModel class"
author:
- name: Pierre-Luc Germain
affiliation:
- D-HEST Institute for Neuroscience, ETH
- Lab of Statistical Bioinformatics, UZH
- name: Michael Soutschek
affiliation: Lab of Systems Neuroscience, D-HEST Institute for Neuroscience, ETH
- name: Fridolin Gross
affiliation: Lab of Systems Neuroscience, D-HEST Institute for Neuroscience, ETH
package: scanMiR
output:
BiocStyle::html_document
abstract: |
This vignettes introduces the KdModel and KdModelList classes used for storing
miRNA 12-mer affinities and predicting the dissociation constant of specific
sites.
vignette: |
%\VignetteIndexEntry{2_Kdmodels}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include=FALSE}
library(BiocStyle)
```
# KdModels
The `KdModel` class contains the information concerning the sequence (12-mer)
affinity of a given miRNA, and is meant to compress and make easily manipulable
the dissociation constants (Kd) predictions from
[McGeary, Lin et al. (2019)](https://dx.doi.org/10.1126/science.aav1741). We
can take a look at the example `KdModel`:
```{r}
library(scanMiR)
data(SampleKdModel)
SampleKdModel
```
In addition to the information necessary to predict the binding affinity to any
given 12-mer sequence, the model contains, minimally, the name and sequence of
the miRNA. Since the `KdModel` class extends the list class, any further
information can be stored:
```{r}
SampleKdModel$myVariable <- "test"
```
An overview of the binding affinities can be obtained with the following plot:
```{r}
plotKdModel(SampleKdModel, what="seeds")
```
The plot gives the -log(Kd) values of the top 7-mers (including both canonical
and non-canonical sites), with or without the final "A" vis-à-vis the first
miRNA nucleotide.
To predict the dissociation constant (and binding type, if any) of a given
12-mer sequence, you can use the `assignKdType` function:
```{r}
assignKdType("ACGTACGTACGT", SampleKdModel)
# or using multiple sequences:
assignKdType(c("CTAGCATTAAGT","ACGTACGTACGT"), SampleKdModel)
```
The log_kd column contains log(Kd) values multiplied by 1000 and stored as an
integer (which is more economical when dealing with millions of sites). In the
example above, `r (lkd <- assignKdType("CTAGCATTAAGT", SampleKdModel)$log_kd)`
means `r lkd/1000`, or a dissociation constant of `r exp(lkd/1000)`. The
smaller the values, the stronger the relative affinity.
## KdModelLists
A `KdModelList` object is simply a collection of `KdModel` objects. We can
build one in the following way:
```{r}
# we create a copy of the KdModel, and give it a different name:
mod2 <- SampleKdModel
mod2$name <- "dummy-miRNA"
kml <- KdModelList(SampleKdModel, mod2)
kml
summary(kml)
```
Beyond operations typically performed on a list (e.g. subsetting), some
specific slots of the respective KdModels can be accessed, for example:
```{r}
conservation(kml)
```
# Creating a KdModel object
`KdModel` objects are meant to be created from a table assigning a log_kd
values to 12-mer target sequences, as produced by the CNN from McGeary, Lin et
al. (2019). For the purpose of example, we create such a dummy table:
```{r}
kd <- dummyKdData()
head(kd)
```
A `KdModel` object can then be created with:
```{r}
mod3 <- getKdModel(kd=kd, mirseq="TTAATGCTAATCGTGATAGGGGTT", name = "my-miRNA")
```
Alternatively, the `kd` argument can also be the path to the output file of the
CNN (and if `mirseq` and `name` are in the table, they can be omitted).
# Common KdModel collections
The [scanMiRData](https://github.com/ETHZ-INS/scanMiRData) package contains
`KdModel` collections corresponding to all human, mouse and rat mirbase miRNAs.
# Under the hood
When calling `getKdModel`, the dissociation constants are stored as an
lightweight overfitted linear model, with base KDs coefficients (stored as
integers in `object$mer8`) for each 1024 partially-matching 8-mers (i.e. at
least 4 consecutive matching nucleotides) to which are added 8-mer-specific
coefficients (stored in `object$fl`) that are multiplied with a flanking score
generated by the flanking di-nucleotides. The flanking score is calculated
based on the di-nucleotide effects experimentally measured by McGeary, Lin et
al. (2019). To save space, the actual 8-mer sequences are not stored but
generated when needed in a deterministic fashion. The 8-mers can be obtained,
in the right order, with the `getSeed8mers` function.
# Session info {.unnumbered}
```{r sessionInfo, echo=FALSE}
sessionInfo()
```