---
title: "Structstrings"
author: "Felix G.M. Ernst"
date: "`r Sys.Date()`"
package: Structstrings
abstract: >
  Classes for RNA sequences with secondary structure informations
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    df_print: paged
vignette: >
  %\VignetteIndexEntry{Structstrings}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown(css.files = c('custom.css'))
```

# Introduction

The `Structstrings` package implements the widely used dot bracket annotation 
to store base pairing information in structured RNA. For example it is used
in the ViennaRNA package [@Lorenz.2011], the tRNAscan-SE software [@Lowe.1997] 
and the tRNAdb [@Juhling.2009].

`Structstrings` uses the infrastructure provided by the
[Biostrings](#References) package [@Pages] and derives the class
`DotBracketString` and related classes from the `BString` class. From these base
pair tables can be produced for in depth analysis, for which the
`DotBracketDataFrame` class is derived from the `DataFrame` class. In addition,
the loop indices of the base pairs can be retrieved as a `LoopIndexList`, a
derivate if the `IntegerList` class. Generally, all classes check automatically
for the validity of the base pairing information.

The conversion of the `DotBracketString` to the base pair table and the loop 
indices is implemented in C for efficiency. The C implementation to a large 
extent inspired by the [ViennaRNA](https://www.tbi.univie.ac.at/RNA/) package.

This package was developed as an improvement for the `tRNA` package. However,
other projects might benefit as well, so it was split of and improved upon.

# Creating and accessing structure information

```{r package, echo=FALSE}
suppressPackageStartupMessages({
  library(Structstrings)
})
```

The package is installed from Bioconductor and loaded.

```{r package2, eval=FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install("Structstrings")
library(Structstrings)
```

`DotBracketString` objects can be created from character as any other `XString`.
The validity of the structure information is checked upon creation or 
modification of the object.

```{r creation, error=TRUE, purl=FALSE}
# Hairpin with 4 base pairs
dbs <- DotBracketString("((((....))))")
dbs
# a StringSet with four hairpin structures, which are all equivalent
dbs <- DotBracketStringSet(c("((((....))))",
                             "<<<<....>>>>",
                             "[[[[....]]]]",
                             "{{{{....}}}}"))
dbs
# StringSetList for storing even more structure annotations
dbsl <- DotBracketStringSetList(dbs,rev(dbs))
dbsl
# invalid structure
DotBracketString("((((....)))")
```

Annotations can be converted using the `convertAnnotation` function.

```{r annotation_convert}
dbs[[2L]] <- convertAnnotation(dbs[[2L]],from = 2L, to = 1L)
dbs[[3L]] <- convertAnnotation(dbs[[3L]],from = 3L, to = 1L)
dbs[[4L]] <- convertAnnotation(dbs[[4L]],from = 4L, to = 1L)
# Note: convertAnnotation checks for presence of annotation and stops
# if there is any conflict.
dbs
```

The dot bracket annotation can be turned into a base pairing table, which allows
the base pairing information to be queried more easily. For example, the `tRNA` 
package makes uses this to identify the structural elements for tRNAs.

For this purpose the class `DotBracketDataFrame` is derived from `DataFrame`.
This special `DataFrame` can only contain 5 columns, `pos`, `forward`, `reverse`
`character`, `base`. The first three are obligatory, whereas the last two are
optional.

```{r base_pairing}
# base pairing table
dbdfl <- getBasePairing(dbs)
dbdfl[[1L]]
```

The types of each column are also fixed as shown in the example above. The fifth 
column not shown above must be an `XStringSet` object.

Additionally, loop indices can be generated for the individual annotation types. 
These information can also be used to distinguish structure elements.

```{r loopindices}
loopids <- getLoopIndices(dbs, bracket.type = 1L)
loopids[[1L]]
# can also be constructed from DotBracketDataFrame and contains the same 
# information
loopids2 <- getLoopIndices(dbdfl, bracket.type = 1L)
all(loopids == loopids2)
```

# Creating a dot bracket annotation from base pairing information

The dot bracket annotation can be recreated from a `DotBracketDataFrame` object
with the function `getDotBracket()`. If the `character` column is present, this 
informations is just concatenated and used to create a `DotBracketString`. If
it is not present or `force.bracket` is set to `TRUE`, the dot bracket string
is created from the base pairing information.

```{r dotbracket}
rec_dbs <- getDotBracket(dbdfl)
dbdf <- unlist(dbdfl)
dbdf$character <- NULL
dbdfl2 <- relist(dbdf,dbdfl)
# even if the character column is not set, the dot bracket string can be created
rec_dbs2 <- getDotBracket(dbdfl2)
rec_dbs3 <- getDotBracket(dbdfl, force = TRUE)
rec_dbs[[1L]]
rec_dbs2[[1L]]
rec_dbs3[[1L]]
```

Please be aware that `getDotBracket()` might return a different output than 
original input, if this information is turned around from a `DotBracketString` 
to `DotBracketDataFrame` and back to a `DotBracketString`. First the `()` 
annotation is used followed by `<>`, `[]` and `{}` in this order.

For a `DotBracketString` containing only one type of annotation this might not 
mean much, except if the `character` string itself is evaluated. However,
if pseudoloops are present, this will lead potentially to a reformated and
simplified annotation.

```{r pseudoloop}
db <- DotBracketString("((((....[[[))))....((((....<<<<...))))]]]....>>>>...")
db
getDotBracket(getBasePairing(db), force = TRUE)
```

# Storing sequence and structure in one object

To store a nucleotide sequence and a structure in one object, the classes
`StructuredRNAStringSet` are implemented.

```{r structured_rna_string}
data("dbs", package = "Structstrings")
data("nseq", package = "Structstrings")
sdbs <- StructuredRNAStringSet(nseq,dbs)
sdbs[1L]
# subsetting to element returns the sequence
sdbs[[1L]]
# dotbracket() gives access to the DotBracketStringSet
dotbracket(sdbs)[[1L]]
```

The base pair table can be directly accessed using `getBasePairing()`. The 
`base` column is automatically populated from the nucleotide sequence. This is a
bit slower than just creating the base pair table. Therefore this step can be 
omitted by setting `return.sequence` to `FALSE`.

```{r structured_rna_string_base_pairing}
dbdfl <- getBasePairing(sdbs)
dbdfl[[1L]]
# returns the result without sequence information
dbdfl <- getBasePairing(sdbs, return.sequence = TRUE)
dbdfl[[1L]]
```

# Session info

```{r}
sessionInfo()
```

<a name="References"></a>

# References