--- title: "Structstrings" author: "Felix G.M. Ernst" date: "`r Sys.Date()`" package: Structstrings abstract: > Classes for RNA sequences with secondary structure informations output: BiocStyle::html_document: toc: true toc_float: true df_print: paged vignette: > %\VignetteIndexEntry{Structstrings} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown(css.files = c('custom.css')) ``` # Introduction The `Structstrings` package implements the widely used dot bracket annotation to store base pairing information in structured RNA. For example it is used in the ViennaRNA package [@Lorenz.2011], the tRNAscan-SE software [@Lowe.1997] and the tRNAdb [@Juhling.2009]. `Structstrings` uses the infrastructure provided by the [Biostrings](#References) package [@Pages] and derives the class `DotBracketString` and related classes from the `BString` class. From these base pair tables can be produced for in depth analysis, for which the `DotBracketDataFrame` class is derived from the `DataFrame` class. In addition, the loop indices of the base pairs can be retrieved as a `LoopIndexList`, a derivate if the `IntegerList` class. Generally, all classes check automatically for the validity of the base pairing information. The conversion of the `DotBracketString` to the base pair table and the loop indices is implemented in C for efficiency. The C implementation to a large extent inspired by the [ViennaRNA](https://www.tbi.univie.ac.at/RNA/) package. This package was developed as an improvement for the `tRNA` package. However, other projects might benefit as well, so it was split of and improved upon. # Creating and accessing structure information ```{r package, echo=FALSE} suppressPackageStartupMessages({ library(Structstrings) }) ``` The package is installed from Bioconductor and loaded. ```{r package2, eval=FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Structstrings") library(Structstrings) ``` `DotBracketString` objects can be created from character as any other `XString`. The validity of the structure information is checked upon creation or modification of the object. ```{r creation, error=TRUE, purl=FALSE} # Hairpin with 4 base pairs dbs <- DotBracketString("((((....))))") dbs # a StringSet with four hairpin structures, which are all equivalent dbs <- DotBracketStringSet(c("((((....))))", "<<<<....>>>>", "[[[[....]]]]", "{{{{....}}}}")) dbs # StringSetList for storing even more structure annotations dbsl <- DotBracketStringSetList(dbs,rev(dbs)) dbsl # invalid structure DotBracketString("((((....)))") ``` Annotations can be converted using the `convertAnnotation` function. ```{r annotation_convert} dbs[[2L]] <- convertAnnotation(dbs[[2L]],from = 2L, to = 1L) dbs[[3L]] <- convertAnnotation(dbs[[3L]],from = 3L, to = 1L) dbs[[4L]] <- convertAnnotation(dbs[[4L]],from = 4L, to = 1L) # Note: convertAnnotation checks for presence of annotation and stops # if there is any conflict. dbs ``` The dot bracket annotation can be turned into a base pairing table, which allows the base pairing information to be queried more easily. For example, the `tRNA` package makes uses this to identify the structural elements for tRNAs. For this purpose the class `DotBracketDataFrame` is derived from `DataFrame`. This special `DataFrame` can only contain 5 columns, `pos`, `forward`, `reverse` `character`, `base`. The first three are obligatory, whereas the last two are optional. ```{r base_pairing} # base pairing table dbdfl <- getBasePairing(dbs) dbdfl[[1L]] ``` The types of each column are also fixed as shown in the example above. The fifth column not shown above must be an `XStringSet` object. Additionally, loop indices can be generated for the individual annotation types. These information can also be used to distinguish structure elements. ```{r loopindices} loopids <- getLoopIndices(dbs, bracket.type = 1L) loopids[[1L]] # can also be constructed from DotBracketDataFrame and contains the same # information loopids2 <- getLoopIndices(dbdfl, bracket.type = 1L) all(loopids == loopids2) ``` # Creating a dot bracket annotation from base pairing information The dot bracket annotation can be recreated from a `DotBracketDataFrame` object with the function `getDotBracket()`. If the `character` column is present, this informations is just concatenated and used to create a `DotBracketString`. If it is not present or `force.bracket` is set to `TRUE`, the dot bracket string is created from the base pairing information. ```{r dotbracket} rec_dbs <- getDotBracket(dbdfl) dbdf <- unlist(dbdfl) dbdf$character <- NULL dbdfl2 <- relist(dbdf,dbdfl) # even if the character column is not set, the dot bracket string can be created rec_dbs2 <- getDotBracket(dbdfl2) rec_dbs3 <- getDotBracket(dbdfl, force = TRUE) rec_dbs[[1L]] rec_dbs2[[1L]] rec_dbs3[[1L]] ``` Please be aware that `getDotBracket()` might return a different output than original input, if this information is turned around from a `DotBracketString` to `DotBracketDataFrame` and back to a `DotBracketString`. First the `()` annotation is used followed by `<>`, `[]` and `{}` in this order. For a `DotBracketString` containing only one type of annotation this might not mean much, except if the `character` string itself is evaluated. However, if pseudoloops are present, this will lead potentially to a reformated and simplified annotation. ```{r pseudoloop} db <- DotBracketString("((((....[[[))))....((((....<<<<...))))]]]....>>>>...") db getDotBracket(getBasePairing(db), force = TRUE) ``` # Storing sequence and structure in one object To store a nucleotide sequence and a structure in one object, the classes `StructuredRNAStringSet` are implemented. ```{r structured_rna_string} data("dbs", package = "Structstrings") data("nseq", package = "Structstrings") sdbs <- StructuredRNAStringSet(nseq,dbs) sdbs[1L] # subsetting to element returns the sequence sdbs[[1L]] # dotbracket() gives access to the DotBracketStringSet dotbracket(sdbs)[[1L]] ``` The base pair table can be directly accessed using `getBasePairing()`. The `base` column is automatically populated from the nucleotide sequence. This is a bit slower than just creating the base pair table. Therefore this step can be omitted by setting `return.sequence` to `FALSE`. ```{r structured_rna_string_base_pairing} dbdfl <- getBasePairing(sdbs) dbdfl[[1L]] # returns the result without sequence information dbdfl <- getBasePairing(sdbs, return.sequence = TRUE) dbdfl[[1L]] ``` # Session info ```{r} sessionInfo() ``` # References