--- title: "RNAmodR: creating classes for additional modification detection from high throughput sequencing." author: "Felix G.M. Ernst" date: "`r Sys.Date()`" package: RNAmodR output: BiocStyle::html_document: toc: true toc_float: true df_print: paged vignette: > %\VignetteIndexEntry{RNAmodR - creating new classes for a new detection strategy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown(css.files = c('custom.css')) ``` # Introduction For users interested in the general aspect of any `RNAmodR` based package please have a look at the [main vignette](RNAmodR.html) of the package. This vignette is aimed at developers and researchers, who want to use the functionality of the `RNAmodR` package to develop a new modification strategy based on high throughput sequencing data. ```{r, echo=FALSE} suppressPackageStartupMessages({ library(RNAmodR) }) ``` ```{r, eval=FALSE} library(RNAmodR) ``` Two classes have to be considered to establish a new analysis pipeline using `RNAmodR`. These are the `SequenceData` and the `Modifier` class. # A new `SequenceData` class First, the `SequenceData` class has to be considered. Several classes are already implemented, which are: * `End5SequenceData` * `End3SequenceData` * `EndSequenceData` * `ProtectedEndSequenceData` * `CoverageSequenceData` * `PileupSequenceData` * `NormEnd5SequenceData` * `NormEnd3SequenceData` If these cannot be reused, a new class can be implemented quite easily. First the DataFrame class, the Data class and a constructor has to defined. The only value, which has to be provided, is a default `minQuality` integer value and some basic information. ```{r} setClass(Class = "ExampleSequenceDataFrame", contains = "SequenceDFrame") ExampleSequenceDataFrame <- function(df, ranges, sequence, replicate, condition, bamfiles, seqinfo){ RNAmodR:::.SequenceDataFrame("Example",df, ranges, sequence, replicate, condition, bamfiles, seqinfo) } setClass(Class = "ExampleSequenceData", contains = "SequenceData", slots = c(unlistData = "ExampleSequenceDataFrame"), prototype = list(unlistData = ExampleSequenceDataFrame(), unlistType = "ExampleSequenceDataFrame", minQuality = 5L, dataDescription = "Example data")) ExampleSequenceData <- function(bamfiles, annotation, sequences, seqinfo, ...){ RNAmodR:::SequenceData("Example", bamfiles = bamfiles, annotation = annotation, sequences = sequences, seqinfo = seqinfo, ...) } ``` Second, the `getData` function has to be implemented. This is used to load the data from a bam file and must return a named list `IntegerList`, `NumericList` or `CompressedSplitDataFrameList` per file. ```{r} setMethod("getData", signature = c(x = "ExampleSequenceData", bamfiles = "BamFileList", grl = "GRangesList", sequences = "XStringSet", param = "ScanBamParam"), definition = function(x, bamfiles, grl, sequences, param, args){ ### } ) ``` Third, the `aggregate` function has to be implemented. This function is used to aggregate data over replicates for all or one of the conditions. The resulting data is passed on to the `Modifier` class. ```{r} setMethod("aggregateData", signature = c(x = "ExampleSequenceData"), function(x, condition = c("Both","Treated","Control")){ ### } ) ``` # A new `Modifier` class A new `Modifier` class is probably the main class, which needs to be implemented. Three variable have to be set. `mod` must be a single element from the `Modstrings::shortName(Modstrings::ModRNAString())`. `score` is the default score, which is used for several function. A column with this name should be returned from the `aggregate` function. `dataType` defines the `SequenceData` class to be used. `dataType` can contain multiple names of a `SequenceData` class, which are then combined to form a `SequenceDataSet`. ```{r} setClass("ModExample", contains = c("RNAModifier"), prototype = list(mod = "X", score = "score", dataType = "ExampleSequenceData")) ModExample <- function(x, annotation, sequences, seqinfo, ...){ RNAmodR:::Modifier("ModExample", x = x, annotation = annotation, sequences = sequences, seqinfo = seqinfo, ...) } ``` `dataType` can also be a `list` of `character` vectors, which leads then to the creation of `SequenceDataList`. However, for now this is a hypothetical case and should only be used, if the detection of a modification requires bam files from two or more different methods to be used to detect one modification. The `settings<-` function can be amended to save specifc settings ( `.norm_example_args` must be defined seperatly to normalize input arguments in any way one sees fit). ```{r} setReplaceMethod(f = "settings", signature = signature(x = "ModExample"), definition = function(x, value){ x <- callNextMethod() # validate special setting here x@settings[names(value)] <- unname(.norm_example_args(value)) x }) ``` The `aggregateData` function is used to take the aggregated data from the `SequenceData` object and to calculate the specific scores, which are then stored in the `aggregate` slot. ```{r} setMethod(f = "aggregateData", signature = signature(x = "ModExample"), definition = function(x, force = FALSE){ # Some data with element per transcript } ) ``` The `findMod` function takes the aggregate data and searches for modifications, which are then returned as a GRanges object and stored in the `modifications` slot. ```{r} setMethod("findMod", signature = c(x = "ModExample"), function(x){ # an element per modification found. } ) ``` ## A new `ModifierSet` class The `ModifierSet` class is implemented very easily by defining the class and the constructor. The functionality is defined by the `Modifier` class. ```{r} setClass("ModSetExample", contains = "ModifierSet", prototype = list(elementType = "ModExample")) ModSetExample <- function(x, annotation, sequences, seqinfo, ...){ RNAmodR:::ModifierSet("ModExample", x = x, annotation = annotation, sequences = sequences, seqinfo = seqinfo, ...) } ``` # Visualization functions Additional functions, which need to be implemented, are `getDataTrack` for the new `SequenceData` and new `Modifier` classes and `plotData`/`plotDataByCoord` for the new `Modifier` and `ModifierSet` classes. `name` defines a transcript name found in `names(ranges(x))` and `type` is the data type typically found as a column in the `aggregate` slot. ```{r} setMethod( f = "getDataTrack", signature = signature(x = "ExampleSequenceData"), definition = function(x, name, ...) { ### } ) setMethod( f = "getDataTrack", signature = signature(x = "ModExample"), definition = function(x, name, type, ...) { } ) setMethod( f = "plotDataByCoord", signature = signature(x = "ModExample", coord = "GRanges"), definition = function(x, coord, type = "score", window.size = 15L, ...) { } ) setMethod( f = "plotData", signature = signature(x = "ModExample"), definition = function(x, name, from, to, type = "score", ...) { } ) setMethod( f = "plotDataByCoord", signature = signature(x = "ModSetExample", coord = "GRanges"), definition = function(x, coord, type = "score", window.size = 15L, ...) { } ) setMethod( f = "plotData", signature = signature(x = "ModSetExample"), definition = function(x, name, from, to, type = "score", ...) { } ) ``` If unsure, how to modify these functions, have a look a the code in the `Modifier-Inosine-viz.R` file of this package. # Summary As suggested directly above, for a more detailed example have a look at the `ModInosine` class source code found in the `Modifier-Inosine-class.R` and `Modifier-Inosine-viz.R` files of this package. # Sessioninfo ```{r} sessionInfo() ```