---
title: "Introduction to derfinderHelper"
author: 
  - name: Leonardo Collado-Torres
    affiliation:
    - &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus
    - &ccb Center for Computational Biology, Johns Hopkins University
    email: lcolladotor@gmail.com
output:
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('derfinderHelper')`"
vignette: >
  %\VignetteIndexEntry{Introduction to derfinderHelper}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

# Basics

## Install `r Biocpkg('derfinderHelper')`

`R` is an open-source statistical environment which can be easily modified to enhance its functionality via packages. `r Biocpkg('derfinderHelper')` is a `R` package available via the [Bioconductor](http://bioconductor/packages/derfinderHelper) repository for packages. `R` can be installed on any operating system from [CRAN](https://cran.r-project.org/) after which you can install `r Biocpkg('derfinderHelper')` by using the following commands in your `R` session:

```{r 'installDer', eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("derfinderHelper")

## Check that you have a valid Bioconductor installation
BiocManager::valid()
```

## Required knowledge

`r Biocpkg('derfinderHelper')` is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A `r Biocpkg('derfinderHelper')` user is not expected to deal with those packages directly but will need to be familiar with `r Biocpkg('derfinder')`.

If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](http://lcolladotor.github.io/2014/10/16/startbioc/#.VkOKbq6rRuU).

## Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: remember to use the `derfinder` or `derfinderHelper` tags and check [the older posts](https://support.bioconductor.org/t/derfinder/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

## Citing `r Biocpkg('derfinderHelper')`

We hope that `r Biocpkg('derfinderHelper')` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

```{r 'citation'}
## Citation info
citation('derfinderHelper')
```

# Introduction to `r Biocpkg('derfinderHelper')`

```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library('knitcitations')

## Load knitcitations with a clean bibliography
cleanbib()
cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html')
# Note links won't show for now due to the following issue
# https://github.com/cboettig/knitcitations/issues/63

## Write bibliography information
bib <- c(knitcitations = citation('knitcitations'),
    derfinderHelper = citation('derfinderHelper')[1], 
    BiocStyle = citation('BiocStyle'),
    knitr = citation('knitr')[3],
    rmarkdown = citation('rmarkdown'),
    BiocParallel = citation('BiocParallel'),
    R = citation(),
    IRanges = citation('IRanges'),
    Matrix = citation('Matrix'),
    S4Vectors = RefManageR::BibEntry(bibtype = 'manual', key = 'S4Vectors',
        author = 'Hervé Pagès and Michael Lawrence and Patrick Aboyoun',
        title = "S4Vectors: S4 implementation of vector-like and list-like objects",
        year = 2017, doi = '10.18129/B9.bioc.S4Vectors'),
    sessioninfo = citation('sessioninfo'),
    testthat = citation('testthat'))
write.bibtex(bib, file = 'derfinderHelperRef.bib')
```


`r Biocpkg('derfinderHelper')` `r citep(bib[['derfinderHelper']])` is a small package that was created to speed up the F-statistics approach implemented in the parent package `r Biocpkg('derfinder')`. It contains a single function, `fstats.apply()`, which is used to calculate the F-statistics for a given data matrix, null and an alternative models.

The data is generally arranged in an matrix where the rows ($n$) are the genomic features of interest (gene-level summaries, exon-level summaries, or base-level data) and the columns ($m$) represent the samples. The other two main arguments for `fstats.apply()` are the null and alternative model matrices which are $m \times p_0$ and $m \times p$  where $p_0$ is the number of covariates in the null model and $p$ is the number of covariates in the alternative model. The models have to be nested and thus by definition $p > p_0$. The end result is a vector of F-statistics with length $n$, which is run length encoded for memory saving purposes.

Other arguments of `fstats.apply()` are related to flow in `r Biocpkg('derfinder')` such as the scaling factor (`scalefac`) used, whether to subset the data (`index`), and if the data was separated into chunks and saved to disk to lower the memory load (`lowMemDir`).

Implementation-wise, `adjustF` is useful when the denominator of the F-statistic calculation is too small. Finally, `method` controls how will the F-statistics be calculated. 

* `Matrix` is the recommended option because it uses around half the memory load of `regular` and can be faster. Specially if the data was saved in this format previously by `r Biocpkg('derfinder')`.
* `Rle` uses the least amount of memory but gets very slow as the number of samples increases. Thus making it less than ideal in several cases.
* `regular` uses base `R` to calculate the F-statistics and can require a large amount of memory. This is noticeable when using several cores to run `fstats.apply()` on different portions of the data.

The F-statistics for each feature $i$ are calculated using the following formula:

$$ F_i = \frac{ (\text{RSS0}_i - \text{RSS1}_i)/(\text{df}_1 - \text{df}_0) }{ \text{adjustF} + (\text{RSS1}_i / (p - p_0 - \text{df_1}))} $$


# Example

The following section walks through an example. However, in practice, you will probably not use this package directly and it will be used via `r Biocpkg('derfinder')`.

## Data

First lets create an example data set where we have information for 1000 features and 16 samples where samples 1 to 4 are from group A, 5 to 8 from group B, 9 to 12 from group C, and 13 to 16 from group D. 

```{r 'createData'}
## Create some toy data
suppressPackageStartupMessages(library('IRanges'))
set.seed(20140923)
toyData <- DataFrame(
    'sample1' = Rle(sample(0:10, 1000, TRUE)),
    'sample2' = Rle(sample(0:10, 1000, TRUE)),
    'sample3' = Rle(sample(0:10, 1000, TRUE)),
    'sample4' = Rle(sample(0:10, 1000, TRUE)),
    'sample5' = Rle(sample(0:15, 1000, TRUE)),
    'sample6' = Rle(sample(0:15, 1000, TRUE)),
    'sample7' = Rle(sample(0:15, 1000, TRUE)),
    'sample8' = Rle(sample(0:15, 1000, TRUE)),
    'sample9' = Rle(sample(0:20, 1000, TRUE)),
    'sample10' = Rle(sample(0:20, 1000, TRUE)),
    'sample11' = Rle(sample(0:20, 1000, TRUE)),
    'sample12' = Rle(sample(0:20, 1000, TRUE)),
    'sample13' = Rle(sample(0:100, 1000, TRUE)),
    'sample14' = Rle(sample(0:100, 1000, TRUE)),
    'sample15' = Rle(sample(0:100, 1000, TRUE)),
    'sample16' = Rle(sample(0:100, 1000, TRUE))
)

## Lets say that we have 4 groups
group <- factor(rep(toupper(letters[1:4]), each = 4))

## Note that some groups have higher coverage, we can adjust for this in the model
sampleDepth <- sapply(toyData, sum)
sampleDepth
```

## Models

Next we create the model matrices for our example data set. Lets say that we want to calculate F-statistics comparing the alternative hypothesis that the group coefficients are not 0 versus the null hypothesis that they are equal to 0, when adjusting for the sample depth.

To do so, we create the nested models.

```{r 'createModels'}
## Build the model matrices
mod <- model.matrix(~ sampleDepth + group)
mod0 <- model.matrix(~ sampleDepth)

## Explore them
mod
mod0
```

## Get F-statistics

Finally, we can calculate the F-statistics using `fstats.apply()`.

```{r 'calculateFstats'}
library('derfinderHelper')
fstats <- fstats.apply(data = toyData, mod = mod, mod0 = mod0, scalefac = 1)
fstats
```

We can then proceed to use this information in `r Biocpkg('derfinder')` or in any way you like.

# Details

We created `r Biocpkg('derfinderHelper')` for calculating F-statistics using `SnowParam()` from `r Biocpkg('BiocParallel')` `r citep(bib[['BiocParallel']])`. Using this form of parallelization requires loading the necessary packages in the child processes. Because `r Biocpkg('derfinder')` takes a long time to load, we shipped off `fstats.apply()` to its own package to improve the speed of the calculations while retaining the memory advantages of `SnowParam()` over `MulticoreParam()`.

Note that transforming the data from a `DataFrame` to a `dgCMatrix` takes some time, so the most efficient performance is achieved when the data is converted at the beginning instead of at every permutation calculation. This is done in `derfinder::preprocessCoverage()` when `lowMemDir` is specified.


# Reproducibility

This package was made possible thanks to:

* R `r citep(bib[['R']])`
* `r Biocpkg('IRanges')` `r citep(bib[['IRanges']])`
* `r CRANpkg('Matrix')` `r citep(bib[['Matrix']])`
* `r Biocpkg('S4Vectors')` `r citep(bib[['S4Vectors']])`
* `r CRANpkg('sessioninfo')` `r citep(bib[['sessioninfo']])`
* `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`
* `r CRANpkg('knitr')` `r citep(bib[['knitr']])`
* `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])`
* `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])`
* `r CRANpkg('testthat')` `r citep(bib[['testthat']])`

Code for creating the vignette

```{r createVignette, eval=FALSE}
## Create the vignette
library('rmarkdown')
system.time(render('derfinderHelper.Rmd', 'BiocStyle::html_document'))

## Extract the R code
library('knitr')
knit('derfinderHelper.Rmd', tangle = TRUE)
```

```{r createVignette2}
## Clean up
file.remove('derfinderHelperRef.bib')
```


Date the vignette was generated.

```{r reproducibility1, echo=FALSE}
## Date the vignette was generated
Sys.time()
```

Wallclock time spent generating the vignette.

```{r reproducibility2, echo=FALSE}
## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits=3)
```

`R` session information.

```{r reproducibility3, echo=FALSE}
## Session info
library('sessioninfo')
options(width = 120)
session_info()
```

# Bibliography

This vignette was generated using `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])`
with `r CRANpkg('knitr')` `r citep(bib[['knitr']])` and `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])` running behind the scenes.

Citations made with `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`.

```{r vignetteBiblio, results = 'asis', echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
bibliography()
```