---
title: "regsplice workflow"
author: 
  - name: Lukas M. Weber
    affiliation: 
      - &id1 "Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland"
      - &id2 "SIB Swiss Institute of Bioinformatics, Zurich, Switzerland"
  - name: Mark D. Robinson
    affiliation: 
      - *id1
      - *id2
package: regsplice
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{regsplice workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```


# Introduction

The `regsplice` package implements statistical methods for the detection of differential exon usage (differential splicing) in RNA sequencing (RNA-seq) and exon microarray data sets.

The `regsplice` methods are based on the use of the lasso (L1-regularization) to improve the power of standard generalized linear models. A key advantage of `regsplice` is that runtimes are fast compared to other leading approaches. We anticipate that similar regularization-based methods may also have applications in other settings.

The detailed statistical methodology and performance comparisons with other methods will be described in an upcoming paper.


## Example workflow

This vignette demonstrates an example workflow for the `regsplice` package, using a small simulated RNA-seq data set.

There are two options for running `regsplice`: you can run a complete workflow in one step using the wrapper function `regsplice()`; or you can run the individual functions for each step in sequence, which provides additional flexibility and insight into the methodology. Both options are demonstrated below.


## Data set

The data set used for the example workflow consists of exon-level read counts for a subset of 100 genes from a simulated human RNA-seq data set, consisting of 6 biological samples, with 3 samples in each of 2 conditions.

The original data set is from the paper:

> Soneson, Matthes et al. (2016), *Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage*, Genome Biology, [available here](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0862-3)

Original data files from this paper, containing the simulated RNA-seq reads (FASTQ and BAM files), are available from ArrayExpress at accession code [E-MTAB-3766](http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3766/).

Exon bin counts were generated with the Python counting scripts provided with the [DEXSeq](http://bioconductor.org/packages/release/bioc/html/DEXSeq.html) package, using the option to exclude exons from overlapping genes instead of aggregating them into multi-gene complexes (see Soneson et al. 2016, Supplementary Material).

For this example workflow, we have selected a subset consisting of the first 100 genes from this simulated data set. The exon-level read counts and the true differential splicing status labels for these 100 genes are saved in the text files `vignette_counts.txt` and `vignette_truth.txt` in the `extdata/` directory in the `regsplice` package source code.


## Exon microarray data

The `regsplice` methods are designed to work with both RNA-seq read counts and exon microarray intensities.

If you are using exon microarray data, the main steps in the workflow are the same as shown below for RNA-seq data. However, the following adjustments to the workflow are required:

- Instead of RNA-seq read counts, a matrix or data frame of exon microarray intensities is provided to the `counts` input argument. The name of the argument is still `counts`, regardless of the input data type.

- Exon microarray intensities should be log2-transformed externally, before they are provided to `regsplice`. This is usually done during pre-processing of microarray data, and may be done automatically depending on your software.

- Filtering of zero-count and low-count exon bins should be disabled, by setting the arguments `filter_zeros = FALSE` and `filter_low_counts = FALSE`.

- Calculation of normalization factors should be disabled, by setting `normalize = FALSE`.

- Calculation of `limma-voom` transformation and weights should be disabled, by setting `voom = FALSE`.


# Workflow

## Load data and create condition vector

Load the vignette example data file, which contains simulated RNA-seq read counts for 100 genes across 6 biological samples. From the raw data, extract the table of counts (`counts`), gene IDs (`gene_IDs`), and number of exon bins per gene (`n_exons`).

Then create the `condition` vector, which specifies the experimental conditions or treatment groups for each biological sample.

\vspace{6pt}

```{r}
# load data
file_counts <- system.file("extdata/vignette_counts.txt", package = "regsplice")
data <- read.table(file_counts, header = TRUE, sep = "\t", stringsAsFactors = FALSE)

head(data)

# extract counts, gene_IDs, and n_exons
counts <- data[, 2:7]
tbl_exons <- table(sapply(strsplit(data$exon, ":"), function(s) s[[1]]))
gene_IDs <- names(tbl_exons)
n_exons <- unname(tbl_exons)

dim(counts)

length(gene_IDs)

head(gene_IDs)

length(n_exons)

sum(n_exons)

# create condition vector
condition <- rep(c("untreated", "treated"), each = 3)

condition
```


## Run workflow with wrapper function

The `regsplice()` wrapper function runs the analysis pipeline with a single command. The required input format for the wrapper function is a `RegspliceData` object, which is created with the `RegspliceData()` constructor function.

The results of a `regsplice` analysis consist of a set of multiple testing adjusted p-values (Benjamini-Hochberg false discovery rates, FDR) quantifying the statistical evidence for differential exon usage (DEU) for each gene. The adjusted p-values are used to rank the genes in the data set according to their evidence for DEU, and a significance threshold can be specified to generate a list of genes with statistically significant evidence for DEU.

The wrapper function returns gene names, fitted model results, raw p-values, multiple testing adjusted p-values, likelihood ratio (LR) test statistics, and degrees of freedom of the LR tests.

The required inputs for the `RegspliceData()` constructor function are `counts` (matrix or data frame of RNA-seq read counts or exon microarray intensities), `gene_IDs` (vector of gene IDs), `n_exons` (vector of exon lengths, i.e. number of exon bins per gene), and `condition` (vector of experimental conditions for each biological sample).

Alternatively, the inputs can be provided as a `SummarizedExperiment` object, which will be parsed to extract these components. This may be useful when running `regsplice` as part of a pipeline with other packages.

Note that we have used `suppressWarnings()` to hide warning messages related to the small number of observations per gene in this example data set. For most data sets, these warning messages should not be present.

See `?RegspliceData` and `?regsplice` for additional details, including other available inputs and options. The `seed` argument is used to generate reproducible results.

\vspace{6pt}

```{r}
library(regsplice)

rs_data <- RegspliceData(counts, gene_IDs, n_exons, condition)

rs_results <- suppressWarnings(regsplice(rs_data, seed = 123))
```


### Summary table of results

The function `summaryTable()` is used to generate a summary table of the results.

The results are displayed as a data frame of the top `n` most highly significant genes, ranked according to either the false discovery rate (FDR) or raw p-values, up to a specified significance threshold (e.g. FDR < 0.05).

The argument `rank_by` chooses whether to rank by FDR or raw p-values.

To display results for all genes up to the significance threshold, set the argument `n = Inf`. To display results for all genes in the data set, set both `n = Inf` and `threshold = 1`.

For more details, see `?summaryTable`.

\vspace{6pt}

```{r}
summaryTable(rs_results)
```


## Run workflow using functions for individual steps

Alternatively, the `regsplice` analysis pipeline can be run using the individual functions for each step, which provides additional flexibility and insight into the statistical methodology. The steps are described below.


### Create RegspliceData object

The first step is to create a `RegspliceData` object, which contains data in the format required by the functions in the `regsplice` analysis pipeline.

The `RegspliceData` format is based on the `SummarizedExperiment` container from Bioconductor. The main advantage of this format is that subsetting operations automatically keep data and meta-data for rows and columns in sync, which helps avoid errors caused by selecting incorrect row or column indices.

Initially, the `RegspliceData` objects contain raw data along with meta-data for rows (genes and exon bins) and columns (biological samples). During subsequent steps in the `regsplice` analysis pipeline, the data values are modified, and additional data and meta-data are added. Final results are stored in a `RegspliceResults` object.

The required inputs are `counts` (matrix or data frame of RNA-seq read counts or exon microarray intensities), `gene_IDs` (vector of gene IDs), `n_exons` (vector of exon lengths, i.e. number of exon bins per gene), and `condition` (vector of experimental conditions for each biological sample).

Alternatively, the inputs can be provided as a `SummarizedExperiment` object, which will be parsed to extract these components. This may be useful when running `regsplice` as part of a pipeline with other packages.

Note that the warning messages are due to the small number of observations per gene in the data set used in this vignette. In most data sets, these warning messages should not be present.

For more details, see `?RegspliceData`.

\vspace{6pt}

```{r}
library(regsplice)

rs_data <- RegspliceData(counts, gene_IDs, n_exons, condition)
```


### Filter zero-count exon bins

Next, use the function `filterZeros()` to filter exon bins (rows) with zero counts in all biological samples (columns).

Any remaining single-exon genes are also removed (since differential splicing requires multiple exons).

If you are using exon microarray data, this step should be skipped.

For more details, see `?filterZeros`.

\vspace{6pt}

```{r}
rs_data <- filterZeros(rs_data)
```


### Filter low-count exons

Filter low-count exon bins with `filterLowCounts()`.

The arguments `filter_min_per_exon` and `filter_min_per_sample` control the amount of filtering. Default values are provided; however, these should be adjusted depending on the total number of samples and the number of samples per condition.

Any remaining single-exon genes are also removed.

If you are using exon microarray data, this step should be skipped.

For more details, see `?filterLowCounts`.

\vspace{6pt}

```{r}
rs_data <- filterLowCounts(rs_data)
```


### Calculate normalization factors

The function `runNormalization()` calculates normalization factors, which are used to scale library sizes.

By default, `runNormalization()` uses the TMM (trimmed mean of M-values) normalization method (Robinson and Oshlack, 2010), implemented in the `edgeR` package. For more details, see the documentation for `calcNormFactors()` in the `edgeR` package.

This step should be done after filtering. The normalization factors are then used by `limma-voom` in the next step.

If you are using exon microarray data, this step should be skipped.

For more details, see `?runNormalization`.

```{r}
rs_data <- runNormalization(rs_data)
```


### 'voom' transformation and weights

The next step is to use `limma-voom` to transform the counts and calculate exon-level weights. This is done with the `runVoom()` function.

The `limma-voom` methodology transforms counts to log2-counts per million (logCPM), and calculates exon-level weights based on the observed mean-variance relationship. This is required because raw or log-transformed counts do not fulfill the statistical assumptions required for linear modeling (i.e. equal variance). After the `limma-voom` transformation and weights have been calculated, linear modeling methods can be used.

For more details, see the following paper, which introduced `voom`; or the [limma User's Guide](http://bioconductor.org/packages/release/bioc/html/limma.html) (section "Differential splicing") available on Bioconductor.

- Law et al. (2014), *voom: precision weights unlock linear model analysis tools for RNA-seq read counts*, Genome Biology, [available here](http://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29)

Note that `voom` assumes that exon bins (rows) with zero or low counts have already been removed, so this step should be done after filtering with `filterZeros()` and `filterLowCounts()`.

If normalization factors are available (from previous step with `runNormalization()`), they will be used by `voom` to calculate normalized library sizes. If they are not available, `voom` will use non-normalized library sizes (columnwise total counts) instead.

If you are using exon microarray data, this step should be skipped.

For more details, see `?runVoom`.

\vspace{6pt}

```{r}
rs_data <- runVoom(rs_data)

# view column meta-data including normalization factors and normalized library sizes
colData(rs_data)
```


### Initialize RegspliceResults object

The `initializeResults()` function creates a `RegspliceResults` object, which will contain the results of the analysis. This object will be populated in the subsequent steps.

For more details, see `?initializeResults`.

\vspace{6pt}

```{r}
rs_results <- initializeResults(rs_data)
```


### Fit models

There are three model fitting functions: 

- `fitRegMultiple()` fits regularized (lasso) models containing an optimal subset of exon:condition interaction terms for each gene. The model fitting procedure penalizes the interaction terms only, so that the main effect terms for exons and samples are always included. This ensures that the null model is nested, allowing likelihood ratio tests to be calculated.

- `fitNullMultiple()` fits the null models, which do not contain any interaction terms.

- `fitFullMultiple()` fits "full" models, which contain all exon:condition interaction terms for each gene.

The fitting functions fit models for all genes in the data set.

Note that we have used `suppressWarnings()` to hide warning messages related to the small number of observations per gene in this example data set. For most data sets, these warning messages should not be present.

For more details, see `?fitRegMultiple`, `?fitNullMultiple`, or `?fitFullMultiple`.

\vspace{6pt}

```{r}
# set random seed for reproducibility
seed <- 123

# fit regularized models
rs_results <- suppressWarnings(fitRegMultiple(rs_results, rs_data, seed = seed))

# fit null models
rs_results <- fitNullMultiple(rs_results, rs_data, seed = seed)

# fit "full" models (not required if 'when_null_selected = "ones"' in next step)
rs_results <- fitFullMultiple(rs_results, rs_data, seed = seed)
```


### Calculate likelihood ratio tests

The function `LRTests()` calculates likelihood ratio (LR) tests between the fitted models and null models.

If the fitted regularized (lasso) model contains at least one exon:condition interaction term, the LR test compares the lasso model against the nested null model. However, if the lasso model contains zero interaction terms, then the lasso and null models are identical, so the LR test cannot be calculated. The `when_null_selected` argument lets the user choose what to do in these cases: either set p-values equal to 1 (`when_null_selected = "ones"`); or calculate a LR test using the "full" model containing all exon:condition interaction terms (`when_null_selected = "full"`), which reduces power due to the larger number of terms, but allows the evidence for differential exon usage among these genes to be distinguished. You can also return NAs for these genes (`when_null_selected = "NA"`).

The default option is `when_null_selected = "ones"`. This simply calls all these genes non-significant, which in most cases is sufficient since we are more interested in genes with strong evidence for differential exon usage. However, if it is important to rank the low-evidence genes in your data set, use the `when_null_selected = "full"` option. If `when_null_selected = "ones"` or `when_null_selected = "NA"`, the "full" fitted models are not required.

The results object contains gene names, fitted model results, raw p-values, multiple testing adjusted p-values (Benjamini-Hochberg false discovery rates, FDR), likelihood ratio (LR) test statistics, and degrees of freedom of the LR tests.

For more details, see `?LRTests`.

\vspace{6pt}

```{r}
rs_results <- LRTests(rs_results)
```


### Summary table of results

The function `summaryTable()` is used to generate a summary table of the results.

The results are displayed as a data frame of the top `n` most highly significant genes, ranked according to either the false discovery rate (FDR) or raw p-values, up to a specified significance threshold (e.g. FDR < 0.05).

The argument `rank_by` chooses whether to rank by FDR or raw p-values.

To display results for all genes up to the significance threshold, set the argument `n = Inf`. To display results for all genes in the data set, set both `n = Inf` and `threshold = 1`.

For more details, see `?summaryTable`.

\vspace{6pt}

```{r}
summaryTable(rs_results)
```


# Analyze results

For the simulated data set in this vignette, the true differential splicing status of each gene is known. In this section, we show how to analyze the results and calculate a contingency table showing the number of true positives, true negatives, false positives, and false negatives.


## Summary of all significant genes

As shown in the workflow above, we can use the `summaryTable()` function with argument `n = Inf` to display a list of all genes with significant evidence for differential exon usage (DEU).

\vspace{6pt}

```{r}
summaryTable(rs_results, n = Inf)
```


The total number of genes with significant evidence for DEU at a given threshold can also be calculated.

Note that we are using the multiple testing adjusted p-values (Benjamini-Hochberg false discovery rates, FDRs) for this calculation. A standard threshold of FDR < 0.05 implies that 5% of genes in the list are expected to be false discoveries.

\vspace{6pt}

```{r}
sum(p_adj(rs_results) < 0.05)

table(p_adj(rs_results) < 0.05)
```


## Contingency table

As mentioned above, the true differential splicing (DS) status is known for each gene, since this is a simulated data set. Therefore, we can calculate contingency tables comparing the true and predicted DS status for each gene at a given significance threshold. Increasing the significance threshold returns more genes, at the expense of a larger number of false positives.

\vspace{6pt}

```{r}
# load true DS status labels
file_truth <- system.file("extdata/vignette_truth.txt", package = "regsplice")
data_truth <- read.table(file_truth, header = TRUE, sep = "\t", stringsAsFactors = FALSE)

str(data_truth)

# remove genes that were filtered during regsplice analysis
data_truth <- data_truth[data_truth$gene %in% gene_IDs(rs_results), ]

dim(data_truth)

length(gene_IDs(rs_results))

# number of true DS genes in simulated data set
sum(data_truth$ds_status == 1)

table(data_truth$ds_status)

# contingency table comparing true and predicted DS status for each gene
# (significance threshold: FDR < 0.05)
table(true = data_truth$ds_status, predicted = p_adj(rs_results) < 0.05)

# increasing the threshold detects more genes, at the expense of more false positives
table(true = data_truth$ds_status, predicted = p_adj(rs_results) < 0.99)
```


# Additional information


## Additional user options

Additional user options not discussed in the workflow above include:

- `alpha`: Elastic net parameter for `glmnet` model fitting functions. The value of `alpha` must be between 0 (ridge regression) and 1 (lasso). The default value is 1, which fits a lasso model. See `glmnet` package documentation for more details.

<br/>

- `lambda_choice`: Parameter to select which optimal `lambda` value to choose from the `cv.glmnet` cross validation fit. Available choices are `"lambda.min"` (model with minimum cross-validated error) and `"lambda.1se"` (most regularized model with cross-validated error within one standard error of minimum). The default value is `"lambda.min"`. See `glmnet` package documentation for more details.

<br/>

For further details, including a complete list and description of all available user options, refer to the documentation for the `regsplice()` wrapper function, which can be accessed with `?regsplice` or `help(regsplice)`.


## Design matrices

The function `createDesignMatrix()` creates the model design matrix for each gene. This function is called automatically by the model fitting functions, so does not need to be used directly. In this section, we demonstrate how it works for a single gene, and show an example design matrix, in order to provide further insight into the statistical methodology.

The design matrix includes main effect terms for each exon and each sample, and interaction terms between the exons and conditions.

Note that the design matrix does not include main effect terms for the conditions, since these are absorbed into the main effect terms for the samples. In addition, the design matrix does not include an intercept column, since it is simpler to let the model fitting functions add an intercept term later.

For more details, see `?createDesignMatrix`.

\vspace{6pt}

```{r}
# gene with 3 exons
# 4 biological samples; 2 samples in each of 2 conditions
design_example <- createDesignMatrix(condition = rep(c(0, 1), each = 2), n_exons = 3)

design_example
```