---
title: 'satuRn - vignette'
author: "Jeroen Gilis"
bibliography: satuRn.bib
date: "14/07/2022"
output: 
  rmarkdown::html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{satuRn - vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
library(knitr)
```

# Introduction

`satuRn` is an R package for performing differential transcript usage analyses 
in bulk and single-cell transcriptomics datasets. 
The package has three main functions. 

1. The first function, `fitDTU`, is used to model transcript usage profiles by 
means of a quasi-binomial generalized linear model. 

2. Second, the `testDTU` function tests for differential usage of transcripts
between certain groups of interest (e.g. different treatment groups or
cell types). 

3. Finally, the `plotDTU` can be used to visualize the usage profiles of 
selected transcripts in different groups of interest. 

All details about the `satuRn` model and statistical tests are described 
in our preprint [@Gilis2021].

In this vignette, we analyze a small subset of the data from [@Tasic2018]. 
More specifically, an expression matrix and the corresponding metadata of 
the  subset data has been provided with the `satuRn` package. We will adopt 
this dataset to showcase the different functionalities of `satuRn`.

# Package installation

`satuRn` (version 1.4.0) can be installed from Bioconductor with:

```{r, eval=FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("satuRn")
```

Alternatively, the development version of `satuRn` can be downloaded with:
 
```{r, eval=FALSE}
devtools::install_github("statOmics/satuRn")
```

*! for this vignette, we use the development version of satuRn.*
The vignette for the officially released satuRn version 1.4.0 can be viewed from 
[Bioconductor](https://bioconductor.org/packages/release/bioc/vignettes/satuRn/inst/doc/Vignette.html)

# Load libraries

```{r, message=FALSE, warning=FALSE}
library(satuRn)
library(AnnotationHub)
library(ensembldb)
library(edgeR)
library(SummarizedExperiment)
library(ggplot2)
library(DEXSeq)
library(stageR)
```

# Load data

The following data corresponds to a small subset of the dataset 
from [@Tasic2018] and is readily available from the `satuRn` package.
To check how the subset was generate, please check `?Tasic_counts_vignette`.

```{r}
data(Tasic_counts_vignette) # transcript expression matrix
data(Tasic_metadata_vignette) # metadata
```

# Data pre-processing

We start the analysis from scratch, in order to additionally showcase some of
the prerequisite steps for performing a DTU analysis.

# Import transcript information

First, we need an object that links transcripts to their corresponding genes. 
We suggest using the BioConductor R packages `AnnotationHub` and `ensembldb` 
for this purpose.  

```{r, message=FALSE, warning=FALSE}
ah <- AnnotationHub() # load the annotation resource.
all <- query(ah, "EnsDb") # query for all available EnsDb databases
ahEdb <- all[["AH75036"]] # for Mus musculus (choose correct release date)
txs <- transcripts(ahEdb)
```

## Data wrangling

Next, we perform some data wrangling steps to get the data in a format that is
suited for satuRn. First, we create a `DataFrame` or `Matrix` linking 
transcripts to their corresponding genes.

! Important: `satuRn` is implemented such that the columns with transcript 
identifiers is names `isoform_id`, while the column containing gene identifiers 
should be named `gene_id`. In addition, following chunk removes transcripts 
that are the only isoform expressed of a certain gene, as they cannot be used 
in a DTU analysis.

```{r}
# Get the transcript information in correct format
txInfo <- as.data.frame(matrix(data = NA, nrow = length(txs), ncol = 2))
colnames(txInfo) <- c("isoform_id", "gene_id")
txInfo$isoform_id <- txs$tx_id
txInfo$gene_id <- txs$gene_id
rownames(txInfo) <- txInfo$isoform_id

# Remove transcripts that are the only isoform expressed of a certain gene
rownames(Tasic_counts_vignette) <- sub("\\..*", "", 
                                       rownames(Tasic_counts_vignette)) 
# remove transcript version identifiers

txInfo <- txInfo[txInfo$isoform_id %in% rownames(Tasic_counts_vignette), ]
txInfo <- subset(txInfo, 
                 duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE))

Tasic_counts_vignette <- Tasic_counts_vignette[which(
  rownames(Tasic_counts_vignette) %in% txInfo$isoform_id), ]
```

## Filtering

Here we perform some feature-level filtering. For this task, we adopt 
the filtering criterion that is implemented in the R package `edgeR`. 
Alternatively, one could adopt the `dmFilter` criterion from the `DRIMSeq` R 
package, which provides a more stringent filtering when both methods are run 
in default settings. After filtering, we again remove transcripts that are 
the only isoform expressed of a certain gene.

```{r}
filter_edgeR <- filterByExpr(Tasic_counts_vignette,
    design = NULL,
    group = Tasic_metadata_vignette$brain_region,
    lib.size = NULL,
    min.count = 10,
    min.total.count = 30,
    large.n = 20,
    min.prop = 0.7
) # more stringent than default to reduce run time of the vignette

table(filter_edgeR)
Tasic_counts_vignette <- Tasic_counts_vignette[filter_edgeR, ]

# Update txInfo according to the filtering procedure
txInfo <- txInfo[which(
  txInfo$isoform_id %in% rownames(Tasic_counts_vignette)), ]

# remove txs that are the only isoform expressed within a gene (after filtering)
txInfo <- subset(txInfo, 
                 duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE))
Tasic_counts_vignette <- Tasic_counts_vignette[which(rownames(
  Tasic_counts_vignette) %in% txInfo$isoform_id), ]

# satuRn requires the transcripts in the rowData and 
# the transcripts in the count matrix to be in the same order.
txInfo <- txInfo[match(rownames(Tasic_counts_vignette), txInfo$isoform_id), ]
```

## Create a design matrix

Here we set up the design matrix of the experiment. The subset of the dataset 
from [@Tasic2018] contains cells of several different cell types 
(variable `cluster`) in two different areas of the mouse neocortex 
(variable `brain_region`). As such, we can model the data with 
a factorial design, i.e. by generating a new variable `group` that encompasses 
all different cell type - brain region combinations. 

```{r}
Tasic_metadata_vignette$group <- paste(Tasic_metadata_vignette$brain_region, 
                                       Tasic_metadata_vignette$cluster, 
                                       sep = ".")
```

## Generate SummarizedExperiment

All three main functions of `satuRn` require a `SummarizedExperiment` object as 
an input class. See the SummarizedExperiment vignette [@SummarizedExperiment] 
for more information on this object class.  

For the sake of completeness, it is advised to include the design matrix formula 
in the SummarizedExperiment as indicated below.

```{r, message=FALSE}
sumExp <- SummarizedExperiment::SummarizedExperiment(
    assays = list(counts = Tasic_counts_vignette),
    colData = Tasic_metadata_vignette,
    rowData = txInfo
)

# for sake of completeness: specify design formula from colData
metadata(sumExp)$formula <- ~ 0 + as.factor(colData(sumExp)$group)
sumExp
```

# Fit quasi-binomial generalized linear models models

The `fitDTU` function of `satuRn` is used to model transcript usage in different
groups of samples or cells. Here we adopt the default settings of the function.
Without parallelized execution, this code runs for approximately 15 seconds 
on a 2018 macbook pro laptop. 

```{r}
system.time({
sumExp <- satuRn::fitDTU(
    object = sumExp,
    formula = ~ 0 + group,
    parallel = FALSE,
    BPPARAM = BiocParallel::bpparam(),
    verbose = TRUE
)
})
```

The resulting model fits are now saved into the `rowData` of 
our SummarizedExperiment object under the name `fitDTUModels`. 
These models can be accessed as follows:

```{r}
rowData(sumExp)[["fitDTUModels"]]$"ENSMUST00000037739"
```

The models are instances of the `StatModel` class as defined in 
the `satuRn` package. These contain all relevant information for 
the downstream analysis. For more details, read the StatModel documentation 
with ?satuRn::`StatModel-class`.

# Test for DTU

Here we test for differential transcript usage between select groups 
of interest. In this example, the groups of interest are the three different 
cell types that are present in the dataset associated with this vignette.

## Set up contrast matrix

First, we set up a contrast matrix. This allows us to test for differential 
transcript usage between groups of interest. The `group` factor in this 
toy example contains three levels; (1) ALM.L5_IT_ALM_Tmem163_Dmrtb1, 
(2) ALM.L5_IT_ALM_Tnc, (3) VISp.L5_IT_VISp_Hsd11b1_Endou. 
Here we show to assess DTU between cells of the groups 1 and 3
and between cells of groups 2 and 3.

The contrast matrix can be constructed manually;

```{r}
group <- as.factor(Tasic_metadata_vignette$group)
design <- model.matrix(~ 0 + group) # construct design matrix
colnames(design) <- levels(group)

L <- matrix(0, ncol = 2, nrow = ncol(design)) # initialize contrast matrix
rownames(L) <- colnames(design)
colnames(L) <- c("Contrast1", "Contrast2")

L[c("VISp.L5_IT_VISp_Hsd11b1_Endou","ALM.L5_IT_ALM_Tnc"),1] <-c(1,-1)
L[c("VISp.L5_IT_VISp_Hsd11b1_Endou","ALM.L5_IT_ALM_Tmem163_Dmrtb1"),2] <-c(1,-1)
L # contrast matrix
```

This can also be done automatically with the `makeContrasts` function of 
the `limma` R package.

```{r}
group <- as.factor(Tasic_metadata_vignette$group)
design <- model.matrix(~ 0 + group) # construct design matrix
colnames(design) <- levels(group)

L <- limma::makeContrasts(
    Contrast1 = VISp.L5_IT_VISp_Hsd11b1_Endou - ALM.L5_IT_ALM_Tnc,
    Contrast2 = VISp.L5_IT_VISp_Hsd11b1_Endou - ALM.L5_IT_ALM_Tmem163_Dmrtb1,
    levels = design
)
L # contrast matrix
```

## Perform the test

Next we can perform differential usage testing using `testDTU`. We again adopt 
default settings. For more information on the parameter settings, please consult
the help file of the `testDTU` function. Note that the arguments `diagplot1` and
`diagplot2` were not yet implemented in `satuRn` v1.1.1.

```{r}
sumExp <- satuRn::testDTU(
    object = sumExp,
    contrasts = L,
    diagplot1 = TRUE,
    diagplot2 = TRUE,
    sort = FALSE
)
```

When set to TRUE, the `diagplot1` and `diagplot2` arguments generate a 
diagnostic plot.

For `diagplot1`, the histogram of the z-scores (computed from p-values) is 
displayed using the locfdr function of the `locfdr` package. The blue dashed 
curve is fitted to the mid 50% of the z-scores, which are assumed to originate 
from null transcripts, thus representing the estimated empirical null component 
densities. The maximum likelihood estimates (MLE) and central matching estimates 
(CME) of this estimated empirical null distribution are given below the plot. 
If the MLE estimates for delta and sigma deviate from 0 and 1 respectively, the 
downstream inference will be influenced by the empirical adjustment implemented 
in satuRn (see below).

For `diagplot2`, a plot of the histogram of the "empirically adjusted" test 
statistics and the standard normal distribution will be displayed. Ideally, 
the majority (mid portion) of the adjusted test statistics should follow 
the standard normal. If this is not the case, the inference may be 
untrustworthy and results should be treated with care. One potential solution
is to include (additional) potential covariates in the analysis.

The test results are now saved into the `rowData` of our SummarizedExperiment 
object under the name `fitDTUResult_` followed by the name of the contrast 
of interest (i.e. the column names of the contrast matrix). 
The results can be accessed as follows:

```{r}
head(rowData(sumExp)[["fitDTUResult_Contrast1"]]) # first contrast
```

```{r}
head(rowData(sumExp)[["fitDTUResult_Contrast2"]]) # second contrast
```

The results for are, for each contrast, a dataframe with 8 columns:

- `estimates`: The estimated log-odds ratios (log base e). In the most simple 
case, an estimate of +1 would mean that the odds of picking that transcript from 
the pool of transcripts within its corresponding gene is exp(1) = 2.72 times 
larger in condition B than in condition A.
- `se`: The standard error on this estimate.
- `df`: The posterior degrees of freedom for the test statistic.
- `t`: The student's t-test statistic, computed with a Wald test given
`estimates` and `se`.
- `pval`: The "raw" p-value given `t` and `df`.
- `FDR`: The false discovery rate, computed using the multiple testing 
correction of Benjamini and Hochberg on `pval`.
- `empirical_pval`: An "empirical" p-value that is computed by estimating the 
null distribution of the test statistic empirically. For more details, see our 
publication.
- `empirical_FDR`: The false discovery rate, computed using the multiple testing 
correction of Benjamini and Hochberg on `pval_empirical`.

**!Note: based on the benchmarks in our publication, we always recommend using**
**the empirical p-values from column 7 over the "raw" p-value from column 5.**
When the MLE estimates for the mean and standard deviation (delta and sigma) of
the empirical null density (blue dashed curve in `diagplot1`) deviate from 0 and
1 respectively, there will be a discrepancy between the "raw" and "empirically
adjusted" p-values. A deviation in the standard deviation only affects the 
magnitude of the p-values, whereas a deviation in the mean also affects the
ranking of transcripts according to their p-value.

# Visualize DTU

Finally, we may visualize the usage of select transcripts in select groups 
of interest.

```{r, warning=FALSE}
group1 <- colnames(sumExp)[colData(sumExp)$group == 
                             "VISp.L5_IT_VISp_Hsd11b1_Endou"]
group2 <- colnames(sumExp)[colData(sumExp)$group == 
                             "ALM.L5_IT_ALM_Tnc"]

plots <- satuRn::plotDTU(
    object = sumExp,
    contrast = "Contrast1",
    groups = list(group1, group2),
    coefficients = list(c(0, 0, 1), c(0, 1, 0)),
    summaryStat = "model",
    transcripts = c("ENSMUST00000081554", 
                    "ENSMUST00000195963", 
                    "ENSMUST00000132062"),
    genes = NULL,
    top.n = 6
)

# to have same layout as in our paper
for (i in seq_along(plots)) {
    current_plot <- plots[[i]] +
        scale_fill_manual(labels = c("VISp", "ALM"), values = c("royalblue4", 
                                                                "firebrick")) +
        scale_x_discrete(labels = c("Hsd11b1_Endou", "Tnc"))

    print(current_plot)
}
```

# Optional post-processing of results: Two-stage testing procedure with stageR

`satuRn` returns transcript-level p-values for each of the specified contrasts.
While we have shown that `satuRn` is able to adequately control the false 
discovery rate (FDR) at the transcript level [@Gilis2021], 
[@VandenBerge2017] argued that it is often desirable to control the FDR at 
the gene level. This boosts statistical power and eases downstream
biological interpretation and validation, which typically occur at 
the gene level. 

To this end, [@VandenBerge2017] developed a testing procedure that is 
implemented in the BioConductor R package `stageR`. The procedure consists of 
two stages; a screening stage and a confirmation stage. 

In the screening stage, gene-level FDR-adjusted p-values are computed, 
which aggregate the evidence for differential transcript usage over all 
transcripts within the gene. Only genes with an FDR below the desired 
nominal level are further considered in the second stage. 
In the confirmation stage, transcript-level p-values are adjusted for 
those genes, using a FWER-controlling method on the FDR-adjusted 
significance level.

In its current implementation, `stageR` can only perform stage-wise testing if 
only one contrast is of interest in a DTU setting. An analogous correction for 
the assessment of multiple contrasts for multiple transcripts per gene 
has not yet been implemented. 

Below, we demonstrate how the transcript-level p-values for the first contrast 
as returned by `satuRn` can be post-processed using `stageR`. 
We rely on the `perGeneQValue` function:

```{r, stage-wise testing}
# transcript level p-values from satuRn
pvals <- rowData(sumExp)[["fitDTUResult_Contrast1"]]$empirical_pval

# compute gene level q-values
geneID <- factor(rowData(sumExp)$gene_id)
geneSplit <- split(seq(along = geneID), geneID)
pGene <- sapply(geneSplit, function(i) min(pvals[i]))
pGene[is.na(pGene)] <- 1
theta <- unique(sort(pGene))

# gene-level significance testing
q <- DEXSeq:::perGeneQValueExact(pGene, theta, geneSplit) 
qScreen <- rep(NA_real_, length(pGene))
qScreen <- q[match(pGene, theta)]
qScreen <- pmin(1, qScreen)
names(qScreen) <- names(geneSplit)

# prepare stageR input
tx2gene <- as.data.frame(rowData(sumExp)[c("isoform_id", "gene_id")])
colnames(tx2gene) <- c("transcript", "gene")

pConfirmation <- matrix(matrix(pvals),
    ncol = 1,
    dimnames = list(rownames(tx2gene), "transcript")
)

# create a stageRTx object
stageRObj <- stageR::stageRTx(
    pScreen = qScreen,
    pConfirmation = pConfirmation,
    pScreenAdjusted = TRUE,
    tx2gene = tx2gene
)

# perform the two-stage testing procedure
stageRObj <- stageR::stageWiseAdjustment(
    object = stageRObj,
    method = "dtu",
    alpha = 0.05,
    allowNA = TRUE
)

# retrieves the adjusted p-values from the stageRTx object
padj <- stageR::getAdjustedPValues(stageRObj,
    order = TRUE,
    onlySignificantGenes = FALSE
)
head(padj)
```

# Session

```{r}
sessionInfo()
```

# References