---
title: "User guide to the `dearseq` R package"
author: "Marine Gauthier, Denis Agniel, Boris Hejblum"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{dearseqUserguide} 
  %\VignetteEngine{knitr::rmarkdown} 
  %\VignetteEncoding{UTF-8}
---
  
```{r knitrsetup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy = TRUE)
knitr::knit_hooks$set(small.mar = function(before, options, envir) {
    if (before) par(mar = c(0, 0, 0, 0))  # no margin
})
```

![](../man/figures/logo.svg){width=139px}

# Overview of `dearseq`

`dearseq` is an `R` package that performs **D**ifferential **E**xpression 
**A**nalysis of **R**NA-**seq** data, while guaranteeing a *sound control of 
false positives* (Gauthier *et al.*, 2019). It relies on variance component 
score test accounting for data heteroscedasticity through precision weights, 
not unlike `voom` (Law *et al.*, 2014). It can perform either 

 - **gene-wise** differential analysis, or
 - **gene set analysis**
 
In addition, `dearseq` can deal with various and complex experimental designs 
such as:

 - **multiple biological conditions**, 
 - **repeated or longitudinal measurements**

More details are provided in this 
[*bioRxiv* preprint 635714](https://www.biorxiv.org/content/10.1101/635714v1).


# Using `dearseq` for a gene-wise analysis 

## Getting started using `dear_seq()`

2 inputs are required to run `dear_seq()`:

 - The gene expression matrix
 - The data.frame containing the variables of interested to be tested (e.g. 
 the group factor)

Both can be supplied in the formed of a `SummarizedExperiment` object for 
instance.
A third optional input is the covariates that will be adjusted on (such as 
age for instance).


## An example analysis

Tuberculosis (TB) is a disease caused by a bacterium called Mycobacterium tuberculosis. Bacteria typically infect the lungs, but they can also affect other parts of the body. Tuberculosis can remain in a quiescent state in the body. It is called latent tuberculosis infection (LTBI). It is an infection without clinical signs, bacteriological and radiological disease.

*Berry et al.* first identified a whole blood 393 transcript signature for active TB using microarray analysis. Then, *Singhania et al.* found a 373-genes signature of active tuberculosis using RNA-Seq, confirming microarray results, that discriminates active tuberculosis from latently infected and healthy individuals.

We sought to investigate how many of the 373 genes *Singhania et al.* found using edgeR might actually be false positives. As a quick example, we propose a partial re-analysis (see *Gauthier et al.* for the complete analysis) of the *Singhania et al.* dataset:  we proceed to the Differential Expression Analysis (DEA) of the Active TB group against the Control one, omitting LTBI/Control and ActiveTB/LTBI comparisons. We focused on the Berry London dataset.

It results in 54 patients of whom 21 are active TB patients, 21 are latent TB patients and 12 are healthy control particiants.

For more details, check the article from *Berry  et al., 2010* [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3492754/).


### Data preparation

Data from *Singhania et al.* inverstigating active TB are publicly available on [GEO website](https://www.ncbi.nlm.nih.gov/geo/) with GEO access number 'GSE107995'. Thanks to the `GEOquery` package to get the data files from GEO website (see appendix for more details on `GEOquery`)


#### Design data matrix

```{r import, message=FALSE, results='hide'}
library(GEOquery)
GSE107991_metadata <- GEOquery::getGEO("GSE107991", GSEMatrix = FALSE)

get_info <- function(i){
    name <- GEOquery::Meta(GSMList(GSE107991_metadata)[[i]])$source_name_ch1
    name <- gsub("Active_TB", "ActiveTB", name)
    name <- gsub("Test_set", "TestSet", name)
    c(unlist(strsplit(name, split="_")), 
             GEOquery::Meta(GSMList(GSE107991_metadata)[[i]])$title)
}

infos <- vapply(X = 1:length(GSMList(GSE107991_metadata)), 
                FUN = get_info,
                FUN.VALUE = c("a", "b", "c", "d", "e"))
infos_df <- cbind.data.frame("SampleID" = names(GSMList(GSE107991_metadata)), 
                             t(infos), stringsAsFactors=TRUE)
rownames(infos_df) <- names(GEOquery::GSMList(GSE107991_metadata))
colnames(infos_df)[-1] <- c("Cohort", "Location", "Set", "Status", "SampleTitle")
```


#### Gene expression matrix

This matrix contains the gene expression (in cells) for each gene (in rows) of 
each sample (in columns) gathered from RNA-seq measurements. The gene expression 
could already be normalized, or not (e.g. raw counts or pseudo-counts straight 
from the alignment). In the latter case, the gene expression is 
normalized then into log(counts) per million (i.e. *log-cpm*). 

The gene expression matrix (called `London`) is including in the package `dearseq`. 
It was extracted from one of the GEO supplementary files (namely the 
"GSE107991_edgeR_normalized_Berry_London.xlsx" file). 

**NB:** These expression data have already been already normalized using `edgeR`

```{r London-download, message=FALSE, results='hide'}
library(readxl)
GEOquery::getGEOSuppFiles('GSE107991', 
                          filter_regex="edgeR_normalized_Berry_London")
London <- readxl::read_excel("GSE107991/GSE107991_edgeR_normalized_Berry_London.xlsx")
genes <- London$Genes
London <- as.matrix(London[, -c(1:3)])
rownames(London) <- genes
```

We have `nrow(London)` genes in rows and `ncol(London)` samples in columns
 * rownames are Ensembl IDs `ENSGxxxxxxxxxxx` for each gene (saved in `genes`)
 * colnames are the name of each sample `Berry_London_Samplex`
 * each data-cell contains the normalized gene expression
 
The vector `genes` contains all the gene identifiers.

### Identifying differentially expressed genes (DEG) from `dearseq` variance component score test

The main arguments to the `dear_seq()` function are:

 * `exprmat`: a numeric matrix containing the raw RNA-seq counts or preprocessed expressions
 * `covariates`: a numeric matrix containing the model covariates that needs to be adjusted. Usually, its first column is the intercept (full of 1s).
 * `variables2test`: a numeric design matrix containing the variables of interest (with whom the expression association is tested)
 * `which_test`: a character string indicating which method to use to approximate the variance component score test, either "permutation" or "asymptotic".
 * `preprocessed`: a logical flag indicating whether the expression data have already been preprocessed (e.g. log2 transformed). Default is FALSE, in which case y is assumed to contain raw counts and is normalized into log(counts) per million.

We use the permutation test because of the relatively small sample size ($n=33$).

```{r LR_ST_echo, eval=TRUE, echo=TRUE, message=FALSE}
library(dearseq)
library(SummarizedExperiment)
col_data <- data.frame("Status" = infos_df$Status)
col_data$Status <- stats::relevel(col_data$Status, ref="Control")
rownames(col_data) <- infos_df$SampleTitle
se <- SummarizedExperiment(assays = London, 
                            colData = col_data)

res_dearseq <- dearseq::dear_seq(object = se[, se$Status != "LTBI"],
                                 variables2test = "Status",
                                 which_test='asymptotic',
                                 preprocessed=TRUE)
```


`res_dearseq` is a list containing: 

 - the input values for `which_test`, `nperm` and `preprocessed`
 - the gene-wise p-values (raw and adjusted - default is correction for multiple 
 testing with the Benjamini-Hochberg procedure)

**NB:** We excluded the LTBI patients to focus on the comparison between Active 
TB versus Control patients.

# Using `dearseq` for gene-set analysis

Here we will use a subsample of the RNA-seq data from Baduel *et al.* (2016) 
studying Arabidopsis Arenosa physiology, that is included in the `dearseq` 
package. 

We investigates 2 gene sets that were derived in a data-driven manner By Baduel 
*et al.* (2016).


```{r baduel, eval=TRUE, echo=TRUE}
library(dearseq)
library(SummarizedExperiment)
library(BiocSet)
data('baduel_5gs')
se2 <- SummarizedExperiment(assay = log2(expr_norm_corr+1), 
                            colData = design)
genes_non0var_ind <- which(matrixStats::rowVars(expr_norm_corr)!=0)

KAvsTBG <- dearseq::dgsa_seq(object = se2[genes_non0var_ind, ],
                             covariates=c('Vernalized', 'AgeWeeks', 'Vernalized_Population', 'AgeWeeks_Population'),
                             variables2test = 'PopulationKA',
                             genesets=baduel_gmt$genesets[c(3,5)],
                             which_test = 'permutation', which_weights = 'loclin',
                             n_perm=1000, preprocessed = TRUE,
                             verbose = FALSE, parallel_comp = FALSE)
KAvsTBG$pvals

Cold <- dearseq::dgsa_seq(object = se2[genes_non0var_ind, ],
                          covariates = c('AgeWeeks', 'PopulationKA', 'AgeWeeks_Population'),
                          variables2test = c('Vernalized', 'Vernalized_Population'),
                          genesets = baduel_gmt$genesets[c(3,5)],
                          which_test = 'permutation', which_weights = 'loclin',
                          n_perm = 1000, preprocessed = TRUE, 
                          verbose = FALSE, parallel_comp = FALSE)
Cold$pvals
```


# Bibliography

Agniel D & Hejblum BP (2017). Variance component score test for time-course gene set analysis of longitudinal RNA-seq data. *Biostatistics* 18(4):589-604. [DOI: 10.1093/biostatistics/kxx005](https://doi.org/10.1093/biostatistics/kxx005).

Baduel P, Arnold B, Weisman CM, Hunter B & Bomblies K (2016). Habitat-Associated Life History and Stress-Tolerance Variation in Arabidopsis Arenosa. *Plant Physiology*, 171(1):437-51.
[DOI: 10.1104/pp.15.01875](https://doi.org/10.1104/pp.15.01875).

Berry MP, Graham CM, McNab FW, Xu Z, Bloch SAA, Oni T, Wilkinson KA, Banchereau R, Skinner J, Wilkinson RJ, Quinn C, Blankenship D, Dhawan R, Cush JJ, Mejias A, Ramilo O, Kon OM, Pascual V, Banchereau J, Chaussabel D & O’Garra A (2010). An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. *Nature* 466:973–977. [DOI: 10.1038/nature09247](https://doi.org/10.1038/nature09247)

Gauthier M, Agniel D, Thiébaut R & Hejblum BP (2019). dearseq: a
variance component score test for RNA-Seq differential analysis that
effectively controls the false discovery rate. *bioRxiv* 635714. [DOI: 10.1101/635714v1](https://www.biorxiv.org/content/10.1101/635714v1).


Singhania A, Verma R, Graham CM, Lee J, Tran T, Richardson M, Lecine P, Leissner P, Berry MPR, Wilkinson RJ, Kaiser K, Rodrigue M, Woltmann G, Haldar P, O’Garra A, (2018). A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection. _Nature communications_ 9(1):2308. [DOI: 10.1038/s41467-018-04579-w](https://doi.org/10.1038/s41467-018-04579-w)


# Appendix

## `GEOquery` package

In case the data you want to analyze is publicly available through [Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/), you can access it with the `GEOquery` package, that can be installed with the following commands:
```{r dl_GEOquery, eval=FALSE, echo=TRUE}
if (!requireNamespace("GEOquery", quietly = TRUE)) {
    if (!requireNamespace("BiocManager", quietly = TRUE)){
        install.packages("BiocManager")
    }
    BiocManager::install("GEOquery")
}
```


## `readxl`

The `readxl` package allows to easily imports data from Excel format and can be similarly installed with the following commands:
```{r dl_readxl, echo=TRUE, eval=FALSE}
if (!requireNamespace("readxl", quietly = TRUE)) {
    install.packages("readxl")
}
```

More details can be found on [Bioconductor](https://bioconductor.org/packages/release/bioc/html/GEOquery.html) and in Davis S, Meltzer P, (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and Bioconductor _Bioinformatics_ 14:1846-1847.

## Session Info
```{r, echo=FALSE}
utils::sessionInfo()
```