---
title: 'An introduction to benchdamic'
author: "Matteo Calgaro"
output: 
  BiocStyle::html_document:
    toc: yes
vignette: >
  %\VignetteEcoding{UTF-8}
  \usepackage[utf8]{inputenc}
bibliography: bib_intro.json
csl: genome-biology.csl
---

```{=html}
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Intro}
-->
```
```{r options, include=FALSE, echo=FALSE}
knitr::opts_chunk$set(warning = FALSE, error = FALSE, message = FALSE)
```

# Introduction

This vignette provides an introductory example on how to work with the analysis framework proposed in [@calgaro2020].

The package is named benchdamic, acronym for “BENCHmarking of Differential Abundance detection methods for MICrobial data”. Not only does the package structure allow the users to test a variety of commonly used methods for differential abundance analysis, but it also enables them to set benchmarks including custom methods on their datasets. Performances of each method are evaluated with respect to i) suitability of distributional assumptions, ii) ability to control false discoveries, iii) concordance of the findings, and iv) enrichment of differentially abundant microbial species in specific conditions. Each step of the assessment is flexible when it comes to the choice of differential abundance methods, their parameters, and input data types. Various graphic outputs lead the users to an informed decision when evaluating the most suitable method to use for their data.

# Installation

The recommended way to install the `benchdamic` package is

```{r, eval=FALSE}
devtools::install_github("mcalgaro93/benchdamic")
```

Then, let's load some packages for basic functions and data.

```{r load_packs}
library(benchdamic)
# Data management
library(phyloseq)
library(plyr)
# Graphics
library(ggplot2)
library(cowplot)
```

# Goodness of Fit

## Data loading

We consider a homogeneous group of samples (e.g. only samples from a specific experimental condition, phenotype, treatment, body site...).

In this demonstrative example, datasets are downloaded from the `HMP16SData` Bioconductor package.

```{r dataloading}
data("ps_stool_16S")
ps_stool_16S
```

```{r 16S_stool_download, eval=FALSE}
## 16S HMP data download
library(HMP16SData)
# Extracting V3-V5 16S sequenced regions' count data
ps_stool_16S_raw <- V35() %>%
  subset(select = HMP_BODY_SUBSITE == "Stool" & # Only fecal samples
    RUN_CENTER == "BI" & # Only sequenced at the BI RUN CENTER
    SEX == "Male" & # Only male subject
    VISITNO == 1 & # Only the first visit
    !duplicated(RSID)) %>% # Duplicated SubjectID removal
  as_phyloseq()
# Remove low depth samples
ps_stool_16S_pruned <- prune_samples(sample_sums(ps_stool_16S_raw) >= 10^3, ps_stool_16S_raw)
# Remove features with zero counts
ps_stool_16S_filtered <- filter_taxa(ps_stool_16S_pruned, function(x) {
  sum(x > 0) > 0
}, 1)
# Collapse counts to the genus level
ps_stool_16S <- tax_glom(ps_stool_16S_filtered, taxrank = "GENUS")
```

## GOF structure

As different methods rely on different statistical distributions to perform DA analysis, we assess the goodness of fit (GOF) of the statistical models underlying each method on a 16S dataset. For each model, we evaluate its ability to correctly estimate the average counts and the proportion of zeroes by taxon.

We consider five distributions: (1) the negative binomial (NB) used in edgeR and DeSeq2 [@edger; @deseq2], (2) the zero-inflated negative binomial (ZINB) used in ZINB-WaVE [@zinbwave], (3) the truncated Gaussian Hurdle model of MAST [@mast], (4) the zero-inflated Gaussian (ZIG) mixture model of metagenomeSeq [@zig], and (5) the Dirichlet-Multinomial (DM) distribution underlying ALDEx2 [@aldex2].

The relationships between the functions used in this section are explained by the following diagram:

![Goodness of Fit - package structure](../vignettes/GOF_structure.svg)

## Model estimation

### Negative Binomial and Zero-Inflated Negative Binomial Models

For any $\mu \ge 0$ and $\theta > 0$, let $f_{NB}(\cdot;\mu,\theta)$ denote the probability mass function (PMF) of the negative binomial (NB) distribution with mean $\mu$ and inverse dispersion parameter $\theta$, namely:$$
f_{NB} = \frac{\Gamma(y+\theta)}{\Gamma(y+1)\Gamma(\theta)}\left(\frac{\theta}{\theta+1} \right)^\theta\left(\frac{\mu}{\mu+\theta} \right)^y, \forall y \in \mathbb{N}
$$Note that another parametrization of the NB PMF is in terms of the dispersion parameter $\psi = \theta^{-1}$ (although $\theta$ is also sometimes called dispersion parameter in the literature). In both cases, the mean of the NB distribution is $\mu$ and its variance is:$$
\sigma^2 = \mu + \frac{\mu^2}{\theta} = \mu+\psi\mu^2
$$In particular, the NB distribution boils down to a Poisson distribution when $\psi=0 \iff \theta=+ \infty$.

For any $\pi\in[0,1]$, let $f_{ZINB}(\cdot;\mu,\theta,\pi)$ be the PMF of the ZINB distribution given by:

$$
f_{ZINB}(\cdot;\mu,\theta,\pi) = \pi\delta_0(y)+(1-\pi)f_{NB}(y;\mu,\theta), \forall y\in\mathbb{N}
$$

where $\delta_0(\cdot)$ is the Dirac function. Here, $\pi$ can be interpreted as the probability that a 0 is observed instead of the actual count, resulting in an inflation of zeros compared to the NB distribution, hence the name ZINB.

The packages we rely on to estimate these distributions on real count data are edgeR [@edger] and ZINB-WaVE [@zinbwave] but we can easily call the benchdamic functions `fitNB` and `fitZINB`.

### Zero-Inflated Gaussian Model

The raw count for sample j and feature i is denoted by $c_{ij}$. The zero-inflated model is defined for the continuity-corrected logarithm of the raw count data: $y_{ij} = log_2(c_{ij}+1)$ as a mixture of a point mass at zero $I_{0}(y)$ and a count distribution $f_{count}(y;\mu,\sigma^2) \sim N(\mu,\sigma^2)$. Given mixture parameters $\pi_j$, we have that the density of the ZIG distribution for feature i, in sample j with $s_j$ total counts is: $$f_{ZIG}(y_{ij};s_j,\beta,\mu_i,\sigma^2_i) = \pi_j(s_j)\cdot I_{0}(y_{ij})+(1-\pi_j(s_j))\cdot f_{count}(y_{ij};\mu,\sigma^2)$$

The mean model is specified as:$$E(y_{ij})=\pi_{j} + (1-\pi_j)\cdot\left(b_{i0}+\eta_ilog_2\left( \frac{s_j^\hat{l}}{N}+1 \right) \right)$$

In this case, parameter $b_{i0}$ is the intercept of the model while the term including the logged normalization factor $log_2\left(\frac{s_j^\hat{l}}{N}+1 \right)$ captures feature-specific normalization factors through parameter $\eta_i$. In details, $s_j^\hat{l}$ is the median scaling factor resulted from the Cumulative Sum Scaling (CSS) normalization procedure. $N$ is a constant fixed by default at 1000 but it should be a number close to the scaling factors to be used as a reference, for this reason a good choice could be the median of the scaling factors. The mixture parameters $\pi_j(s_j)$ are modeled as a binomial process:

$$log\frac{\pi_j}{1-\pi_j} = \beta_0+\beta_1\cdot log(s_j)$$

The package we rely on to estimate this distribution on real count data is metagenomeSeq [@zig] but we can easily call the benchdamic function `fitZIG`.

### Truncated Gaussian Hurdle Model

The original field of application of this method was the single-cell RNAseq data, where $y = log_2(TPM+1)$ expression matrix was modeled as a two-part generalized regression model [@mast]. In microbiome data that starting point translates to a $y_{ij} = log_2\left(counts_{ij}\cdot\frac{10^6}{libSize_{j}}+1 \right)$ or a $log_2\left(counts_{ij}\cdot\frac{ median(libSize)}{libSize_{j}}+1\right)$.

The taxon presence rate is modeled using logistic regression and, conditioning on a sample with the taxon, the transformed abundance level is modeled as Gaussian.

Given normalized, possibly thresholded, abundance $y_{ij}$, the rate of presence and the level of abundance for the samples were the taxon is present, are modeled conditionally independent for each gene $i$. Define the indicator $z_{ij}$, indicating whether taxon $i$ is expressed in sample $j$ (i.e., $z_{ij} = 0$ if $y_{ij} = 0$ and $z_{ij} = 1$ if $y_{ij} > 0$). We fit logistic regression models for the discrete variable $Z$ and a Gaussian linear model for the continuous variable $(Y|Z=1)$ independently, as follows:

$$ logit(Pr(Z_{ij}=1))=X_j\beta_i^D $$

$$ P(Y_{ij}=y|Z_{ij}=1) \sim N(X_j\beta^C_i,\sigma^2_i)$$

The package we rely on to estimate this distribution on real count data is MAST [@mast] but we can easily call the benchdamic function `fitHURDLE`.

### Dirichlet-Multinomial Mixture Model

The probability mass function of a $n$ dimensional multinomial sample $y = (y_1,...,y_n)^T$ with library size $libSize = \sum_{i=1}^ny_i$ and parameter $p=(p_1,...,p_n)$ is:

$$
f(y;p)= {libSize\choose y}\prod_{i=1}^np_i^{y_i}
$$

The mean-variance structure of the MN model doesn't allow over-dispersion, which is common in real data. DM distribution models the probability parameter $p$ in the MN model by a Dirichlet distribution. The probability mass of a n-category count vector $y$ over $libSize$ trials under DM with parameter $\alpha=(\alpha_1,...,\alpha_n)$, $a_i>0$ and proportion vector $p \in \Delta_n=\{(p_1,...,p_n):p_i\ge0,\sum_ip_i=1 \}$ is:

$$
f(y|\alpha)={libSize\choose y}\frac{\prod_{i=1}^n(a_i)y_i}{(\sum_i\alpha_i)\cdot libSize}
$$

The mean value for the $i^{th}$ taxon and $j^{th}$ sample of the count matrix is given by $libSize_j\cdot \frac{\alpha_{ij}}{\sum_i a_{ij}}$.

The package we rely on to estimate this distribution on real count data is MGLM [@kim2018] but we can easily call the benchdamic function `fitDM`.

## Comparing Estimated and Observed counts

The goodness of fit for several distributions is assessed comparing estimated and observed values. On one side we can compare, for each taxon, the observed mean abundance with the estimated one, obtaining the Mean Difference (MD). On the other side we can compare the proportion of samples which have zero counts for a specific taxon with the estimated probability to observe a zero count, obtaining the Zero Probability Difference (ZPD).

To easily compare estimated and observed mean values the natural logarithm transformation, with the continuity correction ($log(counts+1)$), is well suited, indeed it reduces count range making the differences more stable.

Except for `fitHURDLE`, which performs a CPM transformation on the counts (or the one with the median library size), and `fitZIG` which models the $log_2(counts+1)$, the other methods, `fitNB`, `fitZINB`, and `fitDM`, model the $counts$ directly. For these reasons, `fitHURDLE`'s output should not be compared directly to the observed $log(counts+1)$ mean values as for the other methods. Instead, the logarithm of the observed CPM (or the one with the median library size) should be used.

```{r}
example_HURDLE <- fitHURDLE(
  counts = ps_stool_16S,
  scale = "median"
)
head(example_HURDLE)
```

The values above are the `fitHURDLE` estimated values (*NA* value in the second row due to the sparsity of that taxon). The internally used function to prepare observed counts is `prepareObserved()`, specifying the `scale` parameter if the HURDLE model is considered.

```{r}
observed_hurdle <- prepareObserved(
  counts = ps_stool_16S, 
  scale = "median")
head(observed_hurdle)
```

Which are different from the non-scaled observed values:

```{r}
head(prepareObserved(counts = ps_stool_16S))
```

The function to compute mean differences (MD) and zero probability difference (ZPD) between estimated and observed values, is `meanDifferences()` (also used internally).

```{r}
head(meanDifferences(
  estimated = example_HURDLE,
  observed = observed_hurdle
))
```

A wrapper function to simultaneously perform the estimates and the mean differences is `fitModels()`.

```{r fitting}
GOF_stool_16S <- fitModels(
  counts = ps_stool_16S,
  models = c("NB", "ZINB", "DM", "ZIG", "HURDLE"),
  scale_ZIG = c("median", "default"),
  scale_HURDLE = c("median", "default"),
  verbose = FALSE
)
```

Exploiting the internal structure of the `fitModels`'s output we can obtain the Root Mean Squared Error (RMSE) values for MD values (the lower, the better):

```{r RMSE_MD}
plotRMSE(GOF_stool_16S, difference = "MD", plotIt = FALSE)
```

and for ZPD values:

```{r RMSE_ZPD}
plotRMSE(GOF_stool_16S, difference = "ZPD", plotIt = FALSE)
```

## Visualization

### Mean Differences

To plot estimated and observed values we use the function `plotMD`, based on `ggplot2`.

```{r plotGOF_MD, fig.width=15, fig.height=4}
plotMD(
  data = GOF_stool_16S,
  difference = "MD",
  split = TRUE
)
```

For the *HURDLE_default* model, a quite different range of values of mean differences is displayed because of the excessive 'default' scaling proposed (1 million). It is also possible to plot only a subset of the estimated models:

```{r plotGOF_MD_noHurdleDefault, fig.width=15, fig.height=4}
plotMD(
  data = GOF_stool_16S[1:6],
  difference = "MD",
  split = TRUE
)
```

And the mean differences for their Zero Probability/Proportion:

```{r plotGOF_ZPD, fig.width=15, fig.height=4}
plotMD(
  data = GOF_stool_16S[1:6],
  difference = "ZPD",
  split = TRUE
)
```

MDs and ZPDs are also available in this other output layout:

```{r plotGOF_MD_collapsed, fig.width=10, fig.height=5}
plot_grid(plotMD(data = GOF_stool_16S[1:6], difference = "MD", split = FALSE),
  plotMD(data = GOF_stool_16S[1:6], difference = "ZPD", split = FALSE),
  ncol = 2
)
```

### RMSE

While a summary statistics for the overall performance is visible through:

```{r plotGOF_RMSE, fig.width=10,fig.height=7}
plot_grid(plotRMSE(GOF_stool_16S, difference = "MD"),
  plotRMSE(GOF_stool_16S, difference = "ZPD"),
  ncol = 2
)
```

# Type I Error Control

We next try to evaluate Type I Error rate Control of each differential abundance detection method, i.e., the probability of the statistical test to call a feature DA when it is not. To do so, we consider mock comparisons on HMP Stool samples in which no true DA is present. Briefly, we randomly assign each sample to one of two experimental groups and perform DA analysis between these groups, repeating the process 10 times (at least 1000 suggested). In this setting, the p-values of a perfect test should be uniformly distributed between 0 and 1 and the false positive rate (FPR or observed `α`), which is the observed proportion of significant tests, should match the nominal value (e.g., `α=0.05`).

## TIEC structure

The Type I Error Control (TIEC) analysis is summarized by the following diagram. The main functions are the `createMocks()` which randomly generates the mock groups, `runMocks()` which relies on `runDA()` function and performs the DA analysis on mock comparisons, and `createTIEC()` which produces the data.frame for plotting the results.

![TIEC - package structure](TIEC_structure.svg)

Where the DA framework decided by the user is composed by two main steps:

-   normalization;

-   differential abundance.

These steps can be performed both directly, using the `norm_*()` and `DA_*()` methods for normalization and differential abundance respectively, or indirectly, using the `setNormalizations()` and `set_*()` functions. This allows a higher flexibility to the user which can set the instructions for normalization or DA analysis only at the beginning. If some modifications are needed, the user can re-sets the methods or modify the list of instructions directly.

![Normalization and Differential Abundance structures](norm_DA_structure.svg)

## Create mock comparisons

Using `createMocks()` function we can randomly group the samples, `N = 10` times. A higher `N` is suggested (at least 1000) but in that case a longer running time is required.

```{r createMocks}
set.seed(123)
my_mocks <- createMocks(
  nsamples = phyloseq::nsamples(ps_stool_16S),
  N = 10
) # At least N = 1000 is suggested
```

## Differential abundance

Once the mocks have been generated, we perform DA analysis. Firstly we add to the `phyloseq` object some scaling factors, such as *TMM* from `edgeR` and *CSS* from `metagenomeSeq`, and some normalization factor such as *poscounts* from `DESeq2`. This can be done, manually, using the `norm_edgeR()`, `norm_DESeq2()`, and `norm_CSS()` methods:

```{r normalizationManual, eval=FALSE}
ps_stool_16S <- norm_edgeR(
  object = ps_stool_16S,
  method = "TMM"
)
ps_stool_16S <- norm_DESeq2(
  object = ps_stool_16S,
  method = "poscounts"
)
ps_stool_16S <- norm_CSS(
  object = ps_stool_16S,
  "median"
)
```

Or, more fastly, using the `setNormalizations()` and `runNormalizations()` methods:

```{r setNormalization}
my_normalizations <- setNormalizations(fun = c("norm_edgeR", "norm_DESeq2", "norm_CSS"), method = c("TMM", "poscounts", "median"))
ps_stool_16S <- runNormalizations(normalization_list = my_normalizations, object = ps_stool_16S, verbose = TRUE)
```

We also compute the zero-inflated negative binomial weights using the `weights_ZINB` function.

```{r weights}
zinbweights <- weights_ZINB(
  object = ps_stool_16S,
  K = 0,
  design = "~ 1"
)
```

For each row of the `mock_df` data frame we run a bunch of DA methods. In this demonstrative example we use:

-   `edgeR` with *TMM* scaling factors;

-   `edgeR` with *TMM* scaling factors and *ZINB* weights;

-   `DESeq2` with *poscounts* normalization factors;

-   `DESeq2` with poscounts normalization factors and *ZINB* weights;

-   `limma-voom` with *TMM* scaling factors;

-   `limma-voom` with *TMM* scaling factors and *ZINB* weights;

-   `limma-voom` with *CSS* scaling factors;

```{r set_Methods}
my_edgeR <- set_edgeR(
    pseudo_count = FALSE,
    group_name = "group", 
    design = ~ group, 
    robust = FALSE, 
    coef = 2, 
    norm = "TMM", 
    weights_logical = c(TRUE, FALSE), 
    expand = TRUE
)
my_DESeq2 <- set_DESeq2(
    pseudo_count = FALSE,
    design = ~ group,
    contrast = c("group", "grp2", "grp1"),
    norm = "poscounts", 
    weights_logical = c(TRUE, FALSE), 
    alpha = 0.05,
    expand = TRUE
)
my_limma <- set_limma(
    pseudo_count = FALSE,
    design = ~ group,
    coef = 2,
    norm = c("TMM", "TMM", "CSSmedian"),
    weights_logical = c(FALSE, TRUE, FALSE), 
    expand = FALSE)

my_methods <- c(my_edgeR, my_DESeq2, my_limma)
```

```{r runMocks}
# Random grouping each time
Stool_16S_mockDA <- runMocks(mocks = my_mocks, method_list = my_methods, object = ps_stool_16S, weights = zinbweights, verbose = FALSE)
```

The structure of the output in this example is the following:

-   *Comparison1* to *Comparison10* on the first level, which contains:

    -   *Method1* to *Method7* output lists on the second level:

        -   `pValMat` which contains the matrix of raw p-values and adjusted p-values in *rawP* and *adjP* columns respectively;

        -   `statInfo` which contains the matrix of summary statistics for each feature, such as the *logFC*, standard errors, test statistics and so on;

        -   `dispEsts` which contains the dispersion estimates for methods like *edgeR* and *DESeq2*;

        -   `name` which contains the complete name of the used method.

Be sure to create a similar result structure (respecting the names and types of the objects) if a custom method is added: `dispEsts` vector is optional, while the matrix `pValMat`, and the character `name` are mandatory for the "Type I Error Control" results' visualization (see the example below for more information). `statInfo` matrix is also very important for results visualization, however it will be more useful later when the direction of the differential abundance will be investigated.

```{r customExample, eval=FALSE}
DA_yourMethod <- function(object, parameters) # others
{
    ### your method code ###
    
    ### extract important statistics ###
    vector_of_pval <- NA # contains the p-values
    vector_of_adjusted_pval <- NA # contains the adjusted p-values
    name_of_your_features <- NA # contains the OTU, or ASV, or other feature 
                                # names. Usually extracted from the rownames of 
                                # the count data
    vector_of_logFC <- NA # contains the logFCs
    vector_of_statistics <- NA # contains other statistics
    
    ### prepare the output ###
    pValMat <- data.frame("rawP" = vector_of_pval,
                          "adjP" = vector_of_adjusted_pval)
    statInfo <- data.frame("logFC" = vector_of_logFC,
                           "statistics" = vector_of_statistics) 
    name <- "write.here.the.name"
    # Be sure that your method hasn't changed the order of the features. If it 
    # happens, you'll need to re-establish the original order.
    rownames(pValMat) <- rownames(statInfo) <- name_of_your_features 
    
    # Return the output as a list
    return(list("pValMat" = pValMat, "statInfo" = statInfo, "name" = name))
} # END - function: DA_yourMethod
```

Once the custom method is set, it is very simple to add it to the framework:

-   direct way - just run the method using the function `DA_yourMethod()`;

-   indirect way - create a list containing one or more instances of the custom method with the desired combination of parameters and add it to the `my_methods` list.

```{r customExampleInstances, eval=FALSE}
my_custom_method <- list(
    customMethod.1 = list(method = "DA_yourMethod", parameters),
    customMethod.2 = list(method = "DA_yourMethod", parameters)
)
```

In this case, the 'method' field, containing the name of the method to call, is mandatory. Instead, the argument containing the data object is not needed in order to keep the `my_custom_method` object more lightweight. To run the methods:

```{r customExampleRun, eval=FALSE}
# Add the custom method instances to the others
my_methods <- c(my_edgeR, my_DESeq2, my_limma, my_custom_method)
# Run all the methods on a specific data object...
runDA(my_methods = my_methods, object = dataObject)
# ... Or on the mock datasets to investigate TIEC
runMocks(mocks = mock_df, my_methods = my_methods, object = dataObject)
```

## Visualization

Since the visualization rely on `ggplot2` package, firstly we run the `createTIEC()` function of the `benchdamic` package, in order to produce a list of 4 `data.frames`:

1.  `df_pval` is a 3 columns and *number_of_features x methods x comparisons* rows `data.frame`. The three columns are called *Comparison*, *pval*, and *method*;

2.  `df_FPR` is a 5 columns and *methods x comparisons* rows `data.frame`. For each set of method and comparison, the proportion of false positives, considering 3 threshold (0.01, 0.05, 0.1) are reported;

3.  `df_qq` contains the coordinates to draw the QQ-plot for comparing the mean observed p-value distribution across comparisons, with the theoretical uniform distribution. Indeed, the observed p-values should follow a uniform distribution under the null hypothesis of no differential abundant features presence;

4.  `df_KS` is a 5 columns and *methods x comparisons* rows `data.frame`. For each set of method and comparison, the Kolmogorov-Smirnov test statistics and pvalues are reported in *KS* and *KS_pval* columns respectively.

```{r createTIEC, warning=FALSE}
TIEC_summary <- createTIEC(Stool_16S_mockDA)
```

### False Positive Rate

The false positive rate (FPR or observed *α*), which is the observed proportion of significant tests, should match the nominal value because all the findings are false positive by construction. In this example `edgeR.TMM`, `edgeR.TMM.weighted`, and `limma.TMM.weighted` appear to be quite over all the thresholds, differently `DESeq2.poscounts`, `DESeq2.poscounts.weighted`, `limma.CSSmedian`, and `limma.TMM` methods are close to each threshold.

```{r FPRplot}
cols <- createColors(variable = levels(TIEC_summary$df_pval$Method))
plotFPR(df_FPR = TIEC_summary$df_FPR, cols = cols)
```

### QQ-Plot

The p-value distribution under the null hypothesis is uniform. This is qualitatively summarized in the QQ-plot below where the bisector represents a perfect correspondence between observed and theoretical quantiles of p-values. For each theoretical quantile, the corresponding observed quantile is obtained averaging the observed p-values' quantiles from all 10 mock datasets. The plotting area is zoomed-in to show clearly the area between 0 and 0.1.

Methods over the bisector show a conservative behavior, while methods below the bisector a liberal one.

The starting point is determined by the total number of features. In our example the starting point for the theoretical p-value is computed as 1 divided by 71, rounded to the second digit.

```{r QQplot}
plotQQ(df_QQ = TIEC_summary$df_QQ, zoom = c(0, 0.1), cols = cols)
```

### Kolmogorov-Smirnov test

Departure from uniformity is evaluated through the Kolmogorov-Smirnov test which is reported for each method across all mock datasets using the the `plotKS` function.

```{r KSplot}
plotKS(df_KS = TIEC_summary$df_KS, cols = cols)
```

# Concordance

If we want to measure the ability of each method to produce replicable results in independent data and the amount of concordance of the results of each method on two random subsets of the same dataset, we'll perform the Between Method Concordance (BMC) and the Within Method Concordance (WMC) analyses.

We consider the samples from the supragingival and subgingival plaques of 30 subjects.

```{r dataloading_concordance}
data("ps_plaque_16S")
```

```{r 16S_plaque_download, eval=FALSE}
ps_plaque_16S_raw <- V35() %>% # Extracting V3-V5 16S sequenced regions' count data
  subset(select = HMP_BODY_SUBSITE %in% c("Supragingival Plaque", "Subgingival Plaque") & # Only gingival plaque samples
    RUN_CENTER == "WUGC" & # Only sequenced at the WUCG RUN CENTER
    SEX == "Male" & # Only male subject
    VISITNO == 1) %>% # Only the first visit
  as_phyloseq()

# Only paired samples
paired <- names(which(table(sample_data(ps_plaque_16S_raw)[, "RSID"]) == 2))
ps_plaque_16S_paired <- subset_samples(ps_plaque_16S_raw, RSID %in% paired)

# Remove low depth samples
ps_plaque_16S_pruned <- prune_samples(sample_sums(ps_plaque_16S_paired) >= 10^3, ps_plaque_16S_paired)

# Remove features with zero counts
ps_plaque_16S_filtered <- filter_taxa(ps_plaque_16S_pruned, function(x) sum(x > 0) > 0, 1)

# Collapse counts to the genus level
ps_plaque_16S <- tax_glom(ps_plaque_16S_filtered, taxrank = "GENUS")
```

## Concordance structure

Several functions are involved in the concordance analysis. First of all, the normalization methods need to be set up using the `setNormalizations()` function. The differential abundance framework is set up by using `set_edgeR()`, `set_DESeq2()`, `set_limma()`, and so on, functions. Once all these instructions are set up, the user can call the `runSplits()` which relies on `runNormalizations()` and `runDA()` functions. Finally, the `createConcordance()` and `plotConcordance()` functions are used to compute and plot the results.

![Concordance structure](concordance_structure.svg)

## Split datasets

Using the `createSplits()` function, the dataset is randomly divided by half. In this particular demonstrative dataset, samples are paired: 1 sample for supragingival plaque and 1 sample for subgingival plaque are considered for each subject. The *paired* parameter is passed to the method and it contains the name of the variable which describes the subject IDs. In this specific case, the two groups of samples are balanced between conditions, reflecting the starting dataset. However, if the starting dataset had been unbalanced, the *balanced* option would have allowed to keep the two splits unbalanced or not.

```{r createSplits}
set.seed(123)
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- factor(sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE)

my_splits <- createSplits(
  object = ps_plaque_16S,
  varName = "HMP_BODY_SUBSITE",
  paired = "RSID",
  balanced = TRUE,
  N = 10
) # At least 100 is suggested
```

The structure produced by `createSplits()` function consists in a list of two matrices: *Subset1* and *Subset2*. Each matrix contains the randomly chosen sample IDs. The number of rows of both matrices is equal to the number of comparisons/splits.

## Differential abundance

For some of the methods implemented in this package it is possible to perform differential abundance testings for the repeated measurements experimental designs (e.g. by adding the subject ID in the model formula of `DESeq2`). However, for the sake of simplicity we consider the samples as independent between body sub sites.

Once again, to set the differential abundance methods to use, the `set_*()` methods can be exploited. For a faster demonstration, differential abundance methods without weighting are used:

```{r set_Methods_noweights}
my_edgeR_noWeights <- set_edgeR(group_name = "HMP_BODY_SUBSITE", design = ~ HMP_BODY_SUBSITE, coef = 2, norm = "TMM")
my_DESeq2_noWeights <- set_DESeq2(contrast = c("HMP_BODY_SUBSITE","Supragingival Plaque", "Subgingival Plaque"), design = ~ HMP_BODY_SUBSITE, norm = "poscounts")
my_limma_noWeights <- set_limma(design = ~ HMP_BODY_SUBSITE, coef = 2, norm = c("TMM", "CSSmedian"))

my_methods_noWeights <- c(my_edgeR_noWeights, my_DESeq2_noWeights, my_limma_noWeights)
```

Similarly, to set the normalization methods, the `setNormalizations()` function can be used. In this case it has already been set up for the TIEC analysis.

```{r info_normalizations}
str(my_normalizations)
```

The `runSplits()` function generates the subsets and performs DA analysis:

```{r runSplits}
Plaque_16S_splitsDA <- runSplits(split_list = my_splits, method_list = my_methods_noWeights, normalization_list = my_normalizations, object = ps_plaque_16S, verbose = FALSE)
```

The structure of the output in this example is the following:

-   *Subset1* and *Subset2* on the first level, which contains:

    -   *Comparison1* to *Comparison10* output lists on the second level:

        -   results of 4 methods on the third level: `edgeR` with *TMM* scaling factors, `DESeq2` with *poscounts* normalization factors, `limma-voom` with *TMM* scaling factors, `limma-voom` with *CSS* scaling factors. They are organized as already described in the "Goodness of Fit" chapter:

            -   `pValMat` which contains the matrix of raw p-values and adjusted p-values in *rawP* and *adjP* columns respectively;

            -   `statInfo` which contains the matrix of summary statistics for each feature, such as the *logFC*, standard errors, test statistics and so on;

            -   `dispEsts` which contains the dispersion estimates for methods like *edgeR* and *DESeq2*;

            -   `name` which contains the complete name of the used method.

For each pair of methods the concordance is computed by the `createConcordance()` function. It produces a long format `data.frame` object with several columns:

-   *comparison* which indicates the comparison number;
-   *n_features* which indicates the total number of taxa in the comparison dataset;
-   name of *method1*;
-   name of *method2*;
-   *rank*;
-   *concordance* which is defined as the cardinality of the intersection of the top *rank* elements of each list, divided by *rank*, i.e., $\frac{L_{1:rank} \bigcap M_{1:rank}}{rank}$, where *L* and *M* represent the lists of p-values of *method1* and *method2* respectively.

```{r createConcordance}
concordance <- createConcordance(
    object = Plaque_16S_splitsDA, 
    slot = "pValMat", 
    colName = "rawP", 
    type = "pvalue"
)
head(concordance)
```

The `createConcordance()` method is very flexible. In the example below the concordances are built using the log fold changes instead of the p-values. To do so, it is necessary to know the column names generated by each differential abundance method in the `statInfo` matrix.

Firstly inspect the method order:

```{r getNames}
names(Plaque_16S_splitsDA$Subset1$Comparison1)
```

Then look at the column names:

```{r logFC_names}
names(Plaque_16S_splitsDA$Subset1$Comparison1$edgeR.TMM$statInfo)
names(Plaque_16S_splitsDA$Subset1$Comparison1$DESeq2.poscounts$statInfo)
names(Plaque_16S_splitsDA$Subset1$Comparison1$limma.CSSmedian$statInfo)
names(Plaque_16S_splitsDA$Subset1$Comparison1$limma.TMM$statInfo)
```

All methods, except for `DESeq2`, contain the log fold change values in the *logFC* column of `statInfo` matrix. Knowing this, the concordance `data.frame` can be built using:

```{r alternativeConcordance, eval=FALSE}
concordance_alternative <- createConcordance(
  object = Plaque_16S_splitsDA,
  slot = "statInfo",
  colName = c("logFC", "log2FoldChange", "logFC", "logFC"),
  type = "logfc"
)
```

## Visualization

Starting from the table of concordances, the `plotConcordance()` function can produce 2 graphical results:

-   the dendrogram of methods, clustered by the area over the concordance bisector in `concordanceDendrogram` slot;

-   the heatmap of the between and within method concordances in `concordanceHeatmap` slot. For each tile of the symmetric heatmap, which corresponds to a pair of methods, the concordance from rank 1 to a threshold rank is drawn.

The area between the curve and the bisector is colored to highlight concordant methods (blue) and non-concordant ones (red). The two graphical results should be drawn together for the best experience.

```{r plotConcordance}
pC <- plotConcordance(concordance = concordance, threshold = 30)
cowplot::plot_grid(plotlist = pC, ncol = 2, align = "h", axis = "tb",
        rel_widths = c(1, 3))
```

The WMC and BMC from rank 1 to rank 30 are reported in the plot above. 69 is the maximum rank because it is the number of taxa for which all methods have been able to calculate p-values (in all comparisons). However, a custom threshold of 30 was supplied .

It is common that WMC values (in red rectangles) are lower than BMC ones. Indeed, BMC is computed between different methods on the same data, while WMC is computed for a single method, run in different datasets (some taxa are dataset-specific).

`limma.CSSmedian` and `limma.TMM` methods show the highest BMC values. Regarding the WMC, `edgeR.TMM` has the lowest values while other methods have comparable ones.

# Enrichment analysis

While mock comparisons and random splits allow to evaluate type I error and concordance, these analyses do not assess the correctness of the discoveries. Even the method with the highest WMC could nonetheless consistently identify false positive DA taxa.

While the lack of ground truth makes it challenging to assess the validity of DA results in real data, enrichment analysis can provide an alternative solution to rank methods in terms of their ability to identify, as significant, taxa that are known to be differentially abundant between two groups.

## *A priori* knowledge

Here, we leveraged the peculiar environment of the gingival site: the supragingival biofilm is directly exposed to the open atmosphere of the oral cavity, favoring the growth of aerobic species. In the subgingival biofilm, however, the atmospheric conditions gradually become strict anaerobic, favoring the growth of anaerobic species [@plaque_dynamics]. From the comparison of the two sites, we thus expected to find an abundance of aerobic microbes in the supragingival plaque and of anaerobic bacteria in the subgingival plaque. DA analysis should reflect this difference by finding an enrichment of aerobic (anaerobic) bacteria among the DA taxa with a positive (negative) log-fold-change.

Firstly, the microbial metabolism information is necessary. These data comes from [@cigarettes] paper github repository (<https://github.com/waldronlab/nychanesmicrobiome>), but they can be loaded using `data("microbial_metabolism")`.

```{r priorKnowledge}
data("microbial_metabolism")
head(microbial_metabolism)
```

The microbial genus and its type of metabolism are specified in the first and second column respectively. To match each taxon of the phyloseq object to its type of metabolism the next chunk of code can be used.

```{r exampleOfIntegration}
# Extract genera from the phyloseq tax_table slot
genera <- tax_table(ps_plaque_16S)[, "GENUS"]
# Genera as rownames of microbial_metabolism data.frame
rownames(microbial_metabolism) <- microbial_metabolism$Genus
# Match OTUs to their metabolism
priorInfo <- data.frame(genera, "Type" =  microbial_metabolism[genera, "Type"])
unknown_metabolism <- is.na(priorInfo$Type)
priorInfo[unknown_metabolism, "Type"] <- "Unknown"
# Relabel 'F Anaerobic' to 'F_Anaerobic' to remove space
priorInfo$Type <- factor(priorInfo$Type, levels = c("Aerobic","Anaerobic","F Anaerobic","Unknown"), labels = c("Aerobic","Anaerobic","F_Anaerobic","Unknown"))
# Add a more informative names column
priorInfo[, "newNames"] <- paste0(rownames(priorInfo), "|", 
    priorInfo[, "GENUS"])
```

## Differential abundance

Prepare some normalization/scaling factors.

```{r setNormalizations_enrichment}
none_normalization <- setNormalizations(fun = "norm_edgeR", method = "none")
my_normalizations_enrichment <- c(my_normalizations, none_normalization)
ps_plaque_16S <- runNormalizations(normalization_list = my_normalizations_enrichment, object = ps_plaque_16S, verbose = FALSE)
```

Differently from the Type I Error Control and Concordance analyses, the enrichment analysis rely on a single `phyloseq` object. For this reason many methods can be assessed without computational trade-offs.

```{r set_Methods_enrichment}
my_metagenomeSeq <- set_metagenomeSeq(design = ~ HMP_BODY_SUBSITE, coef = 2, norm = "CSSmedian")
my_ALDEx2 <- set_ALDEx2(conditions = "HMP_BODY_SUBSITE", test = "t", norm = "none")
my_corncob <- set_corncob(formula = ~ HMP_BODY_SUBSITE, phi.formula = ~ HMP_BODY_SUBSITE, formula_null = ~ 1, phi.formula_null = ~ HMP_BODY_SUBSITE, test = "Wald", coefficient = "HMP_BODY_SUBSITESupragingival Plaque", norm = "none")
my_MAST <- set_MAST(rescale = "median", design = ~ HMP_BODY_SUBSITE, coefficient = "HMP_BODY_SUBSITESupragingival Plaque", norm = "none")
my_Seurat <- set_Seurat(test.use = "wilcox", contrast = c("HMP_BODY_SUBSITE", "Supragingival Plaque", "Subgingival Plaque"), norm = "none")

my_methods_enrichment <- c(
    my_methods_noWeights, 
    my_metagenomeSeq, 
    my_ALDEx2, 
    my_corncob, 
    my_MAST, 
    my_Seurat
)
```

```{r runDA_enrichment, message=FALSE}
# Convert to factor
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- factor(sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE)
# Reference level = "Subgingival Plaque"
sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE <- relevel(
    x = sample_data(ps_plaque_16S)$HMP_BODY_SUBSITE, 
    ref = "Subgingival Plaque"
)
Plaque_16S_DA <- runDA(method_list = my_methods_enrichment, object = ps_plaque_16S, weights = zinbweights)
```

## Visualization

`Plaque_16_DA` object contains the results for 9 methods. In order to extract p-values, the optional direction of DA (DA vs non-DA, or UP Abundant vs DOWN Abundant), and to add any *a priori* information, the `createEnrichment()` function can be used.

In the *direction* argument, which is set to `NULL` by default, the column name containing the direction (e.g. *logfc*, *logFC*, *logFoldChange*...) of each method's `statInfo` matrix must be supplied.

Firstly, the order of methods should be investigated:

```{r info_DA}
names(Plaque_16S_DA)
```

Following the methods' order, the *direction* parameter is supplied together with other parameters:

-   *threshold_pvalue*, *threshold_logfc*, and *top* (optional), to set differential abundance thresholds;

-   *slot*, *colName*, and *type*, which specify where to apply the above thresholds;

-   *priorKnowledge*, *enrichmentCol*, and *namesCol*, to add enrichment information to DA analysis;

```{r createEnrichment}
enrichment <- createEnrichment(
    object = Plaque_16S_DA, priorKnowledge = priorInfo, enrichmentCol = "Type",
    namesCol = "newNames", slot = "pValMat", colName = "adjP", type = "pvalue",
    direction = c("logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "logFC", # limma
        "HMP_BODY_SUBSITESupragingival Plaque", # metagenomeSeq
        "effect", # ALDEx2
        "Estimate", # corncob
        "logFC",# MAST
        "avg_log2FC"), # Seurat
    threshold_pvalue = 0.1,
    threshold_logfc = 0,
    top = NULL,
    alternative = "greater",
    verbose = TRUE
)
```

The produced *enrichment* object consists in a list of elements as long as the number of tested methods:

-   the `data` slot contains information for each feature. P-values, adjusted p-values (or other statistics) in *stats* column, log fold changes (or other statistics, if specified) in *direction* column, differential abundace information in the *DA* column (according to the thresholds), the variable of interest for the enrichment analysis, and the name of the feature in the *feature* column;

-   in the `tables` slot a maximum of *2 x (levels of enrichment variable)* contingency tables (*2x2*) are present;

-   in the `tests` slot, the list of Fisher exact tests produced by the `fisher.test()` function are saved for each contingency table;

-   in the `summaries` slot, the first elements of the contingency tables and the respective p-values are collected for graphical purposes.

Considering the most liberal method of our example, *metagenomeSeq.CSSmedian,* we obtain 8 contingency tables. Both UP Abundant and DOWN Abundant taxa are found and the enrichment variable has Aerobic, Anaerobic, F_Anaerobic, and Unknown levels. For each level, we can build 2 contingency tables: one for DOWN Abundant vs non.DOWN Abundant features and one for UP Abundant vs non.UP Abundant features. The enrichment is tested using Fisher exact test. The `plotContingency()` function summarize all these information.

```{r plotContingency}
plotContingency(enrichment = enrichment, levels_to_plot = c("Aerobic", "Anaerobic"), method = "metagenomeSeq.CSSmedian")
```

To summarize enrichment analysis for all the methods simultaneously, the `plotEnrichment()` function can be used. Only Aerobic and Anaerobic levels are plotted:

```{r plotEnrichment, fig.width=7, fig.height=7}
plotEnrichment(enrichment = enrichment, enrichmentCol = "Type", levels_to_plot = c("Aerobic", "Anaerobic"))
```

Since "Subgingival Plaque" is the reference level for each method, the coefficients extracted from the methods are referred to the "Supragingival Plaque" class. Five out of nine methods identify, as expected, a statistically significant ($0.001 < p \le 0.05$) amount of DOWN Abundant Anaerobic features in Supragingival Plaque. Moreover, *metagenomeSeq.CSSmedian* finds some UP Abundant Aerobic genera in Supragingival Plaque but, together with *edgeR.TMM*, some Anaerobic genera as UP Abundant, which are probably false positives. *Seurat.none.wilcox*, *ALDEx2.none.iqlr.t*, and *limma.CSSmedian* have instead a more conservative behavior, but still they can find some DOWN Abundant genera towards the expected direction. Finally, *MAST.none* doesn't find any DA feature.

To investigate the DA features, the `plotMutualFindings()` function can be used. While *levels_to_plot* argument allows to choose which levels of the enrichment variable to plot, *n_methods* argument allows to extract only features which are mutually found as DA by more than 1 method.

```{r plotMutualFindings, fig.width=6, fig.height=6}
plotMutualFindings(enrichment, enrichmentCol = "Type", levels_to_plot = c("Aerobic", "Anaerobic"), n_methods = 1)
```

In this example, only Anaerobic genera are found as DA by more than 1 method simultaneously. Among them *Prevotella*, *Fusobacterium*, *Catonella*, and *Dialister* genera are found as DOWN Abundant in Supragingival Plaque by the majority of methods. Only the Bacteroides genus is considered UP Abundant in Supragingival Plaque by *edgeR.TMM*. However, the same feature is found as DOWN Abundant by *metagenomeSeq.CSSmedian*.

## True and False Positives

To evaluate the overall performances we can study a statistic based on the difference between putative True Positives (TP) and the putative False Positives (FP). To build the matrix to plot, the `createPositives()` can be used. In details, the correctness of the DA features is evaluated comparing the direction of the top ranked features to the expected direction supplied by the user in the *TP* and *FP* lists. The procedure is performed for several thresholds of *top* parameter in order to observe a trend, if present.

```{r createPositives}
positives <- createPositives(
    object = Plaque_16S_DA, priorKnowledge = priorInfo, enrichmentCol = "Type",
    namesCol = "newNames", slot = "pValMat", colName = "rawP", type = "pvalue",
    direction = c("logFC", # edgeR
        "log2FoldChange", # DEseq2
        "logFC", # limma
        "logFC", # limma
        "HMP_BODY_SUBSITESupragingival Plaque", # metagenomeSeq
        "effect", # ALDEx2
        "Estimate", # corncob
        "logFC",# MAST
        "avg_log2FC"), # Seurat
    threshold_pvalue = 1,
    threshold_logfc = 0,
    top = seq.int(from = 0, to = 50, by = 5), 
    alternative = "greater",
    verbose = FALSE,
    TP = list(c("DOWN Abundant", "Anaerobic"), c("UP Abundant", "Aerobic")),
    FP = list(c("DOWN Abundant", "Aerobic"), c("UP Abundant", "Anaerobic"))
)
head(positives)
```

Finally, the `plotPositives()` function can be used to summarize the methods' performances. Higher values usually represents better performances. In our example, all methods show similar values of the statistics from the top 5, to the top 20 ranked features.

```{r plotPositives}
plotPositives(positives)
```

*MAST.none.median* and *ALDEx2.none.iqlr.t* methods are obviously very conservative and are located on the lower part of the plot. Very liberal methods, such as *metagenomeSeq.CSSmedian* and *edgeR.TMM* show instead the highest values. This means that their findings are in line with the *a priori* knowledge supplied by the user, however we are considering a toy example with only 88 features.

# Session Info

```{r sessionInfo}
sessionInfo()
```

# References