---
title: "Introduction to the scFeatureFilter package"
author: "Angeles Arzalluz-Luque, Guillaume Devailly, Anagha Joshi"
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Introduction to scFeatureFilter}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Preamble
## Package aim
One of the steps in single cell RNA-seq analysis is the filtering of features (i.e. genes or transcripts)
to retain only the most expressed ones. This allows to focus on features less affected by technical
variations (relatively), and often increases the statistical power of the subsequent analyses.

This step is usually done by applying one or more arbitrary thresholds
(i.e. mean TPM > 1, retention of features 
frequently detected across cells, etc.), or by using the data from spike-in RNA.
This package aims to help the threshold decision, especially 
when spike-in controls are not available.

## Method summary
Highly expressed features are less affected by technical variability than lowly expressed ones,
and thus capture biological information more successfully.
We propose to use highly expressed features as a reference to 
to estimate the drowning of biological variation in bins 
of features of decreasing expression. 
In this scenario, it is expected that features from major transcription programs
will be correlated with each other across bins, unless the technical variation 
component dominates the expression results.

One of the package's first steps is to select this reference bin of features 
(or 'top window' of features).
We propose to use the 100 features with the highest mean expression
across cells, and a reasonably low
fraction of drop-outs (zero-expression values) as reference.
Using more features will increase the noise in the biological variation
reflected in this reference bin (as well as increase computation time),
while using less features might 
result in insufficient capture of the biological complexity of the samples.
Subsequent bins of features of decreasing mean expression across cells are
then defined (by default, 1000 features
per bin). Note that the size of the subsequent feature bins matters less
than the size of the reference bin.

For comparison, several negative control bins are defined by randomly shuffling values of the top bin expression matrix,
building a window that is expected to be completely uncorrelated. Then, every feature in each bin is correlated 
to every feature in the reference bin and the control bins to obtain correlation coefficient distributions.
Ultimately, the variation of the distributions of correlation coefficients can be used to inform the threshold decision.

## Installation
The stable version of `scFeatureFilter` can be installed from Bioconductor:
```{r, eval = FALSE}
## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("scFeatureFilter")
```


Once installed, the package is ready to be loaded:
```{r, message=FALSE, warning=FALSE, collapse=TRUE}
library(scFeatureFilter)

library(ggplot2)
library(cowplot) # multipanel figures + nice theme
```


# Integrated usage
A single-function call option is available to the user, in which the entire package functionality is
executed. This will filter the expression matrix and keep only the most informative highly expressed features,
using default options:
```{r, collapse=TRUE}
# example dataset included with the package:
scData_hESC

# filtering of the dataset with a single function call:
sc_feature_filter(scData_hESC)
```


# Detailed usage
## Loading data
*scFeatureFilter* uses an expression matrix (either as a `matrix`, a `data.frame`, a `tibble` or a
`SingleCellExperiment` R object) as input. Support of Bioconductor's expression set objects will arrive shortly.
Note that features should be in rows, while cells should be in columns. 


We recommend providing normalized expression values where at least library size is accounted for, 
such as TPM or FPKM, rather than raw counts.


An example dataset is supplied with the package:
```{r, collapse=TRUE}
scData_hESC
```
*The example dataset was processed by [Mantsoki et al. (2016)](https://www.ncbi.nlm.nih.gov/pubmed/26951854),  data from human Embryonic Stem cells was generated by [Yan et al. (2013)](https://www.ncbi.nlm.nih.gov/pubmed/23934149).*

## Binning data
Firstly, a mean expression and a coefficient of variation column is added to the expression matrix provided
using the `calculate_cvs` function:
```{r, collapse=TRUE}
calculate_cvs(scData_hESC)
```

We can explore the mean - variance relationship of the dataset with the `plot_mean_variance` function:
```{r, collapse=TRUE}
library(magrittr) # to use the pipe %>%

calculate_cvs(scData_hESC) %>%
    plot_mean_variance(colourByBin = FALSE)
```

The top window (or reference bin) is then defined with the `define_top_genes` function, while the rest of bins
are created with the `bin_scdata` function:
```{r, collapse=TRUE}
scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 1000)
```

Then, we can plot the resulting bin definition using the same `plot_mean_variance` function, setting the 
`colourByBin` argument to `TRUE`:
```{r collapse=TRUE}
myPlot <- scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 1000) %>%
    plot_mean_variance(colourByBin = TRUE, density_color = "blue")

myPlot
```

Note that the plotting function outputs are `ggplot2` objects that can be 
further customized. See an example here:
```{r, collapse=TRUE}
myPlot + annotation_logticks(sides = "l")
```

## Correlation of the bins
After binning the data, correlations (by default, Pearson's) can be calculated. Every feature in each window is correlated 
to every other feature in the reference bin (as well as in the randomized control bins) using the 
function `correlate_windows`:
```{r, collapse=TRUE}
corDistrib <- scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 1000) %>%
    correlate_windows(n_random = 3)
```

For visualization of the correlation values, probability distributions can be computed using 
and plotted using the `correlations_to_densities` and `plot_correlations_distributions` functions:
```{r, collapse=TRUE}
corDens <- correlations_to_densities(corDistrib, absolute_cc = TRUE)
plot_correlations_distributions(corDens, facet_ncol = 5) +
    scale_x_continuous(breaks = c(0, 0.5, 1), labels = c("0", "0.5", "1"))
```

The coloured lines indicate the correlation coefficient distributions for each bin compared to 
the reference bin (bin 1 shows the autocorrelation of the reference bin). The thicker blue 
lines with grey area (barely visible) represents the correlation coefficient distributions for
each bin compared to the randomized control bins.


Notice that the correlation distributions shift from high to low values as
the bin number increases,
i.e. as the  mean expression level of the features decreases, 
indicating a higher influence of 
technical variability in the expression results.

*Note: the `absolute_cc = TRUE` option, which is the default, instructs the package to work
with the absolute value of the correlation coefficient. Therefore, highly negatively correlated features
will count as highly correlated. Setting this option as `TRUE` also produces clearer plots, and reduces 
the emphasis on non-symmetrical, near 0, shifts of correlation values that are not interpretable.*

We can quantify the steady decrease of correlation values as feature expression goes down using the
`get_mean_median` function, improving interpretability of the extent of technical variability in the data:
```{r, collapse=TRUE}
metrics <- get_mean_median(corDistrib)
metrics
plot_correlations_distributions(corDens, metrics = metrics, facet_ncol = 5) +
    scale_x_continuous(breaks = c(0, 0.5, 1), labels = c("0", "0.5", "1"))
```

The added dashed lines are the mean values of the distributions. The `plot_metric` function focalized on these means of distributions:
```{r, collapse=TRUE}
plot_metric(metrics, show_ctrl = FALSE, show_threshold = FALSE)
```

The bar represents the mean value of the correlation coefficient distribution of each bin 
against the reference bin, while the dot and whiskers are against the randomized control windows.
In our example dataset, the mean of the correlation distributions decreases steadily, 
and reaches a plateau after bin 11.

## Choosing a threshold
On the previous plot, a background reference can be added as a dashed line, by taking the mean value of the 
correlations of every bin to the randomized control bins:
```{r, collapse=TRUE}
plot_metric(metrics, show_ctrl = TRUE, show_threshold = FALSE)
```

Then, the user can decide to plot only bins for which the mean correlation value against the reference window 
is higher than, for example, twice the background:
```{r, collapse=TRUE}
plot_metric(metrics, show_ctrl = TRUE, show_threshold = TRUE, threshold = 2)
```

To further assist thresholding, the function `determine_bin_cutoff` will return the number of the first bin 
of features that falls below the set threshold:
```{r, collapse=TRUE}
determine_bin_cutoff(metrics, threshold = 2)
```

Finally, using this information, the function `filter_expression_table` can be used to obtain a filtered 
expression matrix from the original data:
```{r, collapse=TRUE}
binned_data <- scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 1000)
metrics <- correlate_windows(binned_data, n_random = 3) %>%
    get_mean_median

filtered_data <- filter_expression_table(
    binned_data,
    bin_cutoff = determine_bin_cutoff(metrics)
)

dim(scData_hESC)
dim(filtered_data)
filtered_data
```

*Note: most of the features are filtered at the `calculate_cvs` step because
hey have a too high fraction of drop-outs or zero-expression.
The maximum fraction of 0 expression tolerated by scFeatureFilter is 75%,
by default.*

## Working with SingleCellExperiment objects

This package accpets `SingleCellExperiment` as inputs for `sc_feature_filter`
and `calculate_cvs` functions. However, it does not retun such objects at the moment.
One can retrieve a filtered `SingleCellExperiment` objects like this:
```{r, message=FALSE, warning=FALSE, collapse=TRUE}
library(SingleCellExperiment)
library(scRNAseq) # example datasets

sce_allen <- ReprocessedAllenData()

# sce_allen is an SingleCellExperiment object
sce_allen

filtered_allen <- sc_feature_filter(sce_allen, sce_assay = "rsem_tpm")
is.matrix(filtered_allen) # filtered_allen is a tibble

sce_filtered_allen <- sce_allen[rownames(filtered_allen), ]
sce_filtered_allen
```

# Parameter values
Several parameters are used in `sc_feature_filter` or in step-by-step functions.
Their default values are set using mouse an human data and
should work for eukaryote species with multi-exonic
genes (metazoans & plants).

## `max_zeros`
The `max_zeros` parameter (default value `0.75`) is used as a first filtering
step.
Features that present zero-expression in a proportion of cells higher than the 
set threshold will be
removed. Zero expression can reflect both an absence of expression in the cell
of origin, or a drop-out event occuring during library preparation.
Ocurrence of these technical
drop-outs can be quite high in scRNAseq protocols. More stringent values for
this parameter could be `0.5` or `0.25`. This parameter will mostly impact the
total number of features, and therefore the number of bins that the data 
is divided into. This filtering step will remove low-expression,
highly affected by noise features before the method is applied.

## `top_window_size`
The `top_window_size` parameter (default to `100`) can have a significant impact
on the rest of the pipeline. It sets the number of features with the highest mean
expression that will be used as a reference bin. If the reference window is too
small (i.e. `10`), it will not capture enough of the biological variation.
On the contrary, a too large top window (i.e. `2000`) might include features
mildly affected by technical variation, diluting the biological component 
of the variation too much to be of any use.
We therefore suggest keeping it at `100` or near that value.

The function `plot_top_window_autocor` can be used to visually explore the mean
auto-correlation value of features in the top bin with themselves depending on
the number of features in the top bin:
```{r, collapse=TRUE}
plot_top_window_autocor(calculate_cvs(scData_hESC))
```


## `other_window_size`
The `other_window_size` parameter (default to `1000`) is the number of 
features included in the bins of features (outside the top window).
It is a more flexible parameter, and can safely be decreased to
`500` or `250`, (which will result in more bins) without significantly 
altering the proposed filtering threshold:
```{r, collapse=TRUE}
metrics_bigBins <- scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 1000) %>%
    correlate_windows(n_random = 3) %>%
    get_mean_median

metrics_smallBins <- scData_hESC %>%
    calculate_cvs %>%
    define_top_genes(window_size = 100) %>%
    bin_scdata(window_size = 500) %>%
    correlate_windows(n_random = 3) %>%
    get_mean_median

plot_grid(
    plot_metric(metrics_bigBins) +
        labs(title = "1000 genes per bin"),
    plot_metric(metrics_smallBins) +
        labs(title = "500 genes per bin")
)
```

## `n_random`
The background correlation values used to determine a threshold are 
calculated by
randomly shuffling the expression values of features in the top window.
`n_random` (default to 3) sets the number of control bins that will
be generated. 
Note that increasing this value will also increase the computation time.
and given that results differ only marginally among the randomised windows,
there is not much need to increase that value.
Results differ only marginally between the different random windows, therefore
there is not much need to generate more than 3 control windows.

## `threshold`
One can decide which bins of features to keep by looking at the
`plot_metric` or `plot_correlations_distributions` plots,
by visually exploring their correlation to the top bin of features.
We also provide an automatic threshold decision method, which keeps only windows
of features whose mean correlation value to the top window is higher than 
`threshold` (default to 2) times the background level.
The background level of correlation is estimated using the randomised control
bins. Depending on the intended downstream analyses, one
can increase the `threshold` value to be more stringent in the feature
selection, keeping less features, but ensuring that they will contain
a sufficiently high amount of biological information.