---
title: "Cepo for differential stability analysis of scRNA-seq data"
author:
- name: Hani Jieun Kim
  affiliation: The University of Sydney
  email: hani.kim127@gmail.com
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
vignette: >
  %\VignetteIndexEntry{Cepo method for differential stability analysis of scRNA-seq data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: 
  BiocStyle::html_document
---

# Introduction
We introduce *Cepo*, a method to determine genes governing cell identity from scRNA-seq data. We propose a biologically motivated metric—differential stability (DS)—to define cell identity. Our motivation is driven by the hypothesis that stable gene expression is a key component of cell identity. This hypothesis implies that genes marking a cell type should be (i) expressed and (ii) stable in its expression relative to other cell types. We translate these criteria into a computational framework where, using predefined cell-type labels, we compute a cell-type-specific score to prioritise genes that are differential stably expressed against other cell types between all cell-type pair comparisons. 

*Cepo* is therefore distinct from most methods for differential analysis (e.g., differential expression) that prioritise differences in the mean abundance between cell types. *Cepo* is able to capture subtle variations in distribution that does not necessarily involve changes in mean. *Cepo* is particularly suitable for large atlas data as it is computationally efficient and fast. Moreover, *Cepo* can perform differential stability analysis for multi-group comparisons in single-cell data.

To access the R code used in the vignettes, type:
```
browseVignettes("Cepo")
```

Questions relating to *Cepo* should be reported as a new issue at *[BugReports](https://github.com/PYangLab/Cepo/issues)*.

To cite *Cepo*, type:
```
citation("Cepo")
```

## Package installation
The development version of *Cepo* can be installed with the following command:
```
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Cepo")
```

# Differential stability analysis using Cepo
The differential stability analysis in *Cepo* aims to investigate differential stability patterns between cells of different cell types. To use *Cepo* one needs data with cell type labels (or cluster labels). If no cell-type labels are provided, cells first need to be clustered and classified in groups via some form of clustering algorithms. *Cepo* can then be applied to identify differentially stable genes between cell types.

## Example data
Load the example dataset, a small and randomly sampled subset of the [Cellbench](https://github.com/LuyiTian/sc_mixology) dataset consisting of 3 cell types 895 cells and 894 genes.
```{r load-example-data, message = FALSE}
library(Cepo)
library(SingleCellExperiment)
data("cellbench", package = "Cepo")
cellbench
cellbench = cellbench[!duplicated(rownames(cellbench)),]
```

Columns of the `colData` indicate the individual id and various metadata for each cell. `colData` contains `celltype` labels, which will be required to run *Cepo*. Differential stability analysis performed on the entire cell type repertoire.
```{r visualize colData}
colData(cellbench)[1:5,]
```

Note that, if cell-type labels are unknown, we would need to cluster cells into groups via some clustering algorithm. In the example dataset, we have 3 cell types, H1975, H2228 and HCC827, all of which are commonly used cell lines of lung adenocarcinomas.
```{r check_cell_types}
unique(cellbench$celltype)
```

## Run Cepo to generate list of cell identity genes

### Main arguments of Cepo
There are two main arguments to *Cepo*:
  1) `exprsMat` is the input data, which should be normalized data, such as counts per million (CPM) or log2-CPM (e.g., `logcounts` as created via `scater::logNormCounts`).
  2) `cellTypes` receives as input a vector of cell-type labels. Note that the cell-type labels should be equal in length and ordered the same as the column names in `exprsMat`.

```{r differential-stability-analysis}
ds_res = Cepo(exprsMat = logcounts(cellbench),
              cellType = cellbench$celltype)
```

The `Cepo` function returns a list of two elements by default. The first element is a `DataFrame` of DS statistics. In this `DataFrame`, each column corresponds to the DS statistics for that celltype across all genes. A higher DS statistic value denotes a gene that is more prioritized as a differentially stable gene in that given cell type. In the output DataFrame, the columns correspond to each cell type and each row correspond to a gene. 

```{r}
ds_res
```

### Filtering 
In many cases, it is beneficial to perform filtering of lowly expressed genes prior to differential analysis. The parameter `exprsPct` specifies the threshold for filtering of lowly expressed genes should be performed. By default, this is set of `NULL`. A value between 0 and 1 should be provided. Whilst there is no set rule to the threshold, we recommend a value between `0.05` and `0.07`, which will keep any genes that are expressed in 5-7% in at least one cell type, for microfluidic-based data.

```{r filtering-lowly-expressed-genes}
ds_res_zprop = Cepo::Cepo(exprsMat = logcounts(cellbench),
                          cellTypes = cellbench$celltype,
                          exprsPct = 0.5)
```

The parameter `logfc` specifies minimum log fold-change in gene expression. A value of `0.2` will keep any genes that show at least `abs(0.2)`  log fold change in gene expression in at least one cell type. By default, this value is `NULL`.

```{r filtering-logfc}
ds_res_logfc = Cepo(exprsMat = logcounts(cellbench),
                    cellTypes = cellbench$celltype,
                    logfc = 1)
```

`Cepo` outputs some useful stats, including the number of genes `nrow` and gene names `rownames`. By checking `nrow`, we can see that as expected with filtering the number of genes included in the `Cepo` run becomes fewer.
```{r}
nrow(ds_res$stats)
nrow(ds_res_zprop$stats)
nrow(ds_res_logfc$stats)
```

### Computing p-values
The second element of the `Cepo` function is a `DataFrame` of p-values associated with the DS statistics. In this `DataFrame`, each column corresponds to the p-values associated with the DS statistics obtained through bootstrap on the cells. By default, the `Cepo` function do not compute these p-values and requires the users to set the `computePvalue` argument to a number of bootstrap runs required (we recommend this to be at least 1000). 

```{r return-p-values}
ds_res_pvalues = Cepo(exprsMat = logcounts(cellbench),
                      cellType = cellbench$celltype, 
                      computePvalue = 100) # we use a value of 100 for demonstration purposes

ds_res_pvalues
```

## Visualizing results
We can visualize the overlap of differential stability genes between cell types.
```{r upset-plot}
library(UpSetR)
res_name = topGenes(object = ds_res, n = 500)
upset(fromList(res_name), nsets = 3)
```

Density plot of two genes from each cell type.
```{r plot-densities}
plotDensities(x = cellbench,
              cepoOutput = ds_res,
              nGenes = 2,
              assay = "logcounts",
              celltypeColumn = "celltype")
```

We can also specify the genes to be plotted.
```{r plot-densities-genes}
plotDensities(x = cellbench,
              cepoOutput = ds_res,
              genes = c("PLTP", "CPT1C", "MEG3", "SYCE1", "MICOS10P3", "HOXB7"),
              assay = "logcounts",
              celltypeColumn = "celltype")
```

# Running Cepo in a pipeline
## Example data
We will load an example dataset, a small, randomly subsampled subset of the human pancreas datasets from the [scMerge paper](https://www.pnas.org/content/116/20/9775) consisting of 3 batches, 2 cell types, 528 cells, and 1358 genes.
```{r}
data("sce_pancreas", package = "Cepo")
sce_pancreas
```

Given the presences of batches, we will visualize the data for any batch effect. Clearly these is separation of the data points by batch. 
```{r fig.height=5, fig.width=5}
library(scater)
sce = sce_pancreas
sce = scater::logNormCounts(sce)
sce = scater::runPCA(sce)

scater::plotPCA(
  sce, 
  colour_by = "cellTypes", 
  shape_by = "batch")
```

## scMerge to remove batch effect
We can run the analysis on batch corrected data. For this, we can implement batch correction methods on the data suing batch correction methods such as scMerge.
```{r}
library(scMerge)
data("segList", package = "scMerge")
head(segList$human$human_scSEG)

corrected <- scMerge(
  sce_combine = sce,
  ctl = segList$human$human_scSEG,
  kmeansK = c(2, 2),
  assay_name = "scMerge",
  cell_type = sce$cellTypes)
```

Let us visualise the corrected data.
```{r fig.height=5, fig.width=5}
corrected = runPCA(corrected,
                            exprs_values = "scMerge")

scater::plotPCA(
  corrected,
  colour_by = "cellTypes",
  shape_by = "batch")
```

```{r}
ds_res = Cepo::Cepo(exprsMat = assay(corrected, "scMerge"),
              cellType = corrected$cellTypes)
```

## Running Cepo by batch
Rather than running *Cepo* on the corrected values, we can run the differential analysis independently on individual batches using the `block` argument. By default, the `block` argument is set to `NULL`, ignoring batch information. If batches are present and the data is not corrected for batch effect, ensure you run the analyses by block. 
```{r}
ds_res_batches = Cepo::Cepo(exprsMat = logcounts(sce),
                            cellTypes = sce$cellTypes,
                            block = sce$batch,
                            minCelltype = 2)
```

Note that the resulting output in a list of `Cepo` class objects where each slot denotes the individual results for the three batches, as well as the averaged results saved as `average`.
```{r}
names(ds_res_batches)
```

We can confirm that the `Cepo` statistics from across batches demonstrate a strong correlation. The clustered correlation heatmap below shows that there is high correlation between the scores of the same cell type across batches.
```{r fig.height=10, fig.width=10}
idx = Reduce(intersect, lapply(ds_res_batches, function(x) names(x$stats[, 1])))

combinedRes = as.data.frame(do.call(cbind, lapply(ds_res_batches, function(x) 
  x$stats[idx,]
)))

batches = rep(names(ds_res_batches), sapply(ds_res_batches, function(x) length(x$stats)))
cty = unlist(lapply(ds_res_batches, function(x) names(x$stats)), use.name = FALSE)
colnames(combinedRes) = gsub("[.]", "_", colnames(combinedRes))
  
annot = data.frame(
  batch = batches,
  celltype = cty
)
rownames(annot) = colnames(combinedRes)

pheatmap::pheatmap(cor(combinedRes),
                   annotation = annot)
```

## Downstream analyses using Cepo genes

### Marker gene identification and visualisation 
One of the useful applications of `Cepo` is to find marker genes or cell identity genes on clustered data. We can visualise the top three marker genes for beta and ductal cells on the PCA.
```{r marker-genes, fig.height=10, fig.width=15}
cepo_genes = Cepo::topGenes(ds_res_batches$average, n = 3)

markersPlot = lapply(cepo_genes, function(x) {
  pp = lapply(x, function(gene) {
  p = scater::plotPCA(
  corrected,
  colour_by = gene,
  shape_by = "cellTypes")
  return(p)
  })
  pp = patchwork::wrap_plots(pp, ncol = 3) + patchwork::plot_layout(guides = "auto")
  return(pp)
})
patchwork::wrap_plots(markersPlot, nrow = 2)
```

### Gene set enrichment analysis
We can also perform a plethora of downstream analyses, from gene set enrichment analyses to deconvolution of bulk RNA-seq, with the cell identity gene scores generated from the `Cepo` package. As an example, we will perform gene set enrichment analysis using the `fgsea` and  `escape` package.
```{r gsea, message = FALSE}
library(escape)
library(fgsea)
hallmark <- getGeneSets(species = "Homo sapiens", 
                           library = "H")
hallmarkList <- geneIds(hallmark)

fgseaRes <- fgsea(pathways = hallmarkList, 
                  stats    = sort(ds_res_batches$average$stats[,"beta"]),
                  minSize = 15,
                  maxSize = 500)

enriched_beta <- -log10(fgseaRes[order(pval), "padj"][[1]])
names(enriched_beta) <- fgseaRes[order(pval), "pathway"][[1]]
```

Note the top 5 enriched pathways for `beta cells`.
```{r}
enriched_beta[1:5]
```

Finally, we can visualise the enrichment using the `plotEnrichment` function from the `fgsea` package.
```{r fig.height=5, fig.width=5}
plotEnrichment(hallmarkList[["HALLMARK_PANCREAS_BETA_CELLS"]],
               sort(ds_res_batches$average$stats[, "beta"])) + labs(title="HALLMARK_PANCREAS_BETA_CELLS")
```


# Running out-of-memory computation with Cepo
To facilitate analysis of high-throughput atlas data consisting of millions of cells, `Cepo` also enables out-of-memory and parallel computation.

The `Cepo` function naturally handles matrices under the `DelayedArray` wrapper. Briefly, `DelayedArray` is a wrapper around many matrix classes in `R`, including `matrix`, `sparseMatrix` and `HDF5Array`. The last of which allows for out-of-memory computation, which means the computation is done outside of RAM. This will inevitably slow down the computational speed, but the major gain in doing this is that we can perform computations on data much larger than what our RAM can store at once.

```{r cepo-delayed}
library(DelayedArray)
library(HDF5Array)
da_matrix = DelayedArray(realize(logcounts(cellbench), "HDF5Array"))
class(da_matrix)
class(seed(da_matrix))

da_output = Cepo(exprsMat = da_matrix, cellType = cellbench$celltype)
```

Even though out-of-memory computation is slow, one way that we can speed up the computation is through parallel processing. This requires some configurations of the `DelayedArray` package via the `setAutoBPPARAM` function. `BiocParallel` package uses the `MulticoreParam` parameter for Linux/Mac and `SnowParam` for Windows.

```{r cepo-parallel}
library(BiocParallel)

BPPARAM = if (.Platform$OS.type == "windows") {
  BiocParallel::SnowParam(workers = 2)
} else {
  BiocParallel::MulticoreParam(workers = 2)
}

DelayedArray::setAutoBPPARAM(BPPARAM = BPPARAM) ## Setting two cores for computation

da_output_parallel = Cepo(exprsMat = da_matrix, cellTypes = cellbench$celltype)

DelayedArray::setAutoBPPARAM(BPPARAM = SerialParam()) ## Revert back to only one core
```

# Session info
```{r sessionInfo}
sessionInfo()
```