---
title: "mastR: Markers Automated Screening Tool in R"
author:
  - name: Jinjin Chen
    affiliation:
      - Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia
      - Department of Medical Biology, University of Melbourne, Parkville, VIC 3010, Australia
    email: chen.j@wehi.edu.au
  - name: Ahmed Mohamed
    affiliation:
      - Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia
    email: mohamed.a@wehi.edu.au
  - name: Chin Wee Tan
    affiliation:
      - Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3052, Australia
    email: cwtan@wehi.edu.au
date: "`r format(Sys.time(), '%d %b %Y')`"
output: 
  BiocStyle::html_document:
    toc: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{mastR_Demo}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 5
)
```

# Introduction

------------------------------------------------------------------------

**Why do we need group markers**

In biological and clinical research, the identification of biomarkers specific to a disease, tissue, or cell type is a critical step in advancing our understanding and application of this knowledge.

Specific biomarkers can be identified through various methods, such as transcriptomics, proteomics, or metabolomics, which can provide a global view of the molecular landscape of the system being studied.

To denote these biomarkers, we use the term "group markers" in a generic sense, referring to any molecular feature that is specific to a particular group or condition of interest.

Such markers can provide insight into disease diagnosis, prognosis, and treatment options, and can help to differentiate between diseased and healthy states. Thus the identification of the group markers is crucial in various biological and medical applications, as it allows us to distinguish between different cell types or disease states.

For example, in cancer research, identifying marker genes that are differentially expressed between cancer cells and normal cells can help diagnose and monitor the progression of the disease, as well as identify potential therapeutic targets. Similarly, in developmental biology, identifying marker genes that are specific to certain cell types or stages can help us understand the underlying mechanisms of differentiation and development.

Furthermore, marker genes can also be used in diagnostic assays to detect specific diseases or monitor treatment responses. For instance, the presence of certain marker genes in a patient's blood or tissue sample can indicate the presence or progression of a disease.

**Importance of immune cells signature in tumor micro-environment (TME)**

In this demo, we will be using the example of immune cell signature in the context of the TME as a practical case study to showcase the application of our approach.

TME is made up of a diverse range of cell types (fibroblasts, epithelial cells, endothelial cells, and immune cells) as well as various extracellular components (collagens, growth factors, hormones, cytokines, etc.). Tumor immune micro-environment (TIME) is reported to be highly associated with prognosis and various treatment response to many kinds of cancers.

Recent studies have highlighted the role of immune components in the TME in modulating tumor progression, making them attractive therapeutic targets. These components make up TIME, which is a subset of the TME. Subsequently, it's proved that tumor infiltrating lymphocytes (TILs) play an essential role in tumor progression, metastasis and treatment response.

This drives TILs to be a strong prognostic indicator for better precision therapy of cancer patients. The identification of these TILs are the subject of extensive research interest due to the roles of specific subset of immune cells acting on different tissues types.

Identification of these cell types are based on flow cytometry in the past, which has limited granularity. But now with the advance of single cell RNA-seq and spatial transcriptomics technologies, more detailed and novel cell types or subtypes can be identified. One of the key advantages of scRNA-seq and spatial technologies is the ability to investigate cell types in the TIME and understand cellular heterogeneity in the TME.

To quantify TILs infiltration (in bulk), estimate cellular composition (in bulk), annotate single cell types (in scRNA) or identify sample states/activities, many computational methods like cell deconvolution (CIBERSORTx), marker based annotation (CelliD) and sample scoring (singscore) are developed. But to distinguish between closely related cell types, estimate cell composition and states, a refined signature is required which typically requires manual curation by domain expert. With the increasing large amount of data and cell types, this is gradually getting to be untenable and will require a different approach to automatically screen such signatures.

`mastR` is a software package specifically designed to automatically screen signatures of interest for particular research questions. This automated process saves significant time and effort that would otherwise be required for manual labor.

**What this package does**

**mastR**, **M**arkers **A**utomated **S**creening **T**ool in **R**, is a package to automatically derive signature for specific group of interest in specific tissue.

With `mastR`, users can simply input their expression data containing the groups of interest, along with group labels, to obtain a list of marker genes (signature) for the target group.

While there are many tools available for generating cell type-specific markers, they are primarily designed for scRNA-seq data and often rely on machine learning algorithms to select relevant features for classification. However, these methods may lack robustness and consistency across datasets when compared to using statistical methods like empirical Bayesian in `r BiocStyle::Biocpkg("limma")`. Furthermore, some of these tools may return a signature even when the data does not contain any, leading to potential inaccuracies.

Although differential expression (DE) analysis can also be done on scRNA data, like `Seurat::FindMarkers()`, it's reported that DE analysis done on pseudo-bulk scRNA data is more robust and reliable than directly done on scRNA data.

Thus, `mastR` is designed to generate a more refined list of signature genes from multiple group comparisons based on the results from `r BiocStyle::Biocpkg("edgeR")` and `r BiocStyle::Biocpkg("limma")` DE analysis workflow. The final signature is selected by rank product score test for picking up genes with high ranks in the most of comparisons. The rank can be ordered by any gene statistic generated by `r BiocStyle::Biocpkg("limma")` analysis. Signature can be further refined by keeping the top n DEGs in the specified comparison(s), which can help to improve the discrimination between fairly similar cell types.

Another unique advantage of `mastR` is that it takes into account the background expression of tissue-specificity, which is often ignored by other marker generation tools. Unlike most tools that only consider the genes' contribution to the classification, `mastR` also offers a background expression removal function. This function is designed to remove parts of the marker genes that are highly expressed in specific tissues or cancer cells. These signals from the background, regarded as background expression, can cause potential ambiguity when applying the markers to specific tissues due to background or sample purity issues. This feature makes `mastR` apart and enhances the accuracy and specificity of the generated markers.

Furthermore, `mastR` allows users to build a markers pool before signature screening, which might contain the potential markers of interest to the users. The final signature will be integrated with this pool (by intersection), which can help to constrain the final signature within the interested pathway-related genes or functional gene-sets. People can borrow the published knowledge to build this.

The motivation of this package arises from the importance and necessity of identifying specific genes that are differentially expressed in different groups or tissues, as these genes can serve as biomarkers for diagnosis, prognosis, and therapeutic targeting. However, identifying marker genes can be a challenging and time-consuming task, and the presence of background expression can lead to erroneous results. Our package simplifies and streamlines this process, allowing researchers to focus on their analyses and interpretations without the burden of manual marker gene selection and background expression removal.

This report demonstrates the main functions and applications of `mastR` `r packageVersion("mastR")`.

This report demonstrates the signature screening workflow of NK cells in colorectal cancer (CRC), assessing the results by using in-built visualization functions.

**mastR screen the signature using the following 3 key steps:**

  step 1. build markers pool  
  step 2. identify signature from the pool  
  step 3. refine signature by background expression  
  step 4. visualize signature performance

**Applications**

- score samples
- estimate cellular proportion
- single cell annotation

# Installation

------------------------------------------------------------------------

`mastR` R package can be installed from Bioconductor or [GitHub](https://github.com/DavisLaboratory/mastR).

The most updated version of `mastR` is hosted on GitHub and can be installed using `devtools::install_github()` function provided by [devtools](https://cran.r-project.org/package=devtools).

```{r installation, eval=FALSE}
# if (!requireNamespace("devtools", quietly = TRUE)) {
#   install.packages("devtools")
# }
# if (!requireNamespace("mastR", quietly = TRUE)) {
#   devtools::install_github("DavisLaboratory/mastR")
# }

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("mastR", quietly = TRUE)) {
  BiocManager::install("mastR")
}
packages <- c(
  "BiocStyle",
  "clusterProfiler",
  "ComplexHeatmap",
  "depmap",
  "enrichplot",
  "ggrepel",
  "gridExtra",
  "jsonlite",
  "knitr",
  "rmarkdown",
  "RobustRankAggreg",
  "rvest",
  "singscore",
  "UpSetR"
)
for (i in packages) {
  if (!requireNamespace(i, quietly = TRUE)) {
    install.packages(i)
  }
  if (!requireNamespace(i, quietly = TRUE)) {
    BiocManager::install(i)
  }
}
```

```{r lib, message=FALSE}
library(mastR)
library(edgeR)
library(ggplot2)
library(GSEABase)
```

# Step 1. Build Markers Pool

------------------------------------------------------------------------

The first step is to define the original markers pool this analysis will be based on.

The final signature will only be the intersected genes with this markers pool. The whole gene list of the data will be regarded as the markers pool if no preliminary result or interested genes are provided.

***Note:*** `markers = NULL` won't keep any special genes if they fail the filtration by `r BiocStyle::Biocpkg("edgeR")`.

If users have any preliminary knowledge about the target group type (cell type), they can build the markers pool of interest from the available datasets or build from MSigDB, PanglaoDB or LM7/LM22 using our in-built functions. All genes in the pool will be reserved for DE analysis even even if they are filtered out during the filtration of `r BiocStyle::Biocpkg("edgeR")`.

The standard pool building process involves the following:

1)  generate from sources  
    i)  LM7/22 signature matrix for CIBERSORT  
    ii)  MSigDB  
    iii)  PanglaoDB  
    iv)  Customized gene list  
2)  merge gene-sets together

`mastR` allows the following markers to be conveniently loaded:

- `lm7` and `lm22`: CIBERSORT (LM22) and Tosolini et al. (LM7) gene sets for leukocytes, get more details using `?mastR::lm7` or `?mastR::lm22`.

- `msigdb_gobp_nk`: a `GeneSetCollection` object, containing gene sets with gene-set name matched to 'NATURAL_KILLER' in GO:BP MSigDB v7.4 database. More details `?mastR::msigdb_gobp_nk`.

- `nk_markers`: a combination of CIBERSORT LM7, LM22 and Huntington gene list with 114 genes. ([Cursons et al. 2019](https://cancerimmunolres.aacrjournals.org/content/7/7/1162.long)).

All markers generation functions will return `GeneSet` or `GeneSetCollection` object.

## Generate Markers from Sources

To generate the immune cell signatures, firstly users need to define a pool of markers of interest. `mastR` allows users to build markers pool from multiple resources.

In this demo, we will load some some publicly available example datasets.

### i) Leukocyte gene signature Matrix (LM)

`lm7/lm22` are immune cells signature matrices from CIBERSORT, contains 7 or 22 immune cell types.

Users can use function `get_lm_sig()` to generate immune cells markers from LM7 and/or lm22.

```{r LM markers}
data("lm7", "lm22")

## collect both LM7 and LM22
LM <- get_lm_sig(lm7.pattern = "^NK", lm22.pattern = "NK cells")
LM
```

Function `gsc_plot()` allows visualization of an `r BiocStyle::CRANpkg("UpSetR")` plot across all gene-sets.

```{r gsc plot}
## show upset diagram
gsc_plot(LM)
```

### ii) MSigDB

The Molecular Signatures Database ([**MSigDB**](https://www.gsea-msigdb.org/gsea/msigdb/)) is a database of annotated gene sets, typically used for pathway analysis.

Users can use `get_gsc_sig()` to generate a collection of gene sets for the biological subjects of interest.

In this case, NK cell relevant gene-sets from `msigdb_gobp_nk` are collected.

```{r MSigDB gene sets, warning=FALSE}
data("msigdb_gobp_nk")
MSig <- get_gsc_sig(
  gsc = msigdb_gobp_nk,
  pattern = "NATURAL_KILLER_CELL_MEDIATED"
)
MSig
```

***Note***: `data = "msigdb"` allows searching of MsigDB without loading into the environment. `species` or `version` are optional parameters.

Similarly, using `gsc_plot()` users can visualize the overlapping gene sets.

As the gene-set names are too long, for better visualization the gene-set names are replaced with letters and shown below.

```{r show overlap, warning=FALSE}
## cut gene set name within 11 characters
gsn <- setNames(names(MSig), LETTERS[seq_along(MSig)])
for (i in seq_along(MSig)) {
  setName(MSig[[i]]) <- LETTERS[i]
}

## show upset diagram of collected gene-sets
gsc_plot(MSig)
gsn[c("A", "M", "D")] ## show gene-set names of top 3
```

There are `r length(MSig)` gene-sets in `MSig`, which is too many for visualization. Thus, we use function `merge_markers()` to merge all gene-sets into one `GeneSet` object. (*Note*: Users can directly use the un-merged object for subsequent analyses.)

```{r merge MSigDB}
## merge all gene sets into one
MSig <- merge_markers(MSig)
setName(MSig) <- "MSigDB"
```


### iii) PanglaoDB

[PanglaoDB](https://panglaodb.se/) is a database of single cell RNA sequencing experiments with cell type markers.

Users can use `get_panglao_sig()` to generate markers based on the required organs and cell types.

Functions `list_panglao_organs()` and `list_panglao_types()` show all available organs and cell types in PanglaoDB.

***Note:*** This requires real-time connection to PanglaoDB, thus not run in this demo.

```{r PanglaoDB markers, eval=FALSE}
## show available organs on PanglaoDB
list_panglao_organs()

## show available cell types of interest organ on PanglaoDB
list_panglao_types(organ = "Immune system")

## collect all "NK cells" markers from PanglaoDB website
Panglao <- get_panglao_sig(type = "NK cells")

Panglao
```

### iv) Customized gene list

Customized markers can be imported as a `GeneSet` object.

Here we demonstrate this by loading the in-built `nk_markers` markers dataset and convert it into `GeneSet` object.

```{r nk_markers}
## show what nk_markers looks like:
data("nk_markers")
nk_markers

## convert NK markers into `GeneSet` object
nk_m <- GeneSet(nk_markers$HGNC_Symbol,
  geneIdType = SymbolIdentifier(),
  setName = "NK_markers"
)
```

## Pool Markers from Sources

With multiple lists of markers from different sources, use `merge_markers()` to merge them into one `GeneSet` object.

```{r integrate markers}
gsc <- GeneSetCollection(c(nk_m, LM, MSig)) ## add Panglao if you run it
Markers <- merge_markers(gsc)

## upset plot
gsc_plot(gsc)

Markers
```

The summary of the merged list is saved as `longDescription` in the output.

```{r source table}
## to show the table summary of merged list
head(jsonlite::fromJSON(GSEABase::longDescription(Markers)))
```

Now with a merged markers pool, we refine the signature genes for group specificity and tissue background removal.

# Step 2. Signature Identification for Target Group

------------------------------------------------------------------------

For group specificity, there are 3 main steps:

  a)  differential expression (DE) analysis: filtration, normalization, sample weighting and linear model fit;  
  b)  feature selection: select differentially expressed genes based on rank product score;  
  c)  constrain selected genes within the markers pool.

***Note***: External data is accepted in different formats (e.g. `DGEList`, `eSet`, `matrix`). Input data must be raw counts or log-transformed expression data.

In this demo, we use the imported example data `im_data_6` from [GSE60424](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60424) (Download using `GEOquery::getGEO()`), consisting of immune cells from healthy individuals.

`im_data_6` is a `eSet` object, containing RNA-seq TMM normalized counts data of 6 sorted immune cell types each with 4 samples. More details in `?mastR::im_data_6`.

```{r im_data_6}
data("im_data_6")
im_data_6
```

## a) data processing {#process}

To screen signature for the group specificity, the data needs to be pre-processed.

`process_data()` in `mastR` does the 'End-to-End' differential expression analysis using `r BiocStyle::Biocpkg("edgeR")` and `r BiocStyle::Biocpkg("limma")` pipeline as a single function call.

Processes under the hood:

- Filter data by the given cutoff, genes with low expression will be removed by `r BiocStyle::Biocpkg("edgeR")`.

- If data is raw counts (`normalize = TRUE`), normalize data by 'TMM' and fit it using `limma::voom()`. Otherwise apply `trend` of `r BiocStyle::Biocpkg("limma")` on normalized data.

- Fit linear model.

- Compute gene statistic for differential expression analysis using `limma::treat()`.

`im_data_6` has 6 immune cell types:

```{r type names}
table(im_data_6$`celltype:ch1`)
```

`process_data()` requires an expression matrix and group labels of the samples. This returns a `DGEList` object with processed data.

To keep consistent with `Markers`, use param `gene_id` to convert ENSEMBL IDs of `im_data_6` to SYMBOLs for intersection.

Refer to `help(proc_data)` for optional parameters.

```{r process data}
proc_data <- process_data(
  data = im_data_6,
  group_col = "celltype:ch1",
  target_group = "NK",
  markers = geneIds(Markers),
  gene_id = "ENSEMBL" ## rownames of im_data_6 is ENSEMBL ID
)

attributes(proc_data)
```

For ease of process later in this demo, add the expression matrix `E` fitted by `limma::voom()` as 'voomE' into `proc_data`.

```{r save voom E}
## add voom fitted expression as a new list of proc_data for use
proc_data$voomE <- proc_data$vfit$E
```

***Note:***

`mastR` provides visualization functions to compare the data before and after `process_data()` and assess the quality control (QC) by using `plot_diagnostics()` and `plot_mean_var()`.

To assess the removal of the low quality genes, use `plot_diagnostics()` to show the expression distribution, RLE and MDS plots.


## b) signature selection based on differential expression {#selection}

For selection of group specific signature, pass `proc_data` to `select_sig()` function. Genes with high rank product score in DE results are selected.

`mastR` automatically determines if feature selection is required, the default approach (`feature_selection = "auto"`) performs rank product scoring to select genes. But if the numbers of the result are < 5, no feature selection will be conducted (switch to `feature_selection = "none"`).

```{r select sig}
## get the same result as there's permutation test for rank product
set.seed(123)
sig_ct <- select_sig(proc_data, feature_selection = "auto")

head(sig_ct)
```

***Note:***

`mastR` also implements a feature to further optimize the discriminative power of the signature between fairly similar groups by using params `keep.top` and `keep.group`. More details in `help(select_sig)`.


***Tips:***

All above steps from \@ref(process) to \@ref(selection) for refining group signature can be done using an integrated function `get_degs()`.

It will return a list of a processed data `proc_data` and a `GeneSetCollection` `DEGs` with `UP` and `DOWN` regulated genes.

QC plots can also be shown by setting optional param `plot = TRUE`.


## c) constrain signature within markers pool {#intersect-pool}

To constrain the final signature within the interested gene list, intersect the signature genes `sig_ct` with the markers pool `Markers`. In this demo, only the `UP` regulated genes are kept.

For consistency with `Markers` type, convert ENSEMBL IDs of `im_data_6` to gene SYMBOLs.

```{r markers for cell type, fig.width=8, fig.height=6}
## convert ensembl IDs into symbols to match markers pool
deg_up <- mapIds(
  org.Hs.eg.db::org.Hs.eg.db,
  geneIds(sig_ct[["UP"]]),
  "SYMBOL", "ENSEMBL"
)
deg_up <- na.omit(deg_up)
## markers specific for NK cells
m_ct <- intersect(geneIds(Markers), deg_up)
names(m_ct) <- names(deg_up)[match(m_ct, deg_up)] ## set ensembl ID as names for downstream visualization

head(m_ct)
```

Users can use `pca_matrix_plot()` to assess the group separation.

Looking at the variance of PC1 shown below, it's clear the intersected genes explain more variance in PC1 compared with all `UP` DEGs.

```{r PCA on sig}
## PCA shows clear separation of NK cells
## after intersection
pca_matrix_plot(proc_data,
  features = m_ct,
  group_by = "celltype.ch1",
  slot = "voomE",
  n = 3,
  gene_id = "ENSEMBL"
) +
  patchwork::plot_annotation("Intersected UP DEGs")
## before intersection
pca_matrix_plot(proc_data,
  features = as.vector(deg_up),
  group_by = "celltype.ch1",
  slot = "voomE",
  n = 3,
  gene_id = "ENSEMBL"
) +
  patchwork::plot_annotation("All UP DEGs")
```

***Note***:

Wrapper function `filter_subset_sig()` can complete all processes in Step 2 using a single function. The result would be the same.

`filter_subset_sig()` is helpful when users have multiple datasets (more details in Section \@ref(multi-data)). It can output the final signature after aggregating signature lists from all datasets.


# Step 3. Signature Refinement by Background Expression in Tissue

------------------------------------------------------------------------

Further refinement can be achieved by eliminating genes with strong signal in the cancer cells or tissues, regarded as background expression.

`mastR` utilizes a signal-to-noise ratio (SNR) approach to filter out genes with low SNR values that have low discriminative power between the group of interest and the background. This removes tumor purity as a factor in the identified signature markers.

Here the signal is the expression of the signature in signal data and the noise is the expression of the signature in background data.

Two input datasets are required:

- signal data (`proc_data` from Step 2)
- background data (target context)

***Note***: Both of signal data and background data must be log-normalized.

The signal data should be the result generated by `process_data()` function.

The refinement consists of:

  I) data subset
  II) data filtration
  III) markers removal
  IV) combination with the signature

## I) data subsetting

To remove genes with background expression, subset the samples based on the context. In this demo, we are looking to remove CRC-specific signal.

Here we choose to load in-built `ccle_crc_5` as the background data. The CRC adherent cell lines data is extracted (`bg_mat`).

The signal data is in `voomE` of `proc_data` and NK cells data is extracted (`sig_mat`).

```{r subset customized bg data}
data("ccle_crc_5")

## subset CRC cell lines of bg data
bg_mat <- ccle_crc_5$counts[, ccle_crc_5$samples$cancer == "CRC"]
## subset all NK cells of sig data
sig_mat <- proc_data$voomE[, proc_data$samples$celltype.ch1 == "NK"]
```

***Note:*** 

`ccle_crc_5` is a `DGEList` object contains 5 CRC cell line samples from [CCLE](https://depmap.org/portal/download/). More details for `help(ccle_crc_5)`.

[DepMap CCLE](https://depmap.org/portal/download/) is a large database containing RSEM quantified TPM data of more than 1,000 cancer cell lines.

## II) data filtration

Because SNR is computed using scaled expression across genes, low-expression genes must be removed.

In this case, `sig_mat` from `proc_data` has already been filtered, therefore background data (`bg_mat`) needs to be filtered.

`bg_mat` has only 5 samples, genes with logTPM > 1 within more than 2 samples are kept.

```{r bg data filtration}
keep <- rowSums(bg_mat > 1, na.rm = TRUE) > 2
bg_mat <- bg_mat[keep, ]
```

## III) markers pool refinement

To refine the markers pool, use `remove_bg_exp_mat()` to keep genes from `Markers` with high expression in `sig_mat` and low expression in `bg_mat`.

This ensures the specificity of the markers minimizing the context effect.

Similarly, use `gene_id` to convert gene IDs of input to the same type (SYMBOL). As `sig_mat` uses ENSEMBL IDs, `gene_id` types have to be defined in the order: `sig_mat`, `bg_mat`.

```{r remove bg in customized}
m_ccl <- remove_bg_exp_mat(
  sig_mat = sig_mat,
  bg_mat = bg_mat,
  markers = geneIds(Markers),
  gene_id = c("ENSEMBL", "SYMBOL")
)

head(m_ccl)
```

## use CCLE as background data (optional)

Users can also use [CCLE](https://depmap.org/portal/download/) downloaded by `depmap::depmap_TPM()` for the refinement.

The entire CCLE data is quite large, thus use `ccle_crc_5` to construct a small pseudo-CCLE data `ccle` with similar format.

```{r ccle construction}
ccle <- data.frame(ccle_crc_5$counts,
  gene_name = rownames(ccle_crc_5),
  primary_disease = "CRC"
) |>
  tidyr::pivot_longer(-c(gene_name, primary_disease),
    names_to = "depmap_id",
    values_to = "rna_expression"
  )

ccle
```

Because `ccle` is a long data, subsetting needs to be done before converting it into matrix.

In this demo, we still extract CRC adherent cell lines data from it.

```{r subset sig and bg data}
## subset CRC cell lines of bg data
ccle <- ccle[ccle$primary_disease == "CRC", ]
```

As `remove_bg_exp_mat()` can only accept matrix in wide format, users can use `ccle_2_wide()` to convert `ccle` into wide matrix.

```{r ccle to wide mat}
ccle <- ccle_2_wide(ccle = ccle)

ccle[1:3, 1:3]
```

The same filtration cutoff used in `bg_mat` above is used for `ccle`.

```{r ccle filtration}
keep <- rowSums(ccle > 1, na.rm = TRUE) > 2
ccle <- ccle[keep, ]
```

Use `remove_bg_exp_mat()` to refine the markers pool based on `ccle`. As we use `ccle_crc_5` to construct `ccle`, the result should be the same.

```{r remove bg in ccle}
m_ccl <- remove_bg_exp_mat(
  sig_mat = sig_mat,
  bg_mat = ccle,
  markers = geneIds(Markers),
  gene_id = c("ENSEMBL", "SYMBOL")
)

head(m_ccl)
```

***Note:***

Wrapper function `remove_bg_exp()` can complete Step 3. I) II) III) all using one single function.

Users can use the entire CCLE data from DepMap without loading it by using param `bg_data = 'CCLE'`. More details in `help(remove_bg_exp)`.


## IV) combination with Signature

To get the final signature with high group specificity and low background expression, intersect the result `m_ccl` with the signature `m_ct` we get from Step 2.

```{r final signatures}
sig_NK_CRC <- intersect(m_ct, m_ccl)
head(sig_NK_CRC)
```

Here, if we directly use `m_ct` instead of `Markers` as input `markers` for `remove_bg_exp_mat()` or `remove_bg_exp()`, `m_ccl` will already be the intersection result of the original `m_ccl` and `m_ct`, then we can skip this.


# Step 4. Visualization of Final Results

------------------------------------------------------------------------

To assess how well the refined signature can discriminate our target group from others, `mastR` provides the following visualization functions for different purposes:

- `pca_matrix_plot()`: matrix of PCA plots with top n PCs.
    Show if the target group can be separated from other groups and how well the separation is. Users can also know how much variance has been explained by PC1 with the signature.

- `sig_heatmap()`: signature heatmap across groups and its comparison with un-refined markers pool.  
    Validate if the signature shows clear and differential expression pattern between the target group and others. Show if the pattern is clearer and cleaner after the refinement compared with un-refined markers pool.

- `sig_rankdensity_plot()`: rank density distribution of signature genes across groups.  
    Show if the signature genes are highly expressed within each sample of the target group while lowly expressed in others.

- `sig_boxplot()`: boxplot of signature expression or singscore across groups.  
    Test the overall performance of the whole signature when applying it for scoring, singscore is used to compute the rank score using the whole signature. The boxplot across groups can be used to test if the whole signature is powerful enough to distinguish the target group from others. Median expression of each signature gene can also be plotted.

- `sig_scatter_plot()`: scatter plot of signature scaled expression of the target group vs all other groups respectively.  
    Visualize the correlation of the signature between the target and other groups, the row-scaled gene expression is computed to make the scatter plot. Based on it, the co-expression correlation and signature's specificity can be assessed.

- `sig_gseaplot()`: gene set enrichment analysis (GSEA) result display.  
    Validate the enrichment significance of signature in our target group compared with other groups, using `clusterProfiler::GSEA()` analysis. Users can display `dotplot` or `gseaplot`.


## Heatmap

Use `sig_heatmap()` function to compare the expression pattern between the original markers pool and the final signature in `proc_data`.

- `sigs`: final signature (or list of multiple signatures).
- `markers`: original markers pool (optional).

```{r heatmap, warning=FALSE, fig.width=10, fig.height=7}
sig_heatmap(
  data = proc_data,
  sigs = sig_NK_CRC,
  group_col = "celltype.ch1",
  markers = geneIds(Markers),
  gene_id = "ENSEMBL",
  slot = "voomE",
  scale = "row",
  show_column_den = FALSE,
  show_row_den = FALSE,
  show_column_names = FALSE,
  show_row_names = FALSE
)
```

The expression pattern of the signature is more distinguishable in NK cells.


## Signature Score Boxplot

To see if the final signature can distinguish NK cells from other immune cells, use `sig_boxplot()` function to make a boxplot of NK scores using `sig_NK_CRC`. Scores are calculated by `r BiocStyle::Biocpkg("singscore")` package.

***Note***:

Boxplot of median expression of each gene can be plotted using param `type = "expression"`.

```{r score boxplot, warning=FALSE}
sig_boxplot(
  data = proc_data,
  sigs = sig_NK_CRC,
  group_col = "celltype.ch1",
  target_group = "NK",
  gene_id = "ENSEMBL",
  slot = "voomE"
)
```

It's clear that the signature shows significantly higher score level in NK cells.


## Signature Abundance Scatter Plot

To validate the specificity of the final signature to NK cells, use `sig_scatter_plot()` function to make the scatter plot of the signature z-scored expression in `proc_data`.

If the signature is highly specific to NK cells, most of the signature genes should appear in top left region (scaled expression > 1 in NK, < 1 in others). The genes in top right region should represent the co-expression in both groups.

```{r abundance scatter plot, warning=FALSE, fig.width=8, fig.height=5}
## before refinement
sig_scatter_plot(
  data = proc_data,
  sigs = geneIds(Markers),
  group_col = "celltype.ch1",
  target_group = "NK",
  gene_id = "ENSEMBL",
  slot = "voomE"
) + ggtitle("Before Refinement")

## after refinement
sig_scatter_plot(
  data = proc_data,
  sigs = sig_NK_CRC,
  group_col = "celltype.ch1",
  target_group = "NK",
  gene_id = "ENSEMBL",
  slot = "voomE"
) + ggtitle("After Refinement")
```

Almost all of non-specific genes are filtered by `mastR`.

Further optimization can be performed by removing signature genes present in the bottom right or bottom left region based on the scatter plot.

## Signature GSEA plot

To assess the enrichment significance of the final signature in NK cells, use `sig_gseaplot()` function to display GSEA result. Dotplot can be plotted by using `method = "dotplot"`.

```{r gseaplot, warning=FALSE, message=FALSE, fig.width=10, fig.height=7}
## gseaplot
sig_gseaplot(
  data = proc_data,
  sigs = list(sig = sig_NK_CRC, markers = geneIds(Markers)),
  group_col = "celltype.ch1",
  target_group = "NK",
  gene_id = "ENSEMBL",
  slot = "voomE",
  method = "gseaplot"
)
```

The refined signature is enriched in all comparisons and `mastR` successfully removed the trailing tail from the markers pool.


# Working with Extension Data Input

## Multiple Datasets {#multi-data}

------------------------------------------------------------------------

The standard workflow of `mastR` usage is as described above.

The above is a show case of single dataset, `mastR` can also be applied to multiple datasets.

After the Step 2c (Section \@ref(intersect-pool)), for:

1) single dataset  
    Returns the final result as the signature.

2) multiple datasets  
    After Step 2c, choose one combination method (`union` default) to aggregate signature lists generated from different datasets. The method Robust Rank Aggregation ("RRA") can also be selected in `mastR`, which detects genes that are ranked consistently better than expected under null hypothesis of uncorrelated inputs and assigns a significance score for each gene.

Instead of integrating datasets into one larger dataset and fit it into one linear model, `mastR` refines signature in each dataset independently and the aggregation method can be defined using param `comb` (e.g. `comb = "RRA"`).

The advantage of this is that it avoids over-normalizing or mis-correcting when carrying out data integration. More robust and conserved signature across datasets can be obtained. In some cases, it's better than using "Remove Batch Effect".

Simple combination methods like "union" and "intersect" are also supported (e.g. `comb = union`).

It's recommended to use `union` when only a few genes are refined for each dataset, while `RRA` is better for robust gene selection from large DEG lists, `intersect` is proper for highly overlapping gene lists.

***Note:***

All signature visualization functions in `mastR` support multiple datasets. More details in `help()`.

To demonstrate the usage on multiple datasets, `im_data_6` is used as an example.

```{r refine multiple datasets, eval=TRUE}
## In the demo, we just repeatedly use im_data_6 as a show case
set.seed(123)
m_ct_m <- filter_subset_sig(
  data = list(A = im_data_6, B = im_data_6),
  markers = geneIds(Markers),
  group_col = "celltype:ch1",
  target_group = "NK",
  feature_selection = "auto",
  dir = "UP", ## specify to keep "UP" or "DOWN" regulated genes
  gene_id = "ENSEMBL",
  comb = union,
  summary = FALSE
)
```

The signature `m_ct_m` is the same as `m_ct` as `union` is set as combination method.

```{r union sig, eval=TRUE}
## we will get exactly the same list
## if we choose 'union' or 'intersect' as combination
setequal(m_ct_m, m_ct)
```

By using `comb = "RRA"`, only the genes at the top rank across signature lists are kept. The cutoff of `RRA` can be set with param `s_thres`.

```{r rra sig, eval=TRUE}
## but we will only get the genes appear at top rank across gene lists
## if we choose 'RRA', s_thres is to determine the threshold for ranking score
set.seed(123)
m_ct_m <- filter_subset_sig(
  data = list(A = im_data_6, B = im_data_6),
  markers = geneIds(Markers),
  group_col = "celltype:ch1",
  target_group = "NK",
  feature_selection = "auto",
  dir = "UP", ## specify to keep "UP" or "DOWN" regulated genes
  gene_id = "ENSEMBL",
  comb = "RRA", ## change this to use different strategy, default is "union"
  s_thres = 0.5, ## only work when comb = "RRA", set a threshold for ranking score
  summary = FALSE
)

## fewer signature genes this time
m_ct_m
```

## Single Cell RNA-sequencing Data

------------------------------------------------------------------------

Although `mastR` is designed for bulk RNA-seq data, in order to support usage of abundant scRNA resources, `mastR` provides `pseudo_samples()` function to help convert scRNA-seq data into pseudo-bulk data, which can then be used in `mastR` workflow.

`pseudo_samples()` aggregates cells into pseudo-bulk samples according to the factor(s) provided using param `by`. Then use `filter_subset_sig()` to get the signature from scRNA-seq data.

```{r pseudo-bulk, eval=TRUE}
## create a test scRNA object of 100 genes x 100 cells
counts <- matrix(abs(rpois(10000, 10)), 100)
rownames(counts) <- 1:100
colnames(counts) <- 1:100
meta <- data.frame(
  subset = rep(c("A", "B"), 50),
  level = rep(1:4, each = 25)
)
rownames(meta) <- 1:100
pb <- pseudo_samples(counts, by = meta)

pb <- edgeR::DGEList(counts = pb, group = gsub("\\..*", "", colnames(pb)))

filter_subset_sig(pb, group_col = "group", target_group = "A")

## Seurat or SCE object are also accepted
# scRNA <- SeuratObject::CreateSeuratObject(counts = counts, meta.data = meta)
# pseudo_samples(scRNA, by = c("subset","level"))
```

# Applications {.tabset .tabset-fade .tabset-pills}

Users can apply the signature in multiple ways. In this short demo, we will show 3 simple applications.

***Note:***

These examples are not run in this demo.

Firstly, simulate a scRNA data using `r BiocStyle::Biocpkg("splatter")` package. 

```{r simulation, include=FALSE, eval=TRUE}
if (!requireNamespace("splatter", quietly = TRUE)) {
  # BiocManager::install("splatter")
  stop("Install 'splatter'!")
}

library(splatter)
## set seed for reproduce as there's permutation inside
set.seed(123)

sim_params <- newSplatParams(
  nGenes = 1e3,
  batchCells = 3000,
  group.prob = seq(0.1, 0.4, length.out = 4),
  de.prob = 0.02,
  # de.downProb = 0,  ## only set up-regulated genes for each group
  de.facLoc = 0.5,
  de.facScale = 0.4
)

data_sim <- splatSimulate(sim_params, method = "groups")
```

Markers list for each group can be obtained from the simulated scRNA data.

```{r get markers list, eval=TRUE}
markers_list <- lapply(
  rowData(data_sim)[, paste0("DEFacGroup", 1:4)],
  \(x) rownames(data_sim[x > 1])
)
```

Use `pseudo_samples()` to get pseudo-bulk data and create a `DGEList` object.

```{r pseudo bulking, eval=TRUE}
## aggregate into pseudo-bulk samples
pb <- pseudo_samples(data_sim,
  by = c("Batch", "Group"),
  min.cells = 50, max.cells = 100
)
dge <- edgeR::DGEList(
  counts = pb,
  samples = data.frame(
    group = gsub(".*\\.(.*)_.*", "\\1", colnames(pb)),
    Batch = gsub("(.*)\\..*", "\\1", colnames(pb)),
    sampleID = gsub("(.*)_.*", "\\1", colnames(pb))
  )
)
```

Use `mastR` wrapper function `filter_subset_sig()` to get the signature for each group. In this case, there are no markers pool of interest and no background.

```{r get sigs, eval=TRUE}
set.seed(123)
sig_ls <- lapply(paste0("Group", 1:4), \(x) {
  filter_subset_sig(
    data = dge,
    markers = NULL,
    group_col = "group",
    target_group = x
  )
})
names(sig_ls) <- paste0("Group", 1:4)
```

Show the overlap of the signature lists generated by `mastR` and the markers lists from the simulation.

```{r venn plot, eval=TRUE}
if (!requireNamespace("ggvenn", quietly = TRUE)) {
  # install.packages("ggvenn")
  stop("Install 'ggvenn'!")
}

## venn plot
p <- lapply(1:4, \(i) ggvenn::ggvenn(
  list(
    sig = sig_ls[[i]],
    marker = markers_list[[i]]
  ),
  show_percentage = FALSE
) +
  ggtitle(names(sig_ls)[i]))
patchwork::wrap_plots(p)
```

Use `sig_heatmap()` to compare the expression of signatures and markers lists. 

```{r app heatmap, eval=TRUE}
## heatmap
sig_heatmap(
  edgeR::cpm(dge, log = TRUE),
  sigs = c(sig_ls, list("TP53")), ## add a real gene to pass gene check
  group_col = dge$samples$group,
  scale = "row",
  show_column_den = FALSE,
  show_row_den = FALSE,
  cluster_column_slices = FALSE,
  cluster_row_slices = FALSE
)
```

Randomly aggregate scRNA data to generate pseudo bulk data for scoring and deconvolution.

```{r random pseudo-bulk, include=FALSE, eval=TRUE}
library(dplyr)
## randomly generate aggregating idx
set.seed(123)
data_sim$rand_idx <- sample.int(30, ncol(data_sim), replace = TRUE)

## aggregate into pseudo-bulk samples based on rand_idx
pb_r <- pseudo_samples(data_sim, by = c("Batch", "rand_idx"))
dge_r <- edgeR::DGEList(
  counts = pb_r,
  samples = data.frame(
    group = gsub(".*\\.(.*)_.*", "\\1", colnames(pb_r)),
    Batch = gsub("(.*)\\..*", "\\1", colnames(pb_r)),
    sampleID = gsub("(.*)_.*", "\\1", colnames(pb_r))
  )
)
```

Cellular proportion can be added into the random pseudo bulk data.

```{r add cellular prop, include=FALSE, eval=TRUE}
library(tidyr)
## append cellular composition
tmp <- as.data.frame(colData(data_sim)) |>
  group_by(rand_idx, Group) |>
  summarise(count = n()) |>
  pivot_wider(names_from = Group, values_from = count) |>
  mutate(rand_idx = factor(rand_idx))
tmp[, -1] <- signif(tmp[, -1] / rowSums(tmp[, -1]), 2)
dge_r$samples <- left_join(dge_r$samples, tmp, by = c("group" = "rand_idx"))

## data process
keep <- filterByExpr(dge_r)
dge_r <- dge_r[keep, , keep.lib.sizes = FALSE]
dge_r <- calcNormFactors(dge_r, method = "TMM")
```

As shown in the following three applications, the `mastR` signatures outperform or match the simulated markers list in various metrics. In terms of score and deconvolution, `mastR` signatures exhibit high agreement with the true cell type proportions.

## Score

Multiple scores are calculated by `r BiocStyle::Biocpkg("singscore")` package using signatures and markers lists. Make scatter plot to compare the correlation between the score and the actual cellular proportion.

For signatures generated by `mastR`:

```{r score on signatures, eval=TRUE}
library(singscore)

rank_data <- rankGenes(edgeR::cpm(dge_r, log = TRUE))

## score based on sig_ls
scores <- multiScore(rank_data, upSetColc = gls2gsc(sig_ls))

tmp <- pivot_longer(cbind(Sample = colnames(dge_r), dge_r$samples[, 6:9]),
  -Sample,
  names_to = "Group", values_to = "Prop"
)

tmp <- t(scores$Scores) |>
  data.frame(Sample = colnames(scores$Scores)) |>
  pivot_longer(-Sample, names_to = "Group", values_to = "Score") |>
  left_join(tmp)

ggplot(tmp, aes(x = Prop, y = Score, col = Group)) +
  geom_point() +
  facet_wrap(~Group, scales = "free") +
  ggpubr::stat_cor() +
  theme_classic()
```

For markers list from simulation:

```{r score on markers, eval=TRUE}
## score based on markers_list
scores <- multiScore(rank_data, upSetColc = gls2gsc(markers_list))

tmp <- pivot_longer(cbind(Sample = colnames(dge_r), dge_r$samples[, 6:9]),
  -Sample,
  names_to = "Group", values_to = "Prop"
)
tmp$Group <- paste0("DEFac", tmp$Group)

tmp <- t(scores$Scores) |>
  data.frame(Sample = colnames(scores$Scores)) |>
  pivot_longer(-Sample, names_to = "Group", values_to = "Score") |>
  left_join(tmp)

ggplot(tmp, aes(x = Prop, y = Score, col = Group)) +
  geom_point() +
  facet_wrap(~Group, scales = "free") +
  ggpubr::stat_cor() +
  theme_classic()
```

## Deconvolution

Use the `r BiocStyle::CRANpkg("BisqueRNA")` package for cellular deconvolution.

Cellular proportion can be estimated from this.

```{r deconv, eval=TRUE}
if (!requireNamespace("BisqueRNA", quietly = TRUE)) {
  # install.packages("BisqueRNA")
  stop("Install 'BisqueRNA'!")
}

bulk_eset <- ExpressionSet(edgeR::cpm(dge_r, log = TRUE))

## deconv on sig_ls
mark_d <- stack(sig_ls)
colnames(mark_d) <- c("gene", "cluster")

res_sig <- BisqueRNA::MarkerBasedDecomposition(
  bulk.eset = bulk_eset,
  markers = mark_d
)

## deconv on markers_ls
mark_d <- stack(markers_list)
colnames(mark_d) <- c("gene", "cluster")

res_mar <- BisqueRNA::MarkerBasedDecomposition(
  bulk.eset = bulk_eset,
  markers = mark_d
)
rownames(res_mar$bulk.props) <- gsub("DEFac", "", rownames(res_mar$bulk.props))
```

Similarly, use scatter plot to visualize the correlation between estimated proportions and actual proportions.

```{r vis deconv sigs, eval=TRUE}
tmp <- pivot_longer(cbind(Sample = colnames(dge_r), dge_r$samples[, 6:9]),
  -Sample,
  names_to = "Group", values_to = "Prop"
)

tmp <- t(res_sig$bulk.props) |>
  data.frame(Sample = colnames(res_sig$bulk.props)) |>
  pivot_longer(-Sample, names_to = "Group", values_to = "Estimate") |>
  left_join(tmp)

ggplot(tmp, aes(x = Prop, y = Estimate, col = Group)) +
  geom_point() +
  facet_wrap(~Group, scales = "free") +
  ggpubr::stat_cor() +
  theme_classic()
```

```{r vis deconv markers, eval=TRUE}
tmp <- pivot_longer(cbind(Sample = colnames(dge_r), dge_r$samples[, 6:9]),
  -Sample,
  names_to = "Group", values_to = "Prop"
)

tmp <- t(res_mar$bulk.props) |>
  data.frame(Sample = colnames(res_mar$bulk.props)) |>
  pivot_longer(-Sample, names_to = "Group", values_to = "Estimate") |>
  left_join(tmp)

ggplot(tmp, aes(x = Prop, y = Estimate, col = Group)) +
  geom_point() +
  facet_wrap(~Group, scales = "free") +
  ggpubr::stat_cor() +
  theme_classic()
```

## Annotation

Use `r BiocStyle::Biocpkg("singscore")` as a marker-based annotation method. Calculate the scores based on the signatures and markers lists, then assign each cell with the cell type having the highest score. Compare the results.

Using signatures:

```{r sig annotation, eval=TRUE}
if (!requireNamespace("scuttle", quietly = TRUE)) {
  # BiocManager::install("scuttle")
  stop("Install 'scuttle'!")
}
library(singscore)
## normalization
data_sim <- scuttle::computePooledFactors(data_sim, clusters = data_sim$Group)
data_sim <- scuttle::logNormCounts(data_sim)
## use singscore for annotation
rank_data <- rankGenes(logcounts(data_sim))

## score using sig_ls
scores <- multiScore(rank_data, upSetColc = gls2gsc(sig_ls))
data_sim$Pred <- paste0("Group", apply(scores$Scores, 2, which.max))
table(data_sim$Pred == data_sim$Group)
```

Using markers lists:

```{r markers annotation, eval=TRUE}
## score using markers_list
scores <- multiScore(rank_data, upSetColc = gls2gsc(markers_list))
data_sim$Pred <- paste0("Group", apply(scores$Scores, 2, which.max))
table(data_sim$Pred == data_sim$Group)
```

# Session Info

------------------------------------------------------------------------

```{r session info}
sessionInfo()
```