---
title: "MEB Tutorial"
author: "Yan Zhou, Jiadi Zhu"
package: MEB
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{MEB Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{SummarizedExperiment, SingleCellExperiment}
---


# Introduction
This package includes two methods for differentially expressed genes (DEGs) 
detection in RNA-seq and scRNA-seq datasets, respectively. The first method is
the SFMEB that is used to identify DEGs in the same or different species 
RNA-seq dataset. Given that non-DE genes have some similarities in features, 
the SFMEB covers those non-DE genes in feature space, then those DE genes, 
which are enormously different from non-DE genes, being regarded as outliers 
and rejected outside the ball. The method on this package are described in the 
article *A scaling-free minimum enclosing ball method to detect differentially 
expressed genes for RNA-seq data* by Zhou, Y., Yang, B., Wang, J. et al. 
BMC Genomics, 22, 479 (2021). The second method is the scMEB which is the 
extension of the SFMEB. The scMEB is a novel and fast method for detecting 
single-cell DEGs without prior cell clustering results. The details about the 
scMEB could be refered to the article *scMEB: A fast and clustering-independent 
method for detecting differentially expressed genes in single-cell RNA-seq 
data* by Zhu, J.D and Yang, Y.L. (2023, pending publication)


# The steps of the SFMEB method
The SFMEB method is developed for detecting differential expression genes in 
the same or different species. 
Compared with existing methods, it is no need to normalize data in advance. 
Besides, the SFMEB method could be easily applied to the same or different 
species data and without changing too much. We have implemented the SFMEB 
method via an R function `NIMEB()`. The method consists three steps.

**Step 1**: Data Pre-processing;  

**Step 2**: Training a model for the training genes;

**Step 3**: Discriminating a gene whether a DE gene.

We employ a simulation and real dataset for the same and different species to 
illustrate the usage of the SFMEB method.

## Preparations
To install the MEB package into your R environment, start R and
enter:
```{r, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("MEB")
```

Then, the MEB package is ready to load.
```{r}
library(MEB)
```


## Data format
In order to show the usage of SFMEB method, we introduce the example data sets, 
which includes the simulation and real data for the same and different species. 
The next we will show the introduction of datasets in the package.

There are six datasets in the data subdirectory of MEB package, in which four 
datasets are linked to the SFMEB method. To consistent 
with standard Bioconductor representations, we transform the format of dataset
as *SummarizedExperiment*, please refer R package *SummarizedExperiment* for 
details. The four datasets are **sim_data_sp**, **sim_data_dsp**, 
**real_data_sp** and **real_data_dsp**. 

**real_data_sp** is a real dataset for the same species, which comes from 
*RNA-seq: an assessment of technical reproducibility and comparisonwith gene 
expression arrays* by Marioni J.C., Mason C.E., et al. (2008). Genome Res. 
18(9), 1509–1517. 

**real_data_dsp** is a real dataset for the different 
species, which comes from *The evolution of gene expression levels in 
mammalian organs* by Brawand, D., Soumillon, M., 
Necsulea, A. and Julien, P. et al. (2011). Nature, 478, 343-348. 


**sim_data_sp** and **sim_data_dsp** are two simulation datasets for the same 
and different species, respectively. Refering *A scaling-free minimum enclosing 
ball method to detect differentially 
expressed genes for RNA-seq data* by Zhou, Y., Yang, B., Wang, J. et al. 
BMC Genomics, 22, 479 (2021) for the generation procedure. 


```{r}
data(sim_data_sp)
sim_data_sp
```

**sim_data_sp.RData** includes 2 columns,  

*   the first column is the RNA-seq short read counts for the first sample;  

*   the second column is the RNA-seq short read counts for the second sample; 

*   each row represents a gene, and the first 1000 genes are housekeeping 
genes.  

```{r}
data(real_data_sp)
real_data_sp
```

**real_data_sp** includes 10 columns,

*   there are two samples about kidney and liver, and each with five biological 
replicates;

*   each row represents a gene, and the first 530 genes are housekeeping genes.

```{r}
data(sim_data_dsp)
sim_data_dsp
```

**sim_data_dsp.RData** includes 4 columns, 

* the first and the third columns are the gene length for two species;

* the second and the fouth columns are the RNA-seq short read counts for two
species;

* each row represents an orthologous gene, and the first 1000 genes are the 
conserved genes.


```{r}
data(real_data_dsp)
real_data_dsp
```


**real_data_dsp.RData** includes 4 columns, 

* the first and the third columns are the gene length for human and mouse;

* the second and the fouth columns are the RNA-seq short read counts for human
and mouse;

* each row represents an orthologous gene, and the first 143 genes are the 
conserved genes.


## Training a model for the training genes
Based on a part of known housekeeping and conserved genes, we can train our 
model for the above four datasets. The next we will show how to use the 
`NIMEB()` function to train a model.


1. Simulation data for the same species
```{r, message = FALSE, warning = FALSE}
library(SummarizedExperiment)
```


```{r}
data(sim_data_sp)
gamma <- seq(1e-06,5e-05,1e-06)
sim_model_sp <- NIMEB(countsTable=assay(sim_data_sp), train_id=1:1000, gamma,
nu = 0.01, reject_rate = 0.05, ds = FALSE)
```


2. Real data for the same species
```{r}
data(real_data_sp)
gamma <- seq(1e-06,5e-05,1e-06)
real_model_sp <- NIMEB(countsTable=assay(real_data_sp), train_id=1:530,
gamma, nu = 0.01, reject_rate = 0.1, ds = FALSE)
```


3. Simulation data for the different species
```{r}
data(sim_data_dsp)
gamma <- seq(1e-07,2e-05,1e-06)
sim_model_dsp <- NIMEB(countsTable=assay(sim_data_dsp), train_id=1:1000, gamma,
nu = 0.01, reject_rate = 0.1, ds = TRUE)
```


4. Real data for the different species
```{r}
data(real_data_dsp)
gamma <- seq(5e-08,5e-07,1e-08)
real_model_dsp <- NIMEB(countsTable=assay(real_data_dsp), train_id=1:143, 
                        gamma, nu = 0.01, reject_rate = 0.1, ds = TRUE)
```

The output for *NIMEB()* includes "*model*", "*gamma*" and *train_error*. 
*model* is the model we used to discriminate a new gene, *gamma* represents the
selected gamma parameters in model NIMEB, *train_error* represents the
corresponding train_error when the value of gamma changed.


## Discriminating a gene whether a DE gene
Giving the model, we could predict a gene and find out whether DE gene. For 
example, in *sim_data_sp* data, we predict the discrimination results as 
follows:


```{r}
sim_model_sp_pred <- predict(sim_model_sp$model, assay(sim_data_sp))
summary(sim_model_sp_pred)
```

Based on the model we trained, we could discriminate each genes whether DE 
gene, if the discrimination result is *TRUE*/*FALSE*, the gene is non-DE/DE 
gene.


# The usage of the scMEB method
We add a new function `scMEB()` for detecting differential expressed genes in 
scRNA-seq data without prior clustering results. There is a example to 
introduce the usage of this function:

1. Load the package and example scRNA-seq data
```{r, message = FALSE, warning = FALSE}
library(SingleCellExperiment)
```

The simulation data is generated by splatter package (Zappia L, et al. 2017).
The data include 5,000 genes and 100 cells.
```{r}
data(sim_scRNA_data)
sim_scRNA_data
```

We randomly sample 1,000 stable genes from the simulation data. 
```{r}
data(stable_gene)
head(stable_gene)
length(stable_gene)
```

2. Training a model for the simulation scRNA-seq data
```{r}
sim_scRNA <- scMEB(sce=sim_scRNA_data, stable_idx=stable_gene, 
filtered = TRUE, gamma = seq(1e-04,0.001,1e-05), nu = 0.01, 
reject_rate = 0.1)
```

Predict a gene and find out whether DE gene. For *sim_data_sp* data, we predict 
the discrimination results as 
follows:
```{r}
sim_scRNA_pred <- predict(sim_scRNA$model, sim_scRNA$dat_pca)
summary(sim_scRNA_pred)
```
The discrimination result *TRUE*/*FALSE* correspond that gene is non-DE/DE 
gene.


scMEB also provides a metric for ranking the genes, that is, the distance 
between the gene and the sphere of the ball in the feature space. And the 
larger the distance is, the more likely it is that the gene is a DEG.
```{r}
table(sim_scRNA$dist>0)
```

```{r}
sim_scRNA_dist <- data.frame(Gene=rownames(sim_scRNA_data),
                             Distance=sim_scRNA$dist)
head(sim_scRNA_dist)
```


```{r}
sessionInfo()
```