---
title: "Simulating Single-Cell Multi-Omics Data with MOSim"
author: 
  - name: "Carolina Monzó"
  - name: "Ángeles Arzalluz-Luque"
  - name: "Arianna Febbo"
  - name: "Sonia Tarazona"
date: "`r Sys.Date()`"
vignette: >
  %\VignetteEngine{knitr::knitr}
  %\VignetteIndexEntry{Wiki of how to use sc_mosim}
  %\usepackage[utf8]{inputenc}
---

# Introduction

Welcome to the `MOSim` package, a versatile tool for simulating bulk and
single-cell multi-omics data. In this vignette, we will explore how to
create synthetic single-cell data, focusing on single-cell RNA-seq
(scRNA-seq) and single-cell ATAC-seq (scATAC-seq) data. Using the
`MOSim` package, you can generate custom multi-omics datasets for
various experimental conditions, making it an essential resource for
testing and validating analysis methods, or creating benchmark datasets.

# Installation

Before we dive into the exciting world of data simulation, you'll need
to install the `MOSim` package. You can easily obtain it from bioconductor using
the following commands:

```{r installing, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) 
    install.packages("BiocManager")

BiocManager::install("MOSim")

# For the latest development version
install.packages("devtools")
devtools::install_github("ConesaLab/MOSim")

```

# Simulating Single-Cell Multi-Omics Data

The core of data simulation lies in the sc_mosim function, which allows
you to create synthetic single-cell multi-omics data. Let's explore a
typical example of its usage, using the default dataset loaded in the package:

```{r test_run, eval = FALSE}
library(MOSim)

# Create a list of omics data types (e.g., scRNA-seq and scATAC-seq)
omicsList <- sc_omicData(list("scRNA-seq", "scATAC-seq"), 
                         data = NULL)

# Define cell types for your experiment
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
      'Memory_B' = c(31:40))

# Load an association list containing peak IDs related to gene names
associationList <- data(associationList)

# Simulate multi-omics data with specific parameters
testing_groups <- sc_mosim(
  omicsList,
  cell_types,
  numberReps = 2,
  numberGroups = 3,
  diffGenes = list(c(0.2, 0.3), c(0.2, 0.3)),
  minFC = 0.25,
  maxFC = 4,
  numberCells = NULL,
  mean = NULL,
  sd = NULL,
  regulatorEffect = list(c(0.1, 0.2), c(0.1, 0.2), c(0.1, 0.2)),
  associationList = associationList,
  TF = FALSE
)
```

In the example above, we load omics data types, specify experimental
conditions and cell types, and load an association list. The sc_mosim
function lets us simulate multi-omics data with various parameters, such
as the number of replicates, differentially expressed genes, and
regulatory effects.

# Data Preparation

Before diving into simulation, it's essential to have your data ready.
The `sc_omicData` function aids in preparing your data for simulation.
It accepts the following inputs:

-   omics_types: A list of omics data types, which can be "scRNA-seq" or
    "scATAC-seq."

-   data (optional): A user-inputted list of matrices with features as
    rows and cells as columns. If data is NULL, the default data from
    10 cells for 4 celltypes is loaded. 

## Providing Custom Data

`sc_mosim` also allows you to simulate data resembling characteristics of
a dataset of your choice. To do so, you need to format your data using
the `sc_omicData` function. Supported input formats include:

-   Count matrix
-   Seurat array  
For example, to format your data extracted from the original Seurat object,  
you can follow these steps:

```{r custom data, eval = FALSE}
# This is done to get a dataset to extract a matrix (for example purposes)
scRNA <- MOSim::sc_omicData("scRNA-seq", data = NULL)
count <- scRNA[["scRNA-seq"]]
options(Seurat.object.assay.version = "v3")
Seurat_obj <- Seurat::CreateAssayObject(counts = count, assay = 'RNA')
# To format the data into sc_mosim-ready format, we pass the seurat
# object containing the count data we extracted into sc_omicData
omic_list_user <- sc_omicData(c("scRNA-seq"), data = c(Seurat_obj))
```

The resulting omic_list_user is a named list with "scRNA-seq" as the
name and your count matrix as the value.

# Running the Simulation: sc_mosim

## Default sc_mosim Simulation

`sc_mosim` can simulate scRNA, scATAC and TF count matrices without providing
any additional arguments. For a basic simulation, you only need to input
the omics list and cell types. Here's how it's done:

```{r default, eval = FALSE}
omic_list <- sc_omicData(list("scRNA-seq"))
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
      'Memory_B' = c(31:40))

sim <- sc_mosim(omic_list, cell_types, TF = TRUE)
```

This will result in simulated raw count matrices for scRNA and TF.

## Customizing the sc_mosim Simulation

The `sc_mosim` function offers a range of parameters to fine-tune your
simulation:

-   omics: A named list specifying the omics data types.
-   cellTypes: A list specifying the cell types.
-   numberReps (optional): The number of biological replicates.
-   numberGroups (optional): The number of different biological groups
    (experimental conditions).
-   diffGenes (optional): To simulate differentially expressed genes
    between groups.
-   minFC (optional): Threshold for downregulated genes.
-   maxFC (optional): Threshold for upregulated genes.
-   numberCells (optional): A vector specifying the number of cells per
    cell type.
-   mean and sd (optional): Vectors for mean and standard deviation of
    expression per cell type.
-   noiseRep (optional): Standard deviation between biological
    replicates.
-   noiseGroup (optional): Standard deviation between groups.
-   feature_no (optional): Total number of features to distribute
    between co-expression clusters.
-   clusters (optional): Number of co-expression patterns to simulate.
-   cluster_size (optional): Number of features in each co-simulation
    cluster.  
-   TF (optional): wether to extract TF dataframe as sim_TF in simulation object.
-   TFdf (optional): If an association matrix of TF and Target_gene is given the 
    TF expression values are extracted. If no data.frame is given, using the 
    association of human TF from {https://tflink.net/}
For example, to simulate data with specific settings, you can use the following  
code:

```{r testing, eval = FALSE}
omic_list <- sc_omicData(c("scRNA-seq", "scATAC-seq"))
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
      'Memory_B' = c(31:40))
sim <- sc_mosim(omic_list, cell_types, numberReps = 2, 
               numberGroups = 2, diffGenes = list(c(0.2, 0.3)), feature_no = 8000, 
               clusters = 3, mean = c(2*10^6, 1*10^6,2*10^6, 1*10^6), 
               sd = c(5*10^5, 2*10^5, 5*10^5, 2*10^5), 
               regulatorEffect = list(c(0.1, 0.2), c(0.1, 0.2)), TF = FALSE)
```

# Working with Simulation Results

## The sc_mosim Simulation Object

The result of your simulation is stored in a named list with 'sim_sc +
omic name' as names and Seurat objects as values. Each Seurat object
contains the synthetic count matrices for your experiment. Other
relevant information included in the object are:

-   cellTypes: A list specifying the columns in each simulated matrix
    that correspond to each cell type.

-   patterns: Matrix of co-expression patterns affecting the genes
    throughout cell-types.

-   FC: list of Fold Changes applied to each gene to simulate
    differential expression between experimental groups.

-   AssociationMatrices: gene/peak association matrices including
    differential expression and regulatory relationships.

-   Variability: added variability matrices to add dispersion to
    experimental groups and biological replicates.

## Retrieving Simulation Settings

To access simulation settings and other constraints for simulation, you can use the
`sc_omicSettings` function. This provides information about the
relationship between genes and peaks, differentially expressed genes,
regulator types, expression patterns, and fold changes for each gene and
peak compared to group 1.

```{r settings, eval = FALSE}
settings <- sc_omicSettings(sim)
```

If you run the sc_mosim simulation including TF, you can also extract the  
association matrix of TF - target genes regulatory relationships.

```{r settingsTF, eval = FALSE}
settings <- sc_omicSettings(sim, TF = TRUE)
```

## Accessing the Count Data Matrices

You can extract the simulated matrices for all experimental conditions
and biological replicates using the `sc_omicResults` function. This
provides you with the synthetic data for further analysis and
visualization.

```{r results, eval = FALSE}
res <- sc_omicResults(sim)
```