--- title: "Introducing mRNA bias into simulations with scaling factors" author: "Alexander Dietrich" date: "11/29/2021" bibliography: references.bib biblio-style: apalike link-citations: yes colorlinks: yes output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introducing mRNA bias into simulations with scaling factors} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Using scaling factors This package allows the user to introduce an mRNA bias into pseudo-bulk RNA-seq samples. Different cell-types contain different amounts of mRNA (Dendritic cells for examples contain much more than Neutrophils); this bias can be added into simulations artificially in different ways. \ The scaling factors will always be applied on the single-cell dataset first, altering its expression profiles accordingly, and then the pseudo-bulk samples are generated by summing up the count data from the sampled cells. ```{r} # Example data counts <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol = 300), sparse = TRUE) tpm <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol = 300), sparse = TRUE) tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm)) colnames(counts) <- paste0("cell_", rep(1:300)) colnames(tpm) <- paste0("cell_", rep(1:300)) rownames(counts) <- paste0("gene_", rep(1:1000)) rownames(tpm) <- paste0("gene_", rep(1:1000)) annotation <- data.frame( "ID" = paste0("cell_", rep(1:300)), "cell_type" = c(rep("T cells CD4", 150), rep("T cells CD8", 150)), "spikes" = runif(300), "add_1" = runif(300), "add_2" = runif(300) ) ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, name = "test_dataset" ) ``` ## Pre-defined scaling factors Some studies have proposed scaling factors for immune cells, such as EPIC [@Racle2017] or quanTIseq [@Finotello2019], deconvolution tools which correct for the mRNA bias internally using these values: ```{r, format="markdown"} epic <- data.frame( type = c( "B cells", "Macrophages", "Monocytes", "Neutrophils", "NK cells", "T cells", "T cells CD4", "T cells CD8", "T helper cells", "T regulatory cells", "otherCells", "default" ), mRNA = c( 0.4016, 1.4196, 1.4196, 0.1300, 0.4396, 0.3952, 0.3952, 0.3952, 0.3952, 0.3952, 0.4000, 0.4000 ) ) epic ``` ```{r, format="markdown"} quantiseq <- data.frame( type = c( "B cells", "Macrophages", "MacrophagesM2", "Monocytes", "Neutrophils", "NK cells", "T cells CD4", "T cells CD8", "T regulatory cells", "Dendritic cells", "T cells" ), mRNA = c( 65.66148, 138.11520, 119.35447, 130.65455, 27.73634, 117.71584, 63.87200, 70.25659, 72.55110, 140.76091, 68.89323 ) ) quantiseq ``` When you want to apply one of these scaling factors into your simulation (therefore in-/decreasing the expression signals for the cell-types), we can use the `scaling_factor` parameter. Note, that these pre-defined scaling factors only offer values for a certain number of cell types, and your annotation in the provided dataset has to match these names 1:1. All cell types from your dataset which are not present in this scaling factor remain unscaled and a warning message will appear. ```{r} sim_epic <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "epic", nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) ``` We can also try out some custom scaling factors, for example increasing the expression levels for a single cell-type (T cells CD8) by 10-fold compared to the rest. All cell-types which are not mentioned in the named list given to `custom_scaling_vector` will be transformed with a scaling factor of 1, meaning nothing changes for them. ```{r} sim_extreme_b <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "custom", custom_scaling_vector = c("T cells CD8" = 10), nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) ``` **Important:** Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type. ## Dataset specific scaling factors You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive. ### Reads and genes Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation. ```{r} sim_reads <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "read_number", # use number of reads as scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) sim_genes <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "expressed_genes", # use number of expressed genes column as scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) ``` These options would also allow you to use other numerical measurements you have for the single cells as scaling factors, such as weight or size for example. Lets pretend, `add_1` and `add_2` are such measurements. With the `additional_cols` parameter, they can be added to the SimBu dataset and we can use them as scaling factor as well: ```{r, format="markdown"} head(annotation) ``` ```{r} ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, name = "test_dataset", additional_cols = c("add_1", "add_2") ) # add columns to dataset sim_genes <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "add_1", # use add_1 column as scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) ``` ### Spike-ins One other numerical measurement can be spike-ins. Usually the number of reads mapped to spike-in molecules per cell is given in the cell annotations. If this is the case, they can be stored in the dataset annotation using the `spike_in_col` parameter, where you indicate the name of the column from the `annotation` dataframe in which the spike-in information is stored. To calculate a scaling factor from this, the number of reads are also necessary, so we will add this information as well (as above using the `read_number_col` parameter). The scaling factor with spike-ins is calculated as the "% of reads NOT mapped to spike-in reads", or: `(n_reads - n_spike_in)/n_reads` for each cell. We apply it like this: ```{r, eval=FALSE} ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, name = "test_dataset", spike_in_col = "spikes" ) # give the name in the annotation file, that contains spike-in information sim_spike <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "spike_in", # use spike-in scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) ``` ## Census - estimate mRNA counts per cell [Census](http://cole-trapnell-lab.github.io/monocle-release/docs/#census) is an approach which tries to convert TPM counts into relative transcript counts. This basically means, you get the mRNA counts per cell, which can differ between cell-types. \ [@Qiu2017] state in their paper, that it should only be applied to TPM/FPKM normalized data, but I tried it out with raw expression counts as well, which worked as well. \ Census calculates a vector with a scaling value for each cell in a sample. You can switch this feature on, by setting the `scaling_factor` parameter to `census`. ```{r} ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, tpm_matrix = tpm, name = "test_dataset" ) sim_census <- SimBu::simulate_bulk( data = ds, scenario = "random", scaling_factor = "census", # use census scaling factor nsamples = 10, ncells = 100, BPPARAM = BiocParallel::MulticoreParam(workers = 4), run_parallel = TRUE ) ``` In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor. ```{r} sessionInfo() ``` # References