--- title: "Inputs and Outputs" author: "Alexander Dietrich" bibliography: references.bib biblio-style: apalike link-citations: yes colorlinks: yes output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Inputs and Outputs} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r} library(Seurat) library(curl) library(SimBu) ``` This chapter covers the different input and output options of the package in detail. # Input The input for your simulations is always a `SummarizedExperiment` object. You can create this object with different constructing functions, which will explained below. It is also possible to merge multiple datasets objects into one. \ Sfaira is not covered in this vignette, but in ["Public Data Integration"](sfaira_vignette.html).\ ## Custom data Using existing count matrices and annotations is already covered in the ["Getting started"](simulator_documentation.html) vignette; this section will explain some minor details.\ When generating a dataset with your own data, you need to provide the `count_matrix` parameter of `dataset()`; additionally you can provide a TPM matrix with the `tpm_matrix`. This will then lead to two simulations, one based on counts and one based on TPMs. For either of them, genes are located in the rows, cells in the columns.\ Additionally, an annotation table is needed, with the cell-type annotations. It needs to consist of at least out of 2 columns: `ID` and `cell_type`, where `ID` has to be identical to the column names of the provides matrix/matrices. If not all cells appear in the annotation or matrix, the intersection of both is used to generate the dataset. \ Here is some example data: ```{r} counts <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol = 300), sparse = TRUE) tpm <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol = 300), sparse = TRUE) tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm)) colnames(counts) <- paste0("cell-", rep(1:300)) colnames(tpm) <- paste0("cell-", rep(1:300)) rownames(counts) <- paste0("gene-", rep(1:1000)) rownames(tpm) <- paste0("gene-", rep(1:1000)) annotation <- data.frame( "ID" = paste0("cell-", rep(1:300)), "cell_type" = c( rep("T cells CD4", 50), rep("T cells CD8", 50), rep("Macrophages", 100), rep("NK cells", 10), rep("B cells", 70), rep("Monocytes", 20) ), row.names = paste0("cell-", rep(1:300)) ) seurat_obj <- Seurat::CreateSeuratObject(counts = counts, assay = "counts", meta.data = annotation) tpm_assay <- Seurat::CreateAssayObject(counts = tpm) seurat_obj[["tpm"]] <- tpm_assay seurat_obj ``` ### Seurat It is possible to use a Seurat object to build a dataset; give the name of the assay containing count data in the `counts` slot, the name of the column in the meta table with the unique cell IDs and the name of the column in the meta table with the cell type identifier. Additionally you may give the name of the assay containing TPM data in the `counts` slot. ```{r} ds_seurat <- SimBu::dataset_seurat( seurat_obj = seurat_obj, count_assay = "counts", cell_id_col = "ID", cell_type_col = "cell_type", tpm_assay = "tpm", name = "seurat_dataset" ) ``` ### h5ad files It is possible to use an h5ad file directly, a file format which stores [AnnData](https://anndata.readthedocs.io/en/latest/) objects. As h5ad files can store cell specific information in the `obs` layer, no additional annotation input for SimBu is needed. \ Note: if you want both counts and tpm data as input, you will have to provide two files; the cell annotation has to match between these two files. As SimBu expects the cells to be in the columns and genes/features in the rows of the input matrix, but this is not necessarily the case for anndata objects [https://falexwolf.de/img/scanpy/anndata.svg](https://falexwolf.de/img/scanpy/anndata.svg), SimBu can handle h5ad files with cells in the `obs` or `var` layer. If your cells are in `obs`, use `cells_in_obs=TRUE` and `FALSE` otherwise. This will also automatically transpose the matrix. To know, which columns in the cell annotation layer correspond to the cell identifiers and cell type labels, use the `cell_id_col` and `cell_type_col` parameters, respectively.\ As this function uses the `SimBu` python environment to read the h5ad files and extract the data, it may take some more time to initialize the conda environment at the *first* usage only. ```{r} # example h5ad file, where cell type info is stored in `obs` layer #h5 <- system.file("extdata", "anndata.h5ad", package = "SimBu") #ds_h5ad <- SimBu::dataset_h5ad( # h5ad_file_counts = h5, # name = "h5ad_dataset", # cell_id_col = 0, # this will use the rownames of the metadata as cell identifiers # cell_type_col = "group", # this will use the 'group' column of the metadata as cell type info # cells_in_obs = TRUE # in case your cell information is stored in the var layer, switch to FALSE #) ``` ## Merging datasets You are able to merge multiple datasets by using the `dataset_merge` function: ```{r} ds <- SimBu::dataset( annotation = annotation, count_matrix = counts, tpm_matrix = tpm, name = "test_dataset" ) ds_multiple <- SimBu::dataset_merge( dataset_list = list(ds_seurat, ds), name = "ds_multiple" ) ``` # Output The `simulation` object contains three named entries: \ * `bulk`: a SummarizedExperiment object with the pseudo-bulk dataset(s) stored in the `assays`. They can be accessed like this: ```{r} simulation <- SimBu::simulate_bulk( data = ds_multiple, scenario = "random", scaling_factor = "NONE", nsamples = 10, ncells = 100 ) dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_counts"]]) dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_tpm"]]) ``` If only the count matrix was given to the dataset initially, only the `bulk_counts` assay is filled. * `cell_fractions`: a table where rows represent the simulated samples and columns represent the different simulated cell-types. The entries in the table store the specific cell-type fraction per sample.\ * `scaling_vector`: a named list, with the used scaling value for each cell from the single cell dataset. \ ```{r} sessionInfo() ```