---
title: "Public Data Integration using Sfaira"
author: "Alexander Dietrich"
bibliography: references.bib
biblio-style: apalike
link-citations: yes
colorlinks: yes
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Public Data Integration using Sfaira}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(SimBu)
```

# Sfaira Integeration

This vignette will cover the integration of the public database Sfaria. \


## Setup

As a public database, sfaira [@Fischer2020] is used, which is a dataset and model repository for single-cell RNA-sequencing
data. It gives access to about multiple datasets from human and mouse with more than 3 million cells in total. 
You can browse them interactively here: [https://theislab.github.io/sfaira-portal/Datasets](https://theislab.github.io/sfaira-portal/Datasets). Note that only annotated datasets will be downloaded! Also there are cases of datasets, which have private URLs and cannot be automatically downloaded; SimBu will skip these datasets. \
In order to use this database, we first need to install it. This can easily be done, by running the `setup_sfaira()` function for the first time. In the background we use the [basilisik](https://bioconductor.org/packages/release/bioc/html/basilisk.utils.html) package to establish a conda environment that has all sfaira dependencies installed. The installation will be only performed one single time, even if you close your R session and call `setup_sfaira()` again. The given directory serves as the storage for all future downloaded datasets from sfaira:


```{r, eval=FALSE}
setup_list <- SimBu::setup_sfaira(basedir = tempdir())
```

## Creating a dataset

We will now create a dataset of samples from human pancreas using the `organisms` and `tissues` parameter.
You can provide a single word (like we do here) or for example a list of tissues
you want to download: `c("pancreas","lung")`. An additional parameter is the `assays` parameter,
where you subset the database further to only download datasets from certain sequencing
assays (for examples `Smart-seq2`). \
The `name` parameter is used later on to give each sample (cell) a unique name.

```{r, eval=FALSE}
ds_pancrease <- SimBu::dataset_sfaira_multiple(
  sfaira_setup = setup_list,
  organisms = "Homo sapiens",
  tissues = "pancreas",
  name = "human_pancreas"
)
```

Currently there are three datasets in sfaira from human pancreas, which have cell-type annotation.
The package will download them for you automatically and merge them together into a single expression
matrix and a streamlined annotation table, which we can use for our simulation. \
It can happen, that some datasets from sfaira are not (yet) ready for the automatic download,
an error message will then appear in R, telling you which file to download and where to put it. \

If you wish to see all datasets which are included in sfaira you can use the following command:
```{r, eval=FALSE}
all_datasets <- SimBu::sfaira_overview(setup_list = setup_list)
head(all_datasets)
```

This allows you to find the specific IDs of datasets, which you can download directly:

```{r, eval=FALSE}
SimBu::dataset_sfaira(
  sfaira_id = "homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x",
  sfaira_setup = setup_list,
  name = "dataset_by_id"
)
```


```{r}
sessionInfo()
```