--- title: "Overview of the scRNAseq dataset collection" author: - name: Davide Risso affiliation: Division of Biostatistics and Epidemiology, Weill Cornell Medicine - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "Created: May 25, 2016; Compiled: `r Sys.Date()`" output: BiocStyle::html_document: toc: true package: scRNAseq vignette: > %\VignetteIndexEntry{User's Guide} %\VignetteEngine{knitr::rmarkdown} bibliography: "`r system.file('scripts', 'ref.bib', package='scRNAseq')`" --- ```{r style, echo=FALSE} knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) ``` # Introduction The `r Biocpkg("scRNAseq")` package provides convenient access to several publicly available data sets in the form of `SingleCellExperiment` objects. The focus of this package is to capture datasets that are not easily read into R with a one-liner from, e.g., `read.csv()`. Instead, we do the necessary data munging so that users only need to call a single function to obtain a well-formed `SingleCellExperiment`. For example: ```{r} library(scRNAseq) fluidigm <- ReprocessedFluidigmData() fluidigm ``` Readers are referred to the `r Biocpkg("SummarizedExperiment")` and `r Biocpkg("SingleCellExperiment")` documentation for further information on how to work with `SingleCellExperiment` objects. # Available data sets The available data sets can be split into two categories. The first category contains expression matrices that have been generated by the `r Biocpkg("scRNAseq")` authors from the raw sequencing data for each experiment. This includes: - `ReprocessedFluidigmData()` provides 65 cells from @pollen2014lowcoverage. - `ReprocessedTh2Data()` provides 96 T helper cells from @mahata2014singlecell. - `ReprocessedAllenData()` provides 379 cells from the mouse visual cortex, which is a subset of the data from @tasic2016adult. The second category contains expression matrices that were provided by the authors of each study. No further reprocessing has been performed other than some cross-checks betweeh the count matrix and the sample metadata. | Study | System | Number of cells | Function | |-----------------------|------------------|-------------------|---------------------| |@aztekin2019identification | Xenopus tail | 13199 | `AztekinTailData()` | |@bach2017differentiation | Mouse mammary gland | 25806 | `BachMammaryData()` | |@baron2016singlecell | Human pancreas | 8569 | `BaronPancreasData('human')` | |@baron2016singlecell | Mouse pancreas | 1886 | `BaronPancreasData('mouse')` | |@buettner2015computational | Mouse embryonic stem cells | 288 | `BuettnerESCData()` | |@campbell2017molecular | Mouse brain | 21086 | `CampbellBrainData()` | |@chen2017singlecell | Mouse brain | 14437 | `ChenBrainData()` | |@grun2016denovo | Mouse haematopoietic stem cells | 1915 | `GrunHSCData()` | |@grun2016denovo | Human pancreas | 1728 | `GrunPancreasData()` | |@kolodziejczyk2015singlecell | Mouse mebryonic stem cells | 704 | `KolodziejczykESCData()` | |@lamanno2016molecular | Human embryonic stem cells | 1715 | `LaMannoBrainData('human-es')` | |@lamanno2016molecular | Human embryonic midbrain | 1977 | `LaMannoBrainData('human-embryo')` | |@lamanno2016molecular | Human induced pluripotent stem cells | 337 | `LaMannoBrainData('human-ips')` | |@lamanno2016molecular | Mouse adult dopaminergic neurons | 243 | `LaMannoBrainData('mouse-adult')` | |@lamanno2016molecular | Human embyronic midbrain | 1907 | `LaMannoBrainData('mouse-embryo')` | |@lawlor2017singlecell | Human pancreas | 638 | `LawlorPancreasData()` | |@leng2015oscope | Human embryonic stem cells | 460 | `LengESCData()` | |@lun2017assessing | 416B cells | 192 | `LunSpikeInData('416b')` | |@lun2017assessing | Mouse trophoblasts | 192 | `LunSpikeInData('tropho')` | |@macosko2015highly | Mouse retina | 49300 | `MacoskoRetinaData()` | |@marques2016oligodendrocyte | Mouse brain | 5069 | `MarquesBrainData()` | |@messmer2019transcriptional | Human embryonic stem cells | 1344 | `MessmerESCData()` | |@muraro2016singlecell | Human pancreas | 3072 | `MuraroPancreasData()` | |@nestorowa2016singlecell | Mouse haematopoietic stem cells | 1920 | `NestorowaHSCData()` | |@paul2015transcriptional | Mouse haematopoietic stem cells | 10368 | `PaulHSCData()` | |@richard2018tcell | Mouse CD8+ T cells | 572 | `RichardTCellData()` | |@romanov2017molecular | Mouse brain | 2881 | `RomanovBrainData()` | |@segerstolpe2016single | Human pancreas | 3514 | `SegerstolpePancreasData()` | |@shekhar2016comprehensive | Mouse retina | 44994 | `ShekharRetinaData()` | |@usoskin2015unbiased | Mouse brain | 864 | `UsoskinBrainData()` | |@tasic2016adult | Mouse brain | 1809 | `TasicBrainData()` | |@xin2016rna | Human pancreas | 1600 | `XinPancreasData()` | |@zeisel2015brain | Mouse brain | 3005 | `ZeiselBrainData()` | # Adding new data sets Please contact us if you have a data set that you would like to see added to this package. The only requirement is that your data set has publicly available expression values (ideally counts) and sample annotation. The more difficult/custom the format, the better, as its inclusion in this package will provide more value for other users in the R/Bioconductor community. If you have already written code that processes your desired data set in a `SingleCellExperiment`-like form, we would welcome a pull request [here](https://github.com/LTLA/scRNAseq). The process can be expedited by ensuring that you have the following files: - `inst/scripts/make-X-Y-data.Rmd`, a Rmarkdown report that creates all components of a `SingleCellExperiment`. `X` should be the last name of the first author of the relevant study while `Y` should be the name of the biological system. - `inst/scripts/make-X-Y-metadata.R`, an R script that creates a metadata CSV file at `inst/extdata/metadata-X-Y.csv`. Metadata files should follow the format described in the `r Biocpkg("ExperimentHub")` documentation. - `R/XYData.R`, an R source file that defines a function `XYData()` to download the components from ExperimentHub and creates a `SingleCellExperiment` object. Potential contributors are recommended to examine some of the existing scripts in the package to pick up the coding conventions. Remember, we're more likely to accept a contribution if it's indistinguishable from something we might have written ourselves! As a general rule, 10X Genomics data sets are not suitable for inclusion in this package. They are either easy to load (e.g., with functions from the `r Biocpkg("DropletUtils")` package), or they are more appropriately obtained with dedicated 10X packages like `r Biocpkg("TENxPBMCData")` or `r Biocpkg("TENxBrainData")`. That said, inclusion will be considered if the format has been sufficiently customized by the original authors. # References