--- title: Using SingleR to annotate single-cell RNA-seq data author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com - name: Jared M. Andrews affiliation: Washington University in St. Louis, School of Medicine, St. Louis, MO, USA - name: Friederike Dündar affiliation: Applied Bioinformatics Core, Weill Cornell Medicine - name: Daniel Bunis affiliation: Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA date: "Revised: June 14th, 2020" output: BiocStyle::html_document: toc_float: true package: SingleR bibliography: ref.bib vignette: > %\VignetteIndexEntry{Annotating scRNA-seq data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) library(BiocStyle) ``` # Introduction `r Biocpkg("SingleR")` is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data [@aran2019reference]. Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference. Thus, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this biological knowledge can be propagated to new datasets in an automated manner. To keep things brief, this vignette only provides a brief summary of the basic capabilities of `r Biocpkg("SingleR")`. However, the package also provides more advanced functionality that includes the use of multiple references simultaneously, manipulating the cell ontology and improving performance on big datasets. Readers are referred to the [book](http://bioconductor.org/books/devel/SingleRBook/) for more details. # Using built-in references The easiest way to use `r Biocpkg("SingleR")` is to annotate cells against built-in references. In particular, the `r Biocpkg("celldex")` package provides access to several reference datasets (mostly derived from bulk RNA-seq or microarray data) through dedicated retrieval functions. Here, we will use the Human Primary Cell Atlas [@hpcaRef], represented as a `SummarizedExperiment` object containing a matrix of log-expression values with sample-level labels. ```{r} library(celldex) hpca.se <- HumanPrimaryCellAtlasData() hpca.se ``` Our test dataset consists of some human embryonic stem cells [@lamanno2016molecular] from the `r Biocpkg("scRNAseq")` package. For the sake of speed, we will only label the first 100 cells from this dataset. ```{r} library(scRNAseq) hESCs <- LaMannoBrainData('human-es') hESCs <- hESCs[,1:100] ``` We use our `hpca.se` reference to annotate each cell in `hESCs` via the `SingleR()` function. This identifies marker genes from the reference and uses them to compute assignment scores (based on the Spearman correlation across markers) for each cell in the test dataset against each label in the reference. The label with the highest score is the assigned to the test cell, possibly with further fine-tuning to resolve closely related labels. ```{r} library(SingleR) pred.hesc <- SingleR(test = hESCs, ref = hpca.se, assay.type.test=1, labels = hpca.se$label.main) ``` Each row of the output `DataFrame` contains prediction results for a single cell. Labels are shown before (`labels`) and after pruning (`pruned.labels`), along with the associated scores. ```{r} pred.hesc # Summarizing the distribution: table(pred.hesc$labels) ``` At this point, it is worth noting that `r Biocpkg("SingleR")` is workflow/package agnostic. The above example uses `SummarizedExperiment` objects, but the same functions will accept any (log-)normalized expression matrix. # Using single-cell references Here, we will use two human pancreas datasets from the `r Biocpkg("scRNAseq")` package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the @muraro2016singlecell dataset to be our reference. ```{r} library(scRNAseq) sceM <- MuraroPancreasData() # One should normally do cell-based quality control at this point, but for # brevity's sake, we will just remove the unlabelled libraries here. sceM <- sceM[,!is.na(sceM$label)] # SingleR() expects reference datasets to be normalized and log-transformed. library(scuttle) sceM <- logNormCounts(sceM) ``` We then set up our test dataset from @grun2016denovo. To speed up this demonstration, we will subset to the first 100 cells. ```{r} sceG <- GrunPancreasData() sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts. sceG <- logNormCounts(sceG) ``` We then run `SingleR()` as described previously but with a marker detection mode that considers the variance of expression across cells. Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels. This is slower but more appropriate for single-cell data compared to the default marker detection algorithm (which may fail for low-coverage data where the median is frequently zero). ```{r} pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox") table(pred.grun$labels) ``` # Annotation diagnostics `plotScoreHeatmap()` displays the scores for all cells across all reference labels, which allows users to inspect the confidence of the predicted labels across the dataset. Ideally, each cell (i.e., column of the heatmap) should have one score that is obviously larger than the rest, indicating that it is unambiguously assigned to a single label. A spread of similar scores for a given cell indicates that the assignment is uncertain, though this may be acceptable if the uncertainty is distributed across similar cell types that cannot be easily resolved. ```{r} plotScoreHeatmap(pred.grun) ``` Another diagnostic is based on the per-cell "deltas", i.e., the difference between the score for the assigned label and the median across all labels for each cell. Low deltas indicate that the assignment is uncertain, which is especially relevant if the cell's true label does not exist in the reference. We can inspect these deltas across cells for each label using the `plotDeltaDistribution()` function. ```{r} plotDeltaDistribution(pred.grun, ncol = 3) ``` The `pruneScores()` function will remove potentially poor-quality or ambiguous assignments based on the deltas. The minimum threshold on the deltas is defined using an outlier-based approach that accounts for differences in the scale of the correlations in various contexts - see `?pruneScores` for more details. `SingleR()` will also report the pruned scores automatically in the `pruned.labels` field where low-quality assignments are replaced with `NA`. ```{r} summary(is.na(pred.grun$pruned.labels)) ``` Finally, a simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset. We extract the identity of the markers from the metadata of the `SingleR()` results and use them in the `plotHeatmap()` function from `r Biocpkg("scater")`, as shown below for beta cell markers. If a cell in the test dataset is confidently assigned to a particular label, we would expect it to have strong expression of that label's markers. At the very least, it should exhibit upregulation of those markers relative to cells assigned to other labels. ```{r} all.markers <- metadata(pred.grun)$de.genes sceG$labels <- pred.grun$labels # Beta cell-related markers library(scater) plotHeatmap(sceG, order_columns_by="labels", features=unique(unlist(all.markers$beta))) ``` # FAQs *How can I use this with my `Seurat`, `SingleCellExperiment`, or `cell_data_set` object?* `r Biocpkg("SingleR")` is workflow agnostic - all it needs is normalized counts. An example showing how to map its results back to common single-cell data objects is available in the [README](https://github.com/LTLA/SingleR/blob/master/README.md). *Where can I find reference sets appropriate for my data?* `r Biocpkg("celldex")` contains many built-in references that were previously in `r Biocpkg("SingleR")` but have been migrated into a separate package for more general use by other Bioconductor packages. `r Biocpkg("scRNAseq")` contains many single-cell datasets, many of which contain the authors' manual annotations. `r Biocpkg("ArrayExpress")` and `r Biocpkg("GEOquery")` can be used to download any of the bulk or single-cell datasets in [ArrayExpress](https://www.ebi.ac.uk/arrayexpress) or [GEO](https://www.ncbi.nlm.nih.gov/geo/), respectively. *Where can I get more help?* It is likely that your questions is already answered by the function documentation (e.g., `?SingleR`). Further explanations on the reasoning behind certain functions can be found in the [book](https://ltla.github.io/SingleRBook/). If this is not sufficient, we recommend posting an issue on the [Bioconductor support site](https://support.bioconductor.org) or [the GitHub repository](https://github.com/LTLA/SingleR) for this package. Be sure to include your session information and a minimal reproducible example. # Session information ```{r} sessionInfo() ``` # References