Chapter 1 Introduction

1.1 Motivation

The Bioconductor package SingleR implements an automatic annotation method for single-cell RNA sequencing (scRNA-seq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it assigns those labels to new cells from a test dataset based on similarities in their expression profiles. This provides a convenient way of transferring biological knowledge across datasets, allowing users to leverage the domain expertise implicit in the creation of each reference. The most common application of SingleR involves predicting cell type (or “state”, or “kind”) in a new dataset, a process that is facilitated by the availability of curated references and compatibility with user-supplied datasets. In this manner, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this knowledge can be propagated to new datasets in an automated manner.

1.2 Method description

SingleR can be considered a robust variant of nearest-neighbors classification, with some tweaks to improve resolution for closely related labels. For each test cell:

  1. We compute the Spearman correlation between its expression profile and that of each reference sample. The use of Spearman’s correlation provides a measure of robustness to batch effects across datasets. The calculation only uses the union of marker genes identified by pairwise comparisons between labels in the reference data, so as to improve resolution of separation between labels.
  2. We define the per-label score as a fixed quantile (by default, 0.8) of the correlations across all samples with that label. This accounts for differences in the number of reference samples for each label, which interferes with simpler flavors of nearest neighbor classification; it also avoids penalizing classifications to heterogeneous labels by only requiring a good match to a minority of samples.
  3. We repeat the score calculation for all labels in the reference dataset. The label with the highest score is used as SingleR’s prediction for this cell.
  4. We optionally perform a fine-tuning step to improve resolution between closely related labels. The reference dataset is subsetted to only include labels with scores close to the maximum; scores are recomputed using only marker genes for the subset of labels, thus focusing on the most relevant features; and this process is iterated until only one label remains.

1.3 Quick start

We will demonstrate the use of SingleR() on a well-known 10X Genomics dataset (Zheng et al. 2017) with the Human Primary Cell Atlas dataset (Mabbott et al. 2013) as the reference.

# Loading test data.
library(TENxPBMCData)
new.data <- TENxPBMCData("pbmc4k")

# Loading reference data with Ensembl annotations.
library(celldex)
ref.data <- HumanPrimaryCellAtlasData(ensembl=TRUE)

# Performing predictions.
library(SingleR)
predictions <- SingleR(test=new.data, assay.type.test=1, 
    ref=ref.data, labels=ref.data$label.main)

table(predictions$labels)
## 
##           B_cell              CMP               DC              GMP 
##              606                8                1                2 
##         Monocyte          NK_cell        Platelets Pre-B_cell_CD34- 
##             1164              217                3               46 
##          T_cells 
##             2293

And that’s it, really.

1.4 Where to get help

Questions on the general use of SingleR should be posted to the Bioconductor support site. Please send requests for general assistance and advice to the support site rather than to the individual authors. Bug reports or feature requests should be made to the GitHub repository; well-considered suggestions for improvements are always welcome.

Session information

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleR_2.8.0               ensembldb_2.30.0           
 [3] AnnotationFilter_1.30.0     GenomicFeatures_1.58.0     
 [5] AnnotationDbi_1.68.0        celldex_1.15.0             
 [7] TENxPBMCData_1.23.0         HDF5Array_1.34.0           
 [9] rhdf5_2.50.0                DelayedArray_0.32.0        
[11] SparseArray_1.6.0           S4Arrays_1.6.0             
[13] abind_1.4-8                 Matrix_1.7-1               
[15] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0
[17] Biobase_2.66.0              GenomicRanges_1.58.0       
[19] GenomeInfoDb_1.42.0         IRanges_2.40.0             
[21] S4Vectors_0.44.0            BiocGenerics_0.52.0        
[23] MatrixGenerics_1.18.0       matrixStats_1.4.1          
[25] BiocStyle_2.34.0            rebook_1.16.0              

loaded via a namespace (and not attached):
 [1] bitops_1.0-9              DBI_1.2.3                
 [3] httr2_1.0.5               CodeDepends_0.6.6        
 [5] rlang_1.1.4               magrittr_2.0.3           
 [7] gypsum_1.2.0              compiler_4.4.1           
 [9] RSQLite_2.3.7             dir.expiry_1.14.0        
[11] DelayedMatrixStats_1.28.0 png_0.1-8                
[13] vctrs_0.6.5               ProtGenerics_1.38.0      
[15] pkgconfig_2.0.3           crayon_1.5.3             
[17] fastmap_1.2.0             dbplyr_2.5.0             
[19] XVector_0.46.0            utf8_1.2.4               
[21] Rsamtools_2.22.0          rmarkdown_2.28           
[23] graph_1.84.0              UCSC.utils_1.2.0         
[25] purrr_1.0.2               bit_4.5.0                
[27] xfun_0.48                 beachmat_2.22.0          
[29] zlibbioc_1.52.0           cachem_1.1.0             
[31] jsonlite_1.8.9            blob_1.2.4               
[33] rhdf5filters_1.18.0       BiocParallel_1.40.0      
[35] Rhdf5lib_1.28.0           irlba_2.3.5.1            
[37] parallel_4.4.1            R6_2.5.1                 
[39] bslib_0.8.0               rtracklayer_1.66.0       
[41] jquerylib_0.1.4           Rcpp_1.0.13              
[43] bookdown_0.41             knitr_1.48               
[45] tidyselect_1.2.1          yaml_2.3.10              
[47] codetools_0.2-20          curl_5.2.3               
[49] lattice_0.22-6            tibble_3.2.1             
[51] withr_3.0.2               KEGGREST_1.46.0          
[53] evaluate_1.0.1            BiocFileCache_2.14.0     
[55] alabaster.schemas_1.6.0   ExperimentHub_2.14.0     
[57] Biostrings_2.74.0         pillar_1.9.0             
[59] BiocManager_1.30.25       filelock_1.0.3           
[61] generics_0.1.3            RCurl_1.98-1.16          
[63] BiocVersion_3.20.0        sparseMatrixStats_1.18.0 
[65] alabaster.base_1.6.0      glue_1.8.0               
[67] alabaster.ranges_1.6.0    lazyeval_0.2.2           
[69] alabaster.matrix_1.6.0    tools_4.4.1              
[71] beachmat.hdf5_1.4.0       BiocIO_1.16.0            
[73] AnnotationHub_3.14.0      BiocNeighbors_2.0.0      
[75] ScaledMatrix_1.14.0       GenomicAlignments_1.42.0 
[77] XML_3.99-0.17             grid_4.4.1               
[79] GenomeInfoDbData_1.2.13   BiocSingular_1.22.0      
[81] restfulr_0.0.15           rsvd_1.0.5               
[83] cli_3.6.3                 rappdirs_0.3.3           
[85] fansi_1.0.6               dplyr_1.1.4              
[87] alabaster.se_1.6.0        sass_0.4.9               
[89] digest_0.6.37             rjson_0.2.23             
[91] memoise_2.0.1             htmltools_0.5.8.1        
[93] lifecycle_1.0.4           httr_1.4.7               
[95] mime_0.12                 bit64_4.5.2              

Bibliography

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2): 163–72.

Mabbott, Neil A., J. K. Baillie, Helen Brown, Tom C. Freeman, and David A. Hume. 2013. “An expression atlas of human primary cells: Inference of gene function from coexpression networks.” BMC Genomics 14. https://doi.org/10.1186/1471-2164-14-632.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.