--- bibliography: ref.bib --- # Advanced options ## Preconstructed indices Advanced users can split the `SingleR()` workflow into two separate training and classification steps. This means that training (e.g., marker detection, assembling of nearest-neighbor indices) only needs to be performed once for any reference. The resulting data structure can then be re-used across multiple classifications with different test datasets, provided the gene annotation in the test dataset is identical to or a superset of the genes in the training set. To illustrate, we will consider the DICE reference dataset [@diceRef] from the *[celldex](https://bioconductor.org/packages/3.18/celldex)* package. ```r library(celldex) dice <- DatabaseImmuneCellExpressionData(ensembl=TRUE) dice ``` ``` ## class: SummarizedExperiment ## dim: 29914 1561 ## metadata(0): ## assays(1): logcounts ## rownames(29914): ENSG00000121410 ENSG00000268895 ... ENSG00000159840 ## ENSG00000074755 ## rowData names(0): ## colnames(1561): TPM_1 TPM_2 ... TPM_101 TPM_102 ## colData names(3): label.main label.fine label.ont ``` ```r table(dice$label.fine) ``` ``` ## ## B cells, naive Monocytes, CD14+ ## 106 106 ## Monocytes, CD16+ NK cells ## 105 105 ## T cells, CD4+, TFH T cells, CD4+, Th1 ## 104 104 ## T cells, CD4+, Th17 T cells, CD4+, Th1_17 ## 104 104 ## T cells, CD4+, Th2 T cells, CD4+, memory TREG ## 104 104 ## T cells, CD4+, naive T cells, CD4+, naive TREG ## 103 104 ## T cells, CD4+, naive, stimulated T cells, CD8+, naive ## 102 104 ## T cells, CD8+, naive, stimulated ## 102 ``` Let's say we want to use the DICE reference to annotate the PBMC dataset from Chapter \@ref(introduction). ```r library(TENxPBMCData) sce <- TENxPBMCData("pbmc3k") ``` We use the `trainSingleR()` function to do all the necessary calculations that are independent of the test dataset. (Almost; see comments below about `common`.) This yields a list of various components that contains all identified marker genes and precomputed rank indices to be used in the score calculation. We can also turn on aggregation with `aggr.ref=TRUE` (Section \@ref(pseudo-bulk-aggregation)) to further reduce computational work. ```r common <- intersect(rownames(sce), rownames(dice)) library(SingleR) set.seed(2000) trained <- trainSingleR(dice[common,], labels=dice$label.fine, aggr.ref=TRUE) ``` We then use the `trained` object to annotate our dataset of interest through the `classifySingleR()` function. As we can see, this yields exactly the same result as applying `SingleR()` directly. The advantage here is that `trained` can be re-used for multiple `classifySingleR()` calls - possibly on different datasets - without having to repeat unnecessary steps when the reference is unchanged. ```r pred <- classifySingleR(sce, trained, assay.type=1) table(pred$labels) ``` ``` ## ## B cells, naive Monocytes, CD14+ ## 344 515 ## Monocytes, CD16+ NK cells ## 187 320 ## T cells, CD4+, TFH T cells, CD4+, Th1 ## 365 222 ## T cells, CD4+, Th17 T cells, CD4+, Th1_17 ## 64 62 ## T cells, CD4+, Th2 T cells, CD4+, memory TREG ## 69 169 ## T cells, CD4+, naive T cells, CD4+, naive TREG ## 115 57 ## T cells, CD8+, naive ## 211 ``` ```r # Comparing to the direct approach. set.seed(2000) direct <- SingleR(sce, ref=dice, labels=dice$label.fine, assay.type.test=1, aggr.ref=TRUE) identical(pred$labels, direct$labels) ``` ``` ## [1] TRUE ``` The big caveat is that the universe of genes in the test dataset must be a superset of that the reference. This is the reason behind the intersection to `common` genes and the subsequent subsetting of `dice`. Practical use of preconstructed indices is best combined with some prior information about the gene-level annotation; for example, we might know that we always use a particular version of the Ensembl gene models, so we would filter out any genes in the reference dataset that are not in our test datasets. ## Parallelization Parallelization is an obvious approach to increasing annotation throughput. This is done using the framework in the *[BiocParallel](https://bioconductor.org/packages/3.18/BiocParallel)* package, which provides several options for parallelization depending on the available hardware. On POSIX-compliant systems (i.e., Linux and MacOS), the simplest method is to use forking by passing `MulticoreParam()` to the `BPPARAM=` argument: ```r library(BiocParallel) pred2a <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine, BPPARAM=MulticoreParam(8)) # 8 CPUs. ``` Alternatively, one can use separate processes with `SnowParam()`, which is slower but can be used on all systems - including Windows, our old nemesis. ```r pred2b <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.fine, BPPARAM=SnowParam(8)) identical(pred2a$labels, pred2b$labels) ``` ``` ## [1] TRUE ``` When working on a cluster, passing `BatchtoolsParam()` to `SingleR()` allows us to seamlessly interface with various job schedulers like SLURM, LSF and so on. This permits heavy-duty parallelization across hundreds of CPUs for highly intensive jobs, though often some configuration is required - see the [vignette](https://bioconductor.org/packages/3.18/BiocParallel/vignettes/BiocParallel_BatchtoolsParam.pdf) for more details. ## Approximate algorithms It is possible to sacrifice accuracy to squeeze more speed out of *[SingleR](https://bioconductor.org/packages/3.18/SingleR)*. The most obvious approach is to simply turn off the fine-tuning with `fine.tune=FALSE`, which avoids the time-consuming fine-tuning iterations. When the reference labels are well-separated, this is probably an acceptable trade-off. ```r pred3a <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.main, fine.tune=FALSE) table(pred3a$labels) ``` ``` ## ## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+ ## 348 705 357 950 340 ``` Another approximation is based on the fact that the initial score calculation is done using a nearest-neighbors search. By default, this is an exact seach but we can switch to an approximate algorithm via the `BNPARAM=` argument. In the example below, we use the [Annoy algorithm](https://github.com/spotify/annoy) via the *[BiocNeighbors](https://bioconductor.org/packages/3.18/BiocNeighbors)* framework, which yields mostly similar results. (Note, though, that the Annoy method does involve a considerable amount of overhead, so for small jobs it will actually be slower than the exact search.) ```r library(BiocNeighbors) pred3b <- SingleR(sce, ref=dice, assay.type.test=1, labels=dice$label.main, fine.tune=FALSE, # for comparison with pred3a. BNPARAM=AnnoyParam()) table(pred3a$labels, pred3b$labels) ``` ``` ## ## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+ ## B cells 348 0 0 0 0 ## Monocytes 0 705 0 0 0 ## NK cells 0 0 357 0 0 ## T cells, CD4+ 0 0 0 950 0 ## T cells, CD8+ 0 0 0 0 340 ``` ## Cluster-level annotation The default philosophy of *[SingleR](https://bioconductor.org/packages/3.18/SingleR)* is to perform annotation of each individual cell in the test dataset. An alternative strategy is to perform annotation of aggregated profiles for groups or clusters of cells. To demonstrate, we will perform a quick-and-dirty clustering of our PBMC dataset with a variety of Bioconductor packages. ```r library(scuttle) sce <- logNormCounts(sce) library(scran) dec <- modelGeneVarByPoisson(sce) sce <- denoisePCA(sce, dec, subset.row=getTopHVGs(dec, n=5000)) library(bluster) colLabels(sce) <- clusterRows(reducedDim(sce), NNGraphParam()) library(scater) set.seed(117) sce <- runTSNE(sce, dimred="PCA") plotTSNE(sce, colour_by="label") ``` By passing `clusters=` to `SingleR()`, we direct the function to compute an aggregated profile per cluster. Annotation is then performed on the cluster-level profiles rather than on the single-cell level. This has the major advantage of being much faster to compute as there are obviously fewer clusters than cells; it is also easier to interpret as it directly returns the likely cell type identity of each cluster. ```r SingleR(sce, dice, clusters=colLabels(sce), labels=dice$label.main) ``` ``` ## DataFrame with 11 rows and 4 columns ## scores labels delta.next pruned.labels ## ## 1 0.1545149:0.261593:0.601408:... T cells, CD4+ 0.0295797 T cells, CD4+ ## 2 0.2095510:0.236086:0.356562:... T cells, CD4+ 0.0449275 T cells, CD4+ ## 3 0.0526260:0.271140:0.727792:... NK cells 0.3343791 NK cells ## 4 0.1590303:0.761772:0.212253:... Monocytes 0.5495187 Monocytes ## 5 0.1450583:0.782498:0.205440:... Monocytes 0.5770585 Monocytes ## 6 0.6409026:0.332129:0.224033:... B cells 0.3087738 B cells ## 7 0.2020630:0.275696:0.412119:... T cells, CD4+ 0.1159168 T cells, CD4+ ## 8 0.2275805:0.602347:0.211547:... Monocytes 0.3747668 Monocytes ## 9 0.1679136:0.753156:0.259175:... Monocytes 0.4939805 Monocytes ## 10 0.2535745:0.258933:0.328738:... T cells, CD4+ 0.0569244 T cells, CD4+ ## 11 0.0713926:0.223101:0.117047:... Monocytes 0.1060540 NA ``` This approach assumes that each cluster in the test dataset corresponds to exactly one reference label. If a cluster actually contains a mixture of multiple labels, this will not be reflected in its lone assigned label. (We note that it would be very difficult to determine the composition of the mixture from the `SingleR()` scores.) Indeed, there is no guarantee that the clustering is driven by the same factors that distinguish the reference labels, decreasing the reliability of the annotations when novel heterogeneity is present in the test dataset. The default per-cell strategy is safer and provides more information about the ambiguity of the annotations, which is important for closely related labels where a close correspondence between clusters and labels cannot be expected. ## Session information {-}
``` R version 4.3.2 Patched (2023-11-13 r85521) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scater_1.30.1 ggplot2_3.4.4 [3] bluster_1.12.0 scran_1.30.0 [5] scuttle_1.12.0 BiocNeighbors_1.20.0 [7] BiocParallel_1.36.0 SingleR_2.4.0 [9] TENxPBMCData_1.20.0 HDF5Array_1.30.0 [11] rhdf5_2.46.1 DelayedArray_0.28.0 [13] SparseArray_1.2.2 S4Arrays_1.2.0 [15] abind_1.4-5 Matrix_1.6-4 [17] SingleCellExperiment_1.24.0 ensembldb_2.26.0 [19] AnnotationFilter_1.26.0 GenomicFeatures_1.54.1 [21] AnnotationDbi_1.64.1 celldex_1.12.0 [23] SummarizedExperiment_1.32.0 Biobase_2.62.0 [25] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1 [27] IRanges_2.36.0 S4Vectors_0.40.2 [29] BiocGenerics_0.48.1 MatrixGenerics_1.14.0 [31] matrixStats_1.1.0 BiocStyle_2.30.0 [33] rebook_1.12.0 loaded via a namespace (and not attached): [1] later_1.3.1 BiocIO_1.12.0 [3] bitops_1.0-7 filelock_1.0.2 [5] tibble_3.2.1 CodeDepends_0.6.5 [7] graph_1.80.0 XML_3.99-0.16 [9] lifecycle_1.0.4 edgeR_4.0.2 [11] lattice_0.22-5 magrittr_2.0.3 [13] limma_3.58.1 sass_0.4.7 [15] rmarkdown_2.25 jquerylib_0.1.4 [17] yaml_2.3.7 metapod_1.10.0 [19] httpuv_1.6.12 cowplot_1.1.1 [21] DBI_1.1.3 zlibbioc_1.48.0 [23] Rtsne_0.16 purrr_1.0.2 [25] RCurl_1.98-1.13 rappdirs_0.3.3 [27] GenomeInfoDbData_1.2.11 ggrepel_0.9.4 [29] irlba_2.3.5.1 dqrng_0.3.2 [31] DelayedMatrixStats_1.24.0 codetools_0.2-19 [33] xml2_1.3.6 tidyselect_1.2.0 [35] farver_2.1.1 ScaledMatrix_1.10.0 [37] viridis_0.6.4 BiocFileCache_2.10.1 [39] GenomicAlignments_1.38.0 jsonlite_1.8.8 [41] ellipsis_0.3.2 tools_4.3.2 [43] progress_1.2.2 snow_0.4-4 [45] Rcpp_1.0.11 glue_1.6.2 [47] gridExtra_2.3 xfun_0.41 [49] dplyr_1.1.4 withr_2.5.2 [51] BiocManager_1.30.22 fastmap_1.1.1 [53] rhdf5filters_1.14.1 fansi_1.0.5 [55] digest_0.6.33 rsvd_1.0.5 [57] R6_2.5.1 mime_0.12 [59] colorspace_2.1-0 biomaRt_2.58.0 [61] RSQLite_2.3.3 utf8_1.2.4 [63] generics_0.1.3 rtracklayer_1.62.0 [65] prettyunits_1.2.0 httr_1.4.7 [67] pkgconfig_2.0.3 gtable_0.3.4 [69] blob_1.2.4 XVector_0.42.0 [71] htmltools_0.5.7 bookdown_0.37 [73] ProtGenerics_1.34.0 scales_1.3.0 [75] png_0.1-8 knitr_1.45 [77] rjson_0.2.21 curl_5.1.0 [79] cachem_1.0.8 stringr_1.5.1 [81] BiocVersion_3.18.1 parallel_4.3.2 [83] vipor_0.4.5 restfulr_0.0.15 [85] pillar_1.9.0 grid_4.3.2 [87] vctrs_0.6.5 promises_1.2.1 [89] BiocSingular_1.18.0 dbplyr_2.4.0 [91] beachmat_2.18.0 xtable_1.8-4 [93] cluster_2.1.6 beeswarm_0.4.0 [95] evaluate_0.23 cli_3.6.1 [97] locfit_1.5-9.8 compiler_4.3.2 [99] Rsamtools_2.18.0 rlang_1.1.2 [101] crayon_1.5.2 labeling_0.4.3 [103] ggbeeswarm_0.7.2 stringi_1.8.2 [105] viridisLite_0.4.2 munsell_0.5.0 [107] Biostrings_2.70.1 lazyeval_0.2.2 [109] dir.expiry_1.10.0 ExperimentHub_2.10.0 [111] hms_1.1.3 sparseMatrixStats_1.14.0 [113] bit64_4.0.5 Rhdf5lib_1.24.0 [115] KEGGREST_1.42.0 statmod_1.5.0 [117] shiny_1.8.0 highr_0.10 [119] interactiveDisplayBase_1.40.0 AnnotationHub_3.10.0 [121] igraph_1.5.1 memoise_2.0.1 [123] bslib_0.6.1 bit_4.0.5 ```