Chapter 5 Analysis overview
5.1 Outline
This chapter provides an overview of the framework of a typical scRNA-seq analysis workflow (Figure 5.1).
In the simplest case, the workflow has the following form:
- We compute quality control metrics to remove low-quality cells that would interfere with downstream analyses. These cells may have been damaged during processing or may not have been fully captured by the sequencing protocol. Common metrics includes the total counts per cell, the proportion of spike-in or mitochondrial reads and the number of detected features.
- We convert the counts into normalized expression values to eliminate cell-specific biases (e.g., in capture efficiency). This allows us to perform explicit comparisons across cells in downstream steps like clustering. We also apply a transformation, typically log, to adjust for the mean-variance relationship.
- We perform feature selection to pick a subset of interesting features for downstream analysis. This is done by modelling the variance across cells for each gene and retaining genes that are highly variable. The aim is to reduce computational overhead and noise from uninteresting genes.
- We apply dimensionality reduction to compact the data and further reduce noise. Principal components analysis is typically used to obtain an initial low-rank representation for more computational work, followed by more aggressive methods like \(t\)-stochastic neighbor embedding for visualization purposes.
- We cluster cells into groups according to similarities in their (normalized) expression profiles. This aims to obtain groupings that serve as empirical proxies for distinct biological states. We typically interpret these groupings by identifying differentially expressed marker genes between clusters.
Subsequent chapters will describe each analysis step in more detail.
5.2 Quick start (simple)
Here, we use the a droplet-based retina dataset from Macosko et al. (2015), provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) in preparation for biological interpretation. Similar workflows are available in abbreviated form in later parts of the book.
library(scRNAseq)
sce <- MacoskoRetinaData()
# Quality control (using mitochondrial genes).
library(scater)
is.mito <- grepl("^MT-", rownames(sce))
qcstats <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
filtered <- quickPerCellQC(qcstats, percent_subsets="subsets_Mito_percent")
sce <- sce[, !filtered$discard]
# Normalization.
sce <- logNormCounts(sce)
# Feature selection.
library(scran)
dec <- modelGeneVar(sce)
hvg <- getTopHVGs(dec, prop=0.1)
# PCA.
library(scater)
set.seed(1234)
sce <- runPCA(sce, ncomponents=25, subset_row=hvg)
# Clustering.
library(bluster)
colLabels(sce) <- clusterCells(sce, use.dimred='PCA',
BLUSPARAM=NNGraphParam(cluster.fun="louvain"))
# Visualization.
sce <- runUMAP(sce, dimred = 'PCA')
plotUMAP(sce, colour_by="label")
5.3 Quick start (multiple batches)
Here we use the pancreas Smart-seq2 dataset from Segerstolpe et al. (2016), again provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) with some additional tweaks to eliminate uninteresting batch effects between individuals. Note that a more elaborate analysis of the same dataset with justifications for each step is available in Workflow Chapter 8.
sce <- SegerstolpePancreasData()
# Quality control (using ERCCs).
qcstats <- perCellQCMetrics(sce)
filtered <- quickPerCellQC(qcstats, percent_subsets="altexps_ERCC_percent")
sce <- sce[, !filtered$discard]
# Normalization.
sce <- logNormCounts(sce)
# Feature selection, blocking on the individual of origin.
dec <- modelGeneVar(sce, block=sce$individual)
hvg <- getTopHVGs(dec, prop=0.1)
# Batch correction.
library(batchelor)
set.seed(1234)
sce <- correctExperiments(sce, batch=sce$individual,
subset.row=hvg, correct.all=TRUE)
# Clustering.
colLabels(sce) <- clusterCells(sce, use.dimred='corrected')
# Visualization.
sce <- runUMAP(sce, dimred = 'corrected')
gridExtra::grid.arrange(
plotUMAP(sce, colour_by="label"),
plotUMAP(sce, colour_by="individual"),
ncol=2
)
Session Info
R Under development (unstable) (2024-10-21 r87258)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] batchelor_1.23.0 bluster_1.17.0
[3] scran_1.35.0 scater_1.35.0
[5] ggplot2_3.5.1 scuttle_1.17.0
[7] scRNAseq_2.21.0 SingleCellExperiment_1.29.1
[9] SummarizedExperiment_1.37.0 Biobase_2.67.0
[11] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
[13] IRanges_2.41.0 S4Vectors_0.45.1
[15] BiocGenerics_0.53.1 generics_0.1.3
[17] MatrixGenerics_1.19.0 matrixStats_1.4.1
[19] BiocStyle_2.35.0 rebook_1.17.0
loaded via a namespace (and not attached):
[1] jsonlite_1.8.9 CodeDepends_0.6.6
[3] magrittr_2.0.3 ggbeeswarm_0.7.2
[5] GenomicFeatures_1.59.1 gypsum_1.3.0
[7] farver_2.1.2 rmarkdown_2.29
[9] BiocIO_1.17.0 zlibbioc_1.53.0
[11] vctrs_0.6.5 DelayedMatrixStats_1.29.0
[13] memoise_2.0.1 Rsamtools_2.23.0
[15] RCurl_1.98-1.16 htmltools_0.5.8.1
[17] S4Arrays_1.7.1 AnnotationHub_3.15.0
[19] curl_6.0.0 BiocNeighbors_2.1.0
[21] Rhdf5lib_1.29.0 SparseArray_1.7.1
[23] rhdf5_2.51.0 sass_0.4.9
[25] alabaster.base_1.7.1 bslib_0.8.0
[27] alabaster.sce_1.7.0 httr2_1.0.6
[29] cachem_1.1.0 ResidualMatrix_1.17.0
[31] GenomicAlignments_1.43.0 igraph_2.1.1
[33] lifecycle_1.0.4 pkgconfig_2.0.3
[35] rsvd_1.0.5 Matrix_1.7-1
[37] R6_2.5.1 fastmap_1.2.0
[39] GenomeInfoDbData_1.2.13 digest_0.6.37
[41] colorspace_2.1-1 AnnotationDbi_1.69.0
[43] dqrng_0.4.1 irlba_2.3.5.1
[45] ExperimentHub_2.15.0 RSQLite_2.3.7
[47] beachmat_2.23.0 labeling_0.4.3
[49] filelock_1.0.3 fansi_1.0.6
[51] httr_1.4.7 abind_1.4-8
[53] compiler_4.5.0 bit64_4.5.2
[55] withr_3.0.2 BiocParallel_1.41.0
[57] viridis_0.6.5 DBI_1.2.3
[59] HDF5Array_1.35.1 alabaster.ranges_1.7.0
[61] alabaster.schemas_1.7.0 rappdirs_0.3.3
[63] DelayedArray_0.33.1 rjson_0.2.23
[65] tools_4.5.0 vipor_0.4.7
[67] beeswarm_0.4.0 glue_1.8.0
[69] restfulr_0.0.15 rhdf5filters_1.19.0
[71] grid_4.5.0 cluster_2.1.6
[73] gtable_0.3.6 ensembldb_2.31.0
[75] metapod_1.15.0 BiocSingular_1.23.0
[77] ScaledMatrix_1.15.0 utf8_1.2.4
[79] XVector_0.47.0 RcppAnnoy_0.0.22
[81] ggrepel_0.9.6 BiocVersion_3.21.1
[83] pillar_1.9.0 limma_3.63.2
[85] dplyr_1.1.4 BiocFileCache_2.15.0
[87] lattice_0.22-6 FNN_1.1.4.1
[89] rtracklayer_1.67.0 bit_4.5.0
[91] tidyselect_1.2.1 locfit_1.5-9.10
[93] Biostrings_2.75.1 knitr_1.49
[95] gridExtra_2.3 bookdown_0.41
[97] ProtGenerics_1.39.0 edgeR_4.5.0
[99] xfun_0.49 statmod_1.5.0
[101] UCSC.utils_1.3.0 lazyeval_0.2.2
[103] yaml_2.3.10 evaluate_1.0.1
[105] codetools_0.2-20 tibble_3.2.1
[107] alabaster.matrix_1.7.0 BiocManager_1.30.25
[109] graph_1.85.0 cli_3.6.3
[111] uwot_0.2.2 munsell_0.5.1
[113] jquerylib_0.1.4 Rcpp_1.0.13-1
[115] dir.expiry_1.15.0 dbplyr_2.5.0
[117] png_0.1-8 XML_3.99-0.17
[119] parallel_4.5.0 blob_1.2.4
[121] AnnotationFilter_1.31.0 sparseMatrixStats_1.19.0
[123] bitops_1.0-9 viridisLite_0.4.2
[125] alabaster.se_1.7.0 scales_1.3.0
[127] crayon_1.5.3 rlang_1.1.4
[129] cowplot_1.1.3 KEGGREST_1.47.0