--- output: html_document bibliography: ref.bib --- # Correction diagnostics ## Motivation Ideally, batch correction would remove the differences between batches while preserving the heterogeneity within batches. In the corrected data, cells of the same type should be intermingled and indistinguishable even if they come from different batches, while cells of different types should remain well-separated. Unfortunately, we rarely have prior knowledge of the underlying types of the cells, making it difficult to unambiguously determine whether differences between batches represent geniune biology or incomplete correction. Indeed, it could be said that all correction methods are at least somewhat incorrect (Section \@ref(sacrificing-differences)), though that not preclude them from being useful. In this chapter, we will describe a few diagnostics that, when combined with biological context, can be used to identify potential problems with the correction. We will recycle the `mnn.out`, `tab.mnn` and `clusters.mnn` objects that were produced in Section \@ref(mnn-correction). For the sake of brevity, we will not reproduce the relevant code - see Chapter \@ref(integrating-datasets) for more details. ```r mnn.out ``` ``` ## class: SingleCellExperiment ## dim: 13429 6791 ## metadata(2): merge.info pca.info ## assays(1): reconstructed ## rownames(13429): ENSG00000239945 ENSG00000228463 ... ENSG00000198695 ## ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## mainExpName: NULL ## altExpNames(0): ``` ```r tab.mnn ``` ``` ## Batch ## Cluster 1 2 ## 1 337 606 ## 2 152 181 ## 3 261 505 ## 4 12 4 ## 5 655 589 ## 6 13 10 ## 7 303 655 ## 8 169 115 ## 9 11 55 ## 10 17 66 ## 11 19 62 ## 12 92 219 ## 13 144 92 ## 14 416 979 ## 15 4 36 ## 16 4 8 ``` ## Mixing between batches The simplest way to quantify the degree of mixing across batches is to test each cluster for imbalances in the contribution from each batch [@buttner2019test]. This is done by applying Pearson's chi-squared test to each row of `tab.mnn` where the expected proportions under the null hypothesis proportional to the total number of cells per batch. Low $p$-values indicate that there are significant imbalances In practice, this strategy is most suited to experiments where the batches are technical replicates with identical population composition; it is usually too stringent for batches with more biological variation, where proportions can genuinely vary even in the absence of any batch effect. ```r library(batchelor) p.values <- clusterAbundanceTest(tab.mnn) p.values ``` ``` ## 1 2 3 4 5 6 7 8 ## 9.047e-02 6.700e-03 1.341e-02 2.627e-03 5.625e-25 7.427e-02 1.555e-05 2.742e-13 ## 9 10 11 12 13 14 15 16 ## 2.801e-04 7.807e-04 5.633e-03 1.356e-03 9.512e-13 4.057e-11 2.197e-04 7.172e-01 ``` We favor a more qualitative approach where we compute the variance in the log-normalized abundances across batches for each cluster. A highly variable cluster has large relative differences in cell abundance across batches; this may be an indicator for incomplete batch correction, e.g., if the same cell type in two batches was not combined into a single cluster in the corrected data. We can then focus our attention on these clusters to determine whether they might pose a problem for downstream interpretation. Of course, a large variance can also be caused by genuinely batch-specific populations, so some prior knowledge about the biological context is necessary to distinguish between these two possibilities. For the PBMC dataset, none of the most variable clusters are overtly batch-specific, consistent with the fact that our batches are effectively replicates. ```r rv <- clusterAbundanceVar(tab.mnn) # Also printing the percentage of cells in each cluster in each batch: percent <- t(t(tab.mnn)/colSums(tab.mnn)) * 100 df <- DataFrame(Batch=unclass(percent), var=rv) df[order(df$var, decreasing=TRUE),] ``` ``` ## DataFrame with 16 rows and 3 columns ## Batch.1 Batch.2 var ## ## 15 0.153315 0.860832 0.934778 ## 13 5.519356 2.199904 0.745173 ## 9 0.421617 1.315160 0.682672 ## 8 6.477578 2.749880 0.665221 ## 10 0.651591 1.578192 0.485170 ## ... ... ... ... ## 7 11.613645 15.662363 0.08913828 ## 2 5.825987 4.328073 0.08199220 ## 3 10.003833 12.075562 0.03496472 ## 1 12.916826 14.490674 0.01318296 ## 16 0.153315 0.191296 0.00689661 ``` ## Preserving biological heterogeneity Another useful diagnostic check is to compare the pre-correction clustering of each batch to the clustering of the same cells in the corrected data. Accurate data integration should preserve population structure within each batch as there is no batch effect to remove between cells in the same batch. This check complements the previously mentioned diagnostics that only focus on the removal of differences between batches. Specifically, it protects us against scenarios where the correction method simply aggregates all cells together, which would achieve perfect mixing but also discard the biological heterogeneity of interest. To illustrate, we will use clustering results from the analysis of each batch of the PBMC dataset:
```r #--- loading ---# library(TENxPBMCData) all.sce <- list( pbmc3k=TENxPBMCData('pbmc3k'), pbmc4k=TENxPBMCData('pbmc4k'), pbmc8k=TENxPBMCData('pbmc8k') ) #--- quality-control ---# library(scater) stats <- high.mito <- list() for (n in names(all.sce)) { current <- all.sce[[n]] is.mito <- grep("MT", rowData(current)$Symbol_TENx) stats[[n]] <- perCellQCMetrics(current, subsets=list(Mito=is.mito)) high.mito[[n]] <- isOutlier(stats[[n]]$subsets_Mito_percent, type="higher") all.sce[[n]] <- current[,!high.mito[[n]]] } #--- normalization ---# all.sce <- lapply(all.sce, logNormCounts) #--- variance-modelling ---# library(scran) all.dec <- lapply(all.sce, modelGeneVar) all.hvgs <- lapply(all.dec, getTopHVGs, prop=0.1) #--- dimensionality-reduction ---# library(BiocSingular) set.seed(10000) all.sce <- mapply(FUN=runPCA, x=all.sce, subset_row=all.hvgs, MoreArgs=list(ncomponents=25, BSPARAM=RandomParam()), SIMPLIFY=FALSE) set.seed(100000) all.sce <- lapply(all.sce, runTSNE, dimred="PCA") set.seed(1000000) all.sce <- lapply(all.sce, runUMAP, dimred="PCA") #--- clustering ---# for (n in names(all.sce)) { g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA') clust <- igraph::cluster_walktrap(g)$membership colLabels(all.sce[[n]]) <- factor(clust) } ```
```r pbmc3k <- all.sce$pbmc3k table(colLabels(pbmc3k)) ``` ``` ## ## 1 2 3 4 5 6 7 8 9 10 ## 475 636 153 476 164 31 159 164 340 11 ``` ```r pbmc4k <- all.sce$pbmc4k table(colLabels(pbmc4k)) ``` ``` ## ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 127 594 518 775 211 394 187 993 55 201 91 36 ``` Ideally, we should see a many-to-1 mapping where the post-correction clustering is nested inside the pre-correction clustering. This indicates that any within-batch structure was preserved after correction while acknowledging that greater resolution is possible with more cells. We quantify this mapping using the `nestedClusters()` function from the *[bluster](https://bioconductor.org/packages/3.19/bluster)* package, which identifies the nesting of post-correction clusters within the pre-correction clusters. Well-nested clusters have high `max` values, indicating that most of their cells are derived from a single pre-correction cluster. ```r library(bluster) tab3k <- nestedClusters(ref=paste("before", colLabels(pbmc3k)), alt=paste("after", clusters.mnn[mnn.out$batch==1])) tab3k$alt.mapping ``` ``` ## DataFrame with 16 rows and 2 columns ## max which ## ## after 1 0.994065 before 9 ## after 10 0.647059 before 6 ## after 11 1.000000 before 6 ## after 12 0.945652 before 2 ## after 13 1.000000 before 7 ## ... ... ... ## after 5 0.697710 before 4 ## after 6 0.692308 before 4 ## after 7 0.996700 before 1 ## after 8 0.923077 before 1 ## after 9 1.000000 before 1 ``` We can visualize this mapping for the PBMC dataset in Figure \@ref(fig:heat-after-mnn). Ideally, each row should have a single dominant entry close to unity. Horizontal stripes are more concerning as these indicate that multiple pre-correction clusters were merged together, though the exact level of concern will depend on whether specific clusters of interest are gained or lost. In practice, more discrepancies can be expected even when the correction is perfect, due to the existence of closely related clusters that were arbitrarily separated in the within-batch clustering. ```r library(pheatmap) # For the first batch: heat3k <- pheatmap(tab3k$proportions, cluster_row=FALSE, cluster_col=FALSE, main="PBMC 3K comparison", silent=TRUE) # For the second batch: tab4k <- nestedClusters(ref=paste("before", colLabels(pbmc4k)), alt=paste("after", clusters.mnn[mnn.out$batch==2])) heat4k <- pheatmap(tab4k$proportions, cluster_row=FALSE, cluster_col=FALSE, main="PBMC 4K comparison", silent=TRUE) gridExtra::grid.arrange(heat3k[[4]], heat4k[[4]]) ```
Comparison between the clusterings obtained before (columns) and after MNN correction (rows). One heatmap is generated for each of the PBMC 3K and 4K datasets, where each entry is colored according to the proportion of cells distributed along each row (i.e., the row sums equal unity).

(\#fig:heat-after-mnn)Comparison between the clusterings obtained before (columns) and after MNN correction (rows). One heatmap is generated for each of the PBMC 3K and 4K datasets, where each entry is colored according to the proportion of cells distributed along each row (i.e., the row sums equal unity).

We use the adjusted Rand index ([Advanced Section 5.3](http://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#comparing-different-clusterings)) to quantify the agreement between the clusterings before and after batch correction. Recall that larger indices are more desirable as this indicates that within-batch heterogeneity is preserved, though this must be balanced against the ability of each method to actually perform batch correction. ```r library(bluster) ri3k <- pairwiseRand(clusters.mnn[mnn.out$batch==1], colLabels(pbmc3k), mode="index") ri3k ``` ``` ## [1] 0.6482 ``` ```r ri4k <- pairwiseRand(clusters.mnn[mnn.out$batch==2], colLabels(pbmc4k), mode="index") ri4k ``` ``` ## [1] 0.8335 ``` We can also break down the ARI into per-cluster ratios for more detailed diagnostics (Figure \@ref(fig:rand-after-mnn)). For example, we could see low ratios off the diagonal if distinct clusters in the within-batch clustering were incorrectly aggregated in the merged clustering. Conversely, we might see low ratios on the diagonal if the correction inflated or introduced spurious heterogeneity inside a within-batch cluster. ```r # For the first batch. tab <- pairwiseRand(colLabels(pbmc3k), clusters.mnn[mnn.out$batch==1]) heat3k <- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE, col=rev(viridis::magma(100)), main="PBMC 3K probabilities", silent=TRUE) # For the second batch. tab <- pairwiseRand(colLabels(pbmc4k), clusters.mnn[mnn.out$batch==2]) heat4k <- pheatmap(tab, cluster_row=FALSE, cluster_col=FALSE, col=rev(viridis::magma(100)), main="PBMC 4K probabilities", silent=TRUE) gridExtra::grid.arrange(heat3k[[4]], heat4k[[4]]) ```
ARI-derived ratios for the within-batch clusters after comparison to the merged clusters obtained after MNN correction. One heatmap is generated for each of the PBMC 3K and 4K datasets.

(\#fig:rand-after-mnn)ARI-derived ratios for the within-batch clusters after comparison to the merged clusters obtained after MNN correction. One heatmap is generated for each of the PBMC 3K and 4K datasets.

## MNN-specific diagnostics For `fastMNN()`, one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the `lost.var` field in the metadata of `mnn.out`, which contains a matrix of the variance lost in each batch (column) at each merge step (row). ```r metadata(mnn.out)$merge.info$lost.var ``` ``` ## [,1] [,2] ## [1,] 0.006529 0.003314 ``` Large proportions of lost variance (>10%) suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace [@haghverdi2018batch]. In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern. Another MNN-related diagnostic involves examining the variance in the differences in expression between MNN pairs. A small variance indicates that the correction had little effect - either there was no batch effect, or any batch effect was simply a constant shift across all cells. On the other hand, a large variance indicates that the correction was highly non-linear, most likely involving subpopulation-specific batch effects. This computation is achieved using the `mnnDeltaVariance()` function on the MNN pairings produced by `fastMNN()`. ```r library(batchelor) common <- rownames(mnn.out) vars <- mnnDeltaVariance(pbmc3k[common,], pbmc4k[common,], pairs=metadata(mnn.out)$merge.info$pairs) vars[order(vars$adjusted, decreasing=TRUE),] ``` ``` ## DataFrame with 13429 rows and 4 columns ## mean total trend adjusted ## ## ENSG00000111796 0.405882 1.493872 0.659707 0.834166 ## ENSG00000257764 0.320745 1.046425 0.548442 0.497983 ## ENSG00000170345 2.542949 1.341653 0.950599 0.391054 ## ENSG00000187118 0.313330 0.917087 0.538035 0.379052 ## ENSG00000177606 1.801539 1.404074 1.035127 0.368946 ## ... ... ... ... ... ## ENSG00000204472 0.794696 0.538555 0.982401 -0.443846 ## ENSG00000011600 1.174530 0.629559 1.074839 -0.445280 ## ENSG00000101439 1.133495 0.628259 1.073971 -0.445712 ## ENSG00000158869 0.926005 0.577047 1.038913 -0.461866 ## ENSG00000105374 1.220880 0.535073 1.074666 -0.539593 ``` Such genes with large variances are particularly interesting as they exhibit complex differences between batches that may reflect real biology. For example, in Figure \@ref(fig:mnn-delta-var-pbmc), the _KLRB1_-positive clusters in the second batch lack any counterpart in the first batch, despite the two batches being replicates. This may represent some kind of batch-specific state in two otherwise identical populations, though whether this is biological or technical in nature is open for interpretation. ```r library(scater) top <- rownames(vars)[order(vars$adjusted, decreasing=TRUE)[1]] gridExtra::grid.arrange( plotExpression(pbmc3k, x="label", features=top) + ggtitle("3k"), plotExpression(pbmc4k, x="label", features=top) + ggtitle("4k"), ncol=2 ) ```
Distribution of the expression of the gene with the largest variance of MNN pair differences in each batch of the the PBMC dataset.

(\#fig:mnn-delta-var-pbmc)Distribution of the expression of the gene with the largest variance of MNN pair differences in each batch of the the PBMC dataset.

## Session Info {-}
``` R version 4.4.0 beta (2024-04-15 r86425) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 22.04.4 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scater_1.32.0 ggplot2_3.5.1 [3] scuttle_1.14.0 HDF5Array_1.32.0 [5] rhdf5_2.48.0 DelayedArray_0.30.0 [7] SparseArray_1.4.0 S4Arrays_1.4.0 [9] abind_1.4-5 Matrix_1.7-0 [11] pheatmap_1.0.12 bluster_1.14.0 [13] batchelor_1.20.0 BiocSingular_1.20.0 [15] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 [17] Biobase_2.64.0 GenomicRanges_1.56.0 [19] GenomeInfoDb_1.40.0 IRanges_2.38.0 [21] S4Vectors_0.42.0 BiocGenerics_0.50.0 [23] MatrixGenerics_1.16.0 matrixStats_1.3.0 [25] BiocStyle_2.32.0 rebook_1.14.0 loaded via a namespace (and not attached): [1] gridExtra_2.3 CodeDepends_0.6.6 [3] rlang_1.1.3 magrittr_2.0.3 [5] compiler_4.4.0 dir.expiry_1.12.0 [7] DelayedMatrixStats_1.26.0 vctrs_0.6.5 [9] pkgconfig_2.0.3 crayon_1.5.2 [11] fastmap_1.1.1 XVector_0.44.0 [13] labeling_0.4.3 utf8_1.2.4 [15] rmarkdown_2.26 ggbeeswarm_0.7.2 [17] graph_1.82.0 UCSC.utils_1.0.0 [19] xfun_0.43 zlibbioc_1.50.0 [21] cachem_1.0.8 beachmat_2.20.0 [23] jsonlite_1.8.8 highr_0.10 [25] rhdf5filters_1.16.0 Rhdf5lib_1.26.0 [27] BiocParallel_1.38.0 irlba_2.3.5.1 [29] parallel_4.4.0 cluster_2.1.6 [31] R6_2.5.1 bslib_0.7.0 [33] RColorBrewer_1.1-3 limma_3.60.0 [35] jquerylib_0.1.4 Rcpp_1.0.12 [37] bookdown_0.39 knitr_1.46 [39] igraph_2.0.3 tidyselect_1.2.1 [41] yaml_2.3.8 viridis_0.6.5 [43] codetools_0.2-20 lattice_0.22-6 [45] tibble_3.2.1 withr_3.0.0 [47] evaluate_0.23 pillar_1.9.0 [49] BiocManager_1.30.22 filelock_1.0.3 [51] generics_0.1.3 sparseMatrixStats_1.16.0 [53] munsell_0.5.1 scales_1.3.0 [55] glue_1.7.0 metapod_1.12.0 [57] tools_4.4.0 BiocNeighbors_1.22.0 [59] ScaledMatrix_1.12.0 locfit_1.5-9.9 [61] scran_1.32.0 XML_3.99-0.16.1 [63] cowplot_1.1.3 grid_4.4.0 [65] edgeR_4.2.0 colorspace_2.1-0 [67] GenomeInfoDbData_1.2.12 beeswarm_0.4.0 [69] vipor_0.4.7 cli_3.6.2 [71] rsvd_1.0.5 fansi_1.0.6 [73] viridisLite_0.4.2 dplyr_1.1.4 [75] ResidualMatrix_1.14.0 gtable_0.3.5 [77] sass_0.4.9 digest_0.6.35 [79] ggrepel_0.9.5 dqrng_0.3.2 [81] farver_2.1.1 htmltools_0.5.8.1 [83] lifecycle_1.0.4 httr_1.4.7 [85] statmod_1.5.0 ```