--- title: "An introduction to scReClassify package" author: - name: Taiyun Kim affiliation: - School of Mathematics and Statistics, The University of Sydney - Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney - Charles Perkins Centre, The University of Sydney output: BiocStyle::html_document: toc_newpage: true fig_retina: NULL package: BiocStyle vignette: > %\VignetteIndexEntry{An introduction to scReClassify package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction `scReClassify` is a post hoc cell type classification of single-cell RNA-sequencing data to fine-tune cell type annotations generated by any cell type classification procedure. Typically, cell type identification relies on human inspection using combination of prior biological knowledge and computational techniques. Due to incompleteness of our current knowledge and the subjectivity involved in this process, a small amount of cells may be subject to mislabelling. Using semi-supervised learning algorithm, adaSampling, we are able to correct cell type annotations from various degree of noise. ![Overview of scReClassify methods](https://github.com/SydneyBioX/scReClassify/raw/master/img/scReClassify.jpg) # Installation Install the latest development version from GitHub using the `devtools` package: ```{r, eval = FALSE} if (!("devtools" %in% rownames(installed.packages()))) install.packages("devtools") library(devtools) devtools::install_github("SydneyBioX/scReClassify") ``` To install the Bioconductor version of scReClassify, enter the following to your R console. ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("scReClassify") ``` # Loading packages and data ```{r} suppressPackageStartupMessages({ library(scReClassify) library(DT) library(mclust) library(dplyr) library(SummarizedExperiment) library(SingleCellExperiment) }) data("gse87795_subset_sce") dat <- gse87795_subset_sce cellTypes <- gse87795_subset_sce$cellTypes ``` `gse87795_subset_sce` is a `SingleCellExperiment` object of a mouse fetal liver development data deposited at Gene Expression Omnibus respository with accession ID GSE87795. The cell type information can be found on the `colData` of the `SingleCellExperiment` object. ```{r} # Cell types table(cellTypes) # We set the number of clusters nCs <- length(table(cellTypes)) nCs # This demo dataset is already pre-processed dim(dat) ``` There are `r nCs` cell types, `r ncol(dat)` cells and `r nrow(dat)` number of genes. # Part A. scReClassify (Demonstration with synthetic mislabels) ## Dimension reduction Prior to running scReClassify, we perform dimension reduction. `matPCs` is a tool in scReClassify to simplify this process. In this function, a dimension reduced matrix is returned with `n` principal components (PCs), where `n` is the number of principal components (PCs) that by sum explains at least 70% variance. The function accepts either a `matrix` or a `SingleCellExperiment` object. If the `data` parameter is a `SingleCellExperiment` object, an `assay` variable must be specified to perform dimension reduction on the correct assay. If the `SingleCellExperiment` object `data` already has a 'PCA' in `reducedDimNames()`, the 'PCA' matrix of `n` columns are returned. ```{r} reducedDim(dat, "matPCs") <- matPCs(dat, assay = "logNorm", 0.7) ``` ## Synthetic noise (Demonstration purpose) Here in this example, we will synthetically generate varying degree of noise (10-50%) in sample labels. The purpose here is to simulate different level of mislabeling in the data. Given a cell type label `cls.truth`, `noisyCls` function will randomly select a `rho` percentage of cells from a given cell type and relabel to other cell types. Here, we create different degree of noise from 10% to 50%. ```{r} lab <- cellTypes set.seed(1) # Function to create noise in the cell type label noisyCls <- function(dat, rho, cls.truth){ cls.noisy <- cls.truth names(cls.noisy) <- colnames(dat) for(i in seq_len(length(table(cls.noisy)))) { # class label starts from 0 if (i != length(table(cls.noisy))) { cls.noisy[sample(which(cls.truth == names(table(cls.noisy))[i]), floor(sum(cls.truth == names(table(cls.noisy))[i])* rho))] <- names(table(cls.noisy))[i+1] } else { cls.noisy[sample(which(cls.truth == names(table(cls.noisy))[i]), floor(sum(cls.truth == names(table(cls.noisy))[i])* rho))] <- names(table(cls.noisy))[1] } } print(sum(cls.truth != cls.noisy)) return(cls.noisy) } cls.noisy01 <- noisyCls(t(reducedDim(dat, "matPCs")), rho=0.1, lab) cls.noisy02 <- noisyCls(t(reducedDim(dat, "matPCs")), rho=0.2, lab) cls.noisy03 <- noisyCls(t(reducedDim(dat, "matPCs")), rho=0.3, lab) cls.noisy04 <- noisyCls(t(reducedDim(dat, "matPCs")), rho=0.4, lab) cls.noisy05 <- noisyCls(t(reducedDim(dat, "matPCs")), rho=0.5, lab) ``` With `noisyCls` function, we have relabeled `r cls.noisy01`, `r cls.noisy01`, `r cls.noisy01`, `r cls.noisy01` and `r cls.noisy01` number of cells for `rho` equal to 0.1, 0.2, 0.3, 0.4 and 0.5 respectively. ## Use scReClassify to correct mislabeled cell types. Here in this example, we will only use `Support Vector machine (svm)` as base classifier. #### Benchmark evaluation To benchmark scReClassify, we perform scReclassify to all degree of noise with 10 repeats. We measure the accuracy of scReClassify and the Adjusted Rand Index (ARI) to measure the concordance of the reclassified cell type to the true cell type label. ```{r} ################################### # SVM ################################### base <- "svm" set.seed(1) result = lapply(seq_len(10), function(j) { final <- multiAdaSampling(dat, cls.noisy01, reducedDimName = "matPCs", classifier=base, percent=1, L=10)$final ari01 <- mclust::adjustedRandIndex(lab, final) acc01 <- bAccuracy(lab, final) final <- multiAdaSampling(dat, cls.noisy02, reducedDimName = "matPCs", classifier=base, percent=1, L=10)$final ari02 <- mclust::adjustedRandIndex(lab, final) acc02 <- bAccuracy(lab, final) final <- multiAdaSampling(dat, cls.noisy03, reducedDimName = "matPCs", classifier=base, percent=1, L=10)$final ari03 <- mclust::adjustedRandIndex(lab, final) acc03 <- bAccuracy(lab, final) final <- multiAdaSampling(dat, cls.noisy04, reducedDimName = "matPCs", classifier=base, percent=1, L=10)$final ari04 <- mclust::adjustedRandIndex(lab, final) acc04 <- bAccuracy(lab, final) final <- multiAdaSampling(dat, cls.noisy05, reducedDimName = "matPCs", classifier=base, percent=1, L=10)$final ari05 <- mclust::adjustedRandIndex(lab, final) acc05 <- bAccuracy(lab, final) c( acc01 = acc01, acc02 = acc02, acc03 = acc03, acc04 = acc04, acc05 = acc05, ari01 = ari01, ari02 = ari02, ari03 = ari03, ari04 = ari04, ari05 = ari05 ) }) result = do.call(rbind, result) acc = result[,seq_len(5)] colnames(acc) = seq(from=0.1,to=0.5,by=0.1) ari = result[,seq(from= 6, to = 10)] colnames(ari) = seq(from=0.1,to=0.5,by=0.1) ``` We can visualise the performance of the scReClassify. The boxes represent the accuracy and the ARI after scReClassify. The red markers indicate the baseline (prior to scReClassify). ```{r} plot.new() par(mfrow = c(1,2)) boxplot(acc, col="lightblue", main="SVM Accuracy", ylim=c(0.45, 1), xlab = "rho", ylab = "Accuracy") points(x=seq_len(5), y=c( bAccuracy(lab, cls.noisy01), bAccuracy(lab, cls.noisy02), bAccuracy(lab, cls.noisy03), bAccuracy(lab, cls.noisy04), bAccuracy(lab, cls.noisy05)), col="red3", pch=c(2,3,4,5,6), cex=1) boxplot(ari, col="lightblue", main="SVM ARI", ylim=c(0.25, 1), xlab = "rho", ylab = "ARI") points(x=seq_len(5), y=c( mclust::adjustedRandIndex(lab, cls.noisy01), mclust::adjustedRandIndex(lab, cls.noisy02), mclust::adjustedRandIndex(lab, cls.noisy03), mclust::adjustedRandIndex(lab, cls.noisy04), mclust::adjustedRandIndex(lab, cls.noisy05)), col="red3", pch=c(2,3,4,5,6), cex=1) ``` The plot shows that with scReClassify, cell type information have been refined (boxes are higher than the red markers). The scReClassified results show higher accuracy across noise levels 0.1 - 0.4 (i.e. closer to the true label). With the noise level 0.5, it is showing similar accuracy which is as expected because the initial label contains equal amount of true and false information and thus making it difficult for the algorithm to learn the true label. This shows that scReClassify is also robust to noisy cell type labels. # Part B. scReClassify (mislabeled cell type correction) scReClassify has shown promising result with the synthetic noise we have created. Here we will use scReClassify on the actual cell type label from public repository. The data we will use is a mouse fetal liver dataset from GEO with an accession ID GSE87795. ```{r} # PCA procedure reducedDim(dat, "matPCs") <- matPCs(dat, assay = "logNorm", 0.7) # run scReClassify set.seed(1) cellTypes.reclassify <- multiAdaSampling(dat, cellTypes, reducedDimName = "matPCs", classifier = "svm", percent = 1, L = 10) # Verification by marker genes End <- c("ITGA2B", "ITGB3") ``` Below is a table of cell type labels classified to a different cell types after scReClassify. ```{r} # check examples idx <- which(cellTypes.reclassify$final != cellTypes) cbind(original=cellTypes[idx], reclassify=cellTypes.reclassify$final[idx]) %>% DT::datatable() ``` Here, we visualise the expression level of the a cells that is reclassified for demontration purpose. The box plots are the marker gene expression levels grouped by cells types. The expression level of the reclassified cell (Cell ID: E13.5_C14) are highlighted as red marker. ```{r} mat <- assay(dat, "logNorm") c1 <- mat[, which(cellTypes=="Endothelial Cell")] c2 <- mat[, which(cellTypes=="Erythrocyte")] c3 <- mat[, which(cellTypes=="Hepatoblast")] c4 <- mat[, which(cellTypes=="Macrophage")] c5 <- mat[, which(cellTypes=="Megakaryocyte")] c6 <- mat[, which(cellTypes=="Mesenchymal Cell")] cs <- rainbow(length(table(cellTypes))) # (example 1 E13.5_C14) ##### par(mfrow=c(1,2)) marker <- End[1] boxplot(c1[marker,], c2[marker,], c3[marker,], c4[marker,], c5[marker,], c6[marker,], col=cs, main=marker, names=c("Others", "Others", "Others", "Orignal", "Reclassified", "Others"), las=2, xlab = "Labels", ylab = "log2FPKM") points(5, mat[marker, which(colnames(mat) %in% "E13.5_C14")], pch=16, col="red", cex=2) marker <- End[2] boxplot(c1[marker,], c2[marker,], c3[marker,], c4[marker,], c5[marker,], c6[marker,], col=cs, main=marker, names=c("Others", "Others", "Others", "Orignal", "Reclassified", "Others"), las=2, xlab = "Labels", ylab = "log2FPKM") points(5, mat[marker, which(colnames(mat) %in% "E13.5_C14")], pch=16, col="red", cex=2) ``` As shown in the boxplots above, the expression level of the reclassified cell (red dot) is similar to the expression levels of the reclassified cell types in a marker gene. This highlights that the E13.5_C14 cell has a similar expression profiles to the reclassified cell types rather than its originally labeled cell type. Thus, we were able to identify that E13.5_C14 potentially belongs to the reclassified cell type with scReClassify. # SessionInfo ```{r} sessionInfo() ```