--- title: "scds:**s**ingle **c**ell **d**oublet **s**coring: In-silico doublet annotation for single cell RNA sequencing data" # shorttitle: "Introduction to scds" # author: # - name: Abha Bais # affiliation: &id Department of Developmental Biology, University of Pittsburgh School of Medicine # - name: Dennis Kostka # affiliation: *id Department of Computational and Systems Biology, Center for Evolutionary Biology and Medicine, University of Pittsburgh School of Medicine # date: "`r Sys.Date()`" # package: scds output: BiocStyle::html_document: toc_float: false BiocStyle::pdf_document: default vignette: > %\VignetteIndexEntry{Introduction to the scds package} %\VignetteEngine{knitr::rmarkdown} %VignetteEncoding{UTF-8} --- # Introduction In this vignette, we provide an overview of the basic functionality and usage of the `scds` package, which interfaces with ``SingleCellExperiment`` objects. # Installation Install the `scds` package using Bioconductor: ```{r, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("scds", version = "3.9") ``` Or from github: ```{r, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE} library(devtools) devtools::install_github('kostkalab/scds') ``` # Quick start `scds` takes as input a `SingleCellExperiment` object (see here `r BiocStyle::Biocpkg("SingleCellExperiment")`), where raw counts are stored in a ```counts``` assay, i.e. ```assay(sce,"counts")```. An example dataset created by sub-sampling the cell-hashing cell-lines data set (see https://satijalab.org/seurat/hashing_vignette.html) is included with the package and accessible via ```data("sce")```.Note that `scds` is designed to workd with larger datasets, but for the purposes of this vignette, we work with a smaller example dataset. We apply ```scds``` to this data and compare/visualize reasults: ## Example data set Get example data set provided with the package. ```{r prelims, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE} library(scds) library(scater) library(rsvd) library(Rtsne) library(cowplot) set.seed(30519) data("sce_chcl") sce = sce_chcl #- less typing dim(sce) ``` We see it contains 2,000 genes and 2,000 cells, 216 of which are identified as doublets: ```{r doublets, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE} table(sce$hto_classification_global) ``` We can visualize cells/doublets after projecting into two dimensions: ```{r proj, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE} logcounts(sce) = log1p(counts(sce)) vrs = apply(logcounts(sce),1,var) pc = rpca(t(logcounts(sce)[order(vrs,decreasing=TRUE)[1:100],])) ts = Rtsne(pc$x[,1:10],verb=FALSE) reducedDim(sce,"tsne") = ts$Y; rm(ts,vrs,pc) plotReducedDim(sce,"tsne",color_by="hto_classification_global") ``` ## Computational doublet annotation We now run the ```scds``` doublet annotation approaches. Briefly, we identify doublets in two complementary ways: `cxds` is based on co-expression of gene pairs and works with absence/presence calls only, while `bcds` uses the full count information and a binary classification approach using artificially generated doublets. `cxds_bcds_hybrid` combines both approaches, for more details please consult [(this manuscript)](https://doi.org/10.1101/564021). Each of the three methods returns a doublet score, with higher scores indicating more "doublet-like" barcodes. ```{r scds, eval = TRUE, echo = TRUE, message = FALSE, warning = FALSE} #- Annotate doublet using co-expression based doublet scoring: sce = cxds(sce,retRes = TRUE) sce = bcds(sce,retRes = TRUE,verb=TRUE) sce = cxds_bcds_hybrid(sce) par(mfcol=c(1,3)) boxplot(sce$cxds_score ~ sce$doublet_true_labels, main="cxds") boxplot(sce$bcds_score ~ sce$doublet_true_labels, main="bcds") boxplot(sce$hybrid_score ~ sce$doublet_true_labels, main="hybrid") ``` ## Visualizing gene pairs For ```cxds``` we can identify and visualize gene pairs driving doublet annoataions, with the expectation that the two genes in a pair might mark different types of cells ([see manuscript](https://doi.org/10.1101/564021)). In the following we look at the top three pairs, each gene pair is a row in the plot below: ```{r pairplot, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE} scds = top3 = metadata(sce)$cxds$topPairs[1:3,] rs = rownames(sce) hb = rowData(sce)$cxds_hvg_bool ho = rowData(sce)$cxds_hvg_ordr[hb] hgs = rs[ho] l1 = ggdraw() + draw_text("Pair 1", x = 0.5, y = 0.5) p1 = plotReducedDim(sce,"tsne",color_by=hgs[top3[1,1]]) p2 = plotReducedDim(sce,"tsne",color_by=hgs[top3[1,2]]) l2 = ggdraw() + draw_text("Pair 2", x = 0.5, y = 0.5) p3 = plotReducedDim(sce,"tsne",color_by=hgs[top3[2,1]]) p4 = plotReducedDim(sce,"tsne",color_by=hgs[top3[2,2]]) l3 = ggdraw() + draw_text("Pair 3", x = 0.5, y = 0.5) p5 = plotReducedDim(sce,"tsne",color_by=hgs[top3[3,1]]) p6 = plotReducedDim(sce,"tsne",color_by=hgs[top3[3,2]]) plot_grid(l1,p1,p2,l2,p3,p4,l3,p5,p6,ncol=3, rel_widths = c(1,2,2)) ``` # Session Info ```{r sessionInfo} sessionInfo() ```