--- title: Single-cell analysis toolkit for expression in R author: - name: Davis McCarthy affiliation: - EMBL European Bioinformatics Institute - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "Revised: November 2, 2019" package: scater output: BiocStyle::html_document: toc_float: yes vignette: > %\VignetteIndexEntry{Overview of scater functionality} %\VignetteEngine{knitr::rmarkdown} %VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide"} knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) library(BiocStyle) set.seed(10918) ``` # Introduction This document gives an introduction to and overview of the quality control functionality of the `r Biocpkg("scater")` package. `r Biocpkg("scater")` contains tools to help with the analysis of single-cell transcriptomic data, focusing on low-level steps such as quality control, normalization and visualization. It is based on the `SingleCellExperiment` class (from the `r Biocpkg("SingleCellExperiment")` package), and thus is interoperable with many other Bioconductor packages such as `r Biocpkg("scran")`, `r Biocpkg("batchelor")` and `r Biocpkg("iSEE")`. **Note:** A more comprehensive description of the use of `r Biocpkg("scater")` (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org. # Setting up the data ## Generating a `SingleCellExperiment` object We assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.). First, we create a `SingleCellExperiment` object containing the data, as demonstrated below with a famous brain dataset. Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell 'omics data. ```{r quickstart-load-data, message=FALSE, warning=FALSE} library(scRNAseq) example_sce <- ZeiselBrainData() example_sce ``` We usually expect (raw) count data to be labelled as `"counts"` in the assays, which can be easily retrieved with the `counts` accessor. Getters and setters are also provided for `exprs`, `tpm`, `cpm`, `fpkm` and versions of these with the prefix `norm_`. ```{r quickstart-add-exprs, results='hide'} str(counts(example_sce)) ``` Row and column-level metadata are easily accessed (or modified) as shown below. There are also dedicated getters and setters for size factor values (`sizeFactors()`); reduced dimensionality results (`reducedDim()`); and alternative experimental features (`altExp()`). ```{r} example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE) colData(example_sce) rowData(example_sce)$stuff <- runif(nrow(example_sce)) rowData(example_sce) ``` Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner. More details about the `SingleCellExperiment` class can be found in the documentation for `r Biocpkg("SingleCellExperiment")` package. ## Other methods of data import Count matrices stored as CSV files or equivalent can be easily read into R session using `read.table()` from _utils_ or `fread()` from the `r CRANpkg("data.table")` package. It is advisable to coerce the resulting object into a matrix before storing it in a `SingleCellExperiment` object. For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the `r CRANpkg("Matrix")` package. This is performed using the `readSparseCounts()` function and reduces memory usage by not explicitly storing zeroes in memory. Data from 10X Genomics experiments can be read in using the `read10xCounts` function from the `r Biocpkg("DropletUtils")` package. This will automatically generate a `SingleCellExperiment` with a sparse matrix, see the documentation for more details. Transcript abundances from the `kallisto` and `Salmon` pseudo-aligners can be imported using methods from the `r Biocpkg("tximeta")` package. This produces a `SummarizedExperiment` object that can be coerced into a `SingleCellExperiment` simply with `as(se, "SingleCellExperiment")`. # Quality control ## Background `r Biocpkg("scater")` provides functionality for three levels of quality control (QC): 1. QC and filtering of cells 2. QC and filtering of features (genes) 3. QC of experimental variables ## Cell-level QC ### Definition of metrics Cell-level metrics are computed by the `perCellQCMetrics()` function and include: * `sum`: total number of counts for the cell (i.e., the library size). * `detected`: the number of features for the cell that have counts above the detection limit (default of zero). * `subsets_X_percent`: percentage of all counts that come from the feature control set named `X`. ```{r} library(scater) per.cell <- perCellQCMetrics(example_sce, subsets=list(Mito=grep("mt-", rownames(example_sce)))) summary(per.cell$sum) summary(per.cell$detected) summary(per.cell$subsets_Mito_percent) ``` It is often convenient to store this in the `colData()` of our `SingleCellExperiment` object for future reference. (In fact, the `addPerCellQC()` function will do this automatically.) ```{r} colData(example_sce) <- cbind(colData(example_sce), per.cell) ``` ### Diagnostic plots Metadata variables can be plotted against each other using the `plotColData()` function, as shown below. We expect to see an increasing number of detected genes with increasing total count. Each point represents a cell that is coloured according to its tissue of origin. ```{r} plotColData(example_sce, x = "sum", y="detected", colour_by="tissue") ``` Here, we have plotted the total count for each cell against the mitochondrial content. Well-behaved cells should have a large number of expressed features and and a low percentage of expression from feature controls. High percentage expression from feature controls and few expressed features are indicative of blank and failed cells. For some variety, we have faceted by the tissue of origin. ```{r plot-pdata-pct-exprs-controls} plotColData(example_sce, x = "sum", y="subsets_Mito_percent", other_fields="tissue") + facet_wrap(~tissue) ``` ### Identifying low-quality cells Column subsetting of the `SingeCellExperiment` object will only retain the selected cells, thus removing low-quality or otherwise unwanted cells. We can identify high-quality cells to retain by setting a fixed threshold on particular metrics. For example, we could retain only cells that have at least 100,000 total counts _and_ at least 500 expressed features: ```{r} keep.total <- example_sce$sum > 1e5 keep.n <- example_sce$detected > 500 filtered <- example_sce[,keep.total & keep.n] dim(filtered) ``` The `isOutlier` function provides a more data-adaptive way of choosing these thresholds. This defines the threshold at a certain number of median absolute deviations (MADs) away from the median. Values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells. Here, we define small outliers (using `type="lower"`) for the _log_-total counts at 3 MADs from the median. ```{r} keep.total <- isOutlier(per.cell$sum, type="lower", log=TRUE) filtered <- example_sce[,keep.total] ``` Detection of outliers can be achieved more conveniently for several common metrics using the `quickPerCellQC()` function. This uses the total count, number of detected features and the percentage of counts in gene sets of diagnostic value (e.g., mitochondrial genes, spike-in transcripts) to identify which cells to discard and for what reason. ```{r} qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent") colSums(as.matrix(qc.stats)) filtered <- example_sce[,!qc.stats$discard] ``` The `isOutlier` approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, cell type. In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system. We refer readers to the `r Biocpkg("simpleSingleCell")` workflow for more details. ## Feature-level QC ### Definition of metrics Feature-level metrics are computed by the `perFeatureQCMetrics()` function and include: * `mean`: the mean count of the gene/feature across all cells. * `detected`: the percentage of cells with non-zero counts for each gene. * `subsets_Y_ratio`: ratio of mean counts between the cell control set named Y and all cells. ```{r} # Pretending that the first 10 cells are empty wells, for demonstration. per.feat <- perFeatureQCMetrics(example_sce, subsets=list(Empty=1:10)) summary(per.feat$mean) summary(per.feat$detected) summary(per.feat$subsets_Empty_ratio) ``` A more refined calculation of the average is provided by the `calculateAverage()` function, which adjusts the counts by the relative library size (or size factor) prior to taking the mean. ```{r} ave <- calculateAverage(example_sce) summary(ave) ``` We can also compute the number of cells expressing a gene directly. ```{r} summary(nexprs(example_sce, byrow=TRUE)) ``` ### Diagnostic plots We look at a plot that shows the top 50 (by default) most-expressed features. Each row in the plot below corresponds to a gene, and each bar corresponds to the expression of a gene in a single cell. The circle indicates the median expression of each gene, with which genes are sorted. By default, "expression" is defined using the feature counts (if available), but other expression values can be used instead by changing `exprs_values`. ```{r plot-highest, fig.asp=1, fig.wide=TRUE} plotHighestExprs(example_sce, exprs_values = "counts") ``` We expect to see the "usual suspects", i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A few spike-in transcripts may also be present here, though if all of the spike-ins are in the top 50, it suggests that too much spike-in RNA was added. A large number of pseudo-genes or predicted genes may indicate problems with alignment. ### Subsetting by row Genes can be removed by row subsetting of the `SingleCellExperiment` object. For example, we can filter out features (genes) that are not expressed in any cells: ```{r filter-no-exprs} keep_feature <- nexprs(example_sce, byrow=TRUE) > 0 example_sce <- example_sce[keep_feature,] dim(example_sce) ``` Other filtering can be done using existing annotation. For example, ribosomal protein genes and predicted genes can be identified (and removed) using regular expressions or biotype information. Such genes are often uninteresting when the aim is to characterize population heterogeneity. ## Variable-level QC Variable-level metrics are computed by the `getVarianceExplained()` function (after normalization, see below). This calculates the percentage of variance of each gene's expression that is explained by each variable in the `colData` of the `SingleCellExperiment` object. ```{r} example_sce <- logNormCounts(example_sce) # see below. vars <- getVarianceExplained(example_sce, variables=c("tissue", "total mRNA mol", "sex", "age")) head(vars) ``` We can then use this to determine which experimental factors are contributing most to the variance in expression. This is useful for diagnosing batch effects or to quickly verify that a treatment has an effect. ```{r} plotExplanatoryVariables(vars) ``` # Computing expression values ## Normalization for library size differences The most commonly used function is `logNormCounts()`, which calculates log~2~-transformed normalized expression values. This is done by dividing each count by its size factor, adding a pseudo-count and log-transforming. The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in `"logcounts"`. ```{r} example_sce <- logNormCounts(example_sce) assayNames(example_sce) ``` By default, the size factor is automatically computed from the library size of each cell using the `librarySizeFactors()` function. This calculation simply involves scaling the library sizes so that they have a mean of 1 across all cells. However, if size factors are explicitly provided in the `SingleCellExperiment`, they will be used by the normalization functions. ```{r} summary(librarySizeFactors(example_sce)) ``` Alternatively, we can calculate counts-per-million using the aptly-named `calculateCPM()` function. The output is most appropriately stored as an assay named `"cpm"` in the assays of the `SingleCellExperiment` object. Related functions include `calculateTPM()` and `calculateFPKM()`, which do pretty much as advertised. ```{r} cpm(example_sce) <- calculateCPM(example_sce) ``` Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay. ```{r} assay(example_sce, "normed") <- normalizeCounts(example_sce, size_factors=runif(ncol(example_sce)), pseudo_count=1.5) ``` ## Visualizing expression values The `plotExpression()` function makes it easy to plot expression values for a subset of genes or features. This can be particularly useful for further examination of features identified from differential expression testing, pseudotime analysis or other analyses. By default, it uses expression values in the `"logcounts"` assay, but this can be changed through the `exprs_values` argument. ```{r plot-expression, fig.wide=TRUE} plotExpression(example_sce, rownames(example_sce)[1:6], x = "level1class") ``` Setting `x` will determine the covariate to be shown on the x-axis. This can be a field in the column metadata or the name of a feature (to obtain the expression profile across cells). Categorical covariates will yield grouped violins as shown above, with one panel per feature. By comparison, continuous covariates will generate a scatter plot in each panel, as shown below. ```{r plot-expression-scatter} plotExpression(example_sce, rownames(example_sce)[1:6], x = rownames(example_sce)[10]) ``` The points can also be coloured, shaped or resized by the column metadata or expression values. ```{r plot-expression-col} plotExpression(example_sce, rownames(example_sce)[1:6], x = "level1class", colour_by="tissue") ``` Directly plotting the gene expression without any `x` or other visual parameters will generate a set of grouped violin plots, coloured in an aesthetically pleasing manner. ```{r plot-expression-many} plotExpression(example_sce, rownames(example_sce)[1:6]) ``` # Dimensionality reduction ## Principal components analysis Principal components analysis (PCA) is often performed to denoise and compact the data prior to downstream analyses. The `runPCA()` function provides a simple wrapper around the base machinery in `r Biocpkg("BiocSingular")` for computing PCs from log-transformed expression values. This stores the output in the `reducedDims` slot of the `SingleCellExperiment`, which can be easily retrieved (along with the percfentage of variance explained by each PC) as shown below: ```{r} example_sce <- runPCA(example_sce) str(reducedDim(example_sce, "PCA")) ``` By default, `runPCA()` uses the top 500 genes with the highest variances to compute the first PCs. This can be tuned by specifying `subset_row` to pass in an explicit set of genes of interest, and by using `ncomponents` to determine the number of components to compute. The `name` argument can also be used to change the name of the result in the `reducedDims` slot. ```{r} example_sce <- runPCA(example_sce, name="PCA2", subset_row=rownames(example_sce)[1:1000], ncomponents=25) str(reducedDim(example_sce, "PCA2")) ``` ## Other dimensionality reduction methods $t$-distributed stochastic neighbour embedding ($t$-SNE) is widely used for visualizing complex single-cell data sets. The same procedure described for PCA plots can be applied to generate $t$-SNE plots using `plotTSNE`, with coordinates obtained using `runTSNE` via the `r CRANpkg("Rtsne")` package. We strongly recommend generating plots with different random seeds and perplexity values, to ensure that any conclusions are robus t to different visualizations. ```{r plot-tsne-1comp-colby-sizeby-exprs} # Perplexity of 10 just chosen here arbitrarily. set.seed(1000) example_sce <- runTSNE(example_sce, perplexity=10) head(reducedDim(example_sce, "TSNE")) ``` A more common pattern involves using the pre-existing PCA results as input into the $t$-SNE algorithm. This is useful as it improves speed by using a low-rank approximation of the expression matrix; and reduces random noise, by focusing on the major factors of variation. The code below uses the first 10 dimensions of the previously computed PCA result to perform the $t$-SNE. ```{r plot-tsne-from-pca} set.seed(1000) example_sce <- runTSNE(example_sce, perplexity=50, dimred="PCA", n_dimred=10) head(reducedDim(example_sce, "TSNE")) ``` The same can be done for uniform manifold with approximate projection (UMAP) via the `runUMAP()` function, itself based on the `r CRANpkg("uwot")` package. ```{r} example_sce <- runUMAP(example_sce) head(reducedDim(example_sce, "UMAP")) ``` ## Visualizing reduced dimensions Any dimensionality reduction result can be plotted using the `plotReducedDim` function. Here, each point represents a cell and is coloured according to its cell type label. ```{r plot-reduceddim-4comp-colby-shapeby} plotReducedDim(example_sce, dimred = "PCA", colour_by = "level1class") ``` Some result types have dedicated wrappers for convenience, e.g., `plotTSNE()` for $t$-SNE results: ```{r plot-pca-4comp-colby-sizeby-exprs} plotTSNE(example_sce, colour_by = "Snap25") ``` The dedicated `plotPCA()` function also adds the percentage of variance explained to the axes: ```{r plot-pca-default} plotPCA(example_sce, colour_by="Mog") ``` Multiple components can be plotted in a series of pairwise plots. When more than two components are plotted, the diagonal boxes in the scatter plot matrix show the density for each component. ```{r plot-pca-4comp-colby-shapeby} example_sce <- runPCA(example_sce, ncomponents=20) plotPCA(example_sce, ncomponents = 4, colour_by = "level1class") ``` We separate the execution of these functions from the plotting to enable the same coordinates to be re-used across multiple plots. This avoids repeatedly recomputing those coordinates just to change an aesthetic across plots. # Transitioning from the `SCESet` class As of July 2017, `scater` has switched from the `SCESet` class previously defined within the package to the more widely applicable `SingleCellExperiment` class. From Bioconductor 3.6 (October 2017), the release version of `scater` will use `SingleCellExperiment`. `SingleCellExperiment` is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets. It should be straight-forward to convert existing scripts based on `SCESet` objects to `SingleCellExperiment` objects, with key changes outlined immediately below. * The functions `toSingleCellExperiment` and `updateSCESet` (for backwards compatibility) can be used to convert an old `SCESet` object to a `SingleCellExperiment` object; * Create a new `SingleCellExperiment` object with the function `SingleCellExperiment` (actually less fiddly than creating a new `SCESet`); * `scater` functions have been refactored to take `SingleCellExperiment` objects, so once data is in a `SingleCellExperiment` object, the user experience is almost identical to that with the `SCESet` class. Users may need to be aware of the following when updating their own scripts: * Cell names can now be accessed/assigned with the `colnames` function (instead of `sampleNames` or `cellNames` for an `SCESet` object); * Feature (gene/transcript) names should now be accessed/assigned with the `rownames` function (instead of `featureNames`); * Cell metadata, stored as `phenoData` in an `SCESet`, corresponds to `colData` in a `SingleCellExperiment` object and is accessed/assigned with the `colData` function (this replaces the `pData` function); * Individual cell-level variables can still be accessed with the `$` operator (e.g. `sce$sum`); * Feature metadata, stored as `featureData` in an `SCESet`, corresponds to `rowData` in a `SingleCellExperiment` object and is accessed/assigned with the `rowData` function (this replaces the `fData` function); * `plotScater`, which produces a cumulative expression, overview plot, replaces the generic `plot` function for `SCESet` objects. # Session information {.unnumbered} ```{r} sessionInfo() ```