In this tutorial we walk through a typical single-cell RNA-seq analysis using Bioconductor packages. We will try to cover data from different protocols, but some of the EDA/QC steps will be focused on the 10X Genomics Chromium protocol.
We start from the output of the Cell Ranger preprocessing software. This is an open source software suite that allows to pre-process the FASTQ files generated by the sequencing platform and perform alignment and quantification.
We will perform exploratory data analysis (EDA) and quality control (QC), focusing on single-cell, and sometimes droplet-specific, issues, such as the detection of empty droplet and of doublets. We will then cover dimensionality reduction, cell type identification, pseudotime inference, and (briefly) multi-sample differential expression.
Most of the steps covered here (and much more!) are described in great details in the Orchestrating Single Cell Analysis (OSCA) book (Amezquita et al. 2020). I encourage everyone to use that as a reference for the most typical single-cell analyses with Bioconductor.
Bioconductor (Huber et al. 2015) has many packages supporting the analysis of single-cell RNA-seq data and their package vignettes constitute an excellent resource.
While not covered in this tutorial, there are packages and software tools for the analysis of single-cell data outside of Bioconductor too. Popular tools include the Seurat R package and the scanpy python package. The Bioconductor single-cell echosystem tries whenever possible to provide data structures and coercion functions that make it easy to interoperate between Bioconductor and external software.
We will use several experimental datasets throughout this tutorial. In particular, we will use:
In particular, the scRNAseq Bioconductor package contains a wide variety of experimental single-cell datasets that can be used for illustration, exploration, and methods development.
One common feature of the datasets that we will use is that they are all stored as objects of the SingleCellExperiment
class.
SingleCellExperiment
is a S4 class that extends SummarizedExperiment
and can be used for efficiently storing and working with single-cell data in R/Bioconductor.
This class serves as the common infrastructure for across 70+ single-cell-related Bioconductor packages and allows users to mix and match packages by different developers. This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, per-cell metadata and per-gene annotation - and manipulate them in a synchronized manner.
A more thorough overview of SingleCellExperiment
can be found in Chapter 4 of OSCA.
In this section we focus on EDA/QC steps that are typical in a single-cell analysis.
We start by considering a dataset on Peripheral Blood Mononuclear Cell (PBMC) data from a healthy donor provided by 10X Genomics. We use the DropletTestFiles Bioconductor package to download the raw (i.e., unfiltered) count matrix that contains the UMI counts of all genes in all droplets.
library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)
library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
sce.pbmc
## class: SingleCellExperiment
## dim: 33694 737280
## metadata(1): Samples
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
The read10xCounts
function starts from the output of the Cell Ranger software and imports the data into R as an object of class SingleCellExperiment
.
We can notice that the dimension of the matrix is very big; in fact this matrix includes the UMI that have been detected in all the droplets that have been sequenced, including the empty droplets that may contain only ambient RNA.
This is a very sparse matrix, with a large fraction of zeros; read10xCounts
is aware of this and stores the counts as a sparse matrix, which has a very small memory footprint.
class(counts(sce.pbmc))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
Exercise: Explore the SingleCellExperiment
object and its slots. In particular, the assay
, counts
, colData
, rowData
assessors.
Before starting the analysis, it may be a good idea to store the names of the genes in a more human-friendly ID system. We can also include information on the chromosome location of the genes; this will be useful for e.g. identifying mitochondrial genes.
library(scuttle)
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)
library(EnsDb.Hsapiens.v86)
rowData(sce.pbmc)$location <- mapIds(EnsDb.Hsapiens.v86,
keys=rowData(sce.pbmc)$ID,
column="SEQNAME", keytype="GENEID")
rowData(sce.pbmc)
## DataFrame with 33694 rows and 3 columns
## ID Symbol location
## <character> <character> <character>
## RP11-34P13.3 ENSG00000243485 RP11-34P13.3 1
## FAM138A ENSG00000237613 FAM138A 1
## OR4F5 ENSG00000186092 OR4F5 1
## RP11-34P13.7 ENSG00000238009 RP11-34P13.7 1
## RP11-34P13.8 ENSG00000239945 RP11-34P13.8 1
## ... ... ... ...
## AC233755.2 ENSG00000277856 AC233755.2 KI270726.1
## AC233755.1 ENSG00000275063 AC233755.1 KI270726.1
## AC240274.1 ENSG00000271254 AC240274.1 KI270711.1
## AC213203.1 ENSG00000277475 AC213203.1 KI270713.1
## FAM231B ENSG00000268674 FAM231B KI270713.1
Exercise: compute the proportion of zero for each droplet (column) and draw the distribution across the dataset.
The first step is the identification of droplets that do not contain any live cell.
The reason why these droplets contain some RNA is that there may be some ambient RNA due to some cell leaking or they may contain dead or dying cells.
The barcodeRanks
function can be used to rank the barcodes by number of UMIs and to estimate the knee and inflection point of the distribution.
bcrank <- barcodeRanks(counts(sce.pbmc))
# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex.lab=1.2)
abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)
legend("bottomleft", legend=c("Inflection", "Knee"),
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
There is a sharp distinction between droplets with very high counts, very likely to contain a live cell, and droplets with very low counts, very likely to be empty.
However, it is not straightforward to classify the droplets in the middle of the distribution.
We can apply a statistical test of hypothesis to decide, for each droplet, if its RNA profile is significantly different from the profile of ambient RNA, estimated from the very low counts (Aaron TL Lun et al. 2019).
We use a very low threshold on the False Discovery Rate to have very few false positive cells.
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
summary(e.out$FDR <= 0.001)
## Mode FALSE TRUE NA's
## logical 989 4300 731991
The large majority of droplets are not tested, since by default all droplets with fewer than 100 UMIs are considered empty.
table(colSums(counts(sce.pbmc))>100, e.out$FDR<=0.001, useNA = "ifany")
##
## FALSE TRUE <NA>
## FALSE 0 0 731991
## TRUE 989 4300 0
We can now proceed by removing the empty droplets and keep only the ones identified to be cells.
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
sce.pbmc
## class: SingleCellExperiment
## dim: 33694 4300
## metadata(1): Samples
## assays(1): counts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(4300): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
We will see in more details in the next section how to perform accurate dimensionality reduction. Here, we quickly perform a principal component analysis (PCA) and tSNE embedding to explore the data and highlight possible quality issues for some cells.
These embeddings can be quickly computed using the runPCA
and runTSNE
functions in scater
. It is useful to log transform the data before PCA to more accurately reflect the signal.
The logNormCounts
function divides by the total number of UMIs in addition to log transforming the counts to account for the difference in sequencing depths. This is not the best normalization approach, but for our current purpose of quickly visualizing the data, it is good enough.
We will need to recompute these embeddings after filtering and proper normalization to obtain a more accurate representation of our data.
library(scater)
sce.pbmc <- logNormCounts(sce.pbmc)
sce.pbmc <- runPCA(sce.pbmc)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
sce.pbmc
## class: SingleCellExperiment
## dim: 33694 4300
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(4300): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
Note that it is generally a good strategy to start from PCA to compute the tSNE embedding; this is achieved using the dimred
argument.
Exercise: Explore the logcounts
accessor and the assayNames
, assays
and assay
methods to extract the original counts and the log-normalized counts from the object; explore the reducedDimNames
and reducedDim
methods to explore the PCA and t-SNE embedding.
We can visualize the first two PCs and the tSNE embedding using the following scater
functions.
plotPCA(sce.pbmc)
plotTSNE(sce.pbmc)
The fact that the droplets are not empty does not automatically mean that they should be kept in the analysis. They may contain damaged or dying cells.
Our aim here is to identify and potentially remove those cells that have lower quality. There are several methods to identify such cells; here we use the simple but effective strategy of removing cells with high percentages of mitochondrial reads.
This is justified by the fact that mitochondiral genes are involved in biological processes in response to stress and apoptosis. Hence, high mitochondrial gene expression may be an indication of damaged or stressed cells.
The perCellQCMetrics
function can be used to compute a set of metrics useful to evaluate the quality of the samples. The isOutlier
function uses a data driven threshold to define cells of lower quality compared to the rest of the dataset.
stats <- perCellQCMetrics(sce.pbmc,
subsets=list(Mito=which(rowData(sce.pbmc)$location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
table(high.mito)
## high.mito
## FALSE TRUE
## 3985 315
Finally, it is convenient to include this information in the colData
of the object, e.g. to use it in plotting functions.
colData(sce.pbmc) <- cbind(colData(sce.pbmc), stats)
sce.pbmc$discard <- high.mito
plotColData(sce.pbmc, y="subsets_Mito_percent",
colour_by="discard")
plotColData(sce.pbmc, y="detected",
colour_by="discard")
plotColData(sce.pbmc, x="sum", y="subsets_Mito_percent",
colour_by="discard") + scale_x_log10()
We can also color-code the cells by the QC metrics and/or by the indicator that we defined in the PCA and tSNE plots.
plotTSNE(sce.pbmc, colour_by="subsets_Mito_percent")
plotTSNE(sce.pbmc, colour_by="discard")
We can see that there are a few cells with high mitochondrial reads scattered around two clusters. In addition, there seems to be a cluster made of only high-mitochondrial cells. It is unclear whether this is a technical artifact due to sample processing or a genuine PBMC cell population that expresses more mitochondrial genes.
In this tutorial, we go ahead and remove these cells, but in a real analysis, careful thought must be given to whether these metrics are capturing technical noise or biological signal.
sce.pbmc <- sce.pbmc[,!high.mito]
sce.pbmc
## class: SingleCellExperiment
## dim: 33694 3985
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(10): Sample Barcode ... total discard
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
Here, we use the scran method to normalize the data for differences in sequencing depth. This approach is based on the deconvolution of size factors estimated from pools of cells (Aaron T. Lun, Bach, and Marioni 2016).
Since we have a heterogeneous cell population, we perform a quick clustering to make sure that we pool together cells that are not too different from each other.
library(scran)
cl <- quickCluster(sce.pbmc)
table(cl)
## cl
## 1 2 3 4 5 6 7 8
## 216 523 542 345 499 547 258 1055
sce.pbmc <- computeSumFactors(sce.pbmc, clusters = cl)
summary(sizeFactors(sce.pbmc))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00749 0.71207 0.87490 1.00000 1.09900 12.25412
The function returns a SingleCellExperiment
object with the estimated size factors stored in the colData
. To actually normalize the data, we can use the logNormCounts
function that will use the size factors stored in the object to normalize the data and save the log-normalized counts as a second assay in the object.
sce.pbmc <- logNormCounts(sce.pbmc)
sce.pbmc
## class: SingleCellExperiment
## dim: 33694 3985
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(10): Sample Barcode ... total discard
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
Exercise: Extract the size factors with the colData
and sizeFactors
accessors.
We can check that the estimated library sizes are not too far from the library size factors, estimated from the total number of counts.
plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), xlab="Library size factor",
ylab="Deconvolution size factor", log='xy', pch=16,
col=as.integer(factor(cl)))
abline(a=0, b=1, col="red")
We now have a SingleCellExperiment
object with only the high-quality cells. We may also want to filter the genes to focus on those that are more likely to carry information rather than noise.
In fact, many of the genes that we stored in the object have no UMI in any cell. We can easily verify that by looking at the number of zero counts in each cell.
summary(colMeans(counts(sce.pbmc)==0))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8442 0.9559 0.9627 0.9600 0.9676 0.9974
On average, 87% of the genes have 0 UMIs in a given cell.
Some genes will be expressed in a large fraction of cells, while others will almost always be 0.
summary(rowMeans(counts(sce.pbmc)==0))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007528 0.9736512 0.9994981 0.9600243 1.0000000 1.0000000
Hence, we can start off by removing those genes that are 0 across the board.
allzero <- rowMeans(counts(sce.pbmc)==0)==1
table(allzero)
## allzero
## FALSE TRUE
## 19669 14025
sce.pbmc <- sce.pbmc[which(!allzero),]
sce.pbmc
## class: SingleCellExperiment
## dim: 19669 3985
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(19669): RP11-34P13.7 RP11-34P13.8 ... AC004556.1 AC240274.1
## rowData names(3): ID Symbol location
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(10): Sample Barcode ... total discard
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
Note that we have removed more than 40% of the genes.
This loose filter is often not sufficient, and we usually prefer to focus on a set of highly variable genes for the subsequent analysis steps.
The rationale is that if the genes are carrying biologically relevant information, e.g., in distinguishing among cell types, they will not have a constant expression, but rather will vary between cells.
The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values for each gene across all cells. We need to model the mean-variance relationship since the log is not a variance stabilizing transformation.
set.seed(1001)
dec.pbmc <- modelGeneVar(sce.pbmc)
head(dec.pbmc)
## DataFrame with 6 rows and 6 columns
## mean total tech bio p.value
## <numeric> <numeric> <numeric> <numeric> <numeric>
## RP11-34P13.7 0.002166050 0.002227438 0.002227466 -2.81807e-08 0.5000351
## RP11-34P13.8 0.000522431 0.000549601 0.000537244 1.23571e-05 0.4365063
## FO538757.3 0.000256287 0.000261748 0.000263554 -1.80664e-06 0.5189964
## FO538757.2 0.142907236 0.147339117 0.145040566 2.29855e-03 0.4561549
## AP006222.2 0.061806884 0.077278825 0.063525087 1.37537e-02 0.0662234
## RP4-669L17.10 0.002577010 0.002864669 0.002650077 2.14592e-04 0.2868194
## FDR
## <numeric>
## RP11-34P13.7 0.756451
## RP11-34P13.8 0.756451
## FO538757.3 0.756451
## FO538757.2 0.756451
## AP006222.2 0.677945
## RP4-669L17.10 0.756451
At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component. We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation.
plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.pbmc)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
The modelGeneVar
function tests the null hypothesis that the biological component of the variance is 0 for each gene: small p-values indicate a gene with evidence of biological variation.
We can select the genes using a threshold on the adjusted p-values or simply ranking them by biological variation and selecting the top one or two thousand.
top.pbmc <- getTopHVGs(dec.pbmc, n=1000)
head(top.pbmc)
## [1] "LYZ" "S100A9" "S100A8" "HLA-DRA" "CD74" "CST3"
As we have already mentioned, a single droplet may erroneously capture more than one cell; these are called doublets. There are some statistical methods that can be used
To try and detect the doublets, we can use statistical algorithms, many of which are implemented in the scDblFinder
package. We will focus on one technique that uses simulations to identify droplets that are likely to contain more than one cell.
Note that while we use a droplet terminology here, doublets can be observed in data from non-droplet protocols as well, e.g., in Fluidigm C1 Smart-seq2 datasets.
library(scDblFinder)
dbl.dens <- computeDoubletDensity(sce.pbmc, subset.row=top.pbmc,
d=ncol(reducedDim(sce.pbmc)))
sce.pbmc$DoubletScore <- dbl.dens
summary(dbl.dens)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01594 0.43038 0.82091 1.15834 1.41069 18.84108
The function computeDoubletDensity
identifies potential doublet cells based on the local density of simulated doublet expression profiles.
Briefly, the strategy can be described in this way:
These scores can be visualized in a tSNE plot.
plotTSNE(sce.pbmc, colour_by="DoubletScore")
We can define a threshold to identify putative doublets.
dbl.calls <- doubletThresholding(data.frame(score=dbl.dens),
method="griffiths", returnType="call")
summary(dbl.calls)
## singlet doublet
## 3449 536
And finally eliminate them from the dataset before continuing the analysis.
Note that, similarly to what we have said for the removal of low-quality cells, one should be careful in removing putative doublets. Indeed, it may be difficult to distinguish between rare transient cell populations and doublet cells.
sce.pbmc <- sce.pbmc[, dbl.calls == "singlet"]
sce.pbmc
## class: SingleCellExperiment
## dim: 19669 3449
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(19669): RP11-34P13.7 RP11-34P13.8 ... AC004556.1 AC240274.1
## rowData names(3): ID Symbol location
## colnames(3449): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(11): Sample Barcode ... discard DoubletScore
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
Now that we have removed low-quality cells and doublets, and we have identified highly variable genes, we can proceed with the next step of the analysis, dimentionality reduction.
In particular, we will focus on dimensionality reduction techniques useful as a preliminary step for statistical inference (e.g., clustering), such as PCA and count-based methods.
We can always apply visualization techniques, such as tSNE or UMAP, on top of these more formal dimensionality reduction techniques, but it is important that we do not base our inferential steps on them and we use them only for visualization.
We have already seen the simplest dimensionality reduction technique, PCA, in the previous section. Here, we can recalculate the principal components, starting from the data normalized by scran and using only the selected HVGs. This will lead to a better representation of our biological signal.
sce.pbmc <- runPCA(sce.pbmc, subset_row=top.pbmc)
Note that the subset_row
argument ensures that we use the previously selected HVGs to compute the PCs. Also note that by default runPCA
will store the PCs in the PCA
slot, hence overwriting the previous PCs. In this case, this is what we want, as the previous results were based on a quick normalization and prior to removing noisy genes and cells. In other cases, it may be reasonable to keep more than one set of PCs: this can be achieved by specifying a different slot name via the name
argument.
Finally, note that by default scater
will compute the top 50 PCs. This is a reasonable choice, but it may be a good idea to explore the variance explained by each component to decide the number of components to retain. An alternative is to use the denoisePCA
function from scran
that aims at selecting the number of PCs that exlain biological variability.
plotPCA(sce.pbmc)
Again, we can compute the tSNE embedding using the PCs.
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
plotTSNE(sce.pbmc)
One problem of PCA is that it assumes continuous, roughly Gaussian data, while single-cell RNA-seq data are counts.
Computing the PCs on the log-normalized counts helps, but it can be shown that it is not always effective and in some cases technical features, such as sequencing depth, can affect the first PCs (Hicks et al. 2018).
Hence, it may be better to use count-based factor analysis models for dimentionality reduction. One advantage of these methods is that they can be applied directly to count data, without need for normalization and data transformation, which can be challenging.
We will see two alternative approaches:
In both cases, we will start from the HVGs selected previously.
GLM-PCA is implemented in the scry
Bioconductor package; it uses a Poisson model by default, but this can be changed to use negative binomial, multinomial or binomial.
Unlike runPCA
, the GLM-PCA function does not allow for a selection of genes. Hence, we will first create a filtered SingleCellExperiment
object that contains only the HVGs.
Similarly to PCA, GLM-PCA needs the number of components (factors) that need to be calculated. Here, we compute 10 latent factors.
We use the minibatch
argument to use only a random subset of observations to compute the gradient in the optimization algorithm. This speeds up computations and avoids memory problems in big datasets.
library(scry)
set.seed(100000)
filtered <- sce.pbmc[top.pbmc,]
filtered <- GLMPCA(filtered, L=10, minibatch="stochastic")
filtered
## class: SingleCellExperiment
## dim: 1000 3449
## metadata(2): Samples glmpca
## assays(2): counts logcounts
## rownames(1000): LYZ S100A9 ... CD1C PRKCH
## rowData names(3): ID Symbol location
## colnames(3449): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(11): Sample Barcode ... discard DoubletScore
## reducedDimNames(3): PCA TSNE GLMPCA
## mainExpName: NULL
## altExpNames(0):
To visualize the first two components of GLM-PCA, we can use the plotReducedDim
function from scater
.
plotReducedDim(filtered, "GLMPCA")
An alternative method is implemented in the NewWave
package, which implements a negative binomial factor analysis model and uses a penalized likelihood to estimate the latent factors.
Also in this case we can use minibatches, and if multiple CPUs are available parallel computing, to speed up computations. The n_gene_par
, n_cell_par
, and n_gene_disp
arguments allow the user to choose how many observations to use to estimate the gene- and cell-specific parameters. The children
argument allows one to use multiple cores; here we use four, but you should change it depending on how many cores your computer has (see parallel::detectCores()
).
library(NewWave)
set.seed(100000)
filtered <- newWave(filtered, K=10, children=4,
n_gene_disp = 100, n_gene_par = 100,
n_cell_par = 100)
filtered
## class: SingleCellExperiment
## dim: 1000 3449
## metadata(2): Samples glmpca
## assays(2): counts logcounts
## rownames(1000): LYZ S100A9 ... CD1C PRKCH
## rowData names(3): ID Symbol location
## colnames(3449): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
## TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(11): Sample Barcode ... discard DoubletScore
## reducedDimNames(4): PCA TSNE GLMPCA newWave
## mainExpName: NULL
## altExpNames(0):
plotReducedDim(filtered, "newWave")
For visualization, it is also possible to compute the tSNE embedding of the 10-dimensional GLM-PCA or newWave representation.
set.seed(2242)
filtered <- runTSNE(filtered, dimred="GLMPCA")
plotTSNE(filtered)
There are essentially two alternative approaches for the identification of cell types:
Typically, clustering is done on a low dimensional projection of the data, e.g., through PCA or count-based methods.
On the other hand, reference annotation starts from normalized data.
One of the most popular clustering methods for single-cell RNA-seq is the Louvain algorithm. This method starts from a cell network (graph), created by connecting cells that have shared nearest neighbors.
The first step is the creation of the shared nearest neighbor graph. Briefly, we start from the low dimensional representation of the data (e.g., by GLM-PCA) and construct a graph by linking cells that have neighbors in common. This is implemented in the buildSNNgraph
function of the scran
package; by default it uses k=10
neighbors.
g <- buildSNNGraph(filtered, k=10, use.dimred = 'GLMPCA')
g
## IGRAPH b1d7dff U-W- 3449 124895 --
## + attr: weight (e/n)
## + edges from b1d7dff:
## [1] 10--13 2--17 7--18 11--18 8--18 8--22 19--23 10--24 13--24 9--25
## [11] 6--29 22--31 17--33 31--34 34--35 6--37 20--38 34--39 10--43 24--43
## [21] 38--46 20--46 41--47 7--47 3--47 11--48 8--48 18--48 7--48 45--49
## [31] 21--49 40--51 12--53 14--54 33--56 38--56 46--56 52--59 32--60 50--60
## [41] 10--61 24--61 13--61 7--63 39--63 11--63 47--63 41--63 14--63 16--64
## [51] 53--66 12--66 10--67 24--67 61--67 13--67 43--67 53--69 40--71 51--71
## [61] 43--72 10--72 31--73 35--73 29--75 6--75 64--77 36--78 17--78 23--79
## [71] 3--80 47--80 7--80 9--82 25--82 4--83 64--83 55--84 39--84 42--85
## + ... omitted several edges
We can use e.g., the GGally
package to visualize the graph, but this is often not very useful due to the large number of nodes and edges.
library(GGally)
ggnet2(g, size=1)
We can then use the igraph
package to identify communities (clusters) in the graph, e.g., by using the Louvain algorithm.
library(igraph)
clust <- igraph::cluster_louvain(g)
ggnet2(g, color=membership(clust), size=1)
Finally, we store the results into the colData
of our object and visualize the clusters in the t-SNE plot.
filtered$Louvain <- factor(membership(clust))
plotTSNE(filtered, colour_by="Louvain")
Obviously, there are many other algorithms available for the clustering of single-cell data.
Exercise: repeat the clustering with a different graph partition algorithm (e.g. cluster_walktrap
) starting from the same graph and compare the results.
Once we have cluster labels, we need to understand what each cluster represents. Typically, we look for marker genes to label the clusters.
It is not important to perform a formal differential expression test for this step. In fact, the p-values are not valid from an inferential point of view because we are using the data twice: to cluster the data and to test the difference in expression.
For this reason, we use simple methods, like a t-test that compares the average expression of one group versus the average of the others, and use the p-values only as a way to rank potentially interesting genes.
markers <- findMarkers(filtered, filtered$Louvain)
head(markers[[1]])
## DataFrame with 6 rows and 14 columns
## Top p.value FDR summary.logFC logFC.2 logFC.3
## <integer> <numeric> <numeric> <numeric> <numeric> <numeric>
## S100A8 1 9.75195e-139 1.71087e-137 3.40430 -2.1391259 3.184643
## CST3 1 0.00000e+00 4.94066e-324 3.75719 0.4085027 3.743263
## TYROBP 1 0.00000e+00 0.00000e+00 3.73285 0.0762503 3.785303
## FTL 1 0.00000e+00 0.00000e+00 3.12398 0.0661998 3.123977
## FCN1 1 4.75850e-153 1.08148e-151 2.66274 0.0553338 2.682627
## FCGR3A 1 6.53969e-90 5.23175e-89 -2.79094 0.2084705 0.223296
## logFC.4 logFC.5 logFC.6 logFC.7 logFC.8 logFC.9 logFC.10
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## S100A8 2.154038 3.158476 3.206208 3.2413164 3.145760 3.40430 1.686829
## CST3 -0.216490 3.757194 3.751789 3.7845542 3.727404 3.87541 2.202018
## TYROBP 1.605890 3.780443 3.781221 3.7328477 3.818012 1.50887 3.396266
## FTL 2.346775 2.735539 3.258439 3.3302288 3.079368 3.72598 0.460033
## FCN1 1.741848 2.717070 2.666041 2.7000107 2.662739 2.65390 2.808112
## FCGR3A 0.230875 0.225729 0.235568 0.0393192 0.240096 -1.15875 0.258762
## logFC.11
## <numeric>
## S100A8 2.5325371
## CST3 0.3693180
## TYROBP 0.2873956
## FTL -0.0308125
## FCN1 1.3181068
## FCGR3A -2.7909414
We can for instance extract the top 10 genes from each comparison to use as cluster-specific markers.
mm <- unique(unlist(lapply(markers, function(x) rownames(x)[1:10])))
plotGroupedHeatmap(filtered, features=mm, group="Louvain", center=TRUE)
By cross-checking the expression of these genes with databases of known immune cell type markers, we may be able to annotate the clusters in cell types.
Obviously, manually annotating clusters using gene markers is laborious and requires specific expertise. For this reason, algorithms that automatically annotate cells using external, already annotated reference datasets are becoming popular.
Most methods do not rely on a clustering algorighm, but annotate individual cells, by looking at some form of correlation between each cell in the query dataset and all the annotated samples in the reference.
This is the approach implemented in the SingleR
Bioconductor package, which uses a simple yet effective algorithm, based on Spearman correlation.
The celldex
Bioconductor package includes several annotated reference datasets (for mouse and human) that can be used with SingleR
. Specifically, it includes a few immune-specific reference datasets. We choose one of them, namely the Monaco datataset, which consists of 114 bulk RNA-seq samples of sorted immune cell populations.
library(SingleR)
library(celldex)
ref <- MonacoImmuneData()
ref
## class: SummarizedExperiment
## dim: 46077 114
## metadata(0):
## assays(1): logcounts
## rownames(46077): A1BG A1BG-AS1 ... ZYX ZZEF1
## rowData names(0):
## colnames(114): DZQV_CD8_naive DZQV_CD8_CM ... G4YW_Neutrophils
## G4YW_Basophils
## colData names(3): label.main label.fine label.ont
In the colData
of this object, we have the labels of the sorted populations, at two levels of resolution.
colData(ref)
## DataFrame with 114 rows and 3 columns
## label.main label.fine label.ont
## <character> <character> <character>
## DZQV_CD8_naive CD8+ T cells Naive CD8 T cells CL:0000900
## DZQV_CD8_CM CD8+ T cells Central memory CD8 T.. CL:0000907
## DZQV_CD8_EM CD8+ T cells Effector memory CD8 .. CL:0000913
## DZQV_CD8_TE CD8+ T cells Terminal effector CD.. CL:0001062
## DZQV_MAIT T cells MAIT cells CL:0000940
## ... ... ... ...
## G4YW_NK NK cells Natural killer cells CL:0000623
## G4YW_pDC Dendritic cells Plasmacytoid dendrit.. CL:0000784
## G4YW_mDC Dendritic cells Myeloid dendritic ce.. CL:0000782
## G4YW_Neutrophils Neutrophils Low-density neutroph.. CL:0000096
## G4YW_Basophils Basophils Low-density basophils CL:0000043
It is straightforward to annotate our cells, by simply selecting our dataset, the reference dataset and the column that corresponds to the annotation that we want to use.
pred <- SingleR(test = sce.pbmc, ref = ref,
labels = ref$label.main, assay.type.test="logcounts")
head(pred)
## DataFrame with 6 rows and 5 columns
## scores first.labels
## <matrix> <character>
## AAACCTGAGAAGGCCT-1 0.247337:0.175560:0.100281:... Dendritic cells
## AAACCTGAGACAGACC-1 0.294752:0.214352:0.146588:... Monocytes
## AAACCTGAGGCATGGT-1 0.195891:0.131444:0.313910:... CD4+ T cells
## AAACCTGCAAGGTTCT-1 0.216747:0.181529:0.354377:... CD8+ T cells
## AAACCTGCAGGCGATA-1 0.362410:0.269859:0.225483:... Dendritic cells
## AAACCTGGTACATCCA-1 0.356039:0.152885:0.212780:... B cells
## tuning.scores labels pruned.labels
## <DataFrame> <character> <character>
## AAACCTGAGAAGGCCT-1 0.279172:0.277530 Monocytes Monocytes
## AAACCTGAGACAGACC-1 0.344007:0.279909 Monocytes Monocytes
## AAACCTGAGGCATGGT-1 0.121540:0.119775 CD4+ T cells CD4+ T cells
## AAACCTGCAAGGTTCT-1 0.269335:0.265725 T cells T cells
## AAACCTGCAGGCGATA-1 0.519449:0.455961 Dendritic cells Dendritic cells
## AAACCTGGTACATCCA-1 0.356039:0.283158 B cells B cells
The method returns a continuous score, which represents how similar each cell is to the annotated cell populations in the reference. Moreover, it returns the final prediction in the form of a cell-type label.
We can explore the scores, to visualize the uncertainty in the annotation.
plotScoreHeatmap(pred)
We can include the predictions in the SingleCellExperiment
object and visualize the results.
filtered$singler <- factor(pred$pruned.labels)
plotTSNE(filtered, colour_by = "singler")
Exercise: repeat the classification with the finer cell type classification (label.fine
).
We can compare the Louvain clusters with the predicted cell types.
library(pheatmap)
pheatmap(log10(table(filtered$singler, filtered$Louvain)+1), scale = "none")
There are many additional analyses that are useful for single-cell RNA-seq data. Two that are particualrly important are:
While we do not have time to explore these analyses in detail here, these are covered in Chapter 10 of the Advanced OSCA book and in the Multi-sample OSCA book, respectively.
I encourage you to explore these resources by yourselves.
It is good practice to always include a list of the software versions that were used to perform a given analysis, for reproducibility and trouble-shooting purposes. One way of achieving this is via the sessionInfo()
function.
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pheatmap_1.0.12 celldex_1.6.0
## [3] SingleR_1.10.0 igraph_1.3.2
## [5] GGally_2.1.2 NewWave_1.6.0
## [7] scry_1.8.0 scDblFinder_1.10.0
## [9] scran_1.24.0 scater_1.24.0
## [11] ggplot2_3.3.6 EnsDb.Hsapiens.v86_2.99.0
## [13] ensembldb_2.20.1 AnnotationFilter_1.20.0
## [15] GenomicFeatures_1.48.3 AnnotationDbi_1.58.0
## [17] scuttle_1.6.2 DropletUtils_1.16.0
## [19] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [21] Biobase_2.56.0 GenomicRanges_1.48.0
## [23] GenomeInfoDb_1.32.2 IRanges_2.30.0
## [25] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [27] MatrixGenerics_1.8.0 matrixStats_0.62.0
## [29] DropletTestFiles_1.6.0 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 R.utils_2.11.0
## [3] tidyselect_1.1.2 RSQLite_2.2.14
## [5] grid_4.2.0 BiocParallel_1.30.3
## [7] Rtsne_0.16 munsell_0.5.0
## [9] ScaledMatrix_1.4.0 codetools_0.2-18
## [11] statmod_1.4.36 xgboost_1.6.0.1
## [13] withr_2.5.0 colorspace_2.0-3
## [15] filelock_1.0.2 highr_0.9
## [17] knitr_1.39 rstudioapi_0.13
## [19] labeling_0.4.2 GenomeInfoDbData_1.2.8
## [21] bit64_4.0.5 farver_2.1.0
## [23] rhdf5_2.40.0 coda_0.19-4
## [25] vctrs_0.4.1 generics_0.1.2
## [27] xfun_0.31 BiocFileCache_2.4.0
## [29] R6_2.5.1 ggbeeswarm_0.6.0
## [31] rsvd_1.0.5 locfit_1.5-9.5
## [33] bitops_1.0-7 rhdf5filters_1.8.0
## [35] cachem_1.0.6 reshape_0.8.9
## [37] DelayedArray_0.22.0 assertthat_0.2.1
## [39] promises_1.2.0.1 BiocIO_1.6.0
## [41] scales_1.2.0 beeswarm_0.4.0
## [43] gtable_0.3.0 beachmat_2.12.0
## [45] SharedObject_1.10.0 rlang_1.0.2
## [47] rtracklayer_1.56.0 lazyeval_0.2.2
## [49] intergraph_2.0-2 BiocManager_1.30.18
## [51] yaml_2.3.5 httpuv_1.6.5
## [53] tools_4.2.0 bookdown_0.27
## [55] statnet.common_4.6.0 ellipsis_0.3.2
## [57] jquerylib_0.1.4 RColorBrewer_1.1-3
## [59] Rcpp_1.0.8.3 plyr_1.8.7
## [61] sparseMatrixStats_1.8.0 progress_1.2.2
## [63] zlibbioc_1.42.0 purrr_0.3.4
## [65] RCurl_1.98-1.7 prettyunits_1.1.1
## [67] viridis_0.6.2 ggrepel_0.9.1
## [69] cluster_2.1.3 magrittr_2.0.3
## [71] sna_2.7 data.table_1.14.2
## [73] magick_2.7.3 ProtGenerics_1.28.0
## [75] hms_1.1.1 mime_0.12
## [77] evaluate_0.15 xtable_1.8-4
## [79] XML_3.99-0.10 gridExtra_2.3
## [81] compiler_4.2.0 biomaRt_2.52.0
## [83] tibble_3.1.7 crayon_1.5.1
## [85] R.oo_1.25.0 htmltools_0.5.2
## [87] later_1.3.0 DBI_1.1.2
## [89] ExperimentHub_2.4.0 dbplyr_2.2.0
## [91] MASS_7.3-56 rappdirs_0.3.3
## [93] Matrix_1.4-1 cli_3.3.0
## [95] glmpca_0.2.0 R.methodsS3_1.8.2
## [97] parallel_4.2.0 metapod_1.4.0
## [99] pkgconfig_2.0.3 GenomicAlignments_1.32.0
## [101] xml2_1.3.3 vipor_0.4.5
## [103] bslib_0.3.1 dqrng_0.3.0
## [105] XVector_0.36.0 stringr_1.4.0
## [107] digest_0.6.29 Biostrings_2.64.0
## [109] rmarkdown_2.14 edgeR_3.38.1
## [111] DelayedMatrixStats_1.18.0 restfulr_0.0.14
## [113] curl_4.3.2 shiny_1.7.1
## [115] Rsamtools_2.12.0 rjson_0.2.21
## [117] lifecycle_1.0.1 jsonlite_1.8.0
## [119] Rhdf5lib_1.18.2 network_1.17.2
## [121] BiocNeighbors_1.14.0 viridisLite_0.4.0
## [123] limma_3.52.1 fansi_1.0.3
## [125] pillar_1.7.0 lattice_0.20-45
## [127] KEGGREST_1.36.2 fastmap_1.1.0
## [129] httr_1.4.3 interactiveDisplayBase_1.34.0
## [131] glue_1.6.2 png_0.1-7
## [133] bluster_1.6.0 BiocVersion_3.15.2
## [135] bit_4.0.4 stringi_1.7.6
## [137] sass_0.4.1 HDF5Array_1.24.1
## [139] blob_1.2.3 BiocSingular_1.12.0
## [141] AnnotationHub_3.4.0 memoise_2.0.1
## [143] dplyr_1.0.9 irlba_2.3.5