---
title: "Multi-subject scRNA-seq Analysis"
author:
- name: Jason Ratcliff
  affiliation: &iihg Iowa Institute of Human Genetics, Roy J. and Lucille A.
    Carver College of Medicine, University of Iowa, Iowa City, IA 52242, USA
- name: Andrew Thurman
  affiliation: &intMed Department of Internal Medicine, Roy J. and Lucille A.
    Carver College of Medicine, University of Iowa, Iowa City, IA 52242, USA
- name: Michael Chimenti
  affiliation: *iihg
- name: Alejandro Pezzulo
  affiliation: *intMed
link-citations: true
output: BiocStyle::html_document
bibliography: references.bib
csl: bioinformatics.csl
package: aggregateBioVar
abstract: |
    A method for incorporating *biological replication* into differential gene
    expression analysis of single cell RNA sequencing data.
vignette: |
    %\VignetteIndexEntry{Multi-subject scRNA-seq Analysis}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
nocite: |
    @R-cowplot
    @R-ggtext
    @R-magrittr
    @R-ggtext
    @R-dplyr
---

```{r vignette, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
knitr::opts_knit$set(
    eval.after = "fig.cap"
)
```

```{r setup, message=FALSE}
# For analysis of scRNAseq data
library(aggregateBioVar)
library(SummarizedExperiment, quietly = TRUE)
library(SingleCellExperiment, quietly = TRUE)
library(DESeq2, quietly = TRUE)

# For data transformation and visualization
library(magrittr, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(cowplot, quietly = TRUE)
library(ggtext, quietly = TRUE)
```

```{r links, echo=FALSE}
link_sce <-
    paste0(
        "https://bioconductor.org/packages/release/bioc/",
        "vignettes/SingleCellExperiment/inst/doc/intro.html"
    )
link_se <-
    paste0(
        "https://bioconductor.org/packages/release/bioc/",
        "vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html"
    )
link_ncbi <-
    paste0(
        "https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/",
        "wwwtax.cgi?mode=Info&id=9823"
    )
link_DESeq2 <-
    paste0(
        "https://bioconductor.org/packages/devel/bioc/",
        "vignettes/DESeq2/inst/doc/DESeq2.html"
    )
```

```{r captions, echo=FALSE, message=FALSE}
cap_metadata <-
    SummarizedExperiment::colData(x = small_airway) %>% tibble::as_tibble() %>%
    dplyr::select("orig.ident", "Genotype") %>%
    dplyr::distinct() %>% dplyr::arrange(.data$orig.ident)

# Figure 1
cap_pheatmanp_cell <-
    paste0(
        "Gene-by-cell Count Matrix. ",
        "Heatmap of *secretory cell* gene expression with log~2~ counts ",
        "per million cells. Includes ",
        length(grep("Secretory cell", small_airway$celltype)),
        " cells from ", length(unique(small_airway$orig.ident)), " subjects",
        " with genotypes `WT` (n=", length(grep("WT", cap_metadata$Genotype)),
        ", cells=",
        length(
            intersect(
                grep("WT", small_airway$Genotype),
                grep("Secretory cell", small_airway$celltype)
            )
        ),
        ") and `CFTRKO` (n=", length(grep("CFTRKO", cap_metadata$Genotype)),
        ", cells=",
        length(
            intersect(
                grep("CFTRKO", small_airway$Genotype),
                grep("Secretory cell", small_airway$celltype)
            )
        ),
        ")."
    )

# Figure 2
cap_pheatmanp_subj <-
    paste0(
        "Gene-by-subject Count Matrix. ",
        "Heatmap of gene counts aggregated by ",
        "subject with `aggregateBioVar()`.",
        "The `SummarizedExperiment` object with the *Secretory cells* subset ",
        "contains gene counts summed by subject. Aggregate gene-by-subject ",
        "counts are used as input for bulk RNA-seq tools."
    )

# Figure 3
cap_volcano <-
    paste0(
        "Differential expression analysis of scRNA-seq data. ",
        "Comparison of differential expression in ",
        "secretory cells from small airway epithelium ",
        "without aggregation (A) and after aggregation of gene counts ",
        "by subject (biological replicates; B). ",
        "Genes with an absolute log~2~ fold change greater than 1 and an ",
        "adjusted P value less than 0.05 are highlighted in red. Aggregation ",
        "of counts by subject reduced the number of differentially expressed ",
        "genes to CD36 and CFTR for the secretory cell subset."
    )

# Figure 4
cap_counts <-
    paste0(
        "Normalized within-subject gene counts. ",
        "Gene counts aggregated by subject for significantly differentially ",
        "expressed genes from the secretory cell subset."
    )
```

# Introduction

Single cell RNA sequencing (scRNA-seq) studies allow gene expression
quantification at the level of individual cells, and these studies introduce
multiple layers of biological complexity. These include variations in gene
expression between cell states within a sample
(*e.g.*, T cells versus macrophages), between samples within a population
(*e.g.*, biological or technical replicates), and between populations
(*e.g.*, healthy versus diseased individuals).
Because many early scRNA-seq studies involved analysis of only a single sample,
many bioinformatics tools operate on the first layer, comparing gene expression
between cells within a sample.
This software is aimed at organizing scRNA-seq data to permit analysis in the
latter two layers, comparing gene expression between samples and between
populations. An example is given with an implementation of differential
gene expression analysis between populations. From scRNA-seq data stored as a
[`SingleCellExperiment`](`r link_sce`) [@R-SingleCellExperiment] object
with pre-defined cell states, `aggregateBioVar()` stratifies data as a list of
[`SummarizedExperiment`](`r link_se`) [@R-SummarizedExperiment] objects,
a standard [Bioconductor](https://bioconductor.org) data structure for
downstream analysis of RNA-seq data.

# Case Study: Small Airway Epithelium in Cystic Fibrosis

To illustrate the utility of *biological replication* for scRNA-seq
sequencing experiments, consider a set of single cell data from
**porcine small airway epithelium**. In this study, small airway (< 2 mm) tissue
samples were collected from newborn pigs ([*Sus scrofa*](`r link_ncbi`)) to
investigate gene expression patterns and cellular composition in a cystic
fibrosis phenotype. Single cell sequencing samples were prepared using a 10X
Genomics Chromium controller and sequenced on an Illumina HiSeq4000. Data
obtained from seven individuals include both non-CF
(**CFTR+/+**; genotype `WT`; n=4) and CFTR-knockout subjects expressing a
cystic fibrosis phenotype (**CFTR-/-**; genotype `CFTRKO`; n=3).
Cell types were determined following a standard scRNA-seq pipeline using
[Seurat](https://satijalab.org/seurat/) [@Seurat2019], including cell count
normalization, scaling, determination of highly variable genes,
dimension reduction via principal components analysis, and shared
nearest neighbor clustering. Both unsupervised marker detection
(via `Seurat::FindMarkers()`) and a list of known marker genes were used to
annotate cell types. The full data set has been uploaded to the
[Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/)
as accession number **GSE150211.**

## The `small_airway` Dataset

A subset of `r nrow(small_airway)` genes from `r ncol(small_airway)` cells
assigned as secretory, endothelial, and immune cell types are available in the
`small_airway` data set.
The data are formatted as a `SingleCellExperiment` class S4 object
[@SingleCellExperiment2020], an extension of the
`RangedSummarizedExperiment` class from the `SummarizedExperiment` package.

```{r singleCellExperiment}
small_airway
```

The primary data in `SingleCellExperiment` objects are stored in the `assays`
slot. Here, a single assay `counts` contains gene counts from the single cell
sequencing data. Each column of the assay count matrix represents a cell and
each row a feature (*e.g.*, gene).
Assay slot data can be obtained by `SummarizedExperiment::assay()`, indicating
the`SingleCellExperiment` object and name of the assay slot (here, `"counts"`).
In the special case of the assay being named `"counts"`, the data can be
accessed with `SingleCellExperiment::counts()`.

```{r assays}
assays(small_airway)

# Dimensions of gene-by-cell count matrix
dim(counts(small_airway))

# Access dgCMatrix with gene counts
counts(small_airway)[1:5, 1:30]
```

```{r subjects, echo=FALSE}
sbj_cf <-
    grep(pattern = "CF", x = unique(small_airway$orig.ident), value = TRUE)
sbj_wt <-
    grep(pattern = "WT", x = unique(small_airway$orig.ident), value = TRUE)
```

## Cell Metadata

`SingleCellExperiment` objects may also include column metadata with
additional information annotating individual cells.
Here, the metadata variable `orig.ident` identifies the biological sample of the
cell while `celltype` indicates the assigned cellular identity.
Of the `r length(unique(small_airway$orig.ident))` individual
subjects, 3 are CF (`r sbj_cf`) and 4 are non-CF (`r sbj_wt`).
`Genotype` indicates the sample genotype, one of `WT` or `CFTRKO` for non-CF and
CF subjects, respectively.
Column metadata from `SingleCellExperiment` objects can be accessed with the `$`
operator, where the length of a metadata column variable is equal to the number
of columns (*i.e.*, cells) in the feature count matrix from the `assays` slot.

```{r metadata}
# Subject values
table(small_airway$orig.ident)

# Cell type values
table(small_airway$celltype)

# Subject genotype
table(small_airway$Genotype)
```

The experiment metadata are included as an S4 `DataFrame` object.
To access the full column metadata from `SingleCellExperiment` objects, use
`colData()` from the `SummarizedExperiment` package.
Here, metadata include
`r ncol(colData(small_airway))` variables for each of
`r nrow(colData(small_airway))` cells.
In addition to the biological sample identifier, cell type, and genotype,
metadata include total unique molecular identifiers (UMIs) and
number of detected features.

```{r columnData}
colData(small_airway)
```

## Aggregating Gene Counts

The main functionality of this package involves two generalizable operations:

1. Aggregate within-subject gene counts
2. Summarize metadata to retain inter-subject variation

A wrapper function `aggregateBioVar()` abstracts away these operations
and applies them on a by-cell type basis. For input, a `SingleCellExperiment`
object containing gene counts should contain metadata variables for the
subject by which to aggregate cells (*e.g.*, biological sample) and the assigned
cell types.

### Gene-by-subject Count Matrix

The first operation involves summing all gene counts by subject.
For each gene, counts from all cells within each subject are combined. A
*gene-by-cell* count matrix is converted into a *gene-by-subject* count matrix.

```{r countsBySubject}
countsBySubject(scExp = small_airway, subjectVar = "orig.ident")
```

### Subject Metadata

The second operation removes metadata variables with intrasubject
variation. This effectively retains *inter-subject metadata* and eliminates
variables with intrasubject (*i.e.*, intercellular) variation
(*e.g.*, feature or gene counts by cell). This summarized metadata is used
for modeling a differential expression design matrix.

```{r subjectMetaData}
subjectMetaData(scExp = small_airway, subjectVar = "orig.ident")
```

### Return `SummarizedExperiment`

Both the gene count aggregation and metadata collation steps are combined in
`summarizedCounts()`. This function returns a `SummarizedExperiment` object with
the gene-by-subject count matrix in the `assays` slot, and the summarized
inter-subject metadata as `colData`. Notice the column names now correspond
to the subject level, replacing the cellular barcodes in the
`SingleCellExperiment` following aggregation of gene counts from
within-subject cells.

```{r summarizedCounts}
summarizedCounts(scExp = small_airway, subjectVar = "orig.ident")
```

## `aggregateBioVar()`

These operations are applied to each cell type subset with `aggregateBioVar()`.
The full `SingleCellExperiment` object is subset by cell type (*e.g.*,
secretory, endothelial, and immune cell), the gene-by-subject aggregate count
matrix and collated metadata are tabulated, and a `SummarizedExperiment`
object for that cell type is constructed. A list of `SummarizedExperiment`
objects output by `summarizedCounts()` is the returned to the user. The first
element contains the aggregate `SummarizedExperiment` across all cells, and
subsequent list elements correspond to the cell type indicated by the metadata
variable `cellVar`:

```{r aggregateBioVar}
aggregateBioVar(scExp = small_airway,
                subjectVar = "orig.ident", cellVar = "celltype")
```

# Application to Differential Gene Expression (DGE)

In this case, we want to test for differential
expression between non-CF and CF pigs in the `Secretory cell` subset.
To do so, `aggregateBioVar()` is run on the `SingleCellExperiment` object
by indicated the metadata variables representing the subject-level
(`subjectVar`) and assigned cell type (`cellVar`).
If multiple assays are included in the input `scExp` object,
the **first assay slot** is used.

```{r DESeq2Example}
# Perform aggregation of counts and metadata by subject and cell type.
aggregate_counts <-
    aggregateBioVar(
        scExp = small_airway,
        subjectVar = "orig.ident", cellVar = "celltype"
    )
```

## Exploratory Data Analysis

To visualize the gene-by-subject count aggregation, consider a function to
calculate log<sub>2</sub> counts per million cells and display a heatmap
of normalized expression using `pheatmap` [@R-pheatmap].
`RColorBrewer` [@R-RColorBrewer] and `viridis` [@R-viridis] are used to
generate discrete and continuous color scales, respectively.

```{r pheatmapFxn}
#' Single-cell Counts `pheatmap`
#'
#' @param sumExp `SummarizedExperiment` or `SingleCellExperiment` object
#'   with individual cell or aggregate counts by-subject.
#' @param logSample Subset of log2 values to include for clustering.
#' @param ... Forwarding arguments to pheatmap
#' @inheritParams aggregateBioVar
#'
scPHeatmap <- function(sumExp, subjectVar, gtVar, logSample = 1:100, ...) {
    orderSumExp <- sumExp[, order(sumExp[[subjectVar]])]
    sumExpCounts <- as.matrix(
        SummarizedExperiment::assay(orderSumExp, "counts")
    )
    logcpm <- log2(
        1e6*t(t(sumExpCounts) / colSums(sumExpCounts)) + 1
    )
    annotations <- data.frame(
        Genotype = orderSumExp[[gtVar]],
        Subject = orderSumExp[[subjectVar]]
    )
    rownames(annotations) <- colnames(orderSumExp)

    singleCellpHeatmap <- pheatmap::pheatmap(
        mat = logcpm[logSample, ], annotation_col = annotations,
        cluster_cols = FALSE, show_rownames = FALSE, show_colnames = FALSE,
        scale = "none", ...
    )
    return(singleCellpHeatmap)
}

```

Without aggregation, bulk RNA-seq methods for differential expression analysis
would be applied at the cell level (here, secretory cells;
Figure \@ref(fig:pheatmapCells)).

```{r pheatmapCells, fig.cap=cap_pheatmanp_cell}
# Subset `SingleCellExperiment` secretory cells.
sumExp <- small_airway[, small_airway$celltype == "Secretory cell"]

# List of annotation color specifications for pheatmap.
ann_colors <- list(
    Genotype = c(CFTRKO = "red", WT = "black"),
    Subject = c(RColorBrewer::brewer.pal(7, "Accent"))
)
ann_names <- unique(sumExp[["orig.ident"]])
names(ann_colors$Subject) <- ann_names[order(ann_names)]

# Heatmap of log2 expression across all cells.
scPHeatmap(
    sumExp = sumExp, logSample = 1:100,
    subjectVar = "orig.ident", gtVar = "Genotype",
    color = viridis::viridis(75), annotation_colors = ann_colors,
    treeheight_row = 0, treeheight_col = 0
)
```

----

Summation of gene counts across all cells creates a "pseudo-bulk" data set
on which a subject-level test of differential expression is applied 
(Figure \@ref(fig:pheatmapSubject)).

```{r pheatmapSubject, fig.cap=cap_pheatmanp_subj}
# List of `SummarizedExperiment` objects with aggregate subject counts.
scExp <-
    aggregateBioVar(
        scExp = small_airway,
        subjectVar = "orig.ident", cellVar = "celltype"
    )

# Heatmap of log2 expression from aggregate gene-by-subject count matrix.
scPHeatmap(
    sumExp = aggregate_counts$`Secretory cell`, logSample = 1:100,
    subjectVar = "orig.ident", gtVar = "Genotype",
    color = viridis::viridis(75), annotation_colors = ann_colors,
    treeheight_row = 0, treeheight_col = 0
)
```

## DGE with DESeq2

To run [`DESeq2`](`r link_DESeq2`) [@DESeq2], a `DESeqDataSet` object can be
constructed using `DESeqDataSetFromMatrix()`.
Here, the aggregate counts and subject metadata from the secretory cell subset
are modeled by the variable `Genotype`.
Differential expression analysis is performed with `DESeq` and a results table
is extracted by `results()` to obtain log~2~ fold changes with p-values and
adjusted p-values.

```{r DESeq2Aggregate}
subj_dds_dataset <-
    DESeqDataSetFromMatrix(
        countData = assay(aggregate_counts$`Secretory cell`, "counts"),
        colData = colData(aggregate_counts$`Secretory cell`),
        design = ~ Genotype
    )

subj_dds <- DESeq(subj_dds_dataset)

subj_dds_results <-
    results(subj_dds, contrast = c("Genotype", "WT", "CFTRKO"))
```

For comparison of differential expression with and without aggregation of
gene-by-subject counts, a subset of all secretory cells is used to construct
a `DESeqDataSet` and analysis of differential expression is repeated.

```{r DESeq2Cells}
cells_secretory <-
    small_airway[, which(
        as.character(small_airway$celltype) == "Secretory cell")]
cells_secretory$Genotype <- as.factor(cells_secretory$Genotype)

cell_dds_dataset <-
    DESeqDataSetFromMatrix(
        countData = assay(cells_secretory, "counts"),
        colData = colData(cells_secretory),
        design = ~ Genotype
    )

cell_dds <- DESeq(cell_dds_dataset)

cell_dds_results <-
    results(cell_dds, contrast = c("Genotype", "WT", "CFTRKO"))
```

Add a new variable with log~10~-transformed adjusted P-values.

```{r logPadj, message=FALSE}
subj_dds_transf <- as.data.frame(subj_dds_results) %>%
    bind_cols(feature = rownames(subj_dds_results)) %>%
    mutate(log_padj = - log(.data$padj, base = 10))

cell_dds_transf <- as.data.frame(cell_dds_results) %>%
    bind_cols(feature = rownames(cell_dds_results)) %>%
    mutate(log_padj = - log(.data$padj, base = 10))
```

## Results

```{r dgeCounts, echo=FALSE}
dge_cells <-
    filter(
        .data = cell_dds_transf,
        abs(.data$log2FoldChange) > 1, .data$padj < 0.05
    )
dge_subj <-
    filter(
        .data = subj_dds_transf,
        abs(.data$log2FoldChange) > 1, .data$padj < 0.05
    )
```

DGE is summarized by volcano plot `ggplot` [@R-ggplot2] to show cell-level
(Figure \@ref(fig:plotVolcano)A)
and subject-level tests (Figure \@ref(fig:plotVolcano)B).
Aggregation of gene counts by subject reduced the number of genes with both
an adjusted p-value < 0.05 and an absolute log~2~ fold change > 1 from
`r nrow(dge_cells)` genes to `r nrow(dge_subj)` (Figure \@ref(fig:plotVolcano)).

```{r plotVolcano, fig.cap=cap_volcano}
# Function to add theme for ggplots of DESeq2 results.
deseq_themes <- function() {
    list(
        theme_classic(),
        lims(x = c(-4, 5), y = c(0, 80)),
        labs(
            x = "log<sub>2</sub> (fold change)",
            y = "-log<sub>10</sub> (p<sub>adj</sub>)"
        ),
        ggplot2::theme(
            axis.title.x = ggtext::element_markdown(),
            axis.title.y = ggtext::element_markdown())
    )
}

# Build ggplots to visualize subject-level differential expression in scRNA-seq
ggplot_full <- ggplot(data = cell_dds_transf) +
    geom_point(aes(x = log2FoldChange, y = log_padj), na.rm = TRUE) +
    geom_point(
        data = filter(
            .data = cell_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange, y = log_padj), color = "red"
    ) +
    deseq_themes()

ggplot_subj <- ggplot(data = subj_dds_transf) +
    geom_point(aes(x = log2FoldChange, y = log_padj), na.rm = TRUE) +
    geom_point(
        data = filter(
            .data = subj_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange, y = log_padj), color = "red"
    ) +
    geom_label(
        data = filter(
            .data = subj_dds_transf,
            abs(.data$log2FoldChange) > 1, .data$padj < 0.05
        ),
        aes(x = log2FoldChange + 0.5, y = log_padj + 5, label = feature)
    ) +
    deseq_themes()

cowplot::plot_grid(ggplot_full, ggplot_subj, ncol = 2, labels = c("A", "B"))
```

----

From the significantly differentially expressed genes CFTR and CD36,
the aggregate counts by subject are plotted in Figure \@ref(fig:plotGeneCounts).

```{r plotGeneCounts, fig.cap=cap_counts}
# Extract counts subset by gene to plot normalized counts.
ggplot_counts <- function(dds_obj, gene) {
    norm_counts <-
        counts(dds_obj, normalized = TRUE)[grepl(gene, rownames(dds_obj)), ]
    sc_counts <-
        data.frame(
            norm_count = norm_counts,
            subject = colData(dds_obj)[["orig.ident"]],
            genotype = factor(
                colData(dds_obj)[["Genotype"]],
                levels = c("WT", "CFTRKO")
            )
        )

    count_ggplot <- ggplot(data = sc_counts) +
        geom_jitter(
            aes(x = genotype, y = norm_count, color = genotype),
            height = 0, width = 0.05
        ) +
        scale_color_manual(
            "Genotype", values = c("WT" = "blue", "CFTRKO" = "red")
        ) +
        lims(x = c("WT", "CFTRKO"), y = c(0, 350)) +
        labs(x = "Genotype", y = "Normalized Counts") +
        ggtitle(label = gene) +
        theme_classic()
    return(count_ggplot)
}

cowplot::plot_grid(
    ggplot_counts(dds_obj = subj_dds, gene = "CFTR") +
        theme(legend.position = "FALSE"),
    ggplot_counts(dds_obj = subj_dds, gene = "CD36") +
        theme(legend.position = "FALSE"),
    cowplot::get_legend(
        plot = ggplot_counts(dds_obj = subj_dds, gene = "CD36")
    ),
    ncol = 3, rel_widths = c(4, 4, 1)
)
```

# References

<div id="refs"></div>

# Session Info

```{r}
sessionInfo()
```