--- title: "Differential discovery with `CATALYST`" date: "`r BiocStyle::doc_date()`" author: - name: Helena L Crowell affiliation: - &IMLS Institute for Molecular Life Sciences, University of Zurich, Switzerland - &SIB SIB Swiss Institute of Bioinformatics, University of Zurich, Switzerland email: helena.crowell@uzh.ch - name: Mark D Robinson affiliation: - *IMLS - *SIB package: "`r BiocStyle::pkg_ver('CATALYST')`" bibliography: "`r file.path(system.file('extdata', package = 'CATALYST'), 'refs.bib')`" vignette: > %\VignetteIndexEntry{"2. Differential discovery"} %\VignettePackage{CATALYST} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- ```{r setup, include = FALSE} knitr::opts_chunk$set(cache = TRUE) ``` --- **Most of the pipeline and visualizations presented herein have been adapted from @Nowicka2019-F1000's *"CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets"* available [here](https://f1000research.com/articles/6-748/v4).** ```{r warning = FALSE, message = FALSE} # load required packages library(CATALYST) library(cowplot) library(flowCore) library(diffcyt) library(scater) library(SingleCellExperiment) ``` # Example data - `PBMC_fs`: a `flowSet` holding PBMCs samples from 4 patients, each containing between 500 and 1000 cells. For each sample, the expression of 10 cell surface and 14 signaling markers was measured before (REF) and upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor crosslinking for 30', resulting in a total of 8 samples. This data set represents a subset of data originating from @Bodenmiller2012 that was also used in the [citrus](https://github.com/nolanlab/citrus) paper [@Bruggner2014-CITRUS]. - `PBMC_panel`: a data.frame containing each marker's column name in the FCS file (`fcs_colname` column), its targeted protein marker (`antigen` column), and the `marker_class` ("type" or "state"). - `PBMC_md`: a data.frame where rows correspond to samples, and columns specify each sample's `file_name`, `sample_id`, `condition`, and `patient_id`. ```{r load-data} # load example data data(PBMC_fs, PBMC_panel, PBMC_md) PBMC_fs head(PBMC_panel) head(PBMC_md) ``` The code snippet below demonstrates how to construct a `flowSet` from a set of FCS files. However, we also give the option to directly specify the path to a set of FCS files (see next section). ```{r eval=FALSE} # download exemplary set of FCS files url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow" zip <- "PBMC8_fcs_files.zip" download.file(paste0(url, "/", zip), destfile = zip, mode = "wb") unzip(zip) # read in FCS files as flowSet fcs <- list.files(pattern = ".fcs$") fs <- read.flowSet(fcs, transformation = FALSE, truncate_max_range = FALSE) ``` # Data preparation Data used and returned throughout differential analysis are held in objects of the `r BiocStyle::Biocpkg("SingleCellExperiment")` class. To bring the data into the appropriate format, `prepData()` requires the following inputs: - `x`: a `flowSet` holding the raw measurement data, or a character string that specifies a path to a set of FCS files. - `panel`: a 2 column data.frame that contains for each marker of interest i) its column name in the raw input data, and ii) its targeted protein marker. - `md`: a data.frame with columns describing the experimental design. Optionally, `features` will specify which columns (channels) to keep from the input data. Here, we keep all measurement parameters (default value `features = NULL`). ```{r prepData} (sce <- prepData(PBMC_fs, PBMC_panel, PBMC_md)) ``` We provide flexibility in the way the panel and metadata table can be set up. Specifically, column names are allowed to differ from the example above, and multiple factors (patient ID, conditions, batch etc.) can be specified. Arguments `panel_cols` and `md_cols` should then be used to specify which columns hold the required information. An example is given below: ```{r eval=FALSE} # alter panel column names panel2 <- PBMC_panel colnames(panel2)[1:2] <- c("channel_name", "marker") # alter metadata column names & add 2nd condition md2 <- PBMC_md colnames(md2) <- c("file", "sampleID", "cond1", "patientID") md2$cond2 <- rep(c("A", "B"), 4) # construct SCE prepData(PBMC_fs, panel2, md2, panel_cols = list(channel = "channel_name", antigen = "marker"), md_cols = list(file = "file", id = "sampleID", factors = c("cond1", "cond2", "patientID"))) ``` Note that, independent of the input panel and metadata tables, the constructor will fix the names of mandatory slots for latter data accession (`sample_id` in the `rowData`, `channel_name` and `marker_name` in the `colData`). The `md` table will be stored under `experiment_info` inside the `metadata`. # Clustering ## `cluster`: *FlowSOM* clustering & *ConsensusClusterPlus* metaclustering `r BiocStyle::Biocpkg("CATALYST")` provides a simple wrapper to perform high resolution `FlowSOM` clustering and lower resolution `ConsensusClusterPlus` metaclustering. By default, the data will be initially clustered into `xdim = 10` x `ydim = 10` = 100 groups. Secondly, the function will metacluster populations into 2 through `maxK` (default 20) clusters. To make analyses reproducible, the random seed may be set via `seed`. By default, if the `colData(sce)$marker_class` column is specified, the set of markers with marker class `"type"` will be used for clustering (argument `features = "type"`). Alternatively, the markers that should be used for clustering can be specified manually. ```{r cluster} sce <- cluster(sce, features = "type", xdim = 10, ydim = 10, maxK = 20, verbose = FALSE, seed = 1) ``` Let K = `xdim` x `ydim` be the number of `r BiocStyle::Biocpkg("FlowSOM")` clusters. `cluster` will add information to the following slots of the input `SingleCellExperiment`: - `rowData`: - `cluster_id`: cluster ID as inferred by `r BiocStyle::Biocpkg("FlowSOM")`. One of 1, ..., K. - `colData`: - `marker_class`: factor `"type"` or `"state"`. Specifyies whether a marker has been used for clustering or not, respectively. - `metadata`: - `SOM_codes`: a table with dimensions K x (# type markers). Contains the SOM codes. - `cluster_codes`: a table with dimensions K x (`maxK` + 1). Contains the cluster codes for all metaclusterings. - `delta_area`: a `ggplot` object (see below for details). ## `mergeClusters`: Manual cluster merging Provided with a 2 column data.frame containing `old_cluster` and `new_cluster` IDs, `mergeClusters` allows for manual cluster merging of any clustering available within the input `SingleCellExperiment` (i.e. the `xdim` x `ydim` `r BiocStyle::Biocpkg("FlowSOM")` clusters, and any of the 2-`maxK` `r BiocStyle::Biocpkg("ConsensusClusterPlus")` metaclusters). For latter accession (visualization, differential testing), the function will assign a unique ID (specified with `id`) to each merging, and add a column to the `cluster_codes` inside the `metadata` slot of the input `SingleCellExperiment`. ```{r mergeClusters} data(merging_table) head(merging_table) sce <- mergeClusters(sce, k = "meta20", table = merging_table, id = "merging1") head(cluster_codes(sce))[, seq_len(10)] ``` ## Delta area plot The delta area represents the amount of extra cluster stability gained when clustering into k groups as compared to k-1 groups. It can be expected that high stability of clusters can be reached when clustering into the number of groups that best fits the data. The "natural" number of clusters present in the data should thus corresponds to the value of k where there is no longer a considerable increase in stability (pleateau onset). For more details, the user can refer to the original description of the consensus clustering method [@Monti2003-ConsensusClusterPlus]. ```{r delta-area, fig.width = 5, fig.height = 3} # access & render delta area plot # (equivalent to metadata(sce)$delta_area) delta_area(sce) ``` # Visualization ## `plotCounts`: Number of cells measured per sample The number of cells measured per sample may be plotted with `plotCounts`. This plot should be used as a guide together with other readouts to identify samples where not enough cells were assayed. Here, the grouping of samples (x-axis) is controlled by `group_by`; bars can be colored by a an additional cell metadata variable (argument `color_by`): ```{r plotCounts-1, fig.width = 5, fig.height = 3} plotCounts(sce, group_by = "sample_id", color_by = "condition") ``` As opposed to plotting absolute cell counts, argument `prop` can be used to visualize relative abundances (frequencies) instead: ```{r plotCounts-2, fig.width = 4, fig.height = 3} plotCounts(sce, prop = TRUE, group_by = "condition", color_by = "patient_id") ``` ## `pbMDS`: Pseudobulk-level MDS plot A multi-dimensional scaling (MDS) plot on aggregated measurement values may be rendered with `pbMDS`. Such a plot will give a sense of similarities between cluster and/or samples in an unsupervised way and of key difference in expression before conducting any formal testing. Arguments `by`, `assay` and `fun` control the aggregation strategy, allowing to compute pseudobulks by sample (`by = "sample_id"`), cluster (`by = "cluster_id"`) or cluster-sample instances (`by = "both"`) using the specified `assay` data and summarry statistic (argument `fun`)[^1]. When `by != "sample_id"`, i.e., when aggregating by cluster or cluster-sample, argument `k` specifies the clustering to use. The features to include in the computation of reduced dimensions may be specified via argument `features`. [^1]: By default, median expression values are computed. Arguments `color_by`, `label_by`, `shape_by` can be used to color, label, shape pseudobulk instances by cell metadata variables of interest. Moreover, `size_by = TRUE` will scale point sizes proportional to the number of cells that went into aggregation. Finally, a custom color palette may be supplied to argument `pal.` ### Ex. 2: MDS on sample-level pseudobulks A multi-dimensional scaling (MDS) plot on median marker expression by sample has the potential to reveal global proteomic differences across conditions or other experimental metadata. Here, we color points by condition (to reveal treatment effects) and further shape them by patient (to highlight patient effects). In our example, we can see a clear horizontal (MDS dim. 1) separation between reference (REF) and stimulation condition (BCRXL), while patients are, to a lesser extent, separated vertically (MDS dim. 2): ```{r pbMDS-1, fig.width = 5} pbMDS(sce, shape_by = "patient_id", size_by = FALSE) ``` ### Ex. 1: MDS on pseudobulks by cluster-sample Complementary to the visualize above, we can generate an MDS plot on pseudobulks computed for each cluster-sample instance. Here, we color point by cluster (to highlight similarity between cell subpopulations), and shape them by condition (to reveal subpopulation-specific expression changes across conditions). In this example, we can see that cluster-sample instances of the same cell subpopulations group together. Meanwhile, most subpopulations exhibit a shift between instances where samples come from different treatment groups: ```{r pbMDS-2, fig.width = 5} pbMDS(sce, by = "both", k = "meta12", shape_by = "condition", size_by = TRUE) ``` ## `clrDR`: Reduced dimension plot on CLR of proportions A dimensionality reduction plot on centered log-ratios (CLR) of sample/cluster proportions across clusters/samples can be rendered with `clrDR`. Here, we view each sample (cluster) as a vector of cluster (sample) proportions. Complementary to `pbMDS`, such a plot will give a sense of similarities between samples/clusters in terms of their composition. **Centered log-ratio** Let $s_i=s_1,...,s_S$ denote one of $S$ samples, $k_i=k_1,...,k_K$ one of $K$ clusters, and $p_k(s_i)$ be the proportion of cells from sample $s_i$ in cluster $k$. The centered log-ratio (CLR) on a given sample's cluster composition is then defined as: $$\text{clr}_{sk} = \log p_k(s_i) - \frac{1}{K}\sum_{i=1}^K \log p_k(s_i)$$ Thus, each sample $s$ gives a vector with length $K$ with mean 0, and the CLRs computed across all instances can be represented as a matrix with dimensions $S \times K$. We can embed the CLR matrix into a lower dimensional space in which points represent samples; or embed its transpose, in which case points represent samples. Distances in this lower-dimensional space will then represent the similarity in cluster compositions between samples and the in sample compositions between clusters, respectively. **Dimensionality reduction** In principle, `clrDR` allows any dimension reduction to be applied on the CLRs, with `dims` (default `c(1, 2)`) specifying which dimensions to visualize. The default method `dr = "PCA"` will include the percentage of variance explained by each principal component (PC) in the axis labels. Noteworthily, *distances between points in the lower-dimensional space are meaningful only for linear DR methods* (PCA and MDS), and results obtained from other methods should be interpreted with caution. The output plot's aspect ratio should thus be kept as is for PCA and MDS; meanwhile, non-linear DR methods can use `aspect.ratio = 1`, rendering a squared plot. **Interpreting loadings** For `dr = "PCA"`, PC loadings will be represented as arrows that may be interpreted as follows: 0° (180°) between vectors indicates a strong positive (negative) relation between them, while vectors that are orthogonal to one another (90°) are roughly independent. When a vector points towards a given quadrant, the variability in proportions for the points within this quadrant are largely driven by the corresponding variable. Here, only the relative orientation of loading vectors to each other and to the PC axes is meaningful; however, the sign of loadings (i.e., whether an arrow points left or right) can be flipped when re-computing PCs. **Aesthetics** Cell metadata variables to color points and PC loading arrows by are determined by arguments `point/arrow_col`, with `point/arrow_pal` specifying the color palettes to use for each layer. For example, rather than coloring samples by their unique identifiers, we may choose to use their condition for coloring instead to highlight differences across groups. In addition, point sizes may be scaled by the number of cells in a given sample/cluster (when `by = "sample/cluster_id"`) via setting `size_by = TRUE`. Argument `arrow_len` (default 0.5) controls the length of PC loading vectors relative to the largest absolute xy-coordinate. When specified, PC loading vectors will be re-scaled to improve their visibility: A value of 1 will stretch vectors such that the largest loading will touch on the outer most point. Importantly, while absolute arrow lengths are not interpretable, their relative length is. ### Ex. 1: CLR on cluster proportions across samples We here visualize the first two PCs computed on CLRs of sample proportions across clusters: small distances between samples mean similar cluster compositions between them, while large distances are indicative of differences in cluster proportions. In our example, PC 1 clearly separates treatment groups: ```{r clrDR-1, fig.width = 5} clrDR(sce, by = "sample_id", k = "meta12") ``` ### Ex. 2: CLR on sample proportions across clusters As an alternative to the plot above, we can visualize the first two PCs computed on the CLR matrix' transpose. Here, we can observe that most of the variability in cluster-compositions across samples is driven by BCRXL samples (PC1): ```{r clrDR-2, fig.width = 6} clrDR(sce, by = "cluster_id", arrow_col = "condition", size_by = FALSE) ``` ## `plotExprHeatmap`: Heatmap of aggregated marker expressions {#plotExprHeatmap} `plotExprHeatmap` allows generating heatmaps of aggregated (pseudobulk) marker expressions. Argument `assay` (default `"exprs"`) and `fun` (default `"median"`) control the aggregation. Depending on argument `by`, aggregation will be performed by `"sample_id"`, `"cluster"`, or `"both"`. **Scaling** `plotExprHeatmap` supports various scaling strategies that are controlled by arguments `scale` and `q`, and will greatly alter the visualization and its interpretation[^2]: [^2]: Regardless of the chosen scaling strategy, row and column clusterings will be performed on unscaled data. - When `scale = "first"`, the specified assay data will be scaled between 0 and 1 using lower (`q`) and upper (`1-q`) quantiles as boundaries, where `q = 0.01` by default. This way, while losing information on absolut expression values, marker expressions will stay comparable, and visualization is improved in cases where the expression range varies greatly between markers. - When `scale = "last"`, assay data will be aggregated first and scaled subsequently. Thus, each marker's value range will be [0,1]. While all comparability between markers is lost, such scaling will improve seeing differences across, e.g., samples or clusters. - When `scale = "never"`, no scaling (and quantile trimming) is applied. The resulting heatmap will thus display *raw* pseudobulk data (e.g., median expression values). **Hierarchical clustering** Various arguments control whether rows/columns should be hierarchically clustered (and re-ordered) accordingly (`row/col_clust`), and whether to include the resulting dendrograms in the heatmap (`row/col_dend`). Here, clustering is performed using `dist` followed by `hclust` on the assay data matrix with the specified `method` as distance metric (default `euclidean`) and `linkage` for agglomeration (default `average`). **Cell count annotation** Optionally, argument `bars` specifies whether to include a barplot of cell counts, which can further be annotated with relative abundances (%) via `perc = TRUE`. These will correspong to cell counts by cluster (when `by != "sample_id"`), and by sample otherwise. **Sample and cluster annotations** By default (`row/col_anno = TRUE`), for axes corresponding to samples (y-axis for `by = "sample_id"` and x-axis for `by = "both"`), annotations will be drawn for all non-numeric cell metadata variables. Alternatively, a specific subset of annotations can be included for only a subset of variables by specifying `row/col_anno` to be a character vector in \code{names(colData(x))} (see examples). For axes corresponding to clusters (y-axis for `by = "cluster_id"` and `"both"`), annotations will be drawn for the specified clustering(s) (arguments `k` and `m`). The following examples shall cover the 3 different modes of `plotExprHeatmap`: ```{r plotExprHeatmap-sample, fig.width = 6.5, fig.height = 3.5} # scale each marker between 0 and 1 # (after aggregation & without trimming) plotExprHeatmap(sce, features = "state", scale = "last", q = 0, bars = FALSE) ``` When `by != "sample_id"`, the clustering to use for aggregation is specified via `k`, and an additional metaclusting may be included as an annotation (argument `m`): ```{r plotExprHeatmap-cluster, fig.width = 6, fig.height = 3.5} # medians of scaled & trimmed type-marker expressions by cluster plotExprHeatmap(sce, features = "type", by = "cluster_id", k = "meta12", m = "meta8", scale = "first", q = 0.01, perc = TRUE, col_dend = FALSE) ``` Finally, we can visualize the aggregated expression of specific markers by cluster *and* sample via `by = "both"`[^3]: [^3]: In this case, only a single features is allowed as input. ```{r plotExprHeatmap-both, fig.width = 4, fig.height = 3.5} # raw (not scaled, not trimmed) # median expression by cluster-sample plotExprHeatmap(sce, features = "pS6", by = "both", k = "meta8", scale = "never", col_clust = FALSE, row_anno = FALSE, bars = FALSE) ``` ## `plotPbExprs`: Pseudobulk expression boxplot A combined boxplot and jitter of aggregated marker intensities can be generated via `plotPbExprs()`. Here, argument `features` (default `"state"`, which is equivalent to `state_markers(sce)`) controls which markers to include. `features = NULL` will include all markers (and is equivalent to `rownames(sce)`). The specified `assay` values (default `"exprs"`) will be aggregated using `fun` (default `"median"`) as summary statistic, resulting in one pseudobulk value per sample, or cluster-sample (when one of `facet_by`, `color_by` or `group_by` is set to `"cluster_id"`). In order to compare medians for each cluster, and potentially identify changes across conditions early on, we specify `facet = "cluster_id"`: ```{r plotMedExprs-1, fig.width = 12, fig.height = 5} plotPbExprs(sce, k = "meta8", facet_by = "cluster_id", ncol = 4) ``` Alternatively, we can facet the above plot by `antigen` in order to compare marker expressions calculated over all cells across conditions: ```{r plotMedExprs-2, fig.width = 12, fig.height = 4} plotPbExprs(sce, facet_by = "antigen", ncol = 7) ``` Thirdly, we can investigate how consistent type-markers are expressed across clusters. To this end, we specify `color_by = "cluster_id"` in order to aggregate expression values by both clusters and samples. The resulting plot gives an indication how *good* our selection of type markers is: Ideally, their expression should be fairly specific to a subset of clusters. ```{r plotMedExprs-3, fig.width = 12, fig.height = 4} plotPbExprs(sce, k = "meta10", features = "type", group_by = "cluster_id", color_by = "sample_id", size_by = TRUE, geom = "points", jitter = FALSE, ncol = 5) ``` Finally, we can investigate how variable state-marker expressions are across clusters by setting `group_by = "cluster_id"` (to aggregate by both, samples and clusters) and `color_by = "condition"` (to additionally group samples by treatment). This type of visualization yields similar information as the plot above: We can observe that there are global shifts in expression for a set of markers (e.g., pNFkB, pp38), and rather subpopulation-specific changes for others (e.g., pS6). ```{r plotMedExprs-4, fig.width = 12, fig.height = 4} plotPbExprs(sce, k = "meta6", features = "state", group_by = "cluster_id", color_by = "condition", ncol = 7) ``` ## `plotClusterExprs`: Marker-densities by cluster Distributions of marker intensities (arcsinh-transformed) across cell populations of interest can be plotted with `plotClusterExprs`. We specify `features = "type"` (equivalent to `type_markers(sce)`), to include type-markers only. Here, blue densities (top row) are calculated over all cells and serve as a reference. ```{r plotClusterExprs, message = FALSE, fig.width = 12, fig.height = 8} plotClusterExprs(sce, k = "meta8", features = "type") ``` ## `plotAbundances`: Relative population abundances Relative population abundances for any clustering of interest can be plotted with `plotAbundances`. Argument `by` will specify whether to plot proportions for each sample or cluster; `group_by` determines the grouping within each panel as well ascolor coding. - If `by = "sample_id"`, the function displays each sample's cell type composition, and the size of a given stripe reflects the proportion of the corresponding cell type the given sample. Argument `group_by` then specifies the facetting. - If `by = "cluster_id"`, argument `group_by` then specifies the grouping and color coding. ```{r plotAbundances-1, fig.width = 5, fig.height = 3.5} plotAbundances(sce, k = "meta12", by = "sample_id", group_by = "condition") ``` ```{r plotAbundances-2, fig.width = 6, fig.height = 4} plotAbundances(sce, k = "meta8", by = "cluster_id", group_by = "condition", shape_by = "patient_id") ``` ## `plotFreqHeatmap`: Heatmap of cluster fequencies {#plotFreqHeatmap} Complementary to `plotAbundances`, a heatmap of relative cluster abundances by cluster can be generated with `plotFreqHeatmap`. By default (`normalize = TRUE`), frequencies will we standarized for each cluster, across samples. Analogous to `plotExprHeatmap` (Sec. \@ref(plotExprHeatmap)), arguments `row/col_clust/dend` control hierarchical clustering or rows (clusters) and columns (samples), and whether the resulting dendrograms should be display; `k` specifies the clustering across which to compute abundances, and `m` a secondary clustering for display. Again, `bars` and `perc` can be used to include a labelled cell count barplot. ```{r plotFreqHeatmap-complete, fig.width = 6.5, fig.height = 4.5} # complete example plotFreqHeatmap(sce, k = "meta8", m = "meta5", hm_pal = rev(hcl.colors(10, "RdBu")), k_pal = hcl.colors(7, "Zissou 1"), m_pal = hcl.colors(4, "Temps"), bars = TRUE, perc = TRUE) ``` ```{r plotFreqHeatmap-minimal, fig.width = 3.5, fig.height = 4} # minimal example plotFreqHeatmap(sce, k = "meta10", normalize = FALSE, bars = FALSE, row_anno = FALSE, col_anno = FALSE, row_clust = FALSE, col_clust = FALSE, hm_pal = c("grey95", "black")) ``` ## `plotMultiHeatmap`: Multi-panel Heatmaps `plotMultiHeatmap` provides flexible options to combine expression and cluster frequency heatmaps generated with `plotExprHeatmap` (Sec. \@ref(plotExprHeatmap)) and `plotFreqHeatmap` (Sec. \@ref(plotFreqHeatmap)), respectively, with arguments `hm1` and `hm2` controlling the panel contents. **Panel contents** In its 1st panel, `plotMultiHeatmap` will display pseudobulks by cluster. Here, `hm1` may be used to specify a set of markers (subset of `rownames(sce)`), or `"type"/"state"` for `type/state_markers(sce)` when `marker_classes(sce)` have been specified. The 1st heatmap can be turned off altogether by setting `hm1 = FALSE`. Anologously, for `hm2 = "type"/"state"`, an expression heatmap of type-/state-markers will be displayed as 2nd heatmap (see Sec. \@ref(plotMultiHeatmap-example1)). `hm2 = "abundances"` will render cluster frequencies by sample (see Sec. \@ref(plotMultiHeatmap-example2)). As opposed to argument `hm1`, however, when `hm2` specifies one or multiple marker(s), a separate heatmap of pseudobulks by cluster-sample will be drawn for each marker (see Sec. \@ref(plotMultiHeatmap-example3)). **Row and column annotations** The clustering to aggregate by is specified with argument `k`. Optionally, a metaclustering of interest may be provided via `m`; here, clustering `m` is merely included as an additional annotation (and not used for any computation). Thus, `m` serves to visually inspect the quality a lower-resolution clustering or manual merging. When an x-axis corresponds to samples, `plotMultiHeatmap` will include column annotations for cell metadata variables (columns in `colData`) that map uniquely to each sample (e.g., condition, patient ID). These annotations can be omitted via `col_anno = FALSE`, or reduced (e.g., `col_anno = "condition"` to include only a single annotation); by default (`col_anno = TRUE`), all available metadata is included. **Argument handling** Most of `plotMultiHeatmap`'s arguments take a single value as input. For example, row annotations and dendrograms are automatically removed for all but the 1st panel. Nevertheless, a subset of arguments may be set uniquely for each heatmap; these are: - `scale` and `q` controlling the scaling strategy for expression heatmaps - `col_clust/dend` specifying whether or not to column-cluster each heatmap and draw the resulting dendrograms When any of these arguments is of length one, the specified value will be recycled for both heatmaps. E.g., setting `scale = "never"` will have all expression heatmaps show unscaled data. ### Ex. 1: Type- & state-markers {#plotMultiHeatmap-example1} To demonstrate `plotMultiHeatmap`'s basic functionality and handling of arguments, we plot expression heatmaps for both type- (`hm1 = "type"`) and state-markers (`hm2 = "state"`), and choose to include a column dendrogram for the 2nd heatmap only: ```{r plotMultiHeatmap-1, fig.width = 7.5, fig.height = 4} # both, median type- & state-marker expressions plotMultiHeatmap(sce, hm1 = "type", hm2 = "state", k = "meta12", m = "meta8", col_dend = c(FALSE, TRUE)) ``` ### Ex. 2: CDx markers & cluster frequencies {#plotMultiHeatmap-example2} As a second example, we plot an expession heatmap for a selection of markers (here, those starting with "CD": `hm1 = c("CDx", "CDy", ...`)[^4] next to the relative cluster abundances across samples (`hm2 = "abundances"`). We also add a barplot for the cell counts in each cluster (`bars = TRUE`) along with labels for their relative abundance (`perc = TRUE`): [^4]: `hm1 = NULL` would include *all* markers. ```{r plotMultiHeatmap-2, fig.width = 7.5, fig.height = 3} # 1st: CDx markers by cluster; # 2nd: population frequencies by sample cdx <- grep("CD", rownames(sce), value = TRUE) plotMultiHeatmap(sce, k = "meta6", hm1 = cdx, hm2 = "abundances", bars = TRUE, perc = TRUE, row_anno = FALSE) ``` ### Ex. 3: Selected markers {#plotMultiHeatmap-example3} In this final example, we view a selection of markers side-by-side, and omit the 1st panel (`hm1 = FALSE`). We also retain the ordering of samples (column order) across panels (`col_clust = FALSE`). In this case, `plotMultiHeatmap` will drop column names (sample IDs) for all but the first panel to avoid repeating these labels and overcrowding the plot. Lastly, we set `scale = "never"` to visualize *raw* (unscaled) pseudobulks (= median expressions by cluster-sample): ```{r plotMultiHeatmap-3, fig.width = 9, fig.height = 4} # plot selected markers side-by-side; # omit left-hand side heatmap plotMultiHeatmap(sce, k = "meta8", scale = "never", hm1 = FALSE, hm2 = c("pS6", "pp38", "pBtk"), row_anno = FALSE, col_clust = FALSE, hm2_pal = c("grey95", "black")) ``` # Dimensionality reduction The number of cells in cytometry data is typically large, and for visualization of cells in a two-dimensional space it is often sufficient to run dimension reductions on a subset of the data. Thus, `CATALYST` provides the wrapper function `runDR` to apply any of the dimension reductions available from `BiocStyle::Biocpkg("scater")` using 1. the subset of features specified via argument `features`; either a subset of `rownames(.)` or, e.g., `"type"` for `type_markers(.)` (if `marker_classes(.)` have been specified). 2. the subset of cells specified via argument `cells`; either `NULL` for all cells, or `n` to sample a random subset of n cells per sample. To make results reproducible, the random seed should be set via `set.seed` *prior* to computing reduced dimensions: ```{r runDR} set.seed(1601) sce <- runDR(sce, dr = "UMAP", cells = 500, features = "type") ``` Alternatively, dimension reductions can be computed using one of `r BiocStyle::Biocpkg("scater")`'s `runX` functions (`X = "TSNE", "UMAP", ...`). Note that, by default, `scater` expects expression values to be stored in the `logcounts` assay of the SCE; specification of `exprs_values = "exprs"` is thus required: ```{r runUMAP-scater, eval = FALSE} sce <- runUMAP(sce, exprs_values = "exprs") ``` DRs available within the SCE can be viewed via `reducedDimNames` and accessed with `reducedDim(s)`: ```{r} # view & access DRs reducedDimNames(sce) head(reducedDim(sce, "UMAP")) ``` While `r BiocStyle::Biocpkg("scater")`'s `plotReducedDim` function can be used to visualize DRs, `CATALYST` provides the `plotDR` wrapper, specifically to allow for coloring cells by the various clusterings available, and to support facetting by metadata factors (e.g., experimental condition, sample IDs): ```{r plotDR-1, fig.height = 5} # color by marker expression & split by condition plotDR(sce, color_by = c("pS6", "pNFkB"), facet_by = "condition") ``` ```{r plotDR-2, fig.height = 5} # color by 8 metaclusters & split by sample ID p <- plotDR(sce, color_by = "meta8", facet_by = "sample_id") p$facet$params$ncol <- 4; p ``` # Filtering SCEs constructed with `prepData` can be filtered using the `filterSCE` function, which allows for filtering of both cells and markers according to conditional statements in `dplyr`-style. When filtering on `cluster_id`s, argument `k` specifies which clustering to use (the default `NULL` uses `colData` column `"cluster_id"`). Two examples are given below: ```{r filterSCE, fig.height = 3} u <- filterSCE(sce, patient_id == "Patient1") table(u$sample_id) u <- filterSCE(sce, k = "meta8", cluster_id %in% c(1, 3, 8)) plot_grid( plotDR(sce, color_by = "meta8"), plotDR(u, color_by = "meta8")) ``` # Differental testing with `r BiocStyle::Biocpkg("diffcyt")` `r BiocStyle::Biocpkg("CATALYST")` has been designed to be compatible with the `r BiocStyle::Biocpkg("diffcyt")` package [@Weber2019-diffcyt], which implements statistical methods for differential discovery in high-dimensional cytometry (including flow cytometry, mass cytometry or CyTOF, and oligonucleotide-tagged cytometry) using high-resolution clustering and moderated tests. The input to the `r BiocStyle::Biocpkg("diffcyt")` pipeline can either be raw data, or a `SingleCellExperiment` object. We give an exmaple of the latter below. Please refer to the `r BiocStyle::Biocpkg("diffcyt")` [vignette](https://bioconductor.org/packages/3.7/bioc/vignettes/diffcyt/inst/doc/diffcyt_workflow.html) and R documentation (`??diffcyt`) for more detailed information. ```{r diffcyt, message = FALSE, warning = FALSE, fig.show = "hide"} # create design & constrast matrix design <- createDesignMatrix(ei(sce), cols_design = "condition") contrast <- createContrast(c(0, 1)) # test for # - differential abundance (DA) of clusters # - differential states (DS) within clusters res_DA <- diffcyt(sce, clustering_to_use = "meta10", analysis_type = "DA", method_DA = "diffcyt-DA-edgeR", design = design, contrast = contrast, verbose = FALSE) res_DS <- diffcyt(sce, clustering_to_use = "meta10", analysis_type = "DS", method_DS = "diffcyt-DS-limma", design = design, contrast = contrast, verbose = FALSE) # extract result tables tbl_DA <- rowData(res_DA$res) tbl_DS <- rowData(res_DS$res) ``` ## `plotDiffHeatmap`: Heatmap of differential testing results Differential testing results returned by `r BiocStyle::Biocpkg("diffcyt")` can be displayed with the `plotDiffHeatmap` function. For differential abundance (DA) tests, `plotDiffHeatmap` will display relative cluster abundances by samples; for differential state (DS) tests, `plotDiffHeatmap` will display aggregated marker expressions by sample. **Filtering** The results to retain for visualization can be filtered via - `fdr`: threshold on adjusted p-values *below* which to keep a result - `lfc`: thershold on absolute logFCs *above* which to keep a result The number of top findings to display can be specified with `top_n` (default 20). When `all = TRUE`, significance and logFC filtering will be skipped, and all `top_n` results are shown. **Annotations** Analogous to `plotFreq/Expr/MuliHeatmap`, when `col_anno = TRUE`, `plotDiffHeatmap` will include column annotations for cell metadata variables (columns in `colData`) that map uniquely to each sample (e.g., condition, patient ID). These annotations can be omitted via `col_anno = FALSE`, or reduced (e.g., `col_anno = "condition"` to include only a single annotation). When `row_anno = TRUE`, cluster (DA) and cluster-marker instances (DS) will be marked as *significant* if their adjusted p-value falls below the threshold value specified with `fdr`. A second annotation will be drawn for the logFCs. **Normalization** When `normalize = TRUE`, the heatmap will display Z-score normalized values. For DA, cluster frequencies will be arcsine-square-root scaled prior to normalization. While losing information on absolution frequency/expression values, this option will make differences across samples and conditions more notable. ## Ex. 1: DA testing results We here set `all = TRUE` to display top-20 DA analysis results, without filtering on adjusted p-values and logFCs. Since differential testing was performed on 10 clusters only, this will simply include all available results. By setting `fdr = 0.05` despite not filtering on significance, we can control the right-hand side annotation: ```{r plotDiffHeatmap-da, fig.width = 5, fig.height = 4} plotDiffHeatmap(sce, tbl_DA, all = TRUE, fdr = 0.05) ``` ## Ex. 2: DS testing results Via setting `fdr = 0.05`, we here display the top DS analysis results in terms of significance. Alternative to the example above, we sort these according their logFCs (`sort_by = "lfc"`), and include only a selected sample annotation (`col_anno = "condition"`): ```{r plotDiffHeatmap-ds, fig.width = 5, fig.height = 4} plotDiffHeatmap(sce, tbl_DS, fdr = 0.05, sort_by = "lfc", col_anno = "condition") ``` ## Ex. 3: Filtering results As an alternative to leaving the selection of markers and clusters to their ordering (significance), we can visualize a specific subset of results (e.g., a selected marker or cluster) using `filterSCE`: ```{r plotDiffHeatmap-filter-1, fig.width = 4.5, fig.height = 3.25} # include all results for selected marker plotDiffHeatmap(sce["pp38", ], tbl_DS, all = TRUE, col_anno = FALSE) ``` ```{r plotDiffHeatmap-filter-2, fig.width = 5, fig.height = 4.25} # include all results for selected cluster k <- metadata(res_DS$res)$clustering_name sub <- filterSCE(sce, cluster_id == 8, k = k) plotDiffHeatmap(sub, tbl_DS, all = TRUE, normalize = FALSE) ``` ## Ex. 4: Customizing appearance Heatmap and annotation colors are controlled via arguments `hm_pal` and `fdr/lfc_pal`, respectively. Here's an example how these can be modified: ```{r plotDiffHeatmap-pals, fig.width = 4, fig.height = 3.5} plotDiffHeatmap(sce, tbl_DA, all = TRUE, col_anno = FALSE, hm_pal = c("gold", "white", "navy"), fdr_pal = c("grey90", "grey50"), lfc_pal = c("red3", "grey90", "green3")) ``` # More ## Exporting FCS files Conversion from SCE to `flowFrame`s/`flowSet`, which in turn can be writting to FCS files using `r Biocpkg("flowCore")`'s `write.FCS` function, is not straightforward. It is not recommended to directly write FCS via `write.FCS(flowFrame(t(assay(sce))))`, as this can lead to invalid FCS files or the data being shown on an inappropriate scale in e.g. Cytobank. Instead, `CATALYST` provides the `sce2fcs` function to facilitate correct back-conversion. `sce2fcs` allows specification of a `colData` column to split the SCE by (argument `split_by`), e.g., to split the data by cluster; whether to keep or drop any cell metadata (argument `keep_cd`) and dimension reductions (argument `keep_dr`) available within the object; and which assay data to use (argument `assay`)[^5]: [^5]: Only count-like data should be written to FCS files and is guaranteed to show with approporiate scale in Cytobank! ```{r sce2fcs-split, message = FALSE} # store final clustering in cell metadata sce$mm <- cluster_ids(sce, "merging1") # convert to 'flowSet' with one frame per cluster (fs <- sce2fcs(sce, split_by = "mm")) # split check: number of cells per barcode ID # equals number of cells in each 'flowFrame' all(c(fsApply(fs, nrow)) == table(sce$mm)) # store identifiers (= cluster names) (ids <- c(fsApply(fs, identifier))) ``` Having converted out SCE to a `flowSet`, we can write out each of its `flowFrame`s to an FCS file with a meaningul filename that retains the cluster of origin: ```{r sce2fcs-write, eval = FALSE} for (id in ids) { # subset 'flowFrame' for cluster 'id' ff <- fs[[id]] # specify output name that includes ID fn <- sprintf("manuel_merging_%s.fcs", id) # construct output path fn <- file.path("...", fn) # write frame to FCS write.FCS(ff, fn) } ``` ## Using other clustering algorithms While `r BiocStyle::Biocpkg("FlowSOM")` has proven to perform well in systematic comparisons of clustering algorithms for CyTOF data [@Weber2016-clustering; @Freytag2018-clustering], it is not the only method out there. Here we demonstrate how clustering assignments from another clustering method, say, `r Githubpkg("JinmiaoChenLab", "Rphenograph")`, could be incorporated into the SCE to make use of the visualizations available in `CATALYST`. Analogous to the example below, virtually any clustering algorithm could be applied, however, with the following **limitation**: The `ConsensusClusterPlus` metaclusterings applied to the initial `FlowSOM` clustering by `CATALYST`'s `cluster` function have a hierarchical cluster structure. Thus, clustering IDs can be matched from a higher resolution (e.g. 100 SOM clusters) to any lower resolution (e.g., 2 through 20 metaclusters). This is not guaranteed for other clustering algorithms. Thus, we store only a single resolution in the cell metadata column `cluster_id`, and a single column under `metadata` slot `cluster_codes` containing the unique cluster IDs. Adding additional resolutions to the `cluster_codes` will fail if cluster IDs can not be matched uniquely across conditions, which will be the case for any non-hierarchical clustering method. ```{r other-clusterings, message = FALSE, warning = FALSE} # subset type-marker expression matrix es <- assay(sce, "exprs") es <- es[type_markers(sce), ] # run clustering method X # (here, we just split the cells into # equal chunks according to CD33 expression) cs <- split(seq_len(ncol(sce)), cut(es["CD33", ], nk <- 10)) kids <- lapply(seq_len(nk), function(i) { rep(i, length(cs[[i]])) }) kids <- factor(unlist(kids)) # store cluster IDs in cell metadata & codes in metadata foo <- sce foo$cluster_id[unlist(cs)] <- unlist(kids) metadata(foo)$cluster_codes <- data.frame( custom = factor(levels(kids), levels = levels(kids))) # tabulate cluster assignments table(cluster_ids(foo, "custom")) ``` ## Customizing visualizations Most of `CATALYST`'s plotting functions return `ggplot` objects whose aesthetics can (in general) be modified easily. However, while e.g. theme aesthetics and color scales can simply be added to the plot, certain modifications can be achieved only through overwriting elements stored in the object, and thus require a decent understanding of its structure. Other functions (`plotExprHeatmap`, `plotMultiHeatmap` and `plotDiffHeatmap`) generate objects of the `Heatmap` of `HeatmapList` class from the `r BiocStyle::Biocpkg("ComplexHeatmap")` package, and are harder to modify once created. Therefore, `CATALYST` tries to expose a reasonable amount of arguments to the user that control key aesthetics such as the palettes used for coloring clusters and heatmaps. The examples below serve to illustrate how some less exposed `ggplot` aesthetics can be modified in retrospect, and the effects of different arguments that control visualization of `r BiocStyle::Biocpkg("ComplexHeatmap")` outputs. ### Modifying `ggplot`s ```{r plotMedExprs-custom, fig.width = 8, fig.height = 2} p <- plotMedExprs(sce, k = "meta4", facet_by = "cluster_id") # facetting layout is 2x2; plot all side-by-side instead p$facet$params$nrow <- 1 # remove points p$layers <- p$layers[-1] # overwrite default colors p <- p + scale_color_manual(values = c("royalblue", "orange")) # remove x-axis title, change angle & decrease size of labels (p + labs(x = NULL) + theme(axis.text.x = element_text(angle = 90, size = 8))) ``` ### Modifying `ComplexHeatmap`s ```{r plotMultiHeatmap-custom, fig.width = 6, fig.height = 3} plotMultiHeatmap(sce, k = "meta8", m = "meta4", hm2 = "abundances", # include all dendrograms row_dend = TRUE, col_dend = TRUE, # exclude sample annotations col_anno = FALSE, # primary & merging cluster palettes k_pal = hcl.colors(8, "Vik"), m_pal = hcl.colors(4, "Tropic"), # 1st & 2nd heatmap coloring hm1_pal = c("grey95", "blue"), hm2_pal = c("grey95", "red3")) ``` ```{r plotExprHeatmap-minimal, fig.width = 8, fig.height = 3} # minimal heatmap plotExprHeatmap(sce, row_anno = FALSE, # don't annotate samples row_clust = FALSE, # keep samples in original order col_clust = FALSE, # keep markers in original order bin_anno = FALSE, # don't annotate bins bars = FALSE, # don't include sample sizes scale = "last", # aggregate, then scale hm_pal = hcl.colors(10, "YlGnBu", rev = TRUE)) ``` ```{r plotExprHeatmap-complete, fig.width = 12, fig.height = 4} # complete heatmap plotExprHeatmap(sce, row_anno = TRUE, # annotate samples row_clust = TRUE, col_clust = TRUE, # cluster samples/markers row_dend = TRUE, col_dend = TRUE, # include dendrograms bin_anno = TRUE, # annotate bins with value bars = TRUE, perc = TRUE, # include barplot of sample sizes hm_pal = c("grey95", "orange")) ``` ## Combining `ComplexHeatmap`s While `plotMultiHeatmap` provides a convenient way to combine pseudoublk expression heatmaps across clusters or cluster-sample combinations with heatmaps of relative cluster abundances, `Heatmap` objects can be, in principle, combined arbitrarily with a few notes of caution: 1. each `Heatmap` should use the same clustering for aggregation (argument `k`). 1. each `Heatmap` should have unique identifier (slot `@name`); otherwise, a warning is given. 1. each `Heatmap`'s legend should have a unique title (slot `@matrix_color_mapping@name`); otherwise, legends with a title that is already in use will be dropped. ### Ex. 1: type- & state-markers + cluster frequencies ```{r combine-heatmaps-1, fig.width = 14, fig.height = 4.8} # specify clustering to aggregate by k <- "meta11" # median type-marker expression by cluster p1 <- plotExprHeatmap(sce, features = "type", by = "cluster_id", k = k, m = "meta7") # median state-marker expression by cluster p2 <- plotExprHeatmap(sce, features = "state", by = "cluster_id", k = k, row_anno = FALSE) # relative cluster abundances by sample p3 <- plotFreqHeatmap(sce, k = k, perc = TRUE, row_anno = FALSE, col_clust = FALSE) # make legend titles unique p1@name <- p1@matrix_color_mapping@name <- "type" p2@name <- p2@matrix_color_mapping@name <- "state" p1 + p2 + p3 ``` ### Ex. 2: frequencies + selected markers + all markers ```{r combine-heatmaps-2, fig.width = 14, fig.height = 4} # specify clustering to aggregate by k <- "meta9" # relative cluster abundances by sample p <- plotFreqHeatmap(sce, k = k, bars = FALSE, hm_pal = c("white", "black"), row_anno = FALSE, col_clust = FALSE, col_anno = FALSE) # specify unique coloring cs <- c(pp38 = "maroon", pBtk = "green4") # median expression of selected markers by cluster-sample for (f in names(cs)) { q <- plotExprHeatmap(sce, features = f, by = "both", k = k, scale = "never", row_anno = FALSE, col_clust = FALSE, hm_pal = c("white", cs[f])) # make identifier & legend title unique q@name <- q@matrix_color_mapping@name <- f # remove column annotation names for (i in seq_along(q@top_annotation@anno_list)) q@top_annotation@anno_list[[i]]@name_param$show <- FALSE # remove redundant sample names q@column_names_param$show <- FALSE p <- p + q } # add heatmap of median expression across all features p + plotExprHeatmap(sce, features = NULL, by = "cluster_id", k = k, row_anno = FALSE) ``` # Session information ```{r session-info} sessionInfo() ``` # References