---
title: "BayesSpace"
author: "Edward Zhao, Matt Stone, Xing Ren, and Raphael Gottardo"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{BayesSpace}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width="100%",
  fig.width=7, fig.height=5,
  dpi=300, fig.path="figures/BayesSpace-",
  message=FALSE, warning=FALSE, error=FALSE
)
```

```{r setup}
library(SingleCellExperiment)
library(ggplot2)
library(BayesSpace)
```

## Preparing your experiment for BayesSpace

### Loading data

BayesSpace supports three ways of loading a `SingleCellExperiment` for analysis.

Visium datasets processed with [Space
Ranger](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/what-is-space-ranger)
can be loaded directly via the `readVisium()` function. This function takes only
the path to the Space Ranger output directory (containing the `spatial/` and
`filtered_feature_bc_matrix/` subdirectories) and returns a
`SingleCellExperiment`.

```{r readVisium, eval=FALSE}
sce <- readVisium("path/to/spaceranger/outs/")
```

Second, all datasets analyzed for the BayesSpace manuscript are readily
accessible via the `getRDS()` function. This function takes two arguments - the
name of the dataset, and the name of the sample in the dataset.

```{r download}
melanoma <- getRDS(dataset="2018_thrane_melanoma", sample="ST_mel1_rep2")
```

Finally, `SingleCellExperiment` objects can be constructed manually from a
counts matrix and tables of row and column data. BayesSpace only requires that
spot array coordinates be provided as columns named `row` and `col` in
`colData`. (Note that enhancement of Visium datasets additionally requires the
pixel coordinates of each spot in the tissue image, but in this case the dataset
should be loaded with `readVisium()`, which loads these data automatically.)

```{r manual.sce, eval=FALSE}
library(Matrix)

rowData <- read.csv("path/to/rowData.csv", stringsAsFactors=FALSE)
colData <- read.csv("path/to/colData.csv", stringsAsFactors=FALSE, row.names=1)
counts <- read.csv("path/to/counts.csv.gz",
                   row.names=1, check.names=F, stringsAsFactors=FALSE))

sce <- SingleCellExperiment(assays=list(counts=as(counts, "dgCMatrix")),
                            rowData=rowData,
                            colData=colData)
```

We'll continue with the melanoma sample from the 2018 Spatial Transcriptomics
paper for the remaining examples in this vignette.

### Pre-processing data

BayesSpace requires minimal data pre-processing, but we provide a helper
function to automate it.

`spatialPreprocess()` log-normalizes the count matrix and performs PCA on the
top `n.HVGs` highly variable genes, keeping the top `n.PCs` principal
components. Additionally, the spatial sequencing platform is added as metadata
in the `SingleCellExperiment` for downstream analyses. If you do not wish to
rerun PCA, running `spatialPreprocess()` with the flag `skip.PCA=TRUE` will only
add the metadata BayesSpace requires.

Here, we omit log-normalization as all datasets available through `getRDS()`
already include log-normalized counts.
```{r preprocess}
set.seed(102)
melanoma <- spatialPreprocess(melanoma, platform="ST", 
                              n.PCs=7, n.HVGs=2000, log.normalize=FALSE)
```

## Clustering

### Selecting the number of clusters

We can use the `qTune()` and `qPlot()` functions to help choose `q`, the number
of clusters to use in our analysis.  

* `qTune()` runs the BayesSpace clustering algorithm for multiple specified
values of `q` (by default, 3 through 7) and computes their average
pseudo-log-likelihood. It accepts any arguments to `spatialCluster()`.
* `qPlot()` plots the pseudo-log-likelihood as a function of `q`; we suggest
choosing a `q` around the elbow of this plot.

```{r tuning_q}
melanoma <- qTune(melanoma, qs=seq(2, 10), platform="ST", d=7)
qPlot(melanoma)
```

### Clustering with BayesSpace

The `spatialCluster()` function clusters the spots, and adds the predicted
cluster labels to the `SingleCellExperiment`. Typically, as we did for the
analyses in the paper, we suggest running with at least 10,000 iterations
(`nrep=10000`), but we use 1,000 iteration in this demonstration for the sake of
runtime. (Note that a random seed must be set in order for the results to be
reproducible.)

```{r cluster}
set.seed(149)
melanoma <- spatialCluster(melanoma, q=4, platform="ST", d=7,
                           init.method="mclust", model="t", gamma=2,
                           nrep=1000, burn.in=100,
                           save.chain=TRUE)
```

Both the mclust initialization (`cluster.init`) and the BayesSpace cluster
assignments (`spatial.cluster`) are now available in the SingleCellExperiment's
`colData`.
```{r cluster.results}
head(colData(melanoma))
```

### Visualizing spatial clusters

We can plot the cluster assignments over the spatial locations of the spots with
`clusterPlot()`.
```{r cluster.plot, fig.width=7, fig.height=5}
clusterPlot(melanoma)
```

As `clusterPlot()` returns a `ggplot` object, it can be customized by composing
with familiar `ggplot2` functions. Additionally, the argument `palette` sets the
colors used for each cluster, and `clusterPlot()` takes additional arguments to
`geom_polygon()` such as `size` or `color` to control the aesthetics of the spot
borders.

```{r cluster.plot.customize, fig.width=7, fig.height=5}
clusterPlot(melanoma, palette=c("purple", "red", "blue", "yellow"), color="black") +
  theme_bw() +
  xlab("Column") +
  ylab("Row") +
  labs(fill="BayesSpace\ncluster", title="Spatial clustering of ST_mel1_rep2")
```


## Enhanced resolution

### Clustering at enhanced resolution

The `spatialEnhance()` function will enhance the resolution of the principal
components, and add these PCs as well as predicted cluster labels at subspot
resolution to a new `SingleCellExperiment`. As with our demonstration of
`spatialCluster()` above, we are using fewer iterations for the purpose of this
example (`nrep=1000`) than we recommend in practice (`nrep=100000` or greater).
Note that the `jitter_scale` parameter should be tuned so that proposals for 
updating subspot-level expression are accepted around 30% of the time. This can
be evaluated using `mcmcChain(melanoma.enhanced, "Ychange")`, where the chain 
should stabilize to 0.25-0.40. Typically 1000-2500 iterations are sufficient to 
evaluate if `jitter_scale` should be increased if acceptance is too high or 
decreased if acceptance is too low. After tuning, proceed to a full run of 
`spatialEnhance` with more iterations.

```{r enhance, eval=TRUE}
melanoma.enhanced <- spatialEnhance(melanoma, q=4, platform="ST", d=7,
                                    model="t", gamma=2,
                                    jitter_prior=0.3, jitter_scale=3.5,
                                    nrep=1000, burn.in=100,
                                    save.chain=TRUE)
```

The enhanced `SingleCellExperiment` includes an index to the parent spot in the
original `sce` (`spot.idx`), along with an index to the subspot. It adds the
offsets to the original spot coordinates, and provides the enhanced cluster
label (`spatial.cluster`).
```{r enhance.results}
head(colData(melanoma.enhanced))
```

We can plot the enhanced cluster assignments as above.
```{r enhance.plot, eval=TRUE, fig.width=7, fig.height=5}
clusterPlot(melanoma.enhanced)
```

### Enhancing the resolution of gene expression

BayesSpace operates on the principal components of the gene expression matrix,
and `spatialEnhance()` therefore computes enhanced resolution PC vectors.
Enhanced gene expression is not computed directly, and is instead imputed using
a regression algorithm. For each gene, a model using the PC vectors of each spot
is trained to predict the spot-level gene expression, and the fitted model is
used to predict subspot expression from the subspot PCs. 

Gene expression enhancement is implemented in the `enhanceFeatures()` function.
BayesSpace predicts expression with
[`xgboost`](https://xgboost.readthedocs.io/en/latest/) by default, but linear
and Dirichlet regression are also available via the `model` argument. When using
`xgboost`, we suggest automatically tuning the `nrounds` parameter by setting it
to 0, although this comes at the cost of increased runtime (~4x slower than a
pre-specified `nrounds` in practice).

`enhanceFeatures()` can be used to impute subspot-level expression for all
genes, or for a subset of genes of interest. Here, we'll demonstrate by
enhancing the expression of four marker genes: PMEL (melanoma), CD2 (T-cells),
CD19 (B-cells), and COL1A1 (fibroblasts). 

```{r enhanceFeatures}
markers <- c("PMEL", "CD2", "CD19", "COL1A1")
melanoma.enhanced <- enhanceFeatures(melanoma.enhanced, melanoma,
                                     feature_names=markers,
                                     nrounds=0)
```

By default, log-normalized expression (`logcounts(sce)`) is imputed, although
other assays or arbitrary feature matrices can be specified.
```{r enhanced.logcount}
logcounts(melanoma.enhanced)[markers, 1:5]
```

Diagnostic measures from each predictive model, such as `rmse` when using
`xgboost`, are added to the `rowData` of the enhanced dataset.
```{r enhanced.rmse}
rowData(melanoma.enhanced)[markers, ]
```


### Visualizing enhanced gene expression

Spatial gene expression is visualized with `featurePlot()`.
```{r enhanced.featurePlot}
featurePlot(melanoma.enhanced, "PMEL")
```

Here, we compare the spatial expression of the imputed marker genes.
```{r enhanced.markers, fig.width=12, fig.height=8}
enhanced.plots <- purrr::map(markers, function(x) featurePlot(melanoma.enhanced, x))
patchwork::wrap_plots(enhanced.plots, ncol=2)
```

And we can compare to the spot-level expression.
```{r compare.resolution, fig.width=16, fig.height=8}
spot.plots <- purrr::map(markers, function(x) featurePlot(melanoma, x))
patchwork::wrap_plots(c(enhanced.plots, spot.plots), ncol=4)
```


## Accessing Markov chains

If `save.chain` is set to `TRUE` in either `spatialCluster()` or
`spatialEnhance()`, the chain associated with the respective MCMC run is
preserved to disk as an HDF5 file. The path to this file is stored in the
SingleCellExperiment's metadata at `metadata(sce)$h5.chain`, and can be read
directly using `mcmcChain()`.

The chain is provided as a `coda::mcmc` object, which can be analyzed with
[TidyBayes](https://mjskay.github.io/tidybayes/) or as a matrix. The object has
one row per iteration, with the values of the parameters concatenated across the
row. Columns are named with the parameter name and index (if any).
```{r mcmcChain, eval=TRUE}
chain <- mcmcChain(melanoma)
chain[1:5, 1:5]
```

To remove the HDF5 file from disk and remove its path from the metadata, use
`removeChain()`.