---
title: "Use of SpatialDecon in a large GeoMx dataset with GeomxTools"
output: 
  rmarkdown::html_vignette: 
    toc: true
vignette: >
  %\VignetteIndexEntry{Use of SpatialDecon in a large GeoMx dataset with GeomxTools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<style>
p.caption {
  font-size: 1.5em;
}
</style>

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


### Installation

```{r installation, eval=FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SpatialDecon")
BiocManager::install("GeoMxTools")
```

### Overview


This vignette demonstrates the use of the SpatialDecon package to estimate cell 
abundance in spatial gene expression studies. 

We'll analyze a dense GeoMx dataset from a lung tumor, looking for abundance of
different immune cell types. 
This dataset has 30 ROIs. In each ROI, Tumor and Microenvironment segments have 
been profiled separately. 


### Data preparation

First, we load the package:
```{r setup}
library(SpatialDecon)
library(GeomxTools)
```

Now let's load our example data and examine it:

```{r loaddata}
data("nsclc")

dim(nsclc)
head(pData(nsclc))
nsclc@assayData$exprs[seq_len(5), seq_len(5)]

# better segment names:
sampleNames(nsclc) <-  
  paste0(nsclc$ROI, nsclc$AOI.name)

```


The spatialdecon function takes 3 arguments of expression data:

1. The normalized data.
2. A matrix of expected background for all data points in the normalized 
data matrix.
3. Optionally, either a matrix of per-data-point weights, or the raw data, 
which is used to derive weights (low counts are less statistically stable, 
and this allows spatialdecon to down-weight them.) 


We estimate each data point's expected background from the negative control 
probes from its corresponding observation:
This study has two probesets, and the genes from each probeset will have 
distinct background values. 
The function "derive_GeoMx_background" conveniently estimates background of all data points,
 accounting for which probeset each gene belongs to:

```{r derivebg}
bg = derive_GeoMx_background(norm = nsclc@assayData$exprs_norm,
                             probepool = fData(nsclc)$Module,
                             negnames = c("NegProbe-CTP01", "NegProbe-Kilo"))
```


### Cell profile matrices

A "cell profile matrix" is a pre-defined matrix that specifies the expected 
expression profiles of each cell type in the experiment. 
The SpatialDecon library comes with one such matrix pre-loaded, the "SafeTME" 
matrix, designed for estimation of immune and stroma cells in the tumor 
microenvironment. 
(This matrix was designed to avoid genes commonly expressed by cancer cells; 
see the SpatialDecon manuscript for details.)

Let's take a glance at the safeTME matrix:

```{r showsafetme, fig.height=5, fig.width=10, fig.cap = "The safeTME cell profile matrix"}
data("safeTME")
data("safeTME.matches")

signif(safeTME[seq_len(3), seq_len(3)], 2)

heatmap(sweep(safeTME, 1, apply(safeTME, 1, max), "/"),
        labRow = NA, margins = c(10, 5))

```


For studies of other tissue types, we have provided a library of cell profile
matrices, available on Github and downloadable with the "download_profile_matrix" function. 

For a complete list of matrices, see [CellProfileLibrary GitHub Page](https://github.com/Nanostring-Biostats/CellProfileLibrary/tree/NewProfileMatrices). 

Below we download a matrix of cell profiles derived from scRNA-seq of a mouse 
spleen. 

```{r downloadmatrix, fig.height=7, fig.width=10, fig.cap = "The Mouse Spleen profile matrix", eval=T}
mousespleen <- download_profile_matrix(species = "Mouse",
                                       age_group = "Adult", 
                                       matrixname = "Spleen_MCA")
dim(mousespleen)

mousespleen[1:4,1:4]

head(cellGroups)

metadata

heatmap(sweep(mousespleen, 1, apply(mousespleen, 1, max), "/"),
        labRow = NA, margins = c(10, 5), cexCol = 0.7)

```

For studies where the provided cell profile matrices aren't sufficient or if a specific single cell dataset is wanted, we can make a custom profile matrix using the function create_profile_matrix(). 

This mini single cell dataset is a fraction of the data from Kinchen, J. et al. Structural Remodeling of the Human Colonic Mesenchyme in Inflammatory Bowel Disease. Cell 175, 372-386.e17 (2018).

```{r single cell data}
data("mini_singleCell_dataset")

mini_singleCell_dataset$mtx@Dim # genes x cells

as.matrix(mini_singleCell_dataset$mtx)[1:4,1:4]

head(mini_singleCell_dataset$annots)

table(mini_singleCell_dataset$annots$LabeledCellType)

```

**Pericyte cell** and **smooth muscle cell of colon** will be dropped from this matrix due to low cell count. The average expression across all cells of one type is returned so the more cells of one type, the better reflection of the true gene expression. The confidence in these averages can be changed using the minCellNum filter.

```{r creatematrix, fig.height=7, fig.width=10, fig.cap = "Custom profile matrix"}
custom_mtx <- create_profile_matrix(mtx = mini_singleCell_dataset$mtx,            # cell x gene count matrix
                                    cellAnnots = mini_singleCell_dataset$annots,  # cell annotations with cell type and cell name as columns 
                                    cellTypeCol = "LabeledCellType",  # column containing cell type
                                    cellNameCol = "CellID",           # column containing cell ID/name
                                    matrixName = "custom_mini_colon", # name of final profile matrix
                                    outDir = NULL,                    # path to desired output directory, set to NULL if matrix should not be written
                                    normalize = FALSE,                # Should data be normalized? 
                                    minCellNum = 5,                   # minimum number of cells of one type needed to create profile, exclusive
                                    minGenes = 10,                    # minimum number of genes expressed in a cell, exclusive
                                    scalingFactor = 5,                # what should all values be multiplied by for final matrix
                                    discardCellTypes = TRUE)          # should cell types be filtered for types like mitotic, doublet, low quality, unknown, etc.

head(custom_mtx)

heatmap(sweep(custom_mtx, 1, apply(custom_mtx, 1, max), "/"),
        labRow = NA, margins = c(10, 5), cexCol = 0.7)

```

### Performing basic deconvolution with the spatialdecon function

Now our data is ready for deconvolution. 
First we'll show how to use spatialdecon under the basic settings, omitting 
optional bells and whistles. 


```{r runiss}
res = runspatialdecon(object = nsclc,
                      norm_elt = "exprs_norm",
                      raw_elt = "exprs",
                      X = safeTME,
                      align_genes = TRUE)

str(pData(res))
names(res@assayData)
```

We're most interested in "beta", the matrix of estimated cell abundances. 

```{r plotissres, fig.height = 5, fig.width = 8, fig.cap = "Cell abundance estimates"}
heatmap(t(res$beta), cexCol = 0.5, cexRow = 0.7, margins = c(10,7))
```


### Using the advanced settings of spatialdecon

spatialdecon has several abilities beyond basic deconvolution:

1. If given the nuclei counts for each region/observation, it returns results on
the scale of total cell counts.
2. If given the identities of pure tumor regions/observations, it infers a
handful of tumor-specific expression profiles and appends them to the cell
profile matrix. Doing this accounts for cancer cell-derived expression from any 
genes in the cell profile matrix, removing contaminating signal from cancer 
cells. 
3. If given raw count data, it derives per-data-point weights, using an error 
model derived for GeoMx data. 
4. If given a "cellmatches" argument, it sums multiple closely-related cell 
types into a single score. E.g. if the safeTME matrix is used with the 
cell-matching data object "safeTME.matches", it e.g. sums the "T.CD8.naive" and
"T.CD8.memory" scores into a single "CD8.T.cells" score. 

Let's take a look at an example cell matching object:
```{r showmatches}
str(safeTME.matches)
```


Now let's run spatialdecon:

```{r runisstils}
# vector identifying pure tumor segments:
nsclc$istumor = nsclc$AOI.name == "Tumor"

# run spatialdecon with all the bells and whistles:
restils = runspatialdecon(object = nsclc,
                          norm_elt = "exprs_norm",                # normalized data
                          raw_elt = "exprs",                      # expected background counts for every data point in norm
                          X = safeTME,                            # safeTME matrix, used by default
                          cellmerges = safeTME.matches,           # safeTME.matches object, used by default
                          cell_counts = nsclc$nuclei,      # nuclei counts, used to estimate total cells
                          is_pure_tumor = nsclc$istumor,   # identities of the Tumor segments/observations
                          n_tumor_clusters = 5)                   # how many distinct tumor profiles to append to safeTME

str(pData(restils))
names(restils@assayData)
str(restils@experimentData@other)
```

There are quite a few readouts here. Let's review the important ones:

* beta: the cell abundance scores of the rolled-up/major cell types
* beta.granular: the cell abundance scores of the granular cell types, 
corresponding to the columns of the cell profile matrix
* yhat, resids: the fitted values and log2-scale residuals from the 
deconvolution fit. Can be used to measure each observation's goodness-of-fit, a 
possible QC metric. 
* cell.counts, cell.counts.granular: estimated numbers of each cell type, 
derived using the nuclei count input
* prop_of_nontumor: the beta matrix rescaled to give the proportions of 
non-tumor cells in each observation. 
* SpatialDeconMatrix: the cell profile matrix used, including newly-derived tumor-specific 
columns.

To illustrate the derivation of tumor profiles, let's look at the cell profile 
matrix output by spatialdecon:

```{r shownewX, fig.height=5, fig.width=8, fig.cap = "safeTME merged with newly-derived tumor profiles"}
heatmap(sweep(restils@experimentData@other$SpatialDeconMatrix, 1, apply(restils@experimentData@other$SpatialDeconMatrix, 1, max), "/"),
         labRow = NA, margins = c(10, 5))

```

Note the new tumor-specific columns. 


### Plotting deconvolution results

The SpatialDecon package contains two specialized plotting functions, and a 
default color palette for the safeTME matrix. 

The first function is "TIL_barplot", which is just a convenient way of drawing 
barplots of cell type abundance. 

```{r barplot, fig.width=9, fig.height=6, fig.cap="Barplots of TIL abundance"}
# For reference, show the TILs color data object used by the plotting functions 
# when safeTME has been used:
data("cellcols")
cellcols

# show just the TME segments, since that's where the immune cells are:
o = hclust(dist(t(restils$cell.counts$cell.counts)))$order
layout(mat = (matrix(c(1, 2), 1)), widths = c(7, 3))
TIL_barplot(t(restils$cell.counts$cell.counts[, o]), draw_legend = TRUE, 
            cex.names = 0.5)
# or the proportions of cells:
temp = replace(restils$prop_of_nontumor, is.na(restils$prop_of_nontumor), 0)
o = hclust(dist(temp[restils$AOI.name == "TME",]))$order
TIL_barplot(t(restils$prop_of_nontumor[restils$AOI.name == "TME",])[, o], 
            draw_legend = TRUE, cex.names = 0.75)

```


The second function is "florets", used for plotting cell abundances atop some 
2-D projection. 
Here, we'll plot cell abundances atop the first 2 principal components of the 
data:

```{r florets, fig.width=8, fig.height=6, fig.cap = "TIL abundance plotted on PC space"}
# PCA of the normalized data:
pc = prcomp(t(log2(pmax(nsclc@assayData$exprs_norm, 1))))$x[, c(1, 2)]

# run florets function:
par(mar = c(5,5,1,1))
layout(mat = (matrix(c(1, 2), 1)), widths = c(6, 2))
florets(x = pc[, 1], y = pc[, 2],
        b = t(restils$beta), cex = .5,
        xlab = "PC1", ylab = "PC2")
par(mar = c(0,0,0,0))
frame()
legend("center", fill = cellcols[colnames(restils$beta)], 
       legend = colnames(restils$beta), cex = 0.7)
```

So we can see that PC1 roughly tracks many vs. few immune cells, and PC2 tracks 
the relative abundance of lymphoid/myeloid populations.


### Other functions

The SpatialDecon library includes several helpful functions for further 
analysis/fine-tuning of deconvolution results. 

#### Combining cell types:

When two cell types are too similar, the estimation of their abundances becomes
unstable. However, their sum can still be estimated easily. 
The function "collapseCellTypes" takes a deconvolution results object and 
collapses any closely-related cell types you tell it to:

```{r collapse, fig.width=5, fig.height=5, fig.cap="Cell abundance estimates with related cell types collapsed"}
matching = list()
matching$myeloid = c( "macrophages", "monocytes", "mDCs")
matching$T.NK = c("CD4.T.cells","CD8.T.cells", "Treg", "NK")
matching$B = c("B")
matching$mast = c("mast")
matching$neutrophils = c("neutrophils")
matching$stroma = c("endothelial.cells", "fibroblasts")


collapsed = runCollapseCellTypes(object = restils, 
                                 matching = matching)

heatmap(t(collapsed$beta), cexRow = 0.85, cexCol = 0.75)
```

#### Inferring an expression profile for a cell type omitted from the profile matrix

Sometimes a cell profile matrix will omit a cell type you know to be present in 
a tissue. 
If your data includes any regions that are purely this unmodelled cell type - 
e.g. because you've used the GeoMx platform's segmentation capability to 
specifically select them - then you can infer a profile for that cell type and 
merge it with your cell profile matrix. 
The algorithm clusters all the observations you designate as purely the 
unmodelled cell type, and collapses those clusters into as many profiles of that
cell type as you wish. For cancer cell, it may be appropriate to specify 10 or
more clusters; for highly regular healthy cells, one cluster may suffice.

(Note: this functionality can also be run within the spatialdecon function, as 
is demonstrated further above.)

```{r appendtumor, fig.width = 10, fig.height= 5, fig.cap = "safeTME merged with newly-derived tumor profiles"}
pure.tumor.ids = res$AOI.name == "Tumor"
safeTME.with.tumor = runMergeTumorIntoX(object = nsclc,
                                        norm_elt = "exprs_norm", 
                                        pure_tumor_ids = pure.tumor.ids,
                                        X = safeTME,
                                        K = 3) 

heatmap(sweep(safeTME.with.tumor, 1, apply(safeTME.with.tumor, 1, max), "/"),
        labRow = NA, margins = c(10, 5))

```


#### Reverse deconvolution

Once cell type abundance has been estimated, we can flip the deconvolution 
around, modelling the expression data as a function of cell abundances, and 
thereby deriving:

1. Estimated expression of each gene in each cell type. (Including for genes 
not present in your cell profile matrix)
2. Fitted expression values for each gene based on cell mixing.
3. Residuals of each gene: how does their expression compare to what cell mixing
would predict?
4. Two metrics of how well genes are predicted by/ redundant with cell mixing: 
correlation between observed and fitted expression, and residual SD. 

The function "reversedecon" runs this model.

```{r reverse, fig.height=6, fig.width=6, fig.cap="Residuals from reverseDecon"}
rdecon = runReverseDecon(object = nsclc,
                         norm_elt = "exprs_norm",
                         beta = res$beta)
str(fData(rdecon))
names(rdecon@assayData)

# look at the residuals:
heatmap(pmax(pmin(rdecon@assayData$resids, 2), -2))
```

```{r reverse2, fig.height=6, fig.width=6, fig.cap="Genes' dependency on cell mixing"}
# look at the two metrics of goodness-of-fit:
plot(fData(rdecon)$cors, fData(rdecon)$resid.sd, col = 0)
showgenes = c("CXCL14", "LYZ", "FOXP3")
text(fData(rdecon)$cors[!rownames(fData(rdecon)) %in% showgenes], 
     fData(rdecon)$resid.sd[!rownames(fData(rdecon)) %in% showgenes], 
     setdiff(rownames(fData(rdecon)), showgenes), cex = 0.5)
text(fData(rdecon)$cors[rownames(fData(rdecon)) %in% showgenes], fData(rdecon)$resid.sd[rownames(fData(rdecon)) %in% showgenes], 
     showgenes, cex = 0.75, col = 2)

```

From the above plot, we can see that genes like CXCL14 vary independently of 
cell mixing, genes like LYZ are correlated with cell mixing but still have 
variable expression, and genes like FOXP3 serve as nothing but obtuse readouts 
of cell mixing. 


### Session Info

```{r sessioninfo}
sessionInfo()
```