---
title: "nnSVG Tutorial"
author: 
  - name: Lukas M. Weber
    affiliation: &id1 "Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA"
  - name: Stephanie C. Hicks
    affiliation: *id1
package: nnSVG
output: 
  BiocStyle::html_document: 
    toc_float: true
vignette: >
  %\VignetteIndexEntry{nnSVG Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


# Introduction

`nnSVG` is a method for scalable identification of spatially variable genes (SVGs) in spatially-resolved transcriptomics data.

The `nnSVG` method is based on nearest-neighbor Gaussian processes ([Datta et al., 2016](https://www.tandfonline.com/doi/full/10.1080/01621459.2015.1044091), [Finley et al., 2019](https://www.tandfonline.com/doi/full/10.1080/10618600.2018.1537924)) and uses the BRISC algorithm ([Saha and Datta, 2018](https://onlinelibrary.wiley.com/doi/full/10.1002/sta4.184)) for model fitting and parameter estimation. `nnSVG` allows identification and ranking of SVGs with flexible length scales across a tissue slide or within spatial domains defined by covariates. The method scales linearly with the number of spatial locations and can be applied to datasets containing thousands or more spatial locations.

`nnSVG` is implemented as an R package within the Bioconductor framework, and is available from [Bioconductor](https://bioconductor.org/packages/nnSVG).

More details describing the method are available in our paper, available from [Nature Communications](https://www.nature.com/articles/s41467-023-39748-z).


# Installation

The following code will install the latest release version of the `nnSVG` package from Bioconductor. Additional details are shown on the [Bioconductor](https://bioconductor.org/packages/nnSVG) page.

```{r, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("nnSVG")
```

The latest development version can also be installed from the `devel` version of Bioconductor or from [GitHub](https://github.com/lmweber/nnSVG).


# Input data format

In the examples below, we assume the input data are provided as a [SpatialExperiment](https://bioconductor.org/packages/SpatialExperiment) Bioconductor object. In this case, the outputs are stored in the `rowData` of the `SpatialExperiment` object.

Alternatively, the inputs can also be provided as a numeric matrix of normalized and transformed counts (e.g. log-transformed normalized counts, also known as logcounts) and a numeric matrix of spatial coordinates.


# Tutorial

## Standard analysis

### Run nnSVG

Here we show a short example demonstrating how to run `nnSVG`.

For faster runtime in this example, we subsample the dataset and run `nnSVG` on only a small number of genes. For a full analysis, the subsampling step can be skipped.

```{r, message=FALSE}
library(SpatialExperiment)
library(STexampleData)
library(scran)
library(nnSVG)
library(ggplot2)
```

```{r}
# load example dataset from STexampleData package
# see '?Visium_humanDLPFC' for more details
spe <- Visium_humanDLPFC()

dim(spe)
```

```{r}
# preprocessing steps

# keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)

# skip spot-level quality control, since this has been performed previously 
# on this dataset

# filter low-expressed and mitochondrial genes
# using default filtering parameters
spe <- filter_genes(spe)

# calculate logcounts (log-transformed normalized counts) using scran package
# using library size factors
spe <- computeLibraryFactors(spe)
spe <- logNormCounts(spe)

assayNames(spe)
```

```{r}
# select small set of random genes and several known SVGs for 
# faster runtime in this example
set.seed(123)
ix_random <- sample(seq_len(nrow(spe)), 10)

known_genes <- c("MOBP", "PCP4", "SNAP25", "HBB", "IGKC", "NPY")
ix_known <- which(rowData(spe)$gene_name %in% known_genes)

ix <- c(ix_known, ix_random)

spe <- spe[ix, ]
dim(spe)
```

```{r}
# run nnSVG

# set seed for reproducibility
set.seed(123)
# using a single thread in this example
spe <- nnSVG(spe)

# show results
rowData(spe)
```


### Investigate results

The results are stored in the `rowData` of the `SpatialExperiment` object.

The main results of interest are:

- `LR_stat`: likelihood ratio (LR) statistics
- `rank`: rank of top SVGs according to LR statistics
- `pval`: p-values from asymptotic chi-squared distribution with 2 degrees of freedom
- `padj`: p-values adjusted for multiple testing, which can be used to define a cutoff for statistically significant SVGs (e.g. `padj` <= 0.05)
- `prop_sv`: effect size, defined as proportion of spatial variance out of total variance

```{r}
# number of significant SVGs
table(rowData(spe)$padj <= 0.05)
```

```{r}
# show results for top n SVGs
rowData(spe)[order(rowData(spe)$rank)[1:10], ]
```

```{r, fig.small=TRUE}
# plot spatial expression of top-ranked SVG
ix <- which(rowData(spe)$rank == 1)
ix_name <- rowData(spe)$gene_name[ix]
ix_name

df <- as.data.frame(
  cbind(spatialCoords(spe), 
        expr = counts(spe)[ix, ]))

ggplot(df, aes(x = pxl_col_in_fullres, y = pxl_row_in_fullres, color = expr)) + 
  geom_point(size = 0.8) + 
  coord_fixed() + 
  scale_y_reverse() + 
  scale_color_gradient(low = "gray90", high = "blue", 
                       trans = "sqrt", breaks = range(df$expr), 
                       name = "counts") + 
  ggtitle(ix_name) + 
  theme_bw() + 
  theme(plot.title = element_text(face = "italic"), 
        panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())
```


## With covariates

### Run nnSVG

`nnSVG` can also be run on datasets with known covariates, such as spatial domains or cell types. In this case, the method will identify SVGs within regions defined by the selected covariates, e.g. within spatial domains.

Here we run a short example to demonstrate this functionality.

This example dataset contains several known SVGs within spatial domains. Specifically, we are interested in SVGs with gradients of expression within the CA3 region of the mouse hippocampus. (See dataset help files for more details and references.)

As above, for faster runtime in this example, we subsample a very small subset of the data. For a full analysis, the subsampling step can be skipped.

```{r, message=FALSE}
library(SpatialExperiment)
library(STexampleData)
library(scran)
library(nnSVG)
library(ggplot2)
```

```{r}
# load example dataset from STexampleData package
# see '?SlideSeqV2_mouseHPC' for more details
spe <- SlideSeqV2_mouseHPC()

dim(spe)
```

```{r}
# preprocessing steps

# remove spots with NA cell type labels
spe <- spe[, !is.na(colData(spe)$celltype)]
dim(spe)

# check cell type labels
table(colData(spe)$celltype)

# filter low-expressed and mitochondrial genes
# using adjusted filtering parameters for this platform
spe <- filter_genes(
  spe, 
  filter_genes_ncounts = 1, 
  filter_genes_pcspots = 1, 
  filter_mito = TRUE
)

dim(spe)

# calculate log-transformed normalized counts using scran package
# using library size normalization
spe <- computeLibraryFactors(spe)
spe <- logNormCounts(spe)

assayNames(spe)
```

```{r}
# select small set of random genes and several known SVGs for 
# faster runtime in this example
set.seed(123)
ix_random <- sample(seq_len(nrow(spe)), 10)

known_genes <- c("Cpne9", "Rgs14")
ix_known <- which(rowData(spe)$gene_name %in% known_genes)

ix <- c(ix_known, ix_random)

spe <- spe[ix, ]
dim(spe)
```

```{r}
# run nnSVG with covariates

# create model matrix for cell type labels
X <- model.matrix(~ colData(spe)$celltype)
dim(X)
stopifnot(nrow(X) == ncol(spe))

# set seed for reproducibility
set.seed(123)
# using a single thread in this example
spe <- nnSVG(spe, X = X)

# show results
rowData(spe)
```


Note that if there are any empty levels in the factor used to create the design matrix, these can be removed when creating the design matrix, as follows:

```{r}
# create model matrix after dropping empty factor levels
X <- model.matrix(~ droplevels(as.factor(colData(spe)$celltype)))
```


### Investigate results

```{r}
# number of significant SVGs
table(rowData(spe)$padj <= 0.05)
```

```{r}
# show results for top n SVGs
rowData(spe)[order(rowData(spe)$rank)[1:10], ]
```

```{r, fig.small=TRUE}
# plot spatial expression of top-ranked SVG
ix <- which(rowData(spe)$rank == 1)
ix_name <- rowData(spe)$gene_name[ix]
ix_name

df <- as.data.frame(
  cbind(spatialCoords(spe), 
        expr = counts(spe)[ix, ]))

ggplot(df, aes(x = xcoord, y = ycoord, color = expr)) + 
  geom_point(size = 0.1) + 
  coord_fixed() + 
  scale_color_gradient(low = "gray90", high = "blue", 
                       trans = "sqrt", breaks = range(df$expr), 
                       name = "counts") + 
  ggtitle(ix_name) + 
  theme_bw() + 
  theme(plot.title = element_text(face = "italic"), 
        panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())
```


## Multiple samples

For datasets consisting of multiple biological samples (e.g. multiple 10x Genomics Visium capture areas, or multiple tissue samples from other platforms), we can run nnSVG once per sample and combine the results by averaging the ranks of the SVGs identified per sample.

For this example, we use a dataset from the locus coeruleus region in postmortem human brain samples measured with the 10x Genomics Visium platform, which is available in the [WeberDivechaLCdata](https://bioconductor.org/packages/WeberDivechaLCdata) package and is described in our paper [Weber and Divecha et al. 2023](https://elifesciences.org/reviewed-preprints/84628).


### Run nnSVG

```{r, message=FALSE}
library(SpatialExperiment)
library(WeberDivechaLCdata)
library(nnSVG)
```

```{r}
# load example dataset from WeberDivechaLCdata package
# see '?WeberDivechaLCdata_Visium' for more details
spe <- WeberDivechaLCdata_Visium()

dim(spe)

# check sample IDs
table(colData(spe)$sample_id)
```


This dataset contains a complex experimental design consisting of multiple samples (Visium capture areas) and multiple parts (contiguous tissue areas) for some of these samples, as described in our paper [Weber and Divecha et al. 2023](https://elifesciences.org/reviewed-preprints/84628).

For the simplified example here, we select two samples. For a dataset with more than two samples, this example can be extended by including more samples within the loops.

```{r}
# select sample IDs for example
samples_keep <- c("Br6522_LC_1_round1", "Br6522_LC_2_round1")

# subset SpatialExperiment object to keep only these samples
spe <- spe[, colData(spe)$sample_id %in% samples_keep]

# remove empty levels from factor of sample IDs
colData(spe)$sample_id <- droplevels(colData(spe)$sample_id)

# check
dim(spe)
table(colData(spe)$sample_id)
```


```{r}
# preprocessing steps

# keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)

# skip spot-level quality control, since this has been performed previously 
# on this dataset
```


```{r}
# filter low-expressed genes

# filter out genes with extremely low expression
# using simple threshold on total UMI counts summed across all spots

# approximate threshold of 10 UMI counts per sample
n_umis <- 20
ix_low_genes <- rowSums(counts(spe)) < n_umis
table(ix_low_genes)

spe <- spe[!ix_low_genes, ]
dim(spe)


# filter any new zeros created after filtering low-expressed genes

# remove genes with zero expression
ix_zero_genes <- rowSums(counts(spe)) == 0
table(ix_zero_genes)

if (sum(ix_zero_genes) > 0) {
  spe <- spe[!ix_zero_genes, ]
}

dim(spe)

# remove spots with zero expression
ix_zero_spots <- colSums(counts(spe)) == 0
table(ix_zero_spots)

if (sum(ix_zero_spots) > 0) {
  spe <- spe[, !ix_zero_spots]
}

dim(spe)
```


Run nnSVG once per sample and store the resulting lists of top SVGs. Note that we perform additional gene filtering for nnSVG individually per sample, and re-calculate logcounts after the gene filtering for each sample.

In this dataset, mitochondrial genes are of specific biological interest (see our paper [Weber and Divecha et al. 2023](https://elifesciences.org/reviewed-preprints/84628) for details), so we do not filter these genes out (`filter_mito = FALSE`). In most datasets, mitochondrial genes are not of biological interest and should be filtered out (`filter_mito = TRUE`).

For faster runtime in this example, we also subsample a small number of genes before running `nnSVG`, as in the examples above. For a full analysis, the subsampling step should be skipped.

```{r}
# run nnSVG once per sample and store lists of top SVGs

sample_ids <- levels(colData(spe)$sample_id)

res_list <- as.list(rep(NA, length(sample_ids)))
names(res_list) <- sample_ids

for (s in seq_along(sample_ids)) {
  
  # select sample
  ix <- colData(spe)$sample_id == sample_ids[s]
  spe_sub <- spe[, ix]
  
  dim(spe_sub)
  
  # run nnSVG filtering for mitochondrial genes and low-expressed genes
  # (note: set 'filter_mito = TRUE' in most datasets)
  spe_sub <- filter_genes(
    spe_sub, 
    filter_genes_ncounts = 3, 
    filter_genes_pcspots = 0.5, 
    filter_mito = FALSE
  )
  
  # remove any zeros introduced by filtering
  ix_zeros <- colSums(counts(spe_sub)) == 0
  if (sum(ix_zeros) > 0) {
    spe_sub <- spe_sub[, !ix_zeros]
  }
  
  dim(spe_sub)
  
  # re-calculate logcounts after filtering
  spe_sub <- computeLibraryFactors(spe_sub)
  spe_sub <- logNormCounts(spe_sub)
  
  # subsample small set of random genes for faster runtime in this example
  # (same genes for each sample)
  # (note: skip this step in a full analysis)
  if (s == 1) {
    set.seed(123)
    ix_random <- sample(seq_len(nrow(spe_sub)), 30)
    genes_random <- rownames(spe_sub)[ix_random]
  }
  spe_sub <- spe_sub[rownames(spe_sub) %in% genes_random, ]
  dim(spe_sub)
  
  # run nnSVG
  set.seed(123)
  spe_sub <- nnSVG(spe_sub)
  
  # store results for this sample
  res_list[[s]] <- rowData(spe_sub)
}
```


Now that we have run nnSVG once per sample, we can combine the results across multiple samples by averaging the ranks of the SVGs. This approach was applied in our recent paper [Weber and Divecha et al. 2023](https://elifesciences.org/reviewed-preprints/84628).

```{r}
# combine results for multiple samples by averaging gene ranks 
# to generate an overall ranking

# number of genes that passed filtering (and subsampling) for each sample
sapply(res_list, nrow)


# match results from each sample and store in matching rows
res_ranks <- matrix(NA, nrow = nrow(spe), ncol = length(sample_ids))
rownames(res_ranks) <- rownames(spe)
colnames(res_ranks) <- sample_ids

for (s in seq_along(sample_ids)) {
  stopifnot(colnames(res_ranks)[s] == sample_ids[s])
  stopifnot(colnames(res_ranks)[s] == names(res_list)[s])
  
  rownames_s <- rownames(res_list[[s]])
  res_ranks[rownames_s, s] <- res_list[[s]][, "rank"]
}


# remove genes that were filtered out in all samples
ix_allna <- apply(res_ranks, 1, function(r) all(is.na(r)))
res_ranks <- res_ranks[!ix_allna, ]

dim(res_ranks)


# calculate average ranks
# note missing values due to filtering for samples
avg_ranks <- rowMeans(res_ranks, na.rm = TRUE)


# calculate number of samples where each gene is within top 100 ranked SVGs
# for that sample
n_withinTop100 <- apply(res_ranks, 1, function(r) sum(r <= 100, na.rm = TRUE))


# summary table
df_summary <- data.frame(
  gene_id = names(avg_ranks), 
  gene_name = rowData(spe)[names(avg_ranks), "gene_name"], 
  gene_type = rowData(spe)[names(avg_ranks), "gene_type"], 
  overall_rank = rank(avg_ranks), 
  average_rank = unname(avg_ranks), 
  n_withinTop100 = unname(n_withinTop100), 
  row.names = names(avg_ranks)
)

# sort by average rank
df_summary <- df_summary[order(df_summary$average_rank), ]
head(df_summary)

# top n genes
# (note: NAs in this example due to subsampling genes for faster runtime)
top100genes <- df_summary$gene_name[1:100]


# summary table of "replicated" SVGs (i.e. genes that are highly ranked in at
# least x samples)
df_summaryReplicated <- df_summary[df_summary$n_withinTop100 >= 2, ]

# re-calculate rank within this set
df_summaryReplicated$overall_rank <- rank(df_summaryReplicated$average_rank)

dim(df_summaryReplicated)
head(df_summaryReplicated)

# top "replicated" SVGs
topSVGsReplicated <- df_summaryReplicated$gene_name
```


```{r}
# plot one of the top SVGs

head(df_summary, 5)

ix_gene <- which(rowData(spe)$gene_name == df_summary[2, "gene_name"])

df <- as.data.frame(cbind(
  colData(spe), 
  spatialCoords(spe), 
  gene = counts(spe)[ix_gene, ]
))

ggplot(df, aes(x = pxl_col_in_fullres, y = pxl_row_in_fullres, color = gene)) + 
  facet_wrap(~ sample_id, nrow = 1, scales = "free") + 
  geom_point(size = 0.6) + 
  scale_color_gradient(low = "gray80", high = "red", trans = "sqrt", 
                       name = "counts", breaks = range(df$gene)) + 
  scale_y_reverse() + 
  ggtitle(paste0(rowData(spe)$gene_name[ix_gene], " expression")) + 
  theme_bw() + 
  theme(aspect.ratio = 1, 
        panel.grid = element_blank(), 
        plot.title = element_text(face = "italic"), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())
```


# Troubleshooting

This section describes some common errors and solutions. Feedback on any additional issues is welcome to be provided as [GitHub issues](https://github.com/lmweber/nnSVG/issues).


## Zeros (genes or spots)

After filtering out low-expressed genes or low-quality spots, it is possible that new zeros have been introduced, i.e. either rows (genes) or columns (spots) with all zero values. This can create errors in subsequent steps, e.g. calculating logcounts.

The following example code can be used or adapted to check for and remove any zeros.

```{r, eval=FALSE}
# filter any new zeros created after filtering low-expressed genes

# remove genes with zero expression
ix_zero_genes <- rowSums(counts(spe)) == 0
table(ix_zero_genes)

if (sum(ix_zero_genes) > 0) {
  spe <- spe[!ix_zero_genes, ]
}

dim(spe)

# remove spots with zero expression
ix_zero_spots <- colSums(counts(spe)) == 0
table(ix_zero_spots)

if (sum(ix_zero_spots) > 0) {
  spe <- spe[, !ix_zero_spots]
}

dim(spe)
```


## Small numbers of spots

The nnSVG model requires some minimum number of spots for reliable model fitting and parameter estimation. Running nnSVG with too few spots can return errors from the BRISC model fitting functions.

As an approximate guide, we recommend at least 100 spots (spatial locations).

In datasets with multiple samples, the minimum number of spots applies per sample.

In models with covariates for spatial domains, some minimum number of spots per spatial domain is also required. In this case, if the dataset contains too few spots, one possible solution can be to combine spatial domains into larger domains, e.g. if the main biological interest is one domain vs. all the rest (where the rest can be combined into one domain).


## Multiple samples

In datasets with multiple samples, nnSVG can be run once per sample and the results combined by averaging the ranks of the identified SVGs per sample, as demonstrated in the example above.

In this case, ensure that:

- the loop is set up correctly (subsetting each sample within the loop)
- gene filtering is performed per sample within the loop
- any new zeros (rows/genes or columns/spots) introduced by the filtering are removed
- logcounts are re-calculated after filtering
- rows (genes) are matched correctly when combining the results from the multiple samples


# Session information

```{r}
sessionInfo()
```