---
title: "Analyzing NanoString nCounter Data with the NanoTube"
author: "Caleb A Class"
output: rmarkdown::html_vignette
bibliography: NanoTube.bib
vignette: >
    %\VignetteIndexEntry{NanoTube Vignette}
    %\VignetteEngine{knitr::rmarkdown}
    \usepackage[utf8]{inputenc}
---


```{r setup, echo = FALSE, results = "hide"}
knitr::opts_chunk$set(tidy = FALSE,
                      cache = FALSE,
                      dev = "png",
                      message = FALSE, error = FALSE, warning = TRUE,
                      fig.height = 4, fig.width = 4)
```


# Abstract

This library provides tools for the processing, normalization, analysis, and
visualization of NanoString nCounter gene expression data. 
Standard NanoString-suggested analysis steps are supported, and functions
are also provided for interoperability with other published NanoString analysis
methods, as well as a pre-ranked gene set analysis method. This vignette 
provides a simple workflow for nCounter data analysis, as well as more detailed
descriptions of NanoTube functions and options.

# A basic workflow
`processNanostringData` allows you to read in nCounter expression data (from
RCC files or in tabular form) and conduct basic normalization and quality
control checks in one step. We use example data from GEO data series 
GSE117751 [@lundy_t_2018]. For this example, RCC files are provided in 
"GSE117751_RAW", while the sample characteristics table is 
"GSE117751_sample_data.csv".

```{r Basic1}

library(NanoTube)

example_data <- system.file("extdata", "GSE117751_RAW", package = "NanoTube")
sample_info <- system.file("extdata", 
                           "GSE117751_sample_data.csv", 
                           package = "NanoTube")

```

A variety of data processing and normalization options are provided in 
`processNanostringData`. A set of default options recommended by nCounter can 
be run automatically, or they can be specified and customized. More details 
are provided in the "Processing Data" section. This function also merges the 
expression data with the sample characteristics table (if provided), and 
outputs as an Biobase ExpressionSet.

```{r Basic2, warning = FALSE}

dat <- processNanostringData(nsFiles = example_data,
                             sampleTab = sample_info, 
                             groupCol = "Sample_Diagnosis")

```

There are three groups of samples being compared in this data set.

```{r BasicGroups}

table(dat$groups)

```

You can then run differential expression analysis using Limma. Functions are
also provided to allow for analysis using NanoStringDiff (See 'Differential 
expression analysis - Using NanoStringDiff'). For this example, we will compare
the two disease states vs. the control group ("None") by setting `base.group` to
"None".

```{r Basic3}

limmaResults <- runLimmaAnalysis(dat, base.group = "None")

```

DE results can be viewed or exported to a text fileusing `makeDiffExprFile`. 
For example, we can convert the differential expression statistics to a table 
for easier Viewing.

```{r Basic3b}

limmaStats <- makeDiffExprFile(limmaResults, filename = NULL, returns = "stats")
limmaStats <- as.data.frame(limmaStats)

# Order by lowest to highest p-value for 'Autoimmune Retinopathy' vs. 'None'
knitr::kable(head(limmaStats[order(limmaStats$`p-val (Autoimmune.retinopathy)`, 
                      decreasing = FALSE), 1:4]), 
      row.names = TRUE, format = "html", align = "c")


```

A volcano plot can also be drawn using deVolcano.

```{r BasicVol}

deVolcano(limmaResults, plotContrast = "Autoimmune.retinopathy")

```

DE results can also be used as input for pre-ranked gene set analysis in the 
`fgsea` package, using `limmaToFGSEA` or `nsdiffToFGSEA`. Gene sets can be 
provided in .gmt, .tab, or .rds (list object) format, or a list can be input 
directly. Plenty of additional options for GSEA analysis are available, 
including leading edge analysis, gene set clustering, and reporting options 
(see 'Gene set analysis').

```{r Basic4}

data("ExamplePathways")

fgseaResults <- limmaToFGSEA(limmaResults, gene.sets = ExamplePathways)

# FGSEA was run separately for Autoimmune Retinopathy vs. None and 
# Retinitis Pigmentosa vs. None
names(fgseaResults)

```

```{r BasicTab}

knitr::kable(head(fgseaResults$Autoimmune.retinopathy[
             order(fgseaResults$Autoimmune.retinopathy$pval, 
                   decreasing = FALSE),]), 
      row.names = TRUE, format = "html", align = "c")


```


# Processing Data
## From RCC files
One Reporter Code Count (RCC) file is generated by the nCounter instrument
for each sample, containing the counts for each gene and control in the 
codeset. It also includes some basic quality control (QC) metrics that can
by imported by NanoTube. Two functions are provided in this package: 
`read_rcc`, which reads in a single RCC file, and `read_merge_rcc` which reads
in a vector of RCC file names and combines the data. 

`processNanostringData` includes the reading of these data files, in addition 
to optional quality control and normalization steps. These are described in the
'Quality Control' and 'Normalization' sections. Normalization can be skipped
in this function using `normalization = "none"`. The housekeeping normalization
step can be skipped (retaining positive control normalization) using
`skip.housekeeping = TRUE`.

```{r proc1}

dat <- processNanostringData(nsFiles = example_data,
                             sampleTab = sample_info, 
                             groupCol = "Sample_Diagnosis",
                             normalization = "nSolver",
                             bgType = "t.test", bgPVal = 0.01,
                             skip.housekeeping = FALSE,
                             output.format = "ExpressionSet")

```

## From Other files
Expression data can also be imported from a table in a .txt or .csv file, 
possibly produced by the RCC Collector tool.

```{r proc2}

csv_data <- system.file("extdata", 
                        "GSE117751_expression_matrix.csv", 
                        package = "NanoTube")

dat2 <- processNanostringData(nsFile = csv_data,
                             sampleTab = sample_info, 
                             idCol = "GEO_Accession", 
                             groupCol = "Sample_Diagnosis",
                             normalization = "none")

```
## Normalization
This package provides three options for data normalization: a standard set of 
steps recommended by NanoString for nCounter data, which normalizes the data to 
different sets of control genes [@nanostring_technologies_ncounter_2011]; the 
RUV (Remove Unwanted Variance) method [@gagnon-bartsch_ruv_2019]; or no 
normalization.

To normalize using standard nSolver steps, set `normalization` to "nSolver". 
The three normalization steps under this method are:

1. Positive control normalization: The geometric mean of positive control probes
is used to calculate a scaling factor for each sample.
2. Background assessment: Negative control probes are used to assess whether
endogenous genes are truly expressed above the level of backgrounds. Two options
are provided for this
    + `bgType = "threshold"` - The background threshold will be defined 
    independently for each sample, using 
    `threshold = bgThreshold * sd(NegativeControls) + mean(NegativeControls)`, 
    where `bgThreshold` is set by the user. For each gene, the proportion of 
    samples with expression above the threshold is calculated, and only genes 
    with a proportion greater than `bgProportion` (set by user) will be 
    retained for analysis. For example, if `bgThreshold` is 2 and `bgProportion`
    is 0.5 (defaults), a gene will be retained for analysis if its expression is
    2 standard deviations above the mean of negative control probes in at least
    half of the samples.
    + `bgType = "t.test"` - Expression of each gene is compared with all of the 
    negative control probes (across samples) using a one-sided, two-sample 
    t test. Genes will be retained for analysis if the endogenous gene 
    expression is greater, with p < `bgPVal`.
    + `bgSubtract` - By setting this to TRUE, the calculated background level 
    (`mean + bgThreshold * sd`) will be subtracted from the expression values 
    for each gene in each sample. After subtraction, genes with negative values
    will be set to zero.
3. Housekeeping normalization: The geometric mean of housekeeping gene 
expression is used to calculate a scaling factor for each sample. Housekeeping
genes (symbols or accessions) can be input using the `housekeeping` option. 
If not provided, genes marked as "Housekeeping" in the RCC files will be used. 
Alternatively, this can be skipped using `skip.housekeeping = TRUE`.


```{r norm1}

# Set housekeeping genes manually (optional)
hk.genes <- c("TUBB", "TBP", "POLR2A", "GUSB", "SDHA")

dat <- processNanostringData(nsFiles = example_data,
                             sampleTab = sample_info, 
                             groupCol = "Sample_Diagnosis",
                             normalization = "nSolver",
                             bgType = "t.test", bgPVal = 0.01,
                             housekeeping = hk.genes, skip.housekeeping = FALSE)

```

# Quality Control
NanoTube provides the quality control metrics recommended for NanoString
nCounter data. The raw NanoString data can be loaded for QC using the 
`output.format = "list"` option of `processNanostringData`.

```{r qc1, warning = FALSE}

dat <- processNanostringData(example_data, output.format = "list",
                             includeQC = TRUE)

```

Basic QC and cartridge data are loaded in from the RCC files if `includeQC` is 
set to TRUE.

```{r qc2}

head(dat$qc)[,1:5]

```

Positive QC statistics can be calculated and presented as a table. This includes
the positive scaling factors and R-squared values for the expected vs. observed
positive control counts. NanoString recommends positive scaling factors between
0.3 and 3, and R-squared values greater than 0.95. Samples with values outside
these recommendations should be investigated further. 

```{r qcPos1}

posQC <- positiveQC(dat)

knitr::kable(head(posQC$tab), 
      row.names = FALSE, format = "html", align = "c", digits = 3)

```


Positive control genes can be plotted for all samples (default), or a specified
subset of samples (specified by column index, or sample names).

```{r qcPos2, fig.width = 7}

posQC2 <- positiveQC(dat, samples = 1:6)

posQC2$plt

```


Standard negative control statistics can be obtained using the `negativeQC` 
function.


```{r qcNeg1}

negQC <- negativeQC(dat, interactive.plot = FALSE)

knitr::kable(head(negQC$tab), 
      row.names = TRUE, format = "html", align = "c")

```

Negative control genes can also be plotted for each sample.

```{r qcNeg2, fig.height = 7, fig.width = 7}

negQC$plt

```

Housekeeping normalization scale factors can also be obtained from the
processed data. Manufacturer recommends additional caution for samples with
scale factors outside the range of 0.1-10.

```{r qcHK1}
head(dat$hk.scalefactors)
```

# Data Analysis
## Principal components analysis

PCA can be conducted after processing and normalization. We provide a standard
plot using `ggplot2` or an interactive plot using `plotly` (use the 
`interactive.plot` option).

```{r pca, warning = FALSE, fig.width = 6}

dat <- processNanostringData(example_data, 
                             sampleTab = sample_info, 
                             groupCol = "Sample_Diagnosis",
                             normalization = "nSolver", 
                             bgType = "t.test", 
                             bgPVal = 0.01)

library(plotly) 
nanostringPCA(dat, interactive.plot = TRUE)$plt

```

We default to plotting the first two principal components, but you can also
choose others.

```{r pca2, warning = FALSE, fig.width = 6}

nanostringPCA(dat, pc1=3, pc2=4, interactive.plot = FALSE)$plt

```

## Differential expression analysis
### Using Limma

After normalization, the data are likely to resemble a normal distribution,
particularly with reasonable filtering of genes with expression 
below background levels. It is a good idea to 
check this before using limma [@ritchie_limma_2015]. See `limma` vignette for
full details and additional analysis options.

Assuming this information was provided in the `processNanostringData` step, your
ExpressionSet will already contain the sample groups.

```{r gps}

table(dat$groups)

```

`runLimmaAnalysis` allows you to conduct group-vs-group comparisons by defining
the `base.group`. In this example, setting `base.group` to "None" will cause
Limma to build a linear model fitting the expression data, where the intercept
is the average log2 expression of "None", and the two other coefficients will
correspond to the log2(FC) of Autoimmune retinopathy vs. None and Retinitis 
pigmentosa vs. None.  This function will return standard Limma analysis results.

```{r deLimma}

limmaResults <- runLimmaAnalysis(dat, base.group = "None")

head(limmaResults$coefficients)

```

You can also directly define a design matrix, instead. For example, if this data
were collected in two batches, you could include a batch term in the analysis.

```{r deLimma2, fig.height = 3.5}

# Generate fake batch labels
batch <- rep(c(0, 1), times = ncol(dat) / 2)

# Reorder groups ("None" first)
group <- factor(dat$groups, 
                levels = c("None", "Autoimmune retinopathy", 
                           "Retinitis pigmentosa"))

# Design matrix including sample group and batch
design <- model.matrix(~group + batch)

# Analyze data
limmaResults2 <- runLimmaAnalysis(dat, design = design)

# We can see there is no batch effect in this data, due to the somewhat uniform
# distribution of p-values. This is good, since the batches are fake.
hist(limmaResults2$p.value[,"batch"], main = "p-values of Batch terms")

```

Results of Limma analyses can be visualized using simple volcano plots.

```{r volcano}

deVolcano(limmaResults, y.var = "p.value")

```

You can add additional ggplot layers as well.

```{r volcano2}

deVolcano(limmaResults, plotContrast = "Autoimmune.retinopathy", 
          y.var = "p.value") +
  geom_hline(yintercept = 2, linetype = "dashed", colour = "darkred") +
  geom_vline(xintercept = 0.5, linetype = "dashed", colour = "darkred") +
  geom_vline(xintercept = -0.5, linetype = "dashed", colour = "darkred")


```


They can also be converted to a simple table or exported to a text file.

```{r deExport}

limmaStats <- makeDiffExprFile(limmaResults, filename = NULL, returns = "stats")
limmaStats <- as.data.frame(limmaStats)

# Order by lowest to highest p-value for 'Autoimmune Retinopathy' vs. 'None'
knitr::kable(head(limmaStats[order(limmaStats$`p-val (Autoimmune.retinopathy)`, 
                                   decreasing = FALSE),]), 
      row.names = TRUE, format = "html", align = "c")


```


### Using NanoStringDiff

NanoStringDiff models the data using a negative binomial approximation, which
has been shown to be generally more accurate for gene expression count data 
[@wang_nanostringdiff_2016]. We have provided a function to convert processed,
unnormalized, data to a `NanoStringSet` for use with NanoStringDiff.

```{r nsDiffConvert}

# Remember to set normalization = "none"
datNoNorm <- processNanostringData(nsFiles = example_data,
                             sampleTab = sample_info, 
                             groupCol = "Sample_Diagnosis",
                             normalization = "none")

nsDiffSet <- makeNanoStringSetFromEset(datNoNorm)

colnames(pData(nsDiffSet))

```

Then you can run as described in the NanoStringDiff vignette (see vignette for
full details). Notably, the glm.LRT step is very slow for sample sizes above 10
(and may still be slow with fewer samples). Interested users could also consider
using DESeq2 if a negative binomial method is desired, although that method 
would not explicitly handle the various control genes [@love_moderated_2014].

```{r nsDiffRun, eval = FALSE}

### This block is not run! ###

# This step is fast
nsDiffSet <- NanoStringDiff::estNormalizationFactors(nsDiffSet)

# This step likely to take multiple hours on standard desktop computers.
result <- NanoStringDiff::glm.LRT(nsDiffSet, 
                                  design.full = as.matrix(pData(nsDiffSet)),
                                  contrast = c(1, -1, 0)) 
                                  #Contrast: Autoimmune retinopathy vs. None

```

## Gene set analysis

Gene Set Enrichment Analysis can be conducted on the Limma results using the 
`fgsea` package [@sergushichev_algorithm_2016]. Before running it, it is 
useful to consider whether Gene Set Enrichment Analysis is appropriate for
your specific data set, based on the number and types of genes represented in
your chip, and whether any of them were actually differentially expressed.

Gene set databases can be loaded as a list object (either directly or in an 
.rds file), or as a .gmt (MSigDB or similar) or .tab (CPDB) file. We have 
included a list of pathways from WikiPathways [@martens_wikipathways_2021].  

The `min.set` option is important, as pathways containing only a few genes 
present in your data set will probably not provide informative enrichment 
statistics. We will discard all gene sets where fewer than `min.set` genes from
that set are present in the analysis. You also have the option to rank genes by
'coefficients' (frequently, the log2FC) or the 't' statistics. `skip.first` 
will skip the first column of the limma design if TRUE (default).  Generally, 
this column is the Intercept of the regression, which is not useful for gene 
set analysis.

The `limmaToFGSEA` function will then conduct preranked analysis using `fgsea`
for each column in the Limma coefficients or t-statistic matrix (possibly 
skipping the first).  The output will be a list object containing the results
for each analysis.

```{r gsea1}
data("ExamplePathways")

fgseaResults <- limmaToFGSEA(limmaResults, gene.sets = ExamplePathways, 
                            min.set = 5, rank.by = "t",
                            skip.first = TRUE)


names(fgseaResults)

```

We can order the pathways by p-value and view the top results. As previously
reported in [@lundy_t_2018], we see that immune pathways are significantly
altered in autoimmune retinopathy patients. We also identify EGFR Signaling as a
potential pathway of interest.

```{r gsea2}

fgsea.ordered <- fgseaResults$Autoimmune.retinopathy[
    order(fgseaResults$Autoimmune.retinopathy$pval, decreasing = FALSE),]

knitr::kable(fgsea.ordered[1:5,], 
      row.names = TRUE, format = "html", align = "c")


```

Another similar function, `nsdiffToFGSEA`, is provided to conduct fgsea on 
`NanoStringDiff` results. This one conducts analysis on a single preranked list.

After analysis, the leading edge genes can be extracted for gene sets (with 
some cutoff for enrichment statistics) using `fgseaToLEdge`.

```{r gsea3}

# Leading edge for pathways with adjusted p < 0.2
leading.edge <- fgseaToLEdge(fgsea.res = fgseaResults,
                             cutoff.type = "padj", cutoff = 0.2) 

```

The nominal p-value or NES (normalized enrichment score) can also be used as 
a cutoff. If NES is used, you can either select all gene sets with 
abs(NES) > cutoff, if `nes.abs.cutoff == TRUE`. Otherwise, you can select gene 
sets with NES > cutoff (if cutoff > 0) or NES < cutoff (if cutoff < 0).

```{r gsea4}

# Leading edge for pathways with abs(NES) > 1
leading.edge.nes <- fgseaToLEdge(fgsea.res = fgseaResults,
                                cutoff.type = "NES", cutoff = 1,
                                nes.abs.cutoff = TRUE) 

# Leading edge for pathways with NES > 1.5
leading.edge.nes <- fgseaToLEdge(fgsea.res = fgseaResults,
                                cutoff.type = "NES", cutoff = 1.5,
                                nes.abs.cutoff = FALSE) 

# Leading edge for pathways with NES < -0.5
leading.edge.nes <- fgseaToLEdge(fgsea.res = fgseaResults,
                                cutoff.type = "NES", cutoff = -0.5,
                                nes.abs.cutoff = FALSE) 


```

A basic leading edge heatmap can then be drawn using the `pheatmap` package.

```{r gsea5, fig.width = 8, fig.height = 3}

pheatmap::pheatmap(t(leading.edge$Autoimmune.retinopathy),
                   legend = FALSE,
                   color = c("white", "black"))

```

You can further cluster pathways by their leading edge genes. This is 
particularly useful when you have lots (tens to hundreds) of significantly 
enriched pathways, as you can prioritize certain ones.

```{r gsea6}

# Group pathways with a binary distance below 0.5
fgsea.grouped <- groupFGSEA(fgseaResults$Autoimmune.retinopathy, 
                                   leading.edge$Autoimmune.retinopathy,
                                   join.threshold = 0.5,
                                   dist.method = "binary")


knitr::kable(fgsea.grouped, digits = 4, row.names = FALSE, align = "c")

```

In this example, only "EGF/EGFR Signaling Pathway" and "Kit receptor signaling
pathway" are sufficiently similar to be clustered together ("Cluster 1"), and
the other four pathways are deemed unique. In the heatmap above, we see that
the Kit receptor pathway contains only one leading edge gene unique from the
EGF/EGFR pathway (PTPN6). Based on a lower p-value and higher NES, you would 
consider EGFR Signaling to be more important in this analysis, while the Kit 
receptor pathway is largely redundant. This is denoted by the "Cluster.Max"
variable, which identifies maximum enrichment in each cluster with an "x".

### Exporting GSEA results

`fgseaPostprocessingXLSX` allows you to output the results of gene set analyses
to an Excel spreadsheet (`fgsePostprocessing` is similar, and provides .txt
files). A summary sheet shows the overall GSEA results, while an additional
table for each separate analysis (A vs. Control, B vs. Control, etc.) shows
the differential expression statistics and expression profiles for leading edge
genes. This step requires input of the FGSEA results, the leading edge results,
and the Limma results. It will cluster the pathways if specified, prior to 
generating results tables.

```{r exportGSEA}

fgseaPostprocessingXLSX(genesetResults = fgseaResults,
                        leadingEdge = leading.edge,
                        limmaResults = limmaResults,
                        join.threshold = 0.5,
                        filename = "analysis.xlsx")

```
You can also use `groupedGSEAtoStackedReport` to generate the gene-level report
for one comparison.

```{r exportGSEA2}

results.AR <- groupedGSEAtoStackedReport(
    fgsea.grouped,
    leadingEdge = leading.edge$Autoimmune.retinopathy,
    de.fit = limmaResults)

# View Cluster 1 gene statistics
results.AR[results.AR$Cluster == 1, 1:6]

```

# References