---
title: Example of a cytometry data analysis with DepecheR
author: 
- name: Jakob Theorell
  affiliation: 
  - Oxford Autoimmune Neurology Group, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, United Kingdom
  - Department of Clinical Neurosciences, Karolinska Institutet, Stockholm, Sweden
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{Example of a cytometry data analysis with DepecheR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}    
output: 
  BiocStyle::html_document:
    titlecaps: false
    toc_float: true
---

```{r style, echo=FALSE, results='hide', message=FALSE}
library(BiocStyle)
library(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

In this document, the user is presented with an analysis that DepecheR
has been written to perform. There are lots of tweaks to this general outline, 
so the user is encouraged to read the help files for each function individually
in addition. In cases where bugs are identified, feedback is most welcome, 
primarily on the github site github.com/theorell/DepecheR. Now let us get 
started. 

# Installation
This is how to install the package, if that has not already been done: 
```{r, eval = FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DepecheR")
```

# Example data description

The data used in this example is a semi-simulated dataset, consisting of 1000
cytotoxic lymphocytes from each of 20 individuals. These have been categorized 
into two groups, and after this, alterations have been added to the sizes of 
some cell populations in both groups. This means that the groups can be 
separated based onthe sizes of certain cell types in the data. And this 
excersize will show how to identify these, and tell us what markers that define 
the separating cell types in question. 

Importantly, DepecheR does not provide any pre-processing tools, such as for
compensation/spectral unmixing of flow cytometry files. The clustering function
does have an internal algorithm to detect data with extreme tails, but this does
not circumvent the need to transform flow- or mass cytometry data. This can be 
done using either commercially available software or with R packages, such as 
Biocpkg("flowSpecs"), Biocpkg("flowCore") or Biocpkg("flowVS"). 

```{r}
library(DepecheR)
data('testData')
str(testData)
```

As can be noted here, the expected input format is either a 
dataframe or a matrix with cells as rows and markers/variables as columns. This 
is in accordance with the .fcs file convention. In this case, however, the 
different samples (coming from donors) should be added to the same dataframe, 
and a donor column should specify which cells that belong to which donor.
If you have .fcs files, you can do this conversion easily using the
"flowSet2LongDf" function in Biocpkg("flowSpecs").

# depeche clustering

With the depeche clustering function, all necessary scaling and parameter
selection is performed under the hood, so all we have to do, when we have the 
file of interest in the right format, is to run the function on the variables
that we want to cluster on.

```{r eval=FALSE}
testDataDepeche <- depeche(testData[, 2:15])
```
```{r}
## [1] "Files will be saved to ~/Desktop"
## [1] "As the dataset has less than 100 columns, peak centering is applied."
## [1] "Set 1 with 7 iterations completed in 14 seconds."
## [1] "Set 2 with 7 iterations completed in 6 seconds."
## [1] "Set 3 with 7 iterations completed in 6 seconds."
## [1] "The optimization was iterated 21 times."
```

```{r echo=FALSE}
data("testDataDepeche")
```
```{r}
str(testDataDepeche)
```

As can be seen above, the output from the function is a relatively complex list.
If the names of each list element is not suficiently self explanatory, see 
(?depeche) for information about each slot. 

## depeche function output graphs
Two graphs are part of the output from the depeche function. 

### Adjusted Rand Index as a function of penalty values
This graph shows 
how internally reproducible the results were for each of the tested penalties. 
An Adjusted Rand Index of 1 shows that if any random subset of observations is 
clustered two times, each observation will be assigned to the same cluster both
times. Conversely, an Adjusted Rand Index of 0 indicates the opposite, i.e. 
totally random distribution. The adjustment in "Adjusted" Rand index takes the 
divering probabilities of ending up with a high or low overlap in the special 
cases of very few and very many clusters into consideration. 

![Adjusted Rand Index as a function of penalty values. Each value is based on at least 7 clustering pairs (DepecheR standard in a 8 core machine).](https://github.com/jtheorell/DepecheRVinjettePics/raw/master/Fig_1.png)

### Cluster centers
This graph shows in a heatmap format where the cluster center is located for 
each of the markers that are defined for the cluster in question. A light color 
indicates a high expression, whereas a dark color indicates low or absent 
expression. Grey color, on the other hand, indicates that the cluster in 
question did not contribute to defining the cluster in question. In some cases,
the results might seem strange, as a cluster might have an expression very close
to the center of the full dataset, but this expression still defined the 
cluster. This is due to an internal, and for stability reasons necessary, effect
of the algotihm: a specific penalty will have a larger effect on a cluster with
fewer observations, than on a cluster with many observations.

![Cluster center heatmap. Marker names on the x-axis, cluster numbers on the y-axis.](https://github.com/jtheorell/DepecheRVinjettePics/raw/master/Fig_2.png)

# tSNE/umap generation

To be able to visualize the results, we need to generate a two-dimensional 
representation of the data used to generate the depeche clustering. Any 
sutiable method, such as tSNE or UMAP can be used for this purpose. I would 
today use uwot::umap, mainly as it in its R implementation is considerably 
faster than tSNE, but we will keep the tSNE here, as it well represents the 
data.

```{r eval=FALSE}
library(Rtsne)
testDataSNE <- Rtsne(testData[,2:15], pca=FALSE)
```

# Visualization of depeche clusters on 2D representation

Now, we want to evaluate how the different clusters are distributed on the 2D
representation.
To do this, we need to generate a color vector from the cluster vector in the 
testDataDepeche. This cluster vector is then overlayed over the tSNE, 
and to make things easier to interpret, a separate legend is included as well.
The reason that the legend is in a separate plot is for making it easier to use
the plots for publication purposes. For file size reasons, it has namely been
necessary to use PNG and not PDF for the plot files. 

NB! The resolution of the files normally generated by DepecheR is considerably 
higher than in this vinjette, due to size restrictions. 

```{r echo=FALSE}
    data("testDataSNE")
```

```{r}
dColorPlot(colorData = testDataDepeche$clusterVector, xYData = testDataSNE$Y, 
           colorScale = "dark_rainbow", plotName = "Cluster")
```

![Cluster distribution over the tSNE field.](https://github.com/jtheorell/DepecheRVinjettePics/raw/master/Fig_3.png)


# Visualization of markers on tSNE
Here, we once again use the dColorPlot function, with different settings. Note
that titles are included in this case. As they are then becoming embedded in the
png picture, this is not the standard, for publication reasons. It is also worth
noting, that nothing is printed to screen, but rather all graphics are saved as
separate files. This is done to save some computational time and effort.

```{r}
dColorPlot(colorData = testData[2], xYData = testDataSNE$Y)

```
![Example of one marker (SYK) distributed over the tSNE field.](https://github.com/jtheorell/DepecheRVinjettePics/raw/master/Fig_4.png)


# Density distribution of groups
Now, we are getting into separating the groups from each other. The first thing
we want to do is to visually compare the densities of the groups. This is done 
in the following way. First, the density for all events are plotted. This is 
followed by plotting of the first and second group of individuals, but keeping 
the density contours from the full dataset. 

```{r}
densContour <- dContours(testDataSNE$Y)

dDensityPlot(xYData = testDataSNE$Y, plotName = 'All_events', 
             colorScale="purple3", densContour = densContour)

#Here the data for the first group is plotted
dDensityPlot(xYData = testDataSNE$Y[testData$label==0,], plotName = 'Group_0', 
             colorScale="blue", densContour = densContour)

#And here comes the second group
dDensityPlot(xYData = testDataSNE$Y[testData$label==1,], plotName = 'Group_1', 
             colorScale="red", densContour = densContour)

```
![Distribution of events in the different groups.](https://github.com/jtheorell/DepecheRVinjettePics/raw/master/Fig_5.png)

# Separating groups from each other
Now, we have arrived at a crucial point: we are now going to see which 
clusters that separate the two groups from each other. There are three functions
in the DepecheR package that can help us do this. The first one is the 
dResidualPlot. 

```{r}
dResidualPlot(
    xYData = testDataSNE$Y, groupVector = testData$label,
    clusterVector = testDataDepeche$clusterVector)
```

This function shows the difference on a 
per-group/per-cluster basis. This means that it is non-statistical, and thus
applicable even if the groups consist of only one or a few samples each. 
However, it cannot distinguish between a rare but very pronounced phenotype and
a common difference: i.e., an individual donor can be responsible for the full 
difference noted. To circumvent this, and to get some statistical inferences, 
two other functions are available: the dWilcox and the dSplsda. These functions
have identical input, but where the former performs a Wilcoxon rank-sum test
(also called Mann-whitney U test) on a per-cluster basis and thus results in
multiple comparisons, the latter (based on sparse projection to latent 
structures (aka partial least squares) discriminant analysis) instead identifies
the angle through the multi-dimensional datacloud created by the individual 
donor frequencies in each cluster that most optimally separates the groups. It 
then internally checks how well the groups are separated along this vector, and
plots the clusters that contribute to this separation with colors relative to 
how well the groups are separated. As this method is a sparse version of the 
method, it, like depeche, only identifies clusters that contribute robustly to 
separating the clusters, and has its internal tuning algorithm to define how the
penalty term should be set. For both methods, there are paired alternatives. In
this case, however, the data is not paired, and thus, the normal methods will be 
used. The standard use of the methods are: 

```{r}
dWilcoxResult <- dWilcox(
    xYData = testDataSNE$Y, idsVector = testData$ids,
    groupVector = testData$label, clusterVector = testDataDepeche$clusterVector)
```

```{r eval=FALSE}
sPLSDAObject <- dSplsda(xYData = testDataSNE$Y, idsVector = testData$ids,
                        groupVector = testData$label, 
                        clusterVector = testDataDepeche$clusterVector)
## Saving 3 x 3 in image

## [1] "The separation of the datasets was perfect, with no overlap between 
## the groups"

## [1] "Files were saved at /Users/jakthe/Labbet/GitHub/DepecheR/vignettes"

```
The object rendered by the dSplsda function is inherited from the 
[mixOmics](https://bioconductor.org/packages/release/bioc/html/mixOmics.html)
package. The object rendered by the dWilcox function is a matrix containing
information about the cluster number, the median in each group, the Wilcoxon 
statistic, the p-value and a p-value corrected for multiple comparisons. 
In addition to the graphs, the dWilcox and the dSplsda functions also output
result files. See ?dSplsda and ?dWilcox for more information. 

![Residual, Wilcoxon, sPLS-DA plots.](https://github.com/jtheorell/DepecheRVinjettePics/raw/master/Fig_6.png)

When investigating the results from the sPLS-DA or the dWilcox analysis, it was
clear that half of the clusters were highly significantly different between the
groups in this case.

# Visualization of defining markers for clusters.
This function is especially useful to view the cluster distributions for 
specific clusters of interest, such as the significant clusters from the 
previousstep. dViolins serves as a compliment to the cluster center heatmap 
(see step 2). 
In this function, the sparsityMatrix from the depeche run is used. The function 
will in addition to producing all the graphs also produce a hierarchy of folders 
where the graphs are placed. 
EDIT 2019-09-12: This function currently does not support most CyTOF data, as
the zero-peak is so dominant, which would make the interpretation of the plots
impossible. This will be fixed in the near future, by reducing the negative 
population.  

```{r eval}
dViolins(testDataDepeche$clusterVector, inDataFrame = testData, 
         plotClusters = 3, plotElements = testDataDepeche$essenceElementList)
```

![Example of the distribution of events in the four markers defining cluster 3.](https://github.com/jtheorell/DepecheRVinjettePics/raw/master/Fig_7.png)

# Summary
In this document, a typical analysis of a cytometry dataset is shown. There are
however very many other possibilities with this package. One of the major ones
is that it accuratly classifies scRNAseq data, and in that process reduces
the complexity of the data up to 1000-fold, as very few transcripts actually
define each cluster. For further information on how to do this, the reader
is currently encouraged to read the publication connected to this package: 
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0203247.

# Session information
```{r}
sessionInfo()
```