---
title: "CluMSID --- Clustering of MS^2^ Spectra for Metabolite Identification"
author: "Tobias Depke"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{CluMSID Tutorial}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
    %\VignetteDepends{CluMSIDdata, magrittr, dplyr, readr}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    echo = TRUE
)
```

<style>
p.caption {
    font-size: 80%;
    text-align: left;
}
</style>

```{r captions, include=FALSE}
fig1 <- paste("**Figure 1:**",
                "Multidimensional scaling plot as a visualisation of",
                "MS^2^ spectra similarities",
                "of the example data set.",
                "Red dots signify annotated spectra,",
                "black dots spectra from unknown metabolites.")
fig2 <- paste("**Figure 2:**",
                "Multidimensional scaling plot as a visualisation of",
                "neutral loss similarities",
                "of the example data set.",
                "Red dots signify annotated spectra,",
                "black dots spectra from unknown metabolites.")
fig3 <- paste("**Figure 3:**",
                "Screenshot of the interactive version of the",
                "Multidimensional scaling plot visualising",
                "MS^2^ spectra similarities",
                "of the example data set (cf Figure 1).",
                "Zoomed image section with tooltip displaying",
                "feature information upon mouse-over.")
fig4 <- paste("**Figure 4:**",
                "Reachability distance plot resulting from",
                "OPTICS density based clustering of the",
                "MS^2^ spectra similarities",
                "of the example data set.",
                "Bars represent features in OPTICS order",
                "with heights corresponding to the",
                "reachability distance to the next feature.",
                "The dashed horizontal line marks the reachability threshold",
                "that separates clusters.",
                "The resulting clusters are colour-coded",
                "with black representing noise, i.e. features not assigned",
                "to any cluster.")
fig5 <- paste("**Figure 5:**",
                "Reachability distance plot resulting from",
                "OPTICS density based clustering of the",
                "neutral loss similarities",
                "of the example data set",
                "(cf Figure 4).")
fig6 <- paste("**Figure 6:**",
                "Symmetric heat map of the distance matrix displaying",
                "MS^2^ spectra similarities",
                "of the example data set",
                "along with dendrograms resulting from",
                "hierarchical clustering based on the distance matrix.",
                "The colour encoding is shown in the top-left insert.")
fig7 <- paste("**Figure 7:**",
                "Symmetric heat map of the distance matrix displaying",
                "neutral loss similarities",
                "of the example data set",
                "along with dendrograms resulting from",
                "hierarchical clustering based on the distance matrix.",
                "The colour encoding is shown in the top-left insert.")
fig8 <- paste("**Figure 8:**",
                "Circularised dendrogram as a result of",
                "agglomerative hierarchical clustering with average linkage",
                "as agglomeration criterion based on",
                "MS^2^ spectra similarities",
                "of the example data set.",
                "Each leaf represents one feature and colours encode",
                "cluster affiliation of the features.",
                "Leaf labels display feature IDs, along with",
                "feature annotations, if existent.",
                "Distance from the central point is indicative",
                "of the height of the dendrogram.")
fig9 <- paste("**Figure 9:**",
                "Correlation network plot based on",
                "MS^2^ spectra similarities",
                "of the example data set.",
                "Grey dots indicate non-identified features,",
                "orange dots identified ones.",
                "Labels display feature IDs, along with",
                "feature annotations, if existent.",
                "Edge widths are proportional to spectral similarity",
                "of the connected features.")
fig10 <- paste("**Figure 10:**",
                "Screenshot of the interactive version of the",
                "Correlation network plot based on",
                "MS^2^ spectra similarities",
                "of the example data set (cf Figure 9).",
                "Zoomed image section with tooltip displaying",
                "feature information upon mouse-over.")
fig11 <- paste("**Figure 11:**",
                "Correlation network plot based on",
                "neutral loss similarities",
                "of the example data set (cf Figure 9).")
fig12 <- paste("**Figure 12:**",
                "Correlation network plot based on",
                "similarities of pseudospectra",
                "of the example data set (cf Figure 9).")
```

# Introduction

This tutorial shows how to use the `CluMSID` package to help annotate 
MS^2^ spectra from untargeted LC-MS/MS data. `CluMSID` works with MS^2^ 
data generated by data-dependent acquisition and requires an mzXML file 
(like in this example) or any other file that can be parsed by
`mzR`, like mzML, mzTab or netCDF,
as input. It can be used both stand-alone and together with the XCMS suite 
of preprocessing tools.

`CluMSID` extracts and merges MS^2^ spectra and generates neutral loss 
patterns for each feature. Additionally, it can make use of information 
from the `CAMERA` package to generate pseudospectra from MS^1^ level data. 
The tool uses cosine similarity to generate distance matrices from MS^2^ 
spectra, neutral loss patterns and pseudospectra. 

These distance matrices are the basis for multivariate statistics methods 
such as multidimensional scaling, density-based clustering, hierarchical 
clustering and correlation networks. The `CluMSID` package provides functions 
for these methods including (interactive) visualisation but the 
distance/similarity data can also be analysed with other `R` functions.

For the demonstrations in this tutorial, we will mainly use data from pooled 
*Pseudomonas aeruginosa* cell extracts, measured in ESI-(+) mode with 
auto-MS/MS on a Bruker maxis^HD^ qTOF after reversed phase separation by UPLC. 
For details, please refer to the Depke *et al.* 2017 publication 
(doi: 10.1016/j.jchromb.2017.06.002.).

To be able to access the example data, we also need the related package
`CluMSIDdata`. The packages can be loaded as follows:

```{r load_package_0, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("CluMSIDdata", "CluMSID"))
```

```{r load_package_2, eval=TRUE}
library(CluMSID)
library(CluMSIDdata)
```

# `MS2spectrum` and `pseudospectrum` classes

`CluMSID` uses a custom S4 class named `MS2spectrum` to store spectral 
information in the following slots:

* `id`: a character string similar to the ID used by XCMSonline or the ID 
given in a predefined peak list
* `annotation`: a character string containing a user-defined annotation, 
defaults to empty
* `precursor`: (median) *m/z* of the spectrum's precursor ion
* `rt`: (median) retention time of the spectrum's precursor ion
* `polarity`: the polarity with which the spectrum was recorded,
either `positive` or `negative`
* `spectrum`: the actual MS^2^ spectrum as two-column matrix 
(column 1 is (median) *m/z*, column 2 is (median) 
intensity of the product ions)
* `neutral_losses`: a neutral loss pattern generated by subtracting the 
product ion mass-to-charge ratios from the precursor *m/z* in a matrix 
format analogous to the `spectrum` slot

The `pseudospectrum` class is very similar but it contains no information on 
precursor *m/z* and therefore no neutral loss pattern, either. By default, the 
`id` slot contains the "pcgroup" number assigned by `CAMERA`.

The individual slots of `MS2spectrum` and `pseudospectrum` objects can be 
accessed via the standard S4 way using `object@slot`, e.g. `object@annotation`
or by using an accessor function. These exist for all slots and are called
`accessFoo()`, where `Foo` is the slot name (not exactly, though, because 
`Bioconductor` does not allow to mix `snake_case` and `camelCase` in 
function names): 

* `accessID(object)` 
* `accessAnnotation(object)`
* `accessPrecursor(object)`
* `accessRT(object)`
* `accessPolarity(object)`
* `accessSpectrum(object)`
* `accessNeutralLosses(object)`.

# Extract MS^2^ spectra from *.mzXML file

The first step in the `CluMSID` workflow is to extract MS^2^ spectra from the 
raw data file (in mzXML format). This is done by the `extractMS2spectra` 
function which internally uses several functions from the `mzR` package. 
The function offers the possibility to filter spectra that contain less a 
defined number of peaks and/or do not fall in a defined retention time window. 
Setting the `recalibrate_precursor` argument to `TRUE` activates a correction 
process for uncalibrated precursor *m/z* data that existed in older version of 
Bruker's Compass Xport (*cf.* Depke *et al.* 2017). It is not necessary to use 
it with files generated by other software but does not corrupt the data, either.


*Please be aware that `mzR` often throws warnings concerning the `Rcpp`*
*version that can usually be ignored.*

```{r extractMS2spectra_1, warning=FALSE}
ms2list <- extractMS2spectra(system.file("extdata", 
                                            "PoolA_R_SE.mzXML", 
                                            package = "CluMSIDdata"), 
                                min_peaks = 2, 
                                recalibrate_precursor = TRUE, 
                                RTlims = c(0,25))

```

This operation has now extracted all the MS^2^ spectra from the raw data file 
and stored them in a list. Each list entry is an object of class `MS2spectrum`.
The list is quite long because it still contains a lot of spectra that derive 
from the same chromatographic peak.

```{r extractMS2spectra_2}
length(ms2list)
```

In our example, the first two spectra in the list derive from the same peak 
and thus have the same precursor ion and almost the same retention time.

```{r extractMS2spectra_3}
head(ms2list, 4)
```

From the output above, you also see that the `MS2spectrum` class has a `show()`
generic that summarises the MS^2^ spectrum and neutral loss pattern data. 
To show the default output, use `showDefault()`. Be aware that neutral loss 
patterns have not been calculated in this step.


```{r extractMS2spectra_4}
showDefault(ms2list[[2]])
```

# Merge MS^2^ spectra that derive from the same peak/feature

To reduce the amount of redundant MS^2^ spectra, the `mergeMS2spectra()` 
function is used to generate consensus spectra from the MS^2^ spectra that 
derive from the same precursor. `CluMSID` offers two possibilities to do so: 

## Merge spectra without external peaktable

This possibility is the standard method for stand-alone use of `CluMSID` and 
is equivalent to what has been described in Depke *et al.* 2017. It does not 
need additional input and summarises consecutive spectra that have the same 
precursor *m/z* if their retention time fall within a defined threshold 
(`rt_tolerance`, defaults to 30s). A retention time difference between 
consecutive spectra larger than `rt_tolerance` is interpreted as 
chromatographic separation and respective spectra will be assigned to a new 
feature. The `mz_tolerance` argument should be set according to your 
instruments *m/z* precision, the default is 1 * 10^-5^ (10ppm, equivalent to 
Â±5ppm instrument precision). The `peaktable` and `exclude_unmatched` arguments 
are not used in this method and are to be left at their default.

```{r mergeMS2spectra_1}
featlist <- mergeMS2spectra(ms2list)
```
```{r mergeMS2spectra_2}
length(featlist)
```
```{r mergeMS2spectra_3}
head(featlist, 4)
```

The total amount of spectra was reduced from `r length(ms2list)` to 
`r length(featlist)` and as many other, the redundant spectra #1 and #2 in the 
raw list are now merged to one consensus spectrum (#1 in the merged list).

In this step, neutral loss patterns have been generated that look like this:

```{r mergeMS2spectra_4}
accessNeutralLosses(featlist[[1]])
```

## Merge spectra with external peaktable, e.g. from XCMS

The second possibility is to supply a peaktable, i.e. a list of picked peaks 
with their mass-to-charge ratios and retention times. This is particularly 
useful if you want to annotate a complete metabolomics data set. In our 
example, we have a metabolomics dataset called "TD035" in which we have 
measured a range of samples in MS^1^ mode for relative quantification. 
Additionally, we have measured a pooled QC sample in MS^2^ mode for annotation.
The MS^1^ data were analysed using XCMSonline and we want to group the MS^2^ 
spectra so that they match the XCMSonline peak picking.

The spectra are extracted as shown above:

```{r mergeMS2spectra_5}
ms2list2 <- extractMS2spectra(system.file("extdata", 
                                            "TD035-PoolMSMS2.mzXML", 
                                            package = "CluMSIDdata"), 
                                min_peaks = 2, 
                                recalibrate_precursor = TRUE, 
                                RTlims = c(0,25))
```

The peaklist is imported from the XCMSonline output. The list has to contain 
at least 3 columns:

* column 1: name/identifier of the feature
* column 2: *m/z*
* column 3: retention time

Shown below is an easy way of getting from an XCMSonline annotated diffreport 
to a suitable peaktable using tidyverse functions. Of course, you can achieve 
the same goal with base R functions or even in Excel. Depending on the 
retention time format in your *.mzXML file, you might have to convert from 
minutes to seconds or vice versa. Here, we have minutes in the XCMSonline 
output but seconds in the MS^2^ file, so we multiply by 60.

```{r mergeMS2spectra_6, message=FALSE, warning=FALSE}
require(magrittr)
ptable <- 
    readr::read_delim(file = system.file("extdata", 
                                        "TD035_XCMS.annotated.diffreport.tsv", 
                                        package = "CluMSIDdata"), 
                    delim = "\t") %>%
    dplyr::select(c(name, mzmed, rtmed)) %>%
    dplyr::mutate(rtmed = rtmed * 60)

head(ptable)
```

We can now use this peaktable as an argument for `mergeMS2spectra()`. 
You can choose whether you want to keep or exclude MS^2^ spectra that do not 
match any peak in the peaktable. These can occur in regions of the 
chromatogramm where there are no clear peaks but the auto-MS/MS still 
fragments the most abundant ions. These unmatched spectra are merged following 
the same rules as described above (method without peaktable). In this example, 
we keep the unmatched spectra. We use the default values for *m/z* and 
retention time tolerance and thus do not need to specify them.

```{r mergeMS2spectra_7, warning=FALSE}
featlist2 <- mergeMS2spectra(ms2list2, 
                                peaktable = ptable,
                                exclude_unmatched = FALSE)

head(featlist2, 4)
```

Note that the 2^nd^ entry in `featlist2` is marked with an 'x' which 
means that it could not be assigned to a feature in the peaktable.

For the sake of simplicity, only the data generated from the stand-alone 
procedure will be used for the following examples. Be assured that all of 
them would also work with the data generated with the help of an external 
peaktable (`featlist2`).

# Add annotations

The next step is to add (external) annotations to the list of features, 
e.g. from a spectral library that you curate in-house or one that has been 
supplied by your instrument manufacturer. If you do not (want to) annotate 
your features at all, this step can be skipped completely, leaving the 
`annotation` slot of the `MS2spectrum` objects empty.

## Manual procedure

`CluMSID` offers several possibilities to add annotations to your feature list.
The most basic one first generates a list of features and saves it as *.csv 
file. For that you use the `writeFeaturelist()` function and only have to 
specify your list of spectra and a file name for the output file 
(here: `pre_anno.csv`). You can then manually fill in your annotations in a 
new column in the table, save it (in this example under the name 
`post_anno.csv`) and reload it to `R`:

```{r addAnnotations_0, eval=FALSE}
writeFeaturelist(featlist, "pre_anno.csv")
```
```{r addAnnotations_1}
annotatedSpeclist <- addAnnotations(featlist, system.file("extdata", 
                                                    "post_anno.csv", 
                                                    package = "CluMSIDdata"))
```
`annotatedSpeclist` will then be equivalent to `featlist` with annotations 
added to the `annotation` slot of the list entries.

## Alternative procedures

You can add annotations without leaving the `R` environment, too. 
`addAnnotations()` also accepts objects of class `data.frame` as 
`annolist` argument. Be aware that `addAnnotations()` assigns the 
annotation based on the position in the feature list. I.e., if the order 
of the features in your list of features (`featlist`) and your list of 
annotations (`annolist`) is different, you will get nonsense results.

The savest ways to `addAnnotations()` with a `data.frame` is to use 
`featureList()` to generate a `data.frame` that is formatted in the same 
way as the file output from `writeFeaturelist()` and then match your 
identifications against this `data.frame` and use the result as argument 
for `addAnnotations()`.

```{r addAnnotations_2, include=FALSE}
require(magrittr)
annos <- read.csv(system.file("extdata", 
                                "post_anno.csv", 
                                package = "CluMSIDdata"), 
                    stringsAsFactors = FALSE) %>%
    dplyr::filter(nchar(annotation) > 1) %>%
    dplyr::select(id, annotation)
```

Say you have an object called `annos` that contains feature IDs 
(the same as in `featlist`) and annotations in a two-column `data.frame` 
with "id" and "annotation" as column names. It could look like this:

```{r addAnnotations_3}
str(annos)
head(annos)
```

`addAnnotations(featlist, annos, annotationColumn = 2)` will throw an error 
because `featlist` and `annos` are of different length. 
Instead, you need to do the following:
```{r addAnnotations_4}
fl <- featureList(featlist)

fl_annos <- dplyr::left_join(fl, annos, by = "id")
```
Now, you can annotate your list of spectra using 
`addAnnotations(featlist, fl_annos, annotationColumn = 4)`.

An analogous procedure works if you have your annotations stored 
in a peaktable that you have used for `mergeMSspectra()`. 
As the order of spectra in the list will not be same as the order 
of features in your peaktable, you need to do a matching with the 
output of `featureList()` as well.

# Generate distance matrices

Once we have a list of `MS2spectrum` objects containing all the 
required information with or without annotation, we can generate 
distance matrices from (product ion) MS^2^ spectra as well as from 
neutral loss patterns. These distance matrices serve as the basis for 
further analysis of the data. Both for MS^2^ spectra and neutral loss 
patterns, cosine similarity is used as similarity metric:

$$
cos(\theta) = 
\frac{\sum_{i}a_i \cdot b_i}{\sqrt{\sum_{i}{a_{i}}^2 \cdot \sum_{i}{b_{i}}^2}}
$$

```{r distanceMatrix_0, include=FALSE}
load(file = system.file("extdata", "distmat.RData", package = "CluMSIDdata"))
load(file = system.file("extdata", "nlmat.RData", package = "CluMSIDdata"))
```

## Distance matrix for product ion spectra

For most applications, analysing the similarity of product ion MS^2^ spectra 
will be most useful. The generation of the distance matrix is done by just one 
simple command but it can take some time to calculate.

```{r distanceMatrix_1, eval=FALSE}
distmat <- distanceMatrix(annotatedSpeclist)
```

## Distance matrix for neutral loss patterns

Common neutral losses and neutral loss patterns can convey information about 
structural similarity, as well, e.g. with nucleotides or glykosylated 
secondary metabolites. `CluMSID` offers the possibility to study neutral 
loss patterns independently from product ion spectra. The generation of a 
distance matrix is analogous, you just need to set the 'type' argument to 
"neutral_losses":

```{r distanceMatrix_2, eval=FALSE}
nlmat <- distanceMatrix(annotatedSpeclist, type = "neutral_losses")
```

# Visualise distance/similarity data using multidimensional scaling (MDS)

One rather simple possibility to visually analyse the spectral similarity data 
is multidimensional scaling, a dimension reduction method that simplifies 
distances in *n*-dimensional space to those in two-dimensional space (*n* in 
this case being the number of consensus spectra or neutral loss patterns that 
were used to generate the distance matrix in the previous step). `CluMSID` 
offers a simple function to produce an MDS plot from the distance matrix with 
the option to highlight annotated metabolites and the possibility to 
generate an interactive plot using `plotly`.

Standard MDS plots are generated as follows:

For MS^2^ spectra:
```{r MDSplotplot_1, fig.width=5, fig.asp=1, fig.cap=fig1, message=FALSE}
MDSplot(distmat, highlight_annotated = TRUE)
```

For neutral loss patterns:
```{r MDSplotplot_2, fig.width=5, fig.asp=1, fig.cap=fig2, message=FALSE}
MDSplot(nlmat, highlight_annotated = TRUE)
```

Interactive plots are zoomable and show feature names upon mouse-over. 
They are generated like normal MDS plots and can be viewed within 
RStudio or---after saving as html file using `htmlwidgets`---displayed 
in a normal web browser.

```{r MDSplotplot_3, eval=FALSE}
my_mds <- MDSplot(distmat, interactive = TRUE, highlight_annotated = TRUE)

htmlwidgets::saveWidget(my_mds, "mds.html")
```

This is how it looks like if you open the html file in 
Firefox and mouse over a feature:

```{r MDSplotplot_4, echo=FALSE, out.width="100%", fig.cap=fig3}
knitr::include_graphics(system.file("extdata", 
                                    "interactive_mds.png", 
                                    package = "CluMSIDdata"))
```

# Perform density-based clustering using the OPTICS algorithm

For density-based clustering with `CluMSID`, the 'OPTICS' algorithm and 
its implementation in the `dbscan` package is used. Density-based clustering 
is a useful clustering method that often yields different results than 
hierarchical clustering and can thus provide additional insight into the data. 
`CluMSID` has two functions to perform density-based clustering, one for the 
reachability plot which is the most useful visualisation of OPTICS results 
and one that outputs a `data.frame` containing the cluster assignations for 
every feature.

Both functions require as arguments a distance matrix as well as three 
parameters for the underlying functions `dbscan::optics` and 
`dbscan::extractDBSCAN`: `eps`, `minPts` and `eps_cl`. Lowering the `eps` 
parameter (default is 10000) limits the size of the epsilon neighbourhood 
which from experience has very little effect on the results. `minPts` 
defaults to 3 in `CluMSID`. It defines how many points are considered for 
reachability distance calculation between clusters. The `dbscan::optics` 
default for `minPts` is 5. Users are encourage to experiment with this 
parameter. `eps_cl` is the reachability threshold to identify clusters 
and can be varied based on your data. Lowering `eps_cl` leads to a larger 
number of smaller clusters and vice versa for raising the value. In general, 
it is advisable to chose a higher `eps_cl` for MS^2^ spectra than for 
neutral loss patterns, since the latter tend to show less similarity to 
each other.
For details, please refer to the `dbscan` help for the `dbscan::optics` 
and `dbscan::extractDBSCAN` functions.

If the default parameters are used, the generation of an OPTICS reachability 
plots is very simple, shown here for MS^2^ spectra and neutral loss patterns:

```{r OPTICSplot, fig.width=7, fig.asp=0.6, fig.cap=c(fig4, fig5)}
OPTICSplot(distmat)
OPTICSplot(nlmat, eps_cl = 0.7)
```

In the reachability plots, every line represents a feature and the height of 
the line is the reachability distance to the next feature in the OPTICS order. 
Thus, valleys represent groups of similar spectra or neutral loss patterns. 
The order and the cluster assignment can be studied using the 
`OPTICStbl` function that outputs a three-column `data.frame` with 
feature id, cluster assignment and OPTICS order. The order of features in the 
`data.frame` corresponds to the original order in the input distance matrix. 
Features that were not assigned to a cluster are black in the reachability 
plot and have the cluster ID 0. `OPTICStbl` takes the same arguments 
as `OPTICSplot`. The two functions have to be run with exactly the 
same parameters to assure compatibility of results.

```{r OPTICStbl}
OPTICStbl <- OPTICStbl(distmat)

head(OPTICStbl)
```

# Perform hierarchical clustering

In Depke *et al.* 2017, hierarchical clustering proved the most useful 
method to unveil structural similarities between features. analogous 
to density-based clustering, `CluMSID` offers two functions, one for 
plots and one for a `data.frame` with cluster assignments, both taking
a distance matrix as the only compulsory argument. The other two parameters 
are `h` (defaults to `0.95`), the height where the tree should be cut 
(see `stats::cutree` for details) and `type` that determines the type 
visualisation:

* `heatmap`: a heatmap displaying pairwise similarities/distances along 
with cluster dendrograms
* `dendrogram` (default): a circular dendrogram with colour code for 
cluster assignment

## Create a heatmap

Heatmaps of our example data for MS^2^ and neutral loss pattern similarity 
are created as follows (with reduced label font size by changing
`cexRow` and `cexCol` as well as `margins` of the underlying 
`heatmap.2` function):

```{r HCplot_1, fig.width=7, fig.asp=1, fig.cap=c(fig6, fig7)}
HCplot(distmat, type = "heatmap", 
                cexRow = 0.1, cexCol = 0.1,
                margins = c(6,6))
HCplot(nlmat, type = "heatmap", 
                cexRow = 0.1, cexCol = 0.1,
                margins = c(6,6))
```

Obviously, it makes sense to export the plots to larger pdf or png files 
(e.g. 2000 $\times$ 2000 pixels) to examine them closely. If exported to pdf, 
the feature names remain searchable (`Ctrl+F` in Windows).

## Create a dendrogram

With the dendrogram, too, it is advisable to export is to pdf 
in a large format, e.g. as follows:

```{r HCplot_2, eval=FALSE}
pdf(file = "CluMSID_dendro.pdf", width = 20, height = 20)
HCplot(distmat)
dev.off()
```
The plot from our example data looks like this:

```{r HCplot_3, echo=FALSE, out.width="100%", fig.asp=1, fig.cap=fig8}
knitr::include_graphics(system.file("extdata", 
                                    "CluMSID_dendro2.png", 
                                    package = "CluMSIDdata"))
```

The clusters are colour-coded and if exported to pdf, the tip labels 
containing feature ID and annotation are searchable.The height of the 
dendrogram's branching points serves as another piece of information 
when interpreting the clustered data as it signifies similarity of features.

For a detailed example of how to interpret, please refer to 
Depke *et al.* 2017, where `CluMSID` helped to identify new members 
of several classes of secondary metabolites in *Pseudomonas aeruginosa*.

Like with density-based clustering, it is also possible to generate a 
list of features with respective cluster assignments using `HCtbl`. 
As mentioned above for `CluMSID_OPTISplot` and `OPTICStbl`, it is 
crucial to run `HCplot` and `HCtbl` using the same parameters.

```{r HCtbl}
HCtbl <- HCtbl(distmat)

head(HCtbl)
```

# Generate a correlation network

As a new functionality, `CluMSID` offers the possibility to analyse the 
similarity data using weighted correlation networks. These networks offer 
some advantages with respect to standard clustering methods, most notably 
that they do not strictly assign every feature to a distinct cluster but 
also represent similarities between features that would fall into different 
clusters in hierarchical or density-based clustering. Thus, correlation 
networks potentially contain more useful information for data interpretation. 
On the downside, the interpretation is also complicated by this lack of 
concrete cluster assignments. E.g., we cannot simply look up which features 
belong to the same cluster in order to examine their spectra closely but we 
have to go back to the correlation network visualisation and search for 
connected features manually.

`networkplot` requires some arguments:

* `distmat`: *matrix*; a distance matrix like for all other functions 
described above
* `interactive`: *logical*; Similar to `MDSplotplot`, 
correlation network can be generate as interactive plots that are zoomable 
and display feature IDs on mouse-over. If that is desired, set `interactive` 
to `TRUE` (default is `FALSE`).
* `show_labels`: *logical*; whether to display feature IDs in the 
(non-interactive) plot (default is `FALSE`, ignored if `interacive = TRUE`)
* `label_size`: *numeric*; font size of feature ID labels (default is `1.5`, 
which is way smaller than the default in `GGally::ggnet2`, `4.5`)
* `highlight_annotated`: *logical*; whether to plot dots for features 
with annotation in a different colour (same as in `MDSplotplot`, 
default is `FALSE`)
* `min_similarity`: *numeric*; the minimum similarity (1 -- distance) 
threshold (similarities below this threshold will be ignored, default is `0.1`)
* `exclude_singletons`: *logical*; whether to exclude features from the plot 
that do not have connections to other features, particularly useful with 
data sets containing very dissimilar spectra, e.g. neutral loss patterns 
or MS^1^ pseudospectra (default is `FALSE`)

A standard non-interactive correlation network for the MS^2^ example data 
can be plotted like this:

```{r networkplot_1, message=FALSE, fig.width=7, fig.asp=1, fig.cap=fig9}
networkplot(distmat, highlight_annotated = TRUE, 
                show_labels = TRUE, interactive = FALSE)
```

As you can guess from this plot, it makes sense to use the interactive 
visualisation. Just like with `MDSplotplot`, you can view the 
interactive plot within RStudio or save it as html and view it in web browser.

```{r networkplot_2, eval=FALSE}
my_net <- networkplot(distmat, interactive = TRUE, 
                            highlight_annotated = TRUE)

htmlwidgets::saveWidget(my_net, "net.html")
```

This is how it looks like if you open the html file in Firefox, 
zoom in on a cluster and mouse over a feature:

```{r networkplot_3, echo=FALSE, out.width="95%", fig.cap=fig10}
knitr::include_graphics(system.file("extdata", 
                                    "interactive_net.png", 
                                    package = "CluMSIDdata"))
```

Please be aware that the spatial arrangement of the data points in the plot 
has a random component, i.e. while the relative position of the points 
(the distance to each other) is always the same, the absolute position 
varies and will not be the same even if the same command is executed twice.

The pairwise similarity of spectra or neutral loss patterns of features 
expressed by the cosine score is signified by the width of the line connecting 
the two features. All pairwise similarities greater than `min_similarity` 
result in a connecting line in the plot. The spatial proximity in which the 
features are mapped onto the plot is determined by the multivariate method 
underlying the network generation.

As we have already noticed after inspection of the heatmaps on p.13--14, 
the neutral loss patterns show much less similarity to each other than 
the MS^2^ spectra data. Thus, we expect quite a few neutral loss patterns 
that do not show any similarity to another neutral loss pattern. 
This expectation justifies the exclusion of these 'singletons' from the 
correlation network analysis. 
To do so, just set `exclude_singletons` to `TRUE`:

```{r networkplot_4, message=FALSE, fig.width=7, fig.asp=1, fig.cap=fig11}
networkplot(nlmat, highlight_annotated = TRUE, 
                show_labels = TRUE, exclude_singletons = TRUE)
```

# Additional functionalities

Multidimensional scaling, density-based clustering, 
hierarchical clustering and correlation network analysis are the main 
`CluMSID` tools to analyse MS^2^ spectra or neutral loss pattern similarity 
data, however, the package contains some additional functionalities that 
may facilitate data analysis in some cases and can also be used in other 
contexts with or without the above-mentioned unsupervised methods.

## Access individual spectra from a list of spectra by various slot entries

Accessing S4 objects within lists is not trivial. Therefore, 
`CluMSID` offers a function to access individual or several 
`MS2spectrum` objects by their slot entries. `getSpectrum()` 
requires the following arguments:

* `featlist`: a `list` that contains only objects of class `MS2spectrum`
* `slot`: the slot to be searched 
(invalid `slot` arguments will produce errors):
    + `id`
    + `annotation`
    + `precursor` (*m/z* of precursor ion)
    + `rt` (retention time of precursor)
* `what`: the search term or number, must be *character* for `id` 
and `annotation` and *numeric* for `precursor` and `rt`
* `mz.tol`: the tolerance used for precursor ion *m/z* searches, 
defaults to 1E-05 (10ppm)
* `rt.tol`: the tolerance used for precursor ion retention time searches, 
defaults to 30s; high values can be used to specify retention time ranges 
(see example)

Some examples will demonstrate the use of `getSpectrum()`:

**1. Accessing a spectrum by its ID.** For this, 
the exact feature ID must be known:

```{r getSpectrum_1}
getSpectrum(annotatedSpeclist, "id", "M244.17T796.4")
```

**2. Accessing a spectrum by its annotation.** For this, 
the exact annotation has to be known as well, 
other annotations will produce a message:

```{r getSpectrum_2}
getSpectrum(annotatedSpeclist, "annotation", "HHQ")
```
```{r getSpectrum_3}
getSpectrum(annotatedSpeclist, "annotation", "C7-HQ")
```

**3. Accessing spectra by their precursor ion *m/z*.** If the list contains 
more than one spectrum with a precursor ion *m/z* within the tolerance, 
the output is again a list of `MS2spectrum` objects that meet the specified 
criterion:

```{r getSpectrum_4}
getSpectrum(annotatedSpeclist, "precursor", 286.18, mz.tol = 1E-03)
```

**4. Accessing spectra by their precursor retention time.** Here, too, 
we can extract several `MS2spectrum` objects by setting a larger retention 
time tolerance. If we want to extract the spectra of all compounds that 
elute from 6min (360s) to 8min (480s), we proceed as follows:

```{r getSpectrum_5}
six_eight <- getSpectrum(annotatedSpeclist, "rt", 420, rt.tol = 60)
length(six_eight)
```

## Find spectra that contain a specific fragment or neutral loss

Another pair of accessory functions is `findFragment()` and `findNL()` 
which are used to find spectra that contain a specific fragment ion or 
neutral loss. Analogous to `getSpectrum()`, they need as arguments a list 
of `MS2spectrum` objects, the *m/z* of the fragment or neutral loss of 
interest and the respective *m/z* tolerance in ppm (default is 10ppm).

The two functions can be useful in many situation, e.g. when working with 
lipid data where head groups and fatty acids often give characteristic 
fragments or neutral losses. In the world of *P. aeruginosa* secondary 
metabolites, alkylquinolones (AQs) play an important role and most of the 
AQ MS^2^ spectra contain a signature fragment with an *m/z* of 159.068. 
Based on this fragment *m/z*, we can create a list of putative AQs:

```{r findFragment_1}
putativeAQs <- findFragment(annotatedSpeclist, 159.068)
```

An example for common neutral losses are nucleoside monophospates that 
all loose ribose-5'-monophosphate, resulting in a neutral loss of 212.009 
in ESI-(+). Using `findNL()` we find CMP, UMP, AMP and GMP.

```{r findNeutralLoss_1}
findNL(annotatedSpeclist, 212.009)
```

## Match one spectrum against a set of spectra

If you are mainly interested in one or a few number of spectra or neutral 
loss patterns, it may be sufficient to match one feature at a time against 
a larger set of spectra. This set of spectra can be all spectra contained 
in one mzXML file like in all the examples in this tutorial or they could 
be a spectral library, as long as its format in `R` is a list of 
`MS2spectrum` objects.

The `getSimilarities()` function requires several arguments:

* `spec`: The spectrum to be compared to other spectra. Can be either an 
object of class `MS2spectrum` or a two-column numerical matrix that contains 
fragment mass-to-charge ratios in the first and intensities in the second 
column.
* `speclist`: The set of spectra to which `spec` is to be compared. 
Must be a list where every entry is an object of class `MS2spectrum`.
Can be generated from an mzXML file as shown above or constructed using 
`new("MS2spectrum", ...)` for every list entry (see example).
* `type`: Specifies whether MS^2^ spectra or neutral loss patterns are to 
be compared. Must be either 'spectrum' (default) or 'neutral_losses'.
* `hits_only`: Logical that indicates whether the result should contain only 
similarities greater than zero (see example).

In the first example, we want to find all MS^2^ spectra in our example data 
set that are similar to the spectrum of pyocyanin, an important secondary 
metabolite from *Pseudomonas aeruginosa* and therefore match the pyocyanin 
spectrum against our `annotatedSpeclist`. Because we have already identified 
pyocyanin in the data set, we can use `getSpectrum` to extract the 
`MS2spectrum` object from `annotatedSpeclist`. We do not want to search all 
`r length(annotatedSpeclist)` elements of the result vector, so we set 
`hits_only` to `TRUE` to exclude spectra that have 0 similarity to the 
pyocyanin spectrum.

```{r getSimilarities_1}
pyo <- getSpectrum(annotatedSpeclist, "annotation", "pyocyanin")

sim_pyo <- getSimilarities(pyo, annotatedSpeclist, hits_only = TRUE)
sim_pyo
```

We get `r length(getSimilarities(pyo, annotatedSpeclist, hits_only = TRUE))` 
spectra that have a non-zero similarity to the pyocyanin spectrum, including 
pyocyanin itself with a similarity of `1`. Of course, we can further filter 
the data by subsetting the result vector in order to exclude spectra that 
have only minimal similarity, e.g. `M679.43T1051.39` with a cosine similarity 
of only `0.0008` (the last element in the vector).

In the second example, we generate a new `speclist`, e.g. from a spectral 
library. We look at the unknown feature that has most similarity to pyocyanin. 
As pyocyanin is contained in `annotatedSpeclist` itself, we have to look at 
the second highest similarity. Again, we use `getSpectrum()` to extract the 
object from `annotatedSpeclist`:

```{r getSimilarities_2}
highest_sim <- sort(sim_pyo, decreasing = TRUE)[2]

sim_spec <- getSpectrum(annotatedSpeclist, "id", names(highest_sim))
sim_spec
```

We see that the feature is not annotated. We are interested whether this 
feature also shows similarity to other members of the phenazine family 
of *P. aeruginosa* secondary metabolites. Some phenazines are contained 
in `annotatedSpeclist` but some are not, so we make a new `speclist` called 
`phenazines` and add the missing spectra manually from an in-house library:

```{r getSimilarities_3}
phenazines <- list()
phenazines[[1]] <- getSpectrum(annotatedSpeclist, 
                                "annotation", "pyocyanin")
phenazines[[2]] <- getSpectrum(annotatedSpeclist, 
                                "annotation", "phenazine-1-carboxamide")
phenazines[[3]] <- getSpectrum(annotatedSpeclist, 
                                "annotation", "phenazine-1-carboxylic acid")
phenazines[[4]] <- getSpectrum(annotatedSpeclist, 
                                "annotation", "phenazine-1,6-dicarboxylic acid")
phenazines[[5]] <- new("MS2spectrum", id = "lib_entry_1", 
                        annotation = "1-hydroxyphenazine",
                        spectrum = matrix(c(168.0632, 14,
                                            169.0711, 288,
                                            170.0743, 33,
                                            179.0551, 62,
                                            197.0653, 999),
                                        byrow = TRUE,
                                        ncol = 2))
phenazines[[6]] <- new("MS2spectrum", id = "lib_entry_2", 
                        annotation = "2-hydroxy-phenazine-1-carboxylic acid",
                        spectrum = matrix(c(167.0621, 43,
                                            179.0619, 93,
                                            180.0650, 12,
                                            195.0564, 40,
                                            223.0509, 999,
                                            224.0541, 142,
                                            241.0611, 60),
                                        byrow = TRUE,
                                        ncol = 2))
phenazines[[7]] <- new("MS2spectrum", id = "lib_entry_3", 
                        annotation = "pyocyanin (library spectrum)",
                        spectrum = matrix(c(168.0690, 58,
                                            183.0927, 152,
                                            184.0958, 19,
                                            196.0640, 118,
                                            197.0674, 15,
                                            211.0873, 999,
                                            212.0905, 145),
                                        byrow = TRUE,
                                        ncol = 2))

getSimilarities(sim_spec, phenazines, hits_only = FALSE)
```

As a result, we get the interesting information that the MS^2^ spectra 
similarity of our unknown feature seems to be specific to pyocyanin 
(both the experimental and the library spectrum).

## Convert `MSnbase` objects to class `MS2spectrum`

The `MSnbase` package---which is commonly used for proteomics applications 
and is also associated with XCMS3---has two classes for (MS^2^) spectra, 
`Spectrum` and `Spectrum2` which contain spectra along with metainformation. 
These metainformation differ from those contained in `MS2spectrum` objects and 
are not very well suited for metabolomics applications. Still, it is possible 
to use `CluMSID` functions with objects of those two classes by converting 
them to `MS2spectrum` objects using `as.MS2spectrum()`:

```{r convertSpectrum_1, eval=FALSE}
CluMSID_object <- as.MS2spectrum(MSnbase_object)
# or alternatively
CluMSID_object <- as(MSnbase_object, "MS2spectrum")
```

## Split polarities from polarity-switching runs

As polarity-switching and similar methords are gaining importance 
in LC-MS/MS metabolomics, CluMSID offers the possibility to process 
LC-MS/MS data containing spectra of different polarities.
As spectra from positive and negative ionisation show different
fragmentation mechanisms and patterns, it does not appear to be useful
to compare spectra of different polarity to each other. Therefore,
CluMSID provides a function to separate positive and negative spectra
from each other. This has to be done in the very beginning of the analysis
to not interfere with spectral merging. Positive and negative spectra
can than be processed independently from each other as shown above.

A schematic workflow would like like this:

```{r polarities_1, eval=FALSE}
raw_list_mixedpolarities <- extractMS2spectra("raw_file_mixedpolarities.mzXML")

raw_list_positive <- splitPolarities(raw_list_mixedpolarities, "positive")
raw_list_negative <- splitPolarities(raw_list_mixedpolarities, "negative")

speclist_positive <- mergeMS2spectra(raw_list_positive)
speclist_negative <- mergeMS2spectra(raw_list_negative)
```

... and so on as described in this tutorial.

# Use MS^1^ pseudospectra instead of or in addition to MS^2^ data

MS^1^ pseudospectra are groups of peaks/ions that derive or are assumed to 
derive from the same compound. They consist of peaks for in-source fragment, 
adducts etc. Pseudospectra can contain structural information about analytes, 
e.g. about moieties that easily fragment even in MS^1^ mode without CID. 
Thus, it might sometimes be useful to study similarities between pseudospectra 
analogously to those between MS^2^ spectra. `CluMSID` makes use of the 
`CAMERA` package to assign peaks to pseudospectra. A custom S4 class named 
`pseudospectrum` is used which is very similar to the `MS2spectrum` class. 
For obvious reasons, it does not contain a precursor ion *m/z* slot and thus 
no neutral loss pattern, either. The `pcgroup` defined by `CAMERA` is used 
as ID, an annotation can be added if desired.

## Extract pseudospectra

To extract pseudospectra, you first have to process your data using the 
`CAMERA` package, either in R or via XCMSonline, where this is done 
automatically. There are two possibilities to use the 
`extractPseudospectra()` function in `CluMSID`: either with an 
`xsAnnotate` object which you generate with `CAMERA` in R or with a 
`data.frame` that contains data on *m/z*, retention time, intensity 
and `pcgroup`, e.g. the results table from XCMSonline.

The latter is demonstrated with the XCMSonline results table already 
used to generate a peak table. If the column names are not changed, 
the `data.frame` can be supplied as-is and `intensity_columns` does 
not have to be specified. We want to exclude pseudospectra that have 
only one peak, so we set `min_peaks = 2`.

```{r Pseudospectra_1, message=FALSE, warning=FALSE}
pstable <- 
    readr::read_delim(file = system.file("extdata", 
                                        "TD035_XCMS.annotated.diffreport.tsv", 
                                        package = "CluMSIDdata"), 
                    delim = "\t") 

pseudospeclist <- extractPseudospectra(pstable, min_peaks = 2)
```

As a result, we get a list with `r length(pseudospeclist)` 
pseudospectra that we can now process further.

## Create distance matrix for pseudospectra

The creation of a distance matrix is analogous to the procedure for 
MS^2^ spectra:

```{r Pseudospectra_2, eval=FALSE}
pseudodistmat <- distanceMatrix(pseudospeclist)
```
```{r Pseudospectra_3, include=FALSE}
load(file = system.file("extdata", 
                        "pseudodistmat.RData", 
                        package = "CluMSIDdata"))
```

## Generate a correlation network for pseudospectra

The distance matrix can now be used for MDS, clustering and 
correlation networks just like described above. For demonstration, 
we generate a correlation network:

```{r Pseudospectra_4, , fig.width=5, fig.asp=1, fig.cap=fig12}
networkplot(pseudodistmat, show_labels = TRUE, exclude_singletons = TRUE)
```

With the exclusion of singletons, we get a much less busy plot than for 
MS^2^ data but we still find quite a few connections that may prove 
informative.

# Session Info
```{r session}
sessionInfo()
```