---
title: "Grouping FTICR-MS data with xcms"
author:
- name: Joachim Bargsten
- name: Johannes Rainer
package: xcms
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Grouping FTICR-MS data with xcms}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Mass Spectrometry, MS, Metabolomics, Bioinformatics}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{xcms,msdata,MassSpecWavelet,BiocStyle}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

# Introduction

```{r echo = FALSE, results = "hide", message = FALSE}
library(BiocStyle)
```

This document describes how to use `r Biocpkg("xcms")` for the analysis of
direct injection mass spec data, including peak detection, calibration and
correspondence (grouping of peaks across samples).

# Peak detection

Prior to any other analysis step, peaks have to be identified in the mass spec
data. In contrast to the typical metabolomics workflow, in which peaks are
identified in the chromatographic (time) dimension, in direct injection mass
spec data sets peaks are identified in the m/z dimension. `r Biocpkg("xcms")`
uses functionality from the `MassSpecWavelet` package to identify such peaks.

Below we load the required packages. For information on the parallel processing
setup please see the `BiocParallel` vignette.

```{r load-libs, message = FALSE, results = "hide"}
library(xcms)
library(MassSpecWavelet)

register(SerialParam())

```

In this documentation we use an example data set from the `r Biocpkg("msdata")`
package. Assuming that `r Biocpkg("msdata")` is installed, we locate the path of
the package and load the data set. We create also a `data.frame` describing the
experimental setup based on the file names.

```{r load-data, message = FALSE, results = "hide"}
mzdata_path <- system.file("fticr", package = "msdata")
mzdata_files <- list.files(mzdata_path, recursive = TRUE, full.names = TRUE)

## Create a data.frame assigning samples to sample groups, i.e. ham4 and ham5.
grp <- rep("ham4", length(mzdata_files))
grp[grep(basename(mzdata_files), pattern = "^HAM005")] <- "ham5"
pd <- data.frame(filename = basename(mzdata_files), sample_group = grp)

## Load the data.
ham_raw <- readMSData(files = mzdata_files,
                      pdata = new("NAnnotatedDataFrame", pd),
                      mode = "onDisk")
```

The data files are from *direct injection* mass spectrometry experiments,
i.e. we have only a single spectrum available for each sample and no retention
times.

```{r}
## Only a single spectrum with an *artificial* retention time is available
## for each sample
rtime(ham_raw)
```

Peaks are identified within each spectrum using the *mass spec wavelet* method.

```{r msw}
## Define the parameters for the peak detection
msw <- MSWParam(scales = c(1, 4, 9), nearbyPeak = TRUE, winSize.noise = 500,
                SNR.method = "data.mean", snthresh = 10)

ham_prep <- findChromPeaks(ham_raw, param = msw)

head(chromPeaks(ham_prep))
```

# Calibration

The `calibrate` method can be used to correct the m/z values of identified
peaks. The currently implemented method requires identified peaks and a list of
m/z values for known calibrants. The identified peaks m/z values are then
adjusted based on the differences between the calibrants' m/z values and the m/z
values of the closest peaks (within a user defined permitted maximal
distance). Note that this method does presently only calibrate identified peaks,
but not the original m/z values in the spectra.

Below we demonstrate the `calibrate` method on one of the data files with
artificially defined calibration m/z values. We first subset the data set to the
first data file, extract the m/z values of 3 peaks and modify the values
slightly.

```{r message = FALSE}
## Subset to the first file.
first_file <- filterFile(ham_prep, file = 1)

## Extract 3 m/z values
calib_mz <- chromPeaks(first_file)[c(1, 4, 7), "mz"]
calib_mz <- calib_mz + 0.00001 * runif(1, 0, 0.4) * calib_mz + 0.0001

```

Next we calibrate the data set using the previously defined *artificial*
calibrants. We are using the `"edgeshift"` method for calibration that adjusts
all peaks within the range of the m/z values of the calibrants using a linear
interpolation and shifts all chromatographic peaks outside of that range by a
constant factor (the difference between the lowest respectively largest
calibrant m/z with the closest peak's m/z). Note that in a *real* use case, the
m/z values would obviously represent known m/z of calibrants and would not be
defined on the actual data.

```{r message = FALSE}
## Set-up the parameter class for the calibration
prm <- CalibrantMassParam(mz = calib_mz, method = "edgeshift",
                          mzabs = 0.0001, mzppm = 5)
first_file_calibrated <- calibrate(first_file, param = prm)

```

To evaluate the calibration we plot below the difference between the adjusted
and raw m/z values (y-axis) against the raw m/z values.

```{r calibrationresult, fig = TRUE, fig.width = 6, fig.height = 5, fig.align = "center"}
diffs <- chromPeaks(first_file_calibrated)[, "mz"] -
    chromPeaks(first_file)[, "mz"]

plot(x = chromPeaks(first_file)[, "mz"], xlab = expression(m/z[raw]),
     y = diffs, ylab = expression(m/z[calibrated] - m/z[raw]))

```


# Correspondence

Correspondence aims to group peaks across samples to define the *features* (ions
with the same m/z values). Peaks from single spectrum, direct injection MS
experiments can be grouped with the *MZclust* method. Below we perform the
correspondence analysis with the `groupChromPeaks` method using default
settings.

```{r correspondence, message = FALSE, results = "hide"}
## Using default settings but define sample group assignment
mzc_prm <- MzClustParam(sampleGroups = ham_prep$sample_group)
ham_prep <- groupChromPeaks(ham_prep, param = mzc_prm)

```

Getting an overview of the performed processings:

```{r}
ham_prep
```

The peak group information, i.e. the *feature* definitions can be accessed with
the `featureDefinitions` method.

```{r}
featureDefinitions(ham_prep)
```

Plotting the raw data for direct injection samples involves a little more
processing than for LC/GC-MS data in which we can simply use the `chromatogram`
method to extract the data. Below we extract the m/z-intensity pairs for the
peaks associated with the first feature. We thus first identify the peaks for
that feature and define their m/z values range. Using this range we can
subsequently use the `filterMz` function to sub-set the full data set to the
signal associated with the feature's peaks. On that object we can then call the
`mz` and `intensity` functions to extract the data.

```{r feature1, fig = TRUE, fig.width = 6, fig.height = 4, fig.align = "center"}
## Get the peaks belonging to the first feature
pks <- chromPeaks(ham_prep)[featureDefinitions(ham_prep)$peakidx[[1]], ]

## Define the m/z range
mzr <- c(min(pks[, "mzmin"]) - 0.001, max(pks[, "mzmax"]) + 0.001)

## Subset the object to the m/z range
ham_prep_sub <- filterMz(ham_prep, mz = mzr)

## Extract the mz and intensity values
mzs <- mz(ham_prep_sub, bySample = TRUE)
ints <- intensity(ham_prep_sub, bySample = TRUE)

## Plot the data
plot(3, 3, pch = NA, xlim = range(mzs), ylim = range(ints), main = "FT01",
     xlab = "m/z", ylab = "intensity")
## Define colors
cols <- rep("#ff000080", length(mzs))
cols[ham_prep_sub$sample_group == "ham5"] <- "#0000ff80"
tmp <- mapply(mzs, ints, cols, FUN = function(x, y, col) {
    points(x, y, col = col, type = "l")
})

```


To access the actual intensity values of each feature in each sample the
`featureValue` method can be used. The setting `value = "into"` tells the
function to return the integrated signal for each peak (one representative peak)
per sample.

```{r}
feat_vals <- featureValues(ham_prep, value = "into")
head(feat_vals)

```

`NA` is reported for features in samples for which no peak was identified at the
feature's m/z value. In some instances there might still be a signal at the
feature's position in the raw data files, but the peak detection failed to
identify a peak. For these cases signal can be recovered using the
`fillChromPeaks` method that integrates all raw signal at the feature's
location. If there is no signal at that location an `NA` is reported.

```{r fillpeaks, message = FALSE}
ham_prep <- fillChromPeaks(ham_prep, param = FillChromPeaksParam())

head(featureValues(ham_prep, value = "into"))
```

# Further analysis

Further analysis, i.e. detection of features/metabolites with significantly
different abundances, or PCA analyses can be performed on the feature matrix
using functionality from other R packages, such as `r Biocpkg("limma")`.