---
title: "New and modified functionality in xcms"
author:
- name: Johannes Rainer
package: xcms
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{New and modified functionality in xcms}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{xcms,RColorBrewer}
bibliography: references.bib
csl: biomed-central.csl
references:
- id: dummy
  title: no title
  author:
  - family: noname
    given: noname
---

```{r  biocstyle, echo = FALSE, results = "asis" }
BiocStyle::markdown() 
```


# New functionality in `xcms`

This document describes new functionality and changes to existing functionality
in the `xcms` package introduced during the update to version *3*.

```{r  message = FALSE, warning = FALSE }
library(xcms)
library(RColorBrewer)
register(SerialParam()) 
```


## Modernized user interface

The modernization of the user interface comprises new classes for data
representation and new data analysis methods. In addition, the core logic for
the data processing has been extracted from the old methods and put into a set
of R functions, the so called core API functions (or `do_` functions). These
functions take standard R data structures as input and return standard R data
types as result and can hence be easily included in other R packages.

The new user interface aims at simplifying and streamlining the `xcms` workflow
while guaranteeing data integrity and performance also for large scale
metabolomics experiments. Importantly, a simplified access to the original raw
data should be provided throughout the whole metabolomics data analysis workflow.

The new interface re-uses objects from the `MSnbase` Bioconductor package, such as
the `OnDiskMSnExp` object. This object is specifically designed for large scale MS
experiments as it initially reads just the scan header information from the mzML
while the mz-intensity value pairs from all or from selected spectra of a file
are read on demand hence minimizing the memory demand. Also, in contrast to
the old `xcmsRaw` object, the `OnDiskMSnExp` contains information from all files of
an experiment. In addition, all data normalization and adjustment methods
implemented in the `MSnbase` package can be directly applied to the MS data
without the need to re-implement such methods in `xcms`. Results from `xcms`
preprocessings, such as chromatographic peak detection or correspondence are
stored into the new `XCMSnExp` object. This object extends the `OnDiskMSnExp` object
and inherits thus all of its methods including raw data access.

Class and method/function names follow also a new naming convention trying tp
avoid the partially confusing nomenclature of the original `xcms` methods (such as
the `group` method to perform the correspondence of peaks across samples). To
distinguish them from mass peaks, the peaks identified by the peak detection in
an LS/GC-MS experiment are referred to as *chromatographic peaks*. The respective
method to identify such peaks is hence called `findChromPeaks` and the identified
peaks can be accessed using the `XCMSnExp` `chromPeaks` method. The results from an
correspondence analysis which aims to match and group chromatographic peaks
within and between samples are called *features*. A feature corresponds to
individual ions with a unique mass-to-charge ratio (mz) and a unique retention
time (rt). The definition of such mz-rt features (i.e. the result from the
`groupChromPeaks` method) can be accessed *via* the `featureDefinitions` method of
the `XCMSnExp` class. Finally, alignment (retention time correction) can be
performed using the `adjustRtime` method.

The settings for any of the new analysis methods are bundled in *parameter*
classes, one class for each method. This encapsulation of the parameters to a
function into a parameter class (such as `CentWaveParam`) avoids busy function
calls (with many single parameters) and enables saving, reloading and reusing
the settings. In addition, the parameter classes are added, along with other
information to the process history of an `XCMSnExp` object thus providing a
detailed documentation of each processing step of an analysis, with the
possibility to recall all settings of the performed analyses at any stage. In
addition, validation of the parameters can be performed within the parameter
object and hence is no longer required in the analysis function.


## New naming convention

Peaks identified in LC/GC-MS metabolomics are referred to as *chromatographic
peaks* where possible to avoid any misconceptions with *mass peaks* identified in
mz dimension.

Methods for data analysis from the original `xcms` code have been renamed to avoid
potential confusions:

-   **Chromatographic peak detection**: `findChromPeaks` instead of `findPeaks`: for new
    functions and methods the term *peak* is avoided as much as possible, as it is
    usually used to describe a mass peak in mz dimension. To clearly distinguish
    between these peaks and peaks in retention time space, the latter are referred
    to as *chromatographic peak*, or `chromPeak`.

-   **Correspondence**: `groupChromPeaks` instead of `group` to clearly indicate what is
    being grouped. Group might be a sample group or a peak group, the latter being
    referred to also by (mz-rt) *feature*.

-   **Alignment**: `adjustRtime` instead of `retcor` for retention time correction. The
    word *cor* in *retcor* might be easily misinterpreted as *correlation* instead of
    correction.


## New data classes


### `OnDiskMSnExp`

This object is defined and documented in the `MSnbase` package. In brief, it is a
container for the full raw data from an MS-based experiment. To keep the memory
footprint low the mz and intensity values are only loaded from the raw data
files when required. The `OnDiskMSnExp` object replaces the `xcmsRaw` object.


### `XCMSnExp`

The `XCMSnExp` class extends the `OnDiskMSnExp` object from the `MSnbase` package and
represents a container for the xcms-based preprocessing results while (since it
inherits all functionality from its parent class) keeping a direct relation to
the (raw) data on which the processing was performed. An additional slot
`.processHistory` in the object allows to keep track of all performed processing
steps. Each analysis method, such as `findChromPeaks` adds an `XProcessHistory`
object which includes also the parameter class passed to the analysis
method. Hence not only the time and type of the analysis, but its exact settings
are reported within the `XCMSnExp` object. The `XCMSnExp` is thus equivalent to the
`xcmsSet` from the original `xcms` implementation, but keeps in addition a link to
the raw data on which the preprocessing was performed.


### `Chromatogram`

The `Chromatogram` class (available in the `MSnbase` package since version 2.3.8)
allows a data representation that is orthogonal to the `Spectrum` class (also
defined in `MSnbase`). The `Chromatogram` class stores retention time and intensity
duplets and is designed to accommodate most use cases, from total ion
chromatogram, base peak chromatogram to extracted ion chromatogram and SRM/MRM
ion traces.

`Chromatogram` objects can be extracted from `XCMSnExp` (and `MSnExp` and
`OnDiskMSnExp`) objects using the `chromatogram` method.

Note that this class is still considered developmental and might thus undergo
some changes in the future.


## Binning and missing value imputation functions

The binning/profile matrix generation functions have been completely
rewritten. The new `binYonX` function replaces the binning of intensity values
into bins defined by their m/z values implemented in the `profBin`, `profBinLin` and
`profBinLinBase` methods. The `binYonX` function provides also additional functionality:

-   Breaks for the bins can be defined based on either the number of desired bins
    (`nBins`) or the size of a bin (`binSize`). In addition it is possible to provide
    a vector with pre-defined breaks. This allows to bin data from multiple files
    or scans on the same bin-definition.

-   The function returns a list with element `y` containing the binned values and
    element `x` the bin mid-points.

-   Values in input vector `y` can be aggregated within each bin with different
    methods: `max`, `min`, `sum` and `mean`.

-   The index of the largest (or smallest for `method` being "min") within each bin
    can be returned by setting argument `returnIndex` to `TRUE`.

-   Binning can be performed on single or multiple sub-sets of the input vectors
    using the `fromIdx` and `toIdx` arguments. This replaces the *M* methods (such as
    `profBinM`). These sub-sets can be overlapping.

The missing value imputation logic inherently build into the `profBinLin` and
`profBinLinBase` methods has been implemented in the `imputeLinInterpol` function.

The example below illustrates the binning and imputation with the `binYtoX` and
`imputeLinInterpol` functions. After binning of the test vectors below some of the
bins have missing values, for which we impute a value using
`imputeLinInterpol`. By default, `binYonX` selects the largest value within each
bin, but other aggregation methods are also available (i.e. min, max, mean,
sum).

```{r  message = FALSE }
## Defining the variables:
set.seed(123)
X <- sort(abs(rnorm(30, mean = 20, sd = 25))) ## 10
Y <- abs(rnorm(30, mean = 50, sd = 30))

## Bin the values in Y into 20 bins defined on X
res <- binYonX(X, Y, nBins = 22)

res 
```

As a result we get a `list` with the bin mid-points (`$x`) and the binned `y` values
(`$y`).

Next we use two different imputation approaches, a simple linear interpolation
and the linear imputation approach that was defined in the `profBinLinBase`
method. The latter performs linear interpolation only considering a certain
neighborhood of missing values otherwise replacing the `NA` with a base value.

```{r binning-imputation-example, fig = TRUE, message = FALSE, fig.width = 10, fig.height = 7, fig.cap = 'Binning and missing value imputation results. Black points represent the input values, red the results from the binning and blue and green the results from the imputation (with method lin and linbase, respectively).' }
## Plot the actual data values.
plot(X, Y, pch = 16, ylim = c(0, max(Y)))
## Visualizing the bins
abline(v = breaks_on_nBins(min(X), max(X), nBins = 22), col = "grey")

## Define colors:
point_colors <- paste0(brewer.pal(4, "Set1"), 80)
## Plot the binned values.
points(x = res$x, y = res$y, col = point_colors[1], pch = 15)

## Perform the linear imputation.
res_lin <- imputeLinInterpol(res$y)

points(x = res$x, y = res_lin, col = point_colors[2], type = "b")

## Perform the linear imputation "linbase"
res_linbase <- imputeLinInterpol(res$y, method = "linbase")
points(x = res$x, y = res_linbase, col = point_colors[3], type = "b", lty = 2) 
```

The difference between the linear interpolation method `lin` and `linbase` is that
the latter only performs the linear interpolation in a pre-defined neighborhood
of the bin with the missing value (`1` by default). The other missing values are
set to a base value corresponding to half of the smallest bin value. Both
methods thus yield same results, except for bins 15-17 (see Figure above).


## Core functionality exposed *via* simple functions

The core logic from the chromatographic peak detection methods
`findPeaks.centWave`, `findPeaks.massifquant`, `findPeaks.matchedFilter` and
`findPeaks.MSW` and from all alignment (`group.*`) and correspondence (`retcor.*`)
methods has been extracted and put into functions with the common prefix
`do_findChromPeaks`, `do_adjustRtime` and `do_groupChromPeaks`, respectively, with the
aim, as detailed in issue [#30](https://github.com/sneumann/xcms/issues/30), to separate the core logic from the analysis
methods invoked by the users to enable also the use these methods using base R
parameters (i.e. without specific classes containing the data such as the
`xcmsRaw` class). This simplifies also the re-use of these functions in other
packages and simplifies the future implementation of the peak detection
algorithms for e.g. the `MSnExp` or `OnDiskMSnExp` objects from the `MSnbase`
Bioconductor package. The implemented functions are:

-   **peak detection methods**:
    -   `do_findChromPeaks_centWave`: peak density and wavelet based peak detection
        for high resolution LC/MS data in centroid mode [@Tautenhahn:2008fx].
    -   `do_findChromPeaks_matchedFilter`: identification of peak in the
        chromatographic domain based on matched filtration [@Smith:2006ic].
    -   `do_findChromPeaks_massifquant`: identification of peaks using Kalman
        filters.
    -   `do_findChromPeaks_MSW`: single spectrum, non-chromatographic peak detection.

-   **alignment methods**:
    -   `do_adjustRtime_peakGroups`: perform sample alignment (retention time
        correction) using alignment of *well behaved* chromatographic peaks that are
        present in most samples (and are expected to have the same retention time).

-   **correspondence methods**:
    -   `do_groupChromPeaks_density`: perform chromatographic peak grouping (within
        and across samples) based on the density distribution of peaks along the
        retention time axis.
    -   `do_groupChromPeaks_nearest`: groups peaks across samples similar to the
        method implemented in mzMine.
    -   `do_groupChromPeaks_mzClust`: performs high resolution correspondence on
        single spectra samples.

One possible drawback from the introduction of this new layer is, that more
objects get copied by R which *could* eventually result in a larger memory demand
or performance decrease (while no such was decrease was observed up to now).


## Usability improvements in the *old* user interface

-   `[` subsetting method for `xcmsRaw` objects that enables to subset an `xcmsRaw`
    object to specific scans/spectra.
-   `profMat` method to extract the *profile* matrix from the `xcmsRaw` object. This
    method should be used instead of directly accessing the `@env$profile` slot, as
    it will create the profile matrix on the fly if it was not pre-calculated (or
    if profile matrix generation settings have been changed).


# Changes due to bug fixes and modified functionality


## Differences in linear interpolation of missing values (`profBinLin`).

From `xcms` version 1.51.1 on the new binning functions are used, thus, the bug
described here are fixed.

Two bugs are present in the `profBinLin` method (reported as issues [#46](https://github.com/sneumann/xcms/issues/46) and [#49](https://github.com/sneumann/xcms/issues/49) on
github) which are fixed in the new `binYonX` and `imputeLinInterpol` functions:

-   The first bin value calculated by `profBinLin` can be wrong (i.e. not being the
    max value within that bin, but the first).
-   If the last bin contains also missing values, the method fails to determine
    a correct value for that bin.

The `profBinLin` method is used in `findPeaks.matchedFilter` if the profile
method is set to "binlin".

The example below illustrates both differences.

```{r   }
## Define a vector with empty values at the end.
X <- 1:11
set.seed(123)
Y <- sort(rnorm(11, mean = 20, sd = 10))
Y[9:11] <- NA
nas <- is.na(Y)
## Do interpolation with profBinLin:
resX <- xcms:::profBinLin(X[!nas], Y[!nas], 5, xstart = min(X),
                          xend = max(X))
resX
res <- binYonX(X, Y, nBins = 5L, shiftByHalfBinSize = TRUE)
resM <- imputeLinInterpol(res$y, method = "lin",
                          noInterpolAtEnds = TRUE)
resM 
```

Plotting the results helps to better compare the differences. The black points
in the figure below represent the actual values of `Y` and the grey vertical lines
the breaks defining the bins. The blue lines and points represent the result
from the `profBinLin` method. The bin values for the first and 4th bin are clearly
wrong. The green colored points and lines represent the results from the `binYonX`
and `imputeLinInterpol` functions (showing the correct binning and interpolation).

```{r profBinLin-problems, fig = TRUE, message = FALSE, fig.align = 'center', fig.width=10, fig.height = 7, fig.cap = "Illustration of the two bugs in profBinLin. The input values are represented by black points, grey vertical lines indicate the bins. The results from binning and interpolation with profBinLin are shown in blue and those from binYonX in combination with imputeLinInterpol in green."}
plot(x = X, y = Y, pch = 16, ylim = c(0, max(Y, na.rm = TRUE)),
     xlim = c(0, 12))
## Plot the breaks
abline(v = breaks_on_nBins(min(X), max(X), 5L, TRUE), col = "grey")
## Result from profBinLin:
points(x = res$x, y = resX, col = "blue", type = "b")
## Results from imputeLinInterpol
points(x = res$x, y = resM, col = "green", type = "b",
       pch = 4, lty = 2)
 
```

Note that by default `imputeLinInterpol` would also interpolate missing values at
the beginning and the end of the provided numeric vector. This can be disabled
(to be compliant with `profBinLin`) by setting parameter `noInterpolAtEnds` to
`TRUE` (like in the example above).


## Differences due to updates in `do_findChromPeaks_matchedFilter`, respectively `findPeaks.matchedFilter`.

The original `findPeaks.matchedFilter` (up to version 1.49.7) had several
shortcomings and bugs that have been fixed in the new
`do_findChromPeaks_matchedFilter` method:

-   The internal iterative processing of smaller chunks of the full data (also
    referred to as *iterative buffering*) could result, for some bin (step) sizes to
    unstable binning results (discussed in issue [#47](https://github.com/sneumann/xcms/issues/47) on github): calculation of
    the breaks, or to be precise, the actually used bin size was performed in each
    iteration and could lead to slightly different sizes between iterations (due
    to rounding errors caused by floating point number representations in C).

-   The iterative buffering raises also a conceptual issue when linear
    interpolation is performed to impute missing values: the linear imputation
    will only consider values within the actually processed buffer and can thus
    lead to wrong or inaccurate imputations.

-   The `profBinLin` implementation contains two bugs, one that can result in
    failing to identify the maximal value in the first and last bin (see issue
    [#46](https://github.com/sneumann/xcms/issues/46)) and one that fails to assign a value to a bin (issue [#49](https://github.com/sneumann/xcms/issues/49)). Both are fixed
    in the `do_findChromPeaks_matchedFilter` implementation.

A detailed description of tests comparing all implementations is available in
issue [#52](https://github.com/sneumann/xcms/issues/52) on github. Note also that in course of these changes also the `getEIC`
method has been updated to use the new binning and missing value imputation
function.

While it is strongly discouraged, it is still possible to use to *old* code (from
1.49.7) by calling `useOriginalCode(TRUE)`.


## Differences in `findPeaks.massifquant`

-   Argument `scanrange` was ignored in the *original* old code (issue [#61](https://github.com/sneumann/xcms/issues/61)).
-   The method returned a `matrix` if `withWave` was `0` and a `xcmsPeaks` object
    otherwise. The updated version returns **always** an `xcmsPeaks` object (issue #60).


## Differences in *obiwarp* retention time correction

Retention time correction using the obiwarp method uses the *profile* matrix
(i.e. intensities binned in discrete bins along the mz axis). Profile matrix
generation uses now the `binYonX` method which fixed some problems in the original
binning and linear interpolation methods. Thus results might be slightly
different.

Also, the `retcor.obiwarp` method reports (un-rounded) adjusted retention times,
but adjusts the retention time of eventually already identified peaks using
rounded adjusted retention times. The new `adjustRtime` method(s) does adjust
identified peaks using the reported adjusted retention times (not rounded). This
guarantees that e.g. removing retention time adjustment/alignment results from
an object restores the object to its initial state (i.e. the adjusted retention
times of the identified peaks are reverted to the retention times before
alignment).
See issue [#122](https://github.com/sneumann/xcms/issues/122) for more details.


## `retcor.peaksgroups`: change in the way how *well behaved* peak groups are ordered

The `retcor.peakgroups` defines first the chromatographic peak groups that are
used for the alignment of all spectra. Once these are identified, the retention
time of the peak with the highest intensity in a sample for a given peak group
is returned and the peak groups are ordered increasingly by retention time
(which is required for the later fitting of either a polynomial or a linear
model to the data). The selection of the retention time of the peak with the
highest intensity within a feature (peak group) and samples, denoted as
*representative* peak for a given feature in a sample, ensures that only the
retention time of a single peak per sample and feature is selected (note that
multiple chromatographic peaks within the same sample can be assigned to a
feature).  In the original code the ordering of the peak groups was however
performed using the median retention time of the complete peak group (which
includes also potential additional peaks per sample). This has been changed and
the features are ordered now by the median retention time across samples of the
representative chromatographic peaks.


## `scanrange` parameter in all `findPeaks` methods

The `scanrange` in the `findPeaks` methods is supposed to enable the peak detection
only within a user-defined range of scans. This was however not performed in
each method. Due to a bug in `findPeaks.matchedFilter`'s original code the
argument was ignored, except if the upper scan number of the user defined range
was larger than the total number of available scans (see issue [#63](https://github.com/sneumann/xcms/issues/63)). In
`findPeaks.massifquant` the argument was completely ignored (see issue [#61](https://github.com/sneumann/xcms/issues/61)) and,
while the argument was considered in `findPeaks.centWave` and feature detection
was performed within the specified scan range, but the original `@scantime` slot
was used throughout the code instead of just the scan times for the specified
scan indices (see issue [#64](https://github.com/sneumann/xcms/issues/64)).

These problems have been fixed in version 1.51.1 by first sub-setting the
`xcmsRaw` object (using the `[` method) before actually performing the feature
detection.


## `fillPeaks` (`fillChromPeaks`) differences

In the original `fillPeaks.MSW`, the mz range from which the signal is to be
integrated was defined using 

```{r  eval = FALSE }
mzarea <- seq(which.min(abs(mzs - peakArea[i, "mzmin"])),
	      which.min(abs(mzs - peakArea[i, "mzmax"])))
 
```

Depending on the data this could lead to the inclusion of signal in the
integration that are just outside of the mz range. In the new `fillChromPeaks`
method signal is integrated only for mz values >= mzmin and <= mzmax thus
ensuring that only signal is used that is truly within the peak area defined by
columns `"mzmin"`, `"mzmax"`, `"rtmin"` and `"rtmax"`.

Also, the `fillPeaks.chrom` method did return `"into"` and `"maxo"` values of `0` if no
signal was found in the peak area. The new method does not integrate any signal
in such cases and does not fill in that peak.

See also issue [#130](https://github.com/sneumann/xcms/issues/130) for more
information.


# Under the hood changes

These changes and updates will not have any large impact on the day-to-day use of
`xcms` and are listed here for completeness.

-   From `xcms` version 1.51.1 on the default methods from the `mzR` package are used
    for data import. Besides ensuring easier maintenance, this enables also data
    import from *gzipped* mzML files.


# Deprecated functions and files

Here we list all of the functions and related files that are deprecated.

-   `xcmsParallelSetup`, `xcmsPapply`, `xcmsClusterApply`: use `BiocParallel` package
    instead to setup and perform parallel processing, either *via* the `BPPARAM`
    parameter to function and methods, or by calling `register` to globally set
    parallel processing.

-   `profBin`, `profBinM`, `profBinLin`, `profBinLinM`, `profBinLinBase`, `profBinLinBaseM`:
    replaced by the `binYonX` and `imputeLinInterpol` functions. Also, to create or
    extract the profile matrix from an `xcmsRaw` object, the `profMat` method.


## Deprecated


### xcms 1.49:

-   `xcmsParallelSetup` (Deprecated.R)
-   `xcmsPapply` (Deprecated.R)
-   `xcmsClusterApply` (Deprecated.R)


### xcms 1.51:

-   `profBin` (c.R)
-   `profBinM` (c.R)
-   `profBinLin` (c.R)
-   `profBinLinM` (c.R)
-   `profBinLinBase` (c.R)
-   `profBinLinBaseM` (c.R)


## Defunct


# References