---
title: "Annotation of MS-based Metabolomics Data"
package: MetaboAnnotation
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Annotation of MS-based Metabolomics Data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure, Quantitative }
%\VignetteEncoding{UTF-8}
%\VignetteDepends{Spectra,BiocStyle,msdata,MetaboAnnotation,SummarizedExperiment,QFeatures,AnnotationHub,microbenchmark,mzR}
bibliography: references.bib
---
```{r style, echo = FALSE, results = 'asis'}
library(BiocStyle)
BiocStyle::markdown()
```
**Package**: `r BiocStyle::Biocpkg("MetaboAnnotation")`
**Authors**: `r packageDescription("MetaboAnnotation")[["Author"]] `
**Compiled**: `r date()`
```{r message = FALSE, warning = FALSE, echo = FALSE}
library(AnnotationHub)
library(MetaboAnnotation)
library(Spectra)
```
# Introduction
The `r Biocpkg("MetaboAnnotation")` package defines high-level user
functionality to support and facilitate annotation of MS-based metabolomics data
[@rainer_modular_2022].
# Installation
The package can be installed with the *BiocManager* package. To
install `BiocManager` use `install.packages("BiocManager")` and, after that,
`BiocManager::install("MetaboAnnotation")` to install this package.
# General description
*MetaboAnnotation* provides a set of *matching* functions that allow comparison
(and matching) between *query* and *target* entities. These entities can be
chemical formulas, numeric values (e.g. m/z or retention times) or fragment
spectra. The available matching functions are:
- `matchFormula()`: to match chemical formulas.
- `matchSpectra()`: to match fragment spectra.
- `matchValues()` (formerly `matchMz()`): to match numerical values (m/z,
masses, retention times etc).
For each of these matching functions *parameter* objects are available that
allow different types or matching algorithms. Refer to the help pages for a
detailed listing of these (e.g. `?matchFormula`, `?matchSpectra` or
`?matchValues`). As a result, a `Matched` (or `MatchedSpectra`) object is
returned which streamlines and simplifies handling of the potential one-to-many
(or one-to-none) matching.
# Example use cases
The following sections illustrate example use cases of the functionality
provided by the *MetaboAnnotation* package.
```{r, message = FALSE}
library(MetaboAnnotation)
```
## Matching of m/z values
In this section a simple matching of feature m/z values against theoretical m/z
values is performed. This is the lowest level of confidence in metabolite
annotation. However, it gives ideas about potential metabolites that can be
analyzed in further downstream experiments and analyses.
The following example loads the feature table from a lipidomics experiments and
matches the measured m/z values against reference masses from LipidMaps. Below
we use a `data.frame` as *reference* database, but a `CompDb` compound database
instance (as created by the `r Biocpkg("CompoundDb")` package) would also be
supported.
```{r, message = FALSE}
ms1_features <- read.table(system.file("extdata", "MS1_example.txt",
package = "MetaboAnnotation"),
header = TRUE, sep = "\t")
head(ms1_features)
target_df <- read.table(system.file("extdata", "LipidMaps_CompDB.txt",
package = "MetaboAnnotation"),
header = TRUE, sep = "\t")
head(target_df)
```
For reference (target) compounds we have only the mass available. We need to
convert this mass to m/z values in order to match the m/z values from the
features (i.e. the query m/z values) against them. For this we need to define
the *most likely* ions/adducts that would be generated from the compounds based
on the ionization used in the experiment. We assume the most abundant adducts
from the compounds being `"[M+H]+"` and `"[M+Na]+`. We next perform the matching
with the `matchValues()` function providing the query and target data as well as
a parameter object (in our case a `Mass2MzParam`) with the settings for the
matching. With the `Mass2MzParam`, the mass or target compounds get first
converted to m/z values, based on the defined adducts, and these are then
matched against the query m/z values (i.e. the m/z values for the features). To
get a full list of supported adducts the `MetaboCoreUtils::adductNames(polarity
= "positive")` or `MetaboCoreUtils::adductNames(polarity = "negative")` can be
used). Note also, to keep the runtime of this vignette short, we match only the
first 100 features.
```{r}
parm <- Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"),
tolerance = 0.005, ppm = 0)
matched_features <- matchValues(ms1_features[1:100, ], target_df, parm)
matched_features
```
From the tested 100 features 55 were matched against at least one target
compound (all matches are against a single compound). The result object (of type
`Matched`) contains the full query data frame and target data frames as well as
the matching information. We can access the original query data with `query()`
and the original target data with `target()` function:
```{r}
head(query(matched_features))
head(target(matched_features))
```
Functions `whichQuery()` and `whichTarget()` can be used to identify the rows in
the query and target data that could be matched:
```{r}
whichQuery(matched_features)
whichTarget(matched_features)
```
The `colnames` function can be used to evaluate which variables/columns are
available in the `Matched` object.
```{r}
colnames(matched_features)
```
These are all columns from the `query`, all columns from the `target` (the
prefix `"target_"` is added to the original column names in `target`) and
information on the matching result (in this case columns `"adduct"`, `"score"`
and `"ppm_error"`).
We can extract the full matching table with `matchedData()`. This returns a
`DataFrame` with all rows in *query* the corresponding matches in *target* along
with the matching adduct (column `"adduct"`) and the difference in m/z (column
`"score"` for absolute differences and `"ppm_error"` for the m/z relative
differences). Note that if a row in *query* matches multiple elements in
*target*, this row will be duplicated in the `DataFrame` returned by
`matchedData()`. For rows that can not be matched `NA` values are reported.
```{r}
matchedData(matched_features)
```
Individual columns can be simply extracted with the `$` operator:
```{r}
matched_features$target_name
```
`NA` is reported for query entries for which no match was found. See also the
help page for `?Matched` for more details and information. In addition to the
matching of query m/z against target exact masses as described above it would
also be possible to match directly query m/z against target m/z values by using
the `MzParam` instead of the `Mass2MzParam`.
## Matching of m/z and retention time values
If expected retention time values were available for the target compounds, an
annotation with higher confidence could be performed with `matchValues()` and a
`Mass2MzRtParam` parameter object. To illustrate this we randomly assign
retention times from query features to the target compounds adding also 2
seconds difference. In a real use case the target `data.frame` would contain
masses (or m/z values) for standards along with the retention times when ions of
these standards were measured on the same LC-MS setup from which the query data
derives.
Below we subset our data table with the MS1 features to the first 100 rows (to
keep the runtime of the vignette short).
```{r}
ms1_subset <- ms1_features[1:100, ]
head(ms1_subset)
```
The table contains thus retention times of the features in a column named
`"rtime"`.
Next we randomly assign retention times of the features to compounds in our
target data adding a deviation of 2 seconds. As described above, in a real use
case retention times are supposed to be determined by measuring the compounds
with the same LC-MS setup.
```{r}
set.seed(123)
target_df$rtime <- sample(ms1_subset$rtime,
nrow(target_df), replace = TRUE) + 2
```
We have now retention times available for both the query and the target data and
can thus perform a matching based on m/z **and** retention times. We use the
`Mass2MzRtParam` which allows us to specify (as for the
`Mass2MzParam`) the expected adducts, the maximal acceptable m/z relative
and absolute deviation as well as the maximal acceptable (absolute) difference
in retention times. We use the settings from the previous section and allow a
difference of 10 seconds in retention times. The retention times are provided in
columns named `"rtime"` which is different from the default (`"rt"`). We thus
specify the name of the column containing the retention times with parameter
`rtColname`.
```{r}
parm <- Mass2MzRtParam(adducts = c("[M+H]+", "[M+Na]+"),
tolerance = 0.005, ppm = 0,
toleranceRt = 10)
matched_features <- matchValues(ms1_subset, target_df, param = parm,
rtColname = "rtime")
matched_features
```
Less features were matched based on m/z and retention times.
```{r}
matchedData(matched_features)[whichQuery(matched_features), ]
```
## Matching of `SummarizedExperiment` or `QFeatures` objects
Results from LC-MS preprocessing (e.g. by the `r BiocStyle::Biocpkg("xcms")`
package) or generally metabolomics results might be best represented and bundled
as `SummarizedExperiment` or `QFeatures` objects (from the same-named
Bioconductor packages). A `XCMSnExp` preprocessing result from `xcms` can for
example be converted to a `SummarizedExperiment` using the `quantify()` method
from the `xcms` package. The feature definitions (i.e. their m/z and retention
time values) will then be stored in the object's `rowData()` while the assay
(the numerical matrix) will contain the feature abundances across all
samples. Such `SummarizedExperiment` objects can be simply passed as `query`
objects to the `matchValues()` method. To illustrate this, we create below a
simple `SummarizedExperiment` using the `ms1_features` data frame from the
example above as `rowData` and adding a `matrix` with random values as assay.
```{r}
library(SummarizedExperiment)
se <- SummarizedExperiment(
assays = matrix(rnorm(nrow(ms1_features) * 4), ncol = 4,
dimnames = list(NULL, c("A", "B", "C", "D"))),
rowData = ms1_features)
```
We can now use the same `matchValues()` call as before to perform the
matching. Matching will be performed on the object's `rowData`, i.e. each
row/element of the `SummarizedExperiment` will be matched against the target
using e.g. m/z values available in columns of the object's `rowData`:
```{r}
parm <- Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+"),
tolerance = 0.005, ppm = 0)
matched_features <- matchValues(se, target_df, param = parm)
matched_features
```
As `query`, the result contains the full `SummarizedExperiment`, but
`colnames()` and `matchedData()` will access the respective information from the
`rowData` of this `SummarizedExperiment`:
```{r}
colnames(matched_features)
matchedData(matched_features)
```
Subsetting the result object, to e.g. just matched elements will also subset the
`SummarizedExperiment`.
```{r}
matched_sub <- matched_features[whichQuery(matched_features)]
MetaboAnnotation::query(matched_sub)
```
A `QFeatures` object is essentially a container for several
`SummarizedExperiment` objects which rows (features) are related with each
other. Such an object could thus for example contain the full feature data from
an LC-MS experiment as one assay and a compounded feature data in which data
from ions of the same compound are aggregated as an additional
assay. Below we create such an object using our `SummarizedExperiment` as an
assay of name `"features"`. For now we don't add any additional assay to that
`QFeatures`, thus, the object contains only this single data set.
```{r, warning = FALSE}
library(QFeatures)
qf <- QFeatures(list(features = se))
qf
```
`matchValues()` supports also matching of `QFeatures` objects but the user
needs to define the assay which should be used for the matching with the
`queryAssay` parameter.
```{r}
matched_qf <- matchValues(qf, target_df, param = parm, queryAssay = "features")
matched_qf
```
`colnames()` and `matchedData()` allow to access the `rowData` of the
`SummarizedExperiment` stored in the `QFeatures`' `"features"` assay:
```{r}
colnames(matched_qf)
matchedData(matched_qf)
```
## Matching of MS/MS spectra
In this section we match experimental MS/MS spectra against reference
spectra. This can also be performed with functions from the
`r BiocStyle::Biocpkg("Spectra")` package (see
[SpectraTutorials](https://jorainer.github.io/SpectraTutorials/), but the
functions and concepts used here are more suitable to the *end user* as they
simplify the handling of the spectra matching results.
Below we load spectra from a file from a reversed-phase (DDA) LC-MS/MS run of
the Agilent Pesticide mix. With `filterMsLevel()` we subset the data set to only
MS2 spectra. To reduce processing time of the example we further subset the
`Spectra` to a small set of selected MS2 spectra. In addition we assign *feature
identifiers* to each spectrum (again, for this example these are arbitrary IDs,
but in a *real* data analysis such identifiers could indicate to which LC-MS
feature these spectra belong).
```{r, message = FALSE}
library(Spectra)
library(msdata)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML", package = "msdata")
pest_ms2 <- filterMsLevel(Spectra(fl), 2L)
## subset to selected spectra.
pest_ms2 <- pest_ms2[c(808, 809, 945:955)]
## assign arbitrary *feature IDs* to each spectrum.
pest_ms2$feature_id <- c("FT001", "FT001", "FT002", "FT003", "FT003", "FT003",
"FT004", "FT004", "FT004", "FT005", "FT005", "FT006",
"FT006")
## assign also *spectra IDs* to each
pest_ms2$spectrum_id <- paste0("sp_", seq_along(pest_ms2))
pest_ms2
```
This `Spectra` should now represent MS2 spectra associated
with LC-MS features from an untargeted LC-MS/MS experiment that we would like to
annotate by matching them against a spectral reference library.
We thus load below a `Spectra` object that represents MS2 data from a very small
subset of [MassBank](https://massbank.eu/MassBank/) release *2021.03*. This
small `Spectra` object is provided within this package but it would be possible
to use any other `Spectra` object with reference fragment spectra instead (see
also the [SpectraTutorials](https://jorainer.github.io/SpectraTutorials/)
workshop). As an alternative, it would also be possible to use a `CompDb` object
representing a compound annotation database (defined in the
`r Biocpkg("CompoundDb")` package) with parameter `target`. See the
`matchSpectra()` help page or section *Query against multiple reference
databases* below for more details and options to retrieve such annotation
resources from Bioconductor's `r Biocpkg("AnnotationHub")`.
```{r}
load(system.file("extdata", "minimb.RData", package = "MetaboAnnotation"))
minimb
```
We can now use the `matchSpectra()` function to match each of our experimental
*query* spectra against the *target* (reference) spectra. Settings for this
matching can be defined with a dedicated *param* object. We use below the
`CompareSpectraParam` that uses the `compareSpectra()` function from the
`Spectra` package to calculate similarities between each query spectrum and all
target spectra. `CompareSpectraParam` allows to set all individual settings for
the `compareSpectra()` call with parameters `MAPFUN`, `ppm`, `tolerance` and
`FUN` (see the help on `compareSpectra()` in the `r Biocpkg("Spectra")` package
for more details). In addition, we can *pre-filter* the target spectra for each
individual query spectrum to speed-up the calculations. By setting
`requirePrecursor = TRUE` we compare below each query spectrum only to target
spectra with matching precursor m/z (accepting a deviation defined by parameters
`ppm` and `tolerance`). By default, `matchSpectra()` with `CompareSpectraParam`
considers spectra with a similarity score higher than 0.7 as *matching* and
these are thus reported.
```{r}
csp <- CompareSpectraParam(requirePrecursor = TRUE, ppm = 10)
mtches <- matchSpectra(pest_ms2, minimb, param = csp)
mtches
```
The results are reported as a `MatchedSpectra` object which represents the
matching results for all query spectra. This type of object contains all query
spectra, all target spectra, the matching information and the parameter object
with the settings of the matching. The object can be subsetted to e.g. matching
results for a specific query spectrum:
```{r}
mtches[1]
```
In this case, for the first query spectrum, no match was found among the target
spectra. Below we subset the `MatchedSpectra` to results for the second query
spectrum:
```{r}
mtches[2]
```
The second query spectrum could be matched to 4 target spectra. The matching
between query and target spectra can be n:m, i.e. each query spectrum can match
no or multiple target spectra and each target spectrum can be matched to none,
one or multiple query spectra.
Data (spectra variables of either the query and/or the target spectra) can be
extracted from the result object with the `spectraData()` function or with `$`
(similar to a `Spectra` object). The `spectraVariables` function can be used to
list all available spectra variables in the result object:
```{r}
spectraVariables(mtches)
```
This lists the spectra variables from both the *query* **and** the *target*
spectra, with the prefix `"target_"` being used for spectra variable names of
the target spectra. Spectra variable `"score"` contains the similarity score.
Note that by default also an additional column `".original_query_index"` is
added to the `query` `Spectra` object by the `matchSpectra()` function, that
enables an easier mapping of results to the *original* query object used as
input, in particular, if the `MatchedSpectra` object gets further subset. As the
name says, this column contains for each query spectrum the index in the
original `Spectra` object provided with the `query` parameter.
We could thus use `$target_compound_name` to extract the compound name of the
matching target spectra for the second query spectrum:
```{r}
mtches[2]$target_compound_name
```
The same information can also be extracted on the *full* `MatchedSpectra`.
Below we use `$spectrum_id` to extract the query spectra identifiers we added
above from the full result object.
```{r}
mtches$spectrum_id
```
We added this column manually to the query object before the `matchSpectra()`
call, but the automatically added spectra variable `".original_query_index"`
would provide the same information:
```{r}
mtches$.original_query_index
```
And the respective values in the query object:
```{r}
query(mtches)$.original_query_index
```
Because of the n:m mapping between query and target spectra, the number of
values returned by `$` (or `spectraData`) can be larger than the total number of
query spectra. Also in the example above, some of the spectra IDs are present
more than once in the result returned by `$spectrum_id`. The respective spectra
could be matched to more than one target spectrum (based on our settings) and
hence their IDs are reported multiple times. Both `spectraData` and `$` for
`MatchedSpectra` use a *left join* strategy to report/return values: a value
(row) is reported for each query spectrum (even if it does **not** match any
target spectrum) with eventually duplicated values (rows) if the query spectrum
matches more than one target spectrum (each value for a query spectrum is
repeated as many times as it matches target spectra). To illustrate this we
use below the `spectraData()` function to extract specific data from our
result object, i.e. the spectrum and feature IDs for the query spectra we
defined above, the MS2 spectra similarity score, and the target spectra's ID and
compound name.
```{r}
mtches_df <- spectraData(mtches, columns = c("spectrum_id", "feature_id",
"score", "target_spectrum_id",
"target_compound_name"))
as.data.frame(mtches_df)
```
Using the `plotSpectraMirror()` function we can visualize the matching results
for one query spectrum. Note also that an interactive, `shiny`-based, validation
of matching results is available with the `validateMatchedSpectra()`
function. Below we call this function to show all matches for the second
spectrum.
```{r}
plotSpectraMirror(mtches[2])
```
Not unexpectedly, the peak intensities of query and target spectra are on
different scales. While this was no problem for the similarity calculation (the
normalized dot-product which is used by default is independent of the absolute
peak values) it is not ideal for visualization. Thus, we apply below a simple
scaling function to both the query and target spectra and plot the
spectra again afterwards (see the help for `addProcessing()` in the `Spectra`
package for more details on spectra data manipulations). This function will
replace the absolute spectra intensities with intensities relative to the
maximum intensity of each spectrum. Note that functions for `addProcessing()`
should include (like in the example below) the `...` parameter.
```{r}
scale_int <- function(x, ...) {
x[, "intensity"] <- x[, "intensity"] / max(x[, "intensity"], na.rm = TRUE)
x
}
mtches <- addProcessing(mtches, scale_int)
plotSpectraMirror(mtches[2])
```
The query spectrum seems to nicely match the identified target spectra. Below we
extract the compound name of the target spectra for this second query spectrum.
```{r}
mtches[2]$target_compound_name
```
As alternative to the `CompareSpectraParam` we could also use the
`MatchForwardReverseParam` with `matchSpectra()`. This has the same settings and
performs the same spectra similarity search than `CompareSpectraParam`, but
reports in addition (similar to MS-DIAL) to the (*forward*) similarity score
also the *reverse* spectra similarity score as well as the *presence ratio* for
matching spectra. While the default *forward* score is calculated considering
all peaks from the query and the target spectrum (the peak mapping is performed
using an *outer join* strategy), the *reverse score* is calculated only on peaks
that are present in the target spectrum and the matching peaks from the query
spectrum (the peak mapping is performed using a *right join* strategy). The
*presence ratio* is the ratio between the number of mapped peaks between the
query and the target spectrum and the total number of peaks in the target
spectrum. These values are available as spectra variables `"reverse_score"` and
`"presence_ratio"` in the result object). Below we perform the same spectra
matching as above, but using the `MatchForwardReverseParam`.
```{r}
mp <- MatchForwardReverseParam(requirePrecursor = TRUE, ppm = 10)
mtches <- matchSpectra(pest_ms2, minimb, param = mp)
mtches
```
Below we extract the query and target spectra IDs, the compound name and all
scores.
```{r}
as.data.frame(
spectraData(mtches, c("spectrum_id", "target_spectrum_id",
"target_compound_name", "score", "reverse_score",
"presence_ratio")))
```
In these examples we matched query spectra only to target spectra if their
precursor m/z is ~ equal and reported only matches with a similarity higher than
0.7. `CompareSpectraParam`, through its parameter `THRESHFUN` would however also
allow other types of analyses. We could for example also report the *best
matching* target spectrum for each query spectrum, independently of whether the
similarity score is higher than a certain threshold. Below we perform such an
analysis defining a `THRESHFUN` that selects always the best match.
```{r}
select_top_match <- function(x) {
which.max(x)
}
csp2 <- CompareSpectraParam(ppm = 10, requirePrecursor = FALSE,
THRESHFUN = select_top_match)
mtches <- matchSpectra(pest_ms2, minimb, param = csp2)
res <- spectraData(mtches, columns = c("spectrum_id", "target_spectrum_id",
"target_compound_name", "score"))
as.data.frame(res)
```
Note that this whole example would work on any `Spectra` object with MS2
spectra. Such objects could also be extracted from an `xcms`-based LC-MS/MS data
analysis with the `chromPeaksSpectra()` or `featureSpectra()` functions from the
`r Biocpkg("xcms")` package. Note also that retention times could in addition be
considered in the matching by selecting a non-infinite value for the
`toleranceRt` of any of the parameter classes. By default this uses the
retention times provided by the query and target spectra (i.e. spectra variable
`"rtime"`) but it is also possible to specify any other spectra variable for the
additional retention time matching (e.g. retention indices instead of times)
using the `rtColname` parameter of the `matchSpectra(0` function (see
`?matchSpectra` help page for more information).
Matches can be also further validated using an interactive Shiny app by calling
`validateMatchedSpectra()` on the `MatchedSpectra` object. Individual matches
can be set to TRUE or FALSE in this app. By closing the app via the Save & Close
button a filtered `MatchedSpectra` is returned, containing only matches manually
validated.
### Query against multiple reference databases
Getting access to reference spectra can sometimes be a little cumbersome since
it might involve lookup and download of specific resources or eventual
conversion of these into a format suitable for import. `MetaboAnnotation`
provides *compound annotation sources* to simplify this process. These
annotation source objects represent references (links) to annotation resources
and can be used in the `matchSpectra()` call to define the targed/reference
spectra. The annotation source object takes then care, upon request, of
retrieving the annotation data or connecting to the annotation resources.
Also, *compound annotation sources* can be combined to allow
matching query spectra against multiple reference libraries in a single call.
At present `MetaboAnnotation` supports the following types of *compound
annotation sources* (i.e. objects extending `CompAnnotationSource`):
- Annotation resources that provide their data as a `CompDb` database (defined
by the `r BiocStyle::Biocpkg("CompoundDb")`) package. These are supported by
the `CompDbSource` class.
- Annotation resources for which a dedicated `MsBackend` backend is available
hence supporting to access the data *via* a `Spectra` object. These are
supported by the `SpectraDbSource` class.
Various helper functions, specific for the annotation resource, are available to
create such annotation source objects:
- `CompDbSource`: creates a compound annotation source object from the provided
`CompDb` SQLite data base file. This function can be used to integrate an
existing (locally available) `CompDb` annotation database into an annotation
workflow.
- `MassBankSource`: creates a annotation source object for a specific MassBank
release. The desired release can be specified with the `release` parameter
(e.g. `release = "2021.03"` or `release = "2022.06"`). The function will then
download the respective annotation database from Bioconductor's
`r Biocpkg("AnnotationHub")`.
In the example below we create a annotation source for MassBank release
*2022.06*. This call will lookup the requested version in Biocondutor's (online)
`AnnotationHub` and download the data. Subsequent requests for the same
annotation resource will load the locally cached version instead. Upcoming
MassBank database releases will be added to `AnnotationHub` after their official
release and all previous releases will be available as well.
```{r, message = FALSE}
mbank <- MassBankSource("2022.06")
mbank
```
We can now use that annotation source object in the `matchSpectra()` call to
compare the experimental spectra from the previous examples against that release
of MassBank.
```{r}
res <- matchSpectra(
pest_ms2, mbank,
param = CompareSpectraParam(requirePrecursor = TRUE, ppm = 10))
res
```
The result object contains only the matching fragment spectra from the reference
database.
```{r}
target(res)
```
And the names of the compounds with matching fragment spectra.
```{r}
matchedData(res)$target_name
```
### Finding MS2 spectra for selected m/z and retention times
Sometimes it is needed to identify fragment spectra in a `Spectra` object for
selected (precursor) m/z values and retention times. An example would be if
compound quantification was performed with a LC-MS run and in a second LC-MS/MS
run (with the same chromatographic setup) fragment spectra of the same samples
were generated. From the first LC-MS data set *features* (or chromatographic
peaks) would be identified for which it would be necessary to retrieve fragment
spectra matching the m/z and retention times of these from the second, LC-MS/MS
data set (assuming that no big retention time shifts between the measurement
runs are expected). To illustrate this, we below first define a `data.frame`
that should represent a feature table such as defined by an analysis with the
`r Biocpkg("xcms")` package.
```{r}
fts <- data.frame(
feature_id = c("FT001", "FT002", "FT003", "FT004", "FT005"),
mzmed = c(313.43, 256.11, 224.08, 159.22, 224.08),
rtmed = c(38.5, 379.1, 168.2, 48.2, 381.1))
```
We next match the features from this data frame against the `Spectra` object
using an `MzRtParam` to identify fragment spectra with their precursor m/z and
retention times matching (with some tolerance) the values from the features.
```{r}
fts_mtch <- matchValues(fts, pest_ms2, MzRtParam(ppm = 50, toleranceRt = 3),
mzColname = c("mzmed", "precursorMz"),
rtColname = c("rtmed", "rtime"))
fts_mtch
whichQuery(fts_mtch)
```
Thus, we found fragment spectra matching the m/z and retention times for the 2nd
and 5th feature. To extract the `Spectra` matching these features, it would be
best to first reduce the object to features with at least one matching fragment
spectrum. The indices of query elements (in our case features) with matches can
be returned using the `whichQuery()` function. We use these below to subset our
matched result keeping only features for which matches were found:
```{r}
fts_mtched <- fts_mtch[whichQuery(fts_mtch)]
fts_mtched
```
The feature IDs for the matched spectra can be extracted using:
```{r}
fts_mtched$feature_id
```
We next need to extract the matching fragment spectra from the `target`
`Spectra` object. Here we use the `targetIndex()` function, that returns the
indices of the target spectra that were matched to the query.
```{r}
targetIndex(fts_mtched)
```
We extract thus next the fragment spectra matching at least one feature:
```{r}
fts_ms2 <- target(fts_mtched)[targetIndex(fts_mtched)]
fts_ms2
```
While we have now the spectra, we can't relate them (yet) to the features we
used as `query`. Extracting the `"feature_id"` column using the `$` function
from the the matched object would however return, for each match (since we
restricted the matched object to contain only features with matches) the feature
ID (provided in the original data frame). We can thus add this information as an
additional spectra variable to our `Spectra` object:
```{r}
fts_ms2$feature_id <- fts_mtched$feature_id
```
Be aware that extracting the `"feature_id"` column from the matched object
**before** restricting to features with matches would also return the values for
features for which no MS2 spectrum was found:
```{r}
fts_mtch$feature_id
```
Without the initial subsetting of the matched object to features with at least
one matching spectra, the extraction would be a bit more complicated:
```{r}
fts_ms2 <- target(fts_mtch)[targetIndex(fts_mtch)]
fts_ms2$feature_id <- query(fts_mtch)$feature_id[queryIndex(fts_mtch)]
fts_ms2$feature_id
```
This `Spectra` could next be used to match the fragment spectra from the
experiment to e.g. a reference database and with the assigned spectra variable
`"feature_id"` it would allow to map the results back to the quantified feature
matrix from the LC-MS run.
### Performance and parallel processing
Pre-filtering the target spectra based on similar precursor m/z (using
`requirePrecursor = TRUE` generally speeds up the call because a spectra
comparison needs only to be performed on subsets of target spectra. Performance
of the `matchSpectra()` function depends however also on the backend used for
the query and target `Spectra`. For some backends the peaks data (i.e. m/z and
intensity values) might not be already loaded into memory and hence spectra
comparisons might be slower because that data needs to be first loaded. As an
example, for `Spectra` objects, such as our `pest_ms2` variable, that use the
`MsBackendMzR`backend, the peaks data needs to be loaded from the raw data files
before the spectra similarity scores can be calculated. Changing the backend to
an in-memory data representation before `matchSpectra()` can thus improve the
performance (at the cost of a higher memory demand).
Below we change the backends of the `pest_ms2` and `minimb` objects to
`MsBackendMemory` which keeps all data (spectra and peaks data) in memory and we
compare the performance against the originally used `MsBackendMzR` (for
`pest_ms2`) and `MsBackendDataFrame` (for `minimb`).
```{r}
pest_ms2_mem <- setBackend(pest_ms2, MsBackendMemory())
minimb_mem <- setBackend(minimb, MsBackendMemory())
library(microbenchmark)
microbenchmark(compareSpectra(pest_ms2, minimb, param = csp),
compareSpectra(pest_ms2_mem, minimb_mem, param = csp),
times = 5)
```
There is a considerable performance gain by using the `MsBackendMemory` over the
two other backends, that comes however at the cost of a higher memory
demand. Thus, for large data sets (or reference libraries) this might not be an
option. See also [issue
#93](https://github.com/rformassspectrometry/MetaboAnnotation/issues/93) in the
`MetaboAnnotation` github repository for more benchmarks and information on
performance of `matchSpectra()`.
If for `target` a `Spectra` using a SQL database-based backend is used (such as
a `MsBackendMassbankSql`, `MsBackendCompDb` or `MsBackendSql`) and spectra
matching is performed with `requirePrecursorMz = TRUE`, simply *caching* the
precursor m/z values of all target spectra in memory improves the performance of
`matchSpectra` considerably. This can be easily done with e.g.
`target_sps$precursorMz <- precursorMz(target_sps)` where `target_sps` is the
`Spectra` object that uses one of the above mentioned backends. With this call
all precursor m/z values will be cached within `target_sps` and any
`precursorMz(target_sps)` call (which is used by `matchSpectra()` to select the
candidate spectra against which to compare a query spectrum) will not require
a separate SQL call.
Parallel processing can also improve performance, but might not be possible for
all backends. In particular, backends based on SQL databases don't allow
parallel processing because the database connection can not be shared across
different processes.
# Utility functions
*MetaboAnnotation* provides also other utility functions not directly related to
the annotation process. These are presented in this section.
## Creating mixes of standard compounds
The function `createStandardMixes()` allows for grouping of standard compounds
with a minimum difference in m/z based on user input.
```{r}
library(MetaboCoreUtils)
```
### Input format
As an example here I will extract a list of a 100 standard compounds with their
formula from a tab delimited text file provided with the package. Such files
could also be imported from an xlsx sheet using the *readxl* package.
```{r}
standard <- read.table(system.file("extdata", "Standard_list_example.txt",
package = "MetaboAnnotation"),
header = TRUE, sep = "\t", quote = "")
```
We will use functions from the MetaboCoreUtil package to get the mass of each
compounds and the m/z for the adducts wanted.
```{r}
#' Calculate mass based on formula of compounds
standard$mass <- calculateMass(standard$formula)
#' Create input for function
#' Calculate charge for 2 adducts
standard_charged <- mass2mz(standard$mass, adduct = c("[M+H]+", "[M+Na]+"))
#' have compounds names as rownames
rownames(standard_charged) <- standard[ , 1]
#' ensure the input `x` is a matrix
if (!is.matrix(standard_charged))
standard_charged <- as.matrix(standard_charged)
```
The input table for the createStandardMixes should thus look like the one shown
below, i.e. should be a numeric matrix with each row representing one compound.
Columns are expected to contain m/z values for different adducts of that
compound. Importantly, the row names of the matrix should represent the (unique)
compound names (or any other unique identifier for the compound).
```{r}
standard_charged
```
### Using the function
The `createStandardMixes()` function organizes given compounds in such a way
that each compound is placed in a group where all ions (adducts) have a m/z
difference exceeding a user-defined threshold (default: `min_diff = 2`). In
this initial example, we aim to group only a subset of our compound list and
execute the function with default parameters:
```{r}
group_no_randomization <- createStandardMixes(standard_charged[1:20,])
group_no_randomization
```
Let's see the number of compounds per group:
```{r}
table(group_no_randomization$group)
```
The grouping here worked perfectly, but let's now use the entire compound list
and run with the default parameter again:
```{r}
group_no_randomization <- createStandardMixes(standard_charged)
table(group_no_randomization$group)
```
This time we can see that the grouping is less ideal.
In this case we can switch the `iterativeRandomization = TRUE`.
```{r}
group_with_ramdomization <- createStandardMixes(standard_charged,
iterativeRandomization = TRUE)
table(group_with_ramdomization$group)
```
Changing `iterativeRandomization =` from the default `FALSE` to `TRUE` enables
the randomization of input `x` rows until it fits the `min_nstd` parameter. If
the list of compounds is very long or the requirement is hard to fit, this
function can take a bit longer if `iterativeRandomization =` is set to `TRUE.`
What if we want groups of a maximum of 20 and a minimum of 15 compounds, and
with a minimum difference of 2 m/z between compounds of the same group? If you
want to know more about the parameters of this function, look at
`?createStandardMixes`.
```{r}
set.seed(123)
group_with_ramdomization <- createStandardMixes(standard_charged,
max_nstd = 15,
min_nstd = 10,
min_diff = 2,
iterativeRandomization = TRUE)
table(group_with_ramdomization$group)
```
Great ! these groups look good; we can now export. As the function already
returns a `data.frame`, you can directly save is as an Excel file using
`write_xlsx()` from the *writexl* R package or as below in text format that can
also be open in Excel.
```{r eval=FALSE}
write.table(group_with_ramdomization,
file = "standard_mixes.txt", sep = "\t", quote = FALSE)
```
# Session information {-}
```{r sessioninfo, echo=FALSE}
sessionInfo()
```
# References