\documentclass[12pt]{article}

\usepackage{hyperref} \usepackage{framed} \usepackage{lscape}
\usepackage{array}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\textwidth=6.2in \textheight=8.5in
%\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in
\headheight=-.3in

\begin{document} \title{An R package to process LC/MS metabolomic
data: {MAIT} (Metabolite Automatic Identification Toolkit)}
\small{\author{Francesc Fern\'andez-Albert, Rafael Llorach,\\ Cristina
Andr\'es-Lacueva, Alexandre Perera}}
\maketitle %\VignetteIndexEntry{MAIT Vignette}

%% an abstract and keywords
\section{Abstract}
 
Processing metabolomic liquid chromatography and mass spectrometry
(LC/MS) data files is time consuming. Currently available R tools
allow for only a limited number of processing steps and online tools
are hard to use in a programmable fashion. This paper introduces the
metabolite automatic identification toolkit {MAIT} package, which
allows users to perform end-to-end LC/MS metabolomic data
analysis. The package is especially focused on improving the peak
annotation stage and provides tools to validate the statistical
results of the analysis. This validation stage consists of a repeated
random sub-sampling cross-validation procedure evaluated through the
classification ratio of the sample files. {MAIT} also includes
functions that create a set of tables and plots, such as principal
component analysis (PCA) score plots, cluster heat maps or
boxplots. To identify which metabolites are related to statistically
significant features, {MAIT} includes a metabolite database for a
metabolite identification stage.


\section{Introduction}

Liquid Chromatography and Mass Spectrometry (LC/MS) is an analytical
instrument widely used in metabolomics to detect molecules in
biological samples. It breaks the molecules down into pieces, some of
which are detected as peaks in the mass spectrometer. Metabolic
profiling of LC/MS samples basically consists of a peak detection and
signal normalisation step, followed by multivariate statistical
analysis such as principal components analysis (PCA) and univariate
statistical tests such as ANOVA .  \\ As analysing metabolomic data is
time consuming, a wide array of software tools are available,
including commercial tools such as Analyst\textregistered\
software. There are programmatic R packages, such as {XCMS} to detect
peaks or {CAMERA} package and {AStream} , which cover only peak
annotation. Another category of free tools available consists of those
having online access through a graphical user interface (GUI), such as
XCMS Online (\url{http://xcmsonline.scripps.edu}) or MetaboAnalyst,
both extensively used.  \\ These online tools are difficult to use in
a programmable fashion. They are also designed and programmed to be
used step by step with user intervention, making it difficult to set
up metabolomic data analysis workflow. These R packages involve only a
part of the entire metabolomic analysis process. Although there are
specific R packages whose objective is peak annotation, this is still
an issue in analysing LC/MS metabolomic data.\\

We introduce a new R package called metabolite automatic
identification toolkit (MAIT) for automatic LC/MS analysis. The goal
of the {MAIT} package is to provide an array of tools for programmable
metabolomic end-to-end analysis. It consequently has special functions
to improve peak annotation through the processes called
biotransformations. Specifically, {MAIT} is designed to look for
statistically significant metabolites that separate the classes in the
data.




\section{Methodology}

The main processing steps for metabolomic LC/MS data include the
following stages: peak detection, peak annotation and statistical
analysis. In the peak detection stage, the objective is to detect the
peaks in the LC/MS sample files. The peak annotation stage identifies
the metabolites in the metabolomic samples better by increasing the
chemical and biological information in the data set. A statistical
analysis step is essential to obtain significant sample features. All
these 3 steps are covered in the {MAIT} workflow.

\subsection{Peak Detection}

Peak detection in metabolomic LC/MS data sets is a complex issue for
which several approaches have been developed. Two of the most well
established techniques are matched filter and the centWave algorithm
. {MAIT} can use both algorithms through the {XCMS} package.

\subsection{Peak Annotation}
\label{Peak Annotation Methodology} The {MAIT} package uses 3
complementary steps in the peak annotation stage.

\begin{itemize}
\item{The first annotation step uses a peak correlation distance
approach and a retention time window to ascertain which peaks come
from the same source metabolite, following the procedure defined in
CAMERA package. The peaks within each peak group are annotated
following a reference adduct/fragment table and a mass tolerance
window.}


\item{The second step uses a mass tolerance window inside the peak
groups detected in the first step to look for more specific mass
losses called biotransformations. To do this, {MAIT} uses a predefined
biotransformation table where the biotransformations we want to find
are saved. A user-defined biotransformation table can be set as an
input following the procedure defined in Section (\ref{Using MAIT
Biotransformations}).}

\item{The third annotation step is the metabolite identification
stage, in which a predefined metabolite database is mined to search
for the significant masses, also using a tolerance window. This
database is the Human Metabolome Database (HMDB), 2009/07 version.}
\end{itemize}

\subsection{Statistical Analysis}

The objective of analysing metabolomic profiling data is to obtain the
statistically significant features that contain the highest amount of
class-related information. To gather these features, {MAIT} applies
standard univariate statistical tests (ANOVA or Student's t-test) to
every feature and selects the significant set of features by setting
up a user-defined threshold P-value. Bonferroni multiple test
correction can be applied to the resulting P-values.  We propose a
validation test to quantify how well the data classes are separated by
the statistically significant features. The separation is validated
through a repeated random sub-sampling cross-validation using partial
least squares and discriminant analysis (PLS-DA), support vector
machine (SVM) with a radial Kernel and K-nearest neighbours
(KNN). Overall and class-related classification ratios are obtained to
evaluate the class-related information of the significant features.


\section{Using MAIT}
\label{example}
The data files for this example are a subset of the data used in
reference , which are freely distributed through the {XCMS}
package. In these data there are 2 classes of mice: a group where the
fatty acid amide hydrolase gene has been suppressed (class knockout or
KO) and a group of wild type mice (class wild type or WT). There are 6
spinal cord samples in each class. In the following, the {MAIT}
package will be used to read and analyse these samples using the main
functions discussed in Section \ref{workflow}. The significant
features related to each class will be found using statistical tests
and analysed through the different plots that {MAIT} produces.

\subsection{Data Import}

Each sample class file should be placed in a directory with the class
name. All the class folders should be placed under a directory
containing only the folders with the files to be analysed. In this
case, 2 classes are present in the data. An example of correct file
distribution using the example data files is shown in Figure
\ref{sampleTree}.\\

\subsection{Peak Detection}

\begin{figure}
\centering
\includegraphics[width=2.5cm,height=5cm]{./sampleTree.png}
\caption{Example of the correct sample distribution for {MAIT} package
use. Each sample file has to be saved under a folder with its class
name.}
\centering
\label{sampleTree}
\end{figure}


Once the data is placed in 2 subdirectories of a single folder, the
function {sampleProcessing()} is run to detect the peaks, group the
peaks across samples, perform the retention time correction and carry
out the peak filling process. As function {sampleProcessing()} uses
the {XCMS} package to perform these 4 processing steps, this function
exposes {XCMS} parameters that might be modified to improve the peak
detection step. A project name should be defined because all the
tables and plots will be saved in a folder using that name. For
example, typing {project = "project\_Test"}, the output result folder
will be {"Results\_project\_Test"}.\\

By choosing {"MAIT\_Demo"} as the project name, the peak detection
stage can be launched by typing:

 


<<libraryMAIT,echo=TRUE,results=hide>>=
library(MAIT)
@
<<sampleProcessing>>=
library(faahKO)
cdfFiles<-system.file("cdf", package="faahKO", mustWork=TRUE)
MAIT <- sampleProcessing(dataDir = cdfFiles, project = "MAIT_Demo", 
snThres=2,rtStep=0.03)
@

After having launched the {sampleProcessing} function, peaks are
detected, they are grouped across samples and their retention time
values are corrected. A short summary in the R session can be
retrieved by typing the name of the {MAIT-class} object.

<<printMAIT1>>=
MAIT
@

The result is a {MAIT-class} object that contains information about
the peaks detected, their class names and how many files each class
contains. A longer summary of the data is retrieved by performing a
summary of a {MAIT-class} object. In this longer summary version,
further information related to the input parameters of the whole
analysis is displayed. This functionality is especially useful in
terms of traceability of the analysis.

<<summaryMAIT1>>=
summary(MAIT)
@


\subsection{Peak Annotation}


The next step in the data processing is the first peak annotation
step, which is performed through the {peakAnnotation()}. If the input
parameter {adductTable} is not set, then the default {MAIT} table for
positive polarisation will be selected. However, if the {adductTable}
parameter is set to "negAdducts", the default {MAIT} table for
negative fragments will be chosen instead. {peakAnnotation} function
also creates an output table (see Table \ref{table_outputs})
containing the peak mass (in charge/mass units), the retention time
(in minutes) and the spectral ID number for all the peaks detected. A
call of the function {peakAnnotation} may be:\\

<<peakAnnotation>>=
MAIT <- peakAnnotation(MAIT.object = MAIT,corrWithSamp = 0.7, 
corrBetSamp = 0.75,perfwhm = 0.6)
@

Because the parameter {adductTable} was not set in the
{peakAnnotation} call, a warning was shown informing that the default
{MAIT} table for positive polarisation mode was selected. The
{xsAnnotated} object that contains all the information related to
peaks, spectra and their annotation is stored in the {MAIT} object. It
can be retrieved by typing:

<<rawData>>=
rawData(MAIT)
@



\subsection{Statistical Analysis}

Following the first peak annotation stage, we want to know which
features are different between classes. Consequently, we run the
function {spectralSigFeatures()}.

<<warning=FALSE>>=
MAIT<- spectralSigFeatures(MAIT.object = MAIT,pvalue=0.05,
       p.adj="none",scale=FALSE)
summary(MAIT)
@

It is worth mentioning that by setting the {scale} parameter to TRUE,
the data will be scaled to have unit variance. A summary of the
statistically significant features is created and saved in a table
called significantFeatures.csv (see Table \ref{table_outputs}). It is
placed inside the Tables subfolder located in the project folder. This
table shows characteristics of the statistically significant features,
such as their P-value, the peak annotation or the expression of the
peaks across samples. This table can be retrieved at any time from the
{MAIT}-class objects by typing the instruction:

<<>>=
signTable <- sigPeaksTable(MAIT.object = MAIT, printCSVfile = FALSE)
@
\begin{Sinput}
head(signTable)
\end{Sinput}


The number of significant features can be retrieved from the
{MAIT-class} object as follows:

<<>>=
MAIT
@

\subsection{Statistical Plots}

Out of 2,402 features, 106 were found to be statistically
significant. At this point, several {MAIT} functions can be used to
extract and visualise the results of the analysis. Functions
{plotBoxplot}, {plotHeatmap}, {plotPCA} and {plotPLS} automatically
generate boxplots, heat maps and PCA/PLS score plot files in the
project folder when they are applied to a MAIT object (see Table
\ref{table_outputs}).

<<BoxHeatplots,eval=FALSE>>=
plotBoxplot(MAIT)
plotHeatmap(MAIT)
@

<<PLSPCA,warning=FALSE>>=
MAIT<-plotPCA(MAIT,plot3d=FALSE)
MAIT<-plotPLS(MAIT,plot3d=FALSE)
PLSmodel <- model(MAIT, type = "PLS")
PCAmodel <- model(MAIT, type = "PCA")
@
<<showPLSmodel,warning=FALSE>>=
PLSmodel
@

<<showPLSscores,warning=FALSE>>=
pcaScores(MAIT)
@


%scores(x=MAIT,model = "PCA")

The {plotPCA} and {plotPLS} functions produce {MAIT} objects with the
corresponding PCA and PLS models saved inside. The models, loadings
and scores can be retrieved from the {MAIT} objects by using the
functions {model}, {loadings} and {scores}:


All the output figures are saved in their corresponding subfolders
contained in the project folder. The names of the folders for the
boxplots, heat maps and score plots are Boxplots, Heatmaps,
PCA\_Scoreplots and PLS\_Scoreplots respectively. Inside the R
session, the project folder is recovered by typing:

<<resultsPath>>=
resultsPath(MAIT)
@


\subsection{Biotransformations}
\label{Using MAIT Biotransformations}

Before identifying the metabolites, peak annotation can be improved
using the function {Biotransformations} to make interpreting the
results easier. The {MAIT} package uses a default biotransformations
table, but another table can be defined by the user and introduced by
using the {bioTable} function input variable. The biotransformations
table that {MAIT} uses is saved inside the file {MAITtables.RData},
under the name {biotransformationsTable}.

<<Biotransformations,warning=FALSE>>= MAIT <-
Biotransformations(MAIT.object = MAIT, peakPrecision = 0.005) 
@

Building a user-defined biotransformations table from the {MAIT}
default table or adding a new biotransformation is
straightforward. For example, let's say we want to add a new adduct
called "custom\_biotrans" whose mass loss is 105.

<<myBiotransf>>= 
data(MAITtables)
myBiotransformation<-c("custom_biotrans",105.0)
myBiotable<-biotransformationsTable
myBiotable[,1]<-as.character(myBiotable[,1])
myBiotable<-rbind(myBiotable,myBiotransformation)
myBiotable[,1]<-as.factor(myBiotable[,1]) 
tail(myBiotable) 
@

To build an entire new biotransformations table, you only need to
follow the format of the biotransformationsTable, which means writing
the name of the biotransformations as factors in the {NAME} field of
the data frame and their corresponding mass losses in the {MASSDIFF}
field.



\subsection{Metabolite Identification}

Once the biotransformations annotation step is finished, the
significant features have been enriched with a more specific
annotation. The annotation procedure performed by the
{Biotransformations()} function never replaces the peak annotations
already done by other functions. {MAIT} considers the peak annotations
to be complementary; therefore, when new annotations are detected,
they are added to the current peak annotation and the identification
function may be launched to identify the metabolites corresponding to
the statistically significant features in the data.

<<identifyMetabolites,warning=FALSE>>=
MAIT <- identifyMetabolites(MAIT.object = MAIT, peakTolerance = 0.005)
@

By default, the function {identifyMetabolites()} looks for the peaks
of the significant features in the {MAIT} default metabolite
database. The input parameter {peakTolerance} defines the tolerance
between the peak and a database compound to be considered a possible
match. It is set to 0.005 mass/charge units by default. To check the
results easily, function {identifyMetabolites} creates a table
containing the significant feature characteristics and the possible
metabolite identifications. Such a table is recovered from the
{MAIT}-class object using the instruction:

<<metaboliteTable,warning=FALSE>>=
metTable<-metaboliteTable(MAIT)
head(metTable)
@

This table provides useful results about the analysis of the samples,
such as the P-value of the statistical test, its adduct or isotope
annotation and the name of any possible hit in the database. Note that
if no metabolite has been found in the database for a certain feature,
it is labelled as {"unknown"} in the table.

\subsection{Validation}

Finally, we will use the function {Validation()} to check the
predictive value of the significant features. All the information
related to the output of the {Validation()} function is saved in the
project directory in a folder called "Validation". Two boxplots
showing the overall and per class classification ratios are created,
along with every confusion matrix corresponding to each iteration (see
Table \ref{table_outputs}).


<<validation,warning=FALSE>>=
 MAIT <- Validation(Iterations = 20, trainSamples= 3, 
MAIT.object = MAIT)
@

A summary of a MAIT object, which includes the overall classification
values, can be accessed:

<<summaryMAIT2>>=
summary(MAIT)
@

It is also possible to gather the classification ratios per class,
classifier used and iteration number by using the function
{classifRatioClasses()}:

<<classifRatioClasses>>=
classifRatioClasses(MAIT)
@

The classification ratios are 100\%; the set of significant features
separates the samples belonging to these classes.



\subsection{Using External Peak Data}
\label{Using External Peak Data}
Taking advantage of the modularised design of {MAIT}, it is possible
to use the function {MAITbuilder} to import peak data and analyse it
using the {MAIT} statistical functions. As stated in section
\ref{External Peak Data}, there are certain arguments that should be
provided depending on which function is wanted to be launched. In this
section we will show an example of this data importation procedure
using the same data that we have been using in the tutorial so
far. Let's say we have a peak table recorded in positive polarisation
mode with the peak masses and retention time values such as:

<<defData>>=
peaks <- scores(MAIT)
masses <- getPeaklist(MAIT)$mz
rt <- getPeaklist(MAIT)$rt/60
@

We want to perform an annotation stage and metabolite identification
on these data. To that end, we can launch the function {MAITbuilder}
to build a {MAIT-class} object with the data in the table:


<<MAITbuilder>>=
importMAIT <- MAITbuilder(data = peaks, masses = masses, 
              rt = rt,significantFeatures = TRUE, 
              spectraEstimation = TRUE,rtRange=0.2,
              corThresh=0.7)
@

We have selected the option {spectraEstimation} as {TRUE} because we
do not know the grouping of the peaks into spectra. As we want to
annotate and identify all the peaks in the data frame, we set the flag
{significantFeatures} to TRUE. At this point, we can launch the
Biotransformations function:


<<BiotransformationsBuilder>>=
importMAIT <- Biotransformations(MAIT.object = importMAIT, 
              adductAnnotation = TRUE, 
              peakPrecision = 0.005, adductTable = NULL)
@

We set the {adductAnnotation} flag to {TRUE} as we want to perform an
adduct annotation step. The parameter {adductTable} set to {NULL}
implies that a positive polarisation adduct annotation stage will be
performed. To run a negative annotation, the argument should be set to
{negAdducts}. The metabolite identification stage is launched as in
the previous case:

<<identifyMetabolitesBuilder>>=
importMAIT <- identifyMetabolites(MAIT.object = importMAIT, 
              peakTolerance=0.005,polarity="positive")
@



\end{document}