% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{The TargetSearch Package} %\VignetteDepends{TargetSearchData} %\VignetteKeywords{preprocess, GC-MS, analysis} %\VignettePackage{TargetSearch} \documentclass[12pt]{article} \usepackage{hyperref} \usepackage{graphicx} \usepackage[authoryear,round]{natbib} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \SweaveOpts{keep.source=TRUE} \title{The TargetSearch Package} \author{Alvaro Cuadros-Inostroza, Jan Lisec,\\Henning Redestig and Matthew A Hannah\\ \small{Max Planck Institute for Molecular Plant Physiology}\\ \small{Potsdam, Germany}\\ \small{\url{http://www.mpimp-golm.mpg.de/}} } \begin{document} % Sweave options \DefineVerbatimEnvironment{Sinput}{Verbatim}{xleftmargin=1em,fontsize=\small} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em,fontsize=\small} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=1em,fontsize=\small} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \maketitle This document describes how to use \Rpackage{TargetSearch} to preprocess GC-MS data. <>= options(width=60) library(TargetSearch) library(TargetSearchData) @ \section{Supplied Files} This section describes the files that have to be prepared before running \Rpackage{TargetSearch}. These comprise a sample definition file, a reference library file and a retention marker definition file. Throughout this manual, we will use the example files provided by the package \Rpackage{TargetSearchData}. \subsection{NetCDF Files and Sample File} \Rpackage{TargetSearch} can currently read only NetCDF files. Many GC-MS software packages are able to convert raw chromatograms to NetCDF. Although some baseline correction functionality is included in \Rpackage{TargetSearch}, it is recommended to baseline correct your chromatograms before exporting to NetCDF to increase processing speed. Please refer to your software documentation for further details. Export the NetCDF files into a convenient location. Then prepare a tab-delimited text file describing your samples. It must be contain (at least) the two columns entitled: ``CDF\_FILE'' and ``MEASUREMENT\_DAY''. Other columns such as sample name, sample group, treatment, etc. may be additionally included to aid sample sub-setting and downstream analyses. An example is shown in table~\ref{sample:file}. \begin{table}[ht] \centering \begin{tabular}{lll} \hline CDF\_FILE & MEASUREMENT\_DAY & TIME\_POINT \\ \hline 7235eg08.cdf & 7235 & 1 \\ 7235eg11.cdf & 7235 & 1 \\ 7235eg26.cdf & 7235 & 1 \\ 7235eg04.cdf & 7235 & 3 \\ 7235eg30.cdf & 7235 & 3 \\ 7235eg32.cdf & 7235 & 3 \\ ...&&\\ \hline \end{tabular} \caption{Sample file example, ``samples.txt''} \label{sample:file} \end{table} To import the sample list into R, use the function \Rfunction{ImportSamples()} specifying the options \Rfunarg{CDFpath} (directory containing the NetCDF files) and \Rfunarg{RIpath} (directory where the transformed cdf files, containing the retention index corrected apex data, also called RI-files, will be saved). <>= library(TargetSearchData) library(TargetSearch) cdf.path <- system.file("gc-ms-data", package = "TargetSearchData") sample.file <- file.path(cdf.path, "samples.txt") samples <- ImportSamples(sample.file, CDFpath = cdf.path, RIpath = ".") @ Alternatively, you could create a \Robject{tsSample} object by using the sample class methods. <>= cdffiles <- dir(cdf.path, pattern="cdf$") # define the RI file names rifiles <- paste("RI_", sub("cdf","txt", cdffiles), sep = "") # take the measurement day info from the cdf file names. days <- sub("^([[:digit:]]+).*$","\\1",cdffiles) # sample names smp_names <- sub("\\.cdf", "", cdffiles) # add some sample info smp_data <- data.frame(CDF_FILE =cdffiles, GROUP = gl(5,3)) # create the sample object samples <- new("tsSample", Names = smp_names, CDFfiles = cdffiles, CDFpath = cdf.path, RIpath = ".", days = days, RIfiles = rifiles, data = smp_data) @ \subsection{Retention Time (RI) Markers} To align different chromatograms using RI markers, a tab-delimited definition file for these markers has to be provided. At a minimum it should contain three columns specifying the search window (lower and upper threshod) and the fixed standard value for each marker (table \ref{rim:file}). Further, a characteristic ion mass ($m/z$) has to be provided either in an additional column or as an argument \Rfunarg{mass} to function \Rfunction{ImportFameSettings()}. <>= rim.file <- file.path(cdf.path, "rimLimits.txt") rimLimits <- ImportFameSettings(rim.file, mass = 87) @ This will import the limits in ``rimLimits.txt'' file and set the marker mass to 87 for all markers. \begin{table}[ht] \centering \begin{tabular}{rrr} \hline LowerLimit & UpperLimit & RIstandard \\ \hline 230 & 280 & 262320\\ 290 & 340 & 323120\\ 350 & 400 & 381020\\ \hline \end{tabular} \caption{Retention time definition file example, ``rimLimits.txt''} \label{rim:file} \end{table} If you do not use RI markers, you can skip this part by setting the parameter \Rfunarg{rimLimits} to \Robject{NULL} in \Rfunction{RIcorrect} function. Please note that in this case, no retention time correction will be performed. \section{Baseline correction, peak identification and RI correction} Initially, \Rpackage{TargetSearch} identifies the local apex intensities in all chromatograms, finds the retention time (RT) of the RI markers and converts RT to RI using linear interpolation \citep{VandenDool1963}. This is done by the function \Rfunction{RIcorrect()}. Parameter to this function are sample and retention time limits objects, the mass range to extract (\Rfunarg{ MassRange = c(85,500)}), the intensity threshhold (\Rfunarg{IntThreshold = 10}), the peak picking method (\Rfunarg{pp.method}) and a \Rfunarg{Window} parameter that will be used by said method. The function will return a matrix with the retention times of all RI markers and create a tab-delimited file (RI file) for each chromatogram in the specified directory. These files contain the extracted peak list of the respective NetCDF file. <>= RImatrix <- RIcorrect(samples, rimLimits, massRange = c(85,500), IntThreshold = 10, pp.method = "smoothing", Window = 15) @ There are two peak picking methods available: ``smoothing'' implements the algorithm used by Tagfinder \citet{Luedemann2008}, while the ``ppc'' algorithm is a adapted based on the function \Rfunction{ppc.peaks} from the package \Rpackage{ppc}. \Rfunction{ppc.peaks} searches for local maxima within a given time window. Therefore the \Rfunarg{Window} parameter sets the window size of the chosen peak picking method. In case the chromatograms were not baseline corrected by the platform specific GC-MS software, it is possible to achieve that in \Rpackage{TargetSearch} by setting \Rfunarg{baseline} to TRUE in function \Rfunction{RIcorrect()} which calls the function \Rfunction{baselineCorrection()}. This algorithm is based on the work of \citet{chang07} and a description is found in the \Rfunction{baselineCorrection()} documentation. Some options can be passed to \Rfunction{baselineCorrection()} via \Rfunarg{baseline.opts}. After that, it is possible to check for outliers in the samples by using the function \Rfunction{FAMEoutliers()}. It creates a PDF report of the RI markers including information about possible outliers that the user can remove or not. Alternatively, RI markers can % how do we remove outliers be checked manually using the function \Rfunction{plotFAME} (Figure~\ref{fig:01}). <>= outliers <- FAMEoutliers(samples, RImatrix, threshold = 3) @ Here, \Rfunarg{threshold} sets the number of standard deviations a value has to be away from the mean before it will be considered an outlier. In this example, however, no outliers were detected. \begin{figure}[htp] \centering <>= plotFAME(samples, RImatrix, 1) @ \caption{Retention Index Marker 1.}\label{fig:01} \end{figure} Note that \Rpackage{TargetSearch} assumes that the RI markers are injected together with the biological samples. A different approach is to inject them separately. If this is the case, please look at the script \texttt{RetentionIndexCorrection.R} for a workaround. \section{Library Search} \subsection{Reference Library File} The ``reference library'' file contains the information of the metabolites or mass spectral tags (MSTs) that will be searched for in the chromatograms. A public spectra database could be found here \url{http://gmd.mpimp-golm.mpg.de/} at \emph{The Golm Metabolome Database} \citep{Kopka2005,Hummel2007,Hummel2013}. Required information is the metabolite name (``Name''), expected retention time index (``RI''), selective masses (``SEL\_MASS''), most abundant masses (``TOP\_MASS''), spectrum (``SPECTRUM'') and RI deviations (``Win\_1'', ``Win\_2'', ``Win\_3''). See example in table~\ref{library:file}. The columns ``Name'' and ``RI'' are mandatory and you have at least to include one of the columns ``SEL\_MASS'', ``TOP\_MASS'' or ``SPECTRUM'' in the file (see below). The RI deviation columns are optional. \begin{table}[ht] \centering {\footnotesize \begin{tabular}{lllll} \hline Name & RI & Win\_1 & SEL\_MASS & SPECTRUM\\ \hline Pyruvic acid & 222767 & 4000 & 89;115;158;174;189 & 85:7 86:14 87:7 88:5 8... \\ Glycine (2TMS) & 228554 & 4000 & 86;102;147;176;204 & 86:26 87:19 88:8 89:4...\\ Valine & 271500 & 2000 & 100;144;156;218;246 & 85:8 86:14 87:6 88:5 8...\\ Glycerol (3TMS) & 292183 & 2000 & 103;117;205;293 & 85:14 86:2 87:16 88:13...\\ Leucine & 306800 & 1500 & 102;158;232;260 & 158:999 159:148 160:45...\\ Isoleucine & 319900 & 1500 & 102;103;158;163;218 & 90:11 91:2 92:1 93:1 9...\\ Glycine & 325000 & 2000 & 86;100;174;248;276 & 85:6 86:245 87:24 88:12...\\ \hline \end{tabular} } \caption{Reference Library example, ``library.txt''} \label{library:file} \end{table} In this file, masses and intensities must be positive integers. RIs and RI deviations can be any positive real number. The selective and most abundant masses list must be delimited by semicolon (;). The spectrum is described by a list of mass and intensity pair. Every mass-intensity pair is separated by colon (:) and different pairs are separated by spaces. The function \Rfunction{ImportLibrary()} imports the reference file. <>= lib.file <- file.path(cdf.path, "library.txt") lib <- ImportLibrary(lib.file, RI_dev = c(2000,1000,200), TopMasses = 15, ExcludeMasses = c(147, 148, 149)) @ Here we set the RI window deviations to 2000, 1000 and 200 RI units. Since ``Win\_1'' column is already in the file, the first value (2000) is ignored. Also, the 15th most abundant masses are taken but excluding the masses 147, 148 and 149 (common confounding masses) \subsection{Library Search Algorithm} The library search is performed in three steps. First, for every metabolite, selective masses are searched in a given time window around the expected RI. This is done by the function \Rfunction{medianRILib()}. This function calculates the median RI of the selective masses and return new library object with the updated RI. The time deviation is given either in the library file (column ``Win\_1'') or when the library is imported (see \Rfunction{ImportLibrary()}). <>= lib <- medianRILib(samples, lib) @ It is also posible to examinate visually the RI deviation of the metabolites by setting the parameter \Rfunarg{makeReport=TRUE}, which creates a pdf report like the one shown in figure~\ref{fig:02}. This may help to set or update the expected RI deviation. \begin{figure}[htp] \centering <>= resPeaks <- FindPeaks(RIfiles(samples), refLib(lib, w = 1, sel = TRUE)) plotRIdev(lib, resPeaks, libId = 1:9) @ \caption{RI deviation of first 9 metabolites in the library.}\label{fig:02} \end{figure} In the second step, the function \Rfunction{sampleRI()} searches the selective masses again, but using the updated RI and the RI deviation defined in the library object (``Win\_2''). After that, the intensities of the selected masses are normalised to the median of the day, and then used to extract other masses with correlated apex profiles. The masses for which the Pearson correlation coefficient is above \Rfunarg{r\_thres} are taken as metabolite markers and their RIs are averaged on a per sample basis. This average RI represents the exact position where the metabolite elutes in the respective sample, which is returned in a matrix form. <>= cor_RI <- sampleRI(samples, lib, r_thres = 0.95, method = "dayNorm") @ The third step will look up for all the masses (selective and most abundant masses) in all the samples. This is done by the function \Rfunction{peakFind()}. It returns a \Rclass{tsMSdata} object with the intensities and RI of every mass (rows) and every sample (columns) that were search for. <>= peakData <- peakFind(samples, lib, cor_RI) @ The intensity and RI slots can be accessed by using the \Rmethod{Intensity} and \Rmethod{retIndex} methods. Each slot is a list, where every component of the list is a matrix, which is related to a metabolite in the library. There should be as many components as metabolites in the library. In every matrix, columns are samples and rows are masses. <>= met.RI <- retIndex(peakData) met.Intensity <- Intensity(peakData) # show the intensity values of the first metabolite. met.Intensity[[1]] @ \section{Metabolite Profile} The function \Rfunction{Profile} makes a profile of the MS data by averaging all the normalised mass intensities whose Pearson coefficient is greater that \Rfunarg{r\_thresh}. <>= MetabProfile <- Profile(samples, lib, peakData, r_thres = 0.95, method = "dayNorm") @ A \Rclass{msProfile} object is returned. The averaged intensities and RI matrices that can be obtained by \Rmethod{Intensity} and \Rmethod{retIndex} methods. The profile information is represented by a \Rclass{data.frame} in the \Rmethod{info} slot (accessible by \Rmethod{profileInfo} method). The columns are: \begin{description} \item[Name] The metabolite/analyte name. \item[Lib\_RI] The expected RI (library RI). \item[Mass\_count] The number of correlating masses. \item[Non\_consecutive\_Mass\_count] Same as above, but not counting the consecutive masses. \item[Sample\_Count\_per\_Mass] The number of samples in which a correlating mass was found. The numbers are separated by semi-colon (;) and each number corresponds to one correlating masses (same order). If all the numbers are the same, only that unique number is shown. \item[Masses] The correlating masses. \item[RI] The average RI. \item[Score\_all\_masses] The similarity score calculated using the average intensity of all the masses that were searched for, regardless of whether they are correlating masses. \item[Score\_cor\_masses] Same as above, but only correlating masses are considered. \end{description} As metabolites with similar selective masses and RIs can be present in metabolite libraries, it is necessary to reduce redundancy. This is performed by the function \Rfunction{ProfileCleanUp} which selects peaks for which the RI gap is smaller than \Rfunarg{timeSplit} and computes the Pearson correlation between them. When two metabolites within such a time-group are tightly correlated (given by \Rfunarg{r\_thres}) only the one with more correlated masses is retained. <>= finalProfile <- ProfileCleanUp(MetabProfile, timeSplit = 500, r_thres = 0.95) @ The function returns a \Rclass{msProfile} object. The \Rmethod{info} slot is similar as described above, but extra columns with a ``Cor\_'' preffix (e.g., ``Cor\_Name'') are included. They provide information about metabolite redundancy. \section{Peaks and Spectra Visualisation} Finally, it may be of interest to check the chromatographic peak of selected metabolites and compare the median spectra of the metabolites, i.e., the median intensities of the selected masses across all the samples, with the reference spectra of the library. There are two functions to do so: \Rfunction{plotPeak} and \Rfunction{plotSpectra}. For example, we can compare the median spectrum of ``Valine'' against its spectrum reference. Here we look for the library index of ``Valine'' and plot the spectra comparison in a ``head-tail'' plot (figure~\ref{fig:03}). \begin{figure}[htp] \centering <>= grep("Valine", libName(lib)) plotSpectra(lib, peakData, libId = 3, type = "ht") @ \caption{Spectra comparison of ``Valine''}\label{fig:03} \end{figure} To look at the chromatographic peak of ``Valine'' in a given sample, we use the functions \Rfunction{peakCDFextraction} to extract the raw chromatogram and \Rfunction{plotPeak} to plot the peak (figure~\ref{fig:04}). \begin{figure}[htp] \centering <>= # we select the first sample sample.id <- 1 cdf.file <- file.path(cdf.path, cdffiles[sample.id]) rawpeaks <- peakCDFextraction(cdf.file, massRange = c(85,500)) # which.met=3 (id of Valine) plotPeak(samples, lib, MetabProfile, rawpeaks, which.smp=sample.id, which.met=3, massRange=c(85,500), corMass=FALSE) @ \caption{Chromatographic peak of Valine.}\label{fig:04} \end{figure} Refer to the documentation of the functions \Rfunction{plotPeak} and \Rfunction{plotSpectra} for further options not covered here. \section{TargetSearch GUI} We provide a grahical user interface intended to facilitate the use of \emph{TargetSearch} for users unfamiliar with R. Many parameters that would be set calling the individual \Rpackage{TargetSearch} functions as described in this document can be set here ``in one go'' before running the complete analysis. A screenshot of the GUI is shown in figure~\ref{fig:05}. \begin{figure}[htp] \centering \includegraphics[scale=0.7]{tsGUI.png} \caption{The \Rpackage{TargetSearch} GUI.}\label{fig:05} \end{figure} This is a descript of all the GUI options. \begin{description} \item[Working Directory]: Use the \emph{Browse}-button to select the folder on your hard drive containing all your GC-MS data files. The output of \Rpackage{TargetSearch} will be written to this folder too. \item[File Import]: Clicking \emph{NetCDF Data} or \emph{Apex Data} radio buttons will open a file select dialog. Choose the files you would like to be processed. You may check your selection pressing the \emph{Show}-button. \item[Baseline Correction]: Clicking \emph{on/off} button will perform baseline correction before peak detection. If selected, the \Rfunarg{threshold} parameter is a numeric value between 0 and 1. A value of one returns a baseline above the noise, 0.5 in the middle of the noise and 0 below the noise. See \Rfunction{baselineCorrection} documentation for further details. \item[Peak Detection]: \emph{Intensity Counts threshold} defines the minimum apex intensity incorporated in the analysis. A value of 1 would include all peaks. \emph{Mass Range} allows to limit the mass values ($m/z$) to be included in the analysis. \emph{Smoothing} averages raw data to eliminate some inherent noise leading to multiple peaks otherwise. \item[Retention Index Correction]: Retention Index Correction is neccessary and applied only if you supply NetCDF Data (Apex Data contain already Retention Indices). You may \emph{Load} or \emph{Create} the search windows for your RI-Markers here. \item[Library]: A Library (to detect metabolites) usable by \Rpackage{TargetSearch} contains at least information about the metabolite `Name', its expected `RI' and the selective masses in its spectrum `SEL\_MASS'. You may \emph{Load} or \emph{Create} one yourself using the respective buttons. A more detailed description of the file formats can be found in \Rfunction{ImportLibrary}. \emph{Search Windows} refers to the allowed RI deviation of your metabolites which are narrowed in 3 consecutive searches. \emph{No. of top masses} is the number of most abundant masses that will be selected from the spectra, besides the selected masses. \emph{ExcludeMasses} is only used when masses are obtained from the spectra. For example, "126,147:149,160" means that the masses 126, 147, 148, 149, and 160 will be excluded from the analysis (unless they are used as selected masses). \item[Normalization]: This selects how the data will be normalized during the metabolite search. Options are \Rfunarg{dayNorm}, a day based median normalization, \Rfunarg{medianNorm}, normalization using the median of all the intensities of a given mass, and \Rfunarg{none}, no normalization at all. \item[Final Profiles]: Here you may set the parameters used by the functions \Rfunction{Profile} and \Rfunction{ProfileCleanUp}. \emph{timesplit} sets an RI window that will be used to look for metabolites that could have been redundanly identified. \emph{correl. thr.} is the correlation threshold and \emph{min. number of correlation samples} is a threshold used to make sure that correlations are computed with at least said number of observations. \emph{prioritization} is used to resolv ambiguities by taking the metabolite with the most correlating masses (\emph{corr. mass}) or the highest similarity \emph{score}. Metabolites with corr. mass and score higher than the thresholds will be suggested above the one that do not pass the thresholds, which are marked as ``unidentified.'' \item[Quantification Mass]: This section is used to generate an intensity matrix based on one quantification mass. The quantification mass can be specified in the library file (column "QUANT\_MASS") or selected automatically by choosing \emph{max intensity} or \emph{max observations}. \emph{max intensity} selects the mass with the highest intensity across the samples, while \emph{max observation} takes the mass with less missing values. \item[Parameters]: You may \emph{Save} the current parameters as an \texttt{*.RData} file or \emph{Load} previously saved parameters to compare the outcome of different settings or just repeat the analysis. \item[Program]: \emph{Run} starts to process all currently selected files using the current parameters and saving output to \emph{Working Directory}. \emph{Quit} closes the GUI. \end{description} \section{Untargeted search} Although \Rpackage{TargetSearch} was designed to be targeted oriented, it is possible to perform untargeted searches. The basic idea is to create a library that contains evenly distributed ``metabolites'' in time and every ``metabolite'' uses the whole range of possible masses as selective masses. An example: <>= metRI <- seq(200000, 300000, by = 5000) metMZ <- 85:250 metNames <- paste("Metab",format(metRI,digits=6), sep = "_") @ Here we define a set of metabolites located every 5000 RI units from each other in the RI range 200000-300000, with selective masses in the range of 85-300, and assign them a name. After that, we create an \Robject{tsLib} object. <>= metLib <- new("tsLib", Name = metNames, RI = metRI, selMass = rep(list(metMZ), length(metRI)), RIdev = c(3000, 1500, 500)) @ Now we can use this library object to perform a targetive search with this library on the \emph{E. coli} samples as we did before. <>= metLib <- medianRILib(samples, metLib) metCorRI <- sampleRI(samples, metLib) metPeakData <- peakFind(samples, metLib, metCorRI) metProfile <- Profile(samples, metLib, metPeakData) metFinProf <- ProfileCleanUp(metProfile, timeSplit = 500) @ The \Robject{metFinProf} object can be used to create a new library by taking only the metabolites that have, for example, more than 5 correlating masses and using the correlating masses as selective ones. <>= sum( profileInfo(metFinProf)$Mass_count > 5) tmp <- profileInfo(metFinProf)$Mass_count > 5 metRI <- profileInfo(metFinProf)$RI[tmp] metNames <- as.character( profileInfo(metFinProf)$Name[tmp] ) metMZ <- sapply(profileInfo(metFinProf)$Masses[tmp], function(x) as.numeric(unlist(strsplit(x,";"))) ) metLib <- new("tsLib", Name = metNames, RI = metRI, selMass = metMZ, RIdev = c(1500, 750, 250)) @ After the new library object is created, the process can be repeated as shown above. Finally, by using the function \Rfunction{writeMSP}, it is possible to export the spectrum of the unknown metabolites to MSP format used by NIST mass spectra search (\url{http://www.nist.gov/srd/mslist.htm}), so the unknown spectra can be search against known metabolite spectra databases. \bibliographystyle{plainnat} \bibliography{biblio} \end{document}