--- title: "MetNet: Inferring metabolic networks from untargeted high-resolution mass spectrometry data" author: - name: Thomas Naake mail: thomasnaake@googlemail.com affiliation: Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany package: MetNet abstract: > A major bottleneck of mass spectrometry-based metabolomic analysis is still the rapid detection and annotation of unknown m/z features across biological matrices. Traditionally, the annotation was done manually imposing constraints in reproducibility and automatization. Furthermore, different analysis tools are typically used at different steps of analyses which requires parsing of data and changing of environments. I present here `MetNet`, a novel `R` package, that is compatible with the output of the `xcms`/`CAMERA` suite and that uses the data-rich output of mass spectrometry metabolomics to putatively link features on their relation to other features in the data set. `MetNet` uses both structural and quantitative information of metabolomics data for network inference that will guide metabolite annotation. output: BiocStyle::html_document: toc_float: true bibliography: MetNet-citations.bib vignette: > %\VignetteIndexEntry{Workflow for high-resolution metabolomics data} %\VignetteEngine{knitr::rmarkdown} %\VignetteKeywords{Mass Spectrometry, MS, Metabolomics, Visualization, Network} %\VignettePackage{MetNet-vignette} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r env, include=FALSE, echo=FALSE, cache=FALSE} library("knitr") opts_chunk$set(stop_on_error = 1L) suppressPackageStartupMessages(library("MetNet")) ``` # Introduction {#sec:intro} Among the main challenges in mass spectrometric metabolomic analysis is the high-throughput analysis of metabolic features, their fast detection and annotation. By contrast to the screening of known, previously characterized, metabolic features in these data, the putative annotation of unknown features is often cumbersome and requires a lot of manual work, hindering the biological information retrieval of these data. High-resolution mass spectrometric data is often very rich in information content and metabolic conversions, and reactions can be derived from structural properties of features [@Breitling2006]. In addition to that, statistical associations between features (based on their intensity values) can be a valuable ressource to find co-synthesised or co-regulated metabolites, which are synthesised in the same biosynthetic pathways. Given that an analysis tool within the `R` framework is still lacking that is integrating the two features of mass spectrometric information commonly acquired with mass spectrometers (m/z and intensity values), I developed `MetNet` to close this gap. The `MetNet` package comprises functionalities to infer network topologies from high-resolution mass spectrometry data. `MetNet` combines information from both structural data (differences in m/z values of features) and statistical associations (intensity values of features per sample) to propose putative metabolic networks that can be used for further exploration. The idea of using high-resolution mass spectrometry data for network construction was first proposed in @Breitling2006 and followed soon afterwards by a Cytoscape plugin, MetaNetter [@Jourdan2007], that is based on the inference of metabolic networks on molecular weight differences and correlation (Pearson correlation and partial correlation). Inspired by the paper of @Marbach2012 different algorithms for network were implemented in `MetNet` to account for biases that are inherent in these statistical methods, followed by the calculation of a consensus adjacency matrix using the differently computed individual adjacency matrices. The two main functionalities of the package include the creation of an adjacency matrix from structual properties, based on losses/addition of functional groups defined by the user, and statistical associations. Currently, the following statistical models are implemented to infer a statistical adjacency matrix: Least absolute shrinkage and selection operator (LASSO, L1-norm regression, [@Tibshirani1994]), Random Forest [@Breiman2001], Pearson and Spearman correlation (including partial and semipartial correlation, see @Steuer2006 for a discussion on correlation-based metabolic networks), context likelihood of relatedness (CLR, [@Faith2007]), the algorithm for the reconstruction of accurate cellular networks (ARACNE, [@Margolin2006]) and constraint-based structure learning (Bayes, [@Scutari2010]). Since all of these methods have advantages and disadvantages, the user has the possibility to select several of these methods, compute adjacency matrices from these models and create a consensus matrix from the different statistical frameworks. After creating the statistical and structural adjaceny matrices these two matrices can be combined to form a consensus matrix that has both information from structural and statistical properties of the data. This can be followed by further network analyses (e.g. calculation of topological parameters), integration with other data sources (e.g. genomic information or transcriptomic data) and/or visualization. # Questions and bugs {-} `MetNet` is currently under active development. If you discover any bugs, typos or develop ideas of improving `MetNet` feel free to raise an issue via [Github](https://github.com/tnaake/MetNet) or send a mail to the developer. # Prepare the environment and load the data {#sec-prepare} To install `MetNet` enter the following to the `R` console ```{r install, eval=FALSE} install.packages("BiocManager") BiocManager::install("MetNet") ``` Before starting with the analysis, load the `MetNet` package. This will also load the required packages `glmnet`, `stabs`, `randomForest`, `rfPermute`, `mpmi`, `parmigene`, `WGCNA` and `bnlearn` that are needed for functions in the statistical adjacency matrix inference. ```{r load_MetNet,eval=TRUE} library(MetNet) ``` The data format that is compatible with the `MetNet` framework is in the `xcms`/`CAMERA` output-like $m~\times~n$ matrix, where columns denote the different samples $n$ and where $m$ features are present. In such a matrix, information about the masses of the features and quantitative information of the features (intensity or concentration values) are needed. The information about the m/z values has to be stored in a vector of length $\vert m \vert$ in the column `"mz"`. `MetNet` does not impose any requirements for data normalization, filtering, etc. However, the user has to make sure that the data is properly preprocessed. These include division by internal standard, `log2` transformation, noise filtering, removal of features that do not represent mass features/metabolites, removal of isotopes, etc. We will load here the object `x_test` that contains m/z values (in the column `"mz"`), together with the corresponding retention time (in the column `"rt"`) and intensity values. We will use here the object `x_test` for guidance through the workflow of `MetNet`. ```{r data,eval=TRUE,echo=TRUE} data("x_test", package = "MetNet") x_test <- as.matrix(x_test) ``` # Creating the structural matrix {#sec-structural} The function `structural` will create the adjacency matrix based on structual properties (m/z values) of the features. The function expects a matrix with a column `"mz"` that contains the mass information of a feature (typically the m/z value). Furthermore, `structural` takes a `data.frame` object as argument `transformations` with the `colnames` `"mass"`, `"name"` and additional columns (e.g. `"formula"`). `structural` looks for transformation (in the sense of additions/losses of functional groups mediated by biochemical, enzymatic reactions) in the data using the mass information. Following the work of [@Breitling2006] and [@Jourdan2007], molecular weight difference w~X~ is defined by $w_X = \vert w_A - w_B \vert$ where w~A~ is the molecular weight of substrate A, and w~B~ is the molecular weight of product B (typically, m/z values will be used as a proxy for the molecular weight since the molecular weight is not directly derivable from mass spectrometric data). As examplified in [@Jourdan2007] specific enzymatic reactions refer to specific changes in the molecular weight, e.g. carboxylation reactions will result in a mass difference of 43.98983 (molecular weight of CO~2~) between metabolic features. The search space for these transformation is adjustable by the `transformation` argument in `structural` allowing to look for specific enzymatic transformations in mind. Hereby, `structural` will take into account the `ppm` value, to adjust for inaccuracies in m/z values due to technical reasons according to the formula $$ppm = \frac{m_{exp} - m_{calc}}{m_{exp}} \cdot 10^{-6}$$ with m~exp~ the experimentally determined m/z value and m~calc~ the calculated accurate mass of a molecule. Within the function, a lower and upper range is calculated depending on the supplied `ppm` value, differences between the m/z feature values are calculated and matched against the `"mass"`es of the `transformations` argument. If any of the additions/losses defined in `transformations` is found in the data, it will be reported as an (unweighted) connection in the returned adjacency matrix. Together with the adjacency matrix the type of connection (derived from the column `"name"` in the `transformations`) will be written to a character matrix. These two matrices will be returned as a list (first entry: numerical adjacency matrix, second entry: character matrix) by the function `structural`. Before calculating the structural matrix, one must define the search space, i.e. these transformation that will be looked for in the mass spectrometric data by creating the `transformations` object. ```{r transformation_example,echo=TRUE,eval=TRUE} ## define the search space for biochemical transformation transformations <- rbind( c("Hydroxylation (-H)", "O", 15.9949146221, "-"), c("Malonyl group (-H2O)", "C3H2O3", 86.0003939305, "+"), c("D-ribose (-H2O) (ribosylation)", "C5H8O4", 132.0422587452, "-"), c("C6H10O6", "C6H10O6", 178.0477380536, "-"), c("Rhamnose (-H20)", "C6H10O4", 146.057910, "-"), c("Monosaccharide (-H2O)", "C6H10O5", 162.0528234315, "-"), c("Disaccharide (-H2O) #1", "C12H20O10", 324.105649, "-"), c("Disaccharide (-H2O) #2", "C12H20O11", 340.1005614851, "-"), c("Trisaccharide (-H2O)", "C18H30O15", 486.1584702945, "-"), c("Glucuronic acid (-H2O)", "C6H8O6", 176.0320879894, "?"), c("coumaroyl (-H2O)", "C9H6O2", 146.0367794368, "?"), c("feruloyl (-H2O)", "C9H6O2OCH2", 176.0473441231, "?"), c("sinapoyl (-H2O)", "C9H6O2OCH2OCH2", 206.0579088094, "?"), c("putrescine to spermidine (+C3H7N)", "C3H7N", 57.0578492299, "?")) ## convert to data frame transformations <- data.frame( group = transformations[, 1], formula = transformations[, 2], mass = as.numeric(transformations[, 3]), rt = transformations[, 4]) ``` The function `structural` will then check for those m/z differences that are stored in the column `"mass"` in the object `transformations`. To create the adjacency matrix derived from these structural information we enter ```{r structure,eval=TRUE,echo=TRUE} struct_adj <- structural(x = x_test, transformation = transformations, ppm = 10) ``` in the `R` console. ## Refining the structural adjacency matrix (optional) {-} Depending on the chemical group added the retention time will differ depending on the chemical group added, e.g. an addition of a glycosyl group will usually result in a lower retentiom time in reverse-phase chromatography- This information can be used in refining the adjacency matrix derived from the structural matrix. The `rtCorrection` does this checking, if predicted transformation correspond to the expected retention time shift, in an automated fashion. It requires information about the expected retention time shift in the `data.frame` passed to the `transformation` argument (in the `"rt"` column). Within this columns, information about retention time shifts is encoded by `"-"`, `"+"` and `"?"`, which means the feature with higher m/z value has lower, higher or unknown retention time than the feature with the lower m/z value. The values for m/z and retention time will be taken from the object passed to the `x` argument. In case there is a discrepancy between the transformation and the retention time shift the adjacency matrix at the specific position will be set to 0. `rtCorrection` will return the updated adjacency matrix and the updated character matrix with the descriptions of the transformation. To account for retention time shifts we enter ```{r rt_correction,eval=TRUE,echo=TRUE} struct_adj <- rtCorrection(structural = struct_adj, x = x_test, transformation = transformations) ``` in the `R` console. # Creating the statistical matrix {#sec-statistical} ## Creating weighted adjacency matrices using `statistical` {#subsec-statistical} The function `statistical` will create the adjacency matrix based on statistical associations. The function will create a list of weighted adjacency matrices using the statistical models defined by the `model` argument. Currently, the models LASSO (using `stabs`, [@Hofner2015;@Thomas2017]), Random Forest (using `GENIE3`, CLR, ARACNE (the two latter using the package `mpmi` to calculate Mutual Information using a nonparametric bias correction by Bias Corrected Mutual Information, and the functions `clr` and `aracne.a` from the `parmigene` package), Pearson and Spearman correlation (based on the `stats` package), partial and semipartial Pearson and Spearman correlation (using the `ppcor` package) and score-based structure learning returning the strength of the probabilistic relationships of the arcs of a Bayesian network, as learned from bootstrapped data (using the `boot.strength` with the Tabu greedy search as default from the `bnlearn` package [@Scutari2010]). For further information on the different models take a look on the respective help pages of `lasso`, `randomForest`, `clr`, `aracne`, `correlation` and/or `bayes`. Arguments that are accepted by the respective underlying functions can be passed directly to the `statistical` function. In addition, arguments that are defined in the functions `lasso`, `randomForest`, `clr`, `aracne`, `correlation` and/or `bayes` can be passed to the functions. ## Creating an unweighted adjacency matrix using `threshold` {#subsec-threshold} From the list of adjacency matrices the function `threshold` will create a unweighted adjacency matrix from the weighted adjacency matrices unifying the information present from all statistical models. The reasoning behind this step is to circumvent disadvantages arising from each model and creating a statistically reliable topology that reflects the actual metabolic relations. `threshold` return an unweighted adjancency matrix with connections inferred from the respective models. There are four different types implemented how the unweighted adjacency matrix can be created: `threshold`, `top1`, `top2`, `mean`. For `type = "threshold"`, threshold values have to be defined for the `args` argument for each respective statistical model, above or below which the each in each weighted adjacency matrix will be reported as a unweighted each. The unweighted adjacency matrices will be passed to the `consensus` function from the `sna` [@Butts2016] to calculate the unweighted consensus adjacency matrix. The arguments that are accepted by this function can be passed to the `threshold` function. Furthermore, in `args` an entry `threshold` needs to be defined to threshold if the value a~i,j~ of the consensus adjacency matrix will be reported as a connection in the returned matrix (if a~i,j~ \( \geq \) `threshold`) or not. In the case of the method `"central.graph"` (default), the argument `threshold` should be set to 1. For the other three types (top1, top2, mean) the ranks per statistical model will be calculated and from each respective link the top1, top 2 or mean rank across statistical models will be calculated (cf. [@Hase2013]). The top n unqique ranks (defined by the entry n in `args`) will be returned as links in the unweighted consensus adjacency matrix. In the following example, we will create a list of unweighted adjacency matrices using Pearson and Spearman correlation using the intensity values as input data. ```{r statistical,eval=TRUE,echo=TRUE} x_int <- x_test[, 3:dim(x_test)[2]] x_int <- as.matrix(x_int) stat_adj_l <- statistical(x_int, model = c("pearson", "spearman")) ``` `threshold` implements four types to obtain an unweighted adjacency matrix. We will create for all types the unweighted consensus adjacency matrices. ```{r threshold,eval=TRUE,echo=TRUE} ## type = "threshold" args_thr <- list("pearson" = 0.95, "spearman" = 0.95, threshold = 1) stat_adj_thr <- threshold(statistical = stat_adj_l, type = "threshold", args = args_thr) ## type = "top1" args_top <- list(n = 40) stat_adj_top1 <- threshold(statistical = stat_adj_l, type = "top2", args = args_top) ## type = "top2" stat_adj_top2 <- threshold(statistical = stat_adj_l, type = "top2", args = args_top) ## type = "mean" stat_adj_mean <- threshold(statistical = stat_adj_l, type = "mean", args = args_top) ``` # Combining the structural and statistical matrix {#sec-combine} After creating the unweighted structural and unweighted statistical adjacency matrices, it is time to combine these two matrices. The function `combine` will combine the matrices to the consensus matrix. The function accepts the arguments `structure` and `statistical` for the list returned by `structural` and the matrix returned by `threshold`, respectively, and the argument `threshold`, that is a numerical value (default = 1). After adding the matrices, the entries will be checked if they are greater or equal than `threshold` and 1 or 0 will be returned, respectively. The argument `threshold` needs to be adjusted by the user if another `method` than `"central.graph"` in `threshold` (type = "threshold") is used. We will use here the unweighted statistical adjacency matrix from `type = "mean"`: ```{r combine,eval=TRUE,echo=TRUE} cons_adj <- combine(structural = struct_adj, statistical = stat_adj_mean) ``` # Visualization and further analyses {#sec-visualization} To display the created consensus adjacency matrix, existing visualization tools available in the `R` framework can be employed or any other visualization tool after exporting the consensus matrix as a text file. In this example We will use the `igraph` [@Csardi2006] package to visualize the adjacency matrix. `combine` returns a list of two adjacency matrices, where the first entry contains the unweighted adjacency matrix and the second entry contains a matrix given information on the type of link between features based on structural information. Only the first entry of the list will be passed to the `graph_from_adjacency_matrix` function: ```{r visualisation,eval=TRUE,echo=TRUE,fig.cap='** _Ab initio_ network inferred from structural and quantitative mass spectrometry data.** Vertices are connected that are separated by given metabolic transformation and statistical association.'} g <- igraph::graph_from_adjacency_matrix(cons_adj[[1]], mode = "undirected") plot(g, edge.width = 5, vertex.label.cex = 0.5, edge.color = "grey") ``` Furthermore, the network can be analysed by network analysis techniques (topological parameters such as centrality, degree, clustering indices) that are implemented in different packages in `R` (e.g. `igraph` or `sna`) or other software tools outside of the `R` environment. # Appendix {-} ## Session information {-} All software and respective versions to build this vignette are listed here: ```{r session,eval=TRUE,echo=FALSE} sessionInfo() ``` ## Transformations {-} The list of transformations is taken from @Breitling2006. The numerical m/z values were calculated by using the structural formula and the Biological Magnetic Resonance Data Bank [web tool](http://www.bmrb.wisc.edu/metabolomics/mol_mass.php). ```{r ttransformations,eval=TRUE,echo=TRUE} transformations <- rbind( c("Alanine", "C3H5NO", "71.0371137878"), c("Arginine", "C6H12N4O", "156.1011110281"), c("Asparagine", "C4H6N2O2", "114.0429274472"), c("Guanosine 5-diphosphate (-H2O)", "C10H13N5O10P2", "425.0137646843"), c("Guanosine 5-monophosphate (-H2O)", "C10H12N5O7P", "345.0474342759"), c("Guanine (-H)", "C5H4N5O", "150.0415847765"), c("Aspartic acid", "C4H5NO3", "115.0269430320"), c("Guanosine (-H2O)", "C10H11N5O4", "265.0811038675"), c("Cysteine", "C3H5NOS", "103.0091844778"), c("Deoxythymidine 5'-diphosphate (-H2O)", "C10H14N2O10P2", "384.01236770"), c("Cystine", "C6H10N2O3S2", "222.0132835777"), c("Thymidine (-H2O)", "C10H12N2O4", "224.0797068840"), c("Glutamic acid", "C5H7NO3", "129.0425930962"), c("Thymine (-H)", "C5H5N2O2", "125.0351024151"), c("Glutamine", "C5H8N2O2", "128.0585775114"), c("Thymidine 5'-monophosphate (-H2O)", "C10H13N2O7P", "304.0460372924"), c("Glycine", "C2H3NO", "57.0214637236"), c("Uridine 5'-diphosphate (-H2O)", "C9H12N2O11P2", "385.9916322587"), c("Histidine", "C6H7N3O", "137.0589118624"), c("Uridine 5'-monophosphate (-H2O)", "C9H11N2O8P", "306.0253018503"), c("Isoleucine", "C6H11NO", "113.0840639804"), c("Uracil (-H)", "C4H3N2O2", "111.0194523509"), c("Leucine", "C6H11NO", "113.0840639804"), c("Uridine (-H2O)", "C9H10N2O5", "226.0589714419"), c("Lysine", "C6H12N2O", "128.0949630177"), c("Acetylation (-H)", "C2H3O2", "59.0133043405"), c("Methionine", "C5H9NOS", "131.0404846062"), c("Acetylation (-H2O)", "C2H2O", "42.0105646863"), c("Phenylalanine", "C9H9NO", "147.0684139162"), c("C2H2", "C2H2", "26.0156500642"), c("Proline", "C5H7NO", "97.0527638520"), c("Carboxylation", "CO2", "43.9898292442"), c("Serine", "C3H5NO2", "87.0320284099"), c("CHO2", "CHO2", "44.9976542763"), c("Threonine", "C4H7NO2", "101.0476784741"), c("Condensation/dehydration", "H2O", "18.0105646863"), c("Tryptophan", "C11H10N2O", "186.0793129535"), c("Diphosphate", "H3O6P2", "160.9404858489"), c("Tyrosine", "C9H9NO2", "163.0633285383"), c("Ethyl addition (-H2O)", "C2H4", "28.0313001284"), c("Valine", "C5H9NO", "99.0684139162"), c("Formic Acid (-H2O)", "CO", "27.9949146221"), c("Acetotacetate (-H2O)", "C4H4O2", "84.0211293726"), c("Glyoxylate (-H2O)", "C2O2", "55.9898292442"), c("Acetone (-H)", "C3H5O", "57.0340397826"), c("Hydrogenation/dehydrogenation", "H2", "2.0156500642"), c("Adenylate (-H2O)", "C10H12N5O6P", "329.0525196538"), c("Hydroxylation (-H)", "O", "15.9949146221"), c("Biotinyl (-H)", "C10H15N2O3S", "243.0803380482"), c("Inorganic phosphate", "P", "30.9737615100"), c("Biotinyl (-H2O)", "C10H14N2O2S", "226.0775983940"), c("Ketol group (-H2O)", "C2H2O", "42.0105646863"), c("Carbamoyl P transfer (-H2PO4)", "CH2ON", "44.0136386915"), c("Methanol (-H2O)", "CH2", "14.0156500642"), c("Co-enzyme A (-H)", "C21H34N7O16P3S", "765.0995583014"), c("Phosphate", "HPO3", "79.9663304084"), c("Co-enzyme A (-H2O)", "C21H33N7O15P3S", "748.0968186472"), c("Primary amine", "NH2", "16.0187240694"), c("Glutathione (-H2O)", "C10H15N3O5S", "289.0732412976"), c("Pyrophosphate", "PP", "61.9475230200"), c("Isoprene addition (-H)", "C5H7", "67.0547752247"), c("Secondary amine", "NH", "15.0108990373"), c("Malonyl group (-H2O)", "C3H2O3", "86.0003939305"), c("Sulfate (-H2O)", "SO3", "79.9568145563"), c("Palmitoylation (-H2O)", "C16H30O", "238.2296655851"), c("Tertiary amine", "N", "14.0030740052"), c("Pyridoxal phosphate (-H2O)", "C8H8NO5P", "229.0140088825"), c("C6H10O5", "C6H10O5", "162.0528234315"), c("Urea addition (-H)", "CH3N2O", "59.0245377288"), c("C6H10O6", "C6H10O6", "178.0477380536"), c("Adenine (-H)", "C5H4N5", "134.0466701544"), c("D-ribose (-H2O) (ribosylation)", "C5H8O4", "132.0422587452"), c("Adenosine (-H2O)", "C10H11N5O3", "249.0861892454"), c("Disaccharide (-H2O) #1", "C12H20O10", "324.105649"), c("Disaccharide (-H2O) #2", "C12H20O11", "340.1005614851"), c("Adenosine 5'-diphosphate (-H2O)", "C10H13N5O9P2", "409.0188500622"), c("Glucose-N-phosphate (-H2O)", "C6H11O8P", "242.0191538399"), c("Adenosine 5'-monophosphate (-H2O)", "C10H12N5O6P", "329.0525196538"), c("Glucuronic acid (-H2O)", "C6H8O6", "176.0320879894"), c("Cytidine 5'-diphosphate (-H2O)", "C9H13N3O10P2", "385.0076166739"), c("Monosaccharide (-H2O)", "C6H10O5", "162.0528234315"), c("Cytidine 5'-monophsophate (-H2O)", "C9H12N3O7P", "305.0412862655"), c("Trisaccharide (-H2O)", "C18H30O15", "486.1584702945"), c("Cytosine (-H)", "C4H4N3O", "110.0354367661")) transformations <- data.frame(name = transformations[, 1], formula = transformations[, 2], mass = as.numeric(transformations[, 3])) ``` ## References