--- title: "Annotation of mixtures of standards" author: "Gavin Rhys Lloyd" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 number_sections: true toc_float: true vignette: > %\VignetteIndexEntry{Annotation of mixtures of standards} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" ) .DT <- function(x) { dt_options <- list( scrollX = TRUE, pageLength = 6, dom = "t", initComplete = DT::JS( "function(settings, json) {", "$(this.api().table().header()).css({'font-size':'10pt'});", "}" ) ) x %>% DT::datatable(options = dt_options, rownames = FALSE) %>% DT::formatStyle( columns = colnames(x), fontSize = "10pt" ) } library(BiocStyle) ```
# Getting Started The latest versions of `r Biocpkg("struct")` and `MetMashR` that are compatible with your current R version can be installed using BiocManager. ```{r,eval = FALSE, include = TRUE} # install BiocManager if not present if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # install MetMashR and dependencies BiocManager::install("MetMashR") ``` Once installed you can activate the packages in the usual way: ```{r, eval=TRUE, include=FALSE} suppressPackageStartupMessages({ # load the packages library(MetMashR) library(ggplot2) library(structToolbox) library(dplyr) library(DT) }) ``` ```{r, eval=FALSE, include=TRUE} # load the packages library(MetMashR) library(ggplot2) library(dplyr) ```
# Introduction Mixtures of standards can be used to build annotation libraries. In LCMS these libraries can be collected using the same chromatography and instrument as the samples. Such a library allows high-confidence (MSI level 1) detection/annotation of MTox metabolites, in comparison to external databases/sources that rely on m/z only or in-silico predictions. In this vignette we will explore the annotations generated using Compound Discoverer (CD) and LipidSearch (LS) for several mixtures of standards. As the content of the standard mixtures is known, we can assess the ability of CD and LS to annotate these metabolites.
# Input data The data collected corresponds to mixtures of high-purity standards measured using LCMS, with the intention of building an internally measured library including m/z and retention times for each standard. Analysis of the standard mixtures resulted in four data tables: HILIC_NEG, HILIC_POS, LIPIDS_POS and LIPIDS_NEG. We will loosely refer to these as "assays". All four assays were used as input to both Compound Discoverer (CD) and LipidSearch (LS) software, which are software tools commonly used to annotation LCMS datasets.
# Importing the annotations `MetMashR` includes `cd_source` and `ls_source` objects. These objects read the output files from CD and LS and parse them into an `annotation_table` object. Importing the annotation tables isn't always enough; sometimes they need to be cleaned or processed further. We therefore define two `MetMashR` workflows, one for CD and one for LS, in which we import the tables and then apply source-specific cleaning.
## Importing Compound Discoverer annotations The CD file format expected by `cd_source` is Excel format; see [TODO] for details on how to generate this format from CD. For the CD `MetMashR` workflow we would like to: 1. Import CD annotations and convert to `annotation_table` format 2. Filter to only include "Full match" annotations 3. Resolve duplicates When we import the annotations, we add a column indicating which assay the annotations are associated with, include a `tag` for each row indicating both the source and the assay. This will be useful later when processing the combined tables. The resolution of duplicates is needed because CD might assign the same metabolite + adduct to multiple peaks. Here we choose the match with highest mzcloud score using the `select_max` helper function with the combine_records` object. This object is a wrapper for [dplyr::reframe()]. ```{r} # prepare workflow M <- import_source() + add_labels( labels = c( assay = "placeholder", # will be replaced later source_name = "CD" ) ) + filter_labels( column_name = "compound_match", labels = "Full match", mode = "include" ) + filter_labels( column_name = "compound_match", labels = "Full match", mode = "include" ) + filter_range( column_name = "library_ppm_diff", upper_limit = 2, lower_limit = -2, equal_to = FALSE ) + combine_records( group_by = c("compound", "ion"), default_fcn = select_max( max_col = "mzcloud_score", keep_NA = FALSE, use_abs = TRUE ) ) # place to store results CD <- list() for (assay in c("HILIC_NEG", "HILIC_POS", "LIPIDS_NEG", "LIPIDS_POS")) { # prepare source AT <- cd_source( source = c( system.file( paste0("extdata/MTox/CD/", assay, ".xlsx"), package = "MetMashR" ), system.file( paste0("extdata/MTox/CD/", assay, "_comp.xlsx"), package = "MetMashR" ) ), tag = paste0("CD_", assay) ) # update labels in workflow M[2]$labels$assay <- assay # apply workflow to source CD[[assay]] <- model_apply(M, AT) } ``` The `CD` variable is now a list containing the workflow for each assay. The default output of each workflow is a processed `lcms_table`, which is an extension of `annotation_table` that requires both an m/z and a retention time column to be defined. A summary of the table for e.g. the HILIC_NEG assay can be displayed on the console: ```{r} predicted(CD$HILIC_NEG) ``` The HILIC_NEG table is shown below. ```{r,echo=FALSE} .DT(predicted(CD$HILIC_NEG)$data) ``` `MetMashR` workflows store the output after each step. We can use this to explore the impact of different workflow steps. For example, we can display the different compound matched present before and after filtering as pie charts. First, we create the pie chart object and specify some parameters. ```{r} C <- annotation_pie_chart( factor_name = "compound_match", label_rotation = FALSE, label_location = "outside", legend = TRUE, label_type = "percent", centre_radius = 0.5, centre_label = ".total" ) ``` Now we create the plots using `chart_plot` and add some additional settings using `ggplot2`, and arrange the plots using `cowplot`. Note that we use square brackets to index the step of the workflow we want to access e.g. `HILIC_NEG[3]`. ```{r} # plot individual charts g1 <- chart_plot(C, predicted(CD$HILIC_NEG[3])) + ggtitle("Compound matches\nafter filtering") + theme(plot.title = element_text(hjust = 0.5)) g2 <- chart_plot(C, predicted(CD$HILIC_NEG[2])) + ggtitle("Compound matches\nbefore filtering") + theme(plot.title = element_text(hjust = 0.5)) # get legend leg <- cowplot::get_legend(g2) # layout cowplot::plot_grid( g2 + theme(legend.position = "none"), g1 + theme(legend.position = "none"), leg, nrow = 1, rel_widths = c(1, 1, 0.5) ) ``` It is clear from these plots that the filter has removed all annotations without a "full match".
We can assess the quality of the annotations be examining the histogram of ppm errors between the MS2 peaks and the library. If there is a wide distribution then this may indicate false positives. If the distribution is offset from zero then this indicates some m/z drift is present. ```{r} C <- annotation_histogram( factor_name = "library_ppm_diff", vline = c(-2, 2) ) G <- list() G$HILIC_NEG <- chart_plot(C, predicted(CD$HILIC_NEG[2])) G$HILIC_POS <- chart_plot(C, predicted(CD$HILIC_POS[3])) cowplot::plot_grid(plotlist = G, labels = c("HILIC_NEG", "HILIC_POS")) ``` Note that we plotted the distribution based on the step before filtering by range, and set the vertical red lines equal to the range filter so that we can see which parts of the histogram are affected by the range filter. ## Importing LipidSearch annotations The file format expected by `ls_source` is a `.csv` file. See [TODO] for details on how to generate this format from LS. For the `MetMashR` workflow we would like to: 1. Import the LS annotations and convert to `annotation_table` format 2. Filter the annotations to only include grades A and B 3. Resolve duplicates When we import the annotations, we add a column indicating which assay and source the annotations are associated with. We also include a `tag` for the table. This will be useful later when processing the combined tables. For LS we resolve duplicates by selecting the annotation with the smallest ppm error. We use the `combine_records` object with the `select_min` helper function to do this. ```{r} # prepare workflow M <- import_source() + add_labels( labels = c( assay = "placeholder", # will be replaced later source_name = "LS" ) ) + filter_labels( column_name = "Grade", labels = c("A", "B"), mode = "include" ) + filter_labels( column_name = "Grade", labels = c("A", "B"), mode = "include" ) + combine_records( group_by = c("LipidIon"), default_fcn = select_min( min_col = "library_ppm_diff", keep_NA = FALSE, use_abs = TRUE ) ) # place to store results LS <- list() for (assay in c("HILIC_NEG", "HILIC_POS", "LIPIDS_NEG", "LIPIDS_POS")) { # prepare source AT <- ls_source( source = system.file( paste0("extdata/MTox/LS/MTox_2023_", assay, ".txt"), package = "MetMashR" ), tag = paste0("LS_", assay) ) # update labels in workflow M[2]$labels$assay <- assay # apply workflow to source LS[[assay]] <- model_apply(M, AT) } ``` The `LS` variable is now a list containing the workflow for each assay. A summary of the table for e.g. the LIPIDS_NEG assay can be displayed on the console: ```{r} predicted(LS$LIPIDS_NEG) ``` The LIPIDS_NEG table is shown below. ```{r, echo=FALSE} .DT(predicted(LS$LIPIDS_NEG)$data) ``` `MetMashR` workflows store the output after each step. We can use this to explore the impact of different workflow steps. For example, we can display the different Grades present before and after filtering as pie charts. ```{r} C <- annotation_pie_chart( factor_name = "Grade", label_rotation = FALSE, label_location = "outside", label_type = "percent", legend = FALSE, centre_radius = 0.5, centre_label = ".total" ) g1 <- chart_plot(C, predicted(LS$LIPIDS_NEG)) + ggtitle("Grades after filtering") + theme(plot.margin = unit(c(1, 1.5, 1, 1.5), "cm")) g2 <- chart_plot(C, predicted(LS$LIPIDS_NEG[1])) + ggtitle("Grades before filtering") + theme(plot.margin = unit(c(1, 1.5, 1, 1.5), "cm")) cowplot::plot_grid(g2, g1, nrow = 1, align = "v") ```
# Exploratory analysis of annotation sources The annotations imported from each source are interesting to explore graphically "within source". We will draw a comparison "between sources" later. In this example we generate Venn diagrams by providing several `annotation_table` inputs to the `chart_plot` function for an `annotation_venn_chart`. This allows us to compare columns of e.g. compound names present in each table. Here, we compare the compound names for each assay within each source, to see if there is any overlap i.e. the same metabolite detecting in several assays. ```{r} # prepare venn chart object C <- annotation_venn_chart( factor_name = "compound", line_colour = "white", fill_colour = ".group", legend = TRUE, labels = FALSE ) ## plot # get all CD tables cd <- lapply(CD, predicted) g1 <- chart_plot(C, cd) + ggtitle("Compounds in CD per assay") # get all LS tables C$factor_name <- "LipidName" ls <- lapply(LS, predicted) g2 <- chart_plot(C, ls) + ggtitle("Compounds in LS per assay") # prepare upset object C2 <- annotation_upset_chart( factor_name = "compound", n_intersections = 10 ) g3 <- chart_plot(C2, cd) C2$factor_name <- "LipidName" g4 <- chart_plot(C2, ls) # layout cowplot::plot_grid(g1, g2, g3, g4, nrow = 2) ``` The diagram for CD shows the largest amount of overlap is between assays with the same ion mode. For LS the number of annotations is quite small, so the diagram is less informative. However, we are clearly detecting a larger number of annotations in the LIPIDS assays, which is to be expected as the LS software is designed to annotate lipids molecules. Here the `annotation_venn_chart` and `annotation_upset_chart` objects were used to compare the same column across several `annotation_tables`. The same objects can also compare groups from within the same table, which we will explore later. First, we need to combine the CD and LS tables. # Combining Annotation Sources In this workflow step we combine the imported assay tables from each assay and annotation source vertically into a single annotation table. The `combine_tables` object can be used for this step. Combining tables from the same source (e.g. the CD table for each assay) is straight forward as all tables have the same columns. When combining different sources the `combine_tables` object provides input parameters that allow you to combine and select columns from different sources into new columns with the same information. For example the `adduct` column in CD is called "Ion" and in LS it is called "LipidIon"; we can combine these columns into a new column called "adduct". Here we specify that all columns should be retained from both tables, padding with NA if not present; columns with the same name are automatically merged. ```{r} # get all the cleaned annotation tables in one list all_source_tables <- lapply(c(CD, LS), predicted) # prepare to merge combine_workflow <- combine_sources( source_list = all_source_tables, matching_columns = c( name = "LipidName", name = "compound", adduct = "ion", adduct = "LipidIon" ), keep_cols = ".all", source_col = "annotation_table", exclude_cols = NULL, tag = "combined" ) # merge combine_workflow <- model_apply(combine_workflow, lcms_table()) # show predicted(combine_workflow) ``` Now that the tables have been combined we can explore the table using charts. For example, we visualise the number of annotations for each assay from both sources. ```{r} C <- annotation_pie_chart( factor_name = "assay", label_rotation = FALSE, label_location = "outside", label_type = "percent", legend = TRUE, centre_radius = 0.5, centre_label = ".total" ) chart_plot(C, predicted(combine_workflow)) + ggtitle("Annotations per assay") + theme(plot.margin = unit(c(1, 1.5, 1, 1.5), "cm")) + guides(fill = guide_legend(title = "Assay")) ``` In this next example we compare the number of annotations from each source. ```{r} # change to plot source_name column C$factor_name <- "source_name" chart_plot(C, predicted(combine_workflow)) + ggtitle("Annotations per source") + theme(plot.margin = unit(c(1, 1.5, 1, 1.5), "cm")) + guides(fill = guide_legend(title = "Source")) ``` # Adding identifiers Ultimately we would like to compare the annotations detected using our software sources to the list of standard we included in the samples. Comparing the two tables using metabolite names is less than ideal, because different sources might use different synonyms for the same molecular structure. To overcome this it is much better to compare the tables using molecular identifiers such as InChIKey, which are unique to the molecule. `MetMashR` includes a number of workflow steps that allow us to look up identifiers from various databases, either stored locally, or in online databases such as PubChem by using their REST API. For this vignette we use cached results so that we don't overburden the api; in practice you would create your own (see [TODO]). ```{r} # import cached results inchikey_cache <- rds_database( source = file.path( system.file("cached", package = "MetMashR"), "pubchem_inchikey_mtox_cache.rds" ) ) id_workflow <- pubchem_property_lookup( query_column = "name", search_by = "name", suffix = "", property = "InChIKey", records = "best", cache = inchikey_cache ) id_workflow <- model_apply(id_workflow, predicted(combine_workflow)) ```
# Improving ID coverage The ID's obtained in the previous section were obtained by queries based on the molecule name. Molecule names can contain a number of special characters, and follow different nomenclatures when constructing the name, as well as abbreviations and naming conventions. It useful therefore to apply some kind of "molecule name normalisation" to account for these properties of molecule names. We can use the `normalise_strings` MetMashR object to do this. This object has a `dictionary` parameter that takes the form of a list of lists. Each sub-list contains a pattern to be matched and a replacement. In the example workflow below we include a number of definitions in our dictionary: - Update compounds starting with "NP" to start "Compound NP" as this is how they are recorded in PubChem. - Replace any molecule name containing a ? with NA, as this indicates ambiguity in the annotation. - Remove abbreviations from molecule names e.g. "adenosine triphosphate (ATP)" should have the "(ATP)" part removed. - Replace some shorthand names with more formal names that are more likely to result in a match to a PubChem compound. - Remove optical properties from racemic compounds e.g. D-(+)-Glucose becomes D-Glucose. - Replace Greek characters with their Romanised names. Both the Greek and racemic dictionaries are provided by MetMashR for convenience. In steps 2 and 3 of the workflow we submit these normalised names to the PubChem API and to OPSIN. By utlilising serveral API's we can maximise the number of molecules we obtain an InChIKey for. In the final step we merge the three columns of identifiers, giving priority to OPSIN, which is based on deconstructing the molecule name into its component parts. If there is no result from OPSIN then a PubChem search based on the normalised names is prioritised over a PubChem search using the non-normalised names. ```{r} # prepare cached results for vignette inchikey_cache2 <- rds_database( source = file.path( system.file("cached", package = "MetMashR"), "pubchem_inchikey_mtox_cache2.rds" ) ) inchikey_cache3 <- rds_database( source = file.path( system.file("cached", package = "MetMashR"), "pubchem_inchikey_mtox_cache3.rds" ) ) N <- normalise_strings( search_column = "name", output_column = "normalised_name", dictionary = c( # custom dictionary list( # replace "NP" with "Compound NP" list(pattern = "^NP-", replace = "Compound NP-"), # replace ? with NA, since this is ambiguous list(pattern = "?", replace = NA, fixed = TRUE), # remove terms in trailing brackets e.g." (ATP)" list(pattern = "\\ \\([^\\)]*\\)$", replace = ""), # replace known abbreviations list( pattern = "(+/-)9-HpODE", replace = "9-hydroperoxy-10E,12Z-octadecadienoic acid", fixed = TRUE ), list( pattern = "(+/-)19(20)-DiHDPA", replace = "19,20-dihydroxy-4Z,7Z,10Z,13Z,16Z-docosapentaenoic acid", fixed = TRUE ) ), # replace greek characters greek_dictionary, # remove racemic properties racemic_dictionary ) ) + pubchem_property_lookup( query_column = "normalised_name", search_by = "name", suffix = "_norm", property = "InChIKey", records = "best", cache = inchikey_cache2 ) + opsin_lookup( query_column = "normalised_name", suffix = "_opsin", output = "stdinchikey", cache = inchikey_cache3 ) + prioritise_columns( column_names = c("stdinchikey_opsin", "InChIKey_norm", "InChIKey"), output_name = "inchikey", source_name = "inchikey_source", clean = TRUE ) N <- model_apply(N, predicted(id_workflow)) ``` We can explore the impact of these workflow steps using Venn and Pie charts to compare the results before/after the workflow. In this Venn diagram we show the overlap between the InChIKey identifiers obtained from the three queries. It can be seen that normalising the names resulted in 37 identifiers that we were unable to obtain without normalisation. The use of OPSIN then added a further 31 identifiers. The column of combined identifers therefore contains ```{r} # venn inchikey columns C <- annotation_venn_chart( factor_name = c("InChIKey", "InChIKey_norm", "stdinchikey_opsin"), line_colour = "white", fill_colour = ".group", legend = TRUE, labels = FALSE ) chart_plot(C, predicted(N[3])) + guides( fill = guide_legend(title = "Source"), colour = guide_legend(title = "Source") ) + theme(plot.margin = unit(c(1, 1.5, 1, 1.5), "cm")) ``` Due to complexity, Venn charts are limited to 7 sets. UpSet plots are an alternative chart than can accomodate many more sets. Each veritcal bar in the UpSet plot corresponds to a region in the venn diagram. ```{r} # upset inchikey columns C <- annotation_upset_chart( factor_name = c("InChIKey", "InChIKey_norm", "stdinchikey_opsin"), min_size = 0, n_intersections = 10 ) chart_plot(C, predicted(N[3])) ``` In the next chart we visualise the proportion of annotations from each query, after prioritisation and merging has taken place. Note that in this case some of the identifiers might be the same if e.g. the same molecule is present in multiple rows. You can see that in the end only a small proportion (12.1%) of the identifiers are from the least reliable query based on the non-normalised name. ```{r} # pie source of inchikey C <- annotation_pie_chart( factor_name = "inchikey_source", label_rotation = FALSE, label_location = "outside", label_type = "percent", legend = TRUE, centre_radius = 0.5, centre_label = ".total", count_na = TRUE ) chart_plot(C, predicted(N)) + guides( fill = guide_legend(title = "Source"), colour = guide_legend(title = "Source") ) + theme(plot.margin = unit(c(1, 1.5, 1, 1.5), "cm")) ``` To keep the records with the highest confidence identifiers We can remove the annotations based on the `name` query using the `filter_labels` object. ```{r} # prepare workflow FL <- filter_labels( column_name = "inchikey_source", labels = "InChIKey", mode = "exclude" ) # apply FF <- model_apply(FL, predicted(N)) # print summary predicted(FF) ```
# Comparison with the true mixtures In this section we compare the annotated features with the table of standards known to be included in each of the mixtures.
## Importing the standard mixture tables The first step is to import the tables of standards for each mixture. The data has ready been saved as an RDS file so we can use `rds_database` to import it. ```{r} # prepare object R <- rds_database( source = file.path( system.file("extdata", package = "MetMashR"), "standard_mixtures.rds" ), .writable = FALSE ) # read R <- read_source(R) ``` ```{r, echo=FALSE} .DT(R$data) ``` The standards table contains a list of metabolites and the the mixture they were included in. It also contains some manually curated data providing m/z and retention time of the metabolite, which assay it was observed in and the adduct.
## Identifiers for the standards The standards table provides HMDB identifiers for each metabolite, but our preference is to work with InChIkey. So our first task is to obtain InChiKey for the standards. The standards are based on the MTox700+ database (see [TODO]), so we can import MTox700+ and use it to obtain InChIKey identifiers by matching the the HMBD identifiers. An alternative might be to use `hmdb_lookup` and/or `pubchem_property_lookup`. ```{r} # convert standard mixtures to source, then get inchikey from MTox700+ SM <- import_source() + filter_na( column_name = "rt" ) + filter_na( column_name = "median_ms2_scans" ) + filter_na( column_name = "mzcloud_id" ) + filter_range( column_name = "median_ms2_scans", upper_limit = Inf, lower_limit = 0, equal_to = TRUE ) + database_lookup( query_column = "hmdb_id", database_column = "hmdb_id", database = MTox700plus_database(), include = "inchikey", suffix = "", not_found = NA ) + id_counts( id_column = "inchikey", count_column = "inchikey_count", count_na = FALSE ) # apply SM <- model_apply(SM, R) ``` In this next plot we show the overlap in standards for each assay detected by manual observation. ```{r} C <- annotation_venn_chart( factor_name = "inchikey", group_column = "ion_mode", line_colour = "white", fill_colour = ".group", legend = TRUE, labels = FALSE ) ## plot chart_plot(C, predicted(SM)) ```
# Comparison of standards and annotations Now that we have both the annotations and the standards we can look at overlap between the identifiers in both sources, and begin to assess the ability of the annotation software annotate the standards. Here we plot a venn diagram showing the overlap in identifiers. ```{r,fig.width=4} # get processed data AN <- predicted(FF) AN$tag <- "Annotations" sM <- predicted(SM) sM$tag <- "Standards" # prepare chart C <- annotation_venn_chart( factor_name = "inchikey", line_colour = "white", fill_colour = ".group" ) # plot chart_plot(C, sM, AN) + ggtitle("All assays, all sources") ``` It can been seen that there is a large number of annotations not present in the standard. These are false positives. In the next plot we show similar Venn diagrams but for each assay individually. These are followed by a 4-set Venn diagram that shows the overlap between annotations from each assay that matched to a standard. ```{r,fig.height = 10, fig.width = 8} G <- list() VV <- list() for (k in c("HILIC_NEG", "HILIC_POS", "LIPIDS_NEG", "LIPIDS_POS")) { wf <- filter_labels( column_name = "assay", labels = k, mode = "include" ) wf1 <- model_apply(wf, AN) wf$column_name <- "ion_mode" wf2 <- model_apply(wf, sM) G[[k]] <- chart_plot(C, predicted(wf2), predicted(wf1)) V <- filter_venn( factor_name = "inchikey", tables = list(predicted(wf1)), levels = "Standards/Annotations", mode = "include" ) V <- model_apply(V, predicted(wf2)) VV[[k]] <- predicted(V) VV[[k]]$tag <- k } r1 <- cowplot::plot_grid( plotlist = G, nrow = 2, labels = c("HILIC_NEG", "HILIC_POS", "LIPIDS_NEG", "LIPIDS_POS") ) cowplot::plot_grid(r1, chart_plot(C, VV), nrow = 2, rel_heights = c(1, 0.5)) ``` The majority of the standards were correctly annotated in the HILIC_NEG assay. In the next plot we compare the overlap in InChIKey for each source. ```{r,fig.height = 3, fig.width = 8} G <- list() for (k in c("CD", "LS")) { wf <- filter_labels( column_name = "source_name", labels = k, mode = "include" ) wf1 <- model_apply(wf, AN) G[[k]] <- chart_plot(C, sM, predicted(wf1)) } cowplot::plot_grid(plotlist = G, nrow = 1, labels = c("CD", "LS")) ```
# Session Info ```{r} sessionInfo() ```