Contents

squallms is a Bioconductor package designed to easily label and remove low-quality chromatographic features from LC/GC-MS analysis. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for grouping similar peaks together and labeling them, and editing XCMS objects so only the high-quality peaks remain.

1 Setup

1.1 Installation

squallms can be installed from Bioconductor (from release 3.19 onwards):

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("squallms")

squallms makes extensive use of the tidyverse for data shaping and organization as well as the RaMS package for MS data extraction. It also uses the shiny and plotly packages for interactive data visualization.

library(squallms)

1.2 Peakpicking with XCMS

For this demo, I’ll be performing peakpicking with XCMS but it’s worth noting that any peakpicking software could be used to produce a peak list as long as it can be turned into the proper input format (see “Calculating metrics of peak quality” below).

suppressPackageStartupMessages({
    library(dplyr)
    library(tidyr)
    library(tibble)
    library(ggplot2)
    library(xcms)
    library(RaMS)
})
mzML_files <- system.file("extdata", package = "RaMS") %>%
    list.files(full.names = TRUE, pattern = "[A-F].mzML")

register(BPPARAM = SerialParam())
cwp <- CentWaveParam(snthresh = 0, extendLengthMSW = TRUE, integrate = 2)
obp <- ObiwarpParam(binSize = 0.1, response = 1, distFun = "cor_opt")
pdp <- PeakDensityParam(
    sampleGroups = 1:3, bw = 12, minFraction = 0,
    binSize = 0.001, minSamples = 0
)

## Using XCMS version < 4
library(MSnbase)
msexp <- readMSData(mzML_files, msLevel. = 1, mode = "onDisk")
fpp <- FillChromPeaksParam(ppm = 5)
## Using XCMS version 4+ (exact values may differ)
# library(MsExperiment)
# msexp <- readMsExperiment(mzML_files, msLevel. = 1, mode = "onDisk")
# fpp <- ChromPeakAreaParam()

xcms_filled <- msexp %>%
    findChromPeaks(cwp) %>%
    adjustRtime(obp) %>%
    groupChromPeaks(pdp) %>%
    fillChromPeaks(fpp)

This XCMS object is useful for documenting processing steps and interfacing with R-based pipelines but can be challenging to inspect and interact with. squallms provides a function that turns this S4 object into a flat file containing the feature and peak information.

feat_peak_info <- makeXcmsObjFlat(xcms_filled) %>%
    select(feature, starts_with("mz"), starts_with("rt"), filename, filepath)
feat_peak_info %>%
    mutate(filepath = paste0(substr(filepath, 1, 13), "~")) %>%
    head() %>%
    knitr::kable()
feature mz mzmin mzmax rt rtmin rtmax filename filepath
FT001 90.05549 90.05541 90.05556 11.11593 10.91385 11.41188 LB12HL_AB.mzML.gz /home/pkgbuil~
FT001 90.05549 90.05544 90.05552 11.08192 10.87952 11.41167 LB12HL_CD.mzML.gz /home/pkgbuil~
FT001 90.05549 90.05540 90.05553 11.05225 10.80270 11.39430 LB12HL_EF.mzML.gz /home/pkgbuil~
FT002 90.05543 90.05523 90.05553 11.81833 11.53837 11.98883 LB12HL_CD.mzML.gz /home/pkgbuil~
FT002 90.05546 90.05536 90.05551 11.78373 11.51833 11.98535 LB12HL_EF.mzML.gz /home/pkgbuil~
FT002 90.05547 90.05518 90.05563 11.44277 11.36553 11.70645 LB12HL_AB.mzML.gz /home/pkgbuil~

Although comprehensive, the only information needed is actually the grouping info, the peak bounding boxes, and the filepath (feature, mz*, rt*, and filepath) because we’ll recalculate the necessary metrics from the raw peak data pulled in from the mz(X)ML files. Similarly, other peakpicking algorithms tend to produce data in a similar format and can interface with squallms here.

We also can operate over only a subset of the data, which is quite useful for very large datasets. Quality control files or pooled samples with a similar expected quality can be used in lieu of the entire dataset which may not fit into computer memory.

## Not run:
feat_peak_info <- feat_peak_info %>%
    filter(grepl("AB|CD", filename))

Note, however, that the retention time bounds must be uncorrected. makeXcmsObjFlat does this automatically for XCMS objects, but if corrected retention times are provided then the data extracted from the mz(X)ML files will be incorrect.

2 Calculating metrics of peak quality (extractChromMetrics)

The first step is to obtain some kind of information about the chromatographic features that will allow us to distinguish between good and bad. The most intuitive metrics (in my opinion) are similarity to a bell curve and the signal-to-noise ratio. Although many implementations of these exist, the ones I’ve found most useful are described in a BMC Bioinformatics paper I put out in 2023. The function for this in squallms is extractChromMetrics, as shown below. I read in the raw MS data with RaMS first because this saves a step and can be reused for quality checks.

msdata <- grabMSdata(unique(feat_peak_info$filepath), verbosity = 0)
shape_metrics <- extractChromMetrics(feat_peak_info, ms1_data = msdata$MS1)
knitr::kable(head(shape_metrics))
feature med_mz med_rt med_cor med_snr
FT001 90.05549 11.105383 0.9187332 14.425791
FT002 90.05546 11.756300 0.8074746 3.870094
FT003 93.07429 11.144350 0.9552928 17.061047
FT004 104.07101 10.645483 0.8738360 20.505134
FT005 104.07100 8.301808 0.9474947 18.936980
FT006 104.07098 7.373142 0.7835736 5.283807
shape_metrics %>%
    arrange(desc(med_snr)) %>%
    ggplot() +
    geom_point(aes(x = med_rt, y = med_mz, color = med_cor, size = med_snr), alpha = 0.5) +
    theme_bw()

So far so good. Unfortunately, many of these features are very low-quality, often a result of noise rather than biological signal, and should not be included in the downstream analysis.

set.seed(123)
shape_metrics %>%
    slice_sample(n = 8) %>%
    mutate(mzmin = med_mz - med_mz * 5 / 1e6) %>%
    mutate(mzmax = med_mz + med_mz * 5 / 1e6) %>%
    mutate(rtmin = med_rt - 1) %>%
    mutate(rtmax = med_rt + 1) %>%
    left_join(msdata$MS1, join_by(
        between(y$rt, x$rtmin, x$rtmax),
        between(y$mz, x$mzmin, x$mzmax)
    )) %>%
    qplotMS1data() +
    geom_vline(aes(xintercept = med_rt), color = "red") +
    facet_wrap(~feature, scales = "free", nrow = 2)

3 Labeling

Labeling chromatographic features is usually a slow and painful step. XCMS’s methods for extracting chromatograms are sluggish and difficult to customize, while data entry usually involves having an Excel workbook or some similar data entry object open to record the values. squallms provides two ways of labeling features a little more easily.

3.1 Manual labeling with labelFeatsManual

labelFeatsManual accepts the data frame of peak information constructed above and renders one random feature at a time in a new plotting window. This plot window then detects when arrow keys are pressed which have been bound to specific classifications. Currently, the left arrow key categorizes the feature as “Bad” while the right arrow key categorizes it as “Good”. The up arrow key also classifies it as “Revisit” for later analysis. Once pressed, a new window pops up with a new random feature. This process repeats until the Escape key is pressed or there are no features left to label.

manual_classes <- labelFeatsManual(feat_peak_info, ms1_data = msdata$MS1)

The classification plot looks like this:

Alternatively, this function can be used to double-check existing classifications if the existing_labels argument is provided and the selection is set to “Labeled” (see below).

3.2 Lasso labeling with labelFeatsLasso

However, one of the big strengths of squallms is its implementation of a strategy for classifying many peaks simultaneously. To do this, the raw MS data for a multi-file feature is coerced to a set of shared retention times. This RT x file peak matrix can then be treated as individual “pixels” in a PCA that extracts the major patterns. For chromatographic peak data, the dominant signal is often noise vs real peak.

This signal can be seen visually in the figure below. Good peaks have a distinctly lighter center and darker sides corresponding to high values at the center of the peak and low values at the edges of the window. The pattern’s visible across multiple files as a ridge running vertically down each feature’s small multiple plot. In this case, features FT030, FT040, FT041, and FT044 look like high-quality features. Noise features have no obvious pattern, so the PCA won’t be able to extract signal from them as a dominant component. The other pattern are the features (FT031, FT042, FT048) that are intense at the beginning or at the end, corresponding to the noise peaks that CentWave loves picking out so much.

When the PCA is performed, we then see these features separated from the rest along the first principal component.

Of course, such separation isn’t guaranteed to be consistently to the left or right and isn’t always neatly along one axis. While a line between good and bad could be drawn with complicated math expressions, a much easier way is to simply engage an interactive visualization allowing for arbitrary selection. R’s shiny and plotly packages are perfect for this method, so squallms has a built-in Shiny application that launches in the browser and can be used to select such peaks using the “lasso” tool - thus, the “lasso labeling” part of the package name.

lasso_classes <- labelFeatsLasso(feat_peak_info, ms1_data = msdata$MS1)

In the screenshot below, I’ve moused over the feature IDs in the central plot as they render one at a time in the bottom left. Once I find a cluster where many high-quality features are found, I clicked and dragged to select a region of good features, which rendered the “aggregate” feature in the bottom right. Shortly after, I clicked the “Flag selection as Good” at the bottom of the left sidebar.

Of course, a training set needs examples of low-quality peaks as well. I then browsed other features in the central plot until finding a region of low-quality features that I could similarly select using the lasso tool and flag as Bad. Multiple regions can be selected and combined with additional lassos, and of course the rectangle selection tool can be used instead of the freehand.

In this case, the good/bad signal was so strongly dominant that I had no need for additional principal components and really only needed the first one to capture my internal sense of peak quality. However, other datasets may not identify the quality signal as the first or even second PC, especially if there’s a large biological signal present. To handle this case, I added a k-means clustering algorithm that operates on all PCs and renders their aggregate shape in the upper right. Combined with plotly’s ability to show or hide individual clusters by double-clicking the plot legend, this allows the user to isolate regions in multidimensional space while still only showing the first two PCs in the plot. Both k and the number of PCs used for the k-means clustering can be set in the left sidebar, while a simple barplot provides information about the number of PCs that capture 20, 50, and 80% of the variation in the dataset.

After both good and bad features have been flagged, clicking the “Return to R” button at the bottom of the sidebar or closing the window will return a character vector consisting of “Good”, “Bad”, and NAs named by the feature with which those labels were associated.

head(lasso_classes, 10)
#>  FT001  FT002  FT003  FT004  FT005  FT006  FT007  FT008  FT009  FT010 
#> "Good"  "Bad"     NA     NA "Good" "Good"  "Bad"  "Bad"     NA "Good"

If you’re unsure how well the lasso performed, you can easily pass this vector back to the manual labeling tool with selection = "Labeled" to double-check on the agreement between the lasso and the manual methods.

manual_classes <- labelFeatsManual(feat_peak_info, ms1_data = msdata$MS1, selection = "Labeled", existing_labels = lasso_classes)
suppressPackageStartupMessages(library(caret))
data.frame(manual = manual_classes, lasso = lasso_classes) %>%
    filter(!is.na(manual) & !is.na(lasso))
#>       manual lasso
#> FT007    Bad   Bad
#> FT008    Bad   Bad
#> FT026    Bad   Bad
#> FT039    Bad   Bad
#> FT072    Bad   Bad
#> FT085    Bad   Bad
#> FT087   Good  Good
#> FT092   Good  Good
#> FT111   Good  Good
#> FT120    Bad   Bad
#> FT153   Good  Good
#> FT171    Bad   Bad
#> FT181   Good  Good

data.frame(manual = factor(manual_classes), lasso = factor(lasso_classes)) %>%
    filter(!is.na(manual)) %>%
    with(confusionMatrix(lasso, manual, positive = "Good"))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Bad Good
#>       Bad    8    0
#>       Good   0    5
#>                                      
#>                Accuracy : 1          
#>                  95% CI : (0.7529, 1)
#>     No Information Rate : 0.6154     
#>     P-Value [Acc > NIR] : 0.001815   
#>                                      
#>                   Kappa : 1          
#>                                      
#>  Mcnemar's Test P-Value : NA         
#>                                      
#>             Sensitivity : 1.0000     
#>             Specificity : 1.0000     
#>          Pos Pred Value : 1.0000     
#>          Neg Pred Value : 1.0000     
#>              Prevalence : 0.3846     
#>          Detection Rate : 0.3846     
#>    Detection Prevalence : 0.3846     
#>       Balanced Accuracy : 1.0000     
#>                                      
#>        'Positive' Class : Good       
#> 

If there are differences between the two, we can replace the lasso labels with the (generally) more reliable manual labels.

class_labels <- lasso_classes
class_labels[!is.na(manual_classes)] <- manual_classes[!is.na(manual_classes)]

4 Building a quality model and removing low-quality peaks

Of course, both labeling methods are designed to return partially-labeled datasets. To get quality estimates for the remaining features, we can use a logistic model trained on a few of our metrics to estimate the quality of each feature. The basic idea here is to use the med_cor and med_snr information extracted with extractChromMetrics as the predictive variables and the labels produced by labelFeatsManual or labelFeatsLasso as the predicted variable in a logistic regression. The default formula uses glm(formula=feat_class~med_cor+med_snr, family = binomial) to fit the values but additional metrics that exist in the shape_metrics data frame can be used as predictive variables in the model as well.

pred_probs <- logModelFeatProb(shape_metrics, class_labels)
#> Logistic model regression coefficients:
#> (Intercept)     med_cor     med_snr 
#>  -7.0349466   4.8379451   0.6066241

if (interactive()) {
    ggsave("intro_model_spread.png",
        plot = last_plot(), device = "png",
        path = "vignettes", width = 6, height = 4, units = "in", dpi = 120
    )
}
hist(pred_probs, breaks = 20, main = NULL, xlab = "Logistic likelihood")

The values returned by logModelFeatProb can then be handled manually, or you can use the wrapper function logModelFeatQuality with an associated likelihood_threshold to return the binary good/bad classification. This also then makes it possible to assess the performance of the model on the training data itself:

pred_classes <- logModelFeatQuality(shape_metrics, class_labels, verbosity = 1)
#> Logistic model regression coefficients:
#> (Intercept)     med_cor     med_snr 
#>  -7.0349466   4.8379451   0.6066241 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Bad Good
#>       Bad   61    6
#>       Good   4   35
#>                                           
#>                Accuracy : 0.9057          
#>                  95% CI : (0.8333, 0.9538)
#>     No Information Rate : 0.6132          
#>     P-Value [Acc > NIR] : 1.159e-11       
#>                                           
#>                   Kappa : 0.7993          
#>                                           
#>  Mcnemar's Test P-Value : 0.7518          
#>                                           
#>             Sensitivity : 0.8537          
#>             Specificity : 0.9385          
#>          Pos Pred Value : 0.8974          
#>          Neg Pred Value : 0.9104          
#>              Prevalence : 0.3868          
#>          Detection Rate : 0.3302          
#>    Detection Prevalence : 0.3679          
#>       Balanced Accuracy : 0.8961          
#>                                           
#>        'Positive' Class : Good            
#> 

Finally, once an appropriate threshold has been set and the model’s performance is acceptable the XCMS object itself can be edited to remove the poor-quality peaks. This function also notes squallms in the object’s processing history (albeit poorly).

final_xcms <- updateXcmsObjFeats(shape_metrics, class_labels,
    verbosity = 0,
    likelihood_threshold = 0.5
)

The final XCMS object can then be passed to downstream software with confidence!

5 Vignette diagnostics

sessionInfo()
#> R Under development (unstable) (2024-03-18 r86148)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] caret_6.0-94        lattice_0.22-6      MSnbase_2.29.4     
#>  [4] ProtGenerics_1.35.4 S4Vectors_0.41.5    mzR_2.37.3         
#>  [7] Rcpp_1.0.12         Biobase_2.63.1      BiocGenerics_0.49.1
#> [10] RaMS_1.4.0          xcms_4.1.11         BiocParallel_1.37.1
#> [13] ggplot2_3.5.0       tibble_3.2.1        tidyr_1.3.1        
#> [16] dplyr_1.1.4         squallms_0.99.1     BiocStyle_2.31.0   
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          jsonlite_1.8.8             
#>   [3] MultiAssayExperiment_1.29.1 magrittr_2.0.3             
#>   [5] farver_2.1.1                MALDIquant_1.22.2          
#>   [7] rmarkdown_2.26              fs_1.6.3                   
#>   [9] zlibbioc_1.49.3             vctrs_0.6.5                
#>  [11] base64enc_0.1-3             htmltools_0.5.8.1          
#>  [13] S4Arrays_1.3.6              progress_1.2.3             
#>  [15] pROC_1.18.5                 SparseArray_1.3.4          
#>  [17] mzID_1.41.0                 parallelly_1.37.1          
#>  [19] sass_0.4.9                  bslib_0.7.0                
#>  [21] htmlwidgets_1.6.4           plyr_1.8.9                 
#>  [23] lubridate_1.9.3             impute_1.77.0              
#>  [25] plotly_4.10.4               cachem_1.0.8               
#>  [27] igraph_2.0.3                mime_0.12                  
#>  [29] lifecycle_1.0.4             iterators_1.0.14           
#>  [31] pkgconfig_2.0.3             Matrix_1.7-0               
#>  [33] R6_2.5.1                    fastmap_1.1.1              
#>  [35] future_1.33.2               GenomeInfoDbData_1.2.12    
#>  [37] MatrixGenerics_1.15.0       shiny_1.8.1.1              
#>  [39] clue_0.3-65                 digest_0.6.35              
#>  [41] pcaMethods_1.95.0           colorspace_2.1-0           
#>  [43] GenomicRanges_1.55.4        labeling_0.4.3             
#>  [45] Spectra_1.13.8              timechange_0.3.0           
#>  [47] fansi_1.0.6                 httr_1.4.7                 
#>  [49] abind_1.4-5                 compiler_4.4.0             
#>  [51] proxy_0.4-27                withr_3.0.0                
#>  [53] doParallel_1.0.17           DBI_1.2.2                  
#>  [55] highr_0.10                  lava_1.8.0                 
#>  [57] MASS_7.3-60.2               MsExperiment_1.5.4         
#>  [59] DelayedArray_0.29.9         ModelMetrics_1.2.2.2       
#>  [61] tools_4.4.0                 PSMatch_1.7.2              
#>  [63] httpuv_1.6.15               future.apply_1.11.2        
#>  [65] nnet_7.3-19                 glue_1.7.0                 
#>  [67] nlme_3.1-164                QFeatures_1.13.3           
#>  [69] promises_1.2.1              grid_4.4.0                 
#>  [71] reshape2_1.4.4              cluster_2.1.6              
#>  [73] generics_0.1.3              recipes_1.0.10             
#>  [75] gtable_0.3.4                class_7.3-22               
#>  [77] preprocessCore_1.65.0       data.table_1.15.4          
#>  [79] hms_1.1.3                   MetaboCoreUtils_1.11.2     
#>  [81] xml2_1.3.6                  utf8_1.2.4                 
#>  [83] XVector_0.43.1              stringr_1.5.1              
#>  [85] foreach_1.5.2               pillar_1.9.0               
#>  [87] limma_3.59.6                later_1.3.2                
#>  [89] splines_4.4.0               survival_3.5-8             
#>  [91] tidyselect_1.2.1            knitr_1.45                 
#>  [93] bookdown_0.38               IRanges_2.37.1             
#>  [95] SummarizedExperiment_1.33.3 xfun_0.43                  
#>  [97] statmod_1.5.0               hardhat_1.3.1              
#>  [99] timeDate_4032.109           matrixStats_1.2.0          
#> [101] stringi_1.8.3               lazyeval_0.2.2             
#> [103] yaml_2.3.8                  evaluate_0.23              
#> [105] codetools_0.2-20            MsCoreUtils_1.15.6         
#> [107] BiocManager_1.30.22         cli_3.6.2                  
#> [109] affyio_1.73.0               rpart_4.1.23               
#> [111] xtable_1.8-4                munsell_0.5.1              
#> [113] jquerylib_0.1.4             GenomeInfoDb_1.39.11       
#> [115] MassSpecWavelet_1.69.0      globals_0.16.3             
#> [117] XML_3.99-0.16.1             parallel_4.4.0             
#> [119] gower_1.0.1                 prettyunits_1.2.0          
#> [121] AnnotationFilter_1.27.0     listenv_0.9.1              
#> [123] viridisLite_0.4.2           ipred_0.9-14               
#> [125] e1071_1.7-14                prodlim_2023.08.28         
#> [127] MsFeatures_1.11.0           scales_1.3.0               
#> [129] affy_1.81.0                 ncdf4_1.22                 
#> [131] purrr_1.0.2                 crayon_1.5.2               
#> [133] rlang_1.1.3                 vsn_3.71.1

Vignette last built on 2024-04-04