squallms
is a Bioconductor package designed to easily label and remove low-quality chromatographic features from LC/GC-MS analysis. It does this by calculating a few metrics of peak quality from the raw MS data, providing methods for grouping similar peaks together and labeling them, and editing XCMS objects so only the high-quality peaks remain.
squallms
can be installed from Bioconductor (from release 3.19 onwards):
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("squallms")
squallms
makes extensive use of the tidyverse for data shaping and organization as well as the RaMS
package for MS data extraction. It also uses the shiny
and plotly
packages for interactive data visualization.
library(squallms)
For this demo, I’ll be performing peakpicking with XCMS but it’s worth noting that any peakpicking software could be used to produce a peak list as long as it can be turned into the proper input format (see “Calculating metrics of peak quality” below).
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
library(xcms)
library(RaMS)
})
mzML_files <- system.file("extdata", package = "RaMS") %>%
list.files(full.names = TRUE, pattern = "[A-F].mzML")
register(BPPARAM = SerialParam())
cwp <- CentWaveParam(snthresh = 0, extendLengthMSW = TRUE, integrate = 2)
obp <- ObiwarpParam(binSize = 0.1, response = 1, distFun = "cor_opt")
pdp <- PeakDensityParam(
sampleGroups = 1:3, bw = 12, minFraction = 0,
binSize = 0.001, minSamples = 0
)
## Using XCMS version < 4
library(MSnbase)
msexp <- readMSData(mzML_files, msLevel. = 1, mode = "onDisk")
fpp <- FillChromPeaksParam(ppm = 5)
## Using XCMS version 4+ (exact values may differ)
# library(MsExperiment)
# msexp <- readMsExperiment(mzML_files, msLevel. = 1, mode = "onDisk")
# fpp <- ChromPeakAreaParam()
xcms_filled <- msexp %>%
findChromPeaks(cwp) %>%
adjustRtime(obp) %>%
groupChromPeaks(pdp) %>%
fillChromPeaks(fpp)
This XCMS object is useful for documenting processing steps and interfacing with R-based pipelines but can be challenging to inspect and interact with. squallms
provides a function that turns this S4 object into a flat file containing the feature and peak information.
feat_peak_info <- makeXcmsObjFlat(xcms_filled) %>%
select(feature, starts_with("mz"), starts_with("rt"), filename, filepath)
feat_peak_info %>%
mutate(filepath = paste0(substr(filepath, 1, 13), "~")) %>%
head() %>%
knitr::kable()
feature | mz | mzmin | mzmax | rt | rtmin | rtmax | filename | filepath |
---|---|---|---|---|---|---|---|---|
FT001 | 90.05549 | 90.05541 | 90.05556 | 11.11593 | 10.91385 | 11.41188 | LB12HL_AB.mzML.gz | /home/pkgbuil~ |
FT001 | 90.05549 | 90.05544 | 90.05552 | 11.08192 | 10.87952 | 11.41167 | LB12HL_CD.mzML.gz | /home/pkgbuil~ |
FT001 | 90.05549 | 90.05540 | 90.05553 | 11.05225 | 10.80270 | 11.39430 | LB12HL_EF.mzML.gz | /home/pkgbuil~ |
FT002 | 90.05543 | 90.05523 | 90.05553 | 11.81833 | 11.53837 | 11.98883 | LB12HL_CD.mzML.gz | /home/pkgbuil~ |
FT002 | 90.05546 | 90.05536 | 90.05551 | 11.78373 | 11.51833 | 11.98535 | LB12HL_EF.mzML.gz | /home/pkgbuil~ |
FT002 | 90.05547 | 90.05518 | 90.05563 | 11.44277 | 11.36553 | 11.70645 | LB12HL_AB.mzML.gz | /home/pkgbuil~ |
Although comprehensive, the only information needed is actually the grouping info, the peak bounding boxes, and the filepath (feature, mz*, rt*, and filepath) because we’ll recalculate the necessary metrics from the raw peak data pulled in from the mz(X)ML files. Similarly, other peakpicking algorithms tend to produce data in a similar format and can interface with squallms
here.
We also can operate over only a subset of the data, which is quite useful for very large datasets. Quality control files or pooled samples with a similar expected quality can be used in lieu of the entire dataset which may not fit into computer memory.
## Not run:
feat_peak_info <- feat_peak_info %>%
filter(grepl("AB|CD", filename))
Note, however, that the retention time bounds must be uncorrected. makeXcmsObjFlat
does this automatically for XCMS objects, but if corrected retention times are provided then the data extracted from the mz(X)ML files will be incorrect.
The first step is to obtain some kind of information about the chromatographic features that will allow us to distinguish between good and bad. The most intuitive metrics (in my opinion) are similarity to a bell curve and the signal-to-noise ratio. Although many implementations of these exist, the ones I’ve found most useful are described in a BMC Bioinformatics paper I put out in 2023. The function for this in squallms
is extractChromMetrics
, as shown below. I read in the raw MS data with RaMS first because this saves a step and can be reused for quality checks.
msdata <- grabMSdata(unique(feat_peak_info$filepath), verbosity = 0)
shape_metrics <- extractChromMetrics(feat_peak_info, ms1_data = msdata$MS1)
knitr::kable(head(shape_metrics))
feature | med_mz | med_rt | med_cor | med_snr |
---|---|---|---|---|
FT001 | 90.05549 | 11.105383 | 0.9187332 | 14.425791 |
FT002 | 90.05546 | 11.756300 | 0.8074746 | 3.870094 |
FT003 | 93.07429 | 11.144350 | 0.9552928 | 17.061047 |
FT004 | 104.07101 | 10.645483 | 0.8738360 | 20.505134 |
FT005 | 104.07100 | 8.301808 | 0.9474947 | 18.936980 |
FT006 | 104.07098 | 7.373142 | 0.7835736 | 5.283807 |
shape_metrics %>%
arrange(desc(med_snr)) %>%
ggplot() +
geom_point(aes(x = med_rt, y = med_mz, color = med_cor, size = med_snr), alpha = 0.5) +
theme_bw()
So far so good. Unfortunately, many of these features are very low-quality, often a result of noise rather than biological signal, and should not be included in the downstream analysis.
set.seed(123)
shape_metrics %>%
slice_sample(n = 8) %>%
mutate(mzmin = med_mz - med_mz * 5 / 1e6) %>%
mutate(mzmax = med_mz + med_mz * 5 / 1e6) %>%
mutate(rtmin = med_rt - 1) %>%
mutate(rtmax = med_rt + 1) %>%
left_join(msdata$MS1, join_by(
between(y$rt, x$rtmin, x$rtmax),
between(y$mz, x$mzmin, x$mzmax)
)) %>%
qplotMS1data() +
geom_vline(aes(xintercept = med_rt), color = "red") +
facet_wrap(~feature, scales = "free", nrow = 2)
Labeling chromatographic features is usually a slow and painful step. XCMS’s methods for extracting chromatograms are sluggish and difficult to customize, while data entry usually involves having an Excel workbook or some similar data entry object open to record the values. squallms
provides two ways of labeling features a little more easily.
labelFeatsManual
accepts the data frame of peak information constructed above and renders one random feature at a time in a new plotting window. This plot window then detects when arrow keys are pressed which have been bound to specific classifications. Currently, the left arrow key categorizes the feature as “Bad” while the right arrow key categorizes it as “Good”. The up arrow key also classifies it as “Revisit” for later analysis. Once pressed, a new window pops up with a new random feature. This process repeats until the Escape key is pressed or there are no features left to label.
manual_classes <- labelFeatsManual(feat_peak_info, ms1_data = msdata$MS1)
The classification plot looks like this:
Alternatively, this function can be used to double-check existing classifications if the existing_labels
argument is provided and the selection is set to “Labeled” (see below).
However, one of the big strengths of squallms
is its implementation of a strategy for classifying many peaks simultaneously. To do this, the raw MS data for a multi-file feature is coerced to a set of shared retention times. This RT x file peak matrix can then be treated as individual “pixels” in a PCA that extracts the major patterns. For chromatographic peak data, the dominant signal is often noise vs real peak.
This signal can be seen visually in the figure below. Good peaks have a distinctly lighter center and darker sides corresponding to high values at the center of the peak and low values at the edges of the window. The pattern’s visible across multiple files as a ridge running vertically down each feature’s small multiple plot. In this case, features FT030, FT040, FT041, and FT044 look like high-quality features. Noise features have no obvious pattern, so the PCA won’t be able to extract signal from them as a dominant component. The other pattern are the features (FT031, FT042, FT048) that are intense at the beginning or at the end, corresponding to the noise peaks that CentWave loves picking out so much.
When the PCA is performed, we then see these features separated from the rest along the first principal component.
Of course, such separation isn’t guaranteed to be consistently to the left or right and isn’t always neatly along one axis. While a line between good and bad could be drawn with complicated math expressions, a much easier way is to simply engage an interactive visualization allowing for arbitrary selection. R’s shiny
and plotly
packages are perfect for this method, so squallms
has a built-in Shiny application that launches in the browser and can be used to select such peaks using the “lasso” tool - thus, the “lasso labeling” part of the package name.
lasso_classes <- labelFeatsLasso(feat_peak_info, ms1_data = msdata$MS1)
In the screenshot below, I’ve moused over the feature IDs in the central plot as they render one at a time in the bottom left. Once I find a cluster where many high-quality features are found, I clicked and dragged to select a region of good features, which rendered the “aggregate” feature in the bottom right. Shortly after, I clicked the “Flag selection as Good” at the bottom of the left sidebar.
Of course, a training set needs examples of low-quality peaks as well. I then browsed other features in the central plot until finding a region of low-quality features that I could similarly select using the lasso tool and flag as Bad. Multiple regions can be selected and combined with additional lassos, and of course the rectangle selection tool can be used instead of the freehand.
In this case, the good/bad signal was so strongly dominant that I had no need for additional principal components and really only needed the first one to capture my internal sense of peak quality. However, other datasets may not identify the quality signal as the first or even second PC, especially if there’s a large biological signal present. To handle this case, I added a k-means clustering algorithm that operates on all PCs and renders their aggregate shape in the upper right. Combined with plotly’s ability to show or hide individual clusters by double-clicking the plot legend, this allows the user to isolate regions in multidimensional space while still only showing the first two PCs in the plot. Both k and the number of PCs used for the k-means clustering can be set in the left sidebar, while a simple barplot provides information about the number of PCs that capture 20, 50, and 80% of the variation in the dataset.
After both good and bad features have been flagged, clicking the “Return to R” button at the bottom of the sidebar or closing the window will return a character vector consisting of “Good”, “Bad”, and NAs named by the feature with which those labels were associated.
head(lasso_classes, 10)
#> FT001 FT002 FT003 FT004 FT005 FT006 FT007 FT008 FT009 FT010
#> "Good" "Bad" NA NA "Good" "Good" "Bad" "Bad" NA "Good"
If you’re unsure how well the lasso performed, you can easily pass this vector back to the manual labeling tool with selection = "Labeled"
to double-check on the agreement between the lasso and the manual methods.
manual_classes <- labelFeatsManual(feat_peak_info, ms1_data = msdata$MS1, selection = "Labeled", existing_labels = lasso_classes)
suppressPackageStartupMessages(library(caret))
data.frame(manual = manual_classes, lasso = lasso_classes) %>%
filter(!is.na(manual) & !is.na(lasso))
#> manual lasso
#> FT007 Bad Bad
#> FT008 Bad Bad
#> FT026 Bad Bad
#> FT039 Bad Bad
#> FT072 Bad Bad
#> FT085 Bad Bad
#> FT087 Good Good
#> FT092 Good Good
#> FT111 Good Good
#> FT120 Bad Bad
#> FT153 Good Good
#> FT171 Bad Bad
#> FT181 Good Good
data.frame(manual = factor(manual_classes), lasso = factor(lasso_classes)) %>%
filter(!is.na(manual)) %>%
with(confusionMatrix(lasso, manual, positive = "Good"))
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Bad Good
#> Bad 8 0
#> Good 0 5
#>
#> Accuracy : 1
#> 95% CI : (0.7529, 1)
#> No Information Rate : 0.6154
#> P-Value [Acc > NIR] : 0.001815
#>
#> Kappa : 1
#>
#> Mcnemar's Test P-Value : NA
#>
#> Sensitivity : 1.0000
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 1.0000
#> Prevalence : 0.3846
#> Detection Rate : 0.3846
#> Detection Prevalence : 0.3846
#> Balanced Accuracy : 1.0000
#>
#> 'Positive' Class : Good
#>
If there are differences between the two, we can replace the lasso labels with the (generally) more reliable manual labels.
class_labels <- lasso_classes
class_labels[!is.na(manual_classes)] <- manual_classes[!is.na(manual_classes)]
Of course, both labeling methods are designed to return partially-labeled datasets. To get quality estimates for the remaining features, we can use a logistic model trained on a few of our metrics to estimate the quality of each feature. The basic idea here is to use the med_cor
and med_snr
information extracted with extractChromMetrics
as the predictive variables and the labels produced by labelFeatsManual
or labelFeatsLasso
as the predicted variable in a logistic regression. The default formula uses glm(formula=feat_class~med_cor+med_snr, family = binomial)
to fit the values but additional metrics that exist in the shape_metrics
data frame can be used as predictive variables in the model as well.
pred_probs <- logModelFeatProb(shape_metrics, class_labels)
#> Logistic model regression coefficients:
#> (Intercept) med_cor med_snr
#> -7.0349466 4.8379451 0.6066241
if (interactive()) {
ggsave("intro_model_spread.png",
plot = last_plot(), device = "png",
path = "vignettes", width = 6, height = 4, units = "in", dpi = 120
)
}
hist(pred_probs, breaks = 20, main = NULL, xlab = "Logistic likelihood")
The values returned by logModelFeatProb
can then be handled manually, or you can use the wrapper function logModelFeatQuality
with an associated likelihood_threshold
to return the binary good/bad classification. This also then makes it possible to assess the performance of the model on the training data itself:
pred_classes <- logModelFeatQuality(shape_metrics, class_labels, verbosity = 1)
#> Logistic model regression coefficients:
#> (Intercept) med_cor med_snr
#> -7.0349466 4.8379451 0.6066241
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Bad Good
#> Bad 61 6
#> Good 4 35
#>
#> Accuracy : 0.9057
#> 95% CI : (0.8333, 0.9538)
#> No Information Rate : 0.6132
#> P-Value [Acc > NIR] : 1.159e-11
#>
#> Kappa : 0.7993
#>
#> Mcnemar's Test P-Value : 0.7518
#>
#> Sensitivity : 0.8537
#> Specificity : 0.9385
#> Pos Pred Value : 0.8974
#> Neg Pred Value : 0.9104
#> Prevalence : 0.3868
#> Detection Rate : 0.3302
#> Detection Prevalence : 0.3679
#> Balanced Accuracy : 0.8961
#>
#> 'Positive' Class : Good
#>
Finally, once an appropriate threshold has been set and the model’s performance is acceptable the XCMS object itself can be edited to remove the poor-quality peaks. This function also notes squallms
in the object’s processing history (albeit poorly).
final_xcms <- updateXcmsObjFeats(shape_metrics, class_labels,
verbosity = 0,
likelihood_threshold = 0.5
)
The final XCMS object can then be passed to downstream software with confidence!
sessionInfo()
#> R Under development (unstable) (2024-03-18 r86148)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] caret_6.0-94 lattice_0.22-6 MSnbase_2.29.4
#> [4] ProtGenerics_1.35.4 S4Vectors_0.41.5 mzR_2.37.3
#> [7] Rcpp_1.0.12 Biobase_2.63.1 BiocGenerics_0.49.1
#> [10] RaMS_1.4.0 xcms_4.1.11 BiocParallel_1.37.1
#> [13] ggplot2_3.5.0 tibble_3.2.1 tidyr_1.3.1
#> [16] dplyr_1.1.4 squallms_0.99.3 BiocStyle_2.31.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.8
#> [3] MultiAssayExperiment_1.29.1 magrittr_2.0.3
#> [5] farver_2.1.1 MALDIquant_1.22.2
#> [7] rmarkdown_2.26 fs_1.6.3
#> [9] zlibbioc_1.49.3 vctrs_0.6.5
#> [11] base64enc_0.1-3 htmltools_0.5.8.1
#> [13] S4Arrays_1.3.6 progress_1.2.3
#> [15] pROC_1.18.5 SparseArray_1.3.4
#> [17] mzID_1.41.0 parallelly_1.37.1
#> [19] sass_0.4.9 bslib_0.7.0
#> [21] htmlwidgets_1.6.4 plyr_1.8.9
#> [23] lubridate_1.9.3 impute_1.77.0
#> [25] plotly_4.10.4 cachem_1.0.8
#> [27] igraph_2.0.3 mime_0.12
#> [29] lifecycle_1.0.4 iterators_1.0.14
#> [31] pkgconfig_2.0.3 Matrix_1.7-0
#> [33] R6_2.5.1 fastmap_1.1.1
#> [35] future_1.33.2 GenomeInfoDbData_1.2.12
#> [37] MatrixGenerics_1.15.0 shiny_1.8.1.1
#> [39] clue_0.3-65 digest_0.6.35
#> [41] pcaMethods_1.95.0 colorspace_2.1-0
#> [43] GenomicRanges_1.55.4 labeling_0.4.3
#> [45] Spectra_1.13.8 timechange_0.3.0
#> [47] fansi_1.0.6 httr_1.4.7
#> [49] abind_1.4-5 compiler_4.4.0
#> [51] proxy_0.4-27 withr_3.0.0
#> [53] doParallel_1.0.17 DBI_1.2.2
#> [55] highr_0.10 lava_1.8.0
#> [57] MASS_7.3-60.2 MsExperiment_1.5.4
#> [59] DelayedArray_0.29.9 ModelMetrics_1.2.2.2
#> [61] tools_4.4.0 PSMatch_1.7.2
#> [63] httpuv_1.6.15 future.apply_1.11.2
#> [65] nnet_7.3-19 glue_1.7.0
#> [67] nlme_3.1-164 QFeatures_1.13.3
#> [69] promises_1.2.1 grid_4.4.0
#> [71] reshape2_1.4.4 cluster_2.1.6
#> [73] generics_0.1.3 recipes_1.0.10
#> [75] gtable_0.3.4 class_7.3-22
#> [77] preprocessCore_1.65.0 data.table_1.15.4
#> [79] hms_1.1.3 MetaboCoreUtils_1.11.2
#> [81] xml2_1.3.6 utf8_1.2.4
#> [83] XVector_0.43.1 stringr_1.5.1
#> [85] foreach_1.5.2 pillar_1.9.0
#> [87] limma_3.59.6 later_1.3.2
#> [89] splines_4.4.0 survival_3.5-8
#> [91] tidyselect_1.2.1 knitr_1.45
#> [93] bookdown_0.38 IRanges_2.37.1
#> [95] SummarizedExperiment_1.33.3 xfun_0.43
#> [97] statmod_1.5.0 hardhat_1.3.1
#> [99] timeDate_4032.109 matrixStats_1.2.0
#> [101] stringi_1.8.3 lazyeval_0.2.2
#> [103] yaml_2.3.8 evaluate_0.23
#> [105] codetools_0.2-20 MsCoreUtils_1.15.6
#> [107] BiocManager_1.30.22 cli_3.6.2
#> [109] affyio_1.73.0 rpart_4.1.23
#> [111] xtable_1.8-4 munsell_0.5.1
#> [113] jquerylib_0.1.4 GenomeInfoDb_1.39.11
#> [115] MassSpecWavelet_1.69.0 globals_0.16.3
#> [117] XML_3.99-0.16.1 parallel_4.4.0
#> [119] gower_1.0.1 prettyunits_1.2.0
#> [121] AnnotationFilter_1.27.0 listenv_0.9.1
#> [123] viridisLite_0.4.2 ipred_0.9-14
#> [125] e1071_1.7-14 prodlim_2023.08.28
#> [127] MsFeatures_1.11.0 scales_1.3.0
#> [129] affy_1.81.0 ncdf4_1.22
#> [131] purrr_1.0.2 crayon_1.5.2
#> [133] rlang_1.1.3 vsn_3.71.1
Vignette last built on 2024-04-04