--- title: "POMA Workflow" author: - name: Pol Castellano-Escuder affiliation: Duke University email: polcaes@gmail.com date: "`r BiocStyle::doc_date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{POMA Workflow} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} %\VignetteEncoding{UTF-8} bibliography: ["POMA.bib"] biblio-style: apalike link-citations: true --- **Compiled date**: `r Sys.Date()` **Last edited**: 2021-12-12 **License**: `r packageDescription("POMA")[["License"]]` ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, # fig.align = "center", comment = ">" ) ``` # Installation Run the following code to install the Bioconductor version of package. ```{r, eval = FALSE} # install.packages("BiocManager") BiocManager::install("POMA") ``` # Load POMA ```{r, warning = FALSE, message = FALSE, comment = FALSE} library(POMA) ``` You can also load some additional packages that will be very useful in this vignette. ```{r, warning = FALSE, message = FALSE, comment = FALSE} library(ggplot2) library(ggraph) library(plotly) ``` # The POMA Workflow `POMA` functions can be divided in three sequential well separated blocks: **Data Preparation**, **Pre-processing** and **Statistical Analysis**. ## Data Preparation The **SummarizedExperiment** Bioconductor package provides a well defined computational data structures to represent omics experiment data types [@SummarizedExperiment]. Since data structures can mean a marked improvement in data analysis, **POMA** functions use **SummarizedExperiment** objects from **SummarizedExperiment** package, allowing the reusability of existing methods for this class and contributing to the improvement of robust and reproducible workflows. The first step of workflow will be load or create a `SummarizedExperiment` object. Often, you will have your data stored in separated matrix and/or data frames and you will want to create your `SummarizedExperiment` object. The `PomaSummarizedExperiment` function makes this step fast and easy building this `SummarizedExperiment` object for you. ```{r, eval = FALSE} # create an SummarizedExperiment object from two separated data frames target <- readr::read_csv("your_target.csv") features <- readr::read_csv("your_features.csv") data <- PomaSummarizedExperiment(target = target, features = features) ``` Alternatively, if your data is already stored in a `SummarizedExperiment` object, you can skip this step and go directly to the pre-processing step. In this vignette we will use the sample data provided in `POMA`. ```{r} # load example data data("st000336") st000336 ``` ### Brief Description of Example Data This example data is composed of 57 samples, 31 metabolites, 1 covariate and 2 experimental groups (Controls and DMD) from a targeted LC/MS study. _Duchenne Muscular Dystrophy (DMD) is an X-linked recessive form of muscular dystrophy that affects males via a mutation in the gene for the muscle protein, dystrophin. Progression of the disease results in severe muscle loss, ultimately leading to paralysis and death. Steroid therapy has been a commonly employed method for reducing the severity of symptoms. This study aims to quantify the urine levels of amino acids and organic acids in patients with DMD both with and without steroid treatment. Track the progression of DMD in patients who have provided multiple urine samples._ This data was collected from [here](https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&DataMode=AllData&StudyID=ST000336&StudyType=MS&ResultType=1#DataTabs). ## Pre Processing This is a critical point in the workflow because all final statistical results will depend on the decisions made here. Again, this block can be divided in 3 steps: **Missing Value Imputation**, **Normalization** and **Outlier Detection**. ### Missing Value Imputation Often, due to biological and technical reasons, some features can not be identified or quantified in some samples in MS [@imputation]. **POMA** offers 7 different imputation methods to deal with this situation. Just run the following line of code to impute your missings! ```{r} imputed <- PomaImpute(st000336, ZerosAsNA = TRUE, RemoveNA = TRUE, cutoff = 20, method = "knn") imputed ``` ### Normalization The next step of this block is the data normalization. Often, some factors can introduce variability in some types of MS data having a critical influence on the final statistical results, making normalization a key step in the workflow [@normalization]. Again, **POMA** offers several methods to normalize the data by running just the following line of code: ```{r} normalized <- PomaNorm(imputed, method = "log_pareto") normalized ``` #### Normalization effect Sometimes, you will be interested in _how the normalization process affect your data_? To answer this question, **POMA** offers two exploratory functions, `PomaBoxplots` and `PomaDensity`, that can help to understand the normalization process. `PomaBoxplots` generates boxplots for all samples or features (depending on the group factor) of a `SummarizedExperiment` object. Here, we can compare objects before and after normalization step. ```{r, message = FALSE, comment = FALSE} PomaBoxplots(imputed, group = "samples", jitter = FALSE) + ggtitle("Not Normalized") + theme(legend.position = "none") # data before normalization ``` ```{r, message = FALSE, comment = FALSE} PomaBoxplots(normalized, group = "samples", jitter = FALSE) + ggtitle("Normalized") # data after normalization ``` On the other hand, `PomaDensity` shows the distribution of all features before and after the normalization process. ```{r, message = FALSE, comment = FALSE} PomaDensity(imputed, group = "features") + ggtitle("Not Normalized") + theme(legend.position = "none") # data before normalization ``` ```{r, message = FALSE, comment = FALSE} PomaDensity(normalized, group = "features") + ggtitle("Normalized") # data after normalization ``` ### Outlier Detection Finally, the last step of this block is the Outlier Detection. Outlers are defined as observations that are not concordant with those of the vast majority of the remaining data points. These values can have an enormous influence on the resultant statistical analysis, being a dangerous ground for all required assumptions in the most commonly applied parametric tests in mass spectrometry as well as for all also required assumptions in many regression techniques and predictive modeling approaches. **POMA** allows the analysis of outliers as well as the possibility to remove them from the analysis using different modulable parameters. Analyze and remove outliers running the following two lines of code. ```{r} PomaOutliers(normalized, do = "analyze")$polygon_plot # to explore pre_processed <- PomaOutliers(normalized, do = "clean") # to remove outliers pre_processed ``` ## Statistical Analysis Once the data have been pre-processed, you can start with the statistical analysis step! **POMA** offers many different statistical methods and possible combinations to compute. However, in this vignette we will comment only some of the most used. ### Univariate Analysis **POMA** allows you to perform all of the most used univariate statistical methods in MS by using only one function! `PomaUnivariate` wrap 4 different univariate methods (ttest, ANOVA and ANCOVA, Wilcoxon test and Kruskal-Wallis Rank Sum Test) that you can perform changing only the "method" argument. #### T-test ```{r} PomaUnivariate(pre_processed, method = "ttest") ``` You can also compute a volcano plot using the T-test results. _Note that we're using the non-normalized object to avoid negative values in our data._ ```{r} PomaVolcano(imputed, pval = "adjusted") ``` #### Wilcoxon Test ```{r, warning = FALSE} PomaUnivariate(pre_processed, method = "mann") ``` ### Limma Other of the wide used statistical methods in many different omics, such as epigenomics or transcriptomics, is **limma** [@limma]. **POMA** provides an easy use implementation of _limma_ you only have to specify the desired contrast to compute. ```{r} PomaLimma(pre_processed, contrast = "Controls-DMD", adjust = "fdr") ``` ### Multivariate Analysis On the other hand, multivariate analysis implemented in **POMA** is quite similar to the univariate approaches. `PomaMultivariate` allows users to compute a PCA, PLS-DA or sPLS-DA by changing only the "method" parameter. This function is based on **mixOmics** package [@mixOmics]. #### Principal Component Analysis ```{r} poma_pca <- PomaMultivariate(pre_processed, method = "pca") ``` ```{r} poma_pca$scoresplot + ggtitle("Scores Plot") ``` #### PLS-DA ```{r, comment = FALSE, warning = FALSE, message = FALSE, results = 'hide'} poma_plsda <- PomaMultivariate(pre_processed, method = "plsda") ``` ```{r} poma_plsda$scoresplot + ggtitle("Scores Plot") ``` ```{r} poma_plsda$errors_plsda_plot + ggtitle("Error Plot") ``` ### Correlation Analysis Often, correlation analysis is used to explore and discover relationships and patterns within our data. `PomaCorr` provides a flexible and easy way to do that providing a table with all pairwise coorelations in the data, a correlogram and a correlation graph. ```{r} poma_cor <- PomaCorr(pre_processed, label_size = 8, coeff = 0.6) poma_cor$correlations poma_cor$corrplot poma_cor$graph ``` Alternatively, if you switch the "corr_type" parameter to "glasso", this function will compute a **Gaussian Graphical Model** using the **glmnet** package [@glasso]. ```{r} PomaCorr(pre_processed, corr_type = "glasso", coeff = 0.6)$graph ``` ### Lasso, Ridge and Elasticnet **POMA** also provides a function to perform a Lasso, Ridge and Elasticnet regression for binary outcomes in a very intuitive and easy way. `PomaLasso` is based on **glmnet** package [@glmnet]. This function allows you to create a test subset in your data, evaluate the prediction of your models and export the model computed (it could be useful to perform prediction models with MS data). If "ntest" parameter is set to NULL, `PomaLasso` will use all observations to create the model (useful for feature selection). ```{r} # alpha = 1 for Lasso PomaLasso(pre_processed, alpha = 1, labels = TRUE)$coefficientPlot ``` ### Random Forest Finally, the random forest algorithm is also implemented in **POMA**. `PomaRandForest` uses the **randomForest** package [@randomForest] to facilitate the implementation of the algorithm and creates automatically both test and train sets to compute and evaluate the resultant models. ```{r} poma_rf <- PomaRandForest(pre_processed, ntest = 10, nvar = 10) poma_rf$error_tree ``` Resultant random forest model confusion matrix for **test** set: ```{r} poma_rf$confusionMatrix$table ``` Gini index plot for the top 10 predictors: ```{r} poma_rf$MeanDecreaseGini_plot ``` # Session Information ```{r} sessionInfo() ``` # References