--- title: "**phenomis**: Postprocessing and univariate statistical analysis of omics data" author: "Etienne A. Thevenot" date: "`r doc_date()`" package: "`r pkg_ver('phenomis')`" vignette: > %\VignetteIndexEntry{phenomis-vignette} %\VignetteEncoding{UTF-8} %\VignetteKeywords{BatchEffect, Clustering, MassSpectrometry, Metabolomics, Normalization, Proteomics, QualityControl, StatisticalMethod, Transcriptomics} %\VignetteEngine{knitr::rmarkdown} bibliography: "phenomis-vignette.bib" output: BiocStyle::html_document: toc: yes toc_depth: 4 editor_options: markdown: wrap: 72 --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(fig.width = 5, fig.height = 5) ``` # Introduction ## Context Analysis of a molecular phenotyping (e.g. metabolomics) data sets (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at **mining the data** (e.g. detecting trends and outliers) and selecting features of predictive value (**biomarker discovery**). It comprises multiple steps including: - **Post-processing** (quality control, normalization and/or transformation of intensities) - **Statistical analysis** (univariate hypothesis testing, multivariate modeling, feature selection) - **Annotation** (physico-chemical properties, biological pathways) The **`phenomis`** package focuses on the two first steps, and can be combined with other packages for multivariate modeling, feature selection and annotation, such as the [**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls) and [**`biosigner`**](https://doi.org/10.18129/B9.bioc.biosigner) and [**`biodb`**](https://www.bioconductor.org/packages/biodb/) Bioconductor packages. As an example, the tutorial below reproduces the whole workflow described in the `sacurine` study [@thevenot_analysis_2015]. The `phenomis` package has also been used to post-processed the preclinical, proteomics and metabolomics data from the ProMetIS study [@Imbert_PrometisDeepPhenotyping_2021]. Proteomics and metabolomics data from this study are provided as a second dataset, named `prometis`. Examples of statistical analyses of this dataset are provided in the "Working with multi-omics datasets" section. ## Methods ![Methods from the **phenomis**, **ropls**, and **biosigner** packages for the analysis of omics datasets; specific parameter values used for the **sacurine** metabolomics dataset described in the 'Hands-on' part below are provided as examples.](figures/phenomis_methods.png) ## Formats ### Managing data and metadata: the `SummarizedExperiment` and `MultiAssayExperiment` formats The standard `SummarizedExperiment` and `MultiAssayExperiment` Bioconductor formats for single and multi-omics datasets are used by the `phenomis` methods. The methods return updated `SummarizedExperiment` and `MultiAssayExperiment` objects with either modified assay matrices (e.g. when using the `transforming` method) or enriched `rowData` and `colData` with new columns (e.g. fold changes and *p*-values when using the `hypotesting` method). Note that the `phenomis` package also supports the `ExpressionSet` (respectively, `MultiDataSet`) formats, which are previous (respectively, alternative) formats for the management of single (respectively, multiple) datasets. ### Importing from/exporting to tabular files Input (i.e. preprocessed) data consists of a 'variable times sample' matrix of intensities (**dataMatrix** numeric matrix), in addition to sample and variable metadata (**sampleMetadata** and **variableMetadata** data frames). Importantly, the row names from the dataMatrix must be identical to the row names from the vaiableMetadata (feature names), and the column names from the dataMatrix must be identical to the row names from sampleMetadata (sample names). Note that 3 sample metadata columns should be specified when working with the `correcting` method, namely: - **sampleType** (character): the following types are handled by the algorithms: - **sample**: biological sample of interest - **blank**: e.g. solvent only in Liquid Chromatography coupled to Mass Spectrometry - **pool**: quality control sample generated by pooling an equal volume of each of the sample of interest from the whole study (i.e. from all batches) - **poolN**: (where *N* is an integer; e.g. pool2, pool4, ...): diluted quality control sample: *N* indicates the dilution factor - **injectionOrder** (integer): order of the sample injection (e.g. in the LC-MS instrument) - **batch** (character): name of each batch ## Text and graphical outputs Text and graphics can be handled with the *phenomis* methods by setting the two arguments: - **`report.c`**: if set to 'interactive' [default], messages are displayed interactively; if set to a character corresponding the a filename (with the '.txt' extension), messages are diverted to this file by using the *sink* function internally (the same file can be appended by successive methods); if set to 'none', messages are suppressed - **`figure.c`**: if set to 'interactive' [default], graphics are displayed interactively; if set to a character corresponding the a filename (with the '.pdf' extension), a pdf file with the plot is generated instead; if set to 'none', graphics are suppressed # Hands-on ## The **sacurine** cohort study As an example, we will use the **`phenomis`** package to study the **sacurine** human cohort. The study is aimed at characterizing the physiological variations of the human urine metabolome with age, body mass index (BMI), and gender [@thevenot_analysis_2015]. Urine samples from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh performance liquid chromatography (UPLC) coupled to high-resolution mass spectrometry (LTQ-Orbitrap). Raw data are publicly available on the MetaboLights repository ([MTBLS404](https://www.ebi.ac.uk/metabolights/MTBLS404)). This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2): - **`correcting`**: Correction of the signal drift by local regression (loess) modeling of the intensity trend in pool samples [@dunn_procedures_2011]; Adjustment of offset differences between the two analytical batches by using the average of the pool intensities in each batch [@van_der_kloet_analytical_2009] - Variable quality control by discarding features with a coefficient of variation above 30% in *pool* samples - Normalization by the sample osmolality - **`transforming`**: log10 transformation - **`inspecting`**: Computing metrics to filter out outlier samples according to the Weighted Hotellings'T2 distance [@tenenhaus_approche_1999], the Z-score of one of the intensity distribution deciles [@alonso_astream_2011], and the Z-score of the number of missing values [@alonso_astream_2011]. A 0.001 threshold for all p-values results in the HU_096 sample being discarded - **`hypotesting`**: Univariate hypothesis testing of significant variations with age, BMI, or between genders (Student's T test with Benjamini Hochberg correction) - PCA exploration of the data set; [**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls) Bioconductor package [@thevenot_analysis_2015] - **`clustering`**: Hierarchical clustering - OPLS(-DA) modeling of age, BMI and gender; [**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls) Bioconductor package [@thevenot_analysis_2015] - Selection of the features which significantly contributes to the discrimination between gender with PLS-DA, Random Forest, or Support Vector Machines classifiers; [**`biosigner`**](https://doi.org/10.18129/B9.bioc.biosigner) Bioconductor package [@rinaudo_biosigner_2016] A Galaxy version of this analysis is available [W4M00001 'Sacurine-statistics'](https://doi.org/10.15454/1.4811121736910142E12) on the [workflow4metabolomics.usegalaxy.fr](https://workflow4metabolomics.usegalaxy.fr/) online infrastructure [@guitton_create_2017]. ## **`reading`**: reading the data The **`reading`** function reads the data sets and builds the `SummarizedExperiment` object. Additional information on how to build and handle `SummarizedExperiment` objects (as well as `MultiAssayExperiment`, `ExpressionSet`, and `MultiDataSet`) are provided in the **Appendix**. ```{r reading, message = FALSE} library(phenomis) sacurine.se <- reading(system.file("extdata/sacurine", package = "phenomis")) ``` ## **`inspecting`**: looking at the data The `inspecting` method provides a numerical and graphical overview of the data. Furthermore, it computes quality metrics which may subsequently be used to filter out some samples or variables. - Graphical overview. The data matrix is visualized with a color gradient (top right) and the score plot of the Principal Component Analysis is shown for the two first components (bottom right). The black ellipse corresponds to the area of 95% of the samples, as computed with the Hotelling test. For each sample the mean of variable intensities is shown as a function of the injection order to detect any signal drift and/or batch correction (bottom left). Note that for this coloring to be displayed, the **sampleType**, **injectionOrder** and **batch** columns should be provided in the `colData` of (each of) the dataset(s). Finally, some metrics are computed regarding the proportion of NAs, 0 values, intensity quantiles, and proportion of features with a coefficient of variation in pool intensities \< 30%. - Quality metrics (as additional column in the metadata): - **samples** (added in the `colData`): - hotel_pval: *p*-value from the Hotelling's T2 test in the first plane of the PC components - miss_pval: p-value associated to the z-score of the proportion of missing values [@alonso_astream_2011] - deci_pval: p-value associated to the z-score of intensity deciles [@alonso_astream_2011] For each test, low *p*-values highlight samples with extreme behavior. - **features** (included in the `rowData`) - When the type of samples is available (ie the **sampleType** column is included in the input `colData` from the individual datasets), variable metrics are computed: sample, pool, and blank mean, sd and coefficient of variation (if the corresponding types are present in the 'sampleType' column), as well as 'blank' mean / 'sample' mean, and 'pool' CV / 'sample' CV ratio - If pool dilutions have been used and are indicated in the 'sampleType' column as poolN where N is an integer indicating the dilution factor (eg pool2 for a two-fold dilution of the pool; note that the non-diluted pool remains indicated as 'pool') the Pearson correlation (and corresponding p-value) between the intensity and the dilution factor is computed for each variable ```{r inspecting} sacurine.se <- inspecting(sacurine.se, report.c = "none") ``` The `filtering` method may be used to discard samples and/or features with a high proportion of missing values and/or a variance of 0. When filtering the features, sample class may be specified (as the name a column from the sample metadata): in this case, features will be kept when: 1. the proportion of missing values is below the threshold in at least one class 2. the variance is above 0 in both classes ## Post-processing ### **`correcting`**: Correcting signal drift and batch effect For each feature, the signal drift and the batch effect are corrected by using a normalization strategy which relies on the measurements of a pooled (or QC) sample injected periodically: for each variable, a regression model is fitted to the values of the pool and subsequently used to adjust the intensities of the samples of interest [@van_der_kloet_analytical_2009; @dunn_procedures_2011]. The sample metadata of each datasets (e.g. `colData` Data Frames) must contain 3 columns: 1. **sampleType** (character): either 'sample', 'blank', or 'pool' 2. **injectionOrder** (integer): order of injection in the instrument 3. **batch** (character): batch name ```{r correcting, fig.dim = c(6, 5)} sacurine.se <- correcting(sacurine.se, reference.vc = "pool", report.c = "none") ``` ### Variable filtering - using **`inspecting`** to compute the 'pool_CV' metric ```{r inspecting_variable_filtering} sacurine.se <- inspecting(sacurine.se, figure.c = "none", report.c = "none") ``` - discarding features with 'pool_CV' \> 0.3 ```{r discarding_pool_CV_sup_0.3} sacurine.se <- sacurine.se[rowData(sacurine.se)[, "pool_CV"] <= 0.3, ] ``` - discarding the 'pool' observations ```{r discarding_pools} sacurine.se <- sacurine.se[, colData(sacurine.se)[, "sampleType"] != "pool"] print(sacurine.se) ``` ### Normalization The normalization of each sample by its 'osmolality' value does not require any extra method and relies on the classical methods from the `SummarizedExperiment` package. ```{r normalizing_osmolality} assay(sacurine.se) <- sweep(assay(sacurine.se), 2, colData(sacurine.se)[, "osmolality"], "/") ``` Note that a `normalizing` method is available in the `phenomis` package to perform Probabilistic Quotient Normalization [@dieterleProbabilisticQuotientNormalization2006]. ### **`transforming`**: transforming the data intensities ```{r transforming} sacurine.se <- transforming(sacurine.se, method.vc = "log10") ``` ### Sample filtering ```{r sample_filtering} sacurine.se <- inspecting(sacurine.se, figure.c = "none", report.c = "none") sacurine.se <- sacurine.se[, colData(sacurine.se)[, "hotel_pval"] >= 0.001 & colData(sacurine.se)[, "miss_pval"] >= 0.001 & colData(sacurine.se)[, "deci_pval"] >= 0.001] ``` Final visual check of the data before performing the statistics: ```{r final_check} inspecting(sacurine.se, report.c = "none") ``` ## **`hypotesting`**: univariate hypothesis testing The `hypotesting` method is a wrapper of the main R functions for hypothesis testing and corrections for multiple testing. The list of available tests include two sample tests (t-test and Wilcoxon rank test, but also the limma test), analysis of variance (for one and two factors) and Kruskal-Wallis rank test, and correlation tests (by using either the pearson or the spearman correlation). ```{r hypotesting, warning=FALSE} sacurine.se <- hypotesting(sacurine.se, test.c = "ttest", factor_names.vc = "gender", adjust.c = "BH", adjust_thresh.n = 0.05, report.c = "none") ``` The `phenomis` package can be conveniently complemented with other packages for comprehensive statistical analysis and annotation. In the rest of this tutorial, we illustrate how the [**`ropls`**](https://doi.org/10.18129/B9.bioc.ropls), [**`biosigner`**](https://doi.org/10.18129/B9.bioc.biosigner) and [**`biodb`**](https://www.bioconductor.org/packages/biodb/) Bioconductor packages can be included to perform multivariate modeling [@thevenot_analysis_2015], feature selection [@rinaudo_biosigner_2016], and feature annotation [@roger2022]. All these packages support the `SummarizedExperiment` and `MultiAssayExperiment` (but also `ExpressionSet` and `MultiDataSet`), to provide interoperability. ## Unsupervised analysis We first focus on unsupervised methods for multivariate analysis, by performing Principal Component Analysis and hierarchical clustering. ### Principal component analysis: PCA PCA can be performed by using the `opls` function from the `ropls` package. It returns an updated object with scores and loadings included as new columns in the metadata. The PCA model is stored in the metadata from the `SummarizedExperiment` and can be accessed with the `getOpls` method: ```{r PCA} sacurine.se <- ropls::opls(sacurine.se, info.txt = "none") sacurine.pca <- ropls::getOpls(sacurine.se)[["PCA"]] ropls::plot(sacurine.pca, parAsColFcVn = colData(sacurine.se)[, "gender"], typeVc = "x-score") ropls::plot(sacurine.pca, parAsColFcVn = colData(sacurine.se)[, "age"], typeVc = "x-score") ``` ### **`clustering`**: hierarchical clustering The `clustering` method from the `phenomis` package performs the hierarchical clustering of samples and variables. The default dissimilarity is \$1-cor\_{pearson}\$. ```{r heatmap} sacurine.se <- clustering(sacurine.se, clusters.vi = c(5, 3)) ``` ## Supervised modeling We now address the multivariate modeling of the response of interest (either age, bmi, or gender), by using Partial Least Squares modeling and feature selection. ### Partial Least Squares modeling: (O)PLS(-DA) Partial Least-Squares (PLS), which is a latent variable regression method based on covariance between the predictors and the response, has been shown to efficiently handle datasets with multi-collinear predictors, as in the case of spectrometry measurements [@Wold2001]. The `opls` function from the `ropls` package can be used to perform Partial Least Squares modeling, by specifying the response to be modeled as the second argument: ```{r plsda} sacurine.se <- ropls::opls(sacurine.se, "gender") ``` The (O)PLS(-DA) can be obtained from the metadata, e.g. to perform specific plots: ```{r plsda_scoreplot, eval = FALSE} sacurine_gender.plsda <- ropls::getOpls(sacurine.se)[["gender_PLSDA"]] ropls::plot(sacurine_gender.plsda, parAsColFcVn = colData(sacurine.se)[, "gender"], typeVc = "x-score") ``` ### Feature selection The `biosigner` approach is a wrapper method to select the features that significantly contribute to the performance of a binary classifier, namely PLS-DA, random forest, or Support Vector Machine [@rinaudo_biosigner_2016]. We refer the interested reader to the vignette from the `biosigner` package. ```{r biosigner, fig.width = 5, fig.height = 5} sacurine.biosign <- biosigner::biosign(sacurine.se, "gender", bootI = 5) ``` Please note that the `bootI` parameter is set to 5 to speed up the vignette building, but that we advice to keep the default 50 number of bootstraps to obtain more stable signatures. ## **`annotating`**: MS annotation This method is based on the [**biodb**](https://www.bioconductor.org/packages/biodb/) package (and its companion [**biodbChebi**](https://www.bioconductor.org/packages/biodbChebi/)), which provides access to standard remote chemical and biological databases (ChEBI, KEGG, HMDB, ...), as well as to in-house local database files (CSV, SQLite), with easy retrieval of entries, access to web services, search of compounds by mass and/or name [@roger2022]. Viewing the parameters from the **`annotating`** method and their default values: ```{r annotating_parameters} annotating_parameters() ``` Querying chemical annotation from the Chemical Entities of Biological Interest [ChEBI](https://www.ebi.ac.uk/chebi/) database: 1) by using *mz* values: ```{r chebi_annotation, eval = FALSE} sacurine.se <- annotating(sacurine.se, database.c = "chebi", param.ls = list(query.type = "mz", query.col = "mass_to_charge", ms.mode = "neg", mz.tol = 10, mz.tol.unit = "ppm", max.results = 3, prefix = "chebiMZ."), report.c = "none") knitr::kable(head(rowData(sacurine.se)[, grep("chebiMZ", colnames(rowData(sacurine.se)))])) ``` 2) by using ChEBI identifiers: ```{r chebi_identifier, eval = FALSE} sacurine.se <- annotating(sacurine.se, database.c = "chebi", param.ls = list(query.type = "chebi.id", query.col = "database_identifier", prefix = "chebiID.")) knitr::kable(head(rowData(sacurine.se)[, grep("chebiID", colnames(rowData(sacurine.se)))])) ``` Adding chemical information from a local database: 1) loading a local (example) MS database: ```{r localdbDF} msdbDF <- read.table(system.file("extdata/local_ms_db.tsv", package = "phenomis"), header = TRUE, sep = "\t", stringsAsFactors = FALSE) ``` 2) Querying the local database: ```{r localms_annotation, message=FALSE, warning=FALSE} sacurine.se <- annotating(sacurine.se, database.c = "local.ms", param.ls = list(query.type = "mz", query.col = "mass_to_charge", ms.mode = "neg", mz.tol = 5, mz.tol.unit = "ppm", local.ms.db = msdbDF, prefix = "localMS."), report.c = "none") knitr::kable(rowData(sacurine.se)[!is.na(rowData(sacurine.se)[, "localMS.accession"]), grep("localMS.", colnames(rowData(sacurine.se)), fixed = TRUE)]) ``` ## **`writing`**: Exporting the results Finally, the `writing` method can be used to export the `SummarizedExperiment` object in the 3 tabular file format. In case of a `MultiAssayExperiment`, one subfolder containing the 3 files will be created for each dataset. ```{r writing, eval = FALSE} writing(sacurine.se, dir.c = getwd(), prefix.c = "sacurine") ``` ## Graphical User Interface The main features from the `phenomis` package can also be accessed via a graphical user interface in the **Quality Metrics**, **Batch Correction**, **Univariate**, and **Heatmap** modules from the [workflow4metabolomics.usegalaxy.fr](https://workflow4metabolomics.usegalaxy.fr/) online resource for computational metabolomics, based on the Galaxy environment. ## Working with multi-omics datasets The `MultiAssayExperiment` format is useful to handle **multi-omics** datasets [@Ramos_SoftwareIntegrationMultiomics_2017]. The `phenomis` package (as well as the `ropls` and `biosigner` packages) can be used to process the data sets in parallel, such as all methods described before can be directly applied to the `MultiAssayExperiment` object. In several methods, specific parameters for each dataset can be passed as a vector: for instance, with two datasets, if one wishes to log2 transform the intensities of the first dataset but not the second one, the `method.vc = c("log2", "none")` parameter can be used with the `transforming` method (in contrast, if only one value is provided, it will be used for all datasets). Here we apply `phenomis` statistical methods to the ProMetIS study , which addresses the molecular phenotyping of knock-out mouse models by combined proteomics and metabolomics analysis [@Imbert_PrometisDeepPhenotyping_2021]. The `phenomis` dataset has been selected from the original data (publicly available on GitHub: [IFB-ElixirFr/ProMetIS](https://github.com/IFB-ElixirFr/ProMetIS)) as follows: - **liver** data - **WT** and KO mice for the **LAT** gene (Linker for Activation of T cells) only - the **proteomics** dataset and the **metabolomics** dataset obtained with the ZIC-pHILIC chromatographic column - **post-processed** data only - **100 features** only The multi-omics dataset is saved as text files and can be loaded with the `reading` function as a `MultiAssayExperiment` (or as a `MultiDataSet` with the `output.c = "set"` argument) ```{r pms_read} prometis.mae <- reading(system.file("extdata/prometis", package = "phenomis")) ``` Let us have a look at the (common) sample metadata: ```{r} head(colData(prometis.mae)) ``` We first explore the datasets: ```{r pms_inspect} prometis.mae <- inspecting(prometis.mae, report.c = "none") ``` We then perform univariate hypothesis testing with the `limma` approach [@Ritchie_LimmaPowersDifferential_2015]: ```{r pms_limma} prometis.mae <- hypotesting(prometis.mae, "limma", "gene", report.c = "none") ``` As with single omics studies, the `ropls` and `biosigner` can be directly applied to `MultiAssayExperiment` for multivariate analysis: ```{r pms_plsda} prometis.mae <- ropls::opls(prometis.mae, "gene") ``` and feature selection: ```{r pms_biosign, fig.width = 5, fig.height = 5} prometis.mae <- biosigner::biosign(prometis.mae, "gene", bootI = 5) ``` The final `MultiAssayExperiment` can be exported as text files with: ```{r pms_writing, eval = FALSE} writing(prometis.mae, dir.c = getwd(), prefix.c = "prometis") ``` # Appendix ## Additional examples of application to single and multiple omics data sets ### CLL data set The `CLL` package contains the chronic lymphocytic leukemia (CLL) gene expression data from 24 samples, that were either classified as progressive or stable in regards to disease progression. We first load the data (which are stored as an `ExpressionSet` object), and restrict to the first 1,000 features (to speed up computations): ```{r cll_load} data(sCLLex, package = "CLL") cll.eset <- sCLLex[seq_len(1000), ] ``` We explore the data: ```{r cll_inspect, eval = FALSE} cll.eset <- inspecting(cll.eset, report.c = "none") ``` Investigate whether clusters can be observed (we first add the letter 'p' or 's' in the sample names to indicate the phenotype: p = progressive, s = stable): ```{r cll_heatmap, eval = FALSE} Biobase::sampleNames(cll.eset) <- paste0(Biobase::sampleNames(cll.eset), "_", substr(Biobase::pData(cll.eset)[, "Disease"], 1, 1)) cll.eset <- clustering(cll.eset) ``` and perform hypothesis testing with the limma test (the dots must be removed from the 'progress.' phenotype): ```{r cll_limma, eval = FALSE} Biobase::pData(cll.eset)[, "Disease"] <- gsub(".", "", Biobase::pData(cll.eset)[, "Disease"], fixed = TRUE) cll.eset <- hypotesting(cll.eset, "limma", "Disease") ``` All the subsequent multivariate analysis and feature selection steps described with the `sacurine` data set can be performed with the `cll.eset` accordingly. ### NCI60_4arrays data set We provide an example based on the `NCI60_4arrays` cancer data set from the `omicade4` package [@mengMultivariateApproachIntegration2014], which has been made available in the `MultiAssayExperiment` format in the `ropls` package. The data consist of gene transcript expression analysis from NCI-60 cell lines by using 4 microarray platforms [@Reinhold_CellMinerWebBasedSuite_2012]. Getting the `NCI60` data set as a `MultiAssayExperiment`: ```{r mae} data(NCI60, package = "ropls") nci.mae <- NCI60[["mae"]] ``` Let's focus on the **LE**ukemia and **ME**lanoma cancer types, and the **agilent** and **hgu95** arrays: ```{r mae_focus, message=FALSE, warning=FALSE} library(MultiAssayExperiment) table(nci.mae$cancer) nci.mae <- nci.mae[, nci.mae$cancer %in% c("LE", "ME"), c("agilent", "hgu95")] ``` We can now e.g. perform clustering on each data set independently: ```{r mae_clustering, message=TRUE, warning=TRUE, eval = FALSE} nci.mae <- clustering(nci.mae) ``` and/or perform hypothesis testing on each data set: ```{r mae_hypotesting, eval = FALSE} nci.mae <- hypotesting(nci.mae, "limma", "cancer", report.c = "none") ``` as well as all the analyses described above for the `prometis` data set. ## Cheat sheets for Bioconductor (multi)omics containers ### `SummarizedExperiment` The `SummarizedExperiment` format has been designed by the Bioconductor team to handle (single) omics datasets [@Morgan2022]. An example of `SummarizedExperiment` creation and a summary of useful methods are provided below. Please refer to [package vignettes](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html) for a full description of `SummarizedExperiment` objects . ```{r se_build} # Preparing the data (matrix) and sample and variable metadata (data frames): data(sacurine, package = "ropls") data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata # Creating the SummarizedExperiment (package SummarizedExperiment) library(SummarizedExperiment) sac.se <- SummarizedExperiment(assays = list(sacurine = t(data.mn)), colData = samp.df, rowData = feat.df) # note that colData and rowData main format is DataFrame, but data frames are accepted when building the object stopifnot(validObject(sac.se)) # Viewing the SummarizedExperiment # ropls::view(sac.se) ``` | [Methods]{.smallcaps} | [Description]{.smallcaps} | [Returned class]{.smallcaps} | |--------------------------|---------------------------|-------------------| | [**Constructors**]{.smallcaps} | | | | **`SummarizedExperiment`** | Create a SummarizedExperiment object | `SummarizedExperiment` | | **`makeSummarizedExperimentFromExpressionSet`** | | `SummarizedExperiment` | | [**Accessors**]{.smallcaps} | | | | **`assayNames`** | Get or set the names of assay() elements | `character` | | **`assay`** | Get or set the ith (default first) assay element | `matrix` | | **`assays`** | Get the list of experimental data numeric matrices | `SimpleList` | | **`rowData`** | Get or set the row data (features) | `DataFrame` | | **`colData`** | Get or set the column data (samples) | `DataFrame` | | **`metadata`** | Get or set the experiment data | `list` | | **`dim`** | Get the dimensions (features of interest x samples) | `integer` | | **`dimnames`** | Get or set the dimension names | `list` of length 2 character/NULL | | **`rownames`** | Get the feature names | `character` | | **`colnames`** | Get the sample names | `character` | | [**Conversion**]{.smallcaps} | | | | **`as(, "SummarizedExperiment")`** | Convert an ExpressionSet to a SummarizedExperiment | `SummarizedExperiment` | ### `MultiAssayExperiment` The `MultiAssayExperiment` format has been designed by the Bioconductor team to handle multiomics datasets [@Ramos_SoftwareIntegrationMultiomics_2017]. An example of `MultiAssayExperiment` creation and a summary of useful methods are provided below. Please refer to [package vignettes](https://bioconductor.org/packages/MultiAssayExperiment/) or to the original publication for a full description of `MultiAssayExperiment` objects [@Ramos_SoftwareIntegrationMultiomics_2017]. Loading the `NCI60_4arrays` from the `omicade4` package: ```{r mae_build_load} data("NCI60_4arrays", package = "omicade4") ``` Creating the `MultiAssayExperiment`: ```{r mae_build, message = FALSE, warning=FALSE} library(MultiAssayExperiment) # Building the individual SummarizedExperiment instances experiment.ls <- list() sampleMap.ls <- list() for (set.c in names(NCI60_4arrays)) { # Getting the data and metadata assay.mn <- as.matrix(NCI60_4arrays[[set.c]]) coldata.df <- data.frame(row.names = colnames(assay.mn), .id = colnames(assay.mn), stringsAsFactors = FALSE) # the 'cancer' information will be stored in the main colData of the mae, not the individual SummarizedExperiments rowdata.df <- data.frame(row.names = rownames(assay.mn), name = rownames(assay.mn), stringsAsFactors = FALSE) # Building the SummarizedExperiment assay.ls <- list(se = assay.mn) names(assay.ls) <- set.c se <- SummarizedExperiment(assays = assay.ls, colData = coldata.df, rowData = rowdata.df) experiment.ls[[set.c]] <- se sampleMap.ls[[set.c]] <- data.frame(primary = colnames(se), colname = colnames(se)) # both datasets use identical sample names } sampleMap <- listToMap(sampleMap.ls) # The sample metadata are stored in the colData data frame (both datasets have the same samples) stopifnot(identical(colnames(NCI60_4arrays[[1]]), colnames(NCI60_4arrays[[2]]))) sample_names.vc <- colnames(NCI60_4arrays[[1]]) colData.df <- data.frame(row.names = sample_names.vc, cancer = substr(sample_names.vc, 1, 2)) nci.mae <- MultiAssayExperiment(experiments = experiment.ls, colData = colData.df, sampleMap = sampleMap) stopifnot(validObject(nci.mae)) ``` | [Methods]{.smallcaps} | [Description]{.smallcaps} | [Returned class]{.smallcaps} | |-------------------|----------------------------------|-------------------| | [**Constructors**]{.smallcaps} | | | | **`MultiAssayExperiment`** | Create a MultiAssayExperiment object | `MultiAssayExperiment` | | **`ExperimentList`** | Create an ExperimentList from a List or list | `ExperimentList` | | [**Accessors**]{.smallcaps} | | | | **`colData`** | Get or set data that describe the patients/biological units | `DataFrame` | | **`experiments`** | Get or set the list of experimental data objects as original classes | `experimentList` | | **`assays`** | Get the list of experimental data numeric matrices | `SimpleList` | | **`assay`** | Get the first experimental data numeric matrix | `matrix`, matrix-like | | **`sampleMap`** | Get or set the map relating observations to subjects | `DataFrame` | | **`metadata`** | Get or set additional data descriptions | `list` | | **`rownames`** | Get row names for all experiments | `CharacterList` | | **`colnames`** | Get column names for all experiments | `CharacterList` | | [**Subsetting**]{.smallcaps} | | | | **`mae[i, j, k]`** | Get rows, columns, and/or experiments | `MultiAssayExperiment` | | **`mae[i, ,]`** | i: GRanges, character, integer, logical, List, list | `MultiAssayExperiment` | | **`mae[,j,]`** | j: character, integer, logical, List, list | `MultiAssayExperiment` | | **`mae[,,k]`** | k: character, integer, logical | `MultiAssayExperiment` | | **`mae[[i]]`** | Get or set object of arbitrary class from experiments | (Varies) | | **`mae[[i]]`** | Character, integer, logical | | | **`mae$column`** | Get or set colData column | `vector` (varies) | | [**Management**]{.smallcaps} | | | | **`complete.cases`** | Identify subjects with complete data in all experiments | `vector` (logical) | | **`duplicated`** | Identify subjects with replicate observations per experiment | `list` of `LogicalLists` | | **`mergeReplicates`** | Merge replicate observations within each experiment | `MultiAssayExperiment` | | **`intersectRows`** | Return features that are present for all experiments | `MultiAssayExperiment` | | **`intersectColumns`** | Return subjects with data available for all experiments | `MultiAssayExperiment` | | **`prepMultiAssay`** | Troubleshoot common problems when constructing main class | `list` | | [**Reshaping**]{.smallcaps} | | | | **`longFormat`** | Return a long and tidy DataFrame with optional colData columns | `DataFrame` | | **`wideFormat`** | Create a wide DataFrame, one row per subject | `DataFrame` | | [**Combining**]{.smallcaps} | | | | **`c`** | Concatenate an experiment | `MultiAssayExperiment` | | [**Viewing**]{.smallcaps} | | | | **`upsetSamples`** | Generalized Venn Diagram analog for sample membership | `upset` | | [**Exporting**]{.smallcaps} | | | | **`exportClass`** | Exporting to flat files | `csv` or `tsv` files | ### `ExpressionSet` The `ExpressionSet` format (`Biobase` package) was designed by the Bioconductor team as the first format to handle (single) omics data. It has now been supplanted by the `SummarizedExperiment` format. An example of `ExpressionSet` creation and a summary of useful methods are provided below. Please refer to ['An introduction to Biobase and ExpressionSets'](https://bioconductor.org/packages/release/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf) from the documentation from the [`Biobase`](https://doi.org/doi:10.18129/B9.bioc.Biobase) package for a full description of `ExpressionSet` objects. ```{r eset_build, message = FALSE, warning = FALSE} # Preparing the data (matrix) and sample and variable metadata (data frames): data(sacurine, package = "ropls") data.mn <- sacurine[["dataMatrix"]] # matrix: samples x variables samp.df <- sacurine[["sampleMetadata"]] # data frame: samples x sample metadata feat.df <- sacurine[["variableMetadata"]] # data frame: features x feature metadata # Creating the ExpressionSet (package Biobase) sac.set <- Biobase::ExpressionSet(assayData = t(data.mn)) Biobase::pData(sac.set) <- samp.df Biobase::fData(sac.set) <- feat.df stopifnot(validObject(sac.set)) # Viewing the ExpressionSet # ropls::view(sac.set) ``` | [Methods]{.smallcaps} | [Description]{.smallcaps} | [Returned class]{.smallcaps} | |------------------|----------------------------------|--------------------| | **`exprs`** | 'variable times samples' numeric matrix | `matrix` | | **`pData`** | sample metadata | `data.frame` | | **`fData`** | variable metadata | `data.frame` | | **`sampleNames`** | sample names | `character` | | **`featureNames`** | variable names | `character` | | **`dims`** | 2x1 matrix of 'Features' and 'Samples' dimensions | `matrix` | | **`varLabels`** | Column names of the sample metadata data frame | `character` | | **`fvarLabels`** | Column names of the variable metadata data frame | `character` | ### `MultiDataSet` Loading the `NCI60_4arrays` from the `omicade4` package: ```{r mset_build_load} data("NCI60_4arrays", package = "omicade4") ``` Creating the `MultiDataSet`: ```{r mset_build, message = FALSE, warning=FALSE} library(MultiDataSet) # Creating the MultiDataSet instance nci.mds <- MultiDataSet::createMultiDataSet() # Adding the two datasets as ExpressionSet instances for (set.c in names(NCI60_4arrays)) { # Getting the data expr.mn <- as.matrix(NCI60_4arrays[[set.c]]) pdata.df <- data.frame(row.names = colnames(expr.mn), cancer = substr(colnames(expr.mn), 1, 2), stringsAsFactors = FALSE) fdata.df <- data.frame(row.names = rownames(expr.mn), name = rownames(expr.mn), stringsAsFactors = FALSE) # Building the ExpressionSet eset <- Biobase::ExpressionSet(assayData = expr.mn, phenoData = new("AnnotatedDataFrame", data = pdata.df), featureData = new("AnnotatedDataFrame", data = fdata.df), experimentData = new("MIAME", title = set.c)) # Adding to the MultiDataSet nci.mds <- MultiDataSet::add_eset(nci.mds, eset, dataset.type = set.c, GRanges = NA, warnings = FALSE) } stopifnot(validObject(nci.mds)) ``` | [Methods]{.smallcaps} | [Description]{.smallcaps} | [Returned class]{.smallcaps} | |---------------------|-------------------------------|--------------------| | [**Constructors**]{.smallcaps} | | | | **`createMultiDataSet`** | Create a MultiDataSet object | `MultiDataSet` | | **`add_eset`** | Create a MultiAssayExperiment object | `MultiDataSet` | | [**Subsetting**]{.smallcaps} | | | | **`mset[i, ]`** | i: character,logical (samples to select) | `MultiDataSet` | | **`mset[, k]`** | k: character (names of datasets to select) | `MultiDataSet` | | **`mset[[k]]`** | k: character (name of the datast to select) | `ExpressionSet` | | [**Accessors**]{.smallcaps} | | | | **`as.list`** | Get the list of data matrices | `list` | | **`pData`** | Get the list of sample metadata | `list` | | **`fData`** | Get the list of variable metadata | `list` | | **`sampleNames`** | Get the list of sample names | `list` | | [**Management**]{.smallcaps} | | | | **`commonSamples`** | Select samples that are present in all datasets | `MultiDataSet` | | [**Conversion**]{.smallcaps} | | | | **`mds2mae`** | Convert a MultiDataSet to a MultiAssayExperiment | `MultiAssayExperiment` | # Session info Here is the output of `sessionInfo()` on the system on which this document was compiled: ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # References