--- title: "MultiBaC user's guide" author: - name: Manuel Ugidos affiliation: Gene Expression and RNA Metabolism Laboratory, Instituto de Biomedicina de Valencia, Consejo Superior de Investigaciones Científicas (CSIC), Valencia, Spain. - name: Sonia Tarazona affiliation: Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain - name: Maria J. Nueda affiliation: Department of Mathematics, Alicante Universiy, Spain package: MultiBaC output: BiocStyle::html_document: toc: true number_sections: true fig_crop: true fig_caption: true bibliography: ["references.bib"] biblio-style: apalike link-citations: true csl: biomed-central.csl vignette: > %\VignetteIndexEntry{MultiBaC} %\VignetteEngine{knitr::rmarkdown} header-includes: - \usepackage{color} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.width = 12, fig.height = 7) devtools::load_all(".") library("MultiAssayExperiment") library("ggplot2") library("ropls") library("MultiBaC") ``` # Introduction Simultaneously generating multiple omic measurements (e.g. transcriptomics, metabolomics, proteomics or epigenomics) of the same molecular system for one particular study is not always possible. As a consequence, researchers sometimes combine compatible data generated in different labs or in different batches. In such cases, data will usually be affected by an unwanted effect associated to the experimentation event (lab, batch, technology, etc.) that, especially for high throughput molecular assays, may result in important levels of noise contaminating the biological signal. This unwanted source of variation is commonly known as ``batch effect'' and is very frequently seen as the first source of variability in the omic dataset, standing out over the experimental conditions under study. Removing batch effects becomes then necessary in order to obtain meaningful results from statistical analyses. Provided that the omic experiment has been designed in such a way that batch effects are not confounded with the effects of interest (treatment, disease, cell type, etc.), the so-called Batch Effect Correction Algorithms (BECAs) can be used to remove, or at least mitigate, systematic biases.Therefore these methods are extremely useful to combine data from different laboratories or measured at different times. One of these BECAs is the ARSyN method [@Nueda2012], which relies on the ANOVA-Simultaneous Components Analysis (ASCA) framework to decompose the omic signal into experimental effects and other unwanted effects. ARSyN applies Principal Component Analysis (PCA) to estimate the systematic variation due to batch effect and then removes it from the original data. BECAs have been traditionally applied to remove batch effects from omic data of the same type, as for example gene expression. However, while removing batch effects from a single omic data type with an appropriate experimental design is relatively straightforward, it can become unapproachable when dealing with multiomic datasets. In the multiomic scenario, each omic modality may have been measured by a different lab or at a different moment in time, and so it is obtained within a different batch. When this is the case, the batch effect will be confounded with the ``omic type effect'' and will be impossible to remove from the data. However, in some scenarios, the multiomic batch effect can be corrected. MultiBaC is the first BECA dealing with batch effect correction in multiomic datasets. MultiBaC can remove batch effects across different omics generated within separate batches provided that at least one common omic data type is included in all the batches. The **MultiBaC** package includes two BECAs: the ARSyN method for correcting batch effect from a single omic data type and the MultiBac method, which deals with the batch effect problem on multi-omic assays. # Batch effect correction on a single omic ## About ARSyN ### ARSyN method overview ARSyN (ASCA Removal of Systematic Noise) is a method that combines Analysis of Variance (ANOVA) and Principal Component Analysis (PCA) for the identification of structured variation from the estimated ANOVA models for experimental and unwanted effects on an omic data matrix. ARSyN can remove undesired signals to obtain noise-filtered data for further analysis [@Nueda2012]. In **MultiBaC** package, ARSyN has been adapted for filtering the noise associated to identified or unidentified batch effects. This adaptation has been called ARSyNbac. In the ARSyN method, the ANOVA model separates the signal identified with each one of the factors involved in the experimental design from the residuals. The algorithm can be applied on multi-factorial experimental designs. One of the factors in the model can be the batch each sample belongs to, if this information is known. In such case, the ANOVA model is applied to separate the batch effect from the remaining effects and residuals. The PCA analysis will hence detect the possible existence of a structured variation due to the batch effect, that is identified with the principal components explaining a given proportion of the total variation in the data, which can be set by the user. However, ARSyN can also be applied when the batch factor is not known since the PCA on the residual matrix can detect correlated structure associated to a source of variation not included in the experimental design. We alert of this signal in the residuals when the first eigenvalues of the PCA are noticeably higher than the rest, because if there is not any structure the eigenvalues will be approximately equal. In this case the selection of components is controlled by the beta argument. Components that represent more than beta times the average variability are identified as systematic noise and removed from the original data. ### How to cite ARSyN Nueda MJ, Ferrer A, Conesa A.(2012). [ARSyN: A method for the identification and removal of systematic noise in multifactorial time course microarray experiments](https://doi.org/10.1093/biostatistics/kxr042). *Biostatistics* 13:553-66. ## Example: Yeast expression data The yeast expression example data sets were collected from the Gene Expression Omnibus (GEO) database and from three different laboratories (batches). In all of them, the effect of glucose starvation in yeast was analyzed. Lab A is the Department of Biochemistry and Molecular Biology from Universitat de Valencia (accession number GSE11521) [@JE1, @JE2, @JE3]; Lab B is the Department of Molecular and Cellular Biology from Harvard University (accession number GSE56622) [@RIBO]; and Lab C is the Department of Biology from Johns Hopkins University (accession number GSE43747) [@PARCLIP]. After a proper data pre-processing for each case, a voom transformation (with limma R package) was applied when necessary. Finally TMM normalization was performed on the whole set of samples from all labs. A reduced dataset was obtained by selecting 200 omic variables from each data matrix and just 3 samples from lab A. This yeast multiomic reduced dataset is included in *MutiBaC* package to illustrate the usage of the package. The gene expression matrices can be loaded by using the *data("multiyeast")* instruction. The three studies used equivalent yeast strains and experimental conditions but, as shown in Figure \@ref(fig:init) , the main effect on expression is due to data belonging to different labs, which are the batches in this case. ```{r init, eval = TRUE, echo = FALSE, fig.cap = "PCA plot of original gene expression data (before correction). Batches are completely separated from each other. Plot generated with MultiBaC package (see Visualization of results Section)."} data("multiyeast") data_RNA <- createMbac (inputOmics = list(A.rna, B.rna, C.rna), batchFactor = c("A", "B", "C"), experimentalDesign = list("A" = c("Glu+", "Glu+", "Glu+", "Glu-", "Glu-", "Glu-"), "B" = c("Glu+", "Glu+", "Glu-", "Glu-"), "C" = c("Glu+", "Glu+", "Glu-", "Glu-")), omicNames = "RNA") custom_col <- c("brown2", "dodgerblue", "forestgreen") custom_pch <- c(19,19,19,1,1,1, 19,19,1,1, 19,19,1,1) plot(data_RNA, typeP="pca.org", bty = "L", pch = custom_pch, cex = 3, col.per.group = custom_col, legend.text = c("Color: Batch", names(data_RNA$ListOfBatches), "Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 15, 0), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, 1, 1), "bty" = "n", "cex" = 2)) ``` ## ARSyNbac input data The **MultiBaC** package uses *MultiAssayExperiment* objects, a type of Bioconductor container for multiomic studies, that can be created from a list of matrices or *data.frame* objects. These matrices must have features in rows and samples in columns as shown next for one of the data matrices from the yeast example (A.rna). It is important that all data matrices share the variable space. If the number of omic variables and order are not the same, the *createMbac* function will select the common variables. Hence, it is mandatory that rows are named with the same type of identifiers. ```{r} data("multiyeast") head(A.rna) ``` A *MultiAssayExperiment* object needs to be created for each batch (lab in this example). The *mbac* new data structure is a list of *MultiAssayExperiment* objects and can be easily generated with the *createMbac* function in the package. The resulting *mbac* object will be the *ARSyNbac* input. ```{r} data_RNA<- createMbac (inputOmics = list(A.rna, B.rna, C.rna), batchFactor = c("A", "B", "C"), experimentalDesign = list("A" = c("Glu+", "Glu+", "Glu+", "Glu-", "Glu-", "Glu-"), "B" = c("Glu+", "Glu+", "Glu-", "Glu-"), "C" = c("Glu+", "Glu+", "Glu-", "Glu-")), omicNames = "RNA") ``` These are the arguments for the *createMbac* function: * **inputOmics** A list containing all the matrices or data.frame objects to be analysed. MultiAssayExperiment objects can alternatively be provided. * **batchFactor** Either a vector or a factor indicating the batch were each input matrix belongs to (i.e. study, lab, time point, etc.). If NULL (default) no batch is considered and just ARSyNbac noise reduction mode could be applied. * **experimentalDesign** A list with as many elements as batches. Each element can be a factor, a character vector or a data.frame indicating the experimental conditions for each sample in that batch. When being a data.frame with more than one column (multi-factorial experimental designs), the different columns will be combined into a single one to be used by MultiBaC or ARSyNbac. In any case, the experimental setting must be the same for all batches. In addition, the names of the elements in this list must be the same as declared in *batches* argument. If not (or if NULL), names are forced to be the same in as in *batches* argument and in the same order. * **omicNames** Vector of names for each input matrix. The common omic is required to have the same name across batches. * **commonOmic** Name of the common omic between the batches. It must be one of the names in omicNames argument. If NULL (default), the omic name which is common to all batches is selected as commonOmic. The *mbac* R structure generated by the *createMbac* function is an S3 object that initially contains just one slot, the *ListOfBatches* object. This *mbac* structure will incorporate more elements as they are created when running *ARSyNbac* or *MultiBaC* functions. These new slots are *CorrectedData*, *PLSmodels*, *ARSyNmodels* or *InnerRelation* and are described next: * **ListOfBatches**: A list of MultiAssayExperiment objects (one per batch). * **CorrectedData**: Same structure than ListOfBatches but with the corrected data matrices instead of the original ones. * **PLSmodels**: PLS models created by MultiBaC method (one model per non-common omic data type). Only available for MultiBaC method. * **ARSyNmodels**: ARSyN models created either by ARSyNbac or MultiBaC functions. * **InnerRelation**: Table of class *data.frame* containing the inner correlation (i.e. correlation between the scores of X (t) and Y (u) matrices) for each PLS model across all components, for model validation purposes. Only available for MultiBaC method. * **commonOmic**: Name of the common omic between the batches. In addition to plot, other method is supplied for visualizing mbac objects: summary, which show the structure of the object. ```{r,} summary(data_RNA) ``` ## ARSyNbac correction The function to remove batch effects or unwanted noise from a single omic data matrix in the **MultiBaC** package is the *ARSyNbac* function, which allows for the following arguments: ```{r, eval = FALSE} ARSyNbac (mbac, batchEstimation = TRUE, filterNoise = TRUE, Interaction=FALSE, Variability = 0.90, beta = 2, modelName = "Model 1", showplot = TRUE) ``` * **mbac**: mbac object generated by *createMbac*. * **batchEstimation**: Logical. If TRUE (default) the batch effect is estimated and used to correct the data. Use TRUE when the source of the batch effect is known. * **filterNoise^**: Logical. If TRUE (default) structured noise is removed form residuals. Use this option when there is an unknown source of batch effect in data. * **Interaction**: Logical. Whether to model the interaction between factors or not (FALSE by default). * **Variability**: From 0 to 1. Minimum percent of data variability that must be explained by each model. Used in batch correction mode. By default, 0.90. * **beta**: Numeric. Components that represent more than beta times the average variability are identified as systematic noise in residuals. Used in noise reduction mode. By default, 2. * **modelName**: Name of the model created. This name will be showed if you use the explaine_var plot function. By default, "Model 1". * **showplot**: Logical. If TRUE (default), the explained_var plot is showed. This plot represents the number of components selected for the ARSyN model. Therefore, the *ARSyNbac* function offers three types of analysis: **ARSyNbac batch effect correction**, when the batch information is provided, **ARSyNbac noise reduction**, if the batch information is unknown, and the combination of both modes when there is a known source of batch effect and another possible unknown source of unwanted variability. In the following sections we explain how to proceed with each one of them. ### ARSyNbac batch effect correction When the batch is identified in the *batchFactor* argument of the *mbac* input object, its effect can be estimated and removed by choosing *batchEstimation = TRUE* (considering one source of batch effect only, *filterNoise = FALSE*). Moreover, a possible interaction between the experimental factors and the batch factor can be studied by setting *Interaction=TRUE*. ```{r arsyn1, eval = TRUE, message = FALSE, fig.cap = "Batch correction when Interaction=FALSE. Left: Explained variance plot. Default plot when showplot = TRUE. It represents the number of components (x axis) needed to explain a certain variability (y axis) of the effect of interest (batch effect). The number of components selected in the model is indicated with a triangle symbol. Gray dashed line indicates the threshold given by the Variability argument (in percentage). Right: PCA plot of corrected gene expression data when not considering the interaction batch-condition."} par(mfrow = c(1,2)) arsyn_1 <- ARSyNbac(data_RNA, modelName = "RNA", Variability = 0.95, batchEstimation = TRUE, filterNoise = FALSE, Interaction = FALSE) plot(arsyn_1, typeP="pca.cor", bty = "L", pch = custom_pch, cex = 3, col.per.group = custom_col, legend.text = c("Color: Batch", names(data_RNA$ListOfBatches), "Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 15, 0), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, 1, 1), "bty" = "n", "cex" = 1.2)) ``` According to the left plot in Figure \@ref(fig:arsyn1), two principal components (PCs) have been selected to explain at least 95% of the total variability of batch effect. PCA of corrected data with this analysis is shown in the right panel. Now the main source of variability in the data (PC1) is given by the experimental condition, while samples are not clustered by batches anymore. ```{r arsyn2, message = FALSE, fig.cap = "Batch correction when Interaction = TRUE. Left: Explained variance plot. Default plot when showplot = TRUE. It represents the number of components (x axis) needed to explain a certain variability (y axis) of the effect of interest (batch effect). The number of components selected in the model is indicated with a triangle symbol. Gray dashed line indicates the threshold given by the Variability argument (in percentage). Right: PCA plot of corrected gene expression data considering the interaction batch-condition."} par(mfrow = c(1,2)) arsyn_2 <- ARSyNbac(data_RNA, modelName = "RNA", Variability = 0.95, batchEstimation = TRUE, filterNoise = FALSE, Interaction = TRUE) plot(arsyn_2, typeP="pca.cor", bty = "L", pch = custom_pch, cex = 3, col.per.group = custom_col, legend.text = c("Color: Batch", names(data_RNA$ListOfBatches), "Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 15, 0), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, 1, 1), "bty" = "n", "cex" = 1.2)) ``` In Figure \@ref(fig:arsyn2) (right panel) all the points with negative PC1 correspond to Glu- samples, and the positive PC1 to Glu+ samples, as happened when not including the interaction in the model (Figure \@ref(fig:arsyn1), right panel). However, in this second model, PC1 explains a higher percentage of the variability in the data, indicating a better batch effect correction. In general, we recomend the use of the default argument (Interaction=FALSE), as including part of the signal as batch effect could lead to a dilution of the effect of the signal of interest. However, the interaction between batch and experimental condition is sometimes strong and we should consider to include it in the model in order to get a better correction of the data. The PCA plots shown in Figures \@ref(fig:init), \@ref(fig:arsyn1) and \@ref(fig:arsyn2) have been created by the customized *plot* function in **MultiBaC** package. More details about this function can be found at the "Visualization of ARSyN and MultiBaC results" section, where a complete description of the arguments in *plot* is given. ### ARSyNbac noise reduction When batch is not identified, **ARSyNbac** analyses the existence of systematic noise in the residuals by setting *batchEstimation = FALSE* and *filterNoise = TRUE*. ```{r arsyn3, message = FALSE, fig.cap = "Noise reduction mode. Left: Explained variance plot. Default plot when showplot = TRUE. It represents the percentage of variability in the residuals (y axis) explained by a model with a given number of principal components (x axis). The number of selected components in the final model is indicated with a triangle symbol, and computed to explain beta times the average variability of the residuals. Right: PCA plot of corrected gene expression data."} par(mfrow = c(1,2)) arsyn_3 <- ARSyNbac(data_RNA, modelName = "RNA", beta = 0.5, batchEstimation = FALSE, filterNoise = TRUE) plot(arsyn_3, typeP="pca.cor", bty = "L", pch = custom_pch, cex = 3, col.per.group = custom_col, legend.text = c("Color: Batch", names(data_RNA$ListOfBatches), "Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 15, 0), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, 1, 1), "bty" = "n", "cex" = 1.2)) ``` In Figure \@ref(fig:arsyn3) we can see that even though the batch is considered unidentified (batchEstimation=FALSE), ARSyN has removed the noise from the data by estimating the main source of unwanted variation. In this noise reduction mode, we can modulate the magnitude of the residual noise removal with the beta parameter. Basically, components that represent more than beta times the average variability are identified as systematic noise in residuals (3 components were selected in this model). Thus a greater beta value leads to the selection of a lower number of components in the residuals. ### ARSyNbac both modalities When the source of the batch effect is known but there might be an extra unknown source of unwanted variability, **ARSyNbac** can perform both previous ways by setting *batchEstimation = TRUE* and *filterNoise = TRUE*. Note that this mode could also be useful if the known batch effect does not represent the main source of noise in our data. ```{r arsyn4, message = FALSE, fig.cap = "Both modes together. Left: Explained variance plot. Default plot when showplot = TRUE. It represents the percentage of variability in the residuals (y axis) explained by a model with a given number of principal components (x axis). The number of selected components in the final model is indicated with a triangle symbol, and computed to explain beta times the average variability of the residuals. Right: PCA plot of corrected gene expression data."} par(mfrow = c(1,2)) arsyn_4 <- ARSyNbac(data_RNA, modelName = "RNA", beta = 0.5, batchEstimation = TRUE, filterNoise = TRUE, Interaction = TRUE, Variability = 0.95) plot(arsyn_4, typeP="pca.cor", bty = "L", pch = custom_pch, cex = 3, col.per.group = custom_col, legend.text = c("Color: Batch", names(data_RNA$ListOfBatches), "Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 15, 0), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, 1, 1), "bty" = "n", "cex" = 1.2)) ``` In Figure \@ref(fig:arsyn4) we can see that performing both modalities together, ARSyNbac reaches its maximum PC1 variance explained (specially compared to \@ref(fig:arsyn2)). In this third mode, we can modulate the magnitude of the residual noise removal with the beta parameter and the batch effect associated explained variance. In this example, as shown in \@ref(fig:arsyn3), known batch effect represents the main (and almost unique) source of unwanted variatio. Thus, in other scenarios with more than one batch source, the result with this third mode might be very different form the other two previous ways of operation. # Batch effect correction on a multiomic dataset ## About MultiBaC ### MultiBaC method overview Multiomic data integration has become a popular approach to understand how biological systems work. However, generating this kind of datasets is still costly and time consuming. Consequently, it is quite common that not all the samples or omic data types are produced at the same time or in the same lab, but in different batches. In addition, when research groups cannot produce their own multiomic datasets, they usually collect them from different public repositories and, therefore, from different studies or laboratories (again, from different batches). Thus, in both situations, batch effects need to be previously removed from such datasets for successful data integration. Methods to correct batch effects on a single data type cannot be applied to correct batch effects across omics and, hence, we developed the MultiBaC strategy, which corrects batch effects from multiomic datasets distributed across different labs or data acquisition events. However, there are some requirements for the multi-omic data set in order to remove across-omics batch effects with MultiBaC: * There must be, at least, one common omic data type in all the batches. We may have, for instance, gene expression data measurements in all the batches, and then other different omic data types in each batch. It is not necessary that the commom omic (e.g. gene expression) is measured with the same technology. We could have microarray expression data in one batch and RNA-seq data in another batch, for example. * The omic feature identifiers must be the same for all the common data matrices. We cannot use, for instance, Ensembl identifiers in one batch and RefSeq in another batch. It is not necessary to have the same number of omic features, e.g. genes, in all the batches. MultiBac will extract the common identifiers from all the common data type matrices to perform the analysis. * Within the same batch, the experimental design must be the same for all the omic in that batch, that is, the same experimental groups, with the same number of replicates obtained from the same individuals. All these samples must be in the same order in all the omic data matrices. ### How to cite MultiBaC Ugidos M, Tarazona S, Prats-Montalb\'an JM, Ferrer A, Conesa A.(2020). [MultiBaC: A strategy to remove batch effects between different omic data types](https://doi.org/10.1177/0962280220907365). *Statistical Methods in Medical Research*. ## Example: Yeast multiomic dataset The yeast multiomic dataset comes from the same studies that the yeast expression data described in ARSyNbac section. While the three labs produced gene expression data (RNA), each of them generated a different additional omic data type. Lab A collected transcription rates (GRO, with accession number GSE1002) [@JE1, @JE2, @JE3]. Lab B obtained translation rates (RIBO, with accession number GSE56622) [@RIBO]. Finally, Lab C measured global PAR-CLIP data (PAR-CLIP, with accession number GSE43747) [@PARCLIP]. Therefore, labs have one shared (RNA) and one distinct (GRO, RIBO and PAR-CLIP, respectively) data types. This distributed multiomic scenario represents the type of correction problem MultiBaC addresses. A scheme of the data structure is shown in Figure \@ref(fig:design). ```{r design, echo = FALSE, fig.cap = "Scheme of the yeast example data structure."} knitr::include_graphics("designScheme.png", dpi = 30) ``` This yeast multiomic dataset is included in the **MutiBaC** package to illustrate the usage of the package. The six matrices can be loaded by using the *data("multiyeast")* instruction. ## MultiBaC input data As commented before, the **MultiBaC** package uses *MultiAssayExperiment* objects, a type of Bioconductor container for multiomic studies, that can be created from a list of matrices or *data.frame* objects. These matrices must have features in rows and samples in columns (see example in ARSyNbac section). Since *MultiBaC* computes regression models between omics from the same batch, it is important that matrices from the same batch have the same experimental design: the same number of samples and in the same order. *MultiBaC* relates the commonOmic information from the different batches as well. Thus, it is also important that *commonOmic* matrices share the variable space. In this case, if the number of omic variables and order are not the same, *MultiBaC* will take the common variables for the model. Hence, it is mandatory that rows are named with the same type of identifiers. The *mbac* object that will be the *MultiBaC* function input can be easily generated with the *createMbac* function in the package: ```{r} my_mbac <- createMbac (inputOmics = list(A.rna, A.gro, B.rna, B.ribo, C.rna, C.par), batchFactor = c("A", "A", "B", "B", "C", "C"), experimentalDesign = list("A" = c("Glu+", "Glu+", "Glu+", "Glu-", "Glu-", "Glu-"), "B" = c("Glu+", "Glu+", "Glu-", "Glu-"), "C" = c("Glu+", "Glu+", "Glu-", "Glu-")), omicNames = c("RNA", "GRO", "RNA", "RIBO", "RNA", "PAR")) ``` More details about the *createMbac* function can be found in ARSyNbac input data Section. Note that we do not need to indicate which is the common omic (in commonOmic argument) since there is only one omic in common (RNA) for all the batches (labs) and the function detects it as the common omic. ## MultiBaC correction Once the *mbac* object has been created with the *createMbac* function, it is used as the input data for *MultiBaC* function (*mbac* argument), which is the wrapper function for correction of multiomic batch effects. ```{r, eval = FALSE, echo = TRUE} MultiBaC (mbac, test.comp = NULL, scale = FALSE, center = TRUE, crossval = NULL, Interaction = FALSE, Variability = 0.90, showplot = TRUE, showinfo = TRUE) ``` The arguments of the *MultiBaC* function correspond to the different steps of the MultiBaC method: * **mbac**: mbac object generated by *createMbac*. * **test.comp**: Maximum number of components allowed for PLS models. If NULL (default), the minimal effective rank of the matrices is used as the maximum number of components. * **scale**: Logical. Whether X and Y matrices must be scaled. By default, FALSE. * **center**: Logical. Whether X and Y matrices must be centered. By default, TRUE. * **crossval**: Integer: number of cross-validation segments. The number of samples (rows of 'x') must be at least >= crossvalI. If NULL (default), a leave-one-out crossvalidation is performed. * **Interaction**: Logical. Whether to model the interaction between experimental factors and bacth factor in ARSyN models. By default, FALSE. * **Variability**: From 0 to 1. Minimum percent of data variability that must be explained for each ARSyN model. By default, 0.90. * **showplot**: Logical. If TRUE (default), the Q2 and the explained variance plots are shown. * **showinfo**: Logical. If TRUE (default), the information about the function progress is shown. The usage of MultiBaC function on the yeast example data is shown bellow: ```{r main-multibac2, include = TRUE, echo = TRUE, fig.cap = "Q2 and explained variance plots. Q2 plot (left) shows the number ob components (x) needed to reach a certain Q2 value (y). The number of components selected for each model is indicated with a triangle symbol. Gray dashed line indicates the 0.7 Q2 threshold. Explained variance plot (right) represents the number of components (x) needed to explain a certain varibility (y) of the effect of interest (batch effect). The number of components selected for each model is indicated with a triangle symbol. Gray dashed line indicates the Variability argument in percentage."} my_final_mbac <- MultiBaC (my_mbac, test.comp = NULL, scale = FALSE, center = TRUE, crossval = NULL, Interaction = TRUE, Variability = 0.9, showplot = TRUE, showinfo = TRUE) ``` By default (*showinfo = TRUE*), the table containing the inner correlations of PLS models is displayed in propmt. Moreover, the default plots (*showplot = TRUE*) are "Q2 plot" and "Explained variance plot" (see Figure \@ref(fig:main-multibac2)), which contain important information about *MultiBaC* performance. The "Q2 plot" represents the PLS model prediction capability given by the $Q^2$ score. The x axis indicates the number of components extracted for the PLS models and the y axis the $Q^2$ value. The performance of the MultiBaC method will be better for higher $Q^2$ values, since a high $Q^2$ indicates a good PLS prediction of the missing omics and hence will result in a better estimation of the batch effect. Note that, depending on the rank of the matrices, each PLS model could have a different maximum number of components. The "Explained variance plot" provides the batch effect related variability explained using the ASCA decomposition that ARSyN method provides. The x axis indicates the number of components extracted for the ASCA model and y axis reflects the percentage of explained variance. In this case, a higher explained variance leads to a better batch effect estimation. In both plots, the number of components selected for each model is indicated with a triangule symbol. In the "Q2 plot", the selected number of components is the one that maximize the Q2 value while in the "Explained variance plot", this number is the minimum number of components that reaches a explained variability (y axis) higher than the *Variability* argument of the function (gray dashed line). ## Running MultiBaC step by step All the different steps performed by the *MultiBaC* wrapper function can be independently computed with specific functions, as described next, from the initial *mbac* object. The MultiBaC strategy can be divided into three main steps that will be described next in detail: 1) PLS model fitting, 2) Prediction of missing omics, and 3) Batch effect correction. ### PLS model fitting The *genModelList* function produces the PLS models between distinct and common omic data types. It computes the optimal number of components via a crossvalidation approach. ```{r} my_mbac_2 <- genModelList (my_mbac, test.comp = NULL, scale = FALSE, center = TRUE, crossval = NULL, showinfo = TRUE) ``` The arguments of the *genModelList* function are: * **mbac**: mbac type object. * **test.comp** Maximum number of components allowed in PLS models. If NULL (default), the minimal effective rank of the matrices is used as the maximum number of components. * **scale**: Logical. Whether X and Y matrices must be scaled. By default, FALSE. * **center**: Logical. Whether X and Y matrices must be centered. By default, TRUE. * **crossval** Integer indicating the number of cross-validation segments. The number of samples (rows of 'x') must be at least >= crossvalI. If NULL (default), a leave-one-out crossvalidation is conducted. * **showinfo**: A logical value indicating whether to show the information about the function progress. By default, TRUE. The output of *genModelList* is a *mbac* object with a new slot, *PLSmodels*, a list of the PLS models obtained with the *ropls* package. Each slot of the output corresponds to a batch in *ListOfBatches*. If one batch contains more than one non-common omic, the "batch" element in *genModelList* contains in turn as many elements as non-common omics in that batch, i.e. one PLS model per non-common omic. ### Prediction of missing omics The prediction of the initially missing omics is performed with the *genMissingOmics* function from the output of the *genModelList* function. ```{r} multiBatchDesign <- genMissingOmics(my_mbac_2) ``` The result after running *genMissingOmics* is a list of *MultiAssayExperiment* structures. In this case, each batch contains all the omics introduced in MultiBaC. For instance, if two batches are being studying, "A" and "B", given that "A" contains "RNA-seq" and "GRO-seq" data and "B" contains "RNA-seq" and "Metabolomics" data, after applying *genMissingOmics* function, batch "A" will contain "RNA-seq", "GRO-seq" and also the predicted "Metabolomics" data. ### Batch effect correction ```{r} my_finalwise_mbac <- batchCorrection(my_mbac_2, multiBatchDesign = multiBatchDesign, Interaction = TRUE, Variability = 0.90) ``` As described before, ARSyN applies an ANOVA-like decomposition to the data matrix in order to estimate the batch effect and, next, a PCA is applied on each submatrix. The number of principal components for each PCA is adjusted by the *Variability* argument. The output of this function consists of two different objects: *CorrectedData* and *ARSyNmodels*. The first one has the same structure than *ListOfBatches* slot. However, in this case, all batches contain all the omics introduced in MultiBaC after correcting the batch effect on each omic data type separately. Note that we discard the predicted omic matrices and only use the corrected original matrices for further statistical analyses. The *ARSyNmodels* slot contains the ASCA decomposition models for each omic data type. # Visualization of ARSyN and MultiBaC results As mentioned before, the *ARSyNbac* and *MultiBaC* outputs are *mbac* type objects. Since the *mbac* class incorporates a plotting method, the *plot function* can by applied on *mbac* objects to graphically display additional information about the performance of the methods and the data characteristics. The *plot* function for *mbac* objects accepts several additional arguments: ```{r, eval = FALSE, echo = TRUE} plot (x, typeP = "def", col.by.batch = TRUE, col.per.group = NULL, comp2plot = c(1,2), legend.text = NULL, args.legend = NULL, ...) ``` Description of the arguments: * **x**: Object of class "mbac" returned by *MultiBaC* method. * **typeP**: The type of plot to be displayed. Options are: "def" (default option, "Q2 plot" and "Explained variance plot" in case of MultiBaC and "Explained variance plot" in case of ARSyNbac outputs), "inner" (inner correlation plots for each PLS model acroos the components for MultiBaC output), "pca.org" (PCA plot of original data for MultiBaC or ARSyNbac outputs), "pca.cor" (PCA plot of corrected data for MultiBaC or ARSyNbac outputs), "pca.both" (PCA plots for both original and corrected data for MultiBaC or ARSyNbac outputs), and "batch" ("Batch effect estimation" plot for all the outputs). Remember that PCA plots can only be generated when all the omics share the same variable space (e.g. gene identifiers are provided as names of variables for all data matrices). * **col.by.batch**: Argument for PCA plots. TRUE or FALSE. If TRUE (default), samples are colored according to the batch factor. If FALSE, samples are colored according to the experimental conditions. * **col.per.group**: Argument for PCA plots. Color for each group (given by batches or experimental conditions). If NULL (default), the colors are taken from a predefined pallete. * **comp2plot**: Argument for PCA or InnerRel plot. It indicates which components are to be plotted. The default is c(1,2), which means that, in PCA plots, component 1 is plotted in "x" axis and component 2 in "y" axis, and for InnerRel plots, the inner relation plots of components 1 and 2 are to be shown. If more components are indicated, the function will return as many plots as needed to show all the components. * **legend.text**: A vector of text used to construct a legend for the plot. Argument for PCA plot. If NULL (default) batch or conditions names included in the mbac object are used. * **args.legend**: list of additional arguments to pass to legend(); names of the list are used as argument names. Only used if legend.text is supplied. * **...**: Other graphical arguments. While the *plot* function can generate all the plot types described above, each plot can also be independently generated by its corresponding function: *Q2_plot (mbac)*, *explained_varPlot (mbac)*, *plot_pca (mbac, typeP = c("pca.org", "pca.cor", "pca.both"), col.by.batch, col.per.group, comp2plot, legend.text, args.legend)*, *batchEstPlot (mbac)*, or *inner_relPlot (mbac, comp2plot = c(1,2))*. All these plots are useful to validate and understand *MultiBaC* or *ARSyNbac* performance. All of them can be used with a *MultiBaC* output, while those that show information related to the PLS models are not available for an *ARSyNbac* output (see *typeP* argument). In addition to "Q2 plot" and "Explained variance plot" (Figure \@ref(fig:main-multibac2)), which are the default plots, and were explained in previous sections, next sections are devoted to describe the rest of plots in **MultiBaC** package. ## Inner correlation in PLS models An important aspect to be validated in MultiBaC is the inner correlation between X and Y scores in PLS models. As we indicated before, *MultiBaC* function displays by default this information as numerical output but a visual representation can also be invoked. ```{r, eval = TRUE, echo = FALSE} # just one plot in next chunk my_aux <- my_final_mbac my_final_mbac$PLSmodels <- my_final_mbac$PLSmodels[1] ``` ```{r inner, fig.cap="Plot of inner relations of PLS components. Only results for batch 'A' are shown as example. Each panel represents the inner correlation of one component of the PCA model. Red line indicates the diagonal when the correlation is maximal (1:1)."} plot (my_final_mbac, typeP = "inner", comp2plot = c(1,2)) ``` ```{r, eval = TRUE, echo = FALSE} # restore state my_final_mbac <- my_aux ``` The inner correlation between scores of the PLS model that relates both omic data types in batch "A" is shown in Figure \@ref(fig:inner). While we have only shown the plot for batch "A", running *plot* (typeP = "inner")}, the inner correlation plots for all the PLS models generated during MultiBaC performance are displayed using the tag *"Hit to see next plot:"*, thus requiring user's interaction to show the complete set of plots. The information about the model (batches and omics included) is shown in the R prompt too. The "inner correlation plot" is a pivotal result, since it represents the validation of the PLS model. The correlation between x score (t) and y score (u) (in every component) is suppossed to be linear, as shown in Figure \@ref(fig:inner). If a non-linear correlation is observed, a transformation of data would be desirable. ## Batch effect estimation plot This plot illustrates the magnitude of the estimated batch effects. Tipically, this plot is used before *MultiBaC* or *ARSyNbac* performance since it just requires a basic *mbac* returned by *createMbac* function. ```{r batchest, fig.cap="Batch effect estimation plot. Dashed lines represent theoretical batch magnitudes. Violin plots represent the distribution of batch effect coefficents observed in data."} plot (my_final_mbac, typeP = "batch") ``` Theoretical batch effect magnitudes for the yeast example are displayed in Figure \@ref(fig:batchest). The violin plot shows the distribution of batch effect coefficients along the variable space (genes in case of RNA-seq data). Coefficients with higher values are the one that contribute the most to the existence of a batch effect, thus when the distribution is closer to zero, the batch effect is lower. MultiBaC correction performance has been tested and validated with small and medium batch effect magnitudes while it decreases at high magnitudes. ARSyN is not so much affected by the batch effect magnitude. ## PCA plots The goodness of *ARSyNbac* or *MultiBaC* correction can be assessed with the PCA plots before and after the correction. In the case of *MultiBaC*, this plot can only be generated when all the omic data matrices share the same variable space. In our yeast example, every omic data type was obtained as gene-related information, thus matrices can be merged by variables (genes) and the PCA is feasible. An example of the usage of these PCA plots for the *ARSyNbac* output can be found in the ARSyN section. Here we illustrate how to generate and interpret them for MultiBaC correction. The PCA on the original data (Figure \@ref(fig:pca-org)) and on the corrected data (Figure \@ref(fig:pca-cor)) were obtained with the *plot* function by using either *"typeP = pca.org"* or *"typeP = pca.cor"*, respectively. ```{r pca-org, fig.cap="Default PCA plot on the original data."} plot (my_final_mbac, typeP = "pca.org", cex.axis = 1, cex.lab = 1, cex = 3, bty = "L", cex.main = 1.2, pch = 19) ``` ```{r pca-cor, fig.cap="Default PCA plot on the corrected data."} plot (my_final_mbac, typeP = "pca.cor", cex.axis = 1, cex.lab = 1, cex = 3, bty = "L", cex.main = 1.2, pch = 19) ``` By default, this function takes random colors to represent each group (batches by default). However, it would be useful to display the experimental factors information too. For that, we recommend the use of custom *col.per.group* and *pch* arguments. An example is shown in Figure \@ref(fig:pcaplot2), using typeP = pca.both to show both PCA plots together. We could also plot more than two components indicating the desired number with *comp2plot* argument. The user can also include a custom legend by using two arguments: *legend.text* and *args.legend*. With *legend.text* we indicate the text labels of the legend and the rest of the legend arguments are collected in *args.legend* (x, y, pch, fill, col, bty, etc). If *legend.text* is not provided to the function, *args.legend* is not considered. ```{r, include=FALSE} knitr::opts_chunk$set(fig.width = 15, fig.height = 12) ``` ```{r pcaplot2, fig.cap="Customized PCA plots. Original (left panels) and Corrected (right panels) data. Upper panels show the second principal component (PC) against the first one while panels at the bottom show the third PC against the first one."} custom_col <- c("brown2", "dodgerblue", "forestgreen") custom_pch <- c(19,19,19,1,1,1,15,15,15,0,0,0, # batch A 19,19,1,1,17,17,2,2, # batch B 19,19,1,1,18,18,5,5) # batch C plot(my_final_mbac, typeP = "pca.both", col.by.batch = TRUE, col.per.group = custom_col, comp2plot = 1:3, cex.axis = 1.3, cex.lab = 1.2, cex = 3, bty = "L", cex.main = 1.7, pch = custom_pch, legend.text = c("Color", names(my_final_mbac$ListOfBatches), "Shape", c("RNA", "GRO", "RIBO", "PAR"), "Fill", levels(colData(my_final_mbac$ListOfBatches$A)$tfactor)), args.legend = list("x" = "topright", "pch" = c(NA, 15, 15, 15, NA, 19, 15, 17, 18, NA, 19, 1), "col" = c(NA, "brown2", "dodgerblue", "forestgreen", NA, rep(1, 4), NA, 1, 1), "bty" = "n", "cex" = 2)) ``` In this case, batch effect correction is observable in common data (RNA-seq, dots). Batch effect has been removed as common data is placed all together and after the correction, the components (especially the second and the third component) separate the common data based on the experimental condition instead of separating batches. As shown in the legend, point shape indicates the omic data type, however batch effect correction is only visible in common data. # Session info Here is the output of *sessionInfo()* on the system on which this document was compiled: ```{r} sessionInfo() ``` # References