%\VignetteIndexEntry{missRows} %\VignetteEngine{knitr::knitr} \documentclass[11pt]{article} \usepackage{amssymb, amsmath} \usepackage[small, bf]{caption} \usepackage{graphicx} \usepackage{float} \usepackage{lmodern} \usepackage[usenames, dvipsnames]{xcolor} \usepackage{url} \setlength{\captionmargin}{50pt} <>= BiocStyle::latex() @ <>= knitr::opts_chunk$set(message=FALSE, fig.align="center", comment="") @ \bioctitle[Handling Missing Rows]{\Huge{missRows}\\[0.25cm] \Huge{\textsf{\textbf{Handling Missing Rows in Multi-Omics\\ Data Integration}}}} \author{Valentin Voillet and Ignacio Gonz\'alez} % \date{2018} \begin{document} \maketitle \begin{abstract} In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose \Rpackage{missRows}, an \R{} package that implements the MI-MFA method published in Voillet \textit{et al.} (2016) \cite{Voillet_2016}, a multiple imputation (MI) approach in a multiple factor analysis (MFA) framework. The MI-MFA method generates multiple imputed datasets from a MFA model, then the results yield are combined in a single consensus solution. The package provides functions for estimating coordinates of individuals and variables in presence of missing individuals (rows), graphical outputs to display the results and improve interpretation, and various diagnostic plots to inspect the pattern of missingness and visualize the uncertainty due to missing row values. This vignette explains the use of the package and demonstrates typical workflows. \end{abstract} %\newpage \tableofcontents %\newpage %============================================================================ \section{Introduction} %============================================================================ Due to the increase in available data information, integrating large amounts of heterogeneous data is currently one of the major challenges in systems biology. Biological data integration provides scientists with a deeper insight into complex biological processes. However, when dealing with multiple data tables, the presence of missing values is a common situation for a variety of reasons. In omics data integration studies, it is common for some individuals to not be present in all data tables, resulting in a specific missing data pattern for multiple tables. Missing row values for a table of variables are challenging to handle because most statistical methods cannot be directly applied to incomplete datasets. To overcome one of the major issues associated with multiple omics data tables, we propose \Rpackage{missRows}, an \R{} package that implements the MI-MFA method published in \cite{Voillet_2016}, a multiple imputation (MI) approach in a multiple factor analysis (MFA, \cite{Escofier_1994}) framework. Proposed by Rubin (1987) \cite{Rubin_2004}, MI estimates both the parameters of interest and their variability in a data missingness framework, and it relies on the principle that a single value cannot reflect the uncertainty of the estimation of a missing value. Figure~\ref{overview_method} illustrates the three main steps in MI-MFA: imputation, analysis and combination. \Rpackage{missRows} stores the results of each step in a \Rcode{S4} class: \Rclass{MIDTList}. The analysis starts with observed and incomplete data tables $\boldsymbol{K}$. First, MI is used to generate plausible synthetic data values, called imputations, for missing values in the data. This step results in a number ($M$) of imputed datasets in which the missing data are replaced by random draws of plausible values according to a specific statistical model. The second step consists in analyzing each imputed dataset using MFA to estimate the parameters of interest. This step results in $M$ analyses (instead of just one) which differ only because the imputations differ. Finally, MI combines all the results together to obtain a single consensus estimate, thereby combining variation within and across the $M$ imputed datasets. In \Rpackage{missRows}, these three steps are performed using the function \Rcode{MIMFA}, and are described in detail in our publication (Voillet \textit{et al.} 2016, \cite{Voillet_2016}). The package provides functions for data exploration, and result visualization. The structure of missing values can be explored using the \Rcode{missPattern} function. It gives to the user an indication of how much information is missing and how the missingness is distributed. The \Rcode{plotInd} and \Rcode{plotVar} functions provide scatter plots for individual and variable representations respectively. In the MI-MFA framework, after estimating the configurations from the imputed datasets, a new source of variability due to missing values can be taken into account. The \Rpackage{missRows} package proposes two approaches to visualize the uncertainty of the estimated MFA configurations attributable to missing row values: confidence ellipses and convex hulls. The function \Rcode{plotInd} contains methods implementing these two approaches. This vignette explains the basics of using \Rpackage{missRows} by showing an example, including advanced material for fine tuning some options. The vignette also includes description of the methods behind the package. \begin{figure}[H] \begingroup \centering \includegraphics[scale=0.9]{fig_overview_MIMFA.pdf} \caption{\textbf{Overview of the MI-MFA approach to handling missing rows in multi-omics data integration.} The top part of the graphic indicates that analysis starts with observed, incomplete data tables $\boldsymbol{K}$. In a first step, multiple imputation is performed using the hot-deck imputation approach: $M$ imputed versions $\boldsymbol{K}^{(1)},\ldots , \boldsymbol{K}^{(M)}$ of $\boldsymbol{K}$ are obtained by replacing the missing values by plausible data values. These plausible values are drawn from donor pools. The imputed sets are identical for the non-missing data entries, but differ in the imputed values. The second step is to estimate the configuration matrix $\boldsymbol{F}_{\!m}$ for each imputed dataset $\boldsymbol{K}^{(m)}$ using MFA. The estimated configurations differ from each other because their input data differ. The last step is to combine the $M$ estimated configurations $\boldsymbol{F}_{\!1},\ldots , \boldsymbol{F}_{\!M}$ into a compromise configuration $\boldsymbol{F_{\!c}}$ using the STATIS method.} \label{overview_method} \endgroup \end{figure} \subsection{Citing \Rpackage{missRows}} %============================================================================ We hope that \Rpackage{missRows} will be useful for your research. Please use the following information to cite \Rpackage{missRows} and the overall approach when you publish results obtained using this package, as such citation is the main means by which the authors receive credit for their work. Thank you! \medskip <>= x <- citation("missRows") @ \begin{center} \begin{minipage}{16cm} Gonz{\'a}lez I., Voillet V. \Sexpr{paste0("(", x$year, "). ", x$title, ". ", x$note, ".")} \bigskip Voillet V., Besse P., Liaubet L., San Cristobal M., Gonz{\'a}lez I. (2016). Handling missing rows in multi-omics data integration: Multiple Imputation in Multiple Factor Analysis framework. \textit{BMC Bioinformatics}, 17(40). \end{minipage} \end{center} \subsection{How to get help for \Rpackage{missRows}} %============================================================================ Most questions about individual functions will hopefully be answered by the documentation. To get more information on any specific named function, for example \Rcode{MIMFA}, you can bring up the documentation by typing at the \R{} prompt \medskip <>= help("MIMFA") @ or \medskip <>= ?MIMFA @ The authors of \Rpackage{missRows} always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. If you've run into a question that isn't addressed by the documentation, or you've found a conflict between the documentation and what the software does, then there is an active community that can offer help. Send your questions or problems concerning \Rpackage{missRows} to the Bioconductor support site at \url{https://support.bioconductor.org}. Please send requests for general assistance and advice to the support site, rather than to the individual authors. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. Users posting to the support site for the first time will find it helpful to read the posting guide at \url{http://www.bioconductor.org/help/support/posting-guide}. \subsection{Quick start} %============================================================================ A typical MI-MFA session can be divided into three steps: \begin{enumerate} \item {\it Data preparation:} In this first step, a convenient \R{} object of class \Rclass{MIDTList} is created containing all the information required for the two remaining steps. The user needs to provide the data tables with missing rows, an indicator vector giving the stratum for each individual and optionally the name for each table. \medskip \item {\it Performing MI:} Using the object created in the first step the user can perform MI-MFA to estimate the coordinates of individuals and variables on the MFA components, and the imputation of missing data values. \medskip \item {\it Analysis of the results:} The results obtained in the second step are analyzed using visualization tools. \bigskip \end{enumerate} An analysis might look like the following. Here we assume there are two data tables with missing rows, \Rcode{table1} and \Rcode{table2}, and the stratum for each individual is stored in a data frame \Rcode{df}. \medskip <>= ## Data preparation midt <- MIDTList(table1, table2, colData=df) ## Performing MI midt <- MIMFA(midt, ncomp=2, M=30) ## Analysis of the results - Visualization plotInd(midt) plotVar(midt) @ %============================================================================ \section{Using \Rpackage{missRows}} %============================================================================ \subsection{Installation} %============================================================================ We assume that the user has the \R{} program (see the \R{} project at \url{http://www.r-project.org}) already installed. The \Rpackage{missRows} package is available from the \Bioconductor{} repository at \url{http://www.bioconductor.org}. To be able to install the package one needs first to install the core \Bioconductor{} packages. If you have already installed \Bioconductor{} packages on your system then you can skip the two lines below. \medskip <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install() @ \vspace*{-0.2cm} \warning{try http:// if https:// URLs are not supported} Once the core \Bioconductor{} packages are installed, you can install the \Rpackage{missRows} package by \medskip <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("missRows") @ \vspace*{-0.2cm} \warning{try http:// if https:// URLs are not supported} \bigskip Load the \Rpackage{missRows} package in your \R{} session: \medskip <>= library(missRows) @ A list of all accessible vignettes and methods is available with the following command: \medskip <>= help.search("missRows") @ \subsection{Data overview} %============================================================================ To help demonstrate the functionality of \Rpackage{missRows}, the package includes example datasets of the NCI-60 cancer cell line panel. This panel was developed as part of the Developmental Therapeutics Program (\url{http://dtp.nci.nih.gov/}) of the National Cancer Institute (NCI). Once the package is loaded, these datasets can be used by typing \medskip <>= data(NCI60) @ \Rcode{NCI60} is a list composed of two components: \Rcode{dataTables}, and \Rcode{mae}. \medskip <<>>= names(NCI60) @ \Rcode{dataTables} contains both transcriptomic and proteomic expression from NCI-60 without missing data. The transcriptome data was directly retrieved from the \Robject{NCI60\_4arrays} data in the \Biocpkg{omicade4} package \cite{NCI60_4arrays}. This data table contains gene expression profiles generated by the Agilent platform with only few hundreds of genes randomly selected to keep the size of the Bioconductor package small. The proteomic data table was retrieved from the \Biocpkg{rcellminerData} package \cite{rcellminer}. Protein abundance levels are available for 162 proteins. Both full datasets are accessible at \url{https://discover.nci.nih.gov/cellminer/} \cite{CellMiner}. Subsequently, a specific pattern of missingness has been created in these data tables for illustration purpose. The component \Rcode{dataTables\$cell.line} contains the character vector of cancer type for each cell line: colon (CO), renal (RE), ovarian (OV), breast (BR), prostate (PR), lung (LC), central nervous system (CNS), leukemia (LE) and melanoma (ME). \medskip <<>>= table(NCI60$dataTables$cell.line$type) @ \Rcode{NCI60\$mae} contain a 'MultiAssayExperiment' instance from NCI60 data with transcriptome and proteomic experiments as described in \Rcode{dataTables}. <<>>= NCI60$mae @ \subsection{Data preparation: the \Rclass{MIDTList} object} %============================================================================ The first step in using the \Rpackage{missRows} package is to create a \Rclass{MIDTList} object. This object stores the data tables, the intermediate estimated quantities during the multiple imputation procedure and is the input of the visualization functions. The object \Rclass{MIDTList} will usually be represented in the code here as \Rcode{midt}. To build such an object the user needs the following: \begin{description} \item[the incomplete data:] data tables with missing rows. Two or more objects which can be interpreted as matrices (or data frames). Data tables passed as arguments must be arranged in samples $\times$ variables, with sample order and row names matching in all data tables. \medskip \item[strata:] named vector or data frame giving the stratum for each individual. Names (in vector) or row names (in data frame) must be equal to the row names among data tables. \medskip \item[table names:] optionally a character vector giving the name for each table. \medskip \end{description} Here, we demonstrate how to construct a \Rclass{MIDTList} object from \Rcode{NCI60} data. We first load the data \medskip <<>>= data(NCI60) @ If the data tables are already available as a list, \Rcode{tableList} say, and \Rcode{cell.line} contains the character vector of cancer type for each cell line, then a \Rclass{MIDTList} object can be created by \medskip <>= tableList <- NCI60$dataTables[1:2] cell.line <- NCI60$dataTables$cell.line midt <- MIDTList(tableList, colData=cell.line, assayNames=c("trans", "prote")) @ The function \Rcode{MIDTList} makes a \Rclass{MIDTList} object from separated data tables directly by \medskip <>= table1 <- NCI60$dataTables$trans table2 <- NCI60$dataTables$prote cell.line <- NCI60$dataTables$cell.line midt <- MIDTList(table1, table2, colData=cell.line, assayNames=c("trans", "prote")) @ or by \medskip <>= midt <- MIDTList("trans" = table1, "prote" = table2, colData=cell.line) @ The function \Rcode{MIDTList} makes a \Rclass{MIDTList} object from 'MultiAssayExperiment' by \medskip <>= midt <- MIDTList(NCI60$mae) @ A summary of the \Rcode{midt} object can be seen by typing the object name at the \R{} prompt \medskip <<>>= midt @ \subsection{Performing MI-MFA} %============================================================================ The main function for performing MI-MFA is \Rcode{MIMFA}, and it has four main arguments. The first argument is an instance of class \Rclass{MIDTList}. The second and third arguments are of type integer; they specify the number of components to include in MFA and the number of imputations in MI, respectively. The fourth argument is a boolean, if \Rcode{TRUE} then the number of components is estimated for data imputation, in this case the maximum number of components is determined by the second argument \Rcode{ncomp}. We then performs MI-MFA on the incomplete data tables contained in \Rcode{midt} by using \Rcode{M=30} imputations \medskip <>= set.seed(123) midt <- MIMFA(midt, ncomp=50, M=30, estimeNC=TRUE) @ \Rcode{MIMFA} returns an object of class \Rclass{MIDTList}. A short summary of this object is shown by typing \medskip <<>>= midt @ \subsection{Working with the \Rclass{MIDTList} object} %============================================================================ Once the \Rclass{MIDTList} object is created the user can use the methods defined for this class to access the information encapsulated in the object. By example, the table names information is accessed by \medskip <<>>= names(midt) @ For accessing the \Rcode{colData} slot \medskip <<>>= cell.line <- colData(midt) @ Multiple imputation calculates \Rcode{M} configurations associated to each imputed datasets. In order to get the \Rcode{M}th configuration, use the \Rcode{configurations} function. For \Rcode{M=5}, \medskip <>= conf <- configurations(midt, M=5) dim(conf) @ The list of ... can be exported by the function \Rcode{function} \medskip \subsection{Data exploration and visualization} %============================================================================ This section presents the available tools for data exploration and visualization of the results in \Rpackage{missRows}. Both \Rcode{MIDTList} and \Rcode{MIMFA} functions return an object of type \Rclass{MIDTList}, and all of the following functions work with this object. \subsubsection{Inspect the missing rows pattern} %---------------------------------------------------------------------------- Before imputation, the structure of missing values can be explored using visualization tools and summaries outputs. Looking at the missing data pattern is always useful. It can give you an indication of how much information is missing and how the missingness is distributed. Inspect the missingness pattern for the \Rcode{NCI60} incomplete data by \medskip <>= patt <- missPattern(midt, colMissing="grey70") @ \Rcode{missPattern} calculates the amount of missing/available rows in each stratum per data table and plots a missingness map showing where missingness occurs. Data tables are plotted separately on a same device showing the pattern of missingness. The individuals are colored with rapport to their stratum whereas missing rows are colored according to the \Rcode{colMissing} argument. The object \Rcode{patt} encapsulates information from the structure of missingness, such as the amount of missing/available rows in each stratum per data table and the indicator matrix for the missing rows. \medskip <>= patt @ The missingness pattern shows that there are 20 missing rows in total: 12 for the \Rcode{trans} table and 8 for the \Rcode{prote} table. Moreover, there are two strata (\Rcode{LC} and \Rcode{ME}) with 4 missing rows, four strata with 2 missing rows, one stratum with 3 missing rows and another one with 1 missing row. \subsubsection{Individuals plot} %---------------------------------------------------------------------------- An insightful way of looking at the results of \Rcode{MIMFA} is to investigate how the observed and imputed individuals are distributed on the two-dimensional configuration defined by the MFA components. \Rcode{plotInd} function makes scatter plot for individuals representation from \Rcode{MIMFA} results. Each point corresponds to an individual. The color of each individual reflects their corresponding stratum, whereas imputed individuals are colored according to the \Rcode{colMissing} argument. \medskip <>= plotInd(midt, colMissing="white") @ \subsubsection{Visualizing the uncertainty induced by the missing individuals} %---------------------------------------------------------------------------- Multiple imputation generates \Rcode{M} imputed datasets and the variance between-imputations reflects the uncertainty associated to the estimation of the missing row values. The \Rpackage{missRows} package proposes two approaches to visualize and assess the uncertainty due to missing data: confidence ellipses and convex hulls. The rough idea is to project all the multiple imputed datasets onto the compromise configuration. Each individual is represented by \Rcode{M} points, each corresponding to one of the \Rcode{M} configurations. Confidence ellipses and convex hulls can be then constructed for the \Rcode{M} configurations for each individual. The computed convex hull results in a polygon containing all \Rcode{M} solutions. Confidence ellipses can be created using the function \Rcode{plotInd} by setting the \Rcode{confAreas} argument to \Rcode{"ellipse"} \medskip <>= plotInd(midt, confAreas="ellipse", confLevel=0.95) @ The 95\% confidence ellipses show the uncertainty for each individual. For ease of understanding, not all individuals for the \Rcode{M} configurations obtained are plotted. Convex hulls are plotted by setting the \Rcode{confAreas} argument to \Rcode{"convex.hull"} \medskip <>= plotInd(midt, confAreas="convex.hull") @ These graphical representations provide scientists with considerable guidance when interpreting the significance of \Rcode{MIMFA} results. The larger the area of an ellipse (convex hull), the more uncertain the exact location of the individual. Thus, when ellipse and convex hull areas are large, the scientist should remain really careful regarding its interpretation. \subsubsection{Variables plot: correlation circle} %---------------------------------------------------------------------------- \Rcode{plotVar} produces a "correlation circle", \textit{i.e.} the correlations between each variable and the selected components are plotted as scatter plot, with concentric circles of radius one and radius given by \Rcode{radIn}. Variables are represented by points (symbols) and colored according to the data table to which they belong. \medskip <>= plotVar(midt, radIn=0.5) @ This plot is an enlightening tool for data interpretation, as it enables a graphical examination of the relationships between variables. The variables or groups of variables strongly positively correlated are projected closely to each other on the correlation circle. When the correlation is strongly negative, the groups of variables are projected at diametrically opposite places on the correlation circle. The variables or groups of variables that are not correlated are situated 90$^\circ$ one from the other in the circle. For variables closely located to the origin, it means that some information can be carried on other axes and, it might be necessary to visualize the correlation circle plot in the subsequent dimensions. More details about the correlation circle interpretation can be found in \cite{Gonzalez_2012}. In the high dimensional case, the interpretation of the correlation structure between variables from two or more data tables can be difficult, and a \Rcode{cutoff} can be chosen to remove some weaker associations. \medskip <>= plotVar(midt, radIn=0.55, varNames=TRUE, cutoff=0.55) @ \subsection{How many imputations\,?} %============================================================================ When using MI, one of the uncertainties concerns the number \Rcode{M} of imputed datasets needed to obtain satisfactory results. The number of imputed datasets in MI depends to a large extent on the proportion of missing data. The greater the missingness, the larger the number of imputations needed to obtain stable results. However, in multiple hot-deck imputation, the number of imputed datasets is limited by the size of the donor pools. The appropriate number of imputations can be informally determined by carrying out MI-MFA on $N$ replicate sets of $M_l$ imputations for $l=0,1,2,\ldots ,$ with $M_0 < M_1< M_2 < \cdots < M_{max}$, until the estimate compromise configurations are stabilized. \Rcode{tuneM} function implements such a procedure. Collections of size \Rcode{N} are generated for each number of imputations \Rcode{M}, with \Rcode{M = seq(inc, Mmax, by=inc)}. The stability of the estimated MI-MFA configurations is then determined by calculating the RV coefficient between the configurations obtained using \Rcode{M}$_{\,l}$ and \Rcode{M}$_{\,l+1}$ imputations. \medskip <>= set.seed(1) tune <- tuneM(midt, ncomp=2, Mmax=100, inc=10, N=20, showPlot=FALSE) @ <>= tune <- tuneM(midt, ncomp=2, Mmax=100, inc=10, N=20) tune @ <<>>= tune$stats @ <>= tune$ggp @ The values shown are the mean RV coefficients for the \Rcode{N=20} two-dimensional configurations as a function of the number of imputations. Error bars represent the standard deviation of the RV coefficients. %============================================================================ \section{Methods behind \Rpackage{missRows}} %============================================================================ \subsection{The MI-MFA approach} %============================================================================ To deal with multiple tables with missing individuals, \Rpackage{missRows} implements the MI-MFA method in \cite{Voillet_2016}, a multiple imputation approach in a multiple factor analysis framework. According to MI methodology, the MI-MFA proceeds as follows. Let $\boldsymbol{K} = \{\boldsymbol{K}_{\!1}, \ldots ,\boldsymbol{K}_{\!J}\}$ a set of $J$ data tables, where each $\boldsymbol{K}_{\!j}$ corresponds to a table of quantitative variables measured on the same individuals, then carry out the following three steps: \smallskip \begin{center} \begin{minipage}{15cm} \begin{itemize} \item[1.] Imputation: generate $M$ different imputed datasets $\boldsymbol{K}^{(1)},\ldots ,\boldsymbol{K}^{(m)},\ldots , \boldsymbol{K}^{(M)}$ of $\boldsymbol{K}$ using multiple hot-deck imputation \cite{Kalton_1986}.\medskip \item[2.] MFA analysis: perform an MFA on each $\boldsymbol{K}^{(m)}$ imputed dataset leading to $M$ different configurations $\boldsymbol{F}_{\!1}, \dots ,\boldsymbol{F}_{\!m},\dots , \boldsymbol{F}_{\!M}$.\medskip \item[3.] Combination: find a consensus configuration between all $\boldsymbol{F}_{\!1},\dots ,\boldsymbol{F}_{\!M}$ configurations using the STATIS method \cite{Lavit_1994}.\medskip \end{itemize} \end{minipage} \end{center} These steps are outlined in Figure~\ref{overview_method} and described in detail in our publication (Voillet \textit{et al.} 2016, \cite{Voillet_2016}). \subsection{Imputation of missing rows and estimation of MFA axes} %============================================================================ Even if the objective of MI-MFA is to estimate the MFA components in spite of missing values, an estimation of MFA axes and imputation of missing data values can be achieved. Consequently, the MI-MFA approach can also be viewed as a single imputation method. Moreover, since the imputation is based on a MFA model (on the components and axes), it takes into account similarities between individuals and relationships between variables. Since the core of MFA is a PCA of the weighted data table $\boldsymbol{K}$, the algorithm suggested to estimate MFA axes and impute missing values is inspired from the alternating least squares algorithm used in PCA. This consists in finding matrices $\boldsymbol{F}$ and $\boldsymbol{U}$ which minimize the following criterion: $$||\boldsymbol{K}-\boldsymbol{M}-\boldsymbol{F}\boldsymbol{U}||^2 = \sum_{i}\sum_{k}\left( \boldsymbol{K}_{ik} - \boldsymbol{M}_{ik} - \sum_{d=1}^D \boldsymbol{F}_{id} \boldsymbol{U}_{kd} \right)^2,$$ \noindent where $\boldsymbol{M}$ is a matrix with each row equal to a vector of the mean of each variable and $D$ is the kept dimensions in PCA. The solution is obtained by alternating two multiple regressions until convergence, one for estimating axes (loadings $\hat{\boldsymbol{U}}$) and one for components (scores $\hat{\boldsymbol{F}}$): $$ \begin{array}{rcl} \hat{\boldsymbol{U}}' \!\!\!& = &\!\!\! (\hat{\boldsymbol{F}}'\hat{\boldsymbol{F}})^{-1} \hat{\boldsymbol{F}}'(\boldsymbol{K}-\hat{\boldsymbol{M}})\\ \\ \hat{\boldsymbol{F}} \!\!\!& = &\!\!\! (\boldsymbol{K}-\hat{\boldsymbol{M}})\hat{\boldsymbol{U}} (\hat{\boldsymbol{U}}'\hat{\boldsymbol{U}})^{-1}. \end{array} $$ The algorithm to estimate MFA axes and impute missing values works as follows. Let $\boldsymbol{K} = [\boldsymbol{K}_1,\ldots, \boldsymbol{K}_J]$ the merged matrix containing missing values and $\boldsymbol{F}$ the matrix containing the compromise components of the MI-MFA, then carry out the following steps: \smallskip \begin{center} \begin{minipage}{15.5cm} \begin{description} \item[] Step 0. Initialization $l = 0$: $\boldsymbol{K}^0$ is obtained by substituting missing values with initial values (for example with column means on the non-missing entries or zeros); $\hat{\boldsymbol{M}}^0$ is computed.\medskip \item[] Step 1. Calculate $(\hat{\boldsymbol{U}}^l)' = (\boldsymbol{F}'\boldsymbol{F})^{-1} \boldsymbol{F}' (\boldsymbol{K}^{(l-1)} - \hat{\boldsymbol{M}}^{(l-1)})$.\medskip \item[] Step 2. Missing values are estimated as $\hat{\boldsymbol{K}}^l = \hat{\boldsymbol{M}}^{(l-1)} + \boldsymbol{F}(\hat{\boldsymbol{U}}^l)'$.\medskip \item[] Step 3. The new imputed data set $\boldsymbol{K}^l$ is obtained by replacing the missing values of the original $K$ matrix with the corresponding elements of $\hat{\boldsymbol{K}}^l$, whilst keeping the observed values unaltered.\medskip \item[] Step 4. $\hat{\boldsymbol{M}}^l$ is computed on $\boldsymbol{K}^l$.\medskip \item[] Step 5. Steps 1 to 4 are repeated until convergence. \end{description} \end{minipage} \end{center} \newpage %============================================================================ \section{Session information} %============================================================================ The following is the session info that generated this tutorial: \medskip <>= sessionInfo() @ \newpage %============================================================================ % \section{References} %============================================================================ \bibliography{missRows} \end{document}