\documentclass[a4paper]{article} \title{How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures} \author{Dongxiao Zhu} % \VignetteIndexEntry{Multivariate Correlation Estimator} \begin{document} \maketitle \section{Introduction} OMICS data are increasingly available to biomedical researchers, and (biological) replications are more and more affordable for gene microarray experiments or proteomics experiments. The functional relationship between a pair of genes or proteins are often inferred by calculating correlation coefficient between their expression profiles. Classical correlation estimation techniques, such as Pearson correlation coefficient, do not explicitly take replicated data into account. As a result, biological replicates are often averaged before correlations are calculated. The averaging is not justified if there is poor concordance between samples and the variance in each sample is not similar. Based on our recently proposed multivariate correlation estimator, \texttt{CORREP} implements functions for estimating multivariate correlation for replicated OMICS data and statistical inference procedures. In this vignette I demo an non-trivial task accomplished using \texttt{CORREP}. First let's look at examples of replicated OMICS data and non-replicated OMICS data. $$x_{11} , x_{12} , \ldots , x_{1n}$$ $$x_{21} , x_{22} , \ldots , x_{2n}$$ $$\dots$$ $$x_{{m_1}1} , x_{{m_1}2} , \ldots , x_{{m_1}n}$$ $$y_{11} , y_{12} , \ldots , y_{1n}$$ $$y_{21} , y_{22} , \ldots , y_{2n}$$ $$\dots$$ $$y_{{m_2}1} , y_{{m_2}2} , \ldots , y_{{m_2}n}$$ versus $$x_{1} , x_{2} , \ldots , x_{n}$$ $$y_{1} , y_{2} , \ldots , y_{n}$$ In this toy example, $X$ and $Y$ are a pair of genes or proteins of interest. Using microarray or mass spectrum we were able to profile their expression over $n$ biological conditions such as knockouts, overexpression or SiRNA treatments either with replication (upper example) or without replication (lower example). There are $m_1$ and $m_2$ replicates for $X$ and $Y$ respectively. It is often biological interest how strong is the correlation between $X$ and $Y$ over $n$ conditions and how significant the correlation is. Significant correlation between $X$ and $Y$ often indicates potential functional relevancy. In addition, 1-correlation are usually used to calculate distance matrix of a number of genes or proteins to perform hierarchical clustering. Therefore, correlation estimation is an important problem in functional genomics research. For non-replicated OMICS data, estimating correlation is relatively trivial since there are plenty of established methods such as Pearson correlation coefficient and its non-parametric alternatives: Spearman's $\rho$ and Kendall's $\tau$. However it is not straightforward to estimate correlation from replicated OMICS data. A naive approach may be to average over replicates to transform the replicated data into non-replicated data. This approach might work for relative ``clean" data in which the within-replicate correlation is relatively high. Unfortunately most OMICS data are noisy reflected partially by poor within-replicate correlation. The averaging for those ``noisy" data is not justified. In order to properly estimate correlation of $X$ and $Y$ from replicated data, we must explicitly model the within-replicate correlation together with the between-replicate correlation. The next section briefly describes the models and methods for deriving the multivariate correlation estimator. For interested reader, please refer to our manuscript for more technical details \cite{Zhu07}. Interested reader also refer to \cite{Medvedovic04} for related Bayesian mixture model methods. \section{Methods} \subsection{Pearson correlation estimator} Instead of averaging, we exploit all the replicated observations by assuming the data are i.i.d. samples from a multivariate normal distribution with a specified correlation matrix and a mean vector, i.e., $Z_{j}=(x_{j1},\dots,x_{jp},y_{j1},\dots,y_{jq})^{T}$, $j=1,2,\dots, n$, follows a $p+q$-variate normal distribution $N(\mu,\Sigma)$, where $\mu=\left[\begin{array}{ll}\mu_{x}e_{m}\\ \mu_{y}e_{m}\end{array}\right]$, $e_{m}=(1,\dots,1)^{T}$ is a $m \times 1$ vector, the correlation matrix $\Sigma$ is a $(p+q) \times (p+q)$ matrix with structure:\\ \begin{equation} \label{eq:model} \Sigma=\left( \begin{array}{llllll} 1&\ldots&\rho_{x}&\rho&\ldots&\rho\\ \vdots& \ddots & \vdots&\vdots&\ddots & \vdots\\ \rho_{x}&\ldots&1&\rho&\ldots&\rho\\ \rho&\ldots&\rho&1&\ldots&\rho_{y}\\ \vdots& \ddots & \vdots&\vdots&\ddots & \vdots\\ \rho&\ldots&\rho&\rho_{y}&\ldots&1\\ \end{array} \right)=\left[ \begin{array}{ll} \Sigma_{x}&\Sigma_{xy}\\ \Sigma_{xy}^{T}&\Sigma_{y} \end{array} \right], \end{equation} where the inter-molecule correlation $\rho$ is the parameter of interest, and the intra-molecule correlation $\rho_{x}$ or $\rho_{y}$ are nuisance parameters. The $\rho_{x}$ and $\rho_{y}$ indicate the quality of replicates that high quality replicates tend to have high value, and {\it vice versa}. We employ three parameters: $\rho$, $\rho_{x}$ and $\rho_{y}$ to model the correlation structure of replicated omics data. \subsection{Multivariate correlation estimator}\label{sec:est} Assuming a multivariate normal model, the Maximum Likelihood Estimate (MLE) of $\rho$ can be derived as follows (see Manuscript for more details): \begin{equation} \label{eq:mux} \hat{\mu}_{x}= \frac{1}{n}\frac{1}{m}\sum_{j=1}^{n}\sum_{i=1}^{m}x_{ij} \end{equation} \noindent Similarly,\\ \begin{equation} \label{eq:muy} \hat{\mu}_{y}= \frac{1}{n}\frac{1}{m}\sum_{j=1}^{n}\sum_{i=1}^{m}y_{ij} \end{equation} \noindent therefore, $\hat{\mu}=\left[\begin{array}{ll}\hat{\mu}_{x}e_{m}\\ \hat{\mu}_{y}e_{m}\end{array}\right]$ \\ \noindent The MLE of $\Sigma$ is \\ \begin{equation} \label{eq:estimator} \hat{\Sigma}= \frac{1}{n}\sum_{j=1}^{n}(Z_{j}-\hat{\mu})(Z_{j}-\hat{\mu})^{T} \end{equation} To derive the MLE of $\rho$, the ideal method would be obtaining the likelihood explicitly as a function of $\rho$. However, this proved to be intractable in practice (see manscript for detailed discussion). Our approach is to use the average of the elements of $\hat{\Sigma}_{xy}$ estimated via Eq. \ref{eq:estimator}:\\ \begin{equation} \label{eq:estimator.avg} \hat\rho=\mathrm{Avg}(\hat{\Sigma}_{xy}) \end{equation} The sample Pearson correlation coefficient can be also written into the following form: \begin{equation} cor(X,Y) = \frac{\sum_{j=1}^{n}(\bar{x}_{j}-\bar{x})(\bar{y}_{j}-\bar{y})}{(n-1)S_XS_Y}, \end{equation} where $S_X$ and $S_Y$ are standard deviations of $X$ and $Y$ respectively. When there is no replicate ($m_1=m_2=1$), the correlation matrix $\Sigma$ is reduced to a $2$ by $2$ matrix with diagonal elements equal to $1$ and off-diagonal elements equal to $\rho$. It is easy to show from Eq. \ref{eq:estimator} that \begin{equation} \hat\rho = \frac{\sum_{j=1}^{n}(\bar{x}_{j}-\bar{x})(\bar{y}_{j}-\bar{y})}{nS_XS_Y} \end{equation} Hence we derive the connection between the two estimators when there is no replicate as follows: \begin{equation} \hat\rho = \frac{n-1}{n}cor. \end{equation} \subsection{Statistical inference procedures} For very small sample data, eg, $n < 4$, we recommend using all permutations to approximate the null distribution (see Manuscript for detail). For larger sample data, we provide a Likelihood Ratio (LR) test. For moderate to large sample data, we provide a LRT procedure for testing the hypothesis that the multivariate correlation $\rho$ vanishes. Under the multivariate normal distribution assumption, $Z_j \sim N(\mu, \Sigma)$, and we test the following hypothesis: \begin{equation} H_0:Z \in N(\mu, \Sigma_0) \; \mbox{ versus } \; H_{\alpha}:Z \in N(\mu, \Sigma_1). \end{equation} Here, both $\Sigma_0$ and $\Sigma_1$ are $(m_1 + m_2) \times (m_1+m_2)$ matrices, where $m_1$ and $m_2$ are number of replicates for biomolecule $X$ and $Y$ and $\Sigma_0= \left( \begin{array} {cc} \Sigma_x & {\bf 0}_{m_1} \\ {\bf 0}_{m_2}^{T} & \Sigma_y \end{array} \right)$, $\Sigma_1= \left( \begin{array} {cc} \Sigma_x & \Sigma_{xy} \\ \Sigma_{xy}^{T} & \Sigma_y \end{array} \right)$, where $\Sigma_x$ and $\Sigma_y$, with diagonal elements identity and all the other entries being $\rho_x$ and $\rho_y$ respectively. ${\bf 0}_{m_1}$ is a $m_1 \times m_1$ zero matrix and ${\bf 0}_{m_2}$ is a $m_2 \times m_2$ zero matrix, that is, under the null hypothesis, the intermolecule correlation $\rho$ vanishes. $\Sigma_{xy}$ is a $m_1 \times m_1$ matrix with all entries equal to $\rho$. Likewise $\Sigma_{xy}^T$ is a $m_2 \times m_2$ matrix with all entries equal to $\rho$ . The Likelihood Ratio (LR) statistic for testing the two different correlation structures can be derived as follows: \begin{equation} \wedge = \frac{|\hat{\Sigma}_0|^{-n/2}e^{-\frac{1}{2} \sum_{j=1}^{n}(Z_j-\hat{\mu})^{'}(\hat{\Sigma}_0)^{-1}(Z_j-\hat{\mu})}} {|\hat{\Sigma}_1|^{-n/2} e^{-\frac{1}{2} \sum_{j=1}^{n}(Z_j-\hat{\mu})^{'}(\hat{\Sigma}_1)^{-1}(Z_j-\hat{\mu})} }. \end{equation} Note that for the test to be a true LRT, all the estimated quantities $\hat{(\cdot)}$ in the above formula should be MLE's. In Section \ref{sec:est}, Eqs. \ref{eq:mux}, \ref{eq:muy} and \ref{eq:estimator} give the formula of MLE's of the mean vector and the correlation matrix under $H_\alpha$. The MLE of the correlation matrix under $H_0$ can be determined as $\hat{\Sigma}_0= \left( \begin{array} {cc} \hat{\Sigma}_x & O \\ O & \hat{\Sigma}_y \end{array} \right)$, where \begin{eqnarray} \hat{\Sigma}_x &=& \frac{1}{n}\sum_{j=1}^{n}(X_{j}-\hat{\mu}_x)(X_{j}-\hat{\mu}_x)',\\ \hat{\Sigma}_y &=& \frac{1}{n}\sum_{j=1}^{n}(Y_{j}-\hat{\mu}_y)(Y_{j}-\hat{\mu}_y)'. \end{eqnarray} The LR statistic, denoted by $G^2$, is therefore: \begin{equation} G^2 = -2\mathrm{log} \wedge = n [tr M -\mathrm{log} |M| -(m_1+m_2)], \end{equation} where $M = (\hat{\Sigma}_0)^{-1}\hat{\Sigma}_1$. The LR statistic is asymptotically chi-square distributed with $2(m_1*m_2)$ degrees of freedom under $H_0$. \section{Data Analysis Examples} In this example, we will analyze a subset of $205$ genes whose expression were profiled using $4$ replicates under $20$ physiological/genetic conditions \cite{Medvedovic04}. The whole data was initially reported in \cite{Ideker00} We first estimate all pairwise correlation and then we use 1-correlation as distance measure to cluster these genes. Medvedovic et al, 2004 \cite{Medvedovic04} has already classified the $205$ genes into $4$ functional groups according to their GO annotations. The four classes are: \begin{itemize} \item Biosynthesis; protein metabolism and modification \item Energy pathways; carbohydrate metabolism; catabolism \item Nucleobase, nucleoside, nucleotide and nucleic acid metabolism \item Transport \end{itemize} The membership of $205$ genes were stored in the internal data ``true.member". We were able to compare the performance of our multivariate correlation estimator versus Pearson correlation estimator through hierarchical clustering by comparing clustering results to the ``external knowledge" above. <<>>= library(CORREP) data(gal_all) ##Take average gal_avg <- apply(gal_all, 1, function(x) c(mean(x[1:4]), mean(x[5:8]), mean(x[9:12]), mean(x[13:16]), mean(x[17:20]), mean(x[21:24]), mean(x[25:28]), mean(x[29:32]), mean(x[33:36]), mean(x[37:40]), mean(x[41:44]), mean(x[45:48]), mean(x[49:52]), mean(x[53:56]), mean(x[57:60]), mean(x[61:64]), mean(x[65:68]), mean(x[69:72]), mean(x[73:76]), mean(x[77:80]))) ##Estimate routine correlation based on average profile M1 <- cor(gal_avg) @ The above code is to calculate a $205$ by $205$ correlation matrix using Pearson correlation coefficient implemented in R function \texttt{cor}. Note that we have to average over replicates before the straight-forward application of Pearson correlation coefficient can be done. As we mentioned in the paper, the data used in this example is relatively ``clean" data (see the boxplot below for distribution of within-replicate correlation), which is not favorable condition to apply our estimator, however, the following clustering results show that it still significantly outperforms Pearson correlation coefficient. <<>>= ##Take the first row and format x <- gal_all[1,] x <- cbind(t(x[1:4]),t(x[5:8]),t(x[9:12]),t(x[13:16]),t(x[17:20]),t(x[21:24]),t(x[25:28]),t(x[29:32]),t(x[33:36]),t(x[37:40]),t(x[41:44]),t(x[45:48]),t(x[49:52]),t(x[53:56]),t(x[57:60]),t(x[61:64]),t(x[65:68]),t(x[69:72]),t(x[73:76]),t(x[77:80])) ##Loop through to format each row for(j in 2:205) { y <- gal_all[j,] y <- cbind(t(y[1:4]),t(y[5:8]),t(y[9:12]),t(y[13:16]),t(y[17:20]),t(y[21:24]),t(y[25:28]),t(y[29:32]),t(y[33:36]),t(y[37:40]),t(y[41:44]),t(y[45:48]),t(y[49:52]),t(y[53:56]),t(y[57:60]),t(y[61:64]),t(y[65:68]),t(y[69:72]),t(y[73:76]),t(y[77:80])) x <- rbind(x,y) } boxplot(cor(x)) rawdata <- x stddata <- apply(rawdata, 1, function(x) x/sd(x)) stddata <- t(stddata) @ The above code is to reshape the data to be compatible with functions implemented in this package. This step is not absolutely necessary if your data is already in the right format, for example, columns correspond to conditions and rows correspond to (replicated) variables. For example, a $820$ by $20$ matrix for this data. {\bf Moreover, data has to be standardized by making variance of each row (gene) equals to $1$. The standardization is VERY important and must be followed}. <<>>= M <- cor.balance(stddata, m = 4, G= 205) @ The above code is to calculate $205$ by $205$ correlation matrix using the new multivariate correlation estimator implemented in R function \texttt{cor.balance}. Note that there are equivalent number of replicates for this data (balanced). For unbalanced data, R function \texttt{cor.unbalance} should be used. <<>>= row.names(M) <- row.names(M1) colnames(M) <- colnames(M1) @ We next use 1-correlation as distance measure to cluster the genes into $4$ groups, and compare the consistency between these $4$ groups and pre-defined $4$ groups. <<>>= M.rep <- 1-M M.avg <- 1-M1 d.rep <- as.dist(M.rep) d.avg <- as.dist(M.avg) @ The above code calculates distance matrix for hierarchical clustering using both Pearson correlation coefficient and multivariate correlation estimator. <<>>= library(e1071) library(cluster) data(true.member) g.rep <- diana(d.rep) g.avg <- diana(d.avg) @ The above code does hierarchical clustering in a top-down fashion ie, Diana, and classifies all $205$ genes into a number of clusters. We recommend using top-down method in this case since we are interested in identifying a few ($4$) large clusters. For other data, if you are interested in identifying a lot of small clusters, we recommend using agglomerative hierarchical clustering methods. See below for examples. <<>>= h.rep <- hclust(d.rep, method = "complete") h.avg <- hclust(d.avg, method = "complete") @ The R function {\texttt hclust} can be applied to perform bottom-up hierarchical clustering. In addition to choosing a proper distance measure, we will need to determine a method to measure distance between objects such as are ``complete", ``average", ``single". Option ``average" seems to be robust in most of cases. <<>>= member.rep.k4 <- cutree(g.rep, k = 4) member.avg.k4 <- cutree(g.avg, k = 4) classAgreement(table(member.avg.k4, as.matrix(true.member))) classAgreement(table(member.rep.k4, as.matrix(true.member))) @ The above code retrieves the membership of $205$ genes as determined by divisive clustering, and accesses consistency of the memberships with the ``true" cluster membership (external knowledge) using adjusted RAND index. The function {\texttt classAgreement} that was used to calculate adjusted RAND index is from ``e1071" package. It is obvious that clustering results based on multivariate correlation estimator are more consistent to the external knowledge (higher adjusted RAND index). We can also examine the ``quality" of clusters using, for example, silhouette plot. Here quality roughly means that ratio between within-cluster deviation and between-cluster deviation. Although it is statistically true that good quality clusters indicate better clustering performance, it is not necessarily true biologically. Good quality clusters are only a sufficient condition for good clustering performance but not a necessary condition. <<>>= sil.rep4 <- silhouette(member.rep.k4, d.rep) sil.avg4 <- silhouette(member.avg.k4, d.avg) plot(sil.rep4, nmax = 80, cex.names = 0.5) plot(sil.avg4, nmax = 80, cex.names = 0.5) @ \bibliographystyle{plainnat} \begin{thebibliography}{} \bibitem[Medvedovic {\it et~al}., 2004] {Medvedovic04} Medvedovic, M., Yeung, K.Y., Bumgarner, R.E. (2004) Bayesian mixtures for clustering replicated microarray data. {\it Bioinformatics}, {\bf 20}, 1222-1232. \bibitem[Ideker {\it et~al}., 2000]{Ideker00} Ideker, T., Thorsson, V., Siegel, A.F. and Hood, L.E. 2000. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. {\it J. Comput. Biol.}, {\bf 7}, 805-817. \bibitem[Zhu and Li, 2007]{Zhu07} Zhu, D and Li, Y. 2007. Multivariate Correlation Estimator for Inferring Functional Relationships from Replicated `OMICS' Data. Submitted. \end{thebibliography} \end{document}