%\VignetteIndexEntry{1) Tutorial on using variancePartition} %\VignettePackage{variancePartition} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} % To compile this document % cd /Users/gabrielhoffman/workspace/repos % R % library('knitr'); rm(list=ls()); knit('variancePartition/vignettes/variancePartition.Rnw') \documentclass[12pt]{article} \newcommand{\lowtilde}{\raise.17ex\hbox{$\scriptstyle\mathtt{\sim}$}} <>= library("knitr") opts_chunk$set(tidy=FALSE,dev="pdf",fig.show="hide", fig.width=4,fig.height=4.5, message=FALSE, warning=FALSE) @ <>= BiocStyle::latex(width=90) @ \usepackage[position=top]{subfig} \usepackage{blkarray} \bioctitle[Vignette for variancePartition]{ \Biocpkg{variancePartition}:\\Quantifying and interpreting drivers\\of variation in multilevel gene\\expression experiments} \author{Gabriel Hoffman\\ \small{Pamela Sklar Division of Psychiatric Genomics} \\ \small{Icahn Institute for Genomics and Multiscale Biology}\\ \small{Department of Genetics and Genomic Sciences}\\ \small{Icahn School of Medicine at Mount Sinai}} \begin{document} \maketitle \begin{abstract} \noindent Gene expression datasets are complicated and have multiple sources of biological and technical variation. These datasets have recently become more complex as it is now feasible to assay gene expression from the same individual in multiple tissues or at multiple time points. The \Rpackage{variancePartition} package implements a statistical method to quantify the contribution of multiple sources of variation and decouple within/between-individual variation. In addition, \Rpackage{variancePartition} produces results at the gene-level to identity genes that follow or deviate from the genome-wide trend. \vspace{2cm} \noindent \textbf{variancePartition version:} \Sexpr{packageVersion("variancePartition")} \end{abstract} <>= options(digits=3, width=80, prompt=" ", continue=" ") @ \newpage \tableofcontents \pagebreak \section{Overview} The \Rpackage{variancePartition} package provides a general framework for understanding drivers of variation in gene expression in experiments with complex designs. A typical application would consider a dataset of gene expression from individuals sampled in multiple tissues or multiple time points where the goal is to understand variation within versus between individuals and tissues. \Rpackage{variancePartition} use a linear mixed model to partition the variance attributable to multiple variables in the data. The analysis is built on top of the \Rpackage{lme4} package \cite{Bates2015}, and some basic knowledge about linear mixed models will give you some intuition about the behavior of \Rpackage{variancePartition} \cite{Pinheiro2000, Galecki2010} \subsection{Inputs} There are three components to an analysis: \begin{itemize} \item[1)] {\bf Gene expression data:} In general, this is a matrix of normalized gene expression values with genes as rows and experiments as columns. \\ \begin{itemize}[align=left] \item[-- Count-based quantification:] \Rcode{featureCounts} \cite{Liao2014}, \Rcode{HTSeq} \cite{Anders2015}\\ Counts mapping to each gene can be normalized using counts per million (CPM), reads per kilobase per million (RPKM) or fragments per kilobase per million (FPKM). These count results can be processed with \Rpackage{limma}/\Rcode{voom} \cite{Law2014} to model the precision of each observation or \Rpackage{DESeq2} \cite{Love2014}.\\ \item[-- Isoform quantification:] \Rcode{kallisto} \cite{Bray2015}, \Rcode{sailfish} \cite{Patro2014}, \Rcode{salmon} \cite{Patro2015}, \Rcode{RSEM} \cite{Li2011f}, \Rcode{cufflinks} \cite{Trapnell2010}\\ These perform isoform-level quantification using reads that map to multiple transcripts. Quantification values can be read directly into \R{}, or processed with \Rpackage{ballgown} \cite{Frazee2015} or \Rpackage{tximport} \cite{Soneson2015}.\\ \item[-- Microarray data:] any standard normalization such as \Rcode{rma} in the \Rpackage{oligo} \cite{oligoBioc} package can be used. \\ \end{itemize} \item[2)] {\bf Meta-data about each experiment:} A \Rcode{data.frame} with information about each experiment such as patient ID, tissue, sex, disease state, time point, batch, etc.\\ \item[2)] {\bf Formula indicating which meta-data variables to consider:} An R{} formula such as\\ \Rcode{$\sim$ Age + (1|Individual) + (1|Tissue) + (1|Batch) } indicating which meta-data variables should be used in the analysis. \end{itemize} \Rpackage{variancePartition} will assess the contribution of each meta-data variable to variation in gene expression and can report the intra-class correlation for each variable. \pagebreak \section{Running an analysis} A typical analysis with \Rpackage{variancePartition} is only a few lines of \R{} code, assuming the expression data has already been normalized. Normalization is a separate topic, and I address it briefly in Section \ref{sec:normalize_RNA_Seq}. The simulated dataset included as an example contains measurements of 200 genes from 100 samples. These samples include assays from 3 tissues across 25 individuals processed in 4 batches. The individuals range in age from 36 to 73. A typical \Rpackage{variancePartition} analysis will assess the contribution of each aspect of the study design (i.e. individual, tissue, batch, age) to the expression variation of each gene. The analysis will prioritize these axes of variation based on a genome-wide summary and give results at the gene-level to identity genes that follow or deviate from this genome-wide trend. The results can be visualized using custom plots and can be used for downstream analysis. \subsection{Standard application} <>= # load library library('variancePartition') # load simulated data: # geneExpr: matrix of gene expression values # info: information/metadata about each sample data(varPartData) form <- ~ Age + (1|Individual) + (1|Tissue) + (1|Batch) browser() varPart <- fitExtractVarPartModel( geneExpr, form, info ) # vp <- sortCols( varPart ) # plotPercentBars( vp[1:10,] ) # plotVarPart( vp ) @ <>= # load library library('variancePartition') # load simulated data: # geneExpr: matrix of gene expression values # info: information/metadata about each sample data(varPartData) # Specify variables to consider # Age is continuous so model it as a fixed effect # Individual and Tissue are both categorical, # so model them as random effects # Note the syntax used to specify random effects form <- ~ Age + (1|Individual) + (1|Tissue) + (1|Batch) # Fit model and extract results # 1) fit linear mixed model on gene expression # If categorical variables are specified, # a linear mixed model is used # If all variables are modeled as fixed effects, # a linear model is used # each entry in results is a regression model fit on a single gene # 2) extract variance fractions from each model fit # for each gene, returns fraction of variation attributable # to each variable # Interpretation: the variance explained by each variables # after correcting for all other variables # Note that geneExpr can either be a matrix, # and EList output by voom() in the limma package, # or an ExpressionSet varPart <- fitExtractVarPartModel( geneExpr, form, info ) # sort variables (i.e. columns) by median fraction # of variance explained vp <- sortCols( varPart ) # Figure 1a # Bar plot of variance fractions for the first 10 genes plotPercentBars( vp[1:10,] ) # # Figure 1b # violin plot of contribution of each variable to total variance plotVarPart( vp ) @ \Rpackage{variancePartition} includes a number of custom plots to visualize the results. Since \Rpackage{variancePartition} attributes the fraction of total variation attributable to each aspect of the study design, these fractions naturally sum to 1. \Rcode{plotPercentBars} plots the partitioning results for a subset of genes (Figure 1a), and \Rcode{plotVarPart} shows a genome-wide violin plot of the distribution of variance explained by each variable across all genes (Figure 1b). (Note that these plots show results in terms of \underline{percentage} of variance explained, while the results are stored in terms of the \underline{fraction}.) \begin{figure}[hb] \centering \subfloat[]{\includegraphics[width=.45\textwidth]{figure/simResult-1}}\quad \subfloat[]{\includegraphics[width=.45\textwidth]{figure/simResult-2}} \caption{ \Rpackage{variancePartition} example on simulated data} \label{fig:lab} \end{figure} The core functions of \Rpackage{variancePartition} work seemlessly with gene expression data stored as a \Robject{matrix}, \Robject{data.frame}, \Robject{EList} from \Rpackage{limma} or \Robject{ExpressionSet} from \Rpackage{Biobase}. \Rcode{fitExtractVarPartModel()} returns an object that stores the variance fractions for each gene and each variable in the formula specified. These fractions can be accessed just like a \Rcode{data.frame}: <>= # Access first entries head(varPart) # Access first entries for Individual head(varPart$Individual) # sort genes based on variance explained by Individual head(varPart[order(varPart$Individual, decreasing=TRUE),]) @ \subsubsection{Saving plot to file} In order to save the plot to a file, use the \Rcode{ggsave} function: <>= fig <- plotVarPart( vp ) ggsave(file, fig) @ \pagebreak \subsection{Plot expression stratified by other variables} \Rpackage{variancePartition} also includes plotting functions to visualize the variation across a variable of interest. \Rcode{plotStratify} plots the expression of a gene stratified by the specified variable. In the example dataset, users can plot a gene expression trait stratified by Tissue (Figure 2a) or Individual (Figure 2b). <>= # get gene with the highest variation across Tissues # create data.frame with expression of gene i and Tissue # type for each sample i <- which.max( varPart$Tissue ) GE <- data.frame( Expression = geneExpr[i,], Tissue = info$Tissue) # Figure 2a # plot expression stratified by Tissue plotStratify( Expression ~ Tissue, GE, main=rownames(geneExpr)[i]) # # get gene with the highest variation across Individuals # create data.frame with expression of gene i and Tissue # type for each sample i <- which.max( varPart$Individual ) GE <- data.frame( Expression = geneExpr[i,], Individual = info$Individual) # Figure 2b # plot expression stratified by Tissue label <- paste("Individual:", format(varPart$Individual[i]*100, digits=3), "%") main <- rownames(geneExpr)[i] plotStratify( Expression ~ Individual, GE, colorBy=NULL, text=label, main=main) @ \begin{figure}[h] \centering \subfloat[Tissue]{\includegraphics[width=.45\textwidth]{figure/plotStratify-1}}\quad \subfloat[Individual]{\includegraphics[width=.45\textwidth]{figure/plotStratify-2}} \caption{ Plot gene expression stratified by {\bf a)} Tissue and {\bf b)} Individual} \label{fig:Stratify} \end{figure} For gene141, variation across tissues explains 52.9\% of variance in gene expression. For gene43, variation across Individuals explains 91.4\% of variance in gene expression. \subsection{Intuition about the backend} At the heart of \Rcode{variancePartition}, a regression model is fit for each gene separately and summary statistics are extracted and reported to the user for visualization and downstream analysis. For a single model fit, \Rcode{calcVarPart} computes the fraction of variance explained by each variable. \Rcode{calcVarPart} is defined by this package, and computes these statistics from either a fixed effects model fit with \Rcode{lm} or a linear mixed model fit with \Rcode{lmer}. \Rcode{fitExtractVarPart} loops over each gene, fits the regression model and returns the variance fractions reported by \Rcode{calcVarPart}. Fitting the regression model and extracting variance statistics can also be done directly: <>= library('lme4') # fit regression model for the first gene form_test <- geneExpr[1,] ~ Age + (1|Individual) + (1|Tissue) fit <- lmer(form_test, info, REML=FALSE ) # extract variance statistics calcVarPart(fit) @ \pagebreak \section{Interpretation} \Rpackage{variancePartition} fits a linear (mixed) model that jointly considers the contribution of all specified variables on the expression of each gene. It uses a multiple regression model so that the effect of each variable is assessed while jointly accounting for all others\footnote{\href{https://www.r-bloggers.com/2011/03/anova-–-type-iiiiii-ss-explained/}{Standard ANOVA} implemented in R involves refitting the model while dropping terms, but is aimed at hypothesis testing. \Rpackage{variancePartition}'s \Rcode{calcVarPart} is aimed at estimating variance fractions. It uses a single fit of the linear (mixed) model and evaluates the sum of squares of each term and the sum of squares of the total model fit.}. However, we note that like any multiple regression model, high correlation bewtween fixed or random effect variables (see Section \ref{sec:canCor}) can produce unstable estimates and it can be challanging to identify which variable is responsible for the expression variation. The results of \Rpackage{variancePartition} give insight into the expression data at multiple levels. Moreover, a single statistic often has multiple equivalent interpretations while only one is relevant to the biological question. Analysis of the example data in Figure 1 gives some strong interpretations. Considering the median across all genes, \begin{itemize} \item[1)] variation across individuals explains a median of 81.5\% of the variation in expression, after correcting for tissue, batch and age \item[2)] variation across tissues explains a median of 8.2\% of the variation in expression, after correcting for other the variables \item[3)] variation across batches is negligible after correcting for variation due to other variables \item[4)] the effect of age is negligible after correcting for other variables \item[5)] correcting for individual, tissue, batch and age leaves a median of 9.3\% of the total variance in expression. \end{itemize} These statistics also have a natural interpretation in terms of the intra-class correlation (ICC), the correlation between observations made from samples in the same group. Considering the median across across all genes and all experiments, \begin{itemize} \item[1)] the ICC for \underline{individual} is 81.5\%. \item[2)] the ICC for \underline{tissue} is 8.2\%. \item[3)] two randomly selected gene measurements from same \underline{individual}, but regardless of \underline{tissue}, \underline{batch} or \underline{age}, have a correlation of 81.5\%. \item[4)] two randomly selected gene measurements from same \underline{tissue}, but regardless of \underline{individual}, \underline{batch} or \underline{age}, have a correlation of 8.2\%. \item[5)] two randomly selected gene measurements from the same \underline{individual} {\it and} same \underline{tissue}, but regardless of \underline{batch} and \underline{age}, have an correlation of 81.5\% + 8.2\% = 89.7\%. \end{itemize} Note that that the ICC here is interpreted as the ICC after correcting for all other variables in the model. These conclusions are based on the genome-wide median across all genes, but the same type of statements can be made at the gene-level. Moreover, care must be taken in the interpretation of nested variables. For example, \Rcode{Age} is nested within \Rcode{Individual} since the multiple samples from each individual are taken at the same age. Thus the effect of \Rcode{Age} removes some variation from being explained by \Rcode{Individual}. This often arises when considering variation across individuals and across sexes: any cross-sex variation is a component of the cross-individual variation. So the total variation across individuals is the sum of the fraction of variance explained by \Rcode{Sex} and \Rcode{Individual}. This nesting/summing of effects is common for variables that are properties of the individual rather than the sample. For example, sex and ethnicity are always properties of the individual. Variables like age and disease state can be properties of the individual, but could also vary in time-course or longitudinal experiments. The the interpretation depends on the experimental design. The real power of \Rpackage{variancePartition} is to identify specific genes that follow or deviate from the genome-wide trend. The gene-level statistics can be used to identify a subset of genes that are enriched for specific biological functions. For example, we can ask if the 500 genes with the highest variation in expression across tissues (i.e. the long tail for tissue in Figure 1a) are enriched for genes known to have high tissue-specificity. % {\it Note:} The percent variance explained is equivalent to the ICC only when the formula takes the simple form $$ \sim (1|A) + (1|B) + C + ...$$ %For complex formulas using varying coefficient models $$\sim (1|A/B)+...$$ or correlated random intercepts $$\sim (A|B) + ...$$ the percent variance explained is a useful statistical, but does not correspond to ICC. \subsection{Should a variable be modeled as fixed or random effect?} Categorical variables should (almost) always be modeled as a random effect. The difference between modeling a categorical variable as a fixed versus random effect is minimal when the sample size is large compared to the number of categories (i.e. levels). So variables like disease status, sex or time point will not be sensitive to modeling as a fixed versus random effect. However, variables with many categories like \Rcode{Individual} {\it must} be modeled as a random effect in order to obtain statistically valid results. So to be on the safe side, categorical variable should be modeled as a random effect. % \R{} and \Rpackage{variancePartition} handle catagorical variables stored as a \Rcode{factor} very naturally. If categorical variables are stored as an \Rcode{integer} or \Rcode{character}, they must be converted to a \Rcode{factor} before being used with \Rpackage{variancePartition} \Rpackage{variancePartition} fits two types of models: \begin{itemize} \item[1)] linear mixed model where \underline{all} categorical variables are modeled as random effects and all continuous variables are fixed effects. The function \Rcode{lmer} from \Rpackage{lme4} is used to fit this model. \item[2)] fixed effected model, where all variables are modeled as fixed effects. The function \Rcode{lm} is used to fit this model. \end{itemize} \subsection{Which variables should be included?} In my experience, it is useful to include all variables in the first analysis and then drop variables that have minimal effect. However, like all multiple regression methods, \Rpackage{variancePartition} will divide the contribution over multiple variables that are strongly correlated. So, for example, including both sex and height in the model will show sex having a smaller contribution to variation gene expression than if height were omitted, since there variables are strongly correlated. This is a simple example, but should give some intuition about a common issue that arises in analyses with \Rpackage{variancePartition}. \Rpackage{variancePartition} can naturally assess the contribution of both individual and sex in a dataset. As expected, genes for which sex explains a large fraction of variation are located on chrX and chrY. If the goal is to interpret the impact of sex, then there is no issue. But recall the issue with correlated variables and note that individual is correlated with sex, because each individual is only one sex regardless of how many samples are taken from a individual. It follows that including sex in the model reduces the \underline{apparent} contribution of individual to gene expression. In other words, the ICC for individual will be different if sex is included in the model. In general, including variables in the model that do not vary within individual will reduce the apparent contribution of individual as estimated by \Rpackage{variancePartition}. For example, sex and ethnicity never vary between multiple samples from the same individual and will always reduce the apparent contribution of individual. However, disease state and age may or may not vary depending on the study design. In biological datasets technical variability (i.e. batch effects) can often reduce the apparent biological signal. In RNA-seq analysis, it is common for the the impact of this technical variability to be removed before downstream analysis. Instead of including these batch variable in the \Rpackage{variancePartition} analysis, it is simple to complete the expression residuals with the batch effects removed and then feeds these residuals to \Rpackage{variancePartition}. This will increase the fraction of variation explained by biological variables since technical variability is reduced. \subsubsection{Assess correlation between all pairs of variables} \label{sec:canCor} Evaluating the correlation between variables in a important part in interpreting variancePartition results. When comparing two continuous variables, Pearson correlation is widely used. But variancePartition includes categorical variables in the model as well. In order to accommodate the correlation between a continuous and a categorical variable, or two categorical variables we used canonical correlation analysis. Canonical Correlation Analysis (CCA) is similar to correlation between two vectors, except that CCA can accommodate matricies as well. For a pair of variables, canCorPairs assesses the degree to which they co-vary and contain the same information. Variables in the formula can be a continuous variable or a discrete variable expanded to a matrix (which is done in the backend of a regression model). For a pair of variables, canCorPairs uses CCA to compute the correlation between these variables and returns the pairwise correlation matrix. Statistically, let rho be the array of correlation values returned by the standard R function cancor to compute CCA. canCorPairs returns rho / sum(rho) which is the fraction of the maximum possible correlation. Note that CCA returns correlations values between 0 and 1 <>= form <- ~ Individual + Tissue + Batch + Age + Height # Compute Canonical Correlation Analysis (CCA) # between all pairs of variables # returns absolute correlation value C = canCorPairs( form, info) # Plot correlation matrix plotCorrMatrix( C ) @ \begin{figure}[h] \centering \caption{Assess correlation between all pairs of variables} \includegraphics[width=.5\textwidth]{figure/canCorPairs-1} \label{fig:canCorPairs} \end{figure} \pagebreak \section{Advanced analysis} \subsection{Extracting additional information from model fits} Advanced users may want to perform the model fit and extract results in separate steps in order to examine the fit of the model for each gene. Thus the work of \Rcode{fitExtractVarPart} can be divided into two steps: 1) fit the regression model, and 2) extracting variance statistics. \\ <>= form <- ~ Age + (1|Individual) + (1|Tissue) + (1|Batch) # Fit model results <- fitVarPartModel( geneExpr, form, info ) # Extract results varPart <- extractVarPart( results ) @ Note that storing the model fits can use a lot of memory ($\sim$10Gb with 20K genes and 1000 experiments). I do not recommend unless you have a specific need for storing the entire model fit. Instead, \Rcode{fitVarPartModel} can extract any desired information using any function that accepts the model fit from \Rcode{lm}/\Rcode{lmer}. The results are stored in a \Rcode{list} and can be used for downstream analysis. <>= # Fit model and run summary() function on each model fit vpSummaries <- fitVarPartModel( geneExpr, form, info, fxn=summary ) @ <>= # Show results of summary() for the first gene vpSummaries[[1]] @ \subsection{Removing batch effects before fitting model} Gene expression studies often have substantial batch effects, and \Rpackage{variancePartition} can be used to understand the magnitude of the effects. However, we often want to focus on biological variables (i.e. individual, tissue, disease, sex) after removing the effect of technical variables. Depending on the size of the batch effect, I have found it useful to correct for the batch effect first and then perform a \Rpackage{variancePartition} analysis afterward. Subtracting this batch effect can reduce the total variation in the data, so that the contribution of other variables become clearer. Standard analysis: <>= form <- ~ (1|Tissue) + (1|Individual) + (1|Batch) + Age varPart <- fitExtractVarPartModel( geneExpr, form, info ) @ Analysis on residuals: <>= library('limma') # subtract out effect of Batch fit <- lmFit( geneExpr, model.matrix(~ Batch, info)) res <- residuals( fit, geneExpr) # fit model on residuals form <- ~ (1|Tissue) + (1|Individual) + Age varPartResid <- fitExtractVarPartModel( res, form, info ) @ Remove batch effect with linear mixed model <>= # subtract out effect of Batch with linear mixed model modelFit <- fitVarPartModel( geneExpr, ~ (1|Batch), info ) res <- residuals( modelFit ) # fit model on residuals form <- ~ (1|Tissue) + (1|Individual) + Age varPartResid <- fitExtractVarPartModel( res, form, info ) @ If the two-step process requires too much memory, the residuals can be computed more efficiently. Here, run the \Rcode{residuals} function inside the call to \Rcode{fitVarPartModel} to avoid storing the large intermediate results. <>= # extract residuals directly without storing intermediate results residList <- fitVarPartModel( geneExpr, ~ (1|Batch), info, fxn=residuals ) # convert list to matrix residMatrix = do.call(rbind, residList) @ \subsection{Variation within multiple subsets of the data} \label{sec:withinSubset} So far, we have focused on interpreting one variable at a time. But the linear mixed model behind \Rpackage{variancePartition} is a very powerful framework for analyzing variation at multiple levels. We can easily extend the previous analysis of the contribution of individual and tissue on variation in gene expression to examine the contribution of individual {\it within} each tissue. This analysis is as easy as specifying a new formula and rerunning \Rpackage{variancePartition}. Note that is analysis will only work when there are replicates for at least some individuals within each tissue in order to assess cross-individual variance with in a tissue. <>= # specify formula to model within/between individual variance # separately for each tissue # Note that including +0 ensures each tissue is modeled explicitly # Otherwise, the first tissue would be used as baseline form <- ~ (Tissue+0|Individual) + Age + (1|Tissue) + (1|Batch) # fit model and extract variance percents varPart <- fitExtractVarPartModel( geneExpr, form, info, showWarnings=FALSE ) # violin plot plotVarPart( sortCols(varPart), label.angle=60 ) @ \begin{figure}[h] \centering \caption{ Variation across individuals within each tissue} \includegraphics[width=.5\textwidth]{figure/withinTissue-1} \label{fig:withinTissue} \end{figure} This analysis corresponds to a varying coefficient model, where the correlation between individuals varies for each tissue \cite{Pinheiro2000}. Since the variation across individuals is modeled within each tissue, the total variation explained does not sum to 1 as it does for standard application of \Rpackage{variancePartition}. So interpretation as intra-class does not strictly apply and use of \Rcode{plotPercentBars} is no longer applicable. Yet the variables in the study design are still ranked in terms of their genome-wide contribution to expression variation, and results can still be analyzed at the gene level. See Section \ref{sec:variationWithAcross} for statistical details. \subsection{Detecting problems caused by collinearity of variables} \label{sec:collinearity} Including variables that are highly correlated can produce misleading results and overestimate the contribution of variables modeled as fixed effects. This is usually not an issue, but can arise when statistically redundant variables are included in the model. In this case, the model is ``degenerate'' or ``computationally singular'' and parameter estimates from this model are not meaningful. Dropping one or more of the covariates will fix this problem.\\ \\ A check of collinearity is built into \Rcode{fitVarPartModel} and \Rcode{fitExtractVarPartModel}, so the user will be warned if this is an issue.\\ \\ Alternatively, the user can use the \Rcode{colinearityScore} function to evaluate whether this is an issue for a single model fit: <>= form <- ~ (1|Individual) + (1|Tissue) + Age + Height # fit model res <- fitVarPartModel( geneExpr[1:4,], form, info ) @ <>= # evaluate the collinearity score on the first model fit # this reports the correlation matrix between coefficient estimates # for fixed effects # the collinearity score is the maximum absolute correlation value # If the collinearity score > .99 then the variance partition # estimates may be problematic # In that case, a least one variable should be omitted colinearityScore( res[[1]] ) @ \subsection{Including weights computed separately} \Rpackage{variancePartition} automatically used precision weights computed by \Rpackage{voom}, but the user can also specify custom weights using the \Rcode{weightsMatrix} argument. <>= form <- ~ (1|Individual) + (1|Tissue) + Age + Height # Specify custom weights # In this example the weights are simulated from a # uniform distribution and are not meaningful. weights <- matrix(runif(length(geneExpr)), nrow=nrow(geneExpr)) # Specify custom weights res <- fitExtractVarPartModel( geneExpr[1:4,], form, info, weightsMatrix=weights[1:4,] ) @ In addition, setting the \Rcode{useWeights=FALSE} will suppress usage of the weights in all cases, i.e. when the weights are specified manually or implicitly with the results of \Rcode{voom}. \subsection{Including interaction terms} Typical analysis assumes that the effect of each variable on gene expression does not depend on other variables in the model. Sometimes this assumption is too strict, and we want to model an interaction effect whereby the effect of \Rcode{Batch} depends on \Rcode{Tissue}. This can be done easly by specifying an interaction term, \Rcode{(1|Batch:Tissue)}. Since \Rcode{Batch} has 4 categories and \Rcode{Tissue} has 3, this interaction term implicity models a new \Rcode{3*4 = 12} category variable in the analysis. This new interaction term will absorb some of the variance from the \Rcode{Batch} and \Rcode{Tissue} term, so an interaction model should always include the two constituent variables. Here we fit an interaction model, but we observe that interaction between \Rcode{Batch} and \Rcode{Tissue} does not explain much expression variation. <>= form <- ~ (1|Individual) + Age + Height + (1|Tissue) + (1|Batch) + (1|Batch:Tissue) # fit model vpInteraction <- fitExtractVarPartModel( geneExpr, form, info ) plotVarPart( sortCols( vpInteraction ) ) @ \begin{figure}[h] \centering \caption{Fit interaction term} \includegraphics[width=.5\textwidth]{figure/vpInteraction-1} \label{fig:vpInteraction} \end{figure} \pagebreak \section{Applying \Rpackage{variancePartition} to RNA-seq expression data} \label{sec:normalize_RNA_Seq} \Rpackage{variancePartition} works with gene expression data that has already been processed and normalized as for differential expression analysis. \subsection{Gene-level counts} \Rcode{featureCounts} \cite{Liao2014} and \Rcode{HTSeq} \cite{Anders2015} report the number of reads mapping to each gene (or exon). These results are easily read into \R{}. \Rpackage{limma}/\Rcode{voom} and \Rpackage{DESeq2} are widely used for differential expression analysis of gene- and exon-level counts and can be used to process data before analysis with \Rpackage{variancePartition}. This section addresses processing and normalization of gene-level counts, but the analysis is the same for exon-level counts. \subsubsection{\Rpackage{limma}/\Rcode{voom}} Read RNA-seq counts into \R{}, normalize for library size within and between experiments with TMM \cite{Robinson2010c}, estimate precision weights with \Rpackage{limma}/\Rcode{voom}. <>= library('limma') library('edgeR') # identify genes that pass expression cutoff isexpr <- rowSums(cpm(geneCounts)>1) >= 0.5 * ncol(geneCounts) # create data structure with only expressed genes gExpr <- DGEList(counts=geneCounts[isexpr,]) # Perform TMM normalization gExpr <- calcNormFactors(gExpr) # Specify variables to be included in the voom() estimates of # uncertainty. # Recommend including variables with a small number of categories # that explain a substantial amount of variation design <- model.matrix( ~ Batch, info) # Estimate precision weights for each gene and sample # This models uncertainty in expression measurements vobjGenes <- voom(gExpr, design ) # Define formula form <- ~ (1|Individual) + (1|Tissue) + (1|Batch) + Age # variancePartition seamlessly deals with the result of voom() # by default, it seamlessly models the precision weights # This can be turned off with useWeights=FALSE varPart <- fitExtractVarPartModel( vobjGenes, form, info ) @ \subsubsection{\Rpackage{DESeq2}} Process and normalize the gene-level counts before running \Rpackage{variancePartition} analysis. <>= library('DESeq2') # create DESeq2 object from gene-level counts and metadata dds <- DESeqDataSetFromMatrix(countData = geneCounts, colData = info, design = ~ 1) # Estimate library size correction scaling factors dds <- estimateSizeFactors(dds) # identify genes that pass expression cutoff isexpr <- rowSums(fpm(dds)>1) >= 0.5 * ncol(dds) # compute log2 Fragments Per Million # Alternatively, fpkm(), vst() or rlog() could be used quantLog <- log2( fpm( dds )[isexpr,] + 1) # Define formula form <- ~ (1|Individual) + (1|Tissue) + (1|Batch) + Age # Run variancePartition analysis varPart <- fitExtractVarPartModel( quantLog, form, info) @ Note that DESeq2 does not compute precision weights like \Rpackage{limma}/\Rcode{voom}, so they are not used in this version of the analysis. \subsection{Isoform quantification} Other software performs isoform-level quantification using reads that map to multiple transcripts. These include \Rcode{kallisto} \cite{Bray2015}, \Rcode{sailfish} \cite{Patro2014}, \Rcode{salmon} \cite{Patro2015}, \Rcode{RSEM} \cite{Li2011f} and \Rcode{cufflinks} \cite{Trapnell2010}. \subsubsection{\Rpackage{tximport}} Quantifications from \Rcode{kallisto}, \Rcode{salmon}, \Rcode{sailfish} and \Rcode{RSEM} can be read into \R{} and processed with the Bioconductor package \Rpackage{tximport}. The gene- or transcript-level quantifications can be used directly in \Rpackage{variancePartition}. <>= library('tximportData') library('tximport') library('readr') # Get data from folder where tximportData is installed dir <- system.file("extdata", package = "tximportData") samples <- read.table(file.path(dir, "samples.txt"), header = TRUE) files <- file.path(dir, "kallisto", samples$run, "abundance.tsv") names(files) <- paste0("sample", 1:6) tx2gene <- read.csv(file.path(dir, "tx2gene.csv")) # reads results from kallisto txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM") # define metadata (usually read from external source) info_tximport <- data.frame( Sample = sprintf("sample%d", 1:6), Disease=c("case", "control")[c(rep(1, 3), rep(2, 3) )] ) # Extract counts from kallisto y <- DGEList( txi$counts ) # compute library size normalization y <- calcNormFactors(y) # apply voom to estimate precision weights design <- model.matrix( ~ Disease, data = info_tximport) vobj <- voom(y, design) # define formula form <- ~ (1|Disease) # Run variancePartition analysis (on only 10 genes) varPart_tx <- fitExtractVarPartModel( vobj[1:10,], form, info_tximport) @ Code to process results from \Rcode{sailfish}, \Rcode{salmon}, \Rcode{RSEM} is very similar.\\ \\ See tutorial at \url{http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html} for more details. \subsubsection{\Rpackage{ballgown}} Quantifications from Cufflinks/Tablemaker and RSEM can be processed and read into \R{} with the Bioconductor package \Rpackage{ballgown}. <>= library('ballgown') # Get data from folder where ballgown is installed data_directory <- system.file('extdata', package='ballgown') # Load results of Cufflinks/Tablemaker bg <- ballgown(dataDir=data_directory, samplePattern='sample', meas='all') # extract gene-level FPKM quantification # Expression can be convert to log2-scale if desired gene_expression <- gexpr(bg) # extract transcript-level FPKM quantification # Expression can be convert to log2-scale if desired transcript_fpkm <- texpr(bg, 'FPKM') # define metadata (usually read from external source) info_ballgown <- data.frame( Sample = sprintf("sample%02d", 1:20), Batch = rep(letters[1:4], 5), Disease=c("case", "control")[c(rep(1, 10), rep(2, 10) )] ) # define formula form <- ~ (1|Batch) + (1|Disease) # Run variancePartition analysis # Gene-level analysis varPart_gene <- fitExtractVarPartModel( gene_expression, form, info_ballgown) # Transcript-level analysis varPart_transcript <- fitExtractVarPartModel( transcript_fpkm, form, info_ballgown) @ Note that \Rcode{ballgownrsem} can be used for a similar analysis of \Rcode{RSEM} results.\\ \\ See tutorial at \url{http://bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html} for more details. \pagebreak \section{Comparison with other methods on simulated data} Characterizing drivers of variation in gene expression data has typically relied on principal components analysis (PCA) and hierarchical clustering. Here I apply these methods to two simulated datasets to demonstrate the additional insight from an analysis with \Rpackage{variancePartition}. Each simulated dataset comprises 60 experiments from 10 individuals and 3 tissues with 2 biological replicates. In the first dataset, tissue is the major driver of variation in gene expression(Figure \ref{fig:siteDominant}). In the second dataset, individual is the major driver of variation in gene expression (Figures \ref{fig:IndivDominant}). Analysis of simulated data illustrates that PCA identifies the major driver of variation when tissue is dominant and there are only 3 categories. But the results are less clear when individual is dominant because there are now 10 categories. Meanwhile, hierarchical clustering identifies the major driver of variation in both cases, but does not give insight into the second leading contributor. Analysis with \Rpackage{variancePartition} has a number of advantages over these standard methods: \begin{itemize} \item \Rpackage{variancePartition} provides a natural interpretation of multiple variables \begin{itemize} \item figures from PCA/hierarchical clustering allow easy interpretation of only one variable \end{itemize} \vspace{.1in} \item \Rpackage{variancePartition} quantifies the contribution of each variable \begin{itemize} \item PCA/hierarchical clustering give only a visual representation \end{itemize} \vspace{.1in} \item \Rpackage{variancePartition} interprets contribution of each variable to each gene individually for downstream analysis \begin{itemize} \item PCA/hierarchical clustering produces genome-wide summary and does not allow gene-level interpretation \end{itemize} \vspace{2mm} \item \Rpackage{variancePartition} can assess contribution of one variable (i.e. Individual) separately in subset of the data defined by another variable (i.e. Tissue) \end{itemize} <>= library('variancePartition') sim_data = function( n, p, var_indiv, var_site){ info = data.frame(ID = paste("s", 1:n, sep=''), Individual = factor(rep( round(seq(1, n/6, length.out=n/6)),6)), Tissue = factor(sort(rep(toupper(letters[1:3]), n/3))), Age = rpois(n, 50)) geneExpr = matrix(0, nrow=p, ncol=n) for( i in 1:p){ beta_indiv = rnorm(nlevels(info$Individual), 0, sqrt(var_indiv)) beta_site = rnorm(nlevels(info$Tissue), 0, sqrt(var_site)) eta = model.matrix( ~ Individual, info) %*% beta_indiv + model.matrix( ~ Tissue+0, info) %*% beta_site noise = rbeta(1, 10, 100) errVar = var(eta) * (noise) / (1-noise) geneExpr[i,] = eta + rnorm(n, 0, sqrt(errVar)) } colnames(geneExpr) = paste("s", 1:ncol(geneExpr), sep='') rownames(geneExpr) = paste("gene", 1:nrow(geneExpr), sep='') return( list(info=info, geneExpr=geneExpr)) } plotVar = function( geneExpr, info ){ form = ~ (1|Individual) + (1|Tissue) varPart = fitExtractVarPartModel( geneExpr, form, info ) plotVarPart( varPart, col=rainbow(8)[1:3] ) } plotVarCross = function( geneExpr, info, label.angle=30 ){ form = ~ (Tissue+0|Individual) + (1|Tissue) #res = fitVarPartModel( geneExpr, form, info ) #varPart = extractVarPart( res ) varPart = fitExtractVarPartModel( geneExpr, form, info, showWarnings=FALSE ) plotVarPart( varPart, label.angle=label.angle ) } plotPCA = function(geneExpr, col){ dcmp = prcomp(t(geneExpr)) par(mar = c(4, 4, 1, 1) + 0.1) plot(dcmp$x[,1:2], col=col) } plotTree = function(geneExpr, col){ hc = hclust(dist(t(geneExpr))) library('dendextend') hcd = as.dendrogram(hc) labels_colors(hcd) <- col[match(labels(hcd), names(col))] par(mar = c(4, 4, 1, 1) + 0.1, cex=.6) plot( hcd, yaxt='n', horiz=TRUE ) } @ <>= set.seed(1) n = 60 p = 200 data = sim_data( n, p, 1, 4) colTissue = rainbow(nlevels(data$info$Tissue))[data$info$Tissue] names(colTissue) = data$info$ID colIndiv = rainbow(nlevels(data$info$Individual))[data$info$Individual] names(colIndiv) = data$info$ID plotPCA( data$geneExpr, colTissue) plotPCA( data$geneExpr, colIndiv) plotTree( data$geneExpr, colTissue) legend("topleft", legend=levels(data$info$Tissue), fill=rainbow(nlevels(data$info$Tissue))[1:nlevels(data$info$Tissue)], title="Tissue") plotTree( data$geneExpr, colIndiv) legend("topleft", legend=levels(data$info$Individual), fill=rainbow(nlevels(data$info$Individual))[1:nlevels(data$info$Individual)], title="Individual") plotVar( data$geneExpr, data$info ) plotVarCross( data$geneExpr, data$info, label.angle=60 ) @ <>= set.seed(1) n = 60 p = 200 data = sim_data( n, p, 3, 1) colTissue = rainbow(nlevels(data$info$Tissue))[data$info$Tissue] names(colTissue) = data$info$ID colIndiv = rainbow(nlevels(data$info$Individual))[data$info$Individual] names(colIndiv) = data$info$ID plotPCA( data$geneExpr, colTissue) plotPCA( data$geneExpr, colIndiv) plotTree( data$geneExpr, colTissue) legend("topleft", legend=levels(data$info$Tissue), fill=rainbow(nlevels(data$info$Tissue))[1:nlevels(data$info$Tissue)], title="Tissue") plotTree( data$geneExpr, colIndiv) legend("topleft", legend=levels(data$info$Individual), fill=rainbow(nlevels(data$info$Individual))[1:nlevels(data$info$Individual)], title="Individual") plotVar( data$geneExpr, data$info ) plotVarCross( data$geneExpr, data$info, label.angle=60 ) @ <>= # if( "cluster" %in% class(cl) ){ if( exists("cl") ){ library('doParallel') # stop cluster and catch warning if invalid res = tryCatch( {stopCluster(cl)}, warning = function(x) { }, error = function(x) { }, finally={ }) # warning("STOP CLUSTER!!!!!!!!!!!!!!!!!\n") } @ \begin{figure}[h] \centering \caption{ \textbf{Similarity within Tissue is dominant}} \subfloat[PCA - colored by Tissue]{\includegraphics[width=.48\textwidth]{figure/siteDominant-1}}\quad \subfloat[PCA - colored by Individual]{\includegraphics[width=.48\textwidth]{figure/siteDominant-2}}\\ \subfloat[hclust - colored by Tissue]{\includegraphics[width=.48\textwidth]{figure/siteDominant-3}}\quad \subfloat[hclust - colored by Individual]{\includegraphics[width=.48\textwidth]{figure/siteDominant-4}}\\ \subfloat[variancePartition]{\includegraphics[width=.48\textwidth]{figure/siteDominant-5}}\quad \subfloat[variancePartition - within Tissue]{\includegraphics[width=.48\textwidth]{figure/siteDominant-6}} \label{fig:siteDominant} \end{figure} \pagebreak \begin{figure}[tbp] \centering \caption{ \textbf{Similarity within Individual is dominant}} \subfloat[PCA - colored by Tissue]{\includegraphics[width=.48\textwidth]{figure/IndivDominant-1}}\quad \subfloat[PCA - colored by Individual]{\includegraphics[width=.48\textwidth]{figure/IndivDominant-2}}\\ \subfloat[hclust - colored by Tissue]{\includegraphics[width=.48\textwidth]{figure/IndivDominant-3}}\quad \subfloat[hclust - colored by Individual]{\includegraphics[width=.48\textwidth]{figure/IndivDominant-4}}\\ \subfloat[variancePartition]{\includegraphics[width=.48\textwidth]{figure/IndivDominant-5}}\quad \subfloat[variancePartition - within Tissue]{\includegraphics[width=.48\textwidth]{figure/IndivDominant-6}} \label{fig:IndivDominant} \end{figure} \clearpage \section{Statistical details} A \Rpackage{variancePartition} analysis evaluates the linear (mixed) model \begin{eqnarray} y &=& \sum_j X_j\beta_j + \sum_k Z_k \alpha_k + \varepsilon \\ \alpha_k &\sim& \mathcal{N}(0, \sigma^2_{\alpha_k})\\ \varepsilon &\sim& \mathcal{N}(0, \sigma^2_\varepsilon) \end{eqnarray} where $y$ is the expression of a single gene across all samples, $X_j$ is the matrix of $j^{th}$ fixed effect with coefficients $\beta_j$, $Z_k$ is the matrix corresponding to the $k^{th}$ random effect with coefficients $\alpha_k$ drawn from a normal distribution with variance $\sigma^2_{\alpha_k}$. The noise term, $\varepsilon$, is drawn from a normal distribution with variance $\sigma^2_\varepsilon$. Parameters are estimated with maximum likelihood, rather than REML, so that fixed effect coefficients, $\beta_j$, are explicitly estimated. \\ \\ I use the term ``linear (mixed) model'' here since \Rpackage{variancePartition} works seamlessly when a fixed effects model (i.e. linear model) is specified.\\ \\ Variance terms for the fixed effects are computed using the {\it post hoc} calculation \begin{eqnarray} \hat{\sigma}^2_{\beta_j} = var( X_j \hat{\beta}_j). \end{eqnarray} For a fixed effects model, this corresponds to the sum of squares for each component of the model.\\ \\ For a standard application of the linear mixed model, where the effect of each variable is additive, the fraction of variance explained by the $j^{th}$ fixed effect is \begin{eqnarray} \frac{\hat{\sigma}^2_{\beta_j}}{\sum_j \hat{\sigma}^2_{\beta_j} + \sum_k \hat{\sigma}^2_{\alpha_k} + \hat{\sigma}^2_\varepsilon}, \end{eqnarray} by the $k^{th}$ random effect is \begin{eqnarray} \frac{\hat{\sigma}^2_{\alpha_k}}{\sum_j \hat{\sigma}^2_{\beta_j} + \sum_k \hat{\sigma}^2_{\alpha_k} + \hat{\sigma}^2_\varepsilon}, \end{eqnarray} and the residual variance is \begin{eqnarray} \frac{\hat{\sigma}^2_{\varepsilon}}{\sum_j \hat{\sigma}^2_{\beta_j} + \sum_k \hat{\sigma}^2_{\alpha_k} + \hat{\sigma}^2_\varepsilon}. \end{eqnarray} \subsection{Implementation in \R{}} An \R{} formula is used to define the terms in the fixed and random effects, and \Rcode{fitVarPartModel} fits the specified model for each gene separately. If random effects are specified, \Rcode{lmer} from \Rpackage{lme4} is used behind the scenes to fit the model, while \Rcode{lm} is used if there are only fixed effects. \Rcode{fitVarPartModel} returns a list of the model fits, and \Rcode{extractVarPart} returns the variance partition statistics for each model in the list. \Rcode{fitExtractVarPartModel} combines the actions of \Rcode{fitVarPartModel} and \Rcode{extractVarPart} into one function call. \Rcode{calcVarPart} is called behind the scenes to compute variance fractions for both fixed and mixed effects models, but the user can also call this function directly on a model fit with \Rcode{lm}/\Rcode{lmer}. \subsection{Interpretation of percent variance explained} The percent variance explained can be interpreted as the intra-class correlation (ICC) when a special case of Equation 1 is used. Consider the simplest example of the $i^{th}$ sample from the $k^{th}$ individual \begin{eqnarray} y_{i,k} = \mu + Z \alpha_{i,k} + e_{i,k} \end{eqnarray} with only an intercept term, one random effect corresponding to individual, and an error term. In this case ICC corresponds to the correlation between two samples from the same individual. This value is equal to the fraction of variance explained by individual. For example, consider the correlation between samples from the same individual: \begin{eqnarray} ICC &=& cor( y_{1,k}, y_{2,k}) \\ &=& cor( \mu + Z \alpha_{1,k} + e_{1,k}, \mu + Z \alpha_{2,k} + e_{2,k}) \\ &=& \frac{cov( \mu + Z \alpha_{1,k} + e_{1,k}, \mu + Z \alpha_{2,k} + e_{2,k})}{ \sqrt{ var(\mu + Z \alpha_{1,k} + e_{1,k}) var( \mu + Z \alpha_{2,k} + e_{2,k})}}\\ &=& \frac{cov(Z \alpha_{1,k}, Z \alpha_{2,k})}{\sigma^2_\alpha + \sigma^2_\varepsilon} \\ &=& \frac{\sigma^2_\alpha}{\sigma^2_\alpha + \sigma^2_\varepsilon} \end{eqnarray} The correlation between samples from different individuals is: \begin{eqnarray} &=& cor( y_{1,1}, y_{1,2}) \\ &=& cor( \mu + Z \alpha_{1,1} + e_{1,1}, \mu + Z \alpha_{1,2} + e_{1,2}) \\ &=& \frac{cov(Z \alpha_{1,1}, Z \alpha_{1,2})}{\sigma^2_\alpha + \sigma^2_\varepsilon} \\ &=& \frac{0}{\sigma^2_\alpha + \sigma^2_\varepsilon} \\ &=& 0 \end{eqnarray} This interpretation in terms of fraction of variation explained (FVE) naturally generalizes to multiple variance components. Consider two sources of variation, individual and cell type with variances $\sigma^2_{id}$ and $\sigma^2_{cell},$ respectively. Applying a generalization of the the previous derivation, two samples are correlated according to: \renewcommand{\arraystretch}{1.8} \begin{center} \begin{tabular}{cccccc} \hline Individual & cell type & variance & Interpretation & Correlation value\\ \hline same & different & $\frac{\sigma^2_{id}}{\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon }$ & FVE by individual & $ICC_{individual}$\\%\\[0.1cm] different & same & $\frac{\sigma^2_{cell}}{\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon }$ & FVE by cell type & $ICC_{cell}$\\%\\[0.1cm] same & same & $\frac{\sigma^2_{id} + \sigma^2_{cell}}{\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon }$ & sum of FVE by individual \& cell type & $ICC_{individual,cell}$\\ %\\[0.1cm] \hline different & different & $\frac{0}{\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon }$ & sample are independent \\%\\[0.1cm] \hline \hline \end{tabular} \end{center} Notice that the correlation between samples from the same individual and same cell type corresponds to the sum of the fraction explained by individual + fraction explained by cell type. This defines ICC for individual and tissue, as well as the combined ICC and relates these values to FVE.\\ \\ In order to illustrate how this FVE and ICC relate to the correlation between samples in multilevel datasets, consider a simple example of 5 samples from 2 individuals and 2 tissues: \begin{center} \begin{tabular}{|c|c|c|} \hline Sample & Individual & Cell type \\ \hline a & 1 & T-Cell \\ b & 1 & T-Cell \\ c & 1 & monocyte \\ d & 2 & T-Cell \\ e & 2 & monocyte\\ \hline \end{tabular} \end{center} Modeling the separate effects of individual and tissue gives the following covariance structure between samples when a linear mixed model is used: \[ \begin{blockarray}{ccccccc} && a & b & c & d & e \\ \begin{block}{cc(ccccc)} &a &\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon & & & &\\ &b & \sigma^2_{id} + \sigma^2_{cell} & \sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon & & &\\ cov(y) = &c & \sigma^2_{id} & \sigma^2_{id} &\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon & &\\ &d & \sigma^2_{cell} & \sigma^2_{cell} &0 &\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon &\\ &e & 0 & 0 &\sigma^2_{cell} & \sigma^2_{id} &\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon \\ \end{block} \end{blockarray} \] The covariance matrix is symmetric so that blank entries take the value on the opposite side of the diagonal. The covariance can be converted to correlation by dividing by $\sigma^2_{id} + \sigma^2_{cell} + \sigma^2_\varepsilon$, and this gives the results from above. This example generalizes to any number of variance components \cite{Pinheiro2000}. \subsection{Variation with multiple subsets of the data} \label{sec:variationWithAcross} The linear mixed model underlying \Rpackage{variancePartition} allows the effect of one variable to depend on the value of another variable. Statistically, this is called a varying coefficient model \cite{Pinheiro2000, Galecki2010}. This model arises in \Rpackage{variancePartition} analysis when the variation explained by individual depends on tissue or cell type.\\ \\ A given sample is only from one cell type, so this analysis asks a question about a subset of the data. The the data is implicitly divided into subsets base on cell type and variation explained by individual is evaluated within each subset. The data is not actually divided onto subset, but the statistical model essentially examples samples with each cell type. This subsetting means that the variance fractions do not sum to 1.\\ \\ Consider a concrete example with variation from across individual and cell types (T-cells and monocytes) with data from the $i^{th}$ sample from the $k^{th}$ individual, sex of $s$ and cell type $c$. Modeling the variation across individuals within cell type corresponds to \begin{eqnarray} y_{i,k,s,c} = \mu + Z^{(sex)} \alpha_{i,s} + Z^{(Tcell| id)} \alpha_{i,k,c} + Z^{(monocyte| id)} \alpha_{i,k,c} + e_{i,k,s,c} \end{eqnarray} with corresponding variance components: \begin{center} \begin{tabular}{cc} \hline Variance component & Interpretation \\ \hline $\sigma^2_{sex}$ & variance across sex (which is the same for all cell types)\\ $\sigma^2_{(Tcell| id)}$ & variance across individuals within T-cells \\ $\sigma^2_{(monocyte| id)}$ & variance across individuals within monocytes \\ $\sigma^2_\varepsilon$ & residual variance \\ \hline \end{tabular} \end{center} Since the dataset is now divided into multiple subsets, direct interpretation of the fraction of variation explained (FVE) as intra-class correlation does not apply. Instead, we compute a ``pseudo-FVE" by approximating the total variance attributable to cell type by using a weighted average of the within cell type variances weighted by the sample size within each cell type. Thus the values of pseudo-FVE do not have the simple interpretation as in the standard application of \Rpackage{variancePartition}, but allows ranking of variables based on genome-wide contribution to variance and analysis of gene-level results. \subsection{Relationship between \Rpackage{variancePartition} and differential expression} Differential expression (DE) is widely used to identify gene which show difference is expression between two subsets of the data (i.e. case versus controls). For a single gene, DE analysis measures the difference in mean expression between the two subsets. (Since expression is usually analyzed on a log scale, DE results are usually shown in terms of log fold changes between the two subsets ). In Figure \ref{fig:DEVP}, consider two simulated examples of a gene whose expression differs between males and females. The mean expression in males is 0 and the mean expression in females is 2 in both cases. Therefore, the fold change is 2 in both cases.\\ \\ However, the fraction of expression variation explained by sex is very different in these two examples. In example A, there is very little variation {\it within} each sex, so that variation {\it between} sexes is very high at 91.1\%. Conversely, example B shows high variation {\it within} sexes, so that variation {\it between} sexes is only 17.8\%.\\ \\ The fact that the fold change or the fraction of variation is significantly different from 0 indicates differential expression between the two sexes. Yet these two statistics have different interpretations. The fold change from DE analysis tests a difference in means between two sexes. The fraction of variation explained compares the variation explained by sex to the total variation.\\ \\ Thus the fraction of variation explained reported by \Rpackage{variancePartition} reflects as different aspect of the data not captured by DE analysis. <>= set.seed(1) library('variancePartition') n = 500 data = data.frame(Sex = factor(c(rep('F', n), rep('M', n)))) data$expression = rnorm(2*n, (as.integer(data$Sex)-1)*2, .3) data$Sex = factor(data$Sex) fit = lm(expression ~ Sex, data) # calcVarPart(fit) # coef(fit)[2] plotStratify( expression ~ Sex, data, ylim=c(-6, 9)) + annotate("text", x = 0.5, y = 9, hjust=0, label = paste("fold change:", format(coef(fit)[2], digits=3)), size=5.5) + annotate("text", x = 0.5, y = 8.2, hjust=0, label = paste("% variance of expression:", format(calcVarPart(fit)[1]*100, digits=3), "%"), size=5.5) n = 500 data = data.frame(Sex = factor(c(rep('F', n), rep('M', n)))) data$expression = rnorm(2*n, (as.integer(data$Sex)-1)*2, 2.01) data$Sex = factor(data$Sex) fit = lm(expression ~ Sex, data) # calcVarPart(fit) # coef(fit)[2] plotStratify( expression ~ Sex, data, ylim=c(-6, 9)) + annotate("text", x = 0.5, y = 9, hjust=0, label = paste("fold change:", format(coef(fit)[2], digits=3)), size=5.5) + annotate("text", x = 0.5, y = 8.2, hjust=0, label = paste("% variance of expression:", format(calcVarPart(fit)[1]*100, digits=3), "%"), size=5.5) @ \begin{figure}[h] \centering \caption{ \textbf{Compare variancePartition and differential expression}} \subfloat[Example A]{\includegraphics[width=.5\textwidth]{figure/DE-1}} \subfloat[Example B]{\includegraphics[width=.5\textwidth]{figure/DE-2}} \label{fig:DEVP} \end{figure} \subsection{Modelling error in gene expression measurements} Uncertainty in the measurement of gene expression can be modeled with precision weights and tests of differentially expression using \Rcode{voom} in \Rpackage{limma} model this uncertainty directly with a heteroskedastic linear regression \cite{Law2014}. \Rpackage{variancePartition} can use these precision weights in a heteroskedastic linear mixed model implemented in \Rpackage{lme4} \cite{Bates2015}. These precision weights are used seamlessly by calling \Rcode{fitVarPartModel} or \Rcode{fitExtractVarPartModel} on the output of \Rcode{voom}. Otherwise the user can specify the weights with the \Rcode{weightsMatrix} parameter. \pagebreak \section{Frequently asked questions} Note that many warnings and errors can be overridden by specifying \newline \Rcode{suppressWarnings=TRUE} for \Rcode{dream()} and \Rcode{showWarnings=FALSE} for \newline\Rcode{fitExtractVarPartModel()} and \Rcode{fitVarPartModel()}. Interpreting warnings and errors from \Rcode{fitVarPartModel} and \Rcode{fitExtractVarPartModel}: \subsection{Warnings} \begin{itemize} \item \Rcode{No Intercept term was specified in the formula:\newline The results will not behave as expected and may be very wrong!!} \end{itemize} An intercept (i.e. mean term) must be specified order for the results to be statistically valid. Otherwise, the variance percentages will be {\it very} overestimated.\\ \begin{itemize} \item \Rcode{Categorical variables modeled as fixed effect:\newline The results will not behave as expected and may be very wrong!!} \end{itemize} If a linear mixed model is used, all categorical variables must be modeled as a random effect. Alternatively, a fixed effect model can be used by modeling all variables as fixed. \\ \begin{itemize} \item \Rcode{Cannot have more than one varying coefficient term:\newline The results will not behave as expected and may be very wrong!!} \end{itemize} Only one varying coefficient term can be specified. For example, the formula $\sim$\Rcode{(Tissue+0|Individual) + (Batch+0|Individual)} contains two varying coefficient terms and the results from this analysis are not easily interpretable. Only a formula with one term like \Rcode{(Tissue+0|Individual)} is allowed. \\ \begin{itemize} \item \Rcode{executing \%dopar\% sequentially: no parallel backend registered} \end{itemize} These functions are optimized to run in parallel using doParallel/doMC. This warning indicates that a parallelization was not enabled. This is not a problem, but analysis will take more time.\\ \subsection{Errors} \begin{itemize} \item \Rcode{Colinear score > .99: Covariates in the formula are so strongly\\correlated that the parameter estimates from this model are not\\ meaningful. Dropping one or more of the covariates will fix this problem} \item \Rcode{Error in asMethod(object) : not a positive definite matrix} \item \Rcode{In vcov.merMod(fit) : Computed variance-covariance matrix problem:\\not a positive definite matrix; returning NA matrix} \item \Rcode{fixed-effect model matrix is rank deficient so dropping 26 columns / coefficients} \end{itemize} Including variables that are highly correlated can produce misleading results (see Section \ref{sec:colinearity}). In this case, parameter estimates from this model are not meaningful. Dropping one or more of the covariates will fix this problem.\\ \begin{itemize} \item \Rcode{Error in checkNlevels(reTrms\$flist, n = n, control) : \newline number of levels of each grouping factor must be < number of observations} \end{itemize} This arises when using a varying coefficient model that examines the effect of one variable inside subsets of the data defined by another: $\sim$\Rcode{(A+0|B)}. See Section \ref{sec:withinSubset}. There must be enough observations of each level of the variable B with each level of variable A. Consider an example with samples from multiple tissues from a set of individual where we are interested in the variation across individuals within each tissue using the formula: $\sim$ \Rcode{(Tissue+0|Individual)}. This analysis will only work if there are multiple samples from the same individual in at least one tissue. If all tissues only have one sample per individual, the analysis will fail and \Rcode{variancePartition} will give this error.\\ \begin{itemize} \item \Rcode{Problem with varying coefficient model in formula: should have form (A+0|B)} \end{itemize} When analyzing the variation of one variable inside another (Section \ref{sec:withinSubset}), the formula most be specified as \Rcode{(Tissue+0|Individual)}. This error occurs when the formula contains \Rcode{(Tissue|Individual)} instead.\\ \begin{itemize} \item \Rcode{fatal error in wrapper code} \item \Rcode{Error in mcfork() : unable to fork, possible reason: Cannot allocate memory} \item \Rcode{Error: cannot allocate buffer} \end{itemize} This error occurs when \Rcode{fitVarPartModel} uses too many threads and takes up too much memory. The easiest solution is to use \Rcode{fitExtractVarPartModel} instead. Occasionally there is an issue in the parallel backend that is out of my control. Using fewer threads or restarting \R{} will solve the problem. \subsubsection{Errors: Problems removing samples with NA/NaN/Inf values} \Rpackage{variancePartition} fits a regression model for each gene and drops samples that have NA/NaN/Inf values in each model fit. This is generally seamless but can cause an issue when a variable specified in the formula no longer varies within the subset of samples that are retained. Consider an example with variables for sex and age where age is NA for all males samples. Dropping samples with invalid values for variables included in the formula will retain only female samples. This will cause \Rpackage{variancePartition} to throw an error because there is now no variation in sex in the retained subset of the data. This can be resolved by removing either age or sex from the formula.\\ \\ This situtation is indicated by the following errors \begin{itemize} \item \Rcode{Error: grouping factors must have > 1 sampled level} \item \Rcode{Error: Invalid grouping factor specification, Individual} \item \Rcode{Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): \\ contrasts can be applied only to factors with 2 or more levels} \item \Rcode{Error in checkNlevels(reTrms\$flist, n = n, control):\\ grouping factors must have > 1 sampled level} \end{itemize} \section*{Session Info} <>= toLatex(sessionInfo()) @ <>= options(prompt="> ", continue="+ ") @ \bibliography{library} \end{document}