% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{clusterStab Overview} % \VignetteDepends{clusterStab, Biobase, genefilter, fibroEset} % \VignetteKeywords{Expression Analysis, Postprocessing} % \VignettePackage{clusterStab} \documentclass[11pt]{article} \usepackage{hyperref} \parindent 0.5in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \begin{document} \title{\bf Using clusterStab} \author{James W. MacDonald} \maketitle \section{Overview} This document provides a brief guide to the \Rpackage{clusterStab} package, which is intended to be used for two things: \begin{itemize} \item Determining the number of clusters in a set of microarray data. \item Deciding how 'good' these clusters are. \end{itemize} Clustering microarray data and producing so-called heatmaps is a very common thing to do. However, there are some significant drawbacks to this technique. First, clustering is not an inferential technique, so there are no $p$-values associated with the heatmap to indicate how likely it is that the results arose simply by chance. Second, most clustering techniques are quite sensitive to the input data, so it is possible to get a very different result by simply adding or removing a single sample (or by adding or removing genes). This, of course, would result in a completely different interpretation. Third, clustering methods are intended to show previously unknown patterns in high dimensional data, but are often used to show \emph{expected} patterns. This can result in a self-fulfilling prophecy, if for instance, the data are filtered based on an expected pattern. An example would be filtering the data based on the difference in expression between two sample types and then clustering the data. Since the filtered genes by definition are those that are different between the two groups, the heatmap will clearly show this difference. This is usually an uninteresting result, as permuting the sample labels, re-filtering the data and re-clustering will almost always result in a heatmap as striking as the original, yet will be based on sample groups that are not expected to be different. This third point is quite important. In order to create a heatmap that is small enough to be useful, it is often necessary to filter out genes that are not differentially expressed in any samples without unintentionally selecting genes that fulfill an \emph{a priori} assumption about the number of clusters. A good approach is to select genes that have a high coefficient of variation (CV), which implies that there is a high variation in expression relative to the average expression for that gene. When using hierarchical clustering to detect \emph{a priori} unknown groups in data, there are thus two questions that arise; how many clusters are there, and given a number of clusters, how likely is it that they simply arose by chance? We answer both questions by relying on the fact that clustering solutions can be sensitive to changes in the input data. To test for the likely number of clusters, we repeatedly select subsets of the samples, cluster them, and then see for each number of clusters (from 2 to N) how often we get similar results. If the results are not often similar for a given number of clusters, it is not likely that there are that many clusters in our data. We assess similarity using a \emph{Jaccard coefficient}, and use histograms to compare different numbers of clusters. Once we have decided how many clusters we think there are, we can test to see how likely it is that this number came about simply by chance. Again, this is done by selecting subsets of the data and clustering. However, in this case we are selecting subsets of genes (using all samples). The basic idea here is the same; repeatedly subset the data and compare the resulting clustering solutions. If we keep getting the same clusters over and over, then this gives evidence that the cluster is stable, which implies that it is not likely that this result arose simply by chance. Note that these functions are based on hierarchical clustering using the \Rfunction{hclust()} function, although they can be generalized to most clustering techniques. Incorporation of different clustering algorithms will be based on user requests. \section{A Simple Example} For this example we will be using the \Rpackage{fibroEset} package, which contains an \Rfunction{ExpressionSet} object with 46 samples and 12625 genes. There are 11 bonobo samples, 12 gorilla samples, and 23 human samples, so I would expect either two or three clusters, depending on how well the bonobo and gorilla samples cross-hybridize to Affymetrix HG-U95Av2 chips. This is of course too many genes to cluster, so we will start by selecting only those genes that appear to be differentially expressed. Here I will use those genes that have a CV greater than 0.1. However, there are many other possibilites that can be used; see the \Rpackage{genefilter} package for more suggestions. <<>>= library(clusterStab) library(genefilter) library(fibroEset) data(fibroEset) exprs(fibroEset) <- log2(exprs(fibroEset)) filt <- cv(0.1, Inf) index <- genefilter(fibroEset, filt) fb <- fibroEset[index,] bh <- benhur(fb, 0.7, 6, seednum = 12345) #seednum only used to ensure reproducibility @ There are \Sexpr{sum(index)} genes for which the CV is greater than 0.1. We will now use these genes to decide how many different sample types (clusters) there are in our data, using the \Rfunction{benhur} function. The number of clusters is determined by taking random samples of the microarrays and then clustering. Similarity between pairwise clusters is estimated using a Jaccard coefficient; stable clusters will tend to have similar results on repeated sampling, whereas unstable clusters will tend to have very different results. This can be visualized using either histograms or empirical cumulative distribution function (eCDF) plots. We first look at the histograms using the \Rfunction{hist()} function. \begin{figure}[htbp] \centering <>= hist(bh) @ \caption{Histograms of the fibroEset data} \label{fig:hist} \end{figure} Figure~\ref{fig:hist} shows individual histograms for each of the 2 - 6 possible clusters that were tested. Each histogram shows the distribution of the Jaccard coefficients for a particular number of clusters. What we are looking for is the histogram with the majority of the data at or near one. This gives the most likely number of clusters in our data. It appears that the most likely number of clusters is one, as each of the other histograms have data that are not clustered near one. \begin{figure}[htbp] \centering <>= plot(hclust(dist(t(exprs(fibroEset[index,])))), labels = pData(fibroEset)[,2], sub="", xlab="") @ \caption{Cluster of the fibroEset data} \label{fig:clust} \end{figure} Figure~\ref{fig:clust} shows the dendrogram for this clustering result. The fibroEset data are Affy HG-U95av2 chips that were run on human, bonobo, and gorilla samples (labeled h, b, and g in the dendrogram). It is not surprising that the bonobo and gorilla samples cluster together considering that they were analyzed using a human chip. In addition, there are at least two smaller clusters within the human samples We can also look at the eCDF plot, using the \Rfunction{ecdf()} function. \begin{figure}[htbp] \centering <>= plot(hclust(dist(t(exprs(fibroEset[index,])))), labels = pData(fibroEset)[,2], sub="", xlab="") ecdf(bh) @ \caption{Empirical CDF plots of the fibroEset Data} \label{fig:ecdf} \end{figure} Figure~\ref{fig:ecdf} shows the eCDFs for each number of clusters. Again, it appears that there are probably only two real clusters here. Once we have decided how many clusters there are in our sample, we can test to see how stable the clusters are. For this, we use the \Rfunction{clusterComp} function. <<>>= cmp <- clusterComp(fb, 2) cmp @ \end{document}