Cluster analysis concepts
Interactive exploration of clustering
- number of genes; distance between objects (tissues or gene expression time series)
- agglomeration method for tree construction
- number of groups via cutree
- mean bootstrapped Jaccard similarity
- PC ordination with colors
- assignment (mouseover)
Exploring clusters with tissue-of-origin data
Some definitions
Example: Euclidean distance
- High-school analytic geometry: distance between two points in \(R^3\)
- \(p_1 = (x_1, y_1, z_1)\), \(p_2 = (x_2, y_2, z_2)\)
- \(\Delta x = x_1 - x_2\), etc.
- \(d(p_1, p_2) = \sqrt{(\Delta x)^2 + (\Delta y)^2 + (\Delta z)^2}\)
What is the ward.D2 agglomeration method?
- Enables very rapid update upon change of distance or # genes
What is the Jaccard similarity coefficient?
What is the bootstrap distribution of a statistic?
- classic example from 1983: correlation of grades and achievement test scores
- arbitrary replication of multivariate records
What is the bootstrap distribution of a statistic?
- sampling with replacement from the base records
How to use the bootstrap distribution?
- estimate quantiles of the empirical distribution
Bootstrap distributions of Jaccard
- dispersion should be reckoned along with mean
Now that we know the definitions:
Summary
- number of features (and detailed selection of features, not explored here) has impact
- agglomeration procedure has substantial impact
- number of clusters based on tree cutting
- bootstrap distribution of Jaccard index measures cluster stability
- ordination using principal components can be illuminating
- procedure is sensitive but sensible choices recapitulate biology
- there are bivariate outliers in PC1-PC2 view
Road map
- yeast cell cycle
- compute distance to an interpretable prototype
- illustrate silhouette measure against random grouping
- define families of exemplars through trigonometric regression
- use F-statistics and parameter estimates to filter and discriminate patterns
- normalized tissue-specific expression
Yeast cell cycle: phenotypic transitions
- Lee, Rinaldi et al., Science 2002
Yeast cell cycle: regulatory model
- TF binding data added to expression patterns
Raw trajectories for some of the genes in MCM cluster
- consistent with annotation to M/G1
A pattern of interest (“prototype”, but not in the data)
- Define a basal oscillator to be a gene with expression varying with the cell cycle in a specific way
- for alpha-synchronized colony, cell cycle period is about 66 minutes
- Theoretical expression trajectory
Application of the distance concept
- What is the dimension of the space in which a yeast gene expression trajectory resides?
- How can we define the distance between a given gene’s trajectory and that corresponding to the basal oscillator pattern? Assumptions?
Computing distances to basal oscillator pattern
bot = function(tim) sin(2*pi*tim/66)
bo = bot(alp$time)
d2bo = function(x) sqrt(sum((x-bo)^2))
suppressWarnings({ds = apply(uea, 1, d2bo)})
md = which.min(ds)
md
## YMR011W
## 4411
summary(ds)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2 4 4 4 4 6 1689
The nearest gene
plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[md,], lty=2)
legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))
The distribution of distances
hist(ds, xlab="Euclidean distance to basal oscillator pattern")
- What should this look like if there is clustering?
“Top ten!”
todo = names(sort(ds))[1:10]
plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
for (md in todo) lines(alp$time, uea[md,], lty=2, col="gray")
#legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))
Is it a cluster?
- Recall the definition:
- Variability in feature measures exhibits structure
- For some grouping, between-group variation is larger than within-group variation
- Can we make this more precise?
Definition from ?silhouette
For each observation i, the _silhouette width_ s(i) is defined as
follows:
Put a(i) = average dissimilarity between i and all other points of
the cluster to which i belongs (if i is the _only_ observation in
its cluster, s(i) := 0 without further calculations). For all
_other_ clusters C, put d(i,C) = average dissimilarity of i to all
observations of C. The smallest of these d(i,C) is b(i) := \min_C
d(i,C), and can be seen as the dissimilarity between i and its
“neighbor” cluster, i.e., the nearest one to which it does _not_
belong. Finally,
s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).
‘silhouette.default()’ is now based on C code donated by Romain
Francois (the R version being still available as
‘cluster:::silhouette.default.R’).
Realizations of an unstructured grouping scheme
- Form some arbitrarily chosen groups of size ten
- The code:
allf = featureNames(alp)
set.seed(2345)
scramble = function(x) sample(x, size=length(x), replace=FALSE)
cands = scramble(setdiff(allf, todo))[1:40]
sc = split(cands, gids <- rep(2:5,each=10))
ml = lapply(sc, function(x) uea[x,])
Trajectories from the arbitrary groups
The silhouette plot
Recap
- A target pattern was defined, using knowledge of the cell cycle period
- Expression patterns were transformed to conform to the dynamic range of the target pattern
- A distance function was defined and genes ordered by proximity to target
- A group of ten genes nearest to target trajectory was identified and compared to arbitrarily formed groups using silhouette widths
- Transformation, Distance, and Comparison are fundamental elements of all cluster analysis
- but a fixed target pattern is not so common in genomics
- compare handwriting or speech waveform analysis
Another exemplar
- Give the mathematical definition of the hyperbasal oscillator pattern, that has value 1 at time 0 and returns to 1 at 66 minutes
- Which gene has expression trajectory closest to that of the hyperbasal oscillator?
- How many steps to determine the mean silhouette width for the ten genes closest to the hyperbasal pattern?
Solution
hbot = function(tim) cos(2*pi*tim/66)
hbo = hbot(alp$time)
d2hbo = function(x) sqrt(sum((x-hbo)^2))
suppressWarnings({cds = apply(uea, 1, d2hbo)})
cmd = which.min(cds)
cmd
## YHR005C
## 2572
summary(cds)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 3 4 4 4 6 1689
The most hyperbasal gene
plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[cmd,], lty=2)
legend(20,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[cmd]))
“Top ten!”
ctodo = names(sort(cds)[1:10])
plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
for (md in ctodo) lines(alp$time, uea[md,], lty=2, col="gray")
Silhouette continuation
Caveats
- We ignored the natural units reported for expression
- The pure sinusoid need not be a reasonable trajectory model for any gene
- We arbitrarily thresholded to achieve groups of size 10
- Can we use the data to define groups of genes with similar expression patterns without so much conceptual intervention?
Hierarchical clustering
- Subdivision of the genome into coexpressed groups is of general interest
- Agglomerative algorithms use the distance measure to combine very similar genes into new entities
- The process repeats until only one entity remains
Filtering
- We would like to look for structure among expression patterns of genes exhibiting oscillatory behavior
- Thus we would like to filter away genes with non-periodic trajectories
- Trigonometric regression can help
- \(t\) is transformed to [0,1]; estimate \(s_{gj}\) and \(c_{gj}\) in \[ X_g(t) = \sum_{j=1}^J s_{gj} \sin 2j\pi t + \sum_{j=1}^J c_{gj} \cos 2j \pi t + e_g(t) \]
- We will cluster genes possessing relatively large \(F\) statistics for this model, fixing \(J=2\)
limma for trigonometric regression fits
options(digits=2)
x = (alp$time %% 66)/66
mm = model.matrix(~sin(2*pi*x)+cos(2*pi*x)+sin(4*pi*x)+cos(4*pi*x)-1)
colnames(mm) = c("s1", "c1", "s2", "c2")
library(limma)
m1 = lmFit(alp, mm)
em1 = eBayes(m1)
topTable(em1, 1:4, n=5)
## s1 c1 s2 c2 AveExpr F P.Value adj.P.Val
## YIL129C -0.55 -0.75 -0.176 -0.1035 1.1e-03 30 2.6e-08 9.5e-05
## YJL159W 0.69 0.91 -0.059 -0.0074 -1.3e-17 30 3.1e-08 9.5e-05
## YMR011W 0.88 0.31 0.208 0.0368 1.7e-03 25 1.2e-07 2.0e-04
## YPL256C 1.20 -0.58 0.156 -0.3544 -5.6e-04 25 1.5e-07 2.0e-04
## YKR013W 1.04 -0.59 0.072 0.0085 -1.1e-03 24 1.7e-07 2.0e-04
Tuning hclust: dendrogram structure
Distance and fusion method selection
Characteristic traces, raw expression data
Summary on clustering
- Choice of object, representation, and distance: allow flexibility when substantive considerations do not dictate
- Choice of algorithm: qualitative distinctions exist (single vs complete linkage, for example)
- “figure of merit” – Jaccard similarity and silhouette are guides; silhouette is distance-dependent
- Consider how to validate or corroborate clustering results with TF binding data from the harbChIP package
- We’ve not considered divisive methods, self-organizing maps; see Hastie, Tibshirani and Friedman