1 Overview

1.1 Basic concepts

Organisms are assayed on multiple features
Variability in feature measures exhibits structure
Clustering:
- For some grouping, between-group variation is larger than within-group variation
- Our goals are to find, evaluate, and interpret such groupings
Classification:
- Organisms are sorted into classes and labeled
- Rules for classification are maps from features to labels
- Our goals are to find, evaluate, and interpret such rules

1.2 Yeast cell cycle: phenotypic transitions (Lee, Rinaldi et al. Science 2002)

How do you expect gene expression time series to cluster?

1.3 Expression clusters

Spellman et al MBC ’98; dendrogram on left: bottom half labeled MCM

1.4 Species and organ of origin: microarrays and orthologues (McCall et al., NAR 2012)

1.5 Question

Multivariate analysis of the yeast cell cycle uses the gene expression trajectory over time as the data vector
Multivariate analysis of tissue of origin data uses a snapshot of the transcriptome as the data vector
Should the same methods be used for visualization and interpretation? Why or why not?
- roles of convenience and agnosticism
- roles of biological knowledge and potential for corroboration

1.6 Species, organ of origin, and batch: RNA-seq and orthologues (Lin et al., PNAS 2014)

Between-species disparity stronger than within-organ similarity

1.7 Three data analysis problems

Experimental contexts
- Transcriptional cascade in the yeast cell cycle
  - PT Spellman et al., MBC 1998
  - time course on each gene yields 6000+ 18-vectors
  - group genes in search for mechanisms of coregulation
- Distinguishing organ of origin through gene expression patterns
  - McCall et al., NAR 2011
  - adjusted arrays yield 85 22215-vectors
  - cluster or classify samples to identify distinguishing gene sets
- Comparison of human and mouse transcriptomes
  - Lin et al., PNAS 2014
  - mRNA abundance for orthologous genes by RNA-seq, 30 15106-vectors
  - assess similarity of transcriptomes by tissue and by species

1.8 Clustering concept

Clustering: establishes a modular structure in the data
- explanation through contrast, divide and conquer
Methods questions:
- can we rely on mathematical approaches to contrasts and subdivisions among objects to define substantively meaningful distinctions?
- are there internal (data-driven) measures of cluster quality?
- can we simplify assessment using external information (e.g., TF binding data to rationalize assertions of coregulation?)

1.9 Classification methods

Objects are already grouped; module labels and assignments given a priori
- with features \(x\) and labels \(y\), can we discover \(f\) satisfying \(y_i \approx f(x_i)\) for instance \(i\)?
- Over what class of functions shall we search?
- How shall we measure the quality of the approximation?
- Do we require \(f\) to have an interpretation, or are we satisfied with a black-box prediction machine?
These questions are intertwined
- It may be more valuable to have a good estimate of a regression parameter than a “more accurate” but uninterpretable feature processor
- The scope of the search for \(f\) leads to risks of overfitting

1.10 Statistical concepts to master

object representations and distances in high-dimensional spaces
criteria for assigning objects to clusters, or labels to objects
tuning of clustering and classification procedures
measures of cluster quality: silhouette, Jaccard index
approaches to sampling from population models: bootstrap
systematic approaches to reducing bias of model appraisals
- test vs train (single split)
- V-fold cross-validation (V splits)
- leave-one-out cross-validation (N splits)

2 Cluster analysis concepts

2.1 Interactive exploration of clustering

number of genes; distance between objects (tissues or gene expression time series)
agglomeration method for tree construction
number of groups via cutree
mean bootstrapped Jaccard similarity
PC ordination with colors
assignment (mouseover)

2.2 Exploring clusters with tissue-of-origin data

2.3 Some definitions

2.4 Example: Euclidean distance

High-school analytic geometry: distance between two points in \(R^3\)
\(p_1 = (x_1, y_1, z_1)\), \(p_2 = (x_2, y_2, z_2)\)
\(\Delta x = x_1 - x_2\), etc.
\(d(p_1, p_2) = \sqrt{(\Delta x)^2 + (\Delta y)^2 + (\Delta z)^2}\)

2.5 What is the ward.D2 agglomeration method?

Enables very rapid update upon change of distance or # genes

2.6 What is the Jaccard similarity coefficient?

2.7 What is the bootstrap distribution of a statistic?

classic example from 1983: correlation of grades and achievement test scores
arbitrary replication of multivariate records

2.8 What is the bootstrap distribution of a statistic?

sampling with replacement from the base records

2.9 How to use the bootstrap distribution?

estimate quantiles of the empirical distribution

2.10 Bootstrap distributions of Jaccard

dispersion should be reckoned along with mean

2.11 Now that we know the definitions:

2.12 Summary

number of features (and detailed selection of features, not explored here) has impact
agglomeration procedure has substantial impact
number of clusters based on tree cutting
bootstrap distribution of Jaccard index measures cluster stability
ordination using principal components can be illuminating
procedure is sensitive but sensible choices recapitulate biology
there are bivariate outliers in PC1-PC2 view

2.13 Road map

yeast cell cycle
- compute distance to an interpretable prototype
- illustrate silhouette measure against random grouping
- define families of exemplars through trigonometric regression
- use F-statistics and parameter estimates to filter and discriminate patterns
normalized tissue-specific expression
- demonstrate

2.14 Yeast cell cycle: phenotypic transitions

Lee, Rinaldi et al., Science 2002

2.15 Yeast cell cycle: regulatory model

TF binding data added to expression patterns

2.16 a data extract: S. cerevisiae colony synchronized with alpha pheromone

library(yeastCC)
data(spYCCES)
alp = spYCCES[, spYCCES$syncmeth=="alpha"]
rbind(time=alp$time[1:5],exprs(alp)[1:5,1:5])

##         alpha_0 alpha_7 alpha_14 alpha_21 alpha_28
## time       0.00    7.00    14.00    21.00    28.00
## YAL001C   -0.15   -0.15    -0.21     0.17    -0.42
## YAL002W   -0.11    0.10     0.01     0.06     0.04
## YAL003W   -0.14   -0.71     0.10    -0.32    -0.40
## YAL004W   -0.02   -0.48    -0.11     0.12    -0.03
## YAL005C   -0.05   -0.53    -0.47    -0.06     0.11

G=6178 genes comprise rows, N=18 timed samples comprise columns

2.17 Raw trajectories for some of the genes in MCM cluster

consistent with annotation to M/G1

2.18 A pattern of interest (“prototype”, but not in the data)

Define a basal oscillator to be a gene with expression varying with the cell cycle in a specific way
- for alpha-synchronized colony, cell cycle period is about 66 minutes
Theoretical expression trajectory

2.19 Formalism for the basal oscillator prototype

\(t\) denotes time (in minutes) elapsed from synchronization
\(X_g(t)\) denotes reported expression of gene \(g\) at time \(t\)
\(m_g = \min_t X_g(t)\), \(M_g = \max_t X_g(t)\)
\(U_g(t) = 2 \times [\frac{X_g(t) - m_g}{M_g-m_g} - 0.5]\) is signed fractional excursion (sfe) of gene \(g\) at time \(t\)
- if \(g\) is at minimal reported value at \(t\), \(U_g(t) = -1\)
- if \(g\) is at maximal reported value at \(t\), \(U_g(t) = +1\)
if \(g\) is a basal oscillator, what is the form of \(U_g(t)\)?

2.20 One possible form for \(U_g(t)\) for \(g\) a basal oscillator

\(U_g(t) = \sin(2 \pi t/ 66)\), \(t\) in minutes from synchronization
Why?
- \(\sin\) function has range [-1,1]
- smoothness corresponds to gradual nature of transition
- returns to 0 at multiples of 66 minutes
Drawbacks?
- periodicity is biologically motivated, but the detailed trajectory with various local symmetries is not justified
Survey: How many genes are reasonably modeled by the basal oscillator pattern? 0, 10, 100, 1000?

2.21 Application of the distance concept

What is the dimension of the space in which a yeast gene expression trajectory resides?
How can we define the distance between a given gene’s trajectory and that corresponding to the basal oscillator pattern? Assumptions?

2.22 Computing distances to basal oscillator pattern

bot = function(tim) sin(2*pi*tim/66)
bo = bot(alp$time)
d2bo = function(x) sqrt(sum((x-bo)^2))
suppressWarnings({ds = apply(uea, 1, d2bo)})
md = which.min(ds)
md

## YMR011W 
##    4411

summary(ds)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       2       4       4       4       4       6    1689

2.23 The nearest gene

plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[md,], lty=2)
legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))

2.24 The distribution of distances

hist(ds, xlab="Euclidean distance to basal oscillator pattern")

What should this look like if there is clustering?

2.25 “Top ten!”

todo = names(sort(ds))[1:10]
plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
for (md in todo) lines(alp$time, uea[md,], lty=2, col="gray")

#legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))

2.26 Is it a cluster?

Recall the definition:
- Variability in feature measures exhibits structure
- For some grouping, between-group variation is larger than within-group variation
- Can we make this more precise?

2.27 Definition from ?silhouette

   For each observation i, the _silhouette width_ s(i) is defined as
     follows:
     Put a(i) = average dissimilarity between i and all other points of
     the cluster to which i belongs (if i is the _only_ observation in
     its cluster, s(i) := 0 without further calculations).  For all
     _other_ clusters C, put d(i,C) = average dissimilarity of i to all
     observations of C.  The smallest of these d(i,C) is b(i) := \min_C
     d(i,C), and can be seen as the dissimilarity between i and its
     “neighbor” cluster, i.e., the nearest one to which it does _not_
     belong.  Finally,

                   s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).         
     
     ‘silhouette.default()’ is now based on C code donated by Romain
     Francois (the R version being still available as
     ‘cluster:::silhouette.default.R’).

2.28 Realizations of an unstructured grouping scheme

Form some arbitrarily chosen groups of size ten
The code:

allf = featureNames(alp)
set.seed(2345)
scramble = function(x) sample(x, size=length(x), replace=FALSE) 
cands = scramble(setdiff(allf, todo))[1:40]
sc = split(cands, gids <- rep(2:5,each=10))
ml = lapply(sc, function(x) uea[x,])

2.29 Trajectories from the arbitrary groups

2.30 The silhouette plot

2.31 Recap

A target pattern was defined, using knowledge of the cell cycle period
Expression patterns were transformed to conform to the dynamic range of the target pattern
A distance function was defined and genes ordered by proximity to target
A group of ten genes nearest to target trajectory was identified and compared to arbitrarily formed groups using silhouette widths
Transformation, Distance, and Comparison are fundamental elements of all cluster analysis
- but a fixed target pattern is not so common in genomics
- compare handwriting or speech waveform analysis

2.32 Another exemplar

Give the mathematical definition of the hyperbasal oscillator pattern, that has value 1 at time 0 and returns to 1 at 66 minutes
Which gene has expression trajectory closest to that of the hyperbasal oscillator?
How many steps to determine the mean silhouette width for the ten genes closest to the hyperbasal pattern?

2.33 Solution

hbot = function(tim) cos(2*pi*tim/66)
hbo = hbot(alp$time)
d2hbo = function(x) sqrt(sum((x-hbo)^2))
suppressWarnings({cds = apply(uea, 1, d2hbo)})
cmd = which.min(cds)
cmd

## YHR005C 
##    2572

summary(cds)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1       3       4       4       4       6    1689

2.34 The most hyperbasal gene

plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[cmd,], lty=2)
legend(20,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[cmd]))

2.35 “Top ten!”

ctodo = names(sort(cds)[1:10])
plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
for (md in ctodo) lines(alp$time, uea[md,], lty=2, col="gray")

2.36 Silhouette continuation

2.37 Caveats

We ignored the natural units reported for expression
The pure sinusoid need not be a reasonable trajectory model for any gene
We arbitrarily thresholded to achieve groups of size 10
Can we use the data to define groups of genes with similar expression patterns without so much conceptual intervention?

2.38 Hierarchical clustering

Subdivision of the genome into coexpressed groups is of general interest
Agglomerative algorithms use the distance measure to combine very similar genes into new entities
The process repeats until only one entity remains

2.39 Filtering

We would like to look for structure among expression patterns of genes exhibiting oscillatory behavior
Thus we would like to filter away genes with non-periodic trajectories
Trigonometric regression can help
- \(t\) is transformed to [0,1]; estimate \(s_{gj}\) and \(c_{gj}\) in \[ X_g(t) = \sum_{j=1}^J s_{gj} \sin 2j\pi t + \sum_{j=1}^J c_{gj} \cos 2j \pi t + e_g(t) \]
We will cluster genes possessing relatively large \(F\) statistics for this model, fixing \(J=2\)

2.40 limma for trigonometric regression fits

options(digits=2)
x = (alp$time %% 66)/66
mm = model.matrix(~sin(2*pi*x)+cos(2*pi*x)+sin(4*pi*x)+cos(4*pi*x)-1)
colnames(mm) = c("s1", "c1", "s2", "c2")
library(limma)
m1 = lmFit(alp, mm)
em1 = eBayes(m1)
topTable(em1, 1:4, n=5)

##            s1    c1     s2      c2  AveExpr  F P.Value adj.P.Val
## YIL129C -0.55 -0.75 -0.176 -0.1035  1.1e-03 30 2.6e-08   9.5e-05
## YJL159W  0.69  0.91 -0.059 -0.0074 -1.3e-17 30 3.1e-08   9.5e-05
## YMR011W  0.88  0.31  0.208  0.0368  1.7e-03 25 1.2e-07   2.0e-04
## YPL256C  1.20 -0.58  0.156 -0.3544 -5.6e-04 25 1.5e-07   2.0e-04
## YKR013W  1.04 -0.59  0.072  0.0085 -1.1e-03 24 1.7e-07   2.0e-04

2.41 Interactive interface

2.42 Tuning hclust: dendrogram structure

2.42.1 Distance and fusion method selection

2.43 Projection with labels

2.44 Characteristic traces, raw expression data

2.45 Summary on clustering

Choice of object, representation, and distance: allow flexibility when substantive considerations do not dictate
Choice of algorithm: qualitative distinctions exist (single vs complete linkage, for example)
“figure of merit” – Jaccard similarity and silhouette are guides; silhouette is distance-dependent
Consider how to validate or corroborate clustering results with TF binding data from the harbChIP package
We’ve not considered divisive methods, self-organizing maps; see Hastie, Tibshirani and Friedman

CSAMA 2015: Clustering and classification

Vince Carey

May 22, 2005

Contents

1 Overview

1.1 Basic concepts

1.2 Yeast cell cycle: phenotypic transitions (Lee, Rinaldi et al. Science 2002)

1.3 Expression clusters

1.4 Species and organ of origin: microarrays and orthologues (McCall et al., NAR 2012)

1.5 Question

1.6 Species, organ of origin, and batch: RNA-seq and orthologues (Lin et al., PNAS 2014)

1.7 Three data analysis problems

1.8 Clustering concept

1.9 Classification methods

1.10 Statistical concepts to master

2 Cluster analysis concepts

2.1 Interactive exploration of clustering

2.2 Exploring clusters with tissue-of-origin data

2.3 Some definitions

2.4 Example: Euclidean distance

2.5 What is the ward.D2 agglomeration method?

2.6 What is the Jaccard similarity coefficient?

2.7 What is the bootstrap distribution of a statistic?

2.8 What is the bootstrap distribution of a statistic?

2.9 How to use the bootstrap distribution?

2.10 Bootstrap distributions of Jaccard

2.11 Now that we know the definitions:

2.12 Summary

2.13 Road map

2.14 Yeast cell cycle: phenotypic transitions

2.15 Yeast cell cycle: regulatory model

2.16 a data extract: S. cerevisiae colony synchronized with alpha pheromone

2.17 Raw trajectories for some of the genes in MCM cluster

2.18 A pattern of interest (“prototype”, but not in the data)

2.19 Formalism for the basal oscillator prototype

2.20 One possible form for \(U_g(t)\) for \(g\) a basal oscillator

2.21 Application of the distance concept

2.22 Computing distances to basal oscillator pattern

2.23 The nearest gene

2.24 The distribution of distances

2.25 “Top ten!”

2.26 Is it a cluster?

2.27 Definition from ?silhouette

2.28 Realizations of an unstructured grouping scheme

2.29 Trajectories from the arbitrary groups

2.30 The silhouette plot

2.31 Recap

2.32 Another exemplar

2.33 Solution

2.34 The most hyperbasal gene

2.35 “Top ten!”

2.36 Silhouette continuation

2.37 Caveats

2.38 Hierarchical clustering

2.39 Filtering

2.40 limma for trigonometric regression fits

2.41 Interactive interface

2.42 Tuning hclust: dendrogram structure

2.42.1 Distance and fusion method selection

2.43 Projection with labels

2.44 Characteristic traces, raw expression data

2.45 Summary on clustering

3 Classification concepts

3.1 On classification methods with genomic data

3.2 BiocViews: StatisticalMethod

3.3 Conceptual basis for methods covered in the talk

3.4 A method on the boundary: linear discriminant analysis

3.5 Notes on LDA

3.6 Other approaches, issues

3.7 Application to the tissue-of-origin data

3.8 On leukemia data

3.9 On leukemia data, 2class

3.10 Summary