- Cal State LA - Instructional Web Server

advertisement
SocalBSI 2008:
Clustering Microarray Datasets
Sagar Damle, Ph.D. Candidate, Caltech
Distance Metrics: Measuring similarity using the
Euclidean and Correlation distance metrics

Principle Components Analysis: Reducing the
dimensionality of microarray data


Clustering Agorithms:
 Kmeans
 Self-Organizing Maps (SOM)
 Hierarchical Clustering
MATRIXgenes,conditions = Expression dataset
the first genevector = (x11, x12, x13, x14… x1n)
the leftmost condition vector = (x11, x21, x31 … xm1)
Rows (genes)
Columns (conditions [timepoints, or tissues])
x11 , x12 , x13 ,
… x1n
x21
x31 ,
…
Xm1
…
xmn
Similarity measures
Clustering identifies group of genes with
“similar” expression profiles
 How is similarity measured?

Euclidian distance
 Correlation coefficient
 Others: Manhattan, Chebychev, Euclidean
Squared

In an experiment with 10
conditions, the gene
expression profiles for two
genes X, and Y would have
this form
X = (x1, x2, x3, …, x10)
Y = (y1, y2, y3, …, y10)
Similarity measure - Euclidian distance
Gb: (x1, x2)
d(Ga, Gb) = sqrt( (x1-y1)2 + (x2 -y2)2 )
Ga: (y1, y2)
In general: if there are M
experiments:
X = (x1, x2, x3, …, xm)
Y = (y1, y2, y3, …, ym)
Similarity measure – Pearson Correlation
Coefficient
X = (x1, x2, x3, …, xm), Y = (y1, y2, y3, …, ym)
D=1-r
r = [Z(X)*Z(Y)] (dot product of the z-scores of vectors X and Y)
r = |Z(X)| |Z(Y)| cos(T)
• When two unit vectors are completely correlated, r=1 and D=0
• When two unit vectors are non correlated, r=0 and D = 1
Dot product review: http://mathworld.wolfram.com/DotProduct.html
Euclidian vs Pearson Correlation

Euclidian distance – takes
into account the
magnitude of the
expression
Gene Y

Gene X

Correlation distance -
insensitive to the
amplitude of expression,
takes into account the
trends of the change.
Common trends are
considered biologically
relevant, the magnitude is
considered less important
What euclidean
distance sees
What correlation
distance sees
Principle Components Analysis (PCA)

A method for projecting microarray data onto a reduced
(2 or 3 dimensional) easily visualized space
Definition: Principle Components - A set of variables
that define a projection that encapsulates the maximum
amount of variation in a dataset and is orthogonal (and
therefore uncorrelated) to the previous principle
component of the same dataset.

Example Dataset : Thousands of genes probed in 10
conditions.





The expression profile of each gene is presented by the vector of
its expression levels: X = (X1, X2, X3, X4, X5)
Imagine each gene X as a point in a 5-dimentional space.
Each direction/axis corresponds to a specific condition
Genes with similar profiles are close to each other in this space
PCA- Project this dataset to 2 dimensions, preserving as much
information as possible
PCA transformation of a microarray dataset
Visual estimation of the number of clusters in the data
1-page tutorial on singular value decomposition (PCA)
http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
Cluster analysis  Function



Places genes with similar expression patterns in groups.
Sometimes genes of unknown function will be grouped with
genes of known function.
The functions that are known allow the investigator to
hypothesize regarding the functions of genes not yet
characterized.

Examples:




Identify genes important in cell cycle regulation
Identify genes that participate in a biosynthetic pathway
Identify genes involved in a drug response
Identify genes involved in a disease response
Clustering yeast cell cycle dataset
VS gene tree ordering
How to choose the number of clusters needed to
informatively partition the data
Trial and error: Try clustering with a different number of
clusters, and compare your results
 Criteria for comparison: Homogeneity vs Separation
 Use PCA (Principle Component Analysis) to
visually determine how well the algorithm grouped
genes
 Calculate the mean distance between all genes
within a cluster (it should be small) and compare
that to the distance between clusters (which should
be large)
Mathematical evaluation of clustering
solution
Merits of a ‘good’ clustering solution:
 Homogeneity:



Separation:



Genes inside a cluster are highly similar to each other.
Average similarity between a gene and the center (average
profile) of its cluster.
Genes from different clusters have low similarity to each other.
Weighted average similarity between centers of clusters.
These are conflicting features: increasing the
number of clusters tends to improve with-in
cluster Homogeneity on the expense of
between-cluster Separation
Performance on Yeast Cell Cycle Data
698 genes, 72 conditions (Spellman et al. 1998). Each
algorithm was run by its authors in a “blind” test.
Separation
“True”
CAST*
CLICK
GeneCluster
K-means
Homogeneity
*Ben-Dor, Shamir,
Yakhini 1999
Clustering Algorithms
 K–means
 SOMs
 Hierarchical
clustering
K-MEANS
The user sets the number of clusters- k
Initialization: each gene is randomly assigned to
one of the k clusters
Average expression vector is calculated for each
cluster (cluster’s profile)
Iterate over the genes:
1.
2.
3.
4.
•
•
•
5.
6.
For each gene- compute its similarity to the cluster
profiles.
Move the gene to the cluster it is most similar to.
Recalculated cluster profiles.
Score current partition: sum of distances between
genes and the profile of the cluster they are
assigned to (homogeneity of the solution).
Stop criteria: further shuffling of genes results in
minor improvement in the clustering score
genes
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
gene
..
A
B
C
D
E
F
G
H
I
J
K
L
M
N
0hrs
0.12
0.47
1.97
1.21
0.25
0.81
1.64
1.78
0.14
1.01
0.91
1.71
1.46
0.88
1.15
1hr
1.68
1.37
0.87
1.22
0.70
0.34
0.08
1.64
0.68
0.84
1.57
1.33
0.12
1.21
1.30
Experiments
2hr
0.99
1.06
1.84
1.71
0.66
1.18
1.03
1.71
0.88
0.06
1.49
0.27
1.60
1.44
1.16
3hr
1.05
0.91
0.30
1.45
0.83
1.85
0.36
1.49
1.54
1.87
0.81
1.59
0.44
1.46
1.07
4hr
1.44
1.96
1.17
1.68
1.38
1.18
1.64
0.97
0.49
1.11
1.32
0.87
0.73
1.90
0.23
K-MEANS example: 4 clusters (too many?)
Mean
profile
Standard
deviation
in each
condition
Evaluating Kmeans
Cluster 1
Cluster 3
Misclassified
Cluster 4
Cluster 2
K-means example: 3 clusters (looks right)
Kmeans clustering: K=2 (too few)
SOMs (Self-Organizing Maps)
less clustering and more data organizing
User sets the number of clusters in a form
of a rectangular grid (e.g., 3x2) – ‘map
nodes’
 Imagine genes as points in (Mdimensional) space
 Initialization: map nodes are randomly
placed in the data space

Genes – data points
Clusters –
map nodes
SOM - Scheme
• Randomly choose a
data point (gene).
• Find its closest map
node
• Move this map node
towards the data point
• Move the neighbor
map nodes towards
this point, but to lesser
extent (thinner arrows
show weaker shift)
• Iterate over data
points
• Each successive
gene profile (black
dot) has less of an
influence on the
displacement of the
nodes.
• Iterate through all
profiles several times
(10-100)
• When positions of
the cluster nodes
have stabilized,
assign each gene to
its closest map node
(cluster)
Hierarchical Clustering

Goal#1: Organize the genes in a
structure of a hierarchical tree




{1,2,3,4,5}
1) Initial step: each gene is regarded
as a cluster with one item
2) Find the 2 most similar clusters and
merge them into a common node (red
dot)
3) Merge successive nodes until all
genes are contained in a single cluster
{1,2,3}
{4,5}
{1,2}
Goal#2: Collapse branches to
group genes into distinct clusters
g1
g2
g3
g4
g5
Which genes to cluster?

Apply filtering prior to clustering – focus the
analysis on the ‘responding genes’


The application of controlled statistical tests to identify
‘responding genes’ usually ends up with too few
genes that do not allow for a global characterization of
the response.
Variance: filter out genes that do not vary greatly
among the conditions of the experiment.
Non-varying genes skew clustering results, especially when
using a correlation coefficient
 Fold change: choose genes that change by at least M-fold in
at least L conditions.

Clustering – Tools

Cluster (Eisen) – hierarchical clustering


GeneCluster (Tamayo) – SOM


http://rana.lbl.gov/EisenSoftware.htm
http://bioinfo.cnio.es/wwwsomtree/
TIGR MeV – K-Means, SOM, hierarchical, QTC,
CAST


Expander – CLICK, SOM, K-means, hierarchical


http://www.tm4.org/mev.html
http://www.cs.tau.ac.il/~rshamir/expander/expander.htm
l
Many others (e.g. GeneSpring)

http://www.agilent.com/chem/genespring
Analysis Strategy
Transform Dataset Using PCA (1
Cluster (2
Parameters to test: •
Distance Metric •
Number of clusters •
Separation & Homogeneity •
Assign biological meaning to clusters (3
Original presentation created by
Rani Elkon and posted at:
http://www.tau.ac.il/lifesci/bioinfo/teaching/20022003/DNA_microarray_winter_2003.html
Download