Introduction to clustering

advertisement
Unsupervised learning &
Cluster Analysis: Basic Concepts
and Algorithms
Assaf Gottlieb
Some of the slides are taken form Introduction to data mining,
by Tan, Steinbach, and Kumar
What is unsupervised learning &
Cluster Analysis ?
Learning without a priori knowledge about
the classification of samples; learning
without a teacher.

Kohonen (1995), “Self-Organizing Maps”
“Cluster analysis is a set of methods for
constructing a (hopefully) sensible and
informative classification of an initially
unclassified set of data, using the variable
values observed on each individual.”

B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”
What do we cluster?
Features/Variables
Samples/Instances
Applications of Cluster Analysis

Understanding
Group related documents for browsing,
group genes and proteins that have similar
functionality, or group stocks with similar
price fluctuations

Data Exploration



Get insight into data distribution
Understand patterns in the data
Summarization
Reduce the size of large data sets
A preprocessing step
Objectives of Cluster Analysis

Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups
Competing
objectives
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Notion of a Cluster can be
Ambiguous
Two Clusters
How many clusters?
Six Clusters
Four Clusters
Depends on “resolution” !
Prerequisites
 Understand
the nature of your
problem, the type of features,
etc.

The metric that you choose for similarity
(for example, Euclidean distance or
Pearson correlation) often impacts the
clusters you recover.
Similarity/Distance measures

Euclidean Distance
 
d ( x, y) 



N
n 1
( xn  yn )
2
Highly depends on scale
of features
may require normalization
City Block
d
DM1   xi  wi
i 1
deuc=0.5846
deuc=2.6115
deuc=1.1345
These examples of Euclidean
distance match our intuition of
dissimilarity pretty well…
deuc=1.41
deuc=1.22
…But what about these?
What might be going on with the expression profiles on the left? On the
right?
Similarity/Distance measures
N
1
xi  yi

i
1
 
Ccosine ( x , y )  N 

x  y

Cosine

Pearson Correlation
 
C pearson ( x , y ) 

N
i 1
( xi  mx )( yi  m y )
[i 1 ( xi  mx ) 2 ][ i 1 ( yi  m y ) 2 ]
N
N
Invariant to scaling (Pearson also to
addition)
 Spearman correlation for ranks

Similarity/Distance measures

Jaccard similarity

When interested in intersection size
JSim ( X , Y ) 
XUY
X
X∩Y
Y
X Y
X Y
Types of Clusterings

Important distinction between
hierarchical and partitional sets of
clusters

Partitional Clustering


A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
Hierarchical clustering

A set of nested clusters organized as a hierarchical
tree
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
p3 p4
Dendrogram 1
p1
p3
p4
p2
p1 p2
p3 p4
Dendrogram 2
Other Distinctions Between Sets of
Clustering methods

Exclusive versus non-exclusive



Fuzzy versus non-fuzzy



In fuzzy clustering, a point belongs to every cluster
with some weight between 0 and 1
Weights must sum to 1
Partial versus complete


In non-exclusive clusterings, points may belong to
multiple clusters.
Can represent multiple classes or ‘border’ points
In some cases, we only want to cluster some of the
data
Heterogeneous versus homogeneous

Cluster of widely different sizes, shapes, and densities
Clustering Algorithms
Hierarchical clustering
 K-means
 Bi-clustering

Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
 Can be visualized as a dendrogram


A tree like diagram that records the sequences
of merges or splits
5
6
0.2
4
3
4
2
0.15
5
2
0.1
1
0.05
3
0
1
3
2
5
4
6
1
Strengths of Hierarchical Clustering

Do not have to assume any particular
number of clusters


Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
They may correspond to meaningful
taxonomies

Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering

Two main types of hierarchical clustering

Agglomerative (bottom up):



Divisive (top down):



Start with the points as individual clusters
At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a
point (or there are k clusters)
Traditional hierarchical algorithms use a similarity
or distance matrix

Merge or split one cluster at a time
Agglomerative Clustering Algorithm

More popular hierarchical clustering technique

Basic algorithm is straightforward
1.
2.
3.
4.
5.
6.

Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the
proximity of two clusters

Different approaches to defining the distance
between clusters distinguish the different algorithms
Starting Situation

Start with clusters of individual points and
p1 p2
p3
p4 p5
...
a proximity matrix
p1
p2
p3
p4
p5
.
.
Proximity Matrix
.
...
p1
p2
p3
p4
p9
p10
p11
p12
Intermediate Situation

After some merging steps, we have some clusters
C1
C2
C3
C4
C5
C1
C2
C3
C3
C4
C4
C5
C1
Proximity Matrix
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
Intermediate Situation

We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C1
C2
C3
C4
C5
C1
C2
C3
C3
C4
C4
C5
C1
Proximity Matrix
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
After Merging

The question is “How do we update the proximity
C2
matrix?”
C1
C1
U
C5
C3
C4
?
?
?
C3
?
C2 U C5
C4
C1
?
C3
?
C4
?
Proximity Matrix
C2 U C5
...
p1
p2
p3
p4
p9
p10
p11
p12
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4





MIN
MAX
Group Average
Distance Between Centroids
Ward’s method (not discussed)
p5
.
.
.
Proximity Matrix
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4





MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function

p5
Ward’s Method uses squared error
.
.
.
Proximity Matrix
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4





MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function

p5
Ward’s Method uses squared error
.
.
.
Proximity Matrix
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1
p2
p3
p4





MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function

p5
Ward’s Method uses squared error
.
.
.
Proximity Matrix
...
How to Define Inter-Cluster Similarity
p1
p2
p3
p4 p5
p1


p2
p3
p4




MIN
MAX
Group Average
Distance Between Centroids
p5
.
.
.
Proximity Matrix
...
Cluster Similarity: MIN or Single
Link

Similarity of two clusters is based on the
two most similar (closest) points in the
different clusters

Determined by one pair of points, i.e., by one
link in the proximity graph.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
1
2
3
4
5
Hierarchical Clustering: MIN
1
5
3
5
0.2
2
1
2
3
0.15
6
0.1
0.05
4
4
Nested Clusters
0
3
6
2
5
Dendrogram
4
1
Strength of MIN
Original Points
• Can handle non-elliptical shapes
Two Clusters
Limitations of MIN
Original Points
• Sensitive to noise and outliers
Two Clusters
Cluster Similarity: MAX or Complete
Linkage

Similarity of two clusters is based on the
two least similar (most distant) points in
the different clusters

Determined by all pairs of points in the two
clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00
1
2
3 4
5
Hierarchical Clustering: MAX
4
1
2
5
5
0.4
0.35
2
0.3
0.25
3
3
6
1
4
0.2
0.15
0.1
0.05
0
Nested Clusters
3
6
4
Dendrogram
1
2
5
Strength of MAX
Original Points
• Less susceptible to noise and outliers
Two Clusters
Limitations of MAX
Original Points
•Tends to break large clusters
•Biased towards globular clusters
Two Clusters
Cluster Similarity: Group Average

Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
 proximity(p , p )
i
proximity(Clusteri , Clusterj ) 

j
piClusteri
p jClusterj
|Clusteri ||Clusterj |
Need to use average connectivity for scalability since total
proximity favors large clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00
1
2
4
5
3
Hierarchical Clustering: Group
Average
5
4
1
0.25
2
5
0.2
2
0.15
3
6
1
4
3
Nested Clusters
0.1
0.05
0
3
6
4
1
Dendrogram
2
5
Hierarchical Clustering: Group
Average

Compromise between Single and
Complete Link

Strengths


Less susceptible to noise and outliers
Limitations

Biased towards globular clusters
Hierarchical Clustering: Comparison
1
5
4
3
5
5
2
2
5
1
2
1
3
MAX
MIN
2
3
6
3
4
1
4
4
1
5
2
5
2
3
3
6
1
4
4
6
Group Average
Hierarchical Clustering: Problems and
Limitations





Once a decision is made to combine two clusters, it
cannot be undone
Different schemes have problems with one or more of
the following:
 Sensitivity to noise and outliers
 Difficulty handling different sized clusters and
convex shapes
 Breaking large clusters (divisive)
Dendrogram correspond to a given hierarchical clustering is
not unique, since for each merge one needs to specify which
subtree should go on the left and which on the right
They impose structure on the data, instead of revealing
structure in these data.
How many clusters? (some suggestions later)
K-means Clustering





Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
K-means Clustering – Details

Initial centroids are often chosen randomly.





The centroid is (typically) the mean of the points in
the cluster.
‘Closeness’ is measured mostly by Euclidean Typical
distance, cosine similarity, correlation, etc.
choice
K-means will converge for common similarity
measures mentioned above.
Most of the convergence happens in the first few
iterations.


Clusters produced vary from one run to another.
Often the stopping condition is changed to ‘Until relatively
few points change clusters’
Complexity is O( n * K * I * d )

n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)


For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
K
SSE    dist2 (mi , x )
i 1 xCi

x is a data point in cluster Ci and mi is the representative point
for cluster Ci



can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest
error
One easy way to reduce SSE is to increase K, the number of
clusters

A good clustering with smaller K can have a lower SSE than a
poor clustering with higher K
Issues and Limitations for K-means
How to choose initial centers?
 How to choose K?
 How to handle Outliers?
 Clusters different in





Shape
Density
Size
Assumes clusters are spherical in vector space
 Sensitive to coordinate changes
Two different K-means Clusterings
Original Points
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
x
Optimal Clustering
2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Sub-optimal Clustering
Importance of Choosing Initial
Centroids
Iteration 6
1
2
3
4
5
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Importance of Choosing Initial
Centroids
Iteration 1
Iteration 2
Iteration 3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
3
y
3
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
1
1.5
2
-2
Iteration 4
Iteration 5
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
0
x
0.5
1
1.5
2
0
0.5
1
1.5
2
1
1.5
2
y
2.5
y
2.5
y
3
-1
-0.5
Iteration 6
3
-1.5
-1
x
3
-2
-1.5
x
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
Importance of Choosing Initial
Centroids …
Iteration 5
1
2
3
4
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Importance of Choosing Initial
Centroids …
Iteration 1
Iteration 2
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
Iteration 3
2.5
2
2
2
1.5
1.5
1.5
y
2.5
y
2.5
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-1
-0.5
0
x
0.5
2
Iteration 5
3
-1.5
1.5
Iteration 4
3
-2
1
x
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Solutions to Initial Centroids
Problem
Multiple runs
 Sample and use hierarchical clustering to
determine initial centroids
 Select more than k initial centroids and
then select among these initial centroids



Select most widely separated
Bisecting K-means

Not as susceptible to initialization issues
Bisecting K-means

Bisecting K-means algorithm

Variant of K-means that can produce a partitional or
a hierarchical clustering
Bisecting K-means Example
Issues and Limitations for K-means
How to choose initial centers?
 How to choose K?



How to handle Outliers?


Depends on the problem + some suggestions
later
Preprocessing
Clusters different in



Shape
Density
Size
Issues and Limitations for K-means
How to choose initial centers?
 How to choose K?
 How to handle Outliers?
 Clusters different in




Shape
Density
Size
Limitations of K-means: Differing Sizes
Original Points
K-means (3 Clusters)
Limitations of K-means: Differing
Density
Original Points
K-means (3 Clusters)
Limitations of K-means: Non-globular
Shapes
Original Points
K-means (2 Clusters)
Overcoming K-means Limitations
Original Points
K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
Overcoming K-means Limitations
Original Points
K-means Clusters
Overcoming K-means Limitations
Original Points
K-means Clusters
K-means
Pros
 Simple
 Fast for low dimensional data
 It can find pure sub clusters if large
number of clusters is specified
Cons
 K-Means cannot handle non-globular data
of different sizes and densities
 K-Means will not identify outliers
 K-Means is restricted to data which has
the notion of a center (centroid)
Biclustering/Co-clustering
Two genes can have similar
expression patterns only
under some conditions
 Similarly, in two related
conditions, some genes may
exhibit different expression
N
patterns
genes

M conditions
Biclustering

As a result, each
cluster may
involve only a
subset of genes
and a subset of
conditions, which
form a
“checkerboard”
structure:
Biclustering
In general – a hard task (NP-hard)
 Heuristic algorithms described briefly:
 Cheng & Church – deletion of rows and
columns. Biclusters discovered one at a
time
 Order-Preserving SubMatrixes Ben-Dor et
al.
 Coupled Two-Way Clustering (Getz. Et al)
 Spectral Co-clustering

Cheng and Church

Objective function for heuristic methods
(to minimize):
1
H (I , J ) 

I J
2
(
a

a

a

a
)
 ij iJ Ij IJ
iI , jJ
Greedy method:


Initialization: the bicluster contains all rows and
columns.
Iteration:
1. Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
2. Remove a row or column that gives the maximum
decrease of H.



Termination: when no action will decrease H or H
<= .
Mask this bicluster and continue
Problem removing “trivial” biclusters
Ben-Dor et al. (OPSM)

Model:
For a condition set T and a gene g, the conditions in T
can be ordered in a way so that the expression values
are sorted in ascending order (suppose the values are all
unique).
 Submatrix A is a bicluster if there is an ordering
(permutation) of T such that the expression values of all
genes in G are sorted in ascending order.
Idea of algorithm: to grow partial models until they become
complete models.


t1
t2
t3
t4
t5
g1
7
13
19
2
50
g2
19
23
39
6
42
g3
4
6
8
2
10
Induced
permutation
2
3
4
1
5
Ben-Dor et al. (OPSM)
Getz et al. (CTWC)
Idea: repeatedly perform one-way
clustering on genes/conditions.
 Stable clusters of genes are used as the
attributes for condition clustering, and
vice versa.

Spectral Co-clustering

Main idea:

Normalize the 2 dimension
Form a matrix of size m+n (using SVD)
Use k-means to cluster both types of data

http://adios.tau.ac.il/SpectralCoClustering/


Evaluating cluster quality


Use known classes (pairwise Fmeasure, best class F-measure)
Clusters can be evaluated with
“internal” as well as “external”
measures


Internal measures are related to the
inter/intra cluster distance
External measures are related to how
representative are the current clusters to
“true” classes
Inter/Intra Cluster Distances
Intra-cluster distance
 (Sum/Min/Max/Avg) the
(absolute/squared)
distance between
- All pairs of points in
the cluster OR
- Between the centroid
and all points in the
cluster OR
- Between the “medoid”
and all points in the
cluster
Inter-cluster distance
Sum the (squared) distance
between all pairs of
clusters
Where distance between
two clusters is defined
as:
-
distance between their
centroids/medoids
-
-
(Spherical clusters)
Distance between the
closest pair of points
belonging to the clusters
-
(Chain shaped clusters)
Davies-Bouldin index


A function of the ratio of the sum of withincluster (i.e. intra-cluster) scatter to between
cluster (i.e. inter-cluster) separation
Let C={C1,….., Ck} be a clustering of a set of
N objects:
1 k
DB  . Ri
k i 1
with
Ri  max Rij
j 1,..k ,i  j
and
Rij 
i j
var(Ci )  var(C j )
|| ci  c j ||
where Ci is the ith cluster and ci is the
centroid for cluster i
Davies-Bouldin index example

For eg: for the clusters shown

Compute
Rij 
i j
var(Ci )  var(C j )
|| ci  c j ||

var(C1)=0, var(C2)=4.5, var(C3)=2.33
Centroid is simply the mean here, so c1=3, c2=8.5,
c3=18.33
So, R12=1, R13=0.152, R23=0.797

Now, compute

R1=1 (max of R12 and R13); R2=1 (max of R21 and R23);
R3=0.797 (max of R31 and R32)
1 k
Finally, compute
DB  . Ri
k i 1
DB=0.932




Ri  max Rij
j 1,..k ,i  j
Davies-Bouldin index example (ctd)

For eg: for the clusters shown

Compute
Rij 
i j







var(Ci )  var(C j )
|| ci  c j ||
Only 2 clusters here
var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33
R12=1.26
Now compute
Since we have only 2 clusters here, R1=R12=1.26;
Ri  max Rij
R2=R21=1.26
j 1,..k ,i  j
Finally, compute
1 k
DB  . Ri
DB=1.26
k
i 1
Other criteria

Dunn method


(Xi, Xj): intercluster distance between clusters Xi and Xj
(Xk): intracluster distance of cluster Xk
Silhouette method







  (X i , X j ) 


V (U )  m inm in


1 i  c
1 j  c


m
ax

(
X
)
k

 j i 
 1 k  c





Identifying outliers
C-index

Compare sum of distances S over all pairs from the
same cluster against the same # of smallest and largest
pairs.
Example dataset
AML/ALL dataset (Golub et al.)

Leukemia



72 patients (samples)
7129 genes
Genes/
features
4 groups

Two major types ALL &
AML
 T & B Cells in ALL
 With/without treatment in
AML
samples
Validity
index
c=2
c=3
c=4
c=5
c=6
V11
V21
V31
V41
V51
V61
V12
V22
V32
V42
V52
V62
V13
V23
V33
V43
V53
Validity
V63
index
Average
V11
0.18
1.21
0.68
0.55
0.61
0.93
0.62
4.27
2.40
1.95
2.15
3.26
0.23
1.55
0.87
0.71
c0.78
=2
1.18
1.34
0.18
0.25
0.69
0.37
0.22
0.32
0.52
0.56
2.20
1.18
0.70
1.01
1.66
0.20
0.80
0.43
0.26
c0.32
=3
0.61
0.68
0.25
0.20
0.58
0.39
0.27
0.33
0.53
0.63
1.87
1.24
0.85
1.07
1.69
0.23
0.69
0.46
0.31
c0.39
=4
0.63
0.69
0.20
0.19
0.49
0.32
0.21
0.27
0.42
0.59
1.56
1.03
0.66
0.86
1.36
0.22
0.57
0.37
0.24
c0.31
=5
0.50
0.57
0.19
0.19
0.47
0.33
0.23
0.28
0.40
0.61
1.49
1.06
0.73
0.90
1.28
0.23
0.55
0.39
0.27
c0.33
=6
0.47
0.57
0.19
V21
V31
V41
V51
V61
V12
V22
V32
V42
V52
V62
V13
V23
V33
1.21
0.68
0.55
0.61
0.93
0.62
4.27
2.40
1.95
2.15
3.26
0.23
1.55
0.87
0.69
0.37
0.22
0.32
0.52
0.56
2.20
1.18
0.70
1.01
1.66
0.20
0.80
0.43
0.58
0.39
0.27
0.33
0.53
0.63
1.87
1.24
0.85
1.07
1.69
0.23
0.69
0.46
0.49
0.32
0.21
0.27
0.42
0.59
1.56
1.03
0.66
0.86
1.36
0.22
0.57
0.37
0.47
0.33
0.23
0.28
0.40
0.61
1.49
1.06
0.73
0.90
1.28
0.23
0.55
0.39
AML/ALL dataset

Davies-Bouldin index - C=4

Dunn method - C=2

Silhouette method – C=2
Visual evaluation - coherency
Cluster quality example
do you see clusters?
C Silhouette
2
3
4
5
6
7
8
9
10
0.4922
0.5739
0.4773
0.4991
0.5404
0.541
0.5171
0.5956
0.6446
C Silhouette
2
3
4
5
6
7
8
9
10
0.4863
0.5762
0.5957
0.5351
0.5701
0.5487
0.5083
0.5311
0.5229
Dimensionality Reduction
Map points in high-dimensional space to
lower number of dimensions
 Preserve structure: pairwise distances,
etc.
 Useful for further processing:



Less computation, fewer parameters
Easier to understand, visualize
Dimensionality Reduction
Feature selection vs. Feature Extraction
 Feature selection – select important
features
 Pros




meaningful features
Less work acquiring
Unsupervised


Variance, Fold
UFF
Dimensionality Reduction
Feature Extraction
 Transforms the entire feature set to lower
dimension.
 Pros




Uses objective function to select the best
projection
Sometime single features are not good enough
Unsupervised

PCA, SVD
Principal Components Analysis
(PCA)

approximating a high-dimensional data set
with a lower-dimensional linear subspace
Second principal component
*
*
* * *
*
* *
**
* *
*
* *
*
*
* *
Data points
* *
* *
First principal component
*
Original axes
Singular Value Decomposition
Principal Components Analysis
(PCA)

Rule of thumb for selecting number of
components


“Knee” in screeplot
Cumulative percentage variance
*
*
* * *
*
* *
**
* *
*
* *
*
*
* *
* *
* *
*
Tools for clustering
Matlab – COMPACT
 http://adios.tau.ac.il/compact/

Tools for clustering
Matlab – COMPACT
 http://adios.tau.ac.il/compact/

Tools for clustering
Cluster +TreeView (Eisen et al.)
 http://rana.lbl.gov/eisen/?page_id=42

Summary
Clustering is ill-defined and considered an
“art”
 In fact, this means you need to




Understand your data beforehand
Know how to interpret clusters afterwards
The problem determines the best solution
(which measure, which clustering
algorithm) – try to experiment with
different options.
Download