Spectral Clustering
Jing Gao
SUNY Buffalo
1
Motivation
• Complex cluster shapes
– K-means performs poorly because it can only find spherical clusters
• Spectral approach
– Use similarity graphs to encode local neighborhood information
– Data points are vertices of the graph
– Connect points which are “close”
2
1.5
1
-2 -1.5
-1
0.5
-0.5
0
0
-0.5
-1
-1.5
-2
0.5
1 1.5
2
2
Similarity Graph
• Represent dataset as a weighted graph G(V,E)
• All vertices which can be reached from each other by a path form a connected component
• Only one connected component in the graph—The graph is fully connected v v
1 2 v
6
V ={ x i
} Set of n vertices representing data points
E ={ W ij
} Set of weighted edges indicating pair-wise similarity between points
2
0.8
1
0.8
3
0.6
0.2
0.1
4
5
0.8
0.7
0.8
6
3
Graph Construction
• ε -neighborhood graph
– Identify a threshold value, ε , and include edges if the affinity between two points is greater than ε
• k -nearest neighbors
– Insert edges between a node and its k -nearest neighbors
– Each node will be connected to (at least) k nodes
• Fully connected
– Insert an edge between every pair of nodes
– Weight of the edge represents similarity
– Gaussian kernel: w ij
exp(
|| x i
x j
||
2
/
2
)
4
-neighborhood Graph
• ε -neighborhood
– Compute pairwise distance between any two objects
– Connect each point to all other points which have distance smaller than a threshold ε
• Weighted or unweighted
– Unweighted—There is an edge if one point belongs to the ε –neighborhood of another point
– Weighted—Transform distance to similarity and use similarity as edge weights
5
NN Graph
• Directed graph
– Connect each point to its k nearest neighbors
• kNN graph
– Undirected graph
– An edge between x i x j
OR from x j
to x i
• Mutual kNN graph
and x j
: There’s an edge from x i
to
in the directed graph
– Undirected graph
– Edge set is a subset of that in the kNN graph
– An edge between x i x j
AND from x j
to x i
and x j
: There’s an edge from x
in the directed graph i
to
6
7
Clustering Objective
Traditional definition of a “good” clustering
• Points assigned to same cluster should be highly similar
• Points assigned to different clusters should be highly dissimilar
Apply this objective to our graph representation
0.1
1
5
0.8
0.8
0.6
0.8
4
6
2
0.8 0.2
0.7
3
Minimize weight of between-group connections
8
Graph Cuts
• Express clustering objective as a function of the cut of the partition edge
• Cut : Sum of weights of edges with only one vertex in each group
• We wants to find the minimal cut between groups
C
1
1
0.1
5
C
2
0.8 cut ( C
1
, C
2
)
i
C
1
,
j
C w ij
2
0.8 0.8
0.6 cut(C
1
, C
2
) = 0.3
2 4
6
0.7
0.8 0.2
3
9
Bi-partitional Cuts
• Minimum (bi-partitional) cut
10
Example
• Minimum Cut
E
D
1
.22
.2
.24
.19
C
.45
B
.09
.08
A
11
Normalized Cuts
• Minimal (bipartitional) normalized cut
Vol ( C )
i
C ,
j
V w ij
12
Example
• Normalized Minimum Cut
E
.24
.19
D
1
.22
.2
C
.45
B
.09
.08
A
13
Example
• Normalized Minimum Cut
E
.24
.19
D
1
.22
.2
C
.45
B
.09
.08
A
14
Problem
• Identifying a minimum cut is NP-hard
• There are efficient approximations using linear algebra
• Based on the Laplacian Matrix, or graph
Laplacian
15
2
Matrix Representations
• Similarity matrix ( W )
– n x n matrix
– : edge weight between vertex x i and x j
0.8
0.8
1
3
0.6
0.2
0.1
4
5
0.8
0.7
0.8
6 x
1 x
2 x
3 x
4 x
5 x
6 x
2
0.8
0
0.8
0
0
0 x
1
0
0.8
0.6
0
0.1
0 x
4
0
0
0.2
0
0.8
0.7
x
3
0.6
0.8
0
0.2
0
0 x
5
0.1
0
0
0.8
0
0.8
x
6
0
0
0
0.7
0.8
0
• Important properties
– Symmetric matrix
16
• Degree matrix ( D )
– n x n diagonal matrix
– j x i
2
0.8
1
0.6
0.8
3
0.2
0.1
4
5
0.8
0.7
0.8
6
• Used to
– Normalize adjacency matrix x
1 x
2 x
3 x
4 x
5 x
6 x
1
1.5
0
0
0
0
0 x
2
0
1.6
0
0
0
0 x
3
0
0
1.6
0
0
0 x
5
0
0
0
0
1.7
0 x
4
0
0
0
1.7
0
0 x
6
0
0
0
0
0
1.5
17
• Laplacian matrix ( L )
– n x n symmetric matrix
L = D - W x
1 x
2 x
3 x
4 x
5 x
6
2 x
1
1.5
-0.8
-0.6
0 -0.1
0
0.8
0.8
1
0.6
0.2
0.1
4
5
0.8
0.7
0.8
6 x
2 x
3 x
4 x
5
-0.8
-0.6
-0.8
0
-0.1
1.6
0
0
-0.8
1.6
-0.2
0
0
-0.2
1.7
0.8
0
0
0
0
-0.8
-0.7
1.7
3 x
6
0 0 0 -0.7
-0.8
• Important properties
– Eigenvalues are non-negative real numbers
– Eigenvectors are real and orthogonal
– Eigenvalues and eigenvectors provide an insight into the connectivity of the graph…
-0.8
1.5
18
Find An Optimal Min-Cut (Hall’70,
Fiedler’73)
• Express a bi-partition (C
1
,C
2
) as a vector f i
1
1 if if x i x i
C
1
C
2
• We can minimize the cut of the partition by finding a non-trivial vector f that minimizes the function g ( f )
i ,
j
V w ij
( f i
f j
)
2 f
T
L f
Laplacian matrix
19
Why does this work?
• How eigen decomposition of L relates to clustering? cluster assignment
--
Cluster objective function
• if we let f be eigen vectors of L , then the eigenvalues are the cluster objective functions
20
Optimal Min-Cut
• The Laplacian matrix L is semi positive definite
• The Rayleigh Theorem shows:
– The minimum value for g ( f ) is given by the 2nd smallest eigenvalue of the Laplacian L
– The optimal solution for f is given by the corresponding eigenvector λ
2
, referred as the
Fiedler Vector
21
Spectral Bi-partitioning Algorithm x
1 x
2 x
3 x
4 x
5
1. Pre-processing
– Build Laplacian matrix L of the graph x
4 x
5 x
6 x
1 x
2 x
3
1.5
-0.8
-0.6
0
-0.1
0
-0.8
1.6
-0.8
0
-0.6
-0.8
1.6
0 -0.2
0
0 -0.1
0
-0.2
1.7
0 -0.8
0 -0.7
0
0
-0.8
1.7
-0.8
x
6
0
0
0
-0.7
-0.8
1.5
2.
Decomposition
– Find eigenvalues X and eigenvectors Λ of the matrix L
– Map vertices to corresponding components of λ
2 x
1 x
2 x
3 x
4 x
5 x
6
Λ
=
0.0
0.4
2.2
2.3
2.5
3.0
0.2
0.2
0.2
-0.4
-0.7
-0.7
0.4
0.4
X =
0.4
0.4
0.4
0.4
0.2
0.2
0.2
-0.4 0.9
-0.7 -0.4
-0.7 -0.2
0.1
0.1
-0.2
0.4
-0.
0.0
0.2
-0.2
0.4
-0.2
-0.9
0.3
0.6
-0.4 -0.6
-0.8 -0.6 -0.2
0.5 0.8 0.9
22
Spectral Bi-partitioning Algorithm
The matrix which represents the eigenvector of the
Laplacian (the eigenvector matched to the corresponded eigenvalues with increasing order)
0.41 0.41 0.11
0.41 0.44 0.01 0.30 0.71 0.22
0.41 0.37 0.64 0.04
0.41 0.37 0.34 0.00 0.61
6 5 4 3 2 1
0.41 0.41
0.41 0.45 0.72
0.35
0.09
23
• Grouping
– Sort components of reduced 1-dimensional vector
– Identify clusters by splitting the sorted vector in two
(above zero, below zero) x
1 x
2 x
3 x
4 x
5 x
6
0.2
0.2
0.2
-0.4
-0.7
-0.7 x
1 x
2 x
3
Split at 0
– Cluster C
1
:
Positive points
– Cluster C
2
:
Negative points
0.2
0.2
0.2 x
4 x
5 x
6
-0.4
-0.7
-0.7
C
1
C
2
24
2
• Laplacian matrix ( L ) 𝑳 = 𝑫 −𝟏
𝑳 = 𝑫 −𝟎.𝟓
(𝑫 − 𝑾)
(𝑫 − 𝑾)𝑫 −𝟎.𝟓
0.1
5
0.8
1 0.8
0.8 1.00 0.52 0.39 0.00 0.06 0.00
0.6
6
4
0.7 0.52 1.00 0.50 0.00 0.00 0.00
0.8
3
0.2
0.39 0.50 1.00 -0.12
0.00 0.00
0.00
0.06
0.00
0.00
0.00
0.00
-0.12
0.00
0.00
1.00
0.47
1.00
1.00
K
• How do we partition a graph into k clusters?
1. Recursive bi-partitioning (Hagen et al.,’91)
• Recursively apply bi-partitioning algorithm in a hierarchical divisive manner.
• Disadvantages: Inefficient, unstable
2. Cluster multiple eigenvectors (Shi & Malik,’00)
O n
3
( )
• Build a reduced space from multiple eigenvectors.
• Commonly used in recent papers
• A preferable approach
26
Eigenvectors & Eigenvalues
27
K -way Spectral Clustering Algorithm
• Pre-processing
– Compute Laplacian matrix L
• Decomposition
– Find the eigenvalues and eigenvectors of L
– Build embedded space from the eigenvectors corresponding to the k smallest eigenvalues
• Clustering
– Apply k -means to the reduced n x k space to produce k clusters
28
k
• Eigengap : the difference between two consecutive eigenvalues
• Most stable clustering is generally given by the value k that maximizes the expression k
k
k
1 max
k
2
1
Choose k =2
10
5
0
25
20
15
50
45
40
35
30
λ
1
λ
2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
K
29
Summary
• Clustering formulated as graph cut problem
• Solve min-cut by eigen decomposition of Laplacian matrix
• Bipartition and multi-partition spectral clustering procedure
30
Hierarchical Clustering
Jing Gao
SUNY Buffalo
31
Hierarchical Clustering
• Agglomerative approach a b c d e a b d e a b c d e c d e
Initialization:
Each object is a cluster
Iteration:
Merge two clusters which are
most similar to each other;
Until all objects are merged
into a single cluster
Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up
32
Hierarchical Clustering
• Divisive Approaches a b c d e a b d e a b c d e c d e
Initialization:
All objects stay in one cluster
Iteration:
Select a cluster and split it into
two sub clusters
Until each leaf cluster contains
only one object
Step 4 Step 3 Step 2 Step 1 Step 0 Top-down
33
Dendrogram
• A tree that shows how clusters are merged/split hierarchically
• Each node on the tree is a cluster; each leaf node is a singleton cluster
34
Dendrogram
• A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster
35
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1.
Compute the distance matrix
2.
Let each data point be a cluster
3.
Repeat
4.
5.
Merge the two closest clusters
Update the distance matrix
6.
Until only a single cluster remains
• Key operation is the computation of the distance between two clusters
– Different approaches to defining the distance between clusters distinguish the different algorithms
36
Starting Situation
• Start with clusters of individual points and a distance matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5
.
.
.
Distance Matrix p1 p2 p3 p4
...
p9 p10 p11 p12
37
Intermediate Situation
• After some merging steps, we have some clusters
• Choose two clusters that has the smallest C1 C2
distance (largest similarity) to merge C1
C2
C3
C4
C3
C4
C3 C4 C5
C5
Distance Matrix
C1
C5
C2 p1 p2 p3 p4
...
p9 p10 p11 p12
38
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update the distance matrix.
C1 C2 C3 C4 C5
C1
C3
C4
C2
C3
C4
C5
Distance Matrix
C1
C5
C2 p1 p2 p3 p4
...
p9 p10 p11 p12
39
After Merging
• The question is “How do we update the distance matrix?”
C2
U
C1
C5
C3 C4
C1 ?
C1
C3
C4
C2 U C5
C3
? ? ? ?
?
C4
?
Distance Matrix
C2 U C5 p1 p2 p3 p4
...
p9 p10 p11 p12
40
How to Define Inter-Cluster Distance
Distance?
MIN
MAX
Group Average
Distance Between Centroids
…… p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5
.
.
.
Distance Matrix
41
MIN or Single Link
• Inter-cluster distance
– The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.
– Determined by one pair of points, i.e., by one link in the proximity graph d min
( C i
, C j
)
min p
C i
, q
C j d ( p , q )
42
MIN
1
5
5
2
2
3
3
1
6
4
4
Nested Clusters
0.2
0.15
0.1
0.05
0
3 6 2 5 4 1
Dendrogram
43
Strength of MIN
Original Points
• Can handle non-elliptical shapes
Two Clusters
44
Limitations of MIN
Original Points
• Sensitive to noise and outliers
Two Clusters
45
MAX or Complete Link
• Inter-cluster distance
– The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters d min
( C i
, C j
)
max p
C i
, q
C j d ( p , q )
46
MAX
5
2
2
4
1
3
4
3
1
6
5
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
3 6 4 1 2 5
Nested Clusters Dendrogram
47
Strength of MAX
Original Points
• Less susceptible to noise and outliers
Two Clusters
48
Limitations of MAX
•Tends to break large clusters
Original Points
49
Limitations of MAX
MIN (2 clusters)
•Biased towards globular clusters
MAX (2 clusters)
50
Group Average or Average Link
• Inter-cluster distance
– The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters
– Determined by all pairs of points in the two clusters d min
( C i
, C j
)
avg p
C i
, q
C j d ( p , q )
51
Group Average
5
4
1
5
2
2
4
3
3
1
6
0.25
0.2
0.15
0.1
0.05
0
3 6 4 1 2 5
Nested Clusters Dendrogram
52
Group Average
• Compromise between Single and Complete
Link
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
53
Centroid Distance
• Inter-cluster distance
– The distance between two clusters is represented by the distance between the centers of the clusters
– Determined by cluster centroids
d mean
( C i
, C j
)
d ( m i
, m j
)
54
Summary
• Agglomerative and divisive hierarchical clustering
• Several ways of defining inter-cluster distance
• The properties of clusters outputted by different approaches based on different inter-cluster distance definition
55