Spectral Clustering Jing Gao SUNY Buffalo

advertisement

Spectral Clustering

Jing Gao

SUNY Buffalo

1

Motivation

• Complex cluster shapes

– K-means performs poorly because it can only find spherical clusters

• Spectral approach

– Use similarity graphs to encode local neighborhood information

– Data points are vertices of the graph

– Connect points which are “close”

2

1.5

1

-2 -1.5

-1

0.5

-0.5

0

0

-0.5

-1

-1.5

-2

0.5

1 1.5

2

2

Similarity Graph

• Represent dataset as a weighted graph G(V,E)

• All vertices which can be reached from each other by a path form a connected component

• Only one connected component in the graph—The graph is fully connected v v

1 2 v

6

V ={ x i

} Set of n vertices representing data points

E ={ W ij

} Set of weighted edges indicating pair-wise similarity between points

2

0.8

1

0.8

3

0.6

0.2

0.1

4

5

0.8

0.7

0.8

6

3

Graph Construction

• ε -neighborhood graph

– Identify a threshold value, ε , and include edges if the affinity between two points is greater than ε

• k -nearest neighbors

– Insert edges between a node and its k -nearest neighbors

– Each node will be connected to (at least) k nodes

• Fully connected

– Insert an edge between every pair of nodes

– Weight of the edge represents similarity

– Gaussian kernel: w ij

 exp(

|| x i

 x j

||

2

/

2

)

4

ε

-neighborhood Graph

• ε -neighborhood

– Compute pairwise distance between any two objects

– Connect each point to all other points which have distance smaller than a threshold ε

• Weighted or unweighted

– Unweighted—There is an edge if one point belongs to the ε –neighborhood of another point

– Weighted—Transform distance to similarity and use similarity as edge weights

5

k

NN Graph

• Directed graph

– Connect each point to its k nearest neighbors

• kNN graph

– Undirected graph

– An edge between x i x j

OR from x j

to x i

• Mutual kNN graph

and x j

: There’s an edge from x i

to

in the directed graph

– Undirected graph

– Edge set is a subset of that in the kNN graph

– An edge between x i x j

AND from x j

to x i

and x j

: There’s an edge from x

in the directed graph i

to

6

7

Clustering Objective

Traditional definition of a “good” clustering

• Points assigned to same cluster should be highly similar

• Points assigned to different clusters should be highly dissimilar

Apply this objective to our graph representation

0.1

1

5

0.8

0.8

0.6

0.8

4

6

2

0.8 0.2

0.7

3

Minimize weight of between-group connections

8

Graph Cuts

• Express clustering objective as a function of the cut of the partition edge

• Cut : Sum of weights of edges with only one vertex in each group

• We wants to find the minimal cut between groups

C

1

1

0.1

5

C

2

0.8 cut ( C

1

, C

2

)

 i

C

1

,

 j

C w ij

2

0.8 0.8

0.6 cut(C

1

, C

2

) = 0.3

2 4

6

0.7

0.8 0.2

3

9

Bi-partitional Cuts

• Minimum (bi-partitional) cut

10

Example

• Minimum Cut

E

D

1

.22

.2

.24

.19

C

.45

B

.09

.08

A

11

Normalized Cuts

• Minimal (bipartitional) normalized cut

Vol ( C )

 i

C ,

 j

V w ij

12

Example

• Normalized Minimum Cut

E

.24

.19

D

1

.22

.2

C

.45

B

.09

.08

A

13

Example

• Normalized Minimum Cut

E

.24

.19

D

1

.22

.2

C

.45

B

.09

.08

A

14

Problem

• Identifying a minimum cut is NP-hard

• There are efficient approximations using linear algebra

• Based on the Laplacian Matrix, or graph

Laplacian

15

2

Matrix Representations

• Similarity matrix ( W )

– n x n matrix

– : edge weight between vertex x i and x j

0.8

0.8

1

3

0.6

0.2

0.1

4

5

0.8

0.7

0.8

6 x

1 x

2 x

3 x

4 x

5 x

6 x

2

0.8

0

0.8

0

0

0 x

1

0

0.8

0.6

0

0.1

0 x

4

0

0

0.2

0

0.8

0.7

x

3

0.6

0.8

0

0.2

0

0 x

5

0.1

0

0

0.8

0

0.8

x

6

0

0

0

0.7

0.8

0

• Important properties

– Symmetric matrix

16

Matrix Representations

• Degree matrix ( D )

– n x n diagonal matrix

–  j x i

2

0.8

1

0.6

0.8

3

0.2

0.1

4

5

0.8

0.7

0.8

6

• Used to

– Normalize adjacency matrix x

1 x

2 x

3 x

4 x

5 x

6 x

1

1.5

0

0

0

0

0 x

2

0

1.6

0

0

0

0 x

3

0

0

1.6

0

0

0 x

5

0

0

0

0

1.7

0 x

4

0

0

0

1.7

0

0 x

6

0

0

0

0

0

1.5

17

Matrix Representations

• Laplacian matrix ( L )

– n x n symmetric matrix

L = D - W x

1 x

2 x

3 x

4 x

5 x

6

2 x

1

1.5

-0.8

-0.6

0 -0.1

0

0.8

0.8

1

0.6

0.2

0.1

4

5

0.8

0.7

0.8

6 x

2 x

3 x

4 x

5

-0.8

-0.6

-0.8

0

-0.1

1.6

0

0

-0.8

1.6

-0.2

0

0

-0.2

1.7

0.8

0

0

0

0

-0.8

-0.7

1.7

3 x

6

0 0 0 -0.7

-0.8

• Important properties

– Eigenvalues are non-negative real numbers

– Eigenvectors are real and orthogonal

– Eigenvalues and eigenvectors provide an insight into the connectivity of the graph…

-0.8

1.5

18

Find An Optimal Min-Cut (Hall’70,

Fiedler’73)

• Express a bi-partition (C

1

,C

2

) as a vector f i

 

1

1 if if x i x i

C

1

C

2

• We can minimize the cut of the partition by finding a non-trivial vector f that minimizes the function g ( f )

 i ,

 j

V w ij

( f i

 f j

)

2  f

T

L f

Laplacian matrix

19

Why does this work?

• How eigen decomposition of L relates to clustering? cluster assignment

--

Cluster objective function

• if we let f be eigen vectors of L , then the eigenvalues are the cluster objective functions

20

Optimal Min-Cut

• The Laplacian matrix L is semi positive definite

• The Rayleigh Theorem shows:

– The minimum value for g ( f ) is given by the 2nd smallest eigenvalue of the Laplacian L

– The optimal solution for f is given by the corresponding eigenvector λ

2

, referred as the

Fiedler Vector

21

Spectral Bi-partitioning Algorithm x

1 x

2 x

3 x

4 x

5

1. Pre-processing

– Build Laplacian matrix L of the graph x

4 x

5 x

6 x

1 x

2 x

3

1.5

-0.8

-0.6

0

-0.1

0

-0.8

1.6

-0.8

0

-0.6

-0.8

1.6

0 -0.2

0

0 -0.1

0

-0.2

1.7

0 -0.8

0 -0.7

0

0

-0.8

1.7

-0.8

x

6

0

0

0

-0.7

-0.8

1.5

2.

Decomposition

– Find eigenvalues X and eigenvectors Λ of the matrix L

– Map vertices to corresponding components of λ

2 x

1 x

2 x

3 x

4 x

5 x

6

Λ

=

0.0

0.4

2.2

2.3

2.5

3.0

0.2

0.2

0.2

-0.4

-0.7

-0.7

0.4

0.4

X =

0.4

0.4

0.4

0.4

0.2

0.2

0.2

-0.4 0.9

-0.7 -0.4

-0.7 -0.2

0.1

0.1

-0.2

0.4

-0.

0.0

0.2

-0.2

0.4

-0.2

-0.9

0.3

0.6

-0.4 -0.6

-0.8 -0.6 -0.2

0.5 0.8 0.9

22

Spectral Bi-partitioning Algorithm

The matrix which represents the eigenvector of the

Laplacian (the eigenvector matched to the corresponded eigenvalues with increasing order)

0.41 0.41 0.11

0.41 0.44 0.01 0.30 0.71 0.22

0.41 0.37 0.64 0.04

0.41 0.37 0.34 0.00 0.61

6 5 4 3 2 1

0.41 0.41

0.41 0.45 0.72

0.35

0.09

23

Spectral Bi-partitioning

• Grouping

– Sort components of reduced 1-dimensional vector

– Identify clusters by splitting the sorted vector in two

(above zero, below zero) x

1 x

2 x

3 x

4 x

5 x

6

0.2

0.2

0.2

-0.4

-0.7

-0.7 x

1 x

2 x

3

Split at 0

– Cluster C

1

:

Positive points

– Cluster C

2

:

Negative points

0.2

0.2

0.2 x

4 x

5 x

6

-0.4

-0.7

-0.7

C

1

C

2

24

Normalized Laplacian

2

• Laplacian matrix ( L ) 𝑳 = 𝑫 −𝟏

𝑳 = 𝑫 −𝟎.𝟓

(𝑫 − 𝑾)

(𝑫 − 𝑾)𝑫 −𝟎.𝟓

0.1

5

0.8

1 0.8

0.8 1.00 0.52 0.39 0.00 0.06 0.00

0.6

6

4

0.7 0.52 1.00 0.50 0.00 0.00 0.00

0.8

3

0.2

0.39 0.50 1.00 -0.12

0.00 0.00

0.00

0.06

0.00

0.00

0.00

0.00

-0.12

0.00

0.00

1.00

0.47

1.00

1.00

K

-Way Spectral Clustering

• How do we partition a graph into k clusters?

1. Recursive bi-partitioning (Hagen et al.,’91)

• Recursively apply bi-partitioning algorithm in a hierarchical divisive manner.

• Disadvantages: Inefficient, unstable

2. Cluster multiple eigenvectors (Shi & Malik,’00)

O n

3

( )

• Build a reduced space from multiple eigenvectors.

• Commonly used in recent papers

• A preferable approach

26

Eigenvectors & Eigenvalues

27

K -way Spectral Clustering Algorithm

• Pre-processing

– Compute Laplacian matrix L

• Decomposition

– Find the eigenvalues and eigenvectors of L

– Build embedded space from the eigenvectors corresponding to the k smallest eigenvalues

• Clustering

– Apply k -means to the reduced n x k space to produce k clusters

28

How to select

k

?

• Eigengap : the difference between two consecutive eigenvalues

• Most stable clustering is generally given by the value k that maximizes the expression  k

  k

  k

1 max

 k

  

2

1

 Choose k =2

10

5

0

25

20

15

50

45

40

35

30

λ

1

λ

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

K

29

Summary

• Clustering formulated as graph cut problem

• Solve min-cut by eigen decomposition of Laplacian matrix

• Bipartition and multi-partition spectral clustering procedure

30

Hierarchical Clustering

Jing Gao

SUNY Buffalo

31

Hierarchical Clustering

• Agglomerative approach a b c d e a b d e a b c d e c d e

Initialization:

Each object is a cluster

Iteration:

Merge two clusters which are

most similar to each other;

Until all objects are merged

into a single cluster

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

32

Hierarchical Clustering

• Divisive Approaches a b c d e a b d e a b c d e c d e

Initialization:

All objects stay in one cluster

Iteration:

Select a cluster and split it into

two sub clusters

Until each leaf cluster contains

only one object

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

33

Dendrogram

• A tree that shows how clusters are merged/split hierarchically

• Each node on the tree is a cluster; each leaf node is a singleton cluster

34

Dendrogram

• A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

35

Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique

• Basic algorithm is straightforward

1.

Compute the distance matrix

2.

Let each data point be a cluster

3.

Repeat

4.

5.

Merge the two closest clusters

Update the distance matrix

6.

Until only a single cluster remains

• Key operation is the computation of the distance between two clusters

– Different approaches to defining the distance between clusters distinguish the different algorithms

36

Starting Situation

• Start with clusters of individual points and a distance matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5

.

.

.

Distance Matrix p1 p2 p3 p4

...

p9 p10 p11 p12

37

Intermediate Situation

• After some merging steps, we have some clusters

• Choose two clusters that has the smallest C1 C2

distance (largest similarity) to merge C1

C2

C3

C4

C3

C4

C3 C4 C5

C5

Distance Matrix

C1

C5

C2 p1 p2 p3 p4

...

p9 p10 p11 p12

38

Intermediate Situation

• We want to merge the two closest clusters (C2 and C5) and update the distance matrix.

C1 C2 C3 C4 C5

C1

C3

C4

C2

C3

C4

C5

Distance Matrix

C1

C5

C2 p1 p2 p3 p4

...

p9 p10 p11 p12

39

After Merging

• The question is “How do we update the distance matrix?”

C2

U

C1

C5

C3 C4

C1 ?

C1

C3

C4

C2 U C5

C3

? ? ? ?

?

C4

?

Distance Matrix

C2 U C5 p1 p2 p3 p4

...

p9 p10 p11 p12

40

How to Define Inter-Cluster Distance

Distance?

MIN

MAX

Group Average

Distance Between Centroids

…… p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5

.

.

.

Distance Matrix

41

MIN or Single Link

• Inter-cluster distance

– The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.

– Determined by one pair of points, i.e., by one link in the proximity graph d min

( C i

, C j

)

 min p

C i

, q

C j d ( p , q )

42

MIN

1

5

5

2

2

3

3

1

6

4

4

Nested Clusters

0.2

0.15

0.1

0.05

0

3 6 2 5 4 1

Dendrogram

43

Strength of MIN

Original Points

• Can handle non-elliptical shapes

Two Clusters

44

Limitations of MIN

Original Points

• Sensitive to noise and outliers

Two Clusters

45

MAX or Complete Link

• Inter-cluster distance

– The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters d min

( C i

, C j

)

 max p

C i

, q

C j d ( p , q )

46

MAX

5

2

2

4

1

3

4

3

1

6

5

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

3 6 4 1 2 5

Nested Clusters Dendrogram

47

Strength of MAX

Original Points

• Less susceptible to noise and outliers

Two Clusters

48

Limitations of MAX

•Tends to break large clusters

Original Points

49

Limitations of MAX

MIN (2 clusters)

•Biased towards globular clusters

MAX (2 clusters)

50

Group Average or Average Link

• Inter-cluster distance

– The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters

– Determined by all pairs of points in the two clusters d min

( C i

, C j

)

 avg p

C i

, q

C j d ( p , q )

51

Group Average

5

4

1

5

2

2

4

3

3

1

6

0.25

0.2

0.15

0.1

0.05

0

3 6 4 1 2 5

Nested Clusters Dendrogram

52

Group Average

• Compromise between Single and Complete

Link

• Strengths

– Less susceptible to noise and outliers

• Limitations

– Biased towards globular clusters

53

Centroid Distance

• Inter-cluster distance

– The distance between two clusters is represented by the distance between the centers of the clusters

– Determined by cluster centroids

  d mean

( C i

, C j

)

 d ( m i

, m j

)

54

Summary

• Agglomerative and divisive hierarchical clustering

• Several ways of defining inter-cluster distance

• The properties of clusters outputted by different approaches based on different inter-cluster distance definition

55

Download