Clustering Lecture 3: Hierarchical Methods Jing Gao

advertisement
Clustering
Lecture 3: Hierarchical Methods
Jing Gao
SUNY Buffalo
1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
–
–
–
–
–
Partitional
Hierarchical
Density-based
Mixture model
Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.
2
Hierarchical Clustering
• Agglomerative approach
Initialization:
Each object is a cluster
a
Iteration:
ab
Merge two clusters which are
b
abcde
c
cde
d
e
Step 0
most similar to each other;
Until all objects are merged
into a single cluster
de
Step 1
Step 2 Step 3 Step 4
bottom-up
3
Hierarchical Clustering
• Divisive Approaches
Initialization:
All objects stay in one cluster
Iteration:
a
Select a cluster and split it into
ab
b
abcde
c
cde
d
two sub clusters
Until each leaf cluster contains
only one object
de
e
Step 4
Step 3
Step 2 Step 1 Step 0
Top-down
4
Dendrogram
• A tree that shows how clusters are merged/split
hierarchically
• Each node on the tree is a cluster; each leaf node is a
singleton cluster
5
Dendrogram
• A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
6
Agglomerative Clustering Algorithm
•
More popular hierarchical clustering technique
•
Basic algorithm is straightforward
1.
2.
3.
4.
5.
6.
•
Compute the distance matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains
Key operation is the computation of the distance between
two clusters
–
Different approaches to defining the distance between clusters
distinguish the different algorithms
7
Starting Situation
• Start with clusters of individual points and a distance matrix
p1 p2
p3 p4 p5
...
p1
p2
p3
p4
p5
.
.
Distance Matrix
.
...
p1
p2
p3
p4
p9
p10
p11
p12
8
Intermediate Situation
• After some merging steps, we have some clusters
• Choose two clusters that has the smallest C1 C2
C1
distance (largest similarity) to merge
C3
C4 C5
C2
C3
C4
C3
C4
C5
Distance Matrix
C1
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
9
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the distance matrix.
C1 C2 C3 C4 C5
C1
C2
C3
C4
C3
C4
C5
Distance Matrix
C1
C2
C5
...
p1
p2
p3
p4
p9
p10
p11
p12
10
After Merging
• The question is “How do we update the distance matrix?”
C1
C1
?
C2 U C5
C3
C4
C2
U
C5 C3
?
?
C3
?
C4
?
C4
?
?
Distance Matrix
C1
C2 U C5
...
p1
p2
p3
p4
p9
p10
p11
p12
11
How to Define Inter-Cluster Distance
p1 p2
Distance?
p3
p4 p5
...
p1
p2
p3
p4





MIN
MAX
Group Average
Distance Between Centroids
……
p5
.
.
Distance Matrix
.
12
MIN or Single Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the closest pair of data objects belonging to
different clusters.
– Determined by one pair of points, i.e., by one link in the
proximity graph
d min (Ci , C j )  min d ( p, q)
pCi , qC j
13
MIN
1
5
3
5
0.2
2
1
2
3
0.15
6
0.1
0.05
4
4
Nested Clusters
0
3
6
2
5
4
1
Dendrogram
14
Strength of MIN
Original Points
Two Clusters
• Can handle non-elliptical shapes
15
Limitations of MIN
Original Points
Two Clusters
• Sensitive to noise and outliers
16
MAX or Complete Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the farthest pair of data objects belonging to
different clusters
d min (Ci , C j )  max d ( p, q)
pCi , qC j
17
MAX
4
1
2
5
5
0.4
0.35
2
0.3
0.25
3
3
6
0.2
0.15
1
4
0.1
0.05
0
Nested Clusters
3
6
4
1
2
5
Dendrogram
18
Strength of MAX
Original Points
Two Clusters
• Less susceptible to noise and outliers
19
Limitations of MAX
•Tends to break large clusters
Original Points
20
Limitations of MAX
MIN (2 clusters)
MAX (2 clusters)
•Biased towards globular clusters
21
Group Average or Average Link
• Inter-cluster distance
– The distance between two clusters is represented by the
average distance of all pairs of data objects belonging to
different clusters
– Determined by all pairs of points in the two clusters
d min (Ci , C j )  avg d ( p, q)
pCi , qC j
22
Group Average
5
4
1
2
5
0.25
0.2
2
0.15
3
6
1
4
3
Nested Clusters
0.1
0.05
0
3
6
4
1
2
5
Dendrogram
23
Group Average
•
Compromise between Single and Complete
Link
•
Strengths
– Less susceptible to noise and outliers
•
Limitations
– Biased towards globular clusters
24
Centroid Distance
• Inter-cluster distance
– The distance between two clusters is represented by the
distance between the centers of the clusters
– Determined by cluster centroids


d mean (Ci , C j )  d (mi , m j )
25
Ward’s Method
• Similarity of two clusters is based on the increase
in squared error when two clusters are merged
– Similar to group average if distance between points is
squared distance
• Less susceptible to noise and outliers
• Biased towards globular clusters
• Hierarchical analogue of K-means
– Can be used to initialize K-means
26
Comparison
1
3
5
5
4
5
2
2
5
1
2
1
MIN
3
2
MAX
6
3
3
4
4
5
5
2
4
1
4
1
5
6
4
1
2
Ward’s Method
2
3
3
6
5
2
Group Average
3
1
4
6
1
4
3
27
Time and Space Requirements
• O(N2) space since it uses the distance matrix
– N is the number of points
• O(N3) time in many cases
– There are N steps and at each step the size, N2,
distance matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time
for some approaches
28
Strengths
• Do not have to assume any particular number
of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendrogram at the proper level
• They may correspond to meaningful
taxonomies
– e.g., shopping websites—electronics (computer,
camera, ..), furniture, groceries
29
Problems and Limitations
• Once a decision is made to combine two clusters,
it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one or
more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and
irregular shapes
– Breaking large clusters
30
Take-away Message
• Agglomerative and divisive hierarchical clustering
• Several ways of defining inter-cluster distance
• The properties of clusters outputted by different
approaches based on different inter-cluster distance
definition
• Pros and cons of hierarchical clustering
31
Download