Clustering 2

advertisement
AMCS/CS229: Machine
Learning
Clustering 2
Xiangliang Zhang
King Abdullah University of Science and Technology
Cluster Analysis
1. Partitioning Methods + EM algorithm
2. Hierarchical Methods
3. Density-Based Methods
4. Clustering quality evaluation
5. How to decide the number of clusters ?
6. Summary
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
2
The quality of Clustering
• For supervised classification we have a variety of measures
to evaluate how good our model is
– Accuracy, precision, recall
• For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?




To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
3
Measures of Cluster Validity
Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following two types:
 External Index: Used to measure the extent to which
cluster labels match externally supplied class labels.
• Purity, Normalized Mutual Information
 Internal Index: Used to measure the goodness of a
clustering structure without respect to external
information.
• Sum of Squared Error (SSE)
• Cophenetic correlation coefficient, silhouette
4
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
Cluster Validity: External Index
The class labels are externally supplied (q classes)
Purity:
Larger purity values indicate better clustering solutions.
• Purity of each cluster Cr of size nr
P (C r ) 
1
nr
i
max n r
i
• Purity of the entire clustering
k
Purity ( C )  
or 
nr
r 1
n
1
k
k
P (C r )
 P (C r )
r 1
5
Cluster Validity: External Index
Purity:
1 k
Purity(C) = å P(Cr )
k r=1
1 k
1
12
Purity = å P(Cr ) = ´ (5 + 4 + 3) =
k r=1
17
17
6
Cluster Validity: External Index
The class labels are externally supplied (q classes)
NMI (Normalized Mutual Information) :
I(C,T )
NMI(C,T ) =
(H(C) + H(T )) / 2
where I is mutual information
and H is entropy
k
| Cr | | Cr |
H(C) = -å
log
N
r=1 N
q
| Tl |
| Tl |
H(T ) = -å
log
N
l=1 N
7
Cluster Validity: External Index
NMI (Normalized Mutual Information) :
Larger NMI values indicate better clustering solutions.
I(X,Y )
NMI(X,Y ) =
(H(X) + H(Y )) / 2
8
Internal Measures: SSE
Internal Index: Used to measure the goodness of a
clustering structure without respect to external information
SSE is good for comparing two clustering results
• average SSE
• SSE curves w.r.t. various K
Can also be used to estimate the number of clusters
10
9
6
8
4
7
6
SSE
2
0
5
4
-2
3
2
-4
1
-6
0
5
10
15
2
5
10
15
K
20
25
30
9
Internal Measures: Cophenetic correlation coefficient
Cophenetic correlation coefficient:
 a measure of how faithfully a dendrogram preserves the
pairwise distances between the original data points.
 Compare two hierarchical clusterings of the data
2.50
1.41
1.00
0.71
0.5
D
F
E
C
A
B
Compute the correlation
coefficient between Dist
and CP
r X,Y =
E[(X - m X )(Y - mY )]
dXdY
10
Matlab functions: cophenet
Cluster Analysis
1. Partitioning Methods + EM algorithm
2. Hierarchical Methods
3. Density-Based Methods
4. Clustering quality evaluation
5. How to decide the number of clusters ?
6. Summary
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
11
Internal Measures: Cohesion and Separation
• Cluster cohesion measures how closely related are objects
in a cluster
= SSE or the sum of the weight of all links within a cluster.
• Cluster separation measures how distinct or well-separated
a cluster is from other clusters
= sum of the weights between nodes in the cluster and nodes outside
the cluster.
cohesion
separation
12
Internal Measures: Silhouette Coefficient
• Silhouette Coefficient combines ideas of both cohesion and
separation
• For an individual point, i
 Calculate a = average distance of i to the points in its cluster
 Calculate b = min (average distance of i to points in another cluster)
 The silhouette coefficient for a point is then given by
Si =1-
ai
bi
(bi - ai )
or Si =
max(ai , bi )
b
a
o Typically between 0 and 1.
o The closer to 1 the better.
• Can calculate the Average Silhouette width for a cluster or a
clustering
Matlab functions: silhouette
13
Determine number of clusters by Silhouette Coefficient
compare different clusterings by the average silhouette values
K=3
mean(silh) = 0.526
K=4
mean(silh) = 0.640
K=5
mean(silh) = 0.527
Determine the number of clusters
1. Select the number K of clusters as the one maximizing
averaged silhouette value of all points
2. Optimizing an objective criterion
–
Gap statistics of the decreasing of SSE w.r.t. K
3. Model-based method:
• optimizing a global criterion (e.g. the maximum likelihood of data)
4. Try to use clustering methods which need not to set K,
e.g., DbScan,
5. Prior knowledge…..
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
15
Cluster Analysis
1. Partitioning Methods + EM algorithm
2. Hierarchical Methods
3. Density-Based Methods
4. Clustering quality evaluation
5. How to decide the number of clusters ?
6. Summary
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
16
Clustering VS Classification
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
17
Problems and Challenges
• Considerable progress has been made in scalable clustering
methods
 Partitioning: k-means, k-medoids, CLARANS
 Hierarchical: BIRCH, ROCK, CHAMELEON
 Density-based: DBSCAN, OPTICS, DenClue
 Grid-based: STING, WaveCluster, CLIQUE
 Model-based: EM, SOM
 Spectral clustering
 Affinity Propagation
 Frequent pattern-based: Bi-clustering, pCluster
• Current clustering techniques do not address all the
requirements adequately, still an active area of research
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
18
Cluster Analysis
Open issues in clustering
1. Clustering quality evaluation
2. How to decide the number of clusters ?
19
What you should know
• What is clustering?
• How does k-means work?
• What is the difference between k-means and k-mediods?
• What is EM algorithm? How does it work?
• What is the relationship between k-means and EM?
• How to define inter-cluster similarity in Hierarchical
clustering? What kinds of options do you have ?
• How does DBSCAN work ?
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
20
What you should know
• What are the advantages and disadvantages of DbScan?
• How to evaluate the clustering results ?
• Usually how to decide the number of clusters ?
• What are the main differences between clustering and
classification?
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
21
Download