Clustering 2

AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and Technology Cluster Analysis 1. Partitioning Methods + EM algorithm 2. Hierarchical Methods 3. Density-Based Methods 4. Clustering quality evaluation 5. How to decide the number of clusters ? 6. Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 2 The quality of Clustering • For supervised classification we have a variety of measures to evaluate how good our model is – Accuracy, precision, recall • For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? • But “clusters are in the eye of the beholder”! • Then why do we want to evaluate them?     To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 3 Measures of Cluster Validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types:  External Index: Used to measure the extent to which cluster labels match externally supplied class labels. • Purity, Normalized Mutual Information  Internal Index: Used to measure the goodness of a clustering structure without respect to external information. • Sum of Squared Error (SSE) • Cophenetic correlation coefficient, silhouette 4 http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Cluster Validity: External Index The class labels are externally supplied (q classes) Purity: Larger purity values indicate better clustering solutions. • Purity of each cluster Cr of size nr P (C r )  1 nr i max n r i • Purity of the entire clustering k Purity ( C )   or  nr r 1 n 1 k k P (C r )  P (C r ) r 1 5 Cluster Validity: External Index Purity: 1 k Purity(C) = å P(Cr ) k r=1 1 k 1 12 Purity = å P(Cr ) = ´ (5 + 4 + 3) = k r=1 17 17 6 Cluster Validity: External Index The class labels are externally supplied (q classes) NMI (Normalized Mutual Information) : I(C,T ) NMI(C,T ) = (H(C) + H(T )) / 2 where I is mutual information and H is entropy k | Cr | | Cr | H(C) = -å log N r=1 N q | Tl | | Tl | H(T ) = -å log N l=1 N 7 Cluster Validity: External Index NMI (Normalized Mutual Information) : Larger NMI values indicate better clustering solutions. I(X,Y ) NMI(X,Y ) = (H(X) + H(Y )) / 2 8 Internal Measures: SSE Internal Index: Used to measure the goodness of a clustering structure without respect to external information SSE is good for comparing two clustering results • average SSE • SSE curves w.r.t. various K Can also be used to estimate the number of clusters 10 9 6 8 4 7 6 SSE 2 0 5 4 -2 3 2 -4 1 -6 0 5 10 15 2 5 10 15 K 20 25 30 9 Internal Measures: Cophenetic correlation coefficient Cophenetic correlation coefficient:  a measure of how faithfully a dendrogram preserves the pairwise distances between the original data points.  Compare two hierarchical clusterings of the data 2.50 1.41 1.00 0.71 0.5 D F E C A B Compute the correlation coefficient between Dist and CP r X,Y = E[(X - m X )(Y - mY )] dXdY 10 Matlab functions: cophenet Cluster Analysis 1. Partitioning Methods + EM algorithm 2. Hierarchical Methods 3. Density-Based Methods 4. Clustering quality evaluation 5. How to decide the number of clusters ? 6. Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 11 Internal Measures: Cohesion and Separation • Cluster cohesion measures how closely related are objects in a cluster = SSE or the sum of the weight of all links within a cluster. • Cluster separation measures how distinct or well-separated a cluster is from other clusters = sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation 12 Internal Measures: Silhouette Coefficient • Silhouette Coefficient combines ideas of both cohesion and separation • For an individual point, i  Calculate a = average distance of i to the points in its cluster  Calculate b = min (average distance of i to points in another cluster)  The silhouette coefficient for a point is then given by Si =1- ai bi (bi - ai ) or Si = max(ai , bi ) b a o Typically between 0 and 1. o The closer to 1 the better. • Can calculate the Average Silhouette width for a cluster or a clustering Matlab functions: silhouette 13 Determine number of clusters by Silhouette Coefficient compare different clusterings by the average silhouette values K=3 mean(silh) = 0.526 K=4 mean(silh) = 0.640 K=5 mean(silh) = 0.527 Determine the number of clusters 1. Select the number K of clusters as the one maximizing averaged silhouette value of all points 2. Optimizing an objective criterion – Gap statistics of the decreasing of SSE w.r.t. K 3. Model-based method: • optimizing a global criterion (e.g. the maximum likelihood of data) 4. Try to use clustering methods which need not to set K, e.g., DbScan, 5. Prior knowledge….. Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 15 Cluster Analysis 1. Partitioning Methods + EM algorithm 2. Hierarchical Methods 3. Density-Based Methods 4. Clustering quality evaluation 5. How to decide the number of clusters ? 6. Summary Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 16 Clustering VS Classification Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 17 Problems and Challenges • Considerable progress has been made in scalable clustering methods  Partitioning: k-means, k-medoids, CLARANS  Hierarchical: BIRCH, ROCK, CHAMELEON  Density-based: DBSCAN, OPTICS, DenClue  Grid-based: STING, WaveCluster, CLIQUE  Model-based: EM, SOM  Spectral clustering  Affinity Propagation  Frequent pattern-based: Bi-clustering, pCluster • Current clustering techniques do not address all the requirements adequately, still an active area of research Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 18 Cluster Analysis Open issues in clustering 1. Clustering quality evaluation 2. How to decide the number of clusters ? 19 What you should know • What is clustering? • How does k-means work? • What is the difference between k-means and k-mediods? • What is EM algorithm? How does it work? • What is the relationship between k-means and EM? • How to define inter-cluster similarity in Hierarchical clustering? What kinds of options do you have ? • How does DBSCAN work ? Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 20 What you should know • What are the advantages and disadvantages of DbScan? • How to evaluate the clustering results ? • Usually how to decide the number of clusters ? • What are the main differences between clustering and classification? Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning 21

Clustering 2

Related documents

Products

Support

Clustering 2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib