Cluster Validity

CSE 881: Data Mining Lecture 20: Cluster Validation ‹#› Cluster Validity  For supervised classification we have a variety of measures to evaluate how good our model is – Accuracy, precision, recall  For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?  But “clusters are in the eye of the beholder”!  Then why do we want to evaluate them? – – – – To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters ‹#› 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y Random Points y Clusters found in Random Data 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 0.8 0 1 DBSCAN 0 0.2 0.4 x 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y K-means 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 x 0.6 0.8 1 x 0.8 1 0 Complete Link 0 0.2 0.4 0.6 0.8 1 x ‹#› Measures of Cluster Validity  Internal Index (Unsupervised): Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE)  External Index (Supervised): Used to measure the extent to which cluster labels match externally supplied class labels. – Entropy  Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for this function, e.g., SSE or entropy ‹#› Unsupervised Cluster Validation  Cluster Evaluation based on Proximity Matrix – Correlation between proximity and incidence matrices – Visualize proximity matrix  Cluster Evaluation based on Cohesion and Separation ‹#› Measuring Cluster Validity Via Correlation  Two matrices – Proximity Matrix – “Incidence” Matrix One row and one column for each data point An entry is 1 if the associated pair of points belong to the same cluster An entry is 0 if the associated pair of points belongs to different clusters  Compute the correlation between proximity and incidence matrices – Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated.  High correlation indicates that points that belong to the same cluster are close to each other.  Not a good measure for some density or contiguity based clusters ‹#› Measuring Cluster Validity Via Correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y  0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 x Corr = -0.9235 0.8 1 0 0 0.2 0.4 0.6 0.8 1 x Corr = -0.5810 ‹#› Using Similarity Matrix for Cluster Validation Order the similarity matrix with respect to cluster labels and inspect visually 1 1 0.9 0.8 0.7 Points 0.6 y  0.5 0.4 0.3 0.2 0.1 0 10 0.9 20 0.8 30 0.7 40 0.6 50 0.5 60 0.4 70 0.3 80 0.2 90 0.1 100 0 0.2 0.4 0.6 x 0.8 1 20 40 60 80 0 100 Similarity Points ‹#› Using Similarity Matrix for Cluster Validation  Clusters in random data are not so crisp 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.6 50 0.5 0.5 60 0.4 0.4 70 0.3 0.3 80 0.2 0.2 90 0.1 0.1 100 20 40 60 80 0 100 Similarity Points y Points 1 0 0 0.2 0.4 0.6 0.8 1 x DBSCAN ‹#› Using Similarity Matrix for Cluster Validation  Clusters in random data are not so crisp 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.6 50 0.5 0.5 60 0.4 0.4 70 0.3 0.3 80 0.2 0.2 90 0.1 0.1 100 20 40 60 80 0 100 Similarity y Points 1 0 0 0.2 0.4 0.6 0.8 1 x Points K-means ‹#› Using Similarity Matrix for Cluster Validation  Clusters in random data are not so crisp 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.6 50 0.5 0.5 60 0.4 0.4 70 0.3 0.3 80 0.2 0.2 90 0.1 0.1 100 20 40 60 80 0 100 Similarity y Points 1 0 0 Points 0.2 0.4 0.6 0.8 1 x Complete Link ‹#› Using Similarity Matrix for Cluster Validation 1 0.9 500 1 2 0.8 6 0.7 1000 3 0.6 4 1500 0.5 0.4 2000 0.3 5 0.2 2500 0.1 7 3000 500 1000 1500 2000 2500 DBSCAN ‹#› 3000 0 Internal Measures: SSE  Internal Index: Used to measure the goodness of a clustering structure without respect to external information – SSE  SSE is good for comparing two clusterings or two clusters (average SSE). Can also be used to estimate the number of clusters 10 9 6 8 4 7 6 2 SSE  0 5 4 -2 3 2 -4 1 -6 0 5 10 15 2 5 10 15 20 25 K ‹#› 30 Internal Measures: SSE  SSE curve for a more complicated data set 1 2 6 3 4 5 7 SSE of clusters found using K-means ‹#› Unsupervised Cluster Validity Measure  More generally, given K clusters: – Validity(Ci): a function of cohesion, separation, or both – Weight wi associated with each cluster I – For SSE:  wi = 1  validity(Ci) =  x    2 i x C i ‹#› Internal Measures: Cohesion and Separation  Cluster Cohesion: – Measures how closely related are objects in a cluster  Cluster Separation: – Measure how distinct or well-separated a cluster is from other clusters ‹#› Graph-based versus Prototype-based Views ‹#› Graph-based View  Cluster Cohesion: Measures how closely related are objects in a cluster Cohesion ( C i )   proximity ( x, y ) xC i , y C i  Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters Separation ( C i , C j )   proximity ( x , y ) x C i , y C j ‹#› Prototype-Based View  Cluster Cohesion: Cohesion ( C i )   proximity ( x , c i ) x C i – Equivalent to SSE if proximity is square of Euclidean distance  Cluster Separation: Separation ( C i , C j )  proximity ( c i , c j ) Separation ( C i )  proximity ( c i , c ) ‹#› Unsupervised Cluster Validity Measures ‹#› Prototype-based vs Graph-based Cohesion  For SSE and points in Euclidean space, it can be shown that average pairwise difference between points in a cluster is equivalent to SSE of the cluster ‹#› Total Sum of Squares (TSS) TSS   dist ( x , c ) 2 k SSE   dist ( x , c i ) c2 2 i  1 x C i c1 c k SSB   m dist ( c , c ) i 2 i i 1 c3 c: overall mean ci: centroid of each cluster Ci mi: number of points in cluster Ci ‹#› Total Sum of Squares (TSS) m  1 m1 K=1 cluster:  2  3 4 m2 5 TSS  (1  3 )  ( 2  3 )  ( 4  3 )  ( 5  3 )  10 2 2 2 2 SSE  ( 3  1)  ( 3  2 )  ( 4  3 )  ( 5  3 )  10 2 2 2 2 SSB  4  ( 3  3 )  0 2 K=2 clusters: TSS  (1  3 )  ( 2  3 )  ( 4  3 )  ( 5  3 )  10 2 2 2 2 SSE  (1  1 . 5 )  ( 2  1 . 5 )  ( 4  4 . 5 )  ( 5  4 . 5 )  1 2 2 2 2 SSB  2  ( 3  1 . 5 )  2  ( 4 . 5  3 )  9 2 2 TSS = SSE + SSB ‹#› Total Sum of Squares (TSS) TSS = SSE + SSB Given a data set, TSS is fixed  A clustering with large SSE has small SSB, while one with small SSE has large SSB   Goal is to minimize SSE and maximize SSB ‹#› Internal Measures: Silhouette Coefficient   Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, i – Calculate a = average distance of i to the points in its cluster – Calculate b = min (average distance of i to points in another cluster) – The silhouette coefficient for a point is then given by s = 1 – a/b if a < b, (or s = b/a - 1 – Typically between 0 and 1. if a  b, not the usual case) b a – The closer to 1 the better.  Can calculate the Average Silhouette width for a cluster or a clustering ‹#› Unsupervised Evaluation of Hierarchical Clustering Distance Matrix: 0.2 0.15 Single Link 0.1 0.05 0 3 6 2 5 4 1 ‹#› Unsupervised Evaluation of Hierarchical Clustering 0.2 0.15 0.1 0.05 0 3 6 2 Single Link  5 4 1 Cophenetic Distance Matrix for Single Link CPCC (CoPhenetic Correlation Coefficient) – Correlation between original distance matrix and cophenetic distance matrix ‹#› Unsupervised Evaluation of Hierarchical Clustering Single Link 0.2 0.15 0.1 0.05 0 3 6 2 5 4 1 Complete Link 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 3 6 4 1 2 5 ‹#› Supervised Cluster Validation: Entropy and Purity ‹#› Supervised Cluster Validation: Precision and Recall Cluster i mi1: class 1 mi2: class 2  Overall Data m1: class 1 m2: class 2 Precision for cluster i w.r.t. class j = m ij m ik k  Recall for cluster i w.r.t. class j = m ij m  kj m ij mj k ‹#› Supervised Cluster Validation: Hierarchical Clustering  Hierarchical F-measure: ‹#› Framework for Cluster Validity  Need a framework to interpret any measure. –  For example, if our measure of evaluation has the value, 10, is that good, fair, or poor? Statistics provide a framework for cluster validity – The more “atypical” a clustering result is, the more likely it represents valid structure in the data – Can compare the values of an index that result from random data or clusterings to those of a clustering result.   If the value of the index is unlikely, then the cluster results are valid For comparing the results of two different sets of cluster analyses, a framework is less necessary. – However, there is the question of whether the difference between two index values is significant ‹#› Statistical Framework for SSE Example: 2-d data with 100 points 1 0.9 0.8 0.7 0.6 y  0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x – Suppose a clustering algorithm produces SSE = 0.005 – Does it mean that the clusters are statistically significant? ‹#› Statistical Framework for SSE – Generate 500 sets of random data points of size 100 distributed over the range 0.2 – 0.8 for x and y values – Perform clustering with k = 3 clusters for each data set – Plot histogram of SSE (compare with the value 0.005) 1 50 0.9 45 0.8 40 0.7 35 30 Count y 0.6 0.5 0.4 20 0.3 15 0.2 10 0.1 0 25 5 0 0.2 0.4 0.6 x 0.8 1 0 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 SSE ‹#› Statistical Framework for Correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y  0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 0.8 1 x 0 0 0.2 0.4 0.6 0.8 1 x Corr = -0.9235 Corr = -0.5810 (statistically significant) (not statistically significant) ‹#› Final Comment on Cluster Validity “The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes ‹#›

Cluster Validity

Related documents

Products

Support

Cluster Validity

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib