Fourth German Stata Users Group Meeting New Tools for Evaluating the Results of Cluster Analyses Hilde Schaeper Higher Education Information System (HIS), Hannover/Germany schaeper@his.de Mannheim, March 31st, 2006 Main features of cluster analysis Basic idea to form groups of similar objects (observations or variables) such that the classification objects are homogeneous within groups/clusters and heterogeneous between clusters Type of analysis heuristic tool of discovery lacking an underlying coherent body of statistical theory Range of methods cluster analysis is a family of more or less closely related techniques New Tools for Evaluating the Results of Cluster Analyses 1 Steps and decisions in cluster analysis I Selection of a sample (outliers may influence the results) II Selection and transformation of variables (irrelevant and correlated variables can bias the classification; cluster analysis requires the variables to have equal scales) III Choice of the basic approach (in particular: agglomerative hierarchical vs. partitioning cluster analysis) IV Choice of a particular clustering technique V Selection of a dissimilarity or similarity measure (depends partly on the measurement level of the variables and the clustering technique chosen) VI Choice of the initial partition in case of partition methods VII Evaluation and validation (number of clusters, interpretation, stability, validity) New Tools for Evaluating the Results of Cluster Analyses 2 Criteria for a good classification Interpretability Clusters should be substantively interpretable. Internal validity (internal homogeneity and external heterogeneity) Objects that belong to the same cluster should be similar. Objects of different clusters should be different. The clusters should be well isolated from each other. The classification should fit to the data and should be able to explain the variation in the data. Reasonable number and size of clusters (additional) The number of clusters should be as small as possible. The size of the clusters should not be too small. Stability Small modifications in data and methods should not change the results. New Tools for Evaluating the Results of Cluster Analyses 3 Criteria for a good classification (cont.) External validity Clusters should correlate with external variables that are known to be correlated with the classification and that are not used for clustering. Relative validity The classification should be better than the null model which assumes that no clusters are present. The classification should be better than other classifications. New Tools for Evaluating the Results of Cluster Analyses 4 Tools for decision making and evaluation Tools for determining the number of clusters Tools for testing the stability of a classification ( Tools for assessing the internal validity of a classification) New Tools for Evaluating the Results of Cluster Analyses 5 Determining the number of clusters: hierarchical methods (Visual) inspection of the fusion/agglomeration levels dendrogram (official Stata program) scree diagram (easy to produce) agglomeration schedule (new program) New Tools for Evaluating the Results of Cluster Analyses 6 Determining the number of clusters: agglomeration schedule Syntax cluster stop [clname], rule(schedule) [laststeps(#)] Description cluster stop, rule(schedule) displays the agglomeration schedule for hierarchical agglomerative cluster analysis und computes the differences between the stages of the clustering process. Additional options laststeps(#) specifies the number of steps to be displayed. New Tools for Evaluating the Results of Cluster Analyses 7 Determining the number of clusters: agglomeration schedule Example: Cluster analysis of 799 observations, using Ward’s linkage and squared Euclidean distances cluster stop ward, rule(schedule) last(15) Number Fusion Stage clusters value Increase -------------------------------------------------798 1 1529,7205 834,5939 797 2 695,1265 15,2987 796 3 679,8278 414,1430 795 4 265,6848 60,3970 794 5 205,2878 32,0320 793 6 173,2559 12,1593 792 7 161,0966 22,5605 791 8 138,5361 29,6152 790 9 108,9209 3,4233 789 10 105,4976 14,2701 788 11 91,2275 6,7869 787 12 84,4405 2,2950 786 13 82,1455 1,5409 785 14 80,6046 14,8871 784 15 65,7175 3,2681 New Tools for Evaluating the Results of Cluster Analyses 8 Determining the number of clusters: dendrogram New Tools for Evaluating the Results of Cluster Analyses 9 Determining the number of clusters: hierarchical methods (Visual) inspection of the fusion/agglomeration levels dendrogram (official Stata program) scree diagram (easy to produce) agglomeration schedule (new program) Statistical measures/tests for the number of clusters Duda’s and Hart’s stopping rule/Caliński’s and Harabasz’s stopping rule (official Stata program) Mojena’s stopping rules (new program) New Tools for Evaluating the Results of Cluster Analyses 10 Determining the number of clusters: Stata’s stopping rules The Caliński and Harabasz index The Duda and Hart index +---------------------------+ | | Calinski/ | | Number of | Harabasz | | clusters | pseudo-F | |-------------+-------------| | 2 | 277,39 | | 3 | 239,32 | | 4 | 254,86 | | 5 | 228,46 | | 6 | 210,01 | | 7 | 197,16 | | 8 | 189,27 | | 9 | 183,03 | | 10 | 176,34 | | 11 | 171,95 | | 12 | 167,85 | | 13 | 164,64 | | 14 | 162,64 | | 15 | 161,78 | +---------------------------+ +--------------------------------------+ | Number | Duda/Hart | | of | | pseudo | | clusters | Je(2)/Je(1) | T-squared | |----------+-------------+-------------| | 1 | 0,7418 | 277,39 | | 2 | 0,7094 | 192,14 | | 3 | 0,6606 | 167,46 | | 4 | 0,6393 | 91,42 | | 5 | 0,7744 | 76,93 | | 6 | 0,7798 | 57,31 | | 7 | 0,7183 | 72,95 | | 8 | 0,7640 | 50,05 | | 9 | 0,5660 | 81,29 | | 10 | 0,7380 | 42,61 | | 11 | 0,5678 | 61,64 | | 12 | 0,7669 | 42,26 | | 13 | 0,6579 | 41,09 | | 14 | 0,7274 | 36,73 | | 15 | 0,5697 | 46,82 | +--------------------------------------+ New Tools for Evaluating the Results of Cluster Analyses 11 Determining the number of clusters: Mojena’s stopping rules Model I assumes that the agglomeration levels are normally distributed with a particular mean and standard deviation tests at level k whether level k+1 comes from the aforementioned distribution suggests the choice of the k-cluster solution when the null hypothesis has to be rejected for the first time (i. e. when a sharp increase/decrease of the fusion levels occurs) Model I modified assumes that the agglomeration levels up to level k are normally distributed Model II assumes that the agglomeration levels up to step k can be described by a linear regression line tests at level k whether the fusion value of level k+1 equals the predicted value suggests to set the number of clusters equal to k when the null hypothesis has to be rejected for the first time New Tools for Evaluating the Results of Cluster Analyses 12 Determining the number of clusters: Mojena’s stopping rules Syntax cluster stop [clname], rule(mojena) [laststeps(#) m1only] Description cluster stop, rule(mojena) calculates Mojena’s test statistics (Mojena I, Mojena I modified, and Mojena II) for determining the number of clusters of hierarchical agglomerative clustering methods and the corresponding significance levels. Additional options laststeps(#) specifies the number of steps to be displayed. m1only is used to suppress the calculation of Mojena I modified and Mojena II. New Tools for Evaluating the Results of Cluster Analyses 13 Determining the number of clusters: Mojena’s stopping rules cluster stop ward, rule(mojena) last(15) No. of Mojena I Mojena I mod. Mojena II Stage clusters t p t p t p ------------------------------------------------------------------------798 1 . . . . . . 797 2 22,9003 0,0000 39,2306 0,0000 38,8261 0,0000 796 3 10,3453 0,0000 22,8581 0,0000 22,4229 0,0000 795 4 10,1152 0,0000 36,7300 0,0000 36,1526 0,0000 794 5 3,8851 0,0001 16,4988 0,0000 15,8908 0,0000 793 6 2,9765 0,0015 14,2385 0,0000 13,6099 0,0000 792 7 2,4946 0,0064 13,2516 0,0000 12,6058 0,0000 791 8 2,3117 0,0105 13,6952 0,0000 13,0275 0,0000 790 9 1,9723 0,0245 12,9355 0,0000 12,2483 0,0000 789 10 1,5268 0,0636 10,8525 0,0000 10,1556 0,0000 788 11 1,4753 0,0703 11,3345 0,0000 10,6247 0,0000 787 12 1,2607 0,1039 10,4254 0,0000 9,7065 0,0000 786 13 1,1586 0,1235 10,2615 0,0000 9,5338 0,0000 785 14 1,1240 0,1307 10,6825 0,0000 9,9431 0,0000 784 15 1,1009 0,1356 11,3061 0,0000 10,5505 0,0000 New Tools for Evaluating the Results of Cluster Analyses 14 Determining the number of clusters: partitioning methods Measures using the error sum of squares Explained variance (Eta2): specifies to which extent a particular solution improves the solution with one cluster Proportional reduction of errors (PRE): compares a k-cluster solution with the previous (k–1) solution F-max statistic: corrects for the fact that more new program clusters automatically result in a higher explained variance Beale’s F statistic: tests the null hypothesis that a solution with k clusters is not improved by a solution with more clusters (conservative test, provides only convincing results if the clusters are well separated) Caliński’s and Harabasz’s stopping rule New Tools for Evaluating the Results of Cluster Analyses official Stata program 15 Determining the number of clusters: Stata’s stopping rule Example: Cluster analysis of 799 observations, using the kmeans partition method and squared Euclidean distances +---------------------------+ | | Calinski/ | | Number of | Harabasz | | clusters | pseudo-F | |-------------+-------------| | 3 | 322,13 | +---------------------------+ +---------------------------+ +---------------------------+ | | Calinski/ | | Number of | Harabasz | | clusters | pseudo-F | |-------------+-------------| | 6 | 252,44 | +---------------------------+ +---------------------------+ | | Calinski/ | | Number of | Harabasz | | clusters | pseudo-F | |-------------+-------------| | 4 | 274,31 | +---------------------------+ +---------------------------+ | | Calinski/ | | Number of | Harabasz | | clusters | pseudo-F | |-------------+-------------| | 7 | 228,73 | +---------------------------+ +---------------------------+ | | Calinski/ | | Number of | Harabasz | | clusters | pseudo-F | |-------------+-------------| | 5 | 279,87 | +---------------------------+ | | Calinski/ | | Number of | Harabasz | | clusters | pseudo-F | |-------------+-------------| | 8 | 210,20 | +---------------------------+ New Tools for Evaluating the Results of Cluster Analyses 16 Determining the number of clusters: Eta2, PRE, F-max, Beale’s F Syntax clnumber varlist, maxclus(#)[kmeans_options] Description clnumber performs kmeans cluster analyses with the variables specified in varlist and computes Eta2, the PRE coefficient, the F-max statistic, Beale’s F va- lues and the corresponding p-values. Options maxclus(#) is required and specifies the maximum number of clusters for which cluster analyses are performed. maxclus(4), for example, requests cluster analyses for two, three, and four clusters. kmeans_options specifiy options allowed with kmeans cluster analysis except for k(#) and start(group(varname)). New Tools for Evaluating the Results of Cluster Analyses 17 Determining the number of clusters: Eta2, PRE, F-max, Beale’s F clnumber v1–v7, max(8) start(prandom(154698)) First part of the output Eta square, PRE coefficient, F-max value A[8,3] cl_1 cl_2 cl_3 cl_4 cl_5 cl_6 cl_7 cl_8 Eta2 0 ,27878797 ,44732155 ,50863156 ,58504803 ,61414929 ,63407795 ,65036945 Pre . ,27878797 ,23368104 ,11093251 ,15551767 ,07013162 ,05164863 ,04452178 F-max . 308,08417 322,12939 274,31017 279,86862 252,4398 228,73256 210,1983 New Tools for Evaluating the Results of Cluster Analyses 18 Determining the number of clusters: Eta2, PRE, F-max, Beale’s F Second part of the output Upper triangle: Beale‘s F statistic; lower triangle: probability B[8,8] r1 r2 r3 r4 r5 r6 r7 r8 c1 0 ,09228984 ,0067297 ,00225687 ,00005713 ,00001341 4,554e-06 1,600e-06 r1 r2 r3 r4 r5 r6 r7 r8 c8 2,2479901 2,1372527 1,7502537 1,8003233 1,261865 1,1717306 1,1590454 0 c2 1,7527399 0 ,01637451 ,00910208 ,00028111 ,00010317 ,00005409 ,00002861 c3 2,1746911 2,454542 0 ,18682639 ,01048918 ,00641878 ,00517453 ,00408524 c4 2,1056324 2,1062746 1,4336405 0 ,00768272 ,00668153 ,00663361 ,00598863 New Tools for Evaluating the Results of Cluster Analyses c5 2,3824281 2,4264587 2,0737441 2,7415018 0 ,21053567 ,20309237 ,18868905 c6 2,3440412 2,3137652 1,9334281 2,1763191 1,3762712 0 ,3133238 ,28969417 c7 2,2895238 2,2097109 1,8205698 1,9278474 1,2922468 1,1750857 0 ,32290457 19 Determining the number of clusters: Eta2, PRE, F-max, Beale’s F Second part of the output Upper triangle: Beale‘s F statistic; lower triangle: probability B[8,8] r1 r2 r3 r4 r5 r6 r7 r8 c1 0 ,09228984 ,0067297 ,00225687 ,00005713 ,00001341 4,554e-06 1,600e-06 r1 r2 r3 r4 r5 r6 r7 r8 c8 2,2479901 2,1372527 1,7502537 1,8003233 1,261865 1,1717306 1,1590454 0 c2 1,7527399 0 ,01637451 ,00910208 ,00028111 ,00010317 ,00005409 ,00002861 c3 2,1746911 2,454542 0 ,18682639 ,01048918 ,00641878 ,00517453 ,00408524 c4 2,1056324 2,1062746 1,4336405 0 ,00768272 ,00668153 ,00663361 ,00598863 New Tools for Evaluating the Results of Cluster Analyses c5 2,3824281 2,4264587 2,0737441 2,7415018 0 ,21053567 ,20309237 ,18868905 c6 2,3440412 2,3137652 1,9334281 2,1763191 1,3762712 0 ,3133238 ,28969417 c7 2,2895238 2,2097109 1,8205698 1,9278474 1,2922468 1,1750857 0 ,32290457 20 Testing the stability of a classification Stability is a precondition of validity refers to the property of a cluster solution that it is not affected by small modifications of data and methods can be measured by comparing two classifications and computing the proportion of consistent allocations New Tools for Evaluating the Results of Cluster Analyses 21 Testing the stability of a classification: the Rand index Original Rand index (Rand 1971) ranges between 0 and 1 with 1 = perfect agreement values greater than 0.7 are considered as sufficient Adjusted Rand index (Hubert & Arabie 1985) accounts for chance agreement offers a solution for the problem that the expected value of the Rand index does not take a constant value maximum value of 1; expected value of zero, if the classifications are selected randomly usually yields much smaller values than the Rand index New Tools for Evaluating the Results of Cluster Analyses 22 Testing the stability of a classification: the Rand index Syntax clrand groupvar1 groupvar2 Description clrand compares two classifications with respect to the (in)consistency of assignments of the classification objects to clusters and computes the Rand index and the adjusted Rand index proposed by Hubert & Arabie. The command requires the specification of two grouping variables obtained from previous cluster analyses. Output clrand groupvar1 groupvar2 Comparison of two classifications Grouping variables: "groupvar1" and "groupvar2“ Rand index: 0,9695 Adjusted Rand index (Hubert & Arabie): 0,9320 New Tools for Evaluating the Results of Cluster Analyses 23 Testing the stability of a classification: the Rand index Comparisons of the 3-cluster solutions using different start options (adj. Rand) Start option Start option prandom krandom firstk lastk random krandom 0.9320 firstk 0.4234 0.3888 lastk 0.4234 0.3888 1.0000 random 0.9320 1.0000 0.3888 0.3888 everykth 0.9320 1.0000 0.3888 0.3888 1.0000 segment 0.9895 0.9222 0.4290 0.4290 0.9222 everykth average adjusted Rand index: 0,6948 0.9222 Comparisons of the 5-cluster solutions using different start options (adj. Rand) Start option Start option prandom krandom firstk lastk random krandom 0.7160 firstk 0.9815 0.7064 lastk 0.9442 0.7182 0.9266 random 0.7056 0.7896 0.7108 0.6896 everykth 0.8606 0.7788 0.8445 0.8347 0.7540 segment 0.9164 0.7534 0.9000 0.9483 0.6962 New Tools for Evaluating the Results of Cluster Analyses everykth average adjusted Rand index: 0,8122 0.8800 24 Outlook speeding up the program for calculating Mojena’s stopping rules improvement of clnumber improvement of clrand new program for checking whether a local minimum is found with kmeans or kmedians cluster analysis new programs for calculating additional statistics (e. g. homogeneity measures, measures for the fit of a dendrogram) New Tools for Evaluating the Results of Cluster Analyses 25 Basic idea: examples Finding groups of observations Finding groups of variables variable 2 var 1 var 2 var 3 var 4 var 5 var 6 variable 1 New Tools for Evaluating the Results of Cluster Analyses 1 2 3 4 5 6 cases 26 Consequences of decision making: example Comparison of two kmeans cluster analyses using different initial group centres Starting centres obtained from the „quick clustering algorithm“ (SPSS) Starting centres: means of four randomly selected partitions Cluster 1 Cluster 2 Cluster 3 Cluster 4 Total Cluster 1 1,144 137 9 434 1,724 Cluster 2 2 1,629 296 6 1,933 Cluster 3 1 5 757 827 1,590 Cluster 4 848 142 198 88 1,276 1,995 1,913 1,260 1,355 6,523 Total New Tools for Evaluating the Results of Cluster Analyses 27 Determining the number of clusters: inverse scree test New Tools for Evaluating the Results of Cluster Analyses 28