Advanced Statistical Methods for Research Math 736/836 Cluster Analysis Part 1: Hierarchical Clustering. Watch out for the occasional paper clip! ©2009 Philip J. Ramsey, Ph.D. 1 Yet another important multivariate exploratory method is referred to as Cluster Analysis. Once again we are studying a multivariate method that is by itself the subject of numerous textbooks, websites, and academic courses. The Classification Society of North America (http://www.classification-society.org/csna/csna.html) deals extensively with Cluster Analysis and related topics. The Journal of Classification (http://www.classificationsociety.org/csna/joc.html) publishes numerous research papers on Cluster Analysis. Cluster Analysis like PCA, is a method of analysis we refer to as unsupervised learning. An good text is Massart, D. L. and Kaufman, L. (1983), The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, New York: John Wiley & Sons. ©2009 Philip J. Ramsey, Ph.D. 2 Supervised Learning refers to the scenario were we have a predetermined structure among the variables or observations. As an example, some of the variables are considered responses and the remaining variables are predictors or inputs to a model. Multiple Regression analysis is a good example of this type of supervised learning. Another type of supervised learning occurs when we have a predetermined classification structure among the observations and we attempt to develop a classification model that accurately predicts that structure as a function of covariates (variables) in the dataset. Logistic regression, Discriminant Analysis, and CART modeling are examples of this type of supervised learning. In Cluster Analysis we attempt to estimate an empirical classification scheme among the observations or variables or both. ©2009 Philip J. Ramsey, Ph.D. 3 Clustering is a multivariate technique of grouping observations together that are considered similar in some manner – usually based upon a distance measure. Clustering can incorporate any number of variables for N observations. The variables must be numeric variables for which numerical differences or distances make sense – hierarchical clustering in JMP allows nominal and ordinal variables under certain conditions. The common situation is the N observations are not scattered uniformly throughout an N-dimensional space, but rather they form clumps, or locally dense areas, or modes, or clusters. The goal of Cluster Analysis is the identification of these natural occurring clusters, which helps to characterize the distribution of N observations. ©2009 Philip J. Ramsey, Ph.D. 4 Basically clustering consists of a set of algorithms to explore hidden structure among the observations. The goal is to separate the observations into groups or clusters such that observations within a cluster are as homogeneous as possible and the different groups are as heterogeneous as possible. Often we have no a priori hypotheses about the nature of the possible clusters and rely on the algorithms to define the clusters. Identifying a meaningful set of groupings from cluster analysis is as much or more a subject matter task as a statistical task. Generally no formal methods of inference are used in cluster analysis, it is strictly exploratory – some t and F tests may be used. In some applications of cluster analysis, experts may have some predetermined number of clusters that should exist, however the algorithm determines the composition of the clusters. ©2009 Philip J. Ramsey, Ph.D. 5 Cluster Analysis techniques implemented in JMP generally fall into two broad categories. Hierarchical Clustering where we have no preconceived notion of how many natural clusters may exist, it is a combining process; K-means Clustering where we have a predetermined idea of the number of clusters that may exist. Everitt and Dunn refer to this as “Optimization Methods.” A subset of K-means Clustering is referred to as mixture models analysis (models refer to multivariate probability distributions usually assumed to be Normal). If one has a very large dataset, say > 2000 records (depends on computing resources), then the K-means approach might used due to the large number of possible classification groups that must be considered. ©2009 Philip J. Ramsey, Ph.D. 6 The cluster structures have 4 basic forms: Disjoint Clusters where each object can be in only one cluster – K-means clustering falls in this category; Hierarchical Clusters where one cluster may be contained in entirely within a superior cluster; Overlapping Clusters where objects can belong simultaneously to two or more clusters. Often constraints are placed on the amount of overlapping objects in clusters; Fuzzy Clusters are defined by probabilities of membership in clusters for each object. The clusters can be any of the three types listed above. The most common types of clusterins used in practice are disjoint or hierarchical. ©2009 Philip J. Ramsey, Ph.D. 7 Hierarchical clustering is also known as agglomerative hierarchical clustering because we start with a set of N single member clusters and then begin combining them based upon various distance criteria. The process ends when we have a final, single cluster containing N members. The result of hierarchical clustering is presented graphically by way of a dendrogram or tree diagram. One problem with the hierarchical method is that a large number of possible classification schemes are developed and the researcher has to decide which of the schemes is most appropriate. Two-way hierarchical clustering can also be performed where we simultaneously cluster on the observations and the variables. Clustering among variables is typically based upon correlation measures. ©2009 Philip J. Ramsey, Ph.D. 8 Clustering of observations is typically based upon Euclidean distance measures between the clusters. We typically try to find clusters of observations such that the distances (dissimilarities) between the clusters is maximized for a given set of clusters – as different as possible. There are a number of different methods by which the distances between clusters are computed and the methods usually give different results in terms of the cluster compositions. Example: The dataset BirthDeath95.JMP contains statistics on 25 nations from 1995. We will introduce hierarchical clustering using the JMP Cluster platform, which is located in the “Multivariate Methods” submenu. The platform provides most of the popular clustering algorithms. ©2009 Philip J. Ramsey, Ph.D. 9 Example continued: The procedure begins with 25 individual clusters and combines observations into clusters until finally only a single cluster of 25 observations exists. The reseacher must determine how many clusters are most appropriate. In JMP the user can dynamically select the number of clusters by clicking and dragging the diamond above or below the dendrogram (see picture to right) and see how the memberships change in order to come to a final set of clusters. In the dendrogram to the right, the countries are assigned markers according to their membership in 1 of the 4 clusters designated. ©2009 Philip J. Ramsey, Ph.D. 10 Example continued: To the right is the cluster history from JMP. Notice that Greece and Italy are the first cluster formed followed by Australia and USA. At the 9th stage Australia, USA, and Argentina combine into a cluster. Eventually Greece and Italy join that cluster. The countries combine into fewer clusters at each stage until there exists only 1 cluster at the top of the dendrogram or tree. ©2009 Philip J. Ramsey, Ph.D. Clus te ring History Number of Clusters 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Distanc e 0.141569596 0.204865881 0.215801094 0.216828958 0.226494960 0.370451385 0.415752930 0.518773384 0.574383932 0.609473010 0.637642141 0.701263019 0.739182206 0.744788865 0.877722286 0.894878623 1.073430799 1.135510496 1.595721560 1.829760843 2.246948815 2.714981909 3.296092971 7.702060221 Leader Greece Australia Philippines Cameroon Egypt Ethiopia Chile China Argentina Chile Cameroon Chile China Argentina Bolivia Haiti Bolivia Chile Bolivia Chile Ethiopia Argentina Bolivia Argentina Joiner Italy USA Vietnam Nigeria India Somalia Costa Rica Indones ia Australia Mexico Kenya Thailand Philippines Greece Nicaragua Zambia Egypt Kuw ait Cameroon China Haiti Chile Ethiopia Bolivia 11 Example continued: The clusters designations can be saved to the data table for further analysis – option under the red arrow in the Report Window. Graphical analysis is often very important to understanding the cluster structure. In this case we will use a new feature in JMP 8 called Graph Builder located as the first option in the Graph Menu. Graph Builder basically allows the user to construct trellis graphics. The interface for Graph Builder allows the user to drag and drop variables from the Select Columns window to various areas of the graph builder template. The user simply tries various combinations until a desired graph is constructed. By right clicking on the graph it is possible to control what type of display appears on the graph. As an example, one may prefer box plots or histograms for each cell of the plot. The next slide shows a Graph Builder display for the 4 clusters. ©2009 Philip J. Ramsey, Ph.D. 12 Example continued: Graph Builder display of the clusters. From the graph can you see how the four variables define the four clusters? As an example, how do the clusters vary for Baby Mort? ©2009 Philip J. Ramsey, Ph.D. 13 Example continued: Below we show a Fit Y by X plot of Birth Rate vs. Death Rate with 90% density ellipses for each cluster. Bivar iat e Fit of Death Rate By Birth Rate 20 Haiti Death Rate 15 Kenya India Italy 10 Bolivia USA Nicaragua Mexico Costa Rica 5 Kuw ait 0 5 ©2009 Philip J. Ramsey, Bivariate Bivariate Bivariate Ph.D. Bivariate 10 15 20 Normal Ellipse P=0.900 Normal Ellipse P=0.900 Normal Ellipse P=0.900 Normal Ellipse P=0.900 30 25 Birth Rate Cluster==1 Cluster==2 Cluster==3 Cluster==4 35 40 45 50 14 Example continued: Below is a Bubble plot of the data. Note that circles are colored by cluster number. ©2009 Philip J. Ramsey, Ph.D. 15 The hierarchical clustering as mentioned determines the clusters based upon distance measures. JMP supports the five most common types of distance measures used to create clusters. The goal at each stage of clustering is to combine clusters that are most similar in terms of distance between the clusters. The different measures of inter-cluster distance can arrive at very different clustering sequences. Average Linkage: the distance between two clusters is the average distance between pairs of observations, or one in each cluster. Average linkage tends to join clusters with small variances and is slightly biased toward producing clusters with the same variance. The distance formula is d AB ©2009 Philip J. Ramsey, Ph.D. 1 nAnB d iA jB ij 16 Centroid Method: the distance between two clusters is defined as the squared Euclidean distance between their means. The centroid method is more robust to outliers than most other hierarchical methods but in other respects might not perform as well as Ward's method or Average Linkage. Distance for the centroid method is 2 d AB X A X B Ward's Method: For each of k clusters let ESSk be the sum of squared deviations of each item in the cluster from the centroid of the cluster. If there are currently k clusters than the total ESS = ESS1 + ESS2 + …+ ESSk. At each stage all possible unions of cluster pairs are tried and the two clusters providing the least increase in ESS are combined. Initially ESS = 0 for the N individual clusters and for the final single cluster N ESS xi x xi x i 1 ©2009 Philip J. Ramsey, Ph.D. 17 Ward's Method: At each stage, the method is biased toward creating clusters with the same numbers of observations. For Ward’s method the distance between cluster is calculated as d AB XA XB 1 1 n A nB 2 Single Linkage: the distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. Clusters with the smallest distance are joined at each stage. Single linkage has many desirable theoretical properties, but has performed poorly in Monte Carlo studies (Milligan 1980). The inter-cluster distance measure is d AB min dij iA jB ©2009 Philip J. Ramsey, Ph.D. 18 Single Linkage: As mentioned this method has done poorly in Monte Carlo studies, however it is the only clustering method that can detect long, string-like clusters often referred to as chains. It is also very good at detecting irregular, non-ellipsoidal shaped clusters. Ward’s method for example assumes that the underlying clusters are approximately ellipsoidal in shape. Complete Linkage: the distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. At each stage pairs of clusters are joined that have the smallest distance. Complete linkage is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers (Milligan 1980). Distance for the Complete linkage cluster method is d AB max dij iA jB ©2009 Philip J. Ramsey, Ph.D. 19 Example: The following is a simple example with 5 observations to illustrate the idea of clustering. We will use the complete linkage method. We will start with a 5 by 5 symmetric matrix of Euclidean distances between the 5 observations. At stage 1we combine #3 and #5 to form a cluster since they are the closest. At stage two we compute a new distance matrix and join #2 and #4 since they are the closest. 0 9 0 3 7 0 6 5 9 0 11 10 2 8 0 d (35)1 max dij 11 iA jB d (35)2 max dij 10 iA jB d (35)4 max dij 9 0 11 0 10 9 0 9 6 5 0 iA jB ©2009 Philip J. Ramsey, Ph.D. 20 Example continued: At stage 3 we compute a new distance matrix Since cluster (24) and #1 are closest they are joined into a new cluster. d (35)(24) max dij 10 iA jB d (24)1 max dij 9 iA jB 0 10 0 11 9 0 Finally at the last stage cluster (241) is joined with cluster (35) to create the final cluster of 5 observations. The clustering stages can easily be visualized in this simple case without a dendrogram. ©2009 Philip J. Ramsey, Ph.D. 21 We next work through the use of JMP for Hierarchical clustering. As mentioned Clustering is one of the platforms for Multivariate Methods and is located under that submenu. For Hierarchical select the distance measure that is desired. Ward is the default. ©2009 Philip J. Ramsey, Ph.D. 22 Select the columns containing the data on which the hierarchical clustering will be performed. If the rows of the data matrix are identified by a label column then put that variable in the Label box. If you do not want the data standardized prior to clustering than deselect the Standardize Data default. The clusters are very scale dependent, so many experts advise standardization if scales are not commensurate. ©2009 Philip J. Ramsey, Ph.D. 23 In the report window the dendrogram is displayed and many analysis options exist under the hotspot at the top of the report. Click and drag the red diamond on the dendrogram to change the number of clusters you wish to display. JMP usually selects a proportion of clusters by default, but it is not necessarily optimal. Alternatively one can use the “Number of Clusters” option in the report menu. The scree plot at the bottom displays the distance that was bridged in order to join clusters at each stage. ©2009 Philip J. Ramsey, Ph.D. 24 In the report window the dendrogram is displayed and many analysis options exist under the hotspot at the top of the report. Once the number of clusters have been decided, then it is a good idea to save the clusters to the data table and mark them. Simply select the options from the menu or right click at the bottom of the dendrogram. If you decide to change the number of clusters, JMP will update the markers and cluster designations. As shown earlier we often wish to save the clusters to the data table for further analysis in other JMP platforms. ©2009 Philip J. Ramsey, Ph.D. 25 If you mouse click on a branch of the dendrogram, then all of the observations in that branch are highlighted on the dendrogram and selected in data table. ©2009 Philip J. Ramsey, Ph.D. 26 ©2009 Philip J. Ramsey, Ph.D. Argentina Australia USA Greece Italy Chile Costa Rica Mexico Thailand Kuw ait China Indones ia Philippines Vietnam Bolivia Nicaragua Egypt India Cameroon Nigeria Kenya Ethiopia Somalia Haiti Zambia Literacy Baby Mort Birth Rate Death Rate A color map can be added to the dendrogram to help understand the relationships between the observations and the variables in the columns. The map contains a progressive color code from smallest value to largest value. As part of the color map, a two way clustering can be performed where a cluster analysis of the variables is added to the bottom of the observation dendrogram. Variables clustering is based on correlation, with negative correlations indicating dissimilarity. Dendr ogr am 27 If a color map is desired it can be advantageous to select a display order column, which will order the observations based on values of the specified column. A good candidate for an ordering column is to perform PCA and then save only the first PC. This can then be specified as the ordering column in the launch window. ©2009 Philip J. Ramsey, Ph.D. 28 The color map on the left is ordered, the one on the right is not. Dendr ogr am Dendr ogr am Australia USA Argentina Italy Greece Costa Rica Chile Mexico Thailand Kuw ait Philippines Vietnam China Indones ia Nicaragua Bolivia Egypt India Kenya Cameroon Nigeria Zambia Haiti Ethiopia Somalia Argentina Australia USA Greece Italy Chile Costa Rica Mexico Thailand Kuw ait China Indones ia Philippines Vietnam Bolivia Nicaragua Egypt India Cameroon Nigeria Kenya Ethiopia Somalia Haiti Zambia ©2009 Philip J. Ramsey, Ph.D. 29 Example: We use the dataset CerealBrands.JMP. The data contains information on 43 breakfast cereals. We have also saved the first principle component score as an ordering column. ©2009 Philip J. Ramsey, Ph.D. 30 Dendr ogr am PuffedRice PuffedWheat RiceKrispies CornFlakes Crispix Product19 TotalCornFlakes Kix NutriGrainWheat MultiGrainCheerios Cheaties TotalWholeGrain FrostedMiniWheats SpecialK Cheerios CornPops AppleJacks FrostedFlakes Trix CocoaPuffs CountChocula LuckyCharms FrootLoops Smacks GoldenGrahams NutNHoneyCrunch JustRightCrunchy Nuggets WheatiesHoneyGold CapNCrunch HoneyGrahamOhs ACCheer ios HoneyNutCheerios OatmealRaisinCris p Life RaisinNutBran CracklinOatBran QuakerOatmeal AllBran NutriGrainAlmondRaisin MueslixCrispyBlend FruitfulBr an TotalRais inBran RaisinBran Calories Sugar Fat Sodium Carbohydrates Protein Fiber Potassium Example: Using two way clustering with Prin1 as an ordering variable we get the following set of clusters. We select 5 as the desired number of clusters. Do the clusters seem logical? Which variables seem important to the cluster development? ©2009 Philip J. Ramsey, Ph.D. 31 Example: Below is a Fit Y by X plot of Carbohydrates vs. Calories with the clusters displayed. Bivar iat e Fit of Car bohydr ate s By Calor ie s NutriGrainAlmondRaisin 20 Product19 Cheerios Carbohydrates 15 TotalRaisinBran OatmealRaisinCr isp PuffedRice CapNCrunch 10 PuffedWheat Smacks AllBran 5 QuakerOatmeal 0 40 ©2009 Philip J. Ramsey, Ph.D. 60 80 100 120 Calories 140 160 180 32 Dendr ogr am Example: To the right is an analysis using single linkage. Notice that the clustering sequence is significantly different. ©2009 Philip J. Ramsey, Ph.D. PuffedRice PuffedWheat RiceKrispies CornFlakes Product19 Crispix TotalCornFlakes Kix CornPops AppleJacks Trix CocoaPuffs CountChocula LuckyCharms FrootLoops Smacks NutNHoneyCrunch JustRightCrunchyNuggets WheatiesHoneyGold MultiGrainCheerios Cheaties TotalWholeGrain FrostedFlakes GoldenGr ahams NutriGrainWheat CapNCrunch HoneyGrahamOhs ACCheerios HoneyNutCheerios Life RaisinNutBran CracklinOatBran OatmealRaisinCrisp NutriGrainAlmondRaisin MueslixCrispyBlend FruitfulBran TotalRaisinBran RaisinBran FrostedMiniWheats SpecialK Cheerios QuakerOatmeal AllBran 33 An obvious question is which hierarchical clustering method is preferred. Unfortunately, over the decades numerous simulation studies have been performed to attempt to answer this question and the overall results tend to be inconsistent and confusing to say the least. In the studies, generally Ward’s method and average linkage have tended to perform the best in finding the correct clusters, while single linkage has tended to perform the worst. A problem in evaluating clustering algorithms is that each tends to favor clusters with certain characteristics such as size, shape, or dispersion. Therefore, a comprehensive evaluation of clustering algorithms requires that one look at artificial clusters with various characteristis. For the most part this has not been done. ©2009 Philip J. Ramsey, Ph.D. 34 Most evaluations studies have tended to use compact clusters of equal variance and size; often the clusters are based on a multivariate normal distribution. Ward’s method is biased toward clusters of equal size and approximately spherical shape, while average linkage is biased toward clusters of equal variance and spherical shape. Therefore, it is not surprising that Wards’ method and average linkage tend to be the winners in simulation studies. In fact, most clustering algorithms are biased toward regularly shaped regions and may perform very poorly if the clusters are irregular in shape. Recall that single linkage does well if one has elongated, irregularly shaped clusters. In practice, one has no idea about the characteristics of clusters. ©2009 Philip J. Ramsey, Ph.D. 35 If the natural clusters are well separated from each other, then any of the clustering algorithms are likely to perform very well. To illustrate we use the artificial dataset WellSeparateCluster.JMP and apply both Ward’s method and single linkage clustering. The data consist of three very distinct (well separated) clusters. ©2009 Philip J. Ramsey, Ph.D. 36 Below are the clustering results for both Ward’s method, display on the left, and single linkage displayed on the right. Notice that both methods easily identify the three clusters. Note, the clusters are multivariate normal with equal sizes. ©2009 Philip J. Ramsey, Ph.D. 37 If the clusters are not well separated, then the various algorithms will perform quite differently. We illustrate with the data set PoorSeparateCluster.JMP and below is a plot of the clusters. ©2009 Philip J. Ramsey, Ph.D. 38 The plot on the left is for Ward’s method and on the right single linkage using Proc Cluster in SAS – the highlighted points are observations Proc Cluster could not determine cluster membership. Ward’s method has done quite well, while single linkage has done poorly and could not determine cluster membership for quite a few observations. ©2009 Philip J. Ramsey, Ph.D. 39 Next we look at multivariate normal clusters, but his time they are different sizes and dispersion. The dataset UnequalCluster.JMP contains the results. ©2009 Philip J. Ramsey, Ph.D. 40 Next we look at multivariate normal clusters, but his time they are different sizes and dispersion. The dataset UnequalCluster.JMP contains the results. On the left is Ward’s method and on the right is average linkage method using Proc Cluster is SAS. Ward’s method and average linkage produced almost identical results. However, they both tended toward clusters of equal size and assigned too many observations to the smallest cluster. ©2009 Philip J. Ramsey, Ph.D. 41 Next we look at two elongated clusters. We will compare Ward’s method to single linkage clustering. Generally, single linkage is supposed to be superior for elongated clusters. The data are in the file ElongateCluster.JMP. ©2009 Philip J. Ramsey, Ph.D. 42 On the left below is Ward’s method and to the right single linkage method using Proc Cluster in SAS. Ward’s method finds two cluster of approximate equal size, but poorly classifies. The single linkage method correctly specifies the two elongated clusters. ©2009 Philip J. Ramsey, Ph.D. 43 When one has elongated clusters this indicates correlation or covariance structure among the variables used to form the clusters. Sometimes transformations on the variables can generate more spherical clusters that are more easily detected by Ward’s method or similar methods. In theory the method is straightforward, however if one does not know the number of clusters or covariance structure within each cluster they have to be approximated from the data – not always so easy to do well in practice. Proc Aceclus in SAS can be used to perform such transformations prior to clustering with Proc Cluster. We can perform a rough approximation to the method in JMP by converting the original variables to principal components and then clustering on the principal component scores. ©2009 Philip J. Ramsey, Ph.D. 44 To illustrate we will use the elongated cluster data and perform a PCA on correlations in the Multivariate platform and save the two principal component scores to the data table. Next we will perform clustering, using Ward’s method, on the principal component scores. Below is a scatter plot based on the PC’s. Although not perfect spheres, the PC’s are more spherical in shape than the original variables and more easily separated in the Prin2 direction. Proc Aceclus produced similar results and is not shown. ©2009 Philip J. Ramsey, Ph.D. 45 Below are the results of clustering on the PC’s using Ward’s method. Notice that the two clusters are perfectly classified by Ward’s method on the PC’s, while the method did not fair well on the original variables. ©2009 Philip J. Ramsey, Ph.D. 46 To illustrate the clustering, we can use Graph Builder in JMP to show that indeed the clusters are primarily determined by the difference in Prin2 between the two clusters. ©2009 Philip J. Ramsey, Ph.D. 47 To illustrate the clustering, we can use Graph Builder in JMP to show that indeed the clusters are primarily determined by the difference in Prin2 between the two clusters. ©2009 Philip J. Ramsey, Ph.D. 48 We examine one more scenario where we have nonconvex, elongated clusters. Because of the cluster shape the PCA approach will not work in this case. The data are contained in the file NonConvexCluster.JMP. Below is a plot of the two clusters. ©2009 Philip J. Ramsey, Ph.D. 49 Below are the results of clustering using Ward’s method and single linkage. Ward’s method misclassifies some of the observations in cluster 2, while the single linkage method has virtually identified the two clusters correctly. ©2009 Philip J. Ramsey, Ph.D. 50