Cluster Analysis Some of the examples on cluster analysis will come from Base SAS. What is Cluster Analysis? What is cluster analysis? Cluster analysis is an exploratory data analysis tool. Cluster analysis can classify records (people, things, events, etc) into groups, or clusters. The classification is performed in a manner to produce a strong degree of association between members of the same cluster and weak degree of association between members of different clusters. Cluster analysis can illuminate associations and structure in the data that was not previously evident, but are logical and useful once found. When Do We Use Cluster Analysis? When do we use cluster analysis? When you want to classify people/things together into groups/clusters. Thus there are many situations in which cluster analysis can be helpful. Often it is useful in marketing. Why marketing? Often in marketing it is desirable to group customers with similar likes and dislikes together. This can enable more effective marketing campaigns. Also in marketing it is often difficult to determine how to place customers into categories. Measures of Distance Two Main Types of Distance Used for Cluster Analysis 1. Euclidean distance xr xs xr xs ' where xr represents the values of the variables to use for clustering with point r and xs represents the values of the variables to use – for clustering with point s. Standard Euclidean distance – Standardize the data first then calculate the Euclidean distance. zi xi Typically we will use sample mean and sample standard deviation for and respectively. An Example of the Power of Cluster Analysis 1. We will now give an example of the power of cluster analysis with SAS. Generate data (Complex code to generate random data, expecting 3 clusters.): • • • • • • • • • • • • • • • DATA body1; *the data represents the heights and weights of 300 women. KEEP height1 weight1; N=100; SCALE=1; *N=observations per cluster. Scale=variance; MX=150; MY=45; LINK GENERATE; *mean of x,y of cluster 1; MX=155; MY=50; LINK GENERATE; *mean of x,y of cluster 2; MX=160; MY=40; LINK GENERATE; *mean of x,y of cluster 3; STOP; GENERATE: DO I=1 TO N; height1=RANNOR(0)*SCALE+MX; *generates from random normal distribution; weight1=RANNOR(0)*SCALE+MY; *generates from random normal distribution; OUTPUT; END; RETURN; RUN; An Example of the Power of Cluster Analysis 2. How will we group the women looking at height and weight? Now let us see the data. • proc print data=body1; run; An Example of the Power of Cluster Analysis 3. How will we group the women looking at height and weight? Let us see some descriptive statistics for more information. • • • proc univariate data=body1; var height1 weight1; run; Descriptive Statistics on Height An Example of the Power of Cluster Analysis 3. How will we group the women looking at height and weight? Let us see some descriptive statistics for more information. • • • proc univariate data=body1; var height1 weight1; run; Descriptive Statistics on Weight An Example of the Power of Cluster Analysis How will we group the women looking at height and weight? The descriptive statistics were informative but it is still very difficult to know how to group the women using their heights and weights. Input (name=body1) and output (name=tree) datasets respectively. The SAS keywords are data and outtree. 4. Cluster Analysis: • • PROC CLUSTER DATA=body1 OUTTREE=TREE METHOD=SINGLE PRINT=20; The clustering run; method is a single linkage method. An Example of the Power of Cluster Analysis 4. Cluster Analysis: • • PROC TREE data=tree NOPRINT OUT=OUT N=3; COPY height1 weight1; Will copy the variables height1 and weight1 to the output dataset called out. • • • Prevents printing of the tree. Specifies number of desired clusters. For this example it is 3. PROC GPLOT; PLOT height1*weight1=CLUSTER; RUN; The vertical (y) axis and the horizontal (x) axis respectively. Makes the plot distinguish between the clusters produced. An Example of the Power of Cluster Analysis The results of clustering using SAS. The Example of the Power of Cluster Analysis Input (name=body1) and output (name=tree) datasets respectively. The SAS keywords are data and outtree. Cluster Analysis: • PROC CLUSTER DATA=body1 OUTTREE=TREE METHOD=SINGLE PRINT=20; run; The clustering method used in In SAS the options for “Method” are: the example was single linkage 1. AVE or AVERAGE for average linkage method. method. 2. CEN or CENTROID for centroid method. 3. COM or COMPLETE for complete linkage method 4. DEN or DENSITY for density linkage methods using nonparametric probability density estimation. 5. EML use for coordinate data, much slower than the other clustering methods. 6. FLE or FLEXIBLE for Lance-Williams flexible-beta method. 7. MCQ or MCQUITTY for McQuitty’s similarity analysis. 8. MED or MEDIAN for Grower’s median method. A lot of possible 9. SIN or SINGLE for single linkage method. methods!?! The most 10. TWO or TWOSTAGE for two-stage density linkage. common are 11. WAR or WARD for Ward’s minimum variance method. Single, Average Complete, Centroid, and Ward’s. Cluster Analysis: Two Main Procedures In SAS 1. Proc Cluster and it is Hierarchical. • 2. With Proc Tree Proc FastClus and it is Nonhierarchical. Cluster Analysis: Hierarchical 1. Proc Cluster Hierarchical: The data points are placed into clusters in a nested sequence of clustering. The most efficient type of hierarchical clustering methods are the single link clustering methods. Types of single link clustering: Nearest Neighbor Method Furthest Neighbor Method Many Others Nearest Neighbor Method tends to create fewer clusters than the Furthest Neighbor Method. Most other methods tend to give results somewhere in between the latter two methods mentioned. It is good to try more than one method. If multiple methods produce consistent/similar results then your results are more credible. Cluster Analysis: Nonhierarchical 2. Proc FastClus Nonhierarchical: Set of cluster seed points and build clusters around the seeds using dissimilarity measures and distance between data points and the cluster seed points. Example of a dissimilarity measure is Euclidean distance Three main disadvantages: 1. 2. 3. Must guess initially the number of clusters that exist. Greatly influenced by the location of the cluster seed points. Possibly not feasible computationally due to the multitude of possible choices for the number of clusters and the location for the cluster seeds as well. Reference: Much of the next few slides are taken from SPSS Help/Tutorial. SPSS’s: The TwoStep Cluster Analysis The TwoStep Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a data set that would otherwise not be apparent. The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques: The ability to create clusters based on both categorical and continuous variables. Automatic selection of the number of clusters. The ability to analyze large data files efficiently. SPSS’s: The TwoStep Cluster Analysis In order to handle categorical and continuous variables, the TwoStep Cluster Analysis procedure uses a likelihood distance measure which assumes that variables in the cluster model are independent. Further, each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution. Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions, but you should try to be aware of how well these assumptions are met. The two steps of the TwoStep Cluster Analysis procedure's algorithm: Step 1. The procedure begins with the construction of a Cluster Features (CF) Tree. The tree begins by placing the first case at the root of the tree in a leaf node that contains variable information about that case. Each successive case is then added to an existing node or forms a new node, based upon its similarity to existing nodes and using the distance measure as the similarity criterion. A node that contains multiple cases contains a summary of variable information about those cases. Thus, the CF tree provides a capsule summary of the data file. Step 2. The leaf nodes of the CF tree are then grouped using an agglomerative clustering algorithm. The agglomerative clustering can be used to produce a range of solutions. To determine which number of clusters is "best", each of these cluster solutions is compared using Schwarz's Bayesian Criterion (BIC) or the Akaike Information Criterion (AIC) as the clustering criterion. SPSS’s: The TwoStep Cluster Analysis Car manufacturers need to be able to appraise the current market to determine the likely competition for their vehicles. If cars can be grouped according to available data, this task can be largely automatic using cluster analysis. Information for various makes and models of motor vehicles is contained in car_sales.sav. Use the TwoStep Cluster Analysis procedure to group automobiles according to their prices and physical properties. SPSS’s: The TwoStep Cluster Analysis SPSS’s: The TwoStep Cluster Analysis 1. Select Vehicle type as a categorical variable. 2. Select Price in thousands through Fuel efficiency as continuous variables. 3. Click Plots. SPSS’s: The TwoStep Cluster Analysis 1. Select Rank of variable importance. 2. 3. Select By variable in the Rank Variables group. Select Confidence level. 4.Click Continue 5.Then click Output in the TwoStep Cluster Analysis dialog box. SPSS’s: The TwoStep Cluster Analysis 1. Select Information criterion (AIC or BIC) in the Statistics group. 2. Click Continue. 3. Finally, click OK in the TwoStep Cluster Analysis dialog box. SPSS’s: The TwoStep Cluster Analysis Auto-Clustering Number of Clusters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Schwarz's Bayesian Criterion (BIC) 1214.377 974.051 885.924 897.559 931.760 968.073 1026.000 1086.815 1161.740 1237.063 1316.271 1396.192 1477.199 1559.230 1644.366 a BIC Change -240.326 -88.128 11.635 34.201 36.313 57.927 60.815 74.926 75.323 79.207 79.921 81.008 82.030 85.136 Ratio of BIC b Changes Ratio of Distance c Measures 1.000 .367 -.048 -.142 -.151 -.241 -.253 -.312 -.313 -.330 -.333 -.337 -.341 -.354 a. The changes are from the previous number of clusters in the table. b. The ratios of chang es are relative to the change for the two cluster solution. c. The ratios of distance measures are based on the current number of clusters against the previous number of clusters. 1.829 2.190 1.368 1.036 1.576 1.083 1.687 1.020 1.239 1.046 1.075 1.076 1.301 1.044 1. The Auto-clustering table summarizes the process by which the number of clusters is chosen. 2. The clustering criterion (in this case the BIC) is computed for each potential number of clusters. Smaller values of the BIC indicate better models, and in this situation, the "best" cluster solution has the smallest BIC. SPSS’s: The TwoStep Cluster Analysis Auto-Clustering Number of Clusters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Schwarz's Bayesian Criterion (BIC) 1214.377 974.051 885.924 897.559 931.760 968.073 1026.000 1086.815 1161.740 1237.063 1316.271 1396.192 1477.199 1559.230 1644.366 a BIC Change -240.326 -88.128 11.635 34.201 36.313 57.927 60.815 74.926 75.323 79.207 79.921 81.008 82.030 85.136 Ratio of BIC b Changes Ratio of Distance c Measures 1.000 .367 -.048 -.142 -.151 -.241 -.253 -.312 -.313 -.330 -.333 -.337 -.341 -.354 a. The changes are from the previous number of clusters in the table. b. The ratios of chang es are relative to the change for the two cluster solution. c. The ratios of distance measures are based on the current number of clusters against the previous number of clusters. 1.829 2.190 1.368 1.036 1.576 1.083 1.687 1.020 1.239 1.046 1.075 1.076 1.301 1.044 However, there are clustering problems in which the BIC will continue to decrease as the number of clusters increases, but the improvement in the cluster solution, as measured by the BIC Change, is not worth the increased complexity of the cluster model, as measured by the number of clusters. SPSS’s: The TwoStep Cluster Analysis Auto-Clustering Number of Clusters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Schwarz's Bayesian Criterion (BIC) 1214.377 974.051 885.924 897.559 931.760 968.073 1026.000 1086.815 1161.740 1237.063 1316.271 1396.192 1477.199 1559.230 1644.366 a BIC Change -240.326 -88.128 11.635 34.201 36.313 57.927 60.815 74.926 75.323 79.207 79.921 81.008 82.030 85.136 Ratio of BIC b Changes Ratio of Distance c Measures 1.000 .367 -.048 -.142 -.151 -.241 -.253 -.312 -.313 -.330 -.333 -.337 -.341 -.354 a. The changes are from the previous number of clusters in the table. b. The ratios of chang es are relative to the change for the two cluster solution. c. The ratios of distance measures are based on the current number of clusters against the previous number of clusters. 1.829 2.190 1.368 1.036 1.576 1.083 1.687 1.020 1.239 1.046 1.075 1.076 1.301 1.044 In such situations, the changes in BIC and changes in the distance measure are evaluated to determine the "best" cluster solution. A good solution will have a reasonably large Ratio of BIC Changes and a large Ratio of Distance Measures. SPSS’s: The TwoStep Cluster Analysis Cluster Distribution N Cluster Excluded Cases Total 1 2 3 Combined 62 39 51 152 5 157 % of Combined 40.8% 25.7% 33.6% 100.0% % of Total 39.5% 24.8% 32.5% 96.8% 3.2% 100.0% The cluster distribution table shows the frequency of each cluster. Of the 157 total cases, 5 were excluded from the analysis due to missing values on one or more of the variables. Of the 152 cases assigned to clusters, 62 were assigned to the first cluster, 39 to the second, and 51 to the third. SPSS’s: The TwoStep Cluster Analysis Centroids Cluster Price in thousands Engine size Horsepower Wheelbase Width Length Curb weight Fuel capacity Fuel efficiency Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation 1 19.61671 7.644070 2.194 .4238 143.24 30.259 102.595 4.0799 68.539 1.9366 178.235 9.6534 2.83742 .310867 14.979 1.8699 27.24 3.578 2 26.56182 10.185175 3.559 .9358 187.92 39.049 112.972 9.6537 72.744 4.1781 191.110 14.4415 3.96759 .671766 22.064 4.2894 19.51 2.910 3 37.29980 17.381187 3.700 .9493 232.96 54.408 109.022 5.7644 72.924 2.1855 194.688 10.3512 3.57890 .297204 18.443 2.0445 23.02 2.060 Combined 27.33182 14.418669 3.049 1.0498 184.81 56.823 107.414 7.7178 71.089 3.4647 187.059 13.4712 3.37618 .636593 17.959 3.9376 23.84 4.305 The centroids show that the clusters are well separated by the continuous variables. SPSS’s: The TwoStep Cluster Analysis Centroids Cluster Price in thousands Engine size Horsepower Wheelbase Width Length Curb weight Fuel capacity Fuel efficiency Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation 1 19.61671 7.644070 2.194 .4238 143.24 30.259 102.595 4.0799 68.539 1.9366 178.235 9.6534 2.83742 .310867 14.979 1.8699 27.24 3.578 2 26.56182 10.185175 3.559 .9358 187.92 39.049 112.972 9.6537 72.744 4.1781 191.110 14.4415 3.96759 .671766 22.064 4.2894 19.51 2.910 3 37.29980 17.381187 3.700 .9493 232.96 54.408 109.022 5.7644 72.924 2.1855 194.688 10.3512 3.57890 .297204 18.443 2.0445 23.02 2.060 Combined 27.33182 14.418669 3.049 1.0498 184.81 56.823 107.414 7.7178 71.089 3.4647 187.059 13.4712 3.37618 .636593 17.959 3.9376 23.84 4.305 Motor vehicles in cluster 1 are cheap, small, and fuel efficient. SPSS’s: The TwoStep Cluster Analysis Centroids Cluster Price in thousands Engine size Horsepower Wheelbase Width Length Curb weight Fuel capacity Fuel efficiency Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation 1 19.61671 7.644070 2.194 .4238 143.24 30.259 102.595 4.0799 68.539 1.9366 178.235 9.6534 2.83742 .310867 14.979 1.8699 27.24 3.578 2 26.56182 10.185175 3.559 .9358 187.92 39.049 112.972 9.6537 72.744 4.1781 191.110 14.4415 3.96759 .671766 22.064 4.2894 19.51 2.910 3 37.29980 17.381187 3.700 .9493 232.96 54.408 109.022 5.7644 72.924 2.1855 194.688 10.3512 3.57890 .297204 18.443 2.0445 23.02 2.060 Combined 27.33182 14.418669 3.049 1.0498 184.81 56.823 107.414 7.7178 71.089 3.4647 187.059 13.4712 3.37618 .636593 17.959 3.9376 23.84 4.305 Motor vehicles in cluster 2 are moderately priced, heavy, and have a large gas tank, presumably to compensate for their poor fuel efficiency. SPSS’s: The TwoStep Cluster Analysis Centroids Cluster Price in thousands Engine size Horsepower Wheelbase Width Length Curb weight Fuel capacity Fuel efficiency Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation Mean Std. Deviation 1 19.61671 7.644070 2.194 .4238 143.24 30.259 102.595 4.0799 68.539 1.9366 178.235 9.6534 2.83742 .310867 14.979 1.8699 27.24 3.578 2 26.56182 10.185175 3.559 .9358 187.92 39.049 112.972 9.6537 72.744 4.1781 191.110 14.4415 3.96759 .671766 22.064 4.2894 19.51 2.910 3 37.29980 17.381187 3.700 .9493 232.96 54.408 109.022 5.7644 72.924 2.1855 194.688 10.3512 3.57890 .297204 18.443 2.0445 23.02 2.060 Combined 27.33182 14.418669 3.049 1.0498 184.81 56.823 107.414 7.7178 71.089 3.4647 187.059 13.4712 3.37618 .636593 17.959 3.9376 23.84 4.305 Motor vehicles in cluster 3 are expensive, large, and are moderately fuel efficient. SPSS’s: The TwoStep Cluster Analysis Vehicle type Cluster 1 2 3 Combined Automobile Frequency Percent 61 54.5% 0 .0% 51 45.5% 112 100.0% Truck Frequency Percent 1 2.5% 39 97.5% 0 .0% 40 100.0% The cluster frequency table by Vehicle type further clarifies the properties of the clusters. Go To SPSS Output for the Rest Note for myself, so as not to forget. SPSS’s: The TwoStep Cluster Analysis Using the TwoStep Cluster Analysis procedure, you have separated the motor vehicles into three fairly broad categories. In order to obtain finer separations within these groups, you should collect information on other attributes of the vehicles. For example, you could note the crash test performance or the options available. The TwoStep Cluster Analysis procedure is useful for finding natural groupings of cases or variables. It works well with categorical and continuous variables, and can analyze very large data files. If you have a small number of cases, and want to choose between several methods for cluster formation, variable transformation, and measuring the dissimilarity between clusters, try the Hierarchical Cluster Analysis procedure. The Hierarchical Cluster Analysis procedure also allows you to cluster variables instead of cases. The K-Means Cluster Analysis procedure is limited to scale variables, but can be used to analyze large data and allows you to save the distances from cluster centers for each object.