advertisement

Stat 407 Lab 10 Cluster Analysis Fall 2001 SOLUTION In this lab we examine how various cluster algorithms break up the Australian crabs data into groups. The crabs data has 4 visible clusters in it, corresponding to the species and sex groups. There is strong linear dependencies between the 5 physical measurements which means that the clusters are very long, narrow. 1. Based on what we have covered in lecture material, which clustering method would you guess might do the best at finding the 4 clusters? Explain your answer. The cluster structure corresponding to the 4 classes is shaped like 4 long pencils that are partially connected at one end. k-means clearly won’t capture this structure because it finds spherical clusters. Single linkage hierarchical might chain enough to extract the clusters. Ward’s linkage is unlikely to find it because it also fits spherical clusters. Model-based clustering may capture the structure using an elongated elliptical variance-covariance. 2. Using agglomerative hierarchical clustering using euclidean distance and with Ward’s linkage method cluster the crabs data based on the 5 physical measurements. Plot the dendrogram. How many clusters are suggested by the dendrogram? 2 or 3 clusters are suggested, based on the fusion heights. 3. Run it again, but this time forcing it to cut the tree at 4 cluster (4 because we know there are 4 cluster in the data), saving the cluster id to a variable. Append this cluster id to the data file. Cross-tabulate the cluster id with the true sp.sex variable. How well are the Sp.Sex groups captured by the clustering? Sp.Sex |cluster.id |1 |2 |3 |4 |RowTotl| -------+-------+-------+-------+-------+-------+ 1 |24 | 7 |14 | 5 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ 2 |19 | 4 |17 |10 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ 3 |28 |12 | 7 | 3 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ 4 |21 |24 | 5 | 0 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ ColTotl|92 |47 |43 |18 |200 | |0.46 |0.24 |0.22 |0.09 | | -------+-------+-------+-------+-------+-------+ The Sp.Sex groups are not captured at all by the clustering. There is considerable confusion in the crosstabulation. 4. Plot the 5 variables using a scatterplot matrix, and then use the cluster id to color cases. Describe the way the agglomerative hierarchical euclidean distance Ward’s linkage grouped the cases into clusters. Hierarchical with Ward’s linkage carves up the data into 4 groups along the maximum variance direction. 5. Repeat the above, except generating the dendrogram, with fuzzy partitioning. 1 Sp.Sex |cluster.id1 |1 |2 |3 |4 |RowTotl| -------+-------+-------+-------+-------+-------+ 1 | 7 |13 |20 |10 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ 2 |12 |16 |18 | 4 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ 3 | 4 |15 |12 |19 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ 4 | 2 | 6 |18 |24 |50 | | | | | |0.25 | -------+-------+-------+-------+-------+-------+ ColTotl|25 |50 |68 |57 |200 | |0.12 |0.25 |0.34 |0.28 | | -------+-------+-------+-------+-------+-------+ There is very little agreement with the true class structure and the clustering made with fuzzy partitioning, as seen by the cross-tabulation. The plot shows that this method too carves the data into 4 groups along the line of maximum variation. 6. Now in the usual cluster analysis setting we do not know the correct group of a case. So what is common is to run several analyses and then compare the results. Crosstabulate the cluster id’s for the agglomerative hierarchical euclidean Ward’s method with those for the fuzzy partitioning. How do they compare? cluster.id|cluster.id1 |1 |2 |3 |4 |RowTotl| -------+-------+-------+-------+-------+-------+ 1 | 0 |14 |68 |10 |92 | | | | | |0.46 | -------+-------+-------+-------+-------+-------+ 2 | 0 | 0 | 0 |47 |47 | | | | | |0.24 | -------+-------+-------+-------+-------+-------+ 3 | 7 |36 | 0 | 0 |43 | | | | | |0.22 | -------+-------+-------+-------+-------+-------+ 4 |18 | 0 | 0 | 0 |18 | | | | | |0.09 | -------+-------+-------+-------+-------+-------+ ColTotl|25 |50 |68 |57 |200 | |0.12 |0.25 |0.34 |0.28 | | -------+-------+-------+-------+-------+-------+ There is a fair amount of agreement between the two methods, as evidenced by the large number of empty cells. Roughly the assignment of clusters between the two methods corresponds as follows: Hierarch 1 2 3 4 Fuzzy 3 4 2 1 2