Stat 407 Lab 10 Cluster Analysis Fall 2001 SOLUTION

advertisement
Stat 407 Lab 10 Cluster Analysis Fall 2001 SOLUTION
In this lab we examine how various cluster algorithms break up the Australian crabs data into groups. The
crabs data has 4 visible clusters in it, corresponding to the species and sex groups. There is strong linear
dependencies between the 5 physical measurements which means that the clusters are very long, narrow.
1. Based on what we have covered in lecture material, which clustering method would you guess might do the
best at finding the 4 clusters? Explain your answer.
The cluster structure corresponding to the 4 classes is shaped like 4 long pencils that are partially connected
at one end. k-means clearly won’t capture this structure because it finds spherical clusters. Single linkage
hierarchical might chain enough to extract the clusters. Ward’s linkage is unlikely to find it because it
also fits spherical clusters. Model-based clustering may capture the structure using an elongated elliptical
variance-covariance.
2. Using agglomerative hierarchical clustering using euclidean distance and with Ward’s linkage method cluster
the crabs data based on the 5 physical measurements. Plot the dendrogram. How many clusters are
suggested by the dendrogram?
2 or 3 clusters are suggested, based on the fusion heights.
3. Run it again, but this time forcing it to cut the tree at 4 cluster (4 because we know there are 4 cluster in
the data), saving the cluster id to a variable. Append this cluster id to the data file. Cross-tabulate the
cluster id with the true sp.sex variable. How well are the Sp.Sex groups captured by the clustering?
Sp.Sex |cluster.id
|1
|2
|3
|4
|RowTotl|
-------+-------+-------+-------+-------+-------+
1
|24
| 7
|14
| 5
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
2
|19
| 4
|17
|10
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
3
|28
|12
| 7
| 3
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
4
|21
|24
| 5
| 0
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
ColTotl|92
|47
|43
|18
|200
|
|0.46
|0.24
|0.22
|0.09
|
|
-------+-------+-------+-------+-------+-------+
The Sp.Sex groups are not captured at all by the clustering. There is considerable confusion in the crosstabulation.
4. Plot the 5 variables using a scatterplot matrix, and then use the cluster id to color cases. Describe the way
the agglomerative hierarchical euclidean distance Ward’s linkage grouped the cases into clusters.
Hierarchical with Ward’s linkage carves up the data into 4 groups along the maximum variance direction.
5. Repeat the above, except generating the dendrogram, with fuzzy partitioning.
1
Sp.Sex |cluster.id1
|1
|2
|3
|4
|RowTotl|
-------+-------+-------+-------+-------+-------+
1
| 7
|13
|20
|10
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
2
|12
|16
|18
| 4
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
3
| 4
|15
|12
|19
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
4
| 2
| 6
|18
|24
|50
|
|
|
|
|
|0.25
|
-------+-------+-------+-------+-------+-------+
ColTotl|25
|50
|68
|57
|200
|
|0.12
|0.25
|0.34
|0.28
|
|
-------+-------+-------+-------+-------+-------+
There is very little agreement with the true class structure and the clustering made with fuzzy partitioning,
as seen by the cross-tabulation. The plot shows that this method too carves the data into 4 groups along the
line of maximum variation.
6. Now in the usual cluster analysis setting we do not know the correct group of a case. So what is common is
to run several analyses and then compare the results. Crosstabulate the cluster id’s for the agglomerative
hierarchical euclidean Ward’s method with those for the fuzzy partitioning. How do they compare?
cluster.id|cluster.id1
|1
|2
|3
|4
|RowTotl|
-------+-------+-------+-------+-------+-------+
1
| 0
|14
|68
|10
|92
|
|
|
|
|
|0.46
|
-------+-------+-------+-------+-------+-------+
2
| 0
| 0
| 0
|47
|47
|
|
|
|
|
|0.24
|
-------+-------+-------+-------+-------+-------+
3
| 7
|36
| 0
| 0
|43
|
|
|
|
|
|0.22
|
-------+-------+-------+-------+-------+-------+
4
|18
| 0
| 0
| 0
|18
|
|
|
|
|
|0.09
|
-------+-------+-------+-------+-------+-------+
ColTotl|25
|50
|68
|57
|200
|
|0.12
|0.25
|0.34
|0.28
|
|
-------+-------+-------+-------+-------+-------+
There is a fair amount of agreement between the two methods, as evidenced by the large number of empty
cells. Roughly the assignment of clusters between the two methods corresponds as follows:
Hierarch
1
2
3
4
Fuzzy
3
4
2
1
2
Download