HW 4 Answers 1. Consider the xy coordinates of 7 points shown in Table 1. (a) Construct the distance matrix by using Euclidean and perform single and complete link hierarchical clustering. Show your results by drawing a dendrogram. The dendrogram should clearly show the order in which the points are merged. (b) Following (a), compute the cophenetic correlation coefficient for the derived dendrograms. (a) The distance matrix: p1 Step 2: p1 p2 p3 p4 p5 p6 p7 0.00 0.23 0.22 0.37 0.34 0.24 0.19 0.00 0.14 0.19 0.14 0.24 0.06 0.00 0.16 0.28 0.10 0.17 0.00 0.28 0.22 0.25 0.00 0.39 0.15 0.00 0.26 p2 p3 p4 p5 p6 p7 p1 p2,p7 p3, p6 p4 p5 0.00 0.19 0.22 0.37 0.34 0.00 0.14 0.19 0.14 0.00 0.16 0.28 0.00 0.28 p2,p7 p3, p6 p4 p5 0.00 Step 3 (merge p5,p2,p7 first): p1 p2,p5,p7 p3, p6 p4 0.00 0.19 0.22 0.37 0.00 0.14 0.19 0.00 0.16 0.00 p1 Step 0: p1 p1 p1 p2 p3 p4 p5 p6 p7 p2,p5,p7 0.00 0.23 0.22 0.37 0.34 0.24 0.19 p3, p6 0.00 0.14 0.19 0.14 0.24 0.06 p4 0.00 0.16 0.28 0.10 0.17 0.00 0.28 0.22 0.25 0.00 0.39 0.15 p1 0.00 0.26 p2,p3,p5,p6,p7 0.00 p4 p2 p3 p4 p5 p6 p7 Step 1: p1 p2,p7 p3 p4 p5 p6 0.00 Step 4 p2,p3,p5, p6,p7 p4 0.00 0.19 0.37 0.00 0.16 0.00 p1 p2,p7 p3 p4 p5 p6 0.00 0.19 0.22 0.37 0.34 0.24 0.00 0.14 0.19 0.14 0.24 0.00 0.16 0.28 0.10 p1 0.00 0.28 0.22 0.00 0.39 p2,p3,p4,p5, p6,p7 0.00 p1 Step 5 p1 p2,p3,p4,p5, p6,p7 0.00 0.19 0.00 Two possible dendrograms for single link hierarchical clustering: 2 7 5 3 6 4 1 (a) Case 1: merge p5,p2,p7 first 2 7 3 6 5 4 1 (a) Case 2: merge p3,p6,p2,p7 first (a) The distance matrix p1 p2 p3 (a) Case 1 dendrogram (single link ) p1 p2 p3 p4 p5 p6 p7 0.00 0.23 0.22 0.37 0.34 0.24 0.19 0.00 0.14 0.19 0.14 0.24 0.06 0.00 0.16 0.28 0.10 0.17 0.00 0.28 0.22 0.25 0.00 0.39 0.15 0.00 0.26 p4 p5 p6 p7 2 0.00 7 5 3 6 4 (c) The cophenetic correlation coefficient matrix for single link clustering p1 p2 p3 p4 p5 p6 p7 p1 p2 p3 p4 P5 p6 p7 0.00 0.34 0.39 0.39 0.34 0.39 0.34 0.00 0.39 0.39 0.15 0.39 0.06 0.00 0.22 0.39 0.10 0.39 0.00 0.39 0.22 0.39 0.00 0.39 0.15 0.00 0.39 0.00 1 (a) The dendrogram for complete link clustering 2 7 5 1 3 6 4 (b) The cophenetic correlation coefficient matrix for complete link clustering p1 p2 p3 p4 p5 p6 p7 p1 p2 p3 p4 p5 p6 p7 0.00 0.19 0.19 0.19 0.19 0.19 0.19 0.00 0.14 0.16 0.14 0.14 0.06 0.00 0.16 0.14 0.10 0.14 0.00 0.16 0.16 0.16 0.00 0.14 0.14 0.00 0.14 0.00 2. Consider the following four faces shown in Figure 2. Again, darkness or number of dots represents density. Lines are used only to distinguish regions and do not represent points. (a) For each figure, could you use single link to find the patterns represented by the nose, eyes, and mouth? Explain. (b) For each figure, could you use K-means to find the patterns represented by the nose, eyes, and mouth? Explain. Ans: (a) Only for (b) and (d). For (b), the points in the nose, eyes, and mouth are much closer together than the points between these areas. For (d) there is only space between these regions. (b) Only for (b) and (d). For (b), K-means would find the nose, eyes, and mouth, but the lower density points would also be included. For (d), Kmeans would find the nose, eyes, and mouth straightforwardly as long as the number of clusters was set to 4. 3. Compute the entropy and purity for the confusion matrix in Table 2. class j cluster i -The purity of a cluster -The overall purity Purity (cluster #1): Purity (cluster #2): Purity (total): 676 0.98 693 827 0.53 1562 465 0.49 Purity (cluster #3): 949 693 1562 949 0.98 0.53 0.49 0.61 3204 3204 3204 • Entropy – pij: The probability that a member of cluster i belong to class j, pij= mij/mi •mij:The # of objects of class j in cluster i •mi: The # of objects in cluster i – The entropy of a cluster •L: The number of classes (ground truth, given) – The entropy of a clustering is the total entropy •m: Total # of data points •K: # of clusters Entropy (cluster #1): 1 1 1 1 11 11 4 4 676 676 log log log log log 0.2 693 693 693 693 693 693 693 693 693 693 Entropy (cluster #2): 27 27 89 89 333 333 827 827 33 33 log log log log log 1.84 1562 1562 1562 1562 1562 1562 1562 1562 1562 1562 Entropy (cluster #3): 326 326 465 465 8 8 105 105 16 16 29 29 log log log log log log 1.70 949 949 949 949 949 949 949 949 949 949 949 949 Entropy (total): e i 1 K mi 693 1562 949 ei 0.2 1.84 1.7 1.44 m 3204 3204 3204 4. Using the distance matrix in Table 3, compute the silhouette coefficient for each point, each of the two clusters, and the overall clustering. (Cluster 1 contains {P1, P2} and Cluster 2 contains { P3, P4}) Cluster 1: {P1, P2} Cluster 2: {P3, P4} Internal Measures: Silhouette Coefficient • Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings • For an individual point, i a:群內平均 – Calculate a = average distance of i to the points in its cluster – Calculate b = min (average distance of i to points in another cluster) – The silhouette coefficient for a point is then given by b:最短群外平均 s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case) b – Typically between 0 and 1. – The closer to 1 the better. a • Can calculate the Average Silhouette width for a cluster or a clustering Cluster 1: {P1, P2} Cluster 2: {P3, P4} 5. Given the set of cluster labels and similarity matrix shown in Tables 4 and.5, respectively, compute the correlation between the similarity matrix and the ideal similarity matrix, i.e., the matrix whose ijth entry is 1 if two objects belong to the same cluster, and 0 otherwise. Idea similarity matrix: 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1 y =< 1, 0, 0, 0, 0, 1 > x =< 0.8, 0.65, 0.55, 0.7, 0.6, 0.9 > n y =< 1, 0, 0, 0, 0, 1 > x =< 0.8, 0.65, 0.55, 0.7, 0.6, 0.9 > y 0.33 y 0.52 x i 1 n n i 2 2 x i x i 1 n 註:取σ要開平方根 x 0.7 x 0.13 Corr E ( xx x )(x y y ) x y (0.8 0.7)(1 0.33) (0.65 0.7)(0 0.33) (0.55 0.7)(0 0.33) ...(0.9 0.7)(1 0.33) 0.74 0.13 0.52 6 6. Compute the hierarchical F-measure for the eight objects {p1, p2, p3, p4, p5,p6, p7, p8} and hierarchical clustering shown in Figure 3. Class A contains points p1, p2, and p3, while p4, p5, p6, p7, and p8 belong to class B. F-measure class i cluster j Hierarchical F-measure cluster mi F max F (i, j ) j i m class •The maximum is taken over all cluster j at all levels •mi is the number of objects in class i •m is the total number of objects Class A: {p1, p2, p3} Class B: {p4, p5, p6, p7, p8} Class=B: R(B,1)=5/5=1, P(B,1)=5/8=0.625 F(B,1)=0.77 Overall Clustering: mi F max F (i, j ) 3 / 8 * F ( A) 5 / 8 * F (B) 0.78 j i m 7. Figure 4 shows a clustering of a two-dimensional point data set with two clusters: The leftmost cluster, whose points are marked by asterisks, is somewhat diffuse, while the rightmost cluster, whose points are marked by circles, is compact. To the right of the compact cluster, there is a single point (marked by an arrow) that belongs to the diffuse cluster, whose center is farther away than that of the compact cluster. Explain why this is possible with EM clustering, but not Kmeans clustering. Ans: In EM clustering, we compute the probability that a point belongs to a cluster. In turn, this probability depends on both the distance from the cluster center and the spread (variance) of the cluster. Hence, a point that is closer to the centroid of one cluster than another can still have a higher probability with respect to the more distant cluster if that cluster has a higher spread than the closer cluster. K-means only takes into account the distance to the closest cluster when assigning points to clusters. This is equivalent to an EM approach where all clusters are assumed to have the same variance.