Document

advertisement
HW 4 Answers
1.
Consider the xy coordinates of 7 points shown in Table 1.
(a) Construct the distance matrix by using Euclidean and perform
single and complete link hierarchical clustering. Show your
results by drawing a dendrogram. The dendrogram should
clearly show the order in which the points are merged.
(b) Following (a), compute the cophenetic correlation coefficient for
the derived dendrograms.
(a) The distance matrix:
p1
Step 2:
p1
p2
p3
p4
p5
p6
p7
0.00
0.23
0.22
0.37
0.34
0.24
0.19
0.00
0.14
0.19
0.14
0.24
0.06
0.00
0.16
0.28
0.10
0.17
0.00
0.28
0.22
0.25
0.00
0.39
0.15
0.00
0.26
p2
p3
p4
p5
p6
p7
p1
p2,p7
p3, p6
p4
p5
0.00
0.19
0.22
0.37
0.34
0.00
0.14
0.19
0.14
0.00
0.16
0.28
0.00
0.28
p2,p7
p3, p6
p4
p5
0.00
Step 3 (merge p5,p2,p7 first):
p1
p2,p5,p7
p3, p6
p4
0.00
0.19
0.22
0.37
0.00
0.14
0.19
0.00
0.16
0.00
p1
Step 0:
p1
p1
p1
p2
p3
p4
p5
p6
p7
p2,p5,p7
0.00
0.23
0.22
0.37
0.34
0.24
0.19
p3, p6
0.00
0.14
0.19
0.14
0.24
0.06
p4
0.00
0.16
0.28
0.10
0.17
0.00
0.28
0.22
0.25
0.00
0.39
0.15
p1
0.00
0.26
p2,p3,p5,p6,p7
0.00
p4
p2
p3
p4
p5
p6
p7
Step 1:
p1
p2,p7
p3
p4
p5
p6
0.00
Step 4
p2,p3,p5, p6,p7
p4
0.00
0.19
0.37
0.00
0.16
0.00
p1
p2,p7
p3
p4
p5
p6
0.00
0.19
0.22
0.37
0.34
0.24
0.00
0.14
0.19
0.14
0.24
0.00
0.16
0.28
0.10
p1
0.00
0.28
0.22
0.00
0.39
p2,p3,p4,p5,
p6,p7
0.00
p1
Step 5
p1
p2,p3,p4,p5,
p6,p7
0.00
0.19
0.00
Two possible dendrograms for single link hierarchical clustering:
2
7
5
3
6
4
1
(a) Case 1: merge p5,p2,p7 first
2
7 3
6
5
4
1
(a) Case 2: merge p3,p6,p2,p7 first
(a) The distance matrix
p1
p2
p3
(a) Case 1 dendrogram (single link )
p1
p2
p3
p4
p5
p6
p7
0.00
0.23
0.22
0.37
0.34
0.24
0.19
0.00
0.14
0.19
0.14
0.24
0.06
0.00
0.16
0.28
0.10
0.17
0.00
0.28
0.22
0.25
0.00
0.39
0.15
0.00
0.26
p4
p5
p6
p7
2
0.00
7
5
3
6
4
(c) The cophenetic correlation coefficient matrix for single link clustering
p1
p2
p3
p4
p5
p6
p7
p1
p2
p3
p4
P5
p6
p7
0.00
0.34
0.39
0.39
0.34
0.39
0.34
0.00
0.39
0.39
0.15
0.39
0.06
0.00
0.22
0.39
0.10
0.39
0.00
0.39
0.22
0.39
0.00
0.39
0.15
0.00
0.39
0.00
1
(a) The dendrogram for complete link clustering
2
7
5
1
3
6
4
(b) The cophenetic correlation coefficient matrix for complete link clustering
p1
p2
p3
p4
p5
p6
p7
p1
p2
p3
p4
p5
p6
p7
0.00
0.19
0.19
0.19
0.19
0.19
0.19
0.00
0.14
0.16
0.14
0.14
0.06
0.00
0.16
0.14
0.10
0.14
0.00
0.16
0.16
0.16
0.00
0.14
0.14
0.00
0.14
0.00
2.
Consider the following four faces shown in Figure 2. Again,
darkness or number of dots represents density. Lines are used
only to distinguish regions and do not represent points.
(a) For each figure, could you use single link to find the patterns
represented by the nose, eyes, and mouth? Explain.
(b) For each figure, could you use K-means to find the patterns
represented by the nose, eyes, and mouth? Explain.
Ans:
(a) Only for (b) and (d).
For (b), the points in the nose, eyes, and mouth are much closer
together than the points between these areas.
For (d) there is only space between these regions.
(b) Only for (b) and (d).
For (b), K-means would find the nose, eyes, and mouth, but the
lower density points would also be included.
For (d), Kmeans would find the nose, eyes, and mouth
straightforwardly as long as the number of clusters was set to 4.
3. Compute the entropy and purity for the confusion matrix in Table 2.
class j
cluster i
-The purity of a cluster
-The overall purity
Purity (cluster #1):
Purity (cluster #2):
Purity (total):
676
 0.98
693
827
 0.53
1562
465
 0.49
Purity (cluster #3):
949
693
1562
949
 0.98 
 0.53 
 0.49  0.61
3204
3204
3204
• Entropy
– pij: The probability that a member of cluster i belong to class j,
pij= mij/mi
•mij:The # of objects of class j in cluster i
•mi: The # of objects in cluster i
– The entropy of a cluster
•L: The number of classes (ground truth, given)
– The entropy of a clustering is the total entropy
•m: Total # of data points
•K: # of clusters
Entropy (cluster #1):

1
1
1
1
11
11
4
4
676
676
log

log

log

log

log
 0.2
693
693 693
693 693
693 693
693 693
693
Entropy (cluster #2):

27
27
89
89
333
333 827
827
33
33
log

log

log

log

log
 1.84
1562
1562 1562
1562 1562
1562 1562
1562 1562
1562
Entropy (cluster #3):

326
326 465
465
8
8 105
105 16
16
29
29
log

log

log

log

log

log
 1.70
949
949 949
949 949
949 949
949 949
949 949
949
Entropy (total):
e  i 1
K
mi
693
1562
949
ei 
 0.2 
1.84 
1.7  1.44
m
3204
3204
3204
4.
Using the distance matrix in Table 3, compute the silhouette
coefficient for each point, each of the two clusters, and the overall
clustering. (Cluster 1 contains {P1, P2} and Cluster 2 contains
{ P3, P4})
Cluster 1: {P1, P2}
Cluster 2: {P3, P4}
Internal Measures: Silhouette Coefficient
• Silhouette Coefficient combine ideas of both cohesion and separation,
but for individual points, as well as clusters and clusterings
• For an individual point, i
a:群內平均
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
b:最短群外平均
s = 1 – a/b if a < b,
(or s = b/a - 1
if a  b, not the usual case)
b
– Typically between 0 and 1.
– The closer to 1 the better.
a
• Can calculate the Average Silhouette width for a cluster or a
clustering
Cluster 1: {P1, P2}
Cluster 2: {P3, P4}
5. Given the set of cluster labels and similarity matrix shown in Tables
4 and.5, respectively, compute the correlation between the similarity
matrix and the ideal similarity matrix, i.e., the matrix whose ijth
entry is 1 if two objects belong to the same cluster, and 0
otherwise.
Idea similarity matrix:
1
1
0
1
1
0
0
0
1
0
0
1
0
0
1
1
y =< 1, 0, 0, 0, 0, 1 >
x =< 0.8, 0.65, 0.55, 0.7, 0.6, 0.9 >
n
y =< 1, 0, 0, 0, 0, 1 >
x =< 0.8, 0.65, 0.55, 0.7, 0.6, 0.9 >
 
 y  0.33  y  0.52
x
i 1
n
n
i
2 
2


x


 i x
i 1
n
註:取σ要開平方根
x  0.7  x  0.13
Corr 
E ( xx   x )(x y   y )
 x y
(0.8  0.7)(1  0.33)  (0.65  0.7)(0  0.33)  (0.55  0.7)(0  0.33)  ...(0.9  0.7)(1  0.33)

 0.74
0.13 0.52
6
6.
Compute the hierarchical F-measure for the eight objects {p1, p2,
p3, p4, p5,p6, p7, p8} and hierarchical clustering shown in Figure
3. Class A contains points p1, p2, and p3, while p4, p5, p6, p7,
and p8 belong to class B.
F-measure
class i
cluster j
Hierarchical F-measure
cluster
mi
F   max F (i, j )
j
i m
class
•The maximum is taken over all cluster j at all levels
•mi is the number of objects in class i
•m is the total number of objects
Class A: {p1, p2, p3}
Class B: {p4, p5, p6, p7, p8}
Class=B:
R(B,1)=5/5=1, P(B,1)=5/8=0.625 F(B,1)=0.77
Overall Clustering:
mi
F   max F (i, j )  3 / 8 * F ( A)  5 / 8 * F (B)  0.78
j
i m
7. Figure 4 shows a clustering of a two-dimensional point data set with
two clusters: The leftmost cluster, whose points are marked by
asterisks, is somewhat diffuse, while the rightmost cluster, whose
points are marked by circles, is compact. To the right of the compact
cluster, there is a single point (marked by an arrow) that belongs to the
diffuse cluster, whose center is farther away than that of the compact
cluster. Explain why this is possible with EM clustering, but not Kmeans clustering.
Ans:
In EM clustering, we compute the probability that a point belongs to
a cluster. In turn, this probability depends on both the distance from
the cluster center and the spread (variance) of the cluster. Hence, a
point that is closer to the centroid of one cluster than another can
still have a higher probability with respect to the more distant cluster
if that cluster has a higher spread than the closer cluster. K-means
only takes into account the distance to the closest cluster when
assigning points to clusters. This is equivalent to an EM approach
where all clusters are assumed to have the same variance.
Download