Cluster Validity

advertisement
CSE 881: Data Mining
Lecture 20: Cluster Validation
‹#›
Cluster Validity

For supervised classification we have a variety of
measures to evaluate how good our model is
– Accuracy, precision, recall

For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?
–
–
–
–
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
‹#›
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
Random
Points
y
Clusters found in Random Data
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
0.8
0
1
DBSCAN
0
0.2
0.4
x
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
y
K-means
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
x
0.6
0.8
1
x
0.8
1
0
Complete
Link
0
0.2
0.4
0.6
0.8
1
x
‹#›
Measures of Cluster Validity

Internal Index (Unsupervised):
Used to measure the goodness of a clustering structure
without respect to external information.
– Sum of Squared Error (SSE)

External Index (Supervised):
Used to measure the extent to which cluster labels match
externally supplied class labels.
– Entropy

Relative Index:
Used to compare two different clusterings or clusters.
– Often an external or internal index is used for this function, e.g., SSE or
entropy
‹#›
Unsupervised Cluster Validation

Cluster Evaluation based on Proximity Matrix
– Correlation between proximity and incidence matrices
– Visualize proximity matrix

Cluster Evaluation based on Cohesion and
Separation
‹#›
Measuring Cluster Validity Via Correlation

Two matrices
– Proximity Matrix
– “Incidence” Matrix
One
row and one column for each data point
An entry is 1 if the associated pair of points belong to the same cluster
An entry is 0 if the associated pair of points belongs to different clusters

Compute the correlation between proximity and incidence matrices
– Since the matrices are symmetric, only the correlation between n(n-1) /
2 entries needs to be calculated.

High correlation indicates that points that belong to the same cluster
are close to each other.

Not a good measure for some density or contiguity based clusters
‹#›
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices
for the K-means clusterings of the following two
data sets.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
y

0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
x
Corr = -0.9235
0.8
1
0
0
0.2
0.4
0.6
0.8
1
x
Corr = -0.5810
‹#›
Using Similarity Matrix for Cluster Validation
Order the similarity matrix with respect to cluster
labels and inspect visually
1
1
0.9
0.8
0.7
Points
0.6
y

0.5
0.4
0.3
0.2
0.1
0
10
0.9
20
0.8
30
0.7
40
0.6
50
0.5
60
0.4
70
0.3
80
0.2
90
0.1
100
0
0.2
0.4
0.6
x
0.8
1
20
40
60
80
0
100 Similarity
Points
‹#›
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
Points
y
Points
1
0
0
0.2
0.4
0.6
0.8
1
x
DBSCAN
‹#›
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
y
Points
1
0
0
0.2
0.4
0.6
0.8
1
x
Points
K-means
‹#›
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
y
Points
1
0
0
Points
0.2
0.4
0.6
0.8
1
x
Complete Link
‹#›
Using Similarity Matrix for Cluster Validation
1
0.9
500
1
2
0.8
6
0.7
1000
3
0.6
4
1500
0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000
500
1000
1500
2000
2500
DBSCAN
‹#›
3000
0
Internal Measures: SSE

Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
– SSE

SSE is good for comparing two clusterings or two clusters
(average SSE).
Can also be used to estimate the number of clusters
10
9
6
8
4
7
6
2
SSE

0
5
4
-2
3
2
-4
1
-6
0
5
10
15
2
5
10
15
20
25
K
‹#›
30
Internal Measures: SSE

SSE curve for a more complicated data set
1
2
6
3
4
5
7
SSE of clusters found using K-means
‹#›
Unsupervised Cluster Validity Measure

More generally, given K clusters:
– Validity(Ci): a function of cohesion, separation, or both
– Weight wi associated with each cluster I
– For SSE:

wi = 1

validity(Ci) =
 x   
2
i
x C i
‹#›
Internal Measures: Cohesion and Separation

Cluster Cohesion:
– Measures how closely related are objects in a
cluster

Cluster Separation:
– Measure how distinct or well-separated a
cluster is from other clusters
‹#›
Graph-based versus Prototype-based Views
‹#›
Graph-based View

Cluster Cohesion: Measures how closely
related are objects in a cluster
Cohesion ( C i ) 
 proximity
( x, y )
xC i ,
y C i

Cluster Separation: Measure how distinct
or well-separated a cluster is from other
clusters
Separation ( C i , C j ) 

proximity ( x , y )
x C i ,
y C j
‹#›
Prototype-Based View

Cluster Cohesion:
Cohesion ( C i ) 

proximity ( x , c i )
x C i
– Equivalent to SSE if proximity is square of Euclidean
distance

Cluster Separation:
Separation ( C i , C j )  proximity ( c i , c j )
Separation ( C i )  proximity ( c i , c )
‹#›
Unsupervised Cluster Validity Measures
‹#›
Prototype-based vs Graph-based Cohesion

For SSE and points in Euclidean space, it can be
shown that average pairwise difference between
points in a cluster is equivalent to SSE of the
cluster
‹#›
Total Sum of Squares (TSS)
TSS 

dist ( x , c )
2
k
SSE 

dist ( x , c i )
c2
2
i  1 x C i
c1
c
k
SSB 
 m dist ( c , c )
i
2
i
i 1
c3
c: overall mean
ci: centroid of each cluster Ci
mi: number of points in cluster Ci
‹#›
Total Sum of Squares (TSS)
m

1
m1
K=1 cluster:

2

3
4
m2
5
TSS  (1  3 )  ( 2  3 )  ( 4  3 )  ( 5  3 )  10
2
2
2
2
SSE  ( 3  1)  ( 3  2 )  ( 4  3 )  ( 5  3 )  10
2
2
2
2
SSB  4  ( 3  3 )  0
2
K=2 clusters:
TSS  (1  3 )  ( 2  3 )  ( 4  3 )  ( 5  3 )  10
2
2
2
2
SSE  (1  1 . 5 )  ( 2  1 . 5 )  ( 4  4 . 5 )  ( 5  4 . 5 )  1
2
2
2
2
SSB  2  ( 3  1 . 5 )  2  ( 4 . 5  3 )  9
2
2
TSS = SSE + SSB
‹#›
Total Sum of Squares (TSS)
TSS = SSE + SSB
Given a data set, TSS is fixed
 A clustering with large SSE has small SSB, while
one with small SSE has large SSB


Goal is to minimize SSE and maximize SSB
‹#›
Internal Measures: Silhouette Coefficient


Silhouette Coefficient combine ideas of both cohesion and separation,
but for individual points, as well as clusters and clusterings
For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b,
(or s = b/a - 1
– Typically between 0 and 1.
if a  b, not the usual case)
b
a
– The closer to 1 the better.

Can calculate the Average Silhouette width for a cluster or a
clustering
‹#›
Unsupervised Evaluation of Hierarchical Clustering
Distance Matrix:
0.2
0.15
Single Link
0.1
0.05
0
3
6
2
5
4
1
‹#›
Unsupervised Evaluation of Hierarchical Clustering
0.2
0.15
0.1
0.05
0
3
6
2
Single Link

5
4
1
Cophenetic Distance Matrix for Single Link
CPCC (CoPhenetic Correlation Coefficient)
– Correlation between original distance matrix and
cophenetic distance matrix
‹#›
Unsupervised Evaluation of Hierarchical Clustering
Single Link
0.2
0.15
0.1
0.05
0
3
6
2
5
4
1
Complete Link
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
3
6
4
1
2
5
‹#›
Supervised Cluster Validation: Entropy and Purity
‹#›
Supervised Cluster Validation: Precision and Recall
Cluster i
mi1: class 1
mi2: class 2

Overall Data
m1: class 1
m2: class 2
Precision for cluster i w.r.t. class j =
m ij
m
ik
k

Recall for cluster i w.r.t. class j =
m ij
m

kj
m ij
mj
k
‹#›
Supervised Cluster Validation: Hierarchical Clustering

Hierarchical F-measure:
‹#›
Framework for Cluster Validity

Need a framework to interpret any measure.
–

For example, if our measure of evaluation has the value, 10, is
that good, fair, or poor?
Statistics provide a framework for cluster validity
–
The more “atypical” a clustering result is, the more likely it
represents valid structure in the data
–
Can compare the values of an index that result from random
data or clusterings to those of a clustering result.


If the value of the index is unlikely, then the cluster results are valid
For comparing the results of two different sets of
cluster analyses, a framework is less necessary.
–
However, there is the question of whether the difference
between two index values is significant
‹#›
Statistical Framework for SSE
Example: 2-d data with 100 points
1
0.9
0.8
0.7
0.6
y

0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
x
– Suppose a clustering algorithm produces SSE = 0.005
– Does it mean that the clusters are statistically significant?
‹#›
Statistical Framework for SSE
– Generate 500 sets of random data points of size 100
distributed over the range 0.2 – 0.8 for x and y values
– Perform clustering with k = 3 clusters for each data set
– Plot histogram of SSE (compare with the value 0.005)
1
50
0.9
45
0.8
40
0.7
35
30
Count
y
0.6
0.5
0.4
20
0.3
15
0.2
10
0.1
0
25
5
0
0.2
0.4
0.6
x
0.8
1
0
0.016 0.018
0.02
0.022
0.024
0.026
0.028
0.03
0.032
0.034
SSE
‹#›
Statistical Framework for Correlation
Correlation of incidence and proximity matrices for the
K-means clusterings of the following two data sets.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
y

0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
0.8
1
x
0
0
0.2
0.4
0.6
0.8
1
x
Corr = -0.9235
Corr = -0.5810
(statistically significant)
(not statistically significant)
‹#›
Final Comment on Cluster Validity
“The validation of clustering structures is the most
difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and
great courage.”
Algorithms for Clustering Data, Jain and Dubes
‹#›
Download