Chapter 4 - Cengage Learning

advertisement
Chapter Four
Basic techniques for cluster
detection
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Overview
•
•
•
•
•
•
•
The problem of cluster detection
Measuring proximity between data objects
The K-means cluster detection method
The agglomeration cluster detection method
Performance issues of the basic methods
Cluster evaluation and interpretation
Undertaking a clustering task in Weka
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Problem of Cluster Detection
• What is cluster detection?
–
–
–
–
–
Cluster: a group of objects known as members
The centre of a cluster is known as the centroid
Members of a cluster are similar to each other
Members of different clusters are different
Clustering is a process of discovering clusters
: centroids
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Problem of Cluster Detection
• Outputs of cluster detection process
– Assigned cluster tag for members of a cluster
– Cluster summary: size, centroid, variations, etc.
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
s11
125
178
178
180
167
170
173
135
120
145
125
61
90
92
83
85
89
98
40
35
70
50
Cluster
Tag
2
1
1
1
1
1
1
2
2
2
2
100
90
80
Body Weight
SubjectID Body Height Body Weight
Cluster 2: Size: 5
Centroid:(130, 51)
Variation: bodyHeight = 10,
bodyWeight = 14.48
70
60
Cluster 1: Size: 6
Centroid:(154, 90)
Variation: bodyHeight = 5.16
bodyWeight = 5.32
50
40
30
100
110
120
130
140
150
Body Height
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
160
170
180
190
Problem of Cluster Detection
• Basic elements of a
clustering solution
– A sensible measure for
similarity, e.g. Euclidean
– An effective and efficient
clustering algorithm, e.g.
K-means
– A goodness-of-fit function
for evaluating the quality
of resulting clusters, e.g.
SSE
?
?
?
Internal
variation
Inter-cluster
distance
Good or Bad?
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Problem of Cluster Detection
• Requirements for clustering solutions
–
–
–
–
–
–
–
–
–
Scalability
Able to deal with different types of attributes
Able to discover clusters of arbitrary shapes
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input data records
Able to deal with high dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Basics
– Proximity between two data objects is represented by
either similarity or dissimilarity
– Similarity: a numeric measure of the degree of
alikeness, dissimilarity: numeric measure of the
degree of difference between two objects
– Similarity measure and dissimilarity measure are often
convertible; normally dissimilarity is preferred
– Measure of dissimilarity:
• Measuring the difference between values of the
corresponding attributes
• Combining the measures of the differences
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Distance function
– Metric properties of function d:
• d(x, y)  0 and d(x, x) = 0, for all data objects x and y
• d(x, y) = d(y, x), for all data objects x and y
• d(x, y)  d(x, z) + d(z, y), for all data objects x, y and z
– Difference of values for a single attribute is directly
related to the domain type of the attribute.
– It is important to consider which operations are
applicable.
– Some measure is better than no measure at all.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Difference between Attribute Values
– Difference between nominal values
• If two names are the same, the difference is 0; otherwise the
maximum
e.g. diff(“John”, “John”) = 0, diff(“John”, “Mary”) = 
• Same for difference between binary values
e.g. diff(Yes, No) = 
– Difference between ordinal values
• Different degree of proximity can be compared
e.g. diff(A, B) < diff(A, D).
• Converting ordinal values to consecutive integers
e.g. A: 5, B: 4, C: 3, D: 2, E:1. A – B  1 and A – D  3
– Distance measure for interval and ratio attributes
– Difference between values that may be unknown
diff(NULL, v) = |v|, diff(NULL, NULL) = 
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Distance between data objects
– Ratio of mismatched features for nominal attributes
Given two data objects i and j of p nominal attributes. Let m
represent the number of attributes where the values of the two
objects match.
m
d (i, j)  p 
p
e.g.
Body Weight
heavy
heavy
normal
heavy
low
low
normal
low
heavy
low
heavy
Body Height
short
short
tall
tall
medium
tall
medium
short
tall
medium
medium
Blood Pressure
high
high
normal
normal
normal
normal
high
high
high
normal
normal
Blood Sugar
3
1
3
2
2
1
3
2
2
3
3
Habit
smoker
nonsmoker
nonsmoker
smoker
nonsmoker
nonsmoker
smoker
smoker
nonsmoker
smoker
nonsmoker
Class
P
P
N
N
N
P
P
P
P
P
N
d (row1, row2) 
64 1

6
3
d (row1, row3) 
6 1 5

6
6
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Distance between data objects
– Minkowski function for interval/ratio attributes
q
q
q
d (i, j)  q (| x  x | + | x  x | +...+ | x  x | )
i1
j1
i2
j2
ip
jp
Special cases:
 Manhattan distance (q = 1)
d (i, j) | x  x | + | x  x | +...+ | x  x |
i1 j1 i2 j2
ip jp
 Euclidean distance (q = 2)
d (i, j)  (| x  x |2 + | x  x |2 +...+ | x  x |2 )
i1 j1
i2
j2
ip
jp
it  jt .
 Supremum/Chebyshev (q = ) d (i, j )  max
t
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Distance between data objects
– Minkowski function for interval/ratio attributes
(example)
customerID
101
102
103
104
105
106
107
No of Trans
30
40
35
20
50
80
10
Revenue
1000
400
300
1000
500
100
1000
Tenure(Months)
20
30
30
35
1
10
2
Manhattan
Euclidean
Chebyshev
Tenure
40
30
d1 (cust101, cust102) | 30  40 | + | 1000 400| + | 20  30 | 620
20
10
d 2 (cust101, cust102)  (30  40) + (1000 400) + (20  30)  600.16
2
d max (cust101, cust102) | 1000 400| 600
2
2
200
400
600
800
1000
Revenue
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
10
20
30
40
50
No. of Trans
Measures of Proximity
• Distance between data objects
– For binary attributes
• Given two data objects i and j of p binary attributes,
– f00 : the number of attributes where i is 0 and j is 0
– f01 : the number of attributes where i is 0 and j is 1
– f10 : the number of attributes where i is 1 and j is 0
– f11 : the number of attributes where i is 1 and j is 1
• Simple mismatch coefficient (SMC) for symmetric values:
SMC (i, j ) 
f 01 + f10
f 00 + f 01 + f10 + f11
• Jaccard coefficient is defined for asymmetric values:
JC (i, j ) 
f 01 + f10
f 01 + f10 + f11
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Distance between data objects
– For binary attributes (example)
DocumentID Query
d1
d2
d3
SMC(d1, d 2) 
1
0
0
Database Programming Interface User
1
1
1
0
1
0
0
0
0
0
0
0
Usability Network Web
0
0
0
0
0
0
0
0
0
GUI
0
0
0
HTML
0
0
0
f 01 + f10
f 01 + f10
1+1
2
1+1
2


JC(d1, d 2) 


f 00 + f 01 + f10 + f11 7 + 1 + 1 + 1 10
f 01 + f10 + f11 1 + 1 + 1 3
SMC  not that different; JC  very different: two-word (out of 3) difference
SMC(d1, d 3) 
f 01 + f10
1
1


f 00 + f 01 + f10 + f11 8 + 1 + 1 10
JC(d1, d 3) 
f 01 + f10
1
1


f 01 + f10 + f11 1 + 1 2
SMC  very similar; JC  still quite different: one word (out of 2) difference
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Similarity between data objects
– Cosine similarity function
•
•
•
•
Treating two data objects as vectors
Similarity is measured as the angle  between the two vectors
Similarity is 1 when  = 0, and 0 when  = 90
Similarity function:
i j
cos(i, j ) 
|| i ||  || j ||
i j 
n
i
k 1
k jk
|| i ||
n

k 1
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
ik
2
j

i
Measures of Proximity
• Similarity between data objects
– Cosine similarity function (illustrated)
Given two data objects:
x = (3, 2, 0, 5), and y = (1, 0, 0, 0)
Since,
x  y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = sqrt(32 + 22 + 02 + 52)  6.16
||y|| = sqrt(12 + 02 + 02 + 02) = 1
Then, the similarity between x and y: cos(x, y) = 3/(6.16 * 1) = 0.49
The dissimilarity between x and y: 1 – cos(x,y) = 0.51
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Distance between data objects
– Combining heterogeneous attributes
• Based on the principle of ratio of mismatched features
• For the kth attribute, compute the dissimilarity dk in [0,1]
• Set the indicator variable k as follows:
– k = 0, if the kth attribute is an asymmetric binary attribute and
both objects have value 0 for the attribute
– k = 1, otherwise
• Compute the overall distance between i and j as:
n
d (i, j ) 

k
 dk
k 1
n

k
k 1
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Measures of Proximity
• Distance between data objects
– Attribute scaling
• When:
– on the same attribute when data from different data sources
are merged
– on different attributes when data is projected into the N-space
• Normalising variables into comparable ranges:
– divide each value by the mean
– divide each value by the range
– z-score
– Attribute weighting
n
d (i, j ) 
• The weighted overall dissimilarity function:
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
w
k 1
k
  k  dk
n

k 1
k
K-means, a Basic Clustering Method
• Outline of main steps
1. Define the number of clusters (k)
2. Choose k data objects randomly to serve as the initial
centroids for the k clusters
3. Assign each data object to the cluster represented by
its nearest centroid
4. Find a new centroid for each cluster by calculating the
mean vector of its members
5. Undo the memberships of all data objects. Go back to
Step 3 and repeat the process until cluster
membership no longer changes or a maximum
number of iterations is reached.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means, a Basic Clustering Method
• Illustration of the method:
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means, a Basic Clustering Method
• Strengths & weaknesses
– Strengths
• Simple and easy to implement
• Quite efficient
– Weaknesses
• Need to specify the value of k, but we may not
know what the value should be beforehand
• Sensitive to the choice of initial k centroids: the
result can be non-deterministic
• Sensitive to noise
• Applicable only when mean is meaningful to
the given data set
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means, a Basic Clustering Method
• Overcoming the weaknesses:
– Using cluster quality to determine the value of k
– Improving how the initial k centroids are chosen
• Running the clustering a number of times and select the
result with highest quality
• Using hierarchical clustering to locate the centres
• Finding centres that are farther apart
– Dealing with noise
• Removing outliers before clustering?
• K-medoid method, using the nearest data object to the
virtual centre as the centroid.
– When mean cannot be defined,
• K-mode method, calculating mode instead of mean for
the centre of the cluster.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means, a Basic Clustering Method
Cluster errors (e.g. SSE)
• Value of k and cluster quality
Scree plot
Number of
clusters
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means, a Basic Clustering Method
• Choosing initial k centroids
– Running the clustering many times (only trial and
error)
– Using hierarchical clustering to locate the centres
(why partition based?)
– Finding centres that are farther apart
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means, a Basic Clustering Method
• K-medoid:
• Bisecting K-means
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
The Agglomeration Method
• Outline of main steps
1. Take all n data objects as individual clusters and build a n
x n dissimilarity matrix. The matrix stores the distance
between any pair of data objects.
2. While the number of clusters > 1 do:
i. Find a pair of data objects/clusters with the minimum
distance
ii. Merge the two data objects/clusters into a bigger
cluster
iii.Replace the entries in the matrix for the original
clusters or objects by the cluster tag of the newly
formed cluster
iv.Re-calculate relevant distances and update the matrix
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
The Agglomeration Method
• Illustration of the method
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
27
The Agglomeration Method
• Illustration of the method (dendrogram)
# of clusters
1
2
3
4
5
6
7
8
9
10
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
The Agglomeration Method
• Agglomeration schemes
– Single link: the distance
between two closest points
– Complete link: the distance
between two farthest points
– Group average: the average
of all pair-wise distances
– Centroids: the distance
between the centroids
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
The Agglomeration Method
• Strengths and weaknesses
– Strengths
•
•
•
•
Deterministic results
Multiple possible versions of clustering
No need to specify the value of a k beforehand
Can create clusters of arbitrary shapes (single-link)
– Weaknesses
• Does not scale up for large data sets
• Cannot undo membership like the K-means
• Problems with agglomeration schemes (see Chapter 5)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Cluster Evaluation & Interpretation
• Cluster quality
– Principle:
• High-level similarity/low-level variation
within a cluster
• High-level dissimilarity between clusters
– The measures
•
•
•
Cohesion: sum of squared errors (SSE),
and sum of SSEs for all clusters (WC)
Separation: sum of distances between
clusters (BC)
Combining the cohesion and separation,
the ratio BC/WC is a good indicator of
overall quality.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Ck: cluster k
rk: centroid of Ck
SSE(Ck ) 
 d ( x, r )
k
xCk
K
W C
 SSE(C
k)
k 1
BC 
 d (r , r )
j
1 j k  K
Q
BC
WC
k
2
2
Cluster Evaluation & Interpretation
• Cluster quality illustrated
100
90
Cluster c2
80
Body Weight
Cluster
SubjectID Body Height Body Weight
Tag
s1
125
61
2
s2
178
90
1
s3
178
92
1
s4
180
83
1
s5
167
85
1
s6
170
89
1
s7
173
98
1
s8
135
40
2
s9
120
35
2
s10
145
70
2
s11
125
50
2
Cluster c1
70
60
50
40
30
100
110
120
130
140
150
160
170
180
Body Height
SSE(C1 )  274.83
SSE(C2 )  1238.8
WC 274.83+1238.8  1513.63
 C1 is a better quality cluster than C2.
BC  3432.3
Q 
3432.3
 2.268
1513.63
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
190
Cluster Evaluation & Interpretation
• Using cluster quality for clustering
– With K-means:
• Add an outer loop for different values of K (from low to high)
• At an iteration, conduct K-means clustering using the current K
• Measure the overall cluster quality and decide whether the
resulting cluster quality acceptable
• If not, increase the value of K by 1 and repeat the process
– With agglomeration:
•
•
•
Traverse the hierarchy level by level from the root
At a level, evaluate the overall quality of clusters
If the quality is acceptable, take the clusters at the level as the
final result. If not, move to the next level and repeat the process.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Cluster Evaluation & Interpretation
•
Cluster tendency
– Cluster tendency: do clusters really exist?
– Measures for tendency:
• Quality measure: when BC and WC are similar, it means
clusters do not exist.
• Use Hopkins statistic
d ( p, t )

H ( P, S ) 
p ,t p S
 d ( m, t
m)+
m ,t m P
P: a set of n randomly generated data points
S: a sample of n data points from the data set
tp: the nearest neighbour of point p in S
tm: the nearest neighbour of point m in P
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
p
 d ( p, t
p ,t p S
p)
Cluster Evaluation & Interpretation
• Cluster interpretation
– Within cluster
• How values of the clustering attributes are distributed
• How values of supplementary attributes are distributed
– Outside cluster
• Exceptions and anomalies
– Between cluster
• Comparative view
Value distributions
for the cluster
Value distributions
for the population
Value distributions
for the population
Value distributions
for the cluster
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means & Agglomeration in Weka
• Clustering in Weka: Preprocess page
Specify “No Class”
Specify all attributes
for clustering
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means & Agglomeration in Weka
• Clustering in Weka: Cluster page
2. Set parameters
1. Choose a
Clustering Solution
4. Observe results
3. Execute the
chosen solution
5. Select “Visualise
Cluster Assignment”
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means & Agglomeration in Weka
• Clustering in Weka: SimpleKMeans
Specify the distance
function used
Specify
the value of K
Specify the max.
number of iterations
Specify the
random seed
affecting the
initial random
selection of K
centroids
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means & Agglomeration in Weka
• Clustering in Weka: SimpleKMeans
Save membership
into a file
Visualise
Cluster
membership
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
K-means & Agglomeration in Weka
• Clustering in Weka: Agglomeration
Tree-shaped
Dendrogram
Select
Cobweb
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Summary
• A clustering solution must provide a sensible proximity
function, effective algorithm and a cluster evaluation function
• Proximity is normally measured by a distance function that
combines measures of value differences upon attributes
• The K-Means method continues to refine prototype partitions
until membership changes no longer occur
• The agglomeration method constructs all possible groupings
of individual data objects into a hierarchy of clusters
• Good clustering results mean high similarity among members
of a cluster and low similarity between members of different
clusters
• Normal procedure of clustering in Weka is explained
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
References
Read Chapter 4 of Data Mining Techniques and
Applications
Useful further references
• Tan, P-N., Steinbach, M. and Kumar, V. (2006),
Introduction to Data Mining, Addison-Wesley, Chapters 2
(section 2.4) and 8.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Download