Clustering Stat 430 Fall 2011

advertisement
Clustering
Stat 430
Fall 2011
Outline
• Distance Measures
• Linkage
• Hierachical Clustering
• KMeans
Data set: Letters
• from the UCI repository: Letters Data
• 20,000 instances of letters
• Variables:
1. lettr capital letter (26 values from A to Z)
2. x-box horizontal position of box (integer)
3. y-box vertical position of box (integer)
4. width width of box (integer)
5. high height of box (integer)
6. onpix total # on pixels (integer)
7. x-bar mean x of on pixels in box (integer)
8. y-bar mean y of on pixels in box (integer)
9. x2bar mean x variance (integer)
10. y2bar mean y variance (integer)
11. xybar mean x y correlation (integer)
12. x2ybr mean of x * x * y (integer)
13. xy2br mean of x * y * y (integer)
14. x-ege mean edge count left to right (integer)
15. xegvy correlation of x-ege with y (integer)
16. y-ege mean edge count bottom to top (integer)
17. yegvx correlation of y-ege with x (integer)
Data set: Letters
Data Set Information (UCI repository):
Objective: identify number of black-and-white rectangular pixel displays
as one of the 26 capital letters in the English alphabet.
The character images were based on 20 different fonts, each letter
within these 20 fonts was randomly distorted to produce a file of 20,000
unique stimuli.
Each stimulus was converted into 16 primitive numerical attributes
(statistical moments and edge counts) which were then scaled to fit into
a range of integer values from 0 through 15.
We typically train on the first 16000 items and then use the resulting
model to predict the letter category for the remaining 4000. See the
article cited above for more details.
Clustering
• part of unsupervised classification (i.e. we do
not use or have a dependent variable)
• classification is based on object similarity
• What is similar?
tools are used to assess the results of numerical methods. Section 5.5 summarizes the chapter and revisits the data analysis strategies used in the examples.
A good companion to the material presented in this chapter is Venables &
Ripley (2002), which provides data and code for practical examples of cluster
analysis using R. Section 5.5 summarizes the chapter and revisits the data
analysis strategies used in the examples.
5.1 Background
V1 V2
1
T 2
5
Before2 weI can
10 M 11
Distances
Brief Article
V3 V4
8 3
12
3
begin
15 13
V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
5 1 8 13 0
6
6 10
8
0
8
0
8
7 2 10groups
5 5 of 4cases
13 that
3 are
9 similar,
2
8 we
4 need
10
finding
9 7 13 2 6
2 12
1
9
8
1
1
8
The Author
to decide on a definition of similarity. How is similarity defined? Consider a
November
2011in matrix format as
dataset with three cases and four
variables, 28,
described
• three objects, values for one ‘T’, one ‘I’ and

one ‘M’



 

X1
7.3 7.6 7.7 8.0
X =  X2  =  7.4 7.2 7.3 7.2 
4.6 4.8
8 3X3 5 1 4.184.6 13
0

2
6 6 10 8 0 8 0 8
X1
5 Fig.
12 5.2. 3The7Euclidean
2 10 distance
5 5 between
4 13two 3cases9 (rows
2 8 4 10 
X2 is plotted
=  in
X = which
15 as13 9 7 13 2 6 2 12 1 9 8 1 1 8
ofX
the
defined
3 matrix) is11
dEuc (Xi , Xj ) = ||Xi − Xj ||
i, j = 1, . . . , n,
�
2 + X 2 + . . . + X 2 . For example, the Euclidean distance
where ||Xi || = Xi1
i2
ip
> dist(uci[c(1,2,10),-1])
between cases 1 and 2 in the above
data, is 2
i.e. I is closer to T than to
1
�
2 15.81139
(by
sliver), M and T are
2 =a1.0.
(7.3 − 7.4)2 + (7.6 − 7.2)2 + (7.7 − 7.3)2 + (8.0 −M
7.2)
10 26.68333 17.32051
For the three cases, the interpoint Euclidean distance matrix is
quite far apart
Linkage
• Compute a distance matrix of all objects
• Now we start to connect closest objects
• First step is the same: combine the two closest
objects into one cluster
• Next step: combine the two next closest objects
or clusters
• How do we define the distance to a cluster?
this is called linkage
Single Linkage
• Depending on different mechanisms for
linkage, we find very different clusters
• Single Linkage:
Distance between a cluster and an object
or another cluster is the minimal distance
between any of the elements in the cluster.
“My friend’s friend is my friend”
This leads to long and stringy clusters
Complete Linkage
• The distance between two clusters V and
W is defined as the maximum distance
between any of the cluster elements
• This leads to very tight clusters
The Author
November 28, 2011
Average Linkage


•

2 8 3 5 1 8 13 0 6 6 10 8 0 8 0 8
X1
The
distance
between
two
clusters
V
and



5 12 3 7 2 10 5 5 4 13 3 9 2 8 4 10
= X2 =
W11is 15
defined
13 9as7the
13average
2 6 distance
2 12 1 9 8 1 1 8
X3
between any of the cluster elements:
1 1
D(V, W ) =
|V | |W |
�
x∈V,y∈W
|x − y|
• Cluster size is a compromise between
single and complete linkage
November 28, 2011



Ward’s Method
13 0 6 6 10
50 56 46 1310 38
25 64 213 123 19
6 2 12 1 9
�
1 1
D(V,
W ) = two clusters |x
−
y| W
Distance
between
V
and
�
1 |V1| |W | x∈V,y∈W
D(V,asWthe
) = increase in the|xerror
− y| sum
defined
|V | |W |
X1
    
X
= 2
X =X
1 2
=  X2 X3 =  5
11
X3
•
2
58
1112
15
8
12 3
15 3
13
3
35
137
9
5
71
92
7
1
28
710
13
8
1013
135
2
x∈V,y∈W
8
90
92
8
is
of
squares (i.e. variance) nif the two clusters
X
�
are merged.ESS(X) n= |Xi − X̄|2
ESS(X) =
X
�
i=1
i=1
|Xi
− X̄|2
D(V, W ) = ESS(V ∪ W ) − (ESS(V ) + ESS(W ))
This results in spherical clusters
0
28
88
1
8
80
14
1
0
4
11
KMeans
•
KMeans is a non-hierarchical clustering approach.
•
•
Randomly find K centers of the data (seeds),
Each data record is assigned to the closest
center.
Centers are re-calculated based on members
(centroids are means or medians)
Repeat this step until cluster assignments do no
longer change.
KMeans can deal with quite large data sets problematic: K needs to be known
Download