Document

advertisement
Clustering Methods: Part 2a
K-means algorithm
Pasi Fränti
Speech and Image Processing Unit
School of Computing
University of Eastern Finland
K-means overview
• Well-known clustering algorithm
• Number of clusters must be chosen in advance
• Strengths:
1. Vectors can flexibly change clusters during the process.
2. Always converges to a local optimum.
3. Quite fast for most applications.
• Weaknesses:
1. Quality of the output depends on the initial codebook.
2. Global optimum solution not guaranteed.
K-means pseudo code
X: a set of N data vectors
Data set
CI: initialized k cluster centroids
Number of clusters,
C: the cluster centroids of k-clustering
random initial centroids
P = {p(i) | i = 1, …, N} is the cluster label of X
KMEANS(X, CI) → (C, P)
REPEAT
Cprevious ← CI;
FOR all i ∈ [1, N] DO
p(i) ← arg min d(xi, cj);
Generate new optimal paritions
l≤j≤k
FOR all j ∈ [1, k] DO
cj ← Average of xi, whose p(i) = j;
UNTIL C = Cprevious
Generate optimal centroids
K-means example
(1/4)
Data set
E
6
C
D
X: a set of N data vectors
N=6
F
5
A
B
1
1
2
4
5
8
Number of clusters
Random initial centroids
c3
6
5
c1
c2
Initial codebook:
c1 = C, c2 = D, c3 = E
CI: initialized k cluster centroids
k=3
1
1
2
4
5
8
K-means example
c3
(2/4)
Generate optimal partitions
E
6
C
5
D
c1
A
c2
Distance matrix (Euclidean distance)
F
A B
C
5
4,5
0
c1 
1
c2  5,7 5

c3  6,4 5,8 1,4
B
1
1
2
4
5
8
D
1
0
1
E
F
1,4 4 

1
3 
0 3,2 
After 1st iteration: MSE = 9.0
Generate optimal centroids
c3
6
c2
5
 1 2  4 11 5 
c1  
,
  2.3,2.3
3 
 3
c1
58 55
c2  
,
  6.5,5
2
2


c3  5,6
1
1
2
4
5
8
K-means example
(3/4)
Generate optimal partitions
E
6
c3
c2
5
C
F
D
c1
c2
c3
c1
A
Distance matrix (Euclidean distance)
B
A B C
D
 1,9 1,4 3,1 3,8

 6,8 6 2,5 1,5
 6,4 5,8 1,4 1

E
4,5
1,8
0
F
6,3 

1,5 
3,2 
1
1
2
4
5
8
After 2nd iteration: MSE = 1.78
Generate optimal centroids
c3
6
c2
5
 1 2 11
c1  
,
  1.5,1
2 
 2
c2  8,5
c1
 455 556
c3  
,
  4.7,5.3
3
3 

1
1
2
4
5
8
K-means example
(4/4)
Generate optimal partitions
E
c3
6
c2
Distance matrix (Euclidean distance)
5
C
F
D
c1
c2
c3
c1
A
B
A B C
D
E
F
0,5
0,5
4
,
7
5
,
3
6
,
1
7
,6 



3 3,2 0 
 8,1 7,2 4
 5,7 5,1 0,7 0,5 0,7 3,3 


1
1
2
4
5
8
After 3rd iteration: MSE = 0.31
No object move - stop
Counter example
1
2
E
6
C
5
c3
A
1
c1
6
F
5
D
B
1
c2
1
2
4
5
1
8
3
6
2
4
5
Initial codebook:
c1 = A, c2 = B, c3 = C
5
1
1
2
4
5
8
8
Two ways to improve k-means
• Repeated k-means
– Try several random initializations and take the best.
– Multiplies processing time.
– Works for easier data sets.
• Better initialization
– Use some better heuristic to allocate the initial distribution
of code vectors.
– Designing good initialization is not any easier than
designing good clustering algorithm at the first place!
– K-means can (and should) anyway be applied as finetuning of the result of another method.
References
1. Forgy, E. W. (1965) Cluster analysis of multivariate data:
efficiency vs interpretability of classifications. Biometrics 21,
768-769.
2. McQueen, J. (1967) Some methods for classification and
analysis of multivariate observations. In Proceedings of the
Fifth Berkeley Symposium on Mathematical Statistics and
Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297.
Berkeley, CA: University of California Press.
3. Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering
algorithm. Applied Statistics 28, 100-108.
4. Xu, M.: K-Means Based Clustering And Context Quantization.
University of Joensuu, Computer Science, Academic
Dissertation, 2005.
Download