Clustering Methods: Part 2a K-means algorithm Pasi Fränti Speech and Image Processing Unit School of Computing University of Eastern Finland K-means overview • Well-known clustering algorithm • Number of clusters must be chosen in advance • Strengths: 1. Vectors can flexibly change clusters during the process. 2. Always converges to a local optimum. 3. Quite fast for most applications. • Weaknesses: 1. Quality of the output depends on the initial codebook. 2. Global optimum solution not guaranteed. K-means pseudo code X: a set of N data vectors Data set CI: initialized k cluster centroids Number of clusters, C: the cluster centroids of k-clustering random initial centroids P = {p(i) | i = 1, …, N} is the cluster label of X KMEANS(X, CI) → (C, P) REPEAT Cprevious ← CI; FOR all i ∈ [1, N] DO p(i) ← arg min d(xi, cj); Generate new optimal paritions l≤j≤k FOR all j ∈ [1, k] DO cj ← Average of xi, whose p(i) = j; UNTIL C = Cprevious Generate optimal centroids K-means example (1/4) Data set E 6 C D X: a set of N data vectors N=6 F 5 A B 1 1 2 4 5 8 Number of clusters Random initial centroids c3 6 5 c1 c2 Initial codebook: c1 = C, c2 = D, c3 = E CI: initialized k cluster centroids k=3 1 1 2 4 5 8 K-means example c3 (2/4) Generate optimal partitions E 6 C 5 D c1 A c2 Distance matrix (Euclidean distance) F A B C 5 4,5 0 c1 1 c2 5,7 5 c3 6,4 5,8 1,4 B 1 1 2 4 5 8 D 1 0 1 E F 1,4 4 1 3 0 3,2 After 1st iteration: MSE = 9.0 Generate optimal centroids c3 6 c2 5 1 2 4 11 5 c1 , 2.3,2.3 3 3 c1 58 55 c2 , 6.5,5 2 2 c3 5,6 1 1 2 4 5 8 K-means example (3/4) Generate optimal partitions E 6 c3 c2 5 C F D c1 c2 c3 c1 A Distance matrix (Euclidean distance) B A B C D 1,9 1,4 3,1 3,8 6,8 6 2,5 1,5 6,4 5,8 1,4 1 E 4,5 1,8 0 F 6,3 1,5 3,2 1 1 2 4 5 8 After 2nd iteration: MSE = 1.78 Generate optimal centroids c3 6 c2 5 1 2 11 c1 , 1.5,1 2 2 c2 8,5 c1 455 556 c3 , 4.7,5.3 3 3 1 1 2 4 5 8 K-means example (4/4) Generate optimal partitions E c3 6 c2 Distance matrix (Euclidean distance) 5 C F D c1 c2 c3 c1 A B A B C D E F 0,5 0,5 4 , 7 5 , 3 6 , 1 7 ,6 3 3,2 0 8,1 7,2 4 5,7 5,1 0,7 0,5 0,7 3,3 1 1 2 4 5 8 After 3rd iteration: MSE = 0.31 No object move - stop Counter example 1 2 E 6 C 5 c3 A 1 c1 6 F 5 D B 1 c2 1 2 4 5 1 8 3 6 2 4 5 Initial codebook: c1 = A, c2 = B, c3 = C 5 1 1 2 4 5 8 8 Two ways to improve k-means • Repeated k-means – Try several random initializations and take the best. – Multiplies processing time. – Works for easier data sets. • Better initialization – Use some better heuristic to allocate the initial distribution of code vectors. – Designing good initialization is not any easier than designing good clustering algorithm at the first place! – K-means can (and should) anyway be applied as finetuning of the result of another method. References 1. Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769. 2. McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press. 3. Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108. 4. Xu, M.: K-Means Based Clustering And Context Quantization. University of Joensuu, Computer Science, Academic Dissertation, 2005.