Document

Clustering Methods: Part 2a K-means algorithm Pasi Fränti Speech and Image Processing Unit School of Computing University of Eastern Finland K-means overview • Well-known clustering algorithm • Number of clusters must be chosen in advance • Strengths: 1. Vectors can flexibly change clusters during the process. 2. Always converges to a local optimum. 3. Quite fast for most applications. • Weaknesses: 1. Quality of the output depends on the initial codebook. 2. Global optimum solution not guaranteed. K-means pseudo code X: a set of N data vectors Data set CI: initialized k cluster centroids Number of clusters, C: the cluster centroids of k-clustering random initial centroids P = {p(i) | i = 1, …, N} is the cluster label of X KMEANS(X, CI) → (C, P) REPEAT Cprevious ← CI; FOR all i ∈ [1, N] DO p(i) ← arg min d(xi, cj); Generate new optimal paritions l≤j≤k FOR all j ∈ [1, k] DO cj ← Average of xi, whose p(i) = j; UNTIL C = Cprevious Generate optimal centroids K-means example (1/4) Data set E 6 C D X: a set of N data vectors N=6 F 5 A B 1 1 2 4 5 8 Number of clusters Random initial centroids c3 6 5 c1 c2 Initial codebook: c1 = C, c2 = D, c3 = E CI: initialized k cluster centroids k=3 1 1 2 4 5 8 K-means example c3 (2/4) Generate optimal partitions E 6 C 5 D c1 A c2 Distance matrix (Euclidean distance) F A B C 5 4,5 0 c1  1 c2  5,7 5  c3  6,4 5,8 1,4 B 1 1 2 4 5 8 D 1 0 1 E F 1,4 4   1 3  0 3,2  After 1st iteration: MSE = 9.0 Generate optimal centroids c3 6 c2 5  1 2  4 11 5  c1   ,   2.3,2.3 3   3 c1 58 55 c2   ,   6.5,5 2 2   c3  5,6 1 1 2 4 5 8 K-means example (3/4) Generate optimal partitions E 6 c3 c2 5 C F D c1 c2 c3 c1 A Distance matrix (Euclidean distance) B A B C D  1,9 1,4 3,1 3,8   6,8 6 2,5 1,5  6,4 5,8 1,4 1  E 4,5 1,8 0 F 6,3   1,5  3,2  1 1 2 4 5 8 After 2nd iteration: MSE = 1.78 Generate optimal centroids c3 6 c2 5  1 2 11 c1   ,   1.5,1 2   2 c2  8,5 c1  455 556 c3   ,   4.7,5.3 3 3   1 1 2 4 5 8 K-means example (4/4) Generate optimal partitions E c3 6 c2 Distance matrix (Euclidean distance) 5 C F D c1 c2 c3 c1 A B A B C D E F 0,5 0,5 4 , 7 5 , 3 6 , 1 7 ,6     3 3,2 0   8,1 7,2 4  5,7 5,1 0,7 0,5 0,7 3,3    1 1 2 4 5 8 After 3rd iteration: MSE = 0.31 No object move - stop Counter example 1 2 E 6 C 5 c3 A 1 c1 6 F 5 D B 1 c2 1 2 4 5 1 8 3 6 2 4 5 Initial codebook: c1 = A, c2 = B, c3 = C 5 1 1 2 4 5 8 8 Two ways to improve k-means • Repeated k-means – Try several random initializations and take the best. – Multiplies processing time. – Works for easier data sets. • Better initialization – Use some better heuristic to allocate the initial distribution of code vectors. – Designing good initialization is not any easier than designing good clustering algorithm at the first place! – K-means can (and should) anyway be applied as finetuning of the result of another method. References 1. Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768-769. 2. McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281-297. Berkeley, CA: University of California Press. 3. Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100-108. 4. Xu, M.: K-Means Based Clustering And Context Quantization. University of Joensuu, Computer Science, Academic Dissertation, 2005.

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib