Clustering Methods: Part 2d Swap-based algorithms Pasi Fränti 31.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Part I: Random Swap algorithm P. Fränti and J. Kivijärvi Randomised local search algorithm for the clustering problem Pattern Analysis and Applications, 3 (4), 358-369, 2000. Pseudo code of Random Swap RandomSwap(X) C, P C SelectRandomRepresentatives(X); P OptimalPartition(X, C); REPEAT T times (Cnew ,j) RandomSwap(X, C); new new P LocalRepartition(X, C , P, j); Cnew , Pnew Kmeans(X, Cnew , Pnew); new new IF f(C , P ) < f(C, P) THEN (C, P) Cnew , Pnew; RETURN (C, P); Demonstration of the algorithm One centroid , but two clusters . Two centroids , but only one cluster . Centroid swap Swap is made from centroid rich area to centroid poor area. Local repartition Fine-tuning by K-means 1st iteration Fine-tuning by K-means 2nd iteration Fine-tuning by K-means 3rd iteration Fine-tuning by K-means 16th iteration Fine-tuning by K-means 17th iteration Fine-tuning by K-means 18th iteration Fine-tuning by K-means 19th iteration Fine-tuning by K-means Final result after 25 iterations Implementation of the swap 1. Random swap: c j xi j random(1, M ), i random(1, N ) 2. Re-partition vectors from old cluster: pi arg min d xi , ck 2 1 k M 3. Create new cluster: pi arg min d xi , ck k j k pi 2 i pi j i 1, N Random swap as local search Study neighbor solutions Random swap as local search Select one and move Role of K-means Fine-tune solution by hill-climbing technique! Role of K-means Consider only local optima! Role of swap: reduce search space Effective search space Chain reaction by K-means after swap Independency of initialization Results for T = 5000 iterations 190 185 Worst Initial Bridge 180 MSE 175 Best 170 165 Initial Initial 163.51 163.08 Split + RS Ward + RS 176.53 160 155 163.93 163.63 150 K-means Random K-means + RS + RS Part II: Efficiency of Random Swap Probability of good swap • Select a proper centroid for removal: – There are M clusters in total: premoval=1/M. • Select a proper new location: – There are N choices: padd=1/N – Only M are significantly different: padd=1/M • In total: – M2 significantly different swaps. – Probability of each different swap is pswap=1/M2 – Open question: how many of these are good? Number of neighbors Open question: what is the size of neighborhood ()? Voronoi neighbors Neighbors by distance 1 2 1 6 3 3 4 5 2 Observed number of neighbors Data set S2 45 % Average = 3.9 40 % Frequency 35 % 30 % 25 % 20 % 15 % 10 % 5% 0% 1 2 3 4 5 6 7 Number of neighbours 8 9 Average number of neighbors Data set From data set Bridge House Miss America Europe BIRCH1 BIRCH2 BIRCH3 S1 S2 S3 S4 68.8 14.4 345 (4.0) 4.0 (3.7) (3.9) 3.8 3.9 3.9 3.9 From clustering solution Initial Early stage Final T=0 T=5 T=5000 12.2 5.3 51.0 3.8 3.7 2.2 3.0 2.5 2.8 2.7 2.7 7.3 7.1 26.1 5.3 4.5 2.1 3.8 3.1 3.3 3.6 3.7 6.2 7.1 17.1 5.3 4.5 2.0 4.0 3.2 3.7 3.3 4.0 Expected number of iterations • Probability of not finding good swap: q 1 2 M 2 T • Estimated number of iterations: 2 log q T log1 2 M log q T 2 log1 2 M Estimated number of iterations depending on T Observed q-values q=10% q=1% q=0.1% Expected: S1 S2 S3 S4 19% 14% 22% 22% 3.1% 1.2% 1.0% 3.6% 0.1% 0.1% 0.2% 1.1% 72 56 55 48 Estimated iterations (T ) S1 53 106 159 23 S2 47 93 140 21 S3 39 78 117 17 Observed = Number of iterations needed in practice. Estimated = Estimate of the number of iterations needed for given q S1 S2 S3 S4 S4 37 74 111 16 Probability of success (p) depending on T 100 80 60 p 40 20 0 0 50 100 150 Iterations 200 250 300 Probability of failure (q) depending on T 1 0.1 0.01 0.001 q 0.0001 0.00001 0.000001 0.0000001 0.00000001 0.000000001 0 50 100 150 200 Iterations 250 300 Observed probabilities depending on dimensionality 100.00 % Observed for q =10% 10.00 % q Observed for q=1% 1.00 % 0.10 % Observed for q =0.10% 0.01 % 16 32 64 128 256 Dimensionality 512 1024 Bounds for the number of iterations Upper limit: 2 ln q - ln q M T 2 -ln q 2 2 2 2 ln 1 α / M α /M α Lower limit similarly; resulting in: M2 T - ln q 2 α Multiple swaps (w) Probability for performing less than w swaps: T q 2 i 0 i M w1 2 i 1 2 M 2 T i Expected number of iterations: 2 1 M ˆ T 2 i i 1 2 w Number of swaps needed Example from image quantization K-means clustering result (3 swaps needed) Final clustering result Re mo ve ded d A ve Remo Efficiency of the random swap Total time to find correct clustering: – Time per iteration Number of iterations Time complexity of a single step: – Swap: O(1) – Remove cluster: 2MN/M = O(N) – Add cluster: 2N = O(N) – Centroids: 2(2N/M) + 2 + 2 = O(N/M) – (Fast) K-means iteration: 4N = O(N)* *See Fast K-means for analysis. Time complexity and the observed number of steps Observed number of Step: Time complexity: steps at iteration: 50 100 500 Centroid swap 2 2 2 2 Cluster removal 2N 7,526 8,448 10,137 Cluster addition 2N 8,192 8,192 8,192 Update centroids 53 61 60 4N/M + 2 + 1 K-means iterations 300,901 285,555 197,327 4 N Total 316,674 302,258 215,718 O(N) Time spent by K-means iterations 140 k-means 2. iteration k-means 1. iteration local repartition Bridge 120 100 80 60 40 20 0 0 50 100 150 200 250 300 350 400 450 500 Effect of K-means iterations 190 Bridge 185 Version with one iteration seems to be weakest all the time. 1 iteration 2 iterations 3 iterations 4 iterations 5 iterations Error (MSE) 180 175 174 173 172 170 171 170 169 165 Versions with other amounts of iterations are pretty even. 168 167 10 160 0.1 20 30 40 50 1 10 Time (s) 100 Total time complexity Time complexity of a single step (t): t = O(αN) Number of iterations needed (T): 2 M T - ln q 2 α Total time: -lnq NM 2 M2 T N , M -ln q 2 N α α Time complexity: conclusions -lnq NM 2 T N , M α 1. Logarithmic dependency on q 2. Linear dependency on N 3. Quadratic dependency on M (With large number of clusters, can be too slow) 4. Inverse dependency on (worst case = 2) (Higher the dimensionality and higher the cluster overlap, faster the method) Time-distortion performance 190 Bridge 185 180 MSE Repeated k-means 175 170 Random Swap 165 160 0.1 1 Time 10 100 1000 Time-distortion performance 6.50 Missa1 6.25 MSE 6.00 Repeated k-means 5.75 5.50 Ramdom Swap 5.25 5.00 1 10 Time 100 1000 Time-distortion performance Millions 600 Birch1 550 Repeated k-means MSE 500 Random Swap 450 400 1 10 Time 100 1000 10000 Millions Time-distortion performance 10.0 Birch2 8.0 Repeated k-means MSE 6.0 4.0 Random Swap 2.0 0.0 1 10 100 Time 1000 Time-distortion performance Millions 16 Europe 14 12 MSE 10 Repeated k-means 8 Random Swap 6 4 2 1 10 Time 100 1000 Millions Time-distortion performance 7.60 KDD-Cup04 Bio 7.58 7.56 MSE 7.54 7.52 7.50 Repeated k-means 7.48 7.46 Random Swap 7.44 7.42 100 1000 Time 10000 100000 References Random swap algorithm: • P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000. • P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139-1148, August 1998. Pseudo code: • http://cs.joensuu.fi/sipu/soft/ Efficiency of Random swap algorithm: • P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008. Part III: Example when 4 swaps needed 1st swap MSE = 4.2 * 109 MSE = 3.4 * 109 2nd swap MSE = 3.1* 109 MSE = 3.0 * 109 3rd swap MSE = 2.3 * 109 MSE = 2.1 * 109 4th swap MSE = 1.9 * 109 MSE = 1.7 * 109 Final result MSE = 1.3 * 109 Part IV: Deterministic Swap Deterministic swap From where to where? Costs for the swap: One centroid , but two clusters . 13 Two centroids , but only one cluster . 1 2 11 7 8 5 15 9 12 14 10 3 6 4 Cluster Removal Addition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.80 1.04 5.48 5.66 6.50 7.67 8.47 9.10 9.90 11.09 11.47 12.17 14.61 16.41 16.68 0.39 0.64 1.09 0.92 0.76 1.01 0.45 0.75 1.42 1.26 0.61 4.70 0.94 0.93 1.41 Cluster removal • Merge two existing clusters [Frigui 1997, Kaukoranta 1998] following the spirit of agglomerative clustering. • Local optimization: remove the prototype that increases the cost function value least [Fritzke 1997, Likas 2003, Fränti 2006]. • Smart swap: find two nearest prototypes, and remove one of them randomly [Chen, 2010]. • Pairwise swap: locate a pair of inconsistent prototypes in two solutions [Zhao, 2012]. Cluster addition 1. Select an existing cluster – – Depending on strategy: 1..M choices. Each choice takes O(N) time to test. 2. Select a location within this cluster – – Add new prototype Consider only existing points Select the cluster • Cluster with the biggest MSE – Intuitive heuristic [Fritzke 1997, Chen 2010] – Computationally demanding: • Local optimization – Try all clusters for the addition [Likas et al, 2003] – Computationally demanding: O(NM)-O(N2) Select the location 1. 2. 3. 4. 5. Current prototype + ε [Fritzke 1997] Furthest vector [Fränti et al 1997] Any other split heuristic [Fränti et al, 1997] Random location Every possible location [Likas et al, 2003] Complexity of swaps Variant Select cluster Select location Time complexity Full search Optimal cluster Heuristic Random Try all Try all Greatest MSE Random Try all Any heuristic Any heuristic Random O(N2) O(NM) O(N) O(1) Furthest point in cluster Prototype removed Cluster where added Furthest point selected Smart swap • Initialization: O(MN) • Swap Iteration – Finding nearest pair: O(M2) – Calculating distortion: O(N) – Sorting clusters: O(M∙logM) – Evaluation of result: O(N) – Repartition and fine-tuning: O(N) Total: O(MN+M2+I∙N) • Number of iteration expected: < 2∙M • Estimated total time: O(2M2N) Smart swap Cluster with largest distortion Nearest prototypes Smart swap pseudo code SmartSwap(X,M) → C,P C ← InitializeCentroids(X); P ←PartitionDataset(X, C); Maxorder ← log2M; order ← 1; WHILE order < Maxorder ci, cj ←FindNearestPair(C); S ← SortClustersByDistortion(P, C); cswap ←RandomSelect(ci, cj ); clocation ←sorder; Cnew ← Swap(cswap, clocation); Pnew ← LocalRepartition(P, Cnew); KmeansIteration(Pnew, Cnew); IF f(Cnew) < f(C), THEN order ← 1; C ←Cnew ; ELSE order ← order + 1; KmeansIteration(P, C); Pairwise swap Nearest neighbors of each other Nearest neighbor of the other set further than in the same set → Subject to swap Unpaired prototypes Unpaired prototypes Combinations of random and deterministic swap Variant Removal Addition RR Random Random RD Random Deterministic DR Deterministic Random DD Deterministic D2R Deterministic Deterministic D2D Deterministic Deterministic + data update + data update Random Summary of the time complexities Random removal Deterministic removal D2R D2D O(MN) O(MN) O(αN) O(αN) RR RD DR DD Removal O(1) O(1) Addition O(1) O(N) O(1) O(N) O(1) O(N) Repartition O(N) O(N) O(N) O(N) O(N) O(N) K-means O(αN) O(αN) O(αN) O(αN) O(αN) O(αN) O(αN) O(αN) O(MN) O(MN) O(αN) O(αN) Profiles of the processing time 0,45 0,40 2 Bridge Others Repartition 2 Sw ap 0,30 K-means 0,25 0,20 0,15 0,10 Time (s) / iteration Time (s) / iteration 0,35 1 1 0 0,05 0,00 0 RR RD DR DD D2R D2D Test data sets Data set Type of data set Number of data vectors (N) Number of clusters (M) Dimension of data vector (d) Bridge House* Miss America Europe BIRCH1-BIRCH3 S1- S4 Dim32-1024 Gray-scale image RGB image Residual vectors Differential coordinates Synthetically generated Synthetically generated Synthetically generated 4086 34112 6480 169673 100000 5000 1000 256 256 256 16 3 16 2 2 2 32 – 1024 Data set S1 Data set S2 Data set S3 100 15 256 Data set S4 Birch data sets Birch1 Birch2 Birch3 Experiments Bridge 185 RR DR RD DD Error (MSE) 180 175 RD DD 170 Random Swap DR 165 Bridge 160 1 10 Time (s) 100 Experiments Bridge 190 Bridge 185 Random Swap 180 MSE Repeated k-means 175 DR D2R 170 165 0.1 1 Time 10 100 Experiments 5 x 10 Birch2 6 RR DR RD DD 4.5 Error (MSE) 4 3.5 3 Random Swap RD DR 2.5 Birch2 2 DD 10 100 Time (s) Experiments Miss America 6.5 Missa1 6.3 DR MSE 6.1 Repeated k-means 5.9 Ramdom Swap 5.7 D2R 5.5 5.3 1 10 Time 100 Quality comparisons (MSE) with 10 second time constraint Bridge House Miss America Europe ×107 Repeated Random 251.32 12.12 8.34 2.37 13.10 22.35 Repeated K-means 177.66 6.58 5.92 1.52 5.49 4.10 Random Swap 174.08 6.41 5.85 1.26 5.70 4.43 RD-variant 171.20 6.10 5.58 1.02 5.11 2.78 2:1 4:1 5:1 6:1 4:1 18:1 Average speed-up from RR to RD Birch1 Birch2 ×108 ×106 Literature 1. P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000. 2. P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139-1148, August 1998. 3. P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008. 4. P. Fränti, M. Tuononen and O. Virmajoki, "Deterministic and randomized local search algorithms for clustering", IEEE Int. Conf. on Multimedia and Expo, (ICME'08), Hannover, Germany, 837-840, June 2008. 5. P. Fränti and O. Virmajoki, "On the efficiency of swap-based clustering", Int. Conf. on Adaptive and Natural Computing Algorithms (ICANNGA'09), Kuopio, Finland, LNCS 5495, 303-312, April 2009. Literature 5. J. Chen, Q. Zhao, and P. Fränti, "Smart swap for more efficient clustering", Int. Conf. Green Circuits and Systems (ICGCS’10), Shanghai, China, 446-450, June 2010. 6. B. Fritzke, The LBG-U method for vector quantization – an improvement over LBG inspired from neural networks. Neural Processing Letters 5(1) (1997) 35-45. 7. P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pat. Rec., 39 (5), 761-765, May 2006. 8. T. Kaukoranta, P. Fränti and O. Nevalainen "Iterative split-andmerge algorithm for VQ codebook generation", Optical Engineering, 37 (10), 2726-2732, October 1998. 9. H. Frigui and R. Krishnapuram, "Clustering by competitive agglomeration". Pattern Recognition, 30 (7), 1109-1119, July 1997. Literature 10. A. Likas, N. Vlassis and J.J. Verbeek, "The global k-means clustering algorithm", Pattern Recognition 36, 451-461, 2003. 11. PAM (Kaufman and Rousseeuw, 1987) 12. CLARA (Kaufman and Rousseeuw in 1990) 13. CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han 1994) 14. R.T. Ng and J. Han, “CLARANS: A method for clustering objects for spatial data mining,” IEEE Transactions on knowledge and data engineering, 14 (5), September/October 2002.