Evolutionary algorithms for clustering Pasi Fränti Dept. of Computer Science University of Joensuu FINLAND Codebook generation for VQ Mapping function P Training setX Code vectors 1 3 2 1 1 3 42 2 4 3 3 N training vectors CodebookC M code vectors 42 M N 8 Scalar in [1.. M] 32 11 K-dimesional vector Data: X P C Set of N input vectors X={x1,x2,…,xN}. Partition of M clusters P={p1, p2,…,pM} Cluster centroids C={c1, c2,…,cM} Optimization: Find C minimizing f(P,C) Problem size: N M K Number of data vectors. Number of clusters. The dimension of the vectors. Example of typical problem size N=4000, M=256, K=16 Example data set Representation of solution Partition Codebook K-means algorithm K-means(X,P,C): returns (P,C) REPEAT FOR i:=1 TO N DO Pi FindNearestCentroid(xi,C) FOR i:=1 TO M DO Ci CalculateCentroid(X,P,i) UNTIL no improvement. Partition step: pi arg min d xi , c j 1 j M 2 i 1, N Centroid step: x cj pi j i 1 pi j j 1, M Literature LOCAL SEARCH: P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139-1148, August 1998. P. Fränti and J. Kivijärvi, "Randomized local search algorithm for the clustering problem", Pattern Analysis and Applications, 2000. (in press) P. Fränti, H.H. Gyllenberg, M. Gyllenberg, J. Kivijärvi, T. Koski, T. Lund and O. Nevalainen, "Minimizing stochastic complexity using GLA and local search with applications to classification of bacteria", Biosystems, 57 (1), 37-48, June 2000. GENETIC ALGORITHM: P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547-554, 1997. P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, January 2000. Structure of Local Search Generate initial solution. REPEAT Generate a set of new solutions. Evaluate the new solutions. Select the best solution. UNTIL stopping criterion met. Neighborhood function using random swap: c j xi j random(1, M ), i random(1, N ) Object rejection: pi arg min d xi , ck 2 1 k M i pi j Object attraction: pi arg min d xi , ck k j k pi 2 i 1, N Randomized local search RLS algorithm 1: C SelectRandomDataObjects(M). P OptimalPartition(C). REPEAT T times Cnew RandomSwap(C). Pnew LocalRepartition(P,Cnew). Cnew OptimalRepresentatives(Pnew). IF f(Pnew, Cnew) < f(P, C) THEN (P, C) (Pnew, Cnew) RLS algorithm 2: C SelectRandomDataObjects(M). P OptimalPartition(C). REPEAT T times Cnew RandomSwap(C). Pnew LocalRepartition(P,Cnew). K-means(Pnew,Cnew). IF f(Pnew, Cnew) < f(P, C) THEN (P, C) (Pnew, Cnew) Random swap BEFORE SWAP Missing clusters unnecessary clusters AFTER SWAP Centroid added Centroid removed Local fine-tuning LOCAL REFINEMENT New cluster appears Obsolete cluster disappears AFTER K-MEANS Cluster moves down 190 RLS-2 185 180 175 MSE 170 165 176.53 160 163.93 155 163.63 163.51 163.08 150 K-means Random K-means Splitting Ward + + RLS + RLS + RLS RLS 190 Bridge 185 180 RLS-1 MSE 175 170 RLS-2 165 160 0 1000 2000 3000 Iterations 4000 5000 Structure of Genetic Algorithm Genetic algorithm: Generate S initial solutions. REPEAT T times Generate new solutions. Sort the solutions. Store the best solution. END-REPEAT Output the best solution found. Generate new solutions: REPEAT S times Select pair for crossover. Cross the selected solutions. Mutate the new solution. Fine-tune the new solution by GLA. END-REPEAT Pseudo code for the GA (1/2) CrossSolutions(C1, P1, C2, P2) (Cnew, Pnew) Cnew CombineCentroids(C1, C2) Pnew CombinePartitions(P1, P2) Cnew UpdateCentroids(Cnew, Pnew) RemoveEmptyClusters(Cnew, Pnew) PerformPNN(Cnew, Pnew) CombineCentroids(C1, C2) Cnew Cnew C1 C2 CombinePartitions(Cnew, P1, P2) Pnew FOR i1 TO N DO IF xi cp 1 i 2 xi cp 2 pinew pi1 ELSE pinew pi2 END-FOR i 2 THEN Pseudo code for the GA (2/2) UpdateCentroids(C1, C2) Cnew FOR j1 TO |Cnew| DO c new CalculateCentroid(Pnew, j ) j PerformPNN(Cnew, Pnew) FOR i1 TO |Cnew| DO qi FindNearestNeighbor(ci) WHILE |Cnew|>M DO a FindMinimumDistance(Q) b qa MergeClusters(ca, pa, cb, pb) UpdatePointers(Q) END-WHILE Combining existing solutions Random 1 Random 1 + Random 2 Random 2 Final result Figure 3.11. Illustration of the use of the PNN method as a deterministic crossover method in the genetic algorithm for the data set S2. Panels on top left and right show two initial codebooks, which have been generated randomly among the data vectors (M=15). The panel on bottom left shows the codebook after combining two initial codebooks (M=30). On the bottom right the final codebook after the 15 merge steps of the PNN method (M=15) is shown. According to the experiments in [F00], the genetic algorithm with the PNN crossover method outperforms all the comparative methods, including the previous variants of the genetic algorithm. The only other method that has been reported to give better results is the self-adaptive genetic algorithm (SAGA) [KFN03], which still uses the PNN crossover as the key component. The use of the PNN method as a deterministic crossover method also achieves a fast convergence with a rather small population size. The algorithm is therefore remarkably faster than any of the previously reported genetic algorithms. Performance comparison of GA 180 Bridge Random crossover + GLA Distortion 175 Mutations + GLA 170 PNN crossover 165 PNN crossover + GLA 160 0 10 20 30 40 50 Number of iterations 6.5 Miss America Distortion 6.0 Mutations + GLA Random crossover + GLA 5.5 PNN crossover PNN crossover + GLA 5.0 0 10 20 30 Number of iterations 40 50 190 SAGA 185 repeated K-means MSE 180 175 RLS 170 GAIS PNN IS 165 160 0 1 10 100 1000 Time (s) 10000 100000 Performance comparison Bridge Random K-means PNN Local search Tabu Search GA 251.32 179.68 169.15 164.64 164.23 162.09 Miss America 8.34 5.96 5.52 5.28 5.22 5.18 House 12.12 7.81 6.36 5.96 5.94 5.92