k-medoid clustering with genetic algorithm

k-medoid clustering with genetic algorithm WEI-MING CHEN 2012.12.06 Outline  k-medoids clustering  famous works  GCA : clustering with the add of a genetic algorithm  Clustering genetic algorithm : also judge the number of clusters  Conclusion  k-medoids clustering  famous works  GCA : clustering with the add of a genetic algorithm  Clustering genetic algorithm : also judge the number of clusters  Conclusion What is k-medoid clustering?  Proposed in 1987 (L. Kaufman and P.J. Rousseeuw)  There are N points in the space  k points are chosen as centers (medoids)  Classify other points into k groups  Which k points should be chosen to minimize the summation of the points to its medoid Difficulty  NP-hard  Genetic algorithms can be applied  k-medoid clustering  famous works  GCA : clustering with the add of a genetic algorithm  Clustering genetic algorithm : also judge the number of clusters  Conclusion Partitioning Around Medoids (PAM)  Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley  Group N data into k sets  In every generation, select every pair of (Oi, Oj), where Oi is a medoid and Oj is not, if replace Oi by Oj would reduce the distance, replace Oi by Oj  Computation time : O(k(N-k)2) [one generation] Clustering LARge Applications (CLARA)  Kaufman, L., & Rousseeuw, P. J. (1990). Finding     groups in data: An introduction to cluster analysis. New York: Wiley Reduce the calculation time Only select s data in original N data s = 40+2k seems a good choice Computation time : O(ks2+k(n-k)) [one generation] Clustering Large Applications based upon RANdomized Search (CLARANS)  Ng, R., & Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th international conference on very large databases, Santiago, Chile (pp. 144–155)  Do not try all pairs of (Oi, Oj)  Try max(0.0125(k(N-k)), 250) different Oj to each Oi  Computation time : O(N2) [one generation]  k-medoids clustering  famous works  GCA : clustering with the add of a genetic algorithm  Clustering genetic algorithm : also judge the number of clusters  Conclusion GCA  Lucasius, C. B., Dane, A. D., & Kateman, G. (1993). On k-medoid clustering of large data sets with the aid of a genetic algorithm: Background, feasibility and comparison. Analytica Chimica Acta, 282, 647– 669. Chromosome encoding  N data, clustering to k groups  Problem size = k (the number of groups)  each location of the string is an integer (1~N) (a medoid) Initialization  Each string in the population uniquely encodes a candidate solution of the target problem  Random choose the candidates Selection  Select M worst individuals in population and throw them out Crossover  Select some individuals for reproducing M new population  Building-block like crossover  Mutation Crossover  For example, k =3, p1 = 2 3 7, p2 = 4 8 2  1. Mix p1 and p2  Q = 21 31 71 42 82 22  randomly scramble : Q = 42 22 21 82 71 31  2. Add new material : first k elements may be changed  Q = 5 22 7 82 71 31  3. randomly scramble again  Q = 22 71 7 31 5 82  4. The offspring are selected from left or from right  C 1 = 2 7 3 , C2 = 8 5 3 Experiment  Under the limit of NFE < 100000  N = 1000, k = 15 Experiment  GCA versus Random search Experiment  GCA versus CLARA (k = 15) Experiment  GCA versus CLARA (k = 50) Experiment Paper’s conclusion  GCA can handle both large values of k and small values of k  GCA outperforms CLARA, especially when k is a large value  GCA lends itself excellently for parallelization  GCA can be combined with CLARA to obtain a hybrid searching system with better performance.  k-medoids clustering  famous works  GCA : clustering with the add of a genetic algorithm  Clustering genetic algorithm : also judge the number of clusters  Conclusion Motivation  In some cases, we do not actually know the number of clusters  If we only know the upper limit?  Hruschka, E.R. and F.F.E. Nelson. (2003). “A Genetic Algorithm for Cluster Analysis.” Intelligent Data Analysis 7, 15–25. Fitness function  a(i) : the average distance of a individual to the individual in the same cluster  𝑎(𝑖) = 𝐿 𝑗=1 𝑑𝑖𝑗 𝐿−1  d(i) : the average distance of a individual to the individual in a different cluster  𝑑(𝑖, 𝐶) = 𝑀 𝑗=1 𝑑𝑖𝑗 𝑀  b(i) : the smallest of d(i, C)  𝑏(𝑖) = min(d(i, C)) Fitness function  Silhouette 𝑠 𝑖 =  fitness = 𝑠 = 𝑏 𝑖 −𝑎(𝑖) max{𝑎 𝑖 ,𝑏(𝑖)} 𝑁 𝑖=1 𝑠(𝑖)  This value will be high when…  small a(i) values  high b(i) values Chromosome encoding  N data, clustering to at most k groups  Problem size = N+1  each location of the string is an integer (1~k) (belongs to which cluster )  Genotype1: 22345123453321454552 5  To avoid following problems:     Genotype2: 2|2222|111113333344444 4 Genotype3: 4|4444|333335555511111 4 Child2: 2 4444 111113333344444 4 Child3: 4 2222 333335555511111 5  Consistent Algorithm : 11234512342215343441 5 Initialization  Population size = 20  The ﬁrst genotype represents two clusters, the second genotype represents three clusters, the third genotype represents four clusters, . . . , and the last one represents 21 clusters Selection  roulette wheel selection  −1 ≤ 𝑠(𝑖) ≤ 1  normalize to 0 ≤ 𝑠(𝑖) ≤ 2 Crossover  Uniform crossover do not work  Use Grouping Genetic Algorithm (GGA), proposed by Falkenauer (1998)  First, two strings are selected  A − 1123245125432533424  B − 1212332124423221321  Randomly select groups to preserve in A  (For example, group 2 and 3) Crossover  A − 1123245125432533424  B − 1212332124423221321  C − 0023200020032033020  Check the unchanged group in B and place in C  C − 0023200024432033020  Another child : form by the groups in B (without which is actually placed in C)  D − 1212332120023221321 Crossover  A − 1123245125432533424  B − 1212332124423221321  C − 0023200024432033020  Another child : form by the groups in B (without which is actually placed in C)  D − 1212332120023221321  Check the unchanged group in A and place in D  The other objects (whose alleles are zeros) are placed to the nearest cluster Mutation  Two ways for mutation  1. randomly chosen a group, places all the objects to the remaining cluster that has the nearest centroid  2. divides a randomly selected group into two new ones  Just change the genotypes in the smallest possible way Experiment  4 test problems (N = 75, 200, 699, 150) Experiment  Ruspini data (N = 75) Paper’s conclusion  Do not need to know the number of groups  Find out the answer of four different test problems successfully  Only on small population size  k-medoids clustering  famous works  GCA : clustering with the add of a genetic algorithm  Clustering genetic algorithm : also judge the number of clusters  Conclusion Conclusion  Genetic algorithms is an acceptable method for clustering problems  Need to design crossover carefully  Maybe EDAs can be applied  Some theses? Or final projects!

k-medoid clustering with genetic algorithm

Related documents

Products

Support

k-medoid clustering with genetic algorithm

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib