k-medoid clustering with genetic algorithm

advertisement
k-medoid clustering with
genetic algorithm
WEI-MING CHEN
2012.12.06
Outline
 k-medoids clustering
 famous works
 GCA : clustering with the add of a genetic algorithm
 Clustering genetic algorithm : also judge the number
of clusters
 Conclusion
 k-medoids clustering
 famous works
 GCA : clustering with the add of a genetic algorithm
 Clustering genetic algorithm : also judge the number
of clusters
 Conclusion
What is k-medoid clustering?
 Proposed in 1987 (L. Kaufman and P.J. Rousseeuw)
 There are N points in the space
 k points are chosen as centers (medoids)
 Classify other points into k groups
 Which k points should be chosen to minimize the
summation of the points to its medoid
Difficulty
 NP-hard
 Genetic algorithms can be applied
 k-medoid clustering
 famous works
 GCA : clustering with the add of a genetic algorithm
 Clustering genetic algorithm : also judge the number
of clusters
 Conclusion
Partitioning Around Medoids (PAM)
 Kaufman, L., & Rousseeuw, P. J. (1990). Finding
groups in data: An introduction to cluster analysis.
New York: Wiley
 Group N data into k sets
 In every generation, select every pair of (Oi, Oj),
where Oi is a medoid and Oj is not, if replace Oi by Oj
would reduce the distance, replace Oi by Oj
 Computation time : O(k(N-k)2) [one generation]
Clustering LARge Applications (CLARA)
 Kaufman, L., & Rousseeuw, P. J. (1990). Finding




groups in data: An introduction to cluster analysis.
New York: Wiley
Reduce the calculation time
Only select s data in original N data
s = 40+2k seems a good choice
Computation time : O(ks2+k(n-k)) [one generation]
Clustering Large Applications based upon
RANdomized Search (CLARANS)
 Ng, R., & Han, J. (1994). Efficient and effective
clustering methods for spatial data mining. In
Proceedings of the 20th international conference on
very large databases, Santiago, Chile (pp. 144–155)
 Do not try all pairs of (Oi, Oj)
 Try max(0.0125(k(N-k)), 250) different Oj to each Oi
 Computation time : O(N2) [one generation]
 k-medoids clustering
 famous works
 GCA : clustering with the add of a genetic
algorithm
 Clustering genetic algorithm : also judge the number
of clusters
 Conclusion
GCA
 Lucasius, C. B., Dane, A. D., & Kateman, G. (1993).
On k-medoid clustering of large data sets with the
aid of a genetic algorithm: Background, feasibility
and comparison. Analytica Chimica Acta, 282, 647–
669.
Chromosome encoding
 N data, clustering to k groups
 Problem size = k (the number of groups)
 each location of the string is an integer (1~N)
(a medoid)
Initialization
 Each string in the population uniquely encodes a
candidate solution of the target problem
 Random choose the candidates
Selection
 Select M worst individuals in population and throw
them out
Crossover
 Select some individuals for reproducing M new
population
 Building-block like crossover
 Mutation
Crossover
 For example, k =3, p1 = 2 3 7, p2 = 4 8 2
 1. Mix p1 and p2
 Q = 21 31 71 42 82 22
 randomly scramble : Q = 42 22 21 82 71 31
 2. Add new material : first k elements may be changed
 Q = 5 22 7 82 71 31
 3. randomly scramble again
 Q = 22 71 7 31 5 82
 4. The offspring are selected from left or from right
 C 1 = 2 7 3 , C2 = 8 5 3
Experiment
 Under the limit of NFE < 100000
 N = 1000, k = 15
Experiment
 GCA versus Random search
Experiment
 GCA versus CLARA (k = 15)
Experiment
 GCA versus CLARA (k = 50)
Experiment
Paper’s conclusion
 GCA can handle both large values of k and small
values of k
 GCA outperforms CLARA, especially when k is a
large value
 GCA lends itself excellently for parallelization
 GCA can be combined with CLARA to obtain a
hybrid searching system with better performance.
 k-medoids clustering
 famous works
 GCA : clustering with the add of a genetic algorithm
 Clustering genetic algorithm : also judge the
number of clusters
 Conclusion
Motivation
 In some cases, we do not actually know the number
of clusters
 If we only know the upper limit?
 Hruschka, E.R. and F.F.E. Nelson. (2003). “A
Genetic Algorithm for Cluster Analysis.” Intelligent
Data Analysis 7, 15–25.
Fitness function
 a(i) : the average distance of a individual to the
individual in the same cluster
 𝑎(𝑖) =
𝐿
𝑗=1 𝑑𝑖𝑗
𝐿−1
 d(i) : the average distance of a individual to the
individual in a different cluster
 𝑑(𝑖, 𝐶) =
𝑀
𝑗=1 𝑑𝑖𝑗
𝑀
 b(i) : the smallest of d(i, C)
 𝑏(𝑖) = min(d(i, C))
Fitness function
 Silhouette 𝑠 𝑖 =
 fitness = 𝑠 =
𝑏 𝑖 −𝑎(𝑖)
max{𝑎 𝑖 ,𝑏(𝑖)}
𝑁
𝑖=1 𝑠(𝑖)
 This value will be high when…
 small a(i) values
 high b(i) values
Chromosome encoding
 N data, clustering to at most k groups
 Problem size = N+1
 each location of the string is an integer (1~k)
(belongs to which cluster )
 Genotype1:
22345123453321454552 5
 To avoid following problems:




Genotype2: 2|2222|111113333344444 4
Genotype3: 4|4444|333335555511111 4
Child2:
2 4444 111113333344444 4
Child3:
4 2222 333335555511111 5
 Consistent Algorithm : 11234512342215343441 5
Initialization
 Population size = 20
 The first genotype represents two clusters,
the second genotype represents three clusters,
the third genotype represents four clusters, . . . , and
the last one represents 21 clusters
Selection
 roulette wheel selection
 −1 ≤ 𝑠(𝑖) ≤ 1
 normalize to 0 ≤ 𝑠(𝑖) ≤ 2
Crossover
 Uniform crossover do not work
 Use Grouping Genetic Algorithm (GGA), proposed
by Falkenauer (1998)
 First, two strings are selected
 A − 1123245125432533424
 B − 1212332124423221321
 Randomly select groups to preserve in A
 (For example, group 2 and 3)
Crossover
 A − 1123245125432533424
 B − 1212332124423221321
 C − 0023200020032033020
 Check the unchanged group in B and place in C
 C − 0023200024432033020
 Another child : form by the groups in B (without
which is actually placed in C)
 D − 1212332120023221321
Crossover
 A − 1123245125432533424
 B − 1212332124423221321
 C − 0023200024432033020
 Another child : form by the groups in B (without
which is actually placed in C)
 D − 1212332120023221321
 Check the unchanged group in A and place in D
 The other objects (whose alleles are zeros) are placed
to the nearest cluster
Mutation
 Two ways for mutation
 1. randomly chosen a group, places all the objects to
the remaining cluster that has the nearest centroid
 2. divides a randomly selected group into two new
ones
 Just change the genotypes in the smallest possible
way
Experiment
 4 test problems (N = 75, 200, 699, 150)
Experiment
 Ruspini data (N = 75)
Paper’s conclusion
 Do not need to know the number of groups
 Find out the answer of four different test problems
successfully
 Only on small population size
 k-medoids clustering
 famous works
 GCA : clustering with the add of a genetic algorithm
 Clustering genetic algorithm : also judge the number
of clusters
 Conclusion
Conclusion
 Genetic algorithms is an acceptable method for
clustering problems
 Need to design crossover carefully
 Maybe EDAs can be applied
 Some theses? Or final projects!
Download