PCS `94 seminaariesitelmä

advertisement
Evolutionary algorithms
for clustering
Pasi Fränti
Dept. of Computer Science
University of Joensuu
FINLAND
Codebook generation for VQ
Mapping
function P
Training setX
Code
vectors
1
3
2
1
1
3
42
2
4
3
3
N training
vectors
CodebookC
M code
vectors
42
M
N
8
Scalar in [1.. M]
32 11
K-dimesional vector
Data:
X
P
C
Set of N input vectors X={x1,x2,…,xN}.
Partition of M clusters P={p1, p2,…,pM}
Cluster centroids C={c1, c2,…,cM}
Optimization:
Find C minimizing f(P,C)
Problem size:
N
M
K
Number of data vectors.
Number of clusters.
The dimension of the vectors.
 Example of typical problem size
N=4000, M=256, K=16
Example data set
Representation of solution
Partition
Codebook
K-means algorithm
K-means(X,P,C): returns (P,C)
REPEAT
FOR i:=1 TO N DO
Pi  FindNearestCentroid(xi,C)
FOR i:=1 TO M DO
Ci  CalculateCentroid(X,P,i)
UNTIL no improvement.
Partition step:

pi  arg min d xi , c j
1 j  M

2
 i 1, N 
Centroid step:
x
cj 
pi  j
i
1
pi  j
 j 1, M 
Literature
LOCAL SEARCH:
P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm
for codebook generation in VQ", Pattern Recognition, 31 (8),
1139-1148, August 1998.
P. Fränti and J. Kivijärvi, "Randomized local search algorithm for
the clustering problem", Pattern Analysis and Applications, 2000.
(in press)
P. Fränti, H.H. Gyllenberg, M. Gyllenberg, J. Kivijärvi,
T. Koski, T. Lund and O. Nevalainen, "Minimizing stochastic
complexity using GLA and local search with applications to
classification of bacteria", Biosystems, 57 (1), 37-48, June 2000.
GENETIC ALGORITHM:
P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen,
"Genetic algorithms for large scale clustering problems", The
Computer Journal, 40 (9), 547-554, 1997.
P. Fränti, "Genetic algorithm with deterministic crossover for
vector quantization", Pattern Recognition Letters, 21 (1), 61-68,
January 2000.
Structure of Local Search
Generate initial solution.
REPEAT
Generate a set of new solutions.
Evaluate the new solutions.
Select the best solution.
UNTIL stopping criterion met.
Neighborhood function using random swap:
c j  xi
j  random(1, M ), i  random(1, N )
Object rejection:
pi  arg min d  xi , ck 
2
1 k  M
 i pi  j
Object attraction:
pi  arg min d  xi , ck 
k  j  k  pi
2
 i 1, N 
Randomized local search
RLS algorithm 1:
C  SelectRandomDataObjects(M).
P  OptimalPartition(C).
REPEAT T times
Cnew  RandomSwap(C).
Pnew  LocalRepartition(P,Cnew).
Cnew  OptimalRepresentatives(Pnew).
IF f(Pnew, Cnew) < f(P, C) THEN
(P, C)  (Pnew, Cnew)
RLS algorithm 2:
C  SelectRandomDataObjects(M).
P  OptimalPartition(C).
REPEAT T times
Cnew  RandomSwap(C).
Pnew  LocalRepartition(P,Cnew).
K-means(Pnew,Cnew).
IF f(Pnew, Cnew) < f(P, C) THEN
(P, C)  (Pnew, Cnew)
Random swap
BEFORE SWAP
Missing clusters
unnecessary clusters
AFTER SWAP
Centroid added
Centroid removed
Local fine-tuning
LOCAL REFINEMENT
New cluster appears
Obsolete cluster disappears
AFTER K-MEANS
Cluster moves down
190
RLS-2
185
180
175
MSE 170
165
176.53
160
163.93
155
163.63
163.51
163.08
150
K-means Random K-means Splitting Ward +
+ RLS
+ RLS
+ RLS
RLS
190
Bridge
185
180
RLS-1
MSE 175
170
RLS-2
165
160
0
1000
2000
3000
Iterations
4000
5000
Structure of Genetic Algorithm
Genetic algorithm:
Generate S initial solutions.
REPEAT T times
Generate new solutions.
Sort the solutions.
Store the best solution.
END-REPEAT
Output the best solution found.
Generate new solutions:
REPEAT S times
Select pair for crossover.
Cross the selected solutions.
Mutate the new solution.
Fine-tune the new solution by GLA.
END-REPEAT
Pseudo code for the GA (1/2)
CrossSolutions(C1, P1, C2, P2)  (Cnew, Pnew)
Cnew  CombineCentroids(C1, C2)
Pnew  CombinePartitions(P1, P2)
Cnew  UpdateCentroids(Cnew, Pnew)
RemoveEmptyClusters(Cnew, Pnew)
PerformPNN(Cnew, Pnew)
CombineCentroids(C1, C2)  Cnew
Cnew  C1  C2
CombinePartitions(Cnew, P1, P2)  Pnew
FOR i1 TO N DO
IF xi  cp
1
i
2
 xi  cp 2
pinew  pi1
ELSE
pinew  pi2
END-FOR
i
2
THEN
Pseudo code for the GA (2/2)
UpdateCentroids(C1, C2)  Cnew
FOR j1 TO |Cnew| DO
c new
 CalculateCentroid(Pnew, j )
j
PerformPNN(Cnew, Pnew)
FOR i1 TO |Cnew| DO
qi  FindNearestNeighbor(ci)
WHILE |Cnew|>M DO
a  FindMinimumDistance(Q)
b  qa
MergeClusters(ca, pa, cb, pb)
UpdatePointers(Q)
END-WHILE
Combining existing solutions
Random 1
Random 1 + Random 2
Random 2
Final result
Figure 3.11. Illustration of the use of the PNN method as a deterministic crossover method in the
genetic algorithm for the data set S2. Panels on top left and right show two initial codebooks, which
have been generated randomly among the data vectors (M=15). The panel on bottom left shows the
codebook after combining two initial codebooks (M=30). On the bottom right the final codebook after
the 15 merge steps of the PNN method (M=15) is shown.
According to the experiments in [F00], the genetic algorithm with the PNN crossover method
outperforms all the comparative methods, including the previous variants of the genetic algorithm. The
only other method that has been reported to give better results is the self-adaptive genetic algorithm
(SAGA) [KFN03], which still uses the PNN crossover as the key component. The use of the PNN
method as a deterministic crossover method also achieves a fast convergence with a rather small
population size. The algorithm is therefore remarkably faster than any of the previously reported
genetic algorithms.
Performance comparison of GA
180
Bridge
Random crossover + GLA
Distortion
175
Mutations + GLA
170
PNN crossover
165
PNN crossover + GLA
160
0
10
20
30
40
50
Number of iterations
6.5
Miss America
Distortion
6.0
Mutations + GLA
Random crossover + GLA
5.5
PNN crossover
PNN crossover + GLA
5.0
0
10
20
30
Number of iterations
40
50
190
SAGA
185
repeated
K-means
MSE
180
175
RLS
170
GAIS
PNN
IS
165
160
0
1
10
100
1000
Time (s)
10000 100000
Performance comparison
Bridge
Random
K-means
PNN
Local search
Tabu Search
GA
251.32
179.68
169.15
164.64
164.23
162.09
Miss
America
8.34
5.96
5.52
5.28
5.22
5.18
House
12.12
7.81
6.36
5.96
5.94
5.92
Download