Part 2d

advertisement
Clustering Methods: Part 2d
Swap-based algorithms
Pasi Fränti
31.3.2014
Speech & Image Processing Unit
School of Computing
University of Eastern Finland
Joensuu, FINLAND
Part I:
Random Swap algorithm
P. Fränti and J. Kivijärvi
Randomised local search algorithm for the clustering problem
Pattern Analysis and Applications, 3 (4), 358-369, 2000.
Pseudo code of Random Swap
RandomSwap(X)
 C, P
C  SelectRandomRepresentatives(X);
P  OptimalPartition(X, C);
REPEAT T times
(Cnew ,j)  RandomSwap(X, C);
new
new
P  LocalRepartition(X, C , P, j);
Cnew , Pnew  Kmeans(X, Cnew , Pnew);
new
new
IF f(C , P ) < f(C, P) THEN
(C, P)  Cnew , Pnew;
RETURN (C, P);
Demonstration of the algorithm
One centroid , but
two clusters .
Two centroids , but
only one cluster .
Centroid swap
Swap is made from
centroid rich area to
centroid poor area.
Local repartition
Fine-tuning by K-means
1st iteration
Fine-tuning by K-means
2nd iteration
Fine-tuning by K-means
3rd iteration
Fine-tuning by K-means
16th iteration
Fine-tuning by K-means
17th iteration
Fine-tuning by K-means
18th iteration
Fine-tuning by K-means
19th iteration
Fine-tuning by K-means
Final result after 25 iterations
Implementation of the swap
1. Random swap:
c j  xi
j  random(1, M ), i  random(1, N )
2. Re-partition vectors from old cluster:
pi  arg min d  xi , ck 
2
1 k  M
3. Create new cluster:
pi  arg min d  xi , ck 
k  j  k  pi
2
 i pi  j
 i 1, N 
Random swap as local search
Study neighbor solutions
Random swap as local search
Select one and move
Role of K-means
Fine-tune solution by
hill-climbing technique!
Role of K-means
Consider only local optima!
Role of swap: reduce search space
Effective search space
Chain reaction by K-means after swap
Independency of initialization
Results for T = 5000 iterations
190
185
Worst
Initial
Bridge
180
MSE
175
Best
170
165
Initial
Initial
163.51
163.08
Split +
RS
Ward +
RS
176.53
160
155
163.93
163.63
150
K-means Random K-means
+ RS
+ RS
Part II:
Efficiency of Random Swap
Probability of good swap
• Select a proper centroid for removal:
– There are M clusters in total: premoval=1/M.
• Select a proper new location:
– There are N choices: padd=1/N
– Only M are significantly different: padd=1/M
• In total:
– M2 significantly different swaps.
– Probability of each different swap is pswap=1/M2
– Open question: how many of these are good?
Number of neighbors
Open question: what is the size of neighborhood ()?
Voronoi neighbors
Neighbors by distance
1
2
1
6
3
3
4
5
2
Observed number of neighbors
Data set S2
45 %
Average = 3.9
40 %
Frequency
35 %
30 %
25 %
20 %
15 %
10 %
5%
0%
1
2
3
4
5
6
7
Number of neighbours
8
9
Average number of neighbors
Data set
From
data set
Bridge
House
Miss America
Europe
BIRCH1
BIRCH2
BIRCH3
S1
S2
S3
S4
68.8
14.4
345
(4.0)
4.0
(3.7)
(3.9)
3.8
3.9
3.9
3.9
From clustering solution
Initial Early stage
Final
T=0
T=5
T=5000
12.2
5.3
51.0
3.8
3.7
2.2
3.0
2.5
2.8
2.7
2.7
7.3
7.1
26.1
5.3
4.5
2.1
3.8
3.1
3.3
3.6
3.7
6.2
7.1
17.1
5.3
4.5
2.0
4.0
3.2
3.7
3.3
4.0
Expected number of iterations
• Probability of not finding good swap:

 
q  1  2 
 M 
2
T
• Estimated number of iterations:
 2 
log q  T  log1  2 
 M 
log q
T 
 2 
log1  2 
 M 
Estimated number of iterations
depending on T
Observed q-values
q=10%
q=1%
q=0.1%
Expected:
S1
S2
S3
S4
19% 14% 22% 22%
3.1% 1.2% 1.0% 3.6%
0.1% 0.1% 0.2% 1.1%
72
56
55
48
Estimated iterations (T )
S1
53
106
159
23
S2
47
93
140
21
S3
39
78
117
17
Observed = Number of iterations needed in practice.
Estimated = Estimate of the number of iterations needed for given q
S1
S2
S3
S4
S4
37
74
111
16
Probability of success (p)
depending on T
100
80
60
p
40
20
0
0
50
100
150
Iterations
200
250
300
Probability of failure (q)
depending on T
1
0.1
0.01
0.001
q
0.0001
0.00001
0.000001
0.0000001
0.00000001
0.000000001
0
50
100
150
200
Iterations
250
300
Observed probabilities
depending on dimensionality
100.00 %
Observed for q =10%
10.00 %
q
Observed for q=1%
1.00 %
0.10 %
Observed for q =0.10%
0.01 %
16
32
64
128
256
Dimensionality
512
1024
Bounds for the number of iterations
Upper limit:
2
ln q
- ln q
M
T
 2
 -ln q  2
2
2
2
ln 1  α / M
α /M
α


Lower limit similarly; resulting in:

M2 
T   - ln q  2 
α 

Multiple swaps (w)
Probability for performing less than w swaps:
T    
q       2 
i 0  i   M 
w1
2
i
  
 1  2 
 M 
2
T i
Expected number of iterations:
2
1
M


ˆ
T    
2
i
 i 1    2
w
Number of swaps needed
Example from image quantization
K-means clustering result
(3 swaps needed)
Final clustering result
Re
mo
ve
ded
d
A
ve
Remo
Efficiency of the random swap
Total time to find correct clustering:
– Time per iteration  Number of iterations
Time complexity of a single step:
– Swap: O(1)
– Remove cluster: 2MN/M = O(N)
– Add cluster: 2N = O(N)
– Centroids: 2(2N/M) + 2 + 2 = O(N/M)
– (Fast) K-means iteration: 4N = O(N)*
*See
Fast K-means for analysis.
Time complexity and the observed
number of steps
Observed number of
Step:
Time complexity:
steps at iteration:
50
100
500
Centroid swap
2
2
2
2
Cluster removal
2N
7,526
8,448 10,137
Cluster addition
2N
8,192
8,192
8,192
Update centroids
53
61
60
4N/M + 2 + 1
K-means iterations
300,901 285,555 197,327
 4 N
Total
316,674 302,258 215,718
O(N)
Time spent by K-means iterations
140
k-means 2. iteration
k-means 1. iteration
local repartition
Bridge
120
100
80
60
40
20
0
0
50
100
150
200
250
300
350
400
450
500
Effect of K-means iterations
190
Bridge
185
Version with one iteration seems to be weakest all the time.
1 iteration
2 iterations
3 iterations
4 iterations
5 iterations
Error (MSE)
180
175
174
173
172
170
171
170
169
165
Versions with other amounts of
iterations are pretty even.
168
167
10
160
0.1
20
30
40
50
1
10
Time (s)
100
Total time complexity
Time complexity of a single step (t):
t = O(αN)
Number of iterations needed (T):
2

M 
T   - ln q  2 
α 

Total time:
 -lnq   NM 2 
M2

T N , M   -ln q  2  N  
α
α


Time complexity: conclusions
 -lnq   NM 2
T  N , M   
α




1. Logarithmic dependency on q
2. Linear dependency on N
3. Quadratic dependency on M
(With large number of clusters, can be too slow)
4. Inverse dependency on  (worst case  = 2)
(Higher the dimensionality and higher the cluster overlap,
faster the method)
Time-distortion performance
190
Bridge
185
180
MSE
Repeated k-means
175
170
Random
Swap
165
160
0.1
1
Time
10
100
1000
Time-distortion performance
6.50
Missa1
6.25
MSE
6.00
Repeated k-means
5.75
5.50
Ramdom
Swap
5.25
5.00
1
10
Time
100
1000
Time-distortion performance
Millions
600
Birch1
550
Repeated k-means
MSE
500
Random
Swap
450
400
1
10
Time
100
1000
10000
Millions
Time-distortion performance
10.0
Birch2
8.0
Repeated k-means
MSE
6.0
4.0
Random
Swap
2.0
0.0
1
10
100
Time
1000
Time-distortion performance
Millions
16
Europe
14
12
MSE
10
Repeated k-means
8
Random
Swap
6
4
2
1
10
Time
100
1000
Millions
Time-distortion performance
7.60
KDD-Cup04 Bio
7.58
7.56
MSE
7.54
7.52
7.50
Repeated k-means
7.48
7.46
Random
Swap
7.44
7.42
100
1000
Time
10000
100000
References
Random swap algorithm:
•
P. Fränti and J. Kivijärvi, "Randomised local search algorithm
for the clustering problem", Pattern Analysis and Applications,
3 (4), 358-369, 2000.
•
P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search
algorithm for codebook generation in VQ", Pattern
Recognition, 31 (8), 1139-1148, August 1998.
Pseudo code:
•
http://cs.joensuu.fi/sipu/soft/
Efficiency of Random swap algorithm:
•
P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of
random swap based clustering", IAPR Int. Conf. on Pattern
Recognition (ICPR’08), Tampa, FL, Dec 2008.
Part III:
Example when 4 swaps needed
1st swap
MSE = 4.2 * 109
MSE = 3.4 * 109
2nd swap
MSE = 3.1* 109
MSE = 3.0 * 109
3rd swap
MSE = 2.3 * 109
MSE = 2.1 * 109
4th swap
MSE = 1.9 * 109
MSE = 1.7 * 109
Final result
MSE = 1.3 * 109
Part IV:
Deterministic Swap
Deterministic swap
From where to where?
Costs for the swap:
One centroid , but
two clusters .
13
Two centroids , but
only one cluster .
1
2
11
7
8
5
15
9
12
14
10
3
6
4
Cluster Removal Addition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0.80
1.04
5.48
5.66
6.50
7.67
8.47
9.10
9.90
11.09
11.47
12.17
14.61
16.41
16.68
0.39
0.64
1.09
0.92
0.76
1.01
0.45
0.75
1.42
1.26
0.61
4.70
0.94
0.93
1.41
Cluster removal
• Merge two existing clusters [Frigui 1997, Kaukoranta
1998] following the spirit of agglomerative clustering.
• Local optimization: remove the prototype that
increases the cost function value least [Fritzke 1997,
Likas 2003, Fränti 2006].
• Smart swap: find two nearest prototypes, and remove
one of them randomly [Chen, 2010].
• Pairwise swap: locate a pair of inconsistent prototypes
in two solutions [Zhao, 2012].
Cluster addition
1. Select an existing cluster
–
–
Depending on strategy: 1..M choices.
Each choice takes O(N) time to test.
2. Select a location within this cluster
–
–
Add new prototype
Consider only existing points
Select the cluster
•
Cluster with the biggest MSE
–
Intuitive heuristic [Fritzke 1997, Chen 2010]
–
Computationally demanding:
•
Local optimization
–
Try all clusters for the addition [Likas et al, 2003]
–
Computationally demanding: O(NM)-O(N2)
Select the location
1.
2.
3.
4.
5.
Current prototype + ε [Fritzke 1997]
Furthest vector [Fränti et al 1997]
Any other split heuristic [Fränti et al, 1997]
Random location
Every possible location [Likas et al, 2003]
Complexity of swaps
Variant
Select
cluster
Select
location
Time
complexity
Full search
Optimal cluster
Heuristic
Random
Try all
Try all
Greatest MSE
Random
Try all
Any heuristic
Any heuristic
Random
O(N2)
O(NM)
O(N)
O(1)
Furthest point in cluster
Prototype removed
Cluster where
added
Furthest point
selected
Smart swap
• Initialization: O(MN)
• Swap Iteration
– Finding nearest pair: O(M2)
– Calculating distortion: O(N)
– Sorting clusters: O(M∙logM)
– Evaluation of result: O(N)
– Repartition and fine-tuning: O(N)
Total: O(MN+M2+I∙N)
• Number of iteration expected: < 2∙M
• Estimated total time: O(2M2N)
Smart swap
Cluster with
largest distortion
Nearest
prototypes
Smart swap
pseudo code
SmartSwap(X,M) → C,P
C ← InitializeCentroids(X);
P ←PartitionDataset(X, C);
Maxorder ← log2M;
order ← 1;
WHILE order < Maxorder
ci, cj ←FindNearestPair(C);
S ← SortClustersByDistortion(P, C);
cswap ←RandomSelect(ci, cj );
clocation ←sorder;
Cnew ← Swap(cswap, clocation);
Pnew ← LocalRepartition(P, Cnew);
KmeansIteration(Pnew, Cnew);
IF f(Cnew) < f(C), THEN
order ← 1;
C ←Cnew ;
ELSE
order ← order + 1;
KmeansIteration(P, C);
Pairwise swap
Nearest neighbors
of each other
Nearest neighbor of
the other set further
than in the same set
→
Subject to swap
Unpaired
prototypes
Unpaired prototypes
Combinations of random and
deterministic swap
Variant
Removal
Addition
RR
Random
Random
RD
Random
Deterministic
DR
Deterministic
Random
DD
Deterministic
D2R
Deterministic
Deterministic
D2D
Deterministic
Deterministic
+ data update
+ data update
Random
Summary of the time complexities
Random
removal
Deterministic
removal
D2R
D2D
O(MN) O(MN) O(αN)
O(αN)
RR
RD
DR
DD
Removal
O(1)
O(1)
Addition
O(1)
O(N)
O(1)
O(N)
O(1)
O(N)
Repartition
O(N)
O(N)
O(N)
O(N)
O(N)
O(N)
K-means
O(αN)
O(αN)
O(αN)
O(αN)
O(αN)
O(αN)
O(αN)
O(αN)
O(MN) O(MN) O(αN)
O(αN)
Profiles of the processing time
0,45
0,40
2
Bridge
Others
Repartition
2
Sw ap
0,30
K-means
0,25
0,20
0,15
0,10
Time (s) / iteration
Time (s) / iteration
0,35
1
1
0
0,05
0,00
0
RR
RD
DR
DD
D2R
D2D
Test data sets
Data set
Type of data set
Number of data
vectors (N)
Number of
clusters (M)
Dimension of
data vector (d)
Bridge
House*
Miss America
Europe
BIRCH1-BIRCH3
S1- S4
Dim32-1024
Gray-scale image
RGB image
Residual vectors
Differential coordinates
Synthetically generated
Synthetically generated
Synthetically generated
4086
34112
6480
169673
100000
5000
1000
256
256
256
16
3
16
2
2
2
32 – 1024
Data set S1
Data set S2
Data set S3
100
15
256
Data set S4
Birch data sets
Birch1
Birch2
Birch3
Experiments
Bridge
185
RR
DR
RD
DD
Error (MSE)
180
175
RD
DD
170
Random
Swap
DR
165
Bridge
160
1
10
Time (s)
100
Experiments
Bridge
190
Bridge
185
Random
Swap
180
MSE
Repeated k-means
175
DR
D2R
170
165
0.1
1
Time
10
100
Experiments
5
x 10
Birch2
6
RR
DR
RD
DD
4.5
Error (MSE)
4
3.5
3
Random Swap
RD
DR
2.5
Birch2
2
DD
10
100
Time (s)
Experiments
Miss America
6.5
Missa1
6.3
DR
MSE
6.1
Repeated k-means
5.9
Ramdom
Swap
5.7
D2R
5.5
5.3
1
10
Time
100
Quality comparisons (MSE)
with 10 second time constraint
Bridge
House
Miss
America
Europe
×107
Repeated
Random
251.32
12.12
8.34
2.37
13.10
22.35
Repeated
K-means
177.66
6.58
5.92
1.52
5.49
4.10
Random Swap
174.08
6.41
5.85
1.26
5.70
4.43
RD-variant
171.20
6.10
5.58
1.02
5.11
2.78
2:1
4:1
5:1
6:1
4:1
18:1
Average
speed-up from
RR to RD
Birch1 Birch2
×108
×106
Literature
1. P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the
clustering problem", Pattern Analysis and Applications, 3 (4), 358-369,
2000.
2. P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for
codebook generation in VQ", Pattern Recognition, 31 (8), 1139-1148,
August 1998.
3. P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap
based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08),
Tampa, FL, Dec 2008.
4. P. Fränti, M. Tuononen and O. Virmajoki, "Deterministic and randomized
local search algorithms for clustering", IEEE Int. Conf. on Multimedia and
Expo, (ICME'08), Hannover, Germany, 837-840, June 2008.
5. P. Fränti and O. Virmajoki, "On the efficiency of swap-based clustering",
Int. Conf. on Adaptive and Natural Computing Algorithms (ICANNGA'09),
Kuopio, Finland, LNCS 5495, 303-312, April 2009.
Literature
5.
J. Chen, Q. Zhao, and P. Fränti, "Smart swap for more efficient
clustering", Int. Conf. Green Circuits and Systems (ICGCS’10),
Shanghai, China, 446-450, June 2010.
6.
B. Fritzke, The LBG-U method for vector quantization – an
improvement over LBG inspired from neural networks. Neural
Processing Letters 5(1) (1997) 35-45.
7.
P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering
problems", Pat. Rec., 39 (5), 761-765, May 2006.
8.
T. Kaukoranta, P. Fränti and O. Nevalainen "Iterative split-andmerge algorithm for VQ codebook generation", Optical Engineering,
37 (10), 2726-2732, October 1998.
9.
H. Frigui and R. Krishnapuram, "Clustering by competitive
agglomeration". Pattern Recognition, 30 (7), 1109-1119, July 1997.
Literature
10. A. Likas, N. Vlassis and J.J. Verbeek, "The global k-means
clustering algorithm", Pattern Recognition 36, 451-461, 2003.
11. PAM (Kaufman and Rousseeuw, 1987)
12. CLARA (Kaufman and Rousseeuw in 1990)
13. CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han 1994)
14. R.T. Ng and J. Han, “CLARANS: A method for clustering objects
for spatial data mining,” IEEE Transactions on knowledge and
data engineering, 14 (5), September/October 2002.
Download