slides7

advertisement
Color clustering
Sources of the RGB data vectors:
Red-Green plot of the vectors:
G
R
Example of clustering
Clustering for vector quantization
Starting point (996 data vectors)
Clustering result (256 clusters)
Goals of clustering and classification
1. Supervised classification:
Partition the input set so that data vectors that originate from
the same source belong to the same group.
- Training data available with known classification.
- Typical solutions:
o statistical methods.
o neural networks
2. Clustering:
Partition the input set so that similar vectors are grouped
together and dissimilar vectors to different groups. No
training available  classes are unknown, model is fitted to
data.
- Goals to solve:
o Find how many clusters
o Find the location of clusters
- Typical solutions:
o clustering algorithms
o other statistical methods
3. Vector quantization:
Generate codebook that approximates the input data.
- Number of clustrers defined by user
- Codebook generated by clustering algorithms
Vector quantization
Data:
X
P
C
Set of N input vectors X={x1,x2,…,xN}
Partition of M clusters P={p1, p2,…,pM}
Cluster centroids C={c1, c2,…,cM}
Goal:
Find such C and P to minimize f(C, P)
Error function:
1
f ( P, C ) 
N
N
 x c
i
i 1
Mapping
function P
Training setX
pi
Code
vectors
1
3
2
1
1
3
42
2
4
3
3
N training
vectors
2
CodebookC
M code
vectors
42
M
N
8
Scalar in [1.. M]
32 11
K-dimesional vector
Representation of solution
Partition
Codebook
Main approaches
1. Hierarchical methods
- Build the clustering structure stepwise:
- Splitting approach (top-down):
o Increase clusters by adding new ones
o For example: divide the largest cluster
- Merge-based approach (bottom-up):
o Decrease clusters by removing existing ones
o For example: merge existing clusters
2. Iterative methods
- Take any initial solution, e.g. random clustering
- Make small changes to the existing solution by:
o Descendent method (apply rules that improve)
o Local search (trial-and-error approach)
Generalized Lloyd algorithm (GLA)
Partition step:

pi  arg min d xi , c j
1 j  M

2
 i 1, N 
Centroid step:
x
cj 
pi  j
i
1
 j 1, M 
pi  j
GLA (X,P,C): returns (P,C)
REPEAT
FOR i:=1 TO N DO
Pi  FindNearestCentroid(xi,C)
FOR i:=1 TO M DO
Ci  CalculateCentroid(X,P,i)
UNTIL no improvement.
Splitting approach
Split
Put all vectors in one clusters;
REPEAT
Select cluster to be split;
Split the cluster;
UNTIL final cluster size reached;
Median cut algorithm
(example)
Colors
samples:
( x, y )
( 0, 0)
( 0, 5)
( 0,15)
( 1,10)
( 4, 4)
( 4,12)
( 5, 4)
( 6, 6)
(15, 0)
(15,14)
Stage:
15
Distribution of the colors:
15
10
1.
10
y
2.
y
5
5
0
0
0
Regions:
Initial: A=[0..15, 0..15]
1.
A=[0..4, 0..15]
B=[5..15, 0..14]
2.
A=[0..4, 0..5]
B=[5..15, 0..14]
C=[0..4, 10..15]
5
x
10
15
3.
4.
0
Maximum
Regions:
dimension: Stage:
0..15
3.
A=[0..4, 0..5]
B=[5..15, 0..4]
0..15
C=[0..4, 10..15]
0..14
D=[6..15, 6..14]
0..5
0..14
10..15
4.
A=[0..4, 0..5]
B=[5..5, 4..4]
C=[0..4, 10..15]
D=[6..15, 6..14]
E=[15..15, 0..0]
5
x
Maximum
dimension:
0..5
5..15
10..15
6..15
0..5
5..5
10..15
6..15
15..15
10
15
Final color
palette:
( 1, 3)
( 6, 4)
( 2,12)
(11,10)
(15, 0)
Median cut + GLA
(example)
Colors
samples:
( x, y )
( 0, 0)
( 0, 5)
( 0,15)
( 1,10)
( 4, 4)
( 4,12)
( 5, 4)
( 6, 6)
(15, 0)
(15,14)
Distribution of the colors:
15
10
y
5
0
0
5
x
10
15
Median cut segmentation:
15
10
y
Regions:
Color:
A:
B:
C:
D:
E:
( 1, 3)
( 5, 4)
( 2,12)
(11,10)
(15, 0)
( 0, 0) ( 0, 5) ( 4, 4)
( 5, 4)
( 0,15) (1,10) ( 4,12)
( 6, 6) (15,14)
(15, 0)
5
0
Square
error:
25
0
22
48
0
In total: 95
0
5
x
10
15
After first iteration:
15
Original Colors mapped to
color:
the representative:
( 1, 3)
( 0, 0) ( 0, 5)
( 5, 4)
( 4, 4) ( 5, 4) ( 6, 6)
( 2,12)
( 0,15) ( 1,10) ( 4,12)
(11,10)
(15,14)
(15, 0)
(15, 0)
10
y
5
0
New
color:
( 0, 3)
( 5, 5)
( 2,12)
(15,14)
(15, 0)
Square
error:
13
5
22
0
0
In total: 40
0
5
x
10
15
After second iteration:
15
Original Colors mapped to
color:
the representative:
( 0, 3)
( 0, 0) ( 0, 5)
( 5, 5)
( 4, 4) ( 5, 4) ( 6, 6)
( 2,12)
( 0,15) ( 1,10) ( 4,12)
(15,14)
(15,14)
(15, 0)
(15, 0)
10
y
5
0
New
color:
( 0, 3)
( 5, 5)
( 2,12)
(15,14)
(15, 0)
Square
error:
13
5
22
0
0
In total: 40
0
5
x
10
15
PCA-based splitting
1. Calculate the principal axis.
2. Select the dividing point P at the principal axis.
3. Partition according to hyperplane.
4. Calculate two centroids of the two subclusters.
dividing hyperplane
899
899
principal axis
principal axis
dividing point
298
1617
678
111
111
429
298
429
298
principal axis
678
63
principal axis
113
Time complexity of splitting
-
Assume clusters of n vectors with K values (K=3 for RGB);
Principal axis calculated in O(nK2) time
Selection of dividing point in O(nlogn) time
Assume that largest cluster is split to equal halves: n  n/2
Total number of vectors is:
 N
N 
N N N N N N


n

N







...


...


 

i

M 2
2 2 4 4 4 4
M 2
 N  2
N
N
M N
 4   ... 
 N  log M
2
4
2 M 2

Total time complexity is O(NK2 logM ) + O(N logN )
Splitting experiments
With partition refinement:
261
130
265
74
168
Quality-time comparison:
Random
Existing method
New method
260
MSE
Split-2
180
R+GLA
Split-1
175
SLR
170
S+GLA
SLR+GLA
SGLA
165
10
20
30 40 50 60
Time (seconds)
940
Merge-based approach: PNN algorithm
PNN
Put all vectors in own cluster;
REPEAT
(a,b) SearchClusterPair;
MergeClusters(a,b);
UNTIL final cluster size reached;
Before cluster merge
After cluster merge
x
x
x
x
S2
x
x
x
x
S3
+
+
+
+
x
x
x
+
+
xx
+
+
+
x
x
+
+
+
x
+
x
+
xx
+
+
x
x
S1
+
x
+
+
+
+
+
x
+
S5
+
+
+
+ S
4
+
+
Code vectors:
+
x
x
+
+
x
+
+
+
Training vectors:
Vectors to be merged
x
Training vectors of the clusters to be merged
Remaining vectors
+
Other training vectors
+
Iterative shrinking
IS(X, M)  S
FOR i1 to N DO
si  {xi};
REPEAT
sa  SearchClusterToBeRemoved(S);
RepartitionCluster(S, sa);
UNTIL |S|=M;
Before cluster removal
+
+
+
+
+
After cluster removal
S2
+
+
+
+
S3
S1
+
+
+
+
+
+
+
+
+
xx
+
+
+
x
+
+
+
+
+
+
x
+
+
+
xx
+
+
+
+
+
+
+
+
+
+
x
+
+ S
4
+
+
Code vectors:
+
x
+
S5
+
x
x
+
+
+
+
Training vectors:
Vector to be removed
x
Training vectors of the cluster to be removed
Remaining vectors
+
Other training vectors
+
Results using merge-approaches
Results using merge-approach (PNN)
PNN
IS
After
third
merge
After
fourth
merge
180
178
GLA-PNN-GLA (improved)
176
PNN (original)
174
PNN (improved)
MSE
172
170
GA-PNN (improved)
168
166
164
162
160
0
10
20
30
40
50
Run time
60
70
80
90
Split and merge
Generate an initial codebook by any algorithm.
Repeat
Select a cluster to be split.
Split the selected cluster.
Select two clusters to be merged
Merge the selected clusters
Until no improvement achieved.
0
Split-Merge
Merge-Split
M-h
M
M+h
N
Comparison of Split and Merge
180
GLA
Split
MSE
175
SLR
PNN
170
SM
165
SGLA
SMG
160
0
100
300
200
Time (seconds)
600
700
Structure of Local Search
Generate initial solution.
REPEAT
Generate a set of new solutions.
Evaluate the new solutions.
Select the best solution.
UNTIL stopping criterion met.
Neighborhood function using random swap:
c j  xi
j  random(1, M ), i  random(1, N )
Object rejection:
pi  arg min d  xi , ck 
2
1 k  M
 i pi  j
Object attraction:
pi  arg min d  xi , ck 
k  j  k  pi
2
 i 1, N 
Randomized local search
RLS algorithm 1:
RLS algorithm 2:
C  SelectRandomDataObjects(M).
P  OptimalPartition(C).
C  SelectRandomDataObjects(M).
P  OptimalPartition(C).
REPEAT T times
Cnew  RandomSwap(C).
Pnew  LocalRepartition(P,Cnew).
Cnew  OptimalRepresentatives(Pnew).
IF f(Pnew, Cnew) < f(P, C) THEN
(P, C)  (Pnew, Cnew)
REPEAT T times
Cnew  RandomSwap(C).
Pnew  LocalRepartition(P,Cnew).
K-means(Pnew,Cnew).
IF f(Pnew, Cnew) < f(P, C) THEN
(P, C)  (Pnew, Cnew)
Random swap
BEFORE SWAP
Missing clusters
unnecessary clusters
AFTER SWAP
Centroid added
Centroid removed
Local fine-tuning
LOCAL REFINEMENT
New cluster appears
Obsolete cluster disappears
AFTER K-MEANS
Cluster moves down
Genetic algorithm
190
190
RLS-2
185
Bridge
185
180
175
180
MSE 170
MSE 175
165
RLS-1
176.53
170
160
155
163.93
163.63
163.51
163.08
RLS-2
165
160
150
K-means Random K-means Splitting Ward +
+ RLS
+ RLS
+ RLS
RLS
0
1000
2000
3000
Iterations
4000
5000
Structure of Genetic Algorithm
Genetic algorithm:
Generate S initial solutions.
REPEAT T times
Generate new solutions.
Sort the solutions.
Store the best solution.
END-REPEAT
Output the best solution found.
Generate new solutions:
REPEAT S times
Select pair for crossover.
Cross the selected solutions.
Mutate the new solution.
Fine-tune the new solution by GLA.
END-REPEAT
Pseudo code for the GA (1/2)
CrossSolutions(C1, P1, C2, P2)  (Cnew, Pnew)
Cnew  CombineCentroids(C1, C2)
Pnew  CombinePartitions(P1, P2)
Cnew  UpdateCentroids(Cnew, Pnew)
RemoveEmptyClusters(Cnew, Pnew)
PerformPNN(Cnew, Pnew)
CombineCentroids(C1, C2)  Cnew
Cnew  C1  C2
CombinePartitions(Cnew, P1, P2)  Pnew
FOR i1 TO N DO
IF xi  cp
1
i
2
 xi  cp 2
pinew  pi1
ELSE
pinew  pi2
END-FOR
i
2
THEN
Pseudo code for the GA (2/2)
UpdateCentroids(C1, C2)  Cnew
FOR j1 TO |Cnew| DO
c new
 CalculateCentroid(Pnew, j )
j
PerformPNN(Cnew, Pnew)
FOR i1 TO |Cnew| DO
qi  FindNearestNeighbor(ci)
WHILE |Cnew|>M DO
a  FindMinimumDistance(Q)
b  qa
MergeClusters(ca, pa, cb, pb)
UpdatePointers(Q)
END-WHILE
Combining existing solutions
Performance comparison of GA
180
Bridge
Random crossover + GLA
Distortion
175
Mutations + GLA
170
PNN crossover
165
PNN crossover + GLA
160
0
10
20
30
Number of iterations
40
50
Download