Color clustering Sources of the RGB data vectors: Red-Green plot of the vectors: G R Example of clustering Clustering for vector quantization Starting point (996 data vectors) Clustering result (256 clusters) Goals of clustering and classification 1. Supervised classification: Partition the input set so that data vectors that originate from the same source belong to the same group. - Training data available with known classification. - Typical solutions: o statistical methods. o neural networks 2. Clustering: Partition the input set so that similar vectors are grouped together and dissimilar vectors to different groups. No training available classes are unknown, model is fitted to data. - Goals to solve: o Find how many clusters o Find the location of clusters - Typical solutions: o clustering algorithms o other statistical methods 3. Vector quantization: Generate codebook that approximates the input data. - Number of clustrers defined by user - Codebook generated by clustering algorithms Vector quantization Data: X P C Set of N input vectors X={x1,x2,…,xN} Partition of M clusters P={p1, p2,…,pM} Cluster centroids C={c1, c2,…,cM} Goal: Find such C and P to minimize f(C, P) Error function: 1 f ( P, C ) N N x c i i 1 Mapping function P Training setX pi Code vectors 1 3 2 1 1 3 42 2 4 3 3 N training vectors 2 CodebookC M code vectors 42 M N 8 Scalar in [1.. M] 32 11 K-dimesional vector Representation of solution Partition Codebook Main approaches 1. Hierarchical methods - Build the clustering structure stepwise: - Splitting approach (top-down): o Increase clusters by adding new ones o For example: divide the largest cluster - Merge-based approach (bottom-up): o Decrease clusters by removing existing ones o For example: merge existing clusters 2. Iterative methods - Take any initial solution, e.g. random clustering - Make small changes to the existing solution by: o Descendent method (apply rules that improve) o Local search (trial-and-error approach) Generalized Lloyd algorithm (GLA) Partition step: pi arg min d xi , c j 1 j M 2 i 1, N Centroid step: x cj pi j i 1 j 1, M pi j GLA (X,P,C): returns (P,C) REPEAT FOR i:=1 TO N DO Pi FindNearestCentroid(xi,C) FOR i:=1 TO M DO Ci CalculateCentroid(X,P,i) UNTIL no improvement. Splitting approach Split Put all vectors in one clusters; REPEAT Select cluster to be split; Split the cluster; UNTIL final cluster size reached; Median cut algorithm (example) Colors samples: ( x, y ) ( 0, 0) ( 0, 5) ( 0,15) ( 1,10) ( 4, 4) ( 4,12) ( 5, 4) ( 6, 6) (15, 0) (15,14) Stage: 15 Distribution of the colors: 15 10 1. 10 y 2. y 5 5 0 0 0 Regions: Initial: A=[0..15, 0..15] 1. A=[0..4, 0..15] B=[5..15, 0..14] 2. A=[0..4, 0..5] B=[5..15, 0..14] C=[0..4, 10..15] 5 x 10 15 3. 4. 0 Maximum Regions: dimension: Stage: 0..15 3. A=[0..4, 0..5] B=[5..15, 0..4] 0..15 C=[0..4, 10..15] 0..14 D=[6..15, 6..14] 0..5 0..14 10..15 4. A=[0..4, 0..5] B=[5..5, 4..4] C=[0..4, 10..15] D=[6..15, 6..14] E=[15..15, 0..0] 5 x Maximum dimension: 0..5 5..15 10..15 6..15 0..5 5..5 10..15 6..15 15..15 10 15 Final color palette: ( 1, 3) ( 6, 4) ( 2,12) (11,10) (15, 0) Median cut + GLA (example) Colors samples: ( x, y ) ( 0, 0) ( 0, 5) ( 0,15) ( 1,10) ( 4, 4) ( 4,12) ( 5, 4) ( 6, 6) (15, 0) (15,14) Distribution of the colors: 15 10 y 5 0 0 5 x 10 15 Median cut segmentation: 15 10 y Regions: Color: A: B: C: D: E: ( 1, 3) ( 5, 4) ( 2,12) (11,10) (15, 0) ( 0, 0) ( 0, 5) ( 4, 4) ( 5, 4) ( 0,15) (1,10) ( 4,12) ( 6, 6) (15,14) (15, 0) 5 0 Square error: 25 0 22 48 0 In total: 95 0 5 x 10 15 After first iteration: 15 Original Colors mapped to color: the representative: ( 1, 3) ( 0, 0) ( 0, 5) ( 5, 4) ( 4, 4) ( 5, 4) ( 6, 6) ( 2,12) ( 0,15) ( 1,10) ( 4,12) (11,10) (15,14) (15, 0) (15, 0) 10 y 5 0 New color: ( 0, 3) ( 5, 5) ( 2,12) (15,14) (15, 0) Square error: 13 5 22 0 0 In total: 40 0 5 x 10 15 After second iteration: 15 Original Colors mapped to color: the representative: ( 0, 3) ( 0, 0) ( 0, 5) ( 5, 5) ( 4, 4) ( 5, 4) ( 6, 6) ( 2,12) ( 0,15) ( 1,10) ( 4,12) (15,14) (15,14) (15, 0) (15, 0) 10 y 5 0 New color: ( 0, 3) ( 5, 5) ( 2,12) (15,14) (15, 0) Square error: 13 5 22 0 0 In total: 40 0 5 x 10 15 PCA-based splitting 1. Calculate the principal axis. 2. Select the dividing point P at the principal axis. 3. Partition according to hyperplane. 4. Calculate two centroids of the two subclusters. dividing hyperplane 899 899 principal axis principal axis dividing point 298 1617 678 111 111 429 298 429 298 principal axis 678 63 principal axis 113 Time complexity of splitting - Assume clusters of n vectors with K values (K=3 for RGB); Principal axis calculated in O(nK2) time Selection of dividing point in O(nlogn) time Assume that largest cluster is split to equal halves: n n/2 Total number of vectors is: N N N N N N N N n N ... ... i M 2 2 2 4 4 4 4 M 2 N 2 N N M N 4 ... N log M 2 4 2 M 2 Total time complexity is O(NK2 logM ) + O(N logN ) Splitting experiments With partition refinement: 261 130 265 74 168 Quality-time comparison: Random Existing method New method 260 MSE Split-2 180 R+GLA Split-1 175 SLR 170 S+GLA SLR+GLA SGLA 165 10 20 30 40 50 60 Time (seconds) 940 Merge-based approach: PNN algorithm PNN Put all vectors in own cluster; REPEAT (a,b) SearchClusterPair; MergeClusters(a,b); UNTIL final cluster size reached; Before cluster merge After cluster merge x x x x S2 x x x x S3 + + + + x x x + + xx + + + x x + + + x + x + xx + + x x S1 + x + + + + + x + S5 + + + + S 4 + + Code vectors: + x x + + x + + + Training vectors: Vectors to be merged x Training vectors of the clusters to be merged Remaining vectors + Other training vectors + Iterative shrinking IS(X, M) S FOR i1 to N DO si {xi}; REPEAT sa SearchClusterToBeRemoved(S); RepartitionCluster(S, sa); UNTIL |S|=M; Before cluster removal + + + + + After cluster removal S2 + + + + S3 S1 + + + + + + + + + xx + + + x + + + + + + x + + + xx + + + + + + + + + + x + + S 4 + + Code vectors: + x + S5 + x x + + + + Training vectors: Vector to be removed x Training vectors of the cluster to be removed Remaining vectors + Other training vectors + Results using merge-approaches Results using merge-approach (PNN) PNN IS After third merge After fourth merge 180 178 GLA-PNN-GLA (improved) 176 PNN (original) 174 PNN (improved) MSE 172 170 GA-PNN (improved) 168 166 164 162 160 0 10 20 30 40 50 Run time 60 70 80 90 Split and merge Generate an initial codebook by any algorithm. Repeat Select a cluster to be split. Split the selected cluster. Select two clusters to be merged Merge the selected clusters Until no improvement achieved. 0 Split-Merge Merge-Split M-h M M+h N Comparison of Split and Merge 180 GLA Split MSE 175 SLR PNN 170 SM 165 SGLA SMG 160 0 100 300 200 Time (seconds) 600 700 Structure of Local Search Generate initial solution. REPEAT Generate a set of new solutions. Evaluate the new solutions. Select the best solution. UNTIL stopping criterion met. Neighborhood function using random swap: c j xi j random(1, M ), i random(1, N ) Object rejection: pi arg min d xi , ck 2 1 k M i pi j Object attraction: pi arg min d xi , ck k j k pi 2 i 1, N Randomized local search RLS algorithm 1: RLS algorithm 2: C SelectRandomDataObjects(M). P OptimalPartition(C). C SelectRandomDataObjects(M). P OptimalPartition(C). REPEAT T times Cnew RandomSwap(C). Pnew LocalRepartition(P,Cnew). Cnew OptimalRepresentatives(Pnew). IF f(Pnew, Cnew) < f(P, C) THEN (P, C) (Pnew, Cnew) REPEAT T times Cnew RandomSwap(C). Pnew LocalRepartition(P,Cnew). K-means(Pnew,Cnew). IF f(Pnew, Cnew) < f(P, C) THEN (P, C) (Pnew, Cnew) Random swap BEFORE SWAP Missing clusters unnecessary clusters AFTER SWAP Centroid added Centroid removed Local fine-tuning LOCAL REFINEMENT New cluster appears Obsolete cluster disappears AFTER K-MEANS Cluster moves down Genetic algorithm 190 190 RLS-2 185 Bridge 185 180 175 180 MSE 170 MSE 175 165 RLS-1 176.53 170 160 155 163.93 163.63 163.51 163.08 RLS-2 165 160 150 K-means Random K-means Splitting Ward + + RLS + RLS + RLS RLS 0 1000 2000 3000 Iterations 4000 5000 Structure of Genetic Algorithm Genetic algorithm: Generate S initial solutions. REPEAT T times Generate new solutions. Sort the solutions. Store the best solution. END-REPEAT Output the best solution found. Generate new solutions: REPEAT S times Select pair for crossover. Cross the selected solutions. Mutate the new solution. Fine-tune the new solution by GLA. END-REPEAT Pseudo code for the GA (1/2) CrossSolutions(C1, P1, C2, P2) (Cnew, Pnew) Cnew CombineCentroids(C1, C2) Pnew CombinePartitions(P1, P2) Cnew UpdateCentroids(Cnew, Pnew) RemoveEmptyClusters(Cnew, Pnew) PerformPNN(Cnew, Pnew) CombineCentroids(C1, C2) Cnew Cnew C1 C2 CombinePartitions(Cnew, P1, P2) Pnew FOR i1 TO N DO IF xi cp 1 i 2 xi cp 2 pinew pi1 ELSE pinew pi2 END-FOR i 2 THEN Pseudo code for the GA (2/2) UpdateCentroids(C1, C2) Cnew FOR j1 TO |Cnew| DO c new CalculateCentroid(Pnew, j ) j PerformPNN(Cnew, Pnew) FOR i1 TO |Cnew| DO qi FindNearestNeighbor(ci) WHILE |Cnew|>M DO a FindMinimumDistance(Q) b qa MergeClusters(ca, pa, cb, pb) UpdatePointers(Q) END-WHILE Combining existing solutions Performance comparison of GA 180 Bridge Random crossover + GLA Distortion 175 Mutations + GLA 170 PNN crossover 165 PNN crossover + GLA 160 0 10 20 30 Number of iterations 40 50