Clustering algorithms: Part 2c Agglomerative clustering (AC) Pasi Fränti 25.3.2014 Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Agglomerative clustering Categorization by cost function Single link – Minimize distance of nearest vectors Complete link – Minimize distance of two furthest vectors Ward’s method We focus on this – Minimize mean square error – In Vector Quantization, known as Pairwise Nearest Neighbor (PNN) method Pseudo code PNN(X, M) C, P si {xi} i[1,N]; m N; REPEAT (sa, sb) NearestClusters(); MergeClusters(sa, sb); m m-1; UpdateDataStructures(); UNTIL m=M; Pseudo code PNN(X, M) → C, P FOR i←1 TO N DO p[i]←i; c[i]←x[i]; N times REPEAT a,b ← FindSmallestMergeCost(); MergeClusters(a,b); m←m-1; UNTIL m=M; T(N) = O(N3) O(N) O(N2) Ward’s method [Ward 1963: Journal of American Statistical Association] Merge cost: d a ,b na nb ca cb na nb 2 Local optimization strategy: a, b arg min d i , j i , j1, N i j Nearest neighbor search: 1. Find the cluster pair to be merged 2. Update of NN pointers Example of distance calculations nb = 9 nc = 3 na = 1 6 a da,b=32,40 MergeCost a, b 5 b 1 9 9 36 36 32.40 1 9 10 39 27 MergeCost b, c 25 25 56.25 39 12 db,c=56,25 c Example of the overall process M=5000 M=50 M=16 M=15 M=5000 M=4999 M=4998 . . . M=50 . . M=16 M=15 Detailed example of the process Example - 25 Clusters MSE ≈ 1.01*109 Example - 24 Clusters MSE ≈ 1.03*109 Example - 23 Clusters MSE ≈ 1.06*109 Example - 22 Clusters MSE ≈ 1.09*109 Example - 21 Clusters MSE ≈ 1.12*109 Example - 20 Clusters MSE ≈ 1.16*109 Example - 19 Clusters MSE ≈ 1.19*109 Example - 18 Clusters MSE ≈ 1.23*109 Example - 17 Clusters MSE ≈ 1.26*109 Example - 16 Clusters MSE ≈ 1.30*109 Example - 15 Clusters MSE ≈ 1.34*109 Storing distance matrix • Maintain the distance matrix and update rows for the changed cluster only! • Number of distance calculations reduces from O(N2) to O(N) for each step. • Search of the minimum pair still requires O(N2) time still O(N3) in total. • It also requires O(N2) memory. Heap structure for fast search [Kurita 1991: Pattern Recognition] 1 2 3 4 5 6 7 ... N HEAP 1 2 6,7 3 4 5 6 ... 7 ... N • Search reduces O(N2) O(logN). • In total: O(N2 logN) ... ... Store nearest neighbor (NN) pointers [Fränti et al., 2000: IEEE Trans. Image Processing] b c g a f e d Time complexity reduces to O(N 3) Ω (N 2) Pseudo code PNN(X, M) → C, P FOR i←1 TO N DO p[i]←i; c[i]←x[i]; FOR i←1 TO N DO NN[i]← FindNearestCluster(i); O(N) O(N2) O(N) REPEAT a ← SmallestMergeCost(NN); b ← NN[i]; MergeClusters(C,P,NN,a,b,); O(N) UpdatePointers(C,NN); UNTILhttp://cs.uef.fi/pages/franti/research/pnn.txt m=M; Example with NN pointers [Virmajoki 2004: Pairwise Nearest Neighbor Method Revisited ] 10 cluster c a b c d e f g min NN -- 2.0 16.0 2.5 4.0 8.0 16.0 2.0 b b 2.0 -- 10.0 2.5 10.0 18.0 26.0 2.0 a c 16.0 10.0 -- 22.5 36.0 40.0 32.0 10.0 b d 2.5 2.5 22.5 -- 4.5 14.5 30.5 2.5 a e 4.0 10.0 36.0 4.5 -- 4.0 20.0 4.0 a f 8.0 18.0 40.0 14.5 4.0 -- 8.0 4.0 e g 16.0 26.0 32.0 30.5 20.0 8.0 -- 8.0 f Input data a 8 10.0 b d 2.0 6 a 2.0 2.5 4 e 4.0 2 g f 4.0 8.0 0 0 2 4 6 8 10 Example Step 1 10 c After step 1 8 16.7 ab 2.7 d 6 cluster ab c d e f g min NN ab -- 16.7 2.7 8.7 16.7 27.3 2.7 d c 16.7 -- 22.5 36.0 40.0 32.0 16.7 ab d 2.7 22.5 -- 4.5 14.5 30.5 2.7 ab e 8.7 36.0 4.5 -- 4.0 20.0 4.0 f f 16.7 40.0 14.5 4.0 -- 8.0 4.0 e g 27.3 32.0 30.5 20.0 8.0 -- 8.0 f 2.7 4 e 4.0 2 g f 4.0 8.0 0 0 2 4 6 8 10 Example Step 2 10 cluster abd c e f g min NN abd -- 23.1 8.1 19.1 35.1 8.1 e c 23.1 -- 36.0 40.0 32.0 23.1 abd 8.1 e 8.1 36.0 -- 4.0 20.0 4.0 f e f 19.1 40.0 4.0 -- 8.0 4.0 e g 35.1 32.0 20.0 8.0 -- 8.0 f c After step 2 8 23.1 abd 6 4 4.0 2 g f 4.0 8.0 0 0 2 4 6 8 10 Example Step 3 10 c After step 3 8 23.1 abd cluster abd c ef g min NN abd -- 23.1 19.3 35.1 19.3 ef c 23.1 -- 49.3 32.0 23.1 abd ef 19.3 49.3 -- 17.3 17.3 g g 35.1 32.0 17.3 -- 17.3 ef 6 4 19.3 17.3 2 g ef 17.3 0 0 2 4 6 8 10 Example Step 4 10 c After step 4 23.1 8 abd 23.1 6 30.8 4 2 efg 0 0 2 4 6 8 10 cluster abd c efg min NN abd -- 23.1 30.8 23.1 c c 23.1 -- 48.7 23.1 abd efg 30.8 48.7 -- 30.8 abd Example Final 10 Final clustering 8 abcd 6 4 2 efg 0 0 2 4 6 8 10 Time complexities of the variants Initialization phase: Single merge phase: Find two nearest Merge the clusters Recalculate distances Update data structures Merge phases in total: Algorithm in total: Original method: With heap structure: With NN pointers: O(N) O(N2) O(N2) O(N2) O(1) O(1) O(1) O(N3) O(N3) O(1) O(1) O(N) O(Nlog N) O(N2log N) O(N2log N) O(N) O(1) O(N) O(N) O(N2) O(N2) Number of neighbors (τ) 0.25 BIRCH1 average = 5.1 Normalized frequency 0.20 0.15 House average = 7.0 0.10 Bridge 0.05 average = 12.1 0 5 10 15 20 Tau 25 30 35 40 Processing time comparison Time (secon ds) 100000 10000 Original method 1000 method With NNOur pointers 100 10 1 512 1024 2048 Training set size (N) 4096 Algorithm: Lazy-PNN T. Kaukoranta, P. Fränti and O. Nevalainen, "Vector quantization by lazy pairwise nearest neighbor method", Optical Engineering, 38 (11), 1862-1868, November 1999 Monotony property of merge cost [Kaukoranta et al., Optical Engineering, 1999] Cc Merge costs values are monotonically increasing: d(Sa, Sb) d(Sa, Sc) d(Sb, Sc) d(Sa, Sc) d(Sa+b, Sc) Ca Ca+b nb na+nb Cb na na+nb Lazy variant of the PNN • Store merge costs in heap. • Update merge cost value only when it appears at top of the heap. • Processing time reduces about 35%. Time complexity Additional data structure Space compl. Method Ref. Trivial PNN [10] O(d∙N3) - O(N) Distance matrix [6] O(d∙N2+ N3) Distance matrix O(N2) Kurita’s method [5] O(d∙N2+ N2∙logN) Dist. matrix + heap O(N2) -PNN [1] O(d∙N2) NN-table O(N) Lazy-PNN [4] O(d∙N2) NN-table O(N) Combining PNN and K-means 1 M codebook size M K-means GLA M0 N combined PNN random selection M0 standard PNN N Algorithm: Iterative shrinking P. Fränti and O. Virmajoki “Iterative shrinking method for clustering problems“ Pattern Recognition, 39 (5), 761-765, May 2006. Agglomerative clustering based on merging Before cluster merge x x x x After cluster merge S2 x x x x S3 + + + + + x x x + + xx + + + x x + + x + x + xx + + x x S1 + x + + + + x + + + + + + S4 + Code vectors: + x x S5 + x + + + + Data vectors: Vectors to be merged x Remaining vectors + Data vectors of the clusters to be merged Other data vectors + + Agglomeration based on cluster removal [Fränti and Virmajoki, Pattern Recognition, 2006] Before cluster removal + + S2 + + + + + + After cluster removal + S3 S1 + + + + + + + + + + xx + + + x + + + + + x + + + xx + + + + + + + x + + + + + + S4 + Code vectors: + x x S5 + x + + + + + Data vectors: Vector to be removed x Data vectors of the cluster to be removed Remaining vectors + Other data vectors + Merge versus removal PNN IS After third merge After fourth merge Pseudo code of iterative shrinking (IS) IS(X, M) C, P m N; FOR i1, m: ci xi; pi i; ni 1; FOR i1, m: qi FindSecondNearestCluster(C, xi); REPEAT CalculateRemovalCosts(C, P, Q, d); a SelectClusterToBeRemoved(d); RemoveCluster(P, Q, a); UpdateCentroids(C, P, a); UpdateSecondaryPartitions(C, P, Q, a); m m - 1; UNTIL m=M. Cluster removal in practice Find secondary cluster: nj qi arg min xi c j 1 j m n j 1 2 j pi Calculate removal cost for every vector: D i n qi n qi 1 x i c qi 2 xi c a 2 Partition updates S6 + + + + + S13 z z z z z y + y y z y + z x z y z z x x y S5 z x z y x x z x y S12 x y y z z S4 z z + z + z S9 z z + + z z x x x + y S1 x S8 z y x x z + S3 x x y y + z S2 y y z S7 z z + + S10 S11 + + Code vectors: Code vector to be removed Remaining code vectors Data vectors: Data vectors of the minimum update: Data vectors of the standard update: x x Data vectors of the extensive update: Other data vectors + Uy x Uy U z Complexity analysis Number of vectors per cluster: N N N 1 1 1 ... N ... N N 1 M N M M 1 If we iterate until M=1: 1 1 1 N ... ON log N N 1 2 Adding the processing time per vector: N log N log M N 2 N M N N log N M M Algorithm: PNN with kNN-graph P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph". IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006 Agglomerative clustering with kNN graph AgglomerativeClustering(X, M) S Step 1: Construct k-NN graph. Step 2: REPEAT Step 2.1: Find the best pair (sa, sb) to be merged. Step 2.2: Merge the pair (sa, sb) sab. Resolve k-NN list for sab. Step 2.3: Remove obsolete (sb). Step 2.4: Find neighbors of sab. Step 2.5: Update distances for the neighbors of sab. UNTIL |S|=M; Example of 2NN graph k j c i a b d h e f g Example of 4NN graph Graph using double linked lists before merge c k e a c a h j d k b b c d a Merging a and b after merge c e c j k a+b h d c d Effect on calculations number of steps Theoretical Observed STAGE -PNN Single link Double link -PNN Find pair N 1 1 8 357 3 3 Merge N k2 + logN k2 + k + logN 8 367 200 305 Remove last N k + logN LogN 8 349 102 45 Find neighbors N kN k 8 357 41 769 204 Update costs N (1+) + /klogN + /klogN 48 538 198 187 TOTAL O(N2) O(kN2) O(NlogN) 81 970 42 274 746 Single link Double link Processing time as function of k seconds (number of neighbors in graph) 6 5 4 3 2 1 0 Bridge Agglomeration Graph creation by divide-and-conquer 2 4 6 8 10 12 k 14 16 18 20 Time distortion comparison 6.0 Miss America 5.9 5.8 MSE K-means (fast) 5.7 Graph-PNN (1) MPS 5.6 K-means -PNN (229 s) Trivial-PNN (>9999 s) 5.5 5.4 5.3 0 (1) (2) MSE = 5.36 Graph-PNN (2) D-n-C 10 Graph created by MSP Graph created by D-n-C 20 30 Time (seconds) 40 Conclusions • Simple to implement, good clustering quality • Straightforward algorithm slow O(N3) • Fast exact (yet simple) algorithm O(τN2) • Beyond this possible: – O(τ∙N∙logN) complexity – Complicated graph data structure – Compromizes the exactness of the merge Literature 1. P. Fränti, T. Kaukoranta, D.-F. Shen and K.-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773-777, May 2000. 2. P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph". IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006. 3. P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006. 4. T. Kaukoranta, P. Fränti and O. Nevalainen, "Vector quantization by lazy pairwise nearest neighbor method", Optical Engineering, 38 (11), 1862-1868, November 1999. 5. T. Kurita, "An efficient agglomerative clustering algorithm using a heap", Pattern Recognition 24 (3) (1991) 205-209. Literature 6. J. Shanbehzadeh and P.O. Ogunbona, "On the computational complexity of the LBG and PNN algorithms". IEEE Transactions on Image Processing 6 (4), 614-616, April 1997. 7. O. Virmajoki, P. Fränti and T. Kaukoranta, "Practical methods for speeding-up the pairwise nearest neighbor method ", Optical Engineering, 40 (11), 2495-2504, November 2001. 8. O. Virmajoki and P. Fränti, "Fast pairwise nearest neighbor based algorithm for multilevel thresholding", Journal of Electronic Imaging, 12 (4), 648-659, October 2003. 9. O. Virmajoki, Pairwise Nearest Neighbor Method Revisited, PhD thesis, Computer Science, University of Joensuu, 2004. 10. J.H. Ward, Hierarchical grouping to optimize an objective function, J. Amer. Statist.Assoc. 58 (1963) 236-244.