Optimality, Scalability and Stability study of Partitioning and Placement Algorithms Jason Cong, Michail Romesis, Min Xie UCLA Computer Science Department This work is partially supported by Semiconductor Research Corporation and National Science Foundation Overview Motivation and related work Our contribution Construction of Partitioning Examples with Known Upper bound Construction of Placement Examples with Known Upper bound Optimality, Scalability and Stability study Conclusions and future work 2 Overview Motivation and related work Our contribution Construction of Partitioning Examples with Known Upper bound Construction of Placement Examples with Known Upper bound Optimality, Scalability and Stability study Conclusions and future work 3 Motivation Partitioning 120 100 80 60 40 20 0 FM PANZA CLIP LSR hMet is (1982) (1995) (1996) (1997) (1997) MCNC Significant progress in partitioning during the mid-tolate 90’s No significant improvement in the last 5 years Have we reached a plateau? ISPD 4 Motivation Placement Lack of significant progress in wirelength reduction Rate of reduction is about 5-10% every 2-3 years Latest developments in placement differ mainly in runtime Capo [A. Caldwell et al, 2000] Dragon [M. Wang et al, 2000] Mongrel [S. Hur et al, 2000] mPL [T. Chan et al, 2000] mPG [C. Chang et al, 2002] How much is the room for further improvement? 5 Motivation Most work compare only with known heuristics Use real design based benchmarks ISPD98 [C. Alpert 1998] WSI [D. Ghosh et al, 1997] Use synthetic benchmarks circ and gen [M. D. Hutton et al, 1998] gnl [D. Stroobandt et al, 2000] Little understanding about the divergence from the optimal 6 Related Work Quantified Suboptimality of VLSI Layout Heuristics [L. Hagen et al, 1995] ? x x x x x x x x x x Construct scaled instance with known upperbound from an initial problem Over 10% area suboptimality in TimberWolf Notable wirelength suboptimality in GORDIAN-L Significant improvement was possible for placement and partitioning But test cases are small, the largest netlist is less than 40K 7 Related Work Optimality and Scalability of Existing Placement Algorithms [C. Chang et al, 2003] ? Construct instances with known optimal using the characteristic of the original problem Existing placement algorithms can be 70% to 150% away from the optimal Average solution quality deteriorates by an additional 4% to 25% when the problem size increases by a factor of 10 All the connections are local, no global connections 8 Overview Motivation and related work Our contribution Construction of Partitioning Examples with Known Upper bound Construction of Placement Examples with Known Upper bound Optimality, Scalability and Stability study Conclusions and future work 9 BEKU Construction Example Input: t = 16, D={12,8} B = 5 P2 P1 A C D Create two partitions of size 8 Generate 9 2-pin nets that do not cross the partition line Generate 3 2-pin nets that cross the partition line Generate 6 3-pin nets that do not cross the partition line Generate 2 3-pin nets that cross the partition line Cutsize = 5 Cutsize improved to 4 after FM B 10 Construction of Multiway Partitioning Examples with Known Upper Bounds (MEKU) Divide the nodes into m partitions of equal size Create B nets that cross at least two partitions. The remaining nets stay in one partition Improve by multiway FM 11 BEKU and MEKU Suite # of nodes # of nets 500,000 500,000 1,000,000 1,000,000 1,500,000 1,500,000 2,000,000 2,000,000 500,000 500,000 1,000,000 1,000,000 1,500,000 1,500,000 2,000,000 2,000,000 530,705 530,705 1,061,410 1,061,410 1,592,114 1,592,114 2,122,819 2,122,819 530,705 530,705 1,061,410 1,061,410 1,592,114 1,592,114 2,122,819 2,122,819 # of parts 2 2 2 2 2 2 2 2 8 8 8 8 8 8 8 8 Upper bound 92,343 111,873 184,714 223,520 276,670 335,242 369,526 447,781 139,943 160,163 279,975 320,457 420,279 479,971 560,275 640,459 2-way partitions occupy 45-55% of the total area 8-way partitions occupy 11.8-13.3% of the total area URL : http://cadlab.cs.ucla.edu/~pubbench/partitioning/ 12 Tested three State-of-the-Art Partitioning Tools hMetis [G. Karypis et al, 1997] Based on multilevel framework MHEC and FC clustering algorithms Variations of FM for refinement at each level MLPart [A. Caldwell et al, 2000] Based on multilevel framework Different algorithms for coarsening (PinEC) and refinement (VRW) Flare [J. Cong et al, 2000] Two-level hierarchy created by the ESC clustering algorithm Based on the LR bipartitioning engine and the PM multiway partitioning framework 13 Quality Ratio Experimental Results on BEKU 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 15% 17% 19% 21% 23% 25% Bound (% of nets) MLPart hMetis Flare MLPart produces the best results (very close to our estimated upper bound), and Flare the worst The value of the bound (as a percentage of nets) influences the quality of hMetis and Flare 14 Experimental Results on BEKU Minutes 40 30 20 10 0 500000 1000000 1500000 2000000 Circuit size hMETIS MLPart Flare The runtime scale well (almost linearly) Flare runs out of memory when problem size exceeds 1M nodes 15 Experimental Results on MEKU Quality Ratio 2 1.5 1 0.5 0 30% Bound (% of nets) hMetis 35% Flare hMetis is worse by only 2% when the initial bound is 30%, but the gap increases to 18% for a bound of 35% MLPart does not support multiway partitioning 16 Placement Examples with Global Connections circuit height width ibm01 ibm02 ibm03 ibm04 ibm05 ibm06 ibm07 ibm08 ibm09 ibm10 ibm11 ibm12 ibm13 ibm14 ibm15 ibm16 ibm17 ibm18 8158 8158 8158 8158 8158 8158 8158 8158 8158 8158 16350 16350 16350 16350 16350 16350 16283 16350 4530 6430 6740 9140 11055 8715 14605 15895 16395 27890 10925 15545 12230 25475 23785 34015 38895 37065 WL of WL contribution longest net of longest 10% 7148 51% 14224 46% 10624 58% 15171 53% 19064 47% 13966 61% 14051 51% 16142 60% 13780 55% 30755 53% 19234 59% 26748 52% 19539 59% 26370 61% 27284 63% 42860 59% 45686 56% 52846 64% Produced by Dragon on ISPD98 The wirelength contribution from global connections can be significant! Need to consider the impact of global connections 17 Placement Examples with Global Connections only Each net connects either a row or column Obvious upper bound Sum the length of each row and column Similar to datapath examples 18 Placement Examples with Non-local Connections Extend PEKO [ C.Chang 2003] by introducing non-local nets to mimic global connections All the modules are of equal size, and there is no space between rows and adjacent modules nets of degree i, *di of them are generated by randomly conneting i modules, the rest are generated optimally as in PEKO For 19 Placement Examples with Non-local Connections Input : t = 64, D = {d2=34,d3=20,d4=7,d5=4,d6=2, d7=1} =0.2 Generate 28 2-pin optimally Generate 6 2-pin randomly Generate 16 3-pin optimally Generate 4 3-pin randomly Generate 6 4-pin randomly Generate 1 4-pin randomly Generate 4 5-pin optimally Generate 2 6-pin optimally Generate 1 7-pin optimally Total WL = 160 20 G-PEKU Suite Module number extracted from ISPD98 circuit GPeku01 GPeku05 GPeku10 GPeku15 GPeku18 #cell 12506 28146 68685 161187 210341 #net 224 336 525 803 918 #row 113 169 263 402 460 UB 7.93E+05 1.79E+06 4.38E+06 1.03E+07 1.34E+07 URL: http://cadlab.cs.ucla.edu/~pubbench/peku.htm 21 PEKU Suite Module number t and NDVs extracted from ISPD98 Remove connections with pads Vary from 0 to 10% 15% white space by expanding one dimension of the chip 22 PEKU Suite % nonlocal nets 0 0.25% 0.50% Up to 10% circuit #cell #net #row Peku01 Peku05 Peku10 Peku15 Peku18 Peku01 Peku05 Peku10 Peku15 Peku18 Peku01 Peku05 Peku10 Peku15 Peku18 12506 28146 68685 161187 210341 12506 28146 68685 161187 210341 12506 28146 68685 161187 210341 14111 28446 75196 186608 201920 14111 28446 75196 186608 201920 14111 28446 75196 186608 201920 113 169 263 402 460 113 169 263 402 460 113 169 263 402 460 Row utilizatio n 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% 85% LB UB 8.14E+05 1.91E+06 4.73E+06 1.15E+07 1.32E+07 8.14E+05 1.91E+06 4.73E+06 1.15E+07 1.32E+07 8.14E+05 1.91E+06 4.73E+06 1.15E+07 1.32E+07 8.14E+05 1.91E+06 4.73E+06 1.15E+07 1.32E+07 9.23E+05 2.24E+06 6.17E+06 1.71E+07 2.01E+07 1.02E+06 2.63E+06 7.52E+06 2.30E+07 2.75E+07 … URL: http://cadlab.cs.ucla.edu/~pubbench/peku.htm 23 Tested four State-of-the-Art Placers Capo [A. Caldwell et al, 2000] Based on multilevel partitioner Aims to enhance the routability Dragon [M. Wang et al, 2000] Uses hMetis for initial partition SA with bin-based swapping mPL [T. Chan et al, 2000] Nonlinear programming on the coarsest level Goto based relaxation mPG [C. Chang et al, 2002] Uses FC clustering and hierarchical density control Incremental A-tree for routability 24 Experimental Results on G-PEKU circuit GPeku01 GPeku05 GPeku10 GPeku15 GPeku18 Dragon v.2.20 Capo v.8.5 mPG v.1.0 QR QR QR 1.98 1.56 1.91 2.01 1.69 1.97 2.02 1.72 1.98 1.99 1.79 1.97 2.02 1.78 1.98 mPL v.2.0 QR 1.69 1.83 1.94 1.97 1.98 The gap between their solutions and the upper bound varies between 79% and 102% in the worst case Another validation that there is significant room for improvement for the placement problem 25 Experimental Results on PEKU Quality Ratio 2.2 2 1.8 1.6 1.4 1.2 1 0.00% 0.25% Capo v.8.5 0.50% 0.75% 1.00% % of non-local nets Dragon v.2.20 mPG v.1.0 2.00% 5.00% 10.00% mPL v.2.0 mPL’s QR increases when is increased from 0 to 0.75%, while for the other three placers, QRs are steadily decreasing Absolute value of the QRs may not be meaningful, but it helps to identify the technique that works best under each scenario 26 Overview Motivation and related work Our contribution Partitioning Examples with Known Upper bound Placement Examples with Known Upper bound Optimality, Scalability and Stability study Conclusions and future work 27 Conclusions Bipartitioning techniques seem fairly mature The best available algorithms perform and scale very well on examples by our construction The best available multiway partitioning algorithms do not perform equally well The worst divergence from upperbound is 18% by hMetis There is still significant room for improvement in circuit placement Existing placement algorithms may produce solutions far away from the optimal (or upper bound) Their effectiveness depends much on the characteristic of circuits 28 Future Work Construction of more synthetic examples Measure routability optimality Measure timing optimality Understand the deficiencies of existing algorithms using these examples Guide the development of new VLSI CAD algorithms 29 Acknowledgement Prof. I. Markov for providing Capo’s latest version Prof. S. Lim for providing Flare’s latest version X. Yuan for providing the data of mPG J. Shinnerl and K. Sze for providing the experimental data of mPL 30 THE END THANK YOU 31