Power-Aware Placement Yongseok Cheon, Pei-Hsin Ho Advanced Technology Group, Synopsys, Inc. {cheon,pho}@synopsys.com Andrew B. Kahng, Sherief Reda and Qinke Wang UCSD CSE Department {abk,sreda,qiwang}@cs.ucsd.edu Outline • • • • • Introduction Activity-based register clustering Activity-based net weighting Experiments Conclusions 2 IC Power Consumption • Switching power – largest source of power dissipation – usually accounts for 40% to 80% of total power – switching power of a net is proportional to the product of net capacitance and signal switching rate • Short circuit power – power dissipation due to short current that happens briefly during the switching of a CMOS gate • Leakage power – power dissipation due to spurious currents in the non-conducting state of a transistor 3 Clock Power Consumption • Clock net – a major contributor to dynamic power – much larger capacitances than most signal nets – highest switching activity – typically consumes up to 40% of total dynamic power across a variety of design types • Traditional placement methodologies treat registers no differently than combinational cells – lead to sub-optimal placements in terms of power 4 Power Aware Placement Method • Activity-based register clustering – reduce capacitance of clock nets hence clock power • Activity-based net weighting – reduce capacitance of high-activity signal nets hence total net switching power 5 Outline • • • • • Introduction Activity-based register clustering Activity-based net weighting Experiments Conclusions 6 Large Weight for Clock Net? • Not a good idea • May only affect registers close to boundaries • Introduce hot spots and highly congested areas 7 Distribution of Clock Tree Capacitance • Observation: most of the clock tree capacitance (e.g., 80%) is at the leaf level Clock-Tree Capacitance Distribution on A Customer Design 160.00 140.00 Capacitance (pf) 120.00 100.00 wire cap 80.00 pin cap 60.00 40.00 20.00 0.00 0 1 2 3 4 5 6 7 8 9 Level 8 Register Clustering • Goal: reduce capacitance of a clock net • Method: clumping the registers within the same leaf cluster of the clock tree into a smaller area • Result: reduced leaf-level clock tree capacitance and potentially clock skew 9 Flow of Register Clustering 1. Quick CTS algorithm: group registers into clusters such that each cluster can become a leaf cluster of the actual clock tree 2. Group Bounds: constrain the placement of a cluster of registers within smaller bounding box 10 Quick Clock-Tree Synthesis Algorithm • Decide a scope of target cluster size heuristically based on – size of the clock net – design rule constraints: max fanout and max load – user configuration • Perform clustering for each direction from left, right, top and down and each target cluster size • Select the clustering with the best CTS objective – e.g., minimum clock skew, minimum clock delay, minimum # clock buffers, etc. 11 Quick CTS Algorithm (contd) • Start with the leftmost (rightmost, highest or lowest) unclustered clock pin • Add clock pin with shortest Manhattan distance to the capacitance weighted centroid of the current cluster • Grow until target cluster size • Repeat growing clusters until all done 12 Group Bounds • Control bounding box of a cluster and reduce it while still fitting the registers • Compute current bounding box of registers • Shrink the bounding box proportionally • Shrink ratio p – specified shrinking factor of p0 – switching rate of clock net SR and max switching rate MSR 13 Aspect Ratio of Bounding Box • Close to the original bounding box aspect ratio ARold when shrinking ratio p is close to 1 – without serious increasing of signal net length • Close to square when shrinking ratio p is close to 0 – reduced clock skew • Linear function of original aspect ratio ARold and shrink ratio p 14 Outline • • • • • Introduction Activity-based register clustering Activity-based net weighting Experiments Conclusions 15 Pros and Cons of Register Clustering • Effectively reduce capacitance of leaf-level clock tree • Increase the length of some signal nets • Cancel out clock power reduction 16 Activity-Based Net Weighting • Goal: reduce capacitance of signal nets • Assigning larger weight to signal nets with higher switching rates • Combining register clustering and activitybased net weighting further reduces the total net switching power 17 Activity-Based Net Weighting • Assign larger weights to nets with higher switching rates – T: threshold for selecting high activity nets – MSSR: maximum signal net switching rate – W: controls the scope of power weights Power Weight 1+W 1 T MSSR MSR Switching Rate 18 Compatibility with Timing Weights • Linear combination of power and timing net weighting • Power ratio α : 0 ~ 1 – control the ratio of power weight – knob for trade-off between timing and power 19 Outline • • • • • Introduction Activity-based register clustering Activity-based net weighting Experiments Conclusions 20 Experimental Setup • Implemented on Synopsys IC compiler • Eight industry circuits: – #cells: 20k ~ 186k – #registers: 2.3k ~ 44.2k – clock power: 32% of total power – net switching power: 39% of total power • Power aware placement – shrink ratio and power ratio around 0.8 21 Experimental Flow • Commercial IC implementation flow • Power analysis: IC Compiler Place CTS Route Extract RC STA Power Analysis – specified switching rates of primary inputs – net switching rates estimated by probabilistic simulation 22 Clock Net Switching Power Clock Net Switching Power 11.2% 350 300 250 200 150 100 50 0 D1 D2 D3 D4 D5 D6 D7 D8 23 Total Net Switching Power Total Net Switching Power 25.4% 450 400 350 300 250 200 150 100 50 0 D1 D2 D3 D4 D5 D6 D7 D8 24 Results Design # Cells # Regs D1 D2 186K 49K 44244 5621 D2 134K 43528 D4 172K 23372 D5 D6 D7 D8 116K 20K 126K 138K 9071 2315 12864 8727 Methods Clock Switching Power Reference 153.29 low_power 100.43 imp % 34.48% Reference 313.12 low_power 288.96 imp % 7.72% Reference 168.61 low_power 150.87 imp % 10.52% Reference 102.32 low_power 100.65 imp % 1.63% Reference 20.74 low_power 18.49 imp % 10.85% Reference 1.64 low_power 1.54 imp % 6.10% Reference 21.54 low_power 19.31 imp % 10.35% Reference 3.18 low_power 2.94 imp % 7.55% Total AVG imp % 11.15% Total Switching Power 319.86 182.59 42.92% 408.99 364.01 11.00% 302.57 224.95 25.65% 258.53 218.94 15.31% 37.74 32.42 14.10% 3.00 2.44 18.67% 46.87 33.52 28.48% 6.35 3.36 47.09% 25.40% Total Power Clock WL Clock Skew WNS Cell Area 908.51 737.71 18.80% 425.66 380.22 10.68% 1127.23 1054.08 6.49% 789.27 717.80 9.06% 127.27 117.92 7.35% 10.64 9.58 10.03% 133.28 113.58 14.78% 21.97 18.84 14.24% 11.43% 879529 843626 4.08% 84601 77964 7.85% 1492789 1266024 15.19% 484661 482264 0.49% 173554 143063 17.57% 46130 39254 14.91% 260509 252242 3.17% 114542 95760 16.40% 9.96% 0.156 0.110 1.38% 0.028 0.023 0.10% 1.180 0.427 5.02% 0.095 0.088 0.18% 0.169 0.174 -0.03% 0.031 0.030 0.00% 0.222 0.249 -0.68% 0.178 0.285 -0.54% 0.68% 0.34 0.13 6.31% 0.01 0.41 -7.69% 4.61 3.78 5.53% 0.46 0.54 -2.00% 3.73 4.10 -2.47% 0.00 0.00 0.00% 0.15 0.48 -8.25% 3.26 3.38 -0.60% -1.15% 4619435 5041092 -9.13% 1087912 1161019 -6.72% 42612408 43121730 -1.20% 4871915 4646738 4.62% 2444381 2433401 0.45% 535949 447993 16.41% 3136603 3471139 -10.67% 1603950 1701496 -6.08% -1.54% CPU -25.22% -24.75% -51.29% 21.32% 6.96% -32.57% -9.02% 16.23% -12.29% 25 Summary • Reduction – clock net switching power: 11.3% (1.6% ~ 34.5%) – total net switching power: 25.3% (10.5% ~ 47.1%) – total power: 11.4% (6.5% ~ 18.8%) – clock WL: 10.1% – clock skew: random • Impact – WNS (worst negative slack): 2.0% – total cell area: 1.2% – runtime: 11.5% 26 720 3 700 2.5 680 2 660 1.5 640 WNS Total Switching Power Power-Timing Trade-Off with Power Ratio 1 620 600 Power Timing 580 0.2 0.4 0.7 0.75 0.8 0.85 0.9 0.95 0.5 0 1 Power Net Weighting Ratio 27 Power-Timing Trade-Off with Shrink Ratio 160.00 5.5 4.5 150.00 3.5 145.00 140.00 2.5 135.00 1.5 130.00 Power 125.00 0.5 Timing 120.00 -0.5 0.95 0.9 0.85 0.8 0.75 0.7 0.6 0.4 0.2 Shrink Ratio 28 WNS Clock Switching Power 155.00 Conclusions • We have presented a power-aware placement method that performs activitybased net weighting and register clustering to reduce the capacitance of high-activity signal and clock nets • We have experimented the method on eight real designs through a complete industrial physical design flow • Our approach achieved average 25.3% and 11.4% reduction in net switching and total power, with 2.0% timing, 1.2% total cell area and 11.5% runtime degradation 29 Thank You ! 30