A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Professor Shiyan Hu, Ph.D. Department of Electrical and Computer Engineering Michigan Technological University Moore’s law Twice the number of transistors, approximately every two years 2 Interconnect Delay Dominates Gate Delay 3 Technology Scaling 130nm 65nm Global interconnect lengths does not shrink Local interconnect lengths shrink Delay ∝ RC Resistance R = rL/S, where S is reduced Capacitance C slightly changes 4 Interconnect Delay Scaling Scaling factor s=0.7 per generation Emore Delay of a wire of length l tint = (rl)(cl)/2= rcl2/2 (first order) Local interconnects tint : (r/s2)(c)(ls)2/2 = rcl2/2 – Local interconnect delay is roughly unchanged Global interconnects tint : (r/s2)(c)(l)2/2= rcl2 – Global interconnect delay doubles which is unsustainable Interconnect delay increasingly more dominant 5 Timing Driven Buffer Insertion 6 Buffers Reduce RC Wire Delay x x/2 R rx/2 cx/4 cx/4 x/2 C R rx/2 cx/4 cx/4 C ∆t ∆t = t_buf – t_unbuf = RC + tb – rcx2/4 x 7 Intuitive Analysis L Interconnect Elmore delay = rcL2/2 l=2 l Interconnect D elay l l 1 L /2 2 1 rc 2 2 rc 2 L rcL 2 Since there are L/2 buffers (Of course, we need to consider buffer delay) 8 Detailed Analysis The delay of a wire of length L is T=rcL2/2 L r,c – Resistance, cap. per unit length Rd – On resistance of inverter Cg – Gate input capacitance l Assume N identical buffers with equal inter-buffer length l (L = Nl). To minimize delay T N R d C g cl rl C g cl / 2 1 L rcl / 2 rC g R d c R d C g l dT dl 0 rc Rd C g L 2 0 2 l opt l opt 2 Rd C g rc 9 Quadratic Delay -> Linear Delay Substituting lopt back into the interconnect delay expression: T opt 1 R d C g L rcl opt rC g R d c l opt L rc 2 Rd C g rc rC g R d c Rd C g 2 Rd C g rc T opt L 2 R d C g rc rC g R d c Delay grows linearly with L instead of quadratically. This is why buffer insertion is highly effective and thus widely used for reducing circuit delay. 10 25% Gates are Buffers Saxena, et al. [TCAD 2004] 11 ITRS Projections 12 Problem Formulation 1. Steiner Tree 2. n candidate buffer locations T Minimal cost (area/power) solution 13 Solution Characterization To model effect to downstream, a candidate solution is associated with • v: a node • C: downstream capacitance • Q: required arrival time • W: cumulative buffer cost 14 Candidate Buffering Solutions 15 Dynamic Programming (DP) Start from sinks Candidate solutions are generated Three operations – Add Wire – Insert Buffer – Merge Candidate solutions are propagated toward the source Solution Pruning 16 Solution Propagation: Add Wire (v2, c2, w2, q2) x (v1, c1, w1, q1) c2 = c1 + cx q2 = q1 - (rcx2/2 + rxc1) r: wire resistance per unit length c: wire capacitance per unit length 17 Solution Propagation: Insert Buffer (v1, c1b, w1b, q1b) (v1, c1, w1, q1) q1b = q1 - d(b) c1b = C(b) w1b = w1 + w(b) d(b): buffer delay 18 Solution Propagation: Merge (v, cl , wl , ql) (v, cr, wr, qr) cmerge = cl + cr wmerge = wl + wr qmerge = min(ql , qr) 19 Example of Solution Propagation 2 2 Add wire (v2, 3, 16, 0) Add wire (v, C, Q, W) • r = 1, c = 1 (v1, 1, 20, 0) • Rb = 1, Cb = 1, tb = 1 • Rd = 1 (v2, 1, 12, 1) v1 v1 Insert buffer (v3, 5, 8, 0) (v3, 3, 8, 1) Add wire v1 slack = 3 Add driver v1 slack = 5 Add driver 20 Solution Propagation (1) (2) (3) 21 Exponential Runtime 16 solutions 8 solutions 4 solutions 2 solutions n candidate buffer locations lead to 2n solutions 22 Too Many Solutions Needs solution pruning for acceleration Two candidate solutions – (v, c1, q1,w1) – (v, c2, q2,w2) Solution 1 is inferior to Solution 2 if – c1 c2 : larger load – and q1 q2 : tighter timing – and w1 w2: larger cost 23 Car Race - Speed END Car Speed <=> RAT 24 Car Race - Load Load <=> Load Capacitance 25 Faster & Smaller Load Faster & smaller load (larger RAT, smaller capacitance): Good END Slower & larger load (smaller RAT, larger capacitance): Inferior 26 Faster & Larger Load: Result 1 END 27 Faster & Larger Load: Result 2 END Who will be the winner? Cannot tell at this moment, so keep both of them. 28 Pruning (Q1,C1,W1) inferior/dominated if C1 C2,W1 W2 and Q1 Q2 Non-dominated solutions are maintained: for the same Q and W, pick min C (Q2,C2,W2depends ) # of solutions on # of distinct W and Q, but not their values 29 Generating Candidates (1) (2) (3) 30 Pruning Candidates (3) (a) (b) Both (a) and (b) look the same to the source. Remove the one with the worse slack and cost (4) 31 Candidate Example Continued (4) (5) 32 Candidate Example Continued After pruning (5) At driver, compute the candidate solution satisfying the timing target with minimum cost. The result is optimal. 33 Branch Merge Left Candidates Right Candidates 34 Pruning During Branch Merge (n1n2) solutions With after pruning each branch merge. Worst-case ((n/m)m) solutions. 35 Selected Milestone Works on Timing Buffering Is it possible to design a provably good algorithm running in polynomial time with theoretical guarantee on the error to the optimal solution? 1990 1991 ……. 1996 ……. 2003 2004 ……. 2008 2009 This is a major open problem for a decade! 36 Bridging The Gap We are bridging the gap! A Fully Polynomial Time Approximation Scheme (FPTAS) Provably good Computes a solution with cost at most (1+ɛ) of the optimal cost for any ɛ>0 Runs in time polynomial in n (nodes), b (buffer types) and 1/ɛ Best solution for an NP-hard problem in theory Highly practical 37 The Rough Picture W*: the cost of optimal solution Make guess on W* Not Good Check it Good (close to W*) Return the solution Key 1: Efficient checking Key 2: Smart guess 38 Key 1: Efficient Checking Benefit of guess Only maintain the solutions with cost no greater than the guessed cost This is the first reason for acceleratation 39 The Oracle Oracle (x): the checker, able to decide whether x>W* or not – Without knowing W* – Answer efficiently 40 Construction of Oracle(x) Dynamic Programming Only interested in whether there is a solution with cost up to x satisfying timing constraint Scale and round each buffer cost 𝑤 𝑤= 𝑥𝜀/𝑛 Perform DP to scaled problem with cost upper bound n/ɛ. Time polynomial in n/ɛ 41 Scaling and Rounding 0 xɛ/n 2xɛ/n Buffer cost 3xɛ/n 4xɛ/n 42 Scaling and Rounding Rounding error at each buffer # distinct buffer costs xɛ/n, total rounding error isxɛ. at mostxɛ/n: O(n/ε) since • Larger larger error,only fewer distinct costs faster solutions with W and bounded • Smaller xɛ/n: smaller error, by n/ɛ are propagated. more distinct costs and slower • Rounding is the second reason for acceleration 0 1 2 3 4 Buffer cost 43 Oracle Construction Run dynamic programming with cost n/ɛ Yes, there is a solution satisfying timing constraint No, no such solution With cost rounded and scaled back, the solution has cost at most n/ɛ • xɛ/n + xɛ= (1+ɛ)x > W* With cost rounded and scaled back, the solution has cost at least n/ɛ • xɛ/n = x W* 44 Rounding on Q # solutions bounded by # distinct W and Q # W = O(n/ɛ1), ɛ1 is used for W – Rounding before DP #Q – Round up Q to nearest value in {0, ɛ2T/m , 2ɛ2T/m, 3ɛ2T/m,…,T }, in branch merge (m is # sinks) – Rounding during DP – # Q = O(m/ɛ2), ɛ2 is used for Q – Rounding error bounded by ɛ2T/m per branch merge, by ɛ2T for the whole tree # non-dominated solutions is O(mn/ɛ1ɛ2) 0 ɛ2T/m 2ɛ2T/m 3ɛ2T/m 4ɛ2T/m 45 Q-W Rounding Before Branch Merge Q T 4ɛ2T/m 3ɛ2T/m 2ɛ2T/m ɛ2T/m 0 1 2 3 4 n/ɛ1 W 46 Buffer Insertion Runtime O( O( mn 1 2 n 1 ) solutions after a branch merge ) W - bins A buffer insertion At most O ( O( mnb 1 2 mn 1 2 2 n b 1 introduces O( nb 1 ) non - dominated solutions with b buffer typ es 2 n b 1 ) non - dominated solutions in a single branch 2 ) time for each node. No cross W - bin pruning. 47 Branch Merge Runtime - 1 When merging Wl=2 with Wr=1, previously we need to try quadratic # of combinations, now only linear # of combinations. Target Q=0 48 Branch Merge Runtime - 2 Target Q= ɛ2T/m 49 Branch Merge Runtime - 3 Target Q= 2ɛ2T/m 50 Branch Merge Runtime - 4 For merged W a, try all W l W r a, where each takes For a 0,1,..., Including O( mn 1 2 n 1 , total runtime time for putting ) solutions is O ( solutions mn am 2 ) time 2 2 2 1 O( ) into bins, it is O ( mn 1 2 2 n b 1 mn 2 2 2 1 ) after a branch merge 51 Timing-Cost Approximate DP Lemma: a buffering solution with cost at most (1+ɛ1)W* and with timing at most (1+ɛ2)T can be computed in time 2 O( m n 1 2 2 mn b 1 2 m n 2 1 2 2 2 mn b 1 2 3 n b 1 2 ) 52 Key 2: Geometric Sequence Based Guess U (L): upper (lower) bound on W* Naive binary search style approach Set U and L on W* x=(U+L)/2 Oracle (x) W*<(1+ɛ)x U= (1+ɛ)x W* x L= x Runtime (# iterations) depends on the initial bounds U and L 53 Adapt ɛ1 Rounding factor xɛ1/n for W Larger ɛ1: faster with rough estimation Smaller ɛ1: slower with accurate estimation Adapt ɛ1 according to U and L 54 U/L Related Scale and Round Buffer cost U/L 0 xɛ/n xɛ/n 55 Conceptually Begin with large ɛ1 and progressively reduce it (towards ɛ) according to U/L as x approaches W* Fix ɛ2=ɛ in rounding T for limiting timing violation • Set ɛ1 as a geometric sequence of …, 8, 4, 2, 1, 1/2, …, ɛ • Suppose that one run of DP takes O(n/ɛ1) time. Total runtime is bounded by the last run as O(… + n/8 + n/4 + n/2 + … + n/ɛ) = O(n/ɛ). 56 Oracle Query Till U/L<2 W u ,i W l ,i * i ' W * W l ,i 1 W * W l ,i 2 m n 2 2 O( m n 2 m n 1 2 * u ,i 1 1 i t ' i ' W * W u ,i * l ,i * ) O( 2 2 1 i t 1 / 2 ( 4 / 3 ) 2 m n m n 1 2 2 W * l ,i W * u ,i W * l ,i * W u ,i 2 ) O( 2 2 4/3 m n 2 W * W u ,t * l ,t W W * 1 i t u ,i * l ,i ( 4 / 3) t i 1 / 2 ( 4 / 3 ) t i ) j ) O( 2 * l ,i W u ,i 2 mn b 1 W W W * 0 jt u ,i * 1 i 3/4 * l ,i 2 O( 1, x * W l ,i * u ,i 1 O( * W u ,i mn b 1 2 m n 2 3 n b 1 0 . 59 0 jt 1 / 2 ( 4 / 3 ) j ) O ( 2 m n 2 ) 2 ) 57 Mathematically 58 When U/L<2 Scale and round each cost by Lɛ/n W=2n/ɛ Run DP At least one feasible solution, otherwise no solution with cost 2n/ɛ • Lɛ/n = 2L U Lɛ/n rounding error per buffer and Lɛ in a solution A single DP runtime Pick min cost solution satisfying timing at driver 59 The Algorithmic Flow Set U and L of W* Adapting ɛ1 =[U/L-1]1/2 Update U or L Set x=[UL/(1+ ɛ1)]1/2 Oracle (x) U/L<2 Compute final solution 60 Main Theorem Theorem: a (1+ ɛ) approximation to the timing constrained minimum cost buffering problem can be computed in O(m2n2b/ɛ3+ n3b2/ɛ) time for 0<ɛ<1 and in O(m2n2b/ɛ+mn2b+n3b) time for ɛ 1 61 Experiments Experimental Setup – 1000 industrial nets – 48 industrial buffer types including non-inverting buffers and inverting buffers Compared to Dynamic Programming which is the state of the art technique and is widely used in industry 62 Cost Ratio Compared to DP FPTAS FPTAS Buffer Cost Ratio 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.01 0.05 0.1 0.2 0.3 0.4 0.5 Approximation 𝜀 63 Speedup Compared to DP FPTAS FPTAS 6 Speedup 5 4 3 2 1 0 0.01 0.05 0.1 0.2 0.3 0.4 0.5 Approximation 𝜀 64 Observations FPTAS always achieves the theoretical guarantee Larger ɛ leads to more speedup On average about 5x faster than dynamic programming Can run 4.6x faster with 0.57% solution degradation <5% nets with timing violations which can be fixed by a simple timing recovery procedure 65 Our Bridge NP-Hardness Complexity Exponential Time Algorithm 66 Conclusion Propose a (1+ ɛ) approximation for timing constrained minimum cost buffering for any ɛ > 0 (DAC’09) – Runs in O(m2n2b/ɛ3+ n3b2/ɛ) time – Timing-cost approximate dynamic programming – Double-ɛ geometric sequence based oracle search – 5x speedup in experiments – Few percent additional buffers as guaranteed theoretically The first provably good approximation algorithm on this problem which is a major open problem in the field 67 Thanks