Development and Application of Tree Synthesis Algorithms John Lillis University of Illinois Chicago Overview Part I: Buffer tree synthesis Formulations S/P/SP-tree Part II: Fanin tree embedding/replication Optimization across gate boundaries Interaction with placement Part I: Buffer Tree Synthesis Premises of Work MAIN PREMISE: Powerful Buffer Tree Synthesis is a Core for Modern Design Conservation of Resources Crucial Estimate: 700-800K Buffers/Chip in Near Future Cost-Performance Tradeoffs General Cost Model Topology / Embedding / Buffering Spaces Should be Explored Simultaneously 2-Phase Approach Not Robust / Predictable Particularly Troublesome in Presence of Blockages Max Slack Weakness Overoptimized Slack subtrees Cost Problem Formulation Given: Location of Driver and Sinks Technology Parameters Timing Requirements Buffer Library Target Routing Graph (Blockages) Find: Topology in corresponding space its Embedding and Buffer Assignment Minimizing Cost s.t. Timing Constraints Philosophy of Constraint Imposition Goals: Predictable Behavior Absence of ad-hoc Heuristics Main Idea: Optimally Solve Constrained Variant of the Problem Well-Designed Constraints Produce Large Flexible Solution Space Tractability Constraints: Topology Space Full space Constrained space Topology Embedding Flexibility s c s a c a b s b c a b Target Routing Graph Construction Routing blockage s a c b Buffer blockage Algorithmic Description Timing-Driven Maze Routing Topology Embedding S-Tree P-Tree SP-Tree Algorithmic Description Timing-Driven Maze Routing Topology Embedding S-Tree P-Tree SP-Tree Core Subroutine: Timing-Driven Maze Routing Generalization of [Hur, et. al.; TCAD Feb 2000] Single Target, Multiple Sources Finds non-dominated paths Simultaneous Buffer Insertion Handling of Blockages in Topology Synthesis Target Sources Algorithmic Description Timing-Driven Maze Routing Topology Embedding S-Tree P-Tree SP-Tree Topology Embedding Goal: Obtain timing feasible embedding / buffering of given topology, minimizing cost Solution: Dynamic Programming (bottom-up) Solution sets A(u,v) represents a set of solutions that correspond to Vertex u in Topology Vertex v in Target Graph A1b = Join(A1.left , A1.right) A1 = GenDijsktra(A1b) A(u,v) u v Algorithmic Description Timing-Driven Maze Routing Topology Embedding S-Tree P-Tree SP-Tree S-Tree Notion of localities: Spatial Temporal Polarity Partition sinks into 2 sets based on: estimated timing criticality signal polarity requirements some other criteria... Subtrees can break topology and “stitch” at different place S-Tree Topology Space s Sink partition: {a,c,d} {b} d c a b s s b d d b a c a c S-Tree Recurrence A1b = Join(A1.left , A1.right) A1 = GenDijsktra(A1b) A2b = Join(A2.left , A2.right) A2 = GenDijsktra(A2b) A12b = Join(A12.left , A12.right) + Join(A1 , A2) A12 = GenDijsktra(A12b) S-Tree Topology Space s s Initial topology s c a b b f d c a e f d e s s b a c c a b d f e a b d f e c f d e Incorporating polarity 4 sets: critical & positive signal polarity critical & negative non-critical & positive non-critical & negative Other partitioning schemes... Algorithmic Description Timing-Driven Maze Routing Topology Embedding S-Tree P-Tree SP-Tree P-Tree Topology Space All Permutation-Constrained Topologies a s a b c d e b c d s e a b c d e Limitations of P-Tree Space Isolation of Critical / Non-Critical Subtrees: “Temporal-Locality” Min WL May Not Produce Min Cost Driver Driver Critical Critical Non-critical Non-critical Algorithmic Description Timing-Driven Maze Routing Topology Embedding S-Tree P-Tree SP-Tree SP-Tree Combine everything said so far... From P-Tree Spatial locality Robustness From S-Tree Temporal locality Polarity locality Ability to fix “topology problems” by “stitching” Solution Space Entire space SP-Tree P-Tree S-Tree Fixed topo. Experiments Randomly generated nets Non-uniform required arrival time Non-uniform sink input capacitance Buffer-biased cost Interested in: Min cost feasible solution Max slack solution for verification Runtime More details in the paper... Algorithms for Experiments S-Tree P-Tree SP-Tree RMP [Cong, Yuan; DAC 2000] RMP-Quick [Cong, Yuan; DAC 2000] Results RMP RMP-Qck S-Tree P-Tree Net2-06 SP-Tree 35 Min cost feasible Max slack # buffers 30 25 20 15 10 5 0 Wire Buf Cost Slack Max Slack Wire Buf Cost Runtime Results RMP RMP-Qck S-Tree P-Tree Net2-08 SP-Tree 50 Min cost feasible 45 Max slack # buffers 40 35 30 25 20 15 10 5 0 Wire Buf Cost Slack Max Slack Wire Buf Cost Runtime Results RMP RMP-Qck S-Tree P-Tree Net2-12 SP-Tree 80 Min cost feasible Max slack 70 # buffers 60 50 40 30 20 10 0 Wire Buf Cost Slack Max Slack Wire Buf Cost Runtime SP-Tree vs. P-Tree Conclusions Key Concepts: General Cost Models Routing Congestion Buffer Congestion Orthogonal Separation of Spatial and Temporal Locality Polarity Requirements Routing and Buffer Blockages Targets: Small-to-Medium Sized Signal Nets Results Summary Highly Cost-Efficient, High Performance Solutions Substantially Outperforms Prior Approaches in Solution Quality and Runtime Part II: Fanin Tree Embedding/Replication Replication Overview • Hrkic, Lillis, Beraudo (DAC04, IWLS04) • Concept: Netlist structure limits potential of timing-driven placement • Difficult for top-down synthesis to fix • Main issue: inherently non-monotone paths • Approach (Hrkic, Lillis; DAC04) touches on placement, synthesis (netlist perturbation) and routing. Logic Replication Duplicate logic cell Preserve functionality Improve timing Place / Move cells Adjust connections A B A B CR C C D E D E Early Work Use replication to straighten I/O paths Local monotonicity [Beraudo, Lillis, DAC 2003] Sequence of 3 cells on the path Incremental framework B D B A A C C E D CR E Limitations of Local Monotonicity Local Monotonicity satisfied Still many non-monotone paths A B C D F E Replication Tree Approach [Hrkic et. al. DAC04] Identify critical sink Extract critical fan-in tree (Replication Tree) Optimize fan-in tree (Fan-in Tree Embedding) Legalize placement Slowest Paths Tree Focus on slowest paths Find slowest paths tree from critical sink Include paths within epsilon of current critical delay Focus on most critical portions of fan-in cone Replication Tree Most circuits do not contain large fan-in trees due to reconvergence Given a critical tree temporarily replicate the entire tree Assign connections: if (u,v) is tree edge; connect uR to vR else connect u to vR A C B A D E C B E BR DR D F AR F FR CR Placement cost Replication is temporary Placement cost is crucial Cost discount for placing cell over its logical equivalent low cost for placing DR over D actual replication will never occur multiple low cost location possible A C B CR BR DR D E AR F FR Fan-in Tree Embedding Given: Fan-in tree Placement of sink and inputs Arrival times at inputs Placement and routing graph Find: Placement of internal tree nodes (Gates) Minimizing Cost s.t. Timing Constraints cost / delay tradeoff Fan-in Tree Embedding Example C A C A B B sink Higher delay, lower cost sink Lower delay, higher cost Fan-out and Fan-in Tree C source A B C A sink B Bottom-up Top-down Fan-in Tree Embedding Adaptation of S-Tree algorithm [Hrkic, Lillis, DAC 2002] Keep: Graph Model for Embedding Target Modified Timing-Driven Maze Routing multiple source, multiple targets at each vertex keep a list of non-dominated solutions S. Hur, J. Lillis, IEEE TCAD 2000 Modify: Top-down vs. Bottom-up Solution signature (c,t): c - cost t - signal arrival time Gate placement cost p(x,y) Fan-in Tree Embedding Non-binary tree: multiple gate inputs Top-Down Dynamic Programming Maze Routing to populate solutions deffered backtracking Join Solutions Modified maze routing c=px,y + c1 + ... + cn t=MAX(t1, ... ,tn) Bottom-Up solution extraction backtrack to extract maze route extract gate placement Join Aside: Legalization Use Modified Gain-Graph approach [Hur, Lillis; ICCAD00] Modified to incorporate timing information Optimization Flow Identify critical sink (static timing analysis) Extract Fan-in Tree Replication Tree epsilon-Slowest Paths Tree Embed Fan-in Tree Decide which cells to Replicate / Unify Legalize placement Repeat while there is improvement Enhancements Post-process unification some cells placed close to their logical equivalents no automatic unification if one of the paths is non-critical it is possible to unify without degrading performace Unification in legalizer during ripple-move cell may be placed on top of its replica unify them and stop legalization epsilon-Slowest Paths Tree no randomization dynamically modify value of epsilon to enlarge the fan-in cone Experiments Algorithms Timing-Driven VPR (Versatile Place and Route) [http://www.eecg.toronto.edu/~vaughn/vpr/vpr.html] Local Replication [Beraudo, Lillis, DAC-03] RT-Embedding 20 MCNC Benchmark Circuits Interested in: Critical delay Amount of replication Wire usage Tests performed in FPGA domain Promissing results Experimental Setup Obtain valid placement with Timing-Driven VPR placer Local Replication Replication Tree Embedding Route and Evaluate with Timing-Driven VPR router 0.927 1.020 1.003 RT-Embed Average values over all 20 circuits0.858 normalized to VPR 0.869 1.084 critical path delay 1.004 W W wire inf low-stress length blocks LocalDelay improved for all circuits Repl 0.925 0.927 1.020 Best improvement for circuit 1.003 RT- pdc: 0.641 Embed 0.858 0.869 1.084 Runtime 1.004 penalty under 5% on the VPR flow Delay improved for all circuits Best improvement for circuit pdc: 0.641 Replication Statistics Circuit ex1010: 38 replications, 12 unifications Ongoing Work Generalize to ASICs Include simultaneous buffering • Mitigation of legalization noise Preventing (some) overlaps in embedding More sophisticated placement cost Reconvergence - arborescence approach Simultaneous technology (re-)mapping – Explore multiple Tree Topologies simultaneously (Universal Tree solver engine: U-Tree) Review Trees are everywhere! Even in places where they seem to be absent Tree based algorithms can be very strong in generality of formulation and predictability Enable connection to general placement/routing target Can capture tradeoffs between complex objectives Can sometimes be applied to drive optimization of graph structures. References: http://cs.uic.edu/~jlillis/pubs.html S/P/SP-tree executables: http://eda.cs.uic.edu/software.html Thank you Timing-Driven Placement Legalization After embedding, cells could overlap in the placement Moving cells on critical path may harm timing Ripple-move strategy [Hur, Lillis, ICCAD 2000] Modified to include both timing and wiring information Overlap Empty Timing-Driven Placement Legalization After embedding, cells could overlap in the placement Moving cells on critical path may harm timing Ripple-move strategy [Hur, Lillis, ICCAD 2000] Modified to include both timing and wiring information Overlap Empty Timing-Driven Placement Legalization After embedding, cells could overlap in the placement Moving cells on critical path may harm timing Ripple-move strategy [Hur, Lillis, ICCAD 2000] Modified to include both timing and wiring information Overlap Empty Timing-Driven Placement Legalization After embedding, cells could overlap in the placement Moving cells on critical path may harm timing Ripple-move strategy [Hur, Lillis, ICCAD 2000] Modified to include both timing and wiring information Overlap Empty Timing-Driven Placement Legalization After embedding, cells could overlap in the placement Moving cells on critical path may harm timing Ripple-move strategy [Hur, Lillis, ICCAD 2000] Modified to include both timing and wiring information Overlap Empty Timing-Driven Placement Legalization Identify overlap Identify up to 4 closest empty (one in each quadrant) Construct gain graph monotone paths from congested to free slots edges: gain of moving a cell to neighboring slot wire and timing gain find max-gain path and perform ripple-move gain could be negative Overlap Empty Empty Review