xPilot A Platform-Based Behavioral Synthesis System Prof. Jason Cong Students: Deming Chen, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang August, 2005 Supported by NSF, GSRC, Altera, Xilinx. Outline Motivation xPilot system framework Overview of the synthesis engine Scheduling Resource binding Experimental results 2 Motivation (1) Design Complexity is outgrowing the traditional RTL method Feasible to build SoC device with 500M transistors; Billiontransistor chips are on the horizon Behavioral synthesis a critical technology for enabling the move to higher level of abstraction Reasons for previous failures • Lack of a compelling reason: design complexity is still manageable a • • decade of ago Lack of a solid RTL foundation Lack of consideration of physical reality 3 Motivation (2) Behavioral Synthesis provides combined advantages Better complexity management • Code size: RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04] Shorter verification/simulation cycle • Simulation speed 100X faster than RTL-based method Rapid system exploration • Quick evaluation of different hardware/software boundaries • Fast exploration of multiple micro-architecture alternatives Higher quality of results • Full consideration of physical reality 4 xPilot: Platform-Based Behavioral to RTL Synthesis Flow Behavioral spec. in C/SystemC Platform description Frontend compiler FPGAs/ASICs Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis … Core synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDM RTL Presynthesis optimizations Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … 5 System-level Synthesis Data Model SSDM (System-level Synthesis Data Model) Hierarchical netlist of concurrent processes and communication channels Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics • Port / IO interfaces, bit-vector manipulations, cycle-level notations 6 Platform Modeling & Characterization Target platform specification High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations • Functional units: adders, ALUs, multipliers, comparators, etc. • Connectors: mux, demux, etc. • Memories: registers, synchronous memories, etc. Chip layout description • On-chip resource distributions • On-chip interconnect delay/power estimation 7 Scheduling Goals A highly versatile scheduling engine Applicable to a wide range of application domains • Computation-intensive, data/memory-intensive, control-intensive, etc. • Mixed behavioral & RTL Amenable to a rich set of scheduling constraints • Data dependency constraints • Resource constraints: IO ports constraints, memory ports constraints, • • Functional unit constraints, etc. Timing constraints: Frequency constraint, Latency constraints, etc. Relative IO timing constraints: Cycle-fixed mode, superstate-fixed mode, free-floating mode, etc. Retargetable to a variety of design objectives • High performance, small area, low power, etc. 8 Scheduling Optimization Capabilities Offers a variety of optimization techniques in a unified framework Combinational/Sequential non-pipelined/pipelined multi-cycle operation Unconditional/Conditional operation chaining Relative scheduling Considerations of branching probabilities and repetitions Multi-cycle communication (under development) Code motion & speculation (under development) Functional / loop pipelining (under development) Physical layout integration (to be supported) 9 Scheduling Current Status Design objective Focus on high-performance designs Overall approach Use a system of pairwise difference constraints to express all kinds of scheduling constraints Represent the design objective in a linear function The system is immediately solvable via any linear programming solver with integral solutions 10 Scheduling Design Framework CDFG xPilot scheduler Constraint equations generation Userspecified design constraints& assignments Relative timing constraints Dependency constraints Frequency constraints Resource constraints … Objective function generation System of pairwise difference constraints Target platform modeling (resource library & chip layout) Linear programming solver LP solution interpretation STG (State Transition Graph) 11 Example : Greatest Common Divisor GCD C description BB1 x = inport1; y = inport2; while (x != y) { if ( x > y ) x = x – y; else y = y – x; } *outport = x; x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); T BB2 x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); T BB3 T BB4 x_2 = x1 – y1; cond3 = (x_2 != y_1); y_2 = y1 – x1; cond4 = (x_1 != y_2); T BB5 x_3 = (x_0, x_1, x_2); *outport = x_3; 12 Constraints Generation Data dependency constraint Operation v is data dependent on operation u, i.e., (u, v)E s(v) – s(u) 0 where schedule variable s(v) represents the relative schedule of node v u: x_1 = (x_0, x_1, x_2); v: cond2 = (x_1 > y_1); Other constraints can be represented in a similar way … The constraint equations form a system of pairwise difference constraints Matrix A is totally unimodular Feasibility check can be formulated as a single-source shortest path problem Optimizations can be performed via any LP solver; the dual problem is equivalent to a min-cost network flow problem 13 Solution by LP Solver Scheduling are performed across the basic block boundaries BB1 x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); 0 BB2 T x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); T BB3 T BB4 x_2 = x1 – y1; cond3 = (x_2 != y_1); 1 y_2 = y1 – x1; cond4 = (x_1 != y_2); T BB5 x_3 = (x_0, x_1, x_2); *outport = x_3; 14 Schedule Interpretation x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); x_2 = x1 - y1; cond3 = (x_2 != y_1); y_2 = y1 - x1; cond4 = (x_1 != y_2); x_3 = (x_0, x_1, x_2); *outport = x_3; x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); 15 *outport = x_3; } Deriving State Transition Graph Final STG for GCD x_0 = inport1; y_0 = inport2; cond1 = (x_0 != y_0); if (cond1) { x_1 = (x_0, x_1, x_2); y_1 = (y_0, y_1, y_2); cond2 = (x_1 > y_1); if (cond2) { x_2 = x1 - y1; cond3 = (x_2 != y_1); } else { y_2 = y1 - x1; cond4 = (x_1 != y_2); } } if (!cond1 || !cond3&&!cond4) { x_3 = (x_0, x_1, x_2); *outport = x_3; } cond3 || cond4 16 Unified Resource Binding Provides an unified resource sharing framework to optimize for various design objectives Simultaneous functional unit binding, register binding and port binding Equipped with advanced techniques to optimized the interconnect and steering logic networks Guided by a flexible cost evaluation engine to achieve different objectives, e.g., performance, area, power, etc. Extendable to exploit physical layout information 17 An FU/Register binding Example R1 R2 R3 F1 R4 R1 R3 MUX F2 MUX R4 MUX F1 R5 Case 1 R5 Case 2 (a) R1 R1 R2 F1 R2 R2 Note: Assume all operations and variables are compatible for sharing F2 MUX F1 R3 Case 1 R3 Case 2 (b) Observations: Binding has large impact to the resulting performance and cost Functional unit and register binding are highly correlated 18 Drawbacks of Previous Work Many existing algorithms focus on functional-unit- or register“number” minimization Technology advances – interconnect effect increasing • 51% of the total dynamic power of a microprocessor in 0.13um tech. • Up to 80% of the dynamic power in future technologies May generate larger amount of multiplexers and interconnects Unfavorable performance and cost results Optimization for unrealistic goals Minimize “number” of FUs, registers, or multiplexors • Should have detailed datapath models to guide the optimization No technology specific consideration • Should have platform-specific characterizations 19 Resource Binding in xPilot STG (State Transition Graph) Baseline Register Binding Userspecified design constraints FU Allocation/Binding Register Allocation/Binding Target platform (resource library & chip layout) xPilot architecture exploration Iteration Yes Datapath model for performance-cost estimation Improved?? No STG + Best Datapath Models 20 Design Space Exploration Exploring Node 2: • (1) (2) two mul • (1, 2) one mul < …. C1 ’ 2*, 3* 3* 4* 5* 2* 4* C2 5* < C2’ 6+ Compatible Graphs A State Transition Graph (STG) power Exploring Node 4: • • • • • • • > C1 6+ 1* Exploring Node 3: • (1) (2) (3) three mul • (1, 2) (3) two mul • (1, 3) (2) two mul 1* > Exploration phases: (1) (2) (3) (4) pruned (1, 2, 4) (3) (1, 2) (3, 4) MUL MUL (1, 2) (3) (4) (1, 3, 4) (2) (1, 3) (2, 4) (1, 3) (2) (4) Datapath for solution (1, 2, 4) (3) Datapath Model delay Curve for Design 21 Space Pruning Experimental Results Benchmark Suite Benchmark suite PR, MCM: • DSP kernels: pure additions/subtractions and multiplications CACHE • Cache controller: control-intensive designs with cycle-accurate I/O operations MOTION: • Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest amount of computations IDCT: • JPEG inverse discrete cosine transform: computation intensive DWT: • JPEG2000 discrete wavelet transform: computation intensive with modest control flow EDGELOOP: • Extracted from H.264 decoder: a very complex design, features a mix of computation, control, and memory accesses 22 Experimental Results Code Size Reduction 23 Experimental Results Comparison with SPARK On Scheduling SPARK [UCI/UCSD, 2004], a state of the art academic highlevel synthesis tool Synthesis Designs MOTION PR IDCT CACHE Tool/Flow Altera Quartus II report Report state# reg# fmax (MHz) LE register mem dsp spark 13 18 170.8 666 367 0 4 xpilot 24 11 161.2 888 266 0 4 spark 13 36 130.6 508 491 0 32 xpilot 13 40 178.7 1,349 783 0 0 spark 176 ~400 72.01 10,847 4,547 0 138 xpilot 141 413 105.53 11,481 5,627 0 64 xpilot-mem 334 451 162.9 9,351 6,098 1,024 64 spark xpilot-mem Memory unsupported 47 16 161.6 371 265 3072 0 24 Experimental Results Comparison with SPARK On Binding SPARK xPilot LE COMB LonelyReg CombReg DSP (MHz) LE COMB LonelyReg CombReg DSP (MHz) Fmax Ratio xPilot/ SPARK PR 1108 815 0 293 0 123.53 1349 713 84 552 0 178.7 1.45 WANG 1217 942 0 275 0 118.89 1105 527 62 516 8 166.11 1.40 LEE 1367 1052 0 315 0 119.32 1585 691 207 687 4 166.61 1.40 MCM 2808 2248 0 560 0 74.87 2402 981 73 1348 0 152.56 2.04 DIR 2425 2034 0 391 6 69.38 3489 1752 110 1627 4 146.8 2.12 FEIG 16170 13136 0 3034 6 37.17 10539 2295 240 8004 4 173.49 4.67 Total 25095 20227 0 4868 12 543.16 20469 6959 776 12734 20 984.27 1.81 Ave Ratio 1 1 1 1 1 1 1.17 0.65 n/a* 2.96 n/a* 2.48 2.48 Resource Usage Designs Fmax Resource Usage Fmax On average, xPilot resource binding achieves designs with similar area, and 2.48x higher frequency over Spark 25 Synthesis Results for DWT (JPEG2000) Settings Target platform: Altera Stratix RTL synthesis & place-and-route: Altera QuartusII v5.0 Simulation: Mentor ModelSim SE6.0 Design alternatives Target cycle time State# fmax(MHz) Cycle# Latency (ns) LE# DSP# 9ns 34 123.56 4830 39.1 1777 128 7ns 36 147.28 5211 35.4 1862 128 5.5ns 51 183.62 6926 37.8 1926 128 26 Experimental Results: ASIC Flow Magma RTL to GDSII flow Technology Tradeoff library: Cadence Generic Standard Cell Library 0.18um study: 1st column: delay constraint enforced in xPilot 2nd column: control step count of xPilot generated RTL 3rd-5th column: data reported after mapping by Magma tool DIR State # Cell count Area(u2) Delay(ps) Fmax(MHz) Latency(ps) 5ns 5 17555 1256584 2111 473.71 10555 10ns 3 23077 1332203 2139 467.51 6417 15ns 2 28381 1486487 2181 458.51 4362 20ns 2 27189 1394451 2514 397.77 5028 30ns 1 27797 1401642 2725 366.97 2725 27 Experimental Results: ASIC Flow (cont.) LEE State# Cell count Area(u2) Delay(ps) Fmax(MHz) Latency(ps) 5ns 8 8242 509807 2066 484.03 16528 10ns 4 15989 708870 2254 443.66 9016 15ns 2 16698 703381 3423 292.14 6846 20ns 2 15256 656147 4226 236.63 8452 30ns 1 16085 697363 5070 197.24 5070 State# Cell count Motion Area(u2) Delay(ps) Fmax(MHz) Latency(ps) 10ns 35 16474 909721 2107 474.61 73745 15ns 30 15695 847262 2358 424.09 70740 20ns 28 16463 867898 2498 400.32 69944 30ns 28 15807 852573 2563 390.17 71764 28