xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong cong@cs.ucla.edu UCLA Computer Science Department Motivation Design complexity is outgrowing the traditional RTL method even in current CMOS technologies Nanotechnology will enable 10-100x increase in device density and degree of integration Need to enable higher level of design abstraction Start from behavior descriptions (e.g. C or SystemC) Use and/or re-use more complex functional unit (e.g. processor cores instead of standard cells) ESL Tools – A Lot of Interests … xPilot: Platform-Based Synthesis System Platform Description & Constraints SystemC/C xPilot xPilot Front End Profiling SSDM (System-Level Synthesis Data Model) Processor & Architecture Synthesis Processor Cores + Executables Interface Synthesis Drivers + Glue Logic Analysis Mapping Behavioral Synthesis Custom Logic FPSoC Uniqueness of xPilot Platform-based synthesis and optimization Communication-centric synthesis with interconnect optimization xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec. in C/SystemC Platform description Frontend compiler FPGAs/ASICs Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis … Core synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDM RTL + constraints Presynthesis optimizations Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … System-Level Exploration Using xPilot for Heterogeneous MPSoC Platforms Heterogeneous MPSoCs exploration Processors • Heterogeneous vs. homogeneous • General-purpose vs. application-specific On-chip communication architecture (OCA) • Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364) Memory hierarchy μP μP tasks OS Driver Network Interface Network Interface IP μP Network Interface Network Interface μP μP tasks OS Driver Network Interface Network Interface FPGA μP Network Interface Network Interface Communication Network μP μP tasks OS Driver Network Interface Network Interface μP DSP Network Interface Network Interface Outline xPilot Overview Behavior-level synthesis in xPilot System-level synthesis in xPilot Recent Progress in xPilot Interface synthesis Resource binding based on distributed register architecture Conclusions Advantage of Behavior Synthesis Shorter verification/simulation cycle Better complexity management, faster time to market Rapid system exploration Quick evaluation of different hardware/software boundaries Fast exploration of multiple micro-architecture alternatives Higher quality of results Platform-based synthesis & optimization Full consideration of physical reality Example: Better Complexity Management Shorter verification/simulation cycle Simulation speed 100X faster than RTL-based method [NEC, ASPDAC04] Significant code size reduction RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04] VHDL code generated by UCLA xPilot targeting Altera Stratix platform Over 10x code size reduction can be achieved Unique Features of xPilot (1): Platform-based Synthesis & Optimization Platform-based synthesis & optimization The quality of a RTL design is platform-dependent Designers often lack the complete and detail knowledge of the target platform Resource Area Delay (ns) ADDSUB-24b 25 LUTs 2.27 ADDSUB-32b 33 LUTs 2.61 MUX8to1-24b 120 LUTs 2.92 MUX16to1-24b 264 LUTs 4.658 DSPMUL-18bx18b 2 DSP Blocks 3.833 DSPMUL-24bx24b 8 DSP Blocks 7.688 Platform: Altera Stratix (0,0) 0.58 1.8 2.8 2.0 2.9 3.7 2.8 3.8 4.7 (95,61) 3X3 Delay Matrix RTL synthesis & place-and-route: Altera QuartusII v5.0 Unique Features of xPilot (2): Communication-Centric Synthesis & Optimization System performance & power is dominated by interconnect It is difficult for designers to consider physical layout at the RT level T add1 Data transfer > F 5* 2*, 3* add2 6* 4* mul1 (2,4,5) mul2 (3,6) Binding solution 1: Both multipliers keep active mul1 mul2 Layout-aware performance optimization Overlap computation with communication < C2’ Layout-aware power optimization mul1 (2,5,6) mul2 (3,4) Binding solution 2: mul2 can be powered off when false branch is taken Unique Features of xPilot (3): Highly Scalable and Optimized Synthesis Algorithms Use of highly scalable and optimized synthesis algorithms for best quality of results Interface synthesis: Simultaneous data and communication scheduling for latency minimization Scheduling: A unified framework for multi-constraints and multiobjective scheduling based on the system of difference constraints (SDC) Resource binding: Use of distributed register architectures for interconnect/communication optimization Power optimization: Optimal functional module and voltage binding … Behavior and Communication Co-Optimization for Systems with SCM SCM : Sequential Communication Media FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.) Data must be read and written in the same order Order may have dramatic impact on performance • Best order should guarantee that no data transmission on critical path are delayed by non-critical transmission for (int i=0; i <8; i++) { S1: data[i] = …; } C data[8] int s07 = data[0] + data[7]; Int s16 = data[1] + data[6]; ….. P2 P1 FIFO Custo m Logic 1 PE1 Custom logic 2 DCT example PE2 SCM Co-Optimization Problem Formulation Given: A set of processes P connected by a set of channels in C A set of data D = {d1, d2, …, dm} to be transmitted on each channel cj, Goal: Find the optimal transmission order of each process, so that the overall latency of the process network is minimized subject to the given design constraints and platform specifications In the meantime, generate the drivers and glue logics for each process automatically Proposed SCM Co-Optimization Design Flow Platform Description & Constraints Process Network Front End System-Level Synthesis Data Model SCOOP (SCM CO-Optimization) Communication order detection Code transformation and interface generation Indices compression for loop reordering Drivers + Glue Logics Process Behavior Communication Order Detection Step 1. Construct a global CDFG by merging the individual CDFGs of each process Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the total latency of the global CDFG Process 1 T1 T2 + T3 * * T3 T2 Process 2 + T1 * T1 + T3 T2 + Ti : FIFO Latency = 5 cycles Latency = 7 cycles Loop Indices Compression Given the optimal order, we try to generate restructured loops for code compression i.e., given the original iteration and reordered iteration, find the minimum number of linear intervals to represent the new iteration space Original order: (0,0), (0,1), (1,0), (1,1) After reordering: (0,0), (1,0), (0,1), (1,1) Need to solve the linear system i i ' a1 b1 c1 j j ' a 2 b2 c 2 1 Solution: i’=j, j’ = i; Preliminary Experimental Results Experimental setting Target communication model: two-process producer-consumer model Behavioral synthesizer: UCLA xPilot RTL simulator : Mentor ModelSim Total latency (Cycle#) RAs Compress Designs Trad. SCOOP Reduction Before After DCT1 325 290 10.77% 0 0 Haar 142 134 5.63% 0 0 DWT 689 617 10.45% 0 0 Mat_mul 408 339 16.91% 96 20 DCT2 483 419 13.25% 80 64 Masking 620 420 32.26% 192 0 Dot 1903 1084 43.04% 300 0 An average of 26% improvement in total latency can be achieved. Advantage of Register-File Microarchitectures 1 1 2 3 2 4 3 4 2 1 (a) (a) A scheduled DFG with register binding indicated on each variable (b) (b) Binding using discrete registers (c) (c) Binding using a register file Distributed Register-File Microarchitecture Island B Island A On-chip memory blocks Data-Routing Logic Local Register File Input Buffers FUP MUX Island C Functional Unit Pool MUL Island A ALU Island C ALU’ Xilinx XC-2V 2000 3000 4000 6000 8000 #18Kb BRAM 56 96 120 144 168 Dist. RAM(Kb) 336 448 720 1,056 1,456 Altera EP1 S25 S30 S40 S60 S80 #M512(512b) 224 295 384 574 767 #M4K(4Kb) 138 171 183 292 364 #M-(512Kb) 2 4 4 6 9 Island B FP-SoC On-chip RAM resource (Virtex II and Stratix) Resource Binding for DRFM Facts 1 v7 v2 3 v3 4 v4 A v6 v1 2 under simplified assumptions v9 v5 B v8 C v10 D Inter-island connections (A,B)=(A,D)=1 (A,C)=1, two data transfers share one connection (C,D)=2 Operations bound onto an island form a chain in the given scheduled DFG Inter-chain data transfers may share a physical inter-island connection The number of inter-island connections is crucial to the QoR of a DRFM instance Resource Binding Problem for DRFM General DRFM binding problem Given scheduled DFG G and DRFM M, to find a feasible resource binding B(G,M), so that the quality of B is optimized. • Hard to characterize the quality of binding solution B • The problem is too ad-hoc Relaxed problem – DRFM Binding for Minimizing InterIsland Connections: Given a scheduled DFG G and DRFM M, to find a feasible resource binding B(G,M), so that the total number of inter-island connections of B is minimized. Solution: control-step by step binding with min-cost bipartite matching Three Experimental Flows for Comparison xPilot Frontend xPilot behavioral synthesis system 1) Binding on Discrete-Register Microarchitecture SSDM/CDFG Scheduling algorithms Scheduled CDFG (STG) 2) Baseline (Random) DRFM Binding RTL generation Xilinx Virtex II 3) DRFM Binding for Minimizing Inter-Island Connections Experimental Results Xilinx ISE 7.1; Virtex II; Target clock period: 8ns The baseline DRFM binding results achieve 46.70% slice reduction over the discreteregister approach Optimized DRFM binding reduces 12.21% further Overall, more than 2X logic slice reduction with better clock period (7.8%). 1200 Discrete-Reg DRF-Random 1000 14 12 DRF-Opt 10 Clock Period (ns) Slices 800 600 400 8 6 4 200 2 0 0 PR LEE CHEN DIR Area (Slices, DRF solutions use on-chip RAM blocks) PR LEE CHEN Clock period (ns) DIR Conclusions xPilot can automatically synthesize behavior level C or SystemC presentation to RTL code with necessary design constraints Platform-based synthesis with physical planning provides Shorter verification/simulation cycle Better complexity management, faster time to market Rapid system exploration Higher quality of results xPilot can help to explore the efficient use of (multiple) on-chip processors xPilot can efficiently optimize the software for reconfigurable processors We are interested to engage with selected industrial partners to further validate and enhance the technology Acknowledgements We would like to thank the supports from National Science Foundation (NSF) Gigascale Systems Research Center (GSRC) Semiconductor Research Corporation (SRC) Industrial sponsors under the California MICRO programs (Altera, Xilinx) Team members: Yiping Fan Guoling Han Wei Jiang Zhiru Zhang