Platform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong cong@cs.ucla.edu UCLA Computer Science Department Outline Motivation xPilot system framework Behavior-level synthesis in xPilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xPilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions ASICs SOC Example: Philips Nexperia 50 to 300+ MHz MIPS™ SDRAM MMI MIPS CPU D$ PRxxxx I$ Image coprocessors DEVICE IP BLOCK . . . PI BUS Library of device IP blocks DEVICE IP BLOCK DEVICE IP BLOCK D$ 32-bit or 64-bit I$ DEVICE IP BLOCK DEVICE IP BLOCK . . . DEVICE IP BLOCK Scalable VLIW media processor: 100 to 300+ MHz TriMedia CPU TM-xxxx 32-bit or 64-bit TriMedia™ PI BUS General-purpose scalable RISC processor DVP MEMORY BUS Nexperia™system buses 32-128 bit DSPs UART 1394 DVP SYSTEM SILICON MPEG VIDEO USB … Courtesy Philips Philips Nexperia SoC platform for high-end digital video MSP MIPS ACCESS CTL. VLIW Field-Programmable SOC Example: Xilinx Virtex-4 FPGA IP IP IBM CoreConnect™ Bus MicroBlaze 180MHz Soft core Proc < ~1300 LUTs 166 DMIPS MicroBlaze H.264/AVC hardware blocks PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture) Courtesy Xilinx IC Design Steps System-Level Specification Behavior-level Description Physical Design Placed & Routed Design Packaging Synthesis Technology Mapping Gate/Circuit Design Fabrication RT-Level Description Generic Logic Description X=(AB*CD)+ (A+D)+(A(B+C)) Y = (A(B+C)+AC+ D+A(BC+D)) [©Sherwani] xPilot: Platform-Based Synthesis System Platform Description & Constraints SystemC/C xPilot xPilot Front End Profiling SSDM (System-Level Synthesis Data Model) Processor & Architecture Synthesis Processor Cores + Executables Interface Synthesis Drivers + Glue Logic Analysis Mapping Behavioral Synthesis Custom Logic Embedded SoC Uniqueness of xPilot Platform-based synthesis and optimization Communication-centric synthesis with interconnect optimization Outline Motivation xPilot system framework Behavior-level synthesis in xPilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xPilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec. in C/SystemC Platform description Frontend compiler FPGAs/ASICs Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis … Core synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding SSDM RTL + constraints Presynthesis optimizations Arch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, … Advantages of Behavioral Synthesis Shorter verification/simulation cycle • 100X speed up with behavior-level simulation Better complexity management, faster time to market • 10M gate design may require 700K lines of RTL code Rapid system exploration • Quick evaluation of different hardware/software boundaries • Fast exploration of multiple micro-architecture alternatives Higher quality of results • Platform-based synthesis & optimization • Full consideration of physical reality Behavior Synthesis Has Been Tried and Failed – Why? Reasons for previous failures Lack of a compelling reason: design complexity is still manageable a decade of ago Lack of a solid RTL foundation Lack of consideration of physical reality Lack of widely accepted behavior models xPilot Advantages Advanced algorithms for platform-based, communication- centric optimization Platform-based behavior and system synthesis Communication/interconnect-centric approach Complete validation through final P&R on FPGAs Platform Modeling & Characterization Target platform specification MUX High-level resource library with ALU ALU ALU delay/latency/area/power curve for Two binding solutions for various input/bitwidth configurations same behavior: • Functional units: adders, ALUs, • • multipliers, comparators, etc. Connectors: mux, demux, etc. Memories: registers, synchronous memories, etc. Chip layout description • On-chip resource distributions • On-chip interconnect delay/power estimation Which one is better? Answer is platform-dependent: How large/fast are the MUX and ALU? 0.58 1.8 2.8 2.0 2.9 3.7 2.8 3.8 4.7 3X3 Delay Matrix for Stratix-EP1S40 Advanced Behavior System Algorithms: Example: Versatile Scheduling Algorithm Based on SDC Scheduling problem in behavioral synthesis is NP-Complete under general design constraints ILP-based solutions are versatile but very inefficient Exponential time complexity *1 +2 +3 +4 * + CS0 *1 +2 CS1 *5 *5 +3 +4 Existing Scheduling Techniques for Behavioral Synthesis Heuristic approach: Fast, but ad hoc (limited efficiency to specific applications) Data-flow-based scheduling (Targets data-flow-intensive designs, e.g., DSP applications, image processing applications, etc.) Control-flow-based scheduling (Targets control-flow-intensive designs e.g., controllers, network protocol processors, etc.) Exact approach: Versatile, but inefficient (poor scalability) ILP-based scheduling, e.g., [Huang et al., TCAD’91], etc. BDD-based symbolic scheduling, e.g., [Radivojevic and Brewer, TCAD’96] … Scheduling Our Approach Overall approach Current objective: high-performance Use a system of integer difference constraints to express all kinds of scheduling constraints Represent the design objective in a linear function +v * 1 * v2 + Dependency constraint • • • • v1 v3 : x3 – x1 0 v2 v3 : x3 – x2 0 v3 v5 : x4 – x3 0 v4 v5 : x5 – x4 0 Frequency constraint v4 • <v2 , v5> : x5 – x2 1 Resource constraint v3 • <v2 , v3>: x3 – x2 1 v 5 Platform characterization: • adder (+/–) 2ns • multipiler (*): 5ns Target cycle time: 10ns Resource constraint: Only ONE multiplier is available 1 0 0 0 0 0 1 0 0 1 -1 -1 1 0 0 A 0 0 -1 1 0 0 0 0 -1 -1 X1 X2 X3 X4 X5 x 0 -1 0 0 -1 b Totally unimodular matrix: guarantees integral solutions UPS Scheduling Overall Framework CDFG xPilot scheduler Constraint equations generation Userspecified design constraints& assignments Relative timing constraints Dependency constraints Frequency constraints Resource constraints … Objective function generation System of pairwise difference constraints Linear programming solver LP solution interpretation STG (State Transition Graph) Target platform modeling (resource library & chip layout) UPS vs. SPARK: Results on SPARK’s Benchmarks Mult (*): 2 cycles; Div (*) : 5 cycles; Rest: one cycle Target frequency: 7.5ns SPARK UPS State# W. Cycle# State# W. Cycle# UPS / SPARK MPEG2-dpframe 32 424 35 352 0.83 GIMP-tiler 27 2234 32 1877 0.84 ADPCM-decoder 15 327 13 278 0.85 ADPCM-encoder 16 133 13 112 0.84 Benchmark Average Ratio UPS achieves 16% cycle count reduction over SPARK 0.84 Platform-Based Interface Synthesis Focus on sequential communication media (SCM) FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.) Order may have dramatic impact on performance • Best order should guarantee that no data transmission on critical path are delayed by non-critical transmission Interface synthesis for SCM Consider both behavior and communication to determine the optimal transmission order for (int i=0; i <8; i++) { S1: data[i] = …; } C data[8] int s07 = data[0] + data[7]; Int s16 = data[1] + data[6]; ….. P2 P1 FIFO Custo m Logic 1 PE1 Custom logic 2 DCT example PE2 SCM Co-Optimization Problem Formulation Given: A set of processes P connected by a set of channels in C A set of data D = {d1, d2, …, dm} to be transmitted on each channel cj, Goal: Find the optimal transmission order of each process, so that the overall latency of the process network is minimized subject to the given design constraints and platform specifications In the meantime, generate the drivers and glue logics for each process automatically SystemC/C-to-RTL Design Flow SystemC/C specification Front-end compiler xPilot behavioral synthesis SSDM Platform description & constraints (System-Level Synthesis Data Model) SSDM/CDFG Behavioral synthesis SSDM/FSMD RTL generation FSM with Datapath in VHDL Floorplan and/or multicycle path constraints RTL synthesis ASICs/FPGAs platform Preliminary Results of xPilot Better Complexity Management Significant code size reduction RTL design Behavioral design: 10x code size reduction VHDL code generated by UCLA xPilot targeting Altera Stratix platform Outline Motivation xPilot system framework Behavior-level synthesis in xPilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xPilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions Design Exploration for Heterogeneous MPSoC Platforms Heterogeneous MPSoCs exploration Processors • Heterogeneous vs. homogeneous • General-purpose vs. application-specific On-chip communication architecture (OCA) • Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364) Memory hierarchy μP μP tasks OS Driver Network Interface Network Interface IP μP Network Interface Network Interface μP μP tasks OS Driver Network Interface Network Interface FPGA μP Network Interface Network Interface Communication Network μP μP tasks OS Driver Network Interface Network Interface μP DSP Network Interface Network Interface Configurable SoC Platforms General purpose processor cores + programmable fabric Tight integration using extended instructions (ASIPs) • Example: Altera Nios / Nios II Loose integration using FIFOs/busses for communications • Example: Xilinx MicroBlaze, etc. Custom instruction logic for Nios II [source: www.altera.com] Xilinx MicroBlaze [source: www.xilinx.com] ASIP Compilation: Problem Statement Given: t1 = a * b; t4 = ext-inst1(a, b, c); CDFG G(V, E) t2 = b * c;; t5 = ext-inst2(b, c, d, e); The basic instruction set I t3 = d * e; t6 = t4 + t5; Pattern constraints: t4 = t1 + t2; • Number of inputs |PI(pi)| Nin; • Number of outputs |PO(pi)| = 1; • Total area Objective: area( p ) A 1i N t5 = t2 + t3; t6 = t5 + t4; Performance speedup = 9 / 5 = 1.8X i Generate a pattern library P Map G to the extended instruction set IP, so that the total execution time is minimized a * b * c d + + ext-inst1 (MAC1: 2 cycles) t4 * 2 clock cycles * e + t6 t5 ext-inst2 (MAC2: 2 cycles) + 1 clock cycle Target Core Processor Model Core processor model Classic single-issue pipelined RISC core (fetch / decode / execute / mem / write-back) • The number of input and output operands of an instruction is pre-determined • An instruction reads the core register file during the execute stage, and commits the result during the write-back stage MUX MEM / WB Memory OP2 EX / MEM OP1 ALU ID / EX RS2 Reg File PC Inst Cache IF / ID Adder 4 RS1 Result Core Processor Custom Logic ASIP Compilation Flow C code Front-end compilation CDFG Pattern Generation Satisfying input/output constraints Arch constraint 1. Pattern generation 2. Pattern selection Pattern library 3. Application mapping & Graph covering Optimized CDFG Backend compilation Optimized assembly Pattern Selection Select a subset to maximize the potential speedup while satisfying the resource constraint Application Mapping Graph covering to minimize the total execution time Experimental Results on Altera Nios Altera Nios is used for ASIP implementation 5 extended instruction formats up to 2048 instructions for each format Small DSP applications are taken as benchmark Speedup Extended Instruction# Estimation Nios fft_br iir fir pr dir mcm Average 9 7 2 2 2 4 Resource Overhead LE Memory DSP Block 3.28 2.65 408 6.06% 65,536 9.79% 16 3.18 3.73 255 3.79% 4,736 0.71% 40 2.40 1.57 2.14 1.75 51 71 0.76% 1.05% 1,024 0 0.15% 0.00% 8 14 3.28 4.75 3.02 3.22 54 186 0.80% 2.76% 0 0 0.00% 0.00% 16 56 3.08 2.75 - 2.54% - 1.77% - Architecture Extension for ASIPs Data bandwidth problem • Limited register file bandwidth (two read ports, one write port) • ~40% of the ideal performance speedup will be lost Shadow-register-based architectural extension Core registers are augmented by an extra set of shadow registers • Conditionally written during write-back stage • Low power/area overhead Novel shadow-register binding algorithms are developed MUX MEM / WB Memory OP2 EX / MEM OP1 ALU PC Inst Cache ID / EX RS2 Reg File IF / ID Adder 4 RS1 Result Core Processor … SRK Custom Logic Hashing Unit k = hash(j) SR1 Ongoing Work -- Mapping for Heterogeneous Integration with Multiple Processing Cores Given: A library of processing cores P and communication library C Task graph G(V, E) • For each v in V, execution time t(v, pi) on pi • For each (u, v) in E, communication data size s(u,v) Throughput constraint Problem: Select and instantiate the processing elements and communication channels from P and C respectively Map the tasks onto the processing elements and communications to the channels so that • The optimal latency is achieved subject to the throughput constraint • The implementation cost is minimized Preliminary Results on Motion-JPEG Example Preprocess DCT RAW Images Encoded JPEG Images Table Modification OR Preprocess Quant HW-DCT Quant Table Modification System Cycle# Huffman Model #1 : 5 Microblazes FSL-based communication Huffman Model #2 : 4 Microblazes + DCT on FPGA fabrics Fmax Exe Time Area (MHZ) (ms) (Slice#) Model #1 23812 126 0.189 4306 Model #2 14800 (-38%) 126 0.117 6345 Xilinx XUP Board Conclusions xPilot has fairly mature and advanced behavior synthesis capability from C or SystemC to RTL code with necessary design constraints xPilot advantages include Platform-based behavior and system synthesis Communication/interconnect-centric approach Advanced algorithms for platform-based, communication-centric optimization Promising results demonstrated on available FPGAs xPilot system synthesis capabilities Performance simulation of multi-processor systems Exploration the efficient use of (multiple) on-chip processors Compilation and optimization for reconfigurable processors Acknowledgements We would like to thank the supports from Gigascale Systems Research Center (GSRC) National Science Foundation (NSF) Semiconductor Research Corporation (SRC) Industrial sponsors under the California MICRO programs (Altera, Xilinx) Team members: Yiping Fan Guoling Han Wei Jiang Zhiru Zhang Electronic System-Level (ESL) Design Automation Modeling SystemC -- OpenSource SystemVerilog Simulation and Verification Behavior-level simulation & verification System-level simulation & verification SystemC provides behavior-level and system-level synthesis capabilities for free -- rapidly gaining popularity Synthesis Behavior-level synthesis: from behavior specification (e.g. C, SystemC, or Matlab) to RTL or netlists System-level synthesis: from system specification to system implementation ESL Tools – A Lot of Interests … Communication- and Interconnect-Centric Synthesis: Example: Use of Distributed Register-File Architectures Island C 1 1 3 2 4 Island B Island A Local Register File 2 3 Data-Routing Logic Input Buffers 4 2 1 A scheduled DFG with register binding indicated on each variable (assume one-functional unit constraint) Binding using discrete registers FUP MUX Functional Unit Pool MUL ALU ALU’ Distributed register-file micro-architecture: Binding using a register file: more efficient design! Efficiently use on-chip embedded memories Fully explore operation and data-transfer parallelism Distributed Register-File Microarchitecture Island B Island A On-chip memory blocks Data-Routing Logic Local Register File Input Buffers FUP MUX Island C Functional Unit Pool MUL Island A ALU Island C ALU’ Xilinx XC-2V 2000 3000 4000 6000 8000 #18Kb BRAM 56 96 120 144 168 Dist. RAM(Kb) 336 448 720 1,056 1,456 On-chip RAM resource on Virtex II Island B FP-SoC Resource Binding for DRF-Microarchitecture Intra-island transfers Inter-island transfers 1 2 v7 v2 3 v3 4 v9 v5 v4 Island (Chain) v6 v1 A B v8 C v10 D Inter-island connections = 5 (A,B)=(A,D)=1 (A,C)=1, two data transfers share one connection (C,D)=2 Facts under simplified assumptions Operations bound onto an island form a chain in the given scheduled DFG Inter-chain data transfers may share a physical inter-island connection The number of inter-island connections (IIC) is crucial to the QoR of a DRFM instance DRFM Binding Solution 1 v2 3 v3 4 v4 Island (Chain) A v3 1 v6 v1 2 0 v9 v9 B v8 C B 2 v7 v5 A 1 1 v10 D D C-step 1, 2 handled. For c-step 3: Construct weighted bipartite graph: Edge weight = # new introduced interisland connections (IIC) Min-weight matching optimal binding in this step In step-by-step fashion Final Inter-Island Connections = 4 2 C Overview: Use weighted bipartite-matching to solve each step optimally 2 0 Solution of this step: Matching: V3 Island A; V9 Island C New introduced IIC # = 0 DRF Experimental Results: Three Experimental Flows for Comparison xPilot Frontend xPilot behavioral synthesis system 1) Binding on Discrete-Register Microarchitecture SSDM/CDFG Scheduling algorithms Scheduled CDFG (STG) 2) Baseline (Random) DRF Binding RTL generation Xilinx Virtex II 3) DRF Binding for Minimizing Inter-Island Connections DRF Experimental Results Xilinx ISE 7.1; Virtex II; Target clock period: 8ns The baseline DRF binding results achieve 46.70% slice reduction over the discrete-register approach Optimized DRF binding reduces 12.21% further Overall, more than 2X logic slice reduction with better clock period (7.8%). 1200 Discrete-Reg DRF-Random 1000 14 12 DRF-Opt 10 Clock Period (ns) Slices 800 600 400 8 6 4 200 2 0 0 PR LEE CHEN DIR Area (Slices, DRF solutions use on-chip RAM blocks) PR LEE CHEN Clock period (ns) DIR Preliminary Result of xPilot Better QoR (Comparison with UCI/UCSD SPARK) SPARK Resource Usage Designs Slice Slice Slice (LUT) (FF) Delay xPilot Fmax Ratio Resource Usage DSP (MHz) Slice Slice Slice (LUT) (FF) Fmax xPilot DSP (MHz) /SPARK PR 588 981 247 0 92.85 331 416 564 16 146.84 1.58 WANG 660 1157 265 0 109.29 357 464 588 15 133.51 1.22 LEE 574 996 220 0 109.17 356 484 659 19 131.93 1.21 MCM 1062 1857 479 0 99.40 887 1207 1282 30 110.38 1.11 DIR 1323 2256 494 3 79.30 979 1002 1732 56 98.81 1.25 Ave Ratio 1 1 1 1 1.00 0.66 0.48 2.74 n/a 1.27 1.27 Device setting: Xilinx Virtex-II pro (xc2v4000 -6) Target frequency: 200 MHz Proposed SCM Co-Optimization Design Flow Platform Description & Constraints Process Network Front End System-Level Synthesis Data Model SCOOP (SCM CO-Optimization) Communication order detection Code transformation and interface generation Indices compression for loop reordering Drivers + Glue Logics Process Behavior Communication Order Detection Step 1. Construct a global CDFG by merging the individual CDFGs of each process Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the total latency of the global CDFG Process 1 T1 T2 + T3 * * T3 T2 Process 2 + T1 * T1 + T3 T2 + Ti : FIFO Latency = 5 cycles Latency = 7 cycles Loop Indices Compression Given the optimal order, we try to generate restructured loops for code compression i.e., given the original iteration and reordered iteration, find the minimum number of linear intervals to represent the new iteration space Original order: (0,0), (0,1), (1,0), (1,1) After reordering: (0,0), (1,0), (0,1), (1,1) Need to solve the linear system i i' a1 b1 c1 j j ' a 2 b2 c 2 1 Solution: i’=j, j’ = i; Initial Results of Interface Synthesis Target for sequential communication channels In particular, FSL in VirtexII Consider two communicating processes Total latency (Cycle#) RAs Compress Designs Trad. SCOOP Reduction Before After DCT1 325 290 10.77% 0 0 Haar 142 134 5.63% 0 0 DWT 689 617 10.45% 0 0 Mat_mul 408 339 16.91% 96 20 DCT2 483 419 13.25% 80 64 Masking 620 420 32.26% 192 0 Dot 1903 1084 43.04% 300 0 An average of 26% improvement in total latency can be achieved. MPEG-4 Simple Profile Decoder: Architecture Profiling • C specification overview Module Name Orig. C Source File Orig. C line # Copy Controller copyControl.c 287 Display displayControl.c Controller • Runtime Profiling (PowerPC/XUP board) Parser/VLD 59.0% Texture/IDCT 18.1% Motion Comp. 15.7% Copy Controller 3.6% 358 Motion Comp. MotionCompensation.c 312 Parser /VLD parser.c 1092 texture_vld.c 508 Texture /IDCT texture_idct.c 1901 Texture Update textureUpdate.c 220 MPEG-4 Simple Profile Decoder: Hyprid HW/SW Impmentation HW block Integrated with PowerPC single process design: Software blocks running on PowerPC 15% speed improvement MPEG-4 Simple Profile Decoder: Alternate Implementations Single Single PowerPC w/ PowerPC HW Motion Comp. Single uBlaze 7-uBlaze Throughput (Frame per Second) 0.59 1.18 3.06 3.53 Improvement - + 209% + 68.4% + 15.3% • xPilot Synthesis Report of HW blocks Line counts C RTL SystemC RTL VHDL Slices ( FFs, LUTs) MUL Clock period (ns) Latency (Cycles) Motion Comp. 210 9903 5655 986 (1111, 1017) 2 7.97 505 Block IDCT 200 9534 2731 1877 (2376, 2438) 26 7.963 280 Texture Update 160 8227 4475 1551 (1696, 1931) 4 7.913 335 Advantages of Our Scheduling Algorithm A highly versatile scheduling engine (UPS) Supports a wide spectrum of applications with high complexity • Data-intensive, control-intensive, memory-intensive, mixed, etc. Honors a rich set of design constraints • Resource constraints, relative timing constraints, frequency constraints, latency constraints, etc. Offers a variety of optimization techniques • Operation chaining, pipelined multi-cycle operation, awareness of repetitions, behavioral templates, speculation, functional/loop pipelining, multi-cycle communication Accounts for physical reality • Optimizes communications simultaneously with computations Preliminary Results of xPilot Rapid System Exploration Quick evaluation of various amounts of process level concurrency and different hardware/software boundaries Example: Motion-JPEG implementation -All HW implementation -All SW implementation (using embedded processors) -SW/HW co-design: optimal partitioning? -Repeated manual RTL coding is not solution! Preliminary Results of xPilot Shorter Simulation/Verification Cycle From other projects: Simulation speed on behavior model 100X faster than RTL-based method [NEC, ASPDAC04] Our experience: Motion-compensation module in a Mpeg4-decoder • Behavior level (in C language) simulation Less than 1 second per frame • RTL SystemC simulation About 310 second per frame Ongoing Work: Design Exploration for MPSoCs A scalable architecture simulation infrastructure for architecture evaluation & performance/power estimation Need for structural abstraction of processors and interconnects • Recent work such as Liberty is an effort along this direction Complete structural abstraction makes the simulation very slow • Liberty is about 10X slower than SimpleScalar on Itanium model Hybrid approach • Tradeoff between accuracy and simulation time • Model interconnection accurately using SystemC (for accuracy) • Cores modeled using Simplescalar (for simulation speed) Communication network synthesis Automatic interface synthesis is required Physical planning is needed for interconnect latency/power estimation