High-Level Synthesis of High Performance Microprocessor Blocks SPARK High Level Synthesis System Sumit Gupta Nick Savoiu Nikil Dutt Rajesh Gupta Alex Nicolau Timothy Kam Michael Kishinevsky Steve Haynal Abdallah Tabbara Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark Strategic CAD Labs Design Technologies Intel Inc, Hillsboro http://www.intel.com/research/scl Supported by Semiconductor Research Corporation and Intel Copyright CECS & The Spark Project 08/31/2001 Overview Brief background Spark High-Level Synthesis Framework Previous work in Spark framework High-level synthesis for Microprocessor blocks Instruction Length Decoder Design Behavior Steps involved in synthesis Work done this summer at SCL Future Plans Copyright CECS & The Spark Project 2 High Level Synthesis From C to CDFG to Architecture Copyright CECS & The Spark Project 3 Scheduling with Given Resource Allocation Resource Constraints + < Copyright CECS & The Spark Project 4 The Spark High-Level Synthesis Framework Copyright CECS & The Spark Project 5 Limitations of high-level synthesis targeted by Spark Quality of synthesis results severely effected by complex control flow Control flow style effects the effectiveness of optimizations Nested ifs and loops not handled or handled poorly Poor understanding (much less integration) of the interaction between source-level and fine grain “compiler” transformations No comprehensive synthesis framework Few and scattered optimizations Results presented for scheduling Effects on logic synthesis not understood Small, synthetic benchmarks Copyright CECS & The Spark Project 6 Generalized Code Motions Across Hierarchical Blocks + If Node Speculation Reverse Speculation T F + Conditional Speculation + Copyright CECS & The Spark Project 7 Characteristics of ASIC Design Large designs such as MPEG Multi-cycle implementation Resource constrained Implications on transformations applied Extraction of parallelism constrained by area limitations Speculation may lead to additional registers More conservative with transformations such as loop unrolling Copyright CECS & The Spark Project 8 Characteristics of Microprocessor Blocks Smaller Designs Single or Dual Cycle implementation High performance Extract maximal parallelism Area constraints are more lax Implications on transformations applied Operations within behavior are chained together with no latching All loops can be unrolled Copyright CECS & The Spark Project 9 Simplified Instruction Length Decoder NeedNextByte Byte 0 Length Contribution NeedNextByte Byte 1 Length Contribution NeedNextByte Byte 2 Length Contribution Byte 3 Length Contribution Copyright CECS & The Spark Project 10 Simplified Instruction Length Decoder First Instruction NeedNextByte Byte 0 Length Contribution NeedNextByte Byte 1 Length Contribution NeedNextByte Byte 2 Byte 3 Copyright CECS & The Spark Project Length Contribution Length Contribution 11 Behavioral Description in C NextStartByte = 0; for (i=0; i < n; i++) { len[i] = CalculateLength(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } } /* for (i=0; i < n; i++) */ int CalculateLength(i) { lc1 = LengthContribution(i); need1 = need_next_byte(i); if (need1) { lc2 = LengthContribution(i+1); need2 = need_next_byte(i+1); if (need2) { lc3 = LengthContribution(i+2); need3 = need_next_byte(i+2); if (need3) { lc4 = LengthContribution(i+3); Length = lc1 + lc2 + lc3 + lc4; } else Length = lc1 + lc2 + lc3; } else Length = lc1 + lc2; } else Length = lc1; return Length; } Copyright CECS & The Spark Project 12 Speculate Maximally NextStartByte = 0; for (i=0; i < n; i++) { len[i] = CalculateLength(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } } /* for (i=0; i < n; i++) */ int CalculateLength(i) { lc1 = LengthContribution(i); Data need1 = need_next_byte(i); Calculation lc2 = LengthContribution(i+1); need2 = need_next_byte(i+1); lc3 = LengthContribution(i+2); need3 = need_next_byte(i+2); lc4 = LengthContribution(i+3); TempLength1 = lc1 + lc2 + lc3 + lc4; TempLength2 = lc1 + lc2 + lc3; TempLength3 = lc1 + lc2; if (need1) { Control if (need2) { Logic if (need3) { Length = TempLength1; } else Length = TempLength2; } else Length = TempLength3; } else Length = lc1; return Length; } Copyright CECS & The Spark Project 13 Inlining (Done Earlier) NextStartByte = 0; for (i=0; i < n; i++) { Results(i) = DataCalulation(i, i+1, i+2, i+3); Length(i) = ControlLogic(Results(i)); len[i] = Length(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } } /* for (i=0; i < n; i++) */ int CalculateLength(i) { lc1 = LengthContribution(i); Data need1 = need_next_byte(i); Calculation lc2 = LengthContribution(i+1); need2 = need_next_byte(i+1); lc3 = LengthContribution(i+2); need3 = need_next_byte(i+2); lc4 = LengthContribution(i+3); TempLength1 = lc1 + lc2 + lc3 + lc4; TempLength2 = lc1 + lc2 + lc3; TempLength3 = lc1 + lc2; if (need1) { Control if (need2) { Logic if (need3) { Length = TempLength1; } else Length = TempLength2; } else Length = TempLength3; } else Length = lc1; return Length; } Copyright CECS & The Spark Project 14 Unroll Loop Completely NextStartByte = 0; i=0; Results(i) = DataCalculation(i, i+1, i+2, i+3); Length(i) = ControlLogic(Results(i)); len[i] = Length(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } Results(i+1) = DataCalculation(i +1, i+2, i+3, i+4); Length(i +1) = ControlLogic(Results(i +1)); len[i +1] = Length(i +1); if (i +1 == NextStartByte) { NextStartByte = len[i +1]; Mark[i +1] = 1; } Copyright CECS & The Spark Project Shown For Only 2 Unrolls 15 Propagate Constant: Loop Index NextStartByte = 0; Results(0) = DataCalculation(0, 1, 2, 3); Length(0) = ControlLogic(Results(0)); len[0] = Length(0); if (0 == NextStartByte) { NextStartByte = len[0]; Mark[0] = 1; } Results(1) = DataCalculation(1, 2, 3, 4); Length(1) = ControlLogic(Results(1)); len[1] = Length(1); if (1 == NextStartByte) { NextStartByte = len[1]; Mark[1] = 1; } Copyright CECS & The Spark Project 16 Maximally Parallelize/Compact Results(0) = DataCalculation(0,1,2,3); Data NextStartByte = 0; Ripple Results(1) = DataCalculation(1,2,3,4);Calculation if (0 == NextStartByte) { Control … NextStartByte = len[0]; Logic Results(n) = DataCalulation(n, n+1, n+2, n+3); Mark[0] = 1; } Length(0) = ControlLogic(Results(0)); Control if (1 == NextStartByte) { Length(1) = ControlLogic(Results(1)); NextStartByte = len[1]; Logic … Mark[1] = 1; Length(n) = ControlLogic(Results(n)); } … len[0] = Length(0); … len[1] = Length(1); if (n == NextStartByte) { … NextStartByte = len[n]; len[n] = Length(n); Mark[n] = 1; } Copyright CECS & The Spark Project 17 Final Design Architecture Results(0) = DataCalculation(0,1,2,3) Data Results(1) = DataCalculation(1,2,3,4) Calculation … Results(n) = DataCalculation(n, n+1, n+2, n+3); Length(0) = ControlLogic(Results(0)); Control Length(1) = ControlLogic(Results(1)); Logic … Length(n) = ControlLogic(Results(n)); if (0 == NextStartByte) { NextStartByte = len[0]; Mark[0] = 1; } … if (n == NextStartByte) { NextStartByte = len[n]; Mark[n] = 1; } Ripple Control Logic Instruction Buffer Data Calculation Control Logic Ripple Logic Copyright CECS & The Spark Project 18 ILD Tasks Achieved This Summer Chaining across conditional boundaries Enables single cycle schedules Useful as general high-level synthesis transformation as well Had implications on other things such as VHDL generation Complete unrolling of loops Was implemented previously Constant Propagation Useful for loop index propagation after unrolling Copyright CECS & The Spark Project 19 Other Interaction within SCL Interfacing with HLD team via XML Implemented XML generation pass Creates a path from C for NexSiS and the rest of HLD flow Being driven by requirements from Abdallah Analyzed some other designs Whitney: 3-D design FAX: Willamette floating point unit Copyright CECS & The Spark Project 20 Future Plans Continue to work on ILD with the more complicated (complete) design Look at similar designs Detect first 3 zeros in 32 bit vector Develop a set of transformations targeted to such high performance blocks Expand interaction with HLD Design flow Do some transformations before handing over CDFG via XML to Symbolic Scheduling For example: transformations that lead to node duplication, source-to-source transformations, some loop transformations Copyright CECS & The Spark Project 21 Additional Slides Copyright CECS & The Spark Project 22 Spark’s Methodology Applies coarse and fine grain compiler optimizations Targets control flow transformations “Fine grain” loop optimization techniques for multiple and nested loops Mixed IR suitable for fine and coarse grain compiler transformations (similar to other systems such as SUIF) Synthesis from C provides Flow from architecture design to synthesis Opportunity to apply coarse grain optimizations Compiler transformations modified to target HLS Multiple mutually-exclusive operations can be Copyright CECSresource & The Spark Project scheduled on the same in the same cycle 23 Spark’s Methodology Customizable extensible scheduler Range of transformations in modular toolbox Percolation, trailblazing, loop pipelining (RDLP), inlining Selected under heuristics and/or user control Code motion, loop transformations Ability to generate synthesizable RTL VHDL Integrates with current IC design flows Code generation at various levels: Behavioral C Behavioral VHDL Structural VHDL Copyright CECS & The Spark Project 24 Generalized Code Motions Hierarchical code motions Operations are moved across entire conditional structures Speculation to improve resource utilization Has to be controlled to limit impact on number of registers Reverse speculation Moves operations down into conditional branches Early condition execution Evaluates conditionals as soon as the corresponding operation has been executed Conditional Speculation Duplicates operations up into conditional branches Copyright CECS & The Spark Project 25 Scheduling Results on MPEG Prediction Block Copyright CECS & The Spark Project 26 Scheduling Results on ADPCM Encoder Copyright CECS & The Spark Project 27 Scheduling Results Synthesis results after scheduling by Spark show Considerable gain in execution cycles Critical path decreases marginally Area can increase significantly Benchmarks used are large real-life applications well written; no gains due to sloppy code Copyright CECS & The Spark Project 28 Interconnect minimization by resource binding Minimize the complexity of steering logic Multiplexors and demultiplexors Bind operations with same inputs and outputs to same functional units Bind variables, which are inputs/outputs to same functional units, to the same registers Copyright CECS & The Spark Project 29 Results after Binding Copyright CECS & The Spark Project 30 Results after Binding: ADPCM Copyright CECS & The Spark Project 31 Future Plans Synthesis for high-performance microprocessor blocks Single cycle behavioral descriptions Timing analysis and time budgeting Introducing time constrained synthesis Loop Transformations Parallelizing compiler transformations: loop interchange, exchange, splitting, fusion Resource versus Throughput analysis Cost models for code motions Copyright CECS & The Spark Project 32 The Intermediate Representation Hierarchical Task Graph (HTG) is main structure in the intermediate representation (IR) Maintains information on: Code structure (IFs, LOOPs) HTG/CDFG Loop bounds, type (FOR, WHILE) Array accesses are not lowered EDG AST to address calculation followed by memory access Is complete Can regenerate input C code Copyright CECS & The Spark Project 33 IR Examples Copyright CECS & The Spark Project 34 HTG C code Copyright CECS & The Spark Project CDFG 35 Scheduling Copyright CECS & The Spark Project 36 The Scheduler Framework Scheduler framework philosophy modular, reusability allow designer to write new scheduling algorithms with minimal effort Toolbox approach core transformations: percolation, trailblazing, RDLP heuristics to decide which transformations are to be applied Copyright CECS & The Spark Project 37 The Scheduler Framework Designed to be completely customizable in terms of the scheduling algorithms and heuristics used An instance of a scheduling algorithm consists of a set of IR traversal algorithms code motion algorithms scheduling heuristics The designer can use predefined algorithms and heuristics or design new ones enabled Algorithm IR Walkers Scheduling Heuristics Candidate Provider Candidate Validators by toolbox approach Copyright CECS & The Spark Project 38 Extracting Parallelism with Speculation Copyright CECS & The Spark Project 39 Reverse Speculation Moves operations into conditionals Only moves to branches which require result Moves operations with lower priority Copyright CECS & The Spark Project 40 Early Condition Execution Evaluates conditions ASAP Moves all unscheduled operations into conditionals Uses reverse speculation to achieve this Copyright CECS & The Spark Project 41 Conditional Speculation Copyright CECS & The Spark Project 42 RDLP Example A i=i+1 A A A B j=i+h B:C B:C B:C C k=i+g D D:A D:A D l=j+1 B:C D Original Loop Compact Unroll and compact Copyright CECS & The Spark Project Shift and Pipeline 43