High-Level Synthesis with LegUp A Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February 2013 ACM FPGA Symposium Monterey, CA Dept. of Electrical and Computer Engineering University of Toronto Berlin Hong Kong Tokyo New York City Tutorial Outline • Overview of LegUp and its algorithms (60 min) • Labs (“hands on” via VirtualBox) – Lab 1: Using the LegUp Framework (30 min) – Break – Lab 2: Adding resource constraints (30 min) – Lab 3: Changing How LegUp implements hardware (30 min) Project Motivation • Hardware design has advantages over software: – Speed – Energy-efficiency • Hardware design is difficult and skills are rare: – 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware design for software engineers *US Bureau of Labour Statistics ‘08 Top-Level Vision int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum); } .... Self-Profiling Processor (MIPS) C Compiler Program code Profiling Data: Altered SW binary (calls HW accelerators) P FPGA fabric Hardened program segments High-level synthesis Suggested program segments to target to HW Execution Cycles Power Cache Misses LegUp: Key Features • • • • • • C to Verilog high-level synthesis Many benchmarks (incl. 12 CHStone) MIPS processor (Tiger) Hardware profiler Automated verification tests Open source, freely downloadable – Like ABC (Synthesis) or VPR (Place & Route) – 600+ downloads since March 2011 – http://legup.eecg.utoronto.ca System Architecture FPGA Cyclone II or Stratix IV Memory Hardware Accelerator MIPS Processor Memory Hardware Accelerator AVALON INTERFACE Memory Controller Off-Chip Memory On-Chip Cache Memory ALTERA DE2 or DE4 Board High-Level Synthesis Framework • Leverage LLVM compiler infrastructure: – Language support: C/C++ – Standard compiler optimizations – More on this shortly • We support a large subset of ANSI C: Supported Functions Arrays, Structs Global Variables Pointer Arithmetic Floating Point Unsupported Dynamic Memory Recursion Hardware Profiler Architecture PC MIPS P instr Instr. $ Op Decoder tAddr+= V1 tAddr += (tAddr << 8) tAddr ^= (tAddr >> 4) b = (tAddr >> B1) & B2 a = (tAddr + (tAddr << A1)) >> A2 fNum = (a ^ tab[b]) Address Hash (in hardware) target address ret call Call Stack function # reset Data Counter counter count Incr. when +(for current function) PC changes 0 1 0 PC count (ret | call) 0 1 Popped F# F# Counter Storage Memory (for all functions) • Monitor instr. bus to detect function call/ret. • Call: Hash (in HW) from function address to index; push to stack. • Ret: pop function index from stack. • Use function indexes to associate profiling data (e.g. cycles, power) with counters. See paper IEEE ASAP’11 Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } int dotproduct(int N) { … for (i=0; i<N; i++) { sum += A[i] * B[i]; } return sum; } Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS (volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C int dotproduct(int N) { … for (i=0; i<N; i++) int legup_dotproduct(int N) { { *dotproduct_ARG1 = (volatile sum += A[i] * B[i];int) N; *dotproduct_STATUS = 1; } return *dotproduct_DATA; return sum; } } Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } HLS set_accelerator_function “dotproduct” HW Accelerator Processor/Accelerator Hybrid Flow int main () { … sum sum==legup_dotproduct(N); dotproduct(N); ... } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS (volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C int legup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; *dotproduct_STATUS = 1; return *dotproduct_DATA; } Processor/Accelerator Hybrid Flow #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS (volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C int main () { … sum = legup_dotproduct(N); ... } int legup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; SW *dotproduct_STATUS = 1; return *dotproduct_DATA; } MIPS Processor How Does LegUp Handle Memory and Pointers? • • • • LegUp stores each array in a separate FPGA BRAM BRAM data width matches the data in the array Each BRAM is identified by a 9-bit tag Addresses consist of the RAM tag and array index: 31 23 22 9-bit Tag 0 23-bit Index • A shared memory controller uses the tag bit to determine which BRAM to read or write from • The array index is the address passed to the BRAM Pointer Example • We have two arrays in the C function: – int A[100], B[100] • • • • Tag 0 is reserved for NULL pointers Tag 1 is reserved for off-chip memory Assign tag 2 to array A and tag 3 to array B Address of A[3]: Address of B[7]: 31 23 22 Tag=2 0 Index=3 31 23 22 Tag=3 0 Index=7 Shared Memory Controller • Both arrays A and B have 100 element BRAMs 23 22 0 • Load from pointer D: 31 Tag=2 FF 0 A[0] ... 13 Index=13 FF 0 A[13] …. A[99] BRAM Tag=2 B[0] ... 13 99 32 B[13] …. B[99] BRAM Tag=3 99 2 32 3 32 A[13] Core Benchmarks (+Many More) • 12 CHStone Benchmarks (JIP’09) and Dhrystone – Too large/complex for academic HLS tools • Include golden input/output test vectors Category Benchmarks Arithmetic 64-bit double • Not supported byprecision: academic tools add, mult, div, sin Encryption AES, Blowfish, SHA Processor MIPS processor Media JPEG decoder, Motion, GSM, ADPCM General Dhrystone Lines of C code 376 – 755 716 – 1,406 232 393 – 1,692 491 Experimental Results LegUp 1.0 (2011) for Cyclone II 1. Pure software on MIPS Hybrid (software/hardware): 2. Second most compute-intensive function (and descendants) in H/W 3. Same as 2 but with most compute-intensive 4. Pure hardware using LegUp 5. Pure hardware using eXCite (commercial tool) 2500 2000 1500 40000 # of LEs 35000 Exec. time 30000 25000 20000 1000 500 0 15000 10000 5000 0 # of LEs (geometric mean) Execution time (geometric mean) Experimental Results Comparison: LegUp vs eXCite • Benchmarks compiled to hardware • eXCite: Commercial high-level synthesis tool • Couldn’t compile Dhrystone Geomean Circuit Runtime (μs) Logic Elements Area-Delay Product LegUp 292 15,646 4.57M eXcite 357 13,101 4.68M LegUp/eXcite 0.82 (1.22x) 1.19 0.98 Energy (μJ) (geometric mean) Energy Consumption 600 500 400 300 200 100 - 18x less energy than software Current Release: LegUp 3.0 • • • • • • Loop pipelining Dual and multi-ported memory support Bitwidth minimization Multi-pumping DSP units for area reduction Alias analysis for dependency checks Parallel accelerators via Pthreads & OpenMP Results now considerably better than LegUp 1.0 release LegUp 3.0 vs. LegUp 1.0 LegUp 3.0/LegUp 1.0 Ratio 1.6 1.4 1.2 Wall-clock time: 16% better Cycle latency: 31% better FMax: 18% worse LEs (area): 28% better 1 Wall-Clock Time Cycles 0.8 Fmax LEs 0.6 0.4 CHStone Benchmark Circuit LLVM Compiler and HLS Algorithms LLVM Compiler • Open-source compiler framework. – http://llvm.org • Used by Apple, NVIDIA, AMD, others. • Competitive quality with gcc. • LegUp HLS is a “back-end” of LLVM. • LLVM: low-level virtual machine. LLVM Compiler • LLVM will compile C code into a control flow graph (CFG) • LLVM will perform standard optimizations – 50+ different optimizations in LLVM C Program int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return sum; } .... CFG Compiler BB0 LLVM BB1 BB2 Control Flow Graph • Control flow graph is composed of basic blocks • basic block: is a sequence of instructions terminated with exactly one branch – Can be represented by an acyclic data flow graph: CFG BB0 load load + BB1 + BB2 store load LLVM Details • Instructions in basic blocks are primitive computational operations: – shift, add, divide, xor, and, etc. • Or are control-flow operations: – branch, call, etc. • The CDFG is represented in LLVM’s intermediate representation (IR) – IR is machine-independent assembly code. High-Level Synthesis Flow C Program C Compiler (LLVM) Optimized LLVM IR Allocation Scheduling Target H/W Characterization User Constraints • Timing • Resource Binding RTL Generation Synthesizable Verilog Scheduling • Scheduling: is the task of scheduling operations into clock cycles using a finite state machine FSM State 0 State 1 Schedule load load + load State 2 + State 3 store Binding • Binding: is the task of assigning scheduled operations to functional units in the datapath Schedule load Datapath load FF + load + store 2-port RAM + High-Level Synthesis: Scheduling SDC Scheduling • SDC System of Difference Constraints – Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation”. DAC 2006: 433-438. • Basic idea: formulate scheduling as a mathematical optimization problem – Linear objective function + linear constraints (==, <=, >=). • The problem is a linear program (LP) – Solvable in polynomial time with standard solvers Define Variables + << - • For each operation i to schedule, create a variable ti. • The ti’s will hold the cycle # in which each op is scheduled. • Here we have: – tadd, tshift, tsub Data flow graph (DFG): already accessible in LLVM. Dependency Constraints add shift sub • In this example, the subtract can only happen after the add and shift. • tsub – tadd >= 0 • tsub – tshift >= 0 • Hence the name difference constraints. Handling Clock Period Constraints • Target period: P (e.g., 10 ns) • For each chain of dependant operations in DFG, estimate the path delay D (LegUp’s models) mod xor – E.g.: D from mod -> or = 23 ns. • Compute: R = ceiling(D/P) - 1 – E.g.: R = 2 shr • Add the difference constraint: – tor - tmod >= 2 or Resource Constraints • Restriction on # of operations of a given type that can execute in a cycle • Why we need it? – Want to use dual-port RAMs in FPGA • Allow up to 2 load/store operations in a cycle – Floating point • Do not want to instantiate many FP cores of a given type, probably just one • Scheduling must honour # of FP cores available Resource Constraints in SDC • Res-constrained scheduling is NP-hard. • Implemented approach in [Cong & Zhang DAC2006] A + + C D + B E + + + + F + H G Say want to schedule with only have 2 adders in the HW (lab #2) Add SDC Constraints • Generate a topological ordering of the resource-constrained operations. A B C E F D G H • Say constrained to 2 adders in HW. • Starting at C in the ordering, create a constraint: tC – tA > 0 • Next consider, E, add constraint: tE - tB > 0 • Continue to the end • Resulting schedule will have <= 2 adds / cycle ASAP Objective Function • Minimize the sum of the variables: minimize(f = å t) i i ÎOps • Operations will be scheduled as early as possible, subject to the constraints • LP program solvable in polynomial time High-Level Synthesis: Binding High-Level Synthesis: Binding • Weighted bipartite matching-based binding – Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted matching”. DAC 1990: 499-504. • Finds the minimum weighted matching of a bipartite graph at each step – Solve using the Hungarian Method (polynomial) operations edge costs hardware functional units Binding • Bind the following scheduled program State 0 State 1 State 2 State 3 Binding • Resource Sharing: requires 3 multipliers State 0 State 1 State 2 State 3 Binding • Bind the first cycle Functional Units State 0 1 State 1 1 State 2 State 3 1 Binding • Bind the second cycle Functional Units State 0 2 State 1 2 State 2 State 3 1 Binding • Bind the third cycle Functional Units State 0 2 State 1 2 State 2 State 3 2 Binding • Bind the fourth cycle Functional Units State 0 3 State 1 2 State 2 State 3 2 Binding • Required Multiplexing: Functional Units 3 2 2 High-Level Synthesis: Challenges • Easy to extract instruction level parallelism using dependencies within a basic block • But C code is inherently sequential and it is difficult to extract higher level parallelism • Coarse-grained parallelism: – function pipelining • Fine-grained parallelism: – loop pipelining Loop Pipelining Motivating Example for (int i = 0; i < N; i++) { sum[i] = a + b + c + d } cycle 1 2 3 a b + c + d + • Cycles: 3N • Adders: 3 • Utilization: 33% Loop Pipelining Cycle 1 2 3 i=0 + + + + + + + + …. … i=1 i=3 …. i=N-2 + 4 5 … N …. … + + + + + i=N-1 N+1 N+2 + Steady State • Cycles: N+2 (~1 cycle per iteration) • Adders: 3 • Utilization: 100% in steady state Loop Pipelining Example for (int i = 0; i < N; i++) { a[i] = b[i] + c[i] } • Each iteration requires: • 2 loads from memory • 1 store • No dependencies between iterations Loop Pipelining Example for (int i = 0; i < N; i++) { a[i] = b[i] + c[i] } • Cycle latency of operations: • Load: 2 cycles • Store: 1 cycle • Add: 1 cycle • Single memory port LLVM Instructions for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ] a[i] = b[i] + c[i] %scevgep5 = getelementptr %b, %i.04 } %0 = load %scevgep5 %scevgep6 = getelementptr %c, %i.04 %1 = load %scevgep6 %2 = add nsw i32 %1, %0 %scevgep = getelementptr %a, %i.04 store %2, %scevgep %3 = add %i.04, 1 %exitcond = eq %3, 100 br %exitcond, %bb2, %bb LLVM Instructions for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ] a[i] = b[i] + c[i] %scevgep5 = getelementptr %b, %i.04 } %0 = load %scevgep5 %scevgep6 = getelementptr %c, %i.04 %1 = load %scevgep6 %2 = add nsw i32 %1, %0 %scevgep = getelementptr %a, %i.04 store %2, %scevgep %3 = add %i.04, 1 %exitcond = eq %3, 100 br %exitcond, %bb2, %bb LLVM Instructions for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ] a[i] = b[i] + c[i] %scevgep5 = getelementptr %b, %i.04 } %0 = load %scevgep5 %scevgep6 = getelementptr %c, %i.04 %1 = load %scevgep6 %2 = add nsw i32 %1, %0 %scevgep = getelementptr %a, %i.04 store %2, %scevgep %3 = add %i.04, 1 %exitcond = eq %3, 100 br %exitcond, %bb2, %bb LLVM Instructions for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ] a[i] = b[i] + c[i] %scevgep5 = getelementptr %b, %i.04 } %0 = load %scevgep5 %scevgep6 = getelementptr %c, %i.04 %1 = load %scevgep6 %2 = add nsw i32 %1, %0 %scevgep = getelementptr %a, %i.04 store %2, %scevgep %3 = add %i.04, 1 %exitcond = eq %3, 100 br %exitcond, %bb2, %bb LLVM Instructions for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ] a[i] = b[i] + c[i] %scevgep5 = getelementptr %b, %i.04 } %0 = load %scevgep5 %scevgep6 = getelementptr %c, %i.04 %1 = load %scevgep6 %2 = add nsw i32 %1, %0 %scevgep = getelementptr %a, %i.04 store %2, %scevgep %3 = add %i.04, 1 %exitcond = eq %3, 100 br %exitcond, %bb2, %bb Scheduling LLVM Instructions Cycle: for (int i = 0; i < N; i++) { a[i] = b[i] + c[i] } • Each iteration requires: • 2 loads from memory • 1 store • There are no dependencies between iterations Scheduling LLVM Instructions Cycle: for (int i = 0; i < N; i++) { a[i] = b[i] + c[i] } • Each iteration requires: Memory Port Conflict • 2 loads from memory • 1 store • There are no dependencies between iterations Loop Pipelining Example for (int i = 0; i < N; i++) { a[i] = b[i] + c[i] } • Initiation Interval (II) • Constant time interval between starting successive iterations of the loop • The loop requires 6 cycles per iteration (II=6) • Can we do better? Minimum Initiation Interval • Resource minimum II: – Due to limited # of functional units – ResMII = Uses of functional unit # of functional units • Recurrence minimum II: – Due to loop carried dependencies • Minimum II = max(ResMII, RecMII) Resource Constraints • Assume unlimited functional units (adders, …) • Only constraint: single ported memory controller • Reservation table: • The resource minimum initiation interval is 3 Iterative Modulo Scheduling • There are no loop carried dependencies so Minimum II = ResMII = 3 • Iterative: Not always possible to schedule the loop for minimum II II = minII Attempt to modulo schedule loop with II Success II = II + 1 Fail Iterative Modulo Scheduling • Operations in the loop that execute in cycle: i • Must also execute in cycles: i + k*II k = 0 to N-1 • Therefore to detect resource conflicts look in the reservation table under cycle: (i-1) mod II + 1 • Hence the name “modulo scheduling” New Pipelined Schedule Modulo Reservation Table • Store couldn’t be scheduled in cycle 6 • Slot = (6-1) mod 3 + 1 = 3 • Already taken by an earlier load Iterative Modulo Scheduling • Now we have a valid schedule for II=3 • We need to construct the loop kernel, prologue, and epilogue • The loop kernel is what is executed when the pipeline is in steady state – The kernel is executed every II cycles • First we divide the schedule into stages of II cycles each Pipeline Stages Stage: 1 00 2 3 Pipelined Loop Iterations 3 Cycles Stage 1 Stage 2 i=0 i=1 i=2 i=3 i=4 i=0 i=1 i=2 i=3 i=4 i=0 i=1 i=2 i=3 Stage 3 Prologue Kernel (Steady State) i=4 Epilogue Loop Dependencies for (i = 0; i < M; i++) for (j = 0; j < N; j++) a[j] = b[i] + a[j-1]; Depends on previous iteration • May cause non-zero recurrence min II. • Several papers in FPGA 2013 deal with discovering/optimizing loop dependencies Limitations and Current Research LegUp HLS Limitations • HLS will likely do better for datapath-oriented parts of a design. • Results likely quite sensitive to how loops are structured in your C code. • Difficult for HLS to “beat” optimized structured HW design. FPGA/Altera-Specific Aspects of LegUp • Memory – On-chip (AltSyncRAM), off-chip (DDR2/SDRAM controller) • IP cores – Divider, floating point units • On-chip SOC interconnect – Avalon interface • LegUp-generated Verilog fairly FPGA-agnostic: – Not difficult to migrate to target ASICs Current Research Work • Impact of compiler optimizations on HLS • Enhanced parallel accelerator support – Combining Pthreads+OpenMP • Smaller processor • Improved loop pipelining • Software fallback for bitwidth-optimized accelerators • Enhanced GUI to display CDFG connected with the schedule Current Work: PCIe Support • Enable use of LegUp-generated accelerators in an HPC environment – Communicating with an x86 processor via PCIe • Message passing or memory transfers – Software API for fpga_malloc, fpga_free, send, receive • DE4 / Stratix IV support in next LegUp release On to the Labs!