LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems Students: Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona Faculty: Jason Anderson, Stephen Brown Industrial Advisors: Tom Czajkowski Motivation • Hardware design has advantages over software: – Speed – Energy-efficiency • Hardware design is difficult and skills are rare: – 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware design for software engineers *US Bureau of Labour Statistics ‘08 Top-Level Vision Mark Aldham int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum); } .... Self-Profiling Processor (MIPS) C Compiler Program code Profiling Data: Altered SW binary (calls HW accelerators) P FPGA fabric Jongsok Choi Hardened program segments High-level synthesis Andrew Canis Victor Zhang Suggested program segments to target to HW Execution Cycles Power Cache Misses Ahmed Kammoona LegUp: Key Features • • • • • • C to Verilog high-level synthesis 13 C code benchmarks MIPS processor Hardware profiler Automated verification tests Open source, freely downloadable – Like ABC (Synthesis) or VPR (Place & Route) System Architecture FPGA Hardware Accelerator MIPS Processor AVALON BUS Memory Controller Off-Chip Memory On-Chip Memory Hardware Accelerator High-Level Synthesis Framework • Leverage LLVM compiler infrastructure: – Language support: C/C++ – Standard compiler optimizations • We support a large subset of ANSI C: Supported Functions Arrays, Structs Global Variables Pointer Arithmetic Unsupported Dynamic Memory Floating Point Recursion LLVM-Based High-Level Synthesis User Constraints, Target H/W Characterization Allocation Scheduling Binding Generate Verilog • Flexible compiler pass architecture – Passes can be swapped for alternate algorithms High-Level Synthesis Framework • Scheduler: As Soon As Possible – Operator chaining – Multi-cycle operations: divide, multiply • Binding: Weighted Bipartite Matching – Multiplexers are expensive on an FPGA • Only share dividers and multipliers – FPGA is register-rich • No register sharing 13 C Benchmarks • 12 CHStone Benchmarks (JIP’09) and Dhrystone – Too large/complex for academic HLS tools • Include golden input/output test vectors Category Benchmarks Arithmetic 64-bit double • Not supported byprecision: academic tools add, mult, div, sin Encryption AES, Blowfish, SHA Processor MIPS processor Media JPEG decoder, Motion, GSM, ADPCM General Dhrystone Lines of C code 376 – 755 716 – 1,406 232 393 – 1,692 491 Experimental Results 1. Pure software on MIPS Hybrid (software/hardware): 2. Second most compute-intensive function (and descendants) in H/W 3. Same as 2 but with most compute-intensive 4. Pure hardware using LegUp 5. Pure hardware using eXcite (commercial tool) 2500 2000 1500 40000 # of LEs 35000 Exec. time 30000 25000 20000 1000 500 0 15000 10000 5000 0 # of LEs (geometric mean) Execution time (geometric mean) Experimental Results Energy (μJ) (geometric mean) Energy Consumption 600 500 400 300 200 100 - 18x less energy than software Comparison: LegUp vs eXcite • Benchmarks compiled to hardware • eXcite: Commercial high-level synthesis tool • Couldn’t compile Dhrystone Geomean Circuit Runtime (μs) Logic Elements Area-Delay Product LegUp 292 15,646 4.57M eXcite 357 13,101 4.68M LegUp/eXcite 0.82 (1.22x) 1.19 0.98 Performance: LegUp vs eXcite Circuit Legup Cycles eXcite Cycles Legup/ Legup exCite Legup/ Legup eXcite Fmax Fmax eXcite Time exCite Time Legup/ eXcite adpcm 36,795 21,992 1.67 46 29 1.59 804 761 1.06 aes 14,022 55,679 0.25 61 51 1.20 231 1,093 0.21 blowfish 209,866 209,614 1.00 65 36 1.81 3,208 5,845 0.55 dfadd 2,330 370 6.30 124 25 4.96 19 15 1.27 dfdiv 2,144 2,029 1.06 75 44 1.70 29 46 0.63 dfmul 347 223 1.56 86 49 1.76 4 5 0.8 dfsin 67,466 49,709 1.36 63 40 1.58 1,077 1,241 0.87 gsm 6,656 5,739 1.16 59 42 1.40 113 137 0.82 1.80 47 23 2.04 jpeg 5,861,516 3,248,488 124,475 143,358 0.87 mips 6,443 4,344 1.48 90 76 1.18 72 57 1.26 motion 8,578 2,268 3.78 92 43 2.14 93 53 1.75 sha 247,738 238,009 1.04 87 62 1.40 2,850 3,809 0.75 Geomean 20,854 1.43 72 41 1.76 292 357 0.82 14,594 Circuit Runtime: LegUp vs eXcite Geomean: 0.82 adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha 0.0 0.2 0.4 0.6 0.8 1.0 1.2 LegUp/eXCite 1.4 1.6 1.8 2.0 Comparison: Software vs Hardware • Software: Benchmarks run on MIPS • Hardware: LegUp flow (targeting 100% HW) Geomean Benchmark Runtime (μs) Logic Elements Multipliers Memory Bits LegUp MIPS LegUp/MIPS 292 2334 0.12 (8x) 15,646 12 28,822 12,243 16 226,009 1.28 0.75 0.13 Benchmark Runtime: LegUp vs MIPS Geomean: 8x adpcm aes blowfish dfadd dfdiv dfmul dfsin gsm jpeg mips motion sha dhrystone 0 5 10 15 20 Speedup 25 30 35 40 Ongoing Work • Architecture – Memory hierarchy – Multiple clock domains • High-level synthesis – Modulo Scheduling for loop pipelining – Refactoring code for release in March • Profiling – Automatically detect functions to move to H/W