Slide 1

advertisement
LegUp: High-Level Synthesis
for FPGA-Based
Processor/Accelerator Systems
Students: Andrew Canis, Jongsok Choi, Mark Aldham,
Victor Zhang, Ahmed Kammoona
Faculty: Jason Anderson, Stephen Brown
Industrial Advisors: Tom Czajkowski
Motivation
• Hardware design has advantages over software:
– Speed
– Energy-efficiency
• Hardware design is difficult and skills are rare:
– 10 software engineers for every hardware engineer*
• We need a CAD flow that simplifies hardware
design for software engineers
*US Bureau of Labour Statistics ‘08
Top-Level Vision
Mark Aldham
int FIR(int ntaps, int sum) {
int i;
for (i=0; i < ntaps; i++)
sum += h[i] * z[i];
return (sum);
}
....
Self-Profiling
Processor
(MIPS)
C Compiler
Program code
Profiling Data:
Altered SW binary (calls HW accelerators)
P
FPGA fabric
Jongsok Choi
Hardened
program
segments
High-level
synthesis
Andrew Canis
Victor Zhang
Suggested
program
segments to
target to
HW
Execution Cycles
Power
Cache Misses
Ahmed Kammoona
LegUp: Key Features
•
•
•
•
•
•
C to Verilog high-level synthesis
13 C code benchmarks
MIPS processor
Hardware profiler
Automated verification tests
Open source, freely downloadable
– Like ABC (Synthesis) or VPR (Place & Route)
System Architecture
FPGA
Hardware
Accelerator
MIPS Processor
AVALON BUS
Memory Controller
Off-Chip Memory
On-Chip
Memory
Hardware
Accelerator
High-Level Synthesis Framework
• Leverage LLVM compiler infrastructure:
– Language support: C/C++
– Standard compiler optimizations
• We support a large subset of ANSI C:
Supported
Functions
Arrays, Structs
Global Variables
Pointer Arithmetic
Unsupported
Dynamic Memory
Floating Point
Recursion
LLVM-Based High-Level Synthesis
User Constraints, Target H/W Characterization
Allocation
Scheduling
Binding
Generate Verilog
• Flexible compiler pass architecture
– Passes can be swapped for alternate algorithms
High-Level Synthesis Framework
• Scheduler: As Soon As Possible
– Operator chaining
– Multi-cycle operations: divide, multiply
• Binding: Weighted Bipartite Matching
– Multiplexers are expensive on an FPGA
• Only share dividers and multipliers
– FPGA is register-rich
• No register sharing
13 C Benchmarks
• 12 CHStone Benchmarks (JIP’09) and Dhrystone
– Too large/complex for academic HLS tools
• Include golden input/output test vectors
Category
Benchmarks
Arithmetic
64-bit double
• Not supported
byprecision:
academic tools
add, mult, div, sin
Encryption AES, Blowfish, SHA
Processor
MIPS processor
Media
JPEG decoder, Motion, GSM, ADPCM
General
Dhrystone
Lines of C code
376 – 755
716 – 1,406
232
393 – 1,692
491
Experimental Results
1. Pure software on MIPS
Hybrid (software/hardware):
2. Second most compute-intensive function
(and descendants) in H/W
3. Same as 2 but with most compute-intensive
4. Pure hardware using LegUp
5. Pure hardware using eXcite (commercial tool)
2500
2000
1500
40000
# of LEs
35000
Exec. time
30000
25000
20000
1000
500
0
15000
10000
5000
0
# of LEs (geometric mean)
Execution time (geometric mean)
Experimental Results
Energy (μJ) (geometric mean)
Energy Consumption
600
500
400
300
200
100
-
18x less energy
than software
Comparison: LegUp vs eXcite
• Benchmarks compiled to hardware
• eXcite: Commercial high-level synthesis tool
• Couldn’t compile Dhrystone
Geomean
Circuit Runtime (μs)
Logic Elements
Area-Delay Product
LegUp
292
15,646
4.57M
eXcite
357
13,101
4.68M
LegUp/eXcite
0.82 (1.22x)
1.19
0.98
Performance: LegUp vs eXcite
Circuit
Legup
Cycles
eXcite
Cycles
Legup/ Legup exCite Legup/ Legup
eXcite Fmax Fmax eXcite Time
exCite
Time
Legup/
eXcite
adpcm
36,795
21,992
1.67
46
29
1.59
804
761
1.06
aes
14,022
55,679
0.25
61
51
1.20
231
1,093
0.21
blowfish
209,866 209,614
1.00
65
36
1.81
3,208
5,845
0.55
dfadd
2,330
370
6.30
124
25
4.96
19
15
1.27
dfdiv
2,144
2,029
1.06
75
44
1.70
29
46
0.63
dfmul
347
223
1.56
86
49
1.76
4
5
0.8
dfsin
67,466
49,709
1.36
63
40
1.58
1,077
1,241
0.87
gsm
6,656
5,739
1.16
59
42
1.40
113
137
0.82
1.80
47
23
2.04
jpeg
5,861,516 3,248,488
124,475 143,358
0.87
mips
6,443
4,344
1.48
90
76
1.18
72
57
1.26
motion
8,578
2,268
3.78
92
43
2.14
93
53
1.75
sha
247,738 238,009
1.04
87
62
1.40
2,850
3,809
0.75
Geomean
20,854
1.43
72
41
1.76
292
357
0.82
14,594
Circuit Runtime: LegUp vs eXcite
Geomean: 0.82
adpcm
aes
blowfish
dfadd
dfdiv
dfmul
dfsin
gsm
jpeg
mips
motion
sha
0.0
0.2
0.4
0.6
0.8
1.0
1.2
LegUp/eXCite
1.4
1.6
1.8
2.0
Comparison: Software vs Hardware
• Software: Benchmarks run on MIPS
• Hardware: LegUp flow (targeting 100% HW)
Geomean
Benchmark
Runtime (μs)
Logic Elements
Multipliers
Memory Bits
LegUp
MIPS
LegUp/MIPS
292
2334
0.12 (8x)
15,646
12
28,822
12,243
16
226,009
1.28
0.75
0.13
Benchmark Runtime: LegUp vs MIPS
Geomean: 8x
adpcm
aes
blowfish
dfadd
dfdiv
dfmul
dfsin
gsm
jpeg
mips
motion
sha
dhrystone
0
5
10
15
20
Speedup
25
30
35
40
Ongoing Work
• Architecture
– Memory hierarchy
– Multiple clock domains
• High-level synthesis
– Modulo Scheduling for loop pipelining
– Refactoring code for release in March
• Profiling
– Automatically detect functions to move to H/W
Download