Slide - ACM-TC

advertisement
High-Level Synthesis with LegUp
A Crash Course for Users and Researchers
Jason Anderson, Stephen Brown,
Andrew Canis, Jongsok (James) Choi
11 February 2013
ACM FPGA Symposium
Monterey, CA
Dept. of Electrical and Computer Engineering
University of Toronto
Berlin
Hong Kong
Tokyo
New York City
Tutorial Outline
• Overview of LegUp and its algorithms (60 min)
• Labs (“hands on” via VirtualBox)
– Lab 1: Using the LegUp Framework (30 min)
– Break
– Lab 2: Adding resource constraints (30 min)
– Lab 3: Changing How LegUp implements
hardware (30 min)
Project Motivation
• Hardware design has advantages over software:
– Speed
– Energy-efficiency
• Hardware design is difficult and skills are rare:
– 10 software engineers for every hardware engineer*
• We need a CAD flow that simplifies hardware
design for software engineers
*US Bureau of Labour Statistics ‘08
Top-Level Vision
int FIR(int ntaps, int sum) {
int i;
for (i=0; i < ntaps; i++)
sum += h[i] * z[i];
return (sum);
}
....
Self-Profiling
Processor
(MIPS)
C Compiler
Program code
Profiling Data:
Altered SW binary (calls HW accelerators)
P
FPGA fabric
Hardened
program
segments
High-level
synthesis
Suggested
program
segments to
target to
HW
Execution Cycles
Power
Cache Misses
LegUp: Key Features
•
•
•
•
•
•
C to Verilog high-level synthesis
Many benchmarks (incl. 12 CHStone)
MIPS processor (Tiger)
Hardware profiler
Automated verification tests
Open source, freely downloadable
– Like ABC (Synthesis) or VPR (Place & Route)
– 600+ downloads since March 2011
– http://legup.eecg.utoronto.ca
System Architecture
FPGA Cyclone II or Stratix IV
Memory
Hardware
Accelerator
MIPS Processor
Memory
Hardware
Accelerator
AVALON INTERFACE
Memory Controller
Off-Chip Memory
On-Chip
Cache
Memory
ALTERA DE2 or DE4 Board
High-Level Synthesis Framework
• Leverage LLVM compiler infrastructure:
– Language support: C/C++
– Standard compiler optimizations
– More on this shortly
• We support a large subset of ANSI C:
Supported
Functions
Arrays, Structs
Global Variables
Pointer Arithmetic
Floating Point
Unsupported
Dynamic Memory
Recursion
Hardware Profiler Architecture
PC
MIPS P
instr
Instr. $
Op Decoder
tAddr+= V1
tAddr += (tAddr << 8)
tAddr ^= (tAddr >> 4)
b = (tAddr >> B1) & B2
a = (tAddr + (tAddr << A1)) >> A2
fNum = (a ^ tab[b])
Address Hash
(in hardware)
target
address
ret
call
Call Stack
function #
reset
Data Counter
counter
count
Incr. when
+(for current function)
PC changes
0
1
0
PC
count
(ret | call)
0 1
Popped F#
F#
Counter Storage
Memory
(for all functions)
• Monitor instr. bus to
detect function call/ret.
• Call: Hash (in HW) from
function address to
index; push to stack.
• Ret: pop function index
from stack.
• Use function indexes to
associate profiling data
(e.g. cycles, power) with
counters.
See paper IEEE ASAP’11
Processor/Accelerator Hybrid Flow
int main () {
…
sum = dotproduct(N);
...
}
int dotproduct(int N) {
…
for (i=0; i<N; i++) {
sum += A[i] * B[i];
}
return sum;
}
Processor/Accelerator Hybrid Flow
int main () {
…
sum = dotproduct(N);
...
}
#define dotproduct_DATA (volatile int *) 0xf0000000
#define dotproduct_STATUS (volatile int *) 0xf0000008
#define dotproduct_ARG1 (volatile int *) 0xf000000C
int dotproduct(int N) {
…
for (i=0; i<N; i++)
int legup_dotproduct(int
N) { {
*dotproduct_ARG1
= (volatile
sum += A[i]
* B[i];int) N;
*dotproduct_STATUS = 1;
}
return *dotproduct_DATA;
return sum;
}
}
Processor/Accelerator Hybrid Flow
int main () {
…
sum = dotproduct(N);
...
}
HLS
set_accelerator_function “dotproduct”
HW Accelerator
Processor/Accelerator Hybrid Flow
int main () {
…
sum
sum==legup_dotproduct(N);
dotproduct(N);
...
}
#define dotproduct_DATA (volatile int *) 0xf0000000
#define dotproduct_STATUS (volatile int *) 0xf0000008
#define dotproduct_ARG1 (volatile int *) 0xf000000C
int legup_dotproduct(int N) {
*dotproduct_ARG1 = (volatile int) N;
*dotproduct_STATUS = 1;
return *dotproduct_DATA;
}
Processor/Accelerator Hybrid Flow
#define dotproduct_DATA (volatile int *) 0xf0000000
#define dotproduct_STATUS (volatile int *) 0xf0000008
#define dotproduct_ARG1 (volatile int *) 0xf000000C
int main () {
…
sum = legup_dotproduct(N);
...
}
int legup_dotproduct(int N) {
*dotproduct_ARG1 = (volatile int) N;
SW *dotproduct_STATUS = 1;
return *dotproduct_DATA;
}
MIPS
Processor
How Does LegUp Handle Memory
and Pointers?
•
•
•
•
LegUp stores each array in a separate FPGA BRAM
BRAM data width matches the data in the array
Each BRAM is identified by a 9-bit tag
Addresses consist of the RAM tag and array index:
31
23 22
9-bit Tag
0
23-bit Index
• A shared memory controller uses the tag bit to
determine which BRAM to read or write from
• The array index is the address passed to the BRAM
Pointer Example
• We have two arrays in the C function:
– int A[100], B[100]
•
•
•
•
Tag 0 is reserved for NULL pointers
Tag 1 is reserved for off-chip memory
Assign tag 2 to array A and tag 3 to array B
Address of A[3]:
Address of B[7]:
31
23 22
Tag=2
0
Index=3
31
23 22
Tag=3
0
Index=7
Shared Memory Controller
• Both arrays A and B have 100 element BRAMs
23 22
0
• Load from pointer D: 31
Tag=2
FF
0
A[0]
...
13
Index=13
FF
0
A[13]
….
A[99]
BRAM Tag=2
B[0]
...
13
99
32
B[13]
….
B[99]
BRAM Tag=3
99
2
32
3
32
A[13]
Core Benchmarks (+Many More)
• 12 CHStone Benchmarks (JIP’09) and Dhrystone
– Too large/complex for academic HLS tools
• Include golden input/output test vectors
Category
Benchmarks
Arithmetic
64-bit double
• Not supported
byprecision:
academic tools
add, mult, div, sin
Encryption AES, Blowfish, SHA
Processor
MIPS processor
Media
JPEG decoder, Motion, GSM, ADPCM
General
Dhrystone
Lines of C code
376 – 755
716 – 1,406
232
393 – 1,692
491
Experimental Results
LegUp 1.0 (2011) for Cyclone II
1. Pure software on MIPS
Hybrid (software/hardware):
2. Second most compute-intensive function
(and descendants) in H/W
3. Same as 2 but with most compute-intensive
4. Pure hardware using LegUp
5. Pure hardware using eXCite (commercial tool)
2500
2000
1500
40000
# of LEs
35000
Exec. time
30000
25000
20000
1000
500
0
15000
10000
5000
0
# of LEs (geometric mean)
Execution time (geometric mean)
Experimental Results
Comparison: LegUp vs eXCite
• Benchmarks compiled to hardware
• eXCite: Commercial high-level synthesis tool
• Couldn’t compile Dhrystone
Geomean
Circuit Runtime (μs)
Logic Elements
Area-Delay Product
LegUp
292
15,646
4.57M
eXcite
357
13,101
4.68M
LegUp/eXcite
0.82 (1.22x)
1.19
0.98
Energy (μJ) (geometric mean)
Energy Consumption
600
500
400
300
200
100
-
18x less energy
than software
Current Release: LegUp 3.0
•
•
•
•
•
•
Loop pipelining
Dual and multi-ported memory support
Bitwidth minimization
Multi-pumping DSP units for area reduction
Alias analysis for dependency checks
Parallel accelerators via Pthreads & OpenMP
Results now considerably better than
LegUp 1.0 release
LegUp 3.0 vs. LegUp 1.0
LegUp 3.0/LegUp 1.0 Ratio
1.6
1.4
1.2
Wall-clock time: 16% better
Cycle latency: 31% better
FMax: 18% worse
LEs (area): 28% better
1
Wall-Clock Time
Cycles
0.8
Fmax
LEs
0.6
0.4
CHStone Benchmark Circuit
LLVM Compiler and HLS Algorithms
LLVM Compiler
• Open-source compiler framework.
– http://llvm.org
• Used by Apple, NVIDIA, AMD, others.
• Competitive quality with gcc.
• LegUp HLS is a “back-end” of LLVM.
• LLVM: low-level virtual machine.
LLVM Compiler
• LLVM will compile C code into a
control flow graph (CFG)
• LLVM will perform standard optimizations
– 50+ different optimizations in LLVM
C Program
int FIR(int ntaps, int sum) {
int i;
for (i=0; i < ntaps; i++)
sum += h[i] * z[i];
return sum;
}
....
CFG
Compiler
BB0
LLVM
BB1
BB2
Control Flow Graph
• Control flow graph is composed of basic blocks
• basic block: is a sequence of instructions
terminated with exactly one branch
– Can be represented by an acyclic data flow graph:
CFG
BB0
load
load
+
BB1
+
BB2
store
load
LLVM Details
• Instructions in basic blocks are primitive
computational operations:
– shift, add, divide, xor, and, etc.
• Or are control-flow operations:
– branch, call, etc.
• The CDFG is represented in LLVM’s
intermediate representation (IR)
– IR is machine-independent assembly code.
High-Level Synthesis Flow
C Program
C Compiler
(LLVM)
Optimized LLVM IR
Allocation
Scheduling
Target H/W
Characterization
User Constraints
• Timing
• Resource
Binding
RTL
Generation
Synthesizable Verilog
Scheduling
• Scheduling: is the task of scheduling operations
into clock cycles using a finite state machine
FSM
State 0
State 1
Schedule
load
load
+
load
State 2
+
State 3
store
Binding
• Binding: is the task of assigning scheduled
operations to functional units in the datapath
Schedule
load
Datapath
load
FF
+
load
+
store
2-port RAM
+
High-Level Synthesis: Scheduling
SDC Scheduling
• SDC  System of Difference Constraints
– Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC
formulation”. DAC 2006: 433-438.
• Basic idea: formulate scheduling as a
mathematical optimization problem
– Linear objective function + linear constraints
(==, <=, >=).
• The problem is a linear program (LP)
– Solvable in polynomial time with standard solvers
Define Variables
+
<<
-
• For each operation i to
schedule, create a variable ti.
• The ti’s will hold the cycle # in
which each op is scheduled.
• Here we have:
– tadd, tshift, tsub
Data flow graph (DFG):
already accessible in LLVM.
Dependency Constraints
add
shift
sub
• In this example, the subtract
can only happen after the
add and shift.
• tsub – tadd >= 0
• tsub – tshift >= 0
• Hence the name
difference constraints.
Handling Clock Period Constraints
• Target period: P (e.g., 10 ns)
• For each chain of dependant
operations in DFG, estimate the
path delay D (LegUp’s models)
mod
xor
– E.g.: D from mod -> or = 23 ns.
• Compute: R = ceiling(D/P) - 1
– E.g.: R = 2
shr
• Add the difference constraint:
– tor - tmod >= 2
or
Resource Constraints
• Restriction on # of operations of a given type
that can execute in a cycle
• Why we need it?
– Want to use dual-port RAMs in FPGA
• Allow up to 2 load/store operations in a cycle
– Floating point
• Do not want to instantiate many FP cores of a given
type, probably just one
• Scheduling must honour # of FP cores available
Resource Constraints in SDC
• Res-constrained scheduling is NP-hard.
• Implemented approach in [Cong & Zhang DAC2006]
A
+
+
C
D
+
B
E
+
+
+
+
F
+
H
G
Say want to schedule with
only have 2 adders
in the HW (lab #2)
Add SDC Constraints
• Generate a topological ordering of the
resource-constrained operations.
A B C E F D G H
• Say constrained to 2 adders in HW.
• Starting at C in the ordering, create a
constraint: tC – tA > 0
• Next consider, E, add constraint: tE - tB > 0
• Continue to the end
• Resulting schedule will have <= 2 adds / cycle
ASAP Objective Function
• Minimize the sum of the variables:
minimize(f =
å t)
i
i ÎOps
• Operations will be scheduled as early as
possible, subject to the constraints
• LP program solvable in polynomial time
High-Level Synthesis: Binding
High-Level Synthesis: Binding
• Weighted bipartite matching-based binding
–
Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted
matching”. DAC 1990: 499-504.
• Finds the minimum weighted matching of a
bipartite graph at each step
– Solve using the Hungarian Method (polynomial)
operations
edge costs
hardware functional units
Binding
• Bind the following scheduled program
State 0
State 1
State 2
State 3
Binding
• Resource Sharing: requires 3 multipliers
State 0
State 1
State 2
State 3
Binding
• Bind the first cycle
Functional Units
State 0
1
State 1
1
State 2
State 3
1
Binding
• Bind the second cycle
Functional Units
State 0
2
State 1
2
State 2
State 3
1
Binding
• Bind the third cycle
Functional Units
State 0
2
State 1
2
State 2
State 3
2
Binding
• Bind the fourth cycle
Functional Units
State 0
3
State 1
2
State 2
State 3
2
Binding
• Required Multiplexing:
Functional Units
3
2
2
High-Level Synthesis: Challenges
• Easy to extract instruction level parallelism using
dependencies within a basic block
• But C code is inherently sequential and it is
difficult to extract higher level parallelism
• Coarse-grained parallelism:
– function pipelining
• Fine-grained parallelism:
– loop pipelining
Loop Pipelining
Motivating Example
for (int i = 0; i < N; i++) {
sum[i] = a + b + c + d
}
cycle
1
2
3
a
b
+
c
+
d
+
• Cycles: 3N
• Adders: 3
• Utilization: 33%
Loop Pipelining
Cycle 1
2
3
i=0
+
+
+
+
+
+
+
+
….
…
i=1
i=3
….
i=N-2
+
4
5
…
N
….
…
+
+
+
+
+
i=N-1
N+1
N+2
+
Steady State
• Cycles: N+2 (~1 cycle per iteration)
• Adders: 3
• Utilization: 100% in steady state
Loop Pipelining Example
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
• Each iteration requires:
• 2 loads from memory
• 1 store
• No dependencies between iterations
Loop Pipelining Example
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
• Cycle latency of operations:
• Load: 2 cycles
• Store: 1 cycle
• Add: 1 cycle
• Single memory port
LLVM Instructions
for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ],
[ %3, %bb ]
a[i] = b[i] + c[i]
%scevgep5 = getelementptr
%b, %i.04
}
%0 = load %scevgep5
%scevgep6 = getelementptr
%c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr
%a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
LLVM Instructions
for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ],
[ %3, %bb ]
a[i] = b[i] + c[i]
%scevgep5 = getelementptr
%b, %i.04
}
%0 = load %scevgep5
%scevgep6 = getelementptr
%c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr
%a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
LLVM Instructions
for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ],
[ %3, %bb ]
a[i] = b[i] + c[i]
%scevgep5 = getelementptr
%b, %i.04
}
%0 = load %scevgep5
%scevgep6 = getelementptr
%c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr
%a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
LLVM Instructions
for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ],
[ %3, %bb ]
a[i] = b[i] + c[i]
%scevgep5 = getelementptr
%b, %i.04
}
%0 = load %scevgep5
%scevgep6 = getelementptr
%c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr
%a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
LLVM Instructions
for (int i = 0; i < N; i++) { %i.04 = phi i32 [ 0, %bb.nph ],
[ %3, %bb ]
a[i] = b[i] + c[i]
%scevgep5 = getelementptr
%b, %i.04
}
%0 = load %scevgep5
%scevgep6 = getelementptr
%c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr
%a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
Scheduling LLVM Instructions
Cycle:
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
• Each iteration requires:
• 2 loads from memory
• 1 store
• There are no dependencies between iterations
Scheduling LLVM Instructions
Cycle:
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
• Each iteration requires:
Memory Port Conflict
• 2 loads from memory
• 1 store
• There are no dependencies between iterations
Loop Pipelining Example
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
• Initiation Interval (II)
• Constant time interval between starting
successive iterations of the loop
• The loop requires 6 cycles per iteration (II=6)
• Can we do better?
Minimum Initiation Interval
• Resource minimum II:
– Due to limited # of functional units
– ResMII = Uses of functional unit
# of functional units
• Recurrence minimum II:
– Due to loop carried dependencies
• Minimum II = max(ResMII, RecMII)
Resource Constraints
• Assume unlimited functional units (adders, …)
• Only constraint: single ported memory controller
• Reservation table:
• The resource minimum initiation interval is 3
Iterative Modulo Scheduling
• There are no loop carried dependencies so
Minimum II = ResMII = 3
• Iterative: Not always possible to schedule the
loop for minimum II
II = minII
Attempt to modulo
schedule loop with II
Success
II = II + 1
Fail
Iterative Modulo Scheduling
• Operations in the loop that execute in cycle:
i
• Must also execute in cycles:
i + k*II
k = 0 to N-1
• Therefore to detect resource conflicts look in
the reservation table under cycle:
(i-1) mod II + 1
• Hence the name “modulo scheduling”
New Pipelined Schedule
Modulo Reservation Table
• Store couldn’t be scheduled in cycle 6
• Slot = (6-1) mod 3 + 1 = 3
• Already taken by an earlier load
Iterative Modulo Scheduling
• Now we have a valid schedule for II=3
• We need to construct the loop kernel,
prologue, and epilogue
• The loop kernel is what is executed when the
pipeline is in steady state
– The kernel is executed every II cycles
• First we divide the schedule into stages of II
cycles each
Pipeline Stages
Stage:
1
00
2
3
Pipelined Loop Iterations
3 Cycles
Stage 1
Stage 2
i=0
i=1
i=2
i=3
i=4
i=0
i=1
i=2
i=3
i=4
i=0
i=1
i=2
i=3
Stage 3
Prologue
Kernel
(Steady State)
i=4
Epilogue
Loop Dependencies
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
a[j] = b[i] + a[j-1];
Depends on previous iteration
• May cause non-zero recurrence min II.
• Several papers in FPGA 2013 deal with
discovering/optimizing loop dependencies
Limitations and
Current Research
LegUp HLS Limitations
• HLS will likely do better for datapath-oriented
parts of a design.
• Results likely quite sensitive to how loops are
structured in your C code.
• Difficult for HLS to “beat” optimized
structured HW design.
FPGA/Altera-Specific
Aspects of LegUp
• Memory
– On-chip (AltSyncRAM),
off-chip (DDR2/SDRAM controller)
• IP cores
– Divider, floating point units
• On-chip SOC interconnect
– Avalon interface
• LegUp-generated Verilog fairly FPGA-agnostic:
– Not difficult to migrate to target ASICs
Current Research Work
• Impact of compiler optimizations on HLS
• Enhanced parallel accelerator support
– Combining Pthreads+OpenMP
• Smaller processor
• Improved loop pipelining
• Software fallback for bitwidth-optimized
accelerators
• Enhanced GUI to display CDFG connected
with the schedule
Current Work: PCIe Support
• Enable use of LegUp-generated
accelerators in an HPC
environment
– Communicating with an x86
processor via PCIe
• Message passing or memory
transfers
– Software API for fpga_malloc,
fpga_free, send, receive
• DE4 / Stratix IV support in next
LegUp release
On to the Labs!
Download