ppt

advertisement
ECE 565
High-Level Synthesis—An Introduction
Shantanu Dutt
ECE Dept., UIC
HLS Flow
• Code/Algorithm  Architecture (interconnected functional
units (FUs), memory units (MUs) via muxes, demuxes, tristate
buffers, buses, dedicated interconnects)
Classically, these 3
stages were
performed
sequentially but
currently performed
together (which
leads to better
optimization)
HLS Flow (contd)
HLS Flow (contd)
(Binding)
Allocation: Simple counting of FUs after the
above 2 stages
Simple HLS Examples
+
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2
cc’s and + delay of 1 cc
ldd
ldc
(a) Scheduling
lda
a
b
ldb
c
x
ldx
y
mux
mux
ldy
I1
I0
I0
I1
mux1
d
mux2
i) Non-overlapped pipelined scheduling
c1(1)
X
+
c2(1)
cc’s 1
c1(2)
c3(2)
c3(1) c2(2)
2
3
4
5
Note:
Unspecified
control signals
have either an
inactive value,
or if such a
concept doesn’t
exists for the cs,
then the don’tcare value
demux
6
[y  c+d]
(c2)
Controller FSM:
Reset
+
X
(b) Arch. Synthesis
cc 3i
O1
cc 3(i+1) (c) Controller FSM
Synthesis
mux1=0,
mux2=0
demux=0,
ldy=1
O0
z
ldz
Note: A register is loaded at the +ve/-ve edge
(in a +ve/-ve edge triggered system) of the cc
after the one in which its load signal is asseted.
lda=1, ldb=1,
ldc=1, ldd=1,
mux1=1, mux2=1
demux=1,
ldz=1
cc 3(i+2)
ldx=1
[z  x+y]
(c3)
demux
[x  a x b]
(c1)
lda = 1
reg. “a”
loaded
Simple HLS Examples (contd)
2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d)
ldd
ldc
(a) Scheduling
lda
a
ii) Overlapped pipelined scheduling
X
c1(1)
+
cc’s 1
c1(2)
(b) Arch. Synthesis
ldb
I1
mux1
d
I0
I0
y
mux
mux
ldy
I1
mux2
+
X
c2(1) c3(1) c2(2) c3(2)
demux
2
3
4
5
6
cc 3(i+1)
[z  x+y,]
(c3)
Controller FSM:
Reset
b
c
x
ldx
cc 3i
lda=1, ldb=1,
mux1=0, mux2=0
demux=0,
ldy=1, ldx=1
[y  c+d, x  a x b]
((c1, c2)
ldc=1, ldd=1,
mux1=1,
mux2=1,
demux=1,
ldz=1
demux
(c) Controller FSM
Synthesis
z
ldz
• For 4 iterations, the overlapped schedule takes 9
cc’s versus 12 cc’s by the non-overlapped sched.
• Overlap. sched: Time for n iterations = 2n+1
Throughput = n/(2n+1) ~ 0.5 outputs/cc
• Nonoverlap. sched: Time for n iterations = 3n
Throughput = n/3n ~ 0.33 outputs/cc
 ~ 34% throughput improvement using an
overlapped schedule
Simple HLS Examples (contd)
in1
T
• Some DFG control operation nodes:
Condition
(T/F)
F
Selectot
out
• Conditional code:
If (a > b) then
c  a-b;
Else
c  b-a;
• Possible DFGs corresponding to
the above conditional code:
in
in2
Condition
(T/F)
Distributor
T
F
out1
out2
Simple HLS Examples (contd)
• Iterative code: while (a > b)
a  a-b;
b
a
1
T sel F
-
a
mux
>
c2
T dist F
a
r1
ldr1
c1
Mux
b’
+
s xor ovfl
= 1  -ve
= 0  +ve
cin
b’+1 = 2’s compl.
of -b
1
demux
Demux
1
a
0
ldfina
(a) Scheduling (using
only 1 adder/sub)
final a
(b) Arch. Synthesis
Scheduling
& binding:
+
cc’s
c1
c2
c1
c2
b
0
To fsm
Initialized
to F
ldb
lda
Delay Nodes in DFGs
A delay node is generally implemented as a register; a delay node thus becomes a state
variable.
Delay Nodes in DFGs (contd)
register
Transformation in the DFG
Mapping to the architecture
Detailed HLS Example
Detailed HLS Example (contd)
Different paths (i/p  o/p)
in the DFG
Scheduling heuristic: Among available opers schedule those
on available FUs whose delay to o/p is the highest, breaking
ties in favor of those opers u whose “sibling” o/ps (o/ps to
the same children) that are avail. or will be available at u’s
earliest finish will have the largest lifetime at that point.
(a) Scheduling w/ one X
(2 cc’s) & one + (1 cc);
goal: min. latency
(b) Reg. alloc. for o/p of
operations
For WAR
constraint
(c) Arch. synthesis
Note: Not clear how register allocation has been done.
It is sub-optimal (4 non-primary i/p regs. needed)
The synthesized architecture
Detailed HLS Example (contd)
Detailed HLS Example—Register Allocation
Detailed HLS Example—Register Allocation (contd)
Scheduling heuristic: Among available opers schedule those
on avail. FUs whose delay to o/p is the highest, breaking ties
in favor of those opers u whose “sibling” o/ps (o/ps to the
same children) that are avail. or will be avail. at u’s earliest
finish will have the largest lifetime at that point.
d0
3 non-primary i/p
regs. needed
• In the conflict graph (one per FU), there is an edge between 2 var. nodes if
their lifetimes overlap (indicating that different registers need to be allocated
to them)
• Graph coloring—using min. # of colors to color node s.t. connected node
pairs have different colors—in general is NP-hard
• The above type of conflict graph is called an interval graph (derived from a
1-dimensional interval of the lifetimes)
• Min. graph coloring can be solved optimally in linear time for interval graphs
(using the left-edge algorithm that we will see later for channel routing)
Detailed HLS Example—Register Allocation (contd)
d0
3 non-primary i/p
regs. needed
Scheduling heuristic: Among available opers schedule
those on available FUs whose delay to o/p is the
highest, breaking arbitrarily: B’s lifetime oncreases, but
D’s (dep. of B) decreases similarly—heuristic should be
based on more global information
Download