ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC HLS Flow • Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization) HLS Flow (contd) HLS Flow (contd) (Binding) Allocation: Simple counting of FUs after the above 2 stages Simple HLS Examples + Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc ldd ldc (a) Scheduling lda a b ldb c x ldx y mux mux ldy I1 I0 I0 I1 mux1 d mux2 i) Non-overlapped pipelined scheduling c1(1) X + c2(1) cc’s 1 c1(2) c3(2) c3(1) c2(2) 2 3 4 5 Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’tcare value demux 6 [y c+d] (c2) Controller FSM: Reset + X (b) Arch. Synthesis cc 3i O1 cc 3(i+1) (c) Controller FSM Synthesis mux1=0, mux2=0 demux=0, ldy=1 O0 z ldz Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 cc 3(i+2) ldx=1 [z x+y] (c3) demux [x a x b] (c1) lda = 1 reg. “a” loaded Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) ldd ldc (a) Scheduling lda a ii) Overlapped pipelined scheduling X c1(1) + cc’s 1 c1(2) (b) Arch. Synthesis ldb I1 mux1 d I0 I0 y mux mux ldy I1 mux2 + X c2(1) c3(1) c2(2) c3(2) demux 2 3 4 5 6 cc 3(i+1) [z x+y,] (c3) Controller FSM: Reset b c x ldx cc 3i lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 [y c+d, x a x b] ((c1, c2) ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 demux (c) Controller FSM Synthesis z ldz • For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. • Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc • Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule Simple HLS Examples (contd) in1 T • Some DFG control operation nodes: Condition (T/F) F Selectot out • Conditional code: If (a > b) then c a-b; Else c b-a; • Possible DFGs corresponding to the above conditional code: in in2 Condition (T/F) Distributor T F out1 out2 Simple HLS Examples (contd) • Iterative code: while (a > b) a a-b; b a 1 T sel F - a mux > c2 T dist F a r1 ldr1 c1 Mux b’ + s xor ovfl = 1 -ve = 0 +ve cin b’+1 = 2’s compl. of -b 1 demux Demux 1 a 0 ldfina (a) Scheduling (using only 1 adder/sub) final a (b) Arch. Synthesis Scheduling & binding: + cc’s c1 c2 c1 c2 b 0 To fsm Initialized to F ldb lda Delay Nodes in DFGs A delay node is generally implemented as a register; a delay node thus becomes a state variable. Delay Nodes in DFGs (contd) register Transformation in the DFG Mapping to the architecture Detailed HLS Example Detailed HLS Example (contd) Different paths (i/p o/p) in the DFG Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish will have the largest lifetime at that point. (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); goal: min. latency (b) Reg. alloc. for o/p of operations For WAR constraint (c) Arch. synthesis Note: Not clear how register allocation has been done. It is sub-optimal (4 non-primary i/p regs. needed) The synthesized architecture Detailed HLS Example (contd) Detailed HLS Example—Register Allocation Detailed HLS Example—Register Allocation (contd) Scheduling heuristic: Among available opers schedule those on avail. FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be avail. at u’s earliest finish will have the largest lifetime at that point. d0 3 non-primary i/p regs. needed • In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) • Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard • The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) • Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing) Detailed HLS Example—Register Allocation (contd) d0 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking arbitrarily: B’s lifetime oncreases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information