ppt

ECE 565 High-Level Synthesis—An Introduction Shantanu Dutt ECE Dept., UIC HLS Flow • Code/Algorithm  Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization) HLS Flow (contd) HLS Flow (contd) (Binding) Allocation: Simple counting of FUs after the above 2 stages Simple HLS Examples + Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 cc’s and + delay of 1 cc ldd ldc (a) Scheduling lda a b ldb c x ldx y mux mux ldy I1 I0 I0 I1 mux1 d mux2 i) Non-overlapped pipelined scheduling c1(1) X + c2(1) cc’s 1 c1(2) c3(2) c3(1) c2(2) 2 3 4 5 Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’tcare value demux 6 [y  c+d] (c2) Controller FSM: Reset + X (b) Arch. Synthesis cc 3i O1 cc 3(i+1) (c) Controller FSM Synthesis mux1=0, mux2=0 demux=0, ldy=1 O0 z ldz Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 cc 3(i+2) ldx=1 [z  x+y] (c3) demux [x  a x b] (c1) lda = 1 reg. “a” loaded Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (cont’d) ldd ldc (a) Scheduling lda a ii) Overlapped pipelined scheduling X c1(1) + cc’s 1 c1(2) (b) Arch. Synthesis ldb I1 mux1 d I0 I0 y mux mux ldy I1 mux2 + X c2(1) c3(1) c2(2) c3(2) demux 2 3 4 5 6 cc 3(i+1) [z  x+y,] (c3) Controller FSM: Reset b c x ldx cc 3i lda=1, ldb=1, mux1=0, mux2=0 demux=0, ldy=1, ldx=1 [y  c+d, x  a x b] ((c1, c2) ldc=1, ldd=1, mux1=1, mux2=1, demux=1, ldz=1 demux (c) Controller FSM Synthesis z ldz • For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. • Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc • Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc  ~ 34% throughput improvement using an overlapped schedule Simple HLS Examples (contd) in1 T • Some DFG control operation nodes: Condition (T/F) F Selectot out • Conditional code: If (a > b) then c  a-b; Else c  b-a; • Possible DFGs corresponding to the above conditional code: in in2 Condition (T/F) Distributor T F out1 out2 Simple HLS Examples (contd) • Iterative code: while (a > b) a  a-b; b a 1 T sel F - a mux > c2 T dist F a r1 ldr1 c1 Mux b’ + s xor ovfl = 1  -ve = 0  +ve cin b’+1 = 2’s compl. of -b 1 demux Demux 1 a 0 ldfina (a) Scheduling (using only 1 adder/sub) final a (b) Arch. Synthesis Scheduling & binding: + cc’s c1 c2 c1 c2 b 0 To fsm Initialized to F ldb lda Delay Nodes in DFGs A delay node is generally implemented as a register; a delay node thus becomes a state variable. Delay Nodes in DFGs (contd) register Transformation in the DFG Mapping to the architecture Detailed HLS Example Detailed HLS Example (contd) Different paths (i/p  o/p) in the DFG Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be available at u’s earliest finish will have the largest lifetime at that point. (a) Scheduling w/ one X (2 cc’s) & one + (1 cc); goal: min. latency (b) Reg. alloc. for o/p of operations For WAR constraint (c) Arch. synthesis Note: Not clear how register allocation has been done. It is sub-optimal (4 non-primary i/p regs. needed) The synthesized architecture Detailed HLS Example (contd) Detailed HLS Example—Register Allocation Detailed HLS Example—Register Allocation (contd) Scheduling heuristic: Among available opers schedule those on avail. FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose “sibling” o/ps (o/ps to the same children) that are avail. or will be avail. at u’s earliest finish will have the largest lifetime at that point. d0 3 non-primary i/p regs. needed • In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) • Graph coloring—using min. # of colors to color node s.t. connected node pairs have different colors—in general is NP-hard • The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) • Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing) Detailed HLS Example—Register Allocation (contd) d0 3 non-primary i/p regs. needed Scheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking arbitrarily: B’s lifetime oncreases, but D’s (dep. of B) decreases similarly—heuristic should be based on more global information

ppt

Related documents

Products

Support

ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib