2013-02-03 TDTS 01 Lecture 7 High--Level Synthesis II High Zebo Peng Embedded Systems Laboratory IDA, Linköping University Lecture 7 Allocation and binding Control unit synthesis Advanced HLS issues Zebo Peng, IDA, LiTH 2 TDTS01 Lecture Notes – Lecture 7 1 2013-02-03 Allocation and Binding Allocation (unit selection) —— To determine the type and number of resources required, including Functional units Storage elements Buses Binding —— Assignment to resource instances: Operations to functional unit instances Values to be stored to instances of storage elements Data transfers to bus instances Zebo Peng, IDA, LiTH 3 TDTS01 Lecture Notes – Lecture 7 Allocation and Binding (Cont’d) b c a d s1 + o1 + e f o2 s2 + o3 + o4 g h b,e,g c,f,h a +1, +3 d +2, +4 Resource sharing: Allow multiple non-concurrent operations to share the same hardware as much as possible. Optimization goal: Minimize total cost of functional units, registers, bus drivers, and multiplexers. Minimize total interconnection length (placement info needed). Constraint on critical path delay. Zebo Peng, IDA, LiTH 4 TDTS01 Lecture Notes – Lecture 7 2 2013-02-03 Allocation/Binding — Approach 1 Constructive — start with an empty datapath and add functional, storage and interconnection components as needed. Greedy algorithms — perform allocation/binding for one control step at a time time. a1 1 2 + m1 a2 3 + + a3 + + a1, a3, a4 * * a2 m1, m2 m2 * + a4 Reg Rule-based –– used to select type and numbers of function units, especially prior to scheduling. Zebo Peng, IDA, LiTH 5 TDTS01 Lecture Notes – Lecture 7 Allocation/Binding — Approach 2 Graph-theoretical formulations — Sub-tasks are mapped into well-defined problems in graph theory. Clique partitioning. Left-edge algorithm. Graph coloring. Zebo Peng, IDA, LiTH 6 TDTS01 Lecture Notes – Lecture 7 3 2013-02-03 Clique Partitioning another clique G = (V, E), an undirected graph with a set V of vertices and a set E of edges. A clique is a set of vertices that form a complete subgraph of G. The Clique Partitioning Problem: To partition G into a minimal number of cliques such that each vertex belongs to exactly one clique. Zebo Peng, IDA, LiTH a1 a2 a3 a4 a clique A clique partitioning example 7 TDTS01 Lecture Notes – Lecture 7 Allocation as Clique Partitioning a1 + Functional unit allocation: Each vertex represents an operation operation. 1 An edge connects two vertices iff: 2 m1 + a3 a2 + m2 * + a4 3 The two operations are scheduled into different control steps, and a1 There exists a functional unit that is capable of carrying out both operations a2 a3 a4 Zebo Peng, IDA, LiTH * 8 m11 m2 TDTS01 Lecture Notes – Lecture 7 4 2013-02-03 Clique Partitioning (Cont’d) Storage allocation as a clique partitioning problem: Each value needed to be stored is mapped to a vertex. Two vertices are connected iff the life-times of the two values do not intersect. The clique partitioning problem is NP-complete. Efficient heuristics must be developed. p Ex. Tseng developed a polynomial time algorithm, based on step-wise grouping, which generates very good results. Zebo Peng, IDA, LiTH 9 TDTS01 Lecture Notes – Lecture 7 Tseng’s Algorithm A super-graph is derived from the original graph. Find two connected super-nodes such that they have the maximum number of common neighbors neighbors. Merge the two nodes and repeated from the first step, until no more merger can be carried out. V3 V4 Zebo Peng, IDA, LiTH Common Edge V2 V1 V5 (V1,V3) (V1 V3) (V1,V4) (V2,V3) (V2,V5) (V3,V4) (V4,V5) 10 1 1 0 0 1 0 V2 V1 neighbors V3 V4 V5 TDTS01 Lecture Notes – Lecture 7 5 2013-02-03 Tseng’s Algorithm (Cont’d) V1 V2 S1-3 S1V3 (S1-3,V4) (S1 3 V4) 0 (V2,V5) 0 (V4,V5) 0 V2 V4 S1--3-4 S1 Zebo Peng, IDA, LiTH V4 V5 Edge (V2,V5) V3 V3 V2 V5 V4 V1 V1 Common neighbors Edge Common neighbors V3 V5 V2 V1 0 11 V4 V5 TDTS01 Lecture Notes – Lecture 7 Left--Edge (LE) Algorithm Left Used in channel routing to minimize the number of tracks used to connect points (layout design). To minimize the number of needed tracks. To reduce wire lengths. To avoid wire crossings. Zebo Peng, IDA, LiTH 12 TDTS01 Lecture Notes – Lecture 7 6 2013-02-03 LE Algorithm for Reg. Allocation Map birth time of a value to the left (top) edge, and its death time to the right (down) edge of a wire. i1 ‘8’ i3 i4 1 + a * 6 b i2 ‘3’ * 9 3 + d 8 i1 ‘2’ i2 i3 i4 i5 4 e + ‘7’ * 5 a d b f e ‘9’ + c * o1 i5 ‘4’ o2 Zebo Peng, IDA, LiTH 2 g f g ‘8’ 7 + c 10 o3 o1 13 o2 o3 TDTS01 Lecture Notes – Lecture 7 The LE Algorithm 1. 2 2. 3. 4. 5. 6. The values are sorted in increasing order of their birth time The first value is assigned to the first register register. The list is then scanned for the next value whose birth time is larger than or equal to the death time of the previous value. This value is assigned to the current register. The list is scanned until no more value can shared the same register. A new register will then be introduced to hold the next value in the sorted list, and the algorithm iterates from step 3. Zebo Peng, IDA, LiTH 14 TDTS01 Lecture Notes – Lecture 7 7 2013-02-03 LE Algorithm Example i1 i2 i3 i4 i5 i1 i2 i3 i4 i5 R1 R2 R3 R4 R5 i1 i2 a d b f a d e e a b f g g b c c f o1 Zebo Peng, IDA, LiTH 15 i4 i5 d e g c o1 o2 o2 o3 o1 i3 o3 o2 o3 TDTS01 Lecture Notes – Lecture 7 LE Algorithm Discussions The algorithm guarantees to allocate the minimum number of registers. However, it has two disadvantages: Not all life-time table might be interpreted as intersecting intervals on a line. • Loop • Conditional branches The assignment is neither unique, nor necessarily optimal, in terms of minimal number of multiplexers, for example. Zebo Peng, IDA, LiTH 16 TDTS01 Lecture Notes – Lecture 7 8 2013-02-03 Allocation/Binding — Approach 3 Transformational allocation –– starting from an initial allocation and binding, a final design is obtained by successive transformations. Usually it starts with a maximal allocation (each operation has its dedicated physical unit). The design is then improved by merging, step-by-step, physical units so that hardware resources are shared as much as possible. Si Si Sj + + Si,j + Sj Zebo Peng, IDA, LiTH 17 TDTS01 Lecture Notes – Lecture 7 Lecture 7 Allocation and binding Control unit synthesis Advanced HLS issues Zebo Peng, IDA, LiTH 18 TDTS01 Lecture Notes – Lecture 7 9 2013-02-03 Control--Unit Synthesis Control Two basic approaches are widely used: Microcode based. Hard-wired. The basic assumptions: A synchronous controller is used. A schedule is given with the set of activation signals. • E E.g., enable, bl multiplexer lti l iinputt selection, l ti and db bus control. The controller is modeled as a finite-state machine. Zebo Peng, IDA, LiTH 19 TDTS01 Lecture Notes – Lecture 7 Microcoded Control Synthesis To store the control information in an organized fashion. A microcode ROM of size λ is used, where λ is the number of schedule steps. The ROM must have log2λ address bits (note: x denotes the ceiling function). A synchronous counter with a reset signal is used to address the ROM. The counter is controlled by the system clock. The ROM contents can be implemented as horizontal or vertical microcode. Zebo Peng, IDA, LiTH 20 TDTS01 Lecture Notes – Lecture 7 10 2013-02-03 Horizontal Microcode Each activation signal is associated to one bit of the word in the microcode. Address Microwords λ Reset Clock 00 01 10 11 11000101010 00100010101 00010000000 00001000000 Counter Activation signals Th word The d llength th iis usually ll much h llarger th than λ, λ and d th the ROM h has therefore a width larger than its height. Each bit is connected directly to an activation signal – high performance. There are many zeros – wasted storage resource. Zebo Peng, IDA, LiTH 21 TDTS01 Lecture Notes – Lecture 7 Vertical Microcode A fully vertical microcode encodes the n activation signals with log2n bits to reduce the width of the ROM. Several words may be needed for a schedule step. 1 2 3 4 5 6 7 8 9 10 11 (n = 11) 11000101010 00100010101 00010000000 00001000000 Activation signals 0001 0010 0110 1000 1010 0011 0111 1001 1011 0100 0101 Decoder Activation signals Zebo Peng, IDA, LiTH 22 TDTS01 Lecture Notes – Lecture 7 11 2013-02-03 Vertical Microcode Issues A decoder is needed, which can also be implemented by another ROM to form a two-stage control store. Operation concurrency may not be fully supported. 0001 0010 0110 1000 1010 0011 0111 1001 1011 Reserve code-words for concurrent operations. • E.g., using “1100” to denote activation of the first group of activation signals. 0100 Vertical control schemes can be implemented p by: y 0101 Lengthening the schedule, or Decoder Reading multiple ROM words in each step. Activation S. Zebo Peng, IDA, LiTH 23 TDTS01 Lecture Notes – Lecture 7 Microcode Optimization To find the shortest encoding of the words such that full concurrency is preserved — the microcode compaction problem (an intractable problem). MC can b be approached h db by partitioning i i i the h operations i iinto groups such that only one operation is active in each group and therefore vertical encoding can be used in it. 1 2 3 4 5 6 7 8 9 0 1’ 1 3 4 2 6 7 5 8 9 0 1’ A B C D E 11000101010 00100010101 00010000000 00001000000 100 010 001 000 01 10 11 00 01 10 00 11 01 10 00 00 01 10 00 00 1 0 0 0 100 010 000 001 10 01 00 00 10 01 00 00 D1 1 0 0 0 D2 D3 D4 Activation signals Zebo Peng, IDA, LiTH 24 TDTS01 Lecture Notes – Lecture 7 12 2013-02-03 Microcode Compaction To minimize the number of groups. Construct a conflict graph, where the vertices correspond to the operations and the edges represent concurrency. 4 4 5 5 3 3 Coloring 2 1 6 6 2 1 A minimum coloring of this graph yields the minimum number of groups needed. Note: this does not necessarily lead to the minimum number of word bits (e.g., 10 can be divided as 5+5, or 7+3). Zebo Peng, IDA, LiTH 25 TDTS01 Lecture Notes – Lecture 7 Hard--Wired Control Synthesis Hard Generate a Moore-type finite-state machine from a schedule. 1,2,6,8,10 1 2 3 4 5 6 7 8 9 0 1’ 11000101010 00100010101 00010000000 00001000000 S1 S2 3,7,9,11 Reset 5 S4 S3 4 Synthesize the FSM model. Zebo Peng, IDA, LiTH 26 TDTS01 Lecture Notes – Lecture 7 13 2013-02-03 Lecture 7 Allocation and binding Control unit synthesis Advanced HLS issues Zebo Peng, IDA, LiTH 27 TDTS01 Lecture Notes – Lecture 7 Advanced Issues of HLS Many-to-many mapping between operations and physical components. + x Adder Mult + x ALU - Subs Bit-width compatibility Adder Re-use of p previous designs g (p (partial structure). ) Synthesis with commercially available subsystems, IP-based synthesis. HLS with testability consideration. Zebo Peng, IDA, LiTH 28 TDTS01 Lecture Notes – Lecture 7 14 2013-02-03 Summary High-level synthesis is one of the most important design steps in the system design flow of electronic systems. systems The use of efficient HLS tools has led to the great improvement of design productivities. The two most important tasks are scheduling and allocation/binding, which are interdependent. The HLS tasks are usually formulated as optimization problems and heuristic algorithms are used. Zebo Peng, IDA, LiTH 29 TDTS01 Lecture Notes – Lecture 7 15