Graph Transformations for Improved Tree Height Reduction Mangalam G.N., Sanjiv Narayan, Paul van Besouw, LaNae Avra, Anmol Mathur, Sanjeev Saluja Cadence Design Systems Plot Nos.57 A&B, Noida Export Processing Zone Noida, U.P., India fmangalam. mathura, ssaluja}Qcadence.eom / lavra, / naravan. d / x>aulvb. I C-J c AbstractTree height reduction helps in minimizing the critical path delay and area in datapath rich designs during synthesis. We introduce in this paper, the necessary conditions to identify height reducible arithmetic expressions and three graph transformations that make Tree Height Reduc(a) Bit-width matching - a technique tion more efficient: in which input signals that match in their bit-widths are grouped together so that smaller width arithmetic nodes are created in the graph. (b) Carry / Borrow Optimization - a graph transformation by which an optimum number of single bit inputs are distributed as carry / borrow to the add / subtract nodes in the graph. (c) Constant grouping - a graph transformation in which constant inputs are grouped together to form a sub-tree of constants. Experiments on industrial designs with these graph transformations coupled with Tree Height Reduction have shown significant improvement in critical path delay and area. I. INTRODUCTION Meeting the timing constraints of high performance system-on-chip designs is a challenging task for any synthesis tool. Many of these designs have large and complex arithmetic operations in them. Optimizations in the frontend to keep these operations off the critical path as much as possible is essential to meet the constraints set by chip designers. Tree Height Reduction (THR) is a well-known graph optimization technique for minimizing critical path length during synthesis. Digital Filters, array computations and any other computation that has a chain of adders or subtractors or multipliers benefit from THR. An arithmetic expression gets converted to a tree of arithmetic operations during synthesis. A tree of arithmetic operations is a connected acyclic graph with the following properties: 1. The nodes in the graph represent inputs, outputs and arithmetic operators. We represent inputs and outputs as literals and operator nodes using +, - or >I<. 2. The edges in the graph represent the flow of data between the operations. 3. Associated with every node (edge) is a positive integer value called the width of the node (edge). For an input (output) node, it represents the bit-width of the input (output) signal. The width of an operator node is the width of the arithmetic operator. The width of an edge is the number of least significant bits of its source node which are used as input by the node at the destination of the edge. THR takes advantage of the associative and commutative properties of arithmetic operations. It attempts to 48) B(8)W3) W3) Fig. 1. Tree Height Reduction transform skewed arithmetic expression trees to balanced expression trees, thus reducing the critical path length of circuits from O(n) to O(logn) where n is the number of arithmetic operations in the tree [l]. A tree is said to be balanced if the difference in height between any two of its sub-trees is not greater than one. In this paper, we introduce the necessary conditions to identify height reducible expression trees and three graph transformations (Bit-Width matching, Carry / Borrow Optimization, and Constant Grouping) that improve THR. Experiments conducted on industrial test designs demonstrate significant improvement in Quality of Results (&OR - area, critical path delay and run time) obtained from such transformations when they are coupled with traditional THR. In Section 2, we formally state the problem with a motivating example. Section 3 describes the previous work done and the algorithm for THR proposed in [2] since it forms the basis of our work. Section 4 describes our approach to overcome the shortcomings of this algorithm. In Section 5, we describe three graph transformations that make THR more efficient. Section 6 summarizes experiments and the results obtained. II. PROBLEM STATEMENT The main optimization goal of THR is to reduce the delay of a circuit by minimizing the critical path length. Given an arithmetic expression tree T with n operator nodes, the problem is to re-structure the expression tree so that it results in minimum critical path delay and / or area in the hardware implementation. To motivate the need for THR, consider the following expression in HDL. 0 =A+B+C+D 0-7695-1868-0/03/$17.00 (C) 2003 IEEE allnodes in expression if(associative(curNode) succnode tree && / (outedgeCount(curNode) == I)) 1 = successorNode(curNode); if(typeOfzvode(curNode)== inputsO~ode(succiVode) typeOfzvode(succiVode))/ = inputsO~ode(curNode) + inputsO~ode(succiVode); delete outedge(curNode); delete Jr //end Jr //end j//end curNode; Fig. 3. Iterative if Splitting if for Fig. 2. Algorithm For Collapsing Expression Tree One implementation of this expression, which is the devault in most commercial synthesis tools, is the skewed graph shown in Figure l(a). Assuming that all the input signals in the expression arrive at the same time and have the same width, the critical path is the the longest path in the tree as shown by the thick lines in Figure 1(a). It includes the delay of all three adders in the graph. Consider the balanced graph in Figure l(b). The critical path shown by the thick lines includes the delay of only two adders. This shows that the critical path delay due to a balanced tree is lesser than that due to a skewed tree. In full precision arithmetic, the skewed tree has adders of width 8 bits, 9 bits and 10 bits and the width of the final output 0 is 11 bits where as the balanced tree has adders of width 8 bits, 8 bits and 9 bits with the width of the final output 0 as 10 bits. The smaller width adders and output result in lesser area of the design after THR. III. PREVIOUS (4 0 1 (b) Fig. 4. THR of Expressions 0 1 with Subtractions started, now has only 4 inputs. Now, two more inputs are split off and this process continues till the initial composite node has only two inputs. We finally get a balanced tree as shown in Figure 3(d). It is quite simple to extend this algorithm to handle subtractions also. During the first step, while collapsing subtract nodes, we need to keep track of the polarity (+ or -) of the inputs to the composite multiple input node. Then, during the step of iterative splitting, the correct operator, (add or subtract) must be chosen based on the polarity of the inputs. Further, the correct polarity has to be propagated to the output according to the following rules If both the inputs are of polarity +, the operator is an adder and the polarity of output is +. If one input is of polarity + and the other -, the operator is a subtractor and the polarity of output is +. If both the inputs are of polarity -, the operator is an adder and the polarity of output is -. Adjacent add and subtract nodes can be collapsed if we take care of the above conditions to propagate the correct polarity. For example, consider the following expression with subtract operations. l WORK THR is a widely researched topic and several approaches have been proposed. In [l], an incremental THR technique for parallelization of application programs is presented. In [3], a THR technique that enables re-timing has been proposed. Potkonjak [4] has proposed an optimal application of all algebraic transformations, but the application domain was restricted to linear computations. Hartley [2] has proposed a simple and efficient algorithm for height reducing adder / multiplier trees. We describe this algorithm here since it forms the basis of our work. The algorithm has mainly two steps. The first step is to iteratively collapse an expression tree consisting of arithmetic nodes into a multiple input arithmetic node using the algorithm in Figure 2. The expression tree can consist of adjacent add nodes or adjacent multiplier nodes. In the second step, we iteratively split the collapsed tree which is a composite multiple input operator node as shown in Figure 3(a). Since all operations are associative and commutative, we split off any two inputs and feed them to a new operator node. The output from the new operator node is fed to the composite multiple input operator node. As a result, the composite operator node we started with has one input fewer. In Figure 3(b), inputs A and B are split off and fed to a new operator node. The output from this node is fed to the composite operator node. So, the composite operator node which had 5 inputs when we l l 0 =A+B-C-D The graph generated before THR is shown in Figure 4(a). Operands C and D have a polarity -. So during THR, they generate an adder and the polarity of the output from this adder is -. The graph after THR is shown in Figure 4(b). IV. IDENTIFYING SAFELY HEIGHT REDUCIBLE EXPRESSION TREES The above described algorithm is simple and efficient. But it does not address all the issues to be considered to preserve functionality of the expression even after THR. To ensure functionality preservation, we need to identify safely height reducible sub-trees within an expression tree. To identify such sub-trees, we traverse the expression tree and identify breaknodes. A breaknode is an arithmetic node in the expression tree which is a boundary for a safely height 0-7695-1868-0/03/$17.00 (C) 2003 IEEE V. GRAPH Gl In this section, we introduce three graph transformations that improve the QOR of THR. These graph transformations when combined with traditional THR, give a better critical path delay, area and runtime of the logic optimization tool. : GO Fig. 5. Safely Height Reducible TRANSFORMATIONS A. Bit-width Sub-trees reducible sub-tree. A break node has to be either an associative node or a subtract node. It also has to satisfy at least one of the following conditions: 1. The out edge from the node is a Zossy edge. Consider the following piece of Verilog code to understand the notion of Zossy edges. module sum(P, Q, R, S, 0); input [3:0] P, Q, R; input [4:0] S; output [5: 01 0; wire [2:0] T = P + Q; assign 0 =T+R+S* 9 endmodule The graph generated for the expression in the RTL without THR is shown in Figure 5(a). The add operation between P and Q results in a 5 bit result in full precision arithmetic. The 5 bit result is truncated to 3 bits which is then extended to 4 bits for the addition with R. Due to extension of a truncated result, the output 0 cannot be directly expressed as a sum of P, Q, R and S. Re-balancing the expression tree in Figure 5(a) as a whole would give a functionally incorrect result due to loss of information content. We call edges which are truncated first and then extended for the following operation as Zossyedges and the node driving this edge is a breaknode for THR. In Figure 5, edge T is a lossy edge and it partitions the expression tree in Figure 5(a) into two safely height reducible sub-trees Gl and G2, as shown in Figure 5(b). 2. The node should have more than one outedges. The output of such a node is a value that is used in more than one sub-expression and can be treated as a common subexpression. 3. If the operator type of the successor node is multiply and its own type is add / subtract or viceversa, it is a breaknode. 4. The adder / multiplier architecture set on the successor node is different from the one set on itself. 5. The successor node belongs to a different parenthesized sub-expression. (This condition is valid only if parentheses in an expression are honoured). At the end of this procedure, the expression tree is partitioned into one or more safely height reducible sub-trees and each of these sub-trees have a unique breaknode as their root node. We can now apply the algorithm described in [2] to THR each of the safely height reducible sub-tree. Matched THR Bit-width Matching exploits the varying widths of operands in arithmetic expressions by feeding similar-width operands to the same arithmetic operation. Reducing variances in operand sizes during THR results in arithmetic components with smaller area. For example, consider the following expression (number in brackets is the width of the signal) : o(10) = A(4) + B(8) + C(4) + D(8) Fig. 6. Bit-width Matched THR Figure 6(a) shows the expression tree generated after THR but without any bit-width matching. For full precision arithmetic, adders of width 8 bits, 8 bits and 9 bits are created. In Bit-width matched THR, we choose operand pairs in ascending order of their width while splitting the collapsed tree. The expression tree obtained after Bit-width matched THR is shown in Figure 6(b). We now require adders that are of width 4 bits, 8 bits and 9 bits. These smaller width adders result in lesser area. B. Carry / Borrow Optimization Expressions often have single-bit operands. Carry / Borrow Optimization distributes such operands as the carry / borrow input for other multi-bit addition / subtraction nodes, resulting in fewer arithmetic nodes in the expression tree. For example, consider the expression given below. 0 =A+B- C + II + 12 + 13 - 14 where A, B, C are multi-bit inputs and II, Ii?, 13, 14 are l-bit wide. The THR’ed expression tree without carry / borrow optimization with 4 adders and 2 subtractors is shown in Figure 7(a). The expression tree after THR with carry / borrow optimization is given in Figure 7(b) - inputs II, 13, 14have been fed to the carry/borrow inputs of other nodes. The resulting expression tree not only has fewer nodes but also a reduced critical path length. The following lemma proves how the single bit inputs in an expression can be distributed optimally as carry. Lemma I: If there are m, multi-bit inputs and n single bit inputs in an expression, assuming that all inputs have a positive polarity, the number of add nodes in the expression 0-7695-1868-0/03/$17.00 (C) 2003 IEEE A B 11 13 13 12 C C. Optimal Grouping of Constants Constant Grouping groups constants together to form a sub-tree of constants while THR’ing an expression tree. Consider the following expression: 0 =A+2+B+3 0’ Fig. 7. Carry / Borrow Optimization tree due to multi-bit inputs would be (m - 1). If (n <m, - l), all the single bit inputs can be distributed among the (m - 1) add nodes as carry. Lemma 2: If n > (m - 1)) r-1 single bit inputs can be used as carry. Proof: The number of add nodes in the expression tree due to the m, multi-bit inputs would be (m - 1). So (m - 1) single bit inputs can be distributed among these nodes as carry. The remaining single bit inputs are (n - m + 1). We can form a sub-tree using the remaining (n - m + 1) single bit inputs. Let h be the height of the sub-tree consisting of just single bit inputs. Height of a tree is defined as the number of nodes in the longest path in the tree. For the adder tree of single bit inputs, the leaf nodes are equivalent to inputs and the internal nodes are equivalent to add nodes. Applying the properties of a complete binary tree, it can have a maximum of 2‘-’ leaf nodes or inputs and hence 2h-1 - 1 add nodes out of which, 2h-2 add nodes would be at level one of the tree. Each of these 2h-2 add nodes at level one in the tree of single bit inputs can take 3 inputs (2 regular inputs and 1 carry input). Therefore, the tree can have a maximum of 2h-2 >I< 3 inputs at the leaf level. The number of add nodes remaining in the tree other than the ones at level one is 2h-2 - 1. Each of these add nodes can take a carry input. Therefore, 2h-2 * 3 + 2h-2 -1 2h-l 2h = n-m+1 = n-m+1 - n-m+2 n -- n-- -- - zhml 2h 2 2n - 2h 2 2n - (n - rn + 2) 2 n+m-2 c 2 A 2B 1 We have proved the result for adders. The same result can be extended for subtractors also. 3 A B / t 5 / t cc> I: Fig. 8. Optimal Grouping o of Constants To get the optimum result using this optimization, we follow a few guidelines. They are given below. 1. For multiply operations between constants, we always group constants together. 2. For add and subtract operations between constants, if there is at least one multi-bit constant in the expression, we always group the const together. 3. Suppose there are only one bit constants. Let n be the number of positive single bit constants and m, be the number of add nodes due to the multi-bit inputs in the expression. We group the constants together only if n > m. If n < KU, the sin .gle bit constants can be used as carry inpug to the add nodes. The same result applies to subtract nodes and negative single bit constants. This results in a lesser number of add / sub nodes in the expression tree. VI. In the sub-tree of single bit inputs under our consideration, 2h-1 single bit inputs can be used as regular inputs to the adders. The remaining single bit inputs can be used as carry inputs. Therefore, Number of single bit inputs that can be used as carry = - Synthesis tools would typically generate the expression tree of Figure 8(a), resulting in an implementation with 3 adders. The expression tree after THR with Constant Grouping is shown in Figure 8(b) - constants 2 and 3 are fed to the same arithmetic node. Constant propagation [5] can work on this expression tree and reduce it to the expression tree shown in Figure 8(c) resulting in an implementation with only 2 adders. EXPERIMENTS AND RESULTS We have implemented THR and the graph transformations above as part of front-end HDL synthesis. In addition to improving the critical path delay and area of the synthesized netlist, these graph transformations result in reduced runtime of the timing-driven logic optimization tool. We summarize the results obtained using these graph transformations in Tables 1, 2 and 3 when compared to traditional THR without any of these transforms. The test designs considered are part of industrial designs. The critical path delay and area reported here are those obtained after front-end HDL synthesis but before any timing-driven logic optimization so as to highlight the true benefits of these transformations. The arrival time of inputs and the required time of outputs for each test design has been set to 0. We have experimented each graph transformation on datapath only test designs using LSI’s Zca300k cell library and have evaluated on following three metrics: 0-7695-1868-0/03/$17.00 (C) 2003 IEEE 1. Critical path delay of the netlist obtained after synthesis but before any timing-driven logic optimization. 2. Area of the netlist obtained after synthesis but before any timing-driven logic optimization. 3. Runtime in CPU seconds of the timing-driven logic optimization tool on the netlist obtained after synthesis. Table 1 summarizes the improvement in delay, area and runtime obtained with Bit-width Matched THR when compared to traditional THR. The expression trees with and without Bit-width Matched THR corresponding to test design sum are shown in Figure 9. Each of the five inputs in this design have different widths showing the real benefit of Bit-width Matched THR. Smaller width inputs A and C are grouped together which result in a 3 bit adder instead of A and B which result in a 6 bit adder. Figure 10 shows the expression trees with and without Bit-width Matched THR corresponding to test design diff. This design shows Bit-width Matched THR applied to expressions consisting of add and subtract operations. entity CNT port is (CLK, RESET: in : in unsigned<63 A end ACCUM CNT; : out architecture A signal bits: signal act begin process(CLK, of downto CNT downto downto unsigned<31 w/o Bit-width = ’ 1’) (others then => ‘0’); (CLK’event and CLK=’ sum := (others => ‘0’); for i in 63 downto 0 loop sum := sum + bits(i); loop; <= act c= A; + 0); 1’) then sum; if, end process; ACCUM end A; c= act; for Design cnt (3)Vf Matched Fig. 9. Bit-width THR with Bit-width Matched A(2) B(6)C(2) D(6) Matched THR THR for Design sum A(2) cm B(6) D(6) icfgfk w/o Bit-width 0); 0); downto elsif end 0)); is Fig. 11. RTL in VHDL (3jVf downto RESET) <= end act bits 0); unsigned<31 unsigned<63 : unsigned<31 variable sum: begin if (RESET act std-logic; Fig. 12. Carry/Borrow ($fgtgfe Matched THR Fig. 10. Bit-width with Bit-width Matched Tree of 32 adders with C/B Opt Matched THR THR for Design diff In Table 2, we summarize the improvement in delay, area and runtime obtained with Carry / Borrow Optimization when compared to THR without Carry / Borrow Optimization. The RTL in VHDL for test design cnt in Table 2 is given in Figure 11. The testcase adds all bits of a 64 bit vector to a register bit by bit. This design when synthesized with traditional THR has a balanced tree of 64 adders as shown in Figure 12(a) and the timing-driven logic optimization tool took approximately 4.5 hours to finish. When synthesized with carry / borrow optimization, the design has only 32 adders in a balanced tree as shown in Figure 12(b) and the logic optimizer took only 20 minutes to meet constraints. With carry / borrow optimization, every adder has a carry input in addition to the two regular inputs. This results in only 32 adders as compared to the 64 adders without the optimization. The expression trees with and without carry / borrow optimization corresponding to test design carry are shown in Figure 13. This design illustrates the use of positive single bit inputs as carrv and negative single bit Optimization for Design cnt inputs as borrow. Single bit input I which has a + polarity has been used as carry for the add node between A and B and single bit input J which has a - polarity has been used as borrow for the subtract operation between C and D. The implementation with carry / borrow optimization has only two adders and one subtractor, where as the implementation without carry / borrow optimization has three adders and two subtractors. 0' w/o Carry/Borrow Fig. 13. Carry/Borrow Opt 0' with Carry/Borrow Optimization Opt for Design carry Table 3 shows the improvement in delay, area and runtime obtained with Constant Grouping. The expression trees with and without constant grouping for test design stat are shown in Figure 14. In this design, constant 1 is added to a variable five times. The synthesized netlist has three adders without constant grouping and after con- 0-7695-1868-0/03/$17.00 (C) 2003 IEEE TABLE I IMPROVEMENT IN CRITICAL PATH DELAY, AREA AND RUNTIME Bit-width Matched THR RTime Area DlY 7.9 4.26 293 3.19 163 9.1 THR Test Design RTime 8.5 9.6 Area 320 184 DlY 4.85 3.34 sum diff TABLE IMPROVEMENT IN CRITICAL Test Design PATH DELAY, AREA DlY 10.86 4.65 Area 5288 294 RTime 15429.1 13.8 TABLE IMPROVEMENT IN CRITICAL PATH DELAY, THR Test Design DlY 5.22 4.28 stat cnstcry Area 671 322 RTime 58.2 7.5 AREA 1 1111 + t WITH % Improvement DlY 12.28 4.55 0 Fig. 15. Constant Area 2.06 11.56 DlY 45.12 28.97 WITH CONSTANT % Improvement DlY 53.18 10.93 Area 57.38 2.48 VII. RTime 7.06 5.21 OPTIMIZATION in Rtime 92.38 4.35 GROUPING in RTime 73.37 20 CONCLUSIONS We have formulated the conditions necessary to identify safely height reducible sub-trees so that the functionality of the expression is preserved even after THR. We have also introduced in this paper three graph transformation techniques that make Tree Height Reduction more efficient in terms of QOR - resulting in a better critical path delay and area for the circuit than would be obtainable by doing traditional THR. In addition to reducing the critical path delay and area, these transformations give a better starting point for logic / timing optimization tool, leading to significant reduction in run-times to meet timing goals. An interesting extension to this work would be to apply these graph transformations taking into consideration the arrival times of inputs. F, Grouping After Const Propagation for Design stat [3] [5] AfterConst Grouping Grouping [2] [4] (@A (8P (111 (W (W (l)J Before Const Grouping in REFERENCES After Const Grouping Fig. 14. Constant THR III AND RUNTIME [l] - Area 8.44 11.41 CARRY/BORROW t Y Before Const Grouping MATCHED % Improvement Opt Rtime 1176.4 13.2 THR with Constant Grouping RTime Area DlY 15.5 2.44 286 3.81 314 6.0 stant grouping and constant propagation, it has only one adder. The test design cnstcry in Figure 15 shows how single bit constants used as carry / borrow can be grouped with other multi-bit constants to obtain a netlist with lesser area and better critical path delay. The expression tree before constant grouping has two nodes with carry / borrow inputs. But after constant grouping and constant propagation, there are no nodes in the expression tree with carry / borrow inputs. This is the reason for better QOR with the optimized expression tree even though it has the same number of add / subtract nodes as the unoptimized expression tree. BIT-WIDTH II AND RUNTIME THR with Carry/Borrow Area DlY 5.96 5179 260 3.31 THR cnt carry WITH Alexandru Nicolau and Roni Potasman, Incremental Tree Height Reduction for High Level Synthesis, ACM/IEEE Design Automation Conference, 1991. Richard Hartley and Albert Casavant, Tree Height Minimixation in Pipelined Architectures, ICCAD, 1989. Zia Iqbal and Miodrag Potkonjak and Sujit Dey and Alice Parker, Critical Path Minimization Using Retiming and Algebraic SpeedUp, Technical Report 93-COO3-4-5510-1, NEC USA, 1993. Shan-Hsi Huang and Jan M. Rabaey, Maximixing the Throughput of High Performance DSP Applications Using Behavioral Transformations, ICCAD, 1994. A. V. Aho and R. Sethi and J.D. Ullman Compilers : Principles, Techniques and Tools, Addison-Wesley Publishing Company, 1986. AfterConst Propagation for Design cnstcry 0-7695-1868-0/03/$17.00 (C) 2003 IEEE