Graph Transformations for Improved Tree Height Reduction

advertisement
Graph Transformations for Improved Tree Height
Reduction
Mangalam G.N., Sanjiv Narayan, Paul van Besouw, LaNae Avra, Anmol Mathur, Sanjeev Saluja
Cadence Design Systems
Plot Nos.57 A&B, Noida Export Processing Zone
Noida, U.P., India
fmangalam.
mathura, ssaluja}Qcadence.eom
/ lavra,
/ naravan.
d
/ x>aulvb.
I
C-J
c
AbstractTree
height
reduction
helps
in minimizing
the
critical
path
delay
and area in datapath
rich designs
during
synthesis.
We introduce
in this paper,
the necessary
conditions
to identify
height
reducible
arithmetic
expressions
and
three
graph
transformations
that
make
Tree Height
Reduc(a) Bit-width
matching
- a technique
tion
more
efficient:
in which
input
signals
that
match
in their
bit-widths
are
grouped
together
so that
smaller
width
arithmetic
nodes
are created
in the graph.
(b) Carry
/ Borrow
Optimization - a graph
transformation
by which
an optimum
number
of single
bit inputs
are distributed
as carry
/ borrow
to the
add / subtract
nodes
in the graph.
(c) Constant
grouping
- a
graph
transformation
in which
constant
inputs
are grouped
together
to form
a sub-tree
of constants.
Experiments
on
industrial
designs
with
these
graph
transformations
coupled
with
Tree Height
Reduction
have shown
significant
improvement
in critical
path
delay
and area.
I. INTRODUCTION
Meeting the timing constraints of high performance
system-on-chip designs is a challenging task for any synthesis tool. Many of these designs have large and complex
arithmetic operations in them. Optimizations in the frontend to keep these operations off the critical path as much
as possible is essential to meet the constraints set by chip
designers. Tree Height Reduction (THR) is a well-known
graph optimization technique for minimizing critical path
length during synthesis. Digital Filters, array computations and any other computation that has a chain of adders
or subtractors or multipliers benefit from THR.
An arithmetic expression gets converted to a tree of
arithmetic operations during synthesis. A tree of arithmetic operations is a connected acyclic graph with the following properties:
1. The nodes in the graph represent inputs, outputs and
arithmetic operators. We represent inputs and outputs as
literals and operator nodes using +, - or >I<.
2. The edges in the graph represent the flow of data between the operations.
3. Associated with every node (edge) is a positive integer
value called the width of the node (edge). For an input (output) node, it represents the bit-width of the input (output)
signal. The width of an operator node is the width of the
arithmetic operator. The width of an edge is the number
of least significant bits of its source node which are used as
input by the node at the destination of the edge.
THR takes advantage of the associative and commutative properties of arithmetic operations. It attempts to
48) B(8)W3) W3)
Fig. 1. Tree Height Reduction
transform skewed arithmetic expression trees to balanced
expression trees, thus reducing the critical path length of
circuits from O(n) to O(logn) where n is the number of
arithmetic operations in the tree [l]. A tree is said to be
balanced if the difference in height between any two of its
sub-trees is not greater than one.
In this paper, we introduce the necessary conditions to
identify height reducible expression trees and three graph
transformations (Bit-Width matching, Carry / Borrow Optimization, and Constant Grouping) that improve THR.
Experiments conducted on industrial test designs demonstrate significant improvement in Quality of Results (&OR
- area, critical path delay and run time) obtained from
such transformations when they are coupled with traditional THR.
In Section 2, we formally state the problem with a motivating example. Section 3 describes the previous work done
and the algorithm for THR proposed in [2] since it forms
the basis of our work. Section 4 describes our approach to
overcome the shortcomings of this algorithm. In Section 5,
we describe three graph transformations that make THR
more efficient. Section 6 summarizes experiments and the
results obtained.
II.
PROBLEM STATEMENT
The main optimization goal of THR is to reduce the
delay of a circuit by minimizing the critical path length.
Given an arithmetic expression tree T with n operator
nodes, the problem is to re-structure the expression tree
so that it results in minimum critical path delay and /
or area in the hardware implementation.
To motivate the
need for THR, consider the following expression in HDL.
0 =A+B+C+D
0-7695-1868-0/03/$17.00 (C) 2003 IEEE
allnodes
in expression
if(associative(curNode)
succnode
tree
&&
/
(outedgeCount(curNode)
==
I))
1
= successorNode(curNode);
if(typeOfzvode(curNode)==
inputsO~ode(succiVode)
typeOfzvode(succiVode))/
= inputsO~ode(curNode)
+
inputsO~ode(succiVode);
delete
outedge(curNode);
delete
Jr //end
Jr //end
j//end
curNode;
Fig. 3. Iterative
if
Splitting
if
for
Fig. 2. Algorithm
For Collapsing
Expression
Tree
One implementation of this expression, which is the devault in most commercial synthesis tools, is the skewed
graph shown in Figure l(a). Assuming that all the input
signals in the expression arrive at the same time and have
the same width, the critical path is the the longest path in
the tree as shown by the thick lines in Figure 1(a). It includes the delay of all three adders in the graph. Consider
the balanced graph in Figure l(b). The critical path shown
by the thick lines includes the delay of only two adders.
This shows that the critical path delay due to a balanced
tree is lesser than that due to a skewed tree. In full precision arithmetic, the skewed tree has adders of width 8 bits,
9 bits and 10 bits and the width of the final output 0 is 11
bits where as the balanced tree has adders of width 8 bits,
8 bits and 9 bits with the width of the final output 0 as 10
bits. The smaller width adders and output result in lesser
area of the design after THR.
III.
PREVIOUS
(4
0 1
(b)
Fig. 4. THR of Expressions
0 1
with Subtractions
started, now has only 4 inputs. Now, two more inputs are
split off and this process continues till the initial composite
node has only two inputs. We finally get a balanced tree
as shown in Figure 3(d).
It is quite simple to extend this algorithm to handle subtractions also. During the first step, while collapsing subtract nodes, we need to keep track of the polarity (+ or -)
of the inputs to the composite multiple input node. Then,
during the step of iterative splitting, the correct operator,
(add or subtract) must be chosen based on the polarity of
the inputs. Further, the correct polarity has to be propagated to the output according to the following rules If both the inputs are of polarity +, the operator is an
adder and the polarity of output is +.
If one input is of polarity + and the other -, the operator
is a subtractor and the polarity of output is +.
If both the inputs are of polarity -, the operator is an
adder and the polarity of output is -.
Adjacent add and subtract nodes can be collapsed if we
take care of the above conditions to propagate the correct
polarity. For example, consider the following expression
with subtract operations.
l
WORK
THR is a widely researched topic and several approaches
have been proposed. In [l], an incremental THR technique
for parallelization of application programs is presented. In
[3], a THR technique that enables re-timing has been proposed. Potkonjak [4] has proposed an optimal application
of all algebraic transformations, but the application domain
was restricted to linear computations.
Hartley [2] has proposed a simple and efficient algorithm
for height reducing adder / multiplier trees. We describe
this algorithm here since it forms the basis of our work.
The algorithm has mainly two steps. The first step is to
iteratively collapse an expression tree consisting of arithmetic nodes into a multiple input arithmetic node using
the algorithm in Figure 2. The expression tree can consist
of adjacent add nodes or adjacent multiplier nodes.
In the second step, we iteratively split the collapsed tree
which is a composite multiple input operator node as shown
in Figure 3(a). Since all operations are associative and
commutative, we split off any two inputs and feed them
to a new operator node. The output from the new operator node is fed to the composite multiple input operator
node. As a result, the composite operator node we started
with has one input fewer. In Figure 3(b), inputs A and B
are split off and fed to a new operator node. The output
from this node is fed to the composite operator node. So,
the composite operator node which had 5 inputs when we
l
l
0 =A+B-C-D
The graph generated before THR is shown in Figure 4(a).
Operands C and D have a polarity -. So during THR, they
generate an adder and the polarity of the output from this
adder is -. The graph after THR is shown in Figure 4(b).
IV.
IDENTIFYING
SAFELY HEIGHT REDUCIBLE
EXPRESSION TREES
The above described algorithm is simple and efficient.
But it does not address all the issues to be considered to
preserve functionality of the expression even after THR. To
ensure functionality preservation, we need to identify safely
height reducible sub-trees within an expression tree. To
identify such sub-trees, we traverse the expression tree and
identify breaknodes. A breaknode is an arithmetic node in
the expression tree which is a boundary for a safely height
0-7695-1868-0/03/$17.00 (C) 2003 IEEE
V. GRAPH
Gl
In this section, we introduce three graph transformations
that improve the QOR of THR. These graph transformations when combined with traditional THR, give a better
critical path delay, area and runtime of the logic optimization tool.
:
GO
Fig. 5. Safely Height Reducible
TRANSFORMATIONS
A. Bit-width
Sub-trees
reducible sub-tree. A break node has to be either an associative node or a subtract node. It also has to satisfy at
least one of the following conditions:
1. The out edge from the node is a Zossy edge. Consider
the following piece of Verilog code to understand the notion
of Zossy edges.
module sum(P, Q, R, S, 0);
input
[3:0] P, Q, R;
input
[4:0] S;
output [5: 01 0;
wire [2:0] T = P + Q;
assign
0 =T+R+S*
9
endmodule
The graph generated for the expression in the RTL without
THR is shown in Figure 5(a). The add operation between
P and Q results in a 5 bit result in full precision arithmetic. The 5 bit result is truncated to 3 bits which is
then extended to 4 bits for the addition with R. Due to
extension of a truncated result, the output 0 cannot be
directly expressed as a sum of P, Q, R and S. Re-balancing
the expression tree in Figure 5(a) as a whole would give
a functionally incorrect result due to loss of information
content. We call edges which are truncated first and then
extended for the following operation as Zossyedges and the
node driving this edge is a breaknode for THR. In Figure 5,
edge T is a lossy edge and it partitions the expression tree
in Figure 5(a) into two safely height reducible sub-trees Gl
and G2, as shown in Figure 5(b).
2. The node should have more than one outedges. The
output of such a node is a value that is used in more than
one sub-expression and can be treated as a common subexpression.
3. If the operator type of the successor node is multiply
and its own type is add / subtract or viceversa, it is a
breaknode.
4. The adder / multiplier architecture set on the successor
node is different from the one set on itself.
5. The successor node belongs to a different parenthesized
sub-expression. (This condition is valid only if parentheses
in an expression are honoured).
At the end of this procedure, the expression tree is partitioned into one or more safely height reducible sub-trees
and each of these sub-trees have a unique breaknode as
their root node. We can now apply the algorithm described
in [2] to THR each of the safely height reducible sub-tree.
Matched THR
Bit-width
Matching exploits the varying widths of
operands in arithmetic expressions by feeding similar-width
operands to the same arithmetic operation. Reducing variances in operand sizes during THR results in arithmetic
components with smaller area. For example, consider the
following expression (number in brackets is the width of
the signal) :
o(10) = A(4) + B(8) + C(4) + D(8)
Fig. 6. Bit-width
Matched
THR
Figure 6(a) shows the expression tree generated after
THR but without any bit-width matching. For full precision arithmetic, adders of width 8 bits, 8 bits and 9
bits are created. In Bit-width matched THR, we choose
operand pairs in ascending order of their width while splitting the collapsed tree. The expression tree obtained after
Bit-width matched THR is shown in Figure 6(b). We now
require adders that are of width 4 bits, 8 bits and 9 bits.
These smaller width adders result in lesser area.
B. Carry / Borrow Optimization
Expressions often have single-bit operands. Carry / Borrow Optimization distributes such operands as the carry
/ borrow input for other multi-bit addition / subtraction
nodes, resulting in fewer arithmetic nodes in the expression
tree. For example, consider the expression given below.
0 =A+B-
C + II
+ 12 + 13 - 14
where A, B, C are multi-bit inputs and II, Ii?, 13, 14
are l-bit wide. The THR’ed expression tree without carry
/ borrow optimization with 4 adders and 2 subtractors is
shown in Figure 7(a). The expression tree after THR with
carry / borrow optimization is given in Figure 7(b) - inputs
II, 13, 14have been fed to the carry/borrow inputs of other
nodes. The resulting expression tree not only has fewer
nodes but also a reduced critical path length.
The following lemma proves how the single bit inputs in
an expression can be distributed optimally as carry.
Lemma I: If there are m, multi-bit inputs and n single
bit inputs in an expression, assuming that all inputs have a
positive polarity, the number of add nodes in the expression
0-7695-1868-0/03/$17.00 (C) 2003 IEEE
A
B 11
13
13
12
C
C. Optimal
Grouping
of Constants
Constant Grouping groups constants together to form
a sub-tree of constants while THR’ing an expression tree.
Consider the following expression:
0 =A+2+B+3
0’
Fig. 7. Carry
/ Borrow
Optimization
tree due to multi-bit inputs would be (m - 1). If (n <m, - l), all the single bit inputs can be distributed among
the (m - 1) add nodes as carry.
Lemma 2: If n > (m - 1)) r-1
single bit inputs can
be used as carry.
Proof: The number of add nodes in the expression tree
due to the m, multi-bit inputs would be (m - 1). So (m - 1)
single bit inputs can be distributed among these nodes as
carry. The remaining single bit inputs are (n - m + 1).
We can form a sub-tree using the remaining (n - m + 1)
single bit inputs. Let h be the height of the sub-tree consisting of just single bit inputs. Height of a tree is defined
as the number of nodes in the longest path in the tree. For
the adder tree of single bit inputs, the leaf nodes are equivalent to inputs and the internal nodes are equivalent to add
nodes. Applying the properties of a complete binary tree,
it can have a maximum of 2‘-’ leaf nodes or inputs and
hence 2h-1 - 1 add nodes out of which, 2h-2 add nodes
would be at level one of the tree. Each of these 2h-2 add
nodes at level one in the tree of single bit inputs can take
3 inputs (2 regular inputs and 1 carry input). Therefore,
the tree can have a maximum of 2h-2 >I<
3 inputs at the leaf
level.
The number of add nodes remaining in the tree other
than the ones at level one is 2h-2 - 1. Each of these add
nodes can take a carry input. Therefore,
2h-2 * 3 + 2h-2 -1
2h-l
2h
=
n-m+1
=
n-m+1
-
n-m+2
n
--
n--
--
-
zhml
2h
2
2n - 2h
2
2n - (n - rn + 2)
2
n+m-2
c
2
A
2B
1
We have proved the result for adders. The same result
can be extended for subtractors also.
3
A
B
/
t
5
/
t
cc>
I:
Fig. 8. Optimal
Grouping
o
of Constants
To get the optimum result using this optimization, we
follow a few guidelines. They are given below.
1. For multiply operations between constants, we always
group constants together.
2. For add and subtract operations between constants, if
there is at least one multi-bit constant in the expression,
we always group the const
together.
3. Suppose there are only one bit constants. Let n be the
number of positive single bit constants and m, be the number of add nodes due to the multi-bit inputs in the expression. We group the constants together only if n > m. If
n < KU, the sin .gle bit constants can be used as carry inpug to the add nodes. The same result applies to subtract
nodes and negative single bit constants. This results in a
lesser number of add / sub nodes in the expression tree.
VI.
In the sub-tree of single bit inputs under our consideration, 2h-1 single bit inputs can be used as regular inputs
to the adders. The remaining single bit inputs can be used
as carry inputs. Therefore,
Number of single bit inputs that can be used as carry =
-
Synthesis tools would typically generate the expression
tree of Figure 8(a), resulting in an implementation with
3 adders. The expression tree after THR with Constant
Grouping is shown in Figure 8(b) - constants 2 and 3 are fed
to the same arithmetic node. Constant propagation [5] can
work on this expression tree and reduce it to the expression
tree shown in Figure 8(c) resulting in an implementation
with only 2 adders.
EXPERIMENTS AND RESULTS
We have implemented THR and the graph transformations above as part of front-end HDL synthesis. In addition
to improving the critical path delay and area of the synthesized netlist, these graph transformations result in reduced
runtime of the timing-driven logic optimization tool. We
summarize the results obtained using these graph transformations in Tables 1, 2 and 3 when compared to traditional THR without any of these transforms. The test designs considered are part of industrial designs. The critical
path delay and area reported here are those obtained after front-end HDL synthesis but before any timing-driven
logic optimization so as to highlight the true benefits of
these transformations.
The arrival time of inputs and the
required time of outputs for each test design has been set
to 0.
We have experimented each graph transformation
on
datapath only test designs using LSI’s Zca300k cell library
and have evaluated on following three metrics:
0-7695-1868-0/03/$17.00 (C) 2003 IEEE
1. Critical path delay of the netlist obtained after synthesis
but before any timing-driven logic optimization.
2. Area of the netlist obtained after synthesis but before
any timing-driven logic optimization.
3. Runtime in CPU seconds of the timing-driven logic optimization tool on the netlist obtained after synthesis.
Table 1 summarizes the improvement in delay, area and
runtime obtained with Bit-width Matched THR when compared to traditional THR. The expression trees with and
without Bit-width Matched THR corresponding to test design sum are shown in Figure 9. Each of the five inputs in
this design have different widths showing the real benefit
of Bit-width Matched THR. Smaller width inputs A and C
are grouped together which result in a 3 bit adder instead
of A and B which result in a 6 bit adder. Figure 10 shows
the expression trees with and without Bit-width Matched
THR corresponding to test design diff. This design shows
Bit-width Matched THR applied to expressions consisting
of add and subtract operations.
entity
CNT
port
is
(CLK,
RESET:
in
: in
unsigned<63
A
end
ACCUM
CNT;
: out
architecture
A
signal
bits:
signal
act
begin
process(CLK,
of
downto
CNT
downto
downto
unsigned<31
w/o Bit-width
=
’ 1’)
(others
then
=>
‘0’);
(CLK’event
and
CLK=’
sum
:= (others
=>
‘0’);
for
i in 63 downto
0 loop
sum
:= sum
+ bits(i);
loop;
<=
act
c=
A;
+
0);
1’)
then
sum;
if,
end
process;
ACCUM
end
A;
c=
act;
for Design cnt
(3)Vf
Matched
Fig. 9. Bit-width
THR
with Bit-width
Matched
A(2) B(6)C(2) D(6)
Matched
THR
THR for Design sum
A(2) cm B(6) D(6)
icfgfk
w/o Bit-width
0);
0);
downto
elsif
end
0));
is
Fig. 11. RTL in VHDL
(3jVf
downto
RESET)
<=
end
act
bits
0);
unsigned<31
unsigned<63
: unsigned<31
variable
sum:
begin
if (RESET
act
std-logic;
Fig. 12. Carry/Borrow
($fgtgfe
Matched
THR
Fig. 10. Bit-width
with Bit-width
Matched
Tree of 32 adders with C/B Opt
Matched
THR
THR for Design diff
In Table 2, we summarize the improvement in delay, area
and runtime obtained with Carry / Borrow Optimization
when compared to THR without Carry / Borrow Optimization. The RTL in VHDL for test design cnt in Table 2 is
given in Figure 11.
The testcase adds all bits of a 64 bit vector to a register
bit by bit. This design when synthesized with traditional
THR has a balanced tree of 64 adders as shown in Figure
12(a) and the timing-driven
logic optimization tool took
approximately 4.5 hours to finish. When synthesized with
carry / borrow optimization, the design has only 32 adders
in a balanced tree as shown in Figure 12(b) and the logic
optimizer took only 20 minutes to meet constraints. With
carry / borrow optimization, every adder has a carry input in addition to the two regular inputs. This results in
only 32 adders as compared to the 64 adders without the
optimization. The expression trees with and without carry
/ borrow optimization corresponding to test design carry
are shown in Figure 13. This design illustrates the use of
positive single bit inputs as carrv and negative single bit
Optimization
for Design cnt
inputs as borrow. Single bit input I which has a + polarity
has been used as carry for the add node between A and B
and single bit input J which has a - polarity has been used
as borrow for the subtract operation between C and D. The
implementation with carry / borrow optimization has only
two adders and one subtractor, where as the implementation without carry / borrow optimization has three adders
and two subtractors.
0'
w/o Carry/Borrow
Fig. 13. Carry/Borrow
Opt
0'
with Carry/Borrow
Optimization
Opt
for Design carry
Table 3 shows the improvement in delay, area and runtime obtained with Constant Grouping.
The expression
trees with and without constant grouping for test design
stat are shown in Figure 14. In this design, constant 1
is added to a variable five times. The synthesized netlist
has three adders without constant grouping and after con-
0-7695-1868-0/03/$17.00 (C) 2003 IEEE
TABLE
I
IMPROVEMENT IN CRITICAL PATH DELAY, AREA AND RUNTIME
Bit-width
Matched THR
RTime
Area
DlY
7.9
4.26
293
3.19
163
9.1
THR
Test
Design
RTime
8.5
9.6
Area
320
184
DlY
4.85
3.34
sum
diff
TABLE
IMPROVEMENT
IN CRITICAL
Test
Design
PATH
DELAY,
AREA
DlY
10.86
4.65
Area
5288
294
RTime
15429.1
13.8
TABLE
IMPROVEMENT
IN CRITICAL
PATH
DELAY,
THR
Test
Design
DlY
5.22
4.28
stat
cnstcry
Area
671
322
RTime
58.2
7.5
AREA
1
1111
+
t
WITH
% Improvement
DlY
12.28
4.55
0
Fig. 15. Constant
Area
2.06
11.56
DlY
45.12
28.97
WITH
CONSTANT
% Improvement
DlY
53.18
10.93
Area
57.38
2.48
VII.
RTime
7.06
5.21
OPTIMIZATION
in
Rtime
92.38
4.35
GROUPING
in
RTime
73.37
20
CONCLUSIONS
We have formulated the conditions necessary to identify
safely height reducible sub-trees so that the functionality
of the expression is preserved even after THR. We have
also introduced in this paper three graph transformation
techniques that make Tree Height Reduction more efficient
in terms of QOR - resulting in a better critical path delay
and area for the circuit than would be obtainable by doing
traditional THR. In addition to reducing the critical path
delay and area, these transformations give a better starting
point for logic / timing optimization tool, leading to significant reduction in run-times to meet timing goals. An
interesting extension to this work would be to apply these
graph transformations taking into consideration the arrival
times of inputs.
F,
Grouping
After Const Propagation
for Design stat
[3]
[5]
AfterConst
Grouping
Grouping
[2]
[4]
(@A
(8P
(111
(W
(W
(l)J
Before
Const
Grouping
in
REFERENCES
After Const Grouping
Fig. 14. Constant
THR
III
AND RUNTIME
[l]
-
Area
8.44
11.41
CARRY/BORROW
t
Y
Before Const Grouping
MATCHED
% Improvement
Opt
Rtime
1176.4
13.2
THR with
Constant Grouping
RTime
Area
DlY
15.5
2.44
286
3.81
314
6.0
stant grouping and constant propagation, it has only one
adder. The test design cnstcry
in Figure 15 shows how
single bit constants used as carry / borrow can be grouped
with other multi-bit constants to obtain a netlist with lesser
area and better critical path delay. The expression tree before constant grouping has two nodes with carry / borrow
inputs. But after constant grouping and constant propagation, there are no nodes in the expression tree with carry
/ borrow inputs. This is the reason for better QOR with
the optimized expression tree even though it has the same
number of add / subtract nodes as the unoptimized expression tree.
BIT-WIDTH
II
AND RUNTIME
THR with
Carry/Borrow
Area
DlY
5.96
5179
260
3.31
THR
cnt
carry
WITH
Alexandru Nicolau and Roni Potasman, Incremental
Tree Height
Reduction for High Level Synthesis, ACM/IEEE
Design Automation Conference, 1991.
Richard Hartley and Albert Casavant, Tree Height Minimixation
in Pipelined Architectures,
ICCAD, 1989.
Zia Iqbal and Miodrag Potkonjak and Sujit Dey and Alice Parker,
Critical Path Minimization
Using Retiming and Algebraic SpeedUp, Technical Report 93-COO3-4-5510-1, NEC USA, 1993.
Shan-Hsi Huang and Jan M. Rabaey, Maximixing
the Throughput
of High Performance
DSP Applications
Using Behavioral
Transformations,
ICCAD, 1994.
A. V. Aho and R. Sethi and J.D. Ullman
Compilers : Principles, Techniques and Tools, Addison-Wesley
Publishing Company, 1986.
AfterConst
Propagation
for Design cnstcry
0-7695-1868-0/03/$17.00 (C) 2003 IEEE
Download