Wire Delay Models for Global Placement of ASICs

Wire Delay Models for Global Placement of ASICs
By
Krassimir Paskalev
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Computer Science
and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
January 17, 2002
Copyright 2002 Krassimir Paskalev. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
Author______________________________________________________________________
Krassimir Paskalev
January 17, 2002
Certified by________________________________________________________________
Michael Fu, R&D Manager, Ph.D.
Synopsys Thesis Supervisor
Certified by________________________________________________________________
Jacob K. White
M.I.T. Thesis Supervisor
Accepted by_________________________________________________________________
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
Wire Delay Models for Global Placement of ASICs
By
Krassimir Paskalev
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Computer Science
and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
January 17, 2002
Copyright 2002 Krassimir Paskalev. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis
and to grant others the right to do so.
ABSTRACT
A new model for the propagation delay between two logic gates for timing-driven global placement is
proposed. The model is a function of the number of pins on the net, the half perimeter of the bounding
box enclosing the net, and the half perimeter of the bounding box enclosing the driving pin and the sink
pin. On a training set of two designs and testing set of another two, the proposed model is 31% more
accurate than the current state-of-the-art model and has comparable computational complexity.
Thesis Supervisor: Jacob White
Title: Associate Director, Research Laboratory of Electronics
-2-
Introduction
When solving real-world optimization problems, an engineer is faced with a tradeoff between the
computational complexity of the algorithm and optimality of the final solution. Modeling wire delay in
global ASIC placement is an example of an engineering problem where such tradeoff exists.
The goal of wire delay modeling in global placement is to give the placer an easy to compute,
accurate estimate of the final (post-routing) delay of each wire. If the model is not accurate enough, the
placer will place some of the logic cells away from each other, because its model would indicate that the
delay between them is small. And if the actual delay through the wires connecting the cells turns out to be
drastically larger, the final speed of the chip will be significantly decreased. Therefore, wire delay
estimation during placement is extremely important, especially when having a high clock speed is the
most desired feature of the final design.
Today, the state-of-the-art wire delay models use statistical estimates based on the size of the
bounding box of all endpoints of the net1 and the number of endpoints of each wire.
My thesis is that more accurate wire delay estimation models for global2 placement exist, which
also have comparable runtime to the current state-of-the art models. My models estimate wire delay based
on relative position of the driver and the sink, in addition to the variables the current state-of-the-art
models use.
Background on Design of ASICs
This section provides background information about chip design (ASICs in particular) and how
the problem I am solving fits into the overall problem of designing modern chips.
1
Net – the wire connecting two or more cell input and output pins. Also referred to as interconnect, etc.
-3-
ASIC stands for Application-Specific Integrated Circuit and, as the name suggests, ASICs are
semiconductor devices that perform specialized tasks – control a microwave, pacemaker, mp3 player, etc.
As any integrated circuit, an ASIC contains a huge amount of transistors, which are interconnected in a
complex pattern to create a unique electric network. Electric current propagates through the chip, carrying
digital information in the form of zeroes and ones (high voltage and low voltage). The functionality of the
circuit3 depends on the specific way the transistors are interconnected, their number, and size.
Several steps are necessary to convert the idea of a new chip in the designer’s mind into a piece of
silicon that performs the desired functions. Today, that process is to a large extent automated and relies on
a set of software programs, called EDA tools. In the description below I will use a cellular phone
controller as an example of an ASIC to be designed. A summary of the major steps is presented in Fig 1.
2
3
Global placement – also known as coarse
Circuit – also referred to as “the design” by ASIC designers
-4-
Figure 1: Overview of the digital design flow
The first step in the ASIC design flow is to write down the specification of the device (phone
controller) in a computer-readable format. That is done in Hardware Description Language (HDL), like
Verilog or VHDL. A program in HDL specifies how the circuit should react to various input signals – the
phone should ring when there is an incoming call, the call should end when the “Cancel” button is pressed
etc.
The next step is to synthesize the HDL description into a set of simple logic operations like AND,
OR, NOT, and others. This step implements the HDL description (program) of the cellular phone into a
set of logic elements (called logic gates) and is done by a synthesis tool. For example, an HDL description
saying “if there is an incoming call and the line is not busy then ring” could be implemented as the
-5-
“incoming call” signal and “line busy” signal being the inputs of an AND logic gate, with the “ring”
signal being the output. However, it is possible to implement the same functionality in a different way
with a NAND (not-AND) gate, followed by a NOT gate. Even though the second approach might seem
silly, it is sometimes preferred for making the chip faster, consuming less power, or using less area.
Another reason why the second implementation might not work is because the AND logic gate might not
exist in the library of possible gates available to the ASIC tool (cell library).
The cell library is a set of logic gates, which the manufacturer requires that the digital
designers/synthesis tools use. These logic elements (also called library cells) are designed to be very fast,
efficient, and simplify the work of the designer. In a way, the library cells to the synthesis tool are like
LEGO blocks to the kid who wants to build a toy cellular phone – only certain library cells/LEGO blocks
are manufactured and could be used.
The next step after synthesis builds the network of logic cells is global placement. This is the step
where the wire delay models I research in this thesis will be used, so I will go over it in more detail. The
global placer decides where each cell should be placed on the chip. This decision is influenced by several
goals, which the placer attempts to achieve. First, the placer attempts not to overlap any cells, for only one
cell can occupy a certain location on the chip. Second, it attempts to place the cells such that the wires
connecting them later will not be too long. The reason why this is important is due to the limited number
of wires that can go through a location on the chip. Therefore if most pairs of cells that are connected to
each other are separated in distant corners of the chip, the corresponding wires will be long and it will
lead to excess number of wires in the center of chip. Due to the limited wiring capacity, some wires will
have to detour around the congested center and become very long. In turn long wires will increase the
power consumption and slow down the chip’s speed.
-6-
There are other placement goals, yet the most important one from the perspective of this thesis is
to increase the chip’s speed (or to decrease the chip timing/signal propagation delay). To achieve better
timing, the placer needs to model the propagation delay across the wire between placed cells, so that it can
determine which placement results in the fastest chip.
The way a numerical global placer works is similar to many search algorithms. It starts with any
placement of the gates and tries to improve the current placement until all placement goals are met. At
each step a new placement is generated and checked whether it is closer to the placement goals than the
current placement. If it is closer, the search continues from the new placement. If the new placement is
worse (away from the placement goals), the search continues from the current placement.
After global placement finds a satisfactory placement of all cells, the resulting cell locations have
to be legalized (detail-placed). The reason why there is a need for detailed placement is due to the
placement topology enforced by the manufacturer. Today’s ASICs have row-based topology where the
entire chip area is divided into rows of cells. The global placer, however, does not follow the specific
placement topology and places the cells on any location of the chip area. Therefore, the task of the
detailed placer is to snap all cells to valid row locations.
The last step in the chip design is routing. Routing specifies the exact paths which the wires follow
to connect the cells. Wires are routed on special grid-like tracks, which run horizontally and vertically
across the chip4 . The router attempts to find the shortest possible path for each wire. Because of the
limited routing resources on each track, the router has to determine which wires to route directly and
which to detour, making them slower.
4
In fact, the wire routing structure is more complex – the routing grid consists of alternating horizontal and vertical metal
layers – wires on each metal can be routed in one direction only. This, however, is irrelevant detail from the global placer’s
perspective.
-7-
After routing completes, the chip is tested whether its speed meets the timing requirements set by
the ASIC designer.A timing analysis tool looks at the detailed physical information – where the cells are
placed and where the wires are routed, and determines the speed of the chip much more accurately than
the previous stages of the design [Maheshwari N.].
In reality, there are many other steps in the design flow, which are often repeated over and over
until the design goals (high speed, low power, and/or small area) are achieved. If the timing/power/area
analyzer reports that the chip is not fast/low-power/small enough, the design is optimized, and some of
the steps in the flow could be repeated. [Nekoogar F.]
Problem Statement
There is a need for estimation of the wire delay during placement, for the actual delay cannot be
calculated precisely. Precise computation of the propagation delay is impossible, since the detailed
routing topology is not known during placement.
The only information available to the delay estimating function is the number and type and
location of the cells connected to the net, as well as the particular pin (or pins) to which the net is
attached.
In addition, the wire delay estimating function has to be very fast to compute. This requirement
comes from the global placer, which will use the model on each net of the design, for each explored cell
placement. In a modern chip the number of nets is in the order of hundreds of thousands, which results in
billions of evaluations of the wire delay modeling function.
Therefore the goal of wire delay estimation is to find wire delay models, which estimate the delay
as accurately as possible, in order to minimize the delay through the circuit.
-8-
Related Work
The problem of wire delay modeling is inherent not only to coarse placement, but almost to any
other stage of the ASIC design flow. Synthesis needs to model the wire delay, in order to optimize the
logic of the design. Routing faces the same problem, for it has to determine which routes need to be faster
than others. Finally, verification needs to model the wire delay, in order to ensure the functionality of the
chip.
The simplest wire delay model used is to ignore wire delay completely. The benefit of using this
model is that the placer is not doing any wire-delay-related computations. This model has been popular in
the early days of digital design, since the wire delay has been negligible compared to that of the logic
elements. However, in today’s deep sub-micron designs (>0.5um) this is no longer true. As the designs
become smaller and smaller, interconnect delay becomes larger and larger part of the total delay.
Another common delay model is the “wire delay table model”. This model comes from the
synthesis domain, for it predicts wire delay without any physical information (meaning that it is not using
the cell’s coordinates, which are known only during and after placement). The wire load model uses a
lookup table for the delay (among other properties) of a net, solely based on the cardinality of its fanout 5
[Edwards T.]. The values for fanouts that are not in the wire load model table are linearly interpolated.
Fig. 2 contains a common description of a wire load model table. Apart for its use in context where
physical information is not available, this model is useful for its simplicity and relatively high accuracy in
localized blocks. If wire load models are created for specific sub-blocks of the design, the wire delay
could be very close to the actual one, for the model can account for the specific topology of the sub-block.
The drawback of the model is that it is not general enough and either requires knowledge of the design
5
Fanout – the capacitive load on a particular net, roughly equivalent to the number of logic gates connected to it.
-9-
specifics or lengthy simulations. Therefore, the model can be very inaccurate if not used on the proper
scale.
Figure 2: Sample Wire Delay/Load/Resistance/Length/Area Table. The name 10K_WLM implies that
the particular wire load model is appropriate for blocks of approximately 10,000 instances. The delay
is calculated as the Capacitance times the Resistance of the net. The values for the missing fanouts can
be interpolated. For example a simple linear interpolation for the length of a 10-fanout net: Length(10)
would be = 1/4*Length(7)+3/4*Length(11) = 1/4*0.033+3/4*0.054 = 0.0487
Wire_delay_table(10K_WLM)
Fanout
1
2
3
4
7
11
Length
0.002
0.005
0.013
0.022
0.033
0.054
Capacitance Resistance
0.002
0.005
0.005
0.005
0.013
0.139
0.022
0.276
0.033
0.550
0.054
0.785
Area
0.500
1.000
1.500
2.000
3.500
5.500
The most complex class of wire delay models used in global placers is based on the half
perimeter bounding box of the net. It is half the length of the minimum rectangle, which encloses
all endpoints (pins) of a net. For nets with 2 or 3 pins the half-perimeter estimate is equal to the
optimal routing and is close to the optimal for nets with 4 or more pins. [Smith M.] The half
perimeter is fairly simple to calculate for each net6 . To get the net delay, the half perimeter is
multiplied by a constant. A more advanced version of that model is to represent the constant as a
function of the number of pins [Benkoski J.].
In general, there is an inherent tradeoff between modeling accuracy and computational
complexity in designing a wire delay model. Each model could be represented as a point in a
two-dimensional plot, where the two axes are modeling accuracy and computational complexity
6
The minimum of the pin coordinates in each dimension is taken and subtracted from the maximum of pin
coordinates in each dimension to get the perimeter/bounding box of the net. The half perimeter is half of that.
- 10 -
(Figure 3). The ideal wire delay model would be located in the lower left part of the graph – a
very accurate and computationally simple model. Naturally, such model probably does not exist
and all realizable models either require some computational effort, or suffer loss of accuracy.
This inherent tradeoff is seen in the approximate state-of-the-art curve, formed by the different
wire delay models (red points) in Figure 3. Note that there exist models located to the left and
above of the tradeoff curve. Such algorithms are inefficient, because there would be an algorithm
on the tradeoff curve, which is both simpler and more accurate. On the other hand, if there are
algorithms below the current, red-dotted, tradeoff curve, such algorithms would be either faster
and/or more accurate then one or more of the state-of-the-art algorithms. The goal of this thesis
is to find for such algorithms/models.
- 11 -
Figure 3: The Tradeoff Curve between modeling accuracy and computational complexity.
The units are unspecified, for it is hard to precisely quantify the two measures7 . The
theoretical tradeoff curve describes the most efficient/simple models, which could, in theory,
be achieved for a given level of accuracy. The theoretical curve is jittery since its shape is not
known.
7
One way to quantify accuracy and complexity is to pick a fixed set of designs and compare the total prediction
error made by different models and total CPU time on given machine.
- 12 -
Data Collection
Data Extraction
The first step in creating models is to collect a large number of data samples. To generate
them, I run each design through a typical physical design flow – the design is placed, routed, and
timed.
Next, I extract all nets from each design and obtain several characteristic parameters. For
each net, I store the net delay (as reported by the timing analyzer), half perimeter (of the
bounding box around all pins of the net8 ), driver-to-pin half perimeter (the Manhattan distance
from the driver to the sink i.e. the half perimeter of the bounding box around the driver and the
sink), fanout (the number of sinks on the net).
Figure 4: Net parameters uses in the delay models.
Net Delay
DELAY
Half Perimeter
HP
Driver To Sink Half Perimeter
DPHP
Fanout
FANOUT
Delay
The delay of each net is obtained as the net timing arc reported by the timing analyzer.
8
The pin locations are not exactly precise. Pins have irregular shapes and a net could connect to several different
parts of the pin. However I use the simplified (center of mass) location.
- 13 -
Half Perimeter and Driver-to-pin half perimeter
To calculate the half perimeter of a set of pins, the precise location of the pins relative to
the cell location, and the cell location itself need to be obtained4 . The locations of each cell
instance can be obtained from a post-placement-generated output file (.PDEF), or a post-routinggenerated output file (.DEF). The physical library contains the offset locations of each pin for
each library cell. The combination of cell locations and relative pin locations allow me to
calculate the two half perimeters.
Fanout
Net’s fanout can be obtained from at any step after synthesis (.DEF, .PDEF, or .DSPF
file).
Real tools
The tests I ran would work on any set of EDA tools. For completeness, I mention that I
used the 2001.08 versions of Synopsys’s Physical Compiler and PrimeTime.
Real data
The models in this thesis are based on four real-world chip designs (Figure 5). Two of the
designs are small in size, while the other two are medium sized. For confidentiality, I will only
refer to the designs as Design 1-4. All designs use an Artisan library for TSMC’s 6 layer, 0.18
µm process. The lumped (averaged across all metal layers) resistance and capacitance per unit
length are9 : Rhorizontal=Rvertical=0.0000023 kΩ/0.01µm, Chorizontal=Cvertical=0.0000023 pF/0.01µm.
9
Obtained from the library setup scripts
- 14 -
Figure 5: Relative size of the four designs used.
Design Number
Number of net timing arcs
Design 1
2025
Design 2
15126
Design 3
46246
Design 4
45862
Elmore delay decomposition
An important part of developing the models I created was decomposing the Elmore delay
model into four sub-components. This section defines each of the four components and explains
why such decomposition is useful.
The Elmore delay model is an approximate method of calculating the propagation delay
over transmission line (wire). It considers the first moment of the impulse response of the wire
and assumes inductance and coupling capacitance can be ignored. [Rubinstein J., Wayne W.]
For a generic RC tree, the Elmore delay through a path (from the driver pin to a given
sink) is calculated as follows: The delay through each on-path resistor is defined as its resistance
times its downstream capacitance. The total path delay is equal to the sum of the delays through
all on-path resistors. (See the example below)
Each path through an RC tree could be decomposed into four independent sub-delays:
•
Self delay of the on-path wire – the resistance of the wire driving its own capacitive load
along the path.
- 15 -
•
Sink delay – the delay through the on-path wire driving the capacitive load of the sink.
•
Off-path wire load delay – the sum of delays through wire segments starting from the
driver and ending at off-path wire branch points due to the capacitive load of the off-path
wiring.
•
Off path pin load delay- the sum of delays through wire segments starting from the driver
and ending at off-path wire branch points due to the capacitive load of other sinks.
For example, consider the red/orange colored net on Figure 6. Its Elmore equivalent RC
tree would look like the one on Figure 7.
Figure 6: Sample net. D is the driver and the example looks at the delay from D to the sink S.
- 16 -
Figure 7: The corresponding RC tree of the net from the circuit in Figure 6.
Let’s further consider the propagation delay from the driver D to the sink S. If the
capacitive loads of the sinks are CP1, CP2, and CS (not shown on Figure 7), the Elmore delay
for that path10 would be: Delay(D-S) = R5*(C1+C2+C3+C4+C5+CP1+CP2+CS) +
R4*(C1+C2+C4+CP2+CS) + R1(C1+CS)
The four sub-components of the delay would be:
•
Self delay = R5 * (C1+C4+C5) + R4 * (C4 + C1) + R1 * C1
•
Sink delay = (R5 + R4 + R1) * CS
•
Off-path wire load delay = R5 * C2 + (R5 + R4) * C3
•
Off path pin load = R5 * CP2 + (R5 + R4) * CP2
As expected, the sum of the four components is equal to the total delay.
10
The propagation delay from a cell output pin (driver) to a cell input pin (sink) is often referred to as “net timing
arc”.
- 17 -
Purpose of Elmore delay decomposition
The purpose of this decomposition of the Elmore delay model into four independent
components helps me to better understand and model the dependencies between the net
parameters (half perimeter, driver to sink half perimeter, fanout).
Since the resistance and capacitance of a wire are linearly proportional to its length, it is
easy to transform the problem of modeling delay into modeling wire length (i.e. routing
topology). The following section – Data Analysis will describes the exact process.
Data Analysis
Once I collect the sample data, I try to analyze it and create different wire delay models.
Instead of modeling the propagation delay as a whole I approach each of the four sub-delays
individually. From now on, I will use Rµm and Cµm for the resistance and capacitance per
micron and assume all wire lengths are in microns.
Transformations of the four sub-delays
•
Self delay – this delay is equivalent of the delay on a RC transmission line with
length DR2PIN11 . On theory, [Anderson, M.] this delay should be equal to
Self delay = ½ * Rµm * Cµm * DR2PIN^2
•
Sink delay – this delay is equivalent to a DR2PIN resistor driving a Csink capacitor.
Therefore,
Sink delay = Rµm * DR2PIN * Csink
11
DR2PIN - The length of the path over the routed wire from the driver to the sink.
- 18 -
•
Off-path wire load delay – the number of off-path wires is not necessarily equal to
FANOUT12 -1, since there could be multiple off-path pins connected to the same offpath wire. However, if I assume the entire off-path wire branches at the average wire
branch distance then:
Off-path wire load delay = Rµm *AVERAGE_WIRE_BRANCH_DIST * Cµm *
TOTAL_OFF_PATH_WIRELEN
I can further rewrite TOTAL_OFF_PATH_WIRELEN = TOTAL_WIRE_LEN –
DR2PIN and substitute:
Off-path wire load delay = Rµm *AVERAGE_WIRE_BRANCH_DIST * Cµm *
(TOTAL_WIRE_LEN-DR2PIN)
•
Off-path pin capacitive load – this delay is the sum of FANOUT-1 delays, each of
which is the delay through a wire segment from the driver to a point where the
corresponding off-path wire branches. If the average distance to off-path wire
branching is AVERAGE_PIN_LOAD_BRANCH_DIST then:
Off-path pin load delay = (FANOUT-1) * Rµm
*AVERAGE_PIN_LOAD_BRANCH_ DIST * Csink(i)
Theoretical models of total wire length
There are two extreme cases, which have the lowest and highest ratio of total wire length
to half perimeter. However, I could not find a formal proof why these particular configurations
are the extreme cases.
12
FANOUT – The number of pins on the net, excluding the driver. Thus FANOUT-1 is the number of off-path
sinks.
- 19 -
Best Case
TOTAL_WIRE_LEN(HP,FANOUT) = HP
In the best case, all pins are located in a straight line. As a result, the total wire length of
the net is equal to its half perimeter.
Worst Case
TOTAL_WIRE_LEN (HP,FANOUT) = HP+(SQRT(FANOUT)-1)*HP/2
All pins fill a square-shaped grid. Each side of the square grid has sqrt(FANOUT)13 pins.
As a result the total wire length of the net is (SQRT(FANOUT)+1)*HP/2 =
SQRT(FANOUT)*HP/2+HP
Figure 8: Worst case for total wire length/half perimeter.
Theoretical models of driver to sink path length
Similarly to total wire length there are two extreme cases of the driver to sink path length.
13
Here I assume that SQRT(FANOUT) is an integral number. The derived upper bound is valid for any integer
FANOUT.
- 20 -
Best Case
DR2PIN(HP,FANOUT,DPHP) = DPHP
In the best case, the driver to sink path length is the shortest path possible and is equal to
the half perimeter of the bounding box around the driver and the sink.
Worst Case
DR2PIN(HP,FANOUT,DPHP) = HP+(SQRT(FANOUT)-1)*HP/2
The worst case of driver to sink path length is equal to the worst total wire length case
(Figure 9). In such case the path from the driver to the sink goes through all other pins on the net.
Fortunately, this case is quite unlikely to happen.
Figure 9: Worst case driver to sink path length. The routed wire between the two red pins
passes through all other pins on the net, resulting in driver to sink path length of
HP+(SQRT(FANOUT)-1)*HP/2
Theoretical models of average distance to pin branching point
The average point at which a pin load is located along the driver to sink path should be
halfway from the driver to the sink. I assume that the distribution of the driver and sink pins is
- 21 -
random. If that is the case, then for each net and each path from the driver to a sink, there is an
equivalent net, which has the same topology except that the locations of the driver and the sink
are exchanged. These two nets have the property that if there is a pin load on the driver to sink
path at distance X on one of the two nets, the same load is located at distance DR2PIN – X on
the other net. Since there is such “dual” net for any possible net topology, then for a fixed
DR2PIN distance the average distance to pin load branching point is ½ DR2PIN.
Theoretical models of average distance to wire branching point
The reasoning why average distance to pin branching point is ½ DR2PIN applies to
average distance to wire branching point – assuming random driver/sink distribution for each
distance to wire branching point X, there is an equivalent network with wire branching point
distance equal to DR2PIN – X.
Combining theoretical results to obtain a functional form
Combining the theoretical results above into the total delay I get:
Total Delay = (k1 * DR2PIN^2) +
// Self delay
(k2 * DR2PIN) +
// Sink delay
(k3 * DR2PIN * (TOTAL_WIRE_LEN – DR2PIN)) + // Off-path wire
(k4 * (FANOUT-1) * DR2PIN)
// Off-path pin
(rearranging) = K1 * DR2PIN + K2 * DR2PIN^2 + K3 * DR2PIN * TOTAL_WIRE_LEN + K4 *
FANOUT * DR2PIN
(Where each Ki is some linear combination of ki)
- 22 -
Verification
I have randomly selected two designs of the four to be the training set and the remaining
two the testing set14 . All models are analyzed based on data from the training set and the
accuracy of the models is verified on the test set.
Confidentiality
Finally, I will not reveal the optimal coefficients for each functional form, as they could
potentially be used in production code. However, anyone with a good solver and a set of
extracted net timing arcs can reproduce similar coefficients.
Results
Finally, I find the coefficients for each functional form, which minimize the error in the
training set. These coefficients, along with the corresponding functional form are tested on the
test designs. The error specified below each model is the sum of the squared errors between the
actual and predicted net delays for each net in the test designs. Therefore the lower the total error
of each model, the higher the accuracy.
The computational complexity of each of these functions is comparable – within each net,
they all take linear time in FANOUT to compute for all timing arcs. HP is a max/min of the pin
coordinates and therefore takes linear time in FANOUT to compute. Since it is fixed for all
sinks, it needs to be computed only once for each net. DPHP takes constant time to compute
(∆x+∆y of the coordinates of the driver and the sink), but has to be computed for each sink.
Therefore the FANOUT different DPHP’s require, too, linear time in FANOUT to compute.
14
Design-1 and Design-3 are in the training set, while Design-2 and Design-4 are in the testing set.
- 23 -
Therefore the computational complexity difference between the models is in the dominating
constants.
I tested the following functional forms:
•
DELAY(HP,FANOUT)1 =
c1+c2*(HP*SQRT(FANOUT))2 +c3*HP*SQRT(FANOUT)+c4*HP
Error = 1130.4
This functional form is the current state-of-the-art model used in coarse placers. I
included it in the test in order for the error comparison to be fair. Therefore, the
error above is based on coefficients, which are based only on the two training
designs. These are not the actual coefficients from a state-of-the-art placer, since
they are probably based on more designs. (Using the actual coefficients from the
state-of-the-art placer lowered the prediction error to 877.0. However, in order to
fairly compare the modeling accuracy of the functional forms, they have to be
based on the same training set)
•
DELAY(HP,FANOUT)2 =
c1*DR2PIN(HP,FANOUT)^2+c2*DR2PIN(HP,FANOUT)*(HP*SQRT(FANOUT
+c4))
,where DR2PIN is :
DR2PIN(HP,FANOUT) = (k1 + k2/(FANOUT+k3) + k4*log(FANOUT))*HP
Error = 1008.1
This functional form is based on a set of experiments with a min-spanning tree
router on randomly-generated nets. The accuracy is improved, but the functional
- 24 -
does not have any theoretical support – I picked the functional form, which best
fitted the data.
•
DELAY(HP,DPHP,FANOUT)1 =
c1+c2*DPHP*SQRT(FANOUT)^2+c3*DPHP*SQRT(FANOUT)*(HP*SQRT(FA
NOUT))+c4*DPHP*SQRT(FANOUT)+c5*DPHP*SQRT(FANOUT)
Error = 914.6
With this functional I attempted to model DR2PIN as DPHP*SQRT(FANOUT),
similarly to the total wire length.
•
DELAY(HP,DPHP,FANOUT)2 =c1 + c2*DPHP^2 +
c3*DPHP*(HP*SQRT(FANOUT)) + c4*DPHP + c5*DPHP*FANOUT
Error = 542.4
The best model of total wire delay turned out to be when DR2PIN was modeled
as const*DPHP.
The DELAY(HP,DPHP,FANOUT)2 improves the accuracy of the current state-of-the-art
model by 31% (Since the total error is the sum of squared differences between actual and
predicted delays, the square root of the ratio of two total errors is the average per sample
increase/decrease in error. Therefore DELAY(HP,DPHP,FANOUT)2 ’s predicted delay has, on
average, an error which is sqrt(542.4/1130.4)=0.69 times as big as the error of
DELAY(HP,FANOUT)1 – an improvement of 31%).
Remaining source of error
Invariably, some errors in the predictions of the models remain. Most of that error comes
from one or more of the circuit characteristics below:
•
Exact Routing (layers used by routing and exact routing topology)
- 25 -
•
Congestion
•
Fringing capacitance
•
Coupling capacitance
Unfortunately, some of these unknowns are hardly predictable without detailed routes or
a more sophisticated flow. Still, a percentage of the error could, in theory, be reduced by
future research.
Future Work
In this section I have included possible projects which could extend the present work and
hopefully lead to the development of more accurate wire delay models for global placement.
More, larger designs
Ideally, I wanted to use more and larger designs, so that the delay models would be more
representative. However, since I wrote all parsing scripts in Perl, they take hours to parse the
data files even for the small designs. Even worse, once the Perl interpreter runs out of physical
memory, its performance decays further, because it uses the disks as virtual memory. Therefore,
I could not use any large designs. If the parsing/analyzing scripts are wisely re-written in C/C++
it will be possible to obtain models for large (>100K gates) designs.
More libraries
I have used only one library, since it took me about a week to incorporate the library data
into my script flow. As part of a different project, Peter Moceyunas (Synopsys) had collected net
data of several designs based on a ST Microelectronics physical library with 0.18µm process and
- 26 -
from the plots I looked at it seemed the dependency of delay on the three modeling variables is
the same.
Still, on highly customized libraries I would not be surprised if the TSMC delay models I
developed are not that accurate.
Congestion
One area that I could not explore was including congestion information in predicting net
delay. Since routers tend to create longer routes over congested regions, I believe that adding
congestion estimates would make the wire delay models even more accurate. Such model,
however, would likely be slower to compute, unless congestion is already being estimated in the
global placer. (As is the case for most global placers)
Vertical/Horizontal RC, layers
The next step in accurate delay prediction could be looking at the technology
characteristics of the design. Since each layer has different capacitance, for some libraries the
lumped vertical and horizontal RC’s might differ, allowing for better delay models, which
distinguish between horizontal and vertical routes. In such case, the models would need to be a
function of one or two additional variables, since VerticalHP and HorizontalHP (i.e. the width
and the height of the bounding box) would replace HP, while VerticalDPHP and
HorizontalDPHP replace DPHP.
Implement the new wire delay model
As mentioned in the Results section the HP-DPHP-FANOUT-2 model has improved
accuracy without any significant increase in runtime vs. the current best HP-FANOUT-1 model.
- 27 -
However, its impact on the final timing of the designs is yet to be tested in a complete physical
design flow.
Special Thanks To:
Will Naylor and Ross Donelly for the long discussions on theoretical models and data
analysis
Michael Fu for the many suggestions on the writing style and content of the thesis
Peter Moceyunas for his work based on the ST library
Brent Gregory for writing the DEF parser
Placement Technology Team at Synopsys for having me as a 6A intern
References
Anderson E., Electric Transmission Line Fundamentals, 1985 Reston Publishing Company Inc.
Benkoski J., Strojwas A., The Role of Timing Verification in Layout Synthesis, in 28th
Proceedings of the ACM/IEEE Design Automation Conference, 1991, Internet:
http://www.sigda.org/Archives/ProceedingArchives/Dac/Dac91/papers/1991/dac91/37_1/37_1.ht
m
Edwards T., Steer M., Foundation of Interconnect and Microstrip Design, ISBN 0-471-60701-0,
LCID TK7876.E35
Johnson R., Wichern D., Business Statistics: Decision Making with Data, 1997 John Wiley &
Sons, ISBN 0-471-59213-7, LCID HD30.215.J64
Maheshwari N., Sapatnekar S., Timing Analysis and Optimization of Sequential Circuits, ISBN
0-7923-8321-4, LCID TK7874.75.M35
Nekoogar F., Timing Verification of Application-Specific Integrated Circuits, ISBN 0-13794348-2, LCID TK7874.6.N45
- 28 -
Rubinstein J., et.al, Signal Delay in RC Tree Networks, IEEE Transactions on CAD, vol. CAD2, 1983, Internet:
http://infopad.eecs.berkeley.edu/~icdesign/ee241_s98/PAPERS/archive/sig_del_rc_net.pdf
Smith M., Application-Specific Integrated Circuits ISBN: 0201500221, 1997, Internet:
http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/ASICs.htm#anchor11320
Swartz W., Sechen C., Timing Driven Placement for Large Standard Cell Circuits, in 32nd
Proceedings of the ACM/IEEE Design Automation Conference 1995 13.4
Wayne W., Modern VLSI Design, 1998 Prentice Hall, ISBN 0139896902
Youssef, H., R.-B. Lin, and E. Shragowitz. 1992. Bounds on net delays for VLSI circuits. IEEE
Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Vol. 39, no. 11
- 29 -