Wire Delay Models for Global Placement of ASICs By Krassimir Paskalev Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Computer Science and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology January 17, 2002 Copyright 2002 Krassimir Paskalev. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. Author______________________________________________________________________ Krassimir Paskalev January 17, 2002 Certified by________________________________________________________________ Michael Fu, R&D Manager, Ph.D. Synopsys Thesis Supervisor Certified by________________________________________________________________ Jacob K. White M.I.T. Thesis Supervisor Accepted by_________________________________________________________________ Arthur C. Smith Chairman, Department Committee on Graduate Theses Wire Delay Models for Global Placement of ASICs By Krassimir Paskalev Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Computer Science and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology January 17, 2002 Copyright 2002 Krassimir Paskalev. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. ABSTRACT A new model for the propagation delay between two logic gates for timing-driven global placement is proposed. The model is a function of the number of pins on the net, the half perimeter of the bounding box enclosing the net, and the half perimeter of the bounding box enclosing the driving pin and the sink pin. On a training set of two designs and testing set of another two, the proposed model is 31% more accurate than the current state-of-the-art model and has comparable computational complexity. Thesis Supervisor: Jacob White Title: Associate Director, Research Laboratory of Electronics -2- Introduction When solving real-world optimization problems, an engineer is faced with a tradeoff between the computational complexity of the algorithm and optimality of the final solution. Modeling wire delay in global ASIC placement is an example of an engineering problem where such tradeoff exists. The goal of wire delay modeling in global placement is to give the placer an easy to compute, accurate estimate of the final (post-routing) delay of each wire. If the model is not accurate enough, the placer will place some of the logic cells away from each other, because its model would indicate that the delay between them is small. And if the actual delay through the wires connecting the cells turns out to be drastically larger, the final speed of the chip will be significantly decreased. Therefore, wire delay estimation during placement is extremely important, especially when having a high clock speed is the most desired feature of the final design. Today, the state-of-the-art wire delay models use statistical estimates based on the size of the bounding box of all endpoints of the net1 and the number of endpoints of each wire. My thesis is that more accurate wire delay estimation models for global2 placement exist, which also have comparable runtime to the current state-of-the art models. My models estimate wire delay based on relative position of the driver and the sink, in addition to the variables the current state-of-the-art models use. Background on Design of ASICs This section provides background information about chip design (ASICs in particular) and how the problem I am solving fits into the overall problem of designing modern chips. 1 Net – the wire connecting two or more cell input and output pins. Also referred to as interconnect, etc. -3- ASIC stands for Application-Specific Integrated Circuit and, as the name suggests, ASICs are semiconductor devices that perform specialized tasks – control a microwave, pacemaker, mp3 player, etc. As any integrated circuit, an ASIC contains a huge amount of transistors, which are interconnected in a complex pattern to create a unique electric network. Electric current propagates through the chip, carrying digital information in the form of zeroes and ones (high voltage and low voltage). The functionality of the circuit3 depends on the specific way the transistors are interconnected, their number, and size. Several steps are necessary to convert the idea of a new chip in the designer’s mind into a piece of silicon that performs the desired functions. Today, that process is to a large extent automated and relies on a set of software programs, called EDA tools. In the description below I will use a cellular phone controller as an example of an ASIC to be designed. A summary of the major steps is presented in Fig 1. 2 3 Global placement – also known as coarse Circuit – also referred to as “the design” by ASIC designers -4- Figure 1: Overview of the digital design flow The first step in the ASIC design flow is to write down the specification of the device (phone controller) in a computer-readable format. That is done in Hardware Description Language (HDL), like Verilog or VHDL. A program in HDL specifies how the circuit should react to various input signals – the phone should ring when there is an incoming call, the call should end when the “Cancel” button is pressed etc. The next step is to synthesize the HDL description into a set of simple logic operations like AND, OR, NOT, and others. This step implements the HDL description (program) of the cellular phone into a set of logic elements (called logic gates) and is done by a synthesis tool. For example, an HDL description saying “if there is an incoming call and the line is not busy then ring” could be implemented as the -5- “incoming call” signal and “line busy” signal being the inputs of an AND logic gate, with the “ring” signal being the output. However, it is possible to implement the same functionality in a different way with a NAND (not-AND) gate, followed by a NOT gate. Even though the second approach might seem silly, it is sometimes preferred for making the chip faster, consuming less power, or using less area. Another reason why the second implementation might not work is because the AND logic gate might not exist in the library of possible gates available to the ASIC tool (cell library). The cell library is a set of logic gates, which the manufacturer requires that the digital designers/synthesis tools use. These logic elements (also called library cells) are designed to be very fast, efficient, and simplify the work of the designer. In a way, the library cells to the synthesis tool are like LEGO blocks to the kid who wants to build a toy cellular phone – only certain library cells/LEGO blocks are manufactured and could be used. The next step after synthesis builds the network of logic cells is global placement. This is the step where the wire delay models I research in this thesis will be used, so I will go over it in more detail. The global placer decides where each cell should be placed on the chip. This decision is influenced by several goals, which the placer attempts to achieve. First, the placer attempts not to overlap any cells, for only one cell can occupy a certain location on the chip. Second, it attempts to place the cells such that the wires connecting them later will not be too long. The reason why this is important is due to the limited number of wires that can go through a location on the chip. Therefore if most pairs of cells that are connected to each other are separated in distant corners of the chip, the corresponding wires will be long and it will lead to excess number of wires in the center of chip. Due to the limited wiring capacity, some wires will have to detour around the congested center and become very long. In turn long wires will increase the power consumption and slow down the chip’s speed. -6- There are other placement goals, yet the most important one from the perspective of this thesis is to increase the chip’s speed (or to decrease the chip timing/signal propagation delay). To achieve better timing, the placer needs to model the propagation delay across the wire between placed cells, so that it can determine which placement results in the fastest chip. The way a numerical global placer works is similar to many search algorithms. It starts with any placement of the gates and tries to improve the current placement until all placement goals are met. At each step a new placement is generated and checked whether it is closer to the placement goals than the current placement. If it is closer, the search continues from the new placement. If the new placement is worse (away from the placement goals), the search continues from the current placement. After global placement finds a satisfactory placement of all cells, the resulting cell locations have to be legalized (detail-placed). The reason why there is a need for detailed placement is due to the placement topology enforced by the manufacturer. Today’s ASICs have row-based topology where the entire chip area is divided into rows of cells. The global placer, however, does not follow the specific placement topology and places the cells on any location of the chip area. Therefore, the task of the detailed placer is to snap all cells to valid row locations. The last step in the chip design is routing. Routing specifies the exact paths which the wires follow to connect the cells. Wires are routed on special grid-like tracks, which run horizontally and vertically across the chip4 . The router attempts to find the shortest possible path for each wire. Because of the limited routing resources on each track, the router has to determine which wires to route directly and which to detour, making them slower. 4 In fact, the wire routing structure is more complex – the routing grid consists of alternating horizontal and vertical metal layers – wires on each metal can be routed in one direction only. This, however, is irrelevant detail from the global placer’s perspective. -7- After routing completes, the chip is tested whether its speed meets the timing requirements set by the ASIC designer.A timing analysis tool looks at the detailed physical information – where the cells are placed and where the wires are routed, and determines the speed of the chip much more accurately than the previous stages of the design [Maheshwari N.]. In reality, there are many other steps in the design flow, which are often repeated over and over until the design goals (high speed, low power, and/or small area) are achieved. If the timing/power/area analyzer reports that the chip is not fast/low-power/small enough, the design is optimized, and some of the steps in the flow could be repeated. [Nekoogar F.] Problem Statement There is a need for estimation of the wire delay during placement, for the actual delay cannot be calculated precisely. Precise computation of the propagation delay is impossible, since the detailed routing topology is not known during placement. The only information available to the delay estimating function is the number and type and location of the cells connected to the net, as well as the particular pin (or pins) to which the net is attached. In addition, the wire delay estimating function has to be very fast to compute. This requirement comes from the global placer, which will use the model on each net of the design, for each explored cell placement. In a modern chip the number of nets is in the order of hundreds of thousands, which results in billions of evaluations of the wire delay modeling function. Therefore the goal of wire delay estimation is to find wire delay models, which estimate the delay as accurately as possible, in order to minimize the delay through the circuit. -8- Related Work The problem of wire delay modeling is inherent not only to coarse placement, but almost to any other stage of the ASIC design flow. Synthesis needs to model the wire delay, in order to optimize the logic of the design. Routing faces the same problem, for it has to determine which routes need to be faster than others. Finally, verification needs to model the wire delay, in order to ensure the functionality of the chip. The simplest wire delay model used is to ignore wire delay completely. The benefit of using this model is that the placer is not doing any wire-delay-related computations. This model has been popular in the early days of digital design, since the wire delay has been negligible compared to that of the logic elements. However, in today’s deep sub-micron designs (>0.5um) this is no longer true. As the designs become smaller and smaller, interconnect delay becomes larger and larger part of the total delay. Another common delay model is the “wire delay table model”. This model comes from the synthesis domain, for it predicts wire delay without any physical information (meaning that it is not using the cell’s coordinates, which are known only during and after placement). The wire load model uses a lookup table for the delay (among other properties) of a net, solely based on the cardinality of its fanout 5 [Edwards T.]. The values for fanouts that are not in the wire load model table are linearly interpolated. Fig. 2 contains a common description of a wire load model table. Apart for its use in context where physical information is not available, this model is useful for its simplicity and relatively high accuracy in localized blocks. If wire load models are created for specific sub-blocks of the design, the wire delay could be very close to the actual one, for the model can account for the specific topology of the sub-block. The drawback of the model is that it is not general enough and either requires knowledge of the design 5 Fanout – the capacitive load on a particular net, roughly equivalent to the number of logic gates connected to it. -9- specifics or lengthy simulations. Therefore, the model can be very inaccurate if not used on the proper scale. Figure 2: Sample Wire Delay/Load/Resistance/Length/Area Table. The name 10K_WLM implies that the particular wire load model is appropriate for blocks of approximately 10,000 instances. The delay is calculated as the Capacitance times the Resistance of the net. The values for the missing fanouts can be interpolated. For example a simple linear interpolation for the length of a 10-fanout net: Length(10) would be = 1/4*Length(7)+3/4*Length(11) = 1/4*0.033+3/4*0.054 = 0.0487 Wire_delay_table(10K_WLM) Fanout 1 2 3 4 7 11 Length 0.002 0.005 0.013 0.022 0.033 0.054 Capacitance Resistance 0.002 0.005 0.005 0.005 0.013 0.139 0.022 0.276 0.033 0.550 0.054 0.785 Area 0.500 1.000 1.500 2.000 3.500 5.500 The most complex class of wire delay models used in global placers is based on the half perimeter bounding box of the net. It is half the length of the minimum rectangle, which encloses all endpoints (pins) of a net. For nets with 2 or 3 pins the half-perimeter estimate is equal to the optimal routing and is close to the optimal for nets with 4 or more pins. [Smith M.] The half perimeter is fairly simple to calculate for each net6 . To get the net delay, the half perimeter is multiplied by a constant. A more advanced version of that model is to represent the constant as a function of the number of pins [Benkoski J.]. In general, there is an inherent tradeoff between modeling accuracy and computational complexity in designing a wire delay model. Each model could be represented as a point in a two-dimensional plot, where the two axes are modeling accuracy and computational complexity 6 The minimum of the pin coordinates in each dimension is taken and subtracted from the maximum of pin coordinates in each dimension to get the perimeter/bounding box of the net. The half perimeter is half of that. - 10 - (Figure 3). The ideal wire delay model would be located in the lower left part of the graph – a very accurate and computationally simple model. Naturally, such model probably does not exist and all realizable models either require some computational effort, or suffer loss of accuracy. This inherent tradeoff is seen in the approximate state-of-the-art curve, formed by the different wire delay models (red points) in Figure 3. Note that there exist models located to the left and above of the tradeoff curve. Such algorithms are inefficient, because there would be an algorithm on the tradeoff curve, which is both simpler and more accurate. On the other hand, if there are algorithms below the current, red-dotted, tradeoff curve, such algorithms would be either faster and/or more accurate then one or more of the state-of-the-art algorithms. The goal of this thesis is to find for such algorithms/models. - 11 - Figure 3: The Tradeoff Curve between modeling accuracy and computational complexity. The units are unspecified, for it is hard to precisely quantify the two measures7 . The theoretical tradeoff curve describes the most efficient/simple models, which could, in theory, be achieved for a given level of accuracy. The theoretical curve is jittery since its shape is not known. 7 One way to quantify accuracy and complexity is to pick a fixed set of designs and compare the total prediction error made by different models and total CPU time on given machine. - 12 - Data Collection Data Extraction The first step in creating models is to collect a large number of data samples. To generate them, I run each design through a typical physical design flow – the design is placed, routed, and timed. Next, I extract all nets from each design and obtain several characteristic parameters. For each net, I store the net delay (as reported by the timing analyzer), half perimeter (of the bounding box around all pins of the net8 ), driver-to-pin half perimeter (the Manhattan distance from the driver to the sink i.e. the half perimeter of the bounding box around the driver and the sink), fanout (the number of sinks on the net). Figure 4: Net parameters uses in the delay models. Net Delay DELAY Half Perimeter HP Driver To Sink Half Perimeter DPHP Fanout FANOUT Delay The delay of each net is obtained as the net timing arc reported by the timing analyzer. 8 The pin locations are not exactly precise. Pins have irregular shapes and a net could connect to several different parts of the pin. However I use the simplified (center of mass) location. - 13 - Half Perimeter and Driver-to-pin half perimeter To calculate the half perimeter of a set of pins, the precise location of the pins relative to the cell location, and the cell location itself need to be obtained4 . The locations of each cell instance can be obtained from a post-placement-generated output file (.PDEF), or a post-routinggenerated output file (.DEF). The physical library contains the offset locations of each pin for each library cell. The combination of cell locations and relative pin locations allow me to calculate the two half perimeters. Fanout Net’s fanout can be obtained from at any step after synthesis (.DEF, .PDEF, or .DSPF file). Real tools The tests I ran would work on any set of EDA tools. For completeness, I mention that I used the 2001.08 versions of Synopsys’s Physical Compiler and PrimeTime. Real data The models in this thesis are based on four real-world chip designs (Figure 5). Two of the designs are small in size, while the other two are medium sized. For confidentiality, I will only refer to the designs as Design 1-4. All designs use an Artisan library for TSMC’s 6 layer, 0.18 µm process. The lumped (averaged across all metal layers) resistance and capacitance per unit length are9 : Rhorizontal=Rvertical=0.0000023 kΩ/0.01µm, Chorizontal=Cvertical=0.0000023 pF/0.01µm. 9 Obtained from the library setup scripts - 14 - Figure 5: Relative size of the four designs used. Design Number Number of net timing arcs Design 1 2025 Design 2 15126 Design 3 46246 Design 4 45862 Elmore delay decomposition An important part of developing the models I created was decomposing the Elmore delay model into four sub-components. This section defines each of the four components and explains why such decomposition is useful. The Elmore delay model is an approximate method of calculating the propagation delay over transmission line (wire). It considers the first moment of the impulse response of the wire and assumes inductance and coupling capacitance can be ignored. [Rubinstein J., Wayne W.] For a generic RC tree, the Elmore delay through a path (from the driver pin to a given sink) is calculated as follows: The delay through each on-path resistor is defined as its resistance times its downstream capacitance. The total path delay is equal to the sum of the delays through all on-path resistors. (See the example below) Each path through an RC tree could be decomposed into four independent sub-delays: • Self delay of the on-path wire – the resistance of the wire driving its own capacitive load along the path. - 15 - • Sink delay – the delay through the on-path wire driving the capacitive load of the sink. • Off-path wire load delay – the sum of delays through wire segments starting from the driver and ending at off-path wire branch points due to the capacitive load of the off-path wiring. • Off path pin load delay- the sum of delays through wire segments starting from the driver and ending at off-path wire branch points due to the capacitive load of other sinks. For example, consider the red/orange colored net on Figure 6. Its Elmore equivalent RC tree would look like the one on Figure 7. Figure 6: Sample net. D is the driver and the example looks at the delay from D to the sink S. - 16 - Figure 7: The corresponding RC tree of the net from the circuit in Figure 6. Let’s further consider the propagation delay from the driver D to the sink S. If the capacitive loads of the sinks are CP1, CP2, and CS (not shown on Figure 7), the Elmore delay for that path10 would be: Delay(D-S) = R5*(C1+C2+C3+C4+C5+CP1+CP2+CS) + R4*(C1+C2+C4+CP2+CS) + R1(C1+CS) The four sub-components of the delay would be: • Self delay = R5 * (C1+C4+C5) + R4 * (C4 + C1) + R1 * C1 • Sink delay = (R5 + R4 + R1) * CS • Off-path wire load delay = R5 * C2 + (R5 + R4) * C3 • Off path pin load = R5 * CP2 + (R5 + R4) * CP2 As expected, the sum of the four components is equal to the total delay. 10 The propagation delay from a cell output pin (driver) to a cell input pin (sink) is often referred to as “net timing arc”. - 17 - Purpose of Elmore delay decomposition The purpose of this decomposition of the Elmore delay model into four independent components helps me to better understand and model the dependencies between the net parameters (half perimeter, driver to sink half perimeter, fanout). Since the resistance and capacitance of a wire are linearly proportional to its length, it is easy to transform the problem of modeling delay into modeling wire length (i.e. routing topology). The following section – Data Analysis will describes the exact process. Data Analysis Once I collect the sample data, I try to analyze it and create different wire delay models. Instead of modeling the propagation delay as a whole I approach each of the four sub-delays individually. From now on, I will use Rµm and Cµm for the resistance and capacitance per micron and assume all wire lengths are in microns. Transformations of the four sub-delays • Self delay – this delay is equivalent of the delay on a RC transmission line with length DR2PIN11 . On theory, [Anderson, M.] this delay should be equal to Self delay = ½ * Rµm * Cµm * DR2PIN^2 • Sink delay – this delay is equivalent to a DR2PIN resistor driving a Csink capacitor. Therefore, Sink delay = Rµm * DR2PIN * Csink 11 DR2PIN - The length of the path over the routed wire from the driver to the sink. - 18 - • Off-path wire load delay – the number of off-path wires is not necessarily equal to FANOUT12 -1, since there could be multiple off-path pins connected to the same offpath wire. However, if I assume the entire off-path wire branches at the average wire branch distance then: Off-path wire load delay = Rµm *AVERAGE_WIRE_BRANCH_DIST * Cµm * TOTAL_OFF_PATH_WIRELEN I can further rewrite TOTAL_OFF_PATH_WIRELEN = TOTAL_WIRE_LEN – DR2PIN and substitute: Off-path wire load delay = Rµm *AVERAGE_WIRE_BRANCH_DIST * Cµm * (TOTAL_WIRE_LEN-DR2PIN) • Off-path pin capacitive load – this delay is the sum of FANOUT-1 delays, each of which is the delay through a wire segment from the driver to a point where the corresponding off-path wire branches. If the average distance to off-path wire branching is AVERAGE_PIN_LOAD_BRANCH_DIST then: Off-path pin load delay = (FANOUT-1) * Rµm *AVERAGE_PIN_LOAD_BRANCH_ DIST * Csink(i) Theoretical models of total wire length There are two extreme cases, which have the lowest and highest ratio of total wire length to half perimeter. However, I could not find a formal proof why these particular configurations are the extreme cases. 12 FANOUT – The number of pins on the net, excluding the driver. Thus FANOUT-1 is the number of off-path sinks. - 19 - Best Case TOTAL_WIRE_LEN(HP,FANOUT) = HP In the best case, all pins are located in a straight line. As a result, the total wire length of the net is equal to its half perimeter. Worst Case TOTAL_WIRE_LEN (HP,FANOUT) = HP+(SQRT(FANOUT)-1)*HP/2 All pins fill a square-shaped grid. Each side of the square grid has sqrt(FANOUT)13 pins. As a result the total wire length of the net is (SQRT(FANOUT)+1)*HP/2 = SQRT(FANOUT)*HP/2+HP Figure 8: Worst case for total wire length/half perimeter. Theoretical models of driver to sink path length Similarly to total wire length there are two extreme cases of the driver to sink path length. 13 Here I assume that SQRT(FANOUT) is an integral number. The derived upper bound is valid for any integer FANOUT. - 20 - Best Case DR2PIN(HP,FANOUT,DPHP) = DPHP In the best case, the driver to sink path length is the shortest path possible and is equal to the half perimeter of the bounding box around the driver and the sink. Worst Case DR2PIN(HP,FANOUT,DPHP) = HP+(SQRT(FANOUT)-1)*HP/2 The worst case of driver to sink path length is equal to the worst total wire length case (Figure 9). In such case the path from the driver to the sink goes through all other pins on the net. Fortunately, this case is quite unlikely to happen. Figure 9: Worst case driver to sink path length. The routed wire between the two red pins passes through all other pins on the net, resulting in driver to sink path length of HP+(SQRT(FANOUT)-1)*HP/2 Theoretical models of average distance to pin branching point The average point at which a pin load is located along the driver to sink path should be halfway from the driver to the sink. I assume that the distribution of the driver and sink pins is - 21 - random. If that is the case, then for each net and each path from the driver to a sink, there is an equivalent net, which has the same topology except that the locations of the driver and the sink are exchanged. These two nets have the property that if there is a pin load on the driver to sink path at distance X on one of the two nets, the same load is located at distance DR2PIN – X on the other net. Since there is such “dual” net for any possible net topology, then for a fixed DR2PIN distance the average distance to pin load branching point is ½ DR2PIN. Theoretical models of average distance to wire branching point The reasoning why average distance to pin branching point is ½ DR2PIN applies to average distance to wire branching point – assuming random driver/sink distribution for each distance to wire branching point X, there is an equivalent network with wire branching point distance equal to DR2PIN – X. Combining theoretical results to obtain a functional form Combining the theoretical results above into the total delay I get: Total Delay = (k1 * DR2PIN^2) + // Self delay (k2 * DR2PIN) + // Sink delay (k3 * DR2PIN * (TOTAL_WIRE_LEN – DR2PIN)) + // Off-path wire (k4 * (FANOUT-1) * DR2PIN) // Off-path pin (rearranging) = K1 * DR2PIN + K2 * DR2PIN^2 + K3 * DR2PIN * TOTAL_WIRE_LEN + K4 * FANOUT * DR2PIN (Where each Ki is some linear combination of ki) - 22 - Verification I have randomly selected two designs of the four to be the training set and the remaining two the testing set14 . All models are analyzed based on data from the training set and the accuracy of the models is verified on the test set. Confidentiality Finally, I will not reveal the optimal coefficients for each functional form, as they could potentially be used in production code. However, anyone with a good solver and a set of extracted net timing arcs can reproduce similar coefficients. Results Finally, I find the coefficients for each functional form, which minimize the error in the training set. These coefficients, along with the corresponding functional form are tested on the test designs. The error specified below each model is the sum of the squared errors between the actual and predicted net delays for each net in the test designs. Therefore the lower the total error of each model, the higher the accuracy. The computational complexity of each of these functions is comparable – within each net, they all take linear time in FANOUT to compute for all timing arcs. HP is a max/min of the pin coordinates and therefore takes linear time in FANOUT to compute. Since it is fixed for all sinks, it needs to be computed only once for each net. DPHP takes constant time to compute (∆x+∆y of the coordinates of the driver and the sink), but has to be computed for each sink. Therefore the FANOUT different DPHP’s require, too, linear time in FANOUT to compute. 14 Design-1 and Design-3 are in the training set, while Design-2 and Design-4 are in the testing set. - 23 - Therefore the computational complexity difference between the models is in the dominating constants. I tested the following functional forms: • DELAY(HP,FANOUT)1 = c1+c2*(HP*SQRT(FANOUT))2 +c3*HP*SQRT(FANOUT)+c4*HP Error = 1130.4 This functional form is the current state-of-the-art model used in coarse placers. I included it in the test in order for the error comparison to be fair. Therefore, the error above is based on coefficients, which are based only on the two training designs. These are not the actual coefficients from a state-of-the-art placer, since they are probably based on more designs. (Using the actual coefficients from the state-of-the-art placer lowered the prediction error to 877.0. However, in order to fairly compare the modeling accuracy of the functional forms, they have to be based on the same training set) • DELAY(HP,FANOUT)2 = c1*DR2PIN(HP,FANOUT)^2+c2*DR2PIN(HP,FANOUT)*(HP*SQRT(FANOUT +c4)) ,where DR2PIN is : DR2PIN(HP,FANOUT) = (k1 + k2/(FANOUT+k3) + k4*log(FANOUT))*HP Error = 1008.1 This functional form is based on a set of experiments with a min-spanning tree router on randomly-generated nets. The accuracy is improved, but the functional - 24 - does not have any theoretical support – I picked the functional form, which best fitted the data. • DELAY(HP,DPHP,FANOUT)1 = c1+c2*DPHP*SQRT(FANOUT)^2+c3*DPHP*SQRT(FANOUT)*(HP*SQRT(FA NOUT))+c4*DPHP*SQRT(FANOUT)+c5*DPHP*SQRT(FANOUT) Error = 914.6 With this functional I attempted to model DR2PIN as DPHP*SQRT(FANOUT), similarly to the total wire length. • DELAY(HP,DPHP,FANOUT)2 =c1 + c2*DPHP^2 + c3*DPHP*(HP*SQRT(FANOUT)) + c4*DPHP + c5*DPHP*FANOUT Error = 542.4 The best model of total wire delay turned out to be when DR2PIN was modeled as const*DPHP. The DELAY(HP,DPHP,FANOUT)2 improves the accuracy of the current state-of-the-art model by 31% (Since the total error is the sum of squared differences between actual and predicted delays, the square root of the ratio of two total errors is the average per sample increase/decrease in error. Therefore DELAY(HP,DPHP,FANOUT)2 ’s predicted delay has, on average, an error which is sqrt(542.4/1130.4)=0.69 times as big as the error of DELAY(HP,FANOUT)1 – an improvement of 31%). Remaining source of error Invariably, some errors in the predictions of the models remain. Most of that error comes from one or more of the circuit characteristics below: • Exact Routing (layers used by routing and exact routing topology) - 25 - • Congestion • Fringing capacitance • Coupling capacitance Unfortunately, some of these unknowns are hardly predictable without detailed routes or a more sophisticated flow. Still, a percentage of the error could, in theory, be reduced by future research. Future Work In this section I have included possible projects which could extend the present work and hopefully lead to the development of more accurate wire delay models for global placement. More, larger designs Ideally, I wanted to use more and larger designs, so that the delay models would be more representative. However, since I wrote all parsing scripts in Perl, they take hours to parse the data files even for the small designs. Even worse, once the Perl interpreter runs out of physical memory, its performance decays further, because it uses the disks as virtual memory. Therefore, I could not use any large designs. If the parsing/analyzing scripts are wisely re-written in C/C++ it will be possible to obtain models for large (>100K gates) designs. More libraries I have used only one library, since it took me about a week to incorporate the library data into my script flow. As part of a different project, Peter Moceyunas (Synopsys) had collected net data of several designs based on a ST Microelectronics physical library with 0.18µm process and - 26 - from the plots I looked at it seemed the dependency of delay on the three modeling variables is the same. Still, on highly customized libraries I would not be surprised if the TSMC delay models I developed are not that accurate. Congestion One area that I could not explore was including congestion information in predicting net delay. Since routers tend to create longer routes over congested regions, I believe that adding congestion estimates would make the wire delay models even more accurate. Such model, however, would likely be slower to compute, unless congestion is already being estimated in the global placer. (As is the case for most global placers) Vertical/Horizontal RC, layers The next step in accurate delay prediction could be looking at the technology characteristics of the design. Since each layer has different capacitance, for some libraries the lumped vertical and horizontal RC’s might differ, allowing for better delay models, which distinguish between horizontal and vertical routes. In such case, the models would need to be a function of one or two additional variables, since VerticalHP and HorizontalHP (i.e. the width and the height of the bounding box) would replace HP, while VerticalDPHP and HorizontalDPHP replace DPHP. Implement the new wire delay model As mentioned in the Results section the HP-DPHP-FANOUT-2 model has improved accuracy without any significant increase in runtime vs. the current best HP-FANOUT-1 model. - 27 - However, its impact on the final timing of the designs is yet to be tested in a complete physical design flow. Special Thanks To: Will Naylor and Ross Donelly for the long discussions on theoretical models and data analysis Michael Fu for the many suggestions on the writing style and content of the thesis Peter Moceyunas for his work based on the ST library Brent Gregory for writing the DEF parser Placement Technology Team at Synopsys for having me as a 6A intern References Anderson E., Electric Transmission Line Fundamentals, 1985 Reston Publishing Company Inc. Benkoski J., Strojwas A., The Role of Timing Verification in Layout Synthesis, in 28th Proceedings of the ACM/IEEE Design Automation Conference, 1991, Internet: http://www.sigda.org/Archives/ProceedingArchives/Dac/Dac91/papers/1991/dac91/37_1/37_1.ht m Edwards T., Steer M., Foundation of Interconnect and Microstrip Design, ISBN 0-471-60701-0, LCID TK7876.E35 Johnson R., Wichern D., Business Statistics: Decision Making with Data, 1997 John Wiley & Sons, ISBN 0-471-59213-7, LCID HD30.215.J64 Maheshwari N., Sapatnekar S., Timing Analysis and Optimization of Sequential Circuits, ISBN 0-7923-8321-4, LCID TK7874.75.M35 Nekoogar F., Timing Verification of Application-Specific Integrated Circuits, ISBN 0-13794348-2, LCID TK7874.6.N45 - 28 - Rubinstein J., et.al, Signal Delay in RC Tree Networks, IEEE Transactions on CAD, vol. CAD2, 1983, Internet: http://infopad.eecs.berkeley.edu/~icdesign/ee241_s98/PAPERS/archive/sig_del_rc_net.pdf Smith M., Application-Specific Integrated Circuits ISBN: 0201500221, 1997, Internet: http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/ASICs.htm#anchor11320 Swartz W., Sechen C., Timing Driven Placement for Large Standard Cell Circuits, in 32nd Proceedings of the ACM/IEEE Design Automation Conference 1995 13.4 Wayne W., Modern VLSI Design, 1998 Prentice Hall, ISBN 0139896902 Youssef, H., R.-B. Lin, and E. Shragowitz. 1992. Bounds on net delays for VLSI circuits. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Vol. 39, no. 11 - 29 -