IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 6, JUNE 2000 635 Clock Skew Verification in the Presence of IR-Drop in the Power Distribution Network Resve Saleh, Member, IEEE, Syed Zakir Hussain, Member, IEEE, Steffen Rochel, Member, IEEE, and David Overhauser, Member, IEEE Abstract—Clocks are perhaps the most important circuits in high-speed digital systems. The design of clock circuitry and the quality of clock signals directly impact the performance of a very large scale integrated chip. Clock skew verification requires high accuracy and is typically performed using circuit simulators. However, in high-performance deep-submicrometer digital circuits, clocks are running at higher frequencies and are driving more gates than ever, thus presenting a higher current load on the power distribution network with the potential for substantial power grid voltage (IR)-drop. This IR-drop affects the clock timing and must be taken into account in the verification process. Since IR-drop is a full-chip phenomenon, the use of standard circuit simulation on both the clock circuitry and the power-grid is not practical. In this paper, we present a new methodology for the verification of clock delay and skew. An iterative technique is presented for clock simulation in the presence of full-chip dynamic IR-drop. The effect of IR-drop on the timing of clock signals is quantified on a small example, and demonstrated on a large chip. Fig. 1. Simple Buffered Clock Tree. Index Terms—Circuit simulation, CMOS integrated circuits, integrated circuit interconnections, matrix decomposition, power distribution, relaxation methods, very large scale integration. I. INTRODUCTION T HE CLOCK is the fastest and most active signal in a digital integrated circuit. It spans the entire chip and is the most critical control signal in the design [1]. In the early days of very large scale integration technology, clock circuitry was comprised of a few buffers in a simple tree structure, as shown in Fig. 1. A single clock tree drove the entire chip and was relatively easy to design. The clock skew, defined as the maximum difference between the arrival times of the clock signal to the latches, was easily controlled and it was possible to obtain nearly zero-skew clocks through an automatic clock synthesis and routing process [2]. Minimization of the clock skew is one of the key performance goals of clock design [3]. Today, the clock circuit is expected to drive thousands of latches and the design goal is to ensure that the clock signals all reach the latch inputs within a specified time bound. In order to reduce clock delay and skew, the clock Manuscript received January 31, 2000. This paper was recommended by Associate Editor E. Charbon. R. Saleh is with Simplex Solutions, Inc., Sunnyvale, CA 94086 USA (e-mail: res@simplex.com). S. Z. Hussain is with Simplex Solutions, Inc., Sunnyvale, CA 94086 USA (e-mail: szh@simplex.com). S. Rochel is with Simplex Solutions, Inc., Sunnyvale, CA 94086 USA (e-mail: steffen@simplex.com). D. Overhauser is with Simplex Solutions, Inc., Sunnyvale, CA 94086 USA (e-mail: ovee@simplex.com). Publisher Item Identifier S 0278-0070(00)05340-9. Fig. 2. Interconnect delay dominates gate delay in DSM. is composed of large buffers driving long lines over the entire chip. As a result, the clock consumes large amounts of power, often exceeding 25% of the total chip power [4]. With the advent of deep-submicrometer (DSM) technology, designing high-speed clock networks is becoming increasingly complex. Typically, DSM refers to line widths of 0.35 m and below. Mainstream industry has reached and surpassed this feature size, with 0.25 and 0.18 m already in production. As the line width decreases, the number of devices continues to grow according to Moore’s Law. However, it is not the increase in device count alone that is making chip design difficult. Rather, it is the fact that interconnect effects are now dominating the performance of the chip. Parasitic effects such as interconnect resistance and three-dimensional (3-D) capacitance have greatly increased the design complexity, and made clock design a considerable challenge. Fig. 2 illustrates the delay effects observed in DSM: gate delay is decreasing while interconnect delay is increasing with a crossover at roughly 0.35–0.25 m [5]. This type of graph is 0278–0070/00$10.00 © 2000 IEEE 636 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 6, JUNE 2000 II. EFFECT OF IR-DROP ON TIMING Fig. 3. The cause of IR-drop effects. cited in virtually every paper on timing in DSM. In practice, the curves apply to long nets in the design, and clock nets tend to be in this category. The main reasons for the shift from gate dominance to interconnect dominance is the increase in resistance and coupling capacitances. The resistance of the lines continues to increase as the line width decreases and this introduces a significant resistance-capacitance (RC) delay component. Coupling capacitances arise due to 3-D effects of tall, thin metal lines that are closely spaced together. Unfortunately, this makes it difficult to estimate the interconnect loads in advance. Most clock synthesis tools rely on accurate loading information and assume that gate delay is dominant, as shown in Fig. 2. Therefore, it is no longer possible to use these clock synthesis tools to generate zero-skew circuits. Another key issue adding to the complexity of clock design is the IR drop in the power supply [6], [7]. The causes of IR drop are the resistance in the power distribution system, as shown in Fig. 3, and increases in the current flowing through the power grid. In the past, resistance in the power grid could be safely ignored. But with the narrow metal lines used in DSM, resistance is a major factor in the power distribution system. Often, wider lines are used in power busses to reduce resistance, but the speed of today’s clock designs require very large buffers which draw large currents. When these currents flow from the supply to the drivers, any resistance encountered in the power bus causes the voltage to drop. Therefore, the far-end inverter in Fig. 3 experiences a lower supply voltage than the first inverter. Ironically, the clock circuit itself is one of the major causes of IR drop. Of course, large bus drivers, memory decoder drivers and input–output buffers can also lead to significant IR drop during the operation of the chip. In this paper, we focus on the effect of IR drop on clock timing. The IR drop varies dynamically at each buffer as it is related to the switching activity of the clock signal. It is clear that the current delivered by the buffers will vary depending on the level of the IR drop. This fluctuation of the supply voltage causes variations in the delay, skew, and slew rates. These effects can only be captured using detailed transient analysis. For simulation purposes, the clock circuit under analysis must incorporate all the effects shown in Fig. 4. The nonlinear transistors, RC invariations must be terconnect, coupling capacitances and represented to obtain correct timing information. The coupling capacitances can be grounded for the clock simulation since most of the adjacent nets are stable during the rapid clock transition. The circuit model of Fig. 4 is used in our clock verification methodology. Note that each buffer has a different value of that varies as a function of time. Before describing the verification techniques, it is important to understand and quantify the effects of IR drop on timing using a representative design. Since IR drop effectively reduces the supply voltage, the buffers connected to the power grid become weaker and the propagation delay through the clock increases. In this section, we present a quantitative view of exactly how much variation can be expected. For this purpose, two small clock circuits, presented in Fig. 5, were simulated to observe the impact of dynamic IR-drop. The circuits were deliberately kept small enough so that SPICE simulations could be performed to generate the results. The circuit of Fig. 5(a) contains a five-stage clock configured power grid. The clock tree as a tree and connected to the consists of 23 inverters connected together with distributed RC interconnect. The circuit of Fig. 5(b) is an H-tree [9] configuration of the clock which attempts to evenly distribute the clock to all parts of a chip. The power grids, shown as dotted lines in Fig. 5, each consist of a mesh of 10 000 nodes, 20 200 resistors and 10 000 capacitors lumped to ground. Two 1.8-V sources are and . This connected to the power grid at nodes asymmetric connection is used to create a variable dynamic IR drop in the power grid. In Fig. 6, simulation results of the circuit in Fig. 5(a) using a commercial SPICE program are presented. A clock input ramp with a 0.1-ns rise time is applied at node CLK. The waveforms at the output of all five stages are shown along with an IR-drop waveform from the power grid. The simulation results show the transient behavior of the power grid voltage with a maximum drop associated with the switching of stages four and five. The highest IR-drop value of 45 mV or 2.5%, is observed at the power grid tap point of inverter G14. Similar results were obtained for Fig. 5(b) but are not shown here. An instantaneous IR-drop plot for the whole power grid is ps where the most shown in Fig. 7 at the timepoint significant voltage drop occurs, based on the results of Fig. 6. The voltage drop is shaded from very light to very dark as the voltage drop transitions from a lower value to a higher value. The contour plot shows a transition from a very small IR drop in the top corners to a noticeable drop at the positions of buffers G12, G13, G14, and G15. This is expected since the voltage supplies are connected in the top corners and the largest effective resistance occurs from the supplies to these four buffers. In order to assess the effect of IR drop on timing, the power grid was removed and an ideal voltage source was attached to all 23 buffers. The voltage source waveform was a time-varying function that was similar to the IR-drop waveform of Fig. 6, except that the level of IR drop was adjusted to be 0%, 5%, in five successive runs. A scatter 10%, 15%, and 20% of plot of the percentage increase in delay for each case is shown in Fig. 8(a). The H-tree of Fig. 5(b) was simulated using the same experiments to obtain Fig. 8(b). The results show that a given variation, or IR drop, translates directly to percentage of the same percentage of delay variation (note that this is a change in the delay as opposed to a change in skew). For example, a 10% voltage drop causes at most a 10% increase in delay. More accurately, a 10% IR drop causes an increase in delay of about 5% to 10%. In fact, our results show that an % IR drop causes SALEH et al.: CLOCK SKEW VERIFICATION IN THE PRESENCE OF IR-DROP Fig. 4. 637 Buffers and RC interconnect model for clocks. (a) Fig. 5. (b) Clock circuits connected to the power-grid. (a) Simple Clock Tree. (b) Simple H-tree. a delay change of to % and this serves as a useful rule-ofthumb for quantifying the impact of IR drop on timing. To support our claim, we examined the normalized sensitivity . In general, the normalized sensiof delay to changes in tivity of a variable to a variable is given by the standard equation (1) In our case, is the switching delay, , and is the power . The percentage change in the delay is supply voltage, simply the normalized sensitivity multiplied by the percentage change in (2) Now, consider the delay associated with the inverter in Stage 1 of Fig. 4. If the input switches low, then the output of the inverter will be charged to a high value. A simplified expression for the delay of the gate switching from low-to-high is given by (3) is the total capacitance at the output, is the where -channel charging current and the resistance is set to zero. Assuming short channel devices, we expect that the transistor would be velocity saturated for most of the transition. A simple model for the transistor in the saturation region is given in [8] (4) where locity, is the channel width, is the carrier saturation veis the oxide capacitance, is the gate-to-source 638 Fig. 6. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 6, JUNE 2000 Circuit simulation results with dynamic IR-drop. lationship between and delay. It is also consistent with the experimental results of Fig. 8. To see this, consider the nomV in our examples. The sensitivity inal value of at this value is approximately 0.75. This implies that a 10% would produce a 7.5% increase in delay. As decrease in the voltage drops, the delay becomes more sensitive to changes , which further supports the expermental results that a in produces a 5%–10% increase in the delay. 10% decrease in Based on this analysis, we believe that our rule-of-thumb can be used for quantifying the impact of IR drop on timing. Note that the sensitivity of delay to IR drop gets even worse as the power supply is reduced further. III. CLOCK VERIFICATION METHODOLOGY Fig. 7. IR drop at t = 200 ps for power grid of Fig. 5(a). voltage, is the drain-to-source voltage and is the device threshold voltage. We can compute the current in saturation by to obtain [8] setting (5) is the critical electric field and is the channel length. where , if we apply (5) to (3), we can compute After setting the normalized sensitivity from (1) to obtain (6) in Fig. 9 with Equation (6) is plotted as a function of V and V. The sensitivity values lie ranging in the range of approximately 0.9 to 0.5 for from 2.5 V to 4 V. This is expected since there is an inverse re- Because of the importance of clock circuits, designers have been verifying clocks in detail for years. However, a manual approach was used until recently. As DSM technology became prevalent, the interconnect issues and IR-drop problems made manual clock verification obsolete. Designers began to pursue new clock topologies and power routing schemes to avoid these interconnect issues. For example, an H-tree [9] has a central buffer that drives the entire clock tree. It is usually implemented as a custom buffered tree (i.e., multiple buffers distributed throughout the tree) to distribute the clock more evenly throughout the chip. However, skew is more difficult to manage. An alternative to this is the clock grid [10], which reduces local clock skew, but is not area efficient. A third alternative is a hybrid of the H-tree and the clock grid [11], where the advantages of the two basic approaches are exploited, while minimizing the disadvantages. To mitigate potential IR-drop problems, designers use wide metal lines in the topmost metal layer to reduce the resistance. They will often add or remove “straps” (metal wires) to re-route current and minimize IR drop [12]. Other methods include the use of decoupling capacitors and ball-grid arrays [12]. It is apparent that the complexity of the verification increases significantly when using any of these advanced design techniques. SALEH et al.: CLOCK SKEW VERIFICATION IN THE PRESENCE OF IR-DROP 639 (a) Fig. 8. (b) Scatter plot of %Delay Change versus%IR drop Change for two clock structures (a) Clock tree of Fig. 5(a) (b) Simple H-tree from Fig. 5(b). A. Parasitic Extraction Fig. 9. Sensitivity of Delay to V Variations. The manual process is both time consuming and error prone on simple clock trees, but completely impractical on the advanced structures. In fact, the volume of data is too large, and the clock tracing is too complex for a manual approach. Furthermore, the results cannot be patched together since the IR drop must be analyzed across the whole chip. A more reliable approach is to take the final layout and extract the transistor netlist, interconnect and power grid. The next step is to automatically trace the clock from its source node to the latches in the design. Once extracted, the clock circuit can be simulated at the transistor level with its interconnect represented as reduced-order models [13]. If the supply voltage were held constant, a “best-case” clock skew can be determined for each receiver. During the simulation, if the tap current information of the buffers is captured for a power grid IR-drop analysis, the IR-drop analysis can be folded back into the clock simulation to obtain a “worst-case” clock skew. The two clock simulations place a bound on the clock skew in the presence of IR drop. This is the basic strategy used for clock verification described in the rest of this section. The first problem in clock verification is to perform the parasitic extraction. While this is not the subject of this paper, it is important to describe the extraction process and its potential issues. Typically, only resistances and capacitances are extracted for interconnects. In the immediate future, inductance will have to be considered in the clock signals [14], [15] as the speeds continue to increase and as the interconnect technology moves from aluminum to copper [5]. In this paper, we consider only RC effects, although the analysis can be equally applied to interconnect with inductive effects. The emphasis in signal parasitic extraction is on the 3-D capacitance [16]. The interconnect between clock buffers is extracted by taking into account the surrounding interconnect. Both the self-capacitance and the coupling capacitances must be extracted using methods that account for 3-D effects of neighboring lines. During the clock analysis, the coupled capacitances would be grounded since the switching of the clock occurs so rapidly that other signals look like ac ground. In addition to capacitance, the resistances of each metal segment must be extracted using one–dimensional and two-dimensional techniques. It is important to include via resistance in the interconnect extraction. Via capacitance is typically small and therefore negligible (although it is necessary to include their effects if there are a large number of vias along a line). The extraction of the power grid is focused more on resistance extraction since it is the primary source of voltage drop along the power grid. Resistance extraction is performed in the same manner as described above for signals. Extraction of the individual grounded capacitances associated with the power grid can be done in a specialized manner, or in the same way as the interconnect extraction. Either way, the entire full-chip power grid must be extracted for IR-drop analysis. The power grid for and should be extracted for a complete analysis. both Typically, the IR drop in a supply grid is worse than the IR-drop in the ground grid due to the presence of substrate capacitance. power grid is analyzed. However, the In this paper, only the 640 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 6, JUNE 2000 Fig. 10. Simulation of nonlinear devices with constant tap voltages. techniques presented are equally applicable to IR drop in both supply and ground grids. Finally, the actual devices, specifically the MOS transistors, must be extracted. A clock tracing algorithm is required to extract the clock circuit from the transistor netlist to improve the efficiency of the simulation. The tracing algorithm starts at the root node of the clock and searches until latches or flip-flop circuits are encountered. The transistors and interconnect along these paths are tagged as being part of the clock circuit. The loading effects of the final gates driven by the clock buffers can be modeled as constant capacitances. This assumption is made because the size of clock buffers is considerably larger than the devices in the driven gates and latches. Moreover, the linear capacitance in the wiring interconnect shields the nonlinear loading capacitances during the period of fastest signal transitions. The main issue in this extraction step is the size of the resulting data. Storage of the data must be done efficiently, typically in a binary format, and in a data file organization that allows rapid access by downstream tools. A typical power grid and clock for a million-transistor circuit consists of several million metal segments, and thousands of transistors [10]. We have performed the extraction of many designs containing over 5 million transistors in a five-metal layer process and produced approximately 30–40 million resistors and 100 K–200 K transistors. Conventional circuit or timing simulation techniques are unable to handle the volume of data required for dynamic IR-drop simulation. Therefore, alternative methods must be pursued to verify clock circuits in the presence of IR drop. B. Iterative IR-drop Simulation Relaxation methods have been used in the past to decompose large problems into smaller ones before applying an iterative method to obtain a solution [17], [18]. Linear, nonlinear and waveform relaxation methods have all been successfully applied to MOS circuits. We also make use of a relaxation-based approach, except that the partitioning and iteration counts are different from the previous approaches, and specialized methods are used to solve each partition. In previous relaxation-based circuit simulators, the power supply variations were ignored and the circuit was partitioned into subcircuits composed of source-drain connected MOS transistors. These subcircuits were all nonlinear and could easily be solved using standard direct methods. The characteristics of our problem are different and so a different circuit partitioning is required. Specifically, the circuit is partitioned into a very large linear portion and a relatively small nonlinear portion. Given this partitioning, a simulator must be designed with a focus on solving an extremely large sparse linear problem that is generated by the power-grid. A second simulator is required for the nonlinear circuit to provide tap currents for nonlinear devices that are connected to the power grid. In the relaxation process, the power grid and clock are solved separately and their results are synchronized at regular intervals during dynamic simulation. First, the partitioning approach is illustrated in more detail. In Fig. 10, a two-stage inverter circuit that represents the clock is connected to a series of resistors that represent the power grid. For this example, the results of partitioning the problem into the nonlinear and linear portions are shown in Figs. 11 and 12, respectively. While solving the nonlinear circuit of Fig. 11, the tap connections of the -channel voltages are applied at the transistors. The resulting tap currents of these devices, along with their capacitive loadings, are used in the solution of the linear circuit in Fig. 12. Given this partitioning, the success of this approach relies on two factors: the ability to build specialized simulators for each partition and the ability to converge to the correct solution in a few iterations. 1) Convergence: The next issue to consider is the convergence property. Linear relaxation is the easiest to understand. To simplify the problem for illustrative purposes, one can treat the nonlinear devices of Fig. 11 as linear devices by linearizing the transistors. It is easy to prove that linear relaxation would converge to the correct solution if all nodes had grounded capacitors and the timestep was sufficiently small [18]. The speed of convergence would depend on the tightness of the coupling between the two blocks of Figs. 11 and 12. Since the source nodes of the p-channel transistors are connected to the resistive power grid, the coupling tends to be fairly tight, which leads to a large number of iterations to reach convergence. In the case of nonlinear relaxation, sometimes referred to as “timepoint” iteration, each iteration of the nonlinear circuit would involve linearization using Newton’s method. The requirements for convergence are nearly identical to linear relaxation, with the additional requirement that the initial guess must be close enough to the final solution [18]. Since the final solu, it is tion to these types of problems are typically around . Of easy to satisfy this condition: the initial guess is always SALEH et al.: CLOCK SKEW VERIFICATION IN THE PRESENCE OF IR-DROP Fig. 11. 641 Inverter chain with a distributed power grid. Fig. 12. Power-grid with piecewise-constant tap currents and device capacitances. course, if this guess is quite far from the final solution, convergence may be difficult to achieve. But then, the circuit is unlikely . to work anyway because the power supply is not close to So “timepoint” or nonlinear relaxation is a viable approach in this respect, but also suffers from the issue of tight coupling between the two partitions. Finally, if a waveform relaxation (WR) method is used to compute the solution, then, in addition to the requirements for linear relaxation, the window sizes of the iteration must be small enough to achieve convergence [18]. If the window sizes were equal to the timepoints used in nonlinear relaxation, the two algorithms would be identical. However, the use of a larger window size equal to the clock period is more advantageous. Consider the results of two successive WR iterations. In the first iteration, the clock is simulated with zero IR-drop on the power grid. This configuration is depicted in Fig. 11, where and are set to voltage. The resulting both solution of the power grid of Fig. 12 generates the worst-case voltage drops. On the second iteration, all devices in the clock of Fig. 11 encounter large IR-drop voltages at their tap points and therefore draw less current. The next iteration of the power grid of Fig. 12 generates best-case voltage drops. Additional iterations will improve the accuracy, but if the worst-case results are within the specifications of the design, further iterations will be unnecessary. In fact, rather than iterating to convergence, a weighted average of the first two iterations produces a relatively accurate result. From the standpoint of verification, the two-step WR-based method is the most appropriate approach, and avoids any convergence issues. 2) Solution to Linear and Nonlinear Partitions: The partitioning and analysis approach described above has several advantages. By keeping the linear and nonlinear solvers separate, one can design two separate tools and optimize their individual performances. The speed of simulation is greatly increased, as the synchronization between the tools takes place at the end of each simulation period. The worst-case IR-drop obtained after the first iteration provides a good measure of the quality of the power grid and the clock design. The only issue that remains is the solution of each partition. For the nonlinear problem represented by the clock circuit, there are a variety of published approaches that could be used such as EMU [19], SPECS2 [20], iSPLICE3 [21], etc. These methods all use a form of timing simulation and are therefore 10–100 times faster than SPICE. Furthermore, they can all handle the capacity of a typical clock circuit. For interconnect associated with clock signals, a reduction method [13] is needed to generate a lower order model for simulation. If a clock grid structure is used a reduction method and a direct solver must be embedded in the timing simulator because the clock grids tend to generate very large and sparse matrices. One important requirement for the simulation is that it must be accurate to within a few percent of SPICE to produce meaningful results for clock skew. The linear problem generated by the power grid can be solved in a number of ways. Typically, the matrix is symmetric and positive definite [22], and is extremely large and sparse. Therefore, the choices for the solution can be linear relaxation, a pre-conditioned conjugate gradient, or a direct method based on a Cholesky decomposition [22]. While any one of these methods is applicable, large problems dictate the use of a conjugate gradient method with incomplete Cholesky preconditioning, while smaller problems with low fillins can be solved by an efficient implementation of a Cholesky decomposition [7]. These methods are typically 100–1000 times faster than SPICE for large problems. In fact, SPICE cannot even read in the data associated with the power grids of chips containing a million transistors, let alone solve the problem. IV. RESULTS We now present results using the two-step iterative technique described in the previous section. The first example circuit was shown in Fig. 5(a), a simple clock tree connected to a power grid. In Fig. 13, the IR-drop computed using waveform iterations is compared with the exact solution using SPICE simulation. The first iteration results in an overestimate of the IR drop, whereas the second iteration underestimates the value. Further iterations converge toward the exact solution which is indicated by the 642 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 6, JUNE 2000 Fig. 13. Convergence of IR-drop waveform iterations compared to SPICE simulation. solid line. In practice, the iterations converge if the IR-drop is within few percent of the supply voltage. However, this technique can become unstable if the IR-drop is large. This is most likely due to the tight coupling and the fact that the window size is relatively large (we set the window size equal to one clock cycle). Note that the first two iterations bound the exact solution. Instead of continuing with iterations, the final value of the IR-drop can be estimated by taking a weighted average of the first two iterations. This approach produces surprisingly accurate results, as shown in Fig. 13, and once the IR drop is accurate, the skew can be computed accurately. As described earlier, for verification purposes, the best approach is to use a two-step iteration followed by a weighted average of the results. Typical weighting factors are 10% of the first iteration and 90% of the second iteration, but these are adjustable parameters. The second example circuit is a Sparc Core containing 100 K transistors and a powergrid with 190 K resistors. This circuit is used to illustrate the convergence behavior of clock-powergrid iterations. Table I contains the results of the iterations at the output of the fifth stage of the clock tree. It provides the maximum delay and skew values along with the differences between two successive iterations. In this case, convergence occurs when the delta delay and delta skew values are less than a picosecond. The results demonstrate that the first two iterations tend to bound the correct solution. The subsequent iterations converge quickly to the final solution. In this case, we see that the actual solution is closer to the second iteration but it lies between the two iterations. A third example circuit that contained 21 176 transistors and 52 000 RC components in a clock grid was simulated and compared to SPICE. The power grid contained 753 844 resisive elements. The run time for the clock simulation was 1600 s in our nonlinear simulation tool and over 100 h with a commercial SPICE tool. The power grid simulation required less than one 1 h in our linear simulation tool but SPICE could not read the RESULTS FOR TABLE I STAGE 5 RISING CLOCK EDGE entire power grid into memory so a runtime comparison could not be performed. The fourth and final example shown in Fig. 14 illustrates the waveform results for a large chip using the two-step simulation technique. The clock network of this industrial example consists of 30 000 transistors. The power grid itself is modeled by a network of 400 000 resistors and 250 000 capacitors. SPICE was unable to complete the simulation of the power grid or the clock circuit. Our approach required approximately 55 min for the two-step simulations. There are three sets of waveforms in this plot of voltage versus time. Each set shows the actual waveforms from different settings of the IR drop. The skew for each set of waveforms is defined as the width of the waveform envelope measured at the 50% transition point, as illustrated for the waveforms labeled (b) in Fig. 14. The delay here is defined as the time required to reach the midpoint of the skew value, as illustrated in Fig. 14(b). The fastest waveforms, labeled (a), are the results of the simulation without IR drop, which is the same as the first iteration of our method. The results of a second iteration followed by averaging is labeled (b). Note that both the delay and skew of (b) increase relative to (a) because the proper IR drop has been taken into account. In this case, there is a 30% change in the SALEH et al.: CLOCK SKEW VERIFICATION IN THE PRESENCE OF IR-DROP Fig. 14. 643 Simulation results for large chip (a) without IR-drop, (b) dynamic IR-drop, (c) fixed IR-drop. skew value and a 0.2% change in the delay value. Clearly, an error of 30% in the skew calculation cannot be tolerated by the designer. The curves denoted by 14(c) deserve a separate discussion. Often, a designer uses a single worst-case IR drop for all buffers during a SPICE simulation to try to bound the total delay and skew. Our contention is that the results of this approach can be very misleading. For illustrative purposes, we used a single worst case IR-drop value on all buffers and re-ran the transient simulation as shown in Fig. 14(c). The skew value is almost the same as the wrong result of 14(a) and the delay value increased by about 0.5%. While the approach does provide an upper bound on the delay, it masks the 30% change in the skew value which is the more critical metric. Therefore, using a single worst-case static IR drop is misleading and should be avoided. V. CONCLUSION Verification of clocks in the presence of full-chip IR drop is a difficult problem due to conflicting requirements of speed, accuracy and problem size. High-performance clock signals require the highest accuracy because their skews are in the range of picoseconds. However, circuit simulators do not have the speed and memory capacity to handle full-chip power-grid simulation. In this paper, it is demonstrated that the effect of IR drop cannot be ignored in deep submicrometer design. For example, a 10% drop in the supply voltage can lead to a 5%–10% change in the clock timing, and it is only getting worse as the supply voltage is scaled. It has a more pronounced effect on the clock skew, as illustrated by an industrial example which exhibited a 30% change in the skew due to IR drop. A new two-step waveform-based algorithm with averaging was shown to be an effective technique for delay and skew verification of a clock design in the presence of IR drop. Furthermore, the use of a single worst case IR drop for clock skew should be avoided at all costs as it can produce misleading skew information. ACKNOWLEDGMENT The authors would like to thank D. Divekar for assistance with the simulation results in this paper. They would also like to thank the reviewers for providing excellent feedback to improve the overall quality of the paper. REFERENCES [1] E. G. Friedman, Ed., Clock Distribution Networks in VLSI Circuits and Systems. Piscataway, NJ: IEEE Press, 1995. [2] A. Kahng and C. W. Tsao, “Planar-DME: A single-layer zero-skew clock tree router,” IEEE Trans. Computer-Aided Design, vol. 15, Jan. 1996. [3] G. Tellez and M. Sarrafzadeh, “Minimal buffer insertion in clock trees with skew and slew rate contraints,” IEEE Trans. Computer-Aided Design, vol. 16, Apr. 1997. [4] M. K. Gowan, L. L. Biro, and D. B. Jackson, “Power considerations in the design of the alpha 21 264 microprocessor,” in Proc. Design Automation Conf., June 1998, pp. 726–731. [5] “Technology Roadmap for Semiconductors,” Semiconductor Industry Association, San Jose, CA, 1997. [6] G. Steele, D. Overhauser, S. Rochel, and S. Z. Hussain, “Full-chip verification methods for DSM power distribution systems,” in Proc. Design Automation Conf., June 1998, pp. 744–749. [7] A. Dharchoudhury, R. Panda, D. Blaauw, and R. Vaidyanathan, “Design and analysis of power distribution networks in PowerPC microprocessors,” in Proc. Design Automation Conf., June 1998, pp. 738–743. [8] K. Toh, P. Ko, and R. Meyer, “An empirical model for short-channel MOS devices,” IEEE J. Solid-State Circuits, vol. 23, pp. 950–957, Aug. 1988. [9] A. Vittal and M. Marek-Sadowska, “Low-power buffered clocktree design,” IEEE Trans. Computer-Aided Design, vol. 16, pp. 965–975, Sept. 1997. [10] M. P. Desai, R. Cvijetic, and J. Jensen, “Sizing of clock distribution networks for high performance CPU chips,” in Proc. 33rd ACM/IEEE Design Automation Conf., June 1996, pp. 389–394. [11] K. M. Carrig, N. T. Gargiulo, R. P. Gregor, D. R. Menard, and H. E. Reindel, “A new direction in ASIC high-performance clock methodology,” in Proc. Custom Integrated Circuits Conf., May 1998, pp. 27.3.1–27.3.4. [12] R. Saleh, M. Benoit, and P. McCrorie, “Power distribution planning,” in Proc. Design, Automation and Test in Europe Conf., Paris, France, Feb. 1998, pp. 265–270. [13] A. Odabasioglu, M. Celik, and L. T. Pileggi, “PRIMA: Passive reducedorder interconnect macromodeling algorithm,” IEEE Trans. ComputerAided Design, vol. 17, pp. 645–654, Aug. 1998. [14] P. Restle, A. Ruehli, and S. Walker, “Dealing with Inductance in highspeed chip design,” in Proc. Design Automation Conf., June 1999, pp. 904–909. 644 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 6, JUNE 2000 [15] Y. Massoud, S. Majors, T. Bustami, and J. White, “Layout techniques for minimizing on-chip interconnect self-inductance,” in Proc. Design Automation Conf., June 1998, pp. 566–571. [16] N. Arora, K. Raol, R. Schumann, and L. Richardson, “Modeling and extraction of interconnect capacitances for multilayer VLSI circuits,” IEEE Trans. Computer-Aided Design, vol. 15, pp. 58–67, Jan. 1996. [17] R. Saleh and A. R. Newton, “The exploitation of latency and multirate behavior using nonlinear relaxation for circuit simulation,” IEEE Trans. Computer-Aided Design, Dec. 1989. [18] J. White and A. Sangiovanni-Vincentelli, Relaxation-Based Techniques for the Simulation of MOS VLSI Circuits. Norwell, MA: Kluwer Academic, 1986. [19] B. Ackland and N. Weste, “Functional verification in an interactive symbolic IC design environment,” in Proc. 2nd Caltech Conf. VLSI, Jan. 1981, pp. 285–298. [20] C. Visveswariah and R. Rohrer, “SPECS2: An integrated circuit timing simulatorw,” in Proc. ICCAD, Nov. 1987, pp. 94–97. [21] R. A. Saleh, S.-J. Jou, and A. R. Newton, Mixed-Mode Simulation and Analog Multilevel Simulation. Norwell, MA: Kluwer Academic, 1994. [22] G. Golub and C. Van Loan, Matrix Computations. Baltimore, MD: Johns Hopkins Univ. Press, 1989. Resve Saleh (S’78–M’79) received the B.S. degree in electrical engineering from Carleton University, Ottawa, ON, Canada, and the M.S. and Ph.D. degrees in electrical engineering and computer science from the University of California, Berkeley. In 1995, he co-founded Simplex Solutions, Inc., Sunnyvale, CA, to address deep submicrometer-integrated circuit verification and currently serves as Chairman. Prior to starting Simplex, he spent ten years as a Professor in the Department of Electrical and Computer Engineering and the Department of Computer Science at the University of Illinois, Urbana. He has also worked for Toshiba Corporation in Japan, Tektronix in Beaverton, OR, and Mitel Corporation in Ottawa, Canada. He began his career as a Software Engineer with Bell-Northern Research in Ottawa, Canada. He was granted a patent for nonlinear frequency domain analysis in 1997. He has written two books on analog and mixed-mode simulation and published over 50 conference papers and journal articles. Dr. Saleh has served as technical program chair, conference chair, and general chair for the Custom Integrated Circuits Conference in 1993, 1994, and 1995, respectively. He has served on the technical program committees of the Design Automation Conference, International Conference on Computer Design, and, most recently, the International Symposium on Quality in Electronic Design. From 1992-1995, he held the position of chairman of the IEEE Standards Coordinating Committee 30 on Analog Hardware Description Languages (AHDL).He also received an inventor recognition award from the Semiconductor Research Corporation in 1991. He is currently an Associate Editor for the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. He received a Presidential Young Investigator Award in 1990 from the National Science Foundation. Syed Zakir Hussain (M’97) received the B.Sc. degree with honors in mechanical engineering from North Western Frontier Province (NWFP) University of Engineering and Technology, Peshawar, Pakistan, in 1987. He received the M.S. degree in mechanical engineering in 1992 and the Ph.D. degree in electrical engineering in 1995, from Duke University, Durham, NC. From 1987 to 1988, he worked as a design engineer at National Engineering Services of Pakistan (NESPAK). He is currently a Staff Engineer with Simplex Solutions, Inc., San Jose, CA. His research interests include design and development of simulation tools in the area of VLSI design verification. Steffen Rochel (M’95) received the Ph.D. degree in electrical engineering from the University of Technology at Ilmenau, Ilmenau, Germany, in 1992. In 1992, he joined Anacad in Germany, where he was engaged in research and development of analog and mixed-signal behavioral modeling and simulation techniques. Since 1996, he has been with Simplex Solutions, Inc., San Jose, CA, working on the interconnect verification and analysis products. His research interests include circuit simulation, reliability assessment, and timing and power analysis. He has served on the Custom Integrated Circuits Committee since 1997. David Overhauser (S’86–M’89) received the B.S. degree in math and computer science from Purdue University, West Lafayette, IN, in 1983 and the M.S. and Ph.D. degrees in electrical engineering from the University of Illinois, Urbana, in 1985 and 1989, respectively. From 1989 to 1995, he was an Assistant Professor in Electrical and Computer Engineering at Duke University, Durham, NC, where he founded the Design Automation Technology Center. He has published two books and over 20 technical papers in the area of timing simulation, macromodeling, mixed-signal simulation, and design verification. He is a founder and is currently Vice President at Simplex Solutions, Inc., San Jose, CA. Dr. Overhauser has served on the technical program committees of the Custom Integrated Circuits Conference and the International Symposium for Quality Electronic Design.