IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 1087 SILCA: SPICE-Accurate Iterative Linear-Centric Analysis for Efficient Time-Domain Simulation of VLSI Circuits With Strong Parasitic Couplings Zhao Li and C.-J. Richard Shi, Fellow, IEEE Abstract—A new circuit analysis method, named SPICEaccurate iterative linear-centric analysis (SILCA), is proposed for the efficient and accurate time-domain simulation of deep submicron very large scale integrated (VLSI) circuits with strong parasitic couplings. SILCA consists of two key linear-centric techniques applied to time-domain nonlinear circuit simulation. For numerical integration, explicit-formula substitution and iterativeformula transformation are presented to convert implicit variable time-step integration to fixed leading coefficient (FLC) variable time-step integration. This paper characterizes both convergence and stability properties of the resulting FLC integration formulae. For nonlinear iteration, a successive variable chord (SVC) method is used as an alternative to the Newton–Raphson method. Further, the low-rank update technique is implemented for fast LU factorization. With these techniques, the number and cost of required LU factorizations are reduced dramatically. Experimental results on nonlinear circuits coupled with substrate and power/ground networks have demonstrated that SILCA achieves more than an order of magnitude speedup over SPICE3 in terms of both the cost of LU factorization and the overall CPU time. SILCA is suitable for efficient SPICE-like time-domain simulation of parasiticcoupled VLSI circuits, where the number of linear parasitic elements dominates the number of nonlinear devices. Index Terms—Circuit simulation, time-domain analysis. I. I NTRODUCTION W ITH INCREASING operation frequency, lower supply voltage, and smaller device feature size, parasitic coupling effects are becoming more and more important for modern deep submicron very large scale integrated (VLSI) circuit designs [24]. The increasing demand to integrate digital, analog, and radio frequency (RF) circuits into one single chip requires accurate analysis of very large scale integrated (VLSI) circuits together with extracted parasitic elements arising from interconnect lines, common substrate, power/ground networks, etc. [1], [20], [24], [30], [32]. Meanwhile, on-chip and pack- Manuscript received May 4, 2003; revised February 12, 2004, December 23, 2004, March 7, 2005, and April 28, 2005. This work was supported in part by the U.S. Defense Advanced Research Projects Agency NeoCAD Program under Grant 66001-01-1-8920, in part by the National Science Foundation (NSF) CAREER Award under Grant 9985507, and in part by the NSF/Semiconductor Research Corporation Joint Mixed-Signal Initiative under Grant CCR0120371. This paper was recommended by Associate Editor S. Sapatnekar. Z. Li was with the Department of Electrical Engineering, University of Washington, Seattle, WA 98195 USA. He is now with Cadence Design Systems, Inc., San Jose, CA 95134 USA (e-mail: zhaoli@cadence.com). C.-J. R. Shi is with the Department of Electrical Engineering, University of Washington, Seattle, WA 98195 USA (e-mail: cjshi@ee.washington.edu). Digital Object Identifier 10.1109/TCAD.2005.855943 aging inductances are no longer ignorable for accurate circuit analysis [8]. For such purposes as well as coupled circuit and electromagnetic modeling [33], SPICE-like simulators are desirable for accurate transistor-level time-domain simulation. However, efficient simulation of such systems presents a complexity challenge to SPICE [21]. For time-domain circuit simulation, SPICE uses numerical integration formulae [2], [19] to form companion models for capacitors and inductors at each time point, and applies the Newton–Raphson method [19] to linearize nonlinear devices. Then the circuit is simulated at each time point by iteratively solving a system of linearized equations in the form of Ax = b, where A is typically the socalled modified nodal analysis (MNA) circuit matrix [19], [21] which is a Jacobian matrix. It is known that device evaluation dominates simulation of small to medium size circuits, and its cost can be reduced with device bypass [14], [21], table lookup [1], parallel computation techniques [13], etc. However, for a system with strong parasitic couplings, the per-iteration cost of SPICE time-domain simulation is dominated by LU factorization [19] of the circuit matrix A. In practice, the cost for LU factorization by sparse matrix solvers [12] is O(n1.1∼1.5 ) for sparse circuits, where n is the circuit matrix size. However, strong parasitic couplings present in deep submicron circuits can cause the circuit matrix to become much denser, even with model order reduction [22], [28] the cost of LU factorization can approach its worst case O(n3 ) [24]. One key idea to improve the efficiency of SPICE-like circuit simulation is to keep the circuit matrix as constant as possible during the entire time-domain simulation and, therefore, reduce the number of LU factorizations required. This has been implemented in both numerical integration and nonlinear iteration stages. For numerical integration, several strategies have been proposed on reformulating the backward differentiation formulae (BDF) [2], [19] to keep the leading coefficient constant since it is the leading coefficient that contributes to a Jacobian matrix. These include fixed coefficient methods [9] (i.e., in LSODE [25]), fixed leading coefficient (FLC) methods [10] (i.e., in DASSL [23]), overdetermined polynomial methods (ODPM) [5], etc. All these methods have been shown to be effective in bypassing Jacobian matrix factorization. However, the stability of fixed coefficient and FLC methods is worse than that of variable coefficient methods [10]. Furthermore, for fixed coefficient and FLC methods, interpolation must be performed at each time point, which will unfortunately introduce extra errors and increase simulation cost. The overdetermined 0278-0070/$20.00 © 2006 IEEE 1088 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 polynomial method [5] overcomes the interpolation problem by introducing an extra coefficient in the BDF. However, the stability of the overdetermined polynomial method is worse and the ODPM-1 formula [5] has been shown to be stable only when the present time step-size hn is less than or equal to the predefined basis time step-size h. Therefore, a large basis time step-size h has been adopted in [5] to have an hn /h ratio of less than 1. The efficiency of the overdetermined polynomial method is thus limited. To reduce the number of LU factorizations during the nonlinear iteration process, quasi-Newton methods [6], [29] have been studied extensively and applied in circuit simulation [1] and mixed-mode circuit and device simulation [35]. The successive chord method [19] has been explored for fast transistor-level gate-delay calculation [1], where each transistor is modeled as a fixed linear resistor (called chord) combined with a variable nonlinear current source. Since a fixed chord is used, the circuit matrix will not change during nonlinear iteration and only one LU factorization is required overall if a fixed time stepsize is used. Unfortunately, there are two principal difficulties that restrict the success use of this linear-centric idea to the simulation of general VLSI circuits. 1) Most VLSI circuits have widely distributed time constants and require variable time step-size control for simulation efficiency and accuracy. With variable step-sizes, the circuit matrix is no longer constant across time points unless an FLC numerical integration formula as discussed before is used. 2) The successive chord method may need an excessive amount of iterations to converge, and thus offsets the gain from the reduction of LU factorizations. Recently, in the contexts of power grid analysis [4], substrate analysis [24], and parasitic extraction [11], Krylov-subspacebased iterative methods such as the conjugate gradient algorithm, generalized minimum residual (GMRES) algorithm, etc., have been shown to be more efficient than the method of LU factorization and forward/backward substitution (named the direct method). However, there is no report of successful and robust applications of iterative methods to classical timedomain nonlinear circuit simulation in the literature. This paper presents SPICE-accurate iterative linear-centric analysis (SILCA), a new direct method capable of analyzing VLSI circuits containing strong parasitic coupling effects with SPICE-like accuracy yet orders of magnitude faster. SILCA consists of applying the linear-centric principle to both numerical integration and nonlinear iteration to keep circuit matrices as constant as possible during variable step-size time-domain nonlinear circuit simulation. • Two general techniques, namely explicit-formula substitution and iterative-formula transformation, are presented to convert implicit integration formulae in SPICE-like simulators to FLC integration formulae. These formulae lead to constant equivalent conductance in capacitor/inductor companion models. • Successive variable chord (SVC) method, a variant of the successive chord method, is introduced to keep linearized conductance of nonlinear devices constant for a larger voltage/current range by incorporating device-related behavioral knowledge. With the SVC method, a piecewise weakly nonlinear (PWNL) MOSFET model is introduced for the calculation of Jacobian matrices. The low-rank update technique is further applied for fast LU factorization by noting the fact that the number of nonlinear devices, switching operating PWNL regions, is only few at a single time point. With these, the number of required LU factorizations can be reduced by orders of magnitude with a moderate increase of iterations. Thus, rather than solving a newly linearized system by another costly LU factorization, we are able to achieve the same accurate results by several efficient forward/backward substitutions on a previously linearized system. The entire method is robust, accurate, and has been implemented into SPICE3. Further, the proposed method is compatible with other circuit analysis methods, such as model order reduction [22], [28], to achieve even greater simulation speedup. Some preliminary results of this paper were presented in [16]. The rest of this paper is organized as follows. Section II presents new FLC integration schemes, the analysis of their stability and convergence properties, and methods for adaptive step-size control. Section III presents the SVC method and the low-rank update technique. The SILCA algorithm is described in Section IV. Section V shows experimental results on substrate and power/ground coupling analyses. Section VI concludes the paper. II. FLC I NTEGRATION S CHEMES In this section, we present and characterize two general techniques of taking any implicit integration formula to derive such an integration formula that yields a constant circuit matrix for variable step-size time-domain circuit simulation. Mathematically, let xn , xn−1 , . . . , x0 and ẋn , ẋn−1 , . . . , ẋ0 be the values and first-order time derivatives of variable x at time points tn , tn−1 , . . . , t0 , then any linear multistep numerical integration formula implemented in SPICE-like simulators can be written in the general form ẋn = k ai xn−i + i=0 l bj ẋn−j (1) j=1 where ai , i = 0, 1, . . . , k, and bj , j = 1, 2, . . . , l, are coefficients of the integration formula, the leading coefficient a0 is nonzero (hence implicit), and hn = tn − tn−1 is the current time step. Let h be some kind of basis time step-size, the current time step-size can be rewritten as hn = αh, where α is a positive real number. Notice that only the leading coefficient a0 contributes to the circuit matrix. In general, a0 is a function of αh. Since α changes with time points, the circuit matrix would change. To keep the circuit matrix constant, we rewrite the integration formula above as ẋn = a0 (h)xn + a0 (αh)xn + k i=1 ai xn−i + l j=1 bj ẋn−j (2) LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS Fig. 1. 1089 Capacitor companion model using the mixed trapezoid FE formula. where a0 (h) is independent of α. Then, we would like to substitute xn in the second term by all the known values from the previous time points. The first technique is to replace xn in the second term using an explicit integration formula. This is called explicit-formula substitution. The second technique is to replace xn in the second term using an initial guess and then iterate to convergence. This is called iterative-formula transformation. With these, the resulting formulae have an FLC and are referred to as FLC integration formulae, following the convention of Jackson and Sacks-Davis [10]. In the following subsections, we use the standard trapezoid formula as an example to derive FLC integration formulae based on explicit-formula substitution and iterative-formula transformation. We characterize both stability and convergence properties of the resulting integration formulae. We note that these derivation and analyses can be applied to any implicit integration formula used in a circuit simulator. Furthermore, we present how the resulting formulae can be used in a way similar to the classical predictor-corrector integration scheme, how to adaptively control the basis time step-size, and how to control stability. After the xn in the third term in (3) is approximated by (4), the mixed trapezoid FE formula with a time step-size hn = αh is obtained as ẋn = With hn = αh, the standard trapezoid formula can be rewritten as ẋn ≈ 2 (xn − xn−1 ) − ẋn−1 hn = 2 (xn − xn−1 ) − ẋn−1 αh = 2α − (2α − 2) (xn − xn−1 ) − ẋn−1 αh 2 2 2α − 2 (xn − xn−1 ) − ẋn−1 . = xn − xn−1 − h h αh (3) Now we would like to substitute xn in the third term by using any explicit Adams–Bashforth formula [19]. The simplest is the forward Euler (FE) formula with a step-size αh defined as xn ≈ xn−1 + αh ẋn−1 . (4) (5) The mixed trapezoid FE formula is an implicit integration formula. When α = 1, it reduces to the standard trapezoid formula. When α = 1/2, it represents the backward Euler (BE) formula with a step-size h/2. To see the circuit interpretation of the mixed trapezoid FE formula (5), the companion model of a linear capacitor is shown in Fig. 1. Note that even though the actual time step-size is αh, the equivalent conductance of the companion model is constant as long as the basis time step-size h is a constant. The local truncation error (LTE) measures how closely a numerical integration formula approximates the differential operator. We can prove the following result. Theorem 1: The LTE ε of the mixed trapezoid FE formula (5) with time step-size αh is given by ε= A. FLC Integration by Explicit-Formula Substitution 2 2 xn − xn−1 − (2α − 1)ẋn−1 . h h 1− 1 α ẍξ 2 ẍ˙ξ 1.5 (αh)2 + 1 − (αh)3 α 6 (6) where tξ is between tn and tn−1 . Proof: The proof is similar to the LTE estimation for the standard trapezoid formula [19]. According to Theorem 1, when α = 1, the LTE of the mixed trapezoid FE formula reduces to that of the standard trapezoid formula. When α = 1/2, it represents the LTE of the BE formula using time step-size h/2. The mixed trapezoid FE formula is a second-order integration formula only if α = 1 and degenerates to a first-order formula if α = 1. In contrast to LTE, stability is a global property related to the growth or decay of the local error introduced at each time point and propagated to the following time points. “Absolute stability” requires that |εn | < |εn−1 |. It is often studied with the use of an RC test circuit as shown in [19, Fig. 5.1]. The stability property of the mixed trapezoid FE formula can be proved as below. Theorem 2: The absolute stability region of the mixed trapezoid FE formula (5) with time step-size αh is defined by 1 + (2α − 1)z <1 (7) 1−z 1090 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 size hn = αh, a small α will unfortunately result in a large LTE. Therefore, there exists a tradeoff between the stability and the LTE. Due to the mentioned LTE and stability problems, FLC integration formulae derived from explicit-formula substitution are not suggested for analog circuit simulation with high accuracy requirements. However, they can be used to enhance timing analysis of digital circuits, for example, TETA [1]. B. FLC Integration by Iterative-Formula Transformation The LTE and stability problems of the mixed trapezoid FE formula come from the replacement of xn in the third term of (3) by an approximate xn defined using the explicit FE formula (4). In this subsection, rather than using explicit integration formulae, xn in the third term of (3) is replaced by the (k−1) at the present time point and (k − 1)th iteration solution xn (k) a new kth iteration solution xn is obtained by solving (3), where k is the iteration number. This leads to the iterativeformula transformation of (3), called the iterative trapezoid formula, written as 2 (k) 2 2α − 2 (k−1) xn − xn−1 − ẋn−1 ẋ(k) n = xn − xn−1 − h h αh (k−1) 2 2 (k−1) xn = x(k) +2 n − xn h h − xn−1 − ẋn−1 . αh (k) Fig. 2. Absolute stability regions of the mixed trapezoid FE formula for (a) α = 0.625 and (b) α = 2.5. where z = −h/(2τ ) and τ is the time constant of the RC test circuit. Proof: The proof is similar to the stability analysis for the standard trapezoid formula [19]. The absolute stability regions for α = 0.625 and α = 2.5 are shown in Fig. 2(a) and (b), respectively. From Theorem 2, two observations can be made on the stability of the mixed trapezoid FE formula. • When α > 1, the absolute stability region moves closer to that of the FE formula. The mixed trapezoid FE formula is not A-stable [19] or stiff stable [9], [19], and cannot be used as a variable time step-size control scheme when α > 1. • When α < 1, the absolute stability region includes the open left half plane of the complex z-plane and the mixed trapezoid FE formula is A-stable. When α approaches 1/2, the absolute stability region approaches that of the BE formula. Further, the smaller α, the better the stability. However, according to Theorem 1, for a fixed time step- (8) (k−1) | The final solution is said to be converged if |xn − xn is less than a predefined error tolerance. If the iterative trapezoid formula (8) converges, its LTE will approach that of the standard trapezoid formula. Next, we characterize both convergence and stability properties of the iterative integration formula. To study the convergence property, let us write the linear(ized) circuit equation as used in [22] as Gx + C ẋ = b (9) where G and C represent the conductance and capacitance (susceptance) matrices, and b is the vector due to input sources and nonlinear devices. Replacing first-order time derivatives by the iterative trapezoid formula (8), we have G+ = 2C h 1 1− α x(k) n 2C (k−1) 2C x xn−1 +C ẋn−1 +b. (10) + h n αh Clearly, the iterative trapezoid formula converges if −1 2C 1 2C 1− <1 G+ h α h (11) LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS 1091 Fig. 3. Convergence region of the iterative trapezoid formula for α = 0.625 and 2.5. where • represents the spectral radius of the iteration matrix. The above (11) can be rewritten as 1 − α1 1−z <1 (12) where z = −h/(2τ ) and τ is an eigenvalue of the matrix G−1 C. τ represents the time constant of the RC test circuit. With this, we can show the following generalized convergence property. Theorem 3: The convergence region of the iterative trapezoid formula (8) with a time step-size αh is defined by (12). From (12), to ensure that the iterative trapezoid formula converges for any decaying or stable oscillatory system (Re(z) ≤ 0), i.e., to have the convergence region include all of the left half of the complex z-plane, we must choose α > 0.5. In our implementation, to speed up the convergence, 0.625 < α < 2.5 is used in our experiments. The convergence region for α = 0.625 and 2.5 is shown in Fig. 3. It represents the worstcase convergence region for 0.625 < α < 2.5. In practice, a maximum iteration number limit for each iteration step is set. In case that the iteration number exceeds the maximum limit (due to either slow convergence or nonconvergence), the solution process with the same time step-size will be attempted one more time with the standard trapezoid formula before the time step-size is decreased. Theoretically, an iterative implicit integration formula shall have the same stability as the corresponding original implicit formula if the iterative implicit integration formula is solved exactly (iterated to infinity). However, in practice, the iterative implicit integration formula is terminated either when the iteration number exceeds a predefined maximum limit or when the convergence criteria is met with the predefined error tolerance. In such case, the stability of the iterative formula can deteriorate. The stability of the iterative trapezoid formula (8) can be characterized by the following theorem. Fig. 4. Absolute stability regions of the iterative trapezoid formula with k = 2 for (a) α = 0.625 and (b) α = 2.5. Theorem 4: The absolute stability region of the iterative (0) trapezoid integration formula (8) starting with xn = xn−1 with a time step-size αh is defined by 1 − 1 k 2z 1 α α + z (13) + <1 1 1−z z − α1 α −z where z = −h/(2τ ) and τ is the time constant of the RC test circuit. Proof: The proof is given in the Appendix. The absolute stability regions for α = 0.625 and α = 2.5 with k = 2 are shown in Fig. 4(a) and (b), respectively, which can be proven to satisfy the “stiff stability” requirements suggested by Gear [9]. For a fixed iteration number k, the absolute stability region of the iterative trapezoid formula will approach that of the standard trapezoid integration formula with α → 1. Furthermore, we give the following stability property of the iterative trapezoid formula. Theorem 5: When k → +∞, the absolute stability region of the iterative trapezoid formula (8) includes the entire open left 1092 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 half of the complex z-plane and excludes the entire right half of the complex z-plane. Theorem 5 can be interpreted by noting the following fact: (8) is mathematically equivalent to applying the standard trapezoidal method with basis time step-size h to the differential equation, and then using a quasi-Newton method [6], [29] to solve the resulting equation. Therefore, when (8) is solved exactly (k → +∞), the absolute stability region of the iterative trapezoid formula will be the same as that of the standard trapezoid formula. This can also be verified by setting k → +∞ in (13). C. FLC Integration by Predictor-Corrector Scheme In SILCA, we first apply (5) and then apply (8) in a way similar to the classical predictor and corrector procedure [9]. Noting how (5) is derived, we can see that applying (5) as a predictor and (8) as an iterative corrector with k iterations is mathematically equivalent to applying an explicit predictor (the FE formula in this case) as a predictor and (8) as an iterative corrector with k + 1 iterations. Using (5) to predict an initial guess for the iterative trapezoidal formula (8) can lead to faster convergence than using the previous time-point value as the initial guess for (8). Very often, we may choose to carry (8) for one or a finite number of iterations, and then use the LTE to adjust time step-sizes, similar to what is done in the classical predictor and corrector procedure. In this case, the predictor–corrector use leads to the stability region worse than that of applying only (8). We can prove the following result. Theorem 6: The absolute stability region of applying (5) as a predictor and (8) as an iterative corrector with iteration number k is defined by 1 − 1 k+1 2αz 2 1 α α + z (14) + <1 1 1−z z − α1 α −z where z = −h/(2τ ) and τ is the time constant of the RC test circuit. Proof: The proof is given in the Appendix. The absolute stability regions for α = 0.625 and 2.5 with k = 1 are shown in Fig. 5(a) and (b). Compared to Fig. 4, it can be seen that when α = 0.625, the absolute stability region of the predictor–corrector scheme is larger than that of applying (8) alone. However, when α = 2.5, the predictor–corrector scheme becomes less stable than applying (8) alone, and it is even no longer stiff stable. Therefore, in SILCA, to ensure stability, (5) is applied as a predictor only if α < 1. D. Illustration of Basis Time Step-Size (h) Control As discussed in Section II-B, to satisfy the convergence property defined by Theorem 3, 0.5 < α < +∞ is required. The limited α range means that it is impossible to use only one single basis time step-size during transient simulation in our framework. When hn /h is out of the α range, a new basis time step-size has to be chosen (i.e., the present time step-size hn ), which means the circuit matrix has to be updated and a new Fig. 5. Absolute stability regions of the predictor–corrector scheme with k = 1 for (a) α = 0.625 and (b) α = 2.5. LU factorization is required. In this sense, a large α range is preferred to decrease the total number of LU factorizations. However, according to Theorem 3, the linear convergence rate of the iterative trapezoid formula is related to |1 − 1/α| and a smaller |1 − 1/α| means a faster convergence rate. Obviously, a small α range is ideal to reduce the total number of iterations. Therefore, in practice, 0.625 < α < 2.5 (|1 − 1/α| < 0.6) is chosen to achieve a balance between the number of LU factorizations and the number of iterations, which will reduce the error to less than 5% of the original error after six iteration steps. Considering that SPICE3 needs at least two iteration steps to converge at a time point, the number of iteration steps with SILCA is approximately 3× over that with SPICE3 for general circuits. The detailed basis time step-size control scheme will be described in Algorithm I of Section IV and a linear circuit example is shown in Section V to illustrate the efficiency and validity of the iterative trapezoid formula. It should be noted that SILCA could be combined with fixed time step-size methods to enlarge the range of α. As shown in the Appendix, when α > 1, the first iteration with the iterative LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS Fig. 6. 1093 Linear RCL circuit example. trapezoid formula is equivalent to applying the standard trapezoid formula with the basis time step-size h. Then, one idea is to apply the standard trapezoid formula with the basis time step-size h for multiple iterations if α is large. For example, if α = 3.5, we could apply the standard trapezoid formula with the basis time step-size h for the first two iterations and then the iterative trapezoid formula for the rest iterations with α = 3.5 − 1 = 2.5. By this way, the convergence and stability properties will not be affected since the range of α for the iterative trapezoid formula is kept unchanged (α = 2.5 for the previous example). The extra cost is that more iterations will be required with more step-sizes performed by a fixed time stepsize method. E. Illustration of Stability Control As discussed in Section II-B, the iterative trapezoid formula satisfies the stiff stability [9], [19] and is applicable to stiff circuits [19], [34], such as RC circuits, as long as circuit poles are not so close to the imaginary axis of the complex z-plane. For oscillatory circuits with poles close to the imaginary axis in the complex z-plane, according to Theorem 3, the absolute stability region of the iterative trapezoid formula with a finite iteration number k will become worse than that of the standard trapezoid formula. This can be illustrated using the linear RCL circuit example in Fig. 6. The transfer function of Vout for the RCL circuit shown in Fig. 6 can be written as H(s) = 1 Vout . = 2 3 Vin RC Ls + CLs2 + 2RCs + 1 (15) There are three poles for Vout2 — −0.5689 and −0.2151 ± j1.3071, and three poles for Vout1 — −999999 and −0.5 ± j1000. Noting that z = (hλ)/2, among these six poles, −0.5689, −999999, and −0.2151 ± j1.3071 are on or close to the negative real axis of the complex z-plane, therefore they will not cause stability problems since the iterative trapezoid formula has the stiff stability property. However, the rest of the two poles −0.5 ± j1000 are far away from the negative real axis and close to the imaginary axis of the complex z-plane, which may not be covered by the absolute stability region when convergence is achieved in a small iteration number k (i.e., the blank region in the left half of the complex z-plane as shown in Fig. 4(a) and (b) for k = 2). The simulation results with SPICE3 and SILCA (without the stability control) are shown in Fig. 7. Unstable simulation results are observed with SILCA and the number of iteration steps with SILCA is 2× of that with SPICE3. It should be noted that BDF [19] with the order Fig. 7. Time-domain output waveform of Vout1 for a linear RCL circuit example. larger than two have the same stability problem for oscillatory circuits. This can be explained by comparing Figs. 3 and 4, which show that the stability region is smaller than the convergence region. In other words, the stability region might not cover all circuit poles upon convergence if the stability requirement (Theorem 4 or Theorem 6) is stricter than the convergence requirement (i.e., user-specified error tolerance for convergence justification). A tighter error tolerance for convergence will help alleviate the stability problem. However, more iterations and/or more time points have to be simulated. Therefore, SILCA is not recommended for highly oscillatory circuits. III. SVC M ETHOD SPICE-like circuit simulators use the Newton–Raphson method to solve a set of nonlinear equations. Typically, for each Newton–Raphson iteration, a new LU factorization is required. This can be extremely costly for a circuit with strong parasitic coupling effects or with reduced dense linear networks. The successive chord method [19] always uses a fixed chord as the first-order derivative during nonlinear iteration. Hence, at each time point, only one LU factorization is needed for nonlinear iteration. But it is often hard to choose a single fixed chord for a (strongly) nonlinear curve to always ensure a good convergence rate. In general, a chord that ensures global convergence will unfortunately lead to a slow convergence rate. To achieve a good balance between the number of LU factorizations and that of iterations, we propose the SVC method. The basic idea is to divide a nonlinear curve into different 1094 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 Fig. 9. Fig. 8. PWNL example implemented with the SVC method. segments, each of which represents a weakly nonlinear curve and the same (local) chord is used for the same segment during nonlinear iteration-so-called PWNL analysis. As shown in Fig. 8, the nonlinear curve is divided into three PWNL segments with three local chords, each of which represents the maximum derivative for the corresponding segment. A new LU factorization is performed only if the nonlinear curve enters a different PWNL segment, where a new local chord is used. By this method, similar convergence speed and accuracy can be achieved as the Newton–Raphson method while the number of LU factorizations can be decreased. We emphasize that the PWNL model of a nonlinear device is used only for the calculation of first-order derivatives while the nonlinear function is still evaluated using the original nonlinear device model. The PWNL idea implemented with the SVC method can be very effective due to the following facts. 1) Since MOSFETs in analog applications generally operate linearly around their operating points, only weakly nonlinear properties may be present. A fixed chord representing the gm , gmbs , and gds of MOSFETs at operating points is generally sufficient. A linearcentric harmonic balance analysis method has been proposed in [15]. 2) MOSFETs in digital applications reside in two regions at most time points—cutoff region and well-conducted linear region with a very small source-to-drain voltage, both regions have a relatively constant gm , gmbs , and gds . The only situation where gm , gmbs , and gds change a lot is the time when MOSFETs switch from the cutoff region through the saturation region to the linear region (or vice versa). This process only occupies a small fraction of the total simulation time for a MOSFET in a large-scale digital circuit. Hence, a fixed chord for these situations will not significantly affect the total iteration process. With the above considerations, five MOSFET PWNL operating regions for digital circuit applications are defined as shown in Fig. 9, and gm , gmbs , and gds for different operating regions are listed in Table I. In Table I, Reg#0 represents the cutoff PWNL operating regions of MOSFETs for digital applications. TABLE I gm , gmbs , AND gds FOR DIFFERENT MOSFET PWNL REGIONS region, Reg#1 and Reg#3 are saturation regions, and Reg#2 and Reg#4 are linear regions. gm−max and gmbs−max are the maximum values in all the regions (defined by Vdd ), and gds−i is defined (generally the maximum values) for different regions to ensure convergence. It should be noted that, theoretically, the convergence rate of the SVC method is linear, but in practice it can be maintained close to that of the Newton–Raphson method by using more PWNL regions if needed. Another advantage of the SVC method is that chords can be precalculated and stored before simulation, no derivative calculation is required during nonlinear iteration as in the Newton–Raphson method. This can lead to a significant saving in device loading time. Furthermore, table lookup models can be easily implemented in SILCA than in SPICE since there is no need of lookup tables for first-order derivatives. The proposed SVC method is more accurate and effective than ad hoc device bypass techniques utilized in modern circuit simulators [14], [21], where device evaluations are bypassed when terminal voltages of a nonlinear device are kept almost constant for a few continuous nonlinear iteration steps. It has been reported that the voltage/current range of device bypass has to be kept small enough to avoid incorrect simulation results [21]. However, for high-frequency deep submicron circuit applications, the efficiency is limited, since terminal voltages of most nonlinear devices are not completely constant but are changing slowly. The SVC method defines PWNL segments and local chords based on the behaviors of specific nonlinear devices under study; therefore, it can keep the circuit matrix constant for a larger voltage/current range and requires much less LU factorizations. LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS By the above MOSFET PWNL operating region definition, only five sets of gm , gmbs , and gds are used during time-domain simulation for digital systems. We further have the following observations. 1) At one time point, most MOSFETs in a large digital system will stay in their PWNL operating regions as defined above while only a few may switch from one region to another region. 2) For a switching MOSFET, the update of gm , gmbs , and gds is regionwise. In other words, the change of gm , gmbs , and gds from Reg#i to Reg#j is fixed. Therefore, in the case that a small amount of MOSFETs change their PWNL operating regions, we can compute the new L and U matrices directly from the old L and U matrices using the lowrank update technique [7], [31] rather than performing costly LU factorization for the entire circuit matrix. Suppose that the previous circuit matrix is Y and one MOSFET is now switching from Reg#1 to Reg#2. The new circuit matrix for the next iteration can be expressed by Y = Y + crT Ylin Y = T Ycoup Ycoup Ynon Ynon ⊕ × × × = ⊗ ⊗ ⊕. × × × ⊕ × ⊕ × ⊕ (17) ⊗ TABLE II ALGORITHM FOR SILCA TIME-DOMAIN SIMULATION (16) where c and r are sparse column vectors representing values of updated elements. In thiscase, c = r = [0, . . . , 0, e, 0, . . . , 0, −e, 0, . . .]T , and e = |gds−2 − gds−1 |. Noting that there are only four different elements between the matrix Y and Y , the new L and U matrices for Y can be updated from the previous ones for Y efficiently with the low-rank update technique. The worst-case cost of m low-rank updates for a dense matrix is O(m∗ n2 ), where m is the number of updated elements and n is the matrix size. If m is much less than n, the low-rank update will perform much faster than a regular LU factorization, whose worst-case cost is O(n3 ) for a dense matrix. With the introduced MOSFET PWNL definition, m will be kept small enough at a time point since the number of MOSFETs, whose terminal voltages change so violently that the operating region is switched, is generally small. Furthermore, the low-rank update cost can be decreased dramatically by exploiting sparse matrix techniques [7], [31] and nonlinear/linear circuit partitioning to place matrix elements due to nonlinear devices at the bottom-right corner of a circuit matrix [7]. By this way, only matrix elements whose values need to be updated are recomputed while all other matrix elements are kept the same as before, i.e., 1095 ⊗ For example, the circuit matrix Y in (17) is partitioned into the linear part Ylin , the nonlinear/linear coupling part Ycoup , and the nonlinear part Ynon . Whenever a nonlinear device changes its operating region (affecting four matrix elements of Ynon in this example, marked by ⊗), Ylin and Ycoup are kept the same. For the sparse matrix Ynon , only nine matrix elements need to be updated (marked by ⊗ and ⊕) and the other eight are kept unchanged (marked by ×). Therefore, the matrix sparsity can be fully exploited by the low-rank update technique. SILCA utilizes the sparse matrix solver package SPARSE1.3 [12], and a sparse low-rank update algorithm has been implemented successfully in SPARSE1.3. In practice, if the value of a diagonal element (Lii ) during low-rank updates becomes smaller than the predefined threshold value, the diagonal element will not be suitable for the following steps. In this case, a regular LU factorization is restored. IV. SILCA A LGORITHM The basic algorithm for SILCA time-domain simulation is shown in Table II. Practical considerations, such as processing breakpoints [21], are not included for clarity. A new LU factorization is only required if the standard implicit integration scheme is used. In case that only local chords of nonlinear devices change, low-rank update is performed for fast LU factorization. No LU factorization is needed in any other case. Nonlinear capacitors can be handled in SILCA by combining the proposed iterative trapezoid integration formula and the proposed SVC, illustrated as 2 (k) Qn − Qn−1 − Q̇n−1 hn 2 (k−1) ≈ + Cn(k−1) Vn(k) − Vn(k−1) − Qn−1 Qn hn − Q̇n−1 Q̇(k) n = (k−1) (k−1) 2Cn 2Cn Vn(k) − Vn(k−1) hn hn 2 (k−1) Qn + − Qn−1 − Q̇n−1 hn 2C (k) 2C (k−1) V V ≈ − h n h n 2 Q(k−1) + − Qn−1 − Q̇n−1 . n hn = (18) 1096 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 Fig. 10. Linear RCL circuit example. TABLE III SIMULATION RESULTS OF A LINEAR CIRCUIT EXAMPLE In the above derivation, a linearized capacitance C is introduced to represent the PWNL definition of the nonlinear capacitor. The basis step-size h is used as previously. For clarity, the nonlinear charge is assumed to be the function of a single voltage. In practical MOSFET models, nonlinear charges are generally affected by three voltages Vgs , Vds , and Vbs . For example, the nonlinear charge between the drain and the source of a MOSFET is Qds = Q(Vgs , Vds , Vbs ). Suppose that both linearized capacitors of Qds (Cds , Cm , and Cmbs ) and linearized conductors of Ids (gds , gm , and gmbs ) in a MOSFET need to be updated due to the switch of PWNL regions, the contribution of the MOSFET to the circuit matrix is as follows: D G S B D ∆Gds ∆Gm −∆Gds − ∆Gm − ∆Gmbs ∆Gmbs S −∆Gds −∆Gm ∆Gds + ∆Gm + ∆Gmbs −∆Gmbs ∆G a ∆G ∆G |a| ds m mbs − = − |a| |a| |a| |a| |a| 2∆Cds 2∆Cgs + ∆gds , ∆Gm = + ∆gm , h h 2∆Cbs + ∆gmbs , a = ∆Gds + ∆Gm +∆Gmbs . ∆Gmbs = h (19) ∆Gds = There are a total of eight matrix entries to be updated. With the above representation, the rank-one update algorithm [7], [31] can be used to realize fast LU factorization. In case that more matrix entries (at most 16 for a MOSFET) are affected by the switch of PWNL regions, a series of rank-one or rank-m updates [3] are required to perform fast LU factorization. In this case, the efficiency of low-rank updates may be reduced. V. E XPERIMENTAL R ESULTS Four sets of experiments are reported to demonstrate the validity and efficiency of the introduced linear-centric techniques. The first test uses a simple linear RLC circuit to demonstrate the proposed predictor–corrector integration scheme. The second test uses a variety of analog, digital, and RF circuits with relatively small sizes to evaluate the effectiveness of the SVC method implemented with the low-rank update technique. The last two examples are circuits coupled with substrate and power/ground networks, which are used to demonstrate the scalability of SILCA on larger circuits, where a substantial portion of the circuits are linear parasitic devices. The level 1 model of MOSFETs is implemented with the proposed PWNL idea in SILCA, and nonlinear capacitors in MOSFETs are Fig. 11. Histogram of the number of iterations for a linear circuit example. simplified as linear ones in both SILCA and SPICE3. To make a fair evaluation of the benefits of the proposed linear-centric techniques, no table lookup models of MOSFETs are used and no RC(L) model order reduction algorithm is utilized. A. Evaluation of Predictor–Corrector Integration Scheme The efficiency of the predictor–corrector integration scheme can be illustrated with the simple linear circuit example shown in Fig. 10. It includes two RCL circuits with different time constants. The input is a pulse signal (initially in the low voltage level 0v) with 50% duty ratio and 80-s period. The simulation length is set to 160 s. Since the minimum time constant is 0.01 s for the left half RCL circuit, at least 16 000 time points are required for a fixed time step-size simulation. The simulation results are shown in Table III, where #Total points represents the number of total simulated time points and #Accepted points represents the number of accepted time points. The rejected time points are those violating the LTE requirement or exceeding the maximum iteration limit. Since in SILCA a similar adaptive time step control scheme as that in SPICE3 is applied based on the LTE requirement, it can be seen from Table III that SILCA and SPICE3 achieve similar #Total points and #Accepted points, which are much less than that required by a fixed time step-size method. Furthermore, LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS 1097 Fig. 14. Histogram of basis time step-sizes for a linear circuit example. Fig. 12. Distribution of actual time step-sizes for a linear circuit example. TABLE IV SIMULATION RESULTS OF NONLINEAR TEST CIRCUITS∗ Fig. 13. Distribution of basis time step-sizes for a linear circuit example. the number of LU factorizations used by SILCA decreases to 1.14% of that of SPICE3 (or 87.63× LU factorization cost saving). The number of iterations increases to about 2.5×. Fig. 11 shows the histogram of the number of iteration steps, in which it can be seen that most of the iterations converge in two to six steps. Fig. 12 shows the distribution of actual time step-sizes (hn = αh) during SILCA simulation. It can be seen that most simulated time step-sizes are between 0.05 and 0.2 s, centering around 0.08 s. Recall that since we choose 0.625 < α < 2.5, it is possible that fewer basis time step-sizes are required. This is confirmed by Fig. 13, which shows the distribution of basis time step-sizes (h) during SILCA simulation. It can be seen that most basis time step-sizes are the same and near 0.08 s. In SILCA, it is the basis time step-size that is used for circuit matrix construction. Therefore, SILCA keeps the circuit matrix constant as long as the basis time step-size is constant. The histogram of basis time step-sizes with SILCA is shown in Fig. 14. Compared to Fig. 13, it can be concluded that most basis time step-sizes are near 0.08 s and constant during the following time intervals: 10–40, 45–80, 80–120, and 120–160 s. ∗ For each circuit, the first row is the SPICE3 result and the second row is the SILCA result. It should be noted that SILCA is mainly designed for speeding up circuit simulation in case that most of the time stepsizes hn are close to the basis time step-size h, i.e., 0.625 < (hn /h) < 2.5 in our experiments. In general, for transient simulation of parasitic-sensitive circuits, most of time stepsizes are close to the basis time step-size for a relatively long time interval when the transient behavior of circuits does not change significantly, i.e., staying either in the logical “0” state or logical “1” state. In case that time step-sizes hn change violently, a new basis time step-size h will be chosen. However, based on our experiences, such chances are only few (i.e., near break points). Further, SILCA can be combined with fixed time step-size methods to enlarge the range of (hn /h), as discussed in Section II-D. B. Evaluation of SVC and Low-Rank Update To illustrate the efficiency of the SVC method and low-rank update techniques, simulations on several analog, digital, and RF circuits have been performed, and the results are shown in Table IV. It can be seen that the number of iterations generally 1098 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 Fig. 15. Substrate coupling example. increases to 1.5–2.5× of that with SPICE3. But the number of LU factorizations used by the SVC method with low-rank update decreases to 3%–20% of that used by SPICE3. We can see more saving in LU factorization with low-rank update for larger circuits, such as a 20-stage inverter chain, a ring oscillator, and a voltage-controlled oscillator (VCO). In general, a lowrank update technique will be more efficient for simulating a nonlinear circuit with a large-scale (potentially dense) network of linear elements since only the L and U matrices for the sparse nonlinear part need to be updated during nonlinear iteration and the dense linear part remains unchanged. It should be pointed out that although the number of LU factorizations is reduced dramatically with the SVC method and low-rank update techniques, the speedup for circuits in Table IV is not dramatic since the simulation cost is dominated by device evaluation. As a relaxed direct method, SILCA has to take more device evaluations than SPICE3 since more iteration steps are required. For the Opamp follower example (including 32 MOSFETs, eight capacitors, and four current sources), SPICE3 runs for 17.87 s, in which 12.06 s is spent on device loading, while SILCA requires 18.65 s with 13.89 s on device loading. In this case, SILCA is more costly than SPICE3 since the simulation time is dominated by device loading. Therefore, SILCA is more suitable for parasitic-coupled VLSI circuits, where the number of linear parasitic elements dominates the number of nonlinear devices. C. Coupled Circuit and Substrate Analysis The third example is a simple substrate network, as shown in Fig. 15, coupled with two inverters with pulse inputs in different operating frequencies—the first inverter operates at a low frequency and the second inverter operates at a high frequency. The bulk contacts of nMOSFETs are directly connected to P-substrate ports and those of pMOSFETs are connected to P-substrate ports through a capacitor between the N-well and the P-substrate [27]. There are four other P-substrate ports connecting to the ground, and the backplane of the substrate is also connected to the ground. RCL loads are added at the Fig. 16. Transient waveform of Vout1 for the substrate coupling example. output of each inverter (not shown in Fig. 15). The substrate is modeled as a network consisting of a three-dimensional (3-D) dense resistor mesh with multiple layers [32]. In Fig. 15, a onelayer resistor network is illustrated to model the substrate part among four inverter bulk contacts. Although simplified truncated substrate models have been proposed to capture dominant coupling conductance [20], [27], they are likely to underestimate coupling effects in circuit systems designed to be noise immune [24]. Furthermore, the accuracy with simplified substrate models may not be sufficient. Therefore, accurate analysis of a circuit with a fully modeled substrate is desirable for high fidelity circuit design and verification. Fig. 16 shows the time-domain output waveform of the first inverter when the output signal is a digital “1” (the high voltage level). First, the result from SILCA matches that from SPICE3. Second, it can be seen that high-frequency feedthrough signals from the second inverter are present in Fig. 16. This is an important first-pass design failure reason in deep submicron digital and analog circuit designs, which may often not be captured by simplified substrate analysis. LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS 1099 TABLE V SIMULATION RESULTS OF SUBSTRATE COUPLING EXAMPLES Fig. 17. Runtime comparison of the substrate coupling example. Fig. 18. Power/ground network coupling example. Table V shows the statistics of running SILCA on a number of substrate coupling examples with varying circuit substrate network complexity compared to SPICE3. In our experiments, the number of layers and the number of resistors per layer are changed to vary the total number of circuit elements. A maximum 38.69× LU factorization cost saving and 17.30× overall speedup (with about 35 000 elements) are achieved for this simple substrate coupling analysis example, and the cost of forward/backward substitution is increased to 2.5–2.75×. No lowrank update technique is used for this example. The run time comparison is shown in Fig. 17. Fig. 19. Transient waveform of Vout for the power/ground network example. Several observations are as follows. 1) The larger the circuit is (therefore the larger LU/FBS cost ratio), the more overall speedup can be achieved with SILCA. SILCA is very suitable for deep submicron VLSI circuits with strong parasitic coupling effects. 2) Device load cost with SILCA is decreased, which is proportional to the LU factorization cost saving. The reason is that in SILCA, device loads are only performed when circuit matrix elements need to be updated due to nonlinear devices and/or capacitors/inductors. For the substrate coupling examples, since most devices are resistors, their device loads are only performed when a new LU is required. 3) The more savings on LU factorization, the more iterations are required, which means more cost on forward/backward substitution and device evaluation. Therefore, there exists a tradeoff between the cost of LU factorization and that of forward/backward substitution and device evaluation. The maximum overall speedup will approach the LU factorization speedup for large strongly coupled systems. We also compare SILCA with a fast SPICE-like circuit simulator HSIM 1.3 [36], and the results are also collected in Fig. 17. HSIM 1.3 uses the BE integration formula, a table lookup MOS level 2 model, and device bypass techniques. Further, HSIMSPEED = 1 is set in HSIM 1.3 so that the number of total simulated time points is close to that of SPICE3 and SILCA to achieve the same accuracy. It can be seen from Fig. 17 that the larger the circuit is, the more speedup can be achieved with SILCA. Note that SILCA does not use table lookup MOSFET models. 1100 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 TABLE VI SIMULATION RESULTS OF POWER/GROUND NETWORK COUPLING EXAMPLES TABLE VII SIMULATION RESULTS OF POWER/GROUND NETWORK COUPLING EXAMPLES WITH THE GMRES SOLVER (ε = 1e − 8) D. Coupled Circuit and Power/Ground Network Analysis The fourth example is a power/ground network as shown in Fig. 18. The power and ground supply networks are modeled as two RCL meshes (parasitic coupling capacitors are not shown in Fig. 18). Between these two layers is a 20-stage inverter chain, different inverters of which are connected to different power/ground nodes. RCL loads are added to each inverter to model interconnect lines between stages. Fig. 19 shows the time-domain output waveform of the inverter chain when the output signal is digital “1” (the high voltage level). The “1” signal has been disturbed due to the IR drop and L∗ dI/dt effects of the power/ground network (Vdd is 3.3 V). Table VI shows the simulation results with varied numbers of elements modeling the power/ground network. In our experiments, the sizes of two RCL meshes are changed to vary the number of elements. We can see that SILCA achieves more speedup for larger circuits. The number of iterations increases to 3.5–4.2× with SILCA. It is worthy noticing that the maximum LU factorization cost saving and overall speedup reach 88.50× and 14.00× (with about 60 000 elements), respectively, with the rank-one update technique, which are 19.82× and 8.86×, respectively, with only the SVC method. For comparison purposes, we have implemented a coupled iterative/direct solver for nonlinear circuits with largescale power/ground networks [17]. In this coupled solver, power/ground networks are formulated with a nodal analysis (NA) circuit matrix [19], which is symmetric positive definitive, and solved by the conjugate gradient method with an incomplete Cholesky decomposition preconditioner [4]. Nonlinear circuits are formulated with an MNA circuit matrix and solved by the direct method based on LU factorization and Newton–Raphson iteration as in SPICE. The iterative method and direct method are coupled together by a Gauss–Seidel relaxation scheme [34]. Experimental results on the above power/ground coupling examples show that the coupled iterative/director solver achieved similar speedup over SPICE3 as SILCA. However, it should be noticed that the coupled iterative/direct solver is efficient only if there exists a good partition with only a few boundary nodes between linear circuit parts and nonlinear circuit parts and the coupling effects between those two parts are weak. Very recently, we developed a new GMRES solver with an LU factorization preconditioning scheme [18] for time-domain simulation of nonlinear circuits with large-scale power/ground networks. The basic idea is to apply the same time step-size controlling scheme as that used in SILCA. Whenever time stepsizes change violently, a new basis time step-size is chosen and a regular LU factorization is performed. If time step-sizes change in the range of 0.625 < α < 2.5, rather than using linear-centric analysis methods in SILCA, a GMRES solver is applied with the previous factorized L and U matrices as the preconditioner. Meanwhile, to make a fair comparison with SILCA, low-rank update has been applied to the preconditioning L and U matrices whenever a nonlinear device switches its operating region. The GMRES solver is implemented following the left-preconditioned GMRES algorithm in [26]. The simulation results with the new GMRES solver (ε = 1e − 8) are shown in Table VII. It is seen that the average number of GMRES iterations ((#GMRES Iter)/(#GMRES)) with the LU factorization preconditioner is about 3–3.5 for a GMRES solving process, which shows that the preconditioner is very efficient. It is shown in Table VII that the speedup over SPICE3 with the GMRES solver is less than that with SILCA. The main reason is that the number of forward/backward substitutions with the GMRES solver (#Precond in Table VII) is generally larger than that with SILCA (#Iter in Table VI). Furthermore, extra costs due to matrix–vector product operations have to be taken during the GMRES solving process. It can be expected that the simulation cost will be increased if the error tolerance of the GMRES solver is made tighter. It should be noticed that the number of nonlinear iterations (#Tran Iter) is less than that with SILCA since there is no FLC integration scheme required for capacitors/inductors. However, the number of nonlinear iterations is larger than that with SPICE3 due to the PWNL definition of MOSFETs. VI. C ONCLUSION In this paper, a new nonlinear time-domain circuit simulation method called SILCA has been proposed for deep submicron VLSI circuit design and verification, which requires accurate modeling of parasitic coupling effects. New variable time stepsize FLC numerical integration formulae are developed to LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS ensure constant equivalent conductance for capacitor/inductor companion models. We have characterized convergence and stability properties of the newly introduced integration formulae. As an alternative to the Newton–Raphson method, an SVC method is proposed for nonlinear circuit simulation and the low-rank update technique has been implemented for efficient LU factorization. With these techniques, SILCA can reduce dramatically the number of costly LU factorizations for time-domain simulation. Experimental results on coupled circuit, substrate, and power/ground network analysis have demonstrated that SILCA can achieve SPICE-like accuracy yet with orders of magnitude speedup over SPICE. Future research includes handling of nonlinear capacitors, optimum PWNL model generation for nonlinear device models, exploiting incomplete LU preconditioners [26] for GMRES, and applications of SILCA to coupled electrical, electromagnetic, and thermal simulation. (k) τ ẋ(k) n + xn = 0 2τ (k) 2τ (k−1) 2τ (k−1) xn − xn xn + − xn−1 h h αh − τ ẋn−1 + x(k) n =0 + xn−1 + x(k) n =0 (A.1) Since the proof is for the stability property of the iterative (0) trapezoid formula, the initial guess xn of xn is the solution of (0) the previous time point xn = xn−1 . Then, the derivation can be carried out as 1 +z 1 − α1 1+z xn−1 + α xn−1 = xn−1 = 1−z 1−z 1−z k=2 1 m−1 1− α 1 1 m−1 1+z 1− 1−z 1− α α +z + x(m) xn−1 . 1 n = 1− 1−z 1−z 1−z 1− α According to (A.2), it is easy to check that the absolute stability condition cannot be satisfied if (1 − 1/α)/(1 − z) = 1. Therefore, the absolute stability region of the iterative trapezoid formula is then expressed by the inequality 1 m−1 1− α 1 − 1 m−1 1 + z 1 1 − 1−z α α + z + < 1. (A.4) 1− 1 1−z 1−z 1−z 1 − 1−zα 1 α 1 α + z < 1. − z (A.5) This completes the proof of Theorem 4. If the mixed trapezoid FE formula is applied as an integration predictor when α < 1, the absolute stability region of the iterative trapezoid formula is derived as x(0) n = k=1 1 + (2α − 1)z xn−1 1−z 1 1 − α1 (0) +z xn + α xn−1 1−z 1−z .. . k=m 1 +z 1 − α1 (m−1) xn xn−1 + α 1−z 1−z m−1 1 − α1 1 + (2α − 1)z = 1−z 1−z m−2 1 1 1 − α1 + z + z α + + ··· + α xn−1 . 1−z 1−z 1−z = x(m) n 1 +z 1 − α1 (1) xn + α xn−1 1−z 1−z 1 1 − α1 1 + z +z ∗ + α = xn−1 1−z 1−z 1−z x(2) n = .. . k=m x(1) n = k=1 x(1) n Noting that the terms in the square bracket of (A.1) are a geometric series except the first term, it can be written further in the following format if (1 − 1/α)/(1 − z) = 1, i.e., Finally, we have the result 1 − 1 m 2z α + 1−z z − α1 2τ (k) 2τ (k−1) 2τ (k−1) xn − xn xn + − xn−1 h h αh x(k) n (A.2) (A.3) Proof: Applying the iterative trapezoid formula (8) to an RC test example, the iterative relationship can be derived (τ = RC), i.e., h z=− . 2τ 1 +z 1 − α1 (m−1) xn xn−1 + α 1−z 1−z m−1 1 − α1 1+z = 1−z 1−z m−2 1 1 1 − α1 + z + z α + ··· + α xn−1 . + 1−z 1−z 1−z = x(m) n 1−z A PPENDIX P ROOF OF T HEOREM 4 AND T HEOREM 6 1 +z 1 − α1 (k−1) xn xn−1 , = + α 1−z 1−z k=m 1101 (A.6) 1102 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 25, NO. 6, JUNE 2006 Therefore, the absolute stability region of the iterative trapezoid formula is then expressed by the inequality 1 − 1 m+1 2αz 2 α + 1−z z − α1 1 α 1 α + z < 1. − z This completes the proof of Theorem 6. (A.7) ACKNOWLEDGMENT The authors would like to thank Prof. K. Mayaram of Oregon State University and Dr. J. Rockway of the Naval Space and Warfare System Center, San Diego, for several helpful discussions. The authors are also grateful to the anonymous reviewers for their detailed and constructive comments that greatly enhanced this paper. R EFERENCES [1] E. Acar, F. Dartu, and L. T. Pileggi, “TETA: Transistor-level waveform evaluation for timing analysis,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 21, no. 5, pp. 605–616, May 2002. [2] U. M. Ascher and L. R. Petzold, Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. Philadelphia, PA: SIAM, 1998. [3] H. W. Buurman, “From circuit to signal-development of a piecewise linear simulator,” Ph.D. dissertation, Dept. Elect. Eng., Eindhoven Univ. Technol., Eindhoven, The Netherlands, Jan. 1993. [4] T. Chen and C. C.-P. Chen, “Efficient large-scale power grid analysis based on preconditioned Krylov-subspace iterative methods,” in Proc. IEEE/ACM Design Automation Conf., Las Vegas, NV, Jun. 2001, pp. 559–562. [5] P. F. Cox, R. G. Burch, P. Yang, and D. E. Hocevar, “New implicit integration method for efficient latency exploration in circuit simulation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 8, no. 10, pp. 1051– 1064, Oct. 1989. [6] J. E. Dennis and J. J. Moré, “Quasi-Newton methods, motivation and theory,” SIAM Rev., vol. 19, no. 1, pp. 46–89, Jan. 1977. [7] T. Fujisawa, E. S. Kuh, and T. Ohtsuki, “A sparse matrix method for analysis of piecewise-linear resistive networks,” IEEE Trans. Circuit Theory, vol. CT-19, no. 6, pp. 571–584, Nov. 1972. [8] K. Gala, V. Zolotov, R. Panda, B. Young, J. Wang, and D. Blaauw, “On-chip inductance modeling and analysis,” in Proc. IEEE/ACM Design Automation Conf., Los Angeles, CA, Jun. 2000, pp. 63–68. [9] C. W. Gear, Numerical Initial Value Problems in Ordinary Differential Equations. Upper Saddle River, NJ: Prentice-Hall, 1971. [10] K. R. Jackson and R. Sacks-Davis, “An alternative implementation of variable step-size multistep formulas for stiff ODEs,” ACM Trans. Math. Softw., vol. 6, no. 3, pp. 295–318, Sep. 1980. [11] S. Kapur and D. E. Long, “Large-scale capacitance calculation,” in Proc. IEEE/ACM Design Automation Conf., Los Angeles, CA, Jun. 2000, pp. 744–749. [12] K. S. Kundert and A. Sangiovanni-Vincentelli, Sparse User’s Guide— A Sparse Linear Equation Solver Version 1.3a. Berkeley: Univ. California, Apr. 1988. [13] P. M. Lee, S. Ito, T. Hashimoto, J. Sato, T. Touma, and G. Yokomizo, “A parallel and accelerated circuit simulator with precise accuracy,” in Proc. Int. Conf. VLSI Design, Bangalore, India, Jan. 2002, pp. 213–218. [14] E. Lelarasmee and A. Sangiovanni-Vincentelli, “RELAX: A new circuit simulator for large scale MOS integrated circuits,” in Proc. IEEE/ACM Design Automation Conf., Las Vegas, NV, Jun. 1982, pp. 682–690. [15] P. Li and L. Pileggi, “A linear-centric modeling approach to harmonic balance analysis,” in Proc. IEEE/ACM Design, Automation and Test Eur. Conf., Paris, France, Mar. 2002, pp. 634–639. [16] Z. Li and C.-J. R. Shi, “SILCA: Fast-yet-accurate time-domain simulation of VLSI circuits with strong parasitic coupling effects,” in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, San Jose, CA, Nov. 2003, pp. 793–799. [17] ——, “A coupled iterative/direct method for efficient time-domain simulation of nonlinear circuits with power/ground networks,” in Proc. IEEE Int. Symp. Circuits and Systems, Vancouver, Canada, May 2004, vol. 5, pp. 165–168. [18] ——, “An efficiently preconditioned GMRES method for fast parasiticsensitive deep-submicron VLSI circuit simulation,” in Proc. IEEE/ACM Design, Automation and Test Eur. Conf., Munich, Germany, Mar. 2005, vol. 2, pp. 752–757. [19] W. J. McCalla, Fundamentals of Computer-Aided Circuit Simulation. Boston, MA: Kluwer, 1988. [20] M. Nagata, J. Nagai, K. Hijikata, T. Morie, and A. Iwata, “Physical design guides for substrate noise reduction in CMOS digital circuits,” IEEE J. Solid-State Circuits, vol. 36, no. 3, pp. 539–549, Mar. 2001. [21] L. W. Nagel, “SPICE: A computer program to simulate semiconductor circuits,” Univ. California, Berkeley, Tech. Rep. UCB/ERL M520, May 1975. [22] A. Odabasioglu, M. Celik, and L. T. Pillegi, “PRIMA: Passive reducedorder interconnect macromodeling algorithm,” IEEE Trans. Comput.Aided Des. Integr. Circuits Syst., vol. 17, no. 8, pp. 645–654, Aug. 1998. [23] L. R. Petzold, “A description of DASSL: A differential/algebraic system solver,” in IMACS Transactions on Scientific Computing, vol. 1. Amsterdam, The Netherlands: North-Holland, R. Stepleman et al., Eds.1983, pp. 65–68. [24] J. R. Phillips and L. M. Silveira, “Simulation approaches for strongly coupled interconnect systems,” in Proc. IEEE/ACM Int. Conf. ComputerAided Design, San Jose, CA, Nov. 2001, pp. 430–437. [25] K. Radhakrishnan and A. C. Hindmarsh, “Description and use of LSODE, the Livermore solver for ordinary differential equations,” Lawrence Livermore Nat. Lab., Livermore, CA, LLNL Tech. Rep. UCRL-ID-113855, 1993. [26] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed. Philadelphia, PA: SIAM, 2003. [27] A. Samavedam, A. Sadate, K. Mayaram, and T. S. Fiez, “A scalable substrate noise coupling model for design of mixed-signal IC’s,” IEEE J. Solid-State Circuits, vol. 35, no. 6, pp. 895–904, Jun. 2000. [28] B. N. Sheehan, “TICER: Realizable reduction of extracted RC circuits,” in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, San Jose, CA, Nov. 1999, pp. 200–203. [29] A. H. Sherman, “On Newton-iterative methods for the solution of systems of nonlinear equations,” SIAM J. Numer. Anal., vol. 15, no. 4, pp. 755–771, 1978. [30] H. Su, K. H. Gala, and S. S. Sapatnekar, “Fast analysis and optimization of power/ground networks,” in Proc. IEEE/ACM Int. Conf. ComputerAided Design, San Jose, CA, Nov. 2000, pp. 477–480. [31] J. T. J. van Eijndhoven and M. T. van Stiphout, “Latency exploitation in circuit simulation by sparse matrix techniques,” in Proc. IEEE Int. Symp. Circuits and Systems, Espoo, Finland, Jun. 1988, pp. 623–626. [32] N. K. Verghese, T. J. Schmerbeck, and D. J. Allstot, Simulation Techniques and Solutions for Mixed-Signal Coupling in Integrated Circuits. Norwell, MA: Kluwer, 1995. [33] Y. Wang, V. Jandhyala, and C.-J. R. Shi, “Coupled electromagnetic-circuit simulation of arbitrarily-shaped conducting structures,” in Proc. IEEE Conf. Electrical Performance Electronic Packaging, Cambridge, MA, Oct. 2001, pp. 233–236. [34] J. K. White and A. Sangiovanni-Vincentelli, Relaxation Techniques for the Simulation of VLSI Circuits. Norwell, MA: Kluwer, 1987. [35] M. Zwoliński and R. W. Allen, “Practical algorithms for fully decoupled mixed-mode simulation of electronic circuits,” in Proc. IEEE Int. Symp. Circuits and Systems, Sydney, Australia, May 2001, pp. 451–454. [36] User Guide—HSIM Version 1.3, Nassda Corp., Santa Clara, CA, Apr. 2001. Zhao Li received the B.S. degree in electronics and the M.S. degree in microelectronics and solidstate electronics from Tsinghua University, Beijing, China, in 1998 and 2000, respectively, and the Ph.D. degree in electrical engineering from the University of Washington, Seattle, in 2005. He is currently with Cadence Design Systems, Inc., San Jose, CA. His research interests include mixed-signal and deep submicron circuit simulation, symbolic analysis, behavioral modeling for analog/RF circuit application, device modeling, and optimization algorithms. LI AND SHI: SILCA FOR EFFICIENT TIME-DOMAIN SIMULATION OF VLSI CIRCUITS WITH PARASITIC COUPLINGS C.-J. Richard Shi (M’91–SM’99–F’06) received the Ph.D. degree in computer science from the University of Waterloo, Waterloo, ON, Canada, in 1994. From 1994 to 1998, he was with Analogy, Rockwell Semiconductor Systems, and the University of Iowa. In 1998, he joined the University of Washington, Seattle, WA, where he is currently a Professor in electrical engineering. His research interests include several aspects of the computer-aided design and test of integrated circuits and systems, with particular emphasis on analog/mixed-signal and deep submicron circuit modeling, simulation, and design automation. He is a key contributor to the IEEE Std. 1076.1-1999 (VHDL-AMS) standard for the description and simulation of mixed-signal circuits and systems. He founded the IEEE International Workshop on Behavioral Modeling and Simulation (BMAS) in 1997. Dr. Shi was an Associate Editor, as well as a Guest Editor, of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II, ANALOG AND DIGITAL SIGNAL PROCESSING. Since 1999, he has been the Associate Editor of the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. He has received several awards for his research including a Doctoral Prize from the Natural Science and Engineering Research Council of Canada (1995), a Best Paper Award from the 1998 IEEE VLSI Test Symposium, a Best Paper Award from the 1999 IEEE/ACM Design Automation Conference, a National Science Foundation CAREER Award (2000), and an SRC-TECHCON Best Paper Award (2003). 1103