advertisement

9th International Symposium on Quality Electronic Design Accelerating Clock Mesh Simulation Using Matrix-Level Macromodels and Dynamic Time Step Rounding Xiaoji Ye Department of ECE Texas A&M University Rajendran Panda Min Zhao Peng Li, Jiang Hu Freescale Semiconductor, Inc. Austin, TX Magma Design Automation, Inc. [email protected] Department of ECE Texas A&M University [email protected], [email protected] ABSTRACT … … Clock drivers Clock sinks/FFs Figure 1: Clock distribution network using mesh structures. SPICE simulation to these massive networks, existing model order reduction algorithms [3, 4, 5, 6] are only applicable to passive networks with limited number of I/O ports. The efficiency of the standard model order reduction algorithms degrades quickly with the number of I/O ports increases, which is the case for the linear subsystem of a clock mesh. Several research works have been proposed to solve this massive network simulation problem from different directions [7, 8, 9, 10, 11]. In [7, 8, 9], techniques have been proposed to reduce the complexity of the standard model order reduction by means of port compaction and merging. In [10], a sliding window approach is proposed to analyze the clock mesh by a heuristic based divide-and-conquer. More recently, in [11], a harmonic-weighted model order reduction algorithm which utilizes the specific design knowledge of the clock mesh network is presented to further improve the efficiency of the projection-based model order reduction algorithm when applied to mesh structures with a large number of I/Os. The approach in [11] tackles the overall simulation task by further decomposing the complete network into smaller subproblems via a port sliding scheme, where a reduced order model is generated for computing the voltage response at each port of the passive subnetwork one at a time. In this work, we address the analysis of large clock meshes from a simulation perspective. As illustrated in Fig. 2, our new simulation approach breaks the clock mesh into linear and nonlinear parts. At each transient analysis time step, the linear subsystem is modeled as a compact multiport (real) admittance macromodel whose I/V transfer characteristic is identical to that of the linear subsystem. The size of the macromodel is merely determined by the number of INTRODUCTION Due to the inherent redundancy introduced by non-tree topology, clock meshes have excellent performance (e.g. low clock skews) and immunity to PVT (process, voltage and temperature) variations [1, 2]. However, the high complexity of these clock networks, contributed by the large mesh structure and their tightly coupled interactions with a large number of clock drivers, presents a very difficult circuit analysis problem. A typical topology of mesh-based clock distribution networks is shown in Fig. 1. The top-level clock distribution is routed through a tree and this tree drives a large mesh spanning the whole chip. The mesh is driven by a large number of mesh drivers at the leaves of the tree and distributes clock inputs to many bottom-level clock drivers or flip-flops. In industrial chip designs, an accurate mesh circuit model may consist of millions of circuit unknowns and a few hundred nonlinear clock drivers. While it is practically very challenging to apply standard ∗ This work was supported in part by SRC under contract 2006-TJ-1416. 0-7695-3117-2/08 $25.00 © 2008 IEEE DOI 10.1109/ISQED.2008.18 … 1. … Clock meshes have found increasingly wide applications in today’s high-performance IC designs. The inherent routing redundancies associated with clock meshes lead to improved clock skews and reliability. However, the high complexity of clock meshes in modern chip designs has made its verification very challenging. A typical clock distribution network may consist of millions of coupled/interconnected linear elements and hundreds of nonlinear clock drivers attached at different locations on the mesh. Such a large network is often too complex for feasible SPICE-like simulation. In this paper, we present a new simulation methodology which decomposes a clock mesh into linear and nonlinear parts. By exploiting the special matrix property of the linear subsystem resulting from modified nodal analysis (MNA) formulation, the linear subsystem is represented as a matrix-level macromodel, which greatly simplifies the overall simulation task. These macromodels can be efficiently computed using Cholesky factorization and significantly speedup the nonlinear Newton-Raphson iterations used in the transient simulation for the complete clock mesh. Furthermore, a dynamic time step rounding technique is proposed to limit the number of passive macromodels needed in the entire transient simulation which further improves the efficiency of the proposed approach. 627 where G ∈ Rn×n is the conductance matrix from resistive elements; C ∈ Rn×n is the capacitance matrix formed by the capacitive and inductive elements; x(t) ∈ Rn is the timevarying vector of node voltages, inductor and voltage source branch currents; L ∈ Rn×m is the output matrix; b(t) ∈ Rn is the vector of independent time-varying input excitations; y(t) ∈ Rm is the vector of outputs. If Backward Euler (BE) method with a small time step h is used to discretize the above differential equation in time, (1) can be converted into an algebraic equation as ports in the linear part of the clock mesh. When formulated properly, the linear MNA equations corresponding to the linear subsystem have a special matrix structure, i.e., it is SPD (symmetric positive definite). As a result, the macromodel can be computed efficiently using fast Cholesky factorization [12]. This is not possible if the linear and nonlinear portions of the mesh are solved together, which typically leads to asymmetric matrix structures. In practice, a typical Cholesky solver is at least a few times faster than a general LU solver. Once computed, this macromodel is combined with nonlinear clock drivers forming a much more compact set of nonlinear circuit equations. Then, nonlinear N-R iterations are applied to this much reduced set of equations, but not the original equations, resulting in good runtime speedups. Hence, separating the linear part from the nonlinear part allows us to use fast matrix solvers to tackle the large linear subsystem which dominates the complexity of the clock mesh. „ Macromodel « · x(t) = b(t) + C · x(t − h). h (2) 2.2 Macromodel generation N on lin ear p art The desired macromodel of the linear subnetwork for a given time point (or time step h) is shown in Fig. 3 [12]. The model describes the input-and-output characteristics of the linear network at the ports Figure 2: Macromodel based simulation. While the above approach can speedup the transient analysis at each time step, the linear subnetwork macromodel needs to be updated whenever the time step changes. This is indeed the case if a dynamic time step control algorithm, e.g. by using LTE (local truncation error) control [13], is adopted. To limit the total number of large Cholesky factorizations needed in the entire transient simulation, we further propose a dynamic time step rounding scheme in which linear macromodels are pre-computed at a discrete set of time points. During the transient analysis, time steps generated by the dynamic time step control algorithm are rounded to the above discrete set such that no new Cholesky factorizations are needed. These discrete time points are properly spaced to have the best tradeoff between the number of Cholesky factorizations needed and the amount of time step reduction due to rounding. We demonstrate excellent performance of the proposed techniques through extensive experiments. 2. C h Care can be taken to form matrices G and C in a particular way. Since (2) is used to describe the input/output characteristics of the linear subnetwork at the ports, there exists freedom to choose what to be the inputs and outputs. Here the inputs b(t) are chosen to be the port currents and the outputs y(t) are chosen to be the port voltages, respectively. Instead of using self and partial inductances, we further use susceptances (reluctances) to model inductive effects. Under these choices, it can be shown that (G + C/h) is symmetric and positive definite (SPD) [14, 15]. The symmetric and positive definiteness of the matrix allows us to use efficient Cholesky factorization to factorize the linear system. The same matrix property is exploited for DC analysis of linear power/ground distribution networks in [12]. L in ear p art N on lin ear p art G+ I =A·V +S (3) where A ∈ Rm×m is the port admittance matrix; V ∈ Rm is the vector of port voltages; I ∈ Rm is the vector of port currents; S ∈ Rm is the vector of current sources connected between each port and ground; m is the number of ports of the linear part. Vector S essentially moves all the known internal current sources to the ports. These current sources are due to the application of numerical integration in transient analysis as in (2). I Macromodel (A) MATRIX-LEVEL MACROMODELING 2.1 MNA formulation for linear subnetwork To compute a macromodel for the linear subnetwork, let us consider the circuit equations corresponding to the linear portion of the clock mesh. In the modified nodal analysis (MNA) equations, the linear part can be described by the following equation: S Figure 3: Macromodel of the linear subnetwork. Note that we have purposely chosen the macromodel to be an admittance type model for ease of integration with nonlinear circuit elements. This admittance representation is obtained by solving the impedance like circuit formation in 0 G · x(t) + C · x (t) = b(t) y(t) = LT x(t) (1) 628 and, (2), which is chosen to guarantee the symmetric and positive definiteness of the matrix. Without loss of generality, in the following we assume the linear subnetwork is modeled as an RC network for simplicity of presentation. As such, (2) can be further simplified into the form G̃ · U = J, G̃ ∈ Rn×n , U ∈ Rn , J ∈ Rn S (4) = GT12 G−1 11 J1 − J2 “ ”−1 T = L21 L11 LT11 L−1 11 J1 − J2 = L21 LT11 J1 − J2 . (14) where n is the number of nodes; G̃ = G + C/h; J = b(t) + C/h · x(t − h); U is the vector of node voltages. (4) can be split into the form » –» – » – G11 G12 U1 J1 = (5) T V J2 + I G12 G22 Compared with (9), the use of (13) greatly reduces the computation cost since L22 normally is much smaller in size than G11 and G12 . Also, since L11 , L21 and L22 are triangular matrices, both A and S can be very efficiently computed using these matrices. where U1 is the vector of internal node voltages; V is the vector of port voltages; J1 and J2 are the vectors of current sources resulted from the application of numerical integration (e.g. equivalent current sources of the capacitance companion models) connected at the internal nodes and the ports, respectively; I is the vector of port current sources, which are treated as the inputs to the linear subnetwork; G12 is the portion of the admittance matrix that links the internal nodes and the ports; G11 contains the admittances between the internal nodes; G22 contains the admittances matrix between the ports. The first set of the equations in (5) can be rewritten as 3. NONLINEAR CLOCK MESH SIMULATION USING MACROMODELS U1 = G−1 11 (J1 − G12 V ) . We now present how these linear subnetwork macromodels can be utilized to speedup the nonlinear transient analysis of the complete clock mesh. The N-R iteration used at each time point in nonlinear transient simulation is given by Substituting (6) into (7) leads to ” “ ” “ T −1 I = G22 − GT12 G−1 11 G12 V + G12 G11 J1 − J2 . (7) (8) Comparing (8) with the desired macromodel in (3), we can get: (9) S = GT12 G−1 11 J1 − J2 (10) Although the macromodel can be built by directly computing (9) and (10), this turns out not to be the best approach. If we use the fact that the coefficient matrix G̃ is symmetric and positive definite, we can use Cholesky factorization to simplify the computation for A and S . The Cholesky factorization of G̃ is given as: T » G11 GT12 G12 G22 G̃ = L · L – » –» T – L11 L11 LT21 0 = L21 L22 0 LT22 – » T T L11 L21 L11 L11 = T T T L21 L11 L21 L21 + L22 L22 A = = E Vp -S + Ip = F B Ue x t Ie x t /Ve x t (12) Figure 4: Matrix representation of the reduced N-R iteration. The matrix form of the reduced N-R iteration using A and S is illustrated in Fig. 4. In Fig. 4, A is stamped into the Jacobian matrix; E and F contain entries corresponding to the elements which are connected between the port nodes G22 − GT12 G−1 11 G12 “ ”−1 L21 LT21 + L22 LT22 − L21 LT11 L11 LT11 L11 LT21 L22 LT22 , (16) (11) where L11 , L21 , L22 are the submatrices of the Cholesky factor L of G̃, they have the same dimensions as G11 , GT12 and G22 , respectively. Substituting (12) into (9) and (10), we obtain A = Jk ∆vk = −F(vk ) where F(v) = 0 is the set of nonlinear algebraic equations formed at the time point; J is the Jacobian matrix of F; v k and vk+1 are the solution guesses at the k-th and k + 1-th iterations, respectively; At each time point of the transient simulation, contributions from the linear circuit elements and linearized nonlinear elements are stamped into J, equivalent current sources are stamped into F(vk ).1 If a multi-port macromodel is computed for the large linear subnetwork, the entire linear part of the mesh is replaced by the macromodel in the N-R iteration. Matrix A and vector S instead of individual linear circuit elements are stamped into the above N-R iteration. This results in a somewhat denser but much smaller set of equations. The detailed simulation flow is shown in Algorithm 1. It is worth noting that the simulation result from Algorithm 1 is exact compared with standard SPICE simulation. This is because the macromodel exactly describes the I/V characteristic at the ports of the linear part. (6) A = G22 − GT12 G−1 11 G12 (15) or The second set of the equations in (5) can be rewritten as I = GT12 U1 + G22 V − J2 Jk vk+1 = Jk vk − F(vk ) 1 Readers may refer to [13] for more detailed procedure of how different circuit elements are stamped into the N-R iteration formulation. (13) 629 Algorithm 1 Transient simulation flow using macromodel for instance, as a result of dynamic time step control, the macromodel needs to be recomputed. As such, potentially a large number of Cholesky factorizations are needed during the entire nonlinear transient analysis. To limit the number of Cholesky factorizations, we propose a dynamic time step rounding scheme. Before we describe this scheme, we briefly review the LTE based dynamic time step control. Before transient simulation: 1: Stamp in all the linear circuit elements of the linear subnetwork through a proper MNA formulation. 2: Factorize the SPD matrix G̃ using Cholesky factorization, save the submatrices L11 , L21 , L22 of the Cholesky factor L. 3: Compute A matrix by using (13). Transient simulation starts: 4: while transient simulation is not over do 5: Recover all the internal node voltages U1 of the linear subnetwork. 6: Obtain J1 and J2 from U1 . 7: Compute S by using (14). 8: Stamp in A and S into (15) or (16). 9: Stamp in nonlinear mesh drivers and external voltage/current sources into (15) or (16). 10: while not converge do 11: Update the entries of the nonlinear mesh drivers in the Jacobian matrix, update the right-hand-side vector. 12: Solve the system by LU factorization. 13: Update the solution guess: vk+1 = vk + ∆vk . 14: end while 15: Advance the time. 16: end while 4.1 Dynamic time step control by using local truncation error estimation Dynamic time step control is adopted by most SPICE simulators today to enhance the simulation efficiency and accuracy. Local truncation error (LTE) estimation is typically used with a numerical method such as Backward Euler or Trapezoidal method to control the time step [13]. For Backward Euler, which is what we use in this work, local truncation error estimates for capacitances and inductances are given by „ « 1 ∆tn + C ≤ − [ic (t− (18) n+1 ) − ic (tn )] 2 C and L ≤ − and the circuit nodes in the nonlinear portion of the clock mesh; B contains entries contributed by the nonlinear subnetwork; VP is the vector of port voltages, which is the same as vector V in (5); Uext is the vector of the circuit unknowns of the nonlinear subnetwork. On the right hand side, S is stamped as current sources connected at the ports; IP are equivalent current sources contributed by the nonlinear elements at the ports; Iext /Uext are current sources/voltage sources inside the nonlinear subnetwork. „ ∆tn L « + [vL (t− n+1 ) − vL (tn )] 4.2 Dynamic time step rounding One important step of our new simulation flow is to update the internal node voltages of the linear part of the clock mesh (step 5 of Algorithm 1). Since the macromodel is a multi-port linear admittance model, only the node voltages at its ports are solved in the N-R iteration. But vector S of the macromodel depends on the current sources connected at the internal nodes of the linear part according to equation (14). Those current sources are equivalent current sources in the companion models of capacitances and susceptances (or reluctances). In order to compute the values of those equivalent current sources, internal node voltages of the linear part need to be recovered at each time point. Although we can use (6) to solve the internal node voltage vector U1 directly, it is not the most efficient way to do so. Since the Cholesky facor L11 of G11 is a submatrix of the Cholesky factor L of G̃, L11 can be ”extracted” from L during the macromodel generation step. So U1 can be rather efficiently solved by Forward and Backward substitutions: As pointed out before, the linear subnetwork macromodel needs to be recomputed whenever the time step changes. Therefore, it would be inefficient if we allow arbitrary time step change during the transient analysis. Our basic idea is to constrain the time step to a properly chosen discrete set of time points, as described below. First, the range of time steps generated by the dynamic time step control algorithm is estimated: Hint = [hmin , hmax ]. This can be done by monitoring the transient simulation of certain test circuits and the accuracy of this estimation is not very critical. Then, this range is divided using a set of unevenly distributed discrete time points {hmin , h1 , h2 , · · · , hmax }. A linear macromodel is pre-computed for each time point hi in the above set. One good way is to divide the whole time step range using a set of geometrically spaced time steps as shown in Fig. 5. Here, the minimum time step is hmin , and the following time steps are 2hmin , 4hmin and so on until the maximum time step hmax is reached. The total number of discrete time steps is (approximately) given by (17) 1 + dlog2 (hmax /hmin )e 4. (19) where C and L are local truncation errors; ∆tn is the time step between tn and tn+1 ; C and L are values of the capac− itor and inductor; ic (t+ n ) and ic (tn+1 ) are the currents of C − + at tn and tn+1 ; vL (tn ) and vL (tn+1 ) are the voltages of L at tn and tn+1 . By setting a user-defined error limit, we can use (18) and (19) to calculate the maximum tolerable value of the next time step for every energy storage element. The actual next time step for the complete circuit is chosen as the minimum of these maximum tolerable time step values. 3.1 Update of internal node voltages in the linear subnetwork U1 = (LT11 )−1 L−1 11 (J1 − G12 V ) . 1 2 DYNAMIC TIME STEP ROUNDING (20) In this case, only about 10 discrete time points are needed to cover a 1000X span of time steps. During the transient simulation, if the predicted next time step calculated by LTE falls between two adjacent time steps in our predefined set, we round the next time step to its nearest smaller value in the predefined set such that the The macromodeling based approach presented in the previous section can be employed to speedup the nonlinear transient analysis at each individual time point. It shall be noted that the macromodel of the linear subnetwork depends on the choice of the time step (h). Hence, whenever h changes, 630 Act ual t i mest ep Roundi ng Ti mest ep pr edi ct ed byLTE based simulation approach. 5.2 Dynamic time step rounding …… hmin 2hmin 4hmin 8hmin We now compare the macromodel based simulation approach with SPICE more realistically when dynamic time step control is used on the same set of circuits. For the macromodel based method, the predefined set of time steps for the dynamic time step rounding technique covers the interval of [1e − 12, 1e − 10]. So there are 1 + dlog2 100e = 8 predefined time steps. The comparison results are shown in Table 2. We can see from Table 2 that the average runtime speedup compared with SPICE with dynamic time step control is smaller than the average runtime speedup in Table 1. This is because dynamic time step rounding always rounds down the actual time step, so macromodel based simulation always uses a time step no more than what is predicted by LTE in SPICE. Despite the fact the macromodel based simulation uses smaller time steps, it is still considerably faster than SPICE with dynamic time step control. Fig. 6 illustrates how dynamic time step rounding works during the macromodel based simulation for a test circuit. We show the predicted time steps by LTE and the actual rounded time steps for 20 consecutive time points. Dynamic time step rounding always rounds down the actual time step so that the local truncation error is guaranteed to be smaller than the user-defined error limit. And since we have 1 + dhmax /hmin e time steps in the predefined set, the time step reduction by rounding is no more than 2x. hmax Figure 5: Dynamic time step rounding. corresponding pre-computed Cholesky factorization can be utilized. In this way, the LTE constraint is always satisfied. Two main advantages brought by the dynamic time step rounding technique are: first, the number of Cholesky factorizations needed is well bounded. For most cases, 10 factorizations are more than adequate; second, using a set of geometrically distributed time point set will bring no more than 2X reduction to the time step of what is predicted by LTE. 5. EXPERIMENTAL RESULTS We demonstrate our macromodel based simulation algorithm and dynamic time step rounding technique on a set of clock meshes with different sizes and number of mesh drivers. Comparisons are made between our approach and the standard SPICE simulation with fixed/dynamic time step control. Several numerical packages are used in the implementation: Sparse 1.3 solver package [16] is used for the LU factorization in the N-R iterations; Cholmod solver package [17] is used for the Cholesky factorization during the macromodeling generation step. We also compare the runtime of our proposed algorithm with the Harmonic-weighted MOR and port sliding schemes in [11]. Algorithms are implemented in C++. All the experiments are conducted on a Linux server with 3GHz CPU and 4GB memory. −11 3.5 x 10 3 Predicted time steps Actual time steps t(s) 2.5 5.1 Macromodel based simulation with fixed time step 2 1.5 1 0.5 First, we compare the proposed macromodel based simulation using fixed time step with our implementation of SPICE-like simulation with the same fixed time step. In this case, we only need to compute one macromodel for the linear subnetwork. Table 1 shows the detailed results. We have tested our proposed approach on six clock meshes. For each circuit, we first compare the runtime between one LU factorization and one Cholesky factorization for matrix G̃. This is a solver level comparison and gives us an understanding how much benefit we can get merely from Cholesky factorization. Then, the runtime for one full N-R iteration in SPICE and the runtime for one reduced N-R iteration using macromodel are compared. Since the problem size is greatly reduced by the macromodel in the reduced N-R iteration, we can achieve 2 orders of magnitude speedup in each reduced N-R iteration. Since the internal node voltages of the linear part need to be updated at each time point in our proposed approach, we also include the runtime for updating U1 in Table 1. It is interesting to see that even for relatively large circuits, recovering the node voltages for the entire linear part can be done rather quickly. This is due to the fact that we use triangular matrices to compute U1 in (17), which is computationally efficient. Finally, the overall runtime for the SPICE simulation and macromodel based approach are compared. We can see from the last column of Table 1 that 10 to 45 times speedup can be achieved by the macromodel 0 0 5 10 15 Index of time steps 20 Figure 6: Dynamic time step rounding for 20 consecutive time steps. To demonstrate how much benefit dynamic time step rounding can bring to the macromodel based simulation approach, we also compare the runtime between macromodel based simulation with fixed time steps and dynamic time step rounding in Table. 3. The circuit examples are the same as shown in Table. 1 and 2. For the simulation with fixed time step, we use 1e − 12s as the time step, which is also hmin for the dynamic time step rounding. We can see that dynamic time step rounding can bring average 1.5 to 2 times speedup to the macromodel based simulation approach. To compare with the approach in [11], we test our macromodel based approach with fixed time step and the serial version of the harmonic-weighted MOR with driver merging scheme on circuit 4 and 5. The runtimes of harmonicweighted MOR with driver merging for these two circuits are 870.23s and 1836.48s while runtimes of macromodel based simulation are 201.75s and 973.98s, respectively. In harmonicweighted MOR with driver merging approach, the driving point waveform of each clock driver is obtained individu- 631 Table 1: Runtime(s) comparison between macromodel based simulation and SPICE with fixed time step Ckt # nodes # elements # drivers ckt1 ckt2 ckt3 ckt4 ckt5 ckt6 400 1000 2500 6500 40k 70k 1200 3000 7500 20k 120k 200k 2 6 20 30 100 150 One LU(s) 0.00599 0.027 0.363 3.105 124.402 139.15 One Chol.(s) 0.002 0.004 0.115 0.387 25.514 34.15 One full N-R iter.(s) 0.026 0.115 1.335 9.540 150.246 185.16 One Redu. N-R iter.(s) 0.0009 0.002 0.0099 0.038 0.564 0.915 One Update of U1 0.0009 0.0009 0.016 0.064 1.118 1.815 Runtime SPICE(s) 24.779 109.53 1220.63 2h31min 6h15min 9h21min Runtime Macro.(s) 2.335 6.843 50.71 201.75 973.98 1247.93 Speedup 10.6 16.0 24.07 44.75 23.11 26.97 Table 2: Runtime(s) comparison between macromodel based simulation and SPICE with dynamic time steps Ckt # nodes # elements # drivers ckt1 ckt2 ckt3 ckt4 ckt5 ckt6 400 1000 2500 6500 40k 70k 1200 3000 7500 20k 120k 200k 2 6 20 30 100 150 SPICE w. dynamic time step control 8.736 63.885 727.24 1h18min 4h21min 7h30min ckt1 ckt2 ckt3 ckt4 ckt5 ckt6 Macro. w. fixed time steps 2.335 6.843 50.71 201.75 973.98 1247.93 Macro. w. dynamic time step rounding 1.008 4.654 34.18 114.84 778.16 1149.91 Speedup 2.32 1.47 1.48 1.76 1.25 1.09 ally. Therefore, the runtime increases proportionally with the number of drivers. But in macromodel based simulation approach, all the node voltage waveforms are computed in one simulation. This is one of the reasons why it is faster than harmonic-weighted MOR with driver merging for these test circuits. On the other hand, the approach in [11] is parallelizable since the coupled analysis problem is decomposed into independent small pieces. Furthermore, the use of dynamic time step rounding requires pre-computation of multiple large Cholesky factorizations and these Cholesky factorizations must be loaded to main memory or stored on disks during the simulation. Considering all these factors, we expect the approach in [11] to become more attractive for larger clock mesh designs. Accuracy wise, there is no approximation during the macromodel generation in this work while approximation does exist in the harmonic-weighted MOR approach [11]. 6. CONCLUSION In this paper, we propose to use the matrix-level macromodel based simulation approach to accelerate the verification of clock mesh. The difficulty of solving extremely large size N-R iteration problem has been significantly relaxed by the use of the macromodel. Also, dynamic time step rounding technique is jointly applied with the macromodel to achieve even better runtime performance. 7. Speedup 8.67 13.73 21.28 40.66 20.15 23.48 [2] P. J. Restle et al. The clock distribution of the power4 microprocessor. In IEEE ISSCC, pages 144–145, February 2002. [3] L. Pillage and R. Rohrer. Asymptotic waveform evaluation for timing analysis. IEEE Trans. Computer-Aided Design, 9:352–366, April 1990. [4] P. Feldmann and R. Freund. Efficient linear circuit analysis by padé approximation via the lanczos process. IEEE Trans. Computer-Aided Design, 14:639–649, May 1995. [5] L. Silveira, M. Kamon, and J. White. Efficient reduced-order modeling of frequency-dependent coupling inductances associated with 3-d interconnect structures. In Proc. IEEE/ACM Design Automation Conf., June 1995. [6] A. Odabasioglu, M. Celik, and L. Pileggi. Prima: passive reduced-order interconnect macromodeling algorithm. IEEE Trans. Computer-Aided Design, 17(8):645–654, August 1998. [7] P. Feldmann and F. Liu. Sparse and efficient reduced order modeling of linear subcircuits with large number of terminals. In IEEE/ACM Intl. Conf. on CAD, November 2004. [8] P. Liu et al. An efficient method for terminal reduction of interconnect circuits considering delay variations. In IEEE/ACM Intl. Conf. on CAD, November 2005. [9] P. Li and W. Shi. Model order reduction of linear networks with massive ports via frequency-dependent port packing. In Proc. IEEE/ACM Design Automation Conf., pages 267–272, July 2006. [10] H. Chen et al. A sliding window scheme for accurate clock mesh analysis. In Proc. IEEE/ACM Intl. Conf. on CAD, pages 939–946, November 2005. [11] X. Ye, P. Li, M. Zhao, R. Panda, and J. Hu. Analysis of large clock meshes via harmonic-weighted model order reduction and port sliding. In Proc. IEEE/ACM Intl. Conf. on CAD, November 2007. [12] M. Zhao, R. Panda, S. Sapatnekar, and D. Blaauw. Hierarchical analysis of power distribution networks. IEEE Trans. Computer-Aided Design, 21(2):159–168, February 2002. [13] L. Pillage, R. Rohrer, and C. Visweswariah. Electronic Circuit and System Simulation Methods. McGraw-Hill, Inc., 1995. [14] A. Devgan, J. Hao, and W. Dai. How to efficiently capture on-chip inductance effects: introducing a new circuit element K. In Proc. IEEE/ACM Intl. Conf. on CAD, pages 150–155, November 2000. [15] H. Zheng and L. Pileggi. Modeling and analysis of regular symmetrically structured power/ground distribution networks. In Proc. IEEE/ACM Design Automation Conf., pages 395–398, June 2002. [16] K. Kundert. Sparse Matrix Techniques In Circuit Analysis, Simulation and Design. North-Holland, 1986. [17] T.A. Davis and W.W. Hager. Modifying a sparse cholesky factorization. SIAM Journal on Matrix Analysis and Applications, 20(3):606–627, 1999. Table 3: Runtime(s) comparison between macromodel based simulation with fixed time steps and dynamic time step rounding Ckt Macro. w. dynamic time step rounding 1.008 4.654 34.18 114.84 778.16 1149.91 REFERENCES [1] P. J. Restle et al. A clock distribution network for microprocessors. IEEE J. of Solid-State Circuits, 36(5):792–799, May 2001. 632