Accelerating Clock Mesh Simulation

advertisement
9th International Symposium on Quality Electronic Design
Accelerating Clock Mesh Simulation Using Matrix-Level
Macromodels and Dynamic Time Step Rounding
Xiaoji Ye
Department of ECE
Texas A&M University
Rajendran Panda
Min Zhao
Peng Li, Jiang Hu
Freescale
Semiconductor, Inc.
Austin, TX
Magma Design
Automation, Inc.
[email protected]
Department of ECE
Texas A&M University
[email protected],
[email protected]
ABSTRACT
…
…
Clock drivers
Clock sinks/FFs
Figure 1: Clock distribution network using mesh
structures.
SPICE simulation to these massive networks, existing model
order reduction algorithms [3, 4, 5, 6] are only applicable to
passive networks with limited number of I/O ports. The
efficiency of the standard model order reduction algorithms
degrades quickly with the number of I/O ports increases,
which is the case for the linear subsystem of a clock mesh.
Several research works have been proposed to solve this massive network simulation problem from different directions [7,
8, 9, 10, 11]. In [7, 8, 9], techniques have been proposed to
reduce the complexity of the standard model order reduction
by means of port compaction and merging. In [10], a sliding
window approach is proposed to analyze the clock mesh by a
heuristic based divide-and-conquer. More recently, in [11], a
harmonic-weighted model order reduction algorithm which
utilizes the specific design knowledge of the clock mesh network is presented to further improve the efficiency of the
projection-based model order reduction algorithm when applied to mesh structures with a large number of I/Os. The
approach in [11] tackles the overall simulation task by further decomposing the complete network into smaller subproblems via a port sliding scheme, where a reduced order
model is generated for computing the voltage response at
each port of the passive subnetwork one at a time.
In this work, we address the analysis of large clock meshes
from a simulation perspective. As illustrated in Fig. 2, our
new simulation approach breaks the clock mesh into linear
and nonlinear parts. At each transient analysis time step,
the linear subsystem is modeled as a compact multiport
(real) admittance macromodel whose I/V transfer characteristic is identical to that of the linear subsystem. The size
of the macromodel is merely determined by the number of
INTRODUCTION
Due to the inherent redundancy introduced by non-tree
topology, clock meshes have excellent performance (e.g. low
clock skews) and immunity to PVT (process, voltage and
temperature) variations [1, 2]. However, the high complexity of these clock networks, contributed by the large mesh
structure and their tightly coupled interactions with a large
number of clock drivers, presents a very difficult circuit analysis problem. A typical topology of mesh-based clock distribution networks is shown in Fig. 1. The top-level clock
distribution is routed through a tree and this tree drives a
large mesh spanning the whole chip. The mesh is driven by
a large number of mesh drivers at the leaves of the tree and
distributes clock inputs to many bottom-level clock drivers
or flip-flops. In industrial chip designs, an accurate mesh
circuit model may consist of millions of circuit unknowns
and a few hundred nonlinear clock drivers.
While it is practically very challenging to apply standard
∗
This work was supported in part by SRC under contract
2006-TJ-1416.
0-7695-3117-2/08 $25.00 © 2008 IEEE
DOI 10.1109/ISQED.2008.18
…
1.
…
Clock meshes have found increasingly wide applications in
today’s high-performance IC designs. The inherent routing
redundancies associated with clock meshes lead to improved
clock skews and reliability. However, the high complexity of
clock meshes in modern chip designs has made its verification very challenging. A typical clock distribution network
may consist of millions of coupled/interconnected linear elements and hundreds of nonlinear clock drivers attached at
different locations on the mesh. Such a large network is often too complex for feasible SPICE-like simulation. In this
paper, we present a new simulation methodology which decomposes a clock mesh into linear and nonlinear parts. By
exploiting the special matrix property of the linear subsystem resulting from modified nodal analysis (MNA) formulation, the linear subsystem is represented as a matrix-level
macromodel, which greatly simplifies the overall simulation
task. These macromodels can be efficiently computed using
Cholesky factorization and significantly speedup the nonlinear Newton-Raphson iterations used in the transient simulation for the complete clock mesh. Furthermore, a dynamic
time step rounding technique is proposed to limit the number of passive macromodels needed in the entire transient
simulation which further improves the efficiency of the proposed approach.
627
where G ∈ Rn×n is the conductance matrix from resistive
elements; C ∈ Rn×n is the capacitance matrix formed by
the capacitive and inductive elements; x(t) ∈ Rn is the timevarying vector of node voltages, inductor and voltage source
branch currents; L ∈ Rn×m is the output matrix; b(t) ∈ Rn
is the vector of independent time-varying input excitations;
y(t) ∈ Rm is the vector of outputs. If Backward Euler (BE)
method with a small time step h is used to discretize the
above differential equation in time, (1) can be converted
into an algebraic equation as
ports in the linear part of the clock mesh. When formulated
properly, the linear MNA equations corresponding to the linear subsystem have a special matrix structure, i.e., it is SPD
(symmetric positive definite). As a result, the macromodel
can be computed efficiently using fast Cholesky factorization [12]. This is not possible if the linear and nonlinear
portions of the mesh are solved together, which typically
leads to asymmetric matrix structures. In practice, a typical
Cholesky solver is at least a few times faster than a general
LU solver. Once computed, this macromodel is combined
with nonlinear clock drivers forming a much more compact
set of nonlinear circuit equations. Then, nonlinear N-R iterations are applied to this much reduced set of equations,
but not the original equations, resulting in good runtime
speedups. Hence, separating the linear part from the nonlinear part allows us to use fast matrix solvers to tackle the
large linear subsystem which dominates the complexity of
the clock mesh.
„
Macromodel
«
· x(t) = b(t) +
C
· x(t − h).
h
(2)
2.2 Macromodel generation
N on lin ear p art
The desired macromodel of the linear subnetwork for a
given time point (or time step h) is shown in Fig. 3 [12].
The model describes the input-and-output characteristics of
the linear network at the ports
Figure 2: Macromodel based simulation.
While the above approach can speedup the transient analysis at each time step, the linear subnetwork macromodel
needs to be updated whenever the time step changes. This
is indeed the case if a dynamic time step control algorithm,
e.g. by using LTE (local truncation error) control [13], is
adopted. To limit the total number of large Cholesky factorizations needed in the entire transient simulation, we further
propose a dynamic time step rounding scheme in which linear macromodels are pre-computed at a discrete set of time
points. During the transient analysis, time steps generated
by the dynamic time step control algorithm are rounded to
the above discrete set such that no new Cholesky factorizations are needed. These discrete time points are properly spaced to have the best tradeoff between the number
of Cholesky factorizations needed and the amount of time
step reduction due to rounding. We demonstrate excellent
performance of the proposed techniques through extensive
experiments.
2.
C
h
Care can be taken to form matrices G and C in a particular way. Since (2) is used to describe the input/output characteristics of the linear subnetwork at the ports, there exists
freedom to choose what to be the inputs and outputs. Here
the inputs b(t) are chosen to be the port currents and the
outputs y(t) are chosen to be the port voltages, respectively.
Instead of using self and partial inductances, we further use
susceptances (reluctances) to model inductive effects. Under
these choices, it can be shown that (G + C/h) is symmetric
and positive definite (SPD) [14, 15]. The symmetric and
positive definiteness of the matrix allows us to use efficient
Cholesky factorization to factorize the linear system. The
same matrix property is exploited for DC analysis of linear
power/ground distribution networks in [12].
L in ear p art
N on lin ear p art
G+
I =A·V +S
(3)
where A ∈ Rm×m is the port admittance matrix; V ∈ Rm
is the vector of port voltages; I ∈ Rm is the vector of port
currents; S ∈ Rm is the vector of current sources connected
between each port and ground; m is the number of ports of
the linear part. Vector S essentially moves all the known
internal current sources to the ports. These current sources
are due to the application of numerical integration in transient analysis as in (2).
I
Macromodel (A)
MATRIX-LEVEL MACROMODELING
2.1 MNA formulation for linear subnetwork
To compute a macromodel for the linear subnetwork, let
us consider the circuit equations corresponding to the linear
portion of the clock mesh. In the modified nodal analysis
(MNA) equations, the linear part can be described by the
following equation:
S
Figure 3: Macromodel of the linear subnetwork.
Note that we have purposely chosen the macromodel to
be an admittance type model for ease of integration with
nonlinear circuit elements. This admittance representation
is obtained by solving the impedance like circuit formation in
0
G · x(t) + C · x (t) = b(t)
y(t) = LT x(t)
(1)
628
and,
(2), which is chosen to guarantee the symmetric and positive
definiteness of the matrix. Without loss of generality, in the
following we assume the linear subnetwork is modeled as an
RC network for simplicity of presentation. As such, (2) can
be further simplified into the form
G̃ · U = J,
G̃ ∈ Rn×n , U ∈ Rn , J ∈ Rn
S
(4)
= GT12 G−1
11 J1 − J2
“
”−1
T
= L21 L11 LT11
L−1
11 J1 − J2
= L21 LT11 J1 − J2 .
(14)
where n is the number of nodes; G̃ = G + C/h; J = b(t) +
C/h · x(t − h); U is the vector of node voltages. (4) can be
split into the form
»
–»
– »
–
G11 G12
U1
J1
=
(5)
T
V
J2 + I
G12 G22
Compared with (9), the use of (13) greatly reduces the computation cost since L22 normally is much smaller in size than
G11 and G12 . Also, since L11 , L21 and L22 are triangular
matrices, both A and S can be very efficiently computed
using these matrices.
where U1 is the vector of internal node voltages; V is the
vector of port voltages; J1 and J2 are the vectors of current sources resulted from the application of numerical integration (e.g. equivalent current sources of the capacitance
companion models) connected at the internal nodes and the
ports, respectively; I is the vector of port current sources,
which are treated as the inputs to the linear subnetwork;
G12 is the portion of the admittance matrix that links the
internal nodes and the ports; G11 contains the admittances
between the internal nodes; G22 contains the admittances
matrix between the ports.
The first set of the equations in (5) can be rewritten as
3. NONLINEAR CLOCK MESH SIMULATION USING MACROMODELS
U1 = G−1
11 (J1 − G12 V ) .
We now present how these linear subnetwork macromodels
can be utilized to speedup the nonlinear transient analysis
of the complete clock mesh. The N-R iteration used at each
time point in nonlinear transient simulation is given by
Substituting (6) into (7) leads to
”
“
”
“
T
−1
I = G22 − GT12 G−1
11 G12 V + G12 G11 J1 − J2 .
(7)
(8)
Comparing (8) with the desired macromodel in (3), we can
get:
(9)
S = GT12 G−1
11 J1 − J2
(10)
Although the macromodel can be built by directly computing (9) and (10), this turns out not to be the best approach. If we use the fact that the coefficient matrix G̃ is
symmetric and positive definite, we can use Cholesky factorization to simplify the computation for A and S . The
Cholesky factorization of G̃ is given as:
T
»
G11
GT12
G12
G22
G̃ = L · L
– »
–» T
–
L11
L11 LT21
0
=
L21 L22
0
LT22
–
»
T
T
L11 L21
L11 L11
=
T
T
T
L21 L11 L21 L21 + L22 L22
A
=
=
E
Vp
-S + Ip
=
F
B
Ue x t
Ie x t /Ve x t
(12)
Figure 4: Matrix representation of the reduced N-R
iteration.
The matrix form of the reduced N-R iteration using A and
S is illustrated in Fig. 4. In Fig. 4, A is stamped into the
Jacobian matrix; E and F contain entries corresponding to
the elements which are connected between the port nodes
G22 − GT12 G−1
11 G12
“
”−1
L21 LT21 + L22 LT22 − L21 LT11 L11 LT11
L11 LT21
L22 LT22 ,
(16)
(11)
where L11 , L21 , L22 are the submatrices of the Cholesky
factor L of G̃, they have the same dimensions as G11 , GT12
and G22 , respectively. Substituting (12) into (9) and (10),
we obtain
A =
Jk ∆vk = −F(vk )
where F(v) = 0 is the set of nonlinear algebraic equations
formed at the time point; J is the Jacobian matrix of F; v k
and vk+1 are the solution guesses at the k-th and k + 1-th
iterations, respectively; At each time point of the transient
simulation, contributions from the linear circuit elements
and linearized nonlinear elements are stamped into J, equivalent current sources are stamped into F(vk ).1
If a multi-port macromodel is computed for the large linear subnetwork, the entire linear part of the mesh is replaced by the macromodel in the N-R iteration. Matrix A
and vector S instead of individual linear circuit elements
are stamped into the above N-R iteration. This results in
a somewhat denser but much smaller set of equations. The
detailed simulation flow is shown in Algorithm 1. It is worth
noting that the simulation result from Algorithm 1 is exact
compared with standard SPICE simulation. This is because
the macromodel exactly describes the I/V characteristic at
the ports of the linear part.
(6)
A = G22 − GT12 G−1
11 G12
(15)
or
The second set of the equations in (5) can be rewritten as
I = GT12 U1 + G22 V − J2
Jk vk+1 = Jk vk − F(vk )
1
Readers may refer to [13] for more detailed procedure of
how different circuit elements are stamped into the N-R iteration formulation.
(13)
629
Algorithm 1 Transient simulation flow using macromodel
for instance, as a result of dynamic time step control, the
macromodel needs to be recomputed. As such, potentially
a large number of Cholesky factorizations are needed during
the entire nonlinear transient analysis. To limit the number
of Cholesky factorizations, we propose a dynamic time step
rounding scheme. Before we describe this scheme, we briefly
review the LTE based dynamic time step control.
Before transient simulation:
1: Stamp in all the linear circuit elements of the linear subnetwork through a proper MNA formulation.
2: Factorize the SPD matrix G̃ using Cholesky factorization,
save the submatrices L11 , L21 , L22 of the Cholesky factor
L.
3: Compute A matrix by using (13).
Transient simulation starts:
4: while transient simulation is not over do
5: Recover all the internal node voltages U1 of the linear subnetwork.
6: Obtain J1 and J2 from U1 .
7: Compute S by using (14).
8: Stamp in A and S into (15) or (16).
9: Stamp in nonlinear mesh drivers and external voltage/current sources into (15) or (16).
10: while not converge do
11:
Update the entries of the nonlinear mesh drivers in the
Jacobian matrix, update the right-hand-side vector.
12:
Solve the system by LU factorization.
13:
Update the solution guess: vk+1 = vk + ∆vk .
14: end while
15: Advance the time.
16: end while
4.1 Dynamic time step control by using local
truncation error estimation
Dynamic time step control is adopted by most SPICE
simulators today to enhance the simulation efficiency and
accuracy. Local truncation error (LTE) estimation is typically used with a numerical method such as Backward Euler
or Trapezoidal method to control the time step [13]. For
Backward Euler, which is what we use in this work, local
truncation error estimates for capacitances and inductances
are given by
„
«
1 ∆tn
+
C ≤ −
[ic (t−
(18)
n+1 ) − ic (tn )]
2
C
and
L ≤ −
and the circuit nodes in the nonlinear portion of the clock
mesh; B contains entries contributed by the nonlinear subnetwork; VP is the vector of port voltages, which is the same
as vector V in (5); Uext is the vector of the circuit unknowns
of the nonlinear subnetwork. On the right hand side, S is
stamped as current sources connected at the ports; IP are
equivalent current sources contributed by the nonlinear elements at the ports; Iext /Uext are current sources/voltage
sources inside the nonlinear subnetwork.
„
∆tn
L
«
+
[vL (t−
n+1 ) − vL (tn )]
4.2 Dynamic time step rounding
One important step of our new simulation flow is to update the internal node voltages of the linear part of the clock
mesh (step 5 of Algorithm 1). Since the macromodel is a
multi-port linear admittance model, only the node voltages
at its ports are solved in the N-R iteration. But vector S of
the macromodel depends on the current sources connected
at the internal nodes of the linear part according to equation
(14). Those current sources are equivalent current sources in
the companion models of capacitances and susceptances (or
reluctances). In order to compute the values of those equivalent current sources, internal node voltages of the linear
part need to be recovered at each time point.
Although we can use (6) to solve the internal node voltage
vector U1 directly, it is not the most efficient way to do
so. Since the Cholesky facor L11 of G11 is a submatrix of
the Cholesky factor L of G̃, L11 can be ”extracted” from L
during the macromodel generation step. So U1 can be rather
efficiently solved by Forward and Backward substitutions:
As pointed out before, the linear subnetwork macromodel
needs to be recomputed whenever the time step changes.
Therefore, it would be inefficient if we allow arbitrary time
step change during the transient analysis. Our basic idea is
to constrain the time step to a properly chosen discrete set
of time points, as described below.
First, the range of time steps generated by the dynamic
time step control algorithm is estimated: Hint = [hmin , hmax ].
This can be done by monitoring the transient simulation of
certain test circuits and the accuracy of this estimation is not
very critical. Then, this range is divided using a set of unevenly distributed discrete time points {hmin , h1 , h2 , · · · , hmax }.
A linear macromodel is pre-computed for each time point hi
in the above set. One good way is to divide the whole time
step range using a set of geometrically spaced time steps as
shown in Fig. 5. Here, the minimum time step is hmin , and
the following time steps are 2hmin , 4hmin and so on until
the maximum time step hmax is reached. The total number
of discrete time steps is (approximately) given by
(17)
1 + dlog2 (hmax /hmin )e
4.
(19)
where C and L are local truncation errors; ∆tn is the time
step between tn and tn+1 ; C and L are values of the capac−
itor and inductor; ic (t+
n ) and ic (tn+1 ) are the currents of C
−
+
at tn and tn+1 ; vL (tn ) and vL (tn+1 ) are the voltages of L
at tn and tn+1 . By setting a user-defined error limit, we can
use (18) and (19) to calculate the maximum tolerable value
of the next time step for every energy storage element. The
actual next time step for the complete circuit is chosen as
the minimum of these maximum tolerable time step values.
3.1 Update of internal node voltages in the
linear subnetwork
U1 = (LT11 )−1 L−1
11 (J1 − G12 V ) .
1
2
DYNAMIC TIME STEP ROUNDING
(20)
In this case, only about 10 discrete time points are needed
to cover a 1000X span of time steps.
During the transient simulation, if the predicted next time
step calculated by LTE falls between two adjacent time steps
in our predefined set, we round the next time step to its
nearest smaller value in the predefined set such that the
The macromodeling based approach presented in the previous section can be employed to speedup the nonlinear transient analysis at each individual time point. It shall be noted
that the macromodel of the linear subnetwork depends on
the choice of the time step (h). Hence, whenever h changes,
630
Act
ual
t
i
mest
ep
Roundi
ng
Ti
mest
ep
pr
edi
ct
ed
byLTE
based simulation approach.
5.2 Dynamic time step rounding
……
hmin 2hmin 4hmin
8hmin
We now compare the macromodel based simulation approach with SPICE more realistically when dynamic time
step control is used on the same set of circuits. For the
macromodel based method, the predefined set of time steps
for the dynamic time step rounding technique covers the interval of [1e − 12, 1e − 10]. So there are 1 + dlog2 100e = 8
predefined time steps. The comparison results are shown in
Table 2. We can see from Table 2 that the average runtime
speedup compared with SPICE with dynamic time step control is smaller than the average runtime speedup in Table 1.
This is because dynamic time step rounding always rounds
down the actual time step, so macromodel based simulation
always uses a time step no more than what is predicted by
LTE in SPICE. Despite the fact the macromodel based simulation uses smaller time steps, it is still considerably faster
than SPICE with dynamic time step control.
Fig. 6 illustrates how dynamic time step rounding works
during the macromodel based simulation for a test circuit.
We show the predicted time steps by LTE and the actual
rounded time steps for 20 consecutive time points. Dynamic
time step rounding always rounds down the actual time step
so that the local truncation error is guaranteed to be smaller
than the user-defined error limit. And since we have 1 +
dhmax /hmin e time steps in the predefined set, the time step
reduction by rounding is no more than 2x.
hmax
Figure 5: Dynamic time step rounding.
corresponding pre-computed Cholesky factorization can be
utilized. In this way, the LTE constraint is always satisfied. Two main advantages brought by the dynamic time
step rounding technique are: first, the number of Cholesky
factorizations needed is well bounded. For most cases, 10
factorizations are more than adequate; second, using a set
of geometrically distributed time point set will bring no more
than 2X reduction to the time step of what is predicted by
LTE.
5.
EXPERIMENTAL RESULTS
We demonstrate our macromodel based simulation algorithm and dynamic time step rounding technique on a set
of clock meshes with different sizes and number of mesh
drivers. Comparisons are made between our approach and
the standard SPICE simulation with fixed/dynamic time
step control. Several numerical packages are used in the
implementation: Sparse 1.3 solver package [16] is used for
the LU factorization in the N-R iterations; Cholmod solver
package [17] is used for the Cholesky factorization during the
macromodeling generation step. We also compare the runtime of our proposed algorithm with the Harmonic-weighted
MOR and port sliding schemes in [11]. Algorithms are implemented in C++. All the experiments are conducted on
a Linux server with 3GHz CPU and 4GB memory.
−11
3.5
x 10
3
Predicted time steps
Actual time steps
t(s)
2.5
5.1 Macromodel based simulation with fixed
time step
2
1.5
1
0.5
First, we compare the proposed macromodel based simulation using fixed time step with our implementation of
SPICE-like simulation with the same fixed time step. In
this case, we only need to compute one macromodel for the
linear subnetwork. Table 1 shows the detailed results. We
have tested our proposed approach on six clock meshes. For
each circuit, we first compare the runtime between one LU
factorization and one Cholesky factorization for matrix G̃.
This is a solver level comparison and gives us an understanding how much benefit we can get merely from Cholesky factorization. Then, the runtime for one full N-R iteration in
SPICE and the runtime for one reduced N-R iteration using
macromodel are compared. Since the problem size is greatly
reduced by the macromodel in the reduced N-R iteration, we
can achieve 2 orders of magnitude speedup in each reduced
N-R iteration. Since the internal node voltages of the linear
part need to be updated at each time point in our proposed
approach, we also include the runtime for updating U1 in
Table 1. It is interesting to see that even for relatively large
circuits, recovering the node voltages for the entire linear
part can be done rather quickly. This is due to the fact that
we use triangular matrices to compute U1 in (17), which is
computationally efficient. Finally, the overall runtime for
the SPICE simulation and macromodel based approach are
compared. We can see from the last column of Table 1 that
10 to 45 times speedup can be achieved by the macromodel
0
0
5
10
15
Index of time steps
20
Figure 6: Dynamic time step rounding for 20 consecutive time steps.
To demonstrate how much benefit dynamic time step rounding can bring to the macromodel based simulation approach,
we also compare the runtime between macromodel based
simulation with fixed time steps and dynamic time step
rounding in Table. 3. The circuit examples are the same
as shown in Table. 1 and 2. For the simulation with fixed
time step, we use 1e − 12s as the time step, which is also
hmin for the dynamic time step rounding. We can see that
dynamic time step rounding can bring average 1.5 to 2 times
speedup to the macromodel based simulation approach.
To compare with the approach in [11], we test our macromodel based approach with fixed time step and the serial
version of the harmonic-weighted MOR with driver merging scheme on circuit 4 and 5. The runtimes of harmonicweighted MOR with driver merging for these two circuits are
870.23s and 1836.48s while runtimes of macromodel based
simulation are 201.75s and 973.98s, respectively. In harmonicweighted MOR with driver merging approach, the driving
point waveform of each clock driver is obtained individu-
631
Table 1: Runtime(s) comparison between macromodel based simulation and SPICE with fixed time step
Ckt
# nodes
# elements
# drivers
ckt1
ckt2
ckt3
ckt4
ckt5
ckt6
400
1000
2500
6500
40k
70k
1200
3000
7500
20k
120k
200k
2
6
20
30
100
150
One
LU(s)
0.00599
0.027
0.363
3.105
124.402
139.15
One
Chol.(s)
0.002
0.004
0.115
0.387
25.514
34.15
One full
N-R iter.(s)
0.026
0.115
1.335
9.540
150.246
185.16
One Redu.
N-R iter.(s)
0.0009
0.002
0.0099
0.038
0.564
0.915
One Update
of U1
0.0009
0.0009
0.016
0.064
1.118
1.815
Runtime
SPICE(s)
24.779
109.53
1220.63
2h31min
6h15min
9h21min
Runtime
Macro.(s)
2.335
6.843
50.71
201.75
973.98
1247.93
Speedup
10.6
16.0
24.07
44.75
23.11
26.97
Table 2: Runtime(s) comparison between macromodel based simulation and SPICE with dynamic time steps
Ckt
# nodes
# elements
# drivers
ckt1
ckt2
ckt3
ckt4
ckt5
ckt6
400
1000
2500
6500
40k
70k
1200
3000
7500
20k
120k
200k
2
6
20
30
100
150
SPICE w.
dynamic time
step control
8.736
63.885
727.24
1h18min
4h21min
7h30min
ckt1
ckt2
ckt3
ckt4
ckt5
ckt6
Macro. w.
fixed time
steps
2.335
6.843
50.71
201.75
973.98
1247.93
Macro. w.
dynamic time
step rounding
1.008
4.654
34.18
114.84
778.16
1149.91
Speedup
2.32
1.47
1.48
1.76
1.25
1.09
ally. Therefore, the runtime increases proportionally with
the number of drivers. But in macromodel based simulation
approach, all the node voltage waveforms are computed in
one simulation. This is one of the reasons why it is faster
than harmonic-weighted MOR with driver merging for these
test circuits. On the other hand, the approach in [11] is
parallelizable since the coupled analysis problem is decomposed into independent small pieces. Furthermore, the use
of dynamic time step rounding requires pre-computation of
multiple large Cholesky factorizations and these Cholesky
factorizations must be loaded to main memory or stored on
disks during the simulation. Considering all these factors,
we expect the approach in [11] to become more attractive
for larger clock mesh designs. Accuracy wise, there is no approximation during the macromodel generation in this work
while approximation does exist in the harmonic-weighted
MOR approach [11].
6.
CONCLUSION
In this paper, we propose to use the matrix-level macromodel based simulation approach to accelerate the verification of clock mesh. The difficulty of solving extremely large
size N-R iteration problem has been significantly relaxed
by the use of the macromodel. Also, dynamic time step
rounding technique is jointly applied with the macromodel
to achieve even better runtime performance.
7.
Speedup
8.67
13.73
21.28
40.66
20.15
23.48
[2] P. J. Restle et al. The clock distribution of the power4
microprocessor. In IEEE ISSCC, pages 144–145, February
2002.
[3] L. Pillage and R. Rohrer. Asymptotic waveform evaluation for
timing analysis. IEEE Trans. Computer-Aided Design,
9:352–366, April 1990.
[4] P. Feldmann and R. Freund. Efficient linear circuit analysis by
padé approximation via the lanczos process. IEEE Trans.
Computer-Aided Design, 14:639–649, May 1995.
[5] L. Silveira, M. Kamon, and J. White. Efficient reduced-order
modeling of frequency-dependent coupling inductances
associated with 3-d interconnect structures. In Proc.
IEEE/ACM Design Automation Conf., June 1995.
[6] A. Odabasioglu, M. Celik, and L. Pileggi. Prima: passive
reduced-order interconnect macromodeling algorithm. IEEE
Trans. Computer-Aided Design, 17(8):645–654, August 1998.
[7] P. Feldmann and F. Liu. Sparse and efficient reduced order
modeling of linear subcircuits with large number of terminals.
In IEEE/ACM Intl. Conf. on CAD, November 2004.
[8] P. Liu et al. An efficient method for terminal reduction of
interconnect circuits considering delay variations. In
IEEE/ACM Intl. Conf. on CAD, November 2005.
[9] P. Li and W. Shi. Model order reduction of linear networks
with massive ports via frequency-dependent port packing. In
Proc. IEEE/ACM Design Automation Conf., pages 267–272,
July 2006.
[10] H. Chen et al. A sliding window scheme for accurate clock
mesh analysis. In Proc. IEEE/ACM Intl. Conf. on CAD,
pages 939–946, November 2005.
[11] X. Ye, P. Li, M. Zhao, R. Panda, and J. Hu. Analysis of large
clock meshes via harmonic-weighted model order reduction and
port sliding. In Proc. IEEE/ACM Intl. Conf. on CAD,
November 2007.
[12] M. Zhao, R. Panda, S. Sapatnekar, and D. Blaauw. Hierarchical
analysis of power distribution networks. IEEE Trans.
Computer-Aided Design, 21(2):159–168, February 2002.
[13] L. Pillage, R. Rohrer, and C. Visweswariah. Electronic Circuit
and System Simulation Methods. McGraw-Hill, Inc., 1995.
[14] A. Devgan, J. Hao, and W. Dai. How to efficiently capture
on-chip inductance effects: introducing a new circuit element
K. In Proc. IEEE/ACM Intl. Conf. on CAD, pages 150–155,
November 2000.
[15] H. Zheng and L. Pileggi. Modeling and analysis of regular
symmetrically structured power/ground distribution networks.
In Proc. IEEE/ACM Design Automation Conf., pages
395–398, June 2002.
[16] K. Kundert. Sparse Matrix Techniques In Circuit Analysis,
Simulation and Design. North-Holland, 1986.
[17] T.A. Davis and W.W. Hager. Modifying a sparse cholesky
factorization. SIAM Journal on Matrix Analysis and
Applications, 20(3):606–627, 1999.
Table 3: Runtime(s) comparison between macromodel based simulation with fixed time steps and
dynamic time step rounding
Ckt
Macro. w.
dynamic time
step rounding
1.008
4.654
34.18
114.84
778.16
1149.91
REFERENCES
[1] P. J. Restle et al. A clock distribution network for
microprocessors. IEEE J. of Solid-State Circuits,
36(5):792–799, May 2001.
632
Download