CSE241 VLSI Digital Circuits Winter 2003 Lecture 03:ASIC prototyping

advertisement
CSE241
VLSI Digital Circuits
Winter 2003
Lecture 06: Timing
CSE241 L3 ASICs.1
Kahng & Cichy, UCSD ©2003
This Class + Logistics

Timing


Reading


Storage elements, Clock distribution, Clock tree synthesis
Whitepapers/datasheets on STA; papers on clock tree synthesis
Schedule

MT in one week (lab/recitation fair game); Lab #2 due Mon 1/27

HW #9: As a block’s layout is compacted down to fit into a smaller and smaller
region, the timing of the block at first improves, but then worsens. Explain.

HW #10: Hold time violations mean that the chip doesn’t work at any frequency.
Propose several distinct methods for fixing hold time violations (guided by postrouting static timing analysis), and explain the pros and cons of each.

HW #11: Compare DEC’s first Alpha and first StrongArm processors (look up
transistor counts, supply voltage, frequency, etc.). (a) How much of
StrongArm’s power efficiency can be attributed to process, supply, and
frequency scaling? (b) What factors might contribute to the remainder?
CSE241 L3 ASICs.2
Slide courtesy of S. P. Levitan, U. Pittsburg
Kahng & Cichy, UCSD ©2003
Review

Static timing analysis (Lecture 4)

Pin-based timing graph

Directed acyclic graph (DAG) of timing arcs

Longest path in DAG  time linear in #arcs (edges)


Slack = required arrival time – actual arrival time (long path
analysis)
Logic synthesis (Lecture 5)
CSE241 L3 ASICs.3
Slide courtesy of S. P. Levitan, U. Pittsburg
Kahng & Cichy, UCSD ©2003
Static Analysis vs. Dynamic Analysis

Why static analysis when dynamic simulation is more
accurate?

Drawbacks of simulation



Requires input vectors (stimuli for circuit)
Long runtimes
Example: calculate worst-case rising delay from a to z

Exponential explosion with number of possible design input states
a
b
z
c
b=0
b=1
CSE241 L3 ASICs.4
c=0
a-z delay1
a-z delay3
c=1
a-z delay2
a-z delay4
Kahng & Cichy, UCSD ©2003
STA Terminology

(Actual) arrival time (AAT, or AT) = time at which a pin
switches state


Usually 50% point on voltage curve, i.e., AT = t50
Slew time = time over which signal switches

Usually difference between 10% and 90% on voltage curve, i.e.,
tslew = t90 – t10

Required arrival time (RAT) = time at which a signal must
arrive in order to avoid a chip fail

Slack = RAT – AAT

Positive slack good (= margin), negative slack bad
Vdd
CSE241 L3 ASICs.5
90
50
10
Time
Kahng & Cichy, UCSD ©2003
Example: What is slack at PO?
d=1
at=0
temp at=3
at=1
d=2
d=2
at=0
at=2 d=3
at=5
at=6
temp at=7
d=1
d=1
at=5 d=3
at=0
CSE241 L3 ASICs.6
d=5
at=8
d=3
at=11
rat=10
Slack= -1
Kahng & Cichy, UCSD ©2003
Example: Incremental Timing Analysis
d=1
at=0
temp at=3
at=1
d=2
d=2
at=0
at=2 d=3
at=0

d=1
d=5
at=5
at=6
temp at=7
d=1
d=1
at=5 d=3
d=1
at=3
d=1
at=8
at=7
at=10
d=3
at=11
rat=10
Slack = 0
Amount of work is bounded by sizes of fanin, fanout
cones of logic
CSE241 L3 ASICs.7
Kahng & Cichy, UCSD ©2003
Early-Mode Analysis

Definitions change as follows

RAT = lower bound on arrival time
 Propagate shortest possible instead of longest possible delays
 Slack = Arrival – Required
Example: negative slack because ATc is too small (early)

SL y  1  1  0
SLa  0  0  0
ATa  0
a
ATb  1
b
RATx  2
y
1
SLb  1  0  1
CSE241 L3 ASICs.8
AT y  1
1
c
ATc  0
SLc  0  1  1
x
ATx  1
SLx  1  2  1
Kahng & Cichy, UCSD ©2003
Enhancements of STA
 Incremental timing analysis
 Nanometer-scale process effects – variation (
probabilistic timing analysis)
 Interference – crosstalk
 Multiple inputs switching
 Conservatism of delay propagation

(Old: HW #8: Suppose you change the size of one (combinational) gate in
your design, thus invalidating the previous timing analysis. How much work
must be done to regain a correct timing analysis?)
CSE241 L3 ASICs.9
Courtesy K. Keutzer et al. UCB
Kahng & Cichy, UCSD ©2003
Timing Correction

Driven by STA


Fix electrical violations




“Incremental performance analysis backplane”
Resize cells
Buffer nets
Copy (clone) cells
Fix timing problems


Local transforms (bag of tricks)
Path-based transforms
CSE241 L3 ASICs.10
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
Local Synthesis Transforms




Resize cells



Move critical signals forward
Buffer or clone to reduce load on critical nets
Decompose large cells
Swap connections on commutative pins or among
equivalent nets
Pad early paths
Area recovery
CSE241 L3 ASICs.11
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
Transform Example
…..
Double Inverter
Delay = 4
Removal
…..
…..
Delay = 2
CSE241 L3 ASICs.12
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
Resizing
?
b
0.2
e
0.2
f
0.3
d
a
d
0.05
0.04
0.03
0.02
0.01
0
0
a
0.2
A
b
0.8
0.6
0.4
1
load
0.035
A
B
C
a
C
b
CSE241 L3 ASICs.13
0.026
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
d
Cloning
0.05
0.04
0.03
0.02
0.01
0
0
0.2
0.4
0.6
0.8
1
load
A
a
?
b
d
0.2
e
0.2
f
0.2
g
h
CSE241 L3 ASICs.14
0.2
0.2
B
C
d
A
e
f
a
B
b
DAC-2002, Physical Chip Implementation
g
h
Kahng & Cichy, UCSD ©2003
d
Buffering
0.05
0.04
0.03
0.02
0.01
0
0
0.2
0.4
0.6
0.8
1
load
A
a
?
b
d
0.2
e
0.2
f
0.2
g
h
CSE241 L3 ASICs.15
B
C
d
0.2
e
0.2
a
B
b
0.2
0.2
DAC-2002, Physical Chip Implementation
0.1
B
f
0.2
g
0.2
0.2
h
Kahng & Cichy, UCSD ©2003
Redesign Fan-in Tree
Arr(a)=4
Arr(b)=3
a
b
1
e
1
Arr(c)=1
Arr(d)=0
c
Arr(e)=6
1
d
a
b
c
d
CSE241 L3 ASICs.16
1
e
1
Arr(e)=5
1
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
Redesign Fan-out Tree
3
3
1
1
1
1
1
1
1
1
2
1
Longest Path = 5
CSE241 L3 ASICs.17
1
Longest Path = 4
Slowdown of buffer due to load
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
Decomposition
CSE241 L3 ASICs.18
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
Swap Commutative Pins
1
0
a
1
1
2
b
5
1
c
2
Simple sorting on arrival times and delay works
1
2
3
c
1
1
0
b
1
a
2
CSE241 L3 ASICs.19
DAC-2002, Physical Chip Implementation
Kahng & Cichy, UCSD ©2003
Outline
 Clocking
 Storage elements
 Clocking metrics and methodology
 Clock distribution
 Package and useful-skew degrees of freedom
 Clock power issues
 Gate timing models
CSE241 L3 ASICs.20
Kahng & Cichy, UCSD ©2003
Why Clocks?
 Clocks provide the means to synchronize

By allowing events to happen at known timing boundaries, we
can sequence these events
 Greatly simplifies building of state machines
 No need to worry about variable delay through
combinational logic (CL)

All signals delayed until clock edge (clock imposes the worst
case delay)
FSM
Comb
Logic
register
register
CSE241 L3 ASICs.21
register
Comb
Logic
Dataflow
Kahng & Cichy, UCSD ©2003
Clock Cycle Time
 Cycle time is determined by the delay through the CL


Signal must arrive before the latching edge
If too late, it waits until the next cycle
- Synchronization and sequential order becomes incorrect
 tcycle > tprop_delay + toverhead
 Can change circuit architecture to obtain smaller Tcycle
CSE241 L3 ASICs.22
Kahng & Cichy, UCSD ©2003
Pipelining
 For dataflow:



Instead of a long critical path, split the critical path into chunks
Insert registers to store intermediate results
This allows 2 waves of data to coexist within the CL
 Can we extend this ad infinitum?

Overhead eventually limits the pipelining
- E.g., 1.5 to 2 gate delays for latch or FF

Granularity limits as well
- Minimum time quantum: delay of a gate
t
 cycle
> tpd + toverhead
A
tpd1
CL
B
register
CSE241 L3 ASICs.23
CL
> max(tpd1, tpd2) + toverhead
register
tpd
register
A+B
register
register
CL
t
 cycle
tpd2
Kahng & Cichy, UCSD ©2003
FO4 INV Delays Per Clock Period
Number of FO4 inverter delays
120.00
100.00
386
486 DX2 DX4
80.00
Pentium
Pentium MMX
Pentium Pro
60.00
Pentium II
Celeron
40.00
Pentium III
Pentium 4
20.00
0.00
1982
1987
1993
1998
2004
Year


FO4 INV = inverter driving 4 identical inverters (no interconnect)
Half of frequency improvement has been from reduced logic stages, i.e., pipelining
CSE241 L3 ASICs.24
Kahng & Cichy, UCSD ©2003
Outline
 Clocking
 Storage elements
 Clocking metrics and methodology
 Clock distribution
 Package and useful-skew degrees of freedom
 Clock power issues
 Gate timing models
CSE241 L3 ASICs.33
Kahng & Cichy, UCSD ©2003
Clock Skew
t1 t2

Most “high-profile” of
clock network metrics
Skew

Maximum difference in
arrival times of clock
signal to any 2
latches/FF’s fed by the
network
CLK2 Time
Skew = max | t1 – t2 |
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.37
CLK1
Clock Source (ex. PLL)
Time
Time
Latency
Fig. From Zarkesh-Ha
Kahng & Cichy, UCSD ©2003
Clock Skew Causes

Designed (unavoidable) variations – mismatch in buffer load
sizes, interconnect lengths

Process variation – process spread across die yielding
different Leff, Tox, etc. values

Temperature gradients – changes MOSFET performance
across die

IR voltage drop in power supply – changes MOSFET
performance across die

Note: Delay from clock generator to fan-out points (clock
latency) is not important by itself

BUT: increased latency leads to larger skew for same amount of
relative variation
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.38
Kahng & Cichy, UCSD ©2003
Clock Jitter

Clock network delay uncertainty



From one clock cycle to the next, the period is not exactly the same
each time
Maximum difference in phase of clock between any two periods is
jitter
Must be considered in max path (setup) timing; typically O(50ps) for
high-end designs
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.39
Kahng & Cichy, UCSD ©2003
Clock Jitter Causes


PLL oscillation frequency
Various noise sources affecting clock generation and
distribution

E.g., power supply noise dynamically alters drive strength of
intermediate buffer stages

Jitter reduced by minimizing IR and L*(di/dt) noise
Courtesy Cypress Semi
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.40
Kahng & Cichy, UCSD ©2003
Clocking Methodology (Edge-Triggered)
Logic
FlipFlop
Comb
tper
 Max(tpd) < tper – tsu – tc2q – tskew

Delay is too long for data to be captured
 Min(tpd) > th-tc2q+tskew

Delay is too short and data can race through, skipping a state
CSE241 L3 ASICs.41
Kahng & Cichy, UCSD ©2003
Example of tpdmax Violation

Suppose there is skew between the registers in a dataflow
(regA after regB)



“i” gets its input values from regA at transition in Ck’
CL output “o” arrives after Ck transition due to skew
To correct this problem, can increase cycle time
Ck’
i
Comb
Logic
o
regB
regA
tskew
Ck
tpdmax
Ck
Too late!
Ck’
i
CSE241 L3 ASICs.42
o
tpdmax
Kahng & Cichy, UCSD ©2003
Example of tpdmin Violation: Race Through




Suppose clock skew causes regA to be clocked before regB
“i” passes through the CL with little delay (tpdmin)
“o” arrives before the rising Ck’ causes the data to be latched
Cannot be fixed by changing frequency  have rock instead of chip
Ck’
Ck
tskew
Comb
Logic
o
regB
regA
i
tpdmin
Ck
Ck’
i
Too early!
tpdmin
o
CSE241 L3 ASICs.43
Kahng & Cichy, UCSD ©2003
Outline
 Clocking
 Storage elements
 Clocking metrics and methodology
 Clock distribution
 Package and useful-skew degrees of freedom
 Clock power issues
 Gate timing models
CSE241 L3 ASICs.45
Kahng & Cichy, UCSD ©2003
Clock Distribution
 General goal of clock distribution


Deliver clock to all memory elements with acceptable skew
Deliver clock edges with acceptable sharpness
 Clocking network design is one of the greatest challenges
in the design of a large chip
 Clocks generally distributed via wiring trees (and meshes)
 Low-resistance interconnect to minimize delay
 Multiple drivers to distribute driver requirements


Use optimal sizing principles to design buffers
Clock lines can create significant crosstalk
CSE241 L3 ASICs.46
Kahng & Cichy, UCSD ©2003
Clock Distribution Problem Statement

Objective







Minimum skew (performance and hold time issues)
Minimum cell area and metal use
(sometimes) minimal latency
(sometimes) particular latency
(sometimes) intermixed gating for power reduction
(sometimes) hold to particular duty cycle: e.g. 50:50 +- 1 percent
Subject to:






Process variation from lot-to-lot
Process variation across the die
Radically different loading (ff density) around the die
Metal variation across the die
Power variation across the die (both static IR and dynamic)
Coupling (same and other layers)
CSE241 L3 ASICs.47
Kahng & Cichy, UCSD ©2003
Issues in Clock Distribution Network Design
 Skew




Process, voltage, and temperature
Data dependence
Noise coupling
Load balancing
 Power, CV2f – (no ½ or a)

Clock gating
 Flexibility/Tunability

Compactness – fit into existing layout/design
 Reliability

Electromigration
CSE241 L3 ASICs.48
Kahng & Cichy, UCSD ©2003
Skew: Clock Delay Varies With Position
CSE241 L3 ASICs.49
Kahng & Cichy, UCSD ©2003
Clock Distribution Methods
 RC-Tree



Less capacitance
More accuracy
Flexible wiring

CSE241 L3 ASICs.50

Grids

Reliable

Less data dependency

Tunable (late in design)
Shown here for final stage drivers driving F/F loads
Kahng & Cichy, UCSD ©2003
RC-Trees
H-Tree
X-Tree
Binary-Tree
Asymmetric trees can and are used due to uneven sink
distribution, hard macros in floorplan ( hierarchical clock
distribution), etc.; the basic goal is to have even RC delays
CSE241 L3 ASICs.51
Kahng & Cichy, UCSD ©2003
Grids


Gridded clock distribution common on
earlier DEC Alpha microprocessors
Advantages:

Skew determined by grid density, not
too sensitive to load position
Clock signals available everywhere

Tolerant to process variations

Usually yields extremely low skew
values


Disadvantages:


Predrivers
Global
grid
Huge amount of wiring and power
To minimize such penalties, need to
make grid pitch coarser  lose the grid
advantage
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.52
Kahng & Cichy, UCSD ©2003
Trees

H-tree (Bakoglu)



One large central driver, recursive structure to
match wirelengths
Halve wire width at branching points to reduce
reflections
Disadvantages


Slew degradation along long RC paths
Unrealistically large central driver
courtesy of P. Zarkesh-Ha
- Clock drivers can create large temperature
gradients (ex. Alpha 21064 ~30° C)



Non-uniform load distribution
Inherently non-scalable (wire R growth)
Partial solution: intermediate buffers at branching
points
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.53
Kahng & Cichy, UCSD ©2003
Buffered Tree
L2
Drives all clock
loads within its
region
L3
NGBuf
WGBuf
PLL
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.54
EGBuf
SGBuf
Other regions
of the chip
Kahng & Cichy, UCSD ©2003
Buffered H-tree
 Advantages




Ideally zero-skew
Can be low power (depending on skew requirements)
Low area (silicon and wiring)
CAD tool friendly (regular)
 Disadvantages

Sensitive to process variations
- Devices  Want same size buffers at each level of tree
- Wires  Want similar segment lengths on each layer in each source-sink
path !!!

Local clocking loads inherently non-uniform
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.55
Kahng & Cichy, UCSD ©2003
Tree Balancing
Some techniques:
Con: Routing area
often more valuable
than Silicon
a) Introduce dummy loads
b) Snaking of wirelength to match delays
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.56
Kahng & Cichy, UCSD ©2003
Examples From Processor Chips
H-Tree, Asymmetric
RC-Tree (IBM)
Grids
DEC [Alphas]
Serpentines
Intel x86
[Young ISSCC97]
CSE241 L3 ASICs.57
Kahng & Cichy, UCSD ©2003
Examples From Processor Chips
DEC-Alpha 21064 clock spines
DEC-Alpha 21064 RC delays
DEC-Alpha 21164 RC local delays
DEC-Alpha 21164 RC delays for Global
Distribution
(Spine + Grid)
CSE241 L3 ASICs.58
Kahng & Cichy, UCSD ©2003
ReShape Clocks Example (High-End ASIC)
 Balanced, shielded H-tree for pre-clock distribution
 Mesh for block level distribution
CSE241 L3 ASICs.59
Kahng & Cichy, UCSD ©2003
Pre-clock 2 Level H-tree

All routes 5-6u M6/5,
shielded with 1u
grounds

~10 buffers per node


E.g., ganged BUFx20’s
Output mesh must hit
every sub-block
output mesh
CSE241 L3 ASICs.60
Kahng & Cichy, UCSD ©2003
Block Level Mesh (.18u)
Clumps of 1-6 clock buffers, surrounded by
capacitor pads
Shielded input and output m6 shorting straps
Pre-clock connects to input shorting straps
1u m5 ribs every 20 - 30 u
(4 to 6 rows)
Max 600u stride
CSE241 L3 ASICs.61
Kahng & Cichy, UCSD ©2003
Problems with Meshes


Burn more power at low frequencies

Difficult for ‘spare’ clock domains that will not tolerate
regioning


Post placement (and routing) tuning required
Blocks more routing resources (solution, integrated
power distribution with ribs can provide shielding for
‘free’)
No ‘beneficial skew’ possible
CSE241 L3 ASICs.62
Kahng & Cichy, UCSD ©2003
Problems with Meshes (#2)
 Clock gating only easy at root
 Fighting tools to do analysis:



Clumped buffers a problem in Static Timing Analysis tools
Large shorted meshes a problem for STA tools
What does Elmore delay calculation look like for a non-tree?
 Need full extractions and spice-like simulation (e.g.
Avant! Star-Sim) to determine skew
CSE241 L3 ASICs.63
Kahng & Cichy, UCSD ©2003
Benefits of Meshes (#3)

Deterministic since shielded all the way down to rib
distribution

No ECO placement required: all buffers preplaced
before block placement

Low latency since uses shorted (= ganged, parallel)
drivers, therefore lower skew

ECO placements of FFs later do not require rebalance of
tree

“Idealized” clocking environment for concurrent RTL
design and timing convergence dance
CSE241 L3 ASICs.64
Kahng & Cichy, UCSD ©2003
Mesh Example
 ~ 100k flops
 6 blocks
CSE241 L3 ASICs.65
Kahng & Cichy, UCSD ©2003
Clock Skew Thermal Map
 Pre-tuning
CSE241 L3 ASICs.66
Kahng & Cichy, UCSD ©2003
Clock Skew Thermal Map #2
 50ps block/ 100ps global skew, post tuning
CSE241 L3 ASICs.67
Kahng & Cichy, UCSD ©2003
Alternative Clock Network Strategy


Globally – Tree
Power requirements
reduced relative to global
grid


Smaller routing
requirements, frees up
global tracks
Trees balanced easily at
global level

Keeps global skew low
(with minimal process
variation)
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.68
Kahng & Cichy, UCSD ©2003
Vertex Locations in a Bounded-Skew Tree

Given a skew bound, where can internal nodes of the given topology
(e.g., a, b, v) be placed?
skew
0
a
2
4
6
6
2
4
2
2
v
6
s0
v
a
CSE241 L3 ASICs.69
4
skew
0
b
Topology
s1 s2 s3 s4
4
b
6
Kahng & Cichy, UCSD ©2003
Deferred-Merge Embedding (DME) Algorithm
Bottom-Up: build tree of merging
regions corresponding to given
topology
B=4
s0
a
b
mr(a)
mr(v)
s3
mr(b)
Special case: skew = 0  merging segments
CSE241 L3 ASICs.70
Topology
s1 s2 s3 s4
s2
s0
s1
v
s4
Kahng & Cichy, UCSD ©2003
Top-Down Embedding Phase of DME
s0
Top-Down: choose embedding
points within merging regions
a
s0
s1
a
b
Topology
s1 s2 s3 s4
s2
B=4
v
v
s3
b
s4
CSE241 L3 ASICs.71
Kahng & Cichy, UCSD ©2003
Zero-Skew Example (555 sinks, 40 obstacles)
CSE241 L3 ASICs.72
Kahng & Cichy, UCSD ©2003
Outline
 Clocking
 Storage elements
 Clocking metrics and methodology
 Clock distribution
 Package and useful-skew degrees of freedom
 Clock power issues
 Gate timing models
CSE241 L3 ASICs.73
Kahng & Cichy, UCSD ©2003
Skew Reduction Using Package
• Most clock network
latency occurs at global
level (largest distances
spanned)
• Latency  Skew
• With reverse scaling,
routing low-RC signals
at global level becomes
more difficult & areaconsuming
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.74
Kahng & Cichy, UCSD ©2003
Skew Reduction Using Package
mP/ASIC
Solder bump
substrate
System
clock
 Incorporate global
clock distribution into the
package
 Flip-chip packaging
allows for high density,
low parasitic access from
substrate to IC
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.75
• RC of package-level wiring up
to 4 orders of magnitude smaller
than on-chip wiring
• Global skew reduced
• Lower capacitance  lower
power
• Opens up global routing tracks
• Results not yet conclusive
Kahng & Cichy, UCSD ©2003
Useful Skew (= cycle-stealing)
Zero skew
FF
fast
FF
Useful skew
slow
FF
FF
fast
FF
slow
FF
Timing Slacks
hold
setup
hold
setup
Zero skew
• Global skew constraint
• All skew is bad
W. Dai,
UC Santa
Cruz
CSE241
L3 ASICs.76
hold
setup
hold
setup
Useful skew
• Local skew constraints
• Shift slack to critical paths
Kahng & Cichy, UCSD ©2003
Skew = Local Constraint

Timing is correct as long as the signal arrives in the
permissible skew range
FF
-d + thold
race condition
<
D : longest path
d : shortest path
Skew
FF
<
safe
Tperiod - D - tsetup
cycle time violation
permissible range
W. Dai,
UC Santa
Cruz
CSE241
L3 ASICs.77
Kahng & Cichy, UCSD ©2003
Skew Scheduling for Design Robustness
 Design will be more robust if clock signal arrival time is in
the middle of permissible skew range, rather than on edge
 Can solve a linear program to maximize robustness =
determine prescribed sink skews
FF
FF
2 ns
6 ns
4
FF
T = 6 ns
0
“0 0 0”: at verge of violation
4
0
“2 0 2”: more safety margin
2
W. Dai,
UC Santa
Cruz
CSE241
L3 ASICs.78
-2
Kahng & Cichy, UCSD ©2003
Potential Advantages of Useful Skew

Reduce peak current consumption by distributing the FF switch
point in the range of permissible skew
CLK
CLK
0-skew

U-skew
Affords extra margin to increase clock frequency or reduce sizing
(= power)
W. Dai,
UC Santa
Cruz
CSE241
L3 ASICs.79
Kahng & Cichy, UCSD ©2003
Conventional Zero-Skew Flow
Synthesis
Placement
0-Skew Clock Synthesis
Clock Routing
Signal Routing
Extraction & Delay Calculation
Static Timing Analysis
W. Dai,
UC Santa
Cruz
CSE241
L3 ASICs.80
Kahng & Cichy, UCSD ©2003
Useful-Skew Flow
Permissible range generation
Existing Placement
Initial skew scheduling
U-Skew Clock Synthesis
Clock tree topology synthesis
Clock net routing
Clock Routing
Clock timing verification
Signal Routing
Extraction & Delay Calculation
Static Timing Analysis
W. Dai,
UC Santa
Cruz
CSE241
L3 ASICs.81
Kahng & Cichy, UCSD ©2003
Outline
 Clocking
 Storage elements
 Clocking metrics and methodology
 Clock distribution
 Package and used-skew degrees of freedom
 Clock power issues
 Gate timing models
CSE241 L3 ASICs.82
Kahng & Cichy, UCSD ©2003
Clock Power
 Power consumption in clocks due to:



Clock drivers
Long interconnections
Large clock loads – all clocked elements (latches, FF’s) are driven
 Different components dominate


Depending on type of clock network used
Ex. Grid – huge pre-drivers & wire cap. drown out load cap.
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.83
Kahng & Cichy, UCSD ©2003
Clock Power Is LARGE
P = a C Vdd2 f
Not only is the clock capacitance large, it
switches every cycle!
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.84
Kahng & Cichy, UCSD ©2003
Low-Power Clocking

Gated clocks



Prevent switching in areas of chip not being used
Easier in static designs
Edge-triggered flops in ARM rather than transparent latches
in Alpha


Reduced load on clock for each latch/flop
Eliminated spurious power-consuming transitions during latch flowthrough (transparency)
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.85
Kahng & Cichy, UCSD ©2003
Clock Area

Clock networks consume silicon area (clock drivers, PLL,
etc.) and routing area


Routing area is most vital
Top-level metals are used to reduce RC delays




These levels are precious resources (unscaled)
Power routing, clock routing, key global signals
Reducing area also reduces wiring capacitance and power
Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.86
Kahng & Cichy, UCSD ©2003
Clock Slew Rates

To maintain signal integrity and latch performance, minimum
slew rates are required



Too slow – clock is more susceptible to noise, latches are slowed
down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew
(ps)], more short-circuit power for large clock drivers
Too fast – burns too much power, overdesigned network, enhanced
ground bounce
Rule-of-thumb: Trise and Tfall of clock are each between 1020% of clock period (10% - aggressive target)

1 GHz clock; Trise = Tfall = 100-200ps
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.87
Kahng & Cichy, UCSD ©2003
Example: Alpha 21264
Grid + H-tree approach
Power = 32% of total
Wire usage = 3% of
metals 3 & 4
4 major clock quadrants, each with a large driver
connected to local grid structures
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.88
Kahng & Cichy, UCSD ©2003
Alpha 21264 Skew Map
Ref: Compaq, ASP-DAC00
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.89
Kahng & Cichy, UCSD ©2003
Power vs. Skew


Fundamental design decision
Meeting skew requirements is easy with unlimited
power budget



Wide wires reduce RC product but increase total C
Driver upsizing reduces latency ( reduces skew as well)
but increases buffer cap
SOC context: plastic package  power limit is 2-3 W
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.90
Kahng & Cichy, UCSD ©2003
Clock Distribution Trends

Timing



Clock period dropping fast, skew must follow
Slew rates must also scale with cycle time
Jitter – PLL’s get better with CMOS scaling but other sources of noise
increase
- Power supply noise more important
- Switching-dependent temperature gradients

Materials



Cu reduces RC slew degradation, potential skew
Low-k decreases power, improves latency, skew, slews
Power


Complexity, dynamic logic, pipelining  more clock sinks
Larger chips  bigger clock networks
Sylvester
/ Shepard,
2001
CSE241
L3 ASICs.91
Kahng & Cichy, UCSD ©2003
Outline
 Clocking
 Storage elements
 Clocking metrics and methodology
 Clock distribution
 Package and useful-skew degrees of freedom
 Clock power issues
 Gate timing models
CSE241 L3 ASICs.92
Kahng & Cichy, UCSD ©2003
Gate Timing Characterization
A
CL
B
D
F
CL
 “Extract” exact transistor characteristics from layout


Transistor width, length, junction area and perimeter
Local wire length and inter-wire distance
 Compute all transistor and wire capacitances
CSE241 L3 ASICs.93
Kahng & Cichy, UCSD ©2003
Cell Timing Characterization
 Delay tables generated using a detailed transistor-level
circuit simulator SPICE (differential-equations solver)
 For a number of different input slews and load
capacitances simulate the circuit of the cell


Propagation time (50% Vdd at input to 50% at output)
Output slew (10% Vdd at output to 90% Vdd at output)
tslew
Vdd
tpd
Time
CSE241 L3 ASICs.94
Kahng & Cichy, UCSD ©2003
Delay and Transition Measurement
Transition
80%
50%
20%
Cell Delay
CSE241 L3 ASICs.95
Kahng & Cichy, UCSD ©2003
Non-linear effects reflected in tables
 DG = f (CL, Sin) and Sout = f (CL, Sin)

Non-linear
 Interpolate between table entries
 Interpolation error is usually below 10% of SPICE
Output
Capacitance
Output
Capacitance
Input
Slew
Intrinsic
Delay
Delay at the gate
CSE241 L3 ASICs.96
Input
Slew
Output
Slew
Resulting waveform
Kahng & Cichy, UCSD ©2003
Timing Library Example (.lib)
library(my_lib) {
fall_transition(load) {
delay_model : table_lookup;
cell("INV") {
library_features (report_delay_calculation);
pin(A) {
index_1( "0.0326, 0.1614, 0.4192, 1.5017" );
time_unit : "1ns";
max_transition : 1.500000;
index_2( "0.0010, 0.4249, 2.1491, 8.1881" );
voltage_unit : "1V";
direction : input;
values ( \
current_unit : "1mA";
rise_capacitance : 0.0739000;
leakage_power_unit : 1uW;
fall_capacitance : 0.0703340;
capacitive_load_unit(1,pf);
"0.011974, 0.071668, 0.317800, 1.189560", \
"0.033212, 0.101182, 0.328540, 1.189562", \
capacitance : 0.07278646;
pulling_resistance_unit : "1kohm";
}
default_fanout_load : 1.0;
pin(Z) {
"0.059282, 0.155052, 0.389900, 1.202360", \
"0.162830, 0.317380, 0.628160, 1.441260" );
default_inout_pin_cap : 1.0;
direction : output;
default_input_pin_cap : 1.0;
function : "!A";
}
default_output_pin_cap : 0.0;
max_transition : 1.500000;
rise_transition(load) {
default_cell_leakage_power : 0.0;
max_capacitance : 5.1139;
index_1( "0.0375, 0.1650, 0.5455, 1.5078" );
timing() {
nom_voltage : 1.08;
related_pin : "A";
index_2( "0.0010, 0.4449, 1.7753, 5.1139" );
nom_temperature : 125.0;
cell_rise(load) {
values ( \
nom_process : 1.0;
index_1( "0.0375, 0.2329, 0.6904, 1.5008" );
slew_derate_from_library : 0.500000;
index_2( "0.0010, 0.9788, 2.2820, 5.1139" );
"0.016690, 0.115702, 0.418200, 1.189060", \
"0.038256, 0.139336, 0.422960, 1.189081", \
values ( \
operating_conditions("slow_125_1.08") {
process
"0.013211, 0.071051, 0.297500, 0.642340", \
: 1.0 ;
temperature : 125 ;
voltage
}
"0.170992, 0.353120, 0.694740, 1.384760" );
"0.053289, 0.165930, 0.496550, 0.860400", \
: 1.08 ;
tree_type : "worst_case_tree" ;
"0.076248, 0.213280, 0.491820, 1.203700", \
"0.028657, 0.110849, 0.362620, 0.707070", \
}
"0.091041, 0.234440, 0.661840, 1.091700" );
}
}
cell_fall(load) {
default_operating_conditions : slow_125_1.08 ;
index_1( "0.0326, 0.1614, 0.5432, 1.5017" );
index_2( "0.0010, 0.4249, 3.6538, 8.1881" );
lu_table_template("load") {
values ( \
variable_1 : input_net_transition;
"0.009472, 0.072284, 0.317370, 0.688390", \
variable_2 : total_output_net_capacitance;
"0.009992, 0.095862, 0.360530, 0.731610", \
index_1( "1, 2, 3, 4" );
"0.009994, 0.126620, 0.477260, 0.867670", \
index_2( "1, 2, 3, 4" );
}
"0.009996, 0.144150, 0.644140, 1.127700" );
}
CSE241 L3 ASICs.97
Kahng & Cichy, UCSD ©2003
Delay Calculation
Cell Fall
Cap\Tr
0.05
0.2
0.5
0.01
0.02
0.16
0.30
0.5
0.04
0.32
0.60
2.0
0.178
0.08
0.64
1.20
0.147ns
0.1ns
Cell Rise
Cap\Tr
0.05
0.2
0.5
0.01
0.03
0.18
0.33
0.5
0.06
0.36
0.66
2.0
0.09 0.261
0.72
1.32
Fall Transition
Cap\Tr
0.05
0.2
0.5
0.01
0.01
0.09
0.15
0.5
0.03
0.27
0.45
2.0
0.06 0.147
0.54
0.90
CSE241 L3 ASICs.98
0.12ns
1.0pf
Fall delay = 0.178ns
Rise delay = 0.261ns
Fall transition = 0.147ns
Rise transition = …
Kahng & Cichy, UCSD ©2003
PVT (Process, Voltage, Temperature) Derating
Actual cell delay = Original delay x KPVT
CSE241 L3 ASICs.99
Kahng & Cichy, UCSD ©2003
PVT Derating: Example + Min/Typ/Max Triples
Proc_var (0.5:1.0:1.3)
Voltage (5.5:5.0:4.5)
Temperature (0:20:50)
KP = 0.80 : 1.00 : 1.30
KV = 0.93 : 1.00 : 1.08
KT = 0.80 : 1.07 : 1.35
KPVT = 0.60 : 1.07 : 1.90
Cell delay = 0.261ns
Derated delay = 0.157 : 0.279 : 0.496 {min : typical : max}
CSE241 L3 ASICs.100
Kahng & Cichy, UCSD ©2003
Conservatism of Gate Delay Modeling
 True gate delay depends on input arrival time
patterns


STA will assume that only 1 input is switching
Will use worst slope among several inputs
Vdd
A
A
B
tpd
F
B
D
F
CL
Time
Vdd
A
CSE241 L3 ASICs.101
tpd
F
Time
Kahng & Cichy, UCSD ©2003
Download