Lecture_10_Timing_129

advertisement
Timing Issues
Mohammad Sharifkhani
Reading
• Textbook II, Chapter 10
• Textbook I, Chapters 12 and 13
Motivation
• Time is the essence!
– We do things in order, do does the processors
• Procedural dependency
• Resource Reusability
• Synchronous architectures are preferred
– Ease of implementation
– Predictability
– Compatibility with well known arithmetic algorithms
• A reference clock plays a key role
– We usually neglect the non-idealities in the clock in the design
cycle
Timinng
Clock frequency
Two signals
Signals that can only transition at
predetermined times with respect to a
signal clock are called
“{syn,meso,plesio}chronous”
An asynchronous signal can transition
at any arbitrary time.
Definitions
data passed between two different clock domains
Mesochronous Timing
Mesochronous Timing
unknown interconnect delay
Pelsichronous
• two interacting modules have independent
clocks generated from separate crystal
oscillators
Asynchronous Interconnect
•No clock is needed
•Speed is determined by job completion
Hand Shaking
• The four-phase handshake is
level-sensitive while the twophase handshake is edgetriggered (lower transitions at
the expense of edge triggered
circuitry).
• System A places data on the
bus. It then raises Req to
indicate that the data is valid.
• System B samples the data
when it sees a high value on
Req and raises Ack to indicate
that the data has been captured.
System A lowers Req, then
system B lowers Ack.
Req is not synch to clkB 
synchronizer is needed
Hand Shaking (Cont’)
Synchronous Timing
CLK
In
R1
Cin
Combinational
Logic
Cout
R2
Out
A quick look
Timing Definitions and
Basics
Latch Parameters
Transparent
Opaque
T
Clk
PWm
D
Q
tsu
thold
tc-q
td-q
Delays can be different for rising and falling data transitions
Register Parameters
T
Clk
thold
D
tsu
Q
tc-q
Delays can be different for rising and falling data transitions
Clock Uncertainties
4 Power Supply
3 Interconnect
Devices
2
5 Temperature
1 Clock Generation
Sources of clock uncertainty
6 Capacitive Load
7 Coupling to Adjacent Lines
Clock Nonidealities
• Clock skew
– Spatial variation in temporally equivalent clock edges;
deterministic + random, tSK
• Clock jitter
– Temporal variations in consecutive edges of the clock
signal; modulation + random noise
– Cycle-to-cycle (short-term) tJS
– Long term tJL
• Variation of the pulse width
– Important for level sensitive clocking
Clock Skew and Jitter
Clk
tSK
Clk
tJS
• Both skew and jitter affect the effective cycle time
• Only skew affects the race margin
Clock Skew and Jitter
Clk
tSK
Clk
tJS
• Do not touch the clock signal if not necessary!
– Sometimes the simplest architecture is the safest
– But not necessarily the lowest power! 
Clock skew and Jitter
• Data and state
independent clock
distribution is desired
• Enabled FF is a popular
choice in the design
• Consider clock load on
power!
Clock Skew
# of registers
Earliest occurrence
of Clk edge
Nominal – /2
Latest occurrence
of Clk edge
Nominal +  /2
Insertion delay
Max Clk skew

Clk delay
Positive and Negative Skew
In
R1
D Q
CLK
Combinational
Logic
R2
D Q
tCLK1
Combinational
Logic
R3
D Q
tCLK2
delay
•••
tCLK3
delay
(a) Positive skew
In
R1
D Q
Combinational
Logic
tCLK1
R2
D Q
Combinational
Logic
tCLK2
delay
D Q
tCLK3
delay
(b) Negative skew
R3
CLK
•••
Positive Skew
TCLK + d
CLK1
CLK2
TCLK
1
3
d
2
4
d + th
Launching edge arrives before the receiving edge
Negative Skew
TCLK + d
1
CLK1
CLK2
2
d
TCLK
3
4
Receiving edge arrives before the launching edge
Timing Constraints (positive skew)
In
R1
D
R2
Q
Combinational
Logic
tCLK1
CLK
tc - q
tc - q, cd
tsu, thold
D
Q
tCLK2
tlogic
tlogic, cd
Minimum cycle time:
T +  > tc-q + tsu + tlogic
More time to process the data 
Worst case is when receiving edge arrives early (positive )
Timing Constraints (positive skew)
In 1
R1
D
R2
Q
Combinational
Logic
tCLK1
CLK
tc - q
tc - q, cd
tsu, thold
D
Q
tCLK2
tlogic
tlogic, cd
Hold time constraint:
t(c-q, cd) + t(logic, cd) > thold + 
Otherwise it can not latch In1
before it changes after CLK1 edge
Worst case is when receiving edge arrives late
Race between data and clock (positive skew)
 < t(c-q, cd)
+ t(logic, cd) > thold  independent of the T
Considerations
• δ > 0—This corresponds to a clock routed
in the same direction as the flow of the
data through the pipeline. The skew has to
be strictly controlled. If this constraint is
not met, the circuit does malfunction
independent of the clock period.
Question
• Would there be any race if the skew is
negative?
• What would you do to avoid race?
Negative Skew
• δ < 0—When the clock is routed in the
opposite direction of the data , the skew is
negative and condition to avoid race is
unconditionally met. The circuit operates
correctly independent of the skew. The
skew reduces the time available for actual
computation so that the clock period, T,
has to be increased by |δ|.
If race (hold time) is a problem, route the clock in the opposite direction
Impact of Jitter

CLK

TC LK



t j itter
-tji tte r 
In
Combinational
Logic
REGS
CLK
tc-q , tc-q,
ts u, thold
tjitter
cd
t log ic
t log ic, cd
Both skew and jitter should be accounted for in feedback structures
Longest Logic Path in
Edge-Triggered Systems
TSU
Clk
TClk-Q
Latest point
of launching
considering jitter
TLM
T
Earliest arrival
of next cycle
TJI + 
Clock Constraints in
Edge-Triggered Systems
If launching edge is late and receiving edge is early, the data will not be too late if:
Tc-q + TLM + TSU < T – TJI,1 – TJI,2 - 
Minimum cycle time is determined by the maximum delays through the logic
Tc-q + TLM + TSU +  + 2 TJI < T
Skew can be either positive or negative
Shortest Path
Earliest point
of launching
Clk
Clk
TClk-Q TLm
TH
Nominal
clock edge
Data must not arrive
before this time
Clock Constraints
in Edge-Triggered Systems
If launching edge is early and receiving edge is late:
Tc-q + TLM – TJI,1 > TH + TJI,2 + 
Minimum logic delay
Tc-q + TLM > TH + 2TJI+ 
False path
Path 1 (5 tgate) never exercised.
If A = 1, the critical path goes through OR1
and OR2. If A = 0 and B = 0, the critical
path is through I1,OR1 and OR2
(corresponding to a delay of 3 tgate). For
the case when A= 0 and B =1, the critical
path is through I1,OR1, AND3 and OR2.
Does not depend on C,D.
How to counter Clock Skew?


.
REG
REG
In

REG
REG
Negative Skew

Positive Skew
Clock Distribution
Data and Clock Routing
log
Out
Sources of uncertainity
Device variation
• Variation
• Matching
– Poly orientation
– Dopant profiles
• Can be modeled and
compensated for
Interconnect variation (ILD)
Pattern and ILD correlation
Use of fillers is necessary
Temp. and Power
• Temp.
– Time varying (milisecond)
– Effect of clock gating
– Has a gradient  systematic  compensated for
• Power
– Instantaneous IR Drop (switching activity)
– Jitter (short pulses, data dependent)
– Can not be compensated for (only decoupling caps)
Data dependent loading
Capacitive coupling and X-talk works the same way.
It is modeled as a form of jitter due to its random nature
Clock Distribution
H-tree
CLK
Clock is distributed in a tree-like fashion
Example
• Clock H-Tree
– Clock skew: time
difference between
the arrival time of
the clock signal
between two leaves
– Identical branches
and leaves
Example
• Considering three parameters:
– Both FETs and wires; 64 samples + main
buffer
– All deterministic factors are nulled out  only
within chip variation is considered
– Random ΔL of FET with distribution stat: N(0,
0.035um)
– Random ΔW of wires with N(0,0.25um)
– Spatial ΔL; ΔL = w0+wx.x+wy.y
Example
Example
• Results
– In case of Random ΔL  139ps vs. 171ps
without considering spatial constraints
– In case of Random ΔW 41ps vs. 49ps
– Without considering spatial constraints; worst
case is too pessimistic
More realistic H-tree
10 Balanced
segments
Each segments
contain 580
drivers
All-RC matched
If we leave
Clock Tree for
last minute we
may end-up
with multiple
timing
constraints
violations!
[Restle98]
The Grid System
GCL K
Driver
Driver
GCLK
Driver
Absolute delay is minimized
Allows late design changes
GCLK
•No rc-matching
•Large power
Driver
GCL K
Examples
• Alpha 21064 (0.75um) 200MHz
• Clock load 3.25nF (40%)
• Skew < 200pSec (10%)
Example: DEC Alpha 21164
Clock Frequency: 300 MHz - 9.3 Million Transistors
Total Clock Load: 3.75 nF
Power in Clock Distribution network : 20 W (out of 50)
Uses Two Level Clock Distribution:
• Single 6-stage driver at center of chip
• Secondary buffers drive left and right side
clock grid in Metal3 and Metal4
Total driver size: 58 cm!
21164 Clocking
• 2 phase single wire clock,
distributed globally
tskew = 150ps
• 2 distributed driver channels
tcycle= 3.3ns
trise = 0.35ns
Clock waveform
final drivers
pre-driver
Location of clock
driver on die
–
–
–
–
Reduced RC delay/skew
Improved thermal distribution
3.75nF clock load
58 cm final driver width
• Local inverters for latching
• Conditional clocks in caches to
reduce power
• More complex race checking
• Device variation
• Skew: 90pSec (65pSec
effective)
21164 Clocking
• Clock buffers carefully sized to minimize
the skew
• The direction of the clock is considered
• One gate between the latches
• Dummy fillers  (increase cap)
– Dummies are shielded
Reducing Skew
•
•
•
1. balance clock paths from a central distribution source to individual clocking
elements using H-tree structures
2. The use of local clock grids (instead of routed trees) can reduce skew at the cost of
increased capacitive load and power dissipation.
3. If data dependent clock load variations causes significant jitter, differential registers
that have a data independent clock load should be used.
–
The use of gated clocks to save also results in data dependent clock load and increased
jitter. In clock networks where the fixed load is large (e.g., using clock grids), the data
dependent variation might not be significant.
•
4. If data flows in one direction, route data and clock in opposite directions. This
eliminates races at the cost of performance.
•
5. shielding clock wires from adjacent signal wires
•
6. ILD: Dummy fills
•
7. Temperature: delay locked loops as discussed later in this chapter can easily
compensate for temperature variations.
•
8. Power supply variation : on-chip decoupling capacitors. Unfortunately, decoupling
capacitors require a significant amount of area and efficient packaging solutions must
be leveraged to reduce chip area.
Clock Drivers
Clock Skew in Alpha Processor
EV6 (Alpha 21264) Clocking
600 MHz – 0.35 micron CMOS
tcycle= 1.67ns
trise = 0.35ns
Global clock waveform
tskew = 50ps
• 2 Phase, with multiple conditional
buffered clocks
– 2.8 nF clock load
– 40 cm final driver width
PLL
• Local clocks can be gated “off” to
save power
• Reduced load/skew
• Reduced thermal issues
• Multiple clocks complicate race
checking
21264 Clocking
Hierarchical clocking
Trade-off between power and skew
Flexibility in types of clocks at each reagion
Not shielded
EV6 Clock Results
ps
5
10
15
20
25
30
35
40
45
50
ps
300
305
310
315
320
325
330
335
340
345
GCLK Skew
GCLK Rise Times
(at Vdd/2 Crossings)
(20% to 80% Extrapolated to 0% to 100%)
EV7 Clock Hierarchy
Active Skew Management and Multiple Clock Domains
+ widely dispersed
drivers
DLL
DLL
DLL
NCLK
(Mem Ctrl)
+ DLLs compensate
static and lowfrequency variation
GCLK
(CPU Core)
SYSCLK
L2R_CLK
(L2 Cache)
PLL
L2L_CLK
(L2 Cache)
+ divides design and
verification effort
- DLL design and
verification is added
work
+ tailored clocks
Latch based timing
• We can have comb. Circuits between the
two latches of a FF
– More flexibility in terms of timing
Flip-Flop – Based Timing
Skew

Flip-flop
delay
Logic delay
TSU
TClk-Q
Flip
-flop
=0
Logic
Representation after
M. Horowitz, VLSI Circuits 1996.
=1
Latch timing
When data arrives
to transparent latch
tD-Q
D
Q
Latch is a ‘soft’ barrier
Clk
tClk-Q
When data arrives
to closed latch
Data has to be ‘re-launched’
Single-Phase Clock with
Latches

Latch
Logic
Tskl
Clk
Tskl
Tskt
Tskt
latch
transparent
PW
P
Preventing late arrivals
Case 1:
- The LM can
start ahead of
time
- c2q limits
Case 2: d2q limits
Lgk can
still
operate
Preventing late arrivals
Preventing Premature Arrivals
Data should not pass through the
latch more than once during its
transparent mode
Otherwise the data loops within the transparent window of time
Single latch timing
Latch-Based Design
L1 latch is
transparent
when  = 0
L2 latch is transparent
when  = 1

L1
Latch
Logic
Logic
L2
Latch
Latch-Based Timing
Skew
Static logic

L1
Latch
Logic Path1
L2 trans.
L1 latch
Logic
Can tolerate skew!
L2 latch
=1
L2
Latch
L1 trans.
=0
Long
Path 1
Hits L2 transparent 
goes through L2
Short
Path 1
Hits L2 latch has to wait till L2
becomes transparent
Latch based timing
Trans.
when
high
Trans.
when low
Slack-borrowing
In
Trans.
when
high
L1
D
Q
CLB_A
t p d,A
a
b
CLK1
L2
D Q
CLB_B
t p d,B
c
L1
d
D
CLK2
Q
e
CLK1
TC LK
CLK1
CLK2



tpdA

tpdB
slack passed to next stage
t pd,A
tD Q
tpd,B
t DQ
e valid
d valid
CLB_B starts before (3) kicks to latch its input. ie, since CLB_A finished earlier than (3), the extra
time is passed to CLB_B  again e is valid before (4) to latch the input of the next CLB
a valid
b valid c valid
Example
T=125
L4
L4 Becomes
transp. at edge
 no problem
when exactly f
arrives
Design consideration
Hold time violation
Data available for CLL
If the falling edge of clk2 comes with too
much skew, THL might not be able to
latch the previous data because of hold
time violation (ie, D2 is overwritten too
quickly after the edge)
Domino logic with delays
Clock skew
No time slack borrowing
Skew tolerant domino
Can we borrow time?
Multiphase
Time borrowing is possible
Self-timed and Asynchronous
Design
Functions of clock in synchronous design
1) Acts as completion signal
2) Ensures the correct ordering of events
Truly asynchronous design
1) Completion is ensured by careful timing analysis
2) Ordering of events is implicit in logic
Self-timed design
1) Completion ensured by completion signal
2) Ordering imposed by handshaking protocol
Synchronous Pipelined
Datapath
In
CLK
R1
D Q
tpd,reg
Logic
Block #1
tpd1
R2
D Q
Logic
Block #2
R3
D Q
tpd2
Logic
Block #3
tpd3
What clock does is that:
1- physical timing constraints are met
2- Clock events serve as a logical
ordering mechanism for the global
system events
If we guarantee these two items, we can remove the clock:
-power, area, complexity of clock tree…
R4
D Q
Synch. design
• It assumes that all clock events or timing
references happen simultaneously over
the complete circuit. This is not the case in
reality, because of effects such as clock
skew and jitter.
• significant current flows over a very short
period of time
• linking of physical and logical constraints
has some obvious effects (e.g. throughput)
Self-Timed Pipelined Datapath
Req
Req
HS
Ack
Req
HS
Ack
Start
Done
Start
Req
HS
Ack
Done
ACK
Start
Done
Hand
shaking
blocks
In
R1
F1
tpF1
R2
F2
tpF2
R3
F3
Out
tpF3
The logical ordering of the operations is
What each signal does? ensured by the acknowledge-request
scheme, often called a handshaking
protocol.
Asynch. properties
• Timing signals are generated locally… no high
precision clock distribution over the chip (skew,
etc)
• Separating the physical and logical ordering
Performance (data dependency and no worst
case design)
• The automatic shut-down of blocks that are not
in use can result in power savings.(power)
• Robust to variations in manufacturing and
operating conditions such as temperature.
Completion Signal Generation
LOGIC
In
Out
NETWORK
Start
DELAY MODULE
Using Delay Element (e.g. in memories)
Done
Completion Signal Generation
Using Redundant Signal Encoding
Completion Signal in DCVSL
VDD
VDD
B0
Start
Done
B1
B0
B1
In1
In1
In2
In2
PDN
Start
PDN
Self-Timed Adder
VDD
VDD
Start
C0
C0
P0
C1
G0
P1
C2
G1
P2
C3
G2
P3
Start
C4
C4
G3
Start
VDD
C4
C4
C3
C3
C2
C2
C1
C1
Start
Start
C0
C0
Done
P0
K0
C1
P1
K1
C2
P2
K2
Start
(a) Differential carry generation
C3
P3
K3
C4
C4
(b) Completion signal
Completion Signal Using Current
Sensing
Inputs
Start
Input Register

VDD
Start
Output
Static CMOS Logic
tdelay
A
GNDsense
Current Sensor
toverlap
A
B
tMDG
Done
Done
Min Delay Generator
B
Output
Data independent  reference!
tpd-NOR
Minimum delay
valid
Hand-Shaking Protocol
Two Phase Handshake
The four events, data change, request,
data acceptance, acknowledge proceed in
a cyclic order.
Every transition
means that the action
is valid!
Event Logic – The Muller-C
Element
A
F
C
B
(a) Schematic
VDD
A
A
B
S
R
(a) Logic
Q
A
B
Fn+1
0
0
1
1
0
1
0
1
0
Fn
Fn
1
Seq. element
(b) Truth table
VDD
VDD
B
F B
F
B
A
A
F
B
B
(b) Majority Function
(c) Dynamic
2-Phase Handshake Protocol
Start from DataReady, Ack=0,0. when go to 1,0 ,
Req=1. The C-element is blocked (and locked), and
no new data is sent to the data bus (Req stays high)
as long as the transmitted data is not processed by
the receiver, no matter what DataReady is.
Advantage : FAST - minimal # of signaling events (important for global
interconnect)
Disadvantage : requires the detection of transitions that may occur in
either direction  initialization is important
Problem: Self-timed FIFO
Out
In
R1
En
R2
R3
Done
Reqi
Req0
C
C
C
Acki
All 1s or 0s -> pipeline empty
Alternating 1s and 0s -> pipeline full
Acko
2-Phase Protocol
Example
Assume there is a register at the
input which loads the data at the
beginning of Eval phase
From [Horowitz]
Example
DataReady1 is asserted.  Req to the second
block is asserted, First C-element is locked. 
The second block loads data and starts the
evaluation process.
Example
DataReady2 is asserted.  Req to the third block is asserted, Second C-element is
locked.
 The third block loads data and starts the evaluation process.
 The first C-element is released. Can accept a DataReady from the previous stage. (If
Req has already come, the first Req is unleashed and goes to eval phase.)
Example
4-Phase Handshake Protocol
Also known as RTZ
Slower, but unambiguous
Problem: 4-Phase Handshake
Protocol
Implementation using Muller-C elements
Example
Latches: positive edge-triggered or a levelsensitive implementation (latch when
level=1)
Self-Resetting Logic
completion
detection
(L1)
Precharged
Logic Block
(L1)
completion
detection
(L2)
Precharged
Logic Block
(L2)
completion
detection
(L3)
Precharged
Logic Block
(L3)
VDD
int
out
A
B
C
Post-charge
logic
Self- reseting
Clock-Delayed Domino
GND
CLK2 (to next stage)
CLK1
VDD
Q1 (also D2)
D1
Pulldown
Network
This is a style of dynamic logic, where
there is no global clock
signal. Instead, the clock for one stage is
derived from the previous stage.
Asynchronous-Synchronous
Interface
fin
Synchronous system
Asynchronous
system
fCLK
Synchronization
Synchronizers and Arbiters
• Arbiter: Circuit to decide which of 2 events
occurred first
• Synchronizer: Arbiter with clock  as one of the
inputs
• Problem: Circuit HAS to make a decision in
limited time - which decision is not important
• Caveat: It is impossible to ensure correct
operation
• But, we can decrease the error probability at the
expense of delay
A Simple Synchronizer
CLK
int
D
I1
Q
I2
CLK
• Data sampled on rising edge of the clock
• Latch will eventually resolve the signal value,
but ... this might take infinite time!
Synchronizer: Output
Trajectories
Vout
2.0
1.0
0.0
0
100
200
300
time [ps]
Single-pole model for a flip-flop
Mean Time to Failure
Example
Tf = 10 nsec = T
Tsignal = 50 nsec
tr = 1 nsec
t = 310 psec
VIH - VIL = 1 V (VDD = 5 V)
N(T) = 3.9 10-9 errors/sec
MTF (T) = 2.6 108 sec = 8.3 years
MTF (0) = 2.5 sec
Influence of Noise
Uniform distribution
around VM
p(v)
logarithmic
reduction
T
0
VIL
VIH
Initial Distribution
Still Uniform
Low amplitude noise does not influence synchronization behavior
Typical Synchronizers
2 phase clocking circuit
2
Q
1
Q
1
Using delay line
2
Cascaded Synchronizers Reduce
MTF
O1
In
Sync
f
O2
Sync
Out
Sync
Arbiters
Req1
Req2
Ack1
Arbiter
Req1
A
Ack2
B
Ack2
Ack1
(a) Schematic symbol
Req2
Req1
(b) Implementation
Req2
A
B
metastable
Ack1
VT gap
(c) Timing diagram
t
PLL-Based Synchronization
Chip 1
Chip 2
Data
Digital
System
Digital
System
fsystem = N x fcrystal
Divider
PLL
fcrystal , 200<Mhz
Crystal
Oscillator
reference
clock
PLL
Clock
Buffer
PLL Block Diagram
Reference
clock
Local
clock
Up
Phase
detector
Charge
pump
Loop
filter
vcont
VCO
Down
Divide by
N
System
Clock
Phase Detector
Output before filtering
Transfer
characteristic
Phase-Frequency Detector
Rst
D Q
B
UP
B
UP = 0
DN = 1
A
A
UP = 0
DN = 0
UP = 1
DN = 0
Rst
D Q
DN
A
B
B
(a) schematic
(b) state transition diagram
A
A
B
B
UP
UP
DN
DN
(c) Timing waveforms
A
PFD Response to Frequency
A
B
UP
DN
PFD Phase Transfer
Characteristic
Average (UP-DN)
VDD
-2 p
2p
phase error (deg)
Charge Pump
VDD
UP
DN
To VCO Control Input
PLL Simulation
Clock Generation using DLLs
Delay-Locked Loop (Delay Line Based)
fREF
U
Phase
Det
D
Charge
Pump
DL
Filter
fO
Phase-Locked Loop (VCO-Based)
fREF
U
÷N
PD
D
CP
VCO
Filter
fO
Delay Locked Loop
DLL-Based Clock Distribution
VCDL
•••
Digital
Circuit
•••
Digital
Circuit
CP/LF
Phase
Detector
GLOBAL CLK
VCDL
CP/LF
Phase
Detector
Download