Clocked storage elements Outline

advertisement
Clocked storage elements
Vladimir Stojanovic
Material in this presentation is adapted from
“Digital System Clocking: High-Performance and Low Power Aspects”,
V.G.Oklobdzija, V.M.Stojanovic, D.M.Markovic, N.M.Nedovic, © 2003 J.Wiley & IEEE
EE371 Lecture 6
Outline
Latch and Flip-Flop
! Timing and Power Metrics
! High-Performance Issues
! Low-Energy Issues
! State-of-the-art circuits
! Microprocessor Examples
!
EE371 Lecture 6
2
1
Recent Interest in Flip-Flops
!
Trends in high-performance systems
» Higher clock frequency
» More transistors on chip
!
Consequences
» Increased flip-flop overhead relative to cycle time
– Cycle time 10 - 20 FO4 delays, flop overhead 2 - 4 FO4
»
»
»
»
Difficult to control both edges of the clock
Higher impact of clock skew
Higher crosstalk and substrate coupling
Higher power consumption
– expensive packages and cooling systems
– limit in performance
» Clock burns up to 40%, flops up to 20% of total power
3
EE371 Lecture 6
Why are clocked storage elements important?
!
!
Cycle time ~ 12-20 FO4 delays
Flip-flop overhead 2-3FO4 (20% of cycle time!!!)
D Q
Logic
D Q
N
Clk
Clk
T
TClk-Q
TLogic
EE371 Lecture 6
TSetup
TD-Q=TClk-Q + TSetup
4
2
Latch and Flip-Flop
Data
Q
Latch
Latch – “soft” edge clocking
Clock
Clock
Latch is “transparent”
(clock-level sensitive)
Data
Q
Data
Q
Flip-Flop – “hard” edge clocking
F-F
Clock
After the transition of the
clock, data change does
not affect the output
(clock-edge sensitive)
Clock
Data
Q
5
EE371 Lecture 6
Flip-Flop and Master-Slave Latch
Data
F-F
Operational behavior
appears the same…
Q
Clock
Clock
Data
Q
Slave - L2
Master - L1
D
Data
M-S
Latch
Latch
Clock
Q
Q
D
Q
Clk1
Clock
Q
Q
Clk2
How can one recognize the
difference without knowing
what is inside the “blackbox” ?
Data
Q
EE371 Lecture 6
6
3
Flip-Flop and M-S Latch: Structural Difference
Input
Input
Input
Clock: Φ1
Pulse
Generator
Clock
Q1
S
No Clock
R
Slave
Latch
Q
Q
Clock: Φ2
Pulse
Capturing
Latch
Master
(L1)
Latch
Q1
Slave
(L2)
Latch
Q2
Q2
Q
Flip-Flop
Input
Q
M-S Latch
7
EE371 Lecture 6
T-G Master-Slave Latch
!
PowerPC 603 (Gerosa, JSSC 12/94)
Vdd
Clk
Vdd
Clkb
Q
D
Clkb
Clk
EE371 Lecture 6
8
4
Flip-Flop Example 1: SAFF
(Sense-Amplifier-Based Flip-Flop)
S
R
Pulse
generator
D
D
Clk
D=0
pulse
D=1
S
R
Q
Capturing
Latch
Q
SAFF DEC Alpha 21264 (Madden & Bowhill, 1990, Matsui 1994)
9
EE371 Lecture 6
Flip-Flop Example 2: HLFF
(Hybrid Latch Flip-Flop)
Vdd
Q
Q
D=1
D
Clk
D=0
D=0
Pulse
Generator
D=1
EE371 Lecture 6
signal at
node X
Second
Stage Latch
10
5
Logic Diagram of HLFF
D
Clk
Q
Enable
Pulse
Generator
Second
Stage Latch
D=1
D=0
D=1
signal at
node X
D=0
EE371 Lecture 6
11
Performance Metrics:
Timing and Power
EE371 Lecture 6
12
6
Timing Parameters in Latches
D
Q
L
Setup and Hold Times are
defined relative to closing the
clock signal
Clk
U
Clk
tCW
D1
(a) Early data arrival
H
tCQ
Q1
(a)
D2
(b) Late data arrival
tDQ
Q2
(b)
13
EE371 Lecture 6
Timing Parameters in Flip-Flops
D
Q
FF
Setup and Hold Times are
defined relative to rising edge Sampling
of the clock
Window
Clk
U
Clk
H
tCW
D
tCQ
Q
EE371 Lecture 6
14
7
Data-to-Output Delay
!
!
Sum of setup time and Clk-Q delay is the only true
measure of performance w.r.t. system speed
T = TClk-Q + TLogic + Tsetup+ Tskew
D Q
Logic
D Q
N
Clk
Clk
T
TClk-Q
TLogic
TD-Q=TClk-Q + TSetup
TSetup
15
EE371 Lecture 6
Clk-Q Delay is a function of D-Clk
350
300
Clk-Output [ps]
250
200
Setup
Hold
150
100
50
-200
-150
-100
-50
Sampling Window
00
50
100
150
200
Data-Clk [ps]
EE371 Lecture 6
16
8
Data to Output Delay
Setup Time vs. Data-to-Output (D-Q) Delay
Constant Clk-Q
Region
Failure
Region
Variable Clk-Q
Region
D-Q
o
Clk-Q
45
opt Setup
DDQm
Uopt
Data arrives early
Data arrives late
Data to Clock Delay
17
EE371 Lecture 6
Power Consumption
!
Power related to a CSE
can be divided into:
» Input power
PD
VDD
D
– Data power (PD)
– Clock power (PCLK)
» Internal power (PINT)
D
VDD
Q
VDD
CLK
PLOAD
CLK Qb
PCLK
– Depends on data activity
and glitching activity
Ptot = Pinternal&load +
» Load power (PLOAD)
PINT
∑
Pdriver
inputs(D,CLK)
– Can be merged into PINT
EE371 Lecture 6
18
9
High-Performance Issues
19
EE371 Lecture 6
Clock Skew and Jitter
tDRVCLK
Ref_Clock
tskew
t− jit
tskew
t+ jit
Received Clock
T
tRCVCLK
EE371 Lecture 6
20
10
The Idea of Clock Uncertainty Absorption
300
D-Q delay [ps]
280
260
tCU
240
DDQM
220
DDQm
200
-30
-20
-10
0
10
20
Nominal
Clk
30
40
50
60
Clk arrival time [ps]
Clk
Change in D-Q delay is much smaller than the clock uncertainty
(CSE absorbs a part of the uncertainty)
EE371 Lecture 6
21
Clock Uncertainty Absorption
Worst-case DDQ
Nominal DD-Clk
D
Early DD-Clk
Late DD-Clk
Clock uncertainty tCU
TNominal=0
Clk
Q
DDQm
DDQM
EE371 Lecture 6
22
11
Example: HLFF
[Partovi et al, ISSCC’96]
EE371 Lecture 6
23
State-of-the-Art CSEs in CMOS
Technology
EE371 Lecture 6
24
12
Requirements in the Flip-Flop Design
!
!
!
!
Small Clk-Output delay, Narrow sampling window
Low power
Small clock load
High driving capability (increased levels of parallelism)
» Typical flip-flop load in a 0.18µm CMOS ranges from 50fF to
over 200fF, with typical values of 100-150fF in critical paths
!
!
!
Integration of logic into the flop
Multiplexed or clock scan
Crosstalk insensitivity
- dynamic/high impedance nodes are affected
25
EE371 Lecture 6
Low-Energy Designs: Master-Slave Latch Examples
C2MOS Latch
Master-Slave Latch (MSL)
Clk
Clk1
D
QM
Clk
Clk1
Clk
SM
D
Clk
Clk
Clk1
Clk1
Clk1
Clk
Q
Clk1
Clk1
Q
SS
QM
Clk
Clk
Clk
Clk1
[Suzuki et al, JSSC 1973]
[G.Gerosa et al, JSSC 1994]
Feedback for pseudo-static operation
State node SS protection in PPC (decoupled Q)
EE371 Lecture 6
26
13
Master-Slave Latches
!
!
!
!
Positive setup times
Two clock phases:
» distributed globally
» generated locally
Small penalty in delay for incorporating MUX
Some circuit tricks needed to reduce the overall
delay
27
EE371 Lecture 6
High-Performance Designs: Flip-Flop Examples 1/2
HLFF (Hybrid Latch-Flip-Flop)
Clk
SDFF (Semi-Dynamic Flip-Flop)
Clk
S
S
Q
Q
I
Clk
D
D
Clk
Clk1
[Partovi et al, JSSC 1996]
Clk1
[F.Klass et al, JSSC 1998]
Pulse-generating first stage (precharge-evaluate)
Keepers for pseudo-static operation
Output load decoupled from internal nodes
EE371 Lecture 6
28
14
High-Performance Designs: Flip-Flop Examples 2/2
SAFF (Sense-Amplifier-Based Flip-Flop)
S
Fully-differential circuit
R
D
D
Clk
First stage sense-amp can
take reduced-swing inputs
2nd stage is capturing latch
Q
delay to Q and !Q not equal
Q
[Matsui et al, 1994]
EE371 Lecture 6
29
SAFF with Improved S-R Latch
!
!
!
!
The first stage is
unchanged sense amplifier
Second stage is sized to
provide maximum
switching speed
Driver transistors are large
Keeper transistors are
small and disengaged
during transitions
[Nikolic & Stojanovic ISSCC ’99]
EE371 Lecture 6
30
15
Flip-Flops
!
First stage is a pulse generator
» generates a pulse (glitch) on a rising edge of the clock
!
Second stage is a latch
» captures the pulse generated in the first stage
!
!
Pulse generation potentially results in a negative
setup time and soft-edge property
Must check for hold time violations
Note: power is always consumed in the clocked
pulse generator
31
EE371 Lecture 6
Delay Comparison: M-S Latches and Flip-Flops
Delay [FO4]
Min D-Q Delay Comparison
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Flip-Flops are faster
MSL
C2MOS
HLFF
SDFF
SAFF
M-SAFF
0.18um, high load (14 min inverters)
EE371 Lecture 6
32
16
Energy Comparison: M-S Latches and Flip-Flops
Energy breakdown (50% activity)
120
Energy [fJ]
100
Ext. clock
Ext. data
Int. clock
Internal non-clk
Latches are
lower energy
80
60
40
20
0
MSL
C2MOS
HLFF
SDFF
SAFF
M-SAFF
0.18um, high load (14 min inverters)
EE371 Lecture 6
33
Summary
!
CSE topology depends on target application
» Master-Slave Latches for low-energy
» Flip-Flops & Pulsed latches for high-performance
!
!
Delay is critical in high-speed systems, although
minimizing Clk energy is of increasing importance
Methods for reducing Clk energy
» Clock gating (more effective in high-performance than in
low-energy designs)
» Reduced-swing clocking
» Dual-edge clocking
EE371 Lecture 6
34
17
Microprocessor Examples
35
EE371 Lecture 6
Sun UltraSPARC-III
Basic Flip-Flop (SDFF)
Vdd
Vdd
MP1
S
MN3
NAND
MN2
D
MN1
MP2
Q
Inv4
Inv2
Inv5
Q
MN5
Clk1
Inv6
Inv3
Inv1
MN4
Clk
[Klass, 1998]
EE371 Lecture 6
36
18
Sun UltraSPARC-III Dynamic Flip-Flops
Vdd
MP1
Inv5
S
Q
Inv4
MN3
NAND
D
Differential
Inv3
MN2
Vdd
Clk
MP1
MN1
Inv1
Inv2
Vdd
MP2
MP4
MP3
Inv5
S
Q
Single-Ended
MN6
Inv6
R
Q
MN3
MN5
Inv1-2
D
Inv3-4
MN2
MN4
MN7
D
Clk
MN1
37
EE371 Lecture 6
Sun UltraSPARC-III
Flip-Flops with Logic Embedding
Vdd
MP1
Vdd
S
NAND
MN3
MN2a
D1
Inv3
MN5
Inv5
Q
Inv6
MN2d
D2
Clk
Q
M N2c
D1
MN2b
MP2
Inv4
MN4
D2
MN1
2-input XOR
Inv1
Inv2
EE371 Lecture 6
38
19
Latches used in 20164 Alpha
[Gronowski et al, 1998]
P1
P5
P1
P2
X
D
P2
D
Q
P3
Clk
X
N3
Clk
N4
N1
N1
N2
P4
Q
N2
N5
21064 modified TSPC Latches
39
EE371 Lecture 6
Logic Embedding in 21064/21164 Alpha
D1
D2
D1
D2
X
X1
Q
Q
Clk
Clk
D3
D4
X2
Clk
1 level of logic
(21064 Alpha)
2 levels of logic
(21164 Alpha)
EE371 Lecture 6
40
20
Flip-Flop used in 21264 Alpha
Q
Q
Clk
Differential
D
41
EE371 Lecture 6
IBM Processors: Level-Sensitive Scan Design (LSSD)
+Clock
Data
Out
-Clock
Hazard-free level-sensitive polarity-hold latch
[Eichelberger 1983]
EE371 Lecture 6
42
21
The PowerPC 603 Master-Slave Latch
VDD
ACLK
SCANin
C2
ACLK
C1
C2
Dout
Din
C1
C2
ACLK
[Gerosa et al, 1994]
43
EE371 Lecture 6
IBM Power4TM processor
Scannable Split Latch with LSSD Capability
[Warnock et al, 2002]
scan_clk
Scan_in
scan_clk
c1
c1
Attached
Logic
Gate
c2
scan_clk
c2
scan_clk
c1
c2
Data_out
Data_in
c1
L1
EE371 Lecture 6
c2
L2
44
22
Low-Energy Issues:
Clock & Clocked elements burn more
than 60% of the processor power
45
EE371 Lecture 6
Low-Swing Clocking: Clock Driver Re-design
[H. Kojima, JSSC, April 1995]
VDD
GND
Clk
Cp1
CPT
CNT
Cn1
Cp2
CPB
CNB
Cn2
CA
VDD
Vthp
CPT
CPB
CNT
CNB
H-VDD
CB
Vthn
GND
50% power reduction with half-swing clock
(minus some penalty in clock drivers)
EE371 Lecture 6
46
23
Low-Swing Clocking: CSE Re-design
Vwell > VDD
VDD
E(a),(b)~VDD(VDD-Vth)
E(c)~(VDD-Vth)2
Clk
(VDD-Vth)
VDD-Low
(a)
VDD
D
Clk
(VDD-Low)
n
(c)
Clk
(VDD-nVth)
Clk
(b)
Q
Q
Clock drivers
PMOS does not fully turn off
[H. Kawaguchi and T. Sakurai, JSSC, May 1998]
47
EE371 Lecture 6
Low-Swing Clocking: N-only CSEs
Clk
D
Clk
Clk
Q
QM
SS
SM
N1 N2
Clk
N3 N4
Clk
N-only clocked transistors, M-S Latch Example
(N1 and N2 improve pull-up on SM)
[D.Markovic, J.Tschanz, V.De, 2001, patent pending]
EE371 Lecture 6
48
24
Clock Gating: Global Clock Gating
(a)
(b)
0
In
D
1 S
Load
Clk
In
Q
D
EN
Clk
REG
Q
REG
Time-mux
(no gating!)
Global Clk Gating
Used to save clocking energy when data activity is low
49
EE371 Lecture 6
Clock Gating: Local Clock Gating
[M. Nogawa and Y. Ohtomo, JSSC, May 1998]
CP
CP
CP
QM
D
Q
CP
CP
CP
Clk
Pulse Generator
CPI
Clock Control
P1
Data-Transition
Look-Ahead
CP
CP
Used to save clocking energy when data activity is low
EE371 Lecture 6
50
25
Dual-Edge Triggering: Latch-Mux
[R.P. Llopis and M. Sachdev, ISPLED Aug. 1996]
Concept
D
Circuit Example
D
Q
C
Q
Clk
Clk
Clk
0
Clk
D
Q
C
Q
Q
1 S
D
Clk
Clk
Clk
Clk
Clk
Q
Clk
Clk
Used to save clocking energy regardless of data activity!
51
EE371 Lecture 6
Dual-Edge Triggering: Pulsed-Latch
Concept
D
Pulse
Gen
D
Q
Q
C
Q
Q
Circuit Example
C
Pulse
Gen
Clk
Clk
Clk2
Clk
C
Clk
Clk1
Clk1
Clk1 Clk
Clk
Clk2
Clk
Clk1
Clk1
Clk1 Clk
D
D
Clk2 Clk
Q
Q
Clk2 Clk
Clk2
Clk2
Clk1 Clk
Clk
Clk1
Clk
Clk
(a)
(b)
Single-Edge
EE371 Lecture 6
Dual-Edge
52
26
Dual-Edge Triggered Flip-Flop
Concept
D
Circuit Example
D
S
C
R
D
S
C
R
1st stage:
PG Latch X
Q
Q
Q
Q
Clk
Clk
SX
CL
Clk
1st stage:
PG Latch Y
CL
SY
D
D
Clk1
Clk2
Clk
Q
Pulse-generating latches
trigger capturing latch
Clk
Clk
Clk1
Clk
Clk1
Clk2
[N.Nedovic, V.G.Oklobdzija, ESSCIRC 2002]
53
EE371 Lecture 6
Design goals
!
Apply
» Small clock load
» Short direct path
» Reduced node swing
» Low-power feedback
» Pulsed design
» Optimization of both
Master and Slave latch
!
Avoid
» Positive setup time
» Sensitivity to clock slope
and skew
» Dynamic (floating) nodes
» Dynamic Master latch
Conduct Energy - Delay optimizations
Take into account all sources of power dissipation
ALWAYS use Clk-Q + setup time for max delay
For more details on storage elements check prof. Oklobdzija’s ISSCC’02 workshop:
http://www.ece.ucdavis.edu/acsel under Presentations
EE371 Lecture 6
54
27
What to Expect in the Future?
!
!
!
!
!
!
Incorporating logic into the CSE
Absorbing clock skew
Pipeline boundaries will start to blur – pulsed latches
Latch-less domino style clocking, signals used to clock
Synchronous design only in a limited domain
Asynchronous communication between synchronous
domains
EE371 Lecture 6
55
28
Download