8. Clock Distribution

advertisement
Clock Distribution
Shmuel Wimer
Bar Ilan Univ. Eng. Faculty
Technion, EE Faculty
July 2010
1
Clock System Architecture
clk3
External
Clock
ext_clk
Clock
Generator
Clock
Distribution
gclk
Buffers
clk1
Clocked
Elements
Gaters
clk2
Chip
Chip receives external clock through I/O pad.
Clock generator adjusts the global clock to the external clock.
Global clock is distributed across the chip.
Local drivers and gaters drive the physical clocks to clocked elements.
July 2010
2
Global Clock Generation
• Receives external clock signal and produce the global
clock distributed across the die.
• A large skew occurs between external clock and the
physical clocks at clocked elements due to delay of
distribution network (wires, buffers, gaters).
• Therefore, data at clocked elements is no more in sync
with data at I/O pins.
• Phased Locked Loop (PLL) compensates this delay.
• PLL can perform frequency multiplication to obtain the
required on-chip frequencies.
July 2010
3
Synchronous Chip Interface with PLL
Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be
synchronized to the common clock.
A PLL produces the global clock of chip B such that it is in sync
with the external clock.
Chip A
Chip B
CLKin
ext_clk
CLKout
clk
Dout
Din
July 2010
Din
ref_clk
PLL
clk_out
gclk
fdbk_clk
Dout
Clock Distribution
4
How PLL Works?
Charge
Pump
Loop
Filter
C
Up
I R
ref_clk
M
fdbk_clk
Phase
Detect
Vctrl
N
Voltage
Controlled
Oscillator
clk_out
I
Down
July 2010
5
Phase – Frequency Detector (PFD)
1
D
Q
Q_A: B should go
faster
A
CLR
CLR
B
1
D
Q
Q_B: B should go
slower
The two flip-flops receive the signals at their clock input (one is usually a
reference and the other is the sampled).
The output of the leading flip-flop is 1 for the lead duration.
Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.
July 2010
6
What happens when the reference and the sampled signals are a shift of
each other?
A: reference
B: sampled
Q_A: sampled
should go faster
Q_B: sampled
should go slower
The spikes at Q_B are a result of the delay of the AND gate driving the
CLR input of flip-flip and the internal delay from CLR to Q.
July 2010
7
What happens when the reference and the sampled signals have different
frequencies?
A: reference
B: sampled
Q_A: sampled
should go faster
Q_B: sampled
should go slower
Sampled is more often 1-value than the reference is, since rising edge of
B occurs more often than rising edge of A.
July 2010
8
Charge Pump
faster
1
D
Q
Icp
CLK_ref
CLR
Sup
Vctrl
Sdn
CLR
CLK_fdbk
1
D
Icp
Q
slower
Converts PFD error (digital) to charge (analog), which then controls PLL VCO.
Charge is proportional to PFD widths: Qcp  Iup  tfaster  Idn  tslower .
July 2010
9
Current Mirror
Iin
Vcc
V
Iout
N2
N1
P1
Iin
Vss
V
P2
Iout
Charge pump consists of current mirrors which are sources of constant
current. Device N1 is in saturation since its gate is connected to high voltage.
Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.
This is an ideal current source with infinite output impedance since Iout is
independent of N2 load; a change in output voltage doesn’t affect Iout.
Current mirror works similarly for P transistors.
July 2010
10
How Charge Pump Works?
R load – determines
the current through
current mirror
Vcc
I
I
Vcc
I
C
P-type current mirror
faster
R
Switch – open when
faster = 1
Switch – open when
slower = 1
Vout
slower
I
I
I
N-type current mirror
Vss
July 2010
11
Faster Mode Vout → Vcc
Vcc
Vcc
C
faster=1
R
Vout
slower=0
Vss
July 2010
12
Slower Mode Vout → Vss
Vcc
Vcc
C
faster=0
R
Vout
slower=1
Vss
July 2010
13
Loop Filter
Vcc
C
R
Vctrl
+
Vctrl
-
Differential amplifier connected as a unity-gain follower is used.
July 2010
14
Voltage Controlled Oscillator (VCO)
A Ring Oscillator cascades an odd number of inverters and feeds back the
last output to first inverter (even number of inverters will be stable). It starts
to oscillate spontaneously.
Vout
If tinv is the delay of an inverter and n is the number of inverters,
oscillation frequency is f  1  2ntinv 
Frequency can be controlled by number of inverters and supply voltage
of inverter (higher voltage obtains faster inverter).
July 2010
15
Components of VCO
Buffering for
driving clk_out
clk_out
Vctrl
Vcc
- cc
V
V-cc
+
-
+-
Vss
Ring of 5 inverters
July 2010
Level Converter from
Vctrl-Vss to Vcc-Vss
16
Delay Locked Loop (DLL)
• It is a variant of PLL that uses voltage-controlled delay
line rather than oscillator.
• It adjusts phase only. Frequency multiplication is
impossible.
• It is simpler than PLL, less sensitive to Vctrl noise and
requires simpler loop filter.
• It is very difficult to correctly design PLL and DLL. It
requires expertise in control systems and analog circuit
design.
July 2010
17
How DLL Works?
Charge
Pump
Loop
Filter
C
Up
I R
ref_clk
Phase
Detect
Vctrl
fdbk_clk
Voltage
Controlled
Delay Line
clk_out
I
Down
July 2010
18
Delay Line
The signal from input to output is delayed
In
1
or not according to the control bit.
Out
0
In
2n1
1
2n2
1
0
0
Dn1
Dn2
1
1
Out
0
D0
Connecting a series of n delay elements, each of delay 2i , n  1  i  0,
it is possible to control the dely of the line to any value from 0 to 2n  1
dealy units by an n-bit control word  Dn 1 , Dn 2 ,
July 2010
, D0  .
19
Clock Distribution Networks
July 2010
20
Tree Clock Network (Unconstrained)
clk1
clk2
clk3
clkn
No constraints imposed on buffers and wires.
Used mostly by automatic tools in automatic synthesis flows.
Can be used for small blocks within large design.
Tools aim at minimizing the variance of clock delays.
July 2010
21
If TCLK is the clock delay at a leaf, the variance exapression
i
2
1 n


T

T
 i1  CLKi n  i1 CLKi  should be minimized.
n
Serpentine routing or extra buffers may be introduced
to obtain small variance.
Constraints on power can be imposed by limiting
number and size of clock buffers and width of wires.
July 2010
22
Clock Distribution with Grids
Grid feeds flops
directly, no local
buffers
Clock driver tree spans height of chip
Internal levels shorted together
Low skew but high power
July 2010
23
Clock drivers are on perimeter
July 2010
Clock drivers are on grid points
24
Delay and Skew in Grid Distribution
July 2010
25
DEC’s Alpha Microprocessor Clocking
July 2010
26
DEC Alpha 21264 Microprocessor Clock distribution
July 2010
27
Clock Distribution with Spines
July 2010
28
Intel’s Pentium4 Clock Distribution
July 2010
29
July 2010
30
Clock Distribution with Trees
RC-Tree
Each branch is individually
routed to balance RC delay
H-Tree
Recursive pattern to
distribute signals uniformly
with equal delay over area
More skew but less power
July 2010
31
Clock H-Tree
chip / functional block / IP
sequential elements
clock / PLL
July 2010
32
IBM / Motorola PowerPC Clock Distribution
July 2010
33
Delay Calculation
We use Elmore delay model. Sub trees are modeled
as capacitive loads
July 2010
34
Clock Skew and Jitter
• Clock should theoretically arrive simultaneously to all
sequential circuits.
• Practically it arrives in different times. The differences
are called clock skews.
• Skews result from paths mismatches, process variations
and ambient conditions, resulting physical clocks.
• Most system distribute a global clock and then use local
clock gaters located near clocked elements.
Clock skew consists of the following components:
July 2010
35
• Systematic is the portion existing under nominal
conditions. It can be minimized by appropriate design.
• Random is caused by process variations like devices’
channel length, oxide thickness, threshold voltage, wire
thickness, width and space. It can be measured on
silicon and adjusted by delay components.
• Drift is caused by time-dependent environmental
variations, occurring relatively slowly. Compensation of
those must takes place periodically.
• Jitter is rapid clock changes, occurring by power noise
and clock generator jitter. It cannot be compensated.
July 2010
36
Factors affecting clock skew, Intel 1998, 0.25u.
July 2010
37
Skew, Clock Cycle and Design Margins
Clock Jitter is the same order as skew, but far more
difficult to compensate.
July 2010
38
Skew Modeling
Point of
divergence
Q
1
2

Clock
Generator
1
2
Tclk1
m
CL
Tclk2
n
D
TCLK1  
m
T
i 1 i
 m
TCLK2  
n
T
i 1 i
 n
Cl - capacitive load, Id - drive current, VCC - voltage swing
  Cl VCC Id
July 2010
39
Consider a small change in the delay, taking the linear term.
VCC
ClVCC
Cl



 
VCC 
Cl 
I d  VCC 
Cl 
I d
2
VCC
Cl
I d
Id
Id
Id
 VCC Cl I d 
 


  
Cl
Id 
 VCC
Standard deviation of stage delay is    = .  is around 5%.
If clock buffers delays are normally distributed and independent
of each other, then  TCLK1   m , and  TCLK2   n .
Skew is:  Tskew    | TCLK2  TCLK1 | 
July 2010


m  n  skew .
40
Clock Distribution Switching Power
Consider m - level clock tree.
Let Cl_m be the total load of its far-end driven sequential
elements.
Assuming a fixed fanout k of each clock tree buffer, the
dynamic power is:
2
PCLK_m  Cl_mVCC
f,
PCLK_  m j  
July 2010
Cl_m
kj
2
VCC
f , 0  j  m  1.
41
Summing over whole clock tree:
2
PCLK   j 0 Cl_  m  j VCC
f 
m 1
m 1 Cl_m
2
VCC
f  j 0
kj
m


1

1
k


2
 VCC
fCl_m 

 1  1 k  
How much of the power is consumed by the far
end drivers of clock tree?
PCLK _ m
PCLK
July 2010

1  1 k 
1  1 k 
m
.
42
Given the number of sequential elements in a block, at
least 50% of the switching power is consumed by the
far end drivers (clock tree is binary, k=2). This number
approaches 1 rapidly with k growth.
Example: Assume a block with 214 sequential elements
and H-tree clock distribution. Then k=4 and m=7. The
far end drivers consume nearly 75% of the clock tree
switching power, while adding the next upper level
drivers brings it to more than 90%.
July 2010
43
Active Clock De-Skewing
• Compensates process variability, temperature gradients,
imperfect design.
• Can be implemented for global fixes (small HW
overhead) or local fixes (high HW overhead).
• Can be used at testing for one time fix (variability
occurring during manufacturing), or dynamically
concurrently with chip operation.
• Its implementation is a difficult design challenge.
July 2010
44
Intel’s Pentium2 De-Skewing System
1998, 450MHz Clock
0.25u process
60pSec skew w/o fix
15pSec skew with fix
Two clock spines for
two clock regions.
A phase detector
detects relative shifts.
Clock of a region is
shifted by a delay line.
July 2010
45
Delay line consists of two cascaded inverters.
Each has a programmable load consists of eight parallel P-N gate
capacitors.
The shift register stores a thermometer code for load programming
in steps of 12pSec.
July 2010
46
Intel’s IA64 Itanium1 De-Skewing System
2000, 800MHz clock
0.18u process
28pSec skew with fix
X4 increase w/o fix
30 independent de-skew
regions.
PLL
Each cluster is driven
from a global H-tree.
Delay circuit in de-skew
region are similar to
Pentium3 with 20-bit
registers.
July 2010
47
July 2010
48
Proposal for H-Tree Clock De-Skew –
Hierarchical Approach
If a phase detector (PD) has a skew guard band g, then guard bands
may accumulate along tree paths.
For example, if a logic stage is shared between region B and C, it may
add 7g time units to path delay.
July 2010
49
Proposal for H-Tree Clock De-Skew –
Mesh Approach
Clock is distributed by Htree, but de-skew takes
place by neighbor leaves
phase detection.
A delay buffer accepts
phase inputs from its 4
neighbors and then
decides of whether to
increase, decrease or
not change its delay.
July 2010
50
Clock Characteristics of Commercial Processors
July 2010
51
Power Consumption in Chips
• Clock power may reach 50% of total (dynamic +
static).
• Clock gating is very useful and standard design
practice
• Four gating methods:
– Synthesis based, automated by EDA tools, RTL
compilers, inserted into clock-tree
– Clock enable signals manually defined by designer,
inserted into clock-tree, FFs’ clock input
– Data-Driven clock gating, inserted at FF-level
– Auto-gated FF, inserted at latch-level
52
FF Data Toggling (DSP core)
23k FFs, Test bench of 250k CLK cycles
10% of application run-time CLK is enabled.
Of which, only 1.6% CLK pulses are useful!
53
FF Data Toggling
0.3
1.2
63 control blocks of ϻP, 200k FFs
0.25
1
0.2
0.8
0.15
0.6
0.1
0.4
a
0.05
0.2
0.03
0
0
1
6
11
16
21
26
31
36
block count
41
46
51
56
61
cumulative normalized clock
capacitive load
data activity / clock
activity
b
Data-Driven CLK Gating
55
tight
constraint
56
How Many FFs To Group ?
k: # flip-flops, q: FF probability of D=Q
q=1-p
Worst case : All FF are toggling independently of each other.
Latch overhead
amortized over k FFs
Net saving per FF
csaving  q
k
 cFF  cw   clatch
k  1  q   cw  cOR  
FF
Gater’s disabling probability
Derivate by k:
Probability of enabling FF
qk ln q  cFF  cW   clatch k 2  0
57
58
Optimal Flip-flop k-size Grouping
Given n flip-flops and m  1 clock cycles
a   a1, , am  is the activity (toggling) of flip-flop
FF1: 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0
FF2: 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1
ai  a j is the number of redundant clock pulses
ocurring by jointly clocking FFi and FFj
59
FF Pairwise Activity Model
G V , E, w: FF pairwise activity graph.
vi V corresponds to FFi .


eij  vi , v j  E is FF pairing.
ai | a j is joint toggling.
 
w eij  ai  a j is redundant clock pulses, hence a waste.
E   E: vertex matching
60
Total power:
P  2e E ai | a j 
ij
v V
i
v V
i




ai  e E e E  ai  ai | a j  a j  ai | a j  


ij
ij
ai  e E
ij
Essential + Waste
ai  a j  v V ai  e E w  eij 
i
ij
61
Flip-Flop Grouping Algorithm
FFj
Minimize:
 FFi  FFj
FFi  FF j
FFi
Optimal FFs pairing
can be solved by
minimal cost perfect
graph matching.
Can be repeated for groups of size 4, 8, 16 …
62
Is repeated perfect matching optimal ?
a7 | a8
a1 | a2
a1
a3
a3 | a4
a6 | a5
a2
a7
a4
a6
a5
a8
63
No! Here is the optimal 4-size grouping
64
Multi-Bit Flip-Flop
Saving the power of internal CLK drivers
1-bit flip-flop
2-bit flip-flop
CLK
CLK
D
Master
Latch
CLK
Slave
Latch
+
Q
=
CLK
CLK
D
Master
Latch
Master
Latch
CLK
Slave
Latch
Q1
CLK
CLK
D2
CLK
Slave
Latch
D1
Master
Latch
Slave
Latch
Q2
Q
65
66
MBFF should be combined with data-driven CG to
maximize energy savings.
Toggling vectors (VCD) are unfortunately not
always available.
Data-to-clock toggling ratio (probability) is more
often available.
How to utilize it for MBFF optimal grouping?
67
What is the energy waste in 2-bit MBFF?


p j 1  pi   pi 1  p j  pi  p j  2 pi p j
For n FFs grouped
in n/2 MBFFs it is
n
n2
j 1
i 1
 p j  2 psi pti
What is the optimal MBFF grouping?
Group the FFs such that
p1  p2 
 pn
68
Auto-Gated FF
69
Look-Ahead CLK Gating
70
Relaxed
constraint
71
Power Saving Per FF
save
cdyn

1  p   cFF+CLK  cFF  cO  
k
 cFF+CLK

p  cX  kcO   
 cAint  cFF  cO 
3


72
Which FF to Gate?
73
Results
74
75
Download