Analysis and Characterization of Random Skew

advertisement
Analysis and Characterization of Random Skew
and Jitter in a Novel Clock Network
by
Vadim Gutnik
Bachelor of Science, Electrical Engineering and Computer Science,
and Materials Science and Metals Engineering,
University of California at Berkeley (1994)
Master of Science, Electrical Engineering and Computer Science,
Massachusetts Institute of Technology (1996)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2000
@ Massachusetts Institute of Technology 2000. All rights reserved. *Wt
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
~.j-O%
JUN 2 2 2000
......
Author
....
LIBRARIES
Department of Electrical Cneering and Computer Science
March 3, 2000
.........
Anantha Chandrakasan
Associate- P9essor of Electrical Engineering
-S
ervisor
C ertified by...............................
..
Accepted by .....
Arthur C. Smith
Chairman, Departmental Committee on Graduate Students
Analysis and Characterization of Random Skew and Jitter in
a Novel Clock Network
by
Vadim Gutnik
Submitted to the Department of Electrical Engineering and Computer Science
on March 3, 2000, in partial fulfillment of the
requirements for the degree of
Doctor of Science in Electrical Engineering
Abstract
System clock uncertainty, in the form of random skew and jitter, is beginning to
affect performance of large microprocessors significantly. Process and environmental
variations and inter-signal coupling on a chip contribute significant delay variations in
long clock lines, and these variations are predicted to make the now widely-used clock
tree distribution untenable. Distributed clock generation may allow clock networks
to continue scaling with advances in semiconductor processing technology.
A novel clock network composed of multiple synchronized phase-locked loops is analyzed, implemented, and tested. Undesirable large-signal stable (modelocked) states
dictate the transfer characteristic of the phase detectors; a matrix formulation of the
linearized system allows direct calculation of system poles for any desired oscillator
configuration. The circuits were fabricated in CMOS, and two implementations of
the system - a 4 oscillator proof-of-concept 400MHz network, and a 16-oscillator,
1.3GHz network network are presented.
A flash time-to-digital converter is presented that exploits parallelism to get precise time measurements with resolution much smaller than a single gate delay. Unfortunately, an unrelated failure precluded measurements on the 16-oscillator chip where
the measurement system was integrated, but the principle is shown to be valid on an
independent test chip.
Thesis Supervisor: Anantha Chandrakasan
Title: Associate Professor of Electrical Engineering
3
4
Acknowledgments
I would like to thank my thesis advisor, Professor Chandrakasan for innumerable
technical discussions, for always being available and approachable, and for making
sure I could concentrate on thesis work. Thanks also to my thesis readers Professors
Boning and Verghese for their help in organizing the thesis.
Thanks goes to my research group as well; my research would have been much less
enjoyable and much less successful were it not for their advice, help, and camaraderie.
And of course, thanks to my family for putting up with me through an awful lot
of years of school.
5
6
Contents
1
15
Clocks in Digital Systems
1.1
D efinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.2
Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2 Models of Clock Network Timing Variations
. . . . . . . . .
23
2.1.1
Equipotential Clocking . . . . . . . . . . . . . . . . . . . . . .
24
2.1.2
H-Trees and Generalized Trees . . . . . . . . . . . . . . . . . .
25
2.1.3
Active Skew Management
. . . . . . . . . . . . . . . . . . . .
27
Previous Work: Variations . . . . . . . . . . . . . . . . . . . . . . . .
27
2.2.1
Layout-Dependent Processing Variations . . . . . . . . . . . .
28
2.2.2
Wafer-Scale and Random Physical Variations
. . . . . . . . .
28
2.2.3
Circuit Implications of Mismatch
. . . . . . . . . . . . . . . .
29
2.2.4
Abstract Variation Models . . . . . . . . . . . . . . . . . . . .
31
2.3
Categories of Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.4
Clock Architecture Comparison . . . . . . . . . . . . . . . . . . . . .
35
2.4.1
Clock m etric . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.4.2
Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.4.3
G rid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
2.4.4
Active Feedback . . . . . . . . . . . . . . . . . . . . . . . . . .
42
2.1
2.2
3
23
Previous Work: Clocks ....................
49
Synchronization and Stability
3.1
Previous Work: Synchronization . . . . . . . . . . . . . . . . . . . . .
7
49
3.1.1
Local Data Synchronization
49
3.1.2
Local Clock Synchronization
51
3.2
Proposed Clock Architecture . . . .
52
3.3
Small Signal
.
52
3.3.1
General Derivation .
53
3.3.2
Examples
. . . . . .
56
Large Signal: Mode Locking
62
3.4
4
Implementation and Testing Distributed Clocks
69
4.1
4 Oscillator Chip . . . . . . . . . . . . . . . . . .
69
4.1.1
Oscillator
. . . . . . . . . . . . . . . . . .
71
4.1.2
Phase Detector . . . . . . . . . . . . . . .
71
4.1.3
Loop Filter
. . . . . . . . . . . . . . . . .
74
16 Oscillator Chip . . . . . . . . . . . . . . . . . .
77
4.2.1
Oscillator
. . . . . . . . . . . . . . . . . .
77
4.2.2
Phase Detector . . . . . . . . . . . . . . .
77
4.2.3
Loop Filter
. . . . . . . . . . . . . . . . .
80
4.2
5
6
On-Chip Measurement of Clock Performance
83
5.1
Introduction and Motivation . . . . . . .
83
5.2
Time-to-Digital Converter Fundamentals
85
5.3
SOTDC Yield . . . . . . . . . . . . . . .
87
5.4
Calibration of a SOTDC . . . . . . . . .
87
5.5
Circuit and Results . . . . . . . . . . . .
90
Conclusions
95
6.1
Summary and Contributions . . .
95
6.2
Future Work . . . . . . . . . . . .
96
6.2.1
Testing and measurement
96
6.2.2
Unconventional Clocks
.
.
8
97
109
A Full Schematics
A.1 4 oscillator chip .......
A .2 16 oscillator chip
..............................
109
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9
10
List of Figures
1-1
2 bit synchronous counter
1-2
Timing diagram for 3-counter . .
. . . . . . . . . . . . .
16
1-4
Relationship of clock offset, skew, and jitter. . . . . . . . . . . . . .
18
1-3
Two paths in a clock network . .
18
2-1
Alpha clock grid evolution . . . . . . . . . . . . . . . . . . . . . . .
25
2-2
Four-level H-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2-3
Zero-skew balanced tree . . . . . . . . . . . . . . . . . . . . . . . .
26
2-4
Digital active deskewing . . . . . . . . . . . . . . . . . . . . . . . .
27
2-5
Skew caused by finite rise time
. . . . . . . . . . . . . . . . . . . .
29
2-6
Independent balancing of NFETs and PFETS . . . . . . . . . . . .
30
2-7
Example H-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2-8
Schematic model of capacitive coupling . . . . . . . . . . . . . . . .
36
2-9
Clock tree tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
2-10
Grid distribution block schematic . . . . . . . . . . . . . . . . . . .
39
2-11
Model circuit for shorted grid drivers.
. . . . . . . . . . . . . . . .
40
2-12
Power vs. skew for a grid. . . . . . . . . . . . . . . . . . . . . . . .
41
2-13
Simulated edge in a grid with skew to the drivers. . . . . . . . . . .
42
2-14
Short circuit power in a grid vs. input tree skew. . . . . . . . . . .
43
2-15
Low-skew wire with DLL
. . . . . . . . . . . . . . . . . . . . . . .
43
2-16
Matching tree leaves with a DLL . . . . . . . . . . . . . . . . . . .
44
2-17
Matching tree leaves with two DLLs . . . . . . . . . . . . . . . . .
45
16
11
. . . . . . . . . . . . .
2-18
Matching tree leaves with a two DLLs which requires delay cell
. . . . . . . . . . . . . . . . .
matching
45
2-19
DLL architecture . . . . . . . . . . . . . . . . . . . .
46
2-20
Multi-input delay cell DLL architecture
. . . . . . .
47
2-21
Tile number optimization . . . . . . . . . . . . . . .
47
2-22
A variable delay element and phase comparator can be configured
into a DLL or a PLL.
. . . . . . . . . . . . . . . . .
48
3-1
Mode-locking example . . . . . . . . . . . . . . . . . . . . . . . . .
51
3-2
Distributed clocking network
. . . . . . . . . . . . . . . . . . . . .
54
3-3
Standard phase-locked loop. . . . . . . . . . . . . . . . . . . . . . .
54
3-4
Linear system model of a standard phase-locked loop.....
. . . .
54
3-5
Multi-oscillator phase-locked loop . . . . . . . . . . . . . . . . . . .
55
3-6
Linear system model of a multi-oscillator phase-locked loop
55
3-7
PLL loop gain Bode plots . . . . . . . . . . . . . . . . . . .
57
3-8
Root locus for single-oscillator PLL with gain error . . . . . . . . .
58
3-9
Asymmetrical one-dimensional PLL array . . . . . . . . . . . . . .
58
3-10
Symmetrical one-dimensional PLL array . . . . . . . . . . . . . . .
59
3-11
Root locus for a one-dimensional array of PLLs. . . . . . . . . . . .
60
3-12
Comparison of noise responses for symmetrical and asymr etrical
. . . .
netw orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3-13
Root locus for a two-dimensional array of PLLs. . . . . . . .
63
3-14
Mode-locking example . . . . . . . . . . . . . . . . . . . . .
64
4-1
Micrograph of the 4 oscillator, 350 MHz chip
4-3
Relaxation oscillator layout
. . . . . . . . . . . .
70
. . . . . . . . . . . . . . . . . . . . . .
72
4-2
Relaxation oscillator schematic . . . . . . . . . . . . . . . . . . . .
73
4-4
Phase detector schematic
. . . . . . . . . . . . . . . . . . . . . . .
74
4-5
Phase detector timing waveforms . . . . . . . . . . . . . . . . . . .
75
4-6
Sampled phase detector half-circuit transfer function
. . . . . . . .
75
4-7
Sampled phase detector full transfer function
. . . . . . . . . . . .
76
12
4-8
Loop filter schematic . . . . . . . .
76
4-9
Micrograph of the 16 oscillator, 1.3 GHz chip
78
4-10
Ring oscillator schematic . . . . . .
79
4-11
Phase detector
. . . . . . . . . . .
80
4-12
Simulated phase transfer curve
. .
81
4-13
Locking behavior of the PLL array
81
4-14
Loop filter schematic . . . . . . . .
82
5-1
Time to voltage converter operation
. . .
83
5-2
Phase vernier . . . . . . . . . . . . . . . .
84
5-3
Arbiter definitions . . . . . . . . . . . . .
86
5-4
TDC structure. "D" marks delay elements, and "A" the arbiters. .
86
5-5
X (i) vs. i
. . . . . . . . . . . . . . . . . .
88
5-6
SOTDC yield . . . . . . . . . . . . . . . .
89
5-7
Symmetric CMOS arbiter . . . . . . . . .
91
5-8
Measured xi, with expected curve for 18ps standard deviation of t,
92
5-9
Measured xi vs. xi derived via Eq. 5.9, for o- = 0.35ps
. . . . . . .
92
5-10
Measurement chip micrograph . . . . . . . . . . . . . . . . . . . . .
93
A1.1
Top-level (chip core) . . . . . . . . . . . .
110
A1.2
N ode . . . . . . . . . . . . . . . . . . . . .
111
A1.3
Relaxation oscillator . . . . . . . . . . . .
111
A1.4
Compensation amplifier and summer . . .
112
A1.5
Differential to single-ended amplifier . . .
112
A1.6
Sampled phase comparator
. . . . . . . .
113
A1.7
Phase comparator core . . . . . . . . . . .
114
A2.1
Top-level (chip core) . . . . . . . . . . . .
115
A2.2
Individual tile . . . . . . . . . . . . . . . .
116
A2.3
N ode . . . . . . . . . . . . . . . . . . . . .
116
A2.4
Compensation amplifier . . . . . . . . . .
117
A2.5
Ring oscillator
117
. . . . . . . . . . . . . . .
13
A2.6
Differential inverter for the ring oscillator
. . . . . . . . . . . . . .
118
A2.7
Clock divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
A2.8
Jitter measurement block
. . . . . . . . . . . . . . . . . . . . . . .
119
A2.9
Pulse generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119
A2.10 DRAM block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119
A2.11
DRAM write token . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
A2.12
DRAM bitslice . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
121
A2.13
Phase measurement arbiter
. . . . . . . . . . . . . . . . . . . . . .
121
A2.14 Dram data 3-state driver . . . . . . . . . . . . . . . . . . . . . . . .
122
A2.15 Dram output data serializer . . . . . . . . . . . . . . . . . . . . . .
122
14
Chapter 1
Clocks in Digital Systems
The vast majority of integrated circuits manufactured today are synchronous digital
systems. The performance of these systems, measured in terms of computation per
time, is readily increased by increasing the clock rate. The bulk of the effort in design
of high speed systems is expended on the design of systems that operate correctly
when synchronized by ever faster clocks. An increasing amount of effort has been
made in designing the clocks themselves so that imperfections in the clock do not
unnecessarily limit system performance.
This chapter introduces terminology and
constraints relevant to clock performance in digital systems.
1.1
Definitions
Digital devices can be modeled as finite state machines: a set of registers holds the
current state, combinational logic computes the next state, and at specific instants
the registers are loaded with the newly computed state. In the majority of digital
systems, where the registers are designed to be loaded at the same time, a periodic
synchronization signal, or clock, must be distributed throughout the system [1]. The
clock distribution network of a modern microprocessor uses a significant fraction of
the total chip power and has substantial impact on the overall performance of the
system. For example, the 72 watt, 600 MHz Alpha processor [2] dissipates 16 watts
in the global clock distribution, and another 23 watts in the local clocks: more than
15
D
Q
D Q
RO
ClockO
Q
Ri
QO
Clock1
Figure 1-1: 2 bit synchronous counter
QO/D1
Q1
DO
<QIQO> 0
000
01
00
01
10
00
ClockO
Clocki
1
2
3
4
5
6
7
8
Time
Figure 1-2: Timing diagram for 3-counter
half the power goes to driving the clock net!
While clock design issues can be subtle, the main performance criteria for the
system clock are straightforward.
Consider a simple example.
Fig. 1-1 shows a
simple digital circuit: a synchronous counter that counts to 3. The associated timing
waveforms are shown in Fig. 1-2. For the first several cycles shown, the circuit works
correctly, and counts 00, 01, 10, 00. However, for a number of reasons described
below, actual clock signals are neither perfectly periodic nor perfectly simultaneous.
This timing imperfection can lead to two types of timing errors.
The first type of timing error occurs when clockO arrives early at cycle 4: in this
case, the data from Q1 does not have time to propagate through the NOR gate, so the
wrong value is latched into RO. Formally, this may be called a "setup time violation,"
because the correct value was not present at the input to a latch sufficiently before a
16
clock edge. A setup violation occurs if
Ti,n + tcQ + togic > T,n+l - tsetup
(1.1)
where Ti,n is the time of arrival of the nWh edge at the ith flip flop, tcQ is the clock-to-Q
time for the
ith
flip flop,
and jth flip flops, and
t1 09
tsetup
ic
is the worst case (longest) logic delay between the it"
is the setup time for the Jh flip flop. Note that i could
equal j.
The second type of timing failure happens when clockl arrives too late at cycle 6:
the 0 that RO latches on this cycle propagates to the input of R1 and is latched instead
of the correct value, formally because of a hold time violation on R1. Colloquially,
the value is said to have "raced through" latch Ri. A hold violation occurs if
Ti,n + tCQ + ilogic < T,n + thold
where
thold
(1.2)
is the hold time for the Jth register, and ilogic is the worst case (shortest)
logic delay.
Setup and hold violations are different in a number of ways. Setup violations occur
because some instantaneous clock period is too short, and can be averted by lowering
the nominal clock frequency. Because setup violations involve successive clock edges,
possibly at the same register, they are typically considered to be a result of temporal
clock variation. Hold violations, on the other hand, involve arrivals of the same edge
at multiple registers; they result from spatial clock variation. Slowing down the clock
does nothing to avert hold violations; instead, the effective hold time of the offending
registers must be increased, often by adding pairs of inverters after the register.
Traditionally, clock networks have been characterized in terms of skew, the spatial
variations in arrival times, or T,(i, j)
T - Tj; and jitter, the temporal variation in
clock period at a node, Tj(n) = Ti,+-
Ti,n - Tperiod. Rewriting Eq. 1.1 and Eq. 1.2
17
x(1)
x(2) x(3)
Clock x
Ideal Clock
Clock
0
LL
x
3 Time
2
1
Clock A
A-
0-
o"~dl
3 Time
(b) Time offset plot for a single
clock
(a) Definition of clock time offset
I
2
1
Jitter
0-
NA
Clock AA A
4I)
Skew
-
Clock B
'N
Clock B
Time
Time
(c) Conventional view of skew and jitter
(d) Skew and jitter in modern
clocks are comingled
Figure 1-4: Relationship of clock offset, skew, and jitter.
in terms of skew and jitter gives
Ts (i, j) - T (n)
TS (i, A)
>
tsetu + tCQ
>
tCQ +
liogic
-
-
tlogic
thold
(1.3)
(1.4)
In older clock networks, the clock source was the source
Delay A
A
for the majority of jitter so jitter was the same for all
the clock nodes. Referring to Fig. 1-3, the assumption
DelayB
B
Figure 1-3: Two paths in a
was the delay to each of paths A and B is a constant,
and the only source of time-dependent noise is the clock
source. Hence, if clock arrives at node A one nanosecclock network
ond late, it would also arrive at node B one nanosecond too late. Dually, skew was
18
caused by static path-length mismatches to the clock loads, so skew was constant
from cycle to cycle. If on one clock cycle the clock at B lagged the clock at A by one
nanosecond, it would lag by one nanosecond at the next clock cycle as well. If we
plot the time offset from an ideal clock, defined in Fig. 1-4(a), vs. time for a single
clock, we'd expect to see something like Fig. 1-4(b). The traditional model suggests
that two on-chip clocks behave as shown in Fig. 1-4(c). In modern clock systems,
however, delay from the clock source to the loads dominates both static and dynamic
mismatches, so arrival times at different nodes are not necessarily correlated. If the
clock arrival time at node A is not correlated with the arrival time at node B, the
jitter at B need not match the jitter at A, and the skew between A and B becomes
time-varying, as shown in Fig. 1-4(d). This means that the skew and jitter terms
in Eq. 1.3 and Eq. 1.4 would have to be fully indexed for sample time and location.
In short, there is little reason to treat skew and jitter separately in modern clock
networks.
For this reason, this thesis uses "clock skew" and "clock uncertainty" interchangeably to mean the difference between the actual clock arrival time and the nominal
arrival time, whether the reference is established by spatially or temporally distinct
clock edge. Aside from avoiding semantic distinction between skew and jitter, this
usage allows us to consider skew and jitter contributions of individual clock paths,
rather than pairs of paths. (This is an exact clock network analog of analyzing halfcircuits in amplifier design.)
Just as there are distinctions between types of timing errors (hold vs.
setup
violations), and between types of clock uncertainty (skew vs. jitter), there are several divisions in the sources of clock uncertainty. First, errors can be divided into
systematic or random.
Systematic errors are due to layout-dependent parameter
variations, length variations in the lines, load capacitance mismatches, etc. That is,
any variations that are the same from chip to chip. In principle, such errors could
be modeled and corrected at design time given sufficiently good simulators. Failing
that, systematic errors can be deduced from measurements over a set of chips, and the
design adjusted to compensate. Random errors are due to manufacturing variations,
19
inter-signal coupling (which is predictable but often too hard to model correctly),
thermal- and slow supply voltage-gradients, power-supply-noise-induced delay variations in buffers, and to some extent, thermal noise. It is impossible to eliminate some
sources of random clock uncertainty, but it is possible to model some of the skew and
jitter sources, and to design in a way that minimizes their effects.
Mismatch may also be characterized as static or time-varying. In practice, there
is a continuum between changes that are slower than the time constant of interest
and those that are faster. For example, temperature variations on a chip vary on a
millisecond time scale. A clock network tuned by a one-time calibration or trimming
would be vulnerable to time-varying mismatch due to varying thermal gradients. On
the other hand, to a feedback network with a bandwidth of several megahertz, thermal
changes appear essentially static. Note the caveat that time-varying signals can cause
static errors as long as they are periodic with the clock. For example, the clock net is
usually by far the largest single net on the chip, and simultaneous transitions on the
clock drivers induces noise on the power supply. However, this high speed effect does
not contribute to time-varying mismatch because it is the same on every clock cycle,
and hence affects each rising clock edge the same way. Of course, this power supply
glitch may still cause static mismatch if it is not the same throughout the chip.
Finally, random skew can be subdivided into spatially correlated and spatially
uncorrelated mismatch. (Note the similarity to static and time-varying mismatch,
which could be restated as temporally correlated and uncorrelated). Again, the distinction is not absolute. Different physical parameters will have different correlation
distances; hence it is possible for a single pair of wires to be correlated in one respect
but not in the other. Table 1.1 shows the categories and several examples of the
sources of each type of random mismatch.
static
correlated
wafer-scale etching, polishing
uncorrelated
MOSFET channel doping
and lithography gradients
time-varying
temperature and power-supply
value-dependent load capaci-
gradients
tance, inter-signal coupling
Table 1.1: Categorization and example sources of non-systematic mismatch
20
1.2
Thesis Scope
As argued in Chapter 2, signal delay across a microprocessor chip measured in clock
cycles has been increasing as technology scales to smaller feature sizes, and is now
comparable to one clock cycle. Because clock uncertainty scales with path delay,
relatively longer delays increase the fraction of clock uncertainty per clock cycle; this
trend could severely limit performance if not corrected. The overall goal of this thesis
was to examine clock performance at both the circuit and the architectural level to
find ways to design clocks in an environment where performance is limited by random
random physical mismatches and noise.
This thesis is split into three parts.
The first part, Chapter 2, analyzes how
sources of skew and jitter affect different clock architectures. The nonintuitive result
is that a tree architecture is not well suited to systems where cycle time is shorter
than cross-chip path delay, and that distributed clock networks become increasingly
attractive.
This analysis leads into the second part, which proposes a novel clock network
composed of multiple synchronized phase-locked loops. Chapter 3 covers large- and
small-signal stability of the system. Undesirable large-signal stable (modelocked)
states dictate the transfer characteristic of the phase detectors; a matrix formulation of the linearized system allows direct calculation of system poles for any desired
oscillator configuration. Chapter 4 deals with circuit implementation in CMOS, presenting two implementations of the system- a 4 oscillator proof-of-concept 400MHz
network, and a 16-oscillator, 1.3GHz network network.
The last part of the thesis, Chapter 5, examines ways to measure performance
of a high-speed clock. As clock performance is optimized for fast operation, it becomes increasingly difficult to measure clock jitter. A flash time-to-digital converter
is presented that exploits parallelism to get precise time measurements with resolution much smaller than a single gate delay. Unfortunately, an unrelated failure
precluded measurements on the 16-oscillator chip where the measurement system
was integrated, but the principle is shown to be valid on an independent test chip.
21
22
Chapter 2
Models of Clock Network Timing
Variations
Unpredictable parameter variations and noise are becoming dominant concerns for
clocks. Clock networks have traditionally been optimized for minimum design time
(gridded clocks) or power and wireability (trees). Process variations, on the other
hand, have been studied extensively in terms of matching limitations on analog circuits, and to some extent in individual clock architectures. This chapter considers
how clock uncertainty depends on both architecture and imposed mismatch.
2.1
Previous Work: Clocks
Consider first the taxonomy and evolution of clock networks. Note that a great deal
of work nominally about "clocking" has gone into finding the exact sequence of timing
signals needed to clock a microprocessor at the fastest possible speed [3, 4, 5, 6, 7, 8, 9],
and a number of CAD tools have been developed to find and verify such timing
schedules [10, 11, 12]. However, the analysis of what timing signals are needed is
independent of how the signals are distributed. Unpredictable variations are no more
tolerated in scheduled-skew designs than in ideally zero-skew designs. The remaining
discussion will assume that the optimal clocking schedule has already been determined
and that what remains is implementation.
23
2.1.1
Equipotential Clocking
Conceptually the simplest clocking strategy is to distribute a global clock to the
chip as a regular, though heavily loaded, signal line. This is known as equipotential
clocking because the implicit assumption is that resistance in the wires is negligible
and the entire net is always at a uniform voltage. For small nets with relatively
few clock loads and a slow clock, this works well. For large chips and fast clocks,
equipotential clocking has the advantage that most of the clock distribution network
can be designed independently of the logic.
In fact, there is some RC time constant
a clock net.
When
T
(T)
associated with the wires of such
is small compared to the clock period, the RC delays are
unimportant. As feature sizes scale down, however, T increases and clock rates go up,
so the net no longer appears as a lumped capacitance and acts instead a lossy delay
line. Propagation delays along the clock net cause skew. Because
T
scales with the
size of the net, equipotential clocking can still be used for subsections of a chip [13],
and implicitly at the lowest level in hierarchical [14] and distributed [15, 16] designs.
The tour de force of equipotential clocking was the first DEC Alpha chip [17]
(Fig. 2-1(a)). In that design, a single, segmented buffer placed lengthwise in the
center of the die drives a grid made using two upper metal layers (i.e., the thickest
metal available, to lower
was
T).
The worst-case time difference between clock arrivals
200 picoseconds, and this was sufficient for a 200 MHz clock.
The next two versions, the 300 MHz Alpha and its strikingly similar 433 MHz
cousin, [18, 19] both used two drivers for the entire grid (Fig. 2-1(b)). Why? With
higher clock speeds, the RC delay from the center of the chip to the edges becomes
significant; the two drivers effectively both drive halves of the chip, so the delays are
shorter. The 600 MHz Alpha [2] (Fig. 2-1(c)) followed this trend: it has four top-level
buffers, because with the higher clock speeds and wire delays, ever smaller sections
of the chip can be modeled as equipotentials.
24
Driver
Wire Grid
Drivers
Drivers
I I
zlzI±Iz
I
-
-o---
-- I -- ------- I
Clock
Clock
(a) One-driver grid
I
(b) Two-driver grid
Metal Strap
(c) Windowpane grid
Figure 2-1: Evolution of Alpha's grid based clock network. In all cases, large buffers
drive a regular mesh of metal2 and metal3 wires.
2.1.2
H-Trees and Generalized Trees
If it were possible to lay out the clock net so that all points where the clock is used
are equidistant from the clock driver, the wire delay would not cause skew. This idea
led to H-trees (Fig. 2-2) [20, 21, 14].
By symmetry, the distance from the center of
the net (the root of the tree), to each of the ends
(leaves), is the same. Therefore, regardless of
Leaf
Leaf
Leaf
...
T,
signals should arrive at the leaves at the same
Root
time.
The clock can then be distributed to a
smaller (approximately equipotential) net around
each leaf. The size of this equipotential region
Leaf
around each leaf shrinks as the depth of the tree
increases, so deeper trees are needed for faster
clock speeds.
Figure 2-2:
Four level H-Tree.
Paths from the center to the
The maximum clock frequency is limited by
leaves are geometrically the same.
dispersion of pulses on the RC wires, so the basic
H-tree can be improved immediately by symmetrically inserting buffers along the
25
branches to regenerate the signal [21, 22, 15, 14]. Clock trees are insensitive to global
process and environmental variations; skew is still zero if the resistance of the wires
is higher than expected, say, or if the input threshold to all the buffers changes. Of
course, H-trees are affected by intra-die variations [23, 24].
Anything that causes
similar paths on the different parts of the chip to have different delays (e.g., local
line width variations, temperature gradients, varying threshold voltages, etc.) causes
skew.
H-trees are most useful when clocking regular arrays, because the leaves form a
regular grid. What can be done if the clock loading is not so geometrically regular?
The vital feature of H-trees is that the distance from the root to all the leaves is the
same. Finding a balanced tree for an arbitrary set of points is known as the zeroskew tree problem. In general, finding a zero-skew tree with minimum total length
is exceptionally hard; however, a number of heuristic algorithms have been proposed
[25, 26, 27, 28, 29]. Closely related to the zero-skew problem is the bounded skew tree
problem, where a small amount of path difference is allowed to help minimize the
total wire length, and therefore minimize area and power dissipation [30].
All of these tree approaches are bottom-up
algorithms that start by connecting groups of
nodes into a tree and then merging trees until
Leaves
only one net remains. They are distinguished
by exactly how they merge trees, behavior in
Root
pathological cases, how the number of computations scales with the number of clock loads,
The
Figure 2-3: Zero-skew balanced tree how they route around obstructions, etc.
result is essentially the same, however: they all
produce an irregular clock tree that ties together a specified set of clock loads such
that the distance from the root to the leaves is approximately equal (Fig. 2-3). Most
modern processors use some version of such trees to distribute the clock [31, 32, 33, 34].
Those that do not use explicit trees still simulate and balance path delays from the
clock source to all the loads, so act essentially as generalized clock trees. There the
26
Global Clock
Delay-_
Delay
-Compare+-
Figure 2-4: Digital active deskewing
matching is generally less precise, because the delay to the leaves, while nominally
identical, is composed of the delays of a variable number of gates and length of wire,
so even global variations in a particular parameter may cause skew.
2.1.3
Active Skew Management
One approach to measure and cancel out static skew involves splitting the H-tree
into two halves, measuring the relative offset between the two, and applying the
appropriate delay, as shown in Fig. 2-4 [35]. In this structure, the delays and control
signals are digital; this adds a measure of noise immunity, but increases the overhead
power and area. Further, the model does not scale well -
there is explicit digital
control to guarantee that the delays do not both continue to increase. Splitting the
tree into more sections allows finer adjustment, but the control overhead increases
rapidly as well.
2.2
Previous Work: Variations
Because the goal of a clock network is to distribute an identical signal to multiple
locations, device and interconnect matching is important. Environmental variables,
such as supply voltage, switching activity and temperature depend on the design of
27
the chip, and hence are under the control of the designer. Conversely, processing
variables, including film thickness, lateral lengths, resistivity, etc., are defined by the
manufacturing process, and can be treated as imposed constraints [43]. This section
describes some of the approaches to modeling the constraints and their effects on
circuits.
2.2.1
Layout-Dependent Processing Variations
Some manufacturing process steps, most notably etching, chemical-mechanical polishing (CMP) and lithography, are influenced by topography on a chip. This layoutdepending processing causes systematic device and interconnect variations [43, 44, 45].
Modeling this variation falls into the realm of statistical metrology; see [46] for a review. This systematic variation need not limit clock performance, however. Design
rules are evolving to ensure layout pattern uniformity. For some effects, it may be
feasible to add a spatially-varying fabrication mask offset, just as masks are made
by adjusting the drawn layout to compensate for lithography and etching biases.
As a last resort, clock performance can be measured and systematic offsets can be
compensated in the design.
2.2.2
Wafer-Scale and Random Physical Variations
Unlike systematic skew, skew caused by random physical variations is unavoidable.
For example, a dominant source of device mismatch over small areas is V variation
due to stochastic distribution of dopants; variation depends only on channel area
[47, 45, 48, 49]. Wafer-scale non-uniformity, while not truly random, varies from chip
to chip. For example, deposited thin films often have a radially-symmetric thickness
profile across a wafer. This results in slants in parameter properties across chips that
depend on position of the chip within a wafer, and hence cannot be compensated on
chip [43].
28
Voltage
Vth max
-
Vth min-
--
Time
tO t1 t2
t3
Figure 2-5: Clock skew caused by finite signal rise time. t1 - to and t 3 - t 2 is skew
due to variable buffer threshold voltages. t 3 - ti and t 2 - to is due to variable rise
time. t 3 - to shows the worst case combined effect.
2.2.3
Circuit Implications of Mismatch
Processing mismatch translates directly into loss of clock performance. For example,
variations in saturation current or buffer thresholds can both lead to variable clock
arrival times, as shown in Fig. 2-5 [21, 20]. Exact numbers are not easily available,
but one may assume that there could be 10% dynamic variation in VDD across a chip
(which affects the threshold and drive current) and another 5% variation in IDSS
between two distant, though nominally matched, buffers. That leads to an expected
clock skew of 2.5% of the total clock cycle from a single pair of gates! In the current
regime, where the clock skew budget is approximately 10% of the clock period, this
is quite substantial [22, 50, 51]. Attempts to increase the maximum clock speed by
increasing pipelining along an H-tree exacerbate this effect [52].
Because random variations cause substantial skew, there have been a number of
attempts to minimize mismatches at the circuit level. For example, it was noticed that
due to poor matching between nfets and pfets, signal paths which do not match the
nfets and pfets separately may add skew unnecessarily [53]. The canonical example is
shown in Fig. 2-6. On a rising input clock edge, gates N1, P2 and N3 are turned on
in the top chain and N4 and P5 in the bottom chain. Because nfets may be expected
to track nfets better than pfets, and vice versa, the lowest skew is achieved by sizing
29
P1
P2
P3
N1
N2
N3
Clocki
Clock
Input
I n p u t
4
P5
N4
C l o c k2
N5
Figure 2-6: Independent balancing of NFETs and PFETS
the transistors so that dN1 + dN3 = dN4 and dP2 = dP5 where dN1 is the delay
due to transistor N1, etc. The general observation is that matching is best between
similar components. One cannot expect wire delays to match gate delays over all
process corners, for example.
Clock designers have also started to pay attention to wisdom from analog design:
matching is best between similar elements, and matching between identical elements
is improved by making them larger. For example, matching wire delays to gate delays
is likely to lead to random skew. And when matching delays through a clock tree, at
some times fast paths need to be slowed down. There are two straightforward ways to
accomplish this: make the wires longer or make them wider. Which is better? Wider
wires are preferable because of the diminished influence of edge effects [50, 54, 55].
Consideration of random variations is becoming increasingly important in clock
designs. The solutions tend to be ad hoc, and there has been little work on how well
physically separated components may be expected to match. And most clock trees
are still designed to achieve minimal nominal skew without consideration for how
random variations will affect performance.
30
2.2.4
Abstract Variation Models
At the other end of the extreme from the ad hoc physical models are the abstract
models for skew [15, 56, 42, 57]. The assumption in these models is that skew is caused
by uncorrelated, random variations in the clock distribution network. Unfortunately,
because they are so far removed from implementation, generic statistical models give
somewhat misleading results, for several reasons.
The first is that they are too optimistic about statistical independence of variations. For example, gates that are near each other are likely to match each other
more so than gates that are physically separated. This means that the sum of the
skews caused by gates in any signal path will have higher variance than would the
sum of skews caused by the same number of gates randomly selected from the chip.
Also, as has been pointed out, not all variations have the same weight in the final
skew: clock trees, for example, are much more sensitive to differences at the root of
the tree than at the leaves [56].
Ironically, the second weakness is that general statistical models can be too pessimistic as well. For example, an analysis of pulse width down a long line of buffers
suggests that the pulse-width follows a random walk [57].
Thus, it is argued, the
pulse might disappear entirely unless the clock period is sufficiently long. In fact, it
is not particularly hard to add feedback to ensure a 50% duty cycle, which effectively
limits the random walk. In this case and some others, circuit tricks can overcome
apparent stochastic barriers [15].
Fundamentally, the very generality that makes sweeping statistical statements
interesting is their weakness because such bounds do not take into account circuit
or architectural changes that affect network performance. Although they may place
bounds on clock performance, they are necessarily qualitative, and can neither suggest
circuit improvements nor take them into account.
31
2.3
Categories of Mismatch
All on-chip clock networks rely on device parameter matching.
This is a crucial
difference between logic critical paths and clock networks: variation in critical path
delay can be overcome by speeding up the critical path so that the worst-case delay
meets timing constraints [58]. Time-dependency logic delay can be included directly
in the worst-case timing estimates: maximum delay is constrained by Eq. 1.3 and
minimum delay by Eq. 1.4. In contrast, because the clock network itself establishes
the timing, both too-slow and too fast clocks must be avoided. Physical variations
are often separated into separated into local and global contributions [59]. For the
purposes of clock distribution, time-varying mismatch must be considered explicitly
as jitter (and, if uncorrelated spatially, as contributing to skew). 1
Integrated circuit fabrication processes generally result in wafer-scale gradients
in line width (both metal and polysilicon), thin film thickness (metal wires, gate
oxide, interlayer dielectric) and doping concentration [43]. Manufacturing gradients
have been cited to explain distance-dependent mismatch in transistors [60]. These
variations significantly affect device and interconnect performance. In minimum-size
inverters, for example, Leff variation can lead to 9% delay mismatch [61] between
chips; in a different process 37% variation of ring oscillator speed was reported within
single dies [62].
Clocks depend on matching rather than absolute delays, and are
therefore insensitive to truly global parameter variations. We also make the optimistic
assumptions thatall systematic variations are compensated. This could be achieved
via modeling (i.e., statistical metrology), or simply testing finished chips if multiple
silicon revisions are to be made.
However, because clock networks span an entire chip, wafer-scale gradients are
noticeable. It is generally accepted that global effects can be ignored for distances
smaller than 100pm, but are noticeable for distances larger than 1mm [47, 60]. Global
environmental variations, specifically in temperature and DC supply voltage variation,
'There is a subtle asymmetry between temporal variation in logic and clock. Slack in Eq. 1.4 can
not be exploited to decrease clock cycle time, while any decrease in clock uncertainty directly lowers
the minimum clock period. For this reason, temporal variations of the clock are analyzed explicitly.
32
x7
x5
x6
x4
X1
x3
x2
Figure 2-7: Example H-tree
Segment
Xi
1
0.1
2
0.3
3
0.5
4
0.5
5
0.5
6
0.4
7
0.25
Average
.36
Table 2.1: Contributions to skew for an H-tree
are imposed by design rather than fabrication, but are otherwise similar in effect.
Temperature affects resistivity of the metal, channel mobility, and threshold voltages,
and supply voltage affects saturation currents and hence gate delay [63].
The distance between most nominally matched components of a clock distribution
network is comparable to chip size, which is typically 1cm or larger. Fig. 2-7 shows
an example H-tree, and the distances xi, normalized to chip size, between nominally
matched wire segments are tabulated in Table 2.1.
Most of the distances are com-
parable to the size of a chip; hence, we may expect that the wafer-scale variations
are dominant and consider inter-chip mismatch data. Still, this brings up a messy
modeling issue.
Delay along a clock wire is a sum of small delays. The delay of each buffer33
wire-buffer segment contributes a small random component.
If the segments are
strictly independent (e.g., uncorrelated threshold voltage variations), the variance
along the wire is the sum of individual variances, so the standard deviation of the
resulting offset increases as the square root of the length of the wire. Another model
is that the mismatch is due to a gradient of delays across a chip (perhaps from thinfilm deposition). Because the linear gradient is summed, the mismatch rises with the
square of the wire length. Finally, if the perturbations are each fixed-size or uniformly
distributed (e.g., a higher supply voltage for a section of the chip) , the worst-case
offset increases linearly with wire length.
Because gradients dominate over relatively long distances, it would probably be
most accurate to model short nearby wires with independent segments, long distant
wires in terms of gradients, and intermediate wires linearly. However, that obfuscates
the analysis unnecessarily; the key point is that short near wires match better than
long distant wires. For the sake of analysis, we will assume that uncertainty scales
linearly with delay with a mismatch coefficient a, as p(x) - p(0) . ap(O).
This argument can be extended to say that the variability in delay along a path
scales linearly with the delay along the path; that is, that there is a fixed percentage
error in on-chip path delay. We will use this assumption, although there is an important caveat: a depends on the construction of the path. A Ins delay with a = 0.11
gives more skew (110ps) than a 1.lns delay with a = 0.09 (99ps). For this reason the
classic line-driver optimization may give suboptimal results if wire mismatch is not
the same as buffer mismatch. However, for the optimal combination, delay variability
will scale linearly with delay.
Of course, matching is not perfect for adjacent wires or devices either. Strong
sensitivity of threshold voltage and saturation current on L at short channels also
limits matching for minimum-size devices; typically saturation current has a 3% mismatch for minimum devices, and matching down to 1% is straightforward in larger
devices. Local mismatch is an important limit for phase detector offset in PLL and
DLL systems.
Time-varying effects include capacitive and inductive coupling between signal and
34
clock lines and signal-dependent capacitance. Careful layout can minimize the capacitance between signal lines likely to switch near clock edges and clock wires, but
signal coupling is still important because it can be a significant source of jitter. We
will assume that up to 5% of the capacitance of any wire may transition during the
time a clock edge propagates.
Temperature changes on a chip are generally many orders of magnitude slower
than the clock speed, and are therefore reasonably treated as static gradients. On
the other hand, supply voltage can change within a single clock cycle in response
to changing load current. For this reason, temporal correlation is important when
matching elements that depend on supply voltage. An example where this is significant is described in Section 2.4.4.
2.4
Clock Architecture Comparison
While a number of authors have considered the impact of variations on clock performance, most assume tree distribution [52, 41, 63]. This section establishes a common
metric and compares several clock architectures.
2.4.1
Clock metric
The three categories of mismatches listed above cover what is needed for a first-order
comparison of clock networks. For normalization, each is scaled to distribute a 1 GHz
clock to a total of 200pF load capacitance over a 2cm chip in a standard 0.25pm
CMOS process. A clock wire in a TSMC 0.25pm CMOS process would be 1pm wide,
have a resistance of about 0.07Q/pm, and a capacitance of
.lfF/pm.
It would be convenient to choose a single parameter to characterize clock networks.
As discussed earlier, skew and jitter are in general functions of both position and
time. It is appropriate to consider the worst case clock uncertainty over time, but
meaningless to look at worst case across a chip: in all practical cases a signal that
takes longer than a clock cycle to propagate would be pipelined, and hence re-clocked.
Hence, clock uncertainty between points on a chip further apart than one clock cycle is
35
.05C
Figure 2-8: Schematic model of capacitive coupling
irrelevant. For this reason, the metric for clock quality will be taken to be worst-case
clock mismatch over a distance corresponding to signal propagation distance during
one half of a clock cycle.
2.4.2
Tree
Propagation delay along an H-tree can be split into delay from the root to the leaves,
and delay from the leaves to a sub-block or tile. Delays to loads from a leaf are
generally not matched, so the entire delay in a sub-block adds directly to total skew;
this is sometimes called internal clock skew [14, 63]. The point of an H-tree, however,
is to match delays from the root to the leaves, so those delays are nominally matched,
and only variations contribute to skew. Consider a 8-level H-tree (i.e., one with
28 = 256 leaves). Assuming equal-sized buffers along the tree, these buffers would be
placed at intervals of perhaps 2mm, for a total of 10 segments.
Delay along the tree in this example is simulated to be 0.86ns. Assuming a = 0.1,
skew caused by gradient mismatch is 0.86ns x 0.1 = 86ps. Internal skew (Si) is no
larger than 0.07Q x 625pm x 0.2pF ~ 9ps.
Capacitive coupling adds a time-varying offset.
Fig. 2-8 shows the schematic
model used to test the effect of capacitive coupling. The effect may be estimated by
adjusting the effective line capacitance for the Miller-multiplied coupling capacitance.
In the current example, the line capacitance is 200fF, the output capacitance of the
driving buffer is 34fF, and the input capacitance to the receiving buffer is 77fF. A
signal making a transition in the same direction as the clock lowers the effective wire
36
capacitance by 5% (given the assumptions above), so the delay should decrease by
.05x200 ; 3%. Conversely, a signal transitioning in the opposite direction will slow
200+ 111
down the clock by the same 3%, so the total would be up to 6% variation. (Simulation
indicates the total variation is 5%). This component of uncertainty interference recurs on every clock cycle, jitter if it is inconsistent -
skew if the
also scales with
the total delay along the tree, and so adds a worst-case 45ps to clock uncertainty.
To sum up, a clock distributed by a tree as described above will have skew of 140
picoseconds, or 14% of the clock cycle; this is in line with industrial results given the
speed and assumptions about the process.
Generalization
We can generalize from this example to other trees. Fig. 2-9(a) shows how the two
components of skew change with the depth of the tree, n. (The tree of this example
had n = 8.) As argued above, both mismatch and coupling cause skew proportional
to wire length L from root to leaves of the tree; in units of chip size, L
=
1 - (1/2)n/2.
Internal skew scales inversely with the area2 of the resulting patch, so Si oc 2-.
The other key parameter is power. Power scales linearly with switched capacitance, so the clock distribution power (excluding the load) scales as 2n/2. Fig. 2-9(b)
combines the results into a plot of the fundamental clock network tradeoff between
power and performance.
Scaling
Note, however, that a clock tree does not scale well with process technology. As
chip dimensions shrink, wire delay
(T)
is, at best, constant. Total chip size is also
nearly constant. However, clock speeds increase as the gate delay decreases. Delay
along the clock net also speeds up, but not by the same factor. Along an optimally
buffered line, the ratio of gate delay (d) to T is constant, so as d falls, the distance
between buffers decreases. Wire delay is proportional to the square of the wire length
2
Strictly speaking, it scales with length squared, but that is equivalent to area for non-pathological
patches
37
10 4
100
-x-
area-scaled skew
0
-&- length-scaled skew
-- total
-2
U
0
2U
S102
10 -
co
N
C
0
10
-
0
10
1s
E 10
-2
0
0
10
10
10
102
depth of tree
10
10
skew, ps
(a) Skew components in a tree vs. tree depth
(b) Power vs. skew for a clock tree
Figure 2-9: Clock tree tradeoffs
between buffers (1). Hence 1 cx Vd. The total number of segments is proportional to
1/1, so the total delay along a tree is proportional to d/Vdi
=
v/d. Since the clock
speed is directly proportional to d, skew as a fraction of the clock period will grow
as 1/v d as gate delay falls. In other words, without a dramatic redesign or process
improvements, a 4GHz clock tree would have unpredictable clock skew of 30% of a
clock period, and a 16GHz clock would have to budget over half of the clock period
for skew and jitter margin.
Note that as clock speed increases, signal delay across a chip exceeds a single
clock cycle. In the example above, a 2cm-long wire has a delay of 0.86ns with 1GHz
clocks. Scaling to 4GHz, the same wire (with optimal buffering) will have a delay of
approximately 0.43ns, compared to a clock period of 0.25ns. Given the metric defined
in Section 2.4.1, therefore, there is no reason to minimize global skew at all. In a tree,
however, the worst-case skew occurs between nearest neighbors, so tree distribution
cannot take advantage of the relaxed global constraints. This is the fundamental
reason why trees become less attractive at high clock speeds.
38
Global Clock
Figure 2-10: Grid distribution block schematic
2.4.3
Grid
A pure grid network would have a single, central driver for the entire chip and a mesh
of clock wires. Skew would be simply the wire delay across the chip, just as it is the
wire delay in a patch for each leaf of a tree. In the limiting case, a clock plane with a
central driver would give skew of .07Q/pm x .lf F/um x (104pm) 2
=
0.7ns.3 Clearly,
a single driver will not give adequate performance, so modern grids are H-tree-grid
hybrids: a short H-tree distributes clock to a few (4 or 16, for example) buffers around
a chip, and those buffers drive a clock grid in parallel, as shown in Fig. 2-10. The
final patches are larger than those typical of trees, but the grid helps eliminate skew
caused by the tree distribution by shorting together outputs of multiple buffers.
Take as an example system a 4 level (24 = 16 node) clock tree where the final
buffers drive a global grid. Following the example of the previous section, such a tree
would have 7 2mm-long segments and an expected clock uncertainty of 70ps. Delay
across each region, assuming a lumped model with minimum-width wires, would give
a skew of 2.5mm x 70Q/mm x 6.25pF ~ 1ns. Because this skew is dominated by
wire resistance and load capacitance, it can be reduced by increasing the width of the
wires at the cost of increased power. At the point where the capacitance of the wires
3Scaling this value down to the size of the first Alpha gives skew ~ 200ps, which was reported
for that chip.
39
Figure 2-11: Model circuit for shorted grid drivers.
equals the load capacitance there is one clock wire every 200pm, and the expected
wire skew is 89ps, (85ps simulated).
Furthermore, shorting the buffers together helps drive down some of the uncertainty at the cost of increased short-circuit power during switching and somewhat
slower edge rates. A simple circuit model for a grid driven from multiple points is
shown in Fig. 2-11. Simulations with an 70 picosecond skew on buffer inputs show
a total skew of 145ps, of which 55ps is due to the input skew. It is possible to keep
driving this lower by increasing wire width; however, the benefits of wider wires get
incrementally smaller as the wire capacitance comes to dominate the total. Doubling
the wire width again, for example, lowers total skew to 110ps, of which 34ps is due
to the input.
The drawback, of course, is the power dissipation. The extra wiring needed to get
110ps skew down added 25pF of capacitance per buffer, while the clock load per buffer
is only 12.5pf. Still, grid distribution is used because much of the skew is predictable
and, unlike with H-trees, the clock design is largely independent of floorplanning.
40
10
0
o 00 0
75
10
0
101
N
S10'
0
CL10-3
102
101
103
skew, ps
Figure 2-12: Power vs. skew for a grid.
Generalization
The primary parameter for a gridded clock is the capacitance of the grid
sets both the power dissipation (P oc C) and the wire skew.
(C); that
Si is proportional to
1 + CL/C where CL is the load capacitance and C the grid capacitance. Mismatchinduced skew is shorted out by lower-resistance wires, so that component of skew falls
as 1/CL. A plot of simulated power dissipation vs. skew, corresponding to Fig. 2-9(b)
is shown in Fig. 2-12.
Scaling
Grid distributions depend only on wire delays. As mentioned above, wire delays tend
not to improve with process technology scaling. As the skew budget decreases with
rising clock speed, a grid clock must either increase capacitance or subdivide the chip
further with a deeper initial clock tree. In the example above, the initial tree itself
does not add significant power, so an obvious scaling strategy would be to simply
make larger trees to minimize Si.
As long as delay variations in the initial tree are comparable to rise time, deeper
trees and smaller Si will improve performance.
However, rise time scales linearly
with d, so by the same reasoning as as applied to the tree scaling arguments, skew
41
as a fraction of rise time will increase with 1/vd as gate delay falls. When the tree
skew exceeds rise time short circuit power dissipation increases rapidly, and the clock
edges begin to show an unacceptable kink. Fig. 2-13 shows simulated edge shapes
with increasing input skew for a grid driven from a 4-level tree with skews from 0 to
200ps, and Fig. 2-14 shows the corresponding short circuit power dissipation.
edge shape
with
input skew
3.2
DCWAO:v)
D0:
V(xbs1)
y-
-
3
2.8
2.6
-
2.4
2.2
1.8
1.6
1.4 1.2
1T
800m
-
400m
200m
0
-20Cm
3.6n
3.65n
3.7n
3.75n
3.8n
3.85n
3.95n
3.9n
Time (fin) (TIME)
4n
4.05n
4.1n
4.16n
4.2n 4.25n
Figure 2-13: Simulated edge in a grid with skew to the drivers.
2.4.4
Active Feedback
As is evident from the sections above, an increasing share of skew comes from the
initial long-distance distribution of a clock to relatively small loads. A delay-locked
loop (DLL) could be adapted to measure and cancel out wire variations. One possible
implementation is shown in Fig. 2-15, where a DLL is used to implement a single wire
with low effective delay. The intuition is that the delays are adjusted symmetrically
until the round trip time from the source to the load and back is a known multiple
of a clock period; (in line with the examples so far, assume the round trip time is
42
0.5
0 0.4> 0.3
0
c
_00.2
a)
N
E0.1
0
0
50
100
150
input skew, ps
200
Figure 2-14: Short circuit power in a grid vs. input tree skew.
Source
D/2
W1
b2
w2
bw13
w3
b4
Load
b8
w7
b7
w6
b6<
w5
b5
Figure 2-15: Low-skew wire with DLL
2ns, which is 2 clock periods).
Then by symmetry, the signal arrives at the load
with a 1 period clock delay, which means it has effectively 0 delay for clock signals.
Unfortunately, this intuition is misleading.
Despite the apparent symmetry, there is little reason for the forward path to
match the reverse path in this connection for two main reasons. First, the nominally
matched buffers are physically separated. In Fig. 2-15, b1 should match b7 , although
it would be physically near b8 . b, isn't as far away from its matched pair as it might be
in a tree, but it will still typically be millimeters away. Second, there is no temporal
correlation.
The clock signal passes w, at a different time than it passes w 7 , so
any time-dependent variations, including those due to power supply and capacitive
coupling, do not match. Taking the results from Section 2.4.2, the effective skew for a
1cm-long DLL wire would be ~ 90ps, which is only a 30% improvement over a simple
43
Global Clock
Figure 2-16: Matching tree leaves with a DLL
wire, and that does not count offset in the comparison of the two edges or mismatches
in the delay cells.
Another approach, more like a traditional DLL, is shown in Fig. 2-16. The global
clock is distributed to two half H-trees, a phase comparison is done at the leaves, and
a variable delay is adjusted to align the clocks. The technique is meant to balance
delays along path 1 (di) and path 4 (d 4 ) in this example. Note, however, that while
nodes A and B may be matched, nodes C and D are not; the mismatch between
nodes C and D
mcD
(mcD) is
(d- -2)-
(d + d 3 )
(d- -),
DLL (in which case moD
-
(d4 + d 6 ) . The loop drives d, + d2
=
d4
d5 SO5
which is somewhat smaller than it would be without the
=(d,
-
d4 )
+ (d3-
d6))
because W2 and w 5 are both closer
together, and shorter, than d, and 4.
An immediate generalization would be to break up the trees further, have two
more comparators, and variable delay elements, as in Fig. 2-17. (Note the difference
between Fig. 2-17 and Fig. 2-18. The latter generalization requires matching between
delay elements
D2
and D5 , and between
D3
and
D6 ;
the former does not require that
the delay elements match at all.) Because delays to the leaves are controlled by DLLs,
the top-level tree structure is no longer necessary; Fig. 2-19 shows a DLL distribution
where each DLL drives a local tree. Static delay variations of nearest neighbors are
cancelled out by the DLL to within the precision of the matching of the comparators.
44
Global Clock
4
1
A
U
B
5
D 2
D
DC
1
Cj
6
3
C
D
Figure 2-17: Matching tree leaves with two DLLs
Global Clock
7
Compare
7 E 4
D5
D2
D3D6
8-rF
CompareI
I
Figure 2-18: Matching tree leaves with a two DLLs which requires delay cell matching
45
Global Clock
Compare
Compare
Delay
Dela
Compare
Compare
Delay
Delay
A
B
Figure 2-19: DLL architecture
Dynamic variations, due to supply noise or signal coupling, however, persist; two
1cm-long paths with active DLL matching will have a relative jitter of approximately
50ps (all of it time-varying), and skew from mismatch in the phase detectors, and
some mismatch from distribution along local trees. A typical phase detector has a
delay equal to 2 inverters, and its two halves are physically close together, so skew
is expected to be approximately 2 x 5% x d ~ 10ps. As drawn, the maximum skew
in the network is not between two paths connected with a DLL; rather, the skew
between A and B is the sum of the skews through three DLL's (10ps each) and four
local trees (25ps each). Total clock uncertainty between A and B, then, is 180ps and
the scaling is even worse because the effective distance between two nearby points
grows rapidly as the number of DLLs increases. A much better result can be obtained
by using DLLs that take multiple reference inputs, and adjust output phase to be
aligned exactly between the two inputs. The network can then be redrawn somewhat
more symmetrically, as Fig. 2-20. (For clarity, the local tree was not drawn, and the
connections to the comparators are abstracted.)
Optimization of the number of the number of tiles is straightforward. As argued
previously, internal skew scales with tile area, so as the number of tiles increases,
internal skew falls. However, every boundary between tiles introduces some skew
46
Global Clock
...
.............................
......
.....
Delay
o
a e
Delay
...
...........
......
......
..................................
...
.......
.... .......
.......
...............
......
...................
.. ....
. . .............
.. .. . . .. .. .
..........
............
......................
..........I
..........
.............
Compare
.............
.......... Compare
..... ......... .......
....... ...................................
.............................. .......
.......
...............................
.......
...............................................
.......
.......
...
Delay
Compare
Delay
Figure 2-20: Multi-input delay cell DLL architecture
100
area-scaled skew
-x-
-e- boundary skew
80-
_g_- total
)
60
C.
-. )
40
o
200
1
4
9
16 25 36 49 64
number of tiles
Figure 2-21: Tile number optimization
because of mismatch in the phase detector. Hence, as the number of tiles increases, the
number of boundaries increases. Fig. 2-21 shows the optimization curves calculated
for this clock metric.
One inherent weakness of DLL networks is that DLLs are inherently sensitive
to input jitter. A phase-locked loop, (PLL), though somewhat more complicated in
implementation, filters out noise on the inputs. PLLs and DLLs are nearly identical
structures in isolation. Each has a variable delay element as a core, represented in
Fig. 2-22(a). An input signal with phase 0 is delayed by some time A and output with
phase q. In both the DLL and PLL cases (Fig. 2-22(b) and Fig. 2-22(c)), A =
- 0.
The only difference is where the input signal comes from. If the input to the block is
47
ApA
At
(a) Variable delay block
(b) Delay-locked loop
(c) Phase-locked loop
Figure 2-22: A variable delay element and phase comparator can be configured into
a DLL or a PLL.
0,
the system acts as a PLL; if it is 0, a DLL. The noise and stability implications of
the feedback will be considered in the next chapter.
Scaling
As in other clock networks, faster clocks require a more finely-grained architecture.
Jitter in a DLL network will rise in exactly the same way as it increases in clock
trees, and for the same reasons. Skew scales linearly with d because it is comprised
of comparator mismatches and delays across each leaf-patch. Note, however, that
in a PLL the noise can be expected to scale with d; a PLL network like the one in
Fig. 2-20 would have total clock uncertainty that is a constant fraction of the clock
period.
48
Chapter 3
Synchronization and Stability
The purpose of an on-chip clock is to synchronize computation. Distributed networks
make explicit this synchronization. Chapter 2 argues that the performance of distributed clock networks scales favorably with clock speed (or at least does not scale
as poorly as do clock trees). This chapter gives some background on synchronization
architectures and then considers the synchronization of multiple oscillators.
3.1
Previous Work: Synchronization
The are two main synchronization schemes. In the first method, handshaking guarantees that computation proceeds in the correct order, although independent process
are not synchronized in any way. In the latter method, a global clock is used to synchronize data, but the generation of the global clock is split among multiple blocks
that must align their respective clocks.
3.1.1
Local Data Synchronization
The earliest distributed networks dealt with synchronization of data explicitly, rather
than of multiple clocks. The archetypical example of this is large processor arrays.
It has been suggested that the computational density available in modern VLSI be
used to build large arrays of simple processors which communicate only with nearest
49
neighbors [21, 20, 15, 16]. Since skew is only relevant between communicating processors [7], trees do not seem well suited to the problem: there is no reason to eliminate
global skew as long as the clock skew between neighboring processors is low. This can
be accomplished by having each processor synchronize directly with its peers.
So-called self-timed systems use handshaking between the blocks for synchronization [21, 41]. Each communication path between two blocks is accompanied by extra
signals that implement some manner of flow control. For example:
1. The processor sending data puts the data on the wire and asserts a Data Ready
signal.
2. The receiving processor reads the data and then asserts a Data Accepted
signal.
3. Data Ready is unasserted.
4. Data Accepted is unasserted.
Because no global synchronization is needed, self-timed systems are an example
of an asynchronous system. Such systems have several advantages over globally synchronized systems: there is no global clock to propagate, and each block can work at
its actual speed rather than the global worst-case clock speed [21]. However, there
are several significant drawbacks: there is circuit overhead in generating the local
synchronization signals; the designs are notoriously hard to analyze and test; and
often the system operates at the worst-case time anyway, because computation is
always limited by the latest input [15, 41, 42]. The approach suggested by El-Amawy
[16] avoids some of these problems by having a system that looks fully synchronous,
albeit with some local clock skew. However, there is still no global synchronization,
and communication is only allowed between neighboring processors. Despite these
drawbacks, asynchronous systems are an alternative to global clocking, and may become more prevalent if the prospects of very high speed clock distribution are not
improved.
50
Clock Signal
Node
12
1
Node 2
Node 3
Node
4
Time
Figure 3-1: Mode-locking example
3.1.2
Local Clock Synchronization
The proposed clock distribution architecture is organized as a synchronous array.
That is, clocks are generated at multiple places over the chip and controlled to have
the same phase and frequency. This approach has not been used in integrated clocks,
but it has been proposed for parallel computers, and some of the issues are similar
Pratt and Nguyen suggest constructing a clock for a parallel computer from
[40].
synchronized, voltage-controlled quartz crystal oscillators. Phase detectors and inte-
grators generate phase error signals, and these are used to pull the crystals to the
same phase and frequency.
While the desired, phase-locked configuration can be proven stable, it is possible
that some arrangement of unequal clock phases is also stable on a given network;
this effect is known as mode-locking. In the simplest example, a system consisting of
four nodes is stable although the phases are not equal, as shown in Fig. 3-1. Each
node sees one neighbor leading and one lagging, and therefore doesn't adjust. The
authors show that mode-locking can be avoided in a regular mesh with nonlinear
phase detectors, which they implement as balanced XOR gates.
This architecture is inconvenient for on-chip clock distribution for several reasons.
First, modern microprocessors are not organized as regular structures inter-
nally; memory caches and ALUs have vastly different clocking needs. Therefore it
will be necessary to remove the constraint that the clock nodes form a regular array.
51
Second, this method depends on having relatively noise-free, well-matched crystal oscillators, but such oscillators are not available on chip, and what is available has much
worse short-term stability. Therefore, the phase comparators and stabilization network must be completely redesigned to compensate for the noisier oscillators. Third,
they assume that wire delays between nodes are negligible; on an IC, these delays are
the very heart of the problem.
3.2
Proposed Clock Architecture
The proposed distributed clock network is an array of synchronized PLL. Independent
oscillators generate the clock signal at multiple points ("nodes") across a chip; each
oscillator distributes the clock to only to a small section of the chip ("tile") (Fig. 3-2).
Phase detectors (PD) at the boundaries between tiles produce error signals that are
summed by an amplifier in each tile and used to adjust the frequency of the node
oscillator. In general, the network need not be square or regular.
With locally generated clocks, there are no chip-length clock lines to couple in jitter; skew is introduced only by asymmetries in phase detectors instead of mismatches
in physically separated buffers; and the clock is regenerated at each node, so high
frequency jitter does not accumulate with distance from the clock source. Unlike
earlier work on multiple clock domains which suggested the use of multiple independent clocks, this approach produces a single fully synchronized clock. The rest of this
chapter examines small and large signal stability of a distributed phase-locked loop.
3.3
Small Signal
In a multiple-oscillator PLL large- and small-signal behavior are interrelated.
In
normal operation, the oscillators are phase-locked, and jitter depends on the network
response to noise. Because startup is expected to take a negligibly small fraction of
time, the connection of the oscillators is optimized for small-signal behavior rather
than to make initial acquisition more efficient. The linearized small signal behavior,
52
valid when the oscillators are nearly in phase, is analyzed first.
3.3.1
General Derivation
A traditional phase-locked loop (PLL) consists of three components: a voltage controlled oscillator (VCO), a phase detector (PD), and a low-pass loop filter, connected
as shown in Fig. 3-3. In a digital application like clock generation, the output of the
oscillator is a square wave, and the phase detector generates a signal that on average
is related to the difference in phase between two square waves.
Clearly, both the
oscillator and the phase detector are nonlinear in a strict sense. However, there is an
approximately linear relationship between the input voltage of the oscillator and the
phase of the output square wave. The relationship between the input phase difference
and averaged output of the phase detector is also linear. Hence, the system can be
modeled as a linear feedback system Fig. 3-4.
The system as drawn in Fig. 3-4 is
described by:
-
=
aHi(s)
(u -
)
aH(s)/(s + aH(s)) u
(3.1)
(3.2)
where u is the input phase. The poles of the system are the solutions of
aH(s) + 1 = 0
Substituting H(s)
=
(3.3)
(s + z)/s into Eq. 3.3 gives
a(s + z) + S2 = 0
(3.4)
which is a familiar result for a simple phase locked loop.
Exactly the same analysis applies to a network of coupled oscillators. Consider a
set of interlocked PLLs, as shown in Fig. 3-5.
The network can be modeled as a multivariable linear system; in fact, the block
53
Chip Boundary
ile Boundary
Phase
Detector
Loop Filter
&vco
& VCOj
Figure 3-2: Distributed clocking network
Reference
timer-CLooptput
Otu
PDFilter
Figure 3-3: Standard phase-locked loop.
Loop Filter
VCO
PD
Output
Reference
s
............
(voltage)
s
---.--..
(phase)
Figure 3-4: Linear system model of a standard phase-locked loop.
54
L
Reference
PD
----
1
FL
r
----- 0
FitrPD
Loop
VC0
r
VCO
Fle
Loop
VCO
PDFilter
PDFilter
Figure 3-5: Multi-oscillator phase-locked loop
PD
Loop Filter
Reference
j
A
21-
-- *, A2 *h ( s)
VCO
N
a
N Output
N
Figure 3-6: Linear system model of a multi-oscillator phase-locked loop
diagram (Fig. 3-6) is essentially identical to the one for a single oscillator system,
except that the connections between blocks are vectors instead of individual signals,
and the gains and transfer functions are matrices instead of scalars. This means that
the phase detector becomes a matrix A1 of size N(N + 1)/2 x N instead of a single
subtraction, and the loop filter becomes A 2 , a corresponding N x N(N+ 1)/2 matrix.
G = A 2 A1 is an intuitively meaningful N x N matrix. The network of oscillators
is similar to a lumped circuit C with a node for each oscillator and a branch for
each connection between pairs of oscillators. Node voltages in C represent oscillator
phase, and branch currents represent the error signals on the output of the phase
detector. G is the conductance matrix for C with unity conductance branches. G for
a 4 oscillator network is shown in Eq. 3.5. Each off-diagonal entry gij is -1 if there is
a phase detector between node i and node j; gij is the number of detectors attached
55
to node i.
3 -1
G =
-1
0 '
-1
2
0 -1
-1
0
2 -1
0 -1
-1
(3.5)
2
DC gain in the loop can be lumped into a 3 .
Recasting Eq. 3.1 in matrix form gives Eq. 3.6,
4b = [sI + a3A 2 Aih(s)]-' h(s)a3 A 2 U
(3.6)
where u is now the phase error input to each phase comparator. In other words, u(1)
is the reference phase, and u(2) ... u(n) are the noise contributions from interconnect
and phase detector mismatch.
3.3.2
Examples
Matrix A1 is determined by the geometry of the tiles, and hence will constrained by
the placement of clock loads, which for this problem is fixed. Assuming the simplest
possible phase-locked loop, h(s) = (s + z)/s. This leaves A 2 , a3 , and z as design
variables.
There are still far too many choices to find the general optimum, but a few examples may help guide the search.
Single oscillator
The reference design is a single-oscillator phase-locked loop. Stability constraints of
a single oscillator PLL may be derived directly from Eq. 3.3; however, it is more
common and more intuitive to analyze the loop gain, ah(s)/s. Magnitude and phase
Bode plots of the loop gain are shown in Fig. 3-7. Note that because of sampling at
the phase detector, the continuous time approximation is only valid for frequencies
much lower than the oscillator frequency. The Bode plots below add multiple parasitic
56
poles at the clock frequency we, to model the phase effects of the sampling. For the
0
-90
00
00
00
00
-180
Z
z
(00
)C
log(P)
0io
O
log(O))
(b) Loop gain phase
(a) Loop gain magnitude
Figure 3-7: PLL loop gain Bode plots
PLL to be stable and sufficiently damped, the phase must be above -135 when the
loop gain is at OdB. This means that the unity-gain frequency, wo, should be much
lower than w, and that the zero, z, should be much lower than wo. The location of
the dominant pole is not critical to the stability.
For a typical 1GHz oscillator, a = co ~~330MHz, consistent with the constraint
wo < we. In turn, this puts an upper limit of 50MHz on z. Fig. 3-8 shows the root
locus for this PLL over a gain error from -50% to 100%.
One dimensional array
A one-dimensional array of oscillators with phase detectors between neighbors is the
first generalization of a single PLL. In a perfectly asymmetrical array (call this system
S1 ), the output of PLL i is the input to PLL i+1, as shown in Fig. 3-9. S is described
by
A1 =
1
0
0 0
-1
1
0 0
1 0 0 0
0 10
0
0 0
1 0
A 2 ,1
1 0
0 -1
0
0 -1
0 0 0 1
1
57
(3.7)
x 10 7
64 x
u)
2-
x-
<C
<n
Mx
x
0 K< - X
-
-.. . -
xXx.
> 0
O
0
- .. ..
E
X
-4 -6
-1.5
-1
Real Axis
-0.5
x
108
Figure 3-8: Root locus for single-oscillator PLL with gain error
N
Ref
P
Figure 3-9: Asymmetrical one-dimensional PLL array
58
This system has multiple poles at the same place where a single-oscillator PLL has
single poles.
On the other hand, in a perfectly symmetrical array (call it S2 ), the input to each
oscillator i is the phase of oscillators i
-
1 and i + 1 (Fig. 3-10). The A1 matrix is the
N
Ref
P
Figure 3-10: Symmetrical one-dimensional PLL array
same because the physical arrangement of nodes is identical, but A 2 changes:
1 -1
A 2 ,2 =
0
0
1
-1
0
0
1
0
0
0
0
0
(3.8)
-1
1
To achieve the same phase margin in S2 as in S1, it is necessary to lower the gain a 3.
This can be shown with a geometrical argument: in S2, when the phase of oscillator
i changes by A0q, the change is measured at two phase detectors, so oscillator i feels
twice the feedback that it would have felt in S1 , and at the same time, oscillators
i - 1 and i+ 1 both adjust in the opposite direction, giving 4 times the effective gain.
Hence, the gain must be decreased by a factor of approximately 4. Mathematically,
the largest eigenvalues of A 2 ,1 A 1 is 1, but the largest eigenvalue of A 2 ,2 A 1 is 3.5.
Poles of the symmetrical system, solved via Eq. 3.61 are plotted in Fig. 3-11. The
'While it is possible to use Eq. 3.6 directly, it is often more convenient to take advantage of the
59
3
x
21 --
x
OK
X
x
x
xI
x
-1--2
-3
-6
x
-4
-2
0
Figure 3-11: Root locus for a one-dimensional array of PLLs.
60
key difference between Si and S2 is the systems' response to noise. In both cases,
noise at frequencies higher than the unity gain frequency wO are attenuated.
For
frequencies much lower than wo, the response can be calculated via Eq. 3.6. Fig. 312 shows a Bode plot of noise at node P in response to a noise source at node N.
Noise performance of Si is much worse for intermediate frequencies because there is
Noise
------
------
0-10-20-30
symmetrical
-- -
asymmetrical
-40.
0.001
1
0.1
0.01
Freq
Figure 3-12: Comparison of noise responses for symmetrical and asymmetrical networks
no feedback so errors propagate forever. In S2, the feedback limits the influence of
preceding stages, and this in turn attenuates noise. For this reason, networks with
feedback are preferred, despite the more complicated stability calculation.
Two dimensional array
A two dimensional array is analyzed exactly the same was as is a one-dimensional
array, except that the gain has to decrease by another factor of two because the center
oscillators see four neighbors rather than two. A 16-element array in a 4 x 4 grid is
simple form of h(s), and rewrite the zero-input state equations thus:
S'
#' =
$"-Gz
0
0
I
0
-G
61
0
I
-pI )
10
0'
"1
(3.9)
implemented in this thesis. Its G matrix and poles are shown below.
I
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0)
1
-3
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
-3
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
-2
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
-3
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
-4
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
-4
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
-3
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
-3
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
-4
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
-4
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
-3
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
-2
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
-3
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
-3
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
-2)
(3.10)
3.4
Large Signal: Mode Locking
The analysis of the previous section indicates that fully-connected networks should
have a better noise response than asymmetrical networks. However, the feedback
allows the possibility of undesirable large-signal modes.
62
Consider the network of
3
2[
x
1
x
x
00xx xx
x
x
-1
X
-3'
-6
-4
-2
0
Figure 3-13: Root locus for a two-dimensional array of PLLs.
63
Clock Signal
Node 1
1
2
Node 2
4113
Node 3
Node
4
Time
Figure 3-14: Mode-locking example
Fig. 3-5, and its associated matrices:
/
-1
0
1 -1
1
0
0
0
0
0 -1
-1
A2 = A
0
0
1
0 -1
0
0
1 -1
1
o
=
-1
0
0
o
0
-1
1
0
0
0
1
0
0
1
0 -1
(3.11)
-1
/
Because phase is periodic with period 27r, the p hase measured at the phase detectors A0 = A 1 # mod 27r. For small 0, (A 1 # mod 2 -) = A 1 0, so the nonlinearity is
irrelevant. However, consider
#,, =
[0, 7r/2, -7/2,
A 2 (A 1 # mod 27r) = A 2 [0, -r/2,
7r]T.
Because of the nonlinearity,
r/2, -7/2, 7r/2]T = 0
(3.12)
so 0_, is a stationary point. This is intuitively easy to see, in reference to Fig. 3-14:
each oscillator leads one neighbor, and lags behind another neighbor by exactly the
same amount. The net phase error is zero, so clearly there is no restoring force to drive
the oscillators into phaselock. Furthermore, this equilibrium point is stable, because
the nonlinearity does not change for small deviations from 0 2 so dynamics about 0are the same as those about 0. The locking of a distributed oscillator to non-zero
relative phases has been called mode-locking [40].
64
At startup, each oscillator in a
distributed PLL starts at a random phase, so there is a nonzero chance of converging
to a mode-locked state. Simulations show that for a network like the one shown here,
the system ends modelocked from ~ 1/3 of random initial states. The probability
goes up rapidly with the the size of the system; a 4 x 4 array ends up modelocked
well over 99% of the time.
Pratt and Nguyen proved several useful properties about systems in mode-lock.
The lemmas and theorem are repeated here with outlines of proofs, generalized to
include arbitrary (rather than Cartesian) networks.
Consider a system of oscillators to be a circuit, with oscillators at the nodes,
and connections between oscillators to be branches. (This is the same model as was
presented in Section 3.3.1). The phase counterpart to Kirchhoff's Voltage Law is:
Lemma 1 The sum of branch phase differences must be a multiple of 27r.
The sum is a multiple of 27r rather than 0 because phase differences here are defined
over a range [-7r, 7r), so at any branch 27r might be added or subtracted to bring the
result into the right range. For example, a phase detector will measure the difference
between 57r/6 - (-57/6) =wr/3, not 57r/3. This is true independent of mode-lock.
The second lemma derives from conditions for mode-lock: that is, the nodes are
in static equilibrium although the phases are not identical.
Lemma 2 If a set of oscillators is mode-locked, there must be at least one loop in
the network for which the sum of phase differences is a nonzero multiple of 27r.
The proof is as follows: in mode-lock, by definition, the nodes are not all at the
same phase. Therefore, there must be at least one node which connects to a branch
with nonzero phase error. Call that Node 1. Because Node 1 is in equilibrium by
definition of mode-lock it must connect to at least one branch with a positive phase
error. That branch connects to some Node 2, and appears as a negative phase error
there. Since Node 2 is also in equilibrium, it must have some other branch with an
offsetting positive phase error. Because there is a finite number of nodes, the loop
will eventually close back on Node 1. By Lemma 1, the sum must be a multiple of
65
27r. Because by construction, all the branches were positively-oriented, the sum must
be nonzero [40].
There are a number of ways to avoid mode-lock. The most obvious one is to simply
break the feedback: a consequence of Lemma 2 is that if there are no feedback loops,
there can be no modelock. This is not an attractive solution because, as shown in
the example with a one-dimensional array, full feedback helps average and attenuate
noise, so it would be best to avoid modelock without affecting the interconnection
of the system or the operation when correctly phase locked. One possible solution
would be to have a special startup state where there is no feedback between oscillators,
and then an operational state with full feedback. The system might be synchronized
during the startup, and then would remain phase-locked in the operational state. The
biggest drawback of this approach is that the the transition from the reset state to the
operational state jolts the system, and could push it into mode-lock. Thus, it would
be preferable to have a solution that does not require changing network topology even
temporarily. Fortunately, there is such a way.
If we define a minimal loop as a loop in the graph that cannot be decomposed
into other loops, we can combine the results succinctly into:
Theorem 1 For a system in mode-lock, there must be a phase difference 0 between
two oscillators such that 0 ;> 2/n
where n is the number of nodes in the largest
minimal loop in the network.
By Lemma 2, there must be at least one loop (L) with a phase difference sum of at
least 27. If it has more than n nodes, it cannot be a minimal loop. Decompose L into
L 1 and L 2 . By Lemma 1, the loop sum around both L 1 and L 2 must be an integral
multiple of 27, so at least one of them must have a loop sum of at least 27r; iterate
if necessary to get a loop of n or fewer nodes. Since the sum of the branch phase
differences must be 27r, at least one of the branches must have a phase difference of
at least 27r/n.
Theorem 1 suggests a way to distinguish between mode-locked states and the
desired 0-phase state: in mode-lock, there must be at least some large phase errors
66
across individual branches. If the gain of the phase detector is designed to be negative
for a phase difference larger than 0, then all mode-locked states are made unstable
without affecting the in-phase equilibrium. Pratt and Nguyen suggest that an XOR
phase detectors precludes modelock in a rectangular network of oscillators because the
response decreases for phase errors larger than 7r/2,[40]. This result follows directly
from Theorem 1: in a rectangular array, the largest minimal loop has 4 nodes, so
0 = 27/4 = 7r/2. Two other phase detectors are described in the next chapter, both
with 0 < 7r/2, which would be useful in non-rectangular networks, and where more
gain near 0 phase is desirable.
67
68
Chapter 4
Implementation and Testing
Distributed Clocks
Two test chips were made to explore implementation issues: how much power do the
oscillators require? How much area is needed for the compensation filters? Can a
real loop, with the buffer and wire delays be stabilized? The first was a 4-oscillator
chip in a 0.6pm double-poly CMOS process with a clock speed up to 350 MHz, and
the second was a 16-oscillator chip in a 0.35pam single-poly CMOS at clock speeds of
1.2-1.4 GHz. The two chips are described in turn below.
4.1
4 Oscillator Chip
The 4 oscillator chip was done as a proof of concept to show correct phase locking in
the simplest system that could possibly be vulnerable to modelock; a plot is shown
in Fig. 4-1 It consists of four nodes (each with an oscillator and loop filter) and
five phase detectors (one between each pair of neighbors, and one connected to an
external input). High-speed probes contact chip pads at the edges of the chip. One
probe drives the input, and the other three are connected to outputs of the oscillators.
(The probes are too large to connect more than one probe on a single chip side, so
all four oscillators could not be measured at the same time.)
69
Figure 4-1: Micrograph of the 4 oscillator, 350 MHz chip
70
4.1.1
Oscillator
The primary metric in the design of oscillators for clock generation is jitter, and
the majority of that is due to power supply noise [64, 65]. Integrated LC oscillators
often have a lower noise floor than other on-chip oscillators, but substrate and supply
noise are dominant on a large digital chip. Ring-type or relaxation oscillators are
usually preferred for on-chip clocks because large chips are usually sorted into different
categories based on measured achievable clock speed, and LC oscillators are more
difficult to tune. For this chip, a differential relaxation oscillator was chosen because
Hspice simulations showed that this relaxation oscillator had better power-supply
rejection than did ring oscillators. The relaxation current-controlled oscillator, or
"CCO," is shown in Fig. 4-2. Transistors M 3 , M 4 , M 5 , and M6 , along with capacitor
C make up a conventional source-coupled multivibrator, with M7 and M8 as active
loads and nbias controlling oscillation frequency through Id3,4. The drawback is that
that circuit has a feedthrough of -6dB to nodes V+ and V- from VDD, and almost
OdB to the capacitor from ground via Cbs of M 3 ,4 , so supply noise rejection is poor.
In the proposed oscillator, M 1 and M 2 provide shunt-shunt feedback around M 3 and
M 4 respectively, lowering the output impedance at V+ and V- to 1/gm. D1 and D2
limit the amplitude of oscillation to avoid saturation of M 3 and M 4 . Frequency can
be adjusted by adding common-mode current into nodes V+ and V-.
Oscillator layout is shown in Fig. 4-3. Layout for both halves of the oscillator
is identical, and the halves are immediately adjacent. Good matching between the
halves corresponds to a 50% duty cycle. Furthermore, all source/drain regions were
shared to minimize layout area and parasitic capacitance.
4.1.2
Phase Detector
As discussed previously, modelock can be avoided in regular arrays by using nonlinear
phase detectors whose response decreases monotonically beyond a phase difference of
7r/2 [40]. The phase detector Pratt and Nguyen suggest (a flip-flop delay and an XOR
gate) is not well-suited for integrated PLLs, however. First, it has relatively low gain,
71
...
..
...
..
...
..
...
..
...
..
...
..
...
..
...
..
.
...
.....
..
.........
.....
.........
..... ..
..
...............
...............
.............
..........
..........
...........
...........
...........
Figure 4-3: Relaxation oscillator layout
72
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
.........
....
...
....
....
....
...
....
...
....
...
....
...
....
...
....
....
...
....
...
....
so mismatch can lead to large input-referred phase offsets. Second, it generates fullswing digital signals at half the clock frequency; this digital noise must be attenuated
in the loop filter.
The phase detector proposed here,
A rshown
M8
M7
pbias
in Fig. 4-4, has the right nonlinearity, higher gain at small A0q and has
much less high-frequency content than
an XOR. The noise that is generated is
D2
V+
V-
at the clock frequency, and is attenuated
an extra 6dB given the same first-order
M3
M4
loop filter.
(Only half of the circuit is
drawn. The other half is the symmetriM1
cal counterpart, with clocki and clock2
M2
C
switched.)
M 1 , M 2 , and M 3 comprise
an arbiter.
The voltage at node A is
buffered, sampled, and converted to a
current, so that multiple inputs can be
nbias
M5
M6
summed at each oscillator node.
Syn-
chronous sampling of the arbiter output
by M 6 and M 7 demodulates it, removing
Figure 4-2: Relaxation oscillator schematic
high frequency content.
Timing wave-
forms are shown in Fig. 4-5. The phase of the sampling instant affects the transfer
function, shown in Fig. 4-6.
Node A is the output of the arbiter. When clocki and
clock2 are nearly in phase, as is the case at sample periods 1 and 2, A is sampled while
its value is still valid, so the output Y goes from 0 to 1 over the width of the arbitration
window. Hence, the phase detector has a high gain near 0 phase difference. As the
phase difference increases, sampling instance timing becomes relevant. A is sampled
at a fixed delay from the rising edge of clocki. If clock2 falls before A is sampled, the
output Y will also fall, as shown for periods 3 and 4. Therefore, 0c, the phase angle
at which the output transfer function starts to fall, depends on the relative timing
73
U
Ml
M8
A
M5
M6
Tick
M9
M2
M7 "
I2
M1
M34
T ___M4
13
I4
I5
M12
M10
I6
N1
I7
I8
I9
Figure 4-4: Phase detector schematic
of the falling edge of clock2 and the sample delay. If 0, is the phase of the sampling
instant and Of the phase of the falling edge, Oc
=
O - O, so the characteristic angle
could be adjusted easily simply by setting the delay through I ... 19. With 0, ~ 7r/2
and a 50% duty cycle (i.e., Of = ir) 0c would be ir/2, which is the constraint to avoid
modelock. Were smaller 0, needed to accommodate a different network structure, the
same circuit could be used with a different 0,. Adding the output from the unshown
half of the circuit gives the other half of the phase response, shown in Fig. 4-7. The
full circuit fits in 80pm x 40pam.
4.1.3
Loop Filter
One loop filter is associated with each CCO. Conventional loop filters use a charge
pump with an RC pole-zero pair, and often put the large capacitor and resistor off
74
Clockl
Clock2
A
Sample
Y
4
3
2
1
5
Figure 4-5: Phase detector timing waveforms
Iout
-7t
1
293
C
4
Phase
Figure 4-6: Sampled phase detector half-circuit transfer function
chip. To avoid inconveniently large resistor and capacitors, a feed-forward compensation method was used. The loop filter of Fig. 4-8 consists of two differential amplifiers.
(Note that because the frequency control to the oscillator consists of two currents,
both amplifiers have twin outputs.)
M 3 , M 4 , M 5 , and M6 make up amplifier A 1 ,
biased by M!, while M1 , M 2 , M 7 , M8 , M 1 and M 12 make up A 2 , biased by M10 .
The differential output currents from the phase comparators at the edges of each tile
are summed at nodes I,-+ and fln- and drive both amplifiers. A1 is a single stage
differential pair, so it has relatively low gain but a bandwidth limited by gm3,4/Cs3,4,
since nodes Ioutl and Iout2 drive a low impedance. A 2 has two stages, much like a
prototypical op-amp. The first is biased at very low current to give high gain at DC
and allow the use of a relatively small compensation capacitor, and the second provides the needed gain and isolates the high impedance pole from the output. In this
75
Iout.
-o
-IL
T
Phase
Figure 4-7: Sampled phase detector full transfer function
amplifier, the DC gain was simulated at 31dB with a 16kHz pole, a compensating zero
at 7.6MHz, and a high frequency pole well above the PLL target frequency. The use
of feed-forward compensation allowed the use of very small capacitors; the loop filter,
including the poly-poly capacitor, and the CCO with its output buffers together take
up 88pim x
8 8 pm.
M7
M11
M8
M12
I1
Io2
M3
I in-
I1
M4
M5
I
PT1
M6
M2
M1
I2
M9
Vb2
M10
Vb1
Figure 4-8: Loop filter schematic
76
I in+
4.2
16 Oscillator Chip
The 16 oscillator chip was a second generation chip with a number of improvements
over the 4 oscillator first generation. First, a larger network provides a more thorough
test of modelock-resistance, because modelock is more likely from initial startup than
in smaller networks. Second, a newer and faster fabrication process, 0.35pm, was used,
to test the ideas at clock speeds more appropriate for modern microprocessors. Third,
key circuits were redesigned: the oscillator is a ring oscillator instead of a relaxation
oscillator, and no longer requires two levels of polysilicon; the phase detector now
uses a much simpler arbiter-based design that gives phase and frequency feedback as
appropriate.
4.2.1
Oscillator
The second chip used an NMOS-loaded differential ring oscillator as a voltage controlled oscillator (VCO) (Fig. 4-10) primarily because only one layer of polysilicon
was used, and diodes were disallowed in an effort to make the circuits more amenable
to implementation in standard microprocessor. Transistors M 4
-
M8 comprise the
differential inverter. The differential pair is M5 ,8 , the tail current is driven by M6 ,
and M 4,7 act as the NMOS load. The NMOS loads allow fast oscillation and shield
the output signal from VDD noise. Vbias is a low-pass version Of VDD generated by
subthreshold leakage through PFET M1 ; supply noise coupling in through Cgd of M 4 ,7
is bypassed by M 2 . The oscillation frequency is only dependent on the supply voltage
through capacitor nonlinearity and the output conductance of M 4 ,7 , and feedback of
the PLL compensates drift of VDD and Vbias.
4.2.2
Phase Detector
Just like the phase detector for the 4-oscillator chip, the second generation phase
detector, shown in Fig. 4-11, has a sufficient nonlinearity, higher gain at small input phase difference and less high-frequency content than an XOR phase detector.
Compared to Fig. 4-4, however, it is somewhat simpler in implementation, and has
77
Figure 4-9: Micrograph of the 16 oscillator, 1.3 GHz chip
78
M1
M7
M4
Vbias
M2
Vout
Vout
M5
M8
Vctrl
M3
M6
Figure 4-10: Ring oscillator schematic
a smaller transistor count. It also has less delay from the clock inputs to the phase
detector outputs, which is important because the phase detector time constant helps
set the PLL feedback poles.
The core (M 1 - M 6 ) is an NMOS-loaded arbiter which acts as a nonlinear phase
detector. For no input phase difference, the output is balanced. As the phase difference increases from zero, one output will be asserted for the full duration of an input
pulse, while the other output will be asserted for only the remainder of the input pulse
duration after the first input pulse ends, which is equal to the input phase difference.
Thus the detector has very high gain near zero phase error that drops off to zero as
the input phase difference approaches the input pulse width (Fig. 4-12).
The pulse generators P and P 2 enable this arbiter to give frequency error feedback.
If one input is at a higher frequency than the other, its output will be asserted for
more input pulses than the other. Because the width of the pulses is independent
of input frequency, the average output voltage corresponds to frequency. Unlike a
typical phase-frequency detector, however, the strength of the error signal falls to
zero as frequency difference goes to 0, so there can be no modelock problems, yet
large signal frequency- (and hence, phase-) locking is enhanced.
Fig. 4-13 shows
the large signal correction and small signal behavior of the entire array of PLLs as
79
M4
M1M
Y1
Y2
M5
I8M2
Ii
MM
.............
P1
P2
Figure 4-11: Phase detector
the already internally-locked array approaches and locks to the reference clock. The
detector fits in 3Opum x 30pm.
4.2.3
Loop Filter
This loop filter, Fig. 4-14, is conceptually identical to the previous loop filter, Fig. 48, though for biasing reasons, the wide bandwidth amplifier now has p-inputs and a
current mirror, and the high gain amplifier loads are cascoded.
M, - M5 make up amplifier A 1 , while M 9
-
M17 make up A 2 . The differential
output currents from the phase detectors at the edges of each tile are summed at
nodes In+ and In-, and drive both amplifiers. A1 is a single stage differential pair
so it has relatively low gain but a bandwidth limited by gm/Cgs. A 2 has a high gain
cascoded stage driving a common source PFET M 17 . M1 6 is a large gate capacitor
which serves to set the dominant pole of M 2 such that the PLL network is stable. M15
is biased at very low current to boost gain and enable a low time constant (as low
80
OU
40
-.
30
-.
.
-
.. . .. . .. . ..
..
-..
. . .. .. . . ..-..
. . .. .. .. .
. ..
CL
20
(a
0
~3
10
. .... ...
. -.. . ..
-.
0
.. ... ..
U -10
... ..
-.
-20
....-..
-.
..
- . . . ..
-.
.. . -.
-..
. ..
-..
..-..-.... .-... . .-.
-. -..
.
-
0
-30
-40
-50
-0. 2
-0.1
0.1
0
0.2
Time difference (nanoseconds)
Figure 4-12: Simulated phase transfer curve
1. 06
1.0 55
Small Signal Regime
8 1.054)
(A
0
4
Large
Signal
Regime
0
S.o
0
04
0
E5)
1.0 35 - Reference
clock
1.
0.
1
1
2
2
0.5
1
1.5
2
2.5
3
3.5
Simulation time (microseconds)
Figur e 4-13: Locking behavior of the PLL array
81
M1
pbias
M6
M9
M10
M1l
M12
M16
M2
M3
AM10
M7
InM13
M14
In+
M17
ML2
6 Out
M4
M5
M8
nbias
M15
Figure 4-14: Loop filter schematic
as 12kHz) with a 15pm x 15pam gate capacitor. The simple design and feed-forward
compensation allow the loop filter to fit in only 15pm x 45pum. Each clock node,
consisting of an oscillator and a loop filter, takes just 45pum x 45pum.
82
Chapter 5
On-Chip Measurement of Clock
Performance
While increasing resources are devoted to implementing low skew and low jitter clocks
in modern microprocessors, there are few ways to measure jitter. Skew can be measured by such off-chip methods as e-beam [66] and photonic emission [67, 68], but
because both average thousands of edges, neither method is suitable for resolving
cycle-by-cycle clock jitter. A method to measure clock jitter was developed in this
thesis. A proof-of-concept test chip showed that excellent measurement performance
is possible, and this chapter describes the theory and results from that chip.
5.1
Introduction and Motivation
On-chip measurement necessarily
requires tricks. Acceptable clock
AID
skew is generally around 10% of a
2
clock cycle and a microprocessor
clock period is typically 8-12 gate
delays. Hence, the measurement
Figure 5-1: Time to voltage converter operation
necessarily requires timing resolution smaller than a single gate delay. Time-to-voltage
converters work by integrating a current onto a capacitor, as in Fig. 5-1 [69, 70, 71].
83
Delay Tune
CLK
IDLL
PD
E
I
Phase Interpolator
I
I
I
I
I
I
I
I
Sigln
R[iJ
Out [i]
Figure 5-2: Phase vernier
The capacitor starts with 0 voltage; at the beginning of the interval to be measured,
switch S1 closes, and the capacitor charges for the duration of the interval. Then S,
opens, the voltage is amplified, converted to a digital value and output, and then S2
closes to reset the capacitor. Such converters may have high dynamic range but do
not have enough resolution for clock jitter measurement, essentially because the time
of interest is comparable to the time it takes to open and close switch S 1 .
Another approach is to sample the signal of interest into registers which are clocked
by closely-spaced sampling phases, as shown in Fig. 5-2. The interpolator takes in
several uniformly-placed phases and generates a larger number of phases with closer
spacing. The newly generated phases are used to clock a string of registers, marked
R[i] in the figure. The timing of a transition on SigIn can be deduced to within
the spacing of the sampling phases. Effectively, the registers compare the transition
instant of the input signal Sigln to a set of fixed times, just as a flash analog-to-digital
converter (ADC) compares an input voltage to a set of voltage thresholds. Because
of the similarity, it is useful to think of this architecture as a flash time-to-digital
84
converter, or TDC. Because the comparison thresholds are clock phases, this will be
called a sampling phase time-to-digital converter, or SPTDC. Either a delay-locked
loop with phase interpolation (as shown) or an array oscillator can be used to generate
sampling phases with time differences smaller than a single gate delay [72, 73, 74, 75].
However, mismatches between the oscillators in the array or delays in a DLL can be
significant, giving as much as a gate delay offset before calibration [72].
The approach presented here is also a flash TDC, but rather than creating the
time vernier by generating closely-spaced clocks, the vernier arises from input-referred
offset on the samplers. Hence, the proposed converter will be called a sampling offset
time-to-digital converter, or SOTDC. The advantage is that instead of needing to
generate precise clocks, it is necessary only to create some sampling elements and
measure their relative positions. As will be demonstrated, measurement can be much
more precise than any calibration is likely to be. The SOTDC was developed to
measure jitter between clock domains, but it works to measure the timing of any
signal relative to a reference.
5.2
Time-to-Digital Converter Fundamentals
Calibration and operation of the SOTDC depends critically on the operation of the
sampling elements. (In Fig. 5-2, the sampling elements were registers, but they were
acting as arbiters.) An arbiteris a circuit that determines which of two inputs arrived
first. Because only the time difference between rising edges of the two inputs affects
the output, it is conventional to think of the arbiter as having a single input, where
that input is a time interval t between two incoming edges, as shown in Fig. 5-3(a).
Given enough time, the output of an arbiter settles to either a logic '1' or '0', indicating
whether the first or second input arrived first. Unfortunately, device mismatch gives
arbiters an effective time offset, t,,.
Also, because of thermal noise, the output, y,
is not deterministic. y(t) = 1 if and only if t > t0, + t,, where t, is white noise with
standard deviation - [76, 77]. Therefore, the probability that the output y is a '1' is
85
1
0.8
21
- 0.6
y
............. ..............
a'0.4
X
t
0.2F
0
O'
-2
-1
0
tos
1
2
t/O-
time
(a) Arbiter input definition
(b) Probability that arbiter output is a 1
Figure 5-3: Arbiter definitions
) In2
Inl0
D
tos
D
tos
tos
A
A
thermometer decode logic
Figure 5-4: TDC structure. "D" marks delay elements, and "A" the arbiters.
given by the Gaussian cumulative density function
P(y= 1) =
1+
erf ( -tos
(5.1)
which is plotted in Fig. 5-3(b). The strong sensitivity of y to t near t = t0 s makes the
arbiter useful for precise time measurement.
Fig. 5-4 shows the simplified theory of operation of a flash TDC (cf. a flash ADC).
In any flash converter, the input is compared to a set of thresholds; call the thresholds
x. In a TDC, x is the set of offset times to which the input time t is compared. In
86
a SPTDC, each threshold xi is composed of a vernier delay D and an arbiter offset
t0,. Variation of t, is significant-
the standard deviation of t0 s, at, is about 18ps in
0.35pm CMOS. Fig. 5-5(a) shows a plot of ideal x for an 8-level converter; Fig. 5-5(b)
shows the actual positions of the x with normally distributed t,,. Because the a-t
is large, errors in the x are significant. However, the random spread of t,, suggests
another approach to generating the x: eliminate the vernier delay entirely, and let
xi = t, 2 . Fig. 5-5(c) shows typical x for such a converter,
5.3
SOTDC Yield
The random placement of xi in an SOTDC means that measurement precision varies
from chip to chip. Finding a formula for the expected yield given a desired precision
over a fixed range is surprisingly difficult. The problem is quite amenable to Monte
Carlo simulation, however. A simulated plot of expected yield vs. precision is shown
in Fig. 5-6.
5.4
Calibration of a SOTDC
Of course, a vernier-less, or sampling offset TDC is useless if it cannot be calibrated:
the outputs of the arbiters give information about the input signal in terms of the xi;
if the xi are unknown, the arbiter outputs are useless. Fortunately, it is possible to
find x empirically.
A TDC could be calibrated directly by connecting two signals with preciselyknown t and measuring resulting outputs for t over the range of interest. Fitting the
probabilities of an output '1' vs. t for each arbiter via Eq. 5.1 gives the effective x.
Unfortunately, input jitter adds linearly to the apparent measurement noise in this
case. In cases where it is impossible or inconvenient to input known signals, it is also
possible to calibrate a flash TDC indirectly with uncorrelated signals.
For uniformly distributed t, the probability that t is measured between two sampling thresholds, P(xi+tn > t > xj+ts) A Pij(01), is proportional to xi-xj
87
Aij for
60-
40
(i2
U')
7C3
0
0
0-
(D~
20
U,
4020-
0
0
0
.a
0
0
x
x
-a
-20
0D
-20
-40-4C
-60
7
2
4
6
8
4
2
0
(b) xi oc i + t,,, 18ps std. dev.
(a) Ideal, xi oc i
4030c
0
.
20-
0
10
a,
0
-10
6
C
4
2
6
(c) xi = t,, 18ps std. dev.
Figure 5-5: x(i) vs. i
88
8
8
1
0.8
V
0.6
0.4
0.2
'-
2
3
4
5
precision (ps)
6
Figure 5-6: Expected yield of an SOTDC, for a fixed precision over a range of one
standard deviation.
a single event, as long as the difference is much larger than sampling noise, Aj
>
t,.
For example, if the two input signals are constant-frequency square waves, measurements with bit i low and bit
and
f2
j
high will occur with a frequency of Aijfif 2 where fi
are the frequencies of the two input signals. While x can be fully deduced
from such measurements, the resolution is poor for Aj
e
t,,.
A second indirect calibration method resolves small Aij in terms of o-. When Aij
is comparable to t,
there will sometimes be a "bubble" in the output codeword;
that is, it will appear that xj + t, > t > xi + t, even though xi > x3 . The ratio
r = Pi(10)/Pij(01) should depend only on 6 = Ai\j/-, and in fact, it does.
Consider two arbiters with ti = x, + ti
and t 2 = X 2 + tn 2 . t1 and t 2 are the
instantaneous switching thresholds of the arbiters, so
P(y1 = 1) = P(t > ti)
(5.2)
P(y2 = 0) = P(t < t 2 )
(5.3)
1 ,y2 =
0) A P 12 (10) = P(ti < t < t 2 )
(5.4)
P 12 (10) = P(ti < t 2 ) - P(ti < t < t 2 I t1 < t2)
(5.5)
P(y1 =
89
Let x =t2- t 1 . Then x is Gaussian with mean x 2
-
x 1 = At and standard deviation
2u. For uniformly distributed t, P(ti < t < t 2 ti < t 2 ) Oc t 2
-
t1 .
Substituting into
Eq. 5.5,
P 12 (10)
x
Oc
Oc
By symmetry, P12(01)1
,t=
(5.6)
- P(X > 0)
Oc x
je
e
(4a2)+
VIT2
4a 2
dx
(5.7)
At1 + erf
P 12 (10)1,,-,. Defining 6 =
(5.8)
2or
(
and erfcx(x) = ex 2
2
f:
gives
) P (10)
1+
r (6) = 12
=_
P 12 (01)
1 -
VF
-erfcx(-6)
F6 - erfcx(6)
(5.9)
In this way an array of arbiters can be calibrated to much higher precision than their
manufacturing tolerances without the use of precise input clocks.
Thus, by measuring r and inverting Eq. 5.9, one can find relative spacings of x
in terms of a. Combined with either of the previous two methods calibrations, this
measurement thus gives a and precise measurements of x. Note that both indirect
methods are completely insensitive to input jitter.
5.5
Circuit and Results
The SOTDC circuit consists of a set of nominally identical arbiters and output circuitry to transfer the bits off-chip. The implemented symmetric CMOS arbiter is
shown in Fig. 5-7. The outputs are precharged when Inl and In2 are low (for clock
systems where jitter is meaningful, there will be substantial overlap between the low
phases of the inputs). The first edge that arrives pulls down the corresponding output, and the positive feedback guarantees that eventually a valid logic value can be
latched from the output. For the test chip, 64 such arbiters were connected in parallel
90
et
2
dt
M1
M4
Y2
Y1
M5
M2
Inl
In2
MM6
Figure 5-7: Symmetric CMOS arbiter
to two test inputs, and their outputs individually recorded.
Fig. 5-8 shows x for one test chip measured directly. As expected, process variations distribute the x over a range of approximately 50 picoseconds. A plot of x
calculated by numerically inverting Eq. 5.9 for measured data vs. x measured directly
is shown in Fig. 5-9. The fit is perfect to within the tolerances of the measurement
equipment; clearly, calibration by random signals is viable. Best fit -is 0.35 picoseconds, which corresponds to an arbiter aperture of ~ lps, consistent with a previously
reported simulated value of 10ps in a 3pm CMOS process. Nonuniform spacing of the
arbiter thresholds limits resolution of this TDC to 2ps over the range [-15ps,15ps].
The goal of this part of the thesis was to measure jitter in the 16 oscillator chip
described in Chapter 4. A set of arbiters was connected between the clocks of neighboring tiles, and a 128-word DRAM recorded arbiter results. Unfortunately, the
DRAM timing was marginal on that test chip, so direct measurements were unavailable.
91
70
60
50
40
30
20
101I
-40
-20
0
20
threshold x(i), picoseconds
40
Figure 5-8: Measured xi, with expected curve for 18ps standard deviation of t,,.
20
00
o6
0
10 1
)
0
.3
LU
0
C)
CO)
-10
-20'
-40
-20
0
20
40
directly-measured x(i)
Figure 5-9: Measured xi vs. xi derived via Eq. 5.9, for a-= 0. 3 5 ps
92
Figure 5-10: Measurement chip micrograph
93
94
Chapter 6
Conclusions
6.1
Summary and Contributions
A great deal of work has been done previously on clocks in integrated circuits. As the
ratio of clock period to wire-delay across a chip decreases, more and more attention
is being devoted to clocking. An attempt was made in this thesis to look forward, to
predict the clocks necessary in the near future to continue the trend of faster devices
and faster clocks.
One contribution of this thesis has been the analysis of clock networks in terms
of performance given parameter variations and noise. Although much of the focus
has been on the contrast between different clock networks, the conclusion is that
the different architectures do not replace but rather complement each other. Over
a single tile where signal propagation delay is small compared to the clock period
and all points must be synchronized, tree distribution is effective. For relatively long
distances on a chip, clock regeneration becomes useful to filter out high frequency
noise on the distribution wire. A multiple-oscillator peer network also avoids the
problem of having different paths to nearest neighbors that plagues trees. Gridded
distribution, or more generally shorting together spatially separated buffers greatly
reduces skew and jitter between tiles as long as the initial offsets are small.
Another contribution is the analysis and implementation of a clock network that
uses distributed generation. Theory about mode-locking was extended to account for
95
non-orthogonal networks.
Inter-oscillator coupling was treated in the context of a
single multivariable system which exposes all possible interactions. The phase detector and oscillator were modified from standard versions to satisfy the requirements
needed for a distributed clock. Although the details will likely be changed (shorting together the tiles and finding another way to measure phase differences between
clocks is an obvious improvement) the main strength of this architecture is that the
clock traverses the same path, peer-to-peer, as does the data. Because the clock can
be measured and corrected over multiple cycles, however, it appears that clock skew
can always be corrected to a fraction of the uncertainty in data delay. In other words,
it should always be possible to distribute a clock using the same technology as is used
for long-distance interconnect.
Verification of clock design will likely become more important as a way to confirm predictions about clock performance. The proposed and tested sampling offset
time to digital converter appears to be well-suited to this task, with resolution of a
small fraction of a single gate delay. Because of its extreme hardware simplicity and
generality, the SOTDC may find its way onto many chips as a simple debugging tool.
6.2
Future Work
This thesis was dominated by analysis and implementation of the distributed clock
network, and of how that network compares with conventional clock networks. This
leaves a two-fold opening for future work: more accurate testing and comparison to
conventional clock networks, and the development clock architectures that are as yet
impractical.
6.2.1
Testing and measurement
The focus of the design and testing of the multiple-oscillator array was on initial
locking and stability. Testing received substantially less attention. Another version of
that chip with a more robust DRAM (so that precise timing data could be obtained),
and controllable, on-chip noise generators (i.e., large transistors between power and
96
ground) would help calibrate the noise models.
On a similar topic, distributed PLLs make low-speed functional testing difficult.
For distributed clock generation to move to production, stability of the network at
low-speeds should be addressed. It's trivial to add a controllable divider for each node
oscillator; however, the extra delay will certainly make the network unstable unless
other changes are made.
6.2.2
Unconventional Clocks
Grids and clock trees have found widespread use in industry already. A number of
other clocking strategies have been proposed that may either find use in niche applications, or perhaps someday take over as the dominant clock method if technology
evolves to makes them more attractive.
Salphasic
Salphasic clocking is conceptually related to equipotential clocking. If the wires are
lossless but the transmission line delay is causing clock skew, it is possible to set up
standing waves in the clock network. Because these standing waves are perfectly synchronous with the signal at the driver, a clock can be distributed over long distances
with no skew. Of course, this depends on having lossless transmission lines for clock
distribution; this constraint can be approximated closely in systems on the scale of
several meters with clocks in the tens of megahertz [36]. On chip, however, resistance
in the wires has made salphasic clocking untenable.
Resonant Clocks
Resonant clocking is a similar approach, intended for a different purpose. A standing
wave is set up in a transmission line with a period equal to the desired period of a
clock. With care, a transmission line can be tuned to resonate a fundamental and
several odd harmonics in phase, despite the capacitive load and small resistive losses
in the wire so that a true square wave appears at the load [37]. A resonant clock in
97
a low-loss transmission line dissipates a fraction of the CV 2 f power that traditional
clock networks do. The technique is relatively new, and has not been proven to be
practical at high speeds.
Optical Clocking
Because the propagation speed of optical signals is easily controlled, optical clocks
have been suggested as a way to equalize path delay and thus minimize clock skew [38,
39]. Optical signals, transmitted either in a tree, as in the first citation, or in free space
as in the second, also have the advantage that they do not interfere with each other,
and are immune to electrical or magnetic coupling.
Unfortunately, the conversion
from optical signals to electrical is a significant stumbling block. Detectors for optical
signals are not silicon, and hence require a substantial fabrication process change.
Second, the conversion is often relatively slow and error prone because the detected
currents are small. No optical clock has been demonstrated for VLSI, although optical
clocks may become practical in the future.
98
Bibliography
[1] Neil H. E. Weste and Kamran Eshraghian. Principles of CMOS VLSI design.
Addison Wesley, 2 edition, 1990.
[2] Daniel W. Bailey and Bradley J. Benschneider. Clocking design and analysis for
a 600 MHz Alpha microprocessor. Journal of Solid State Circuits, 33(11):16271633, November 1998.
[3] Stephen H. Unger and Chung-Jen Tan. Clocking schemes for high-speed digital
systems. IEEE Transactions on Computers, C-35(10):880-895, October 1986.
[4] Arthur F. Champernowne et al. Latch-to-latch timing rules. IEEE Transactions
on Computers, 39(6):798-808, June 1990.
[5] E. G. Friedman. The applications of localized clock distribution design to improving the performance of retimed sequential circuits. In Proceedings of the
IEEE Asia-Pacific Conference on Circuits and Systems, pages 12-17, December
1992.
[6] Karem A. Sakalh et al. Synchronization of pipelines. IEEE Transactions on
Computer-Aided Design, 12(8):1132-1146, August 1993.
[7] Jose Luis Neves and Eby G. Friedman. Topological design of clock distribution
networks based on non-zero clock skew specifications. In Proceedings of the 36th
Midwest Symposium on Circuits and Systems, pages 468-471, August 1993.
99
[8]
Narendra V. Shenoy, Robert K. Brayton, and Alberto L. Sangiovanni-Vincentelli.
Resynthesis of multi-phase pipelines. In Proceedings of the ACM/IEEE Design
Automation Conference, pages 490-496, June 1993.
[9] C. Thomas Gray et al. Timing constraints for wave-pipelined systems. IEEE
Transactions on Computer-Aided Design, 13(8):987-1004, August 1994.
[10] Michel R. Dagenais and Nicholas C. Rumin.
On the calculation of optimal
clocking parameters in synchronous circuits with level sensitive latches. IEEE
Transactions on Computer-Aided Design, 8(3):268-278, March 1989.
[11] Karem A. Sakallah, Trevor N. Mudge, and Oyekunle A. Olukotun. Analysis and
design of latch-controlled synchronous digital circuits. IEEE Transactions on
Computer-Aided Design, 11(3):322-333, March 1992.
[12] Tolga Soyata and Eby G. Friedman. Retiming with non-zero clock skew, variable register, and interconnect delay. In Proceedings of the IEEE International
Conference on Computer-Aided Design, pages 234-241, November 1994.
[13] Francois Angeau. A synchronous approach for clocking VLSI systems. Journal
of Solid State Circuits, SC-17(1):51-56, February 1982.
[14] H. B. Bakoglu, J. T. Walker, and J. D. Meindl. A symmetric clock-distribution
tree and optimized high-speed interconnections for reduced clock skew in ULSI
and WSI circuits. In VLSI in Computers and Processors, pages 118-122, Rye
Brook, NY, October 1986. IEEE International Conference on Computer Design.
[15] Allan L. Fisher and H. T. Kung. Synchronizing large VLSI processor arrays.
IEEE Transactions on Computers, C-34(8):734-740, August 1985.
[16] Ahmed El-Amawy. Clocking arbitrarily large computing structures under constant skew bound.
IEEE Transactions on Parallel and Distributed Systems,
4(3):241-255, 1993.
100
[17] Daniel W. Dobberpuhl et al. A 200-MHz 64-b dual-issue CMOS microprocessor.
Journal of Solid State Circuits, 27(11):1555-1567, November 1992.
[18] Bradley J. Benschneider et al. A 300-MHz 64-b quad-issue CMOS RISC microprocessor. Journal of Solid State Circuits, 30(11):1203-1214, November 1992.
[19] Paul E. Gronowski et al. A 433-MHz 64-b quad-issue RISC microprocessor.
Journal of Solid State Circuits, 31(11):1687-1696, November 1996.
[20] Donald F. Wann and Mark A. Franklin. Asynchronous and clocked control structures for VLSI based interconnection networks. IEEE Transactions on Computers, C-32(3):284-293, March 1983.
[21] S. Y. Kung and R. J. Gal-Ezer. Synchronous versus asynchronous computation
in very large scale integrated (VLSI) array processors.
Proceedings of SPIE,
341:53-65, May 1982.
[22] Sanjay Dhar, Mark A. Franklin, and Donald F. Wann. Reduction of clock delays
in VLSI structures.
In IEEE International Conference on Computer Design,
pages 778-783, October 1984.
[23] Mehdi Hatamian and Glenn L. Cash. Parallel bit-level pipelined VLSI designs for
high-speed signal processing. Proceedings of the IEEE, 75(9):1192-1202, September 1987.
[24] Eby G. Friedman and Scott Powell. Design and analysis of hierarchical clock
distribution system for synchronous standard cell/macrocell VLSI. Journal of
Solid State Circuits, SC-21(2):240-246, April 1986.
[25] Michael A. B. Jackson, Arvind Srinivasan, and E. S. Kuh. Clock routing for highperformance ICs. In 27th Proceedings of the ACM/IEEE Design Automation
Conference, pages 573-579, June 1990.
[26] Fumihiro Minami and Midori Takano. Clock tree synthesis based on RC delay
balancing. In Proceedings of the IEEE Custom Integrated Circuits Conference,
pages 28.3.1-28.3.4, May 1992.
101
[27] Ting-Hai Chao, Yu-Chin Hsu, Jan-Ming Ho, Kenneth D. Boese, and Andrew B.
Kahng. Zero skew clock routing with minimum wirelength. IEEE Transactions
on Circuits and Systems-Il: Analog and Digital Signal Processing, 39(11):799814, November 1992.
[28] Jason Cong, Andrew B. Kahng, and Gabriel Robins. Matching-based methods for
high-performance clock routing. IEEE Transactions on Computer-Aided Design,
12(8):1157-1169, August 1993.
[29] Ren-Song Tsay. An exact zero-skew clock routing algorithm. IEEE Transactions
on Computer-Aided Design, 12(2):242-249, February 1993.
[30] Andrew B. Kahng and C.-W. Albert Tsao. Practical bounded-skew clock routing.
Journal of VLSI Signal Processing, 16(2/3):87-103, June/July 1997.
[31] Shantanu Ganguly, Daksh Lehther, and Satyamurthy Pullela. Clock distribution methodology for the PowerPC microprocessors.
Journal of VLSI Signal
Processing, 16(2/3):181-189, June/July 1997.
[32] Earl T. Cohen et al. A 533MHz BiCMOS superscalar microprocessor. In ISSCC
Digest of Technical Papers, pages 164-165, February 1997.
[33] Charles F. Webb et al. A 400MHz S/390 microprocessor. In ISSCC Digest of
Technical Papers, pages 168-169, February 1997.
[34] Toyohiko Yoshida et al. A 2V 250MHz multimedia processor. In ISSCC Digest
of Technical Papers, pages 266-267, February 1997.
[35] G. Geannopoulos and X. Dai. An adaptive digital deskewing circuit for clock
distribution networks.
In ISSCC Digest of Technical Papers, pages 400-401,
February 1998.
[36] Vernon L. Chi. Salphasic distribution of clock signals for synchronous systems.
IEEE Transactions on Computers, 43(5):597-602, May 1994.
102
[37] M. E. Becker and T. F. Knight, Jr. Transmission line clock driver. In IEEE
International Conference on Computer Design, pages 489-490, October 1999.
[38] C.-S. Li, F. Tong, K. Liu, and D. G. Messerschmitt. Fanout analysis of multistage optical clock distribution using optical amplifiers.
In Globecom, pages
434-438, 1991.
[39] Helmut Zarschizky, Christian Gerndt, Martin Honsberg, and Ekkehard Klement.
Optical clock distribution with a compact free space interconnect system. In
IEEE Lasers and Electro-Optics Society Annual Meeting, pages 590-591, 1992.
[40] Gill A. Pratt and John Nguyen. Distributed synchronous clocking. IEEE Transactions on Parallel and Distributed Systems, February 1995.
[41] David G. Messerschmidt. Synchronization in digital system design. IEEE Journal
Selected Areas in Communications,8(8):1404-1419, October 1990.
[42] Morteza Afghahi and Christer Svensson.
Performance of synchronous and
asynchronous schemes for VLSI systems.
IEEE Transactions on Computers,
41(7):858-872, July 1992.
[43] D. Boning and S. Nassif. Models of Process Variations in Device and Interconnect, chapter 6. IEEE Press, 2000.
[44] Brian E. Stine et al. Simulating the impact of poly-CD wafer-level and die-level
variation on circuit performance. In Second InternationalWorkshop on Statistical
Metrology, June 1997.
[45] M. Eisele, J. Berthold, R. Thewes, E. Wohlrab, D. Schmitt-Landsiedel, and
W. Weber. Intra-die device parameter variations and their impact on digital
CMOS gates at low supply voltages. In Technical Digest of IEDM, pages 67-70,
1995.
[46] Duane S. Boning and James E. Chung. Statistical metrology - measurement
and modelling of variation for advanced process development and design rule
103
generation. In Proceedings of the International Conference on Characterization
and Metrology for ULSI Technology, March 1998.
[47] Tomohisa Mizuno, Jun-ichi Okamura, and Akira Toriumi. Experimental study
of threshold voltage fluctuation due to statistical variation of channel dopant
number in MOSFET's. IEEE Transactions on Electron Devices, 41(11):22162221, November 1994.
[48] Martin Eisele,
J6rg Berthold,
Doris Schmitt-Landsiedel,
and Reinhard
Mahnkopf. The impact of intra-dive device parameter variations on path delays
and on the design for yield of low voltage digital circuits. IEEE Transactions on
VLSI, 5(4):360-368, December 1997.
[49] Xinghai Tang, Vivek K. De, and James D. Meindl. Intrinsic MOSFET parameter
fluctuations due to random dopant placement.
IEEE Transactions on VLSI,
5(4):369-376, December 1997.
[50 D. C. Keezer and V. K. Jain. Design and evaluation of wafer scale clock distribution. In Proceedings of the IEEE InternationalConference on Wafer Scale
Integration,pages 168-175, January 1992.
[51] Jos6 Luis Neves and Eby G. Friedman. Circuit synthesis of clock distribution
networks based on non-zero clock skew. In Proceedings of the IEEE International
Symposium on Circuits and Systems, pages 4.175-4.178, June 1994.
[52] Mohamed Nekili, Guy Bois, and Yvon Savaria. Pipelined H-trees for high-speed
clocking of large integrated systems in the presence of process variations. IEEE
Transactions on VLSI, 5(2):161-174, June 1997.
[53] Masakazu Shoji. Elimination of process-dependent clock skew in CMOS VLSI.
Journal of Solid State Circuits, SC-21(5):875-880, October 1986.
[54] Satyamurthy Pullela, Noel Menezes, and Lawrence T. Pillage. Reliable nonzero skew clock trees using wire width optimization. In 30th Proceedings of the
ACM/IEEE Design Automation Conference, pages 165-170, June 1993.
104
[55] Masato Edahiro. Delay minimization for zero-skew routing. In Proceedings of
the IEEE International Conference on Computer-Aided Design, pages 563-566,
November 1993.
[56] Steven D. Kugelmass and Kennet Steiglitz. An upper bound of expected clock
skew in synchronous systems. IEEE Transactions on Computers, 39(12):14751477, December 1990.
[57] Marios D. Dikaiakos and Kenneth Steiglitz. Comparison of tree and straightline clocking in long systolic arrays. Journal of VLSI Signal Processing, pages
1177-1180, 1991.
[58] Keith A. Bowman, Xinghai Tang, John C. Eble, and James D. Meindl. Imapact
of extrinsic and intrinsic parameter variations on CMOS system on a chip performance. In Proceedings of the ASIC/SOC Conference, pages 267-271, September
1999.
[59] Marcel J. M. Pelgrom, AAD C. J. Duinmaijer, and Anton P. G. Welbers. Matching properties of MOS transistors. Journal of Solid State Circuits, 24(5):14331440, October 1989.
[60] Shy-Chyi Wong, Kuo-Hua Pan, Dye-Jyun Ma, M. S. Liang, and P. N. Tseng. On
matching properties and process factors for submicrometer CMOS. In Proceedings of the 1996 IEEE International Conference on Microelectronic Test Structures, volume 9, pages 43-47, March 1996.
[61] Shih-Wei Sun and Paul G. Y. Tsui. Limitation of CMOS supply-voltage scaling by MOSFET threshold-voltage variation.
Journal of Solid State Circuits,
30(8):947-949, August 1995.
[62] M. Nekili, Y. Savaria, and G. Bois. Spatial characterization of process variations
via MOS transistor time constants in VLSI and WSI.
Circuits, 34(1):80-84, January 1999.
105
Journal of Solid State
[63] Payman Zarkesh-Ha, Tony Mule, and James D. Meindl. Characterization and
modeling of clock skew with process variations. In Proceedings of the IEEE 1999
Custom Integrated Circuits Conference, pages 441-444, 1999.
[64] Ian A. Young, Monte F. Mar, and Bharat Bhushan. A 0.35pm CMOS 3-880MHz
PLL N/2 clock multiplier and distribution network with low jitter for microprocessors. In ISSCC Digest of Technical Papers,pages 330-331, February 1997.
[65] Raghunand Bhagwan and Alan Rogers. A 1GHz dual-loop microprocessor PLL
with instant frequency shifting. In ISSCC Digest of Technical Papers, pages
336-337, February 1997.
[66] P. J. Restle, K. A. Jenkins, A. Deutsch, and P. W. Cook. Measurement and modeling of on-chip transmission line effects in a 400 MHz microprocessor. Journal
of Solid State Circuits, 33(4):662-665, April 1998.
[67] Y. Uraoka, T. Maeda, I. Miyanaga, and K. Tsuji. New failure analysis technique
of ULSIs using photon emission method. In Proceedings of the International
Conference on Microelectronic Test Structures, volume 5, pages 100-105, March
1992.
[68] Yukiharu Uraoka, Isao, Miyanaga, Kazuhiko Tsuji, and Shigenobu Akiyama.
Failure analysis of ULSI circuits using photon emission. IEEE Transactions on
Semiconductor Manufacturing, 6(4):324-331, November 1993.
[69] Andrew E. Stevens, Richard P. Van Berg, Jan Van Der Spiegel, and Hugh H.
Williams. A time-to-voltage converter and analog memory for colliding beam
detectors. Journal of Solid State Circuits, 24(6):1748-1752, December 1989.
[70] C. Konstadakellis, S. Siskos, and Th. Laopoulos. A fast, versatile, CMOS timeto-voltage converter. In Proceedings of the 6th Mediterranean Electrotechnical
Conference, pages 282-285, 1991.
[71] Elvi Rdissinen-Routsalainen, Timo Rahkonen, and Juha Kostamovaara. A time
digitizer with interpolation based on time-to-voltage conversion. In Proceedings
106
of the
40th Midwest Symposium
on Circuits and Systems, pages 197-200, August
1997.
[72] Dan Weinlader, Ron Ho, Chih-Kong Ken Yang, and Mark Horowitz. An eight
channel 36Gsample/s CMOS timing analyzer.
In ISSCC Digest of Technical
Papers, pages 170-171, 2000.
[73] Thomas A. Knotts, David Chu, and Jeremy Sommer. A 500MHz time digitizer
IC with 15.625ps resolution. In ISSCC Digest of Technical Papers, pages 58-59,
1994.
[74] Yasuo Arai and Masahiro Ikeno. A time digitizer CMOS gate-array with a 250 ps
time resolution. Journal of Solid State Circuits, 31(2):212-219, February 1996.
[75] J. G. Maneatis and M. A. Horowitz.
Precise delay generation using coupled
oscillators. Journal of Solid State Circuits, 28(12):1273-1282, December 1993.
[76] Linsay Kleeman. The jitter model for metastability and its application to redudnant synchronizers. IEEE Transactions on Computers, 39(7):930-942, July
1990.
[77] W. A. M. Van Noije, W. T. Liu, and S. J. Navarro, Jr. Precise final state
determination in mismatched CMOS latches. Journal of Solid State Circuits,
30(5):607-611, May 1995.
107
108
Appendix A
Full Schematics
A.1
4 oscillator chip
A.2
16 oscillator chip
109
Si
C
V
0
Si
a
C
A
APhillate
phil
philearly
sampled-phase-comp
L~j phi2late
phi2
phi2early
Si
V
V
V
Si
U-
AilAl
IREF
foster
clock
slower
skewfaster skewslower
I-I
E
m
o_
0- 0
0
o
a- Q a
0
node
W-
147
0
0
-1
-
1
IREF foster
clock
slower
skewfaster skewslower
im
141
-
node
134
C_
E
0
o!
a)
m5
A
faster
clock
foster
slower
Ao
clock
slower
03
AV
phil
p4hi1late
Ef
v hillate
philearly
sam pled phase-com p
phi2
phi2late
L
9
phil
philearly
sopledphose-comp
ophi2te
phi2
phi2eorly
144
Ak
phi2erly
m_
IREF foster
clock
slower
skewfaster skewslower
[REF foster
clock
slower
skewfaster skewslower
U.
~E
a
node
node
0
--U
rj
L4
I-
o
145
0 a_
phi<0>
-
0
Em
-U-.
135
o0
00
E
0
-0
-0
a
oe
0
-0
0
a
0
W
C(q
o
V
V)
0
_E
foster
clock
slower
faster
(0
Figure A1.1: Top-level (chip core)
110
clock
slower
-U
'"st.
slower nolood2
f.*lr
-o..
locd1
b-,,
4
-*'-p
b' 3
,
out
'oad
~
125
124
__>c
Ifr
Figure A1.2: Node
T
10/1.2
24/1.2
24/1 2
aa
12/0.6
12/0.6
24/0.6
gnd!
gnd!
COpi
nbias12/1.2
Figure A1.3: Relaxation oscillator
III
24/0.6
cop2
T
F/rA.4
6/1.8.
8/1.8
meat.
1.-
an
m
e
//1.8
6~~~/1-8
.1.
15/1.
1512/
18/.
1
./1.2
Figure A1.4: Compensation amplifier and summer
6/0.6
6/0.6
out
in
6/0.6
6/0.6
6/0.6
Figure A1.5: Differential to single-ended amplifier
112
in
1.2/3
6phi2early
2/.6
Outl
lote
Ophi
122/2
phil
D
phrl
1
all12
p~p~us-ephrlealy
Fiure
A
paecmr
Sml
2/5
ph
pp.
4
Fig11e A16Tapeehaecmaao
113
/
13
3/0.6
3/0.6
e
6/0,6
phil
pi6/0.6
3/0 .6
3/0.6
-4
gnd!
15
phi2
116
Figure A1.7: Phase comparator core
114
phi2
clock
re
refeloc 1111
datow
:lock
rei
ed
refcloc
rea-
rea%
dataswitv
t
refelc
,
reo
1
1ucl
lut.ser~I< 12>
UIotsoritt
ri
-Mot.i ol
I
q
C0C
c oC
phi,2lote
lower phaarly
pher
faster pi2lt
0
1h2
l
phillate
faster
ohl 1137
H
b0
phi ear
c
rou
datoswitch
ut.seriol<21>
out
S
dr12
Ia
datoswto
-
R!
a
a
a
i
do
tdle
rout
slower
,ieryph
phi2sar
philiate
Shi31138 b 0d2
*
dat.
d
ontre
r
casWitcha -Ile tch
2
--
l r a fc
f
li
f
late
phi2earl
1t 14
rout
phi2
phi2e
phIi2
lte
a
.
0
w
r
phFS2
at.
grou_
E
l
119
phi
phillate
clock
re
dot
hE0.
T I
1117 phi
19
E phi
phTi 014 t 2
aerfa 434>
1Uot.seral
d
u
tster phIt2late
-a2
a
data
'S
re clock-
a
£
re
-111
r
clocktt
tileaIlphlti~t
S
a
ua erol 2>
w
-
slower pplear
phifear
a2lote
hi2
L4
10cl
*pl .try
16
phi
phillote
'
m a
-
roua 0
refoloc
r00
son a l
o ut00
0
hie I
?
0
r
i hi2
a
.-
war
p
ut~serol<33>
U
a:
Iieil4>
a CMU
IF
5
y
a QswItch
datoswitch
loe
phi2lote
1d
p
r2
data
S
pki2eaty
phl2lot
'.lp
h
oropi
k
ro
rTi
erial-44>
-
aE
philear
00
-
0~a
~
0t
a
uaFerio1443>
1 1 1;
l pphil 2orL
k
orwaOrd
backword
p22late 116-
phillate
e
0 .0
afserial432>
*
rout
. 11i
luggerial424>
o
Peah' ptieary
we
*
.
r
ph'
a
foster
rarotore
ou
h
uk. erfol<2>
-o tseria-l
0 tl
slower
Moto
y
.
o at seril<F3l>
ut -- eaa t.eria
e
d-ta
kdck2oe
pilot
cro
a -a
eo
slower
foster phillote
.5t data
rI2UpthIl
slower
fster
poilar.y
faster
I
.
out.serial<11:13,212431:34.41:44>
datoswltcb
CS
.
coc
~dawF
in.dUoa-sw.TchR
c
ma
11, reteloc
"Ut.serfol<22>
out.serial
serial
in~dallosilitch
refefoc
reclo
wtch
*
in
slower
phillate
phileard
a a
fca
Stch
Wo
1121 phtI
phile
refelac
rec* r
Z
I 7 Ir ini
slower
phillato
in.refelock
in.oTetI
k
mrt ph 2
slowom
in.reok
13>sril
I
otier
-t.seil
Ie
U-
datoswit
1136
ut.se
rel lock
refelocV
t
datoswitc?
h
clock
od-
tlr phillate
philearl
-
K
L-i
c o
Figure A2.1: Top-level (chip core)
115
Z
E
current-in np
n
p
slower
faster
clock
faster
faster
-
slower "
1129
a-
clock
clock
node-a
-jslower
slower 0
faster
|inclockclock2
refclock
read
write
write
1126
At mux
jmeasure
jAO
inv3x
Vn
out.somples
128
1127
Y
refclock
read
slower
1123
1114
E-clock1
"
1120
outclock
clock
clock2
clock
faster
clock
-U--D
inv9x
-I
Figure A2.2: Individual tile
faster
comp
slower
amp- acobias
-eE-iref
slowerringosc-2
soe
Figure A2.3: Node
116
phi
----
clock
out.serial
/
A./.3
slower
2.3/0.35
2.3/0.3
f
0.7/0.35
0,7/0,35
0.7/0.35
0.7/0.35
stA
4e- 13
vx
1/1
8.4/0.35
A
8.4/0.35
slower
4203
4.2/0.35
1.8/0.6
faster
1.0/0.6
4.2/0.35
1.5/0.35
1.5/0.35
.5'1.2
1.4/2.1
Figure A2.4: Compensation amplifier
Figure A2.5: Ring oscillator
117
VT
loodbias
W
0.7/0.7
out-
in+
out+
3.5/0.35
3.5/0.35
ibios
1/0.35
Figure A2.6: Differential inverter for the ring oscillator
115
17
d
q -
inclodkq10---d--
d
q
n2
117
119
nx
n2
118
d
122
q
125
inc12k3
120
-dut--
Figure A2.7: Clock divider
118
dck
in-
60
147
"(10
103.
49
[1 18
120
123e
1'ms*Pe
-'"Tek
-
dd
~
q
1998
~~W.1
152< 1:0 15 1< 1:0>
Figure A2.8: Jitter measurement block
13
d
Si
19y
194
q--
qd
ck
Figure A2.9: Pulse generator
owtTok<
utTokOut
4<ok:11>
Outpu
c:
pkin
out.sarmples
tbu3
utpul
latch
*6rite -E-writ.
W wtTokOut
sdl-bitslice
rw<cff:127>
DataClock
ww<O-:127:
lrnputTaken
refcAock
ou
1ok9n
outTokl
k
19
im 17<0:127>
r*
rea d
ph2
write
9Whtkb
read
ou
write
W>dl 0m1tckenoo
tkkon
shiftcik
write
dlrnmTokIn.dramTok
drtomi
ww
read
11
shiftclk
shiftclockb
shiftclock
r1110
11111
1112<0:3 1113<0:3
y
n~
158 0:3>
iv4
Figure A2.10: DRAM block
119
-
drarmTok<1:127>.dromlakOut
dramTokOut
195
-~g
x
x
LCn
read
write
d
tokin
q
shiftclk
Figure A2.11: DRAM write token
120
tokout
d
q
11
Figure A2.12: DRAM bitslice
out2
h.0
phH1
2.8/0.35
2.8/0.35
gnd!
Figure A2.13: Phase measurement arbiter
121
out1
./
.
phi2
49/0.5b
0.7/0.7
A
Y
24.5/0.35
oe
24.5/0.35
Figure A2.14: Dram data 3-state driver
(N
C
0
0
:3:
2. 1/0.35 4.2/0.35
DataClockW
D
-
*-wotu
Figure A2.15: Dram output data serializer
122
Download