Analysis and Characterization of Random Skew and Jitter in a Novel Clock Network by Vadim Gutnik Bachelor of Science, Electrical Engineering and Computer Science, and Materials Science and Metals Engineering, University of California at Berkeley (1994) Master of Science, Electrical Engineering and Computer Science, Massachusetts Institute of Technology (1996) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2000 @ Massachusetts Institute of Technology 2000. All rights reserved. *Wt MASSACHUSETTS INSTITUTE OF TECHNOLOGY ~.j-O% JUN 2 2 2000 ...... Author .... LIBRARIES Department of Electrical Cneering and Computer Science March 3, 2000 ......... Anantha Chandrakasan Associate- P9essor of Electrical Engineering -S ervisor C ertified by............................... .. Accepted by ..... Arthur C. Smith Chairman, Departmental Committee on Graduate Students Analysis and Characterization of Random Skew and Jitter in a Novel Clock Network by Vadim Gutnik Submitted to the Department of Electrical Engineering and Computer Science on March 3, 2000, in partial fulfillment of the requirements for the degree of Doctor of Science in Electrical Engineering Abstract System clock uncertainty, in the form of random skew and jitter, is beginning to affect performance of large microprocessors significantly. Process and environmental variations and inter-signal coupling on a chip contribute significant delay variations in long clock lines, and these variations are predicted to make the now widely-used clock tree distribution untenable. Distributed clock generation may allow clock networks to continue scaling with advances in semiconductor processing technology. A novel clock network composed of multiple synchronized phase-locked loops is analyzed, implemented, and tested. Undesirable large-signal stable (modelocked) states dictate the transfer characteristic of the phase detectors; a matrix formulation of the linearized system allows direct calculation of system poles for any desired oscillator configuration. The circuits were fabricated in CMOS, and two implementations of the system - a 4 oscillator proof-of-concept 400MHz network, and a 16-oscillator, 1.3GHz network network are presented. A flash time-to-digital converter is presented that exploits parallelism to get precise time measurements with resolution much smaller than a single gate delay. Unfortunately, an unrelated failure precluded measurements on the 16-oscillator chip where the measurement system was integrated, but the principle is shown to be valid on an independent test chip. Thesis Supervisor: Anantha Chandrakasan Title: Associate Professor of Electrical Engineering 3 4 Acknowledgments I would like to thank my thesis advisor, Professor Chandrakasan for innumerable technical discussions, for always being available and approachable, and for making sure I could concentrate on thesis work. Thanks also to my thesis readers Professors Boning and Verghese for their help in organizing the thesis. Thanks goes to my research group as well; my research would have been much less enjoyable and much less successful were it not for their advice, help, and camaraderie. And of course, thanks to my family for putting up with me through an awful lot of years of school. 5 6 Contents 1 15 Clocks in Digital Systems 1.1 D efinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Models of Clock Network Timing Variations . . . . . . . . . 23 2.1.1 Equipotential Clocking . . . . . . . . . . . . . . . . . . . . . . 24 2.1.2 H-Trees and Generalized Trees . . . . . . . . . . . . . . . . . . 25 2.1.3 Active Skew Management . . . . . . . . . . . . . . . . . . . . 27 Previous Work: Variations . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.1 Layout-Dependent Processing Variations . . . . . . . . . . . . 28 2.2.2 Wafer-Scale and Random Physical Variations . . . . . . . . . 28 2.2.3 Circuit Implications of Mismatch . . . . . . . . . . . . . . . . 29 2.2.4 Abstract Variation Models . . . . . . . . . . . . . . . . . . . . 31 2.3 Categories of Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Clock Architecture Comparison . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 Clock m etric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.2 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.3 G rid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4.4 Active Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.1 2.2 3 23 Previous Work: Clocks .................... 49 Synchronization and Stability 3.1 Previous Work: Synchronization . . . . . . . . . . . . . . . . . . . . . 7 49 3.1.1 Local Data Synchronization 49 3.1.2 Local Clock Synchronization 51 3.2 Proposed Clock Architecture . . . . 52 3.3 Small Signal . 52 3.3.1 General Derivation . 53 3.3.2 Examples . . . . . . 56 Large Signal: Mode Locking 62 3.4 4 Implementation and Testing Distributed Clocks 69 4.1 4 Oscillator Chip . . . . . . . . . . . . . . . . . . 69 4.1.1 Oscillator . . . . . . . . . . . . . . . . . . 71 4.1.2 Phase Detector . . . . . . . . . . . . . . . 71 4.1.3 Loop Filter . . . . . . . . . . . . . . . . . 74 16 Oscillator Chip . . . . . . . . . . . . . . . . . . 77 4.2.1 Oscillator . . . . . . . . . . . . . . . . . . 77 4.2.2 Phase Detector . . . . . . . . . . . . . . . 77 4.2.3 Loop Filter . . . . . . . . . . . . . . . . . 80 4.2 5 6 On-Chip Measurement of Clock Performance 83 5.1 Introduction and Motivation . . . . . . . 83 5.2 Time-to-Digital Converter Fundamentals 85 5.3 SOTDC Yield . . . . . . . . . . . . . . . 87 5.4 Calibration of a SOTDC . . . . . . . . . 87 5.5 Circuit and Results . . . . . . . . . . . . 90 Conclusions 95 6.1 Summary and Contributions . . . 95 6.2 Future Work . . . . . . . . . . . . 96 6.2.1 Testing and measurement 96 6.2.2 Unconventional Clocks . . 8 97 109 A Full Schematics A.1 4 oscillator chip ....... A .2 16 oscillator chip .............................. 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9 10 List of Figures 1-1 2 bit synchronous counter 1-2 Timing diagram for 3-counter . . . . . . . . . . . . . . . 16 1-4 Relationship of clock offset, skew, and jitter. . . . . . . . . . . . . . 18 1-3 Two paths in a clock network . . 18 2-1 Alpha clock grid evolution . . . . . . . . . . . . . . . . . . . . . . . 25 2-2 Four-level H-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2-3 Zero-skew balanced tree . . . . . . . . . . . . . . . . . . . . . . . . 26 2-4 Digital active deskewing . . . . . . . . . . . . . . . . . . . . . . . . 27 2-5 Skew caused by finite rise time . . . . . . . . . . . . . . . . . . . . 29 2-6 Independent balancing of NFETs and PFETS . . . . . . . . . . . . 30 2-7 Example H-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2-8 Schematic model of capacitive coupling . . . . . . . . . . . . . . . . 36 2-9 Clock tree tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2-10 Grid distribution block schematic . . . . . . . . . . . . . . . . . . . 39 2-11 Model circuit for shorted grid drivers. . . . . . . . . . . . . . . . . 40 2-12 Power vs. skew for a grid. . . . . . . . . . . . . . . . . . . . . . . . 41 2-13 Simulated edge in a grid with skew to the drivers. . . . . . . . . . . 42 2-14 Short circuit power in a grid vs. input tree skew. . . . . . . . . . . 43 2-15 Low-skew wire with DLL . . . . . . . . . . . . . . . . . . . . . . . 43 2-16 Matching tree leaves with a DLL . . . . . . . . . . . . . . . . . . . 44 2-17 Matching tree leaves with two DLLs . . . . . . . . . . . . . . . . . 45 16 11 . . . . . . . . . . . . . 2-18 Matching tree leaves with a two DLLs which requires delay cell . . . . . . . . . . . . . . . . . matching 45 2-19 DLL architecture . . . . . . . . . . . . . . . . . . . . 46 2-20 Multi-input delay cell DLL architecture . . . . . . . 47 2-21 Tile number optimization . . . . . . . . . . . . . . . 47 2-22 A variable delay element and phase comparator can be configured into a DLL or a PLL. . . . . . . . . . . . . . . . . . 48 3-1 Mode-locking example . . . . . . . . . . . . . . . . . . . . . . . . . 51 3-2 Distributed clocking network . . . . . . . . . . . . . . . . . . . . . 54 3-3 Standard phase-locked loop. . . . . . . . . . . . . . . . . . . . . . . 54 3-4 Linear system model of a standard phase-locked loop..... . . . . 54 3-5 Multi-oscillator phase-locked loop . . . . . . . . . . . . . . . . . . . 55 3-6 Linear system model of a multi-oscillator phase-locked loop 55 3-7 PLL loop gain Bode plots . . . . . . . . . . . . . . . . . . . 57 3-8 Root locus for single-oscillator PLL with gain error . . . . . . . . . 58 3-9 Asymmetrical one-dimensional PLL array . . . . . . . . . . . . . . 58 3-10 Symmetrical one-dimensional PLL array . . . . . . . . . . . . . . . 59 3-11 Root locus for a one-dimensional array of PLLs. . . . . . . . . . . . 60 3-12 Comparison of noise responses for symmetrical and asymr etrical . . . . netw orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3-13 Root locus for a two-dimensional array of PLLs. . . . . . . . 63 3-14 Mode-locking example . . . . . . . . . . . . . . . . . . . . . 64 4-1 Micrograph of the 4 oscillator, 350 MHz chip 4-3 Relaxation oscillator layout . . . . . . . . . . . . 70 . . . . . . . . . . . . . . . . . . . . . . 72 4-2 Relaxation oscillator schematic . . . . . . . . . . . . . . . . . . . . 73 4-4 Phase detector schematic . . . . . . . . . . . . . . . . . . . . . . . 74 4-5 Phase detector timing waveforms . . . . . . . . . . . . . . . . . . . 75 4-6 Sampled phase detector half-circuit transfer function . . . . . . . . 75 4-7 Sampled phase detector full transfer function . . . . . . . . . . . . 76 12 4-8 Loop filter schematic . . . . . . . . 76 4-9 Micrograph of the 16 oscillator, 1.3 GHz chip 78 4-10 Ring oscillator schematic . . . . . . 79 4-11 Phase detector . . . . . . . . . . . 80 4-12 Simulated phase transfer curve . . 81 4-13 Locking behavior of the PLL array 81 4-14 Loop filter schematic . . . . . . . . 82 5-1 Time to voltage converter operation . . . 83 5-2 Phase vernier . . . . . . . . . . . . . . . . 84 5-3 Arbiter definitions . . . . . . . . . . . . . 86 5-4 TDC structure. "D" marks delay elements, and "A" the arbiters. . 86 5-5 X (i) vs. i . . . . . . . . . . . . . . . . . . 88 5-6 SOTDC yield . . . . . . . . . . . . . . . . 89 5-7 Symmetric CMOS arbiter . . . . . . . . . 91 5-8 Measured xi, with expected curve for 18ps standard deviation of t, 92 5-9 Measured xi vs. xi derived via Eq. 5.9, for o- = 0.35ps . . . . . . . 92 5-10 Measurement chip micrograph . . . . . . . . . . . . . . . . . . . . . 93 A1.1 Top-level (chip core) . . . . . . . . . . . . 110 A1.2 N ode . . . . . . . . . . . . . . . . . . . . . 111 A1.3 Relaxation oscillator . . . . . . . . . . . . 111 A1.4 Compensation amplifier and summer . . . 112 A1.5 Differential to single-ended amplifier . . . 112 A1.6 Sampled phase comparator . . . . . . . . 113 A1.7 Phase comparator core . . . . . . . . . . . 114 A2.1 Top-level (chip core) . . . . . . . . . . . . 115 A2.2 Individual tile . . . . . . . . . . . . . . . . 116 A2.3 N ode . . . . . . . . . . . . . . . . . . . . . 116 A2.4 Compensation amplifier . . . . . . . . . . 117 A2.5 Ring oscillator 117 . . . . . . . . . . . . . . . 13 A2.6 Differential inverter for the ring oscillator . . . . . . . . . . . . . . 118 A2.7 Clock divider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 A2.8 Jitter measurement block . . . . . . . . . . . . . . . . . . . . . . . 119 A2.9 Pulse generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A2.10 DRAM block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A2.11 DRAM write token . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A2.12 DRAM bitslice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A2.13 Phase measurement arbiter . . . . . . . . . . . . . . . . . . . . . . 121 A2.14 Dram data 3-state driver . . . . . . . . . . . . . . . . . . . . . . . . 122 A2.15 Dram output data serializer . . . . . . . . . . . . . . . . . . . . . . 122 14 Chapter 1 Clocks in Digital Systems The vast majority of integrated circuits manufactured today are synchronous digital systems. The performance of these systems, measured in terms of computation per time, is readily increased by increasing the clock rate. The bulk of the effort in design of high speed systems is expended on the design of systems that operate correctly when synchronized by ever faster clocks. An increasing amount of effort has been made in designing the clocks themselves so that imperfections in the clock do not unnecessarily limit system performance. This chapter introduces terminology and constraints relevant to clock performance in digital systems. 1.1 Definitions Digital devices can be modeled as finite state machines: a set of registers holds the current state, combinational logic computes the next state, and at specific instants the registers are loaded with the newly computed state. In the majority of digital systems, where the registers are designed to be loaded at the same time, a periodic synchronization signal, or clock, must be distributed throughout the system [1]. The clock distribution network of a modern microprocessor uses a significant fraction of the total chip power and has substantial impact on the overall performance of the system. For example, the 72 watt, 600 MHz Alpha processor [2] dissipates 16 watts in the global clock distribution, and another 23 watts in the local clocks: more than 15 D Q D Q RO ClockO Q Ri QO Clock1 Figure 1-1: 2 bit synchronous counter QO/D1 Q1 DO <QIQO> 0 000 01 00 01 10 00 ClockO Clocki 1 2 3 4 5 6 7 8 Time Figure 1-2: Timing diagram for 3-counter half the power goes to driving the clock net! While clock design issues can be subtle, the main performance criteria for the system clock are straightforward. Consider a simple example. Fig. 1-1 shows a simple digital circuit: a synchronous counter that counts to 3. The associated timing waveforms are shown in Fig. 1-2. For the first several cycles shown, the circuit works correctly, and counts 00, 01, 10, 00. However, for a number of reasons described below, actual clock signals are neither perfectly periodic nor perfectly simultaneous. This timing imperfection can lead to two types of timing errors. The first type of timing error occurs when clockO arrives early at cycle 4: in this case, the data from Q1 does not have time to propagate through the NOR gate, so the wrong value is latched into RO. Formally, this may be called a "setup time violation," because the correct value was not present at the input to a latch sufficiently before a 16 clock edge. A setup violation occurs if Ti,n + tcQ + togic > T,n+l - tsetup (1.1) where Ti,n is the time of arrival of the nWh edge at the ith flip flop, tcQ is the clock-to-Q time for the ith flip flop, and jth flip flops, and t1 09 tsetup ic is the worst case (longest) logic delay between the it" is the setup time for the Jh flip flop. Note that i could equal j. The second type of timing failure happens when clockl arrives too late at cycle 6: the 0 that RO latches on this cycle propagates to the input of R1 and is latched instead of the correct value, formally because of a hold time violation on R1. Colloquially, the value is said to have "raced through" latch Ri. A hold violation occurs if Ti,n + tCQ + ilogic < T,n + thold where thold (1.2) is the hold time for the Jth register, and ilogic is the worst case (shortest) logic delay. Setup and hold violations are different in a number of ways. Setup violations occur because some instantaneous clock period is too short, and can be averted by lowering the nominal clock frequency. Because setup violations involve successive clock edges, possibly at the same register, they are typically considered to be a result of temporal clock variation. Hold violations, on the other hand, involve arrivals of the same edge at multiple registers; they result from spatial clock variation. Slowing down the clock does nothing to avert hold violations; instead, the effective hold time of the offending registers must be increased, often by adding pairs of inverters after the register. Traditionally, clock networks have been characterized in terms of skew, the spatial variations in arrival times, or T,(i, j) T - Tj; and jitter, the temporal variation in clock period at a node, Tj(n) = Ti,+- Ti,n - Tperiod. Rewriting Eq. 1.1 and Eq. 1.2 17 x(1) x(2) x(3) Clock x Ideal Clock Clock 0 LL x 3 Time 2 1 Clock A A- 0- o"~dl 3 Time (b) Time offset plot for a single clock (a) Definition of clock time offset I 2 1 Jitter 0- NA Clock AA A 4I) Skew - Clock B 'N Clock B Time Time (c) Conventional view of skew and jitter (d) Skew and jitter in modern clocks are comingled Figure 1-4: Relationship of clock offset, skew, and jitter. in terms of skew and jitter gives Ts (i, j) - T (n) TS (i, A) > tsetu + tCQ > tCQ + liogic - - tlogic thold (1.3) (1.4) In older clock networks, the clock source was the source Delay A A for the majority of jitter so jitter was the same for all the clock nodes. Referring to Fig. 1-3, the assumption DelayB B Figure 1-3: Two paths in a was the delay to each of paths A and B is a constant, and the only source of time-dependent noise is the clock source. Hence, if clock arrives at node A one nanosecclock network ond late, it would also arrive at node B one nanosecond too late. Dually, skew was 18 caused by static path-length mismatches to the clock loads, so skew was constant from cycle to cycle. If on one clock cycle the clock at B lagged the clock at A by one nanosecond, it would lag by one nanosecond at the next clock cycle as well. If we plot the time offset from an ideal clock, defined in Fig. 1-4(a), vs. time for a single clock, we'd expect to see something like Fig. 1-4(b). The traditional model suggests that two on-chip clocks behave as shown in Fig. 1-4(c). In modern clock systems, however, delay from the clock source to the loads dominates both static and dynamic mismatches, so arrival times at different nodes are not necessarily correlated. If the clock arrival time at node A is not correlated with the arrival time at node B, the jitter at B need not match the jitter at A, and the skew between A and B becomes time-varying, as shown in Fig. 1-4(d). This means that the skew and jitter terms in Eq. 1.3 and Eq. 1.4 would have to be fully indexed for sample time and location. In short, there is little reason to treat skew and jitter separately in modern clock networks. For this reason, this thesis uses "clock skew" and "clock uncertainty" interchangeably to mean the difference between the actual clock arrival time and the nominal arrival time, whether the reference is established by spatially or temporally distinct clock edge. Aside from avoiding semantic distinction between skew and jitter, this usage allows us to consider skew and jitter contributions of individual clock paths, rather than pairs of paths. (This is an exact clock network analog of analyzing halfcircuits in amplifier design.) Just as there are distinctions between types of timing errors (hold vs. setup violations), and between types of clock uncertainty (skew vs. jitter), there are several divisions in the sources of clock uncertainty. First, errors can be divided into systematic or random. Systematic errors are due to layout-dependent parameter variations, length variations in the lines, load capacitance mismatches, etc. That is, any variations that are the same from chip to chip. In principle, such errors could be modeled and corrected at design time given sufficiently good simulators. Failing that, systematic errors can be deduced from measurements over a set of chips, and the design adjusted to compensate. Random errors are due to manufacturing variations, 19 inter-signal coupling (which is predictable but often too hard to model correctly), thermal- and slow supply voltage-gradients, power-supply-noise-induced delay variations in buffers, and to some extent, thermal noise. It is impossible to eliminate some sources of random clock uncertainty, but it is possible to model some of the skew and jitter sources, and to design in a way that minimizes their effects. Mismatch may also be characterized as static or time-varying. In practice, there is a continuum between changes that are slower than the time constant of interest and those that are faster. For example, temperature variations on a chip vary on a millisecond time scale. A clock network tuned by a one-time calibration or trimming would be vulnerable to time-varying mismatch due to varying thermal gradients. On the other hand, to a feedback network with a bandwidth of several megahertz, thermal changes appear essentially static. Note the caveat that time-varying signals can cause static errors as long as they are periodic with the clock. For example, the clock net is usually by far the largest single net on the chip, and simultaneous transitions on the clock drivers induces noise on the power supply. However, this high speed effect does not contribute to time-varying mismatch because it is the same on every clock cycle, and hence affects each rising clock edge the same way. Of course, this power supply glitch may still cause static mismatch if it is not the same throughout the chip. Finally, random skew can be subdivided into spatially correlated and spatially uncorrelated mismatch. (Note the similarity to static and time-varying mismatch, which could be restated as temporally correlated and uncorrelated). Again, the distinction is not absolute. Different physical parameters will have different correlation distances; hence it is possible for a single pair of wires to be correlated in one respect but not in the other. Table 1.1 shows the categories and several examples of the sources of each type of random mismatch. static correlated wafer-scale etching, polishing uncorrelated MOSFET channel doping and lithography gradients time-varying temperature and power-supply value-dependent load capaci- gradients tance, inter-signal coupling Table 1.1: Categorization and example sources of non-systematic mismatch 20 1.2 Thesis Scope As argued in Chapter 2, signal delay across a microprocessor chip measured in clock cycles has been increasing as technology scales to smaller feature sizes, and is now comparable to one clock cycle. Because clock uncertainty scales with path delay, relatively longer delays increase the fraction of clock uncertainty per clock cycle; this trend could severely limit performance if not corrected. The overall goal of this thesis was to examine clock performance at both the circuit and the architectural level to find ways to design clocks in an environment where performance is limited by random random physical mismatches and noise. This thesis is split into three parts. The first part, Chapter 2, analyzes how sources of skew and jitter affect different clock architectures. The nonintuitive result is that a tree architecture is not well suited to systems where cycle time is shorter than cross-chip path delay, and that distributed clock networks become increasingly attractive. This analysis leads into the second part, which proposes a novel clock network composed of multiple synchronized phase-locked loops. Chapter 3 covers large- and small-signal stability of the system. Undesirable large-signal stable (modelocked) states dictate the transfer characteristic of the phase detectors; a matrix formulation of the linearized system allows direct calculation of system poles for any desired oscillator configuration. Chapter 4 deals with circuit implementation in CMOS, presenting two implementations of the system- a 4 oscillator proof-of-concept 400MHz network, and a 16-oscillator, 1.3GHz network network. The last part of the thesis, Chapter 5, examines ways to measure performance of a high-speed clock. As clock performance is optimized for fast operation, it becomes increasingly difficult to measure clock jitter. A flash time-to-digital converter is presented that exploits parallelism to get precise time measurements with resolution much smaller than a single gate delay. Unfortunately, an unrelated failure precluded measurements on the 16-oscillator chip where the measurement system was integrated, but the principle is shown to be valid on an independent test chip. 21 22 Chapter 2 Models of Clock Network Timing Variations Unpredictable parameter variations and noise are becoming dominant concerns for clocks. Clock networks have traditionally been optimized for minimum design time (gridded clocks) or power and wireability (trees). Process variations, on the other hand, have been studied extensively in terms of matching limitations on analog circuits, and to some extent in individual clock architectures. This chapter considers how clock uncertainty depends on both architecture and imposed mismatch. 2.1 Previous Work: Clocks Consider first the taxonomy and evolution of clock networks. Note that a great deal of work nominally about "clocking" has gone into finding the exact sequence of timing signals needed to clock a microprocessor at the fastest possible speed [3, 4, 5, 6, 7, 8, 9], and a number of CAD tools have been developed to find and verify such timing schedules [10, 11, 12]. However, the analysis of what timing signals are needed is independent of how the signals are distributed. Unpredictable variations are no more tolerated in scheduled-skew designs than in ideally zero-skew designs. The remaining discussion will assume that the optimal clocking schedule has already been determined and that what remains is implementation. 23 2.1.1 Equipotential Clocking Conceptually the simplest clocking strategy is to distribute a global clock to the chip as a regular, though heavily loaded, signal line. This is known as equipotential clocking because the implicit assumption is that resistance in the wires is negligible and the entire net is always at a uniform voltage. For small nets with relatively few clock loads and a slow clock, this works well. For large chips and fast clocks, equipotential clocking has the advantage that most of the clock distribution network can be designed independently of the logic. In fact, there is some RC time constant a clock net. When T (T) associated with the wires of such is small compared to the clock period, the RC delays are unimportant. As feature sizes scale down, however, T increases and clock rates go up, so the net no longer appears as a lumped capacitance and acts instead a lossy delay line. Propagation delays along the clock net cause skew. Because T scales with the size of the net, equipotential clocking can still be used for subsections of a chip [13], and implicitly at the lowest level in hierarchical [14] and distributed [15, 16] designs. The tour de force of equipotential clocking was the first DEC Alpha chip [17] (Fig. 2-1(a)). In that design, a single, segmented buffer placed lengthwise in the center of the die drives a grid made using two upper metal layers (i.e., the thickest metal available, to lower was T). The worst-case time difference between clock arrivals 200 picoseconds, and this was sufficient for a 200 MHz clock. The next two versions, the 300 MHz Alpha and its strikingly similar 433 MHz cousin, [18, 19] both used two drivers for the entire grid (Fig. 2-1(b)). Why? With higher clock speeds, the RC delay from the center of the chip to the edges becomes significant; the two drivers effectively both drive halves of the chip, so the delays are shorter. The 600 MHz Alpha [2] (Fig. 2-1(c)) followed this trend: it has four top-level buffers, because with the higher clock speeds and wire delays, ever smaller sections of the chip can be modeled as equipotentials. 24 Driver Wire Grid Drivers Drivers I I zlzI±Iz I - -o--- -- I -- ------- I Clock Clock (a) One-driver grid I (b) Two-driver grid Metal Strap (c) Windowpane grid Figure 2-1: Evolution of Alpha's grid based clock network. In all cases, large buffers drive a regular mesh of metal2 and metal3 wires. 2.1.2 H-Trees and Generalized Trees If it were possible to lay out the clock net so that all points where the clock is used are equidistant from the clock driver, the wire delay would not cause skew. This idea led to H-trees (Fig. 2-2) [20, 21, 14]. By symmetry, the distance from the center of the net (the root of the tree), to each of the ends (leaves), is the same. Therefore, regardless of Leaf Leaf Leaf ... T, signals should arrive at the leaves at the same Root time. The clock can then be distributed to a smaller (approximately equipotential) net around each leaf. The size of this equipotential region Leaf around each leaf shrinks as the depth of the tree increases, so deeper trees are needed for faster clock speeds. Figure 2-2: Four level H-Tree. Paths from the center to the The maximum clock frequency is limited by leaves are geometrically the same. dispersion of pulses on the RC wires, so the basic H-tree can be improved immediately by symmetrically inserting buffers along the 25 branches to regenerate the signal [21, 22, 15, 14]. Clock trees are insensitive to global process and environmental variations; skew is still zero if the resistance of the wires is higher than expected, say, or if the input threshold to all the buffers changes. Of course, H-trees are affected by intra-die variations [23, 24]. Anything that causes similar paths on the different parts of the chip to have different delays (e.g., local line width variations, temperature gradients, varying threshold voltages, etc.) causes skew. H-trees are most useful when clocking regular arrays, because the leaves form a regular grid. What can be done if the clock loading is not so geometrically regular? The vital feature of H-trees is that the distance from the root to all the leaves is the same. Finding a balanced tree for an arbitrary set of points is known as the zeroskew tree problem. In general, finding a zero-skew tree with minimum total length is exceptionally hard; however, a number of heuristic algorithms have been proposed [25, 26, 27, 28, 29]. Closely related to the zero-skew problem is the bounded skew tree problem, where a small amount of path difference is allowed to help minimize the total wire length, and therefore minimize area and power dissipation [30]. All of these tree approaches are bottom-up algorithms that start by connecting groups of nodes into a tree and then merging trees until Leaves only one net remains. They are distinguished by exactly how they merge trees, behavior in Root pathological cases, how the number of computations scales with the number of clock loads, The Figure 2-3: Zero-skew balanced tree how they route around obstructions, etc. result is essentially the same, however: they all produce an irregular clock tree that ties together a specified set of clock loads such that the distance from the root to the leaves is approximately equal (Fig. 2-3). Most modern processors use some version of such trees to distribute the clock [31, 32, 33, 34]. Those that do not use explicit trees still simulate and balance path delays from the clock source to all the loads, so act essentially as generalized clock trees. There the 26 Global Clock Delay-_ Delay -Compare+- Figure 2-4: Digital active deskewing matching is generally less precise, because the delay to the leaves, while nominally identical, is composed of the delays of a variable number of gates and length of wire, so even global variations in a particular parameter may cause skew. 2.1.3 Active Skew Management One approach to measure and cancel out static skew involves splitting the H-tree into two halves, measuring the relative offset between the two, and applying the appropriate delay, as shown in Fig. 2-4 [35]. In this structure, the delays and control signals are digital; this adds a measure of noise immunity, but increases the overhead power and area. Further, the model does not scale well - there is explicit digital control to guarantee that the delays do not both continue to increase. Splitting the tree into more sections allows finer adjustment, but the control overhead increases rapidly as well. 2.2 Previous Work: Variations Because the goal of a clock network is to distribute an identical signal to multiple locations, device and interconnect matching is important. Environmental variables, such as supply voltage, switching activity and temperature depend on the design of 27 the chip, and hence are under the control of the designer. Conversely, processing variables, including film thickness, lateral lengths, resistivity, etc., are defined by the manufacturing process, and can be treated as imposed constraints [43]. This section describes some of the approaches to modeling the constraints and their effects on circuits. 2.2.1 Layout-Dependent Processing Variations Some manufacturing process steps, most notably etching, chemical-mechanical polishing (CMP) and lithography, are influenced by topography on a chip. This layoutdepending processing causes systematic device and interconnect variations [43, 44, 45]. Modeling this variation falls into the realm of statistical metrology; see [46] for a review. This systematic variation need not limit clock performance, however. Design rules are evolving to ensure layout pattern uniformity. For some effects, it may be feasible to add a spatially-varying fabrication mask offset, just as masks are made by adjusting the drawn layout to compensate for lithography and etching biases. As a last resort, clock performance can be measured and systematic offsets can be compensated in the design. 2.2.2 Wafer-Scale and Random Physical Variations Unlike systematic skew, skew caused by random physical variations is unavoidable. For example, a dominant source of device mismatch over small areas is V variation due to stochastic distribution of dopants; variation depends only on channel area [47, 45, 48, 49]. Wafer-scale non-uniformity, while not truly random, varies from chip to chip. For example, deposited thin films often have a radially-symmetric thickness profile across a wafer. This results in slants in parameter properties across chips that depend on position of the chip within a wafer, and hence cannot be compensated on chip [43]. 28 Voltage Vth max - Vth min- -- Time tO t1 t2 t3 Figure 2-5: Clock skew caused by finite signal rise time. t1 - to and t 3 - t 2 is skew due to variable buffer threshold voltages. t 3 - ti and t 2 - to is due to variable rise time. t 3 - to shows the worst case combined effect. 2.2.3 Circuit Implications of Mismatch Processing mismatch translates directly into loss of clock performance. For example, variations in saturation current or buffer thresholds can both lead to variable clock arrival times, as shown in Fig. 2-5 [21, 20]. Exact numbers are not easily available, but one may assume that there could be 10% dynamic variation in VDD across a chip (which affects the threshold and drive current) and another 5% variation in IDSS between two distant, though nominally matched, buffers. That leads to an expected clock skew of 2.5% of the total clock cycle from a single pair of gates! In the current regime, where the clock skew budget is approximately 10% of the clock period, this is quite substantial [22, 50, 51]. Attempts to increase the maximum clock speed by increasing pipelining along an H-tree exacerbate this effect [52]. Because random variations cause substantial skew, there have been a number of attempts to minimize mismatches at the circuit level. For example, it was noticed that due to poor matching between nfets and pfets, signal paths which do not match the nfets and pfets separately may add skew unnecessarily [53]. The canonical example is shown in Fig. 2-6. On a rising input clock edge, gates N1, P2 and N3 are turned on in the top chain and N4 and P5 in the bottom chain. Because nfets may be expected to track nfets better than pfets, and vice versa, the lowest skew is achieved by sizing 29 P1 P2 P3 N1 N2 N3 Clocki Clock Input I n p u t 4 P5 N4 C l o c k2 N5 Figure 2-6: Independent balancing of NFETs and PFETS the transistors so that dN1 + dN3 = dN4 and dP2 = dP5 where dN1 is the delay due to transistor N1, etc. The general observation is that matching is best between similar components. One cannot expect wire delays to match gate delays over all process corners, for example. Clock designers have also started to pay attention to wisdom from analog design: matching is best between similar elements, and matching between identical elements is improved by making them larger. For example, matching wire delays to gate delays is likely to lead to random skew. And when matching delays through a clock tree, at some times fast paths need to be slowed down. There are two straightforward ways to accomplish this: make the wires longer or make them wider. Which is better? Wider wires are preferable because of the diminished influence of edge effects [50, 54, 55]. Consideration of random variations is becoming increasingly important in clock designs. The solutions tend to be ad hoc, and there has been little work on how well physically separated components may be expected to match. And most clock trees are still designed to achieve minimal nominal skew without consideration for how random variations will affect performance. 30 2.2.4 Abstract Variation Models At the other end of the extreme from the ad hoc physical models are the abstract models for skew [15, 56, 42, 57]. The assumption in these models is that skew is caused by uncorrelated, random variations in the clock distribution network. Unfortunately, because they are so far removed from implementation, generic statistical models give somewhat misleading results, for several reasons. The first is that they are too optimistic about statistical independence of variations. For example, gates that are near each other are likely to match each other more so than gates that are physically separated. This means that the sum of the skews caused by gates in any signal path will have higher variance than would the sum of skews caused by the same number of gates randomly selected from the chip. Also, as has been pointed out, not all variations have the same weight in the final skew: clock trees, for example, are much more sensitive to differences at the root of the tree than at the leaves [56]. Ironically, the second weakness is that general statistical models can be too pessimistic as well. For example, an analysis of pulse width down a long line of buffers suggests that the pulse-width follows a random walk [57]. Thus, it is argued, the pulse might disappear entirely unless the clock period is sufficiently long. In fact, it is not particularly hard to add feedback to ensure a 50% duty cycle, which effectively limits the random walk. In this case and some others, circuit tricks can overcome apparent stochastic barriers [15]. Fundamentally, the very generality that makes sweeping statistical statements interesting is their weakness because such bounds do not take into account circuit or architectural changes that affect network performance. Although they may place bounds on clock performance, they are necessarily qualitative, and can neither suggest circuit improvements nor take them into account. 31 2.3 Categories of Mismatch All on-chip clock networks rely on device parameter matching. This is a crucial difference between logic critical paths and clock networks: variation in critical path delay can be overcome by speeding up the critical path so that the worst-case delay meets timing constraints [58]. Time-dependency logic delay can be included directly in the worst-case timing estimates: maximum delay is constrained by Eq. 1.3 and minimum delay by Eq. 1.4. In contrast, because the clock network itself establishes the timing, both too-slow and too fast clocks must be avoided. Physical variations are often separated into separated into local and global contributions [59]. For the purposes of clock distribution, time-varying mismatch must be considered explicitly as jitter (and, if uncorrelated spatially, as contributing to skew). 1 Integrated circuit fabrication processes generally result in wafer-scale gradients in line width (both metal and polysilicon), thin film thickness (metal wires, gate oxide, interlayer dielectric) and doping concentration [43]. Manufacturing gradients have been cited to explain distance-dependent mismatch in transistors [60]. These variations significantly affect device and interconnect performance. In minimum-size inverters, for example, Leff variation can lead to 9% delay mismatch [61] between chips; in a different process 37% variation of ring oscillator speed was reported within single dies [62]. Clocks depend on matching rather than absolute delays, and are therefore insensitive to truly global parameter variations. We also make the optimistic assumptions thatall systematic variations are compensated. This could be achieved via modeling (i.e., statistical metrology), or simply testing finished chips if multiple silicon revisions are to be made. However, because clock networks span an entire chip, wafer-scale gradients are noticeable. It is generally accepted that global effects can be ignored for distances smaller than 100pm, but are noticeable for distances larger than 1mm [47, 60]. Global environmental variations, specifically in temperature and DC supply voltage variation, 'There is a subtle asymmetry between temporal variation in logic and clock. Slack in Eq. 1.4 can not be exploited to decrease clock cycle time, while any decrease in clock uncertainty directly lowers the minimum clock period. For this reason, temporal variations of the clock are analyzed explicitly. 32 x7 x5 x6 x4 X1 x3 x2 Figure 2-7: Example H-tree Segment Xi 1 0.1 2 0.3 3 0.5 4 0.5 5 0.5 6 0.4 7 0.25 Average .36 Table 2.1: Contributions to skew for an H-tree are imposed by design rather than fabrication, but are otherwise similar in effect. Temperature affects resistivity of the metal, channel mobility, and threshold voltages, and supply voltage affects saturation currents and hence gate delay [63]. The distance between most nominally matched components of a clock distribution network is comparable to chip size, which is typically 1cm or larger. Fig. 2-7 shows an example H-tree, and the distances xi, normalized to chip size, between nominally matched wire segments are tabulated in Table 2.1. Most of the distances are com- parable to the size of a chip; hence, we may expect that the wafer-scale variations are dominant and consider inter-chip mismatch data. Still, this brings up a messy modeling issue. Delay along a clock wire is a sum of small delays. The delay of each buffer33 wire-buffer segment contributes a small random component. If the segments are strictly independent (e.g., uncorrelated threshold voltage variations), the variance along the wire is the sum of individual variances, so the standard deviation of the resulting offset increases as the square root of the length of the wire. Another model is that the mismatch is due to a gradient of delays across a chip (perhaps from thinfilm deposition). Because the linear gradient is summed, the mismatch rises with the square of the wire length. Finally, if the perturbations are each fixed-size or uniformly distributed (e.g., a higher supply voltage for a section of the chip) , the worst-case offset increases linearly with wire length. Because gradients dominate over relatively long distances, it would probably be most accurate to model short nearby wires with independent segments, long distant wires in terms of gradients, and intermediate wires linearly. However, that obfuscates the analysis unnecessarily; the key point is that short near wires match better than long distant wires. For the sake of analysis, we will assume that uncertainty scales linearly with delay with a mismatch coefficient a, as p(x) - p(0) . ap(O). This argument can be extended to say that the variability in delay along a path scales linearly with the delay along the path; that is, that there is a fixed percentage error in on-chip path delay. We will use this assumption, although there is an important caveat: a depends on the construction of the path. A Ins delay with a = 0.11 gives more skew (110ps) than a 1.lns delay with a = 0.09 (99ps). For this reason the classic line-driver optimization may give suboptimal results if wire mismatch is not the same as buffer mismatch. However, for the optimal combination, delay variability will scale linearly with delay. Of course, matching is not perfect for adjacent wires or devices either. Strong sensitivity of threshold voltage and saturation current on L at short channels also limits matching for minimum-size devices; typically saturation current has a 3% mismatch for minimum devices, and matching down to 1% is straightforward in larger devices. Local mismatch is an important limit for phase detector offset in PLL and DLL systems. Time-varying effects include capacitive and inductive coupling between signal and 34 clock lines and signal-dependent capacitance. Careful layout can minimize the capacitance between signal lines likely to switch near clock edges and clock wires, but signal coupling is still important because it can be a significant source of jitter. We will assume that up to 5% of the capacitance of any wire may transition during the time a clock edge propagates. Temperature changes on a chip are generally many orders of magnitude slower than the clock speed, and are therefore reasonably treated as static gradients. On the other hand, supply voltage can change within a single clock cycle in response to changing load current. For this reason, temporal correlation is important when matching elements that depend on supply voltage. An example where this is significant is described in Section 2.4.4. 2.4 Clock Architecture Comparison While a number of authors have considered the impact of variations on clock performance, most assume tree distribution [52, 41, 63]. This section establishes a common metric and compares several clock architectures. 2.4.1 Clock metric The three categories of mismatches listed above cover what is needed for a first-order comparison of clock networks. For normalization, each is scaled to distribute a 1 GHz clock to a total of 200pF load capacitance over a 2cm chip in a standard 0.25pm CMOS process. A clock wire in a TSMC 0.25pm CMOS process would be 1pm wide, have a resistance of about 0.07Q/pm, and a capacitance of .lfF/pm. It would be convenient to choose a single parameter to characterize clock networks. As discussed earlier, skew and jitter are in general functions of both position and time. It is appropriate to consider the worst case clock uncertainty over time, but meaningless to look at worst case across a chip: in all practical cases a signal that takes longer than a clock cycle to propagate would be pipelined, and hence re-clocked. Hence, clock uncertainty between points on a chip further apart than one clock cycle is 35 .05C Figure 2-8: Schematic model of capacitive coupling irrelevant. For this reason, the metric for clock quality will be taken to be worst-case clock mismatch over a distance corresponding to signal propagation distance during one half of a clock cycle. 2.4.2 Tree Propagation delay along an H-tree can be split into delay from the root to the leaves, and delay from the leaves to a sub-block or tile. Delays to loads from a leaf are generally not matched, so the entire delay in a sub-block adds directly to total skew; this is sometimes called internal clock skew [14, 63]. The point of an H-tree, however, is to match delays from the root to the leaves, so those delays are nominally matched, and only variations contribute to skew. Consider a 8-level H-tree (i.e., one with 28 = 256 leaves). Assuming equal-sized buffers along the tree, these buffers would be placed at intervals of perhaps 2mm, for a total of 10 segments. Delay along the tree in this example is simulated to be 0.86ns. Assuming a = 0.1, skew caused by gradient mismatch is 0.86ns x 0.1 = 86ps. Internal skew (Si) is no larger than 0.07Q x 625pm x 0.2pF ~ 9ps. Capacitive coupling adds a time-varying offset. Fig. 2-8 shows the schematic model used to test the effect of capacitive coupling. The effect may be estimated by adjusting the effective line capacitance for the Miller-multiplied coupling capacitance. In the current example, the line capacitance is 200fF, the output capacitance of the driving buffer is 34fF, and the input capacitance to the receiving buffer is 77fF. A signal making a transition in the same direction as the clock lowers the effective wire 36 capacitance by 5% (given the assumptions above), so the delay should decrease by .05x200 ; 3%. Conversely, a signal transitioning in the opposite direction will slow 200+ 111 down the clock by the same 3%, so the total would be up to 6% variation. (Simulation indicates the total variation is 5%). This component of uncertainty interference recurs on every clock cycle, jitter if it is inconsistent - skew if the also scales with the total delay along the tree, and so adds a worst-case 45ps to clock uncertainty. To sum up, a clock distributed by a tree as described above will have skew of 140 picoseconds, or 14% of the clock cycle; this is in line with industrial results given the speed and assumptions about the process. Generalization We can generalize from this example to other trees. Fig. 2-9(a) shows how the two components of skew change with the depth of the tree, n. (The tree of this example had n = 8.) As argued above, both mismatch and coupling cause skew proportional to wire length L from root to leaves of the tree; in units of chip size, L = 1 - (1/2)n/2. Internal skew scales inversely with the area2 of the resulting patch, so Si oc 2-. The other key parameter is power. Power scales linearly with switched capacitance, so the clock distribution power (excluding the load) scales as 2n/2. Fig. 2-9(b) combines the results into a plot of the fundamental clock network tradeoff between power and performance. Scaling Note, however, that a clock tree does not scale well with process technology. As chip dimensions shrink, wire delay (T) is, at best, constant. Total chip size is also nearly constant. However, clock speeds increase as the gate delay decreases. Delay along the clock net also speeds up, but not by the same factor. Along an optimally buffered line, the ratio of gate delay (d) to T is constant, so as d falls, the distance between buffers decreases. Wire delay is proportional to the square of the wire length 2 Strictly speaking, it scales with length squared, but that is equivalent to area for non-pathological patches 37 10 4 100 -x- area-scaled skew 0 -&- length-scaled skew -- total -2 U 0 2U S102 10 - co N C 0 10 - 0 10 1s E 10 -2 0 0 10 10 10 102 depth of tree 10 10 skew, ps (a) Skew components in a tree vs. tree depth (b) Power vs. skew for a clock tree Figure 2-9: Clock tree tradeoffs between buffers (1). Hence 1 cx Vd. The total number of segments is proportional to 1/1, so the total delay along a tree is proportional to d/Vdi = v/d. Since the clock speed is directly proportional to d, skew as a fraction of the clock period will grow as 1/v d as gate delay falls. In other words, without a dramatic redesign or process improvements, a 4GHz clock tree would have unpredictable clock skew of 30% of a clock period, and a 16GHz clock would have to budget over half of the clock period for skew and jitter margin. Note that as clock speed increases, signal delay across a chip exceeds a single clock cycle. In the example above, a 2cm-long wire has a delay of 0.86ns with 1GHz clocks. Scaling to 4GHz, the same wire (with optimal buffering) will have a delay of approximately 0.43ns, compared to a clock period of 0.25ns. Given the metric defined in Section 2.4.1, therefore, there is no reason to minimize global skew at all. In a tree, however, the worst-case skew occurs between nearest neighbors, so tree distribution cannot take advantage of the relaxed global constraints. This is the fundamental reason why trees become less attractive at high clock speeds. 38 Global Clock Figure 2-10: Grid distribution block schematic 2.4.3 Grid A pure grid network would have a single, central driver for the entire chip and a mesh of clock wires. Skew would be simply the wire delay across the chip, just as it is the wire delay in a patch for each leaf of a tree. In the limiting case, a clock plane with a central driver would give skew of .07Q/pm x .lf F/um x (104pm) 2 = 0.7ns.3 Clearly, a single driver will not give adequate performance, so modern grids are H-tree-grid hybrids: a short H-tree distributes clock to a few (4 or 16, for example) buffers around a chip, and those buffers drive a clock grid in parallel, as shown in Fig. 2-10. The final patches are larger than those typical of trees, but the grid helps eliminate skew caused by the tree distribution by shorting together outputs of multiple buffers. Take as an example system a 4 level (24 = 16 node) clock tree where the final buffers drive a global grid. Following the example of the previous section, such a tree would have 7 2mm-long segments and an expected clock uncertainty of 70ps. Delay across each region, assuming a lumped model with minimum-width wires, would give a skew of 2.5mm x 70Q/mm x 6.25pF ~ 1ns. Because this skew is dominated by wire resistance and load capacitance, it can be reduced by increasing the width of the wires at the cost of increased power. At the point where the capacitance of the wires 3Scaling this value down to the size of the first Alpha gives skew ~ 200ps, which was reported for that chip. 39 Figure 2-11: Model circuit for shorted grid drivers. equals the load capacitance there is one clock wire every 200pm, and the expected wire skew is 89ps, (85ps simulated). Furthermore, shorting the buffers together helps drive down some of the uncertainty at the cost of increased short-circuit power during switching and somewhat slower edge rates. A simple circuit model for a grid driven from multiple points is shown in Fig. 2-11. Simulations with an 70 picosecond skew on buffer inputs show a total skew of 145ps, of which 55ps is due to the input skew. It is possible to keep driving this lower by increasing wire width; however, the benefits of wider wires get incrementally smaller as the wire capacitance comes to dominate the total. Doubling the wire width again, for example, lowers total skew to 110ps, of which 34ps is due to the input. The drawback, of course, is the power dissipation. The extra wiring needed to get 110ps skew down added 25pF of capacitance per buffer, while the clock load per buffer is only 12.5pf. Still, grid distribution is used because much of the skew is predictable and, unlike with H-trees, the clock design is largely independent of floorplanning. 40 10 0 o 00 0 75 10 0 101 N S10' 0 CL10-3 102 101 103 skew, ps Figure 2-12: Power vs. skew for a grid. Generalization The primary parameter for a gridded clock is the capacitance of the grid sets both the power dissipation (P oc C) and the wire skew. (C); that Si is proportional to 1 + CL/C where CL is the load capacitance and C the grid capacitance. Mismatchinduced skew is shorted out by lower-resistance wires, so that component of skew falls as 1/CL. A plot of simulated power dissipation vs. skew, corresponding to Fig. 2-9(b) is shown in Fig. 2-12. Scaling Grid distributions depend only on wire delays. As mentioned above, wire delays tend not to improve with process technology scaling. As the skew budget decreases with rising clock speed, a grid clock must either increase capacitance or subdivide the chip further with a deeper initial clock tree. In the example above, the initial tree itself does not add significant power, so an obvious scaling strategy would be to simply make larger trees to minimize Si. As long as delay variations in the initial tree are comparable to rise time, deeper trees and smaller Si will improve performance. However, rise time scales linearly with d, so by the same reasoning as as applied to the tree scaling arguments, skew 41 as a fraction of rise time will increase with 1/vd as gate delay falls. When the tree skew exceeds rise time short circuit power dissipation increases rapidly, and the clock edges begin to show an unacceptable kink. Fig. 2-13 shows simulated edge shapes with increasing input skew for a grid driven from a 4-level tree with skews from 0 to 200ps, and Fig. 2-14 shows the corresponding short circuit power dissipation. edge shape with input skew 3.2 DCWAO:v) D0: V(xbs1) y- - 3 2.8 2.6 - 2.4 2.2 1.8 1.6 1.4 1.2 1T 800m - 400m 200m 0 -20Cm 3.6n 3.65n 3.7n 3.75n 3.8n 3.85n 3.95n 3.9n Time (fin) (TIME) 4n 4.05n 4.1n 4.16n 4.2n 4.25n Figure 2-13: Simulated edge in a grid with skew to the drivers. 2.4.4 Active Feedback As is evident from the sections above, an increasing share of skew comes from the initial long-distance distribution of a clock to relatively small loads. A delay-locked loop (DLL) could be adapted to measure and cancel out wire variations. One possible implementation is shown in Fig. 2-15, where a DLL is used to implement a single wire with low effective delay. The intuition is that the delays are adjusted symmetrically until the round trip time from the source to the load and back is a known multiple of a clock period; (in line with the examples so far, assume the round trip time is 42 0.5 0 0.4> 0.3 0 c _00.2 a) N E0.1 0 0 50 100 150 input skew, ps 200 Figure 2-14: Short circuit power in a grid vs. input tree skew. Source D/2 W1 b2 w2 bw13 w3 b4 Load b8 w7 b7 w6 b6< w5 b5 Figure 2-15: Low-skew wire with DLL 2ns, which is 2 clock periods). Then by symmetry, the signal arrives at the load with a 1 period clock delay, which means it has effectively 0 delay for clock signals. Unfortunately, this intuition is misleading. Despite the apparent symmetry, there is little reason for the forward path to match the reverse path in this connection for two main reasons. First, the nominally matched buffers are physically separated. In Fig. 2-15, b1 should match b7 , although it would be physically near b8 . b, isn't as far away from its matched pair as it might be in a tree, but it will still typically be millimeters away. Second, there is no temporal correlation. The clock signal passes w, at a different time than it passes w 7 , so any time-dependent variations, including those due to power supply and capacitive coupling, do not match. Taking the results from Section 2.4.2, the effective skew for a 1cm-long DLL wire would be ~ 90ps, which is only a 30% improvement over a simple 43 Global Clock Figure 2-16: Matching tree leaves with a DLL wire, and that does not count offset in the comparison of the two edges or mismatches in the delay cells. Another approach, more like a traditional DLL, is shown in Fig. 2-16. The global clock is distributed to two half H-trees, a phase comparison is done at the leaves, and a variable delay is adjusted to align the clocks. The technique is meant to balance delays along path 1 (di) and path 4 (d 4 ) in this example. Note, however, that while nodes A and B may be matched, nodes C and D are not; the mismatch between nodes C and D mcD (mcD) is (d- -2)- (d + d 3 ) (d- -), DLL (in which case moD - (d4 + d 6 ) . The loop drives d, + d2 = d4 d5 SO5 which is somewhat smaller than it would be without the =(d, - d4 ) + (d3- d6)) because W2 and w 5 are both closer together, and shorter, than d, and 4. An immediate generalization would be to break up the trees further, have two more comparators, and variable delay elements, as in Fig. 2-17. (Note the difference between Fig. 2-17 and Fig. 2-18. The latter generalization requires matching between delay elements D2 and D5 , and between D3 and D6 ; the former does not require that the delay elements match at all.) Because delays to the leaves are controlled by DLLs, the top-level tree structure is no longer necessary; Fig. 2-19 shows a DLL distribution where each DLL drives a local tree. Static delay variations of nearest neighbors are cancelled out by the DLL to within the precision of the matching of the comparators. 44 Global Clock 4 1 A U B 5 D 2 D DC 1 Cj 6 3 C D Figure 2-17: Matching tree leaves with two DLLs Global Clock 7 Compare 7 E 4 D5 D2 D3D6 8-rF CompareI I Figure 2-18: Matching tree leaves with a two DLLs which requires delay cell matching 45 Global Clock Compare Compare Delay Dela Compare Compare Delay Delay A B Figure 2-19: DLL architecture Dynamic variations, due to supply noise or signal coupling, however, persist; two 1cm-long paths with active DLL matching will have a relative jitter of approximately 50ps (all of it time-varying), and skew from mismatch in the phase detectors, and some mismatch from distribution along local trees. A typical phase detector has a delay equal to 2 inverters, and its two halves are physically close together, so skew is expected to be approximately 2 x 5% x d ~ 10ps. As drawn, the maximum skew in the network is not between two paths connected with a DLL; rather, the skew between A and B is the sum of the skews through three DLL's (10ps each) and four local trees (25ps each). Total clock uncertainty between A and B, then, is 180ps and the scaling is even worse because the effective distance between two nearby points grows rapidly as the number of DLLs increases. A much better result can be obtained by using DLLs that take multiple reference inputs, and adjust output phase to be aligned exactly between the two inputs. The network can then be redrawn somewhat more symmetrically, as Fig. 2-20. (For clarity, the local tree was not drawn, and the connections to the comparators are abstracted.) Optimization of the number of the number of tiles is straightforward. As argued previously, internal skew scales with tile area, so as the number of tiles increases, internal skew falls. However, every boundary between tiles introduces some skew 46 Global Clock ... ............................. ...... ..... Delay o a e Delay ... ........... ...... ...... .................................. ... ....... .... ....... ....... ............... ...... ................... .. .... . . ............. .. .. . . .. .. . .......... ............ ...................... ..........I .......... ............. Compare ............. .......... Compare ..... ......... ....... ....... ................................... .............................. ....... ....... ............................... ....... ............................................... ....... ....... ... Delay Compare Delay Figure 2-20: Multi-input delay cell DLL architecture 100 area-scaled skew -x- -e- boundary skew 80- _g_- total ) 60 C. -. ) 40 o 200 1 4 9 16 25 36 49 64 number of tiles Figure 2-21: Tile number optimization because of mismatch in the phase detector. Hence, as the number of tiles increases, the number of boundaries increases. Fig. 2-21 shows the optimization curves calculated for this clock metric. One inherent weakness of DLL networks is that DLLs are inherently sensitive to input jitter. A phase-locked loop, (PLL), though somewhat more complicated in implementation, filters out noise on the inputs. PLLs and DLLs are nearly identical structures in isolation. Each has a variable delay element as a core, represented in Fig. 2-22(a). An input signal with phase 0 is delayed by some time A and output with phase q. In both the DLL and PLL cases (Fig. 2-22(b) and Fig. 2-22(c)), A = - 0. The only difference is where the input signal comes from. If the input to the block is 47 ApA At (a) Variable delay block (b) Delay-locked loop (c) Phase-locked loop Figure 2-22: A variable delay element and phase comparator can be configured into a DLL or a PLL. 0, the system acts as a PLL; if it is 0, a DLL. The noise and stability implications of the feedback will be considered in the next chapter. Scaling As in other clock networks, faster clocks require a more finely-grained architecture. Jitter in a DLL network will rise in exactly the same way as it increases in clock trees, and for the same reasons. Skew scales linearly with d because it is comprised of comparator mismatches and delays across each leaf-patch. Note, however, that in a PLL the noise can be expected to scale with d; a PLL network like the one in Fig. 2-20 would have total clock uncertainty that is a constant fraction of the clock period. 48 Chapter 3 Synchronization and Stability The purpose of an on-chip clock is to synchronize computation. Distributed networks make explicit this synchronization. Chapter 2 argues that the performance of distributed clock networks scales favorably with clock speed (or at least does not scale as poorly as do clock trees). This chapter gives some background on synchronization architectures and then considers the synchronization of multiple oscillators. 3.1 Previous Work: Synchronization The are two main synchronization schemes. In the first method, handshaking guarantees that computation proceeds in the correct order, although independent process are not synchronized in any way. In the latter method, a global clock is used to synchronize data, but the generation of the global clock is split among multiple blocks that must align their respective clocks. 3.1.1 Local Data Synchronization The earliest distributed networks dealt with synchronization of data explicitly, rather than of multiple clocks. The archetypical example of this is large processor arrays. It has been suggested that the computational density available in modern VLSI be used to build large arrays of simple processors which communicate only with nearest 49 neighbors [21, 20, 15, 16]. Since skew is only relevant between communicating processors [7], trees do not seem well suited to the problem: there is no reason to eliminate global skew as long as the clock skew between neighboring processors is low. This can be accomplished by having each processor synchronize directly with its peers. So-called self-timed systems use handshaking between the blocks for synchronization [21, 41]. Each communication path between two blocks is accompanied by extra signals that implement some manner of flow control. For example: 1. The processor sending data puts the data on the wire and asserts a Data Ready signal. 2. The receiving processor reads the data and then asserts a Data Accepted signal. 3. Data Ready is unasserted. 4. Data Accepted is unasserted. Because no global synchronization is needed, self-timed systems are an example of an asynchronous system. Such systems have several advantages over globally synchronized systems: there is no global clock to propagate, and each block can work at its actual speed rather than the global worst-case clock speed [21]. However, there are several significant drawbacks: there is circuit overhead in generating the local synchronization signals; the designs are notoriously hard to analyze and test; and often the system operates at the worst-case time anyway, because computation is always limited by the latest input [15, 41, 42]. The approach suggested by El-Amawy [16] avoids some of these problems by having a system that looks fully synchronous, albeit with some local clock skew. However, there is still no global synchronization, and communication is only allowed between neighboring processors. Despite these drawbacks, asynchronous systems are an alternative to global clocking, and may become more prevalent if the prospects of very high speed clock distribution are not improved. 50 Clock Signal Node 12 1 Node 2 Node 3 Node 4 Time Figure 3-1: Mode-locking example 3.1.2 Local Clock Synchronization The proposed clock distribution architecture is organized as a synchronous array. That is, clocks are generated at multiple places over the chip and controlled to have the same phase and frequency. This approach has not been used in integrated clocks, but it has been proposed for parallel computers, and some of the issues are similar Pratt and Nguyen suggest constructing a clock for a parallel computer from [40]. synchronized, voltage-controlled quartz crystal oscillators. Phase detectors and inte- grators generate phase error signals, and these are used to pull the crystals to the same phase and frequency. While the desired, phase-locked configuration can be proven stable, it is possible that some arrangement of unequal clock phases is also stable on a given network; this effect is known as mode-locking. In the simplest example, a system consisting of four nodes is stable although the phases are not equal, as shown in Fig. 3-1. Each node sees one neighbor leading and one lagging, and therefore doesn't adjust. The authors show that mode-locking can be avoided in a regular mesh with nonlinear phase detectors, which they implement as balanced XOR gates. This architecture is inconvenient for on-chip clock distribution for several reasons. First, modern microprocessors are not organized as regular structures inter- nally; memory caches and ALUs have vastly different clocking needs. Therefore it will be necessary to remove the constraint that the clock nodes form a regular array. 51 Second, this method depends on having relatively noise-free, well-matched crystal oscillators, but such oscillators are not available on chip, and what is available has much worse short-term stability. Therefore, the phase comparators and stabilization network must be completely redesigned to compensate for the noisier oscillators. Third, they assume that wire delays between nodes are negligible; on an IC, these delays are the very heart of the problem. 3.2 Proposed Clock Architecture The proposed distributed clock network is an array of synchronized PLL. Independent oscillators generate the clock signal at multiple points ("nodes") across a chip; each oscillator distributes the clock to only to a small section of the chip ("tile") (Fig. 3-2). Phase detectors (PD) at the boundaries between tiles produce error signals that are summed by an amplifier in each tile and used to adjust the frequency of the node oscillator. In general, the network need not be square or regular. With locally generated clocks, there are no chip-length clock lines to couple in jitter; skew is introduced only by asymmetries in phase detectors instead of mismatches in physically separated buffers; and the clock is regenerated at each node, so high frequency jitter does not accumulate with distance from the clock source. Unlike earlier work on multiple clock domains which suggested the use of multiple independent clocks, this approach produces a single fully synchronized clock. The rest of this chapter examines small and large signal stability of a distributed phase-locked loop. 3.3 Small Signal In a multiple-oscillator PLL large- and small-signal behavior are interrelated. In normal operation, the oscillators are phase-locked, and jitter depends on the network response to noise. Because startup is expected to take a negligibly small fraction of time, the connection of the oscillators is optimized for small-signal behavior rather than to make initial acquisition more efficient. The linearized small signal behavior, 52 valid when the oscillators are nearly in phase, is analyzed first. 3.3.1 General Derivation A traditional phase-locked loop (PLL) consists of three components: a voltage controlled oscillator (VCO), a phase detector (PD), and a low-pass loop filter, connected as shown in Fig. 3-3. In a digital application like clock generation, the output of the oscillator is a square wave, and the phase detector generates a signal that on average is related to the difference in phase between two square waves. Clearly, both the oscillator and the phase detector are nonlinear in a strict sense. However, there is an approximately linear relationship between the input voltage of the oscillator and the phase of the output square wave. The relationship between the input phase difference and averaged output of the phase detector is also linear. Hence, the system can be modeled as a linear feedback system Fig. 3-4. The system as drawn in Fig. 3-4 is described by: - = aHi(s) (u - ) aH(s)/(s + aH(s)) u (3.1) (3.2) where u is the input phase. The poles of the system are the solutions of aH(s) + 1 = 0 Substituting H(s) = (3.3) (s + z)/s into Eq. 3.3 gives a(s + z) + S2 = 0 (3.4) which is a familiar result for a simple phase locked loop. Exactly the same analysis applies to a network of coupled oscillators. Consider a set of interlocked PLLs, as shown in Fig. 3-5. The network can be modeled as a multivariable linear system; in fact, the block 53 Chip Boundary ile Boundary Phase Detector Loop Filter &vco & VCOj Figure 3-2: Distributed clocking network Reference timer-CLooptput Otu PDFilter Figure 3-3: Standard phase-locked loop. Loop Filter VCO PD Output Reference s ............ (voltage) s ---.--.. (phase) Figure 3-4: Linear system model of a standard phase-locked loop. 54 L Reference PD ---- 1 FL r ----- 0 FitrPD Loop VC0 r VCO Fle Loop VCO PDFilter PDFilter Figure 3-5: Multi-oscillator phase-locked loop PD Loop Filter Reference j A 21- -- *, A2 *h ( s) VCO N a N Output N Figure 3-6: Linear system model of a multi-oscillator phase-locked loop diagram (Fig. 3-6) is essentially identical to the one for a single oscillator system, except that the connections between blocks are vectors instead of individual signals, and the gains and transfer functions are matrices instead of scalars. This means that the phase detector becomes a matrix A1 of size N(N + 1)/2 x N instead of a single subtraction, and the loop filter becomes A 2 , a corresponding N x N(N+ 1)/2 matrix. G = A 2 A1 is an intuitively meaningful N x N matrix. The network of oscillators is similar to a lumped circuit C with a node for each oscillator and a branch for each connection between pairs of oscillators. Node voltages in C represent oscillator phase, and branch currents represent the error signals on the output of the phase detector. G is the conductance matrix for C with unity conductance branches. G for a 4 oscillator network is shown in Eq. 3.5. Each off-diagonal entry gij is -1 if there is a phase detector between node i and node j; gij is the number of detectors attached 55 to node i. 3 -1 G = -1 0 ' -1 2 0 -1 -1 0 2 -1 0 -1 -1 (3.5) 2 DC gain in the loop can be lumped into a 3 . Recasting Eq. 3.1 in matrix form gives Eq. 3.6, 4b = [sI + a3A 2 Aih(s)]-' h(s)a3 A 2 U (3.6) where u is now the phase error input to each phase comparator. In other words, u(1) is the reference phase, and u(2) ... u(n) are the noise contributions from interconnect and phase detector mismatch. 3.3.2 Examples Matrix A1 is determined by the geometry of the tiles, and hence will constrained by the placement of clock loads, which for this problem is fixed. Assuming the simplest possible phase-locked loop, h(s) = (s + z)/s. This leaves A 2 , a3 , and z as design variables. There are still far too many choices to find the general optimum, but a few examples may help guide the search. Single oscillator The reference design is a single-oscillator phase-locked loop. Stability constraints of a single oscillator PLL may be derived directly from Eq. 3.3; however, it is more common and more intuitive to analyze the loop gain, ah(s)/s. Magnitude and phase Bode plots of the loop gain are shown in Fig. 3-7. Note that because of sampling at the phase detector, the continuous time approximation is only valid for frequencies much lower than the oscillator frequency. The Bode plots below add multiple parasitic 56 poles at the clock frequency we, to model the phase effects of the sampling. For the 0 -90 00 00 00 00 -180 Z z (00 )C log(P) 0io O log(O)) (b) Loop gain phase (a) Loop gain magnitude Figure 3-7: PLL loop gain Bode plots PLL to be stable and sufficiently damped, the phase must be above -135 when the loop gain is at OdB. This means that the unity-gain frequency, wo, should be much lower than w, and that the zero, z, should be much lower than wo. The location of the dominant pole is not critical to the stability. For a typical 1GHz oscillator, a = co ~~330MHz, consistent with the constraint wo < we. In turn, this puts an upper limit of 50MHz on z. Fig. 3-8 shows the root locus for this PLL over a gain error from -50% to 100%. One dimensional array A one-dimensional array of oscillators with phase detectors between neighbors is the first generalization of a single PLL. In a perfectly asymmetrical array (call this system S1 ), the output of PLL i is the input to PLL i+1, as shown in Fig. 3-9. S is described by A1 = 1 0 0 0 -1 1 0 0 1 0 0 0 0 10 0 0 0 1 0 A 2 ,1 1 0 0 -1 0 0 -1 0 0 0 1 1 57 (3.7) x 10 7 64 x u) 2- x- <C <n Mx x 0 K< - X - -.. . - xXx. > 0 O 0 - .. .. E X -4 -6 -1.5 -1 Real Axis -0.5 x 108 Figure 3-8: Root locus for single-oscillator PLL with gain error N Ref P Figure 3-9: Asymmetrical one-dimensional PLL array 58 This system has multiple poles at the same place where a single-oscillator PLL has single poles. On the other hand, in a perfectly symmetrical array (call it S2 ), the input to each oscillator i is the phase of oscillators i - 1 and i + 1 (Fig. 3-10). The A1 matrix is the N Ref P Figure 3-10: Symmetrical one-dimensional PLL array same because the physical arrangement of nodes is identical, but A 2 changes: 1 -1 A 2 ,2 = 0 0 1 -1 0 0 1 0 0 0 0 0 (3.8) -1 1 To achieve the same phase margin in S2 as in S1, it is necessary to lower the gain a 3. This can be shown with a geometrical argument: in S2, when the phase of oscillator i changes by A0q, the change is measured at two phase detectors, so oscillator i feels twice the feedback that it would have felt in S1 , and at the same time, oscillators i - 1 and i+ 1 both adjust in the opposite direction, giving 4 times the effective gain. Hence, the gain must be decreased by a factor of approximately 4. Mathematically, the largest eigenvalues of A 2 ,1 A 1 is 1, but the largest eigenvalue of A 2 ,2 A 1 is 3.5. Poles of the symmetrical system, solved via Eq. 3.61 are plotted in Fig. 3-11. The 'While it is possible to use Eq. 3.6 directly, it is often more convenient to take advantage of the 59 3 x 21 -- x OK X x x xI x -1--2 -3 -6 x -4 -2 0 Figure 3-11: Root locus for a one-dimensional array of PLLs. 60 key difference between Si and S2 is the systems' response to noise. In both cases, noise at frequencies higher than the unity gain frequency wO are attenuated. For frequencies much lower than wo, the response can be calculated via Eq. 3.6. Fig. 312 shows a Bode plot of noise at node P in response to a noise source at node N. Noise performance of Si is much worse for intermediate frequencies because there is Noise ------ ------ 0-10-20-30 symmetrical -- - asymmetrical -40. 0.001 1 0.1 0.01 Freq Figure 3-12: Comparison of noise responses for symmetrical and asymmetrical networks no feedback so errors propagate forever. In S2, the feedback limits the influence of preceding stages, and this in turn attenuates noise. For this reason, networks with feedback are preferred, despite the more complicated stability calculation. Two dimensional array A two dimensional array is analyzed exactly the same was as is a one-dimensional array, except that the gain has to decrease by another factor of two because the center oscillators see four neighbors rather than two. A 16-element array in a 4 x 4 grid is simple form of h(s), and rewrite the zero-input state equations thus: S' #' = $"-Gz 0 0 I 0 -G 61 0 I -pI ) 10 0' "1 (3.9) implemented in this thesis. Its G matrix and poles are shown below. I 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0) 1 -3 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 -3 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 -2 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 -3 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 -4 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 -4 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 -3 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 -3 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 -4 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 -4 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 -3 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 -2 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 -3 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 -3 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 -2) (3.10) 3.4 Large Signal: Mode Locking The analysis of the previous section indicates that fully-connected networks should have a better noise response than asymmetrical networks. However, the feedback allows the possibility of undesirable large-signal modes. 62 Consider the network of 3 2[ x 1 x x 00xx xx x x -1 X -3' -6 -4 -2 0 Figure 3-13: Root locus for a two-dimensional array of PLLs. 63 Clock Signal Node 1 1 2 Node 2 4113 Node 3 Node 4 Time Figure 3-14: Mode-locking example Fig. 3-5, and its associated matrices: / -1 0 1 -1 1 0 0 0 0 0 -1 -1 A2 = A 0 0 1 0 -1 0 0 1 -1 1 o = -1 0 0 o 0 -1 1 0 0 0 1 0 0 1 0 -1 (3.11) -1 / Because phase is periodic with period 27r, the p hase measured at the phase detectors A0 = A 1 # mod 27r. For small 0, (A 1 # mod 2 -) = A 1 0, so the nonlinearity is irrelevant. However, consider #,, = [0, 7r/2, -7/2, A 2 (A 1 # mod 27r) = A 2 [0, -r/2, 7r]T. Because of the nonlinearity, r/2, -7/2, 7r/2]T = 0 (3.12) so 0_, is a stationary point. This is intuitively easy to see, in reference to Fig. 3-14: each oscillator leads one neighbor, and lags behind another neighbor by exactly the same amount. The net phase error is zero, so clearly there is no restoring force to drive the oscillators into phaselock. Furthermore, this equilibrium point is stable, because the nonlinearity does not change for small deviations from 0 2 so dynamics about 0are the same as those about 0. The locking of a distributed oscillator to non-zero relative phases has been called mode-locking [40]. 64 At startup, each oscillator in a distributed PLL starts at a random phase, so there is a nonzero chance of converging to a mode-locked state. Simulations show that for a network like the one shown here, the system ends modelocked from ~ 1/3 of random initial states. The probability goes up rapidly with the the size of the system; a 4 x 4 array ends up modelocked well over 99% of the time. Pratt and Nguyen proved several useful properties about systems in mode-lock. The lemmas and theorem are repeated here with outlines of proofs, generalized to include arbitrary (rather than Cartesian) networks. Consider a system of oscillators to be a circuit, with oscillators at the nodes, and connections between oscillators to be branches. (This is the same model as was presented in Section 3.3.1). The phase counterpart to Kirchhoff's Voltage Law is: Lemma 1 The sum of branch phase differences must be a multiple of 27r. The sum is a multiple of 27r rather than 0 because phase differences here are defined over a range [-7r, 7r), so at any branch 27r might be added or subtracted to bring the result into the right range. For example, a phase detector will measure the difference between 57r/6 - (-57/6) =wr/3, not 57r/3. This is true independent of mode-lock. The second lemma derives from conditions for mode-lock: that is, the nodes are in static equilibrium although the phases are not identical. Lemma 2 If a set of oscillators is mode-locked, there must be at least one loop in the network for which the sum of phase differences is a nonzero multiple of 27r. The proof is as follows: in mode-lock, by definition, the nodes are not all at the same phase. Therefore, there must be at least one node which connects to a branch with nonzero phase error. Call that Node 1. Because Node 1 is in equilibrium by definition of mode-lock it must connect to at least one branch with a positive phase error. That branch connects to some Node 2, and appears as a negative phase error there. Since Node 2 is also in equilibrium, it must have some other branch with an offsetting positive phase error. Because there is a finite number of nodes, the loop will eventually close back on Node 1. By Lemma 1, the sum must be a multiple of 65 27r. Because by construction, all the branches were positively-oriented, the sum must be nonzero [40]. There are a number of ways to avoid mode-lock. The most obvious one is to simply break the feedback: a consequence of Lemma 2 is that if there are no feedback loops, there can be no modelock. This is not an attractive solution because, as shown in the example with a one-dimensional array, full feedback helps average and attenuate noise, so it would be best to avoid modelock without affecting the interconnection of the system or the operation when correctly phase locked. One possible solution would be to have a special startup state where there is no feedback between oscillators, and then an operational state with full feedback. The system might be synchronized during the startup, and then would remain phase-locked in the operational state. The biggest drawback of this approach is that the the transition from the reset state to the operational state jolts the system, and could push it into mode-lock. Thus, it would be preferable to have a solution that does not require changing network topology even temporarily. Fortunately, there is such a way. If we define a minimal loop as a loop in the graph that cannot be decomposed into other loops, we can combine the results succinctly into: Theorem 1 For a system in mode-lock, there must be a phase difference 0 between two oscillators such that 0 ;> 2/n where n is the number of nodes in the largest minimal loop in the network. By Lemma 2, there must be at least one loop (L) with a phase difference sum of at least 27. If it has more than n nodes, it cannot be a minimal loop. Decompose L into L 1 and L 2 . By Lemma 1, the loop sum around both L 1 and L 2 must be an integral multiple of 27, so at least one of them must have a loop sum of at least 27r; iterate if necessary to get a loop of n or fewer nodes. Since the sum of the branch phase differences must be 27r, at least one of the branches must have a phase difference of at least 27r/n. Theorem 1 suggests a way to distinguish between mode-locked states and the desired 0-phase state: in mode-lock, there must be at least some large phase errors 66 across individual branches. If the gain of the phase detector is designed to be negative for a phase difference larger than 0, then all mode-locked states are made unstable without affecting the in-phase equilibrium. Pratt and Nguyen suggest that an XOR phase detectors precludes modelock in a rectangular network of oscillators because the response decreases for phase errors larger than 7r/2,[40]. This result follows directly from Theorem 1: in a rectangular array, the largest minimal loop has 4 nodes, so 0 = 27/4 = 7r/2. Two other phase detectors are described in the next chapter, both with 0 < 7r/2, which would be useful in non-rectangular networks, and where more gain near 0 phase is desirable. 67 68 Chapter 4 Implementation and Testing Distributed Clocks Two test chips were made to explore implementation issues: how much power do the oscillators require? How much area is needed for the compensation filters? Can a real loop, with the buffer and wire delays be stabilized? The first was a 4-oscillator chip in a 0.6pm double-poly CMOS process with a clock speed up to 350 MHz, and the second was a 16-oscillator chip in a 0.35pam single-poly CMOS at clock speeds of 1.2-1.4 GHz. The two chips are described in turn below. 4.1 4 Oscillator Chip The 4 oscillator chip was done as a proof of concept to show correct phase locking in the simplest system that could possibly be vulnerable to modelock; a plot is shown in Fig. 4-1 It consists of four nodes (each with an oscillator and loop filter) and five phase detectors (one between each pair of neighbors, and one connected to an external input). High-speed probes contact chip pads at the edges of the chip. One probe drives the input, and the other three are connected to outputs of the oscillators. (The probes are too large to connect more than one probe on a single chip side, so all four oscillators could not be measured at the same time.) 69 Figure 4-1: Micrograph of the 4 oscillator, 350 MHz chip 70 4.1.1 Oscillator The primary metric in the design of oscillators for clock generation is jitter, and the majority of that is due to power supply noise [64, 65]. Integrated LC oscillators often have a lower noise floor than other on-chip oscillators, but substrate and supply noise are dominant on a large digital chip. Ring-type or relaxation oscillators are usually preferred for on-chip clocks because large chips are usually sorted into different categories based on measured achievable clock speed, and LC oscillators are more difficult to tune. For this chip, a differential relaxation oscillator was chosen because Hspice simulations showed that this relaxation oscillator had better power-supply rejection than did ring oscillators. The relaxation current-controlled oscillator, or "CCO," is shown in Fig. 4-2. Transistors M 3 , M 4 , M 5 , and M6 , along with capacitor C make up a conventional source-coupled multivibrator, with M7 and M8 as active loads and nbias controlling oscillation frequency through Id3,4. The drawback is that that circuit has a feedthrough of -6dB to nodes V+ and V- from VDD, and almost OdB to the capacitor from ground via Cbs of M 3 ,4 , so supply noise rejection is poor. In the proposed oscillator, M 1 and M 2 provide shunt-shunt feedback around M 3 and M 4 respectively, lowering the output impedance at V+ and V- to 1/gm. D1 and D2 limit the amplitude of oscillation to avoid saturation of M 3 and M 4 . Frequency can be adjusted by adding common-mode current into nodes V+ and V-. Oscillator layout is shown in Fig. 4-3. Layout for both halves of the oscillator is identical, and the halves are immediately adjacent. Good matching between the halves corresponds to a 50% duty cycle. Furthermore, all source/drain regions were shared to minimize layout area and parasitic capacitance. 4.1.2 Phase Detector As discussed previously, modelock can be avoided in regular arrays by using nonlinear phase detectors whose response decreases monotonically beyond a phase difference of 7r/2 [40]. The phase detector Pratt and Nguyen suggest (a flip-flop delay and an XOR gate) is not well-suited for integrated PLLs, however. First, it has relatively low gain, 71 ... .. ... .. ... .. ... .. ... .. ... .. ... .. ... .. . ... ..... .. ......... ..... ......... ..... .. .. ............... ............... ............. .......... .......... ........... ........... ........... Figure 4-3: Relaxation oscillator layout 72 ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... ......... .... ... .... .... .... ... .... ... .... ... .... ... .... ... .... .... ... .... ... .... so mismatch can lead to large input-referred phase offsets. Second, it generates fullswing digital signals at half the clock frequency; this digital noise must be attenuated in the loop filter. The phase detector proposed here, A rshown M8 M7 pbias in Fig. 4-4, has the right nonlinearity, higher gain at small A0q and has much less high-frequency content than an XOR. The noise that is generated is D2 V+ V- at the clock frequency, and is attenuated an extra 6dB given the same first-order M3 M4 loop filter. (Only half of the circuit is drawn. The other half is the symmetriM1 cal counterpart, with clocki and clock2 M2 C switched.) M 1 , M 2 , and M 3 comprise an arbiter. The voltage at node A is buffered, sampled, and converted to a current, so that multiple inputs can be nbias M5 M6 summed at each oscillator node. Syn- chronous sampling of the arbiter output by M 6 and M 7 demodulates it, removing Figure 4-2: Relaxation oscillator schematic high frequency content. Timing wave- forms are shown in Fig. 4-5. The phase of the sampling instant affects the transfer function, shown in Fig. 4-6. Node A is the output of the arbiter. When clocki and clock2 are nearly in phase, as is the case at sample periods 1 and 2, A is sampled while its value is still valid, so the output Y goes from 0 to 1 over the width of the arbitration window. Hence, the phase detector has a high gain near 0 phase difference. As the phase difference increases, sampling instance timing becomes relevant. A is sampled at a fixed delay from the rising edge of clocki. If clock2 falls before A is sampled, the output Y will also fall, as shown for periods 3 and 4. Therefore, 0c, the phase angle at which the output transfer function starts to fall, depends on the relative timing 73 U Ml M8 A M5 M6 Tick M9 M2 M7 " I2 M1 M34 T ___M4 13 I4 I5 M12 M10 I6 N1 I7 I8 I9 Figure 4-4: Phase detector schematic of the falling edge of clock2 and the sample delay. If 0, is the phase of the sampling instant and Of the phase of the falling edge, Oc = O - O, so the characteristic angle could be adjusted easily simply by setting the delay through I ... 19. With 0, ~ 7r/2 and a 50% duty cycle (i.e., Of = ir) 0c would be ir/2, which is the constraint to avoid modelock. Were smaller 0, needed to accommodate a different network structure, the same circuit could be used with a different 0,. Adding the output from the unshown half of the circuit gives the other half of the phase response, shown in Fig. 4-7. The full circuit fits in 80pm x 40pam. 4.1.3 Loop Filter One loop filter is associated with each CCO. Conventional loop filters use a charge pump with an RC pole-zero pair, and often put the large capacitor and resistor off 74 Clockl Clock2 A Sample Y 4 3 2 1 5 Figure 4-5: Phase detector timing waveforms Iout -7t 1 293 C 4 Phase Figure 4-6: Sampled phase detector half-circuit transfer function chip. To avoid inconveniently large resistor and capacitors, a feed-forward compensation method was used. The loop filter of Fig. 4-8 consists of two differential amplifiers. (Note that because the frequency control to the oscillator consists of two currents, both amplifiers have twin outputs.) M 3 , M 4 , M 5 , and M6 make up amplifier A 1 , biased by M!, while M1 , M 2 , M 7 , M8 , M 1 and M 12 make up A 2 , biased by M10 . The differential output currents from the phase comparators at the edges of each tile are summed at nodes I,-+ and fln- and drive both amplifiers. A1 is a single stage differential pair, so it has relatively low gain but a bandwidth limited by gm3,4/Cs3,4, since nodes Ioutl and Iout2 drive a low impedance. A 2 has two stages, much like a prototypical op-amp. The first is biased at very low current to give high gain at DC and allow the use of a relatively small compensation capacitor, and the second provides the needed gain and isolates the high impedance pole from the output. In this 75 Iout. -o -IL T Phase Figure 4-7: Sampled phase detector full transfer function amplifier, the DC gain was simulated at 31dB with a 16kHz pole, a compensating zero at 7.6MHz, and a high frequency pole well above the PLL target frequency. The use of feed-forward compensation allowed the use of very small capacitors; the loop filter, including the poly-poly capacitor, and the CCO with its output buffers together take up 88pim x 8 8 pm. M7 M11 M8 M12 I1 Io2 M3 I in- I1 M4 M5 I PT1 M6 M2 M1 I2 M9 Vb2 M10 Vb1 Figure 4-8: Loop filter schematic 76 I in+ 4.2 16 Oscillator Chip The 16 oscillator chip was a second generation chip with a number of improvements over the 4 oscillator first generation. First, a larger network provides a more thorough test of modelock-resistance, because modelock is more likely from initial startup than in smaller networks. Second, a newer and faster fabrication process, 0.35pm, was used, to test the ideas at clock speeds more appropriate for modern microprocessors. Third, key circuits were redesigned: the oscillator is a ring oscillator instead of a relaxation oscillator, and no longer requires two levels of polysilicon; the phase detector now uses a much simpler arbiter-based design that gives phase and frequency feedback as appropriate. 4.2.1 Oscillator The second chip used an NMOS-loaded differential ring oscillator as a voltage controlled oscillator (VCO) (Fig. 4-10) primarily because only one layer of polysilicon was used, and diodes were disallowed in an effort to make the circuits more amenable to implementation in standard microprocessor. Transistors M 4 - M8 comprise the differential inverter. The differential pair is M5 ,8 , the tail current is driven by M6 , and M 4,7 act as the NMOS load. The NMOS loads allow fast oscillation and shield the output signal from VDD noise. Vbias is a low-pass version Of VDD generated by subthreshold leakage through PFET M1 ; supply noise coupling in through Cgd of M 4 ,7 is bypassed by M 2 . The oscillation frequency is only dependent on the supply voltage through capacitor nonlinearity and the output conductance of M 4 ,7 , and feedback of the PLL compensates drift of VDD and Vbias. 4.2.2 Phase Detector Just like the phase detector for the 4-oscillator chip, the second generation phase detector, shown in Fig. 4-11, has a sufficient nonlinearity, higher gain at small input phase difference and less high-frequency content than an XOR phase detector. Compared to Fig. 4-4, however, it is somewhat simpler in implementation, and has 77 Figure 4-9: Micrograph of the 16 oscillator, 1.3 GHz chip 78 M1 M7 M4 Vbias M2 Vout Vout M5 M8 Vctrl M3 M6 Figure 4-10: Ring oscillator schematic a smaller transistor count. It also has less delay from the clock inputs to the phase detector outputs, which is important because the phase detector time constant helps set the PLL feedback poles. The core (M 1 - M 6 ) is an NMOS-loaded arbiter which acts as a nonlinear phase detector. For no input phase difference, the output is balanced. As the phase difference increases from zero, one output will be asserted for the full duration of an input pulse, while the other output will be asserted for only the remainder of the input pulse duration after the first input pulse ends, which is equal to the input phase difference. Thus the detector has very high gain near zero phase error that drops off to zero as the input phase difference approaches the input pulse width (Fig. 4-12). The pulse generators P and P 2 enable this arbiter to give frequency error feedback. If one input is at a higher frequency than the other, its output will be asserted for more input pulses than the other. Because the width of the pulses is independent of input frequency, the average output voltage corresponds to frequency. Unlike a typical phase-frequency detector, however, the strength of the error signal falls to zero as frequency difference goes to 0, so there can be no modelock problems, yet large signal frequency- (and hence, phase-) locking is enhanced. Fig. 4-13 shows the large signal correction and small signal behavior of the entire array of PLLs as 79 M4 M1M Y1 Y2 M5 I8M2 Ii MM ............. P1 P2 Figure 4-11: Phase detector the already internally-locked array approaches and locks to the reference clock. The detector fits in 3Opum x 30pm. 4.2.3 Loop Filter This loop filter, Fig. 4-14, is conceptually identical to the previous loop filter, Fig. 48, though for biasing reasons, the wide bandwidth amplifier now has p-inputs and a current mirror, and the high gain amplifier loads are cascoded. M, - M5 make up amplifier A 1 , while M 9 - M17 make up A 2 . The differential output currents from the phase detectors at the edges of each tile are summed at nodes In+ and In-, and drive both amplifiers. A1 is a single stage differential pair so it has relatively low gain but a bandwidth limited by gm/Cgs. A 2 has a high gain cascoded stage driving a common source PFET M 17 . M1 6 is a large gate capacitor which serves to set the dominant pole of M 2 such that the PLL network is stable. M15 is biased at very low current to boost gain and enable a low time constant (as low 80 OU 40 -. 30 -. . - .. . .. . .. . .. .. -.. . . .. .. . . ..-.. . . .. .. .. . . .. CL 20 (a 0 ~3 10 . .... ... . -.. . .. -. 0 .. ... .. U -10 ... .. -. -20 ....-.. -. .. - . . . .. -. .. . -. -.. . .. -.. ..-..-.... .-... . .-. -. -.. . - 0 -30 -40 -50 -0. 2 -0.1 0.1 0 0.2 Time difference (nanoseconds) Figure 4-12: Simulated phase transfer curve 1. 06 1.0 55 Small Signal Regime 8 1.054) (A 0 4 Large Signal Regime 0 S.o 0 04 0 E5) 1.0 35 - Reference clock 1. 0. 1 1 2 2 0.5 1 1.5 2 2.5 3 3.5 Simulation time (microseconds) Figur e 4-13: Locking behavior of the PLL array 81 M1 pbias M6 M9 M10 M1l M12 M16 M2 M3 AM10 M7 InM13 M14 In+ M17 ML2 6 Out M4 M5 M8 nbias M15 Figure 4-14: Loop filter schematic as 12kHz) with a 15pm x 15pam gate capacitor. The simple design and feed-forward compensation allow the loop filter to fit in only 15pm x 45pum. Each clock node, consisting of an oscillator and a loop filter, takes just 45pum x 45pum. 82 Chapter 5 On-Chip Measurement of Clock Performance While increasing resources are devoted to implementing low skew and low jitter clocks in modern microprocessors, there are few ways to measure jitter. Skew can be measured by such off-chip methods as e-beam [66] and photonic emission [67, 68], but because both average thousands of edges, neither method is suitable for resolving cycle-by-cycle clock jitter. A method to measure clock jitter was developed in this thesis. A proof-of-concept test chip showed that excellent measurement performance is possible, and this chapter describes the theory and results from that chip. 5.1 Introduction and Motivation On-chip measurement necessarily requires tricks. Acceptable clock AID skew is generally around 10% of a 2 clock cycle and a microprocessor clock period is typically 8-12 gate delays. Hence, the measurement Figure 5-1: Time to voltage converter operation necessarily requires timing resolution smaller than a single gate delay. Time-to-voltage converters work by integrating a current onto a capacitor, as in Fig. 5-1 [69, 70, 71]. 83 Delay Tune CLK IDLL PD E I Phase Interpolator I I I I I I I I Sigln R[iJ Out [i] Figure 5-2: Phase vernier The capacitor starts with 0 voltage; at the beginning of the interval to be measured, switch S1 closes, and the capacitor charges for the duration of the interval. Then S, opens, the voltage is amplified, converted to a digital value and output, and then S2 closes to reset the capacitor. Such converters may have high dynamic range but do not have enough resolution for clock jitter measurement, essentially because the time of interest is comparable to the time it takes to open and close switch S 1 . Another approach is to sample the signal of interest into registers which are clocked by closely-spaced sampling phases, as shown in Fig. 5-2. The interpolator takes in several uniformly-placed phases and generates a larger number of phases with closer spacing. The newly generated phases are used to clock a string of registers, marked R[i] in the figure. The timing of a transition on SigIn can be deduced to within the spacing of the sampling phases. Effectively, the registers compare the transition instant of the input signal Sigln to a set of fixed times, just as a flash analog-to-digital converter (ADC) compares an input voltage to a set of voltage thresholds. Because of the similarity, it is useful to think of this architecture as a flash time-to-digital 84 converter, or TDC. Because the comparison thresholds are clock phases, this will be called a sampling phase time-to-digital converter, or SPTDC. Either a delay-locked loop with phase interpolation (as shown) or an array oscillator can be used to generate sampling phases with time differences smaller than a single gate delay [72, 73, 74, 75]. However, mismatches between the oscillators in the array or delays in a DLL can be significant, giving as much as a gate delay offset before calibration [72]. The approach presented here is also a flash TDC, but rather than creating the time vernier by generating closely-spaced clocks, the vernier arises from input-referred offset on the samplers. Hence, the proposed converter will be called a sampling offset time-to-digital converter, or SOTDC. The advantage is that instead of needing to generate precise clocks, it is necessary only to create some sampling elements and measure their relative positions. As will be demonstrated, measurement can be much more precise than any calibration is likely to be. The SOTDC was developed to measure jitter between clock domains, but it works to measure the timing of any signal relative to a reference. 5.2 Time-to-Digital Converter Fundamentals Calibration and operation of the SOTDC depends critically on the operation of the sampling elements. (In Fig. 5-2, the sampling elements were registers, but they were acting as arbiters.) An arbiteris a circuit that determines which of two inputs arrived first. Because only the time difference between rising edges of the two inputs affects the output, it is conventional to think of the arbiter as having a single input, where that input is a time interval t between two incoming edges, as shown in Fig. 5-3(a). Given enough time, the output of an arbiter settles to either a logic '1' or '0', indicating whether the first or second input arrived first. Unfortunately, device mismatch gives arbiters an effective time offset, t,,. Also, because of thermal noise, the output, y, is not deterministic. y(t) = 1 if and only if t > t0, + t,, where t, is white noise with standard deviation - [76, 77]. Therefore, the probability that the output y is a '1' is 85 1 0.8 21 - 0.6 y ............. .............. a'0.4 X t 0.2F 0 O' -2 -1 0 tos 1 2 t/O- time (a) Arbiter input definition (b) Probability that arbiter output is a 1 Figure 5-3: Arbiter definitions ) In2 Inl0 D tos D tos tos A A thermometer decode logic Figure 5-4: TDC structure. "D" marks delay elements, and "A" the arbiters. given by the Gaussian cumulative density function P(y= 1) = 1+ erf ( -tos (5.1) which is plotted in Fig. 5-3(b). The strong sensitivity of y to t near t = t0 s makes the arbiter useful for precise time measurement. Fig. 5-4 shows the simplified theory of operation of a flash TDC (cf. a flash ADC). In any flash converter, the input is compared to a set of thresholds; call the thresholds x. In a TDC, x is the set of offset times to which the input time t is compared. In 86 a SPTDC, each threshold xi is composed of a vernier delay D and an arbiter offset t0,. Variation of t, is significant- the standard deviation of t0 s, at, is about 18ps in 0.35pm CMOS. Fig. 5-5(a) shows a plot of ideal x for an 8-level converter; Fig. 5-5(b) shows the actual positions of the x with normally distributed t,,. Because the a-t is large, errors in the x are significant. However, the random spread of t,, suggests another approach to generating the x: eliminate the vernier delay entirely, and let xi = t, 2 . Fig. 5-5(c) shows typical x for such a converter, 5.3 SOTDC Yield The random placement of xi in an SOTDC means that measurement precision varies from chip to chip. Finding a formula for the expected yield given a desired precision over a fixed range is surprisingly difficult. The problem is quite amenable to Monte Carlo simulation, however. A simulated plot of expected yield vs. precision is shown in Fig. 5-6. 5.4 Calibration of a SOTDC Of course, a vernier-less, or sampling offset TDC is useless if it cannot be calibrated: the outputs of the arbiters give information about the input signal in terms of the xi; if the xi are unknown, the arbiter outputs are useless. Fortunately, it is possible to find x empirically. A TDC could be calibrated directly by connecting two signals with preciselyknown t and measuring resulting outputs for t over the range of interest. Fitting the probabilities of an output '1' vs. t for each arbiter via Eq. 5.1 gives the effective x. Unfortunately, input jitter adds linearly to the apparent measurement noise in this case. In cases where it is impossible or inconvenient to input known signals, it is also possible to calibrate a flash TDC indirectly with uncorrelated signals. For uniformly distributed t, the probability that t is measured between two sampling thresholds, P(xi+tn > t > xj+ts) A Pij(01), is proportional to xi-xj 87 Aij for 60- 40 (i2 U') 7C3 0 0 0- (D~ 20 U, 4020- 0 0 0 .a 0 0 x x -a -20 0D -20 -40-4C -60 7 2 4 6 8 4 2 0 (b) xi oc i + t,,, 18ps std. dev. (a) Ideal, xi oc i 4030c 0 . 20- 0 10 a, 0 -10 6 C 4 2 6 (c) xi = t,, 18ps std. dev. Figure 5-5: x(i) vs. i 88 8 8 1 0.8 V 0.6 0.4 0.2 '- 2 3 4 5 precision (ps) 6 Figure 5-6: Expected yield of an SOTDC, for a fixed precision over a range of one standard deviation. a single event, as long as the difference is much larger than sampling noise, Aj > t,. For example, if the two input signals are constant-frequency square waves, measurements with bit i low and bit and f2 j high will occur with a frequency of Aijfif 2 where fi are the frequencies of the two input signals. While x can be fully deduced from such measurements, the resolution is poor for Aj e t,,. A second indirect calibration method resolves small Aij in terms of o-. When Aij is comparable to t, there will sometimes be a "bubble" in the output codeword; that is, it will appear that xj + t, > t > xi + t, even though xi > x3 . The ratio r = Pi(10)/Pij(01) should depend only on 6 = Ai\j/-, and in fact, it does. Consider two arbiters with ti = x, + ti and t 2 = X 2 + tn 2 . t1 and t 2 are the instantaneous switching thresholds of the arbiters, so P(y1 = 1) = P(t > ti) (5.2) P(y2 = 0) = P(t < t 2 ) (5.3) 1 ,y2 = 0) A P 12 (10) = P(ti < t < t 2 ) (5.4) P 12 (10) = P(ti < t 2 ) - P(ti < t < t 2 I t1 < t2) (5.5) P(y1 = 89 Let x =t2- t 1 . Then x is Gaussian with mean x 2 - x 1 = At and standard deviation 2u. For uniformly distributed t, P(ti < t < t 2 ti < t 2 ) Oc t 2 - t1 . Substituting into Eq. 5.5, P 12 (10) x Oc Oc By symmetry, P12(01)1 ,t= (5.6) - P(X > 0) Oc x je e (4a2)+ VIT2 4a 2 dx (5.7) At1 + erf P 12 (10)1,,-,. Defining 6 = (5.8) 2or ( and erfcx(x) = ex 2 2 f: gives ) P (10) 1+ r (6) = 12 =_ P 12 (01) 1 - VF -erfcx(-6) F6 - erfcx(6) (5.9) In this way an array of arbiters can be calibrated to much higher precision than their manufacturing tolerances without the use of precise input clocks. Thus, by measuring r and inverting Eq. 5.9, one can find relative spacings of x in terms of a. Combined with either of the previous two methods calibrations, this measurement thus gives a and precise measurements of x. Note that both indirect methods are completely insensitive to input jitter. 5.5 Circuit and Results The SOTDC circuit consists of a set of nominally identical arbiters and output circuitry to transfer the bits off-chip. The implemented symmetric CMOS arbiter is shown in Fig. 5-7. The outputs are precharged when Inl and In2 are low (for clock systems where jitter is meaningful, there will be substantial overlap between the low phases of the inputs). The first edge that arrives pulls down the corresponding output, and the positive feedback guarantees that eventually a valid logic value can be latched from the output. For the test chip, 64 such arbiters were connected in parallel 90 et 2 dt M1 M4 Y2 Y1 M5 M2 Inl In2 MM6 Figure 5-7: Symmetric CMOS arbiter to two test inputs, and their outputs individually recorded. Fig. 5-8 shows x for one test chip measured directly. As expected, process variations distribute the x over a range of approximately 50 picoseconds. A plot of x calculated by numerically inverting Eq. 5.9 for measured data vs. x measured directly is shown in Fig. 5-9. The fit is perfect to within the tolerances of the measurement equipment; clearly, calibration by random signals is viable. Best fit -is 0.35 picoseconds, which corresponds to an arbiter aperture of ~ lps, consistent with a previously reported simulated value of 10ps in a 3pm CMOS process. Nonuniform spacing of the arbiter thresholds limits resolution of this TDC to 2ps over the range [-15ps,15ps]. The goal of this part of the thesis was to measure jitter in the 16 oscillator chip described in Chapter 4. A set of arbiters was connected between the clocks of neighboring tiles, and a 128-word DRAM recorded arbiter results. Unfortunately, the DRAM timing was marginal on that test chip, so direct measurements were unavailable. 91 70 60 50 40 30 20 101I -40 -20 0 20 threshold x(i), picoseconds 40 Figure 5-8: Measured xi, with expected curve for 18ps standard deviation of t,,. 20 00 o6 0 10 1 ) 0 .3 LU 0 C) CO) -10 -20' -40 -20 0 20 40 directly-measured x(i) Figure 5-9: Measured xi vs. xi derived via Eq. 5.9, for a-= 0. 3 5 ps 92 Figure 5-10: Measurement chip micrograph 93 94 Chapter 6 Conclusions 6.1 Summary and Contributions A great deal of work has been done previously on clocks in integrated circuits. As the ratio of clock period to wire-delay across a chip decreases, more and more attention is being devoted to clocking. An attempt was made in this thesis to look forward, to predict the clocks necessary in the near future to continue the trend of faster devices and faster clocks. One contribution of this thesis has been the analysis of clock networks in terms of performance given parameter variations and noise. Although much of the focus has been on the contrast between different clock networks, the conclusion is that the different architectures do not replace but rather complement each other. Over a single tile where signal propagation delay is small compared to the clock period and all points must be synchronized, tree distribution is effective. For relatively long distances on a chip, clock regeneration becomes useful to filter out high frequency noise on the distribution wire. A multiple-oscillator peer network also avoids the problem of having different paths to nearest neighbors that plagues trees. Gridded distribution, or more generally shorting together spatially separated buffers greatly reduces skew and jitter between tiles as long as the initial offsets are small. Another contribution is the analysis and implementation of a clock network that uses distributed generation. Theory about mode-locking was extended to account for 95 non-orthogonal networks. Inter-oscillator coupling was treated in the context of a single multivariable system which exposes all possible interactions. The phase detector and oscillator were modified from standard versions to satisfy the requirements needed for a distributed clock. Although the details will likely be changed (shorting together the tiles and finding another way to measure phase differences between clocks is an obvious improvement) the main strength of this architecture is that the clock traverses the same path, peer-to-peer, as does the data. Because the clock can be measured and corrected over multiple cycles, however, it appears that clock skew can always be corrected to a fraction of the uncertainty in data delay. In other words, it should always be possible to distribute a clock using the same technology as is used for long-distance interconnect. Verification of clock design will likely become more important as a way to confirm predictions about clock performance. The proposed and tested sampling offset time to digital converter appears to be well-suited to this task, with resolution of a small fraction of a single gate delay. Because of its extreme hardware simplicity and generality, the SOTDC may find its way onto many chips as a simple debugging tool. 6.2 Future Work This thesis was dominated by analysis and implementation of the distributed clock network, and of how that network compares with conventional clock networks. This leaves a two-fold opening for future work: more accurate testing and comparison to conventional clock networks, and the development clock architectures that are as yet impractical. 6.2.1 Testing and measurement The focus of the design and testing of the multiple-oscillator array was on initial locking and stability. Testing received substantially less attention. Another version of that chip with a more robust DRAM (so that precise timing data could be obtained), and controllable, on-chip noise generators (i.e., large transistors between power and 96 ground) would help calibrate the noise models. On a similar topic, distributed PLLs make low-speed functional testing difficult. For distributed clock generation to move to production, stability of the network at low-speeds should be addressed. It's trivial to add a controllable divider for each node oscillator; however, the extra delay will certainly make the network unstable unless other changes are made. 6.2.2 Unconventional Clocks Grids and clock trees have found widespread use in industry already. A number of other clocking strategies have been proposed that may either find use in niche applications, or perhaps someday take over as the dominant clock method if technology evolves to makes them more attractive. Salphasic Salphasic clocking is conceptually related to equipotential clocking. If the wires are lossless but the transmission line delay is causing clock skew, it is possible to set up standing waves in the clock network. Because these standing waves are perfectly synchronous with the signal at the driver, a clock can be distributed over long distances with no skew. Of course, this depends on having lossless transmission lines for clock distribution; this constraint can be approximated closely in systems on the scale of several meters with clocks in the tens of megahertz [36]. On chip, however, resistance in the wires has made salphasic clocking untenable. Resonant Clocks Resonant clocking is a similar approach, intended for a different purpose. A standing wave is set up in a transmission line with a period equal to the desired period of a clock. With care, a transmission line can be tuned to resonate a fundamental and several odd harmonics in phase, despite the capacitive load and small resistive losses in the wire so that a true square wave appears at the load [37]. A resonant clock in 97 a low-loss transmission line dissipates a fraction of the CV 2 f power that traditional clock networks do. The technique is relatively new, and has not been proven to be practical at high speeds. Optical Clocking Because the propagation speed of optical signals is easily controlled, optical clocks have been suggested as a way to equalize path delay and thus minimize clock skew [38, 39]. Optical signals, transmitted either in a tree, as in the first citation, or in free space as in the second, also have the advantage that they do not interfere with each other, and are immune to electrical or magnetic coupling. Unfortunately, the conversion from optical signals to electrical is a significant stumbling block. Detectors for optical signals are not silicon, and hence require a substantial fabrication process change. Second, the conversion is often relatively slow and error prone because the detected currents are small. No optical clock has been demonstrated for VLSI, although optical clocks may become practical in the future. 98 Bibliography [1] Neil H. E. Weste and Kamran Eshraghian. Principles of CMOS VLSI design. Addison Wesley, 2 edition, 1990. [2] Daniel W. Bailey and Bradley J. Benschneider. Clocking design and analysis for a 600 MHz Alpha microprocessor. Journal of Solid State Circuits, 33(11):16271633, November 1998. [3] Stephen H. Unger and Chung-Jen Tan. Clocking schemes for high-speed digital systems. IEEE Transactions on Computers, C-35(10):880-895, October 1986. [4] Arthur F. Champernowne et al. Latch-to-latch timing rules. IEEE Transactions on Computers, 39(6):798-808, June 1990. [5] E. G. Friedman. The applications of localized clock distribution design to improving the performance of retimed sequential circuits. In Proceedings of the IEEE Asia-Pacific Conference on Circuits and Systems, pages 12-17, December 1992. [6] Karem A. Sakalh et al. Synchronization of pipelines. IEEE Transactions on Computer-Aided Design, 12(8):1132-1146, August 1993. [7] Jose Luis Neves and Eby G. Friedman. Topological design of clock distribution networks based on non-zero clock skew specifications. In Proceedings of the 36th Midwest Symposium on Circuits and Systems, pages 468-471, August 1993. 99 [8] Narendra V. Shenoy, Robert K. Brayton, and Alberto L. Sangiovanni-Vincentelli. Resynthesis of multi-phase pipelines. In Proceedings of the ACM/IEEE Design Automation Conference, pages 490-496, June 1993. [9] C. Thomas Gray et al. Timing constraints for wave-pipelined systems. IEEE Transactions on Computer-Aided Design, 13(8):987-1004, August 1994. [10] Michel R. Dagenais and Nicholas C. Rumin. On the calculation of optimal clocking parameters in synchronous circuits with level sensitive latches. IEEE Transactions on Computer-Aided Design, 8(3):268-278, March 1989. [11] Karem A. Sakallah, Trevor N. Mudge, and Oyekunle A. Olukotun. Analysis and design of latch-controlled synchronous digital circuits. IEEE Transactions on Computer-Aided Design, 11(3):322-333, March 1992. [12] Tolga Soyata and Eby G. Friedman. Retiming with non-zero clock skew, variable register, and interconnect delay. In Proceedings of the IEEE International Conference on Computer-Aided Design, pages 234-241, November 1994. [13] Francois Angeau. A synchronous approach for clocking VLSI systems. Journal of Solid State Circuits, SC-17(1):51-56, February 1982. [14] H. B. Bakoglu, J. T. Walker, and J. D. Meindl. A symmetric clock-distribution tree and optimized high-speed interconnections for reduced clock skew in ULSI and WSI circuits. In VLSI in Computers and Processors, pages 118-122, Rye Brook, NY, October 1986. IEEE International Conference on Computer Design. [15] Allan L. Fisher and H. T. Kung. Synchronizing large VLSI processor arrays. IEEE Transactions on Computers, C-34(8):734-740, August 1985. [16] Ahmed El-Amawy. Clocking arbitrarily large computing structures under constant skew bound. IEEE Transactions on Parallel and Distributed Systems, 4(3):241-255, 1993. 100 [17] Daniel W. Dobberpuhl et al. A 200-MHz 64-b dual-issue CMOS microprocessor. Journal of Solid State Circuits, 27(11):1555-1567, November 1992. [18] Bradley J. Benschneider et al. A 300-MHz 64-b quad-issue CMOS RISC microprocessor. Journal of Solid State Circuits, 30(11):1203-1214, November 1992. [19] Paul E. Gronowski et al. A 433-MHz 64-b quad-issue RISC microprocessor. Journal of Solid State Circuits, 31(11):1687-1696, November 1996. [20] Donald F. Wann and Mark A. Franklin. Asynchronous and clocked control structures for VLSI based interconnection networks. IEEE Transactions on Computers, C-32(3):284-293, March 1983. [21] S. Y. Kung and R. J. Gal-Ezer. Synchronous versus asynchronous computation in very large scale integrated (VLSI) array processors. Proceedings of SPIE, 341:53-65, May 1982. [22] Sanjay Dhar, Mark A. Franklin, and Donald F. Wann. Reduction of clock delays in VLSI structures. In IEEE International Conference on Computer Design, pages 778-783, October 1984. [23] Mehdi Hatamian and Glenn L. Cash. Parallel bit-level pipelined VLSI designs for high-speed signal processing. Proceedings of the IEEE, 75(9):1192-1202, September 1987. [24] Eby G. Friedman and Scott Powell. Design and analysis of hierarchical clock distribution system for synchronous standard cell/macrocell VLSI. Journal of Solid State Circuits, SC-21(2):240-246, April 1986. [25] Michael A. B. Jackson, Arvind Srinivasan, and E. S. Kuh. Clock routing for highperformance ICs. In 27th Proceedings of the ACM/IEEE Design Automation Conference, pages 573-579, June 1990. [26] Fumihiro Minami and Midori Takano. Clock tree synthesis based on RC delay balancing. In Proceedings of the IEEE Custom Integrated Circuits Conference, pages 28.3.1-28.3.4, May 1992. 101 [27] Ting-Hai Chao, Yu-Chin Hsu, Jan-Ming Ho, Kenneth D. Boese, and Andrew B. Kahng. Zero skew clock routing with minimum wirelength. IEEE Transactions on Circuits and Systems-Il: Analog and Digital Signal Processing, 39(11):799814, November 1992. [28] Jason Cong, Andrew B. Kahng, and Gabriel Robins. Matching-based methods for high-performance clock routing. IEEE Transactions on Computer-Aided Design, 12(8):1157-1169, August 1993. [29] Ren-Song Tsay. An exact zero-skew clock routing algorithm. IEEE Transactions on Computer-Aided Design, 12(2):242-249, February 1993. [30] Andrew B. Kahng and C.-W. Albert Tsao. Practical bounded-skew clock routing. Journal of VLSI Signal Processing, 16(2/3):87-103, June/July 1997. [31] Shantanu Ganguly, Daksh Lehther, and Satyamurthy Pullela. Clock distribution methodology for the PowerPC microprocessors. Journal of VLSI Signal Processing, 16(2/3):181-189, June/July 1997. [32] Earl T. Cohen et al. A 533MHz BiCMOS superscalar microprocessor. In ISSCC Digest of Technical Papers, pages 164-165, February 1997. [33] Charles F. Webb et al. A 400MHz S/390 microprocessor. In ISSCC Digest of Technical Papers, pages 168-169, February 1997. [34] Toyohiko Yoshida et al. A 2V 250MHz multimedia processor. In ISSCC Digest of Technical Papers, pages 266-267, February 1997. [35] G. Geannopoulos and X. Dai. An adaptive digital deskewing circuit for clock distribution networks. In ISSCC Digest of Technical Papers, pages 400-401, February 1998. [36] Vernon L. Chi. Salphasic distribution of clock signals for synchronous systems. IEEE Transactions on Computers, 43(5):597-602, May 1994. 102 [37] M. E. Becker and T. F. Knight, Jr. Transmission line clock driver. In IEEE International Conference on Computer Design, pages 489-490, October 1999. [38] C.-S. Li, F. Tong, K. Liu, and D. G. Messerschmitt. Fanout analysis of multistage optical clock distribution using optical amplifiers. In Globecom, pages 434-438, 1991. [39] Helmut Zarschizky, Christian Gerndt, Martin Honsberg, and Ekkehard Klement. Optical clock distribution with a compact free space interconnect system. In IEEE Lasers and Electro-Optics Society Annual Meeting, pages 590-591, 1992. [40] Gill A. Pratt and John Nguyen. Distributed synchronous clocking. IEEE Transactions on Parallel and Distributed Systems, February 1995. [41] David G. Messerschmidt. Synchronization in digital system design. IEEE Journal Selected Areas in Communications,8(8):1404-1419, October 1990. [42] Morteza Afghahi and Christer Svensson. Performance of synchronous and asynchronous schemes for VLSI systems. IEEE Transactions on Computers, 41(7):858-872, July 1992. [43] D. Boning and S. Nassif. Models of Process Variations in Device and Interconnect, chapter 6. IEEE Press, 2000. [44] Brian E. Stine et al. Simulating the impact of poly-CD wafer-level and die-level variation on circuit performance. In Second InternationalWorkshop on Statistical Metrology, June 1997. [45] M. Eisele, J. Berthold, R. Thewes, E. Wohlrab, D. Schmitt-Landsiedel, and W. Weber. Intra-die device parameter variations and their impact on digital CMOS gates at low supply voltages. In Technical Digest of IEDM, pages 67-70, 1995. [46] Duane S. Boning and James E. Chung. Statistical metrology - measurement and modelling of variation for advanced process development and design rule 103 generation. In Proceedings of the International Conference on Characterization and Metrology for ULSI Technology, March 1998. [47] Tomohisa Mizuno, Jun-ichi Okamura, and Akira Toriumi. Experimental study of threshold voltage fluctuation due to statistical variation of channel dopant number in MOSFET's. IEEE Transactions on Electron Devices, 41(11):22162221, November 1994. [48] Martin Eisele, J6rg Berthold, Doris Schmitt-Landsiedel, and Reinhard Mahnkopf. The impact of intra-dive device parameter variations on path delays and on the design for yield of low voltage digital circuits. IEEE Transactions on VLSI, 5(4):360-368, December 1997. [49] Xinghai Tang, Vivek K. De, and James D. Meindl. Intrinsic MOSFET parameter fluctuations due to random dopant placement. IEEE Transactions on VLSI, 5(4):369-376, December 1997. [50 D. C. Keezer and V. K. Jain. Design and evaluation of wafer scale clock distribution. In Proceedings of the IEEE InternationalConference on Wafer Scale Integration,pages 168-175, January 1992. [51] Jos6 Luis Neves and Eby G. Friedman. Circuit synthesis of clock distribution networks based on non-zero clock skew. In Proceedings of the IEEE International Symposium on Circuits and Systems, pages 4.175-4.178, June 1994. [52] Mohamed Nekili, Guy Bois, and Yvon Savaria. Pipelined H-trees for high-speed clocking of large integrated systems in the presence of process variations. IEEE Transactions on VLSI, 5(2):161-174, June 1997. [53] Masakazu Shoji. Elimination of process-dependent clock skew in CMOS VLSI. Journal of Solid State Circuits, SC-21(5):875-880, October 1986. [54] Satyamurthy Pullela, Noel Menezes, and Lawrence T. Pillage. Reliable nonzero skew clock trees using wire width optimization. In 30th Proceedings of the ACM/IEEE Design Automation Conference, pages 165-170, June 1993. 104 [55] Masato Edahiro. Delay minimization for zero-skew routing. In Proceedings of the IEEE International Conference on Computer-Aided Design, pages 563-566, November 1993. [56] Steven D. Kugelmass and Kennet Steiglitz. An upper bound of expected clock skew in synchronous systems. IEEE Transactions on Computers, 39(12):14751477, December 1990. [57] Marios D. Dikaiakos and Kenneth Steiglitz. Comparison of tree and straightline clocking in long systolic arrays. Journal of VLSI Signal Processing, pages 1177-1180, 1991. [58] Keith A. Bowman, Xinghai Tang, John C. Eble, and James D. Meindl. Imapact of extrinsic and intrinsic parameter variations on CMOS system on a chip performance. In Proceedings of the ASIC/SOC Conference, pages 267-271, September 1999. [59] Marcel J. M. Pelgrom, AAD C. J. Duinmaijer, and Anton P. G. Welbers. Matching properties of MOS transistors. Journal of Solid State Circuits, 24(5):14331440, October 1989. [60] Shy-Chyi Wong, Kuo-Hua Pan, Dye-Jyun Ma, M. S. Liang, and P. N. Tseng. On matching properties and process factors for submicrometer CMOS. In Proceedings of the 1996 IEEE International Conference on Microelectronic Test Structures, volume 9, pages 43-47, March 1996. [61] Shih-Wei Sun and Paul G. Y. Tsui. Limitation of CMOS supply-voltage scaling by MOSFET threshold-voltage variation. Journal of Solid State Circuits, 30(8):947-949, August 1995. [62] M. Nekili, Y. Savaria, and G. Bois. Spatial characterization of process variations via MOS transistor time constants in VLSI and WSI. Circuits, 34(1):80-84, January 1999. 105 Journal of Solid State [63] Payman Zarkesh-Ha, Tony Mule, and James D. Meindl. Characterization and modeling of clock skew with process variations. In Proceedings of the IEEE 1999 Custom Integrated Circuits Conference, pages 441-444, 1999. [64] Ian A. Young, Monte F. Mar, and Bharat Bhushan. A 0.35pm CMOS 3-880MHz PLL N/2 clock multiplier and distribution network with low jitter for microprocessors. In ISSCC Digest of Technical Papers,pages 330-331, February 1997. [65] Raghunand Bhagwan and Alan Rogers. A 1GHz dual-loop microprocessor PLL with instant frequency shifting. In ISSCC Digest of Technical Papers, pages 336-337, February 1997. [66] P. J. Restle, K. A. Jenkins, A. Deutsch, and P. W. Cook. Measurement and modeling of on-chip transmission line effects in a 400 MHz microprocessor. Journal of Solid State Circuits, 33(4):662-665, April 1998. [67] Y. Uraoka, T. Maeda, I. Miyanaga, and K. Tsuji. New failure analysis technique of ULSIs using photon emission method. In Proceedings of the International Conference on Microelectronic Test Structures, volume 5, pages 100-105, March 1992. [68] Yukiharu Uraoka, Isao, Miyanaga, Kazuhiko Tsuji, and Shigenobu Akiyama. Failure analysis of ULSI circuits using photon emission. IEEE Transactions on Semiconductor Manufacturing, 6(4):324-331, November 1993. [69] Andrew E. Stevens, Richard P. Van Berg, Jan Van Der Spiegel, and Hugh H. Williams. A time-to-voltage converter and analog memory for colliding beam detectors. Journal of Solid State Circuits, 24(6):1748-1752, December 1989. [70] C. Konstadakellis, S. Siskos, and Th. Laopoulos. A fast, versatile, CMOS timeto-voltage converter. In Proceedings of the 6th Mediterranean Electrotechnical Conference, pages 282-285, 1991. [71] Elvi Rdissinen-Routsalainen, Timo Rahkonen, and Juha Kostamovaara. A time digitizer with interpolation based on time-to-voltage conversion. In Proceedings 106 of the 40th Midwest Symposium on Circuits and Systems, pages 197-200, August 1997. [72] Dan Weinlader, Ron Ho, Chih-Kong Ken Yang, and Mark Horowitz. An eight channel 36Gsample/s CMOS timing analyzer. In ISSCC Digest of Technical Papers, pages 170-171, 2000. [73] Thomas A. Knotts, David Chu, and Jeremy Sommer. A 500MHz time digitizer IC with 15.625ps resolution. In ISSCC Digest of Technical Papers, pages 58-59, 1994. [74] Yasuo Arai and Masahiro Ikeno. A time digitizer CMOS gate-array with a 250 ps time resolution. Journal of Solid State Circuits, 31(2):212-219, February 1996. [75] J. G. Maneatis and M. A. Horowitz. Precise delay generation using coupled oscillators. Journal of Solid State Circuits, 28(12):1273-1282, December 1993. [76] Linsay Kleeman. The jitter model for metastability and its application to redudnant synchronizers. IEEE Transactions on Computers, 39(7):930-942, July 1990. [77] W. A. M. Van Noije, W. T. Liu, and S. J. Navarro, Jr. Precise final state determination in mismatched CMOS latches. Journal of Solid State Circuits, 30(5):607-611, May 1995. 107 108 Appendix A Full Schematics A.1 4 oscillator chip A.2 16 oscillator chip 109 Si C V 0 Si a C A APhillate phil philearly sampled-phase-comp L~j phi2late phi2 phi2early Si V V V Si U- AilAl IREF foster clock slower skewfaster skewslower I-I E m o_ 0- 0 0 o a- Q a 0 node W- 147 0 0 -1 - 1 IREF foster clock slower skewfaster skewslower im 141 - node 134 C_ E 0 o! a) m5 A faster clock foster slower Ao clock slower 03 AV phil p4hi1late Ef v hillate philearly sam pled phase-com p phi2 phi2late L 9 phil philearly sopledphose-comp ophi2te phi2 phi2eorly 144 Ak phi2erly m_ IREF foster clock slower skewfaster skewslower [REF foster clock slower skewfaster skewslower U. ~E a node node 0 --U rj L4 I- o 145 0 a_ phi<0> - 0 Em -U-. 135 o0 00 E 0 -0 -0 a oe 0 -0 0 a 0 W C(q o V V) 0 _E foster clock slower faster (0 Figure A1.1: Top-level (chip core) 110 clock slower -U '"st. slower nolood2 f.*lr -o.. locd1 b-,, 4 -*'-p b' 3 , out 'oad ~ 125 124 __>c Ifr Figure A1.2: Node T 10/1.2 24/1.2 24/1 2 aa 12/0.6 12/0.6 24/0.6 gnd! gnd! COpi nbias12/1.2 Figure A1.3: Relaxation oscillator III 24/0.6 cop2 T F/rA.4 6/1.8. 8/1.8 meat. 1.- an m e //1.8 6~~~/1-8 .1. 15/1. 1512/ 18/. 1 ./1.2 Figure A1.4: Compensation amplifier and summer 6/0.6 6/0.6 out in 6/0.6 6/0.6 6/0.6 Figure A1.5: Differential to single-ended amplifier 112 in 1.2/3 6phi2early 2/.6 Outl lote Ophi 122/2 phil D phrl 1 all12 p~p~us-ephrlealy Fiure A paecmr Sml 2/5 ph pp. 4 Fig11e A16Tapeehaecmaao 113 / 13 3/0.6 3/0.6 e 6/0,6 phil pi6/0.6 3/0 .6 3/0.6 -4 gnd! 15 phi2 116 Figure A1.7: Phase comparator core 114 phi2 clock re refeloc 1111 datow :lock rei ed refcloc rea- rea% dataswitv t refelc , reo 1 1ucl lut.ser~I< 12> UIotsoritt ri -Mot.i ol I q C0C c oC phi,2lote lower phaarly pher faster pi2lt 0 1h2 l phillate faster ohl 1137 H b0 phi ear c rou datoswitch ut.seriol<21> out S dr12 Ia datoswto - R! a a a i do tdle rout slower ,ieryph phi2sar philiate Shi31138 b 0d2 * dat. d ontre r casWitcha -Ile tch 2 -- l r a fc f li f late phi2earl 1t 14 rout phi2 phi2e phIi2 lte a . 0 w r phFS2 at. grou_ E l 119 phi phillate clock re dot hE0. T I 1117 phi 19 E phi phTi 014 t 2 aerfa 434> 1Uot.seral d u tster phIt2late -a2 a data 'S re clock- a £ re -111 r clocktt tileaIlphlti~t S a ua erol 2> w - slower pplear phifear a2lote hi2 L4 10cl *pl .try 16 phi phillote ' m a - roua 0 refoloc r00 son a l o ut00 0 hie I ? 0 r i hi2 a .- war p ut~serol<33> U a: Iieil4> a CMU IF 5 y a QswItch datoswitch loe phi2lote 1d p r2 data S pki2eaty phl2lot '.lp h oropi k ro rTi erial-44> - aE philear 00 - 0~a ~ 0t a uaFerio1443> 1 1 1; l pphil 2orL k orwaOrd backword p22late 116- phillate e 0 .0 afserial432> * rout . 11i luggerial424> o Peah' ptieary we * . r ph' a foster rarotore ou h uk. erfol<2> -o tseria-l 0 tl slower Moto y . o at seril<F3l> ut -- eaa t.eria e d-ta kdck2oe pilot cro a -a eo slower foster phillote .5t data rI2UpthIl slower fster poilar.y faster I . out.serial<11:13,212431:34.41:44> datoswltcb CS . coc ~dawF in.dUoa-sw.TchR c ma 11, reteloc "Ut.serfol<22> out.serial serial in~dallosilitch refefoc reclo wtch * in slower phillate phileard a a fca Stch Wo 1121 phtI phile refelac rec* r Z I 7 Ir ini slower phillato in.refelock in.oTetI k mrt ph 2 slowom in.reok 13>sril I otier -t.seil Ie U- datoswit 1136 ut.se rel lock refelocV t datoswitc? h clock od- tlr phillate philearl - K L-i c o Figure A2.1: Top-level (chip core) 115 Z E current-in np n p slower faster clock faster faster - slower " 1129 a- clock clock node-a -jslower slower 0 faster |inclockclock2 refclock read write write 1126 At mux jmeasure jAO inv3x Vn out.somples 128 1127 Y refclock read slower 1123 1114 E-clock1 " 1120 outclock clock clock2 clock faster clock -U--D inv9x -I Figure A2.2: Individual tile faster comp slower amp- acobias -eE-iref slowerringosc-2 soe Figure A2.3: Node 116 phi ---- clock out.serial / A./.3 slower 2.3/0.35 2.3/0.3 f 0.7/0.35 0,7/0,35 0.7/0.35 0.7/0.35 stA 4e- 13 vx 1/1 8.4/0.35 A 8.4/0.35 slower 4203 4.2/0.35 1.8/0.6 faster 1.0/0.6 4.2/0.35 1.5/0.35 1.5/0.35 .5'1.2 1.4/2.1 Figure A2.4: Compensation amplifier Figure A2.5: Ring oscillator 117 VT loodbias W 0.7/0.7 out- in+ out+ 3.5/0.35 3.5/0.35 ibios 1/0.35 Figure A2.6: Differential inverter for the ring oscillator 115 17 d q - inclodkq10---d-- d q n2 117 119 nx n2 118 d 122 q 125 inc12k3 120 -dut-- Figure A2.7: Clock divider 118 dck in- 60 147 "(10 103. 49 [1 18 120 123e 1'ms*Pe -'"Tek - dd ~ q 1998 ~~W.1 152< 1:0 15 1< 1:0> Figure A2.8: Jitter measurement block 13 d Si 19y 194 q-- qd ck Figure A2.9: Pulse generator owtTok< utTokOut 4<ok:11> Outpu c: pkin out.sarmples tbu3 utpul latch *6rite -E-writ. W wtTokOut sdl-bitslice rw<cff:127> DataClock ww<O-:127: lrnputTaken refcAock ou 1ok9n outTokl k 19 im 17<0:127> r* rea d ph2 write 9Whtkb read ou write W>dl 0m1tckenoo tkkon shiftcik write dlrnmTokIn.dramTok drtomi ww read 11 shiftclk shiftclockb shiftclock r1110 11111 1112<0:3 1113<0:3 y n~ 158 0:3> iv4 Figure A2.10: DRAM block 119 - drarmTok<1:127>.dromlakOut dramTokOut 195 -~g x x LCn read write d tokin q shiftclk Figure A2.11: DRAM write token 120 tokout d q 11 Figure A2.12: DRAM bitslice out2 h.0 phH1 2.8/0.35 2.8/0.35 gnd! Figure A2.13: Phase measurement arbiter 121 out1 ./ . phi2 49/0.5b 0.7/0.7 A Y 24.5/0.35 oe 24.5/0.35 Figure A2.14: Dram data 3-state driver (N C 0 0 :3: 2. 1/0.35 4.2/0.35 DataClockW D - *-wotu Figure A2.15: Dram output data serializer 122