5.75 to 44Gb/s Quarter Rate CDR with Data Rate Selection in 90nm Bulk CMOS George von Bueren, Lucio Rodoni, Heinz Jaeckel Alex Huber Electronics Laboratory ETH Zürich CH-8092 Zürich, Switzerland Institute of Microelectronics University of Applied Sciences Northwestern Switzerland CH-5210 Windisch, Switzerland Roland Brun, Daniel Holzer Martin Schmatz Bern University of Applied Sciences CH-3400 Burgdorf, Switzerland IBM Zurich Research Laboratory CH-8803 Rüschlikon, Switzerland Abstract—This paper presents a quarter rate clock/data recovery (CDR) circuit for plesiochronous serial I/O-links. This 2x-oversampled phase-tracking CDR, implemented in 90nm bulk CMOS technology, covers the whole range of data rates from 5.75 to 44Gb/s thanks to a data rate selection logic. A bit error rate < 10–12 was verified up to 38Gb/s using a 27–1 PRBS pattern. The CDR is able to track a maximum frequency deviation of ±615ppm between incoming data and reference clock. Keywords: clock data recovery, quarter rate, CMOS. I. INTRODUCTION The aggregate data communication bandwidth of key components in telecommunication equipment and computer servers has shown a continuous increase in the past. This progress has been reached by increasing the serial data rate and by integrating more links on a single chip. In order to achieve multi-channel integration into a CMOS logic process, these transceivers should be low power and area efficient. One of the most crucial and speed-limiting circuit blocks in these link macrocells is the clock and data recovery (CDR) circuit in the receiver. The first 40Gb/s CMOS CDR has been presented in 2003 [1]. This 40Gb/s CDR has been realized in 0.18µm CMOS, employs a quarter-rate architecture with a multiphase VCO and passive loop filter, achieves a the bit-error rate (BER) of 10-6 and consumes a current of 144mA from a 2V supply. In case of plesiochronous systems, where every participant gets nearly the same frequency, the CDR tracking loop with the area-consuming passive loop filter can be replaced with a digital phase tracking loop [2]. A half-rate 25Gb/s CDR implemented in 90nm CMOS achieving a BER < 10-12 incorporates a digital first order loop filter, consumes 98mA from a 1.1V supply and its area consumption is 0.064mm2 only, and is therefore suited for high-density integration [3]. It has been shown with a quarter rate CDR [4] This work was supported by the Swiss Federal Office for Professional Education and Technology, contract/grant number KTI 7995.1 that area and power consumption can be further reduced thanks to two accomplishments. First, the application of a phase-programmable PLL [5] allows realizing a dual loop CDR [2] without phase rotators. Second, the use of staticCMOS design style in most analog circuits instead of current mode logic (CML). This 40Gb/s CDR is implemented in 65nm SOI CMOS and its area and power consumption are 0.03mm2 and 72mW, respectively. The use of static-CMOS design style is only possible with regulated supply voltages [5], [6]. Compared to static-CMOS design style CML circuits have a better immunity to supply variations and generate less switching noise. The 40Gb/s CDR presented in this paper employs fully differential CML in all analog high-speed circuits. With a 90nm CMOS technology CML circuits are mandatory to processes a 40Gb/s data stream. Only the digital loop filter consists of CMOS gates. We propose a data rate selection logic that allows covering the whole range of data rates from 5.75 to 44Gb/s. This feature makes the circuit especially suitable in multi-standard applications enabling new link rates while supporting compatibility with legacy rates. II. CDR TOPOLOGY In high-density serial I/O links, the transmitter (TX) and receiver (RX) are clocked by two independent reference clocks having the same nominal frequency. These reference clocks are multiplied from a quartz crystal oscillator with a frequency tolerance ranging from ±10 to ±100ppm. In these plesiochronous systems the CDR has to track a slowly drifting phase difference between the incoming data and the RX clock caused by the small frequency offset between the TX and RX clocks. Hence, a phase-tracking loop in the CDR is sufficient. The architecture of our phase tracking loop is shown in Fig. 1. It is a 2x-oversampled quarter rate CDR with the advantage that only the first latch of the sampling flip-flop must be able to track the data at full speed. Eight parallel Data Out 4x10Gb/s Data 40Gb/s D0-3 8 Samplers E0-3 8 phases 10GHz Clock, Data Out @ 2.5Gb/s 8:32 Demux Data Alignment d0-15 Edge Detection @ 2.5GHz e0-15 early/ late φ0 φ1 … φ7 Up/Dn Counter 4 Phase Rotators 16 8 phases 10GHz Digital Loop Filter up/dn @ 1.25GHz @ 1.25GHz ψ0 ψ1 … ψ 7 CML DLL Reference Clock 10GHz CMOS Fig. 1 Architecture of the phase tracking loop. samplers acquire the four data bits (D0..3) and four edges (E0..3) needed to evaluate the sampling position [7]. Eight parallel 1:4 demultiplexers reduce the data rate form 10 to 2.5 Gb/s and align the sampled bits, which are separated by one eighth of the period of the reference clock signal, to one single clock phase, generating 16 data (d0..15) and 16 edge (e0..3) bits. The transition from differential signaling to full swing CMOS signal levels is performed in the demultiplexers. The phase tracking loop is implemented by a digital delay locked loop. The digital control logic consists of an edge detection logic, a digital loop filter and an up/down counter, which controls the output phases (φi) of the four phase rotators. The reference clock phases (Ψi) are generated in an analog delay locked loop (DLL). The four 10Gb/s data bits D0..3 are buffered and fed to output pins for testing and measurement purposes. III. CIRCUIT DESIGN A. Sampler The first stage of the master-slave flip-flop is a shunt inductive peaked CML latch. The bandwidth enhancement is necessary since this latch has to track the 40Gb/s input data. With a 0.7nH on-chip inductor a maximal bandwidth enhancement by a factor of 1.8 [8] has been achieved. The area of one multi-layer spiral inductor amounts to 20x20µm2. In the second latch of the master-slave flip-flop no inductive peaking is required because this latch operates with a 10Gb/s data stream only. B. Digital Control Loop with Rate Selection Fig. 2 illustrates the block diagram of the digital control loop. All circuit blocks are synthesized circuits and are placed and routed with a digital design tool. d15 d-1 d0-15 d0-15 e0-15 e 0-15 2.5GHz Alexander early phase detector late majorityvoting D E M ea1,0 U X la1,0 M A J. V O early late step1,0 F up S dn M 1.25GHz Fig. 2. Block diagram of the digital loop filter. Sx up/ dn W0-7 counter W0-7 Fig. 3. Principle of the rate selection quarter-rate (QR), half-rate (HR), and full-rate (FR) mode, • sample points, ◦ discarded samples The edge detector solves the Alexander equations [7] and outputs a single early or late signal after majority voting. In order to relax the speed requirement for the digital loop filter, the early and late output signals of the edge detector are demultiplexed by a factor of 2. The loop filter is realized as finite state machine and accumulates the incoming early and late bits. A phase step (up or down) is induced when the overhang of early or late signals is greater than three. This edge detection logic can work in three operation modes as depicted in Fig. 3. Quarter rate (QR) operation is used for an input data rate from 23 to 44Gb/s. The early/late generation logic generates for each of the 16 data/edge bit pairs an early/late signal by solving the Alexander equations [7]. When the data rate is lower and the bit length larger, between 11.5 and 23Gb/s, the CDR operates in half rate (HR) mode. The edge samples used in the quarter rate mode are omitted and only the data samples are evaluated. In this mode, the even data samples take the role of the edge bits and the odd data samples are still data bits. From the eight data/edge pairs the early/late information is generated. For a still lower input data rate from 5.75 to 11.5Gb/s, the full rate (FR) mode is appropriate. Here, every other sample of the odd data samples are alternately used as data and edge bit, respectively. In this case, the early/late logic generates 4 early/late signals. Hence, our receiver can cover the full range of data rates from 5.75 to 44Gb/s, even though the multi phase delay lock loop (DLL), which generates the reference clock phase Ψi, is band limited. The DLL operates from 5.75 to 11.5 GHz and limits the lower data rate of the CDR. C. DLL, Phase Rotator and Clock Buffer In order to update the sample position, we use four parallel phase rotators, which are controlled by a thermometer coded up/down counter. Using a full thermometer code, glitches or discontinuities, in the phase rotator characteristics can be avoided. The four differential reference clock phases (Ψi), which are generated by the DLL, are fed to the four phase rotators. One phase rotator, shown in Fig. 4(a), consists of a phase selection stage followed by a phase interpolation stage [2]. The first stage selects two clock phases from two adjacent phase octants. Using eight clock phases provides a better phase linearity compared to using six phases or I/Q Ψ0 Ψ2 Ψ4 Ψ6 4:1 PI Ψ1 Ψ3 Ψ5 Ψ7 (a) PH1 4:1 8 PH2 W0..7 W0B..7B interpolator weight 8 Si phase select 500Ω PH1 Ψ0 Ψ4 Ψ2 S0 PI0 500Ω PH1B Ψ6 Ψ4 Ψ0 Ψ6 S180 S90 Ψ2 signals (Ψi) as well as the sample clock signals (φi) are driven by clock buffers using inductive and capacitive peaking to have enough driving capability and to remove any DC-offset in the differential clock signal. The inductive shunt peaking is used to expand the bandwidth of the buffer. With capacitive peaking, the gain at lower frequencies (<5 GHz) is decreased. In addition, the output DC levels are regulated actively to reduce DC-offset and duty cycle distortion of the clock signal. S270 Vbias (b) 250Ω 250Ω PI0 PI0B (c) PH1 PH1B PH2 PH2B ..... W0 Vbias W0B W1 I=0.25mA W1B I=0.25mA W7 ..... Fig. 5. Chip photo and layout of the CDR W7B I=0.25mA Fig. 4. (a) Phase rotator, (b) phase selector, (c) phase interpolator interpolation schemes. The phase interpolator that blends the two selected phases is controlled by the 8-bit thermometer coded value W7..0. The schematic of the used 4:1 multiplexer and interpolator are depicted in Fig. 4(b) and Fig. 4(c), respectively. Retiming flip-flops between the up/down counter and the phase rotator guarantee that all control signals Si, W7..0, W7B..0B change at the same time. The common mode outputs of the selector and the interpolator are regulated by a replica bias as all CML circuits of this CDR. An important practical requirement is that amplitude and common mode voltage of sampling clock are valid always, even after start-up, to assure the presence of the CDR system clock. This implies that the control signals Si, W7..0, W7B..0B are initialized correctly. As can be seen Fig. 4(b) and Fig. 4(c), the regulated output common mode voltages of the proposed selector and interpolator circuits are always valid because their output common mode voltages are independent of the digital control signals. A total number of 64 phase steps for one 100ps reference clock period or 16 steps for one data unit interval (UI) of 25ps are provided, resulting in a nominal timing resolution of 1.56ps. As a consequence, the maximal possible frequency offset between TX and RX clocks that can be tracked correctly amounts to 106/(64·8·3)ppm = 615ppm. The reference clock IV. MEASUREMENT RESULTS Our CDR circuit is fabricated in a 90nm bulk CMOS technology and consumes 230mA from a 1V power supply voltage (analog 215mA, digital supply 15mA). All inputs and outputs are ESD protected except the differential 40Gb/s data inputs. The layout of the core circuit that occupies 570x350µm2 (=0.2mm2) and the die micrograph of the CDR circuit are shown in Fig. 5. The CDR is able to lock to a PRBS data stream at up to 44Gb/s when the input signal is applied to the chip using on-wafer probes. The 40Gb/s input eye diagram with a 10GHz sinusoidal clock signal is illustrated in Fig. 6(a). The recovered data at 10Gb/s is shown in Fig. 6(b). The operating ranges for full-, half- and quarter-rate modes are 5.75 to 11.5Gb/s, 11.5 to 23Gb/s and 23 to 44Gb/s, respectively. In all operating ranges, the maximum frequency offset that can be tracked is ±615ppm for a BER of <10–12 up to 38Gb/s. The limit was set by the measurement setup because the input pattern was not error free above 38Gb/s. The value of ±615ppm is sufficient to countervail inequalities of (a) (b) Fig. 6. (a) 40Gb/s input data, 10GHz sinusoidal clock signal. (time scale: 10ps/div, amplitude scale: 50mV/div) (b) Recovered 10Gb/s data (time scale: 20ps/div, amplitude scale: 50mV/div) Fig. 7. Eye diagram of a 24Gb/s data stream at the input of the package (left eye diagram) and at the pad of the circuit (right eye diagram) the clock frequencies of two chips clocked from different crystal oscillators. Besides the frequency offset, which can be tracked, the jitter tolerance is the second key parameter for CDRs employed in chip-to-chip communication. The jitter tolerance measurements have been performed in a packaged module (Fig. 7). Fig. 7 also shows the eye diagram of the 24Gb/s input data before (left eye diagram) and after (right eye diagram) a trace of 1.6cm length on the substrate. The jitter tolerance plot at 24Gb/s of the packaged CDR and the extended jitter tolerance mask for XAUI [9] are illustrated in Fig. 8. For all jitter frequencies and all jitter amplitudes, the XAUI mask can be fulfilled by our circuit. TABLE I. shows a comparison with previously published 40Gb/s CMOS CDRs with analog [1], [10] or digital loop filters [4], [11]. Fully analog CDRs are area consuming and dissipate less power but have a larger BER (>10-12) compared to [4], [11]. Among the three CDRs with a digital loop filter our CDR covers the largest range of data rates. Furthermore, it consumes less power and has a smaller chip area than the 3xoversampling CDR [10]. Only the circuit in [4] reaches superior performance with respect to power and area, but uses a more advanced transistor technology that allows to implement the speed-critical circuit blocks in CMOS logic instead of the more power- and area-consuming CML logic. TABLE I. Data-rate [Gb/s] Tbit/Tclock Demux data Loop filter Supply [V] Power [mW] Area [mm2] Gb/s/mW Tb/s/mm2 BER CMOS [1] 40 1/4 1:4 passive 2 144 0.64a 0.28 0.06 10-6 0.18µm 40GB/S CMOS CDRS [4] 27-40 1/4 1:8 digital 1 72 0.03 0.56 1.33 <10-12 65nmb [10] 40 1/2 1:2 passive 1.2 48 0.42 0.83 0.09 10-9 90nm [11] 40-44 1/4 1:16 digital 1.4 900 1.44 0.048 0.03 <10-12 90nm This 5.75-44 1/4 1:16 digital 1 230 0.2 0.174 0.2 <10-12 90nm a. Estimated b. Silicon on insulator technology (SOI) V. CONCLUSION A clock-data-recovery circuit implemented in 90 nm bulk CMOS for 40Gb/s chip-to-chip communication is presented. Thanks to the novel rate selection feature in the fully digital Fig. 8. Jitter tolerance of the packed CDR at 24Gb/s achieving a BER<10–12. loop filter a very large data rate range from 5.75 to 44Gb/s can be covered. From 5.75 to 38Gb/s a BER <10–12 is achieved even for a frequency offset of ±615ppm and data jitter amplitudes above the XAUI mask. ACKNOWLEDGMENT The authors thank T. Toifl, C. Menolfi, T. Morf, C. Kromer, M. Kossel, J. Weiss for fruitful discussions, M. Lanz and M. Witzig for bonding and the IBM foundry team for manufacturing the CMOS chips. REFERENCES [1] J. Lee and B. Razavi, “A 40-Gb/s Clock and Data Recovery Circuit in 0.18-µm CMOS Technology,” IEEE Journal of Solid-State Circuits, vol. 38, pp. 2181–2190, Dec. 2003. [2] S. Sidiropoulos, M. Horowitz, “A Semi-Digital Dual Delay-Locked Loop,” IEEE JSSC, vol. 32, no. 11, pp. 1683-1692, Nov. 1997. [3] C. Kromer, G. Sialm, C. Menolfi, M. Schmatz, F. Ellinger, H. Jäckel, “A 25-Gb/s CDR in 90-nm CMOS for High-Density Interconnects”, IEEE J. Solid-State Circuits, vol. 41, no.12, pp. 2921-2929, Dec. 2006. [4] T. Toifl, C. Menolfi, P. Buchmann, C. Hagleitner, M. Kossel, T. Morf, J. Weiss, and M. Schmatz, “A 72mW 0.03mm2 Inductorless 40 Gb/s CDR in 65 nm SOI CMOS,” ISSCC Dig. Technical Papers, pp. 226– 227, 11–15 Feb. 2007. [5] T. Toifl, C. Menolfi, P. Buchmann, et al., “0.94ps-rms-Jitter 0.016mm2 2.5GHz Multi-Phase Generator PLL with 360° Digitally Programmable Phase Shift for 10Gb/s Serial Links,” IEEE J. Solid-State Circuits, vol. 40, no. 12, pp. 2700–2712, Dec., 2005. [6] E. Alon, J. Kim, S. Pamarti, K. Chang, and M. Horowitz, "Replica compensated linear regulators for supply-regulated phase-locked loops," IEEE J. of Solid-State Circuits, vol. 41, pp. 413-424, Feb. 2006. [7] J. D. H. Alexander, “Clock Recovery from Random Binary Data,” Electronics Letters, vol. 11, pp. 541–542, 1975. [8] S. S. Mohan, M. Hershenson, S. P. Boyd, and T. H. Lee, “Bandwidth Extension in CMOS with Optimized On-Chip Inductors”, IEEE Journal of Solid-State Circuits, vol. 35, no. 3, pp. 346–355, March 2000. [9] IEEE Std. 802.3ae-2002, Media Access Control (MAC) Parameters, Physical Layers, and Management Parameters for 10 Gbps Operation. [10] C. F. Liao, and S. I. Liu, “40 Gb/s Transimpedance-AGC Amplifier and CDR Circuit for Broadband Data Receivers in 90 nm CMOS,” IEEE JSSC, vol. 43. no. 3, pp. 642-655, March 2008. [11] N. Nedovic, N. Tzartzanis, H. Tamura, H. Rotella, M. Wiklund, Y. Mizutani, Y. Okaniwa, T. Kuroda, J. Ogawa, and W. Walker, “40-to-44 Gb/s 3× Oversampling CMOS CDR, 1:16 DEMUX,” in IEEE ISSCC Dig. Technical Papers, pp. 224–225, 11–15 Feb. 2007.