IEICE TRANS. FUNDAMENTALS, VOL.E83–A, NO.6 JUNE 2000 1100 PAPER Special Section of Papers Selected from ITC-CSCC’99 A 1.0 Gbps CMOS Oversampling Data Recovery Circuit with Fine Delay Generation Method∗ Jun-Young PARK† and Jin-Ku KANG† , Nonmembers SUMMARY This paper describes an oversampling data recovery circuit composed of an analog delay locked loop and a digital decision logic. The novel oversampling technique is based on the delay locked loop circuit locked to multiple clock periods rather than a single clock period, which generates the timing resolution less than the gate delay of the delay chain. The digital logic for data recovery was implemented with the assumption that there is no frequency deviation that hurts the center of acquired data. The chip has been fabricated using 0.6 µm CMOS technology. The chip has been tested at 1.0 Gb/s NRZ input data with 125 MHz clock and recovers the serial input data into eight 125 Mb/s output stream. key words: oversampling data recovery, PLL, DLL, jitter 1. Introduction Data recovery systems using the data-oversampling method have been developed in CMOS technology [1]– [3]. In their oversampling technique, the clock (or data) is propagated through delay taps for generating multiphase clocks (or data), limit the sampling resolution to the minimal intrinsic gate delay. As a result the technique was applied to relatively low speed system applications. Finer delay would require the use of faster, more expensive technologies. In order to generate precise delays with significantly finer resolution than an intrinsic gate delay without the use of state-of-the-art integrated circuit technology, some approaches have been presented [4], [5]. These approaches though have a complex architecture to implement. This paper describes a single delay chain that is locked to multiple clock periods to achieve a delay resolution less than one delay cell delay time. The sampling resolution can be much finer than with the single delay line tapping method. This finer sampling resolution provides the capability of oversampling higher data rate with a high oversampling ratio. Consequently, jitter tolerance can also be improved [6]. The multi phase signals from delay chain are used as clock signals oversampling 1 Gbps input data. Proper data boundary position is decided from sampled data bit array to perform the correct data recovery. Manuscript received October 4, 1999. Manuscript revised January 11, 2000. † The authors are with the School of Electrical and Computer Engineering, Inha University, Inchon, Korea. ∗ This work was supported by the development program for the exemplary schools in information and communication from the Ministry of Information and Communication (MIC). In the data recovery circuit each bit of input data is oversampled by four times. A ‘correct bit’ in the oversampled points is then selected as recovered data by locating the NRZ input data boundary. The circuit requires a clock at 1/8 of the data frequency, and for one clock cycle, 8 data bits are recovered in parallel. This oversampling data recovery technique could be used on SONET-based telecommunications, ATM digital communications, transceiver circuits and other high speed data communications. The primary components of the data recovery circuit are the analog delay locked loop for generating the fine clock delays, input sampler, FIFO register and digital decision logic for data recovery. Section 2 describes the overall architecture of the implemented data recovery circuit and Sect. 3 discusses the basic theory of the components. Circuit design is presented in Sect. 4 and experimental results are given in Sect. 5, followed by the conclusions. 2. Data Recovery Circuit Architecture The architecture of the data recovery circuit is shown in Fig. 1. The DLL for a multi-phase clock generator is locked by an external clock signal that runs at oneeighth the data rate for the 1:8 demultiplexing data recovery. The delay stages of the delay chain are tapped to produce the sampling clock edges and these edges Fig. 1 Architectural floor plan. PARK and KANG: A 1.0 Gbps CMOS OVERSAMPLING DATA RECOVERY CIRCUIT 1101 are adjusted in time sequence for data oversampling. A bank of input samplers uses these clock edges to oversample the input data. Transitions in the input data bit stream are detected and processed to decide proper input data bit boundary and this information is used to pick the oversampled points to recover data. To process the oversampled data in parallel, first in first out (FIFO) registers are needed. The sampled points must be held in parallel at the register by the time needed to decide data boundary position so that correct bits are selected. In our architecture, 4X oversampling is chosen. This means the sampler oversamples four times per bit. Since eight input data are processed in one clock cycle, 32 clock edges and latches in the samplers are required for the 1:8 demultiplexing data recovery circuit. It is a very practical assumption that in the worst case two oversampled points near the each data boundary among the four samples per bit are uncertain due to jitter. Thus two sampled points from the center of single data bit are guaranteed to derive the correct data recovery. 3. Table 1 The delay time (D) per single stage for various stage numbers (N ) and the number of clock periods (n). (unit: ps, T =8 ns) N /n 24stage 32stage 40stage 48stage 56stage 64stage 1 333.3 250 200 166.6 143.8 125 2 666.6 500 400 333.3 285.7 250 3 1000 750 600 500 428.5 375 4 1333 1000 800 666.6 571.4 500 5 1666 1250 1000 833.3 714.2 625 6 2000 1500 1200 1000 857.1 750 Principle of Operation 3.1 Multi-Phase Clock Signal Generator The general method that makes signals equally spaced in time domain is to tap a chain of delay elements, which the delay time is controlled by DLL (delay locked loop). When the two input signals to phase detector from delay chain are in phase, i.e. the DLL is locked, the following equation is satisfied between the delay time per unit delay cell (D), the external clock period (T ), and the number of total delay cell stages (N ). DN = nT (1) where n is an integer number. In conventional DLL’s, the delay chain is locked to the single clock period. In this case the delay time per unit delay cell (D) can be expressed as below: D = T /N Fig. 2 Waveform of delay stage in case of N =32, n=3, and T =8 ns. The delay difference between 0th stage and 11th stage is 250 ps under the absolute gate delay of 750 ps. (N/n = T /D is an irrational number in this case.) (when n=1) (2) From the above equation, the delay time D can be reduced by increasing the total stage number N but can not be less than intrinsic gate delay of the each delay stage. Now we lift the constraint that the delay chain should be locked to the one clock period (n = 1) as used in the common DLL’s. Here, the number of locking clock periods, n, can be any integer. Table 1 shows the delay time (D = nT /N ) for various stage numbers (N ) and the numbers of clock periods (n) under the fixed external clock period, T = 8 ns. The bolded numbers in Table 1 are the cases when the ratio N/n = T /D produces irrational number and the other numbers are the case when the ratio makes rational number. The combinations of the number of delay stages (N ) and the number of clock periods to DLL (n) which generate the bold styled delay values are the conditions that makes a delay resolution less than the delay time of single delay cell. In the cases when the ratio N/n = T /D produces a rational number, the same phase clocks are generated after one clock cycle period. Thus the delay resolution is the same as the delay of the signle delay cell. Figure 2 and Table 2 show the special case when the total number of delay stage (N ) is 32, the number (n) is 3, and the external clock period (T ) is 8 ns. These setting displays that each delay cell generate 750 ps delay, but 32 equally spaced signals have 250 ps difference in sequence. As shown in Fig. 2, the signal from 11th delay stage has the delay difference of 250 ps with the signal going into 1st delay stage. This is because the actual delay is the modulus function of the clock period on the total delay time after 11th stage. And the delay difference between 11th stage and 22nd stage is also 250 ps. The periodicity makes a finer delay resolution than the absolute gate delay possible through many dif- IEICE TRANS. FUNDAMENTALS, VOL.E83–A, NO.6 JUNE 2000 1102 Table 2 The time delay of delay stage with respect to 0th stage in the order of time sequence. (N =32, n=3, and T =8 ns) N 0 D(ps) 0 N 24 11 250 (8250) 500 (16500) 750 22 1 12 23 2 13 1000 (9000) 1250 (17250) 1500 1750 (9750) N 16 3 D(ps) 2000 (18000) 2250 14 2500 6 25 2750 (18750) 3000 17 3250 (11250) 3500 (19500) 3750 7 4 15 26 5 27 28 18 29 D(ps) 4000 (12000) 4250 (20250) 4500 (10500) 4750 (12750) 5000 (21000) 5250 5500 (13500) 5750 (21750) N 8 D(ps) 6000 19 6250 (14250) 6500 (22500) 6750 30 9 20 31 10 21 7000 (15000) 7250 (23250) 7500 7750 (15750) Fig. 4 Bit boundary decision with the transition information of sampled data. into four groups. xl [n] = x[4n + l] Fig. 3 Data recovery process. ferent delay taps by rearranging in time sequence. The effective delay, δi at ith delay stage is therefore can be written as: δi = (i × D) mod T (3) 3.2 Data Recovery Figure 3 shows the overall process of data recovery. Recovery process requires finding and tracking the bit boundaries. The decision logic first detects data transition points by an XOR of adjacent samples, indicating the bit boundary can be deduced from four samples in one bit data period. If the data boundaries are determined, samples in two points after each the boundary position are selected as the recovered data. Since the 8 bit input data are processed in one clock cycle, the bit boundary reappears in every four samples. In order to determine the bit boundary, the 32 samples are divided (0 ≤ l ≤ 3) (4) where n is the samples position through the 32 stages and l is the samples looked at each single data bit. We denote position ‘a’ when the transition occurs between x0 [n] and x3 [n], position ‘b’ between x3 [n] and x2 [n], position ‘c’ between x2 [n] and x1 [n] and position ‘d’ between x1 [n] and x0 [n]. Figure 4 shows an example of the boundary detection process with a portion of a sampled bit stream. To find which of the four transition positions is the most likely bit boundary, transitions corresponding to the same boundary position are counted. The position with the largest total determines the bit borders. The decision logic updates the bit boundary in every 8 bit of input stream. If there is no data transition, the transition position ‘a’ is chosen for the bit boundary. Due to jitter in the input data, two transition positions can be possible candidates for the bit boundary as shown in Fig. 5. If the samples from jittered input data has 5 or 3 sampling points per single data bit instead of the nominal 4 sampling points, the transition count calculation gives two positions (‘a’ and ‘b’) as the bit boundary candidates. If the position a is chosen as the boundary, the correct data recovery is performed. But, the position b is selected as the boundary, the wrong data are recovered. To resolve this problem, we also look for the position which has no transition, which is the position ‘c’ in Fig. 5. This no-transition position ‘c’ indicates that two adjacent sampled points, x2 [n] and x1 [n], should be placed in the middle of one bit sampling window. This can be derived from the assumption that the maximum possible jitter hurts only two sampling points near bit boundary. Therefore, the transition position ‘b’ can not be the boundary in the case shown in Fig. 5. This supplementary process to the maximum transition count algorithm is made pos- PARK and KANG: A 1.0 Gbps CMOS OVERSAMPLING DATA RECOVERY CIRCUIT 1103 Fig. 7 Fig. 5 The information from transition count can give an information of incorrect bit boundary position due to jitter. Fig. 6 Block diagram of the designed DLL. sible because of our scheme is implemented with the 4X oversampling technique. Unlike other techniques [7], [8], this data recovery algorithm can also be applied to the data communication without any preamble period. 4. Circuit Implementation A block diagram of the designed DLL for generating a fine resolution multi-phase clock is shown in Fig. 6. It consists of a phase detector, a charge pump, and a delay chain of 32 stage. The transistors in delay cell are locked at 750 ps delay time. The real sampling resolution is 250 ps though. The small signals in delay cell are converted to full swing level as clock signals at data sampler. The D flip-flop in the phase detector is Block diagram of the boundary decision circuit. a single phase clock D-F/F with a clear function [9]. The phase detector using flip-flop is not sensitive to differences of the duty-cycle between the inputs and D flip-flop of only 11 transistors is used to achieve dead zone free. The circuit of the input sampler is based on Ref. [7]. Input data and the sampling clocks are provided in differential manner. The sampling ends on the falling edge of the clock, holding the most recent sampled data that arrived for regenerative amplification for half cycle. Due to the presence of multiple clocks in the sampler, the sampled points could be overwritten before proceeding to the data processing block. To avoid this, the half-segment register deskewing with a FIFO structure is used [10]. In order to latch the oversampled bits in parallel, they must first be aligned in time (deskewed). Instead of implementing area-consuming delay units on de-skewing lines, half-segment latch stages were used. This allows wider setup/hold time windows for the parallel latching. FIFOs are used for shifting samples without being overwritten by the next incoming sampling clock. The transition detection and boundary decision block is shown in Fig. 7. As described in the data recovery algorithm, both transition count and non-transition count are used for the boundary position decision. 5. Experimental Results The recovery circuit is implemented in a standard 0.6 µm CMOS technology and post layout simulation has been done using HSPICE. The die photo is shown in Fig. 8. The design and performance statistics are summarized in Table 3. The external clock signal to the delay chain is 1 V swing (3.8 V–4.8 V) at 125 MHz and the DLL is locked to 3 times of the clock period (3 × 8 ns). The differential input data is a 1 Gbps with 0.5 V swing (1.25 V–1.75 V). Figure 9 shows the simulated full swing converted signals from delay chain in the order of the delay stage. Figure 10 displays the simulated signals adjusted IEICE TRANS. FUNDAMENTALS, VOL.E83–A, NO.6 JUNE 2000 1104 Fig. 10 Fig. 8 Table 3 Simulated signals rearranged in time sequence. Chip micrograph. Design and performance statistics. Technology Transistors Die size (Core) Total I/O pins Power Supply Input Data Rate Sampling Clock Rate Clock jitter (peak to peak) Output Data (8 Channel) Sampling Resolution Power Dissipation 0.6 µm CMOS 14675 3100 µm× 1900 µm 80 pin QFP 5 Volt 1 Gb/s 125 MHz 100 ps 125 Mb/s 250 ps 800 mW Fig. 11 Fig. 9 Simulated signals from each stage from delay chain (1st, 2nd, 3rd, 4th, 5th stage. . .). Measured output signals at output channels. in time sequence, which is used as clock edges for data oversampling. The resolution is 250 ps, so that 1 Gbps data (data period=1 ns) is oversampled by four times. The measured output signals from five channels of eight output channels are shown in Fig. 11. For example, the measured output signal of channel 2 is ‘1010. . ..’ pattern. The pulse width of ‘1’ and ‘0’ is 8 ns (125 Mb/s). The other channels patterns are ‘0001. . .,’ ‘1100. . .’ and ‘1110. . .’ Various input patterns were supplied for checking the correct data recovery. Due to the demultiplexed structure and the lack of BER test equipment, the BER test using PRBS pattern could not be performed yet. Instead, the output for various known input data patterns with maximum bit length of 16 bits were checked and compared. In all input data patterns, data were recovered without PARK and KANG: A 1.0 Gbps CMOS OVERSAMPLING DATA RECOVERY CIRCUIT 1105 any error. The sampling clock jitter was measured by checking the output while sweeping a input transition. The time period producing an unstable output is measured as clock jitter which was about 100 ps. 6. Conclusion This paper describes a 1 Gbps data recovery circuit using 4X oversampling technique. To achieve a fine sampling resolution for high speed data oversampling, a single delay chain locked to multiple clock periods is proposed. This approach can realize the delay resolution less than the gate delay in the delay chain. This is satisfied when the ratio N/n = T /D makes irrational number (N : total number of delay cell, n: number of clock periods, T : input signal period, D: delay time per unit delay cell). By using 4X oversampling and data recovery algorithm of the maximum-minimum transition count, the robust oversampling data recovery circuit is implemented in 0.6 µm CMOS. The test chip has shown that 1 Gbps input data can be recovered 1:8 demultiplexed output format. This scheme can be expanded to higher bit rate applications by scaling the sampling resolution. References [1] J. Sonntag and R. Leonowich, “A monolithic CMOS 10 MHz DPLL for burst-mode data retiming,” IEEE International Solid-State Circuit Conference, pp.194–195, Feb. 1990. [2] M. Bazes and R. Ashuri, “A novel CMOS digital clock and data decoder,” IEEE J. Solid State Circuits, pp.1934–1940, Dec. 1992. [3] B. Kim, D. N. Helman, and P. R. Gray, “A 30-MHz hybrid analog/digital clock recovery circuit in 2 µm CMOS,” IEEE J. Solid State Circuits, pp.1385–1394, Dec. 1990. [4] S. Sidiropolous and M. Horowitz, “A semi-digital DLL with unlimited phase shift capability and 0.08–400 MHz operating range,” IEEE J. Solid State Circuits, pp.1683–1692, vol.32, no.11, Nov. 1997. [5] J.G. Maneatis, “Low-jitter and process-independent DLL and PLL based on self-biased technique,” ISSCC, pp.130– 131, 1996. [6] J. Kang, “Performance analysis of oversampling data recovery circuit,” IEICE Trans. Fundamentals, vol.E82-A, no.6, pp.958–964, June 1999. [7] C. Yang, R. Farjad-Rad, and M. Horowitz, “A 0.5 µm CMOS 4.0 Gbps serial link transceiver with data recovery using oversampling,” IEEE J. Solid State Circuits, pp.713– 721, 1998. [8] T. Fong, M. Cerisola, R. Hofmeister, and L. Kazovsky, “Ultra fast clock recovery for optical packet switched networks,” Electronic Letters, pp.1687–1688, Sept. 1995. [9] R. Rogenmoser, “1.16 GHz dual-modulus 1.2 µm CMOS prescaler,” CICC, pp.27.6.1–27.6.4, 1993. [10] J. Kang, W. Liu, and R. K. Cavin, “A CMOS high speed data recovery circuit using matched delay sampling technique,” IEEE J. Solid State Circuits, pp.1588–1596, Oct. 1997. Jun-Young Park received his B.S. and M.S. degree in Electrical and Computer Engineering at Inha University, Inchon, Korea, in 1998 and 2000, respectively. His research interests are VLSI design, mixed-mode circuit design, and signal processing. Jin-Ku Kang received his Ph.D. in Computer Engineering at North Carolina State University, Raleigh, NC, U.S.A, in 1996. He is an assistant professor in School of Electrical and Computer Engineering at Inha University, Inchon, Korea. From 1984–1988 he was with Samsung Electronics. He has also worked for INTEL, Hillsboro, OR, U.S.A. before he joined the University in 1997. His research interests are VLSI design, mixedmode circuit design, PLL, data communication circuits, and clock distribution in high speed digital systems.