A 1.0 Gbps CMOS Oversampling Data Recovery Circuit with Fine

advertisement
IEICE TRANS. FUNDAMENTALS, VOL.E83–A, NO.6 JUNE 2000
1100
PAPER
Special Section of Papers Selected from ITC-CSCC’99
A 1.0 Gbps CMOS Oversampling Data Recovery Circuit
with Fine Delay Generation Method∗
Jun-Young PARK† and Jin-Ku KANG† , Nonmembers
SUMMARY This paper describes an oversampling data recovery circuit composed of an analog delay locked loop and a
digital decision logic. The novel oversampling technique is based
on the delay locked loop circuit locked to multiple clock periods rather than a single clock period, which generates the timing
resolution less than the gate delay of the delay chain. The digital logic for data recovery was implemented with the assumption
that there is no frequency deviation that hurts the center of acquired data. The chip has been fabricated using 0.6 µm CMOS
technology. The chip has been tested at 1.0 Gb/s NRZ input data
with 125 MHz clock and recovers the serial input data into eight
125 Mb/s output stream.
key words: oversampling data recovery, PLL, DLL, jitter
1.
Introduction
Data recovery systems using the data-oversampling
method have been developed in CMOS technology [1]–
[3]. In their oversampling technique, the clock (or data)
is propagated through delay taps for generating multiphase clocks (or data), limit the sampling resolution to
the minimal intrinsic gate delay. As a result the technique was applied to relatively low speed system applications. Finer delay would require the use of faster,
more expensive technologies. In order to generate precise delays with significantly finer resolution than an intrinsic gate delay without the use of state-of-the-art integrated circuit technology, some approaches have been
presented [4], [5]. These approaches though have a complex architecture to implement.
This paper describes a single delay chain that is
locked to multiple clock periods to achieve a delay resolution less than one delay cell delay time. The sampling resolution can be much finer than with the single
delay line tapping method. This finer sampling resolution provides the capability of oversampling higher data
rate with a high oversampling ratio. Consequently, jitter tolerance can also be improved [6]. The multi phase
signals from delay chain are used as clock signals oversampling 1 Gbps input data. Proper data boundary
position is decided from sampled data bit array to perform the correct data recovery.
Manuscript received October 4, 1999.
Manuscript revised January 11, 2000.
†
The authors are with the School of Electrical and Computer Engineering, Inha University, Inchon, Korea.
∗
This work was supported by the development program
for the exemplary schools in information and communication from the Ministry of Information and Communication
(MIC).
In the data recovery circuit each bit of input data is
oversampled by four times. A ‘correct bit’ in the oversampled points is then selected as recovered data by
locating the NRZ input data boundary. The circuit requires a clock at 1/8 of the data frequency, and for one
clock cycle, 8 data bits are recovered in parallel. This
oversampling data recovery technique could be used on
SONET-based telecommunications, ATM digital communications, transceiver circuits and other high speed
data communications.
The primary components of the data recovery circuit are the analog delay locked loop for generating the
fine clock delays, input sampler, FIFO register and digital decision logic for data recovery. Section 2 describes
the overall architecture of the implemented data recovery circuit and Sect. 3 discusses the basic theory of the
components. Circuit design is presented in Sect. 4 and
experimental results are given in Sect. 5, followed by
the conclusions.
2.
Data Recovery Circuit Architecture
The architecture of the data recovery circuit is shown
in Fig. 1. The DLL for a multi-phase clock generator
is locked by an external clock signal that runs at oneeighth the data rate for the 1:8 demultiplexing data recovery. The delay stages of the delay chain are tapped
to produce the sampling clock edges and these edges
Fig. 1
Architectural floor plan.
PARK and KANG: A 1.0 Gbps CMOS OVERSAMPLING DATA RECOVERY CIRCUIT
1101
are adjusted in time sequence for data oversampling. A
bank of input samplers uses these clock edges to oversample the input data. Transitions in the input data
bit stream are detected and processed to decide proper
input data bit boundary and this information is used
to pick the oversampled points to recover data. To process the oversampled data in parallel, first in first out
(FIFO) registers are needed. The sampled points must
be held in parallel at the register by the time needed to
decide data boundary position so that correct bits are
selected. In our architecture, 4X oversampling is chosen. This means the sampler oversamples four times
per bit. Since eight input data are processed in one
clock cycle, 32 clock edges and latches in the samplers
are required for the 1:8 demultiplexing data recovery
circuit. It is a very practical assumption that in the
worst case two oversampled points near the each data
boundary among the four samples per bit are uncertain
due to jitter. Thus two sampled points from the center
of single data bit are guaranteed to derive the correct
data recovery.
3.
Table 1 The delay time (D) per single stage for various stage
numbers (N ) and the number of clock periods (n). (unit: ps,
T =8 ns)
N /n
24stage
32stage
40stage
48stage
56stage
64stage
1
333.3
250
200
166.6
143.8
125
2
666.6
500
400
333.3
285.7
250
3
1000
750
600
500
428.5
375
4
1333
1000
800
666.6
571.4
500
5
1666
1250
1000
833.3
714.2
625
6
2000
1500
1200
1000
857.1
750
Principle of Operation
3.1 Multi-Phase Clock Signal Generator
The general method that makes signals equally spaced
in time domain is to tap a chain of delay elements,
which the delay time is controlled by DLL (delay locked
loop). When the two input signals to phase detector
from delay chain are in phase, i.e. the DLL is locked,
the following equation is satisfied between the delay
time per unit delay cell (D), the external clock period
(T ), and the number of total delay cell stages (N ).
DN = nT
(1)
where n is an integer number.
In conventional DLL’s, the delay chain is locked to
the single clock period. In this case the delay time per
unit delay cell (D) can be expressed as below:
D = T /N
Fig. 2 Waveform of delay stage in case of N =32, n=3, and
T =8 ns. The delay difference between 0th stage and 11th stage
is 250 ps under the absolute gate delay of 750 ps. (N/n = T /D
is an irrational number in this case.)
(when n=1)
(2)
From the above equation, the delay time D can be
reduced by increasing the total stage number N but can
not be less than intrinsic gate delay of the each delay
stage. Now we lift the constraint that the delay chain
should be locked to the one clock period (n = 1) as used
in the common DLL’s. Here, the number of locking
clock periods, n, can be any integer. Table 1 shows the
delay time (D = nT /N ) for various stage numbers (N )
and the numbers of clock periods (n) under the fixed
external clock period, T = 8 ns.
The bolded numbers in Table 1 are the cases when
the ratio N/n = T /D produces irrational number and
the other numbers are the case when the ratio makes
rational number. The combinations of the number of
delay stages (N ) and the number of clock periods to
DLL (n) which generate the bold styled delay values are
the conditions that makes a delay resolution less than
the delay time of single delay cell. In the cases when
the ratio N/n = T /D produces a rational number, the
same phase clocks are generated after one clock cycle
period. Thus the delay resolution is the same as the
delay of the signle delay cell.
Figure 2 and Table 2 show the special case when
the total number of delay stage (N ) is 32, the number
(n) is 3, and the external clock period (T ) is 8 ns. These
setting displays that each delay cell generate 750 ps delay, but 32 equally spaced signals have 250 ps difference
in sequence. As shown in Fig. 2, the signal from 11th
delay stage has the delay difference of 250 ps with the
signal going into 1st delay stage. This is because the
actual delay is the modulus function of the clock period
on the total delay time after 11th stage. And the delay difference between 11th stage and 22nd stage is also
250 ps. The periodicity makes a finer delay resolution
than the absolute gate delay possible through many dif-
IEICE TRANS. FUNDAMENTALS, VOL.E83–A, NO.6 JUNE 2000
1102
Table 2 The time delay of delay stage with respect to 0th
stage in the order of time sequence. (N =32, n=3, and T =8 ns)
N
0
D(ps)
0
N
24
11
250
(8250)
500
(16500)
750
22
1
12
23
2
13
1000
(9000)
1250
(17250)
1500
1750
(9750)
N
16
3
D(ps)
2000
(18000)
2250
14
2500
6
25
2750
(18750)
3000
17
3250
(11250)
3500
(19500)
3750
7
4
15
26
5
27
28
18
29
D(ps)
4000
(12000)
4250
(20250)
4500
(10500)
4750
(12750)
5000
(21000)
5250
5500
(13500)
5750
(21750)
N
8
D(ps)
6000
19
6250
(14250)
6500
(22500)
6750
30
9
20
31
10
21
7000
(15000)
7250
(23250)
7500
7750
(15750)
Fig. 4 Bit boundary decision with the transition information
of sampled data.
into four groups.
xl [n] = x[4n + l]
Fig. 3
Data recovery process.
ferent delay taps by rearranging in time sequence. The
effective delay, δi at ith delay stage is therefore can be
written as:
δi = (i × D) mod T
(3)
3.2 Data Recovery
Figure 3 shows the overall process of data recovery.
Recovery process requires finding and tracking the bit
boundaries. The decision logic first detects data transition points by an XOR of adjacent samples, indicating
the bit boundary can be deduced from four samples in
one bit data period. If the data boundaries are determined, samples in two points after each the boundary
position are selected as the recovered data. Since the 8
bit input data are processed in one clock cycle, the bit
boundary reappears in every four samples. In order to
determine the bit boundary, the 32 samples are divided
(0 ≤ l ≤ 3)
(4)
where n is the samples position through the 32 stages
and l is the samples looked at each single data bit. We
denote position ‘a’ when the transition occurs between
x0 [n] and x3 [n], position ‘b’ between x3 [n] and x2 [n],
position ‘c’ between x2 [n] and x1 [n] and position ‘d’
between x1 [n] and x0 [n].
Figure 4 shows an example of the boundary detection process with a portion of a sampled bit stream.
To find which of the four transition positions is the
most likely bit boundary, transitions corresponding to
the same boundary position are counted. The position
with the largest total determines the bit borders. The
decision logic updates the bit boundary in every 8 bit
of input stream. If there is no data transition, the transition position ‘a’ is chosen for the bit boundary.
Due to jitter in the input data, two transition positions can be possible candidates for the bit boundary as
shown in Fig. 5. If the samples from jittered input data
has 5 or 3 sampling points per single data bit instead
of the nominal 4 sampling points, the transition count
calculation gives two positions (‘a’ and ‘b’) as the bit
boundary candidates. If the position a is chosen as the
boundary, the correct data recovery is performed. But,
the position b is selected as the boundary, the wrong
data are recovered. To resolve this problem, we also
look for the position which has no transition, which is
the position ‘c’ in Fig. 5. This no-transition position
‘c’ indicates that two adjacent sampled points, x2 [n]
and x1 [n], should be placed in the middle of one bit
sampling window. This can be derived from the assumption that the maximum possible jitter hurts only
two sampling points near bit boundary. Therefore, the
transition position ‘b’ can not be the boundary in the
case shown in Fig. 5. This supplementary process to
the maximum transition count algorithm is made pos-
PARK and KANG: A 1.0 Gbps CMOS OVERSAMPLING DATA RECOVERY CIRCUIT
1103
Fig. 7
Fig. 5 The information from transition count can give an information of incorrect bit boundary position due to jitter.
Fig. 6
Block diagram of the designed DLL.
sible because of our scheme is implemented with the 4X
oversampling technique. Unlike other techniques [7],
[8], this data recovery algorithm can also be applied to
the data communication without any preamble period.
4.
Circuit Implementation
A block diagram of the designed DLL for generating
a fine resolution multi-phase clock is shown in Fig. 6.
It consists of a phase detector, a charge pump, and a
delay chain of 32 stage. The transistors in delay cell
are locked at 750 ps delay time. The real sampling resolution is 250 ps though. The small signals in delay
cell are converted to full swing level as clock signals at
data sampler. The D flip-flop in the phase detector is
Block diagram of the boundary decision circuit.
a single phase clock D-F/F with a clear function [9].
The phase detector using flip-flop is not sensitive to
differences of the duty-cycle between the inputs and D
flip-flop of only 11 transistors is used to achieve dead
zone free.
The circuit of the input sampler is based on Ref. [7].
Input data and the sampling clocks are provided in
differential manner. The sampling ends on the falling
edge of the clock, holding the most recent sampled data
that arrived for regenerative amplification for half cycle. Due to the presence of multiple clocks in the sampler, the sampled points could be overwritten before
proceeding to the data processing block. To avoid this,
the half-segment register deskewing with a FIFO structure is used [10]. In order to latch the oversampled
bits in parallel, they must first be aligned in time (deskewed). Instead of implementing area-consuming delay units on de-skewing lines, half-segment latch stages
were used. This allows wider setup/hold time windows
for the parallel latching. FIFOs are used for shifting
samples without being overwritten by the next incoming sampling clock.
The transition detection and boundary decision
block is shown in Fig. 7. As described in the data recovery algorithm, both transition count and non-transition
count are used for the boundary position decision.
5.
Experimental Results
The recovery circuit is implemented in a standard
0.6 µm CMOS technology and post layout simulation
has been done using HSPICE. The die photo is shown
in Fig. 8. The design and performance statistics are
summarized in Table 3. The external clock signal to
the delay chain is 1 V swing (3.8 V–4.8 V) at 125 MHz
and the DLL is locked to 3 times of the clock period
(3 × 8 ns). The differential input data is a 1 Gbps with
0.5 V swing (1.25 V–1.75 V). Figure 9 shows the simulated full swing converted signals from delay chain in
the order of the delay stage.
Figure 10 displays the simulated signals adjusted
IEICE TRANS. FUNDAMENTALS, VOL.E83–A, NO.6 JUNE 2000
1104
Fig. 10
Fig. 8
Table 3
Simulated signals rearranged in time sequence.
Chip micrograph.
Design and performance statistics.
Technology
Transistors
Die size (Core)
Total I/O pins
Power Supply
Input Data Rate
Sampling Clock Rate
Clock jitter (peak to peak)
Output Data (8 Channel)
Sampling Resolution
Power Dissipation
0.6 µm CMOS
14675
3100 µm× 1900 µm
80 pin QFP
5 Volt
1 Gb/s
125 MHz
100 ps
125 Mb/s
250 ps
800 mW
Fig. 11
Fig. 9 Simulated signals from each stage from delay chain (1st,
2nd, 3rd, 4th, 5th stage. . .).
Measured output signals at output channels.
in time sequence, which is used as clock edges for data
oversampling. The resolution is 250 ps, so that 1 Gbps
data (data period=1 ns) is oversampled by four times.
The measured output signals from five channels of eight
output channels are shown in Fig. 11. For example, the
measured output signal of channel 2 is ‘1010. . ..’ pattern. The pulse width of ‘1’ and ‘0’ is 8 ns (125 Mb/s).
The other channels patterns are ‘0001. . .,’ ‘1100. . .’
and ‘1110. . .’ Various input patterns were supplied for
checking the correct data recovery.
Due to the demultiplexed structure and the lack of
BER test equipment, the BER test using PRBS pattern could not be performed yet. Instead, the output
for various known input data patterns with maximum
bit length of 16 bits were checked and compared. In
all input data patterns, data were recovered without
PARK and KANG: A 1.0 Gbps CMOS OVERSAMPLING DATA RECOVERY CIRCUIT
1105
any error. The sampling clock jitter was measured by
checking the output while sweeping a input transition.
The time period producing an unstable output is measured as clock jitter which was about 100 ps.
6.
Conclusion
This paper describes a 1 Gbps data recovery circuit using 4X oversampling technique. To achieve a fine sampling resolution for high speed data oversampling, a
single delay chain locked to multiple clock periods is
proposed. This approach can realize the delay resolution less than the gate delay in the delay chain. This is
satisfied when the ratio N/n = T /D makes irrational
number (N : total number of delay cell, n: number of
clock periods, T : input signal period, D: delay time per
unit delay cell). By using 4X oversampling and data recovery algorithm of the maximum-minimum transition
count, the robust oversampling data recovery circuit is
implemented in 0.6 µm CMOS. The test chip has shown
that 1 Gbps input data can be recovered 1:8 demultiplexed output format. This scheme can be expanded
to higher bit rate applications by scaling the sampling
resolution.
References
[1] J. Sonntag and R. Leonowich, “A monolithic CMOS
10 MHz DPLL for burst-mode data retiming,” IEEE International Solid-State Circuit Conference, pp.194–195, Feb.
1990.
[2] M. Bazes and R. Ashuri, “A novel CMOS digital clock and
data decoder,” IEEE J. Solid State Circuits, pp.1934–1940,
Dec. 1992.
[3] B. Kim, D. N. Helman, and P. R. Gray, “A 30-MHz hybrid
analog/digital clock recovery circuit in 2 µm CMOS,” IEEE
J. Solid State Circuits, pp.1385–1394, Dec. 1990.
[4] S. Sidiropolous and M. Horowitz, “A semi-digital DLL with
unlimited phase shift capability and 0.08–400 MHz operating range,” IEEE J. Solid State Circuits, pp.1683–1692,
vol.32, no.11, Nov. 1997.
[5] J.G. Maneatis, “Low-jitter and process-independent DLL
and PLL based on self-biased technique,” ISSCC, pp.130–
131, 1996.
[6] J. Kang, “Performance analysis of oversampling data recovery circuit,” IEICE Trans. Fundamentals, vol.E82-A, no.6,
pp.958–964, June 1999.
[7] C. Yang, R. Farjad-Rad, and M. Horowitz, “A 0.5 µm
CMOS 4.0 Gbps serial link transceiver with data recovery
using oversampling,” IEEE J. Solid State Circuits, pp.713–
721, 1998.
[8] T. Fong, M. Cerisola, R. Hofmeister, and L. Kazovsky,
“Ultra fast clock recovery for optical packet switched networks,” Electronic Letters, pp.1687–1688, Sept. 1995.
[9] R. Rogenmoser, “1.16 GHz dual-modulus 1.2 µm CMOS
prescaler,” CICC, pp.27.6.1–27.6.4, 1993.
[10] J. Kang, W. Liu, and R. K. Cavin, “A CMOS high speed
data recovery circuit using matched delay sampling technique,” IEEE J. Solid State Circuits, pp.1588–1596, Oct.
1997.
Jun-Young Park
received his B.S.
and M.S. degree in Electrical and Computer Engineering at Inha University, Inchon, Korea, in 1998 and 2000, respectively. His research interests are VLSI design, mixed-mode circuit design, and signal processing.
Jin-Ku Kang
received his Ph.D. in
Computer Engineering at North Carolina
State University, Raleigh, NC, U.S.A, in
1996. He is an assistant professor in
School of Electrical and Computer Engineering at Inha University, Inchon, Korea. From 1984–1988 he was with Samsung Electronics. He has also worked for
INTEL, Hillsboro, OR, U.S.A. before he
joined the University in 1997. His research interests are VLSI design, mixedmode circuit design, PLL, data communication circuits, and clock
distribution in high speed digital systems.
Download