672 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019 A 56-Gb/s PAM4 Receiver With Low-Overhead Techniques for Threshold and Edge-Based DFE FIR- and IIR-Tap Adaptation in 65-nm CMOS Ashkan Roshan-Zamir , Student Member, IEEE, Takayuki Iwai, Yang-Hang Fan, Ankur Kumar, Student Member, IEEE, Hae-Woong Yang, Student Member, IEEE, Lee Sledjeski, John Hamilton, Soumya Chandramouli, Member, IEEE, Arlo Aude, Member, IEEE, and Samuel Palermo , Senior Member, IEEE Abstract— This paper presents a four-level pulse amplitude modulation (PAM4) quarter-rate receiver that efficiently compensates for moderate channel loss in a robust manner through background adaptation of the receiver thresholds and equalization taps. The front-end utilizes an input single-stage continuous-time linear equalizer (CTLE) to boost the main cursor and relax the pre-cursor cancellation requirement, requiring only a 2-tap pre-cursor feed-forward equalizer (FFE) on the transmitter side. A 2-tap decision feedback equalizer (DFE) follows that includes one finite impulse response (FIR) tap and one infinite impulse response (IIR) tap to cancel first post-cursor and long-tail inter-symbol interference (ISI), respectively. In addition to the per-slice main three data samplers, a single error sampler is utilized for background threshold control and an edge-based sampler performs both phase-locked loop (PLL)-based clock and data recovery (CDR) phase detection and generates information for background DFE tap adaptation. Fabricated in general purpose (GP) 65-nm CMOS, the 56-Gb/s receiver achieves 4.63 mW/Gb/s and compensates for up to 20.8-dB loss at a bit error rate (BER) < 10−12 when operated with a 2-tap FFE transmitter. Index Terms— Decision feedback equalizer (DFE), DFE adaptation, four-level pulse amplitude modulation (PAM4), infinite impulse response (IIR), receiver, serial link, threshold adaptation. I. I NTRODUCTION S UPPORTING increased bandwidth demand in datacenters and high-performance computing systems requires higher per-lane electrical I/O data rates, motivating the development of recent high-speed I/O standards that utilize four-level Manuscript received July 7, 2018; revised October 4, 2018; accepted October 31, 2018. Date of publication December 18, 2018; date of current version February 21, 2019. This paper was approved by Guest Editor Nan Sun. This work was supported by SRC under Grant 1836.143. (Corresponding author: Ashkan Roshan-Zamir.) A. Roshan-Zamir was with the Analog and Mixed Signal Center, Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843 USA. He is now with Texas Instruments Incorporated, Santa Clara, CA 95051 USA (e-mail: ashkanroshan@tamu.edu). T. Iwai was with the Analog and Mixed Signal Center, Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843 USA. He is now with Toshiba Memory Corporation, Kawasaki 212-8520, Japan. Y.-H. Fan, A. Kumar, H.-W. Yang, and S. Palermo are with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA. L. Sledjeski, J. Hamilton, S. Chandramouli, and A. Aude are with Texas Instruments Incorporated, Duluth, GA 30096 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSSC.2018.2881278 pulse amplitude modulation (PAM4) with higher spectral efficiency [1], [2]. However, as shown in Fig. 1, PAM4 signaling increases system complexity at both the transmitter and receiver sides. Since the feed-forward equalizer (FFE) at the transmitter must be implemented on both the MSB and LSB signals, this can result in a high level of output stage segmentation with large FFE tap counts [3]. At the receiver, the multi-level decisions require multiple data samplers and equalizers with sufficient linearity to cancel long-tail multilevel inter-symbol interference (ISI). This paper focuses on improving the receiver efficiency, while also minimizing the transmit-side equalization requirements. While ADC-based receivers [4], [5] are well suited for PAM4 signaling due to their inherent multi-level detection and robust digital equalization, their power can be prohibitive for moderate channel loss applications. This motivates a powerefficient mixed-signal receiver front-end solution. However, relative to NRZ receivers, transmit swing limitations make PAM4 receivers more sensitive to noise and residual ISI. This necessitates stringent ISI cancellation, and can result in transmit-side FFEs with three or more taps [3], [5], [7] and receive-side multi-stage continuous-time linear equalizers (CTLEs) [5]–[7] and decision feedback equalizers (DFEs) with large tap counts when implemented with conventional finite impulse response (FIR) feedback filters [6]. Improvement in DFE efficiency is possible with architectures that combine conventional FIR and infinite impulse response (IIR) feedback filters [8], [9]. However, it is difficult to support channels with over 15 dB of loss at Nyquist utilizing a DFE-only approach due to excessive sampler sensitivity requirements [10]. Reliable PAM4 receiver operation requires robust configuration of both the equalization settings and sampler thresholds. DFE taps must by adaptively tuned to support operation over a wide range of channels. These varying equalization settings result in different data samplers’ multi-level threshold values. An approach that utilizes a standard minimum mean square error algorithm requires four extra error samplers at each of the four PAM4 levels [6]. Another technique that involves the symmetrical adaptation of the high and low thresholds in the presence of Gray coding [11] requires two extra error samplers and can be sensitive to a transmitter and front-end non-linearity. Overall, both the equalization tap weight and 0018-9200 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES 673 Fig. 1. Conceptual PAM4 transceiver with transmit- and receive-side equalization, adaptation for equalization taps and sampler thresholds, and clock recovery circuitry. threshold adaptations should be performed with minimal hardware overhead and also offer compatibility with efficient clock and data recovery (CDR) architectures that support PAM4 modulation. The efficiency of the receiver CDR system is also important. While phase interpolator (PI)-based CDRs are often employed to generate the optimal sampling time from multi-phase highspeed reference clocks [6], [7], [11], this involves the design of a separate dedicated phase-locked loop (PLL). In addition, high-speed global clock distribution and per-channel multiphase generation circuitry are required in a multi-channel receiver system. This paper presents a 56-Gb/s mixed-signal PAM4 receiver that is targeted for moderate channel loss applications and addresses the aforementioned issues [12]. Section II compares different equalization configurations for operating over a moderate loss channel with low complexity. The PAM4 receiver architecture that employs a single-stage CTLE and DFE with only 1-FIR tap and 1-IIR tap to efficiently cancel long-tail ISI is detailed in Section III, along with a discussion on the bang-bang phase detector (BBPD) PLL-based CDR that recovers the clock using only one per-slice edge sampler. Section IV describes the proposed background sampler threshold adaptation scheme that does not rely on equal PAM4-level spacing and uses only one additional per-slice sampler that periodically scans the top and bottom PAM4 eyes to compensate for nonlinearity. Also discussed is that through utilizing the CDR edge samplers’ information, the DFE adaptation scheme of [13] is extended for PAM4 operation with the addition of independent per-slice tap values for mismatch robustness. Experimental results from a general purpose (GP) 65-nm CMOS prototype is presented in Section V. Finally, Section VI concludes this paper. II. E QUALIZATION C OMPARISONS The frequency response of a representative refined electrical channel that has a moderate 20.8 dB of loss at the 14-GHz Nyquist frequency for 56-Gb/s PAM4 modulation is shown in Fig. 2(a). This smoothly decreasing response is caused by skin effect and dielectric loss, with minimal performance degradation due to reflections. Fig. 2(b) shows the channel’s non-equalized 28-GS/s pulse response, which is well characterized by a fast rising-edge with one significant pre-cursor ISI term and a slow-decaying long-tail ISI on the falling edge with significant ISI terms up to 10th post-cursor. Transmission of this un-equalized data results in a very poor bit error rate (BER), as shown in the voltage and timing margin curves of Fig. 2(c) and (d), respectively. Overall, given the heightened sensitivity of PAM4 to residual ISI, these ISI terms should be sufficiently cancelled to achieve the target BER. A potential power-efficient approach is to not utilize any CTLE and have a DFE-only receiver with a 1-FIR tap to cancel the large first post-cursor and 2-IIR taps, with one IIR tap optimized for the fast-decaying close-in ISI and the other IIR tap optimized for the slow-decaying tail [10]. While the equalized pulse response, in this case, displays minimal post-cursor ISI, there still exists a significant pre-cursor term that is not cancelled by the DFE-only receiver. This still results in a poor BER > 10−3 for this moderate loss channel. The large pre-cursor ISI term can potentially be cancelled by utilizing a 2-tap transmit FFE in combination with the DFE with 1-FIR and 2-IIR taps. However, utilizing a large negative pre-cursor tap results in both attenuation of the main cursor and additional ISI now present in the second pre-cursor position. While this improves the BER to near 10−9 , forward error correction (FEC) would still generally be required in the system. Overall, achieving a better BER would require a higher complexity transmitter with more pre-cursor taps. Fixing the transmitter complexity to a 2-tap FFE, further performance improvement is possible by introducing receiveside CTLE. However, as discussed in Section I, relying solely on CTLE for long-tail ISI cancellation can require multiple high-bandwidth stages that employ inductive peaking and consume a large amount of area and power. Nonetheless, a CTLE with high-frequency peaking offers the benefits of cancelling pre-cursor ISI, boosting the main cursor, and can- 674 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019 Fig. 2. Refined electrical channel. (a) S21 response. (b) Simulated 28-GS/s pulse responses and 56-Gb/s PAM4 (c) voltage and (d) timing margins with various equalizer configurations. Only the middle eye voltage margins are shown for clarity due to different top and bottom threshold levels changing with the varying equalizer configurations. celling close-in post-cursor ISI in a manner similar to the first DFE IIR tap. Thus, a mixed receive-side equalization approach is investigated that consists of a single-stage CTLE, with a conservative 6-dB high-frequency peaking optimized to provide relative pre-cursor ISI attenuation, followed by a DFE with 1-FIR tap for the large first post-cursor ISI and only one IIR tap, starting at the second post-cursor, optimized for the long-tail ISI. As depicted in the equalized pulse response, although similar pre-cursor ISI cancellation is achieved, the boost in the main cursor makes the system performance less susceptible to residual ISI. The resultant bathtub curves show a BER = 10−12 with timing and voltage margins of 0.22 UI and 18 mV, respectively. Note that at this performance level a 1-tap FIR, 2-tap IIR DFE-only receiver without CTLE would require a much more complex 4-tap FFE transmitter. In the 65-nm CMOS technology utilized in this paper, it is estimated that scaling a 56-Gb/s PAM4 transmitter from 2-taps to 4-taps would consume an additional 97 mW. This is much higher than the 11 mW consumed in the receive-side CTLE. III. R ECEIVER A RCHITECTURE Based on Section II system analysis, a PAM4 receiver with a single-stage CTLE and a 1-tap FIR, 1-tap IIR DFE is proposed (Fig. 3). After the input CTLE, a quarter-rate DFE follows, which consists of five samplers. Three data samplers implement a 2-bit flash ADC for PAM4 symbol detection, one error sampler periodically scans the top and bottom eyes for threshold tuning, and one edge sampler provides information for both CDR phase locking and DFE tap adaptation. The outputs of the four receiver slices are further deserialized to 1/8 symbol rate, with the data and edge samples driving the CDR’s PAM4 BBPD. At this point, the data samples are also probed out for external BER testing. All the data, error, and edge samples are then further deserialized to 1/32 symbol rate for processing by the DFE tap and threshold adaptation logic. A detailed block diagram of the equalizer data path is shown in Fig. 4. Eight total quarter-rate phase clocks are used for data and edge detection, with each receiver slice operating with a single pair of data and edge clocks. This quarter-rate architecture reduces clocking power and relaxes timing of the current mode logic (CML) samplers by giving them extra time to recover from previous decisions. PAM4 symbol detection is performed with the three data samplers, with the middle sampler threshold set to zero and the top and bottom samplers’ thresholds set to ±2/3 the post-equalized amplitude of the received signal by the threshold adaptation circuitry. The error sampler is clocked by the same data clock as the main three data samplers and has a threshold that is periodically scanned to track the top and bottom PAM4 eyes in order to provide threshold adaptation information. Timing recovery and DFE adaptation information are provided by the edge sampler, whose threshold is set to zero. The DFE FIR tap, which cancels the first post-cursor ISI, is efficiently realized by feeding back the data samplers’ 3-bit thermometer-coded output bits directly ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES Fig. 3. 56-Gb/s PAM4 receiver with threshold and DFE tap adaptation. Fig. 4. Equalizer data path. to three equally weighted summer inputs embedded in the data, error, and edge samplers. This minimizes the DFE FIR tap critical path delay to meet the stringent 1-UI timing. In order to minimize the samplers’ internal loading, the DFE IIR tap is subtracted from the sampler input with preceding CML summers. This is possible due to the DFE IIR tap starting at the second post-cursor to cancel the long-tail ISI. The IIR tap signal is generated by re-serializing the quarter-rate data samplers’ outputs and passing this full-rate data through a low-pass filter. 675 Fig. 5. Single-stage CTLE. (a) Schematic and simulated frequency response with different (b) capacitor DAC and (c) resistor DAC settings. Fig. 5 shows the single-stage CTLE with manually tunable 3-bit degeneration resistor and capacitor DACs. At a minimum 0-dB gain setting, tuning the capacitor provides up to 6 dB of peaking close to the 14-GHz Nyquist frequency. This bandwidth is achieved at a reasonable power efficiency by employing shunt inductive peaking. Tuning the resistor DAC provides near 6 dB of low-frequency gain control. The choice of sampler topology is critical to achieving reliable PAM4 operation at high data rates. While strong-arm samplers [15] and modified double-tail latch versions [16] 676 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019 Fig. 6. CML sampler with DFE FIR-tap and threshold control (a) schematic, (b) simulated normalized FIR-tap offset weight versus differential input amplitude, (c) FIR tap weight and offset versus DAC code, and (d) impulse sensitivity function. have advantages that include no dc power, small aperture time, high gain, and CMOS-level outputs, their multi-stage implementation can increase delay. Conversely, single-stage CML samplers [17] can provide higher bandwidth and reduced delay, while suffering from reduced gain and static power consumption. Given that this design is targeted for an aggressive 56-Gb/s data rate for the 65-nm CMOS technology, CML samplers are chosen. Fig. 6(a) shows the schematic of the CML slicer with the embedded DFE FIR tap and additional threshold/offset control pairs. In order to avoid noise propagation in the DFE loop, the input signal must be amplified sufficiently to make a true decision and fully switch the CML slicer DFE FIR tap pairs [18]. As shown in Fig. 6(b), the CML sampler requires a 14-mV differential input amplitude to achieve switching in the feedback tap pairs equivalent to 90% of the DFE tap weight at 56 Gb/s. Independent DFE FIR-tap weights set the tail currents with 6-bit resolution on a per-slice basis to compensate for the mismatch between the receiver slices, achieving more than 150 mV of range [Fig. 6(c)]. The sampler threshold and offset is controlled through the DAC-generated Voff /Vth voltage with 7-bit (1-bit sign, 6-bit amplitude) resolution and a maximum range of more than 250 mV [Fig. 6(c)]. The sampler has a simulated 11-ps aperture time, obtained from the impulse sensitivity function shown in Fig. 6(d). Fig. 7 shows the single-IIR MUX that combines the thermometer quarter-rate data from all the slices and serializes it to full rate using a current-mode architecture. A tunable RC load implements the IIR filter, with the time constant controlled from 22 to 175 ps through the coarse-tuning 3-bit resistor DAC and the fine-tuning 3-bit capacitor DAC. IIR tap amplitude control is achieved with the tunable tail current. Per-slice summers combine the IIR tap signal with the CTLE output. The input pair that is driven by the CTLE is degenerated to achieve the required linear range, with additional −12- to 6-dB gain control provided by the tunable degeneration resistor. As the IIR tap cancellation starts from second post-cursor and has a relatively small amplitude, its input pair does not require degeneration. Here, the IIR summation is done on a per-slice basis to minimize the summer to slicer routing and isolate slicer kickback. At the default settings of 6-dB CTLE peaking and 0-dB summer gain, the entire receiver front end has a simulated 1-dB compression point of the 324-mV differential input amplitude. This is sufficient to handle more than 600 mVppd of input swing. The PLL-based CDR shown in Fig. 8 provides a power efficient solution to both generate the eight quarter-rate clocks and adjust their phase to track the incoming data. A BBPD receives the 1/8 rate data and edge samples and filters out all but the symmetric transitions to avoid asymmetric PAM4 transitioninduced jitter [19]. In order to reduce loop latency, the BBPD works with eight parallel early/late signals controlling an eight-segment charge pump. This parallel charge pump drives the loop filter to produce the control voltage for a 14-GHz ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES Fig. 7. DFE IIR-tap MUX, filter, and per-slice input summer. Fig. 8. PAM4 PLL-based CDR architecture. LC voltage-controlled oscillator (VCO) [20]. In addition to the primary resonator tank, oscillator phase noise is reduced with tanks also in the source of both cross-coupled transistor pairs [21]. After the VCO, quarter-rate clocks are generated by a CML divide-by-two block and then converted to CMOS levels. Static CMOS phase interpolators then efficiently generate the eight clock phases for the quarter-rate data and edge samplers. Per-phase skew calibration is achieved with tunable delay buffers preceding the samplers. The CDR loop 677 bandwidth is tuned through the loop filter’s 4-bit capacitor DAC and the charge pump’s 3-bit current DAC. Minimal jitter tolerance peaking is achieved with damping factor control provided by the loop filter’s 4-bit resistor DAC. IV. T HRESHOLD AND DFE TAP A DAPTATION Given that PAM4 receiver sampler thresholds and equalization settings can vary with channel conditions, adaptive tuning is necessary to support operation over a wide range of 678 Fig. 9. IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019 Background sampler threshold adaptation algorithm. channels. This process is further complicated in the presence of transmitter and/or front-end non-linearity. This section describes the background sampler threshold adaptation scheme that utilizes only one additional per-slice sampler and the DFE tap adaptation scheme that utilizes the CDR edge samplers’ information. A. Sampler Threshold Adaptation Transmitter- and receiver-side non-linearity, which is often compressive, can result in margin reduction for the top and bottom eyes. This necessitates increased accuracy in the top/bottom eye threshold placement relative to the nominally zero threshold center eye. Threshold adaptation is achieved with the error sampler periodically estimating the top and bottom eye heights in order to place the data thresholds TH1,3 in the middle of these eyes. An initial foreground calibration step is performed where all 20 samplers are set to zero offset/threshold by shorting the input to the common mode and adjusting the per-sampler Voff /Vth DAC codes. On a per-slice basis, the top sampler’s threshold TH3 is then incremented up by 1 LSB (Error Offset 1) to come to the initial condition shown in Fig. 9. The initial coarse adaptation steps are based on uniform symbol statistics, with both the top sampler TH3 and error THER increased until a 25% one detection probability is achieved by the error sampler. Also, in parallel, the bottom sampler TH1 is stepped at the same rate in an open-loop manner to improve convergence speed. The data statistics are computed by averaging 256 symbols per slice for each threshold step decision. This value is selected to minimize convergence time with minimal threshold code dithering. At the end of State 1, the error sampler THER,1 is residing at the bottom of the top eye and the top sampler TH3 is 1 LSB inside the eye. Next, the polarity of the error sampler threshold is inverted and then fine-tuned to converge to the top of the bottom eye based on a 25% zero detection criteria, with the bottom sampler TH1 following the error sampler THER,2 by −1 LSB difference (State 2). This independent top and bottom threshold tuning eliminates errors caused by PAM4 asymmetry and level spacing mismatch. In order not to rely on uniform statistics, the process then transitions to monitoring the relative values of the error sampler and the bottom/top samplers to track the eye edges in States 3 and 4, respectively. It should be noted that at the end of the first State 4, the top and bottom slicers are in a sub-optimal position inside the eye. While ideally the top and bottom samplers should be following the error sampler with ±1/2 eye height, respectively, in States 4 and 3, due to the lack of eye height estimation at this point, there is only ±1 LSB difference. Next, in order to get an estimation of the top eye height in State 5, the data samplers’ thresholds are fixed and the error sampler THER,5 is increased until the discrepancy is detected between the error sampler and the top sampler outputs. This implies that the error sampler THER,5 has reached the top edge of the top eye at the end of State 5. This is repeated to find the error sampler THER,6 that corresponds to the bottom of the bottom eye at the end of State 6. At this point, the top and bottom eye heights are now independently found. This results in the following optimum top and bottom threshold settings: THER,3 + THER,6 2 THER,4 + THER,5 . TH3 = 2 TH1 = (1) (2) ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES Fig. 11. Fig. 10. PAM4 DFE FIR and IIR-tap adaptation logic tables. Next, the TH1 and TH3 thresholds are placed in the middle of their corresponding eye when the process goes back to States 3 and 4 for monitoring of the top of the bottom eye and bottom of the top eye, respectively. The algorithm then periodically rotates between States 3–6 to track eye height and optimal threshold position. While not implemented in this prototype, the middle eye threshold could also be adjusted with no front-end hardware overhead by adding extra middle eye monitoring states in the threshold adaptation algorithm. However, due to the reduced non-linearity present in the smaller level middle eye, this is not as critical as the top and bottom eye threshold placement. B. DFE FIR and IIR Tap Adaptation The edge-based DFE tap background adaptation logic tables are shown in Fig. 10, which is modified from [13] to allow for PAM4 operation and independent per-slice DFE FIR-tap control. Similar to the BBPD logic, the DFE tap adaptation works with symmetric PAM4 data transitions in order to improve convergence. When a symmetric transition is detected, the correlation between the edge sample and the sign of the previous symbols determines the residual ISI polarity from the corresponding symbol. Adjusting the DFE tap weights based on this edge information works to maximize the horizontal timing margin, which correlates with improved vertical eye height. Given that the CTLE is optimized for 679 Chip micrograph of the 56-Gb/s PAM4 receiver. pre-cursor cancellation, the DFE FIR-tap is adapted to cancel the large first post-cursor ISI and the DFE-IIR tap, which starts with an additional 1-UI delay, is adapted to cancel the long tail ISI that is considered to start at the second post-cursor location. As the DFE FIR-tap cancels the first post-cursor, if the D−1 symbol polarity matches the edge sample ISI polarity, this implies that the tap value is too small and the FIR-tap counter is incremented and vice versa. As PAM4 receivers require improved sensitivity, independent per-slice adaptation is implemented for the DFE FIR-taps to compensate for mismatch in the four receiver slices. The DFE IIR-tap amplitude is set in a similar manner utilizing the D−2 polarity, as this IIR tap compensates for long-tail ISI after the first post-cursor. Adjustment of the DFE IIR-tap time constant is determined by the correlation from either D−3 or D−4 and the edge sample. The use of one common DFE IIR-tap mux allows for the adaptation of only a single set of IIR values. Due to the increased sensitivity of PAM4 to residual ISI, a larger 6-bit resolution is used for all the DFE FIR tap weight, IIR amplitude, and IIR time constant settings relative to a previous NRZ receiver that utilized 5-bit resolution [13]. For each code step, the DFE tap adaptation logic makes decisions based on the correlation between edge and data samples averaged over 256 symbols per slice for the FIR settings and 1024 symbols for the IIR weight and time constant settings. Similar to the threshold adaptation procedure, these values are selected to minimize convergence time with minimal code dithering. V. E XPERIMENTAL R ESULTS Fig. 11 shows the chip micrograph of the PAM4 receiver, which was fabricated in a GP 65-nm CMOS process and occupies a total active area of 0.51 mm2 . The CTLE is placed close to the bottom right input pads. At the CTLE output is the DFE circuitry that is followed by the deserialization logic where the data signals are multiplexed out of the chip for BER testing. The CDR circuitry is above this, with the 680 Fig. 12. IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019 PAM4 receiver test setup and channel responses. LC VCO placed roughly in the middle of the chip and the clock divider and phase generator placed near the DFE to minimize routing. In addition to the clocks from the CDR, the receiver can also be tested via an external bypass clock to measure timing margin. This clock comes in from the bottom of the chip through an inductively peaked clock buffer. As the synthesized threshold and DFE tap adaptation logic run at a relatively low speed, it is placed at the top of the chip. An on-chip DAC is included to provide an analog monitor of the sampler thresholds and DFE tap coefficient adaptation convergence behavior. The receiver is characterized utilizing two test channels in the experimental setup shown in Fig. 12. Channel 1 consists of a 2-in FR4 test channel trace and the 1-in Rogers RX board trace, with SMA cables between the pattern generator and boards and three total sets of SMA connectors. Channel 2 has the same components, except for a longer 4-in FR4 test channel trace. A PAM4 pattern generator with 1-main and 1-pre-cursor FFE taps generates PRBS15 data which passes through Channel 1 and Channel 2 with 16.1 and 20.8 dB of loss at 14 GHz, respectively. The on-die 1/8 rate data MUX at the receiver output allows for independent verification of the MSB or LSB outputs with an NRZ BER tester. In order to measure timing bathtub curves, the CDR is bypassed and the receiver is clocked with an external half-rate clock from a pattern generator. This allows for a programmable phase shift to measure the BER at different sampling times. Fig. 13(a) shows the transmitter PAM4 pre-channel eye diagram without any equalization with 600-mVppd swing. Co-optimizing the 2-tap pre-cursor FFE with the receiver equalization results in a completely closed eye at the output of Channel 2 [Fig. 13(b)]. For all the subsequent reported testing results, the receiver CTLE is set to have 0-dB dc gain and 6-dB high-frequency peaking. Utilizing the on-chip monitor DAC, Fig. 14 shows the DFE tap coefficients and sampler thresholds convergence for both Fig. 13. (a) 56-Gb/s eye diagram before Channel 2 without equalization and (b) after Channel 2 with 2-tap pre-cursor FFE. channels. All of the DFE taps settle within 2 µs, with the higher loss Channel 2 settings displaying both higher FIR and IIR tap values. Both channels have FIR[1:4] tap values that are slightly different due to mismatches in the receiver slices and residual skew between the sampling clock phases. The initial threshold adaptation procedure completes within 16 µs, with the higher loss Channel 2 having lower absolute threshold values due to the reduced swing at the sampler input. After the initial convergence, the error samplers continue to scan the top and bottom eyes for background threshold optimization. The CDR bypass mode is used to measure the receiver’s combined MSB/LSB BER timing bathtub curves with various receiver equalization configurations for Channel 1 and 2, as, respectively, shown in Fig. 15(a) and (b). Utilizing a DFE is a necessity, as both channels display poor BER with the CTLE-only configuration. While an optimized combination of CTLE and a DFE with only the FIR tap enabled allows a BER <10−7 for Channel 1, the BER is worse than 10−2 for Channel 2 due to significant ISI from the second post-cursor and beyond. Enabling the DFE IIR tap allows for efficient long-tail ISI cancellation to achieve 0.22 UI and 0.19 UI timing margin at a BER = 10−12 for Channels 1 and 2, respectively. It should be noted that the transmitter-side FFE still plays a critical role in pre-cursor cancellation, as only ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES 681 Fig. 14. Measured DFE tap adaptation operating over (a) Channel 1 and (b) Channel 2, and measured sampler threshold adaptation operating over (c) Channel 1 and (d) Channel 2. The edge sampler values are omitted and only error sampler 1 is shown for clarity. Fig. 15. Measured 56-Gb/s receiver timing bathtub curves operating over (a) Channel 1 and (b) Channel 2 and receiver voltage bathtub curves utilizing a 2-tap TX FFE and RX CTLE and 1-tap FIR 1-tap IIR RX DFE operating over (c) Channel 1 and (d) Channel 2. a BER >10−7 is achieved for Channel 2 without transmitter FFE. Utilizing the maximum receiver equalization configuration, the CDR is then enabled. This involves initially coarsely setting the VCO frequency by manually forcing the loop filter output voltage through an analog MUX and then allowing the closed-loop PLL-based CDR to achieve phase lock. The 682 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019 TABLE I PAM4 R ECEIVER P ERFORMANCE C OMPARISONS Fig. 16. Measured PAM4 jitter tolerance (BER = 10−9 ) operating over Channel 2. Fig. 17. data sampler thresholds are then manually adjusted from their converged values to measure the voltage bathtub curves of Fig. 15(c) and (d). Worst eye voltage margins of 23 and 14 mV are achieved at a BER = 10−12 for Channel 1 and Channel 2, respectively. Jitter tolerance measurements were also performed using the high-loss Channel 2 with the CDR enabled and the background equalization and threshold adaptation running. Fig. 16 shows that the CDR has more than 6 MHz of bandwidth with 0.12 UI of high-frequency jitter tolerance at a BER = 10−9 . This exceeds the CEI-56G- 56-Gb/s receiver power breakdown. VSR mask with margin, as the specification only requires a BER = 10−6 . Fig. 17 shows the 56-Gb/s receiver power breakdown. The receiver consumes 259 mW of power, with CML comparators and clocking circuits having the most contribution. Table I summarizes the receiver performance and compares it with other PAM4 receivers operating over 32 Gb/s. The receiver achieves a power efficiency of 4.63 mW/Gb/s, which is superior to the ADC-based design of [5], and the mixed-signal front ROSHAN-ZAMIR et al.: 56-Gb/s PAM4 RECEIVER WITH LOW-OVERHEAD TECHNIQUES end of [7] that utilizes a two-stage CTLE and an additional TX FFE tap. Employing the DFE IIR-tap allows for a reduction in the total tap count relative to [6], while also extending the maximum supported channel loss. Compared to [11], the presented work extends the maximum achievable data rate in a similar process. VI. C ONCLUSION This paper has presented a 56-Gb/s PAM4 quarter-rate receiver that efficiently compensates for moderate channel loss in a robust manner through background adaptation of the receiver thresholds and equalization taps. Combining CTLE and a DFE with only one FIR and one IIR taps allows for low receive-side equalization complexity and operation with only a 2-tap transmit-side FFE. In addition to the per-slice main three data samplers, a single error sampler is utilized for both CDR phase detection and DFE tap adaptation with independent per-slice values for the required PAM4 sensitivity. Sampler threshold adaptation is also achieved with a single per-slice error sampler that periodically scans the top and bottom PAM4 eyes. Overall, the proposed PAM4 receiver architecture enables transmission over channels with up to 20 dB of loss at Nyquist with only a 2-tap pre-cursor transmitter-side FFE, while improving the power efficiency compared to the state-ofthe-art receivers operating at similar data rates over channels with comparable channel loss. ACKNOWLEDGMENT The authors would like to thank Texas Instruments for providing laboratory equipment access and S. Finn, P. Crinion, T. K. Chin, W. Haque, K. Jakoush, and A. Rane for measurement assistance. R EFERENCES [1] CEI-56G-VSR_PAM4 Very Short Reach Interface, document OIF 2014.230.07, Optical Internetworking Forum, Jun. 2016. [2] IEEE P802.3bs 200 Gb/s and 400 Gb/s Ethernet Task Force. Accessed: Nov. 2016. [Online]. Available: http://www.ieee802.org/3/bs/ [3] P. Upadhyaya et al., “A fully adaptive 19-to-56Gb/s PAM-4 wireline transceiver with a configurable ADC in 16nm FinFET,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2018, pp. 108–110. [4] S. Kiran, S. Cai, Y. Luo, S. Hoyos, and S. Palermo, “A 32 Gb/s ADCbased PAM-4 receiver with 2-bit/stage SAR ADC and partially-unrolled DFE,” in Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2018, pp. 1–4. [5] Y. Frans et al., “A 56-Gb/s PAM4 wireline transceiver using a 32-way time-interleaved SAR ADC in 16-nm FinFET,” IEEE J. Solid-State Circuits, vol. 52, no. 4, pp. 1101–1110, Apr. 2017. [6] J. Im et al., “A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct decision-feedback equalization in 16-nm FinFET,” IEEE J. Solid-State Circuits, vol. 52, no. 12, pp. 3486–3502, Dec. 2017. [7] P.-J. Peng, J.-F. Li, L.-Y. Chen, and J. Lee, “A 56 Gb/s PAM-4/NRZ transceiver in 40 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2017, pp. 110–111. [8] O. Elhadidy, A. Roshan-Zamir, H.-W. Yang, and S. Palermo, “A 32 Gb/s 0.55 mW/Gbps PAM4 1-FIR 2-IIR tap DFE receiver in 65-nm CMOS,” in Proc. Symp. VLSI Circuits, Jun. 2015, pp. C224–C225. [9] S. Shahramian and A. C. Carusone, “ A 0.41 pJ/Bit 10 Gb/s hybrid 2 IIR and 1 discrete-time DFE tap in 28 nm-LP CMOS,” IEEE J. Solid-State Circuits, vol. 50, no. 7, pp. 1722–1735, Jul. 2015. [10] A. Roshan-Zamir, O. Elhadidy, H.-W. Yang, and S. Palermo, “A reconfigurable 16/32 Gb/s dual-mode NRZ/PAM4 SerDes in 65-nm CMOS,” IEEE J. Solid-State Circuits, vol. 52, no. 9, pp. 2430–2447, Sep. 2017. 683 [11] L. Tang, W. Gai, L. Shi, X. Xiang, K. Sheng, and A. He, “A 32 Gb/s 133 mW PAM-4 transceiver with DFE based on adaptive clock phase and threshold voltage in 65 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2018, pp. 114–116. [12] A. Roshan-Zamir et al., “A 56 Gb/s PAM4 receiver with low-overhead threshold and edge-based DFE FIR and IIR-tap adaptation in 65 nm CMOS,” in Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2018, pp. 1–4. [13] S. Shahramian, B. Dehlaghi, and A. C. Carusone, “A 16 Gb/s 1 IIR + 1 DT DFE compensating 28 dB loss with edge-based adaptation converging in 5µ s,” in IEEE ISSCC Dig. Tech. Papers, Jan./Feb. 2016, pp. 410–411. [14] A. Roshan-Zamir, O. Elhadidy, H. W. Yang, and S. Palermo, “A 16/32 Gb/s dual-mode NRZ/PAM4 SerDes in 65 nm CMOS,” in Proc. IEEE Compound Semiconductor Integr. Circuit Symp. (CSICS), Oct. 2016, pp. 1–4. [15] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703–1714, Nov. 1996. [16] E. Mensink, D. Schinkel, E. A. M. Klumperink, E. van Tuijl, and B. Nauta, “Power efficient gigabit communication over capacitively driven RC-limited on-chip interconnects,” IEEE J. Solid-State Circuits, vol. 45, no. 2, pp. 447–457, Feb. 2010. [17] M. Mizuno et al., “A GHz MOS adaptive pipeline technique using MOS current-mode logic,” IEEE J. Solid-State Circuits, vol. 31, no. 6, pp. 784–791, Jun. 1996. [18] Y. Lu and E. Alon, “Design techniques for a 66 Gb/s 46 mW 3-tap decision feedback equalizer in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 48, no. 12, pp. 3243–3257, Dec. 2013. [19] N. Qi et al., “A 51Gb/s, 320mW, PAM4 CDR with baud-rate sampling for high-speed optical interconnects,” in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Nov. 2017, pp. 89–92. [20] J. Lee and B. Razavi, “A 40-Gb/s clock and data recovery circuit in 0.18-µm CMOS technology,” IEEE J. Solid-State Circuits, vol. 38, no. 12, pp. 2181–2190, Dec. 2003. [21] E. Hegazi, H. Sjoland, and A. A. Abidi, “A filtering technique to lower LC oscillator phase noise,” IEEE J. Solid-State Circuits, vol. 36, no. 12, pp. 1921–1930, Dec. 2001. Ashkan Roshan-Zamir (S’14) received the B.Sc. and M.Sc. degrees in electrical engineering from the University of Tehran, Tehran, Iran, in 2010 and 2013, respectively, and the Ph.D. degree in electrical engineering from Texas A&M University, College Station, TX, USA, in 2018. He was a Design Intern with Samsung Semiconductor Inc., San Jose, CA, USA, in 2015, where he was involved in the design of clock and data recovery systems. He was a Design Intern with Texas Instruments Incorporated, Duluth, GA, USA, in 2017, where he was involved in the design of integrated circuits for high-speed wireline communication. Since 2018, he has been a Design Engineer with Texas Instruments Incorporated, Santa Clara, CA, USA, where he is involved in designing integrated mixed-signal circuits and circuits for wireline and optical communication. His current research interests include analog and mixed-signal integrated circuits, high-speed circuits for electrical and optical communication, and clock and data recovery circuits. Takayuki Iwai received the B.Eng. and M.Eng. degrees from Waseda University, Tokyo, Japan, in 2004 and 2006, respectively. In 2006, he joined Toshiba Corporation, Kawasaki, Japan, where he was involved in the design of embedded DRAM, stacked chip SoC DRAM, and high-speed I/O circuit. From 2016 to 2018, he was a Visiting Scholar with Texas A&M University, College Station, TX, USA, where he was involved in research on high-speed electrical and optical transceiver circuits. Since 2017, he has been with Toshiba Memory Corporation, Kawasaki. His current research interests include highspeed analog and mixed-signal integrated circuits, high-speed electrical and optical transceiver circuits, and high-speed clocking circuits. 684 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 3, MARCH 2019 Yang-Hang Fan received the B.S. degree in engineering and system science and the M.S. degree from the Institute of Electronics Engineering, National Tsing Hua University, Hsinchu, Taiwan, in 2007 and 2009, respectively. He is currently pursuing the Ph.D. degree in electrical engineering with Texas A&M University, College Station, TX, USA. From 2011 to 2015, he was with Faraday Technology, Hsinchu, Taiwan, where he worked on the design of mixed-signal integrated circuits for high-speed wireline communication. Since 2016, he has been a Research Assistant with the Analog and Mixed Signal Center, Texas A&M University. Since 2018, he has been an Intern with Hewlett Packard Enterprise, Palo Alto, CA, USA. His current research interests include mixed-signal integrated circuits and high-speed electrical and optical link circuits. Ankur Kumar (S’18) received the B.E. degree (Hons.) in electrical and electronics engineering and the M.Sc. degree (Hons.) in mathematics from the Birla Institute of Technology and Science, Pilani, India, in 2014. He is currently pursuing the Ph.D. degree in electrical engineering with Texas A&M University, College Station, TX, USA. In 2014, he was a Design Intern with STMicroelectronics Pvt. Ltd., Greater Noida, India. From 2014 to 2016, he was a Senior Systems Engineer with Hewlett Packard Enterprise, Bangalore, India. In 2016, he joined Texas A&M University. Since 2018, he has been a Design Intern with Texas Instruments Incorporated, Duluth, GA, USA, where he is involved in the design of integrated circuits for high-speed wireline communication. His current research interests include the design of high-speed and low-power circuits for electrical and optical communication and clock and data recovery circuits. Hae-Woong Yang (S’13) was born in Seoul, South Korea. He received the B.S. and M.E. degrees in electrical and computer engineering from Texas A&M University, College Station, TX, USA, in 2007 and 2009, respectively, and the Ph.D. degree from the Analog and Mixed Signal Center, Texas A&M University, in 2018. His interests are in low-power high-speed electrical link circuits, clock generation circuits, and signal integrity. Dr. Yang was a co-recipient of the Student Best Paper Award in the 2014 Midwest Symposium on Circuits and Systems. Lee Sledjeski received the B.S. degree in electrical engineering from the University of Connecticut, Storrs, CT, USA, in 1989. He is currently a Staff Applications Engineer with Texas Instruments (TI) Incorporated, Duluth, GA, USA, and is heavily involved in high-speed interfaces and direct customer support. He works closely with other engineers at TI to develop signal conditioning solutions for multi-gigabit communication and storage standards. John Hamilton, photograph and biography not available at the time of publication. Soumya Chandramouli (S’05–M’08) received the B.S. degree in electrical and computer engineering (summa cum laude) from the Lafayette College, Easton, PA, USA, in 2002, and the M.S. and Ph.D. degrees in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2004 and 2007, respectively. In 2008, she joined the Interface Group, National Semiconductor, Duluth, GA, USA, as an Analog Circuit Designer. She is currently an Analog Design Manager and a Design Lead in the high-speed signal conditioning product line and a member and a Group Technical Staff with Texas Instruments Incorporated, Duluth. She has authored or co-authored 15 conference and journal papers. She holds three patents. Dr. Chandramouli was a recipient of the MTT Undergraduate Research Scholarship in 2001. She has served as a reviewer for the IEEE T RANS ACTIONS ON C IRCUITS AND S YSTEMS : E XPRESS B RIEFS . Arlo Aude (S’93–M’95) received the bachelor’s degree of science in electrical engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 1995. He worked with Harris Semiconductor, Melbourne, FL, USA, as a Test Engineer. He is currently a Senior Member of Technical Staff and a Design Technologist for the High-Speed Signal Conditioning Group, Texas Instruments Incorporated, Atlanta, GA, USA. He is an Eagle Scout with The Boy Scouts of America, Tampa, FL, USA, and an alumnus of Psi Upsilon Fraternity. He is the inventor or co-inventor of over 35 patents in many disciplines. He has authored or co-authored over 12 journal and industry papers. Samuel Palermo (S’98–M’07–SM’17) received the B.S. and M.S. degrees in electrical engineering from Texas A&M University, College Station, TX, USA, in 1997 and 1999, respectively, and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, USA, in 2007. From 1999 to 2000, he was with Texas Instruments Incorporated, Dallas, TX, USA, where he worked on the design of mixed-signal integrated circuits for high-speed serial data communication. From 2006 to 2008, he was with Intel Corporation, Hillsboro, OR, USA, where he worked on high-speed optical and electrical I/O architectures. In 2009, he joined the Electrical and Computer Engineering Department, Texas A&M University, where he is currently an Associate Professor. His research interests include high-speed electrical and optical interconnect architectures, RF photonics, high-performance clocking circuits, and integrated sensor systems. Dr. Palermo is a member of Eta Kappa Nu. He has also previously served as a Distinguished Lecturer for the IEEE Solid-State Circuits Society and on the IEEE CASS Board of Governors. He was a recipient of the 2013 NSFCAREER Award, the Texas A&M University Department of Electrical and Computer Engineering Outstanding Professor Award in 2014, the Best Student Paper at the 2014 Midwest Symposium on Circuits and Systems, the Engineering Faculty Fellow Award in 2015, and the Best Student Paper at the 2016 Dallas Circuits and Systems Conference. He was a co-recipient of the Jack Raper Award for Outstanding Technology-Directions Paper at the 2009 International Solid-State Circuits Conference. He was an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS –II. He is currently an Associate Editor of IEEE S OLID -S TATE C IRCUITS L ETTERS .