Circuit Design for a 2.2 GByte/s Memory Interface Stefanos Sidiropoulos Work done at Rambus Inc with A. Abhyankar, C. Chen, K. Chang, TJ Chin, N. Hays, J. Kim, Y. Li, G. Tsang, A. Wong, D. Stark Increasing Chip I/O Bandwidth R Computers: Main memory: R SDRAM100 (100 Mbps) « RDRAM (0.8-1.1 Gbps) Peripherals: R PCI (66 Mbps) « Infiniband (2.5 Gbps) R Networks: Physical Front End: R LAN: Fast-Eth (100 Mbps) « Gigabit-Eth (1Gbps) R WAN: OC-12 (625 Mbps) « OC-48 (2.4 Gbps) Switch Fabric: R 625 Mbps « 2.5 Gbps Outline R Overview R Timing Methods R Signaling Methods R Timing Circuits R Signaling Circuits R Results Main Issues Channel Tx Rx PCB, Coax, Fiber R Drive and capture signals at the correct time R R Bit times are as small as 2-3 gate delays Send and receive signals robustly R Noise is a large fraction of the signal 1 0 0 1 < 400-mV < 1-ns 0 1 Timing Architectures R Synchronous: Same frequency and phase R Conventional busses t t R Conventional Memories F0 R Mesochronous: Same frequency, unknown phase tA tB R Fast memories/busses tA≠ tB R MP networks F0 R Interconnection networks R Plesiochronous: Almost the same frequency R Network front-end F1 R Router core Synchronous Systems PLL/DLL CKX CKC on-chip logic DI R R CKX DI CKC On-chip clock is a multiple of system clock: « Synthesize on-chip clock frequency On-chip clock phase varies: « F2 F1≈ F2 Cancel clock buffer delay Mesochronous Systems CKSRC PLL/DLL CKRCV rcvr data logic ref CKSRC data D0 D1 D2 D3 CKRCV R Position on-chip sampling clock at the optimal point i.e. maximize “timing” margin Plesiochronous Systems rcvr logic CKR DIN CRC DIN D0 CKR R Recover incoming data fundamental frequency R Position sampling clock at the “optimal” point D1 Signaling R Send and receive the data impaired by noise: R Independent noise sources: R Thermal and uncorrelated system noise R Proportional noise sources: R Reflections, cross-talk, signal-return noise High Impedance VS + + - d ref VS /2 - shared Differential Single Ended Low Impedance + + - d Outline R Background R Timing Circuits R Signaling Circuits R Results d Rambus Memory Channel M1 M2 M16 M1 M2 M16 24 CTM Clk Gen Controller R CFM 1.6-GB/s (800 Mbps/pin): R Current mode signaling R Source synchronous clocking D0 D1 D2 Increasing System Performance R Increase transfer rate: System Clock: 400 « 533 MHz (800 « 1066 Mbps/pin) Peak Bandwidth: 1.6 « 2.2 GB/s R Challenges: R Timing Margin R Device Variations R Channel Imperfections R Voltage Errors R Bus Hand-off Prototype DRAM Interface Chip Technology: µm, 2.5-V CMOS 0.25-µ Supply: 1.8-V Active Area: Package: 11.2 x 1.3 mm2 LGA, µBGA Chip Includes: T/R DLL 2-Data bytes, 1-Address byte Packet Protocol Logic 18 KB SRAM Outline R Background Q Timing Circuits T Requirements T Architecture T Timing Error Sources R Signaling Circuits R Results RDRAM Timing Circuit Requirements TCLK DLL RCLK RCLK 8 DQA CTM CFM TCLK RCLK 8 8 RQ DQB CFM CTM DQ D0 D1 D2 D3 DQ/RQ D0 D1 D2 D3 RCLK TCLK PLLs vs DLLs VCO VCDL clk clk ÷N PD PD ref clk ref clk Filter R Second/third order loop: Filter R First order loop: « Stability is an issue « Stability guaranteed « Frequency synthesis easy « Frequency synthesis problematic « Ref. Clk jitter gets filtered « Ref. Clk jitter propagates « Phase error accumulates « Phase error does not accumulate Supply Noise: DLL vs PLL 6-stage DLL vs 6-stage PLL 0 ). g e d ( r o rr e e s a h p -10 DLL-pk -20 -30 PLL-pk DLL PLLBW 20MHz PLLBW 5MHz -40 -50 0 500 time (ns) 1000 1500 * Supply sensitivity: 0.1%-delay/%-supply/element R No need for clock multiplication « use a DLL Conventional DLL clk ref clk PD R Limited phase acquisition range « Generate delay by using phase interpolation Variable Phase Interpolation φ’ φ φ ψ Θ w = 0..N ψ ψ’ Θ= ( N − w ) ⋅ φ + w ⋅ψ N φ1 R « R ψ1 If φ, ψ selectively span 2π : Can generate any Θ φ, ψ can be generated by a DLL ψ0 φ2 φ0 ψ2 ψ3 φ3 RDRAM Delay Buffers [Maneatis’93] VCP VCTL Bias Circuit [Hu’92] VCN R Use differential elements with replica biasing: Increased noise immunity Not easily portable Require larger supply head-room but ok for 1.8-V Interpolator Design VC P VCN + 5 DAC - R Interpolator bias and input/output time constant scales « TDC remains linear over large frequency range Dual DLL Block Diagram PD/CP/Bias Amp Amp Input Clock CORE FSM up/dn PD PERIPHERAL Ref Clock Device Timing Variations Receive Window Distribution 25 # parts 20 15 10 5 0 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 Receive-valid Window Center (ps) R 100 parts: µ ≅ 30-ps, σ ≅ 20-ps Propagation Delay Mismatch φ Discontinuity DRAM Module v ( t ) = A ⋅ [sin(ω ⋅t ) + r ⋅ sin(ω ⋅t − 2ϕ )] ⇒ v ( t ) = A'⋅ sin(ω ⋅ t + θ ) A’ θ A R Clock and data channels different R Clock and data spectral components different « Propagation T delays can differ by ~ 100-ps Regain margin: every DRAM transmit/receive timing must be offset from its lock point rA 2φ Original Dual-DLL PD/CP/Bias Amp Amp Input Clock Mux+Interpolator Decoder 8 FB Clock Counter FSM up/dn Main Clock to I/O Ref PD Clock DLL for “in-system” Calibration PD/CP/Bias Amp Amp Input Clock Mux+Interpolator Mux+Interpolator (_2) Decoder Adder Decoder 8 FB Clock 8 Counter Offset[7:0] (set @boot time) up/dn PD Main Clock Ref Clock to I/O Outline R Background Q Timing Circuits R Signaling Circuits R R Bus Environment Challenges R Output Subsystem Design Results “Back-to-Back” Reads Vterm ∆t1 ∆t2 Contr. Mem1 Mem2 Vterm Controller Mem2 ∆t1+∆ ∆t 2 2 ∆t 2 Vterm-Vsw 2 ∆t 2 Vterm-1.5Vsw R Compliance voltage for M2 as low as 0.5-V Output Driver Subsystem - Driver Bias Voltage Generator VGREF VGATE + VG[6:0] CC[6:0] EN _7 DQ0 Q0 7 DQ1 7 Q1 DQ8 Q8 _7 _7 _7 Driver Bias Voltage Generator IC IR VGREF IR R R >VT R Constant gate overdrive: R Increase noise immunity R Constant saturation margin over PVT Driver IV Characteristics 35 Iout (mA) 30 25 TT 20 SS FF 15 10 5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Vpad (V) Output Driver Model -A vO vG gm2 gm ro iout = g m ⋅ v g + vo / ro − A ⋅ g m 2 ⋅ vo R Negative resistance compensates for finite ro 1.8 Output Driver Schematic SL[1:0] M7[1:0] M6[1:0] DQ VG[6:0] M5 M2[6:0] Q M3 M1[6:0] M4 R M6-M7 control maximum feedback current R M3/M4 ratio constrained to minimize time constant Driver IV Characteristics 35 Iout (mA) 30 25 TT 20 SS FF 15 10 5 0 0 0.2 0.4 0.6 0.8 1 Vpad (V) 1.2 1.4 1.6 1.8 Outline R Introduction R Timing R Signaling Q Results Operating Range TBIT (nsec) 2.75 1.8-V 1.1 Gbps/pin 0.75 1.0 VDD (Volts) 2.5 Measured DLL Jitter < 100-ps peak-peak with interface and core active Uncalibrated Output Data-valid Window VDD (Volts) 2.5 1-V 760-ps 1.5 -1.0 R ∆t (ns) 1.0 TBIT = 900-ps, TOFFS = default « TQ offset ~ 150-ps Calibrated Output Data-valid Window VDD (Volts) 2.5 1-V 780-ps 1.5 -1.0 R 1.0 ∆t (ns) TBIT = 900-ps, calibrated TOFFS « TQ offset < 20-ps Measured Calibration Accuracy 350 offset (degrees) 300 250 200 150 100 400 MHz 533 MHz 50 0 0 50 100 150 code # R DNL, INL < 2-LSB 200 250 RDRAM Power Modes DLL must go into low-power “nap” mode R R IVDD < 4-mA R Restore clock phase within 80-ns R Digital peripheral loop logic naturally holds state R Hold state of core loop on 25-pF charge-pump capacitor Measured Driver I-V Characteristics 35 30 Iout (mA) 25 20 FB off FB on 15 10 5 0 0 0.2 0.4 0.6 0.8 1 Vpad (V) 1.2 1.4 1.6 1.8 Summary R Increasing memory interface bandwidth: « R Minimize both voltage and timing errors: Voltage errors are systematic « Compensated with new driver design R Timing Errors are unpredictable « Compensated with “in-system” calibration R Expect to see more digital “calibration” in high speed links: R Challenge is minimize overhead: R Area, Power, Yield.. R System bring-up and ease of use..