Circuit Design for a 2.2 GByte/s Memory Interface

advertisement
Circuit Design for a 2.2 GByte/s Memory
Interface
Stefanos Sidiropoulos
Work done at Rambus Inc with A. Abhyankar, C. Chen, K.
Chang, TJ Chin, N. Hays, J. Kim, Y. Li, G. Tsang, A. Wong,
D. Stark
Increasing Chip I/O Bandwidth
R
Computers:
Main memory:
R SDRAM100 (100 Mbps) « RDRAM (0.8-1.1 Gbps)
Peripherals:
R PCI (66 Mbps) « Infiniband (2.5 Gbps)
R
Networks:
Physical Front End:
R LAN: Fast-Eth (100 Mbps) « Gigabit-Eth (1Gbps)
R WAN: OC-12 (625 Mbps) « OC-48 (2.4 Gbps)
Switch Fabric:
R 625 Mbps « 2.5 Gbps
Outline
R
Overview
R
Timing Methods
R
Signaling Methods
R
Timing Circuits
R
Signaling Circuits
R
Results
Main Issues
Channel
Tx
Rx
PCB, Coax, Fiber
R
Drive and capture signals at the correct time
R
R
Bit times are as small as 2-3 gate delays
Send and receive signals robustly
R
Noise is a large fraction of the signal
1
0
0
1
< 400-mV
< 1-ns
0
1
Timing Architectures
R
Synchronous:
Same frequency and phase
R Conventional busses
t
t
R Conventional Memories
F0
R
Mesochronous:
Same frequency, unknown
phase
tA
tB
R Fast memories/busses
tA≠ tB
R MP networks
F0
R Interconnection networks
R
Plesiochronous:
Almost the same frequency
R Network front-end
F1
R Router core
Synchronous Systems
PLL/DLL
CKX
CKC
on-chip
logic
DI
R
R
CKX
DI
CKC
On-chip clock is a multiple of system clock:
«
Synthesize on-chip clock frequency
On-chip clock phase varies:
«
F2
F1≈ F2
Cancel clock buffer delay
Mesochronous Systems
CKSRC
PLL/DLL
CKRCV
rcvr
data
logic
ref
CKSRC
data
D0
D1
D2
D3
CKRCV
R
Position on-chip sampling clock at the optimal point
i.e. maximize “timing” margin
Plesiochronous Systems
rcvr
logic
CKR
DIN
CRC
DIN
D0
CKR
R
Recover incoming data fundamental frequency
R
Position sampling clock at the “optimal” point
D1
Signaling
R
Send and receive the data impaired by noise:
R
Independent noise sources:
R Thermal and uncorrelated system noise
R
Proportional noise sources:
R Reflections, cross-talk, signal-return noise
High Impedance
VS
+
+
-
d
ref
VS /2
-
shared
Differential
Single
Ended
Low Impedance
+
+
-
d
Outline
R
Background
R
Timing Circuits
R
Signaling Circuits
R
Results
d
Rambus Memory Channel
M1
M2
M16
M1
M2
M16
24
CTM
Clk
Gen
Controller
R
CFM
1.6-GB/s (800 Mbps/pin):
R
Current mode signaling
R
Source synchronous clocking
D0
D1
D2
Increasing System Performance
R
Increase transfer rate:
System Clock:
400 « 533 MHz (800 « 1066 Mbps/pin)
Peak Bandwidth: 1.6 « 2.2 GB/s
R
Challenges:
R
Timing Margin
R Device Variations
R Channel Imperfections
R
Voltage Errors
R Bus Hand-off
Prototype DRAM Interface Chip
Technology:
µm, 2.5-V CMOS
0.25-µ
Supply:
1.8-V
Active Area:
Package:
11.2 x 1.3 mm2
LGA, µBGA
Chip Includes:
T/R DLL
2-Data bytes, 1-Address byte
Packet Protocol Logic
18 KB SRAM
Outline
R
Background
Q
Timing Circuits
T
Requirements
T
Architecture
T
Timing Error Sources
R
Signaling Circuits
R
Results
RDRAM Timing Circuit Requirements
TCLK
DLL
RCLK
RCLK
8
DQA
CTM
CFM
TCLK
RCLK
8
8
RQ
DQB
CFM
CTM
DQ
D0
D1
D2
D3
DQ/RQ
D0
D1
D2
D3
RCLK
TCLK
PLLs vs DLLs
VCO
VCDL
clk
clk
÷N
PD
PD
ref
clk
ref
clk
Filter
R
Second/third order loop:
Filter
R
First order loop:
«
Stability is an issue
«
Stability guaranteed
«
Frequency synthesis easy
«
Frequency synthesis problematic
«
Ref. Clk jitter gets filtered
«
Ref. Clk jitter propagates
«
Phase error accumulates
«
Phase error does not accumulate
Supply Noise: DLL vs PLL
6-stage DLL vs 6-stage PLL
0
).
g
e
d
(
r
o
rr
e
e
s
a
h
p
-10
DLL-pk
-20
-30
PLL-pk
DLL
PLLBW 20MHz
PLLBW 5MHz
-40
-50
0
500
time (ns)
1000
1500
* Supply sensitivity: 0.1%-delay/%-supply/element
R
No need for clock multiplication « use a DLL
Conventional DLL
clk
ref
clk
PD
R
Limited phase acquisition range
«
Generate delay by using phase interpolation
Variable Phase Interpolation
φ’
φ
φ
ψ
Θ
w = 0..N
ψ
ψ’
Θ=
( N − w ) ⋅ φ + w ⋅ψ
N
φ1
R
«
R
ψ1
If φ, ψ selectively span 2π :
Can generate any Θ
φ, ψ can be generated by a DLL
ψ0
φ2
φ0
ψ2
ψ3
φ3
RDRAM Delay Buffers
[Maneatis’93]
VCP
VCTL
Bias
Circuit
[Hu’92]
VCN
R
Use differential elements with replica biasing:
Increased noise immunity
Not easily portable
Require larger supply head-room but ok for 1.8-V
Interpolator Design
VC P
VCN
+
5
DAC
-
R
Interpolator bias and input/output time constant scales
«
TDC remains linear over large frequency range
Dual DLL Block Diagram
PD/CP/Bias
Amp
Amp
Input
Clock
CORE
FSM
up/dn
PD
PERIPHERAL
Ref
Clock
Device Timing Variations
Receive Window Distribution
25
# parts
20
15
10
5
0
-50 -40
-30 -20 -10
0
10
20
30
40
50
60
70
80
90
100
Receive-valid Window Center (ps)
R
100 parts: µ ≅ 30-ps, σ ≅ 20-ps
Propagation Delay Mismatch
φ
Discontinuity
DRAM
Module
v ( t ) = A ⋅ [sin(ω ⋅t ) + r ⋅ sin(ω ⋅t − 2ϕ )]
⇒ v ( t ) = A'⋅ sin(ω ⋅ t + θ )
A’
θ
A
R
Clock and data channels different
R
Clock and data spectral components different
« Propagation
T
delays can differ by ~ 100-ps
Regain margin: every DRAM transmit/receive
timing must be offset from its lock point
rA
2φ
Original Dual-DLL
PD/CP/Bias
Amp
Amp
Input
Clock
Mux+Interpolator
Decoder
8
FB
Clock
Counter
FSM
up/dn
Main
Clock
to I/O
Ref
PD
Clock
DLL for “in-system” Calibration
PD/CP/Bias
Amp
Amp
Input
Clock
Mux+Interpolator
Mux+Interpolator (_2)
Decoder
Adder
Decoder
8
FB
Clock
8
Counter
Offset[7:0]
(set @boot time)
up/dn
PD
Main
Clock
Ref
Clock
to I/O
Outline
R
Background
Q
Timing Circuits
R
Signaling Circuits
R
R
Bus Environment Challenges
R
Output Subsystem Design
Results
“Back-to-Back” Reads
Vterm
∆t1
∆t2
Contr.
Mem1
Mem2
Vterm
Controller
Mem2
∆t1+∆
∆t 2
2 ∆t 2
Vterm-Vsw
2 ∆t 2
Vterm-1.5Vsw
R
Compliance voltage for M2 as low as 0.5-V
Output Driver Subsystem
-
Driver Bias
Voltage
Generator
VGREF
VGATE
+
VG[6:0]
CC[6:0]
EN
_7
DQ0
Q0
7
DQ1
7
Q1
DQ8
Q8
_7
_7
_7
Driver Bias Voltage Generator
IC
IR
VGREF
IR R
R
>VT
R
Constant gate overdrive:
R
Increase noise immunity
R
Constant saturation margin over PVT
Driver IV Characteristics
35
Iout (mA)
30
25
TT
20
SS
FF
15
10
5
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Vpad (V)
Output Driver Model
-A
vO
vG
gm2
gm
ro
iout = g m ⋅ v g + vo / ro − A ⋅ g m 2 ⋅ vo
R
Negative resistance compensates for finite ro
1.8
Output Driver Schematic
SL[1:0]
M7[1:0]
M6[1:0]
DQ
VG[6:0]
M5
M2[6:0]
Q
M3
M1[6:0]
M4
R
M6-M7 control maximum feedback current
R
M3/M4 ratio constrained to minimize time constant
Driver IV Characteristics
35
Iout (mA)
30
25
TT
20
SS
FF
15
10
5
0
0
0.2
0.4
0.6
0.8
1
Vpad (V)
1.2
1.4
1.6
1.8
Outline
R
Introduction
R
Timing
R
Signaling
Q
Results
Operating Range
TBIT (nsec)
2.75
1.8-V
1.1 Gbps/pin
0.75
1.0
VDD (Volts)
2.5
Measured DLL Jitter
< 100-ps peak-peak with interface and core active
Uncalibrated Output Data-valid Window
VDD (Volts)
2.5
1-V
760-ps
1.5
-1.0
R
∆t (ns)
1.0
TBIT = 900-ps, TOFFS = default « TQ offset ~ 150-ps
Calibrated Output Data-valid Window
VDD (Volts)
2.5
1-V
780-ps
1.5
-1.0
R
1.0
∆t (ns)
TBIT = 900-ps, calibrated TOFFS « TQ offset < 20-ps
Measured Calibration Accuracy
350
offset (degrees)
300
250
200
150
100
400 MHz
533 MHz
50
0
0
50
100
150
code #
R
DNL, INL < 2-LSB
200
250
RDRAM Power Modes
DLL must go into low-power “nap” mode
R
R
IVDD < 4-mA
R
Restore clock phase within 80-ns
R
Digital peripheral loop logic naturally holds state
R
Hold state of core loop on 25-pF charge-pump capacitor
Measured Driver I-V Characteristics
35
30
Iout (mA)
25
20
FB off
FB on
15
10
5
0
0
0.2
0.4
0.6
0.8
1
Vpad (V)
1.2
1.4
1.6
1.8
Summary
R
Increasing memory interface bandwidth:
«
R
Minimize both voltage and timing errors:
Voltage errors are systematic
« Compensated with new driver design
R
Timing Errors are unpredictable
« Compensated with “in-system” calibration
R
Expect to see more digital “calibration” in high speed links:
R
Challenge is minimize overhead:
R Area, Power, Yield..
R System bring-up and ease of use..
Download