FB-DIMM technology Dezső Sima Spring 2008 (Ver. 1.0) Sima Dezső, 2008 Motivations to introduce FB-DIMMs in servers/workstations Shortcommings of the stub-bus topology used with conventional DRAM architectures [2] Stub-bus topology Data lines of the memory controller are electrically connected to the data lines of every DRAM device on the bus (memory channel) Impedance discontinuities effect signal integrity [2] Memory channels may have 8 DIMMs with 8 DRAM devices/DIMM (i.e. 72 devices/channel) Heavy signal loading due to the large number of devices and impedance discontinuities on the bus limit the number of DRAM devices connected to the channel the more the higher the data rate Figure: Scaling number of channels with memory hubs [7]. Two ranks of DRAM devices per DIMM is assumed. In the case of single rank per DIMM , while the number of DIMMs per channel may be doubled, the declining trend shown in the figure remains the same. For higher DRAM speeds less DRAM devices can be connected per memory channel [2] Stub-bus channel capacity (device density x nr. of devices) has hit its ceiling [2] but increasing server performance doubles memory capacity demand about every two years [2] from Jacob mem systems 2007 Increasing the number of memory channels Each DDR2 memory channel requires 240 pins FB-DIMM technology (1) Principle of operation • introduce packed based serial transmission (like in the PCI-E, SATA, SAS buses) • introduce full buffering (registered DIMMs buffer only addresses) • CRC error checking (cyclic redundancy check) FB-DIMM technology (2) Figure: FB-DIMM memory architecture [4] Figure: Maximum supported FB-DIMM configuration [6] (6 channels/8 DIMMs) FB-DIMM technology (3) Implementation details (1) • Serial transmission between the North Bridge and the DIMMs (each bit needs a pair of wires) • Number of seral links • 14 read lanes (2 wires each) • 10 write lanes (2 wires each) • Clocked at 6 x double pumped data rate e.g. for a DDR 667 DRAM the clock rate is: 6 x 667 MHz = 4 GHz • Every 12 cycles (that is every two memory cycles) constitute a packet. • Read packets (frames, bursts): 168 bits (12 x 14 bits) • 144 data bits (equals the number of data bits produced by a 72 bit wide DDR2 module (64 data bits + 8 ECC bits) in two memory cycles) • 24 CRC bits. • Write packets (frames, bursts): 120 bits (12 x 10 bits) • 98 payload bits • 22 CRC bits. FB-DIMM technology (4) Implementation details (2) 98 payload bits. • 2 frame type bits, • 24 bits of command, • 72 bits for data and commands, according to the frame type, e.g. 72 bits of data, 36 bits of data + one command or two commands. Commands • row select, precharge, refresh, read, write etc. • all commands include a 3-bit FB-DIMM module address to select one of 8 modules. FB-DIMM technology (5) Implementation details (3) Read bandwidth: One FB-DIMM channel transfers in one frame (that is in 12 cycles): 128 data bits, + 16 ECC bits One frame lasts 2 memory cycles One DDR2 DIMM channel transfers in 2 memory cycles: 2 x 72 bits (2 x 64-bit data + 2 x 8-bit ECC) The read bandwidth of an FB-DIMM channel equals the bandwidth of a DDR2 channel Write bandwidth: The write bandwidth of an FB-DIMM channel is up to 0.5 x the read bandwidth. But FB-DIMMs allow simultan read and write operation FB-DIMM technology (6) FB-DIMM data puffer (Advanced Memory Buffer, AMB) Manages the read/write operations of the module Source: PC stats FB-DIMM-4300 (DDR2-533 SDRAM); Clock Speed: 133MHz, Data Rate: 532MHz, Through-put 4300MB/s PC2-5300 (DDR2-667 SDRAM); Clock Speed: 167MHz, Data Rate: 667MHz, Through-put 5300MB/s PC2-6400 (DDR2-800 SDRAM); Clock Speed: 200MHz, Data Rate: 800MHz, Through-put 6400MB/s Figure: Different implementations of FB-DIMMs Figure: Block diagram of the AMB [3] (There are two Command/Address buses (C/A) to limit loads of 9 to 36 DRAMs) FB-DIMM Necessary routing to connect the north bridge to the technology DIMM socket (7) b) In case of an FB-DIMM (69 pins) a) In case of a DDR2 DIMM (240 pins) A 2-layer PCB is needed (but a 3. layer is used for power lines) A 3-layer PCB is needed Figure: PCB routing [4] FB-DIMM technology (8) Figure: Latency and bandwith figures of different DRAM technologies for a mix of SPEC applications [5] FB-DIMM technology (9) Pros and cons of FB-DIMMs Advantage of FB-DIMMs vs DDR2 and DDR3 DIMMs • more memory channels (up to 6) higher total bandwidth • more DIMM modules (up to 8) per channel higher memory capacity (up to 192 GB) • less wires simplified PCB routing • symultaneous read/write operation in a channel Disadvantage of FB-DIMMs vs DDR2 and DDR3 DIMMs • higher latency and lower bandwidth figures for 4 to 8 DIMM modules • higher cost • higher dissipation (Typical dissipation figures: DDR2: about 5 W AMB: about 5 W DDR2 FB-DIMM: about 10 W) Latency The other issue is potentially more troubling. Intel addressed this by not having the signals be stored and then retransmitted. The data travels along a special fast-passthrough channel in the buffer itself. This lessens much of the latency that would be induced by store and forward architectures. Figure: FB-DIMM heat sinks (heat spreaders) FB-DIMM technology (10) Market penetration of the FB-DIMM technology • 5/2006 Intel adopts it in its Bensley platform (5000) for DPs • 8/2007 Sun introduces it in the Niagara II • 9/2006 AMD has taken it off from their road map • 9/2007 Intel uses it in the Caneland platform (7000) for MPs • 2007 Major memory manufacturers intend to develop DDR3 DIMMs instead of DDR3 based FB-DIMMs Standardisation 3/2007 JESD205 DDR2 SDRAM Fully Buffered DIMM (FBDIMM) Design Specification DDR2-533, DDR2-667, DDR2-800 x72 ECC, 240 pin 256 Mb, 512 Mb, 1 Gb, 2 Gb, 4 Gb devices 1/2007 JESD 206 FBDiMM Architecture and Protocol FB-DIMM technology (11) DDR2 vs (SDRAM) DDR The key difference between DDR and DDR2 is that the DDR2 data bus is clocked at twice the speed of the memory cells, so four data words can be transferred in each memory cell cycle without speeding up the memory cells themselves. Figure: Clocking schemes of the SDR, DDR and DDR2 SDRAM techologies [1] DDR2's bus frequency is boosted by electrical interface improvements, on-die termination, prefetch buffers and off-chip drivers. However, latency is greatly increased as a trade-off. The DDR2 prefetch buffer is 4 bits deep, whereas it is 2 bits deep for DDR (and 8 bits deep for DDR3). While DDR SDRAM has typical read latencies of between 2 and 3 bus cycles, early DDR2 may have read latencies between 4 and 6 cycles. Although introduced in Q2 2003 at 200/266 MHz, initially DDR2 could not be competitive due to too high latency figures. As lower latency parts became available by the end of 2004 DDR2 became widespread. Memory Timings Latency Bandwidth in dual-channel mode DDR400 SDRAM 2.5–3–3 12.5 ns 6.4 GB/sec DDR400 SDRAM 2–3–2 10 ns 6.4 GB/sec DDR533 SDRAM 3–4–4 11.2 ns 8.5 GB/sec DDR533 SDRAM 2.5–3–3 9.4 ns 8.5 GB/sec DDR2-533 SDRAM 5–5–5 18.8 ns 8.5 GB/sec DDR2-533 SDRAM 4–4–4 15 ns 8.5 GB/sec DDR2-533 SDRAM 3–3–3 11.2 ns 8.5 GB/sec DDR2-600 SDRAM 5–5–5 16.6 ns 9.6 GB/sec DDR2-600 SDRAM 4–4–4 13.3 ns 9.6 GB/sec Table: Burst timing, latency and bandwidth figures of DDR and DDR2 DRAM technologies [1] CAS latency (Column Address Select),(CL) the time delay (in number of clock cycles) between a memory chip is accessed for data and the first data bit becomes available For instance, after accessing a 400 MHz CL3 device, the first bit arrives in 3 x 2.5 ns = 7.5 ns Early DDR2-533 SDRAM modules available at the time of the announcement of i925 and i915 chipsets (6/2004) had 4-4-4 timings (CAS Latency - RAS to CAS Delay - RAS Precharge Time). FB-DIMM technology () Power savings are achieved primarily due to a drop in operating voltage (1.8 V compared to DDR's 2.5 V). DDR2 has 240 pins instead of 168 pins used by DDR DIMMs DDR3 Official JEDEC Specifications DDR2 DDR3 Rated Speed 400-800 Mbps 800-1600 Mbps Vdd/Vddq 1.8V +/- 0.1V 1.5V +/- 0.075V Internal Banks 4 8 Termination Limited All DQ signals Topology Conventional T Fly-by Driver Control OCD Calibration Self Calibration with ZQ Thermal Sensor No Yes (Optional) Source: Anandtech Appeared mid 2007 e.g. in Intel’s P35 Bearlake Source: Wiki 5.2. Speed gap between processor and memory (1a) DRAM 1 FPM 2 EDO3 BEDO4 SDRAM 5 DRDRAM 6 Since 1996 Asynchronous Synchronous Burst mode access (4*8B) on the same row (page) Up to 66 MHz bus frequency 66/100/133 MHz Random access, typ. access time 60/70/80/100 ns Access to 4 subsequent columns (60 ns) ~ 40 ns ~ 25 ns (5-5-5-5) (5-7)-3-3-3 (5-7)-4-4-4 (5-7)-2-2-2 5-1-1-1 (5-7)-1-1-1 Max. bandwidth MB/s Effective bandwidth MB/s Triton III.: 7-1-1-1 430 ZX.: 7-1-1-1 820 840 Developed by RAMBUS Cycle time within a burst (for a 60 ns part) Full burst timing Examples Overlapping the read and address transfer operations Internal 2-bit address generator, dual banks ~ 15 ns Triton I.: 7-3-3-3 Triton I.: 7-2-2-2 Triton III.: 6-3-3-3 Triton II,III.:6-2-2-2 Internal on-chip Full pipelined operation, SRAM cache, assuming at least page is filled in dual banks 1 clock cycle,1-2 B wide data path 256/300/356/400 MHz transfer rate ~ 15/10/7.5 ns Developed by MICRON Remakes (4/3.3/2.8/2.5 ns) Level of overlapping Cached structure 1 4 2 5 Dynamic RAM Fast Page Mode DRAM 3 Extended Data Out DRAM Burst mode EDO Synchronous DRAM 6 Direct Rambus DRAM Figure 5.1a: DRAM types 5.2. Speed gap between processor and memory (1b) t RAC (ns) 200 * 200 180 160 * 140 150 120 100 * 80 80 * 100 80 * 60 70 * 60* 60 * 50* 40 * * 30 20 81 Processor chipset PC Typ. DRAM parts 16 K t RAC 50 82 83 84 85 86 AT 386 DX 128 K 128 K 256 K 87 88 89 90 486 DX 256 K 1M 4M 91 92 93 94 95 96 97 P PPro PII 430 NX 450 KX/GX 440 BX 256 K 1M 4M 8M 4M 16 M 64 M : Row access time (time from row address until data valid) Figure 5.1b: Latency of DRAM chips 4M 16 M 64 M 98 99 PIII 815 16 M 64 M 128 M 256 M 2000 Year 5.2. Speed gap between processor and memory (1c) Memory latency ns Memory latency in proc. cycles 500 1000 702 ** 468 500 Latency in proc.cycles 400 85 300 200 100 * 300 Latency in ns 200 200 * 155 * 3 100 *1 50 40 * 30 20 10 * * 135 10 141 * * 116 * *1 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 2000 Processor PC 486 DX P PPro AT 386 DX PII PIII P4 (8088) (286) 5 3 2 1 Year Figure 5.1c: System-level memory latency in x86-based PCs 5.2. Speed gap between processor and memory (1d) Memory latency (cycles) Pentium 4 Pentium III 130 * * 120 * DDR 266 110 * RDRAM-60 100 DDR 333 Pentium II 90 60 * Pentium Pro * * DDR 400 * * * * * PC 100 * 40 PC 66 PC 133 EDO FPM * * 20 10 * RDRAM-40 Pentium 50 30 * * 80 70 DDR2 533 * * 486 *386 * 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Figure 5.1d: Latency of DRAM chips (in clock cycles) 4.0 fc (GHz) 5.2. Speed gap between processor and memory (2) Tmemory/f c 1.00 Pentium 0.90 Pentium Pro 0.80 0.70 Pentium II 0.60 0.50 * 0.40 0.30 0.20 0.10 Pentium 4 Pentium III EDO ** * * PC-66 * * * * FPM * * * * * * * * PC-100 0.5 PC-133 * * * * * 1.0 PC-800D * * * * 1.5 2.0 * * * DDR 333D * * * DDR 266 2.5 * * * * * * * * DDR 400D * DDR 400 * * * DDR 333 * 3.0 * DDR 533D * * * * * * 3.5 Figure 5.2: Relative transfer rate of memories (D: dual channel) * 4.0 f c (GHz) References [1]: Gavrichenkov I., „DDR2 vs. DDR: Revenge Gained,” Xbit Laboratories, 12/17/2004, http://www.xbitlabs.com/articles/memory/display/ddr2-ddr.html [2]: Vogt P., Fully Buffered DIMM (FB-DIMM) Server Memory Architecture,”, Febr. 18, 2004, Intel Developer Forum, http://www.idt.com/content/OSA_S008_FB-DIMM-Arch.pdf [3]: McTague M. & David H., „ Fully Buffered DIMM (FB-DIMM) Design Considerations,” Febr. 18, 2004, Intel Developer Forum, http://www.idt.com/content/OSA-S009.pdf [4]: Haas, J. & Vogt P., Fully buffered DIMM Technology Moves Enterprise Platforms to the Next Level,” Technology Intel Magazine, March 2005, pp. 1-7 [5]: Ganesh B., Jaleel A., Wang D. , Jacob B., „Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling”, Proc. HPCA 2007 [6]: - „Introducing FB-DIMM Memory: Birth of Serial RAM?,” PCStats, Dec. 23, 2005, http://www.pcstats.com/articleview.cfm?articleid=1812&page=1 [7]: Haas J. & Vogt P., „Fully-Buffered DIMM Technology Moves Enterprise Platforms to the Next Level,” Technology Intel Magazin, Technology Intel Magazin, http://www.intel.com/ technology/magazine/computing/fully-buffered-dimm-0305.htm