LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi§ Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi* University of Utah and *HP Labs § Currently with ARM Memory Reliability • Datacenters are the backbone of the web-connected infrastructure – Reliability is essential • Memory reliability is a major concern [Schroeder et al., SIGMETRICS ‘09] – among the most error-prone parts of a server – Even a few uncorrectable errors will require DIMM replacement ranks near the top of component replacements in datacenters Increases downtime Increases operational cost Source: Nagios 2 Some Numbers A single server blade Datacenter 2 Billion DRAM cells per chip X 36 DRAM chips per DIMM X 2 DIMMs per channel X 4 Channels per processor X 4 processors per blade = ~2.5 x 1012 DRAM cells 16 blades per enclosure X 4 enclosures per rack X 10 racks per container X 40 containers per datacenter = ~64 x 1015 DRAM cells Assume MTTF per cell is the age of the universe ~14 Billion Years Blade DRAM MTTF = 2 days Datacenter DRAM MTTF = 7 seconds 3 Target Reliability • High-end servers commonly have high reliability expectations – Single Symbol Correct Double Symbol Detect – One symbol == one DRAM chip (“Chipkill”) • Today’s systems employ symbol-based ECC codes 4 Problems with Existing Solutions • Increased access granularity – Every data access is spread across 36 DRAM chips – JEDEC standards define minimum access granularity per chip – Massive overfetch of data at multiple levels Wastes energy Wastes bandwidth Reduced rank-level parallelism • x4 device width restriction – fewer ranks for given DIMM real estate • Reliability level: 1 failed chip out of 36 5 A New Approach: LOT-ECC • Operate on a single rank of x8 memory: 9 chips – and support 1 failed chip out of 9 • Multiple tiers of localized protection – Tier 1: Local Error Detection (checksum) – Tier 2: Global Error Correction (parity) – T3 & T4 to handle specific failure cases • Data mapping handled by memory controller with firmware support – Transparent to OS, caches, etc. – Strictly commodity DRAM used • Significant power and performance benefits 6 Tier 1 – Local Error Detection (LED) Chip 0 Chip 7 Chip 8 • Standard x72 DIMM (Nine x8 parts): Eight data + One ECC • We use all 9 chips for both data and ECC • 64 bits per chip per burst – 57 data + 7 checksum 7 Tier 1 – Local Error Detection (LED) • 57 bits * 9 = 513 – Only 1 cache line read at a time – 57 bits/chip on first 8 chips; 56 bits on 9th chip 1 bit extra on the 9th chip – Use in a different tier of protection • No performance impact on memory reads or writes – LED ops occur in parallel with data ops • Note that LED is local to each chip – Need to pin-point exact failed chip, not simply detect an error in the rank 8 Tier 2 – Global Error Correction (GEC) A0 LA0 57 bits 7 bits PA G0 0-6 LG0 Chip 0 ✔ A1 ✖ LA1 . . ✔ ✖ PA 7- G 131 LG1 A7 LA7 A8 LA8 PA 49G7 55 LG7 PAG56 8 PP LG8A Chip 7 Chip 1 A, B, C, D, E, F, G, H – Cache Lines, each comprised of segments X0 through X8 LXN [PX0:PXN] PPX – L1 Local Error Detection for Cache Line X, Segment N – L2 Global Error Correction across segments X0 through X8 – Parity across GEC segments PX0-6 through PX49-55 9 Chip 8 Data LED GEC The Devil is in the Details.. • ..and the details are in the paper! • Need to detect and correct additional errors in GEC region – Parity is 57 bits; write granularity is 72 bits – Use the remaining 15 bits wisely, add two more tiers of protection Surplus bit borrowed from data + LED 7b PA0-6 Chip 0 1b T4 7b 1b 7b 1b PA7-13 T4 .. PA49-55 T4 Chip 1 Chip 7 10 PA 56 7b 1b PPA T4 Chip 8 Optimizing Write Behavior A0 LA0 PA 0-6 B0 B0LB0 LB0 PB 0-6 A7 LA7 B7 57 bits 7 bits G0 PA 0-6 PB 0-6 LG0 PG 0-6 LAB78 LA8A8PP LA8B8 B8LB8 LPPB8 A B 57 bits 7 bits H0 H0LH0 LH0 PH 0-6 Chip 0 PH 0-6 G7 PA 4955 PB 4955 LG7 H7 LGH78 PH 4955 Chip 7 LG8G8PP LG8H8 H8LH8 LPPH8 G PPA PPB H PPH Chip 8 • Every write has to update its GEC bits – Already borrowing one bit from [data + LED] to use in the GEC – Put them all in the same DRAM row! – Guaranteed row-buffer hit – Data mapping handled by the memory controller 11 GEC Coalescing • DDR3 burst of 8 forces 72 bytes per access – GEC per cache line is only 72 bits • With sufficient locality, one GEC write can potentially cover 8 data writes – In reality, each write becomes 1 + δ writes (for 0.125 < δ ≤ 1) • Note that even with δ = 1, benefits of row-buffer hit remain • Write typically buffered at the memory controller to avoid bus turnaround overheads – Controller can re-order accesses to accommodate coalescing • Results show three cases: Basic design (δ = 1), Simple coalescing (measured δ), and Oracular design (δ = 0.125) 12 Constructing the LED Code • Use a 7-bit ECC code to detect errors in 57 data bits – We choose a 7-bit 1’s complement checksum • Paper details code operation and computes FIT – single-bit, double-bit, row, column, row-column, pin, chip, multiple random, combinations • Very small rate of undetected errors – Caused by very specific, uncommon bit-flip combinations – Less than 5E-5 FIT! • Captures some failure modes NOT captured by existing mechanisms (failure of 2 chips out of 18, errors in >2 chips/rank, etc.) 13 Checksum Design • Not all error combinations actually occur in DRAM – Small number of failure modes with specific root causes – Code’s effectiveness under those failures is important • Current symbol-based codes guarantee capturing 100% of SSC-DSD errors – At huge power and performance penalties – Likely overkill • Not scalable as error rates increase – Use strong yet practical codes + RAS features – Example: Proactive patrol scrubbing will capture a majority of soft errors; may not coincide with hard errors 14 Evaluation Methodology • Performance analysis: In-house DRAM simulator – Models refresh, address/command bus, data bus, banks/ranks/channels contention, read/write queues • Power analysis: Micron power calculator spreadsheet – Reflects timing parameters assumed for performance simulations – Bus utilization and bank utilization numbers obtained from performance simulations – Accounts for activation power, read/ write power, termination power, and background power – Includes low-power sleep modes 15 Evaluation Platforms • Xeon 7500-like system – 8 DDR3 channels, 2 DIMMs/channel – Dual-ranked x4 or Quad-ranked x8 DIMMs – “Lockstep mode” is the only supported mode Two ranks operate together to provide a 144-bit bus Wasted bandwidth by masking out half the burst, OR Forced prefetching • Also evaluate Xeon 5500-like systems – 3 DDR3 channels, 3 DIMMs/channel – “Lockstep mode” wastes one channel entirely, gangs other two • Evaluate five design points each – Baseline symbol-based SSC-DSD – Virtualized ECC (Yoon & Erez, ASPLOS ’10) – LOT-ECC with no coalescing, simple coalescing, oracular coalescing Power Results 7500 -43% 17 Power Results 5500 -32% 18 Performance Results 7500 Latency Reduction: LOT-ECC 4.6% +GEC Coalescing 7.7% 19 Oracular 16.2% Performance Results 5500 Latency Reduction: LOT-ECC 42.9% +GEC Coalescing 46.9% 20 Oracular 57.3% Storage Overhead • For each 64-byte cache line – 63 bits of LED checksum – 57 bits of GEC parity – 7 bits of T3 code – 9 bits of T4 code • Total storage overhead of 26.5% • Current ECC implementations and DIMMs already accept 12.5% through extra chip • Additional 14% in data memory via firmware • Memory capacity is cheap if commodity – Better to spend on this than power/performance 21 Key Contributions • Multi-tiered protection design to keep fault tolerance contained to fewer chips • Unique data layout tailored to the access mechanism of commodity DRAM systems • Exploit row-buffer efficiency – co-locate data and all tiers of fault-tolerance codes – Mitigates overheads of additional writes typical in parity-based systems • Coalescing optimization to further minimize impact of parity writes 22 Key Benefits • Power Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes (43% memory power savings) • Performance Gains: More rank-level parallelism, reduced access granularity (7.7% memory latency reduction) • Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently • Flexibility: Works with a single rank of x4 DRAMs or more efficient x8 DRAMs • Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS 23 BACKUP SLIDES 24 Tier 2 – Global Error Correction (GEC) • GEC is a parity written across the cache line segments in each chip • LED has already pinpointed erroneous segment – Error correction is trivial • Storing the parity – A portion of memory set aside to hold GEC – Handled by memory controller + firmware • No impact on reads unless error is detected • GEC also self contained (single cache line) – No read-before-write 25