isca12 - University of Utah

advertisement
LOT-ECC: LOcalized and Tiered Reliability
Mechanisms for Commodity Memory Systems
Ani Udipi§
Naveen Muralimanohar*
Rajeev Balasubramonian
Al Davis
Norm Jouppi*
University of Utah and *HP Labs
§ Currently with ARM
Memory Reliability
• Datacenters are the backbone of the
web-connected infrastructure
– Reliability is essential
• Memory reliability is a major
concern [Schroeder et al., SIGMETRICS ‘09]
– among the most error-prone parts of a
server
– Even a few uncorrectable errors will
require DIMM replacement
ranks near the top of component
replacements in datacenters
 Increases downtime
 Increases operational cost

Source: Nagios
2
Some Numbers
A single server blade
Datacenter
2 Billion DRAM cells per chip
X
36 DRAM chips per DIMM
X
2 DIMMs per channel
X
4 Channels per processor
X
4 processors per blade
=
~2.5 x 1012 DRAM cells
16 blades per enclosure
X
4 enclosures per rack
X
10 racks per container
X
40 containers per datacenter
=
~64 x 1015 DRAM cells
Assume MTTF per cell is the age of the universe ~14 Billion Years
Blade DRAM MTTF = 2 days
Datacenter DRAM MTTF = 7 seconds
3
Target Reliability
• High-end servers commonly have high
reliability expectations
– Single Symbol Correct Double Symbol Detect
– One symbol == one DRAM chip (“Chipkill”)
• Today’s systems employ symbol-based ECC
codes
4
Problems with Existing Solutions
• Increased access granularity
– Every data access is spread across 36 DRAM chips
– JEDEC standards define minimum access granularity per
chip
– Massive overfetch of data at multiple levels
 Wastes energy
 Wastes bandwidth
 Reduced rank-level parallelism
• x4 device width restriction
– fewer ranks for given DIMM real estate
• Reliability level: 1 failed chip out of 36
5
A New Approach: LOT-ECC
• Operate on a single rank of x8 memory: 9 chips
– and support 1 failed chip out of 9
• Multiple tiers of localized protection
– Tier 1: Local Error Detection (checksum)
– Tier 2: Global Error Correction (parity)
– T3 & T4 to handle specific failure cases
• Data mapping handled by memory controller
with firmware support
– Transparent to OS, caches, etc.
– Strictly commodity DRAM used
• Significant power and performance benefits
6
Tier 1 – Local Error Detection (LED)
Chip 0
Chip 7
Chip 8
• Standard x72 DIMM (Nine x8 parts): Eight data + One ECC
• We use all 9 chips for both data and ECC
• 64 bits per chip per burst – 57 data + 7 checksum
7
Tier 1 – Local Error Detection (LED)
• 57 bits * 9 = 513
– Only 1 cache line read at a time
– 57 bits/chip on first 8 chips; 56 bits on 9th chip

1 bit extra on the 9th chip
– Use in a different tier of protection
• No performance impact on memory reads or
writes
– LED ops occur in parallel with data ops
• Note that LED is local to each chip
– Need to pin-point exact failed chip, not simply detect
an error in the rank
8
Tier 2 – Global Error Correction (GEC)
A0
LA0
57 bits 7 bits
PA
G0
0-6
LG0
Chip 0
✔
A1
✖
LA1
. .
✔
✖
PA
7- G
131
LG1
A7
LA7
A8
LA8
PA
49G7
55
LG7
PAG56
8
PP
LG8A
Chip 7
Chip 1
A, B, C, D, E, F, G, H – Cache Lines, each comprised of segments X0 through X8
LXN
[PX0:PXN]
PPX
– L1 Local Error Detection for Cache Line X, Segment N
– L2 Global Error Correction across segments X0 through X8
– Parity across GEC segments PX0-6 through PX49-55
9
Chip 8
Data
LED
GEC
The Devil is in the Details..
• ..and the details are in the paper!
• Need to detect and correct additional errors in GEC region
– Parity is 57 bits; write granularity is 72 bits
– Use the remaining 15 bits wisely, add two more tiers of
protection
Surplus bit borrowed from data + LED
7b
PA0-6
Chip 0
1b
T4
7b
1b
7b
1b
PA7-13 T4 .. PA49-55 T4
Chip 1
Chip 7
10
PA
56
7b
1b
PPA
T4
Chip 8
Optimizing Write Behavior
A0
LA0
PA
0-6
B0 B0LB0 LB0
PB
0-6
A7
LA7
B7
57 bits 7 bits
G0
PA
0-6
PB
0-6
LG0
PG
0-6
LAB78
LA8A8PP LA8B8 B8LB8 LPPB8
A
B
57 bits 7 bits
H0 H0LH0 LH0
PH
0-6
Chip 0
PH
0-6
G7
PA
4955
PB
4955
LG7
H7
LGH78
PH
4955
Chip 7
LG8G8PP LG8H8 H8LH8 LPPH8
G
PPA
PPB
H
PPH
Chip 8
• Every write has to update its GEC bits
– Already borrowing one bit from [data + LED] to use in the
GEC
– Put them all in the same DRAM row!
– Guaranteed row-buffer hit
– Data mapping handled by the memory controller
11
GEC Coalescing
• DDR3 burst of 8 forces 72 bytes per access
– GEC per cache line is only 72 bits
• With sufficient locality, one GEC write can potentially cover
8 data writes
– In reality, each write becomes 1 + δ writes (for 0.125 < δ ≤ 1)
• Note that even with δ = 1, benefits of row-buffer hit remain
• Write typically buffered at the memory controller to avoid
bus turnaround overheads
– Controller can re-order accesses to accommodate coalescing
• Results show three cases: Basic design (δ = 1), Simple
coalescing (measured δ), and Oracular design (δ = 0.125)
12
Constructing the LED Code
• Use a 7-bit ECC code to detect errors in 57 data bits
– We choose a 7-bit 1’s complement checksum
• Paper details code operation and computes FIT
– single-bit, double-bit, row, column, row-column, pin, chip,
multiple random, combinations
• Very small rate of undetected errors
– Caused by very specific, uncommon bit-flip combinations
– Less than 5E-5 FIT!
• Captures some failure modes NOT captured by
existing mechanisms (failure of 2 chips out of 18,
errors in >2 chips/rank, etc.)
13
Checksum Design
• Not all error combinations actually occur in DRAM
– Small number of failure modes with specific root causes
– Code’s effectiveness under those failures is important
• Current symbol-based codes guarantee capturing
100% of SSC-DSD errors
– At huge power and performance penalties
– Likely overkill
• Not scalable as error rates increase
– Use strong yet practical codes + RAS features
– Example: Proactive patrol scrubbing will capture a
majority of soft errors; may not coincide with hard errors
14
Evaluation Methodology
• Performance analysis: In-house DRAM simulator
– Models refresh, address/command bus, data bus,
banks/ranks/channels contention, read/write queues
• Power analysis: Micron power calculator spreadsheet
– Reflects timing parameters assumed for performance
simulations
– Bus utilization and bank utilization numbers obtained from
performance simulations
– Accounts for activation power, read/ write power,
termination power, and background power
– Includes low-power sleep modes
15
Evaluation Platforms
• Xeon 7500-like system
– 8 DDR3 channels, 2 DIMMs/channel
– Dual-ranked x4 or Quad-ranked x8 DIMMs
– “Lockstep mode” is the only supported mode



Two ranks operate together to provide a 144-bit bus
Wasted bandwidth by masking out half the burst, OR
Forced prefetching
• Also evaluate Xeon 5500-like systems
– 3 DDR3 channels, 3 DIMMs/channel
– “Lockstep mode” wastes one channel entirely, gangs other two
• Evaluate five design points each
– Baseline symbol-based SSC-DSD
– Virtualized ECC (Yoon & Erez, ASPLOS ’10)
– LOT-ECC with no coalescing, simple coalescing, oracular
coalescing
Power Results 7500
-43%
17
Power Results 5500
-32%
18
Performance Results 7500
Latency Reduction: LOT-ECC 4.6%
+GEC Coalescing 7.7%
19
Oracular 16.2%
Performance Results 5500
Latency Reduction: LOT-ECC 42.9%
+GEC Coalescing 46.9%
20
Oracular 57.3%
Storage Overhead
• For each 64-byte cache line
– 63 bits of LED checksum
– 57 bits of GEC parity
– 7 bits of T3 code
– 9 bits of T4 code
• Total storage overhead of 26.5%
• Current ECC implementations and DIMMs
already accept 12.5% through extra chip
• Additional 14% in data memory via firmware
• Memory capacity is cheap if commodity
– Better to spend on this than power/performance
21
Key Contributions
• Multi-tiered protection design to keep fault
tolerance contained to fewer chips
• Unique data layout tailored to the access
mechanism of commodity DRAM systems
• Exploit row-buffer efficiency
– co-locate data and all tiers of fault-tolerance codes
– Mitigates overheads of additional writes typical in
parity-based systems
• Coalescing optimization to further minimize
impact of parity writes
22
Key Benefits
• Power Efficiency: Fewer chips activated per access, reduced access
granularity, reduced static energy through better use of low-power
modes (43% memory power savings)
• Performance Gains: More rank-level parallelism, reduced access
granularity (7.7% memory latency reduction)
• Improved Protection: Can handle 1 failed chip out of 9, compared
to 1 in 36 currently
• Flexibility: Works with a single rank of x4 DRAMs or more efficient
x8 DRAMs
• Implementation Ease: Changes to memory controller and system
firmware only; commodity processor/memory/OS
23
BACKUP SLIDES
24
Tier 2 – Global Error Correction (GEC)
• GEC is a parity written across the cache line
segments in each chip
• LED has already pinpointed erroneous segment
– Error correction is trivial
• Storing the parity
– A portion of memory set aside to hold GEC
– Handled by memory controller + firmware
• No impact on reads unless error is detected
• GEC also self contained (single cache line)
– No read-before-write
25
Download