The Efficacy of Error Mitigation Techniques for DRAM Retention Failures Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen, Chris Wilkerson, and Onur Mutlu Motivation Technology Scaling DRAM Cells DRAM Cells Scaling DRAM cells results in more failures • Longer manufacture-time tests • Lower yield • Higher cost 2 Vision: Online Profiling Detect and Mitigate DRAM Cells System Detect and mitigate errors after the system has become operational Reduces cost of testing, increases yield, enables scaling What is the effectiveness of system-level detection and mitigation techniques? 3 Summary • We analyze the efficacy of testing, guardbanding, ECC, and recent techniques – Using experimental data from real DRAMs • Key Conclusions – Testing alone cannot guarantee reliable operation – A combination of ECC, testing, and guardbanding is more effective – Testing+ECC-based techniques block memory for significant time Performance degradation • We propose a possible online profiling mechanism 4 Outline • DRAM Scaling Problem • Online Profiling as a Solution • Efficacy of System-Level Detection and Mitigation – Simple Techniques – Recently Proposed Techniques • Towards an Online Profiling System • Conclusion 5 Outline • DRAM Scaling Problem • Online Profiling as a Solution • Efficacy of System-Level Detection and Mitigation – Simple Techniques – Recently Proposed Techniques • Towards an Online Profiling System • Conclusion 6 Retention Failure DRAM Cells 7 Retention Failure Switch Refreshed Every 64 ms Leakage Capacitor Retention Retention Time Time Refresh Interval 64 ms Time 8 Intermittent Retention Failure DRAM Cells • Some retention failures are intermittent • Two characteristics of intermittent retention failures 1 2 Data Pattern Sensitivity Variable Retention Time 9 1 Data Pattern Sensitivity Noise Interference 10 0 0 1 Failure No Failure Some cells can fail depending on the data stored in neighboring cells 10 2 Variable Retention Time Retention Time (ms) 640 512 384 256 128 0 Time Retention time of some cells change at random points of time 11 Testing for Retention Failures Manufacturing Time Testing PASS FAIL Manufacturers perform exhaustive testing Chips failing tests are discarded of the DRAM Chips 12 DRAM Scaling Problem Manufacturing Time Testing PASS FAIL More interference in smaller technology nodes leads to lower yield and higher cost 13 Outline • DRAM Scaling Problem • Online Profiling as a Solution • Efficacy of System-Level Detection and Mitigation – Simple Techniques – Recently Proposed Techniques • Towards an Online Profiling System • Conclusion 14 System-Level Online Profiling Not fully tested during manufacture-time 1 Ship modules 2 with possible failures PASS FAIL Detect and mitigate failures online 3 Increases yield, reduces cost, enables scaling 15 System-Level Online Profiling What is the effectiveness of detection and mitigation techniques for retention failures? Our goal is to analyze the efficacy of 1. Simple Techniques • Testing, Guardbanding, ECC 2. Recently Proposed Techniques • ArchShield, RAIDR, SECRET, RAPID, VS-ECC, Hi-ECC We analyze the effectiveness of these techniques using experimental data from real DRAM ArchShield ISCA'13, RAIDR ISCA'12, SECRET ICCD’12, RAPID HPCA'06, VS-ECC ISCA'11, Hi-ECC ISCA’10 16 Methodology FPGA-based testing infrastructure Evaluated 96 chips from three major vendors 17 Outline • DRAM Scaling Problem • Online Profiling as a Solution • Efficacy of System-Level Detection and Mitigation – Simple Techniques – Recently Proposed Techniques • Towards an Online Profiling System • Conclusion 18 Efficacy of Simple Techniques 1 Testing 2 Guardbanding 3 Error Correcting Code 19 1 Testing Write some pattern in the module Repeat Read 3 and verify 1 Wait until 2 refresh interval Test each module with different patterns for many rounds Zeros (0000), Ones (1111), Tens (1010), Fives (0101), Random 20 Number of Failing Cells Found Efficacy of Testing ZERO 200000 ONE TEN FIVE RAND All Even after hundreds of rounds, small number Only a few arounds can of new cells keep failing discover most of the 150000 100000 failures 50000 0 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds Testing alone cannot detect all possible failures 21 2 Guardbanding • Adding a safety-margin on the refresh interval • Can avoid VRT failures 4X Guardband 2X Guardband Refresh Interval Effectiveness of guardbanding depends on the difference between retention times of a cell 22 Efficacy of Guardbanding Number of Failing Cells 1000000 100000 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 23 Efficacy of Guardbanding Number of Failing Cells 1000000 100000 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 23 Efficacy of Guardbanding Number of Failing Cells 1000000 100000 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 23 Efficacy of Guardbanding Number of Failing Cells 1000000 100000 10000 1000 Most of the cells exhibit closeby retention times 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 23 Efficacy of Guardbanding Number of Failing Cells 1000000 100000 There are few cells with large differences in retention times 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) Even a large guardband (5X) cannot detect 5-15% of the intermittently failing cells23 3 Error Correcting Code • Error Correcting Code (ECC) – Additional information to detect error and correct data 24 Probability of New Failure Effectiveness of ECC No ECC SECDED SECDED, 2X Guardband 1E+00 1E-06 1E-12 1E-18 1 10 100 1000 Number of Rounds 25 Probability of New Failure Effectiveness of ECC No ECC SECDED SECDED, 2X Guardband 1E+00 1E-06 1E-12 1E-18 1 10 100 1000 Number of Rounds 25 Probability of New Failure Effectiveness of ECC No ECC SECDED SECDED, 2X Guardband 1E+00 1E-06 1E-12 1E-18 1 10 100 1000 Number of Rounds 25 Probability of New Failure Effectiveness of ECC No ECC SECDED SECDED, 2X Guardband 1E+00 Combination of techniques SECDED code reduces reduces error rate by 107 times error ratea by times Adding 2X100 guardband reduces error rate by 1000 times 1E-06 1E-12 1E-18 1 10 100 Number of Rounds 1000 A combination of mitigation techniques is much more effective 25 Outline • DRAM Scaling Problem • Online Profiling as a Solution • Efficacy of System-Level Detection and Mitigation – Simple Techniques – Recently Proposed Techniques • Towards an Online Profiling System • Conclusion 26 Efficacy of Recent Techniques 1 Bit Repair Techniques In the paper 2 Variable-Strength ECC 3 Higher-Strength ECC 27 Higher Strength ECC (Hi-ECC) No testing, use strong ECC But amortize cost of ECC over larger data chunk Can potentially tolerate errors at the cost of higher strength ECC Hi-ECC ISCA'10 28 Time to Failure (in years) Efficacy of Hi-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 29 Time to Failure (in years) Efficacy of Hi-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 29 Time to Failure (in years) Efficacy of Hi-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 After starting with 4EC5ED, can reduce to 3EC4ED code after 2 rounds of tests 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 29 Time to Failure (in years) Efficacy of Hi-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 Can reduce to DECTED code after 10 rounds of tests 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 29 Time to Failure (in years) Efficacy of Hi-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 1E+20 Can reduce to SECDED code, after 7000 rounds of tests (4 hours) 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds Testing can help to reduce the ECC strength 29 Outline • DRAM Scaling Problem • Online Profiling as a Solution • Efficacy of System-Level Detection and Mitigation – Simple Techniques – Recently Proposed Techniques • Towards an Online Profiling System • Conclusion 30 Towards an Online Profiling System Key Observations: • Testing alone cannot detect all possible failures • Combination of ECC and other mitigation techniques is much more effective – But degrades performance • Testing can help to reduce the ECC strength – Even when starting with a higher strength ECC 31 Towards an Online Profiling System Initially Protect DRAM with Strong ECC 1 Periodically Test Parts of DRAM 2 Test Test Test Mitigate errors and reduce ECC 3 Run tests periodically after a short interval 32 at smaller regions of memory Outline • DRAM Scaling Problem • Online Profiling as a Solution • Efficacy of System-Level Detection and Mitigation – Simple Techniques – Recently Proposed Techniques • Towards an Online Profiling System • Conclusion 33 Conclusion • We analyze the efficacy of testing, guardbanding, ECC, and recent techniques at system-level – Using experimental data from real DRAMs • Key Conclusions – Testing alone cannot guarantee reliable operation – A combination of techniques is more effective – Testing+ECC-based techniques block memory for significant time Performance degradation • We propose Online profiling that runs at background without disrupting current programs – Run periodically at smaller regions of memory 34 Thank you Full data set for 96 chips is available at http://www.ece.cmu.edu/~safari/tools/dr am-sigmetrics2014-fulldata.html The Efficacy of Error Mitigation Techniques for DRAM Retention Failures Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen, Chris Wilkerson, and Onur Mutlu 1 Bit Repair Techniques Test DRAM module at boot up 1 Mitigate failures by repairing the bits 2 FIXED These techniques are vulnerable to new intermittent failures ArchShield ISCA'13, RAIDR ISCA'12, SECRET ICCD’12, RAPID HPCA'06 48 Time to Failure (in days) Efficacy of Bit Repair Techniques No Guardband 2X Guardband 25 20 15 10 5 0 1 101 102 103 104 105 Number of Rounds 106 107 49 Time to Failure (in days) Efficacy of Bit Repair Techniques 25 20 15 10 No Guardband 2X Guardband System fails within 13 days, even after initial Will fail testing ofimmediately 107 rounds even after initial testing of 104 rounds 5 0 1 101 102 103 104 105 Number of Rounds 106 107 50 Time to Failure (in days) Efficacy of Bit Repair Techniques 25 20 15 No Guardband 2X Guardband System fails within 23 days, even after initial testing of 107 rounds 10 5 0 1 101 102 103 104 105 Number of Rounds 106 107 Even longer tests are not sufficient to guarantee reliable operation 51 2 Variable-Strength ECC (VS-ECC) Test DRAM module at boot up 1 Protect failed lines with strong ECC 2 Will fail as soon as there are two bit errors in SECDED lines VS-ECC ISCA'11 52 Time to Failure (in years) Efficacy of VS-ECC No Guardband 2X Guardband 1E+02 10 Years 1E+00 1E-02 1E-04 1E-06 1E-08 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds 53 Time to Failure (in years) Efficacy of VS-ECC No Guardband 2X Guardband 1E+02 10 Years Memory blocked for 19 minutes in 2GB DRAM 1E+00 1E-02 1E-04 1E-06 1E-08 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds 54 Time to Failure (in years) Efficacy of VS-ECC No Guardband 2X Guardband 1E+02 10 Years 1E+00 1E-02 Memory blocked for 7 minutes in 2GB DRAM 1E-04 1E-06 1E-08 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds With higher capacity DRAM, memory will be blocked for an unacceptable amount of time 55 Challenges and Opportunities Challenges: • Performance Overhead • Mitigation Overhead Testing Opportunities: • Enable Failure-aware Optimizations 56 Reduction in New Failure Rate Reduction in Error Rate in all Modules 100000 10000 1000 100 10 1 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds 57 Number of Failing Cells (in millions) Difference in Modules 20 15 10 5 0 2 4 6 8 10 12 14 16 Refresh Interval (in seconds) 18 20 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 58 Tested DRAM Modules Manufacturer Module Name Assembly Date (Year-Week) Number of Chips A A1 2013-18 8 A2 2012-26 8 A3 2013-18 8 A4 2014-08 8 B1 2012-37 8 B2 2012-37 8 B3 2012-41 8 B4 2012-20 8 C1 2012-29 8 C2 2012-29 8 C3 2013-22 8 C4 2012-29 8 B C 59 Time to Test Operation Time (2GB) Time (64GB) Write/Read a Row 667.5 ns 667.5 ns Write/Read 2GB Module 174.98 ms 5.59 s 1 round , 1 pattern 413:96 ms 11.24 s 1 round, 5 patterns 2.06 s 56.22 s 1000 rounds, 5 patterns 34 m 15.6 hours 60 Temperature Controlled Environment 61 Dependence of Retention Time on Temperature Fraction of cells that exhibited retention time failure at any tWAIT for any data pattern o at 50 C Normalized retention times of the same cells o at 55 C Normalized retention times of the same cells o At 70 C Best-fit exponential curves for retention time change with temperature Slide Courtesy Onur Mutlu ISCA’13 62 Dependence of Retention Time on Temperature Relationship between retention time and temperature is consistently bounded (predictable) within a device o Every 10 C temperature increase 63 46.5% reduction in retention time in the worst case Effect of Temperature • Worst fit curve for retention time at different temperature corresponds to e^-0.0625T , where T is the temperature [ISCA’13] • A 10 C increase in temperature results in a reduction of 1 – e^-0.0625*10 = 46.5% • 1 second 82 ms at 45 C • 20 seconds 1640 ms at 85 C 64 Characteristics not Dependent on Refresh Interval 2s Probability of New Bit Failure 1E-03 4s 5s 10 s 1E-06 1E-09 1E-12 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds 65 Expected Number of Multi-Bit Failures 1 Bit Failure 1 Bit Failure, 2X Guardband 2 Bit Failure 2 Bit Failure, 2X Guardband 3 Bit Failure 3 Bit Failure, 2X Guardband Expected Number of Words (8B) 1E+06 1E+00 1E-06 1E-12 1E-18 1E-24 1 10 100 Number of Rounds 1000 66