SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS Samira Khan MEMORY IN TODAY’S SYSTEM Processor DRAM Memory Storage DRAM is critical for performance 2 MAIN MEMORY CAPACITY Gigabytes of DRAM Increasing demand for high capacity 1. More cores 2. Data-intensive applications 3 DEMAND 1: INCREASING NUMBER OF CORES 2012 2013 2014 SPARC M5 6 Cores SPARC M6 12 Cores SPARC M7 32 Cores 2015 More cores need more memory 4 DEMAND 2: DATA-INTENSIVE APPLICATIONS MEMORY CACHING IN-MEMORY DATABASE More demand for memory 5 HOW DID WE GET MORE CAPACITY? Technology Scaling DRAM Cells DRAM Cells DRAM scaling enabled high capacity 6 MEGABITS/CHIP DRAM SCALING TREND 10000 1000 100 10 1 1985 1995 2005 2015 START OF MASS PRODUCTION DRAM scaling is getting difficult Source: Flash Memory Summit 2013, Memcon 2014 7 DRAM SCALING CHALLENGE Technology Scaling DRAM Cells DRAM Cells Manufacturing reliable cells at low cost is getting difficult 8 WHY IS IT DIFFICULT TO SCALE? DRAM Cells In order to answer this we need to take a closer look to a DRAM cell 9 WHY IS IT DIFFICULT TO SCALE? Transistor Capacitor Bitline Contact Capacitor Transistor LOGICAL VIEW VERTICAL CROSS SECTION A DRAM cell 10 WHY IS IT DIFFICULT TO SCALE? Technology Scaling DRAM Cells DRAM Cells Challenges in Scaling 1. Capacitor reliability 2. Cell-to-cell interference 11 SCALING CHALLENGE 1: CAPACITOR RELIABILITY Technology Scaling DRAM Cells DRAM Cells Capacitor is getting taller 12 CAPACITOR RELIABILITY 58 nm 140 m Results in failures while manufacturing Source: Flash Memory Summit, Hynix 2012 13 IMPLICATION: DRAM COST TREND Cost/Bit PROJECTION 2000 2005 2010 2015 2020 YEAR Cost is expected to go higher Source: Flash Memory Summit, Hynix 2012 14 SCALING CHALLENGE 2: CELL-TO-CELL INTERFERENCE Technology Scaling Less Interference Indirect path More Interference Indirect path More interference results in more failures 15 IMPLICATION: DRAM ERRORS IN THE FIELD 1.52% of DRAM modules failed in Google Servers 1.6% of DRAM modules failed in LANL SIGMETRICS’09, SC’12 16 GOAL ENABLE HIGH CAPACITY MEMORY WITHOUT SACRIFICING RELIABILITY 17 TWO DIRECTIONS DRAM Difficult to scale MAKE DRAM SCALABLE SIGMETRICS’14, DSN’15, ONGOING NEW TECHNOLOGIES Predicted to be highly scalable LEVERAGE NEW TECHNOLOGIES WEED’13, ONGOING 18 Technology Scaling DRAM Cells DRAM Cells DRAM SCALING CHALLENGE Detect and Mitigate DRAM Cells UNIFY Reliable System SYSTEM-LEVEL TECHNIQUES TO ENABLE DRAM SCALING Non-Volatile Memory Storage NON-VOLATILE MEMORIES: UNIFIED MEMORY & STORAGE PAST AND FUTURE WORK 19 TRADITIONAL APPROACH TO ENABLE DRAM SCALING Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Manufacturing Time Reliable System System in the Field DRAM has strict reliability guarantee 20 MY APPROACH Make DRAM Reliable Unreliable DRAM Cells Reliable DRAM Cells Manufacturing Manufacturing Time Time Reliable System System System in the Field in the Field Shift the responsibility to systems 21 VISION: SYSTEM-LEVEL DETECTION AND MITIGATION Detect and Mitigate Unreliable DRAM Cells Reliable System Detect and mitigate errors after the system has become operational ONLINE PROFILING Reduces cost, increases yield, and enables scaling 22 CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate DRAM Cells Reliable System SYSTEM-LEVEL TECHNIQUES TO ENABLE DRAM SCALING EFFICACY OF SYSTEM-LEVEL TECHNIQUES WITH REAL DRAM CHIPS HIGH-LEVEL DESIGN NEW SYSTEM-LEVEL TECHNIQUES 23 CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate Unreliable DRAM Cells Reliable System If failures were permanent, a simple boot up test would have worked What are the characteristics of intermittent failures? 24 DRAM RETENTION FAILURE Switch Refreshed Every 64 ms Leakage Capacitor Retention Retention Time Time Refresh Interval 64 ms Time 25 INTERMITTENT RETENTION FAILURE DRAM Cells • Some retention failures are intermittent • Two characteristics of intermittent retention failures 1 2 Data Pattern Sensitivity Variable Retention Time 26 DATA PATTERN SENSITIVITY INTERFERENCE 10 0 0 1 NO FAILURE FAILURE Some cells can fail depending on the data stored in neighboring cells JSSC’88, MDTD’02 27 Retention Time (ms) VARIABLE RETENTION TIME 640 512 384 256 128 0 Time Retention time changes randomly in some cells IEDM’87, IEDM’92 28 CURRENT APROACH TO CONTAIN INTERMITTENT FAILURES Manufacturing Time Testing PASS FAIL 1. Manufacturers perform exhaustive testing of DRAM chips 2. Chips failing the tests are discarded 29 SCALING AFFECTING TESTING Manufacturing Time Testing PASS FAIL Longer Tests and More Failures More interference in smaller technology nodes leads to lower yield and higher cost 30 SYSTEM-LEVEL ONLINE PROFILING Not fully tested during manufacture-time 1 Ship modules 2 with possible failures PASS FAIL Detect and mitigate failures online 3 Increases yield, reduces cost, enables scaling 31 CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate DRAM Cells Reliable System SYSTEM-LEVEL TECHNIQUES TO ENABLE DRAM SCALING EFFICACY OF SYSTEM-LEVEL TECHNIQUES WITH REAL DRAM CHIPS HIGH-LEVEL DESIGN NEW SYSTEM-LEVEL TECHNIQUES 32 EFFICACY OF SYSTEM-LEVEL TECHNIQUES Can we leverage existing techniques? 1 Testing 2 Guardbanding 3 Error Correcting Code 4 Higher Strength ECC We analyze the effectiveness of these techniques using experimental data from real DRAMs Data set publicly available 33 METHODOLOGY FPGA-based testing infrastructure Evaluated 96 chips from three major vendors 34 1. TESTING Write some pattern in the module Repeat Read 3 and verify 1 Wait until 2 refresh interval Test each module with different patterns for many rounds Zeros (0000), Ones (1111), Tens (1010), Fives (0101), Random 35 Number of Failing Cells Found EFFICACY OF TESTING ZERO 200000 ONE TEN FIVE RAND All Even after hundreds of rounds, a small number Only a few rounds can of new most cells keep discover of thefailing 150000 100000 failures 50000 0 0 100 200 300 400 500 600 700 800 900 1000 Number of Rounds Conclusion: Testing alone cannot detect all possible failures 36 2. GUARDBANDING • Adding a safety-margin on the refresh interval • Can avoid VRT failures 4X Guardband 2X Guardband Refresh Interval Effectiveness depends on the difference between retention times of a cell 37 EFFICACY OF GUARDBANDING Number of Failing Cells 1000000 100000 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 38 EFFICACY OF GUARDBANDING Number of Failing Cells 1000000 100000 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 39 EFFICACY OF GUARDBANDING Number of Failing Cells 1000000 100000 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 40 EFFICACY OF GUARDBANDING Number of Failing Cells 1000000 100000 10000 1000 Most of the cells exhibit closeby retention times 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) 41 EFFICACY OF GUARDBANDING Number of Failing Cells 1000000 100000 There are few cells with large differences in retention times 10000 1000 100 10 1 0 4 8 12 16 20 Retention Time (in seconds) Conclusion: Even a large guardband (5X) cannot detect 5-15% of the failing cells 42 3. ERROR CORRECTING CODE • Error Correcting Code (ECC) – Additional information to detect error and correct data 43 Probability of New Failure EFFECTIVENESS OF ECC No ECC SECDED SECDED, 2X Guardband 1E+00 1E-06 1E-12 1E-18 1 10 100 1000 Number of Rounds 44 Probability of New Failure EFFECTIVENESS OF ECC No ECC SECDED SECDED, 2X Guardband 1E+00 1E-06 1E-12 1E-18 1 10 100 1000 Number of Rounds 45 Probability of New Failure EFFECTIVENESS OF ECC No ECC SECDED SECDED, 2X Guardband 1E+00 1E-06 1E-12 1E-18 1 10 100 1000 Number of Rounds 46 Probability of New Failure EFFECTIVENESS OF ECC No ECC SECDED SECDED, 2X Guardband 1E+00 Combination of techniques SECDED code reduces reduces error rate by 107 times error ratea by times Adding 2X100 guardband reduces error rate by 1000 times 1E-06 1E-12 1E-18 1 10 100 1000 Number of Rounds Conclusion: A combination of mitigation techniques is much more effective 47 4. HIGHER STRENGTH ECC (HI-ECC) No testing, use strong ECC But amortize cost of ECC over larger data chunk Can potentially tolerate errors at the cost of higher strength ECC 48 Time to Failure (in years) EFFICACY OF HI-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 49 Time to Failure (in years) EFFICACY OF HI-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 50 Time to Failure (in years) EFFICACY OF HI-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 After starting with 4EC5ED, can reduce to 3EC4ED code after 2 rounds of tests 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 51 Time to Failure (in years) EFFICACY OF HI-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 Can reduce to DECTED code after 10 rounds of tests 1E+20 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds 52 Time to Failure (in years) EFFICACY OF HI-ECC 4EC5ED, 2X Guardband 3EC4ED, 2X Guardband DECTED, 2X Guardband SECDED, 2X Guardband 1E+25 1E+20 Can reduce to SECDED code, after 7000 rounds of tests (4 hours) 1E+15 1E+10 1E+05 10 Years 1E+00 1E-05 1 10 100 1000 10000 Number of Rounds Conclusion: Testing can help to reduce the ECC strength, but blocks memory for hours 53 CONCLUSIONS SO FAR Key Observations: • Testing alone cannot detect all possible failures • Combination of ECC and other mitigation techniques is much more effective • Testing can help to reduce the ECC strength – Even when starting with a higher strength ECC – But degrades performance 54 CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate DRAM Cells Reliable System SYSTEM-LEVEL TECHNIQUES TO ENABLE DRAM SCALING EFFICACY OF SYSTEM-LEVEL TECHNIQUES WITH REAL DRAM CHIPS HIGH-LEVEL DESIGN NEW SYSTEM-LEVEL TECHNIQUES 55 TOWARDS AN ONLINE PROFILING SYSTEM Initially Protect DRAM 1 with Strong ECC Periodically Test Parts of DRAM 2 Test Test Test Mitigate errors and reduce ECC 3 Run tests periodically after a short interval at smaller regions of memory 56 CHALLENGE: INTERMITTENT FAILURES Detect and Mitigate DRAM Cells Reliable System SYSTEM-LEVEL TECHNIQUES TO ENABLE DRAM SCALING EFFICACY OF SYSTEM-LEVEL TECHNIQUES WITH REAL DRAM CHIPS HIGH-LEVEL DESIGN NEW SYSTEM-LEVEL TECHNIQUES 57 WHY SO MANY ROUNDS OF TESTS? DATA-DEPENDENT FAILURE Fails when specific pattern in the neighboring cell 0L D 0 1 R LINEAR ADDRESS X-1 X X+1 0 SCRAMBLED X-1 ADDRESS 0 1 0 0 X-4 X X+2X+1 Even many rounds of random patterns cannot detect all failures 58 DETERMINE THE LOCATION OF PHYSICALLY ADJACENT CELLS L D R SCRAMBLED ADDRESS X-? X X+? NAÏVE SOLUTION For a given failure X, test every combination of two bit addresses in the row O(n2) 8192*8192 tests, 49 days for a row with 8K cells Our goal is to reduce the test time 59 STRONGLY VS. WEAKLY DEPENDENT CELLS STRONGLY DEPENDENT Fails even if only one neighbor data changes WEAKLY DEPENDENT Fails if both neighbor data change Can detect neighbor location in strongly dependent cells by testing every bit address 0, 1, … , X, X+1, X+2, … n 60 PHYSICAL NEIGHBOR LOCATION TEST Testing every bit address will detect only one neighbor Run parallel tests in different rows X-4 L A L C R B R X+2 Aggregate the locations from different rows 61 PHYSICAL NEIGHBOR LOCATION TEST L A SCRAMBLED ADDRESS LINEAR TESTING RECURSIVE TESTING X-4 X 2 6 0 1 2 3 4 5 6 7 0, 1, 2, 3 4, 5, 6, 7 0, 1 2, 3 4, 5 6, 7 0 1 2 3 4 5 6 7 62 PHYSICAL NEIGHBOR-AWARE TEST A B C NUM TEST REDUCED 745654X 1016800X 745654X EXTRA FAILURES 27% 11% 14% Detects more failures with small number of tests leveraging neighboring information 63 NEW SYSTEM-LEVEL TECHNIQUES REDUCE FAILURE MITIGATION OVERHEAD Mitigation for worst vs. common case Reduces mitigation cost around 3X-10X ONGOING Variable refresh Leverages adaptive refresh to mitigate failures DSN’15 64 SUMMARY Detect and Mitigate Unreliable DRAM Cells Reliable System Proposed online profiling to enable scaling Analyzed efficacy of system-level detection and mitigation techniques Found that combination of techniques is much more effective, but blocks memory Proposed new system-level techniques to reduce detection and mitigation overhead 65 Technology Scaling DRAM Cells DRAM Cells DRAM SCALING CHALLENGE Detect and Mitigate DRAM Cells UNIFY Reliable System SYSTEM-LEVEL TECHNIQUES TO ENABLE DRAM SCALING Non-Volatile Memory Storage NON-VOLATILE MEMORIES: UNIFIED MEMORY & STORAGE PAST AND FUTURE WORK 66 STORAGE MEMORY CPU TWO-LEVEL STORAGE MODEL Ld/St DRAM FILE I/O VOLATILE FAST BYTE ADDR NONVOLATILE SLOW BLOCK ADDR 67 STORAGE MEMORY CPU TWO-LEVEL STORAGE MODEL Ld/St DRAM FILE I/O NVM PCM, STT-RAM VOLATILE FAST BYTE ADDR NONVOLATILE SLOW BLOCK ADDR Non-volatile memories combine characteristics of memory and storage 68 VISION: UNIFY MEMORY AND STORAGE CPU NVM PERSISTENT MEMORY Ld/St Provides an opportunity to manipulate persistent data directly 69 CPU CPU DRAM IS STILL FASTER MEMORY DRAM NVM PERSISTENT MEMORY Ld/St A hybrid unified memory-storage system 70 CPU CPU CHALLENGE: DATA CONSISTENCY MEMORY PERSISTENT MEMORY Ld/St System crash can result in permanent data corruption in NVM 71 CURRENT SOLUTIONS Explicit interfaces to manage consistency – NV-Heaps [ASPLOS’11], BPFS [SOSP’09], Mnemosyne [ASPLOS’11] AtomicBegin { Insert a new node; } AtomicEnd; Limits adoption of NVM Have to rewrite code with clear partition between volatile and non-volatile Burden on the programmers 72 OUR GOAL AND APPROACH Goal: Provide efficient transparent consistency in hybrid systems time Running Epoch 0 Checkpointing Running Checkpointing Epoch 1 Periodic Checkpointing of dirty data Transparent to application and system Hardware checkpoints and recovers data 73 CHECKPOINTING GRANULARITY PAGE EXTRA WRITES SMALL METADATA DRAM DIRTY CACHE BLOCK NO EXTRA WRITE HUGE METADATA NVM High write locality pages in DRAM, low write locality pages in NVM 74 DUAL GRANULARITY CHECKPOINTING DRAM GOOD FOR STREAMING WRITES NVM GOOD FOR RANDOM WRITES Can adapt to access patterns 75 TRANSPARENT DATA CONSISTENCY Cost of consistency compared to systems with zero-cost consistency UNMODIFIED LEGACY CODE DRAM -3.5% NVM +2.7% Provides consistency without significant performance overhead 76 Technology Scaling DRAM Cells DRAM Cells DRAM SCALING CHALLENGE Detect and Mitigate DRAM Cells UNIFY Reliable System SYSTEM-LEVEL TECHNIQUES TO ENABLE DRAM SCALING Non-Volatile Memory Storage NON-VOLATILE MEMORIES: UNIFIED MEMORY & STORAGE PAST AND FUTURE WORK 77 OTHER WORK IMPROVING CACHE PERFORMANCE STORAGE MEMORY CPU MICRO’10, PACT’10, ISCA’12, HPCA’12, HPCA’14 EFFICIENT LOW VOLTAGE PROCESSOR OPERATION HPCA’13, INTEL TECH JOURNAL’14 NEW DRAM ARCHITECTURE HPCA’15, ONGOING ENABLING DRAM SCALING SIGMETRICS’14, DSN’15, ONGOING UNIFYING MEMORY & STORAGE WEED’13, ONGOING 78 DRAM NVM LOGIC STORAGE MEMORY CPU FUTURE WORK: TRENDS 79 APPROACH DESIGN AND BUILD BETTER SYSTEMS WITH NEW CAPABILITIES BY REDEFINING FUNCTIONALITIES ACROSS DIFFERENT LAYERS IN THE SYSTEM STACK 80 SSD OPERATING SYSTEM APPLICATION RETHINKING STORAGE CPU FLASH CONTROLLER FLASH CHIPS APPLICATION, OPERATING SYSTEM, DEVICES What is the best way to design a system to take advantage of the SSDs? 81 ENHANCING SYSTEMS WITH NON-VOLATILE MEMORY Recover and migrate NVM PROCESSOR, FILE SYSTEM, DEVICE, NETWORK, MEMORY How to provide efficient instantaneous system recovery and migration? 82 DESIGNING SYSTEMS WITH MEMORY AS AN ACCELERATOR MANAGE DATA FLOW SPECIALIZED CORES MEMORY WITH LOGIC APPLICATION, PROCESSOR, MEMORY How to manage data movement when applications run on different accelerators? 83 THANK YOU QUESTIONS? 84 SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS Samira Khan