SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong† Dong Hyuk Woo† Vijayalakshmi Srinivasan‡ Jude A. Rivers‡ Hsien-Hsin S. Lee† ‡ † Emerging Memory Technologies • Resistive memories – Due to DRAM scaling challenge • Phase Change Memory (PCM) Scalability, high density Limited write endurance (Avg. 108 writes) • Incurring stuck-at faults 2 Cell Write Endurance • Endurance variation – No spatial correlation – Increases with technology scaling • Issues – Unpredictable cell endurance • Read verification required for each write – The weakest cell dictates memory lifetime! – # of stuck-at faults gradually grows! • Multi-bit error recovery scheme is needed! 3 Existing Error Correcting Methods • (72,64) Hamming code – For transient faults – Single Error Correction Double Error Detection (SECDED) – 12.5% overhead • Error-Correcting Pointers (ECP) [Schechter, ISCA37] – Dynamically replace failed cells with extra cells – Storing multiple fail pointers for each data block – Recover from 6 fails with 61-bit overhead (11.9%) 4 SAFER: Stuck-At-Fault Error Recovery 5 Concept of SAFER • Exploit two properties of Stuck-At Faults – Permanency – Readability • Multiple error correction – Fault separation – Low-cost Single Error Correction (SEC) Fault Separation SEC SEC 6 SAFER: 1. Fault Separation 2. Single Error Correction 7 Fault Separation • Assuming 2 faults in an 8-bit block – C(8,2) = 28 possible fault pairs • How to separate these 2 faults (of all 28 pairs)? Pattern #2 7 6 5 4 3 2 1 0 Pattern #1 7 6 5 4 3 2 1 0 Pattern #0 7 6 5 4 3 2 1 0 8 Decision for Fault Separation • Use bit pointers for fault separation Data Block Bit Pointer 7 6 5 4 3 2 1 0 bit 2 1 1 1 1 0 0 0 0 Pattern #2 bit 1 1 1 0 0 1 1 0 0 Pattern #1 bit 0 1 0 1 0 1 0 1 0 Pattern #0 Bit Pointer 9 Decision for Fault Separation • Find pattern candidates by XORing bit pointers Data Block Bit Pointer 7 6 5 4 3 2 1 0 bit 2 1 1 1 1 0 0 0 0 1 Pattern #2 bit 1 1 1 0 0 1 1 0 0 0 Pattern #1 bit 0 1 0 1 0 1 0 1 0 0 Pattern #0 Bit Pointer Difference Vector 10 Decision for Fault Separation • Find pattern candidates by XORing bit pointers Data Block Bit Pointer 7 6 5 4 3 2 1 0 bit 2 1 1 1 1 0 0 0 0 0 Pattern #2 bit 1 1 1 0 0 1 1 0 0 1 Pattern #1 bit 0 1 0 1 0 1 0 1 0 1 Pattern #0 Bit Pointer 11 Extension to Multi-Group Partition • Use two bits for 4 group partition Data Block Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 bit 3 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 bit 2 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 bit 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 bit 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 Bit Pointer (bit 3, bit 2) (bit 3, bit 1) Dynamic Partition • 4 group partition for a 16-bit data block Bit Pointer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Data Block 1st Partition Field bit 2 2nd bit 0 Partition Field Fixed Partition Counter 0 1000 0010 = 1010 Data Block 1st Partition Field bit 23 2nd Partition Field bit 0 Fixed Partition Counter 0 1 0010 0000 = 0010 Data Block 1st Partition Field bit 3 2nd bit 01 Partition Field Fixed Partition Counter 1 2 13 Dynamic Partition • Objective – Separate multiple stuck-at faults into different groups • Additional meta data – Assuming an n bit block and a k group partition – log2k log2 log2 n log2 log2 k 1 size of fixed partition counter size of each partition field # of partition fields • Example: n = 512, k = 32 – Required meta data: 23 bits/block – 6 the number of separable stuck-at faults 32 14 SAFER: 1. Fault Separation 2. Single Error Correction 15 Low-cost Single Error Correction • Stuck-At Fault Property: Readability 1 0 1 0 1 0 1 0 1 0 1 0 Write Verify 16 Low-cost Single Error Correction • Stuck-At Fault Property: Readability 0 1 0 1 1 0 0 1 01 0 1 0 1 0 1 Write Verify 17 Low-cost Single Error Correction • Stuck-At Fault Property: Readability 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 Write Verify Need to recover!! 18 Low-cost Single Error Correction • Data Inversion as an SEC Flip Mark Inversion & Mark 1 0 0 1 1 0 0 1 “F” 0 1 0 1 “F” 2nd Write 2nd Verify 0 1 0 1 1 0 1 0 Inversion “F” One additional bit per group Recovered from Stuck-At Fault!! 19 Design Issues 20 SAFER Sequence for a Write Start Read Write Drawbacks: - accelerating wear-out - performance degradation (1st) Verify N Y Error Y Inversion Write (2nd) N Y Re-partition Verify N Success Error Fixed Partition Counter < MAX Y Failure 21 Fail Information Cache • Objective: avoid the 2nd writes • Solution: early inversion decision • Fail Info. Cache with 1K entries – Keep track of recent data blocks with stuck-at faults – Store fail positions and their stuck-at values Block Address Cache Index Valid Fail Pointer Tag Index Bank #0 Bank #1 0 0 0 1 0 1 1 0 tag_a Tag 0 tag_b tag_c Bank Addr Bank #15 1 tag_d 1 0 1 0 0 1 tag_e 0 Stuck Value 22 Evaluation 23 Evaluation • Monte Carlo simulations – – – – Data block size = 512 bits Perfect wear-leveling scheme (256-byte block) Cell write endurance: IdealECC, ECP, SAFER, SAFER_FC SAFER32_FC SAFER32 ECP6 64 56 48 40 32 24 16 8 0 IdealECC6 SAFER8_FC SAFER8 ECP4 64 56 48 40 32 24 16 8 0 IdealECC4 Meta-bit size • Hardware overhead SAFER32_FC 1.4 SAFER16_FC SAFER8_FC SAFER4_FC SAFER2_FC SAFER32 SAFER16 SAFER8 SAFER4 SAFER2 ECP6 ECP5 ECP4 ECP3 ECP2 ECP1 IdealECC8 IdealECC7 IdealECC6 IdealECC5 IdealECC4 IdealECC3 IdealECC2 IdealECC1 Relative Lifetime Improvement Relative Lifetime Improvement • Cell write endurance: – = 100M writes, = 10M writes 14.8% 1.2 1.0 0.8 0.6 0.4 0.2 0.0 25 Conclusion • Need to recover from multiple stuck-at faults • SAFER – Efficient recovery scheme – handles the growing stuck-at faults • Dynamic partition • Data inversion – SAFER32_FC • 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8) • 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8) 26 Thank You All!! Questions? 27 SRAM Fail Info. Cache Overhead • Cell size in 2024 – SRAM = 140 F2 @ 10nm, PCM = 6 F2 @ 8nm – 36.6X difference • Compared with a 8 Gbit PCM chip Number of Entries Tag Size (bits) Entry Size (bits) Cache Size (bits) Area Overhead 1K 23 25 25.6K 0.01% 2K 22 24 49.2K 0.02% 4K 21 23 94.2K 0.04% 8K 20 22 0.18M 0.08% 16K 19 21 0.33M 0.15% 32K 18 20 0.63M 0.28% 64K 17 19 1.19M 0.53% 128K 16 18 2.25M 1.00% 28 Relative Lifetime Improvement • Need a method measuring relative lifetime – independent from and T • Definition = = Recovery scheme contribution for lifetime T (L F) T Cell Write Endurance Distribution: 100M writes 10M writes Bit Toggle Rate (T) = 0.5 Lifetime Contribution 100 120 F 140 160 L 180 200 220 240 260 Lifetime (Million Writes) 29 SAFER32_FC SAFER16_FC SAFER8_FC SAFER4_FC SAFER2_FC SAFER32 SAFER16 SAFER8 SAFER4 SAFER2 ECP6 ECP5 ECP4 ECP3 ECP2 ECP1 IdealECC8 IdealECC7 IdealECC6 IdealECC5 IdealECC4 IdealECC3 IdealECC2 IdealECC1 Lifetime Contribution per Meta-bit Lifetime Contribution per Meta-bit 0.025 0.020 0.015 0.010 0.005 0.000 30 SAFER32_FC SAFER16_FC SAFER8_FC SAFER4_FC SAFER2_FC SAFER32 SAFER16 SAFER8 SAFER4 SAFER2 ECP6 ECP5 ECP4 ECP3 ECP2 ECP1 IdealECC8 IdealECC7 IdealECC6 IdealECC5 IdealECC4 IdealECC3 IdealECC2 IdealECC1 Average Recovered Fails per 256B Average Number of Recovered Fails 30 25 20 15 10 5 0 31 SAFER with Fail Cache 1.4 70% Miss rate 60% 1K 2K 4K 8K 16K 32K 64K 128K Relative Lifetime Improvement 80% 50% 40% 30% 20% 10% 0% IdealECC8 1.2 1.0 0.8 0.6 0.4 0.2 SAFER2 SAFER4 SAFER8 SAFER16 SAFER32 0.0 1 2 3 4 5 6 7 8 9 10 11 The number of maximum fails per 512 bits 12 13 14 15 None 1K 2K 4K 8K 16K 32K 64K 128K The number of cache entries 32 Low-cost Single Error Correction • Stuck-At Fault Property: Readability 0 1 0 1 1 0 1 0 Write Write 0 1 0 1 1 0 1 0 Verify Verify 0 1 0 1 1 0 1 0 33 Low-cost Single Error Correction • Stuck-At Fault Property: Readability 0 1 0 1 1 0 1 0 Write Write 0 1 1 0 0 1 0 Verify 0 Verify 1 0 1 1 0 0 0 Need to recover!! 34 Low-cost Single Error Correction • Data Inversion as an SEC – one additional bit per group 0 1 0 1 1 0 0 1 1 0 0 1 “F” Inversion & Mark 2nd Write Write 0 1 0 1 “F” 2nd Verify Verify 0 1 0 1 0 1 0 1 “F” Inversion 1 0 1 0 Recovered from Stuck-At Fault!! 35 Evaluation • Monte Carlo simulations Data block size = 512 bits Perfect wear-leveling scheme (256-byte block) Cell write endurance: IdealECC, ECP, SAFER, SAFER_FC 14% 12% 11.9% 10% 8% 6% 4% 2% SAFER32_FC SAFER16_FC SAFER8_FC SAFER4_FC SAFER2_FC SAFER32 SAFER16 SAFER8 SAFER4 SAFER2 ECP6 ECP5 ECP4 ECP3 ECP2 ECP1 IdealECC8 IdealECC7 IdealECC6 IdealECC5 IdealECC4 IdealECC3 IdealECC2 0% IdealECC1 Hardware Overhead – – – – 36