SAFER: Stuck-At-Fault Error Recovery for Memories

advertisement
SAFER:
Stuck-At-Fault Error Recovery for
Memories
Nak Hee Seong†
Dong Hyuk Woo†
Vijayalakshmi Srinivasan‡
Jude A. Rivers‡
Hsien-Hsin S. Lee†
‡
†
Emerging Memory Technologies
• Resistive memories
– Due to DRAM scaling challenge
• Phase Change Memory (PCM)
 Scalability, high density
 Limited write endurance (Avg. 108 writes)
• Incurring stuck-at faults
2
Cell Write Endurance
• Endurance variation
– No spatial correlation
– Increases with technology scaling
• Issues
– Unpredictable cell endurance
• Read verification required for each write
– The weakest cell dictates memory lifetime!
– # of stuck-at faults gradually grows!
• Multi-bit error recovery scheme is needed!
3
Existing Error Correcting Methods
• (72,64) Hamming code
– For transient faults
– Single Error Correction Double Error Detection
(SECDED)
– 12.5% overhead
• Error-Correcting Pointers (ECP) [Schechter, ISCA37]
– Dynamically replace failed cells with extra cells
– Storing multiple fail pointers for each data block
– Recover from 6 fails with 61-bit overhead (11.9%)
4
SAFER:
Stuck-At-Fault Error Recovery
5
Concept of SAFER
• Exploit two properties
of Stuck-At Faults
– Permanency
– Readability
• Multiple error
correction
– Fault separation
– Low-cost Single Error
Correction (SEC)
Fault Separation
SEC
SEC
6
SAFER:
1. Fault Separation
2. Single Error Correction
7
Fault Separation
• Assuming 2 faults in an 8-bit block
– C(8,2) = 28 possible fault pairs
• How to separate these 2 faults (of all 28 pairs)?
Pattern #2
7
6
5
4
3
2
1
0
Pattern #1
7
6
5
4
3
2
1
0
Pattern #0
7
6
5
4
3
2
1
0
8
Decision for Fault Separation
• Use bit pointers for fault separation
Data Block
Bit Pointer
7
6
5
4
3
2
1
0
bit 2
1
1
1
1
0
0
0
0
Pattern #2
bit 1
1
1
0
0
1
1
0
0
Pattern #1
bit 0
1
0
1
0
1
0
1
0
Pattern #0
Bit Pointer
9
Decision for Fault Separation
• Find pattern candidates by XORing bit pointers
Data Block
Bit Pointer
7
6
5
4
3
2
1
0
bit 2
1
1
1
1
0
0
0
0
1
Pattern #2
bit 1
1
1
0
0
1
1
0
0
0
Pattern #1
bit 0
1
0
1
0
1
0
1
0
0
Pattern #0
Bit Pointer
Difference Vector
10
Decision for Fault Separation
• Find pattern candidates by XORing bit pointers
Data Block
Bit Pointer
7
6
5
4
3
2
1
0
bit 2
1
1
1
1
0
0
0
0
0
Pattern #2
bit 1
1
1
0
0
1
1
0
0
1
Pattern #1
bit 0
1
0
1
0
1
0
1
0
1
Pattern #0
Bit Pointer
11
Extension to Multi-Group Partition
• Use two bits for 4 group partition
Data Block
Bit Pointer
15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
bit 3
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
bit 2
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
bit 1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
bit 0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
Bit Pointer
(bit 3, bit 2)
(bit 3, bit 1)
Dynamic Partition
• 4 group partition for a 16-bit data block
Bit Pointer
15 14 13 12 11 10 9
8
7
6
5
4
3
2
1
0
Data Block
1st Partition Field
bit 2
2nd
bit 0
Partition Field
Fixed Partition Counter
0
1000  0010 = 1010
Data Block
1st Partition Field
bit 23
2nd Partition Field
bit 0
Fixed Partition Counter
0
1
0010  0000 = 0010
Data Block
1st Partition Field
bit 3
2nd
bit 01
Partition Field
Fixed Partition Counter
1
2
13
Dynamic Partition
• Objective
– Separate multiple stuck-at faults into different groups
• Additional meta data
– Assuming an n bit block and a k group partition
– log2k  log2 log2 n   log2  log2 k  1
size of fixed partition counter
size of each partition field
# of partition fields
• Example: n = 512, k = 32
– Required meta data: 23 bits/block
– 6  the number of separable stuck-at faults  32
14
SAFER:
1. Fault Separation
2. Single Error Correction
15
Low-cost Single Error Correction
• Stuck-At Fault Property: Readability
1
0
1
0
1
0
1
0
1
0
1
0
Write
Verify
16
Low-cost Single Error Correction
• Stuck-At Fault Property: Readability
0
1
0
1
1
0
0
1
01
0
1
0
1
0
1
Write
Verify
17
Low-cost Single Error Correction
• Stuck-At Fault Property: Readability
1
0
1
0
0
1
1
0
0
1
0
1
0
0
0
Write
Verify
Need to recover!!
18
Low-cost Single Error Correction
• Data Inversion as an SEC
Flip Mark
Inversion
& Mark
1
0
0
1
1
0
0
1
“F”
0
1
0
1
“F”
2nd Write
2nd Verify
0
1
0
1
1
0
1
0
Inversion
“F”
One
additional bit
per group
Recovered from Stuck-At Fault!!
19
Design Issues
20
SAFER Sequence for a Write
Start
Read
Write
Drawbacks:
- accelerating wear-out
- performance degradation
(1st)
Verify
N
Y
Error
Y
Inversion Write (2nd)
N
Y
Re-partition
Verify
N
Success
Error
Fixed Partition
Counter < MAX
Y
Failure
21
Fail Information Cache
• Objective: avoid the 2nd writes
• Solution: early inversion decision
• Fail Info. Cache with 1K entries
– Keep track of recent data blocks with stuck-at faults
– Store fail positions and their stuck-at values
Block Address
Cache Index
Valid
Fail Pointer
Tag
Index
Bank #0
Bank #1
0
0
0
1
0
1
1
0
tag_a
Tag
0
tag_b
tag_c
Bank Addr
Bank #15
1 tag_d 1
0
1
0
0
1
tag_e
0
Stuck Value
22
Evaluation
23
Evaluation
• Monte Carlo simulations
–
–
–
–
Data block size = 512 bits
Perfect wear-leveling scheme (256-byte block)
Cell write endurance: 
IdealECC, ECP, SAFER, SAFER_FC
SAFER32_FC
SAFER32
ECP6
64
56
48
40
32
24
16
8
0
IdealECC6
SAFER8_FC
SAFER8
ECP4
64
56
48
40
32
24
16
8
0
IdealECC4
Meta-bit size
• Hardware overhead
SAFER32_FC
1.4
SAFER16_FC
SAFER8_FC
SAFER4_FC
SAFER2_FC
SAFER32
SAFER16
SAFER8
SAFER4
SAFER2
ECP6
ECP5
ECP4
ECP3
ECP2
ECP1
IdealECC8
IdealECC7
IdealECC6
IdealECC5
IdealECC4
IdealECC3
IdealECC2
IdealECC1
Relative Lifetime Improvement
Relative Lifetime Improvement
• Cell write endurance: 
–  = 100M writes,  = 10M writes
14.8%
1.2
1.0
0.8
0.6
0.4
0.2
0.0
25
Conclusion
• Need to recover from multiple stuck-at faults
• SAFER
– Efficient recovery scheme
– handles the growing stuck-at faults
• Dynamic partition
• Data inversion
– SAFER32_FC
• 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8)
• 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)
26
Thank You All!!
Questions?
27
SRAM Fail Info. Cache Overhead
• Cell size in 2024
– SRAM = 140 F2 @ 10nm, PCM = 6 F2 @ 8nm
– 36.6X difference
• Compared with a 8 Gbit PCM chip
Number of
Entries
Tag Size
(bits)
Entry Size
(bits)
Cache Size
(bits)
Area
Overhead
1K
23
25
25.6K
0.01%
2K
22
24
49.2K
0.02%
4K
21
23
94.2K
0.04%
8K
20
22
0.18M
0.08%
16K
19
21
0.33M
0.15%
32K
18
20
0.63M
0.28%
64K
17
19
1.19M
0.53%
128K
16
18
2.25M
1.00%
28
Relative Lifetime Improvement
• Need a method measuring relative lifetime
– independent from  and T
• Definition
=
=
Recovery scheme contribution for lifetime  T

(L  F)  T
Cell Write Endurance Distribution:
  100M writes

  10M writes
Bit Toggle Rate (T) = 0.5
Lifetime Contribution
100
120
F
140
160
L 180
200
220
240
260
Lifetime (Million Writes)
29
SAFER32_FC
SAFER16_FC
SAFER8_FC
SAFER4_FC
SAFER2_FC
SAFER32
SAFER16
SAFER8
SAFER4
SAFER2
ECP6
ECP5
ECP4
ECP3
ECP2
ECP1
IdealECC8
IdealECC7
IdealECC6
IdealECC5
IdealECC4
IdealECC3
IdealECC2
IdealECC1
Lifetime Contribution per Meta-bit
Lifetime Contribution per Meta-bit
0.025
0.020
0.015
0.010
0.005
0.000
30
SAFER32_FC
SAFER16_FC
SAFER8_FC
SAFER4_FC
SAFER2_FC
SAFER32
SAFER16
SAFER8
SAFER4
SAFER2
ECP6
ECP5
ECP4
ECP3
ECP2
ECP1
IdealECC8
IdealECC7
IdealECC6
IdealECC5
IdealECC4
IdealECC3
IdealECC2
IdealECC1
Average Recovered Fails per 256B
Average Number of Recovered Fails
30
25
20
15
10
5
0
31
SAFER with Fail Cache
1.4
70%
Miss rate
60%
1K
2K
4K
8K
16K
32K
64K
128K
Relative Lifetime Improvement
80%
50%
40%
30%
20%
10%
0%
IdealECC8
1.2
1.0
0.8
0.6
0.4
0.2
SAFER2
SAFER4
SAFER8
SAFER16
SAFER32
0.0
1
2
3
4
5
6
7
8
9
10
11
The number of maximum fails per 512 bits
12
13
14
15
None
1K
2K
4K
8K
16K
32K
64K
128K
The number of cache entries
32
Low-cost Single Error Correction
• Stuck-At Fault Property: Readability
0
1
0
1
1
0
1
0
Write
Write
0
1
0
1
1
0
1
0
Verify
Verify
0
1
0
1
1
0
1
0
33
Low-cost Single Error Correction
• Stuck-At Fault Property: Readability
0
1
0
1
1
0
1
0
Write
Write
0
1
1
0
0
1
0
Verify
0
Verify
1
0
1
1
0
0
0
Need to recover!!
34
Low-cost Single Error Correction
• Data Inversion as an SEC
– one additional bit per group
0
1
0
1
1
0
0
1
1
0
0
1
“F”
Inversion
& Mark
2nd Write
Write
0
1
0
1
“F”
2nd Verify
Verify
0
1
0
1
0
1
0
1
“F”
Inversion
1
0
1
0
Recovered from Stuck-At Fault!!
35
Evaluation
• Monte Carlo simulations
Data block size = 512 bits
Perfect wear-leveling scheme (256-byte block)
Cell write endurance: 
IdealECC, ECP, SAFER, SAFER_FC
14%
12%
11.9%
10%
8%
6%
4%
2%
SAFER32_FC
SAFER16_FC
SAFER8_FC
SAFER4_FC
SAFER2_FC
SAFER32
SAFER16
SAFER8
SAFER4
SAFER2
ECP6
ECP5
ECP4
ECP3
ECP2
ECP1
IdealECC8
IdealECC7
IdealECC6
IdealECC5
IdealECC4
IdealECC3
IdealECC2
0%
IdealECC1
Hardware Overhead
–
–
–
–
36
Download