slides

advertisement
MACAU: A Markov Model for Reliability
Evaluations of Caches Under Single-bit
and Multi-bit Upsets
Jinho Suh
Murali Annavaram
Michel Dubois
Outline
•
Definitions
•
Model-based Soft-Error Reliability Evaluations
•
Modeling Multiple bit Upsets
•
MACAU
• Describing Markov Chain
• Measuring Intrinsic Reliability
• Benchmarking Realistic Reliability
•
Evaluations
•
Summary
HPCA-18
2
Definitions
•
Domain
Set of bits bundled together; protection domain
•
Spatial MBU (Multi-Bit Upset; spatial q-BU)
Multiple (q-)bits are flipped due to one particle hit
cf. spatial MBE
•
Temporal MBU
Multiple bits are flipped due to more than one particle hit
HPCA-18
3
Model-based Soft-Error Reliability Evaluations
•
•
Comparison [Saleh’90, Reviriego’09, Mukherjee’03, Suh’11]
Intrinsic MTTF
AVF
PARMA
MACAU
Masking effect
X
O
O
O
Single-Bit Upset
O
O
O
O
Spatial MBU
O
X
X
O
Temporal MBU
--
X
O
O
Protection code
O
X
O
O
Variable domain size
X
X
O
X
Computation
Extremely cheap
Cheap
Expensive
Expensive
Projection from industry [ITRS’07, ITRS’10]
Year or production
2013
2016
2019
2022
Feature size [nm]
35
25
18
13
Gate Length [nm]
13
9
6
4.5
SER [FIT per Mb]
1,250
1,300
1,350
1,400
% MBU in SEU
64%
100%
100%
100%
4
Challenges in Modeling Spatial MBUs
•
Multiple spatial MBUs may leave complex patterns of flipped bits
• Protection domain: word
PD#1
3
1
PD#5
•
PD#4
PD#3
PD#2
……
……
……
……
……
……
PD#6
……
2
……
PD#7
PD#8
Dealing with MBUs spanning domains vertically/horizontally
• MBUs spanning domains vertically: easy to address if isolated in multiple
protection domains
• MBUs spanning domains horizontally (edge-effect): negligible impact
HPCA-18
5
Assumptions in MACAU
1. Spatial MBUs happen always in contiguous patterns
(a) Contiguous
(b) Disjoint
(c) Diagonal
• Recent studies [Georgakos2007, Mahatme2011] report that (a) happens much more
frequently than (b)/(c)
• (b)/(c) can be approximated by contiguous patterns framed by the red dotted
rectangles as (a) (with minimal overestimation)
2. At most one SEU strike a protection domain in one cycle, and at most
two SEUs flip bits on the same protection domain
• Soft errors are extremely rare
3. Edge effect is ignored
• Only on small number of bits next to the border of two protection domains
HPCA-18
6
Distribution of Spatial MBUs in a SEU
Probability distribution of MBUs for omni-directional galactic cosmic
rays [Tipton’08] in the cache
• We concentrate on the MBU patterns included in the dotted square
0
10
0.89
1 row
2 row
-1
10
0.05948
Probability of event
•
0.015
0.013
-2
10
0.009
0.007
0.002
0.0012
-3
10
0.5%
0.001
0.0007
0.0005
0.0004
0.0002
0.00015
-4
10
0
1
2
3
4
5
Number of columns
HPCA-18
6
7
0.00015
0.0001
7e-005
8
5e-005
9
10
7
Probability of a SEU in One Cycle
•
SEU Model
• pSEU_PD
• Probability that one SEU arrives during one cycle period in the protection domain
• Poisson probability mass function gives pSEU_PD
pSEU_PD = lSEU_PD ´ e- lSEU_PD @ lSEU_PD
• λSEU_PD : Poisson rate of SEUs in 32bit word protection domain
ex) 3×10-24/PD @ 65nm 3GHz CPU
•
Probability of having a spatial q-BU in PD due to one SEU
• Including the effect of vertical spatial MBUs
HPCA-18
8
Modeling Spatial Effects with Markov Chain
•
Markov states
• Transient state (non-recurrent state)
• After departing from the state, probability of not returning is nonzero
• Absorbing state
• No more transition to other states is possible once the state is visited
•
Markov chain
• State expresses the number of flipped, incorrect bits
0
1
……
q2
-1
q3
q+1
4
Up
toonly
qBU
SBU
SBU
and
2BU
HPCA-18
9
Markov Chain and Overlap of SEUs
•
Number of overlapped bits
k-bit flipped by 1st SEU:
Current Markov state = k
q-BU arrives with 2nd SEU:
Next Markov state = k + d
• Overlap (o ): k’ = k + d = k + q – 2o
d = q – 2o
k q o
8 3
• d is even if and only if q is even
8 3
• d is odd if and only if q is odd
• o =1, 2, …, min(k, q)
8 3
8
HPCA-18
3
d
k'
overlap
0
3
11
no
1
1
9
partial
2
-1
7
partial
3
-3
5
full
10
Overlap of SEUs
•
In N-bit protection domain with contiguous k-bit fault
• SEU of spatial q-BU arrives:
Spatial q-BU
Spatial q-BU
k-bit fault
Spatial q-BU
N-bit protection domain
1. Full overlap (0 < o = q)
2. Partial overlap (0 < o < q)
3. No overlap (o = 0)
HPCA-18
11
Markov Transition Matrix T:
Example for the Case of D
•
Building T with overlapping probabilities
Example: matrix D
for MBUs with up to 3 horizontal BUs and up to 2 vertical BUs
HPCA-18
12
Markov Transition Matrix T:
General Case
• T contains the probabilities of transition between any two
states in one cycle
HPCA-18
13
Using MACAU
•
Measuring intrinsic MTTF
Build T
•
Program
starts
Build T
Add transitions on T for
scrubbing
Benchmarking
Manage VCC
Word
consumed?
Yes
No
Calculate probability of
failure on word from TVCC
Calculate mean firstpassage time from state
0 to failing states
Accumulate E[#fail]
Program
ends?
Yes
No
Calculate failure rate
HPCA-18
14
Calculating intrinsic MTTF
•
Mean first-passage time gives intrinsic MTTF
1. Make the states that cause failure absorbing states
•
With b-bit error correcting code, states > b are failing states
2. Measure the transition time from state 0 (clean state) to failing state
With transition matrix T:
•
Without scrubbing
• First-passage time from state 0 to any absorbing state gives the intrinsic
MTTF
•
With (stochastic) scrubbing with scrubbing interval of L
• States that can be scrubbed has extra transitions to state 0 with
probability = 1/L
• Then first-passage time gives intrinsic MTTF
HPCA-18
15
Benchmarking FIT rate with T
•
Whenever an access is made:
1. Measure VCC
2. Calculate S = TVCC to get the transition matrix after VCC
3. Add to the expected number of failures by summing the probabilities of
reaching failure states
•
Failure probability is obtained from state transition probability in S
Protection
No protection
Odd-parity
SECDED
DECTED
TECQED
SDC
DUE
---
HPCA-18
16
Evaluations
•
Intrinsic MTTFs
SEUs
Protection
on a word
SBUs only
SEC
1BU+2BU
(0.5:0.5)
DEC
DEC
D
TEC
Model
MACAU
Saleh
MACAU
Reviriego
MACAU
Reviriego
MACAU
Reviriego
No scrub
6.715E+06
6.245E+06
8.012E+06
7.211E+06
9.700E+06
--1.330E+07
1.700E+07
32b-word
Once/year
Once/month
1.092E+13
1.329E+14
1.058E+13
1.287E+14
1.593E+13
1.938E+14
1.411E+13
1.716E+14
1.153E+08
1.153E+08
----1.815E+14
2.209E+15
1.839E+14
2.238E+15
Once/day
3.986E+15
3.862E+15
5.813E+15
5.149E+15
1.153E+08
--6.626E+16
6.713E+16
• Differences in MTTF: MACAU addresses ‘overlapping effect’ which
Saleh/Reviriego ignores [Ming’11]
•
FIT rates from benchmarking
DUE
SDC
TRUE
FALSE
No-protection
Odd-parity
N/A
1217.840
N/A
2448.644
1328.711
110.872
SECDED
DECTED
110.872
37.614
222.923
75.629
37.614 8.0925E-16
TECQED
7.3947E-16
1.3724E-15
6.9784E-17
• MACAU differs by ≤ 0.015% from PARMA when benchmarking SBUs only
HPCA-18
17
Summary
MACAU
• Model for temporal/spatial MBU effects
• Capable of evaluating various protection schemes
• Useful for quick evaluation of caches, by measuring intrinsic MTTFs
• Useful for rigorously benchmarking FIT rates in caches under MBUs
and SBUs
Future work
• Refining model for addressing edge effect
• Spatial MBU model for arbitrarily shaped patterns
• Model for TAG and meta-bit vulnerability
• Application to processor buffers (ROB, LSQ, IFQ)
HPCA-18
18
THANK YOU!
(Some) References
[Biswas’10] Arijit Biswas, Charles Recchia, Shubhendu S. Mukherjee, Vinod Ambrose, Leo Chan, Aamer Jaleel, Mike Plaster,
and Norbert Seifert, “Explaining Cache SER Anomaly Using Relative DUE AVF Measurement,” HPCA 2010.
[Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors. In
Proceedings of the International Conference on Dependable Systems and Networks, 496-505, 2005.
[Mahatme’11] Mahatme, N., Bhuva, B., Fang, Y., and Oates, A. Analysis of multiple cell upsets due to neutrons in srams for a
deep-n-well process. In Reliability Physics Symposium (IRPS), 2011 IEEE International (April 2011), pp. SE.7.1 – 6.
[Ming’11] Ming, Z., Yi, X. L., Chang, L., and Wei, Z. J. Reliability of memories protected by multibit error correction codes
against mbus. Nuclear Science, IEEE Transactions on 58, 1 (feb. 2011), 289 – 295.
[Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to calculate the
architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International
Symposium on Microarchitecture, pages 29-40, 2003.
[Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or
Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium on Dependable Computing, 37-42, 2004.
[Reviriego’09] Reviriego, P., and Maestro, J. A. Study of the effects of multibit error correction codes on the reliability of
memories in the presence of mbus. IEEE Transactions on Device and Materials Reliability 9 (2009), 31 - 39.
[Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques for Memory Systems. In
IEEE Transactions on Reliability, 39(1), 114-122, 1990.
[Tipton’08] Tipton, A. D., Pellish, J. A., Hutson, J. M., Baumann, R., Deng, X., Marshall, A., Xapsos, M. A., Kim, H. S., Friendlich,
M. R., Campola, M. J., Seidleck, C. M., LaBel, K. A., Mendenhall, M. H., Reed, R. A., Schrimpf, R. D., Weller, R. A., and Black, J. D.
Device-orientation eects on multiple-bit upset in 65 nm srams. IEEE Transactions on Nuclear Science 55 (2008), 2880-2885.
[Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp
HPCA-18
20
ADDENDUM
HPCA-18
21
Definitions
•
Domain
•
Set of bits bundled together;
protection domain
•
State change in memory due to
one particle hit
Fault
•
Incorrect state in a domain
•
Error
•
Spatial MBU (Multi-Bit Upset;
spatial q-BU)
Multiple (q-)bits are flipped due to
one SEU
Failure
•
Visible error
•
SBU (Single-Bit Upset)
One bit is flipped due to one SEU
A manifested fault, propagated
outside the original domain
•
SEU (Single Event Upset)
Multiple bits are flipped due to
more than one SEU
Consumption
An event resulting in the change of
architectural state
Temporal MBU
•
Vulnerability clock cycles (VCCs)
Time in cycles that a bit is exposed
to particle hits
HPCA-18
22
Model-based Soft-Error Reliability Evaluations
• (Intrinsic) Mean-Time-To-Failure [Saleh’90, Reviriego’09]
+ Fast, first-cut estimation of circuit-level reliability
− Highly pessimistic
• No consideration of masking effects
− Unclear for protected memories
• No consideration of cleaning effects on accesses
Intrinsic Reliability
Benchmarked Reliability
• AVF (Architectural Vulnerability Factor) [Mukherjee’03]
+ Quickly calculates SDC without protection or DUE under parity due to SBUs
− Ignores temporal/spatial MBUs
− Cannot account for error detection/correction schemes
• PARMA (Precise Analytical Reliability Model for Architectures) [Suh’11]
+ Addresses
temporal
MBUs MBUs
caused and
by multiple
SBUs
• MACAU
models
spatial
their temporal
effects to
+ Evaluates FIT (Failures-In-Time) of protected caches
evaluate
soft-error vulnerabilities on cache data bit-cells
− Cannot account for spatial MBUs
HPCA-18
23
Vulnerability Clocks Cycles (VCCs)
Vulnerability of a bit  Exposure time
•
Common assumption in model based studies
•
We measure bit’s exposure time with VCCs
•
VCCs are equivalent to ACE (Architectural Correct Execution) cycles in
AVF methods
•
Managing VCCs is similar to (reliability-)lifetime analysis in AVF
HPCA-18
24
Two Basic Models in Soft Error Benchmarking
1. Fault generation model
Poisson Single Event Upset model
Probability distribution of having k faulty bit(s) in a
domain (set of bits) during vulnerability clock cycles
2. Fault propagation model
• Observing consumption for tracking failures due to SEUs
• Accumulating expected number of (total system) failures whenever
consumption happens
• Benchmarking measures:
Generated faults  Errors (propagated faults)
 Expected number of failures  Failure rate
HPCA-18
25
Temporal & Spatial Effects
•
Temporal effects
• Requires for evaluating/quantifying reliability in the presence of protection
codes
Example:
• If SBU is dominant, temporal effects should be addressed for evaluating
SECDED protected caches
• Evaluating DUE FIT rates require quantifying failures with 2 flipped bits
• Evaluating SDC FIT rates require quantifying failures with >2 flipped bits
•
Spatial effects
• Growing concerns with future technologies
• All SEUs are expected to be spatial MBUs in near future [ITRS’07]
• Radiation hardened/interleaving design may not be possible always
HPCA-18
26
Spatial MBUs and Layout
•
Circuit layout determines the population of spatial MBUs
•
•
•
•
Deep-N-well process is commonly used by TSMC, infineon, etc
Parasitic bipolar transistors contribute to spatial MBUs
With deep-N-well process, only parasitic NPN transistors are turned on
At most two bit flips are observed in the same direction of wells [Mahatme’11]
wordline
wordline
Word 0 Word 1 Word 2 Word 3
Word 0 Word 1 Word 2 Word 3
Block
Word 4 Word 5 Word 6 Word 7
bitline
Wells: P N
Block
Word 4 Word 5 Word 6 Word 7
N
P
bitline
Cache Slice
Cache Slice
(a) Wells in the bitline direction
(b) Wells in the wordline direction
HPCA-18
27
Markov Chain and Overlap of SEUs
•
Consider how many bits are overlapped
k-bit flipped by 1st SEU:
Current Markov state = k
q-BU arrives with 2nd SEU:
Next Markov state = k + d
• Overlap (o ): k’ = k + d = k + q – 2o
d = q – 2o
• d is even if and only if q is even
• d is odd if and only if q is odd
• o =1, 2, …, min(k, q)
HPCA-18
k q o
d
k'
8
3
0
3
11
8
5
1
3
11
8
7
2
3
11
8
9
3
3
11
8
1
1
-1
7
28
Set-up for Markov Chain
•
Overlapping probabilities
• o overlapped bits when q-BU hits a word with already k flipped bits
1. If 0 < o = q
2. If 0 < o < q
3. Else o = 0
HPCA-18
29
MACAU: Computing S = TNc
•
Major computation bottleneck in MACAU
• Square-and-multiply method for efficient matrix multiplication
Matrix power computation using doubling trick
Shift right by one
Nc = 571 dec = 1000111011 bin
Q
T
T2
T4
T8
T16
T32
T64
T128
T256
T512
LSB
1
1
0
1
1
1
0
0
0
1
Result Matrix S is intialized as I
Then if LSB == 1, S = S∙Q
S
S = TNc = T(1+2+8+16+32+512) = T571
• Matrix computation is also data-parallel computation
30
AVF vs PARMA vs MACAU
•
Comparison
AVF
PARMA
MACAU
SBU
O
O
O
Spatial MBU
X
X
O
Temporal MBU
X
O
O
Protection code
X
O
O
Variable domain size (n-bit)
X
O
X
Computation complexity
O(1)
O(n3)
O(n3×log(Nc))
•
•
•
•
•
VCC = Nc
MACAU is capable of addressing all the soft error related situations
Both PARMA and MACAU are much slower than AVF
Is the computation overhead an overkill for practical use? No
Reliability-aware sampling for accelerating reliability simulations
31
Download