MACAU: A Markov Model for Reliability Evaluations of Caches Under Single-bit and Multi-bit Upsets Jinho Suh Murali Annavaram Michel Dubois Outline • Definitions • Model-based Soft-Error Reliability Evaluations • Modeling Multiple bit Upsets • MACAU • Describing Markov Chain • Measuring Intrinsic Reliability • Benchmarking Realistic Reliability • Evaluations • Summary HPCA-18 2 Definitions • Domain Set of bits bundled together; protection domain • Spatial MBU (Multi-Bit Upset; spatial q-BU) Multiple (q-)bits are flipped due to one particle hit cf. spatial MBE • Temporal MBU Multiple bits are flipped due to more than one particle hit HPCA-18 3 Model-based Soft-Error Reliability Evaluations • • Comparison [Saleh’90, Reviriego’09, Mukherjee’03, Suh’11] Intrinsic MTTF AVF PARMA MACAU Masking effect X O O O Single-Bit Upset O O O O Spatial MBU O X X O Temporal MBU -- X O O Protection code O X O O Variable domain size X X O X Computation Extremely cheap Cheap Expensive Expensive Projection from industry [ITRS’07, ITRS’10] Year or production 2013 2016 2019 2022 Feature size [nm] 35 25 18 13 Gate Length [nm] 13 9 6 4.5 SER [FIT per Mb] 1,250 1,300 1,350 1,400 % MBU in SEU 64% 100% 100% 100% 4 Challenges in Modeling Spatial MBUs • Multiple spatial MBUs may leave complex patterns of flipped bits • Protection domain: word PD#1 3 1 PD#5 • PD#4 PD#3 PD#2 …… …… …… …… …… …… PD#6 …… 2 …… PD#7 PD#8 Dealing with MBUs spanning domains vertically/horizontally • MBUs spanning domains vertically: easy to address if isolated in multiple protection domains • MBUs spanning domains horizontally (edge-effect): negligible impact HPCA-18 5 Assumptions in MACAU 1. Spatial MBUs happen always in contiguous patterns (a) Contiguous (b) Disjoint (c) Diagonal • Recent studies [Georgakos2007, Mahatme2011] report that (a) happens much more frequently than (b)/(c) • (b)/(c) can be approximated by contiguous patterns framed by the red dotted rectangles as (a) (with minimal overestimation) 2. At most one SEU strike a protection domain in one cycle, and at most two SEUs flip bits on the same protection domain • Soft errors are extremely rare 3. Edge effect is ignored • Only on small number of bits next to the border of two protection domains HPCA-18 6 Distribution of Spatial MBUs in a SEU Probability distribution of MBUs for omni-directional galactic cosmic rays [Tipton’08] in the cache • We concentrate on the MBU patterns included in the dotted square 0 10 0.89 1 row 2 row -1 10 0.05948 Probability of event • 0.015 0.013 -2 10 0.009 0.007 0.002 0.0012 -3 10 0.5% 0.001 0.0007 0.0005 0.0004 0.0002 0.00015 -4 10 0 1 2 3 4 5 Number of columns HPCA-18 6 7 0.00015 0.0001 7e-005 8 5e-005 9 10 7 Probability of a SEU in One Cycle • SEU Model • pSEU_PD • Probability that one SEU arrives during one cycle period in the protection domain • Poisson probability mass function gives pSEU_PD pSEU_PD = lSEU_PD ´ e- lSEU_PD @ lSEU_PD • λSEU_PD : Poisson rate of SEUs in 32bit word protection domain ex) 3×10-24/PD @ 65nm 3GHz CPU • Probability of having a spatial q-BU in PD due to one SEU • Including the effect of vertical spatial MBUs HPCA-18 8 Modeling Spatial Effects with Markov Chain • Markov states • Transient state (non-recurrent state) • After departing from the state, probability of not returning is nonzero • Absorbing state • No more transition to other states is possible once the state is visited • Markov chain • State expresses the number of flipped, incorrect bits 0 1 …… q2 -1 q3 q+1 4 Up toonly qBU SBU SBU and 2BU HPCA-18 9 Markov Chain and Overlap of SEUs • Number of overlapped bits k-bit flipped by 1st SEU: Current Markov state = k q-BU arrives with 2nd SEU: Next Markov state = k + d • Overlap (o ): k’ = k + d = k + q – 2o d = q – 2o k q o 8 3 • d is even if and only if q is even 8 3 • d is odd if and only if q is odd • o =1, 2, …, min(k, q) 8 3 8 HPCA-18 3 d k' overlap 0 3 11 no 1 1 9 partial 2 -1 7 partial 3 -3 5 full 10 Overlap of SEUs • In N-bit protection domain with contiguous k-bit fault • SEU of spatial q-BU arrives: Spatial q-BU Spatial q-BU k-bit fault Spatial q-BU N-bit protection domain 1. Full overlap (0 < o = q) 2. Partial overlap (0 < o < q) 3. No overlap (o = 0) HPCA-18 11 Markov Transition Matrix T: Example for the Case of D • Building T with overlapping probabilities Example: matrix D for MBUs with up to 3 horizontal BUs and up to 2 vertical BUs HPCA-18 12 Markov Transition Matrix T: General Case • T contains the probabilities of transition between any two states in one cycle HPCA-18 13 Using MACAU • Measuring intrinsic MTTF Build T • Program starts Build T Add transitions on T for scrubbing Benchmarking Manage VCC Word consumed? Yes No Calculate probability of failure on word from TVCC Calculate mean firstpassage time from state 0 to failing states Accumulate E[#fail] Program ends? Yes No Calculate failure rate HPCA-18 14 Calculating intrinsic MTTF • Mean first-passage time gives intrinsic MTTF 1. Make the states that cause failure absorbing states • With b-bit error correcting code, states > b are failing states 2. Measure the transition time from state 0 (clean state) to failing state With transition matrix T: • Without scrubbing • First-passage time from state 0 to any absorbing state gives the intrinsic MTTF • With (stochastic) scrubbing with scrubbing interval of L • States that can be scrubbed has extra transitions to state 0 with probability = 1/L • Then first-passage time gives intrinsic MTTF HPCA-18 15 Benchmarking FIT rate with T • Whenever an access is made: 1. Measure VCC 2. Calculate S = TVCC to get the transition matrix after VCC 3. Add to the expected number of failures by summing the probabilities of reaching failure states • Failure probability is obtained from state transition probability in S Protection No protection Odd-parity SECDED DECTED TECQED SDC DUE --- HPCA-18 16 Evaluations • Intrinsic MTTFs SEUs Protection on a word SBUs only SEC 1BU+2BU (0.5:0.5) DEC DEC D TEC Model MACAU Saleh MACAU Reviriego MACAU Reviriego MACAU Reviriego No scrub 6.715E+06 6.245E+06 8.012E+06 7.211E+06 9.700E+06 --1.330E+07 1.700E+07 32b-word Once/year Once/month 1.092E+13 1.329E+14 1.058E+13 1.287E+14 1.593E+13 1.938E+14 1.411E+13 1.716E+14 1.153E+08 1.153E+08 ----1.815E+14 2.209E+15 1.839E+14 2.238E+15 Once/day 3.986E+15 3.862E+15 5.813E+15 5.149E+15 1.153E+08 --6.626E+16 6.713E+16 • Differences in MTTF: MACAU addresses ‘overlapping effect’ which Saleh/Reviriego ignores [Ming’11] • FIT rates from benchmarking DUE SDC TRUE FALSE No-protection Odd-parity N/A 1217.840 N/A 2448.644 1328.711 110.872 SECDED DECTED 110.872 37.614 222.923 75.629 37.614 8.0925E-16 TECQED 7.3947E-16 1.3724E-15 6.9784E-17 • MACAU differs by ≤ 0.015% from PARMA when benchmarking SBUs only HPCA-18 17 Summary MACAU • Model for temporal/spatial MBU effects • Capable of evaluating various protection schemes • Useful for quick evaluation of caches, by measuring intrinsic MTTFs • Useful for rigorously benchmarking FIT rates in caches under MBUs and SBUs Future work • Refining model for addressing edge effect • Spatial MBU model for arbitrarily shaped patterns • Model for TAG and meta-bit vulnerability • Application to processor buffers (ROB, LSQ, IFQ) HPCA-18 18 THANK YOU! (Some) References [Biswas’10] Arijit Biswas, Charles Recchia, Shubhendu S. Mukherjee, Vinod Ambrose, Leo Chan, Aamer Jaleel, Mike Plaster, and Norbert Seifert, “Explaining Cache SER Anomaly Using Relative DUE AVF Measurement,” HPCA 2010. [Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors. In Proceedings of the International Conference on Dependable Systems and Networks, 496-505, 2005. [Mahatme’11] Mahatme, N., Bhuva, B., Fang, Y., and Oates, A. Analysis of multiple cell upsets due to neutrons in srams for a deep-n-well process. In Reliability Physics Symposium (IRPS), 2011 IEEE International (April 2011), pp. SE.7.1 – 6. [Ming’11] Ming, Z., Yi, X. L., Chang, L., and Wei, Z. J. Reliability of memories protected by multibit error correction codes against mbus. Nuclear Science, IEEE Transactions on 58, 1 (feb. 2011), 289 – 295. [Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to calculate the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40, 2003. [Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium on Dependable Computing, 37-42, 2004. [Reviriego’09] Reviriego, P., and Maestro, J. A. Study of the effects of multibit error correction codes on the reliability of memories in the presence of mbus. IEEE Transactions on Device and Materials Reliability 9 (2009), 31 - 39. [Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques for Memory Systems. In IEEE Transactions on Reliability, 39(1), 114-122, 1990. [Tipton’08] Tipton, A. D., Pellish, J. A., Hutson, J. M., Baumann, R., Deng, X., Marshall, A., Xapsos, M. A., Kim, H. S., Friendlich, M. R., Campola, M. J., Seidleck, C. M., LaBel, K. A., Mendenhall, M. H., Reed, R. A., Schrimpf, R. D., Weller, R. A., and Black, J. D. Device-orientation eects on multiple-bit upset in 65 nm srams. IEEE Transactions on Nuclear Science 55 (2008), 2880-2885. [Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp HPCA-18 20 ADDENDUM HPCA-18 21 Definitions • Domain • Set of bits bundled together; protection domain • State change in memory due to one particle hit Fault • Incorrect state in a domain • Error • Spatial MBU (Multi-Bit Upset; spatial q-BU) Multiple (q-)bits are flipped due to one SEU Failure • Visible error • SBU (Single-Bit Upset) One bit is flipped due to one SEU A manifested fault, propagated outside the original domain • SEU (Single Event Upset) Multiple bits are flipped due to more than one SEU Consumption An event resulting in the change of architectural state Temporal MBU • Vulnerability clock cycles (VCCs) Time in cycles that a bit is exposed to particle hits HPCA-18 22 Model-based Soft-Error Reliability Evaluations • (Intrinsic) Mean-Time-To-Failure [Saleh’90, Reviriego’09] + Fast, first-cut estimation of circuit-level reliability − Highly pessimistic • No consideration of masking effects − Unclear for protected memories • No consideration of cleaning effects on accesses Intrinsic Reliability Benchmarked Reliability • AVF (Architectural Vulnerability Factor) [Mukherjee’03] + Quickly calculates SDC without protection or DUE under parity due to SBUs − Ignores temporal/spatial MBUs − Cannot account for error detection/correction schemes • PARMA (Precise Analytical Reliability Model for Architectures) [Suh’11] + Addresses temporal MBUs MBUs caused and by multiple SBUs • MACAU models spatial their temporal effects to + Evaluates FIT (Failures-In-Time) of protected caches evaluate soft-error vulnerabilities on cache data bit-cells − Cannot account for spatial MBUs HPCA-18 23 Vulnerability Clocks Cycles (VCCs) Vulnerability of a bit Exposure time • Common assumption in model based studies • We measure bit’s exposure time with VCCs • VCCs are equivalent to ACE (Architectural Correct Execution) cycles in AVF methods • Managing VCCs is similar to (reliability-)lifetime analysis in AVF HPCA-18 24 Two Basic Models in Soft Error Benchmarking 1. Fault generation model Poisson Single Event Upset model Probability distribution of having k faulty bit(s) in a domain (set of bits) during vulnerability clock cycles 2. Fault propagation model • Observing consumption for tracking failures due to SEUs • Accumulating expected number of (total system) failures whenever consumption happens • Benchmarking measures: Generated faults Errors (propagated faults) Expected number of failures Failure rate HPCA-18 25 Temporal & Spatial Effects • Temporal effects • Requires for evaluating/quantifying reliability in the presence of protection codes Example: • If SBU is dominant, temporal effects should be addressed for evaluating SECDED protected caches • Evaluating DUE FIT rates require quantifying failures with 2 flipped bits • Evaluating SDC FIT rates require quantifying failures with >2 flipped bits • Spatial effects • Growing concerns with future technologies • All SEUs are expected to be spatial MBUs in near future [ITRS’07] • Radiation hardened/interleaving design may not be possible always HPCA-18 26 Spatial MBUs and Layout • Circuit layout determines the population of spatial MBUs • • • • Deep-N-well process is commonly used by TSMC, infineon, etc Parasitic bipolar transistors contribute to spatial MBUs With deep-N-well process, only parasitic NPN transistors are turned on At most two bit flips are observed in the same direction of wells [Mahatme’11] wordline wordline Word 0 Word 1 Word 2 Word 3 Word 0 Word 1 Word 2 Word 3 Block Word 4 Word 5 Word 6 Word 7 bitline Wells: P N Block Word 4 Word 5 Word 6 Word 7 N P bitline Cache Slice Cache Slice (a) Wells in the bitline direction (b) Wells in the wordline direction HPCA-18 27 Markov Chain and Overlap of SEUs • Consider how many bits are overlapped k-bit flipped by 1st SEU: Current Markov state = k q-BU arrives with 2nd SEU: Next Markov state = k + d • Overlap (o ): k’ = k + d = k + q – 2o d = q – 2o • d is even if and only if q is even • d is odd if and only if q is odd • o =1, 2, …, min(k, q) HPCA-18 k q o d k' 8 3 0 3 11 8 5 1 3 11 8 7 2 3 11 8 9 3 3 11 8 1 1 -1 7 28 Set-up for Markov Chain • Overlapping probabilities • o overlapped bits when q-BU hits a word with already k flipped bits 1. If 0 < o = q 2. If 0 < o < q 3. Else o = 0 HPCA-18 29 MACAU: Computing S = TNc • Major computation bottleneck in MACAU • Square-and-multiply method for efficient matrix multiplication Matrix power computation using doubling trick Shift right by one Nc = 571 dec = 1000111011 bin Q T T2 T4 T8 T16 T32 T64 T128 T256 T512 LSB 1 1 0 1 1 1 0 0 0 1 Result Matrix S is intialized as I Then if LSB == 1, S = S∙Q S S = TNc = T(1+2+8+16+32+512) = T571 • Matrix computation is also data-parallel computation 30 AVF vs PARMA vs MACAU • Comparison AVF PARMA MACAU SBU O O O Spatial MBU X X O Temporal MBU X O O Protection code X O O Variable domain size (n-bit) X O X Computation complexity O(1) O(n3) O(n3×log(Nc)) • • • • • VCC = Nc MACAU is capable of addressing all the soft error related situations Both PARMA and MACAU are much slower than AVF Is the computation overhead an overkill for practical use? No Reliability-aware sampling for accelerating reliability simulations 31