Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois Outline • Introduction • Soft Errors • Evaluating Soft Errors • PARMA: Precise Analytical Reliability Model for Architecture • Model • Applications • Conclusion and Future work 2 Outline • Introduction • Soft Errors • Evaluating Soft Errors • PARMA: Precise Analytical Reliability Model for Architecture • Model • Applications • Conclusion and Future work 3 Soft Errors • Random errors on perfect circuits, mostly affect SRAMs • From alpha particles (electric noises) • From neutron strikes in cosmic rays • Severe problem, especially with power saving techniques • Superlinear increases with Voltage & Capacitance scaling • Near-/sub-threshold Vdd operation for power savings • Concerns in: • • • • Large servers Avionic or space electronics SRAM caches with drowsy modes eDRAM caches with reduced refresh rates 4 Why Benchmark Soft Errors • Designers need good estimation of expected errors to incorporate ‘just-right’ solution at design time • Good estimation is non-trivial • Multi-Bit Errors are expected • Masking effects: Not every Single Event Upset leads to an error [Mukherjee’03] • Faults become errors when they propagate to the outer scope • Faults can be masked off at various levels • Design decision • When the protection under consideration is too much or too little? • Is a newly proposed protection scheme better? • The impact of soft errors needs to be addressed at design time • Estimating soft error rates for target application domains is an important issue 5 Evaluating Soft Errors: Some Reliability Benchmarking Approaches • Fundamental difficulty: Soft errors happen very rarely Field Analysis Life Testing • Intrinsic FIT (Failure-in-Time) rate • Highly pessimistic: no consideration of masking effects • Difficulty in collecting data [Ziegler] • Unclear for protected • Obsolete for designcaches iteration • AVF [Mukherjee’03] and SoftArch [Li’05] Fault Injection Accelerated Testing • Quickly compute SDC without protection or DUE under parity • Ignores temporal/spatial MBEs • Require massive experiments • Can’t account for error detection/correction schemes • Distortion in measurement/interpretation Analytical Modeling Intrinsic SER AVF SoftArch • Better for estimating SER in short time • Complexity determines preciseness 6 Outline • Introduction • Soft Errors • Evaluating Soft Errors • PARMA: Precise Analytical Reliability Model for Architecture • Model • Applications • Conclusion and Future work 7 Two Components of PARMA (Precise Analytical Reliability Model for Architecture) 1. Fault generation model Poisson Single Event Upset model Probability distribution of having k faulty bit(s) in a domain (set of bits) during multiple cycles 2. Fault propagation model • Fault becomes Error when faulty bit is consumed • Instruction with faulty bit commits • Load commits and its operand has a faulty bit • PARMA measures: Generated faults Propagated faults Expected errors Error rate 8 Using Vulnerability Clocks Cycles to Track Bit Lifetime • • Used to track cycles that any bit spends in vulnerable component: L2$ • Ticks when a bit resides L2 When a word is in updated to • Stops when a bit stays outside L2 hold new data, its VC When this block is refilled later, VCs should start Similar to lifetime analysis in AVF method resets to zero ticking from here Proc L1$ L2$ Main Memory VC: ticks VC: stops Set of bits Set of bits Word# 0 1 2 3 VC 100 200 500 0 100 300 0 100 200 500 0 100 200 500 0 L2 When Accesses block L1isblock NOT to L1$ is dead evicted, even when determines consumption it is evicted REALto of impact MEM the faulty because of it Soft can Error bits beto refilled isthe finalized system into L2 later 9 Probability of a Bit Flip in One Cycle • SEU Model • p : probability that one bit is flipped during one cycle period • Poisson probability mass function gives p p odd j j j! e • λ: Poisson rate of SEUs • ex) 10-25/bit @ 65nm 3GHz CPU 10 Temporal Expansion: Probability of a Bit Flip in Nc vulnerability cycles • q(Nc) : probability of a bit being faulty in Nc vulnerability cycles q(Nc) p timeline 1 Cycle Period Vulnerability Clock Cycle = Nc • To be faulty at the end of Nc cycles, a bit must flip an odd number of times in Nc N Pi ( Nc ) c p i (1 p) Nc i , i 0,...,N c i q( N c ) Nc 2 P i 0 2i 1 ( Nc ) 11 Spatial Expansion: from a Bit to the Protection Domain (Word) • SQ(k) • Probability of set of bits S having k faulty bits inside (during Nc cycles) S Q(k) Protection Domain q(N ) S : Word c p qb(k) 1 Cycle Period q(Nc) 1 Byte timeline …… Vulnerability Clock Cycle = Nc • Choose cases where there are k faulty bits in S • S has [S] bits inside • Assumed that all the bits in the word have the same VCs • Otherwise, discrete convolution should be used S [S ] Q(k ) q ( N c ) k (1 q ( N c ))[ S ] k , k 0,..., [ S ] k 12 Faults in the Access Domain (Block) • DQ(k) • Probability of k faulty bits in any protection domain inside of D ( Sm) S S Q(k) qb(k) qb(k) q(Nc) q(Nc) D Domain S : Word Protection Domain S : Word Q(k) 1 Byte 1 Byte …… Q(k) …… …… Access Domain D : Block • Choose cases where there are k faulty bits in each Sm • Sum for all Sm in D M D Q(k ) S Qm (k ) m 1 • So far, masking effect has not been considered • Expected number of intrinsic faults/errors are calculated so far 13 Considering Masking Effect: Separating TRUE from Intrinsic Faults • • If all faults occur in unconsumed bits, then don’t care (FALSE events) TRUE faults = {All faults in S} – {All faults in unconsumed bits} • S Q(k )CQ(0)C Q(k ) protection domain bit ... ... ... grey-colored: consumed bits (C) white-colored:unconsumed bits (C) ¯ • Probability that C has k faults, and C has 0 fault: FALSE or masked faults • Deduct the probability that ALL k faulty bits are in the unconsumed bytes from the probability that the protection domain S has k faulty bits to obtain the probability of TRUE faults which becomes SDCs or TRUE DUEs • C and C are obtained through simulations Using PARMA to Measure Errors in Block Protected by block-level SECDED B k>=3 is SDC 8NB 8NC k 3 i 3 E SB , SDC BQ(k ) C Q(0)C Q(i ) All faulty bits unconsumed >=3 faults in Block • Undetected error that affects reliability (SDC): three or more faulty bits in the block; at least one faulty bit in the consumed bits B E SB ,TRUE _ DUE BQ(2) C Q(0)C Q(2) k =2 is DUE 2 faults in Block • All faulty bits unconsumed Detected error that affects reliability (TRUE DUE): exactly two faulty bits in the block; at least one faulty bit in the consumed bits See paper for how to apply PARMA on the different protection schemes 15 Four Contributions 1. Development of the rigorous analytical model called PARMA Modeling Application 2. Measuring SERs on structures protected by various schemes 3. Observing possible distortions in the accelerated studies • Quantitatively • Qualitatively 4. Verifying approximate models 16 Measuring SERs on Structures Protected by Various Schemes • Target Failures-In-Time of IBM Power6 • • • Average L2 (256KB, 32B block) cache FITs: • Results were verified with AVF simulations 100M SimPoint simulations of 18 benchmarks from SPEC2K, on sim-outorder Schemes SDC (TRUE+FALSE) DUE Latency Checkbits per 256 bits No Protection 155.66 N/A 10 0 1-bit Odd Parity 2.53E-15 372.83 10 1 Block-level SECDED 8.34E-31 7.04E-15 14 10 Word-level SECDED 2.92E-33 6.32E-16 13 56 • • • SDC: 114 DUE: 4,566 Implies word-level SECDED might be overkill in most cases Implies increasing the protection domain size: ex) CPPC @ISCA2011 Partially protected caches or caches with adaptive protection schemes need to be carefully quantified for their FITs • PARMA provides comprehensive framework that can measure the effectiveness of such schemes 17 Observing Possible Distortions in the Accelerated Tests • Highly accelerated tests • SPEC2K benchmarks end in several minutes (wall-clock time) • Needs to accelerate SEU rate 1017 times to see reasonable faults 1.E+21 MAX Possible Errors • How to scale down the results? FIT: ammp 1.E+19 1.E+17 1.E+15 1.E+13 DUE • Results multiplied by 10-17 times? • Can distort results quantitatively 1.E+11 1.E+09 • SDC > DUE ? SDC 1.E+07 1.E+05 1.E+03 1.E+16 1.E+18 1.E+20 • Having more than two errors overwhelms the cases of having two errors • Can be misleading qualitatively 1.E+22 SEU Rate Results were verified with fault-injection simulations 18 Verifying Approximate Models • Example: model for word level SECDED protected cache • Methods for determining cache scrubbing rates[Mukherjee’04][Saleh’90] • Ignoring cleaning effects at accesses: overestimate by how much? • New model with geometric distribution of Bernoulli trials • Assumption: most areWrite flipped between <1> Readat word #1:two bits<2> to word #1: two accesses to the same word • EveryActivate access ECC results error or in no-error (corrected) code,in a detected Updating word, removing existing 1 removing any faulty faulty bit: pmf of two Poisson bit arrivals time PDUE New approximate model Average interval unACE interval MTTF forACE having 2nd faulty bits AverageMTTF extended due to in the same word <1> <2> Average access interval between two accesses to the same word AVF xTFIT previous 2.8246E-14 AVG: from method MTTFSECDED- Word 2.1454 FIT 1 PDUE TAVG [sec] Mean of geometric distribution FIT PARMA 6.3170E-16 FIT • PARMA provides rigorous reliability measurements • Hence, it is useful to verify the faster, simpler approximate models 19 Outline • Introduction • Soft Errors • Evaluating Soft Errors • PARMA: Precise Analytical Reliability Model for Architecture • Model • Applications • Conclusion and Future work 20 Conclusion and Future Work • Summary + + + + PARMA is a rigorous model measuring Soft Error Rates in architectures PARMA works with wide range of SEU rates without distortion PARMA handles temporal MBEs PARMA quantifies SDC or DUE rates under various error detection/protection schemes - PARMA does not address spatial MBEs yet - PARMA does not model TAG yet - Due to the complexity, PARMA is slow • Future Work • Extend PARMA to account for spatial MBEs and TAG vulnerability • Develop sampling methods to accelerate PARMA 21 THANK YOU! QUESTIONS? (Some) References [Biswas’05] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. Mukherjee, R Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of the 32nd International Symposium on Computer Architecture, 532-543, 2005 [Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors. In Proceedings of the International Conference on Dependable Systems and Networks, 496-505, 2005. [Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to calculate the architectural vulnerability factors for a highperformance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40, 2003. [Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium on Dependable Computing, 37-42, 2004. [Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques for Memory Systems. In IEEE Transactions on Reliability, 39(1), 114-122, 1990. [Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp 23 Addendum Some Definitions • • SDC = Silent Data Corruption DUE = Detected and unrecoverable error • SER = Soft Error Rate = SDC + DUE • Errors are measured as • MTTF = Mean Time to Failure • FIT = Failure in Time ; 1 FIT = 1 failure in billion hours • 1 year MTTF = 1billion/(24*365)= 114,155 FIT • FIT is commonly used since FIT is additive • Vulnerability Factor = fraction of faults that become errors • Also called derating factor or soft error sensitivity 25 Soft Errors and Technology Scaling • Hazucha & Svensson model Circuit _ SER Const Flux Area e Qcrit Qcoll • For a specific size of SRAM array: • Flux depends on altitude & geomagnetic shielding (environmental factor) • (Bit)Area is process technology dependent (technology factor) • Qcoll is charge collection efficiency, technology dependent • Qcrit Cnode * Vdd • According to scaling rules both C and V decrease and hence Q decreases rapidly • Static power saving techniques (on caches) with drowsy mode or using near-/sub-threshold Vdd make cells more vulnerable to soft errors Hazucha et al, “Impact of CMOS technology scaling on the atmospheric neutron soft error rate ” 26 Error Classification • • Silent Data Corruption (SDC) TRUE- and FALSE- Detected Unrecoverable Error (DUE) Consumed ? Consumed ? C. Weaver et al, “Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor,” ISCA 2004 27 Soft Error Rate (SER) • Intrinsic SER – more from the component’s view • Assumes all bits are important all the time • Intrinsic SER projections from ITRS2007 (High Performance model) Year or production 2010 2013 2016 2019 2022 Feature size [nm] 45 35 25 18 13 Gate Length [nm] 18 13 9 6 4.5 1200 1250 1300 1350 1400 1.2E-6 1.25E-6 1.3E-6 1.35E-6 1.4E-6 32% 64% 100% 100% 100% Soft Error Rate [FIT per Mb] Failure Rate in 1Mb [fails/hour] % Multi-Bit Upsets in Single Event Upsets • Intrinsic SER of caches protected by SECDED code? • Cleaning effect on every access • Realistic SER – more from the system’s view • Some soft errors are masked and do not cause system failure • EX) AVF x Intrinsic SER: what about caches with protection code? 28 Soft Error Estimation Methodologies: Industries • Field analysis • Statistically analyzes reported soft errors in market products • Using repair record, sales of replacement parts • Provides obsolete data • Life testing • • • • • Tester constantly cycles through 1,000 chips looking for errors Takes around six months Expensive, not fast enough for chip design process Usually used to confirm the accuracy of accelerated testing (x2 rule) Accelerated testing • Chips under various beams of particles, under well-defined test protocol • Terrestrial neutrons – particle accelerators (protons) • Thermal neutrons – nuclear reactors • Radioactive contamination – radioactive materials • Hardship • Data rarely published: potential liability problems of products • Even rarer the comparison of accelerated testing vs life testing • IBM, Cypress published small amount of data showing correlation J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp 29 Soft Error Estimation Methodologies: Common Ways in Researches • Fault-injection • Generate artificial faults based on the fault model + - • Applicable to wide level of designs (from RTL to system simulations) Massive number of simulations necessary to be statistically valid Highly accelerated Single Event Upset (SEU) rate is required for Soft Errors How to scale down the measurements to ‘real environment’ is unclear Architectural Vulnerability Factor • Find derating factor (Faults Errors) by {ACE bits}/{total bits} per cycle • SoftArch • Extrapolate AVG(TTFs) from one program to MTTF using infinite executions • AVF and SoftArch – uses simplified Poisson fault generation model + Works well with small scale system in the current technology at earth’s surface: single bit error dominant environment - Can’t account for error protection/detection schemes (ECC) - Unable to address temporal & spatial MBEs • AVF is NOT an absolute metric for reliability • FITstructure = intrinsic_FITstructure * AVFstructure M. Li et al, “Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults,” HPCA 2009 S. S Mukherjee et al, “A systematic methodology to calculate the architectural vulnerability factors for a high-performance microprocessor.” MICRO 2003 X. Li et al, “SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors.” DSN 2005 30 Evaluating Soft Errors: Some Reliability Benchmarking Approaches • Intrinsic FIT (Failure-in-Time) rate – highly pessimistic • Every bit is vulnerable in every cycle • Unclear how to compute intrinsic FIT rates for protected caches • Architectural Vulnerability Factor [Mukherjee’03] • Lifetime analysis on Architecturally Correct Execution bits • De-rating factor (Faults Errors); realistic FIT = AVF x Intrinsic FIT • SoftArch [Li’05] • Computes TTF for one program run and extrapolates to MTTF • AVF and SoftArch + Quickly compute SDC with no parity or DUE under parity - Ignores temporal MBEs • Two SEUs on one word become two faults instead of one fault • Two SEUs on the same bit become two faults instead of zero fault - Ignores spatial MBEs - Can’t account for error detection / correction schemes To compare SERs of various error correcting schemes: • Temporal/spatial MBEs must be accurately counted Prior State of the Art Reliability Model: AVF • Architectural Vulnerability Factor (AVF) • AVFbit = Probability a bit matters (for a user-visible error) = # of bits affects to user-visible outcome / total # of bits • If we assume AVF = 100% then we will over design the system • Need to estimate AVF to optimize the system design for reliability • AVF equation for a target structure N AVFstructure (bitwiseAVF) N i ACEcycles i N N total _ cycles Averagenumber of ACE bits in a structurein a cycle T otalnumber of bits in a structure • i 0 i 0 ……(Eq. 1) AVF is NOT an absolute metric for reliability • FITstructure = intrinsic_FITstructure * AVFstructure Shubu Mukherjee, “Architecture design for soft errors” 32 ACEness of a bit • ACE (Architecturally Correct Execution) bit • ACE bit affects program outcome: correctness is subjective (user-visible) • Microarchitectural ACE bits • Invisible to programmer, but affects program outcome • Easier to consider Un-ACE bits – Idle/Invalid/Misspeculated state – Predictor structures – Ex-ACE state (architecturally dead or invisible states) • Architectural ACE bits • Visible to programmer • Transitive (ACE bit in the word makes the Load instruction ACE) • Easier to consider Un-ACE bits – NOP instructions – Performance-enhancing operations (non-opcode field of non-binding prefetch, branch prediction hint) – Predicated false instructions (except predicate status bit) – Dynamically dead instructions – Logical masking • AVF framework = lifetime analysis to correctly find ACEness of bits in the target structure for every operating cycle Shubu Mukherjee, “Architecture design for soft errors” 33 Rigorous Failure/Error Rate Modeling • In existing methodologies such as AVF multiplied by intrinsic rate • Estimation is simple and easy • Imprecise estimation but safe-overestimation • Downside of classical approach (i.e. AVF-based methodology) • SEU is very rare event while program execution time is rather short • In 3GHz processor, SEU rate is 1.0155E-25 within one cycle for one bit • Equivalently, the probability of being hit by SEU and being faulty bit is 1.0155E-25 • Simplified assumption that one SEU results one Fault/Error directly • same bit may be hit multiple times, and/or • multiple bits may become faulty in a word • In space, or when extremely low Vdd is supplied to SRAM cell: • SEU rate could rise high (more than 10E6 times) • Second order effects become significant • With data protection methodology: • How to measure vulnerability is uncertain due to the simplified assumption 34 Reliability Theory (1) • Fundamental definition of probability in Reliability Theory • Number(Event)/Number(Trials): Approximations of true Prob(Event) • True probability is barely known • approx true when trials ∞ by the Law of Large Numbers • Two events in R-T: Survival & Failure of a component/system • Reliability Functions • (Component/system) Reliability R(t), and Probability of Failure Q(t) Nf Ns R(t ) , Q(t ) , R(t ) Q(t ) 1 Ns N f Ns N f • Prob(Event) up to and at time t: conditional probability • Note that R(t), Q(t) are time dependent in general • (Conditional) Instantaneous Failure Rate λ(t) - a.k.a, Hazard function h(t) (t ) 1 dR(t ) R(t ) dt 35 Reliability Theory (2) • Reliability functions (cont’d) • (Unconditional) Failure Density Function f(t) f (t ) dR(t ) f (t ) , (t ) dt R(t ) • Average Failure Rate from time 0 to T AFR(0, T ) T 0 (t )dt T • Discrete dual of λ(t) - Hazard Probability Mass Function h(j) h( j ) Pr ob(t j | t j ) Pr ob(fail at j | surviveduntil j 1) • Average Failure Rate from timeslot 0 to T T AFR(0, T ) h( j ) j 1 T 36 Reliability Theory (3) • How to measure Reliability • R(t) itself • Events with constant failure rate T R (t ) exp (t ) dt o • MTTF MTTF t f (t )dt 0 • Sampling issue: Usually no test can aggregate total test time to ∞ • (Right) censorship with no replacement, then Maximum Likelihood Estimation – by B. Epstein, 1954 – At the end of the test time tr, measure TTFs (ti) for samples that failed and truncate the lifetime of all survived samples to tr – Then, MLE of MTTF is r mˆ t i 1 i (n r )t r , where n : # totalitems,r : # failed items r • FIT – one intuitive form of failure rate • Failures in time 1E9 hours • Interchangeable with MTTF only when failure rate is constant • Additive between independent components 37 Vulnerability Clock • Used to track cycles that any bit spends in vulnerable component: L2$ • Ticks when a bit resides in L2 • Stops when a bit stays outside L2 VC_L1 Updates BLK fetched to L1 VC_L1 Stops Store or Consume on L1 BLK fetched to L2 L1 BLK Discarded: VC_L2 Ticks BL K BLK fetched by L1 miss VC_L1 Resets L1 BLK Replaced PARMA calculation for SDC/True DUE @ L1 Caches (Resistant to SEUs) BLK replaced VC_L2 Updates L1 BLK Replaced Writeback Cold Store Miss fe tc to hed L2 VC_MEM VC_MEM VC_MEM := VC_L2 == 80 0 = 80 VC_MEM Stops VC_MEM Updates L1 BLK Writes back: Bit untouched VC_L2 Resets VC_L2 VC_L2 VC_L2 VC_L2 VC_L1= === =150 100 0 10 =080 VC_L2 80 VC_L2 :=:= VC_MEM Program ends Prepare PARMA calculation L1 BLK Writes back: Bit updated or consumed START 1st MEM BLK Access VC_L2 Stops @ L2 Cache (Vulnerable to SEUs) VC_L1 VC_L1 VC_L1 := VC_L2 := =0 0=0 END L2 BLK Writes back to MEM @ Memory (Resistant to SEUs) 38 PARMA Model: Measuring Soft Error FIT with PARMA • PARMA measures failure rate by accumulating failure probability mass • Index processor cycle by j (1 ≤ j ≤ Texe) • Total failures observed during Texe (failure rate): • Equivalent to expected number of failures of type ERR Texe H ERR (Texe ) hERR ( j ) 1 E[ ERR] j 1 • FIT extrapolation with infinite program execution assumption FITERR • E[ ERR] 3600109 Texe CyclePeriod How to calculate hERR ( j ) ? • Let’s start with p: probability that one bit is flipped during one cycle period • Obtained from Poisson SEU model 39 PARMA Model: Fault Generation Model • SEU Model • Assumptions: • All clock cycles are independent to SEUs • All bits are independent to SEUs (do not account for spatial MBEs) • Widely accepted model for SEU: Poisson model • p : probability that one bit is flipped during one cycle period (in SBE cases) • Spatial MBE case: probability that multi-bits become faulty during one cycle • Poisson probability mass function gives p • λ: Poisson rate of SEUs, ex) 10-25/bit @ 65nm 3GHz CPU p odd j j j! e 40 PARMA Model: Measuring Soft Error FIT with PARMA • PARMA measures failure rate by accumulating failure probability mass • Index processor cycle by j (1 ≤ j ≤ Texe) • A (conditional) failure probability mass at cycle j : hERR ( j ) Pr(T ypeERR failureat j | survivedall typeof faultsuntil j ) • Total failures observed during Texe (failure rate): • Equivalent to expected number of failures of type ERR Texe H ERR (Texe ) hERR ( j ) 1 E[ ERR] j 1 • FIT extrapolation with infinite program execution assumption FITERR E[ ERR] 3600109 Texe CyclePeriod • Average FIT with multiple programs FITERR f FIT i benchmarki i , ERR 41 Failures Measured in PARMA • No-protection, 1-bit Parity, 1-bit ECC on Word and 1-bit ECC on Block No parity 1-bit Parity 1-bit ECC TRUE DUE SDC TRUE DUE SDC SDC word-level block-level word-level block-level Access Domain Block Block Block Blk containing M words Block Blk containing M words Block Protection Domain N/A Block Block Word Block Word Block ≥1 in C ∀odd in S, ≥1 in C ∀even >0 in S, ≥1 in C 2 in any Sm, ≥1 in that Cm 2 in S, ≥1 in C ≥3 in any Sm, ≥1 in that Cm ≥3 in S, ≥1 in C B W Faulty bits Notation B E NP ,SDC B E P1B ,TRUE _ DUE E P1B,SDC E SW ,TRUE _ DUE B E SB ,TRUE _ DUE W E SW , SDC B E SB ,SDC 42 Spatial Expansion: From a Bit to a Byte in Nc Vulnerability Cycles • qb(k) • Probability of a Byte having k faulty bits (in Nc vulnerability cycles) p qb(k) 1 Cycle Period q(Nc) q(Nc) 1 Byte timeline Vulnerability Clock Cycle = Nc • From 8 bits in the Byte, choose k faulty bit 8 qb (k ) q( N c ) k (1 q( N c ))8k , k 0,...,8 k 43 Spatial Expansion: from a Byte to the Protection Domain (Word) • SQ(k) • Probability of set of bits S having k faulty bits inside (during Nc cycles) S Protection Domain S : Word Q(k) qb(k) qb(k) 1 Byte 1 Byte q(Nc) …… q(Nc) • Choose cases where there are k faulty bits in S • Enumerate all possibilities of faulty bits in bytes of S such that their total number = k S Q(k) q l j k{ j}S b, j (l j ) 44 Faults in the Access Domain (Block) • DQ(k) • Probability of k faulty bits in any protection domain inside of D ( Sm) S Q(k) S qb(k) qb(k) q(Nc) q(Nc) D Protection Domain S : Word Domain S : Word Q(k) 1 Byte 1 Byte …… …… Q(k) …… Access Domain D : Block • Choose cases where there are k faulty bits in Sm • Sum for all Sm in D M D M Q(k ) Qm (k ) S m 1 m 1 q l j k { j}Sm b, j (l j ) • So far, masking effect has not been considered • Expected number of intrinsic faults/errors are calculated so far 45 PARMA Model: Failures Measured in PARMA (1) • B Unprotected cache E NP , SDC 8NB • 8 NC Q(k ) Q(0) Q(i) k 1 B C C B Nonzero, even # k faulty bits in the block is SDC SDCs: having at least one faulty bit in the consumed bits E P1B ,TRUE _ DUE All faults Unconsumed Without protection, any nonzero faulty bit(s) will cause SDC 8NB 8N B Q(k ) even k 0 • failure • E P1B , SDC i 1 B • Odd parity per block • 8NB C C Q(0)C Q(i ) even i 0 8NC B odd k Q(k ) C Q(0)C Q(i ) odd i SDCs: having at least one faulty bit in the consumed bytes, from having nonzero, even number of faulty bits in the block TRUE DUEs: having at least one faulty bit in the consumed bytes, from having odd number of faulty bits in the block 46 PARMA Model: Failures Measured in PARMA (2) • SECDED per block B E SB , SDC Q(k ) Q(0) Q(i ) B E SB ,TRUE _ DUE k>=3 is SDC SECDED per word M 8NC 8NB k 3 B • C B C E SW , SDC W EmSW , SDC m 1 i 3 8NC m 8 NW Cm W C Q(2) Q(0) Q(2) Qm (k ) Qm (0) Qm (i ) m 1 k 3 i 3 All faults Unconsumed B C M C M >=3 faults in Block • • B E SW ,TRUE _ DUE W EmSW ,TRUE _ DUE For all the words in the mblock 1 M W Qm (FITs 2) Cfrom Qm (0each )Cm Qword SDCs: having at least one faultyAdditive because m ( 2) is independent and counted separately m 1 bit in the consumed bits, from having more than two faulty • Same to ‘per block’ case except bits in the block protection domain is word TRUE DUEs: having at least one • Because access domain is block, faulty bit in the consumed bits, all the words in the same block from having exactly two faulty are addressed by adding FITs bits in the block 47 PARMA Simulations • Target processor • • • • • 4-wide OoO processor 64-entry ROB 32-entry LSQ McFarling’s hybrid branch predictor Cache configuration Cache Associativity Latency [cyc] IL1: 32B BLK 16KB 1-way 2 DL1: 32B BLK 16KB 4-way 3 8-way NP/P1b: 10 SW(4B):13 SB(64B): 14 UL2: 32B BLK • • Size 256KB sim-outorder was modified and executed with alpha ISA 18 benchmarks from SPEC2000 were used with SimPoint Sampling of 100M-instruction samples 48 Evaluating Soft Errors: AVF or Fault-Injection, Why Not? • • AVF fails for handling scenarios under error protection schemes Why not use fault injection for such scenarios? • Possible distortion in the interpretation of results due to the highly accelerated experiments 49 Simulations with PARMA: Results in FIT (1) (a) NP_ SDC: no-protection/SDC (≈ AVF_SDC) P1B_TRUE_DUE:odd parity/TRUE DUE (b) P1B_FALSE_DUE:odd parity/FALSE DUE (c) P1B_ SDC: odd parity/SDC (d) SB_TRUE_DUE: block-level SECDED/TRUE DUE Bench ammp art crafty eon facerec galgel gap gcc gzip mcf mesa parser perlbmk sixtrack twolf vortex vpr wupwise Average (a) 320.32 48.76 429.47 382.25 98.08 60.35 138.59 349.11 547.53 14.71 460.52 138.54 100.82 76.24 193.40 831.26 184.31 146.69 155.66 (b) 419.27 16.74 716.45 298.23 0.59 77.61 22.27 229.96 1115.56 14.43 112.19 380.34 315.37 7.92 419.25 324.74 369.25 130.00 217.17 (c) 2.50E-14 1.22E-16 1.99E-14 1.45E-14 9.80E-17 9.52E-17 3.32E-16 3.76E-15 1.05E-14 1.28E-17 4.50E-15 1.86E-15 2.74E-15 3.75E-16 1.52E-15 8.57E-15 2.16E-15 1.99E-15 2.53E-15 (d) 2.53E-14 1.22E-16 2.34E-14 1.63E-14 9.79E-17 9.52E-17 3.94E-16 4.86E-15 1.17E-14 1.28E-17 5.03E-15 2.00E-15 2.97E-15 3.91E-16 1.56E-15 9.63E-15 2.18E-15 2.02E-15 3.45E-15 (e) SB_FALSE_DUE: block-level SECDED/FALSE DUE (f) SB_SDC: block-level SECDED/SDC (g) SW_TRUE_DUE: word-level SECDED/TRUE DUE (h) SW_FALSE_DUE: word-level SECDED/TRUE DUE (i) SW_SDC: word-level SECDED/SDC (e) 1.53E-14 3.70E-17 4.74E-14 6.03E-15 1.37E-18 8.53E-17 5.26E-17 3.25E-15 9.56E-15 3.23E-17 8.57E-16 4.12E-15 8.85E-15 1.39E-16 2.88E-15 2.80E-15 3.27E-15 7.28E-16 3.59E-15 (f) 1.32E-29 3.89E-34 4.24E-30 2.72E-30 8.88E-35 3.54E-34 2.34E-33 1.89E-31 2.86E-31 2.93E-35 4.01E-32 4.59E-32 6.96E-32 9.70E-33 1.45E-32 1.70E-31 3.04E-32 1.75E-32 8.34E-31 (g) 1.99E-15 1.03E-17 1.85E-15 1.69E-15 1.18E-17 8.15E-18 4.30E-17 4.65E-16 1.30E-15 1.13E-18 5.44E-16 1.47E-16 2.36E-16 4.26E-17 1.24E-16 1.04E-15 1.83E-16 1.63E-16 2.25E-16 (h) 2.92E-15 9.01E-18 6.41E-15 9.77E-16 2.45E-19 1.38E-17 8.80E-18 4.68E-16 1.24E-15 4.35E-18 1.52E-16 5.82E-16 1.17E-15 2.14E-17 4.11E-16 4.31E-16 4.79E-16 1.70E-16 4.07E-16 (i) 1.02E-31 3.21E-36 2.83E-32 3.57E-32 1.22E-36 2.72E-36 2.59E-35 1.86E-33 3.67E-33 1.89E-37 4.73E-34 2.97E-34 4.46E-34 1.12E-34 1.06E-34 1.93E-33 2.45E-34 1.42E-34 2.92E-33 50 PARMA Application: a Gold-Standard for Developing New Approximate Model (3) • Results Name ammp art crafty eon facerec galgel gap gcc gzip mcf mesa parser perlbmk sixtrack twolf vortex vpr wupwise Average AVF 40.977% 2.849% 61.078% 99.049% 4.319% 6.010% 7.118% 27.658% 83.466% 1.267% 30.070% 22.983% 31.621% 3.916% 26.750% 53.171% 24.232% 12.183% 27.33% AVFxFIT from previous method 2.9374 0.2042 4.3784 7.1003 0.3096 0.4308 0.5103 1.9827 5.9832 0.0908 2.1555 1.6475 2.2667 0.2807 1.9176 3.8115 1.7371 0.8733 2.1454 FIT from new approximate model 8.4182E-14 4.5179E-16 3.3463E-14 1.3441E-13 1.3138E-15 7.0577E-16 4.5248E-16 7.7612E-15 6.9763E-14 2.9364E-16 1.6881E-14 1.8796E-14 2.9209E-14 5.9788E-16 2.2392E-14 3.6437E-14 4.0074E-14 1.1242E-14 2.8246E-14 FIT from PARMA 4.9114E-15 1.93476E-17 8.25685E-15 2.67121E-15 1.20444E-17 2.19513E-17 5.18293E-17 9.33375E-16 2.53328E-15 5.47892E-18 6.96557E-16 7.28453E-16 1.41053E-15 6.39658E-17 5.35443E-16 1.4704E-15 6.61992E-16 3.33145E-16 6.31707E-16 With PARMA, we can verify newly developed approximate models 51 Simulation with PARMA: Overhead • Need to track all memory footprint • Vulnerability clock cycles for L1, L2 and Memory copies • Data structure: Binary Search Tree • Quick search and insertion • Memory footprint never decreases • Memory overhead: ~17 bytes for tracking 1 byte of memory footprint • Computation overhead: O(n3) with non-parallelized code • n : number of bits in the block • Probability calculation for having k specific faulty bits is O(n2) • Need to know the probability distribution on k in [0, n] • Overall ~25x slowdown in simulation time from base sim-outorder • Still much faster than doing massive number of tests with fault injection 52 PARMA Application: a Gold-Standard for Developing New Approximate Model (1) • PARMA provides rigorous reliability measurements • Hence, it is useful to verify faster, simpler approximate models • Example: model for word-level SECDED protected cache • Known methods for determining cache scrubbing rates • Model from previous work [Mukherjee’04][Saleh’90] 1 MTTF L 2M , L : SER for a word, M :# of words in memory • Ignores cleaning effects at accesses – Okay for determining cache scrubbing rates because it overestimates – But by how much does it overestimate? <1> Read word #1: Activate ECC code, removing existing 1 faulty bit <2> Write to word #1: Updating word, removing any faulty bit MTTF for having 2nd faulty bits in the same word time MTTF extended due to <1> <2> 53 PARMA Application: a Gold-Standard for Developing New Approximate Model (2) • New model: model with Bernoulli attempts • Assumption: at most two bits are flipped between two accesses to the same word • Every access results in a detected error or in no-error (corrected) • An interval between two accesses is a binomial event ACE unACE Texe ACEAVG unACEAVG QuantumBinomial • Expected number of attempts in binomial process for success = MTTFSECDED • Then: MTTFSECDED 1 PDUE Quantum Binomial [sec] • PDUE = Poisson PMF 𝒇(𝟐, 𝝀𝑨𝑪𝑬,𝒘𝒐𝒓𝒅 ) 54 PARMA Application: a Gold-Standard for Developing New Approximate Model (3) • Word level SECDED average vulnerability, converted to FIT rate New approximate model AVF x Intrinsic FIT from previous method 2.1454 FIT 2.8246E-14 FIT PARMA 6.3170E-16 FIT With PARMA, we can verify newly developed approximate models 55