Evaluating Impact of Soft-Errors in an Embedded System -Vijay Sheshadri Graduate Student Dept. of Electrical Engineering What is a Soft-error? Transient fault caused by cosmic ray particles. A charged particle incident on a component Sufficient charge collection causes an erroneous bitflip 0 1 The charged particle creates EHPs which get collected by the drain July 16, 2016 2 Soft-error in a System yes Bit has error protection yes Bit Read no benign fault no error no yes Error is only detected (e.g., parity + no recovery) Detected, but unrecoverable error (DUE) July 16, 2016 Error can be corrected (e.g, ECC) no error Does bit matter? yes no Silent Data Corruption (SDC) benign fault no error Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 20053 Masking of Soft-error latching window masking I1 R E G I S T E R S I2 I3 I4 Particle strike 1 Soft error 0 D 1 I5 0 I6 0 I7 No soft error 1O1 B 1 1 C Logical Masking July 16, 2016 E O2 R E G I S T E R S Electrical masking 44 FIT Equation: Vulnerability Factors FIT = (for each vulnerable device i) (intrinsic error ratei * vulnerability factori) Vulnerability Factor = Timing Vulnerability Factor * Architectural Vulnerability Factor Timing Vulnerability Factor (TVF) fraction of time bit is vulnerable Architectural Vulnerability Factor (AVF) fraction of time bit matters for final output of a program July 16, 2016 Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 20055 Architectural Vulnerability Factor Fraction of time bit matters for final output of a program Branch Predictor Program Counter Doesn’t matter at all (AVF = 0%) Almost always matters (AVF ~ 100%) Computing AVF for complex structures July 16, 2016 Statistical Fault Injection ACE (Architecturally Correct Execution) Analysis Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 20056 Soft-error & Automobiles Mar,2010 - NHTSA enlisted NASA Engineering and Safety Center (NESC) to investigate “Unintended Acceleration” Apr,2011 – NESC discounts SEU in its report to NHTSA stating that the ICs manufactured using SOI (Silicon-on-insulator) technology July 16, 2016 As per AEC-Q100 standard, SEU testing required for automobile electronics with RAM > 1Mb 7 An Example Predicted Block RAM upset rates for a Virtex-5 FPGA = 635 FIT/Mb = 1.5E-05 upsets per day per Mb. Ref : A. Lesea, “Continuing Experiments of Atmospheric Neutron Effects on Deep Submicron Integrated Circuits,” WP286 (v1.0), Xilinx, Inc. 2008 Assume this FPGA used in throttle control module If 500,000 such vehicles produced by vendor, then total upsets per day = 1.5E-05 x 500,000 = 7.6 vehicle upsets per day July 16, 2016 8 Soft-error Mitigation Robust circuit designs (radiation-hardenend) resilient to soft-errors Soft-error mitigation at Device-level – silicon-on-insulator, triple-well Circuit-level – DICE cell, Triple-modular redundancy Architecture-level – RMT, lock-stepping, ECC July 16, 2016 9 Soft-error Mitigation Soft-error mitigation techniques incur penalties in Selective hardening of the components for reduced penalty area (spatial redundancy) timing (temporal redundancy) Often based on logical/electrical/timing derating A low cost mitigation technique proposed for critical applications based on application derating Certain applications can mask or recover from transient faults* July 16, 2016 Ref: V. Wong et al, “Soft Error Resilience of Probabilistic Inference Applications” SELSE II, 2006 10 Critical Application - An Analogy Climate monitor/display Airbag deployment GPS Cruise control • A micro-controller embedded in a car dashboard maybe handling many applications. • A critical application in this case could be ‘Airbag deployment’. • SE during this application could be catastrophic July 16, 2016 11 Target Module PWM – output is a pulse, width of which decides speed of motor. Etpwmi0 module ~800 FFs & ~3000 logic gates 180-nm CMOS technology, 80 MHz frequency Motor ADC CPU core July 16, 2016 PWM 12 Basic Simulation Steps* Pre-analysis: Identify components utilized by critical application Fault injection: Inject a single fault at random time instance by depositing the opposite value on the component Error metric: Error count => no. of mismatches b/w output and reference PW count => no. of clock-cycles the output is ‘1’ as compared to reference July 16, 2016 Ref: J. Blome et al, “Cost-Efficient Soft Error Protection for Embedded Microprocessors” CASES, 2006 13 Simulation tools Verilog netlist simulated with timing information, using Synopsys VCS Fault-injection module coded in C. Uses VPI (verilog procedural interface) functions to July 16, 2016 Access a net in the netlist (vpiHandle) Read value of the net (vpi_get_value) Overwrite value of the net (vpi_put_value) 14 Simulation – Pre-analysis Pre-analysis Categorize FFs based on their activity a) b) Low-activity FFs (no. of toggles less than 2) High-activity FFs (no. of toggles higher than 2) Opposite values forced and output pulse observed for errors FFs in which errors were observed are identified and subjected to fault-injection July 16, 2016 15 Simulation – Fault-injection Fault injection For the FFs obtained from pre-analysis, inject fault at a random instance of time (within time interval of first output pulse) Measure Error count & PW count. Identify FFs with error in acceptable limits Test bench (verilog) July 16, 2016 Original value Modified value Faultinjection module (C+VPI) Fault-injection window Output pulse 16 Absolute error vs. Acceptable error Absolute error – Raise error flag for any mismatch b/w the output pulse and reference Acceptable error - Raise error flag only if mismatch b/w the output pulse and reference lies outside tolerance limit* Examples: Delayed pulse - Self-correcting pulse Target FF Target FF Actual output Actual output reference copy reference copy Faultinjected here July 16, 2016 delay Faultinjected here Ref: X. Li, et al “Exploiting Soft Computing for Increased Fault Tolerance” Workshop on Architectural Support for 17 Gigascale Integration, 2006 Simulations-Combinational logic Fault injection steps: SE modeled as a 1ns pulse (System Clock Freq = 80MHz) Transient pulse injected onto the gate output Target combinational circuit selected at random Example: 2-input NAND gate Actual output A Y B reference copy A B Y Injected Fault July 16, 2016 18 Results Pre-analysis - ~18% FFs used by the application Fault-injection - number of faults injected is proportional to the number of flip-flops in the group Low-toggle FFs more in number, hence no. of faults injected in low-toggle FF is higher July 16, 2016 19 Results 100,000 Fault-Injection simulation # Faults Injected # errors 1000 Fault-Injection in FFs Total FFs # acceptable errors # critical FFs 10,000 FFs with acceptable error 100 1,000 100 10 10 1 1 Low activity FFs High activity FFs Combo logic Low-toggle FFs High-toggle FFs Low-toggle FF more vulnerable to soft-errors since an erroneous bit-flip may remain unchanged High-toggle FF is written very often, an erroneous bit flip has a higher probability of getting overwritten July 16, 2016 20 Computing AVF AVF = Pe * % component Pe = probability that a fault injected in the component results in an error (Pe) = (no. of errors) / (no. of faults injected) % component = the percentage of that component with respect to total number of components Example: For a latch, a. if # errors = 50% of injected faults (Pe = 0.5) b. if latches make for 20% of circuit AVF = 0.5 x 0.2 = 0.1 July 16, 2016 21 AVF - Results Low activity FF have a higher Pe and are more in number; hence have a higher AVF Combinational logic, though high in number, has Pe ~4E-03, causing AVF to drop AVF = P(error)* % of comp 1.00 0.1590 0.10 0.01 0.0079 0.0030 0.00 Low activity FFs 7/16/2016 High activity FFs Combo logic 22 Summary Fault-resilience scheme for critical applications using application derating and inherent error tolerance For the application considered, ~12% of the sequential logic was safety critical (prev. work reports 30% of seq. logic hardened for 99% fault-coverage in ARM embedded proc. running image processing algorithm) failures in combinational logic were negligible Worst-case scenario would only be the same as radiation-hardening a generic system 7/16/2016 i.e., all the hardware is identified as safety-critical 23 Future Work Perform fault-injection analysis on the processor core managing the control loop Conduct neutron beam experiments on the circuit to compare with simulations and find FIT rate Implement circuit hardening and test the system to ascertain its robustness 7/16/2016 24