Evaluating Impact of Soft-Errors in an Embedded System Vijay Sheshadri Graduate Student

advertisement
Evaluating Impact of Soft-Errors in an
Embedded System
-Vijay
Sheshadri
Graduate Student
Dept. of Electrical Engineering
What is a Soft-error?

Transient fault caused by cosmic ray particles.
A charged
particle
incident on a
component
Sufficient charge
collection causes
an erroneous bitflip
0
1
The charged particle creates EHPs which
get collected by the drain
July 16, 2016
2
Soft-error in a System
yes
Bit has
error
protection
yes
Bit
Read
no
benign fault
no error
no
yes
Error
is only detected
(e.g., parity +
no recovery)
Detected, but
unrecoverable error
(DUE)
July 16, 2016
Error can be
corrected
(e.g, ECC)
no error
Does bit
matter?
yes
no
Silent Data
Corruption
(SDC)
benign fault
no error
Source: Shubhu Mukherjee et al. Radiation-Induced
Soft Errors: An Architectural Perspective, HPCA 20053
Masking of Soft-error
latching window masking
I1
R
E
G
I
S
T
E
R
S
I2
I3
I4
Particle
strike
1
Soft error
0 D
1
I5
0
I6
0
I7
No soft error
1O1
B
1
1
C
Logical
Masking
July 16, 2016
E
O2
R
E
G
I
S
T
E
R
S
Electrical masking
44
FIT Equation: Vulnerability Factors

FIT = (for each vulnerable device i) (intrinsic error ratei * vulnerability factori)

Vulnerability Factor =
Timing Vulnerability Factor * Architectural Vulnerability Factor

Timing Vulnerability Factor (TVF)


fraction of time bit is vulnerable
Architectural Vulnerability Factor (AVF)

fraction of time bit matters for final output of a program
July 16, 2016
Source: Shubhu Mukherjee et al. Radiation-Induced
Soft Errors: An Architectural Perspective, HPCA 20055
Architectural Vulnerability Factor

Fraction of time bit matters for final output of a program

Branch Predictor


Program Counter


Doesn’t matter at all (AVF = 0%)
Almost always matters (AVF ~ 100%)
Computing AVF for complex structures


July 16, 2016
Statistical Fault Injection
ACE (Architecturally Correct Execution) Analysis
Source: Shubhu Mukherjee et al. Radiation-Induced
Soft Errors: An Architectural Perspective, HPCA 20056
Soft-error & Automobiles

Mar,2010 - NHTSA
enlisted NASA
Engineering and Safety
Center (NESC) to
investigate “Unintended
Acceleration”

Apr,2011 – NESC
discounts SEU in its report
to NHTSA stating that the
ICs manufactured using
SOI (Silicon-on-insulator)
technology

July 16, 2016
As per AEC-Q100
standard, SEU testing
required for automobile
electronics with RAM >
1Mb
7
An Example

Predicted Block RAM upset rates for a Virtex-5 FPGA
= 635 FIT/Mb = 1.5E-05 upsets per day per Mb.


Ref : A. Lesea, “Continuing Experiments of Atmospheric
Neutron Effects on Deep Submicron Integrated Circuits,”
WP286 (v1.0), Xilinx, Inc. 2008
Assume this FPGA used in throttle control module

If 500,000 such vehicles produced by vendor, then total
upsets per day = 1.5E-05 x 500,000 = 7.6 vehicle upsets per
day
July 16, 2016
8
Soft-error Mitigation

Robust circuit designs (radiation-hardenend) resilient
to soft-errors

Soft-error mitigation at



Device-level – silicon-on-insulator, triple-well
Circuit-level – DICE cell, Triple-modular redundancy
Architecture-level – RMT, lock-stepping, ECC
July 16, 2016
9
Soft-error Mitigation

Soft-error mitigation techniques incur penalties in



Selective hardening of the components for reduced
penalty


area (spatial redundancy)
timing (temporal redundancy)
Often based on logical/electrical/timing derating
A low cost mitigation technique proposed for critical
applications based on application derating

Certain applications can mask or recover from transient faults*
July 16, 2016
Ref: V. Wong et al, “Soft Error Resilience of Probabilistic
Inference Applications” SELSE II, 2006
10
Critical Application - An Analogy
Climate
monitor/display
Airbag
deployment
GPS
Cruise control
• A micro-controller embedded in a car dashboard
maybe handling many applications.
• A critical application in this case could be ‘Airbag
deployment’.
• SE during this application could be catastrophic
July 16, 2016
11
Target Module

PWM – output is a pulse, width of which decides
speed of motor.

Etpwmi0 module


~800 FFs & ~3000 logic gates
180-nm CMOS technology, 80 MHz frequency
Motor
ADC
CPU core
July 16, 2016
PWM
12
Basic Simulation Steps*

Pre-analysis: Identify components utilized by critical
application

Fault injection: Inject a single fault at random time
instance by depositing the opposite value on the
component

Error metric:


Error count => no. of mismatches b/w output and reference
PW count => no. of clock-cycles the output is ‘1’ as compared
to reference
July 16, 2016
Ref: J. Blome et al, “Cost-Efficient Soft Error Protection
for Embedded Microprocessors” CASES, 2006
13
Simulation tools

Verilog netlist simulated with timing information, using
Synopsys VCS

Fault-injection module coded in C.

Uses VPI (verilog procedural interface) functions to



July 16, 2016
Access a net in the netlist (vpiHandle)
Read value of the net (vpi_get_value)
Overwrite value of the net (vpi_put_value)
14
Simulation – Pre-analysis

Pre-analysis

Categorize FFs based on their activity
a)
b)
Low-activity FFs (no. of toggles less than 2)
High-activity FFs (no. of toggles higher than 2)

Opposite values forced and output pulse observed for errors

FFs in which errors were observed are identified and
subjected to fault-injection
July 16, 2016
15
Simulation – Fault-injection

Fault injection

For the FFs obtained from pre-analysis, inject fault at a
random instance of time (within time interval of first output
pulse)

Measure Error count & PW count. Identify FFs with error in
acceptable limits
Test
bench
(verilog)
July 16, 2016
Original
value
Modified
value
Faultinjection
module
(C+VPI)
Fault-injection
window
Output pulse
16
Absolute error vs. Acceptable error



Absolute error – Raise error flag for any mismatch b/w
the output pulse and reference
Acceptable error - Raise error flag only if mismatch
b/w the output pulse and reference lies outside
tolerance limit*
Examples:

Delayed pulse
- Self-correcting pulse
Target FF
Target FF
Actual output
Actual output
reference
copy
reference
copy
Faultinjected
here
July 16, 2016
delay
Faultinjected
here
Ref: X. Li, et al “Exploiting Soft Computing for Increased
Fault Tolerance” Workshop on Architectural Support for
17
Gigascale Integration, 2006
Simulations-Combinational logic

Fault injection steps:

SE modeled as a 1ns pulse (System Clock Freq = 80MHz)

Transient pulse injected onto the gate output

Target combinational circuit selected at random

Example: 2-input NAND gate
Actual output
A
Y
B
reference
copy
A
B
Y
Injected
Fault
July 16, 2016
18
Results

Pre-analysis - ~18% FFs used by the application

Fault-injection - number of faults injected is proportional to the
number of flip-flops in the group

Low-toggle FFs more in number, hence no. of faults injected in low-toggle
FF is higher
July 16, 2016
19
Results
100,000
Fault-Injection simulation
# Faults Injected
# errors
1000
Fault-Injection in FFs
Total FFs
# acceptable errors
# critical FFs
10,000
FFs with acceptable
error
100
1,000
100
10
10
1
1
Low activity FFs High activity FFs
Combo logic
Low-toggle FFs
High-toggle FFs

Low-toggle FF more vulnerable to soft-errors since an
erroneous bit-flip may remain unchanged

High-toggle FF is written very often, an erroneous bit flip has a
higher probability of getting overwritten
July 16, 2016
20
Computing AVF

AVF = Pe * % component


Pe = probability that a fault injected in the component results in
an error (Pe) = (no. of errors) / (no. of faults injected)
% component = the percentage of that component with
respect to total number of components
Example: For a latch,
a. if # errors = 50% of injected faults (Pe = 0.5)
b. if latches make for 20% of circuit
AVF = 0.5 x 0.2 = 0.1
July 16, 2016
21
AVF - Results

Low activity FF have a higher Pe and are more in
number; hence have a higher AVF

Combinational logic, though high in number, has Pe
~4E-03, causing AVF to drop
AVF = P(error)* % of
comp
1.00
0.1590
0.10
0.01
0.0079
0.0030
0.00
Low activity FFs
7/16/2016
High activity FFs
Combo logic
22
Summary

Fault-resilience scheme for critical applications using
application derating and inherent error tolerance

For the application considered,



~12% of the sequential logic was safety critical (prev. work
reports 30% of seq. logic hardened for 99% fault-coverage in
ARM embedded proc. running image processing algorithm)
failures in combinational logic were negligible
Worst-case scenario would only be the same as
radiation-hardening a generic system

7/16/2016
i.e., all the hardware is identified as safety-critical
23
Future Work

Perform fault-injection analysis on the processor core
managing the control loop

Conduct neutron beam experiments on the circuit to
compare with simulations and find FIT rate

Implement circuit hardening and test the system to
ascertain its robustness
7/16/2016
24
Download