Fault Tolerance in Embedded Systems

advertisement
Fault Tolerance in Embedded Systems
Daniel Shapiro
dshap092@uottawa.ca
http://site.uottawa.ca/~dshap092
Fault Tolerance
• This presentation is based upon [1]
• Focus is on the basics as applied to embedded
systems with processors
• This presentation does not rely on Wikipedia.
• See Byzantine fault tolerance on wiki
Overview
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Trends Problems
Fault Tolerance Definitions
Fault Hiding
Fault Avoidance
Error Models
# Simultaneous Errors
Fault Tolerance Metrics
Error Detection
Error Recovery
Fault Diagnosis
Self-Recovery
Trends Problems
Cosmic rays and alpha particles
• Fault Tolerance
• Goal = safety + liveness
• Safe: Hide faults from
hurting the user, even in
failure
• Live: performs the
desired task
• Better to fail than to do
harm
Trends Problems
• More devices/processor
means more units can
fail
– Think CISC v.s. RISC
• More complex designs
mean more failure
cases exist
– Think AVX v.s. MMX
• Cache faults and more
generally memory faults
– Recharging DRAM is
“easier” than reloading a
destroyed cache line
Fault Tolerance Definitions
• Fault
– Physical faults
– Software faults
• May manifest as error
• Masked fault does not
show up as an error
• Errors may also be
masked
• Otherwise the error
results in a failure
• Logical mask - 0 AND
error bit
• Architectural mask –
NOP reg destination
error
• Application mask –
silent fault like writing
garbage to an unused
address … produces no
failure
Fault Hiding
• Some faults are
automatically recovered
already: branch
prediction can recover
from faulty branches
• Dangerous cases are the
faults that are NOT
masked
• Goal: mask all faults
– E.g. HDD faults are
common but hidden
• Transient fault – signal
glitch
• Permanent fault – wire
burns
• Intermittent fault – cold
soldered wire 
• Fault tolerance scheme –
design a system for
masking the expected
fault type
(transient/permanent/int
ermittent)
Fault Avoidance
• Fault avoidance is just
as good as fault
tolerance
• Error detection and
correction is the
alternative
• Permanent faults
– Physical wear-out
– Fabrication defects
– Design bugs
Error Models
• We only care about errors,
since masked faults are
innocuous
• Error models
– For improving fault tolerance
– E.g. stuck at 0/1 model tells
us that there is a potential
error
– Many many stuck at 0 errors
can mean that there is NO
PROBLEM 
– Reduces the need to evaluate
all sources of error. Design
space size↓↓
• 3 main error model
parameters
• Type of error –
bridging/coupling error
(e.g. short, cross-talk),
stuck-at error, fail-stop
error, delay error
• Error duration – transient,
intermittent, permanent
• # simultaneous errors –
errors are rare, how many
wars can you fight at once?
# Simultaneous Errors
• Maybe 1 error hides
another error
• E.g. 2-bit flip parity
checker
• Reasons for resolving:
– Mission critical
– High error rate
– Latent errors (undetected
and lingering) may overlap
with other errors. Think
about an incorrectly stored
word: the error occurs
upon NEXT read of the
word
• Better to detect the first
error AND to have double
error correction since the
error rate trends are
against us.
Fault Tolerance Metrics
• Availability
– 99.999% = five nines of
availability
• Reliability
– P(time t and still no
failure)
– Most errors are not
failures
• Mean != probability
• Variance (2 and 20 v.s.
11 and 12)
• MTTF – Mean Time to
Failure
• MTTR – Mean Time To
Repair
• MTBF = MTTF+MTTR
Fault Tolerance Metrics
• Failures in Time (FIT)
–
–
–
–
–
–
Rate
# failures / 1 billion hours
Additive
α 1/MTTF
Arbitrary
Raw rate includes masked
failures
– Effective rate excludes masked
failures
• Effective FIT = FIT*AVF
– Helps locate transient error
vulnerability
– Shown to be a good lower
bound on reliability
• Architectural Vulnerability
Factor (AVF)
– Architecturally Correct
Execution =ACE state
– Otherwise = un-ACE state
– E.g. PC state = ACE; branch
pred=un-ACE
– Fraction of time in ACE state
• Component AVF =
– avg # ACE bits per cycle / #
state bits.
• If many ACE bits reside in a
structure for a long time, that
structure is highly vulnerable.
 Large AVF
Error Detection
• Helps to provide safety
• Without redundancy we
cannot detect errors
• What kind of redundancy
do we need?
DMR
• Redundancy
– Physical (majority gate = TMR,
dual modular redundancy
=DMR, NMR where N is
odd>3)
– Temporal (run twice &
compare results)
– Information (extra bits like
parity)
• Boeing 777 uses “tripletriple” modular redundancy,
2 levels of triple voting,
where each vote is from a
different architecture
Error Detection
• Physical Redundancy
• Heterogeneous hardware
units can provide physical
redundancy
– E.g. Watchdog timer
– E.g. Boeing 777 different
architectures running same
program and then voting
on results.
– Design Diversity
• Unit replication
– Gate level
– Register level
– Core level
• Wastes lots of area &
power
• NMR impractical for PCs
• False error reporting
becomes more likely
• Using different hardware
for the voters avoids the
possibility of design bugs
Error Detection
Temporal Redundancy
• Twice the active power but
not twice the area
• Can find transient but not
permanent errors
• Smart pipelining can have
the votes arrive 1 cycle
apart, but wastes pipeline
slots
Information Redundancy
• Error-Detecting Code (EDC)
• Words mapped to code
words like checksums and
CRC
• Hamming Distance (HD)
• Single-Error Correcting (SEC)
Double-Error Detecting
(DED) with HD of 4
Error Detection
Error Detection
• For ALU we can compare bitcount of inputs out outputs, but this is
not common
• Many other techniques exist like BIST or calculating a known
quantity and comparing to a ROM with the answer in it.
• ReExecution with Shifted Operands (RESO) finds permanent errors.
• Redundant multithreading: use empty slots to run redundancy
threads
• Checking invariant conditions
• Anomaly detection like behavioural antivirus (look at data and/or
traces)
• Error Detection by Duplicated Instructions (EDDI) – let software
look into the hardware using randomly inserted dummy code
• Way way more stuff about caches, CAMs, consistency, and more.
Error Recovery
• Safety from detection but
what about liveness?
• Forward Error Recovery
– FER
– Once detected, the error is
seamlessly corrected
• FER implemented using
physical, information, or
temporal redundancy
• More HW needed to
correct than detect
– E.g. DMR can detect but
TMR or triple-triple can
correct (spatial)
• HD=k (information redundancy)
– k-1 bit errors detection
– (k-1)/2 error correction
– (HD,Detect,correct)
• (5,4,2)
• TMR by repetition (temporal)
Error Recovery
• Backwards Error Recovery
–
–
–
–
BER
Rollback / Safe point
Restore point
Recovery line for multicore
(cool!)
– How do we model
communication in MP /w
caches??
– Just log everything? Nope,
save it distributed and in the
caches. Possibly use software.
– Way more crazy algorithm
selection magic….
• The Output Commit
Problem
– Sphere of recoverability
– Don’t let bad data out
– Wait for error detection
hardware to complete
– Latency is usually hidden
– Processor state is difficult to
store/restore
Error Recovery
FER when DRAM module fails – RAID-M/chipkill
Fault Diagnosis
• Diagnosis hardware
– FER and BER do not solve
livelock
– E.g. mult fails, recover,
mult again.. livelock
• Idea: be smart, figure out
what components are
toast
• BIST
– Compare boundary scan
data or stored tests to a
ROM with the right
answers
• Run BIST at fixed
intervals or at end of
context switch
• Commit changes if error
free, otherwise restore
• Try to test all components
in system, ideally all gates
in the system
• MPs/NoC typically have
dedicated diagnosis
hardware
Self-Repair
• BIST can tell you what broke, but
not how to fix it.
• i7 can respond to errors on the
on-chip busses at runtime. Partial
bus shorts do not kill the system.
Data is transferred like a packet
(NoC)
– Because of all the prediction,
lanes, and issue logic, superscalar
has much more redundancy than
RISC
– For RISC just steal a core from the
grid and mark the old core dead
– CISC has some very crazy metrics
for triggering self-repair
• Remember the infinite loop
mult we diagnosed?
• Alternative: notice that mult
is dead, use shift-add booth
• Another cool idea: if shift
breaks use the mult with base
2 inputs (hot spare)
• A cold spare would be a fully
dedicated redundant unit
– CellBE only uses 7 cores and
has an 8th cold spare SPE! So
cool!
Conclusions
• Things are getting a bit
crazy in error detection
and correction
• Multicore and caches
complicated everything
• Although up until now
this fault stuff was
known, it is only now
entering the PC market
because the error rate is
increasing with process
technology
• Like the byzantine
generals problem, we
start to worry about who
to trust in the running but
broken chip
• Voting works best for
transient errors. For
permanent errors too, but
land the plane or you will
end up crashing.
• You can prove that it is
easier to detect a
problem than fix it.
References
[1] Daniel J. Sorin, “Fault Tolerant Computer
Architecture (Synthesis Lectures on Computer
Architecture),” 2010.
Questions?
Download