Fault Tolerance in Embedded Systems Daniel Shapiro dshap092@uottawa.ca http://site.uottawa.ca/~dshap092 Fault Tolerance • This presentation is based upon [1] • Focus is on the basics as applied to embedded systems with processors • This presentation does not rely on Wikipedia. • See Byzantine fault tolerance on wiki Overview 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Trends Problems Fault Tolerance Definitions Fault Hiding Fault Avoidance Error Models # Simultaneous Errors Fault Tolerance Metrics Error Detection Error Recovery Fault Diagnosis Self-Recovery Trends Problems Cosmic rays and alpha particles • Fault Tolerance • Goal = safety + liveness • Safe: Hide faults from hurting the user, even in failure • Live: performs the desired task • Better to fail than to do harm Trends Problems • More devices/processor means more units can fail – Think CISC v.s. RISC • More complex designs mean more failure cases exist – Think AVX v.s. MMX • Cache faults and more generally memory faults – Recharging DRAM is “easier” than reloading a destroyed cache line Fault Tolerance Definitions • Fault – Physical faults – Software faults • May manifest as error • Masked fault does not show up as an error • Errors may also be masked • Otherwise the error results in a failure • Logical mask - 0 AND error bit • Architectural mask – NOP reg destination error • Application mask – silent fault like writing garbage to an unused address … produces no failure Fault Hiding • Some faults are automatically recovered already: branch prediction can recover from faulty branches • Dangerous cases are the faults that are NOT masked • Goal: mask all faults – E.g. HDD faults are common but hidden • Transient fault – signal glitch • Permanent fault – wire burns • Intermittent fault – cold soldered wire • Fault tolerance scheme – design a system for masking the expected fault type (transient/permanent/int ermittent) Fault Avoidance • Fault avoidance is just as good as fault tolerance • Error detection and correction is the alternative • Permanent faults – Physical wear-out – Fabrication defects – Design bugs Error Models • We only care about errors, since masked faults are innocuous • Error models – For improving fault tolerance – E.g. stuck at 0/1 model tells us that there is a potential error – Many many stuck at 0 errors can mean that there is NO PROBLEM – Reduces the need to evaluate all sources of error. Design space size↓↓ • 3 main error model parameters • Type of error – bridging/coupling error (e.g. short, cross-talk), stuck-at error, fail-stop error, delay error • Error duration – transient, intermittent, permanent • # simultaneous errors – errors are rare, how many wars can you fight at once? # Simultaneous Errors • Maybe 1 error hides another error • E.g. 2-bit flip parity checker • Reasons for resolving: – Mission critical – High error rate – Latent errors (undetected and lingering) may overlap with other errors. Think about an incorrectly stored word: the error occurs upon NEXT read of the word • Better to detect the first error AND to have double error correction since the error rate trends are against us. Fault Tolerance Metrics • Availability – 99.999% = five nines of availability • Reliability – P(time t and still no failure) – Most errors are not failures • Mean != probability • Variance (2 and 20 v.s. 11 and 12) • MTTF – Mean Time to Failure • MTTR – Mean Time To Repair • MTBF = MTTF+MTTR Fault Tolerance Metrics • Failures in Time (FIT) – – – – – – Rate # failures / 1 billion hours Additive α 1/MTTF Arbitrary Raw rate includes masked failures – Effective rate excludes masked failures • Effective FIT = FIT*AVF – Helps locate transient error vulnerability – Shown to be a good lower bound on reliability • Architectural Vulnerability Factor (AVF) – Architecturally Correct Execution =ACE state – Otherwise = un-ACE state – E.g. PC state = ACE; branch pred=un-ACE – Fraction of time in ACE state • Component AVF = – avg # ACE bits per cycle / # state bits. • If many ACE bits reside in a structure for a long time, that structure is highly vulnerable. Large AVF Error Detection • Helps to provide safety • Without redundancy we cannot detect errors • What kind of redundancy do we need? DMR • Redundancy – Physical (majority gate = TMR, dual modular redundancy =DMR, NMR where N is odd>3) – Temporal (run twice & compare results) – Information (extra bits like parity) • Boeing 777 uses “tripletriple” modular redundancy, 2 levels of triple voting, where each vote is from a different architecture Error Detection • Physical Redundancy • Heterogeneous hardware units can provide physical redundancy – E.g. Watchdog timer – E.g. Boeing 777 different architectures running same program and then voting on results. – Design Diversity • Unit replication – Gate level – Register level – Core level • Wastes lots of area & power • NMR impractical for PCs • False error reporting becomes more likely • Using different hardware for the voters avoids the possibility of design bugs Error Detection Temporal Redundancy • Twice the active power but not twice the area • Can find transient but not permanent errors • Smart pipelining can have the votes arrive 1 cycle apart, but wastes pipeline slots Information Redundancy • Error-Detecting Code (EDC) • Words mapped to code words like checksums and CRC • Hamming Distance (HD) • Single-Error Correcting (SEC) Double-Error Detecting (DED) with HD of 4 Error Detection Error Detection • For ALU we can compare bitcount of inputs out outputs, but this is not common • Many other techniques exist like BIST or calculating a known quantity and comparing to a ROM with the answer in it. • ReExecution with Shifted Operands (RESO) finds permanent errors. • Redundant multithreading: use empty slots to run redundancy threads • Checking invariant conditions • Anomaly detection like behavioural antivirus (look at data and/or traces) • Error Detection by Duplicated Instructions (EDDI) – let software look into the hardware using randomly inserted dummy code • Way way more stuff about caches, CAMs, consistency, and more. Error Recovery • Safety from detection but what about liveness? • Forward Error Recovery – FER – Once detected, the error is seamlessly corrected • FER implemented using physical, information, or temporal redundancy • More HW needed to correct than detect – E.g. DMR can detect but TMR or triple-triple can correct (spatial) • HD=k (information redundancy) – k-1 bit errors detection – (k-1)/2 error correction – (HD,Detect,correct) • (5,4,2) • TMR by repetition (temporal) Error Recovery • Backwards Error Recovery – – – – BER Rollback / Safe point Restore point Recovery line for multicore (cool!) – How do we model communication in MP /w caches?? – Just log everything? Nope, save it distributed and in the caches. Possibly use software. – Way more crazy algorithm selection magic…. • The Output Commit Problem – Sphere of recoverability – Don’t let bad data out – Wait for error detection hardware to complete – Latency is usually hidden – Processor state is difficult to store/restore Error Recovery FER when DRAM module fails – RAID-M/chipkill Fault Diagnosis • Diagnosis hardware – FER and BER do not solve livelock – E.g. mult fails, recover, mult again.. livelock • Idea: be smart, figure out what components are toast • BIST – Compare boundary scan data or stored tests to a ROM with the right answers • Run BIST at fixed intervals or at end of context switch • Commit changes if error free, otherwise restore • Try to test all components in system, ideally all gates in the system • MPs/NoC typically have dedicated diagnosis hardware Self-Repair • BIST can tell you what broke, but not how to fix it. • i7 can respond to errors on the on-chip busses at runtime. Partial bus shorts do not kill the system. Data is transferred like a packet (NoC) – Because of all the prediction, lanes, and issue logic, superscalar has much more redundancy than RISC – For RISC just steal a core from the grid and mark the old core dead – CISC has some very crazy metrics for triggering self-repair • Remember the infinite loop mult we diagnosed? • Alternative: notice that mult is dead, use shift-add booth • Another cool idea: if shift breaks use the mult with base 2 inputs (hot spare) • A cold spare would be a fully dedicated redundant unit – CellBE only uses 7 cores and has an 8th cold spare SPE! So cool! Conclusions • Things are getting a bit crazy in error detection and correction • Multicore and caches complicated everything • Although up until now this fault stuff was known, it is only now entering the PC market because the error rate is increasing with process technology • Like the byzantine generals problem, we start to worry about who to trust in the running but broken chip • Voting works best for transient errors. For permanent errors too, but land the plane or you will end up crashing. • You can prove that it is easier to detect a problem than fix it. References [1] Daniel J. Sorin, “Fault Tolerant Computer Architecture (Synthesis Lectures on Computer Architecture),” 2010. Questions?