SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Rob Smolinski, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs.illinois.edu Motivation • Hardware will fail in-the-field due to several reasons Wear-out (Devices are weaker) Transient errors Design Bugs … and so on (High-energy particles ) Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets – Traditional redundancy solutions (e.g., nMR) too expensive Need low-cost solutions for multiple failure sources Must incur low area, performance, power overhead 2 Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized Watch for software anomalies (symptoms) – Zero to low overhead “always-on” monitors Diagnose cause after symptom detected − May incur high overhead, but rarely invoked SWAT: SoftWare Anomaly Treatment 3 SWAT Framework Components • Detection: Symptoms of software misbehavior • Recovery: Checkpoint and rollback • Diagnosis: Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 4 Advantages of SWAT • Handles all faults that matter – Oblivious to low-level failure modes and masked faults • Low, amortized overheads – Optimize for common case, exploit SW reliability solutions • Customizable and flexible – Firmware control adapts to specific reliability needs • Holistic systems view enables novel solutions – Synergistic detection, diagnosis, recovery solutions • Beyond hardware reliability – Long term goal: unified system (HW+SW) reliability – Potential application to post-silicon test and debug 5 SWAT Contributions Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Application-Aware SWAT Even lower SDC, latency Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Accurate fault modeling Repair [HPCA’09] Multithreaded workloads [MICRO’09] In-situ diagnosis [DSN’08] Outline • Motivation • Detection • Recovery analysis • Diagnosis • Conclusions and future work 7 Simple Fault Detectors [ASPLOS ’08] • Simple detectors that observe anomalous SW behavior Fatal Traps Hangs Kernel Panic High OS App Abort Division by zero, RED state, etc. Simple HW hang detector OS enters panic State due to fault High contiguous OS activity Application abort due to fault SWAT firmware • Very low hardware area, performance overhead 8 Evaluating Fault Detectors • Simulate OpenSolaris on out-of-order processor – GEMS timing models + Simics full-system simulator • I/O- and compute-intensive applications – Client-server – apache, mysql, squid, sshd – All SPEC 2K C/C++ – 12 Integer, 4 FP • µarchitecture-level fault injections (single fault model) – Stuck-at, transient faults in 8 µarch units – ~18,000 total faults statistically significant Fault 10M instr If no symptom in 10M instr, run to completion Timing simulation Functional simulation Masked or Potential Silent Data Corruption (SDC) 9 Metrics for Fault Detection • Potential SDC rate – Undetected fault that changes app output – Output change may or may not be important • Detection Latency – Latency from architecture state corruption to detection Architecture state = registers + memory Will improve later – High detection latency impedes recovery 10 SDC Rate of Simple Detectors: SPEC, permanents 0.4 0.7 0.6 Permanent Faults 0.6 Total injections 100% 80% SDC 60% Detected Masked 40% 20% Total No FP FP ALU AGEN RAT ROB Int reg Reg Dbus INT ALU Decoder 0% • 0.6% potential SDC rate for permanents in SPEC, without FPU • Faults in FPU need different detectors – Mostly corrupt only data Potential SDC Rate Injected Faults 100% 0.4 0.7 0.6 0.6 80% Potential SDC 60% Detected 40% Masked 20% 0% Server SPEC Permanents Server SPEC Transients • SWAT detectors highly effective for hardware faults – Low potential SDC rates across workloads 12 Detection Latency SPEC workloads Server workloads 100% 90% 90% 80% 80% Detected Faults 100% 70% Permanent Transient 60% 70% 60% 50% 50% 40% <10K 40% <10K <100K <1M <10M >10M Detection Latency (Instructions) <100K <1M <10M >10M Detection Latency (Instructions) • 90% of the faults detected in under 10M instructions • Existing work claims these are recoverable w/ HW chkpting – More recovery analysis follows later 13 Exploiting Application Support for Detection • Techniques inspired by software bug detection – Likely program invariants: iSWAT Instrumented binary, no hardware changes <5% performance overhead on x86 processors – Detecting out-of-bounds addresses Low hardware overhead, near-zero performance impact • Exploiting application-level resiliency Exploiting Application Support for Detection • Techniques inspired by software bug detection – Likely program invariants: iSWAT Instrumented binary, no hardware changes <5% performance overhead on x86 processors – Detecting out-of-bounds addresses Low hardware overhead, near-zero performance impact • Exploiting application-level resiliency Low-Cost Out-of-Bounds Detector • Sophisticated detector for security, software bugs – Track object accessed, validate pointer accesses – Require full-program analysis, changes to binary • Bad addresses from HW faults more obvious – Invalid pages, unallocated memory, etc. • Low-cost out-of-bounds detector App Address Space Empty – Monitor boundaries of heap, stack, globals App Code – Address beyond these bounds HW fault Globals Heap - SW communicates boundaries to HW Libraries - HW enforces checks on ld/st address Stack Reserved 16 Impact of Out-of-Bounds Detector Server Workloads 0.38 0.23 0.58 0.28 100% 80% Potential SDC 60% DetectOther 40% DetectOoB 20% Masked 0% 0.67 0.63 0.65 0.65 SWAT OoB SWAT OoB 80% Injected Faults 100% SPEC Workloads 60% 40% 20% 0% SWAT OoB Permanents SWAT OoB Transients Permanents Transients Lower potential SDC rate in server workloads – 39% lower for permanents, 52% for transients For SPEC workloads, impact is on detection latency 17 Application-Aware SDC Analysis • Potential SDC undetected faults that corrupt app output • But many applications can tolerate faults – Client may detect fault and retry request – Application may perform fault-tolerant computations E.g., Same cost place & route, acceptable PSNR, etc. Not all potential SDCs are true SDCs - For each application, define notion of fault tolerance should not? • SWAT detectors cannot detect such acceptable changes 18 Application-Aware SDCs for Server Number of Faults Permanent Faults Transient Faults 60 60 50 50 40 34 (0.38%) 40 21 (0.23%) 30 10 0 0 w/ app tolerance 9 (0.10%) 20 10 SWAT + OoB 25 (0.28%) 30 12 (0.13%) 20 SWAT 52 (0.58%) SWAT SWAT + OoB w/ app tolerance • 46% of potential SDCs are tolerated by simple retry • Only 21 remaining SDCs out of 17,880 injected faults – Most detectable through application-level validity checks 19 Application-Aware SDCs for SPEC Permanent Faults Transient Faults 70 Number of Faults 60 70 56 (0.6%) 60 50 50 40 40 30 58 (0.6%) 46 (0.5%) 37 (0.4%) 33 (0.4%) 30 16 (0.2%) 20 10 11 (0.1%) 8 (0.1%) 20 10 0 SWAT+OoB >0% >0.01% >1% 0 SWAT+OoB >0% >0.01% >1% • Only 62 faults show >0% degradation from golden output • Only 41 injected faults are SDCs at >1% degradation – 38 from apps we conservatively classify as fault intolerant Chess playing apps, compilers, parsers, etc. 20 Reducing Potential SDCs further (future work) • Explore application-specific detectors – Compiler-assisted invariants like iSWAT – Application-level checks • Need to fundamentally understand why, where SWAT works – SWAT evaluation largely empirical – Build models to predict effectiveness of SWAT Develop new low-cost symptom detectors Extract minimal set of detectors for given sets of faults Reliability vs overhead trade-offs analysis 21 Reducing Detection Latency: New Definition • SWAT relies on checkpoint/rollback for recovery • Detection latency dictates fault recovery – Checkpoint fault-free fault recoverable • Traditional defn. = arch state corruption to detection • But software may mask some corruptions! • New defn. = Unmasked arch state corruption to detection Fault Bad arch state Bad SW state Detection Recoverable chkpt Recoverable chkpt Old latency New Latency 22 Measuring Detection Latency • New detection latency = SW state corruption to detection • But identifying SW state corruption is hard! – Need to know how faulty value used by application – If faulty value affects output, then SW state corrupted • Measure latency by rolling back to older checkpoints – Only for analysis, not required in real system Fault Bad arch state Bad SW state Detection Chkpt Chkpt Fault effect Symptom masked Rollback Rollback & & Replay Replay New latency 23 Detection Latency - SPEC Transient Faults in Server 100% 100% 90% 90% Detected Faults Detected Faults Permanent Faults in Server 80% 70% 60% 80% 70% 60% Old Latency SWAT 50% 50% 40% <10k 40% <10k <100k <1m <10m >10m Detection Latency (Instructions) <100k <1m <10m >10m Detection Latency (Instructions) 24 Detection Latency - SPEC Transient Faults in Server 100% 100% 90% 90% Detected Faults Detected Faults Permanent Faults in Server 80% 70% 80% 70% 60% New Latency SWAT 50% 50% Old Latency SWAT 40% <10k 40% <10k 60% <100k <1m <10m >10m Detection Latency (Instructions) <100k <1m <10m >10m Detection Latency (Instructions) 25 Detection Latency - SPEC Transient Faults in Server 100% 100% 90% 90% Detected Faults Detected Faults Permanent Faults in Server 80% 70% 60% 80% 70% 60% New Latency SWAT 50% 50% 40% <10k 40% <10k <100k <1m <10m >10m Detection Latency (Instructions) New Latency out-of-bounds Old Latency SWAT <100k <1m <10m >10m Detection Latency (Instructions) • Measuring new latency important to study recovery • New techniques significantly reduce detection latency - >90% of faults detected in <100K instructions • Reduced detection latency impacts recoverability 26 Detection Latency - Server Transient Faults in Server 100% 100% 90% 90% Detected Faults Detected Faults Permanent Faults in Server 80% 70% 60% 80% 70% 60% New Latency SWAT 50% 50% 40% <10k 40% <10k <100k <1m <10m >10m Detection Latency (Instructions) New Latency out-of-bounds Old Latency SWAT <100k <1m <10m >10m Detection Latency (Instructions) • Measuring new latency important to study recovery • New techniques significantly reduce detection latency - >90% of faults detected in <100K instructions • Reduced detection latency impacts recoverability 27 Implications for Fault Recovery Recovery Checkpointing I/O buffering • Checkpointing – Record pristine arch state for recovery – Periodic registers snapshot, log memory writes • I/O buffering – Buffer external events until known to be fault-free – HW buffer records device reads, buffers device writes “Always-on” must incur minimal overhead 28 Memory Log Size (in KB) Overheads from Memory Logging 2500 2250 2000 1750 1500 1250 1000 750 500 250 0 10K apache sshd squid mysql 100K 1M Checkpoint Interval (Instructions) 10M • New techniques reduce chkpt overheads by over 60% – Chkpt interval reduced to 100K from millions of instrs. 29 Overheads from Output Buffering 20 Output Buffer size (in KB) 18 16 14 12 apache sshd squid mysql 10 8 6 4 2 0 10K 100K 1M Checkpoint Interval (Instructions) 10M • New techniques reduce output buffer size to near-zero – <5KB buffer for 100K chkpt interval (buffer for 2 chkpts) – Near-zero overheads at 10K interval 30 Low Cost Fault Recovery (future work) • New techniques significantly reduce recovery overheads – 60% in memory logs, near-zero output buffer • But still do not enable ultra-low cost fault recovery – ~400KB HW overheads for memory logs in HW (SafetyNet) – High performance impact for in-memory logs (ReVive) • Need ultra low-cost recovery scheme at short intervals – Even shorter latencies – Checkpoint only state that matters – Application-aware insights – transactional apps, recovery domains for OS, … 31 Fault Diagnosis • Symptom-based detection is cheap but – May incur long latency from activation to detection – Difficult to diagnose root cause of fault SW Bug Transient Fault Permanent Fault ? Symptom • Goal: Diagnose the fault with minimal hardware overhead – Rarely invoked higher perf overhead acceptable SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08] • First, diagnosis for single threaded workload on one core – Multithreaded w/ multicore later – several new challenges Key ideas • Single core fault model, multicore fault-free core available • Chkpt/replay for recovery replay on good core, compare • Synthesizing DMR, but only for diagnosis Traditional DMR P1 P2 = Always on expensive Synthesized DMR P1 Fault-free P1 P2 = DMR only on fault SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Symptom detected Rollback on faulty core No symptom Transient or nondeterministic s/w bug Continue Execution Symptom Deterministic s/w or Permanent h/w bug Rollback/replay on good core No symptom Permanent h/w fault, needs repair! Symptom Deterministic s/w bug (send to s/w layer) µarch-level Fault Diagnosis Symptom detected Diagnosis Software bug Transient fault Permanent fault Microarchitecture-level Diagnosis Unit X is faulty Trace Based Fault Diagnosis (TBFD) • µarch-level fault diagnosis using rollback/replay • Key: Execution caused symptom trace activates fault – Deterministically replay trace on faulty, fault-free cores – Divergence faulty hardware used diagnosis clues • Diagnose faults to µarch units of processor – Check µarch-level invariants in several parts of processor – Diagnosis in out-of-order logic (meta-datapath) complex Trace-Based Fault Diagnosis: Evaluation • Goal: Diagnose faults at reasonable latency • Faults diagnosed in 10 SPEC workloads – ~8500 detected faults (98% of unmasked) • Results – 98% of the detection successfully diagnosed – 91% diagnosed within 1M instr (~0.5ms on 2GHz proc) SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09] • Challenge 1: Deterministic replay involves high overhead • Challenge 2: Multithreaded apps share data among threads Core 2 Core 1 Fault Store Load Memory • Symptom causing core may not be faulty • No known fault-free core in system Symptom Detection on a fault-free core mSWAT Diagnosis - Key Ideas Multithreaded applications Challenges Key Ideas A TA Full-system deterministic replay No known good core Isolated deterministic replay Emulated TMR B TB TA C TC D TD A TA B TB C TC TA TA D TD mSWAT Diagnosis - Key Ideas Multithreaded applications Challenges Key Ideas A TA Full-system deterministic replay No known good core Isolated deterministic replay Emulated TMR B TB TA C TC D TD A TA B TB C TC D TD TD TA TB TC TC TD TA TB mSWAT Diagnosis: Evaluation • Diagnose detected perm faults in multithreaded apps – Goal: Identify faulty core, TBFD for µarch-level diagnosis – Challenges: Non-determinism, no fault-free core known – ~4% of faults detected from fault-free core • Results – 95% of detected faults diagnosed All detections from fault-free core diagnosed – 96% of diagnosed faults require <200KB buffers Can be stored in lower level cache low HW overhead • SWAT diagnosis can work with other symptom detectors Summary: SWAT works! Very low-cost detectors [ASPLOS’08, DSN’08] Low SDC rate, latency Application-Aware SWAT Even lower SDC, latency Checkpoint Checkpoint Fault Error Symptom detected Recovery Diagnosis Accurate fault modeling Repair [HPCA’09] Multithreaded workloads [MICRO’09] In-situ diagnosis [DSN’08] Future Work • Formalization of when/why SWAT works • Near zero cost recovery • More server/distributed applications • App-level, customizable resilience • Other core and off-core parts in multicore • Other fault models • Prototyping SWAT on FPGA w/ Michigan • Interaction with safe programming • Unifying with s/w resilience