SWAT - University of Illinois at Urbana

advertisement
SWAT: Designing Resilient Hardware by
Treating Software Anomalies
Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep
Ramachandran, Swarup Sahoo, Rob Smolinski, Sarita Adve,
Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou
Department of Computer Science
University of Illinois at Urbana-Champaign
swat@cs.illinois.edu
Motivation
• Hardware will fail in-the-field due to several reasons
Wear-out
(Devices are weaker)
Transient errors
Design Bugs
… and so on
(High-energy particles )
Need in-field detection, diagnosis, recovery, repair
• Reliability problem pervasive across many markets
– Traditional redundancy solutions (e.g., nMR) too expensive
 Need low-cost solutions for multiple failure sources
 Must incur low area, performance, power overhead
2
Observations
• Need handle only hardware faults that propagate to software
• Fault-free case remains common, must be optimized
 Watch for software anomalies (symptoms)
– Zero to low overhead “always-on” monitors
Diagnose cause after symptom detected
− May incur high overhead, but rarely invoked
 SWAT: SoftWare Anomaly Treatment
3
SWAT Framework Components
• Detection: Symptoms of software misbehavior
• Recovery: Checkpoint and rollback
• Diagnosis: Rollback/replay on multicore
• Repair/reconfiguration: Redundant, reconfigurable hardware
• Flexible control through firmware
Checkpoint
Checkpoint
Fault Error
Symptom
detected
Recovery
Diagnosis
Repair
4
Advantages of SWAT
• Handles all faults that matter
– Oblivious to low-level failure modes and masked faults
• Low, amortized overheads
– Optimize for common case, exploit SW reliability solutions
• Customizable and flexible
– Firmware control adapts to specific reliability needs
• Holistic systems view enables novel solutions
– Synergistic detection, diagnosis, recovery solutions
• Beyond hardware reliability
– Long term goal: unified system (HW+SW) reliability
– Potential application to post-silicon test and debug
5
SWAT Contributions
Very low-cost detectors [ASPLOS’08, DSN’08]
Low SDC rate, latency
Application-Aware SWAT
Even lower SDC, latency
Checkpoint
Checkpoint
Fault Error
Symptom
detected
Recovery
Diagnosis
Accurate fault modeling
Repair
[HPCA’09]
Multithreaded workloads
[MICRO’09]
In-situ diagnosis
[DSN’08]
Outline
• Motivation
• Detection
• Recovery analysis
• Diagnosis
• Conclusions and future work
7
Simple Fault Detectors [ASPLOS ’08]
• Simple detectors that observe anomalous SW behavior
Fatal Traps
Hangs
Kernel Panic
High OS
App Abort
Division by zero,
RED state, etc.
Simple HW hang
detector
OS enters panic
State due to fault
High contiguous
OS activity
Application
abort due to fault
SWAT firmware
• Very low hardware area, performance overhead
8
Evaluating Fault Detectors
• Simulate OpenSolaris on out-of-order processor
– GEMS timing models + Simics full-system simulator
• I/O- and compute-intensive applications
– Client-server – apache, mysql, squid, sshd
– All SPEC 2K C/C++ – 12 Integer, 4 FP
• µarchitecture-level fault injections (single fault model)
– Stuck-at, transient faults in 8 µarch units
– ~18,000 total faults  statistically significant
Fault
10M instr
If no symptom in 10M instr, run to completion
Timing simulation
Functional simulation
Masked or
Potential Silent Data Corruption (SDC)
9
Metrics for Fault Detection
• Potential SDC rate
– Undetected fault that changes app output
– Output change may or may not be important
• Detection Latency
– Latency from architecture state corruption to detection
 Architecture state = registers + memory
 Will improve later
– High detection latency impedes recovery
10
SDC Rate of Simple Detectors: SPEC, permanents
0.4
0.7
0.6
Permanent
Faults
0.6
Total injections
100%
80%
SDC
60%
Detected
Masked
40%
20%
Total No FP
FP ALU
AGEN
RAT
ROB
Int reg
Reg Dbus
INT ALU
Decoder
0%
• 0.6% potential SDC rate for permanents in SPEC, without FPU
• Faults in FPU need different detectors
– Mostly corrupt only data
Potential SDC Rate
Injected Faults
100%
0.4
0.7
0.6
0.6
80%
Potential SDC
60%
Detected
40%
Masked
20%
0%
Server
SPEC
Permanents
Server
SPEC
Transients
• SWAT detectors highly effective for hardware faults
– Low potential SDC rates across workloads
12
Detection Latency
SPEC workloads
Server workloads
100%
90%
90%
80%
80%
Detected Faults
100%
70%
Permanent
Transient
60%
70%
60%
50%
50%
40%
<10K
40%
<10K
<100K
<1M
<10M
>10M
Detection Latency (Instructions)
<100K
<1M
<10M
>10M
Detection Latency (Instructions)
• 90% of the faults detected in under 10M instructions
• Existing work claims these are recoverable w/ HW chkpting
– More recovery analysis follows later
13
Exploiting Application Support for Detection
• Techniques inspired by software bug detection
– Likely program invariants: iSWAT
 Instrumented binary, no hardware changes
 <5% performance overhead on x86 processors
– Detecting out-of-bounds addresses
 Low hardware overhead, near-zero performance impact
• Exploiting application-level resiliency
Exploiting Application Support for Detection
• Techniques inspired by software bug detection
– Likely program invariants: iSWAT
 Instrumented binary, no hardware changes
 <5% performance overhead on x86 processors
– Detecting out-of-bounds addresses
 Low hardware overhead, near-zero performance impact
• Exploiting application-level resiliency
Low-Cost Out-of-Bounds Detector
• Sophisticated detector for security, software bugs
– Track object accessed, validate pointer accesses
– Require full-program analysis, changes to binary
• Bad addresses from HW faults more obvious
– Invalid pages, unallocated memory, etc.
• Low-cost out-of-bounds detector
App Address Space
Empty
– Monitor boundaries of heap, stack, globals
App Code
– Address beyond these bounds  HW fault
Globals
Heap
- SW communicates boundaries to HW
Libraries
- HW enforces checks on ld/st address
Stack
Reserved
16
Impact of Out-of-Bounds Detector
Server Workloads
0.38
0.23
0.58
0.28
100%
80%
Potential
SDC
60%
DetectOther
40%
DetectOoB
20%
Masked
0%
0.67
0.63
0.65
0.65
SWAT
OoB
SWAT
OoB
80%
Injected Faults
100%
SPEC Workloads
60%
40%
20%
0%
SWAT
OoB
Permanents
SWAT
OoB
Transients
Permanents
Transients
Lower potential SDC rate in server workloads
– 39% lower for permanents, 52% for transients
For SPEC workloads, impact is on detection latency
17
Application-Aware SDC Analysis
• Potential SDC  undetected faults that corrupt app output
• But many applications can tolerate faults
– Client may detect fault and retry request
– Application may perform fault-tolerant computations
 E.g., Same cost place & route, acceptable PSNR, etc.
 Not all potential SDCs are true SDCs
- For each application, define notion of fault tolerance
should not?
• SWAT detectors cannot detect such acceptable changes
18
Application-Aware SDCs for Server
Number of Faults
Permanent Faults
Transient Faults
60
60
50
50
40
34
(0.38%)
40
21
(0.23%)
30
10
0
0
w/ app
tolerance
9
(0.10%)
20
10
SWAT + OoB
25
(0.28%)
30
12
(0.13%)
20
SWAT
52
(0.58%)
SWAT
SWAT + OoB
w/ app
tolerance
• 46% of potential SDCs are tolerated by simple retry
• Only 21 remaining SDCs out of 17,880 injected faults
– Most detectable through application-level validity checks
19
Application-Aware SDCs for SPEC
Permanent Faults
Transient Faults
70
Number of Faults
60
70
56
(0.6%)
60
50
50
40
40
30
58
(0.6%)
46
(0.5%)
37
(0.4%)
33
(0.4%)
30
16
(0.2%)
20
10
11
(0.1%)
8
(0.1%)
20
10
0
SWAT+OoB
>0%
>0.01%
>1%
0
SWAT+OoB
>0%
>0.01%
>1%
• Only 62 faults show >0% degradation from golden output
• Only 41 injected faults are SDCs at >1% degradation
– 38 from apps we conservatively classify as fault intolerant
 Chess playing apps, compilers, parsers, etc.
20
Reducing Potential SDCs further (future work)
• Explore application-specific detectors
– Compiler-assisted invariants like iSWAT
– Application-level checks
• Need to fundamentally understand why, where SWAT works
– SWAT evaluation largely empirical
– Build models to predict effectiveness of SWAT
 Develop new low-cost symptom detectors
 Extract minimal set of detectors for given sets of faults
 Reliability vs overhead trade-offs analysis
21
Reducing Detection Latency: New Definition
• SWAT relies on checkpoint/rollback for recovery
• Detection latency dictates fault recovery
– Checkpoint fault-free  fault recoverable
• Traditional defn. = arch state corruption to detection
• But software may mask some corruptions!
• New defn. = Unmasked arch state corruption to detection
Fault
Bad arch state
Bad SW state
Detection
Recoverable
chkpt
Recoverable
chkpt
Old latency
New Latency
22
Measuring Detection Latency
• New detection latency = SW state corruption to detection
• But identifying SW state corruption is hard!
– Need to know how faulty value used by application
– If faulty value affects output, then SW state corrupted
• Measure latency by rolling back to older checkpoints
– Only for analysis, not required in real system
Fault Bad arch state
Bad SW state
Detection
Chkpt
Chkpt
Fault effect
Symptom
masked
Rollback
Rollback &
&
Replay
Replay
New latency
23
Detection Latency - SPEC
Transient Faults in Server
100%
100%
90%
90%
Detected Faults
Detected Faults
Permanent Faults in Server
80%
70%
60%
80%
70%
60%
Old Latency SWAT
50%
50%
40%
<10k
40%
<10k
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
24
Detection Latency - SPEC
Transient Faults in Server
100%
100%
90%
90%
Detected Faults
Detected Faults
Permanent Faults in Server
80%
70%
80%
70%
60%
New Latency SWAT
50%
50%
Old Latency SWAT
40%
<10k
40%
<10k
60%
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
25
Detection Latency - SPEC
Transient Faults in Server
100%
100%
90%
90%
Detected Faults
Detected Faults
Permanent Faults in Server
80%
70%
60%
80%
70%
60%
New Latency SWAT
50%
50%
40%
<10k
40%
<10k
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
New Latency out-of-bounds
Old Latency SWAT
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
• Measuring new latency important to study recovery
• New techniques significantly reduce detection latency
- >90% of faults detected in <100K instructions
• Reduced detection latency impacts recoverability
26
Detection Latency - Server
Transient Faults in Server
100%
100%
90%
90%
Detected Faults
Detected Faults
Permanent Faults in Server
80%
70%
60%
80%
70%
60%
New Latency SWAT
50%
50%
40%
<10k
40%
<10k
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
New Latency out-of-bounds
Old Latency SWAT
<100k
<1m
<10m
>10m
Detection Latency (Instructions)
• Measuring new latency important to study recovery
• New techniques significantly reduce detection latency
- >90% of faults detected in <100K instructions
• Reduced detection latency impacts recoverability
27
Implications for Fault Recovery
Recovery
Checkpointing
I/O buffering
• Checkpointing
– Record pristine arch state for recovery
– Periodic registers snapshot, log memory writes
• I/O buffering
– Buffer external events until known to be fault-free
– HW buffer records device reads, buffers device writes
“Always-on”  must incur minimal overhead
28
Memory Log Size (in KB)
Overheads from Memory Logging
2500
2250
2000
1750
1500
1250
1000
750
500
250
0
10K
apache
sshd
squid
mysql
100K
1M
Checkpoint Interval (Instructions)
10M
• New techniques reduce chkpt overheads by over 60%
– Chkpt interval reduced to 100K from millions of instrs.
29
Overheads from Output Buffering
20
Output Buffer size (in KB)
18
16
14
12
apache
sshd
squid
mysql
10
8
6
4
2
0
10K
100K
1M
Checkpoint Interval (Instructions)
10M
• New techniques reduce output buffer size to near-zero
– <5KB buffer for 100K chkpt interval (buffer for 2 chkpts)
– Near-zero overheads at 10K interval
30
Low Cost Fault Recovery (future work)
• New techniques significantly reduce recovery overheads
– 60%  in memory logs, near-zero output buffer
• But still do not enable ultra-low cost fault recovery
– ~400KB HW overheads for memory logs in HW (SafetyNet)
– High performance impact for in-memory logs (ReVive)
• Need ultra low-cost recovery scheme at short intervals
– Even shorter latencies
– Checkpoint only state that matters
– Application-aware insights – transactional apps, recovery
domains for OS, …
31
Fault Diagnosis
• Symptom-based detection is cheap but
– May incur long latency from activation to detection
– Difficult to diagnose root cause of fault
SW Bug
Transient
Fault
Permanent
Fault
?
Symptom
• Goal: Diagnose the fault with minimal hardware overhead
– Rarely invoked  higher perf overhead acceptable
SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08]
• First, diagnosis for single threaded workload on one core
– Multithreaded w/ multicore later – several new challenges
Key ideas
• Single core fault model, multicore  fault-free core available
• Chkpt/replay for recovery  replay on good core, compare
• Synthesizing DMR, but only for diagnosis
Traditional DMR
P1
P2
=
Always on  expensive
Synthesized DMR
P1
Fault-free
P1
P2
=
DMR only on fault
SW Bug vs. Transient vs. Permanent
• Rollback/replay on same/different core
• Watch if symptom reappears
Faulty Good
Symptom detected
Rollback on faulty core
No symptom
Transient or nondeterministic s/w bug
Continue
Execution
Symptom
Deterministic s/w or
Permanent h/w bug
Rollback/replay
on good core
No symptom
Permanent
h/w fault,
needs repair!
Symptom
Deterministic s/w bug
(send to s/w layer)
µarch-level Fault Diagnosis
Symptom
detected
Diagnosis
Software
bug
Transient
fault
Permanent
fault
Microarchitecture-level
Diagnosis
Unit X is faulty
Trace Based Fault Diagnosis (TBFD)
• µarch-level fault diagnosis using rollback/replay
• Key: Execution caused symptom  trace activates fault
– Deterministically replay trace on faulty, fault-free cores
– Divergence  faulty hardware used  diagnosis clues
• Diagnose faults to µarch units of processor
– Check µarch-level invariants in several parts of
processor
– Diagnosis in out-of-order logic (meta-datapath) complex
Trace-Based Fault Diagnosis: Evaluation
• Goal: Diagnose faults at reasonable latency
• Faults diagnosed in 10 SPEC workloads
– ~8500 detected faults (98% of unmasked)
• Results
– 98% of the detection successfully diagnosed
– 91% diagnosed within 1M instr (~0.5ms on 2GHz proc)
SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09]
• Challenge 1: Deterministic replay involves high overhead
• Challenge 2: Multithreaded apps share data among threads
Core 2
Core 1
Fault
Store
Load
Memory
• Symptom causing core may not be faulty
• No known fault-free core in system
Symptom Detection
on a fault-free core
mSWAT Diagnosis - Key Ideas
Multithreaded
applications
Challenges
Key Ideas
A
TA
Full-system
deterministic
replay
No known
good core
Isolated
deterministic
replay
Emulated TMR
B
TB
TA
C
TC
D
TD
A
TA
B
TB
C
TC
TA
TA
D
TD
mSWAT Diagnosis - Key Ideas
Multithreaded
applications
Challenges
Key Ideas
A
TA
Full-system
deterministic
replay
No known
good core
Isolated
deterministic
replay
Emulated TMR
B
TB
TA
C
TC
D
TD
A
TA
B
TB
C
TC
D
TD
TD
TA
TB
TC
TC
TD
TA
TB
mSWAT Diagnosis: Evaluation
• Diagnose detected perm faults in multithreaded apps
– Goal: Identify faulty core, TBFD for µarch-level diagnosis
– Challenges: Non-determinism, no fault-free core known
– ~4% of faults detected from fault-free core
• Results
– 95% of detected faults diagnosed
 All detections from fault-free core diagnosed
– 96% of diagnosed faults require <200KB buffers
 Can be stored in lower level cache  low HW overhead
• SWAT diagnosis can work with other symptom detectors
Summary: SWAT works!
Very low-cost detectors [ASPLOS’08, DSN’08]
Low SDC rate, latency
Application-Aware SWAT
Even lower SDC, latency
Checkpoint
Checkpoint
Fault Error
Symptom
detected
Recovery
Diagnosis
Accurate fault modeling
Repair
[HPCA’09]
Multithreaded workloads
[MICRO’09]
In-situ diagnosis
[DSN’08]
Future Work
• Formalization of when/why SWAT works
• Near zero cost recovery
• More server/distributed applications
• App-level, customizable resilience
• Other core and off-core parts in multicore
• Other fault models
• Prototyping SWAT on FPGA w/ Michigan
• Interaction with safe programming
• Unifying with s/w resilience
Download