SafetyNet: Improving the Availability and Designability of Shared

advertisement
SafetyNet:
Improving the Availability of
Shared Memory Multiprocessors with
Global Checkpoint/Recovery
Daniel J. Sorin, Milo M. K. Martin,
Mark D. Hill, and David A. Wood
Computer Sciences Department
University of Wisconsin—Madison
(C) 2002 Daniel Sorin
Wisconsin Multifacet Project
Overview
• Hardware fault frequencies are increasing
• Hardware checkpoint/recovery for multiprocessors
– Transparent to software
• SafetyNet Innovations
– Efficient coordination of checkpoint creation
– Optimized logging of checkpoint state
– Checkpoint validation off critical path
• SafetyNet achieves 3 goals, existing systems get 2
– High availability
– High performance
– Low cost
slide 2
SafetyNet – Daniel Sorin
Outline
• Availability
– Motivation
– Example targeted faults
– Differences between SafetyNet and existing approaches
•
•
•
•
SafetyNet: Key Features
A SafetyNet Implementation
Evaluation
Conclusions
slide 3
SafetyNet – Daniel Sorin
Availability Motivation
•
Fault frequencies are increasing
1. Technological reasons
– Smaller transistors
– Denser wires
2. Architectural reasons
– More components
– More aggressive designs
•
Marketing trends demand more availability
Need architectural solution to improve availability
slide 4
SafetyNet – Daniel Sorin
Which Faults Do We Target?
• Hardware faults in shared memory multiprocessors
– Mostly transient, some permanent, not chipkill
• We focus on faults outside of processor cores
– Why? Good techniques for processors (e.g., DIVA)
• Interconnection network
– Example: dead switch
– Detect with timeout
• Cache coherence protocols
– Example: lost coherence message
– Detect with timeout
slide 5
CPU
CPU
CPU
Interconnectio
n
Network
SafetyNet – Daniel Sorin
System Hardware Design Space
Existing systems get only 2 out of 3 features
Backward Error Recovery
(Tandem NonStop)
Forward Error Recovery
(IBM mainframes)
Servers and PCs
slide 6
SafetyNet – Daniel Sorin
Outline
• Availability
• SafetyNet: Key Features
– System abstraction
– Innovations
• A SafetyNet Implementation
• Evaluation
• Conclusions
slide 7
SafetyNet – Daniel Sorin
SafetyNet Abstraction
Most Recently
Validated Checkpoint
Processor
Recovery Point
Current
Current
Active
Memory
Current
Memory
Checkpoint
Memory
(Architectural)
checkpoint
Version
State
of
Processor
slide 8
System
Checkpoints
Awaiting
Validation
SafetyNet – Daniel Sorin
SafetyNet Execution Model
CP1 recovery pt recovery pt
CP2 validating
validating recovery pt recovery pt recovery pt
CP3
validating
validating
validating
active
active
validating
active
CP4
CP5
active
active
time
Create
CP3
slide 9
Validate
CP2
Create
CP4
Recovery
SafetyNet – Daniel Sorin
SafetyNet Goal and Innovations
•
Goal: Recover to consistent checkpoint if fault
•
Inefficient but correct solution
– Periodically quiesce entire system to take checkpoint
– Checkpoints include all system state
– Stop system to validate checkpoints as fault free
•
SafetyNet innovations:
1. Efficient coordination of checkpoint creation across system
2. Optimized checkpointing of system state
3. Pipelined validation of checkpoints in background
slide 10
SafetyNet – Daniel Sorin
Key #1
Coordinating Checkpoint Creation
• Checkpoints must reflect consistent system state
– Nodes must agree on memory values and coherence
• Coordinate checkpoints in logical time
– Logical time is time base that respects causality
• Each node maintains its own logical clock
– Create checkpoint every K logical cycles
We need logical time base that helps coordination
slide 11
SafetyNet – Daniel Sorin
Logical Time Base
• Many logical time bases exist
– Depends on coherence protocol
• Broadcast snooping systems
– Increment clock for every coherence request processed
– Nodes can be at different logical times
– All nodes can agree when coherence transaction happens
• Directory protocol systems
– Based on loosely synchronized physical clock (10 kHz)
– More complicated explanation  refer to paper for details
slide 12
SafetyNet – Daniel Sorin
Key #2
Optimized Checkpointing of System State
•
Checkpoint all state needed to resume execution
– Processor registers
– Memory state (including cache state)
– Cache coherence state
•
Processors save register state at each checkpoint
– Copy registers into shadow registers
•
Logically, cache/memory log old data every time:
– Store overwrites an old checkpoint of block
– Block’s coherence ownership is transferred
How can we reduce the amount of logged state?
slide 13
SafetyNet – Daniel Sorin
Optimized Logging
• Insight: only recover at checkpoint granularity
• Intervals between checkpoints group writes/transfers
– E.g., checkpoint every 100,000 cycles (100 μsec at 1GHz)
• Only log first store/transfer per block per interval
• Optimization at cache:
– Label cache blocks with checkpoint numbers (CNs)
– If write/transfer is from same checkpoint, no logging needed
Large benefit due to locality of references
slide 14
SafetyNet – Daniel Sorin
Key #3
Checkpoint Validation in Background
• Only validate when all agree checkpoint is fault-free
– Example: no outstanding coherence requests in checkpoint
• Nodes perform fault detection, then coordinate
• Can be in background and pipelined
– Reason why we have checkpoints awaiting validation
• Can hide long fault detection latencies
– Number of outstanding checkpoints x checkpoint length
– Design tolerance to be longer than longest detection latency
Don’t slow down execution to validate checkpoints
slide 15
SafetyNet – Daniel Sorin
Outline
•
•
•
•
•
Availability
SafetyNet: Key Features
A SafetyNet Implementation
Evaluation
Conclusions
slide 16
SafetyNet – Daniel Sorin
System Model
CPU
reg CPs
cache(s)
NS half
switch
CLB
network
interface
memory
CLB
I/O bridge
EW half
switch
• Checkpoint Log Buffer (CLB) at cache and memory
• Just FIFO log of block writes/transfers
slide 17
SafetyNet – Daniel Sorin
Example of SafetyNet Operation
Regs: CP2
Regs: CP3
P1
Cache
B
M
Addr State CN
CLB
2000
P2
Regs: CP2
Regs: CP3
Cache
data
Addr State CN
data
CLB
Addr State data
Addr State data
Interconnection network
Recovery point is checkpoint 2. Most recent checkpoint is 3.
Active checkpoint is 4. Processor 1 owns block B (validated).
slide 18
SafetyNet – Daniel Sorin
Example of SafetyNet Operation
Regs: CP2
Regs: CP3
P1
Cache
B
M 4
3000
Addr State CN
CLB
B
M
Addr State data
Cache
data
2000
Regs: CP2
Regs: CP3
P2
Addr State CN
data
CLB
Addr State data
Interconnection network
P1 stores 3000 to block B between checkpoints 3 and 4.
Logs old data.
slide 19
SafetyNet – Daniel Sorin
Example of SafetyNet Operation
Regs: CP2
Regs: CP3
P1
Cache
B
M 4
3000
Addr State CN
CLB
B
M
Addr State data
Regs: CP2
Regs: CP3
Cache
data
2000
P2
Addr State CN
data
CLB
Addr State data
Interconnection network
P1 loads from block B.
SafetyNet uninvolved.
slide 20
SafetyNet – Daniel Sorin
Example of SafetyNet Operation
Regs: CP2
Regs: CP3
Regs: CP4
P1
Cache
B
M 4
3000
Addr State CN
CLB
B
M
Addr State data
Regs: CP2
Regs: CP3
Regs: CP4
Cache
data
2000
P2
Addr State CN
data
CLB
Addr State data
Interconnection network
Coordinated creation of checkpoint 4. Active checkpoint is 5.
Save register state at beginning of checkpoint 4.
slide 21
SafetyNet – Daniel Sorin
Example of SafetyNet Operation
Regs: CP2
Regs: CP3
Regs: CP4
P1
Cache
Cache
Addr State CN
CLB
B
B
M
M
data
2000
3000
Addr State data
P2
B
M 5
Regs: CP2
Regs: CP3
Regs: CP4
3000
Addr State CN
data
CLB
Addr State data
Interconnection network
P2 requests ownership of block B. P1 logs old data and sends copy
to P2. P1 invalidates cache entry.
slide 22
SafetyNet – Daniel Sorin
Example of SafetyNet Operation
Regs: CP3
Regs: CP4
P1
Cache
Cache
Addr State CN
CLB
B
B
M
M
data
2000
3000
Addr State data
P2
B
M 5
Regs: CP3
Regs: CP4
3000
Addr State CN
data
CLB
Addr State data
Interconnection network
Validation of checkpoint 3. Discard checkpoint 2 registers.
Recovery point is now beginning of checkpoint 3.
slide 23
SafetyNet – Daniel Sorin
Example of SafetyNet Operation
Regs: CP3
P1
Cache
B
M
Addr State CN
CLB
2000
Regs: CP3
P2
Cache
data
Addr State CN
data
CLB
Addr State data
Addr State data
Interconnection network
Recovery (to checkpoint 3). Restore CP3 registers. Restore
ownership of B to P1. Invalidate B at P2. Now restart system!
slide 24
SafetyNet – Daniel Sorin
System Recovery and Restart
• Any component can trigger recovery
– E.g., processor times out on coherence request
• All in-progress transactions are dropped
– By definition, these transactions are not validated
• After recovery, resume execution
– May have to reconfigure (e.g., route around dead link)
– Must replay work that was lost
slide 25
SafetyNet – Daniel Sorin
I/O and the Outside World
• Output commit problem – Can’t send uncommitted
data beyond sphere of recoverability
• SafetyNet includes processors, memory, coherence
• Doesn’t include network, disks, printer, etc.
• Standard solution: wait to communicate with I/O
• Only send validated data to outside world
• Input commit problem – Input can’t be recovered
– Standard solution: log input
slide 26
SafetyNet – Daniel Sorin
Outline
•
•
•
•
Availability
SafetyNet: Key Features
A SafetyNet Implementation
Evaluation
– Methodology
– Runtime performance
• Conclusions
slide 27
SafetyNet – Daniel Sorin
Methodology: Simulation & Workloads
• Simulation
– Simics full-system simulation of 16-proc SPARC system
– Detailed timing simulation of memory system
• MOSI directory cache coherence protocol
– Simple, in-order processor model
– 128KB L1I/D, 4MB L2, 512KB CLB
• Workloads (commercial and scientific)
–
–
–
–
–
slide 28
Online transaction processing (OLTP): IBM’s DB2
Static web server: Apache driven by SURGE
Dynamic web server: Slashcode
Java server: SpecJBB
Scientific: barnes-hut from SPLASH2
SafetyNet – Daniel Sorin
Runtime Performance
Normalize results to unprotected system
slide 29
SafetyNet – Daniel Sorin
Runtime Performance
Unprotected system crashes if fault occurs
slide 30
SafetyNet – Daniel Sorin
Runtime Performance
Error bars = +/- one standard deviation
SafetyNet has same fault-free performance as unprotected
slide 31
SafetyNet – Daniel Sorin
Runtime Performance
SafetyNet avoids crashes in presence of lost messages
slide 32
SafetyNet – Daniel Sorin
Runtime Performance
SafetyNet avoids crashes in presence of dead half-switch
slide 33
SafetyNet – Daniel Sorin
High-Level Comparison to ReVive
ReVive
SafetyNet
Backward error
recovery scheme
Yes
Yes
Fault model
Transient &
permanent
Transient & some
permanent
Processor
modification
No
Yes
Software modification
Minor
None
Fault-free
performance
6-10% loss
No loss
Output commit latency At least 100
milliseconds
slide 34
No more than 0.4
milliseconds
SafetyNet – Daniel Sorin
Conclusions
• SafetyNet: global, consistent checkpointing
–
–
–
–
Low cost and high performance
Efficient logical time checkpoint coordination
Optimized checkpointing of state
Pipelined, in-background checkpoint validation
• Improved availability
– Avoid crash in case of fault
– Same fault-free performance
slide 35
SafetyNet – Daniel Sorin
Performance vs. CLB Size
Caveats
• Scaled workloads
• 100,000 cycle intervals
slide 36
SafetyNet – Daniel Sorin
Traditional Availability
•
Forward Error Recovery (FER)
–
–
–
–
•
Use redundant hardware to mask faults
E.g., triple modular redundancy with voter or pair&spare
Systems: IBM mainframes, Intel 432, Stratus
Sacrifices cost to achieve availability
Backward Error Recovery (BER)
–
–
–
–
–
slide 37
If fault detected, recover system to pre-fault state
Periodically stop system and save state or log changes
Fault? Restore pre-fault checkpoint or unroll log
Systems: Sequoia, Synapse N+1, Tandem NonStop
Sacrifices performance to achieve availability
SafetyNet – Daniel Sorin
Download