SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department University of Wisconsin—Madison (C) 2002 Daniel Sorin Wisconsin Multifacet Project Overview • Hardware fault frequencies are increasing • Hardware checkpoint/recovery for multiprocessors – Transparent to software • SafetyNet Innovations – Efficient coordination of checkpoint creation – Optimized logging of checkpoint state – Checkpoint validation off critical path • SafetyNet achieves 3 goals, existing systems get 2 – High availability – High performance – Low cost slide 2 SafetyNet – Daniel Sorin Outline • Availability – Motivation – Example targeted faults – Differences between SafetyNet and existing approaches • • • • SafetyNet: Key Features A SafetyNet Implementation Evaluation Conclusions slide 3 SafetyNet – Daniel Sorin Availability Motivation • Fault frequencies are increasing 1. Technological reasons – Smaller transistors – Denser wires 2. Architectural reasons – More components – More aggressive designs • Marketing trends demand more availability Need architectural solution to improve availability slide 4 SafetyNet – Daniel Sorin Which Faults Do We Target? • Hardware faults in shared memory multiprocessors – Mostly transient, some permanent, not chipkill • We focus on faults outside of processor cores – Why? Good techniques for processors (e.g., DIVA) • Interconnection network – Example: dead switch – Detect with timeout • Cache coherence protocols – Example: lost coherence message – Detect with timeout slide 5 CPU CPU CPU Interconnectio n Network SafetyNet – Daniel Sorin System Hardware Design Space Existing systems get only 2 out of 3 features Backward Error Recovery (Tandem NonStop) Forward Error Recovery (IBM mainframes) Servers and PCs slide 6 SafetyNet – Daniel Sorin Outline • Availability • SafetyNet: Key Features – System abstraction – Innovations • A SafetyNet Implementation • Evaluation • Conclusions slide 7 SafetyNet – Daniel Sorin SafetyNet Abstraction Most Recently Validated Checkpoint Processor Recovery Point Current Current Active Memory Current Memory Checkpoint Memory (Architectural) checkpoint Version State of Processor slide 8 System Checkpoints Awaiting Validation SafetyNet – Daniel Sorin SafetyNet Execution Model CP1 recovery pt recovery pt CP2 validating validating recovery pt recovery pt recovery pt CP3 validating validating validating active active validating active CP4 CP5 active active time Create CP3 slide 9 Validate CP2 Create CP4 Recovery SafetyNet – Daniel Sorin SafetyNet Goal and Innovations • Goal: Recover to consistent checkpoint if fault • Inefficient but correct solution – Periodically quiesce entire system to take checkpoint – Checkpoints include all system state – Stop system to validate checkpoints as fault free • SafetyNet innovations: 1. Efficient coordination of checkpoint creation across system 2. Optimized checkpointing of system state 3. Pipelined validation of checkpoints in background slide 10 SafetyNet – Daniel Sorin Key #1 Coordinating Checkpoint Creation • Checkpoints must reflect consistent system state – Nodes must agree on memory values and coherence • Coordinate checkpoints in logical time – Logical time is time base that respects causality • Each node maintains its own logical clock – Create checkpoint every K logical cycles We need logical time base that helps coordination slide 11 SafetyNet – Daniel Sorin Logical Time Base • Many logical time bases exist – Depends on coherence protocol • Broadcast snooping systems – Increment clock for every coherence request processed – Nodes can be at different logical times – All nodes can agree when coherence transaction happens • Directory protocol systems – Based on loosely synchronized physical clock (10 kHz) – More complicated explanation refer to paper for details slide 12 SafetyNet – Daniel Sorin Key #2 Optimized Checkpointing of System State • Checkpoint all state needed to resume execution – Processor registers – Memory state (including cache state) – Cache coherence state • Processors save register state at each checkpoint – Copy registers into shadow registers • Logically, cache/memory log old data every time: – Store overwrites an old checkpoint of block – Block’s coherence ownership is transferred How can we reduce the amount of logged state? slide 13 SafetyNet – Daniel Sorin Optimized Logging • Insight: only recover at checkpoint granularity • Intervals between checkpoints group writes/transfers – E.g., checkpoint every 100,000 cycles (100 μsec at 1GHz) • Only log first store/transfer per block per interval • Optimization at cache: – Label cache blocks with checkpoint numbers (CNs) – If write/transfer is from same checkpoint, no logging needed Large benefit due to locality of references slide 14 SafetyNet – Daniel Sorin Key #3 Checkpoint Validation in Background • Only validate when all agree checkpoint is fault-free – Example: no outstanding coherence requests in checkpoint • Nodes perform fault detection, then coordinate • Can be in background and pipelined – Reason why we have checkpoints awaiting validation • Can hide long fault detection latencies – Number of outstanding checkpoints x checkpoint length – Design tolerance to be longer than longest detection latency Don’t slow down execution to validate checkpoints slide 15 SafetyNet – Daniel Sorin Outline • • • • • Availability SafetyNet: Key Features A SafetyNet Implementation Evaluation Conclusions slide 16 SafetyNet – Daniel Sorin System Model CPU reg CPs cache(s) NS half switch CLB network interface memory CLB I/O bridge EW half switch • Checkpoint Log Buffer (CLB) at cache and memory • Just FIFO log of block writes/transfers slide 17 SafetyNet – Daniel Sorin Example of SafetyNet Operation Regs: CP2 Regs: CP3 P1 Cache B M Addr State CN CLB 2000 P2 Regs: CP2 Regs: CP3 Cache data Addr State CN data CLB Addr State data Addr State data Interconnection network Recovery point is checkpoint 2. Most recent checkpoint is 3. Active checkpoint is 4. Processor 1 owns block B (validated). slide 18 SafetyNet – Daniel Sorin Example of SafetyNet Operation Regs: CP2 Regs: CP3 P1 Cache B M 4 3000 Addr State CN CLB B M Addr State data Cache data 2000 Regs: CP2 Regs: CP3 P2 Addr State CN data CLB Addr State data Interconnection network P1 stores 3000 to block B between checkpoints 3 and 4. Logs old data. slide 19 SafetyNet – Daniel Sorin Example of SafetyNet Operation Regs: CP2 Regs: CP3 P1 Cache B M 4 3000 Addr State CN CLB B M Addr State data Regs: CP2 Regs: CP3 Cache data 2000 P2 Addr State CN data CLB Addr State data Interconnection network P1 loads from block B. SafetyNet uninvolved. slide 20 SafetyNet – Daniel Sorin Example of SafetyNet Operation Regs: CP2 Regs: CP3 Regs: CP4 P1 Cache B M 4 3000 Addr State CN CLB B M Addr State data Regs: CP2 Regs: CP3 Regs: CP4 Cache data 2000 P2 Addr State CN data CLB Addr State data Interconnection network Coordinated creation of checkpoint 4. Active checkpoint is 5. Save register state at beginning of checkpoint 4. slide 21 SafetyNet – Daniel Sorin Example of SafetyNet Operation Regs: CP2 Regs: CP3 Regs: CP4 P1 Cache Cache Addr State CN CLB B B M M data 2000 3000 Addr State data P2 B M 5 Regs: CP2 Regs: CP3 Regs: CP4 3000 Addr State CN data CLB Addr State data Interconnection network P2 requests ownership of block B. P1 logs old data and sends copy to P2. P1 invalidates cache entry. slide 22 SafetyNet – Daniel Sorin Example of SafetyNet Operation Regs: CP3 Regs: CP4 P1 Cache Cache Addr State CN CLB B B M M data 2000 3000 Addr State data P2 B M 5 Regs: CP3 Regs: CP4 3000 Addr State CN data CLB Addr State data Interconnection network Validation of checkpoint 3. Discard checkpoint 2 registers. Recovery point is now beginning of checkpoint 3. slide 23 SafetyNet – Daniel Sorin Example of SafetyNet Operation Regs: CP3 P1 Cache B M Addr State CN CLB 2000 Regs: CP3 P2 Cache data Addr State CN data CLB Addr State data Addr State data Interconnection network Recovery (to checkpoint 3). Restore CP3 registers. Restore ownership of B to P1. Invalidate B at P2. Now restart system! slide 24 SafetyNet – Daniel Sorin System Recovery and Restart • Any component can trigger recovery – E.g., processor times out on coherence request • All in-progress transactions are dropped – By definition, these transactions are not validated • After recovery, resume execution – May have to reconfigure (e.g., route around dead link) – Must replay work that was lost slide 25 SafetyNet – Daniel Sorin I/O and the Outside World • Output commit problem – Can’t send uncommitted data beyond sphere of recoverability • SafetyNet includes processors, memory, coherence • Doesn’t include network, disks, printer, etc. • Standard solution: wait to communicate with I/O • Only send validated data to outside world • Input commit problem – Input can’t be recovered – Standard solution: log input slide 26 SafetyNet – Daniel Sorin Outline • • • • Availability SafetyNet: Key Features A SafetyNet Implementation Evaluation – Methodology – Runtime performance • Conclusions slide 27 SafetyNet – Daniel Sorin Methodology: Simulation & Workloads • Simulation – Simics full-system simulation of 16-proc SPARC system – Detailed timing simulation of memory system • MOSI directory cache coherence protocol – Simple, in-order processor model – 128KB L1I/D, 4MB L2, 512KB CLB • Workloads (commercial and scientific) – – – – – slide 28 Online transaction processing (OLTP): IBM’s DB2 Static web server: Apache driven by SURGE Dynamic web server: Slashcode Java server: SpecJBB Scientific: barnes-hut from SPLASH2 SafetyNet – Daniel Sorin Runtime Performance Normalize results to unprotected system slide 29 SafetyNet – Daniel Sorin Runtime Performance Unprotected system crashes if fault occurs slide 30 SafetyNet – Daniel Sorin Runtime Performance Error bars = +/- one standard deviation SafetyNet has same fault-free performance as unprotected slide 31 SafetyNet – Daniel Sorin Runtime Performance SafetyNet avoids crashes in presence of lost messages slide 32 SafetyNet – Daniel Sorin Runtime Performance SafetyNet avoids crashes in presence of dead half-switch slide 33 SafetyNet – Daniel Sorin High-Level Comparison to ReVive ReVive SafetyNet Backward error recovery scheme Yes Yes Fault model Transient & permanent Transient & some permanent Processor modification No Yes Software modification Minor None Fault-free performance 6-10% loss No loss Output commit latency At least 100 milliseconds slide 34 No more than 0.4 milliseconds SafetyNet – Daniel Sorin Conclusions • SafetyNet: global, consistent checkpointing – – – – Low cost and high performance Efficient logical time checkpoint coordination Optimized checkpointing of state Pipelined, in-background checkpoint validation • Improved availability – Avoid crash in case of fault – Same fault-free performance slide 35 SafetyNet – Daniel Sorin Performance vs. CLB Size Caveats • Scaled workloads • 100,000 cycle intervals slide 36 SafetyNet – Daniel Sorin Traditional Availability • Forward Error Recovery (FER) – – – – • Use redundant hardware to mask faults E.g., triple modular redundancy with voter or pair&spare Systems: IBM mainframes, Intel 432, Stratus Sacrifices cost to achieve availability Backward Error Recovery (BER) – – – – – slide 37 If fault detected, recover system to pre-fault state Periodically stop system and save state or log changes Fault? Restore pre-fault checkpoint or unroll log Systems: Sequoia, Synapse N+1, Tandem NonStop Sacrifices performance to achieve availability SafetyNet – Daniel Sorin