MPICH-V: Fault Tolerant MPI Rachit Chawla

MPICH-V: Fault Tolerant MPI Rachit Chawla Outline      Introduction Objectives Architecture Performance Conclusion Fault Tolerant Techniques  Transparency  User Level   API Level   Transparent Fault Tolerant communication layer Checkpoint Co-ordination    Error codes returned to be handled by programmer Communication Library Level   Re-launch application from previous coherent snapshot Coordinated Uncoordinated Message Logging  Optimistic - All events are logged in the volatile memory  Pessimistic - All events are logged on a stable storage Fault Tolerant Techniques  Transparency  User Level   API Level   Transparent Fault Tolerant communication layer Checkpoint Co-ordination    Error codes returned to be handled by programmer Communication Library Level   Re-launch application from previous coherent snapshot Coordinated Uncoordinated Message Logging  Optimistic - All events are logged in the volatile memory  Pessimistic - All events are logged on a stable storage Checkpointing  Coordinated  Coordinator initiates a checkpoint  No Domino Effect  Simplified Rollback Recovery  Uncoordinated  Independent checkpoints  Possibility of Domino Effect  Rollback Recovery Complex Logging  Piece-Wise Deterministic (PWD)  For all Non-Deterministic events, store information in determinant – replay  Non-Deterministic Events  Send/Receive message  Software Interrupt  System Calls  Replay Execution Events after last checkpoint Objectives      Automatic Fault Tolerance Transparency (for programmer/user) Tolerate n faults (n, # of MPI Processes) Scalable Infrastructure/Protocol No Global Synchronization MPICH-V  Extension of MPICH – Comm Lib level  Implements all comm subroutines in MPICH  Tolerant to volatility of nodes  Node Failure  Network Failure  Uncoordinated Checkpointing  Checkpointing Servers  Distributed Pessimistic Message Logging  Channel Memories Architecture  Communication Library  Relink the application with “libmpichv”  Run-Time Environment     Dispatcher Channel Memories - CM CheckPointing Servers - CS Computing/Communicating Nodes Big Picture Channel Memory Checkpoint 3 server 2 Dispatcher 1 Network Firewall Node Node 4 Node Firewall Overview  Channel Memory  Dedicated Nodes  Message tunneling  Message Repository  Node  Home CM  Send a message – send to receiver’s home CM  Distributed Checkpointing/Logging  Execution Context - CS  Communication context - CM Dispatcher (Stable)  Initializes the execution  A Stable Service Registry (centralized) started  Providing services - CM, CS to nodes  CM, CS assigned in a round-robin fashion  Launches the instances of MPI processes on Nodes  Monitors the Node state  alive signal, or time-out  Reschedules tasks on available nodes for dead MPI process instances Steps  When a node executes  Contacts Stable Service Registry  Gets assigned CM, CS based on rank  Sends “alive” signal periodically to dispatcher – contains rank  On a failure     Restart its execution Other processes unaware about failure CM – allows single connection per rank If faulty process reconnects, error code returned, it exits Channel Memory (Stable)      Logs every message Send/Receive Messages – GET & PUT GET & PUT are transactions FIFO order maintained – each receiver On a restart, replays communications using CMs node Get Network node Get node Put Channel Memory Checkpoint Server (Stable)  Checkpoint stored on stable storage  Execution – node performs a checkpoint, send image to CS  On a Restart  Dispatcher  informs about task  CS to contact to get last task chkpt  Node  Contacts CS with its rank  Gets last chpkt image back from CS Putting it all together Worst condition: in-transit message + checkpoint Processes Pseudo time scale 0 CM Ckpt image 1 1 CM Ckpt image 2 2 2 Crash CS 1 2 Ckpt images Rollback to latest process checkpoint Performance Evaluation  xTremeWeb – P2P Platform  Dispatcher  Client – excute parrallel application  Workers – MPICH-V nodes, CMs & CSs  216 PIII 733 Pcs  Connected by Ethernet  Simulate Node volatility – enforce process crashes  NAS BT benchmark – simulated Computational Fluid Dynamics Application  Parallel Benchmark  Significant Communication + Computation Effects of CM on RTT Time, sec 0.2 0.15 Mean over 100 measurements P4 ch_cm 1 CM out-of-core ch_cm 1 CM in-core ch_cm 1 CM out-of-core best 5.6 MB/s X ~2 0.1 10.5 MB/s 0.05 0 0 Message size 64kB 128kB 192kB 256kB 320kB 384kB Impact of Remote Checkpointing Time between reception of a checkpoint signal and actual restart: fork, ckpt, compress, transfer to CS, way back, decompress, restart RTT Time, sec 250 Dist. Ethernet 100BaseT Local (disc) 200 +2% 214 208 150 +25% 100 +14% 50 0 50 +28% 78 62 44 1.81.4 bt.W.4 (2MB) bt.A.4 (43MB) bt.B.4 (21MB) bt.A.1 (201MB)  Cost of remote checkpoint is close to the one of local checkpoint (can be as low as 2%)… …because compression and transfer are overlapped Performance of Re-Execution Time, sec 0.3 8 restarts 0 restart 0.2 0.1 0 0 1 2 3 4 5 6 7 8 Re-execution is faster than execution: Messages are already stored in CM restart restart restarts restarts restarts restarts restarts restarts restarts token size 0 64kB 128kB 192kB 256kB The system can survive the crash of all MPI Processes Execution Time Vs Faults 1100 Total execution 1050 time (sec.) 1000 950 900 850 800 750 700 650 610 ~1 fault/110 sec. Base exec. without ckpt. and fault 0 1 2 3 4 5 6 7 8 9 10 Number of faults  Overhead of chkpt is about 23%  For 10 faults performance is 68% of the one without fault Conclusion  MPICH-V  full fledge fault tolerant MPI environment (lib + runtime).  uncoordinated checkpointing + distributed pessimistic message logging.  Channel Memories, Checkpoint Servers, Dispatcher and nodes.

MPICH-V: Fault Tolerant MPI Rachit Chawla

Related documents

Products

Support

MPICH-V: Fault Tolerant MPI Rachit Chawla

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib