MPICH-V: Fault Tolerant MPI Rachit Chawla

advertisement
MPICH-V: Fault Tolerant MPI
Rachit Chawla
Outline





Introduction
Objectives
Architecture
Performance
Conclusion
Fault Tolerant Techniques

Transparency

User Level


API Level


Transparent Fault Tolerant communication layer
Checkpoint Co-ordination



Error codes returned to be handled by programmer
Communication Library Level


Re-launch application from previous coherent snapshot
Coordinated
Uncoordinated
Message Logging
 Optimistic - All events are logged in the volatile memory

Pessimistic - All events are logged on a stable storage
Fault Tolerant Techniques

Transparency

User Level


API Level


Transparent Fault Tolerant communication layer
Checkpoint Co-ordination



Error codes returned to be handled by programmer
Communication Library Level


Re-launch application from previous coherent snapshot
Coordinated
Uncoordinated
Message Logging
 Optimistic - All events are logged in the volatile memory

Pessimistic - All events are logged on a stable storage
Checkpointing
 Coordinated
 Coordinator initiates a checkpoint
 No Domino Effect
 Simplified Rollback Recovery
 Uncoordinated
 Independent checkpoints
 Possibility of Domino Effect
 Rollback Recovery Complex
Logging
 Piece-Wise Deterministic (PWD)
 For all Non-Deterministic events, store
information in determinant – replay
 Non-Deterministic Events
 Send/Receive message
 Software Interrupt
 System Calls
 Replay Execution Events after last
checkpoint
Objectives





Automatic Fault Tolerance
Transparency (for programmer/user)
Tolerate n faults (n, # of MPI Processes)
Scalable Infrastructure/Protocol
No Global Synchronization
MPICH-V
 Extension of MPICH – Comm Lib level
 Implements all comm subroutines in MPICH
 Tolerant to volatility of nodes
 Node Failure
 Network Failure
 Uncoordinated Checkpointing
 Checkpointing Servers
 Distributed Pessimistic Message Logging
 Channel Memories
Architecture
 Communication Library
 Relink the application with “libmpichv”
 Run-Time Environment




Dispatcher
Channel Memories - CM
CheckPointing Servers - CS
Computing/Communicating Nodes
Big Picture
Channel Memory
Checkpoint
3 server
2
Dispatcher
1
Network
Firewall
Node
Node
4
Node
Firewall
Overview
 Channel Memory
 Dedicated Nodes
 Message tunneling
 Message Repository
 Node
 Home CM
 Send a message – send to receiver’s home CM
 Distributed Checkpointing/Logging
 Execution Context - CS
 Communication context - CM
Dispatcher (Stable)
 Initializes the execution
 A Stable Service Registry (centralized) started
 Providing services - CM, CS to nodes
 CM, CS assigned in a round-robin fashion
 Launches the instances of MPI processes on
Nodes
 Monitors the Node state
 alive signal, or time-out
 Reschedules tasks on available nodes for
dead MPI process instances
Steps
 When a node executes
 Contacts Stable Service Registry
 Gets assigned CM, CS based on rank
 Sends “alive” signal periodically to dispatcher –
contains rank
 On a failure




Restart its execution
Other processes unaware about failure
CM – allows single connection per rank
If faulty process reconnects, error code
returned, it exits
Channel Memory (Stable)





Logs every message
Send/Receive Messages – GET & PUT
GET & PUT are transactions
FIFO order maintained – each receiver
On a restart, replays communications using CMs
node
Get
Network
node
Get
node
Put
Channel Memory
Checkpoint Server (Stable)
 Checkpoint stored on stable storage
 Execution – node performs a checkpoint,
send image to CS
 On a Restart
 Dispatcher
 informs about task
 CS to contact to get last task chkpt
 Node
 Contacts CS with its rank
 Gets last chpkt image back from CS
Putting it all together
Worst condition: in-transit message + checkpoint
Processes
Pseudo time scale
0
CM
Ckpt image
1
1
CM
Ckpt image
2
2
2
Crash
CS
1
2
Ckpt images
Rollback to latest
process checkpoint
Performance Evaluation
 xTremeWeb – P2P Platform
 Dispatcher
 Client – excute parrallel application
 Workers – MPICH-V nodes, CMs & CSs
 216 PIII 733 Pcs
 Connected by Ethernet
 Simulate Node volatility – enforce process crashes
 NAS BT benchmark – simulated Computational Fluid
Dynamics Application
 Parallel Benchmark
 Significant Communication + Computation
Effects of CM on RTT
Time, sec
0.2
0.15
Mean over 100 measurements
P4
ch_cm 1 CM out-of-core
ch_cm 1 CM in-core
ch_cm 1 CM out-of-core best
5.6 MB/s
X ~2
0.1
10.5 MB/s
0.05
0
0
Message
size
64kB 128kB 192kB 256kB 320kB 384kB
Impact of Remote Checkpointing
Time between reception of a checkpoint signal and actual restart:
fork, ckpt, compress, transfer to CS, way back, decompress, restart
RTT Time, sec
250
Dist. Ethernet 100BaseT
Local (disc)
200
+2%
214
208
150
+25%
100
+14%
50
0
50
+28%
78
62
44
1.81.4
bt.W.4 (2MB) bt.A.4 (43MB) bt.B.4 (21MB) bt.A.1 (201MB)
 Cost of remote checkpoint is close to the one of local checkpoint
(can be as low as 2%)…
…because compression and transfer are overlapped
Performance of Re-Execution
Time, sec 0.3
8 restarts
0 restart
0.2
0.1
0
0
1
2
3
4
5
6
7
8
Re-execution
is faster than
execution:
Messages are
already stored
in CM
restart
restart
restarts
restarts
restarts
restarts
restarts
restarts
restarts
token size
0
64kB
128kB
192kB
256kB
The system can survive the crash of all MPI Processes
Execution Time Vs Faults
1100
Total execution 1050
time (sec.) 1000
950
900
850
800
750
700
650
610
~1 fault/110 sec.
Base exec.
without ckpt.
and fault
0
1
2
3
4
5
6
7
8
9
10
Number of faults
 Overhead of chkpt is about 23%
 For 10 faults performance is 68% of the one without fault
Conclusion
 MPICH-V
 full fledge fault tolerant MPI environment
(lib + runtime).
 uncoordinated checkpointing +
distributed pessimistic message logging.
 Channel Memories, Checkpoint Servers,
Dispatcher and nodes.
Download