CS 271 1
• Fault tolerance is achieved by periodically using stable storage to save the processes’ states during the failure-free execution.
• Upon a failure , a failed process rolls back from one of its saved states , thereby reducing the amount of lost computation.
• Each of the saved states is called a checkpoint
CS 271 2
• Uncoordinated checkpointing : Each process takes its checkpoints independently
• Coordinated checkpointing : Process coordinate their checkpoints in order to save a system-wide consistent state.
• Communication-induced checkpointing : It forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.
CS 271 3
CS 271
Recovery
Line
P
0 m
7 m
0 m
2 m
3 m
5
P
1 m
6 m
1 m
4
P
2
Domino Effect: Cascaded rollback which causes the system to roll back to too far in the computation (even to the beginning), in spite of all the checkpoints
4
• Global state of a distributed system
– Local state of each process
– Messages sent but not received
• Many applications need the state of the system
– Failure recovery, distributed deadlock detection
– Detect stable properties .
• Problem: how can you figure out the state of a distributed system?
– Each process is independent
– Network does not have any processing power.
• Distributed snapshot : a consistent global state
CS 271 5
a) A consistent cut b) An inconsistent cut
CS 271 6
• Assume each process communicates with another process using unidirectional FIFO pointto-point channels (e.g, TCP connections )
• Any process can initiate the algorithm
– Checkpoint local state
– Send MARKER on every outgoing channel
• On receiving a first marker on a channel:
– Process checkpoints local state and
– Send markers on all outgoing channels, and save messages on all other channels.
• On receiving subsequent marker on a channel:
– stop saving messages for that channel
– Saved messages are the state of the channel
CS 271 7
• A process finishes when
– It receives a marker on each incoming channel and processes them all
– State : local state plus state of all channels
– Send state to initiator
• Any process can initiate snapshot
– Multiple snapshots may be in progress
• Each is distinguished by tagging the marker with the initiator ID (and sequence number)
CS 271 8
a) Organization of a process and channels for a distributed snapshot
CS 271 9
b) Process Q receives a marker for first time and records local state c) Q records all incoming message d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel
CS 271 10
Execution Example p
S p
0 S p
1 S p
2 S p
3 q
S q
0 m
1
S q
1 m
2
S q
2 m
3
S q
3
CS 271 11
Execution Example q records state as S q
1 , sends marker to p p
S p
0 S p
1 S p
2 S p
3 q
S q
0 m
1
S q
1 m
2
S q
2 m
3
S q
3
CS 271 12
Execution Example p records state as S p
2 , channel state as empty p
S p
0 S p
1 S p
2 S p
3 q
S q
0 m
1
S q
1 m
2
S q
2 m
3
S q
3
CS 271 13
Execution Example q records channel state as m
3 p
S p
0 S p
1 S p
2 S p
3 q
S q
0 m
1
S q
1 m
2
S q
2 m
3
S q
3
CS 271 14
Execution Example
Recorded Global State = ((S p
2 , S q
1 ), (0,m
3
) ) p
S p
0 S p
1 S p
2 S p
3 q
S q
0 m
1
S q
1 m
2
S q
2 m
3
S q
3
CS 271 15
(Snapshot and global states)
• General solution for global state detection .
• Causality based detection of stable properties .
• Simple efficient protocol, uses Markers and
FIFO properties .
• Don’t forget Channel States .
• Foundation for Distributed Checkpointing and
Rollback Recovery
CS 271 16