pptx - UCSB Computer Science

GLOBAL STATES AND

CHECKPOINTS

CS 271 1

Distributed Checkpoints and Rollback

Recovery

• Fault tolerance is achieved by periodically using stable storage to save the processes’ states during the failure-free execution.

• Upon a failure , a failed process rolls back from one of its saved states , thereby reducing the amount of lost computation.

• Each of the saved states is called a checkpoint

CS 271 2

Checkpoint based Recovery

• Uncoordinated checkpointing : Each process takes its checkpoints independently

• Coordinated checkpointing : Process coordinate their checkpoints in order to save a system-wide consistent state.

• Communication-induced checkpointing : It forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.

CS 271 3

CS 271

Domino effect: example

Recovery

Line

P

0 m

7 m

0 m

2 m

3 m

5

P

1 m

6 m

1 m

4

P

2

Domino Effect: Cascaded rollback which causes the system to roll back to too far in the computation (even to the beginning), in spite of all the checkpoints

4

Global State

Chandy and Lamport—TOCS 1985

• Global state of a distributed system

– Local state of each process

– Messages sent but not received

• Many applications need the state of the system

– Failure recovery, distributed deadlock detection

– Detect stable properties .

• Problem: how can you figure out the state of a distributed system?

– Each process is independent

– Network does not have any processing power.

• Distributed snapshot : a consistent global state

CS 271 5

Global State

a) A consistent cut b) An inconsistent cut

CS 271 6

Distributed Snapshot Algorithm

• Assume each process communicates with another process using unidirectional FIFO pointto-point channels (e.g, TCP connections )

• Any process can initiate the algorithm

– Checkpoint local state

– Send MARKER on every outgoing channel

• On receiving a first marker on a channel:

– Process checkpoints local state and

– Send markers on all outgoing channels, and save messages on all other channels.

• On receiving subsequent marker on a channel:

– stop saving messages for that channel

– Saved messages are the state of the channel

CS 271 7

Distributed Snapshot

• A process finishes when

– It receives a marker on each incoming channel and processes them all

– State : local state plus state of all channels

– Send state to initiator

• Any process can initiate snapshot

– Multiple snapshots may be in progress

• Each is distinguished by tagging the marker with the initiator ID (and sequence number)

CS 271 8

Snapshot Algorithm Example

a) Organization of a process and channels for a distributed snapshot

CS 271 9

Snapshot Algorithm Example

b) Process Q receives a marker for first time and records local state c) Q records all incoming message d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel

CS 271 10

Execution Example p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 11

Execution Example q records state as S q

1 , sends marker to p p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 12

Execution Example p records state as S p

2 , channel state as empty p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 13

Execution Example q records channel state as m

3 p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 14

Execution Example

Recorded Global State = ((S p

2 , S q

1 ), (0,m

3

) ) p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 15

Take home Message

(Snapshot and global states)

• General solution for global state detection .

• Causality based detection of stable properties .

• Simple efficient protocol, uses Markers and

FIFO properties .

• Don’t forget Channel States .

• Foundation for Distributed Checkpointing and

Rollback Recovery

CS 271 16

pptx - UCSB Computer Science

GLOBAL STATES AND

CHECKPOINTS

Distributed Checkpoints and Rollback

Recovery

Checkpoint based Recovery

Domino effect: example

Global State

Chandy and Lamport—TOCS 1985

Global State

Distributed Snapshot Algorithm

Distributed Snapshot

Snapshot Algorithm Example

Snapshot Algorithm Example

Take home Message

Related documents

Products

Support

pptx - UCSB Computer Science

GLOBAL STATES AND

CHECKPOINTS

Distributed Checkpoints and Rollback

Recovery

Checkpoint based Recovery

Domino effect: example

Global State

Chandy and Lamport—TOCS 1985

Global State

Distributed Snapshot Algorithm

Distributed Snapshot

Snapshot Algorithm Example

Snapshot Algorithm Example

Take home Message

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib