pptx - UCSB Computer Science

advertisement

GLOBAL STATES AND

CHECKPOINTS

CS 271 1

Distributed Checkpoints and Rollback

Recovery

• Fault tolerance is achieved by periodically using stable storage to save the processes’ states during the failure-free execution.

• Upon a failure , a failed process rolls back from one of its saved states , thereby reducing the amount of lost computation.

• Each of the saved states is called a checkpoint

CS 271 2

Checkpoint based Recovery

• Uncoordinated checkpointing : Each process takes its checkpoints independently

• Coordinated checkpointing : Process coordinate their checkpoints in order to save a system-wide consistent state.

• Communication-induced checkpointing : It forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.

CS 271 3

CS 271

Domino effect: example

Recovery

Line

P

0 m

7 m

0 m

2 m

3 m

5

P

1 m

6 m

1 m

4

P

2

Domino Effect: Cascaded rollback which causes the system to roll back to too far in the computation (even to the beginning), in spite of all the checkpoints

4

Global State

Chandy and Lamport—TOCS 1985

• Global state of a distributed system

– Local state of each process

– Messages sent but not received

• Many applications need the state of the system

– Failure recovery, distributed deadlock detection

– Detect stable properties .

• Problem: how can you figure out the state of a distributed system?

– Each process is independent

– Network does not have any processing power.

• Distributed snapshot : a consistent global state

CS 271 5

Global State

a) A consistent cut b) An inconsistent cut

CS 271 6

Distributed Snapshot Algorithm

• Assume each process communicates with another process using unidirectional FIFO pointto-point channels (e.g, TCP connections )

• Any process can initiate the algorithm

– Checkpoint local state

– Send MARKER on every outgoing channel

• On receiving a first marker on a channel:

– Process checkpoints local state and

– Send markers on all outgoing channels, and save messages on all other channels.

• On receiving subsequent marker on a channel:

– stop saving messages for that channel

– Saved messages are the state of the channel

CS 271 7

Distributed Snapshot

• A process finishes when

– It receives a marker on each incoming channel and processes them all

– State : local state plus state of all channels

– Send state to initiator

• Any process can initiate snapshot

– Multiple snapshots may be in progress

• Each is distinguished by tagging the marker with the initiator ID (and sequence number)

CS 271 8

Snapshot Algorithm Example

a) Organization of a process and channels for a distributed snapshot

CS 271 9

Snapshot Algorithm Example

b) Process Q receives a marker for first time and records local state c) Q records all incoming message d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel

CS 271 10

Execution Example p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 11

Execution Example q records state as S q

1 , sends marker to p p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 12

Execution Example p records state as S p

2 , channel state as empty p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 13

Execution Example q records channel state as m

3 p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 14

Execution Example

Recorded Global State = ((S p

2 , S q

1 ), (0,m

3

) ) p

S p

0 S p

1 S p

2 S p

3 q

S q

0 m

1

S q

1 m

2

S q

2 m

3

S q

3

CS 271 15

Take home Message

(Snapshot and global states)

• General solution for global state detection .

• Causality based detection of stable properties .

• Simple efficient protocol, uses Markers and

FIFO properties .

• Don’t forget Channel States .

• Foundation for Distributed Checkpointing and

Rollback Recovery

CS 271 16

Download