Distributed Systems CS 15-440 Fault Tolerance- Part III Lecture 19, Nov 21, 2012

advertisement
Distributed Systems
CS 15-440
Fault Tolerance- Part III
Lecture 19, Nov 21, 2012
Majd F. Sakr and Mohammad Hammoud
1
Today…
 Last session
 Fault Tolerance – Part II
 Reliable request-reply communication
 Today’s session
 Fault Tolerance – Part III
 Reliable group communication
 Atomicity
 Recovery
 Announcement:
 Project 3 is due tomorrow by 11:59PM
2
Objectives
Discussion on Fault Tolerance
Recovery from
failures
General
background on
fault tolerance
Process
resilience,
failure detection
and reliable
communication
Atomicity and
distributed
commit
protocols
Objectives
Discussion on Fault Tolerance
Recovery from
failures
General
background on
fault tolerance
Process
resilience,
failure detection
and reliable
communication
Atomicity and
distributed
commit
protocols
Reliable Communication
Reliable Communication
Reliable Request-Reply
Communication
Reliable Group
Communication
5
Reliable Group Communication
 As we considered reliable request-reply communication,
we need also to consider reliable multicasting services
1
2
7
3
6
4
5
 E.g., Election algorithms use multicasting schemes
6
Reliable Group Communication
 A Basic Reliable-Multicasting Scheme
 Scalability in Reliable Multicasting
 Atomic Multicast
7
Reliable Group Communication
 A Basic Reliable-Multicasting Scheme
 Scalability in Reliable Multicasting
 Atomic Multicast
8
Reliable Multicasting
 Reliable multicasting indicates that a message that is sent to a
process group should be delivered to each member of that group
 A distinction should be made between:
 Reliable communication in the presence of faulty processes
 Reliable communication when processes are assumed
operate correctly

to
In the presence of faulty processes, multicasting is considered to be
reliable when it can be guaranteed that all non-faulty group members
receive the message
9
Basic Reliable Multicasting Questions
 What happens if during communication (i.e., a message
is being delivered) a process P joins a group?
 Should P also receive the message?
 What happens if a (sending) process crashes during
communication?
 What about message ordering?
10
Reliable Multicasting with Feedback
Messages
 Consider the case when a single sender S wants to
multicast a message to multiple receivers
 An S’s multicast message may be lost part way and delivered
to some, but not to all, of the intended receivers
 Assume that messages are received in the same order as
they are sent
11
Reliable Multicasting with Feedback
Messages
Sender
History
Buffer
Receiver
Receiver
Receiver
Receiver
M25
Last = 24
Last = 24
Last = 23
Last = 24
Network
Sender
Receiver
Last = 24
Receiver
Last = 24
M25
ACK25
Receiver
Last = 23
M25
ACK25
Receiver
Last = 24
M25
Missed 24
M25
ACK25
An extensive and detailed survey of total-order broadcasts can be found
12 in Defago et al. (2004)
Reliable Group Communication
 A Basic Reliable-Multicasting Scheme
 Scalability in Reliable Multicasting
 Atomic Multicast
13
Scalability Issues with a FeedbackBased Scheme
 If there are N receivers in a multicasting process, the sender must
be prepared to accept at least N ACKs
 This might cause a feedback implosion
 Instead, we can let a receiver return only a NACK
 Limitations:
 No hard guarantees can be given that a feedback implosion will
not happen
 It is not clear for how long the sender should keep a message in its
history buffer
14
Nonhierarchical Feedback Control
 How can we control the number of NACKs sent back to the sender?
 A NACK is sent to all the group members after some random delay
 A group member suppresses its own feedback concerning a missing
message after receiving a NACK feedback about the same message
15
Hierarchical Feedback Control
 Feedback suppression is basically
a nonhierarchical solution
 Achieving scalability for very large
groups of receivers requires that
hierarchical
approaches
are
adopted
 The group of receivers is
partitioned into a number of
subgroups, which are organized
into a tree
R
Receiver
16
Hierarchical Feedback Control
 The subgroup containing the
sender S forms the root of the tree
S
Coordinator
 Within a subgroup, any reliable
multicasting scheme can be used
 Each subgroup appoints a local
coordinator C responsible for
handling retransmission requests in
its subgroup
C
C
R
Root
 If C misses a message m, it asks
the C of the parent subgroup to
retransmit m
17
Reliable Group Communication
 A Basic Reliable-Multicasting Scheme
 Scalability in Reliable Multicasting
 Atomic Multicast
18
Atomic Multicast
 P1: What is often needed in a distributed system is the guarantee
that a message is delivered to either all processes or to none at all
 P2: It is also generally required that all messages are delivered in
the same order to all processes
 Satisfying P1 and P2 results in an atomic multicast
 Atomic multicast:
 Ensures that non-faulty processes maintain a consistent view
 Forces reconciliation when a process recovers and rejoins the group
19
Virtual Synchrony (1)
 A multicast message m is uniquely associated with a list of
processes to which it should be delivered
 This delivery list corresponds to a group view (G)
A reliable multicast with this
property is said to be virtually
 There is only one case in which delivery of m is allowed to fail:
synchronous
 When a group-membership-change is the result of the sender
of m crashing
 In this case, m may either be delivered to all remaining processes,
or ignored by each of them
20
Virtual Synchrony (2)
Reliable multicast by multiple
point-to-point messages
P3 crashes
P3 rejoins
P1
P2
P3
P4
G = {P1, P2, P3, P4}
G = {P1, P2, P4}
G = {P1, P2, P3, P4}
Partial multicast from P3 is discarded
The Principle of Virtual Synchronous Multicast
21
Time
Message Ordering
 Four different virtually synchronous multicast orderings
are distinguished:
1. Unordered multicasts
2. FIFO-ordered multicasts
3. Causally-ordered multicasts
4. Totally-ordered multicasts
22
1. Unordered multicasts
 A reliable, unordered multicast is a virtually synchronous multicast in
which no guarantees are given concerning the order in which
received messages are delivered by different processes
Process P1
Process P2
Process P3
Sends m1
Receives m1
Receives m2
Sends m2
Receives m2
Receives m1
Three communicating processes in the same group
23
2. FIFO-Ordered Multicasts
 With FIFO-Ordered multicasts, the communication layer is forced to
deliver incoming messages from the same process in the same
order as they have been sent
Process P1
Process P2
Process P3
Process P4
Sends m1
Receives m1 Receives m3 Sends m3
Sends m2
Receives m3 Receives m1 Sends m4
Receives m2 Receives m2
Receives m4 Receives m4
Four processes in the same group with two different senders.
24
3-4. Causally-Ordered and
Total-Ordered Multicasts
 Causally-ordered multicast preserves potential causality
between different messages
 If message m1 causally precedes another message m2,
regardless of whether they were multicast by the same sender or
not, the communication layer at each receiver will always deliver
m1 before m2
 Total-ordered multicast requires that when messages are
delivered, they are delivered in the same order to all group
members (regardless of whether message delivery is
unordered, FIFO-ordered, or causally-ordered)
25
Virtually Synchronous Reliable
Multicasting
 A virtually synchronous reliable multicasting that offers total-ordered
delivery of messages is what we refer to as atomic multicasting
Multicast
Basic Message Ordering
Total-Ordered Delivery?
Reliable multicast
None
No
FIFO multicast
FIFO-ordered delivery
No
Causal multicast
Causal-ordered delivery
No
Atomic multicast
None
Yes
FIFO atomic multicast
FIFO-ordered delivery
Yes
Causal atomic multicast
Causal-ordered delivery
Yes
Six different versions of virtually synchronous reliable multicasting
26
Implementing Virtual Synchrony (1)
 We will consider a possible implementation of virtual
synchrony appeared in Isis [Birman et al. 1991]
 Isis assumes a FIFO-ordered multicast
 Isis makes use of TCP, hence, each transmission is
guaranteed to succeed
 Using TCP does not guarantee that all messages sent to a
view G are delivered to all non-faulty processes in G before
any view change
27
Implementing Virtual Synchrony (2)
 The solution adopted by Isis is to let every process in G
keeps a message m until it knows for sure that all
members in G have received it
 If m has been received by all members in G, m is said to
be stable
 Only stable messages are allowed to be delivered
28
Implementing Virtual Synchrony (3)
A flush message
An unstable message
2
1
5
View change
4
2
6
3
0
7
Process 4 notices that process
7 has crashed and sends a
view change
1
2
5
4
6
6
3
0
7
Process 6 sends out all its unstable
messages, followed by a flush
message
5
4
3
0
1
7
Process 6 installs the new
view when it receives a
flush message from
everyone else
29
Distributed Commit
 Atomic multicasting problem is an example of a more general
problem, known as distributed commit
 The distributed commit problem involves having an operation being
performed by each member of a process group, or none at all
 With reliable multicasting, the operation is the delivery of a message
 With distributed transactions, the operation may be the commit of a
transaction at a single site that takes part in the transaction
 Distributed commit is often
coordinator and participants
established
by
30
means
of
a
One-Phase Commit Protocol
 In a simple scheme, a coordinator can tell all participants
whether or not to (locally) perform the operation in question
 This scheme is referred to as a one-phase commit protocol
 The one-phase commit protocol has a main drawback that if
one of the participants cannot actually perform the operation,
there is no way to tell the coordinator
 In practice, more sophisticated schemes are needed. The
most common utilized one is the two-phase commit protocol
31
Two-Phase Commit Protocol
 Assuming that no failures occur, the two-phase commit protocol
(2PC) consists of the following two phases, each consisting of
two steps:
Phase I: Voting Phase
Step 1
Step 2
•
The coordinator sends a VOTE_REQUEST message to all
participants.
•
When a participant receives a VOTE_REQUEST message, it
returns either a VOTE_COMMIT message to the coordinator
telling the that
indicating
coordinator
it is prepared
that it to
is prepared
locally commit
to locally
its part
commit
of theits
part of the transaction,
transaction,
or otherwise
or aotherwise
VOTE_ABORT
a VOTE_ABORT
message.
message
32
Two-Phase Commit Protocol
Phase II: Decision Phase
•
The coordinator collects all votes from the participants.
•
If all participants have voted to commit the transaction, then so
will the coordinator. In that case, it sends a GLOBAL_COMMIT
message to all participants.
•
However, if one participant had voted to abort the transaction,
the coordinator will also decide to abort the transaction and
multicasts a GLOBAL_ABORT message.
•
Each participant that voted for a commit waits for the final
reaction by the coordinator.
•
If a participant receives a GLOBAL_COMMIT message, it
locally commits the transaction.
•
Otherwise, when receiving a GLOBAL_ABORT message, the
transaction is locally aborted as well.
Step 1
Step 2
33
2PC Finite State Machines
Vote-request
Vote-abort
Commit
Vote-request
Vote-abort
Global-abort
ABORT
INIT
INIT
Vote-request
Vote-commit
WAIT
Vote-commit
Global-commit
COMMIT
The finite state machine for the
coordinator in 2PC
Global-abort
ACK
ABORT
WAIT
Global-commit
ACK
COMMIT
The finite state machine for a
participant in 2PC
34
2PC Algorithm
Actions by coordinator:
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected{
wait for any incoming vote;
if timeout{
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
exit;
}
record vote;
}
If all participants sent VOTE_COMMIT and coordinator votes COMMIT{
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;
}else{
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
}
35
Two-Phase Commit Protocol
Actions by participants:
write INIT to local log;
Wait for VOTE_REQUEST from coordinator;
If timeout{
write VOTE_ABORT to local log;
exit;
}
If participant votes COMMIT{
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator;
if timeout{
multicast DECISION_RQUEST to other participants;
wait until DECISION is received; /*remain blocked*/
write DECISION to local log;
}
if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;}
else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log};
}else{
write VOTE_ABORT to local log;
send VOTE_ABORT to coordinator;
}
36
Two-Phase Commit Protocol
Actions for handling decision requests:
/*executed by separate thread*/
while true{
wait until any incoming DECISION_REQUEST is received; /*remain blocked*/
read most recently recorded STATE from the local log;
if STATE == GLOBAL_COMMIT
send GLOBAL_COMMIT to requesting participant;
else if STATE == INIT or STATE == GLOBAL_ABORT
send GLOBAL_ABORT to requesting participant;
else
skip; /*participant remains blocked*/
}
37
Objectives
Discussion on Fault Tolerance
Recovery from
failures
General
background on
fault tolerance
Process
resilience,
failure detection
and reliable
communication
Atomicity and
distributed
commit
protocols
Recovery
 So far, we have mainly concentrated on algorithms that allow us to
tolerate faults
 However, once a failure has occurred, it is essential that the process
where the failure has happened can recover to a correct state
 In what follows we focus on:
 What it actually means to recover to a correct state
 When and how the state of a distributed system can be recorded and
recovered, by means of checkpointing and message logging
39
Recovery
 Error Recovery
 Checkpointing
 Message Logging
40
Recovery
 Error Recovery
 Checkpointing
 Message Logging
41
Error Recovery
 Once a failure has occurred, it is essential that the process where
the failure has happened can recover to a correct state
 Fundamental to fault tolerance is the recovery from an error
 The idea of error recovery is to replace an erroneous state with an
error-free state
 There are essentially two forms of error recovery:
1. Backward recovery
2. Forward recovery
42
1. Backward Recovery (1)
 In backward recovery, the main issue is to bring the system from its
present erroneous state back to a previously correct state
 It is necessary to record the system’s state from time to time onto a
stable storage, and to restore such a recorded state when things
go wrong
Stable
Storage
Crash after
drive 1 is
updated
Bad
Spot
43
1. Backward Recovery (2)
 Each time (part of) the system’s present state is
recorded, a checkpoint is said to be made
 Problems with backward recovery:
 Restoring a system or a process to a previous state is generally
expensive in terms of performance
 Some states can never be rolled back (e.g., typing in UNIX rm –fr *)
44
2. Forward Recovery
 When the system detects that it has made an error, forward
recovery reverts the system state to error time and corrects it,
to be able to move forward
 Forward recovery is typically faster than backward recovery
but requires that it has to be known in advance which errors
may occur
 Some systems make use of both forward and backward
recovery for different errors or different parts of one error
45
Recovery
 Error Recovery
 Checkpointing
 Message Logging
46
Why Checkpointing?
 In a fault-tolerant distributed system, backward recovery
requires that the system regularly saves its state onto a
stable storage
 This process is referred to as checkpointing
 In particular, checkpointing consists of storing a
distributed snapshot of the current application state (i.e.,
a consistent global state), and later on, use it for
restarting the execution in case of a failure
47
Recovery Line
 In a distributed snapshot, if a process P has recorded the receipt of
a message, then there should be also a process Q that has
recorded the sending of that message
We are able to identify both, senders and receivers.
Initial state
A snapshot
A recovery line
Not a recovery line
P
A failure
Q
Message sent from
Q to P
They jointly form a distributed
48
snapshot
Checkpointing
 Checkpointing can be of two types:
1. Independent Checkpointing: each process simply records its
local state from time to time in an uncoordinated fashion
2. Coordinated Checkpointing: all processes synchronize to
jointly write their states to local stable storages
49
Domino Effect

Independent checkpointing may make it difficult to find a recovery line,
leading potentially to a domino effect resulting from cascaded rollbacks
Not a Recovery Line
Not a Recovery Line
Rollback
Not a Recovery Line
P
A failure
Q

With coordinated checkpointing, the saved state is automatically globally
consistent, hence, domino effect is inherently avoided
50
Recovery
 Error Recovery
 Checkpointing
 Message Logging
51
Why Message Logging?
 Considering that checkpointing is an expensive operation,
techniques have been sought to reduce the number of checkpoints,
but still enable recovery
 An important technique in distributed systems is message logging
 The basic idea is that if transmission of messages can be replayed,
we can still reach a globally consistent state but without having to
restore that state from stable storage
 In practice, the combination of having fewer checkpoints and
message logging is more efficient than having to take many
checkpoints
52
Message Logging
 Message logging can be of two types:
1. Sender-based logging: A process can log its messages before
sending them off
2. Receiver-based logging: A receiving process can first log an
incoming message before delivering it to the application
 When a sending or a receiving process crashes, it can restore the
most recently checkpointed state, and from there on replay the
logged messages (important for non-deterministic behaviors)
53
Replay of Messages and
Orphan Processes
 Incorrect replay of messages after recovery can lead to orphan
processes. This should be avoided
Q crashes
Q recovers
M1 is replayed
M3 becomes an
orphan
P
M1
M1
Q
M2
M3
M2
M3
R
M2 can never be replayed
Logged Message
Unlogged Message
54
Objectives
Discussion on Fault Tolerance
Recovery from
failures
General
background on
fault tolerance
Process
resilience,
failure detection
and reliable
communication
Atomicity and
distributed
commit
protocols
Next Class
Distributed File Systems-Part I
Thanks You!
56
Download