Document

advertisement
Broadcast Variants
why broadcasts?
• distributed systems are inherently group
oriented and hence it is more useful to
talk about one-to-all or one-to-many
communication, that is broadcast and
multicast within the broader context of
group communication
• most useful in database replication and
in the general case of state machine
replication – where every server replica
is expected to respond to the same
sequence of requests
Distributed Systems (DNR)
2
• compared to unicast communication,
the problems are made complex by
message ordering (at the receiving
end) and reliability (sending
process crashes) issues in
broadcast
• message ordering and reliability
are orthogonal to each other, with
often hybrid models existing
Distributed Systems (DNR)
3
*p1, p2 with p1 FIFO order broadcast and
receive in misorder
*P2 crashing in the midst
Distributed Systems (DNR)
4
• message ordering definitions:
• FIFO order –if a process p sends m1 before
it sends m2, then m2 is not delivered at a
process q before m1 (easily implemented
using message sequence numbers)
• total order – if a process (correct or
faulty) p delivers a message m1 before m2,
then every process delivers m2 only after
it has delivered m1
• causal order – for every process p, if m1
happens before m2, then m2 is not
delivered at q before m1 is
Distributed Systems (DNR)
5
• causal ordering  single source FIFO
ordering
• total ordering  FIFO or causal ordering
• a combination of FIFO-total order
broadcast (which enforces single source
FIFO), or, causal-total order broadcast
(which preserves causality) is possible
Distributed Systems (DNR)
6
m1m2 (FIFO) and m1m3 (causal) is
maintained in the total order
p1
m1
m3
m2
p2
p3
m1
m3
Distributed Systems (DNR)
m2
7
• we will discuss:
– best effort broadcast (BEBcast)
– reliable broadcast (RBcast)
– terminating reliable broadcast (TRBcast)
– uniform reliable broadcast (URBcast)
– (uniform reliable) causal order broadcast
(COBcast)
– (uniform reliable) total order broadcast
(ABcast, or atomic broadcast)
Distributed Systems (DNR)
8
assumptions
• groups are static: dynamic groups are not
addressed here
• processes will not have access to stable
storage (no fail-recovery)
• asynchronous and at the network level,
point-to-point communication
• fail-stop processes unless otherwise
stated
Distributed Systems (DNR)
9
• Channels- two interpretations of liveness
criterion:
• reliable channel – a reliable channel
between processes p and q ensures the
following: if p executes send(m) and q is
correct, then q eventually receives m
• quasi reliable channel – a quasi reliable
channel between processes p and q ensures
the following: if p and q are correct and
p executes send(m), then q eventually
receives m
Distributed Systems (DNR)
10
• reliable vs. quasi-reliable:
• let process q be correct; a reliable
channel implies if p executes send(m) at
time t, and crashes at time t+1, then q
must eventually receive m, a useful model
of a shared persistent space
• a quasi reliable channel is weaker – both
p and q must be correct at the same time,
a useful model of TCP with error recovery
Distributed Systems (DNR)
11
Best effort broadcast (BEBcast)
• burden of ensuring reliability is only on
the sender: as long as the sender of a
message does not crash, the properties of
a quasi reliable channel ensure that all
correct processes eventually deliver
message
• operations:
• at p, BEBcast(m): for every process qp,
send (m) by reliable unicast
• on receive(m) at q : BEBdeliver(m) at q
Distributed Systems (DNR)
12
• transport level mechanisms: reliable
unicast by TCP (ack-implosion problem) or
IP multicast
Distributed Systems (DNR)
13
• properties:
• validity (a liveness property)– for any
two correct processes p and q, every
message broadcast by p is eventually
delivered by q
• integrity (a safety property)– for any
message m, every correct process q
delivers m at most once, and only if m
was previously broadcast by some process
p
Distributed Systems (DNR)
14
Distributed Systems (DNR)
15
Reliable broadcast (RBcast)
• in best effort broadcast, if the sender
fails immediately after broadcasting to
all, as end to end error recovery is not
possible in such a case, the correct
processes might disagree on whether or
not to deliver the message
• reliable broadcast ensures that correct
process agree on the messages they
deliver even when the sender crashes,
i.e., adheres to the properties of a
reliable channel
Distributed Systems (DNR)
16
• reliable broadcast is built on top of
best-effort broadcast + failure detector
abstraction
Distributed Systems (DNR)
17
• operations:
• at p, RBcast(m)  BEBcast(m)
• at q BEBdeliver(m)  RBdeliver(m)
• if q unreliably detects that p has
crashed then BEBcast(m)
• note – retransmission received by other
correct processes must handle duplicates
properly
Distributed Systems (DNR)
18
• properties:
• validity – if a correct process p
broadcasts a message m, then p eventually
delivers m
• integrity – for a message m, a correct
process q delivers m at most once and
only if m was previously broadcast by
some process p
• agreement (a liveness property)– if a
correct process p delivers a message m,
then m is eventually delivered by every
correct process q
Distributed Systems (DNR)
19
• Is the following run acceptable?
• process p executes RBcast(m) and later
crashes; some process q RBdelivers m and
then crashes; all other processes are
correct, but none of them RBdelivers m
• process p executes RBcast(m) and later
crashes: validity not violated
Distributed Systems (DNR)
20
uniform reliable broadcast
(URBcast)
• consider the scenario discussed earlier:
process p1 executes RBcast(m) and later
crashes; some process p2 RBdelivers m and
then crashes; all other processes are
correct, but none of them RBdelivers m;
satisfies reliable broadcast,
nevertheless seem to be lacking in some
aspect..
Distributed Systems (DNR)
21
• the problem is q RBdelivers m and then
only takes a step to rebroadcast if the
source failure is detected
• URBCAST ensures that a process (correct
or not) delivers the message only when it
knows that the message has been seen
(BEBdeliver) by all correct processes
• URB property is important, say if
processes are interacting with outside
world; a fact that a process has
delivered a message is important, even if
it has crashed afterwards; because before
it had got crashed it might have
communicated with external world; other
processes must be aware of this situation
Distributed Systems (DNR)
22
• agreement property replaced by uniform
agreement – if some process (correct or
not) p delivers a message m, then m is
eventually delivered by every correct
process q
• reliable channel assumption holds –
where, if p executes send(m) to q, q is
correct, then eventually q receives m
Distributed Systems (DNR)
23
• operations:
• at p, URBcast(m)  BEBcast(m)
• at q BEBdeliver(m); if m received by q
for the first time and qp, then
BEBcast(m)  URBdeliver(m)
Distributed Systems (DNR)
24
Causal order broadcast (COBcast)
• reliable broadcast does not guarantee any
ordering among messages delivered by
different processes
• single source FIFO ordering is a special
case of causal ordering where messages
from the same process should be delivered
in the order they were broadcast
Distributed Systems (DNR)
25
• practical scenario:
• on a publish-subscribe whiteboard p1
broadcasts m1 proposal to all which p2
(sees and) replies with comment m2 to all
• here m1  m2
• due to arbitrary delay p3 delivers m2
before m1 and has to withhold m2
• a suitable ‘middleware’ for causal
ordering would relieve the programmer
from performing such a task
Distributed Systems (DNR)
26
• we say that a message m1 may potentially
have caused another message m2 (or m1 
m2), if any of the following applies
• m1 and m2 were broadcast by the same
process p and m1 was broadcast before m2
• m1 was delivered by process p, m2 was
broadcast by process p, m2 was broadcast
after the delivery of m1
• there exist some message m’ such that
m1  m’ and m’  m2
Distributed Systems (DNR)
27
Distributed Systems (DNR)
28
• additional property:
• causal delivery – no process p delivers a
message m2 unless p has already delivered
every message m1 such that m1  m2
• causally ordered broadcast can be
achieved in the presence of crash
failures
• when RBcast is replaced by URBcast, we
get a reliable causally ordered broadcast
• two implementations discussed:
Distributed Systems (DNR)
29
no-waiting causal broadcast
• whenever a process RBdeliver(m), it
COdeliver(m) without waiting for other
messages to be RBdelivered
• algorithm outline:
• each message m carries a control field
pastm which includes all messages that
causally precede m
Distributed Systems (DNR)
30
• when a message m is RBdelivered, pastm is
first inspected where all messages in
pastm that have not been COdelivered must
be done so before m it self is
COdelievered
• each process memorises all messages it
has COBcast or COdelivered in a variable
past_list
• past_list and pastm are ordered sets
Distributed Systems (DNR)
31
at pi:
init: past_list = delivered_list = empty;
upon <COBcast(m)> {
RBcast(m, past_list);
past_list = past_list  m;}
upon <RBdeliver(pj, pastm, m)>
if (m  delivered_list) then {
for all messages m’ pastm not delivered
so far {
COdeliver() in deterministic order;
delivered_list= delivered_list 
m’;
past_list= past_list  m’;}
COdeliver (pj, m);
delivered_list = delivered_list m;
past_list=past_list m;}
Distributed Systems (DNR)
32
• in the figure above, p4 RBdeliver m2 first
but since the message carries m1 in its
pastm, m1 and m2 are COdelivered in order;
finally when m1 is RBdelivered from p1, it
is discarded
• weakness: long message size due to past
casual history carried
Distributed Systems (DNR)
33
waiting causal order broadcast
• instead of keeping a record of all past
messages, history is now represented by
vector clocks
• vector clocks essentially capture the
causal precedence between messages
• waiting COBcast relies on as before,
underlying RBcast and RBdeliver
primitives
Distributed Systems (DNR)
34
• every process p maintains a vector clock
that represents the number of messages
that p has COdelivered from every other
process, i.e., VCp[j], j=1..n, jp, and
the number of messages it has itself
COBcast, i.e., VCp[p]
• this vector is then attached to every
message m that p COBcast
• a process q that RBdeliver m interprets
this vector time stamp to determine how
many messages are missing (if any), and
from which process
Distributed Systems (DNR)
35
• as far as all previous messages from p
are concerned this is VCp[p]-1 and then,
all messages received by p before it had
sent m, that is VCp[k], kp
• process q needs to COdeliever all these
missing messages before it can COdeliver
m
Distributed Systems (DNR)
36
• at p2, interpretation of the vector time
stamp [0,2,0] implies that there is one
message pending from p1, one message from
p1 already RBdelivered but pending
COdeliver and, none from p0
Distributed Systems (DNR)
37
at pi:
init: pending = empty;  i,j VCi[j] =0; pending
list ordered in increasing order of vector time
upon COBcast(m) {
COdeliver(pi, m); /receive locally
RBcast(VCi, pi, m);
VCi[i]++;}
upon RBdeliever(VCj, pj, m) {
for i  j augment pending with (VCj, pj, m);
/ignore messages from self
wait until VCj[j]=VCi[j]+1 and ki VCj[k] 
VCi[k]; {
remove (VCj, pj, m) from pending;
COdeliever(pj ,m);
VCi[j]++;}
}
Distributed Systems (DNR)
38
Total order broadcast (TOBcast)
• causal order broadcast enforces a global
ordering for all messages that are
causally depended on each other
• messages that are no so, are said to be
concurrent and could be delivered in any
order
• a total order abstraction orders all
messages, even those that are concurrent
• it is some times possible to have a total
order that does not respect causal order
• a convenient abstraction for managing
replicated state machines (e.g., in fault
tolerant servers)
Distributed Systems (DNR)
39
• totally ordered reliable broadcast cannot
be achieved in the presence of crash
failures when the underlying
communication is asynchronous
• this is because totally ordered broadcast
 consensus; recall that consensus cannot
be solved in an asynchronous system with
failures (FLP result)
• assumptions: asynchronous with no process
failures, or synchronous with fail-stop
processes
• how do we achieve causal-total order
broadcast?
Distributed Systems (DNR)
40
• properties:
• validity – if a correct process p
broadcasts a message m, then p eventually
delivers m
• integrity – for a message m, a correct
process q delivers m at most once, and
only if m was previously broadcast by
some process p
• uniform agreement (atomicity in delivery)
– if a process p delivers a message m,
then m is eventually delivered by every
correct process q
• uniform total order (an order property) –
if a process (correct or faulty) p
delivers a message m1 before m2, then
every process delivers m2 only after it
has delivered m1.
Distributed Systems (DNR)
41
• algorithm 1 – asynchronous with no
process failures
• assume reliable (stronger condition under
no failure assumption) and single source
FIFO channel (each process stamps
sequence numbers)
• each process maintains an increasing
counter, a time stamp, which is tagged
with the message it broadcasts
• each process also maintains a vector with
estimates of the time stamps of all
others
Distributed Systems (DNR)
42
• suppose ts[j] is the vector element that
corresponds to pj on pi; it says that pi
will never again receive a message from
pj with a smaller time stamp than or
equal to this value
• processes use special update time stamp
messages to keep up the estimates
• RBdelivered messages are queued in a
pending list in the order of increasing
<time stamp-ts(m): pid> pairs, say
ts(m)^; pid used to break a tie
• ABdeliver can be done for any message in
pending list that has a time stamp
greater than all of the elements of the
current vector time of a process
Distributed Systems (DNR)
43
at pi: (0  i  n-1)
init ts[j] = 0; (0  j  n-1); pending = empty;
ABcast (m) {
ts[i]++; add (m,ts(i),pi) to pending;
RBcast(m,ts[i],pi);}
upon RBdeliver(m,ts(msg),pj),ji ignore self msg{
ts[j] = ts(msg);
add (m,ts(msg),pj) to pending;
if (ts(msg) > ts[i]) then {
ts[i] = ts(msg); RBcast(new_ts,ts[i],pi);}}
upon RBdeliver(new_ts,ts(new_ts),pj),ji ignore
self msg
ts[j] = ts(new_ts);
delivery_test() /at any time
while (m,ts(msg),pj) at head of pending list {
k ts(msg) ts[k] {
remove(m,ts(msg),pj) from pending;
ABdeliver(m);}}
Distributed Systems (DNR)
44
total order broadcast with time stamps
Distributed Systems (DNR)
45
Total order broadcast by consensus
• uses reliable broadcast and
consensus as building blocks
• messages are first disseminated
using a reliable broadcast
primitive and are stored in a bag
of unordered messages at every
process
• processes then use consensus to
order the messages in the bag
Distributed Systems (DNR)
46
• algorithm works in rounds
• there is one consensus instance per round
• messages to be delivered in a round are
agreed upon before proceeding to next
round
• RBcast can be replaced with URBcast to
give ‘uniform total order broadcast’
• algorithm 2 – synchronous with fail-stop
processes
Distributed Systems (DNR)
47
Distributed Systems (DNR)
48
init:
unordered = delivered = empty;
round = 1; wait = false;
TOBcast (m) {
RBcast(m);}
upon RBdeliver(m){
if (m  delivered) then
unordered = unordered  m;}
upon ((unordered empty)  (wait = false)) {
wait = true;
propose(round, unordered); }/ propose() and
decide() are consensus primitives
upon (m’  decide(round)) { / may take f+1 rounds
in case of failures
delivered = delivered  m’;
unordered = unordered \ m’;
TOdeliever(m’);
round++; wait = false;}
Distributed Systems (DNR)
49
Terminating reliable broadcast
(TRBcast)
• uniform reliable broadcast says that if
some process (correct or not) p delivers
a message m, then m is eventually
delivered by every correct process q
• however, q cannot decide whether it
should wait for m or not; q has no means
to distinguish the case where some
process has delivered m, and where q can
indeed wait for m, from the case where no
process will ever deliver m, in which
case q should definitely not keep waiting
for m
Distributed Systems (DNR)
50
• suppose a process r urbcasts message m,
but crashed while doing so and another
process p detects that r has crashed
without seeing m
• this does not mean that m was not
broadcast
• this nuance is captured by terminating
reliable broadcast
• TRBcast ensures precisely that every
process q either delivers the message m
or some indication F that m will never be
delivered (by any process); abstraction
is defined for a specific originator
process src
Distributed Systems (DNR)
51
• properties:
• validity – if the sender src is correct
and broadcasts a message m, then src
eventually delivers m
• integrity – if a correct process delivers
a message m then either m=F or m was
previously broadcast by src
• uniform agreement – if any process
delivers a message m, then m is
eventually delivered by every correct
process
• assumptions: synchronous with fail stop
processes
Distributed Systems (DNR)
52
• underlying abstractions – a perfect
failure detector, consensus, best effort
broadcast
• the source of message src identifies it
self as the originator in the message m
in the best effort broadcast to all
• a participant joins in the trbcast by
broadcasting a special null message
• every process waits until it either gets
a message broadcast by the sender or
detects the crash of sender
• all processes run a consensus instance to
agree on whether to deliver m or the
failure notification F
Distributed Systems (DNR)
53
Distributed Systems (DNR)
54
init:
proposal =decision = null;
TRBcast (m, psrc)
BEBcast(m);
upon (BEBdeliver(m, psrc)  (proposal=
null))
propose (m);
upon ((psrc_crash)  (proposal=null))
propose (Fsrc);
upon decide(decision) / consensus round
TRBdeliever(decision, psrc);
----------------------------------------[Scanned figures in the slides have been
extracted from the text books of
R.Guerroui and H.Attiya]
Distributed Systems (DNR)
55
Download