Broadcast Variants why broadcasts? • distributed systems are inherently group oriented and hence it is more useful to talk about one-to-all or one-to-many communication, that is broadcast and multicast within the broader context of group communication • most useful in database replication and in the general case of state machine replication – where every server replica is expected to respond to the same sequence of requests Distributed Systems (DNR) 2 • compared to unicast communication, the problems are made complex by message ordering (at the receiving end) and reliability (sending process crashes) issues in broadcast • message ordering and reliability are orthogonal to each other, with often hybrid models existing Distributed Systems (DNR) 3 *p1, p2 with p1 FIFO order broadcast and receive in misorder *P2 crashing in the midst Distributed Systems (DNR) 4 • message ordering definitions: • FIFO order –if a process p sends m1 before it sends m2, then m2 is not delivered at a process q before m1 (easily implemented using message sequence numbers) • total order – if a process (correct or faulty) p delivers a message m1 before m2, then every process delivers m2 only after it has delivered m1 • causal order – for every process p, if m1 happens before m2, then m2 is not delivered at q before m1 is Distributed Systems (DNR) 5 • causal ordering single source FIFO ordering • total ordering FIFO or causal ordering • a combination of FIFO-total order broadcast (which enforces single source FIFO), or, causal-total order broadcast (which preserves causality) is possible Distributed Systems (DNR) 6 m1m2 (FIFO) and m1m3 (causal) is maintained in the total order p1 m1 m3 m2 p2 p3 m1 m3 Distributed Systems (DNR) m2 7 • we will discuss: – best effort broadcast (BEBcast) – reliable broadcast (RBcast) – terminating reliable broadcast (TRBcast) – uniform reliable broadcast (URBcast) – (uniform reliable) causal order broadcast (COBcast) – (uniform reliable) total order broadcast (ABcast, or atomic broadcast) Distributed Systems (DNR) 8 assumptions • groups are static: dynamic groups are not addressed here • processes will not have access to stable storage (no fail-recovery) • asynchronous and at the network level, point-to-point communication • fail-stop processes unless otherwise stated Distributed Systems (DNR) 9 • Channels- two interpretations of liveness criterion: • reliable channel – a reliable channel between processes p and q ensures the following: if p executes send(m) and q is correct, then q eventually receives m • quasi reliable channel – a quasi reliable channel between processes p and q ensures the following: if p and q are correct and p executes send(m), then q eventually receives m Distributed Systems (DNR) 10 • reliable vs. quasi-reliable: • let process q be correct; a reliable channel implies if p executes send(m) at time t, and crashes at time t+1, then q must eventually receive m, a useful model of a shared persistent space • a quasi reliable channel is weaker – both p and q must be correct at the same time, a useful model of TCP with error recovery Distributed Systems (DNR) 11 Best effort broadcast (BEBcast) • burden of ensuring reliability is only on the sender: as long as the sender of a message does not crash, the properties of a quasi reliable channel ensure that all correct processes eventually deliver message • operations: • at p, BEBcast(m): for every process qp, send (m) by reliable unicast • on receive(m) at q : BEBdeliver(m) at q Distributed Systems (DNR) 12 • transport level mechanisms: reliable unicast by TCP (ack-implosion problem) or IP multicast Distributed Systems (DNR) 13 • properties: • validity (a liveness property)– for any two correct processes p and q, every message broadcast by p is eventually delivered by q • integrity (a safety property)– for any message m, every correct process q delivers m at most once, and only if m was previously broadcast by some process p Distributed Systems (DNR) 14 Distributed Systems (DNR) 15 Reliable broadcast (RBcast) • in best effort broadcast, if the sender fails immediately after broadcasting to all, as end to end error recovery is not possible in such a case, the correct processes might disagree on whether or not to deliver the message • reliable broadcast ensures that correct process agree on the messages they deliver even when the sender crashes, i.e., adheres to the properties of a reliable channel Distributed Systems (DNR) 16 • reliable broadcast is built on top of best-effort broadcast + failure detector abstraction Distributed Systems (DNR) 17 • operations: • at p, RBcast(m) BEBcast(m) • at q BEBdeliver(m) RBdeliver(m) • if q unreliably detects that p has crashed then BEBcast(m) • note – retransmission received by other correct processes must handle duplicates properly Distributed Systems (DNR) 18 • properties: • validity – if a correct process p broadcasts a message m, then p eventually delivers m • integrity – for a message m, a correct process q delivers m at most once and only if m was previously broadcast by some process p • agreement (a liveness property)– if a correct process p delivers a message m, then m is eventually delivered by every correct process q Distributed Systems (DNR) 19 • Is the following run acceptable? • process p executes RBcast(m) and later crashes; some process q RBdelivers m and then crashes; all other processes are correct, but none of them RBdelivers m • process p executes RBcast(m) and later crashes: validity not violated Distributed Systems (DNR) 20 uniform reliable broadcast (URBcast) • consider the scenario discussed earlier: process p1 executes RBcast(m) and later crashes; some process p2 RBdelivers m and then crashes; all other processes are correct, but none of them RBdelivers m; satisfies reliable broadcast, nevertheless seem to be lacking in some aspect.. Distributed Systems (DNR) 21 • the problem is q RBdelivers m and then only takes a step to rebroadcast if the source failure is detected • URBCAST ensures that a process (correct or not) delivers the message only when it knows that the message has been seen (BEBdeliver) by all correct processes • URB property is important, say if processes are interacting with outside world; a fact that a process has delivered a message is important, even if it has crashed afterwards; because before it had got crashed it might have communicated with external world; other processes must be aware of this situation Distributed Systems (DNR) 22 • agreement property replaced by uniform agreement – if some process (correct or not) p delivers a message m, then m is eventually delivered by every correct process q • reliable channel assumption holds – where, if p executes send(m) to q, q is correct, then eventually q receives m Distributed Systems (DNR) 23 • operations: • at p, URBcast(m) BEBcast(m) • at q BEBdeliver(m); if m received by q for the first time and qp, then BEBcast(m) URBdeliver(m) Distributed Systems (DNR) 24 Causal order broadcast (COBcast) • reliable broadcast does not guarantee any ordering among messages delivered by different processes • single source FIFO ordering is a special case of causal ordering where messages from the same process should be delivered in the order they were broadcast Distributed Systems (DNR) 25 • practical scenario: • on a publish-subscribe whiteboard p1 broadcasts m1 proposal to all which p2 (sees and) replies with comment m2 to all • here m1 m2 • due to arbitrary delay p3 delivers m2 before m1 and has to withhold m2 • a suitable ‘middleware’ for causal ordering would relieve the programmer from performing such a task Distributed Systems (DNR) 26 • we say that a message m1 may potentially have caused another message m2 (or m1 m2), if any of the following applies • m1 and m2 were broadcast by the same process p and m1 was broadcast before m2 • m1 was delivered by process p, m2 was broadcast by process p, m2 was broadcast after the delivery of m1 • there exist some message m’ such that m1 m’ and m’ m2 Distributed Systems (DNR) 27 Distributed Systems (DNR) 28 • additional property: • causal delivery – no process p delivers a message m2 unless p has already delivered every message m1 such that m1 m2 • causally ordered broadcast can be achieved in the presence of crash failures • when RBcast is replaced by URBcast, we get a reliable causally ordered broadcast • two implementations discussed: Distributed Systems (DNR) 29 no-waiting causal broadcast • whenever a process RBdeliver(m), it COdeliver(m) without waiting for other messages to be RBdelivered • algorithm outline: • each message m carries a control field pastm which includes all messages that causally precede m Distributed Systems (DNR) 30 • when a message m is RBdelivered, pastm is first inspected where all messages in pastm that have not been COdelivered must be done so before m it self is COdelievered • each process memorises all messages it has COBcast or COdelivered in a variable past_list • past_list and pastm are ordered sets Distributed Systems (DNR) 31 at pi: init: past_list = delivered_list = empty; upon <COBcast(m)> { RBcast(m, past_list); past_list = past_list m;} upon <RBdeliver(pj, pastm, m)> if (m delivered_list) then { for all messages m’ pastm not delivered so far { COdeliver() in deterministic order; delivered_list= delivered_list m’; past_list= past_list m’;} COdeliver (pj, m); delivered_list = delivered_list m; past_list=past_list m;} Distributed Systems (DNR) 32 • in the figure above, p4 RBdeliver m2 first but since the message carries m1 in its pastm, m1 and m2 are COdelivered in order; finally when m1 is RBdelivered from p1, it is discarded • weakness: long message size due to past casual history carried Distributed Systems (DNR) 33 waiting causal order broadcast • instead of keeping a record of all past messages, history is now represented by vector clocks • vector clocks essentially capture the causal precedence between messages • waiting COBcast relies on as before, underlying RBcast and RBdeliver primitives Distributed Systems (DNR) 34 • every process p maintains a vector clock that represents the number of messages that p has COdelivered from every other process, i.e., VCp[j], j=1..n, jp, and the number of messages it has itself COBcast, i.e., VCp[p] • this vector is then attached to every message m that p COBcast • a process q that RBdeliver m interprets this vector time stamp to determine how many messages are missing (if any), and from which process Distributed Systems (DNR) 35 • as far as all previous messages from p are concerned this is VCp[p]-1 and then, all messages received by p before it had sent m, that is VCp[k], kp • process q needs to COdeliever all these missing messages before it can COdeliver m Distributed Systems (DNR) 36 • at p2, interpretation of the vector time stamp [0,2,0] implies that there is one message pending from p1, one message from p1 already RBdelivered but pending COdeliver and, none from p0 Distributed Systems (DNR) 37 at pi: init: pending = empty; i,j VCi[j] =0; pending list ordered in increasing order of vector time upon COBcast(m) { COdeliver(pi, m); /receive locally RBcast(VCi, pi, m); VCi[i]++;} upon RBdeliever(VCj, pj, m) { for i j augment pending with (VCj, pj, m); /ignore messages from self wait until VCj[j]=VCi[j]+1 and ki VCj[k] VCi[k]; { remove (VCj, pj, m) from pending; COdeliever(pj ,m); VCi[j]++;} } Distributed Systems (DNR) 38 Total order broadcast (TOBcast) • causal order broadcast enforces a global ordering for all messages that are causally depended on each other • messages that are no so, are said to be concurrent and could be delivered in any order • a total order abstraction orders all messages, even those that are concurrent • it is some times possible to have a total order that does not respect causal order • a convenient abstraction for managing replicated state machines (e.g., in fault tolerant servers) Distributed Systems (DNR) 39 • totally ordered reliable broadcast cannot be achieved in the presence of crash failures when the underlying communication is asynchronous • this is because totally ordered broadcast consensus; recall that consensus cannot be solved in an asynchronous system with failures (FLP result) • assumptions: asynchronous with no process failures, or synchronous with fail-stop processes • how do we achieve causal-total order broadcast? Distributed Systems (DNR) 40 • properties: • validity – if a correct process p broadcasts a message m, then p eventually delivers m • integrity – for a message m, a correct process q delivers m at most once, and only if m was previously broadcast by some process p • uniform agreement (atomicity in delivery) – if a process p delivers a message m, then m is eventually delivered by every correct process q • uniform total order (an order property) – if a process (correct or faulty) p delivers a message m1 before m2, then every process delivers m2 only after it has delivered m1. Distributed Systems (DNR) 41 • algorithm 1 – asynchronous with no process failures • assume reliable (stronger condition under no failure assumption) and single source FIFO channel (each process stamps sequence numbers) • each process maintains an increasing counter, a time stamp, which is tagged with the message it broadcasts • each process also maintains a vector with estimates of the time stamps of all others Distributed Systems (DNR) 42 • suppose ts[j] is the vector element that corresponds to pj on pi; it says that pi will never again receive a message from pj with a smaller time stamp than or equal to this value • processes use special update time stamp messages to keep up the estimates • RBdelivered messages are queued in a pending list in the order of increasing <time stamp-ts(m): pid> pairs, say ts(m)^; pid used to break a tie • ABdeliver can be done for any message in pending list that has a time stamp greater than all of the elements of the current vector time of a process Distributed Systems (DNR) 43 at pi: (0 i n-1) init ts[j] = 0; (0 j n-1); pending = empty; ABcast (m) { ts[i]++; add (m,ts(i),pi) to pending; RBcast(m,ts[i],pi);} upon RBdeliver(m,ts(msg),pj),ji ignore self msg{ ts[j] = ts(msg); add (m,ts(msg),pj) to pending; if (ts(msg) > ts[i]) then { ts[i] = ts(msg); RBcast(new_ts,ts[i],pi);}} upon RBdeliver(new_ts,ts(new_ts),pj),ji ignore self msg ts[j] = ts(new_ts); delivery_test() /at any time while (m,ts(msg),pj) at head of pending list { k ts(msg) ts[k] { remove(m,ts(msg),pj) from pending; ABdeliver(m);}} Distributed Systems (DNR) 44 total order broadcast with time stamps Distributed Systems (DNR) 45 Total order broadcast by consensus • uses reliable broadcast and consensus as building blocks • messages are first disseminated using a reliable broadcast primitive and are stored in a bag of unordered messages at every process • processes then use consensus to order the messages in the bag Distributed Systems (DNR) 46 • algorithm works in rounds • there is one consensus instance per round • messages to be delivered in a round are agreed upon before proceeding to next round • RBcast can be replaced with URBcast to give ‘uniform total order broadcast’ • algorithm 2 – synchronous with fail-stop processes Distributed Systems (DNR) 47 Distributed Systems (DNR) 48 init: unordered = delivered = empty; round = 1; wait = false; TOBcast (m) { RBcast(m);} upon RBdeliver(m){ if (m delivered) then unordered = unordered m;} upon ((unordered empty) (wait = false)) { wait = true; propose(round, unordered); }/ propose() and decide() are consensus primitives upon (m’ decide(round)) { / may take f+1 rounds in case of failures delivered = delivered m’; unordered = unordered \ m’; TOdeliever(m’); round++; wait = false;} Distributed Systems (DNR) 49 Terminating reliable broadcast (TRBcast) • uniform reliable broadcast says that if some process (correct or not) p delivers a message m, then m is eventually delivered by every correct process q • however, q cannot decide whether it should wait for m or not; q has no means to distinguish the case where some process has delivered m, and where q can indeed wait for m, from the case where no process will ever deliver m, in which case q should definitely not keep waiting for m Distributed Systems (DNR) 50 • suppose a process r urbcasts message m, but crashed while doing so and another process p detects that r has crashed without seeing m • this does not mean that m was not broadcast • this nuance is captured by terminating reliable broadcast • TRBcast ensures precisely that every process q either delivers the message m or some indication F that m will never be delivered (by any process); abstraction is defined for a specific originator process src Distributed Systems (DNR) 51 • properties: • validity – if the sender src is correct and broadcasts a message m, then src eventually delivers m • integrity – if a correct process delivers a message m then either m=F or m was previously broadcast by src • uniform agreement – if any process delivers a message m, then m is eventually delivered by every correct process • assumptions: synchronous with fail stop processes Distributed Systems (DNR) 52 • underlying abstractions – a perfect failure detector, consensus, best effort broadcast • the source of message src identifies it self as the originator in the message m in the best effort broadcast to all • a participant joins in the trbcast by broadcasting a special null message • every process waits until it either gets a message broadcast by the sender or detects the crash of sender • all processes run a consensus instance to agree on whether to deliver m or the failure notification F Distributed Systems (DNR) 53 Distributed Systems (DNR) 54 init: proposal =decision = null; TRBcast (m, psrc) BEBcast(m); upon (BEBdeliver(m, psrc) (proposal= null)) propose (m); upon ((psrc_crash) (proposal=null)) propose (Fsrc); upon decide(decision) / consensus round TRBdeliever(decision, psrc); ----------------------------------------[Scanned figures in the slides have been extracted from the text books of R.Guerroui and H.Attiya] Distributed Systems (DNR) 55