On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar and Sergio Rajsbaum PODC 2002 Tutorial © Idit Keidar and Sergio Rajsbaum; PODC 2002 About This Tutorial • Preliminary version in SIGACT News and MIT Tech Report, June 2001 • More polished lower bound proof to appear in IPL • New version of the tutorial in preparation • The talk includes only a subset of references, sorry • We include some food for thought Any suggestions are welcome! © Idit Keidar and Sergio Rajsbaum; PODC 2002 Consensus Each process has an input, should decide an output s.t. Agreement: correct processes’ decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide There are at least two possible input values 0 and 1 © Idit Keidar and Sergio Rajsbaum; PODC 2002 Basic Model • Message passing • Channels between every pair of processes • Crash failures – t<n potential failures out of n>1 processes • No message loss among correct processes © Idit Keidar and Sergio Rajsbaum; PODC 2002 How Long Does It Take to Solve Consensus? Depends on the timing model: • Message delays • Processing times • Clocks • And on the metric used: • Worst case • Average • etc © Idit Keidar and Sergio Rajsbaum; PODC 2002 The Rest of This Tutorial • • • • Part I: Realistic timing model and metric Part II: Upper bounds Part III: Lower bounds Part IV: New directions and extensions © Idit Keidar and Sergio Rajsbaum; PODC 2002 Part I: Realistic Timing Model © Idit Keidar and Sergio Rajsbaum; PODC 2002 Asynchronous Model • Unbounded message delay, processor speed Consensus impossible even for t=1 [FLP85] © Idit Keidar and Sergio Rajsbaum; PODC 2002 Synchronous Model • Algorithm runs in synchronous rounds: Round – send messages to any set of processes, – receive messages from previous round, – do local processing (possibly decide, halt) • If process i crashes in a round, then any subset of the messages i sends in this round can be lost © Idit Keidar and Sergio Rajsbaum; PODC 2002 Synchronous Consensus 1 round with no failures • Consider a run with f failures (f<t) – Processes can decide in f+1 rounds [Lamport Fischer 82; Dolev, Reischuk, Strong 90] (early-deciding) • In this talk deciding – halting takes min(f+2,t+1) [Dolev, Reischuk, Strong 90] © Idit Keidar and Sergio Rajsbaum; PODC 2002 The Middle Ground Many real networks are neither synchronous nor asynchronous • During long stable periods, delays and processing times are bounded – Like synchronous model • Some unstable periods – Like asynchronous model © Idit Keidar and Sergio Rajsbaum; PODC 2002 Partial Synchrony Model [Dwork, Lynch, Stockmeyer 88] • • • • Processes have clocks with bounded drift D, upper bound on message delay r, upper bound on processing time GST, global stabilization time – Until GST, unstable: bounds do not hold – After GST, stable: bounds hold – GST unknown © Idit Keidar and Sergio Rajsbaum; PODC 2002 Partial Synchrony in Practice • For D, r, choose bounds that hold with high probability • Stability forever? – We assume that once stable remains stable – In practice, has to last “long enough” for given algorithm to terminate – A commonly used model that alternates between stable and unstable times: Timed Asynchronous Model [Cristian, Fetzer 98] © Idit Keidar and Sergio Rajsbaum; PODC 2002 Consensus with Partial Synchrony Unbounded running time by [FLP85], because model can be asynchronous for unbounded time • Solvable iff t < n/2 [DLS88] © Idit Keidar and Sergio Rajsbaum; PODC 2002 In a Practical System Can we say more than: consensus will be solved eventually ? © Idit Keidar and Sergio Rajsbaum; PODC 2002 Performance Metric Number of rounds in well-behaved runs • Well-behaved: – No failures – Stable from the beginning • Motivation: common case © Idit Keidar and Sergio Rajsbaum; PODC 2002 The Rest of This Tutorial • Part II: best known algorithms decide in 2 rounds in well-behaved runs – 2D time (with delay bound D, 0 processing time) • Part III: this is the best possible • Part IV: new directions and extensions © Idit Keidar and Sergio Rajsbaum; PODC 2002 Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony II.b Algorithms © Idit Keidar and Sergio Rajsbaum; PODC 2002 Time-Free Algorithms • We describe the algorithms using failure detector abstraction [Chandra, Toueg 96] • Goal: abstract away time, get simpler algorithms © Idit Keidar and Sergio Rajsbaum; PODC 2002 Unreliable Failure Detectors [Chandra, Toueg 96] • Each process has local failure detector oracle – Typically outputs list of processes suspected to have crashed at any given time • Unreliable: failure detector output can be arbitrary for unbounded (finite) prefix of run © Idit Keidar and Sergio Rajsbaum; PODC 2002 Performance of Failure Detector Based Consensus Algorithms • Implement a failure detector in the partial synchrony model • Design an algorithm for the failure detector • Analyze the performance in well-behaved runs of the combined algorithm © Idit Keidar and Sergio Rajsbaum; PODC 2002 A Natural Failure Detector Implementation in Partial Synchrony Model • Implement failure detector using timeouts: – When expecting a message from a process i, wait D + r + clock skew before suspecting i • In well-behaved runs, D, r always hold, hence no false suspicions © Idit Keidar and Sergio Rajsbaum; PODC 2002 The resulting failure detector is <>P - Eventually Perfect • Strong Completeness: From some point on, every faulty process is suspected by every correct process • Eventual Strong Accuracy: From some point on, every correct process is not suspected* *holds in all runs © Idit Keidar and Sergio Rajsbaum; PODC 2002 Weakest Failure Detectors for Consensus • <>S - Eventually Strong – Strong Completeness – Eventual Weak Accuracy: From some point on, some correct process is not suspected • W - Leader – Outputs one trusted process – From some point, all correct processes trust the same correct process © Idit Keidar and Sergio Rajsbaum; PODC 2002 Relationships among Failure Detector Classes • <>S is a subset of <>P • <>S is strictly weaker than <>P • <>S ~ W [Chandra, Hadzilacos, Toueg 96] Food for thought: What is the weakest timing model where <>S and/or W are implementable but <>P is not? © Idit Keidar and Sergio Rajsbaum; PODC 2002 Note on the Power of Consensus • Consensus cannot implement <>P, interactive consistency, atomic commit, … • So its “universality”, in the sense of – wait-free objects in shared memory [Herlihy 93] – state machine replication [Lamport 78; Schneider 90] does not cover sensitivity to failures, timing, etc. © Idit Keidar and Sergio Rajsbaum; PODC 2002 A Simple W Implementation • Use <>P implementation • Output lowest id non-suspected process In well-behaved runs: process 1 always trusted © Idit Keidar and Sergio Rajsbaum; PODC 2002 Other Failure Detector Implementations • Message efficient <>S implementation [Larrea, Fernández, Arévalo 00] • QoS tradeoffs between accuracy and completeness [Chen, Toueg, Aguilera 00] • Leader Election [Aguilera, Delporte, Fauconnier, Toueg 01] • Adaptive <>P [Fetzer, Raynal, Tronel 01] Food for thought: When is building <>P more costly than <>S or W? © Idit Keidar and Sergio Rajsbaum; PODC 2002 Part II: Algorithms, and the Failure Detector Abstraction II.a Failure Detectors and Partial Synchrony II.b Algorithms © Idit Keidar and Sergio Rajsbaum; PODC 2002 Algorithms that Take 2 Rounds in Well-Behaved Runs • <>S-based [Schiper 97; Hurfin, Raynal 99; Mostefaoui, Raynal 99] • W-based for t < n/3 [Mostefaoui, Raynal 00] • W-based for t < n/2 [Dutta, Guerraoui 01] • Paxos (optimized version) [Lamport 89; 96] – Leader-based (W) – Also tolerates omissions, crash recoveries • COReL - Atomic Broadcast [Keidar, Dolev 96] – Group membership based (<>P) © Idit Keidar and Sergio Rajsbaum; PODC 2002 Of This Laundry List, We Present Two Algorithms 1 <>S-based [MR99] 2 Paxos © Idit Keidar and Sergio Rajsbaum; PODC 2002 <>S-based Consensus [MR99] • val input v; est null for r =1, 2, … do coord (r mod n)+1 if I am coord, then send (r,val) to all wait for ( (r, val) from coord OR suspect coord (by <>S)) 1 if receive val from coord then est val else est null 2 send (r, est) to all wait for (r,est) from n-t processes if any non-null est received then val est if all ests have same v then send (“decide”, v) to all; return(v) od • Upon receive (“decide”, v), forward to all, return(v) © Idit Keidar and Sergio Rajsbaum; PODC 2002 In Well-Behaved Runs 1 1 (1, v1) est = v1 1 decide v1 2 2 . . . . . . n n (1, v1) © Idit Keidar and Sergio Rajsbaum; PODC 2002 In Case of Omissions The algorithm can block in case of transient message omissions, waiting for a specific round message that will not arrive © Idit Keidar and Sergio Rajsbaum; PODC 2002 Paxos [Lamport 88; 96; 01] • Uses W failure detector • Phase 1: prepare – A process who trusts itself tries to become leader – Chooses largest unique (using ids) ballot number – Learns outcome of all smaller ballots • Phase 2: accept – Leader proposes a value with his ballot number. – Leader gets majority to accept his proposal. – A value accepted by a majority can be decided © Idit Keidar and Sergio Rajsbaum; PODC 2002 Paxos - Variables • Type Rank – totally ordered set with minimum element r0 • Variables: Rank BallotNum, initially r0 Rank AcceptNum, initially r0 Value {^} AcceptVal, initially ^ © Idit Keidar and Sergio Rajsbaum; PODC 2002 Paxos Phase I: Prepare • Periodically, until decision is reached do: if leader (by W) then BallotNum (unique rank > BallotNum) send (“prepare”, rank) to all • Upon receive (“prepare”, rank) from i if rank > BallotNum then BallotNum rank send (“ack”, rank, AcceptNum, AcceptVal) to i © Idit Keidar and Sergio Rajsbaum; PODC 2002 Paxos Phase II: Accept Upon receive (“ack”, BallotNum, b, val) from n-t if all vals = ^ then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ Upon receive (“accept”, b, v) with b BallotNum AcceptNum b; AcceptVal v /* accept proposal */ send (“accept”, b, v) to all © Idit Keidar and Sergio Rajsbaum; PODC 2002 (first time only) Paxos – Deciding Upon receive (“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v © Idit Keidar and Sergio Rajsbaum; PODC 2002 In Well-Behaved Runs 1 1 1 1 2 2 . . (“ack”,1,r0,^) . . . . . n n n 2 (“prepare”,1) . Our W implementation always trusts process 1 1 (“accept”,1 ,v1) . (“accept”,1 ,v1) decide v1 © Idit Keidar and Sergio Rajsbaum; PODC 2002 Optimization • Allow process 1 (only!) to skip Phase 1 – use rank r0 – propose its own initial value • Takes 2 rounds in well-behaved runs • Takes 2 rounds for repeated invocations with the same leader © Idit Keidar and Sergio Rajsbaum; PODC 2002 What About Message Loss? • Does not block in case of a lost message – Phase I can start with new rank even if previous attempts never ended • But constant omissions can violate liveness • Specify conditional liveness: If n-t correct processes including the leader can communicate with each other then they eventually decide © Idit Keidar and Sergio Rajsbaum; PODC 2002 Part III: Lower Bounds in Partial Synchrony Model © Idit Keidar and Sergio Rajsbaum; PODC 2002 Upper Bounds From Part II We saw that there are algorithms that take 2 rounds to decide in well-behaved runs • <>S-based, W-based, Paxos, COReL • Presented two of them. © Idit Keidar and Sergio Rajsbaum; PODC 2002 Why are there no 1-Round Algorithms? There is a lower bound of 2 rounds in wellbehaved executions – Similar bounds shown in [Dwork, Skeen 83; Lamport 00] • We will show that the bound follows from a similar bound on Uniform Consensus in the synchronous model © Idit Keidar and Sergio Rajsbaum; PODC 2002 Uniform Consensus • Uniform agreement: decision of every two processes is the same Recall: with consensus, only correct processes have to agree © Idit Keidar and Sergio Rajsbaum; PODC 2002 From Consensus to Uniform Consensus In partial synchrony model, any algorithm A for consensus solves uniform consensus [Guerraoui 95] Proof: Assume by contradiction that A does not solve uniform consensus – in some run, p,q decide differently, p fails – p may be non-faulty, and may wake up after q decides © Idit Keidar and Sergio Rajsbaum; PODC 2002 Synchronous Uniform Consensus Every algorithm has a well-behaved run that takes 2 rounds to decide • More generally, it has a run with f failures (f<t-1), that takes at least f+2 rounds to decide [Charron-Bost, Schiper 00; KR 01] – as opposed to f+1 for consensus © Idit Keidar and Sergio Rajsbaum; PODC 2002 A Simple Proof of the Uniform Consensus Synchronous Lower Bound [Keidar, Rajsbaum IPL 02] © Idit Keidar and Sergio Rajsbaum; PODC 2002 States • State = list of processes’ local states • Given a fixed deterministic algorithm, state at the end of run determined by initial values and environment actions – failures, message loss – can be denoted as: x . E1. E2. E3 x state, Ei environment actions © Idit Keidar and Sergio Rajsbaum; PODC 2002 Connectivity States x, x’ are similar, x~x’, if they look the same to all but at most one process • Set of initial states of consensus is connected 000 ~ 001 ~ 011 ~ 111 • Intuition: in connected states there cannot be different decisions © Idit Keidar and Sergio Rajsbaum; PODC 2002 Coloring • Impossibility proofs color non-decided states • Classical coloring: valency, potential decisions state can lead to e.g. [FLP85] • Our coloring: val(x) = decision of correct processes in failure-free extension of x (0 or 1) © Idit Keidar and Sergio Rajsbaum; PODC 2002 To Prove Lower Bounds • Sufficient to look at subset of runs, called a system • Simplifies proof • A set of environment actions defines a system © Idit Keidar and Sergio Rajsbaum; PODC 2002 Considered Environment Actions • (i, [k]) - i fails, – messages to processes {1,…,k} lost (if sent) – [0] empty set - no loss – applicable if i non-failed and < t failures • (0, [0]) - no failures – always applicable Notice: at most one process fails in one round – its messages lost by prefix of processes © Idit Keidar and Sergio Rajsbaum; PODC 2002 Layering • Layering L = set of environment actions – L(X) = {x.E | x X, E L applicable to x} – L0(X) = X – Lk(X) = L(Lk-1(X)) • Define system using layers – X0 set of initial states – System: all runs obtained from L( . ) [Moses, Rajsbaum 98; Gafni 98; Herlihy, Rajsbaum,Tuttle 98] © Idit Keidar and Sergio Rajsbaum; PODC 2002 L2(X0) L(X0) X0 Proof Strategy • Uniform Lemma: from connected set, under some conditions, 2 more rounds needed for uniform consensus (recall: 1 for consensus) • The initial states are connected. In general: for f<t+1, Lf(X0) connected (will not be shown) – feature of model, not of the problem – also implies consensus f+1 lower bound – can be proven for all Li(X0) in other models, e.g., mobile failure model [MosesR98], [Santoro,Widemayer89], and asynchronous model © Idit Keidar and Sergio Rajsbaum; PODC 2002 Uniform Lemma • If – X connected – x,x’X, s.t. val(x)= 0, val(x’)=1 – In all states in X exist at least 3 non-failed processes and 2 can fail • Then – yX s.t. in y.(0,[0]) not all decide 1-round failure-free extension of y © Idit Keidar and Sergio Rajsbaum; PODC 2002 Uniform Lemma: Proof • X connected, val(x)= 0, val(x’)=1 x ... y y’ ... x’ differ only in state of some j • Assume, by contradiction, in failure-free extensions of y, y’, all decide after 1 round • 2 cases: j either failed or non-failed © Idit Keidar and Sergio Rajsbaum; PODC 2002 Illustrating the Contradiction Case 1: j is correct y’ y y y’ X y.(0,[0]) y’.(0,[0]) X y’.(1,[2]) y.(1,[2]) the same val(y)=0, solook y leads to to process 3 decision 0 in one failure-free round X X X X y.(1,[2]).(3,[3]) y.(1,[2]).(3,[3]) look the same to process 2 A contradiction to uniform agreement! © Idit Keidar and Sergio Rajsbaum; PODC 2002 The uniform consensus synchronous lower bound • • • • n >2, t >1, f =0 X0 = {initial failure-free states} connected x’,xX0 s.t. val(x)=0, val(x’)=1 (validity) By Uniform Lemma, from some initial state need 2 rounds to decide © Idit Keidar and Sergio Rajsbaum; PODC 2002 Part IV: Extensions and New Directions © Idit Keidar and Sergio Rajsbaum; PODC 2002 Less Than Well-Behaved Runs We now discuss stable runs with failures – only initial failures, and after that run is “wellbehaved” – the run is synchronous but there are f failures Food for thought: What if the run is unstable and then becomes stable? © Idit Keidar and Sergio Rajsbaum; PODC 2002 Stable Runs with Initial Failures Only • Algorithms we showed here: – [MR99] can take t+2 rounds – Paxos takes 4 rounds after leader is known • Other results: – If n > 3t, exists <>W-based algorithm that terminates in 2 rounds [Mostefaoui, Raynal 00] – If n < 3t+1, tight lower bound of 3 rounds [Dutta, Guerraoui, Keidar WIP; Lamport WIP] © Idit Keidar and Sergio Rajsbaum; PODC 2002 Stable Runs with f Failures • When f<t-1, we showed lower bound of f+2 • Algorithms we showed here can take 2f+2 in runs with f failures • [Dutta, Guerraoui ] – extend lower bound to cover also f=t-1 and f=t – show algorithm that takes t+2 Food for thought: Algorithm that takes f+2 rounds for every f <=t ? © Idit Keidar and Sergio Rajsbaum; PODC 2002 Byzantine Failures • Byzantine variants of Paxos [Castro Liskov 99; Chockler, Malkhi, Reiter 01] • Need n>3t, even with authentication [DLS88] – as opposed to synchronous case • New result: same upper and lower bounds as in crash failure model (but with more processes) [Lamport WIP] © Idit Keidar and Sergio Rajsbaum; PODC 2002 Weakness of the “Rounds” Performance Metric • Over the Internet, time to complete a round can depend heavily on number of messages sent and other factors [Bakr, Keidar ] – Reason: depends on delay distribution Food for thought: Better performance metrics? © Idit Keidar and Sergio Rajsbaum; PODC 2002 More Food for Thought • Shared memory models – Good definitions of “well-behaved executions”? Including factors such as contention E.g., [Alur, Attiya, Taubenfeld 97] – Generic reductions from message passing bounds to bounds in different shared memory models • Infinitely many processes – Paxos in shared memory with infinitely many processes [Chockler, Malkhi ] © Idit Keidar and Sergio Rajsbaum; PODC 2002 More Food for Thought Yet • Other problems – Atomic Commit – 3 rounds? – mutual exclusion, etc. • Proof techniques – Layered analysis for partial synchrony models – Improve timing bounds for such models • E.g., consensus [Attiya, Dwork, Lynch, Stockmeyer 94] • Wait-free k-set agreement [Herlihy, Rajsbaum, Tuttle 98] © Idit Keidar and Sergio Rajsbaum; PODC 2002 On an Optimistic Note • Consensus requires 2 rounds in partial synchrony model because of false suspicions • Optimistic approach: use 1-round algorithm – correct while there are no false suspicions – upon false suspicions, reconcile or rollback – E.g., Group Communication Horus, Amoeba, ... Atomic Commit [Jimenez-Peris, PatinoMartinez, Alonso, Arevalo 01] © Idit Keidar and Sergio Rajsbaum; PODC 2002