PODC 2002 Tutorial Slides

advertisement
On the Cost of Fault-Tolerant
Consensus
When There are no Faults
Idit Keidar and Sergio Rajsbaum
PODC 2002 Tutorial
© Idit Keidar and Sergio Rajsbaum; PODC 2002
About This Tutorial
• Preliminary version in SIGACT News and MIT
Tech Report, June 2001
• More polished lower bound proof to appear in IPL
• New version of the tutorial in preparation
• The talk includes only a subset of references, sorry
• We include some food
for thought
Any suggestions are welcome!
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Consensus
Each process has an input, should decide an output s.t.
Agreement: correct processes’ decisions are the same
Validity: decision is input of one process
Termination: eventually all correct processes decide
There are at least two possible input values 0 and 1
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Basic Model
• Message passing
• Channels between every pair of processes
• Crash failures
– t<n potential failures out of n>1 processes
• No message loss among correct processes
© Idit Keidar and Sergio Rajsbaum; PODC 2002
How Long Does It Take to Solve
Consensus?
Depends on the timing model:
• Message delays
• Processing times
• Clocks
• And on the metric used:
• Worst case
• Average
• etc
© Idit Keidar and Sergio Rajsbaum; PODC 2002
The Rest of This Tutorial
•
•
•
•
Part I: Realistic timing model and metric
Part II: Upper bounds
Part III: Lower bounds
Part IV: New directions and extensions
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Part I: Realistic Timing Model
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Asynchronous Model
• Unbounded message delay, processor speed
Consensus impossible even for t=1
[FLP85]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Synchronous Model
• Algorithm runs in synchronous rounds:
Round
– send messages to any set of processes,
– receive messages from previous round,
– do local processing (possibly decide, halt)
• If process i crashes in a round, then any subset
of the messages i sends in this round can be lost
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Synchronous Consensus
1 round with no failures
• Consider a run with f failures (f<t)
– Processes can decide in f+1 rounds
[Lamport Fischer 82; Dolev, Reischuk, Strong 90] (early-deciding)
• In this talk deciding
– halting takes min(f+2,t+1) [Dolev, Reischuk, Strong 90]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
The Middle Ground
Many real networks are neither synchronous
nor asynchronous
• During long stable periods, delays and
processing times are bounded
– Like synchronous model
• Some unstable periods
– Like asynchronous model
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Partial Synchrony Model
[Dwork, Lynch, Stockmeyer 88]
•
•
•
•
Processes have clocks with bounded drift
D, upper bound on message delay
r, upper bound on processing time
GST, global stabilization time
– Until GST, unstable: bounds do not hold
– After GST, stable: bounds hold
– GST unknown
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Partial Synchrony in Practice
• For D, r, choose bounds that hold with high
probability
• Stability forever?
– We assume that once stable remains stable
– In practice, has to last “long enough” for given algorithm
to terminate
– A commonly used model that alternates between stable
and unstable times:
Timed Asynchronous Model [Cristian, Fetzer 98]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Consensus with Partial Synchrony
Unbounded running time
by [FLP85], because model can be
asynchronous for unbounded time
• Solvable iff t < n/2 [DLS88]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
In a Practical System
Can we say more than:
consensus will be solved eventually ?
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Performance Metric
Number of rounds in well-behaved runs
• Well-behaved:
– No failures
– Stable from the beginning
• Motivation: common case
© Idit Keidar and Sergio Rajsbaum; PODC 2002
The Rest of This Tutorial
• Part II: best known algorithms decide in 2
rounds in well-behaved runs
– 2D time (with delay bound D, 0 processing time)
• Part III: this is the best possible
• Part IV: new directions and extensions
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Part II: Algorithms, and the
Failure Detector Abstraction
II.a Failure Detectors and Partial Synchrony
II.b Algorithms
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Time-Free Algorithms
• We describe the algorithms using failure
detector abstraction [Chandra, Toueg 96]
• Goal: abstract away time, get simpler
algorithms
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Unreliable Failure Detectors
[Chandra, Toueg 96]
• Each process has local failure detector oracle
– Typically outputs list of processes suspected to
have crashed at any given time
• Unreliable: failure detector output can be
arbitrary for unbounded (finite) prefix of run
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Performance of Failure Detector
Based Consensus Algorithms
• Implement a failure detector in the partial
synchrony model
• Design an algorithm for the failure detector
• Analyze the performance in well-behaved
runs of the combined algorithm
© Idit Keidar and Sergio Rajsbaum; PODC 2002
A Natural Failure Detector
Implementation
in Partial Synchrony Model
• Implement failure detector using timeouts:
– When expecting a message from a process i,
wait D + r + clock skew before suspecting i
• In well-behaved runs, D, r always hold,
hence no false suspicions
© Idit Keidar and Sergio Rajsbaum; PODC 2002
The resulting failure detector is
<>P - Eventually Perfect
• Strong Completeness: From some point on,
every faulty process is suspected by every
correct process
• Eventual Strong Accuracy: From some point
on, every correct process is not suspected*
*holds in all runs
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Weakest Failure Detectors for
Consensus
• <>S - Eventually Strong
– Strong Completeness
– Eventual Weak Accuracy: From some point on,
some correct process is not suspected
• W - Leader
– Outputs one trusted process
– From some point, all correct processes trust the
same correct process
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Relationships among Failure
Detector Classes
• <>S is a subset of <>P
• <>S is strictly weaker than <>P
• <>S ~ W
[Chandra, Hadzilacos, Toueg 96]
Food for thought:
What is the weakest timing model where <>S
and/or W are implementable but <>P is not?
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Note on the Power of Consensus
• Consensus cannot implement <>P, interactive
consistency, atomic commit, …
• So its “universality”, in the sense of
– wait-free objects in shared memory [Herlihy 93]
– state machine replication [Lamport 78; Schneider 90]
does not cover sensitivity to failures, timing, etc.
© Idit Keidar and Sergio Rajsbaum; PODC 2002
A Simple W Implementation
• Use <>P implementation
• Output lowest id non-suspected process
In well-behaved runs: process 1 always trusted
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Other Failure Detector
Implementations
• Message efficient <>S implementation [Larrea,
Fernández, Arévalo 00]
• QoS tradeoffs between accuracy and completeness
[Chen, Toueg, Aguilera 00]
• Leader Election [Aguilera, Delporte, Fauconnier, Toueg
01]
• Adaptive <>P [Fetzer, Raynal, Tronel 01]
Food for thought:
When is building <>P more costly than <>S or W?
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Part II: Algorithms, and the
Failure Detector Abstraction
II.a Failure Detectors and Partial Synchrony
II.b Algorithms
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Algorithms that Take 2 Rounds
in Well-Behaved Runs
• <>S-based [Schiper 97; Hurfin, Raynal 99; Mostefaoui,
Raynal 99]
• W-based for t < n/3 [Mostefaoui, Raynal 00]
• W-based for t < n/2 [Dutta, Guerraoui 01]
• Paxos (optimized version) [Lamport 89; 96]
– Leader-based (W)
– Also tolerates omissions, crash recoveries
• COReL - Atomic Broadcast [Keidar, Dolev 96]
– Group membership based (<>P)
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Of This Laundry List,
We Present Two Algorithms
1 <>S-based [MR99]
2 Paxos
© Idit Keidar and Sergio Rajsbaum; PODC 2002
<>S-based Consensus [MR99]
• val  input v; est  null
for r =1, 2, … do
coord  (r mod n)+1
if I am coord, then send (r,val) to all
wait for ( (r, val) from coord OR suspect coord (by <>S))
1
if receive val from coord then est  val else est  null
2
send (r, est) to all
wait for (r,est) from n-t processes
if any non-null est received then val  est
if all ests have same v then send (“decide”, v) to all; return(v)
od
• Upon receive (“decide”, v), forward to all, return(v)
© Idit Keidar and Sergio Rajsbaum; PODC 2002
In Well-Behaved Runs
1
1
(1, v1)
est = v1
1
decide v1
2
2
.
.
.
.
.
.
n
n
(1, v1)
© Idit Keidar and Sergio Rajsbaum; PODC 2002
In Case of Omissions
The algorithm can block in case of transient
message omissions, waiting for a specific
round message that will not arrive
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Paxos
[Lamport 88; 96; 01]
• Uses W failure detector
• Phase 1: prepare
– A process who trusts itself tries to become leader
– Chooses largest unique (using ids) ballot number
– Learns outcome of all smaller ballots
• Phase 2: accept
– Leader proposes a value with his ballot number.
– Leader gets majority to accept his proposal.
– A value accepted by a majority can be decided
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Paxos - Variables
• Type Rank
– totally ordered set with minimum element r0
• Variables:
Rank
BallotNum, initially r0
Rank
AcceptNum, initially r0
Value  {^} AcceptVal,
initially ^
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Paxos Phase I: Prepare
• Periodically, until decision is reached do:
if leader (by W) then
BallotNum  (unique rank > BallotNum)
send (“prepare”, rank) to all
• Upon receive (“prepare”, rank) from i
if rank > BallotNum then
BallotNum  rank
send (“ack”, rank, AcceptNum, AcceptVal) to i
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Paxos Phase II: Accept
Upon receive (“ack”, BallotNum, b, val) from n-t
if all vals = ^ then myVal = initial value
else myVal = received val with highest b
send (“accept”, BallotNum, myVal) to all /* proposal */
Upon receive (“accept”, b, v) with b  BallotNum
AcceptNum  b; AcceptVal  v /* accept proposal */
send (“accept”, b, v) to all
© Idit Keidar and Sergio Rajsbaum; PODC 2002
(first time only)
Paxos – Deciding
Upon receive (“accept”, b, v) from n-t
decide v
periodically send (“decide”, v) to all
Upon receive (“decide”, v)
decide v
© Idit Keidar and Sergio Rajsbaum; PODC 2002
In Well-Behaved Runs
1
1
1
1
2
2
.
. (“ack”,1,r0,^)
.
.
.
.
.
n
n
n
2
(“prepare”,1) .
Our W implementation
always trusts process 1
1
(“accept”,1 ,v1) .
(“accept”,1 ,v1)
decide v1
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Optimization
• Allow process 1 (only!) to skip Phase 1
– use rank r0
– propose its own initial value
• Takes 2 rounds in well-behaved runs
• Takes 2 rounds for repeated invocations with
the same leader
© Idit Keidar and Sergio Rajsbaum; PODC 2002
What About Message Loss?
• Does not block in case of a lost message
– Phase I can start with new rank even if previous
attempts never ended
• But constant omissions can violate liveness
• Specify conditional liveness:
If n-t correct processes including the leader can
communicate with each other
then they eventually decide
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Part III: Lower Bounds in Partial
Synchrony Model
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Upper Bounds From Part II
We saw that there are algorithms that take 2 rounds to
decide in well-behaved runs
• <>S-based, W-based, Paxos, COReL
• Presented two of them.
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Why are there no 1-Round
Algorithms?
There is a lower bound of 2 rounds in wellbehaved executions
– Similar bounds shown in [Dwork, Skeen 83; Lamport 00]
• We will show that the bound follows from a
similar bound on Uniform Consensus in the
synchronous model
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Uniform Consensus
• Uniform agreement: decision of every two
processes is the same
Recall: with consensus, only correct processes
have to agree
© Idit Keidar and Sergio Rajsbaum; PODC 2002
From Consensus
to Uniform Consensus
In partial synchrony model, any algorithm A
for consensus solves uniform consensus
[Guerraoui 95]
Proof: Assume by contradiction that A does not
solve uniform consensus
– in some run, p,q decide differently, p fails
– p may be non-faulty, and may wake up after q
decides
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Synchronous Uniform Consensus
Every algorithm has a well-behaved
run that takes 2 rounds to decide
• More generally, it has a run with f failures
(f<t-1), that takes at least f+2 rounds to
decide [Charron-Bost, Schiper 00; KR 01]
– as opposed to f+1 for consensus
© Idit Keidar and Sergio Rajsbaum; PODC 2002
A Simple Proof of the Uniform
Consensus Synchronous Lower Bound
[Keidar, Rajsbaum IPL 02]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
States
• State = list of processes’ local states
• Given a fixed deterministic algorithm,
state at the end of run determined by initial
values and environment actions
– failures, message loss
– can be denoted as:
x . E1. E2. E3
x state, Ei environment actions
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Connectivity
States x, x’ are similar, x~x’, if they look the
same to all but at most one process
• Set of initial states of consensus is connected
000
~
001
~
011
~
111
• Intuition: in connected states there cannot be
different decisions
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Coloring
• Impossibility proofs color non-decided states
• Classical coloring: valency, potential
decisions state can lead to e.g. [FLP85]
• Our coloring:
val(x) = decision of correct processes in
failure-free extension of x (0 or 1)
© Idit Keidar and Sergio Rajsbaum; PODC 2002
To Prove Lower Bounds
• Sufficient to look at subset of runs, called a
system
• Simplifies proof
• A set of environment actions defines a
system
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Considered Environment Actions
• (i, [k]) - i fails,
– messages to processes {1,…,k} lost (if sent)
– [0] empty set - no loss
– applicable if i non-failed and < t failures
• (0, [0]) - no failures
– always applicable
Notice: at most one process fails in one round
– its messages lost by prefix of processes
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Layering
• Layering L = set of environment actions
– L(X) = {x.E | x  X, E  L applicable to x}
– L0(X) = X
– Lk(X) = L(Lk-1(X))
• Define system using layers
– X0 set of initial states
– System: all runs obtained from L( . )
[Moses, Rajsbaum 98; Gafni 98;
Herlihy, Rajsbaum,Tuttle 98]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
L2(X0)
L(X0)
X0
Proof Strategy
• Uniform Lemma: from connected set, under some
conditions, 2 more rounds needed for uniform
consensus (recall: 1 for consensus)
• The initial states are connected.
In general: for f<t+1, Lf(X0) connected (will not be shown)
– feature of model, not of the problem
– also implies consensus f+1 lower bound
– can be proven for all Li(X0) in other models, e.g.,
mobile failure model [MosesR98], [Santoro,Widemayer89], and
asynchronous model
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Uniform Lemma
• If
– X connected
– x,x’X, s.t. val(x)= 0, val(x’)=1
– In all states in X exist at least 3 non-failed
processes and 2 can fail
• Then
– yX s.t. in y.(0,[0]) not all decide
1-round failure-free extension of y
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Uniform Lemma: Proof
• X connected, val(x)= 0, val(x’)=1
x
...
y
y’
...
x’
differ only in state of
some j
• Assume, by contradiction, in failure-free
extensions of y, y’, all decide after 1 round
• 2 cases: j either failed or non-failed
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Illustrating the Contradiction
Case 1: j is correct
y’
y
y
y’
X
y.(0,[0])
y’.(0,[0])
X
y’.(1,[2])
y.(1,[2])
the same
val(y)=0, solook
y leads
to to process 3
decision 0
in one failure-free round
X
X
X
X
y.(1,[2]).(3,[3]) y.(1,[2]).(3,[3])
look the same to process 2
A contradiction to uniform agreement!
© Idit Keidar and Sergio Rajsbaum; PODC 2002
The uniform consensus
synchronous lower bound
•
•
•
•
n >2, t >1, f =0
X0 = {initial failure-free states} connected
x’,xX0 s.t. val(x)=0, val(x’)=1 (validity)
By Uniform Lemma, from some initial state
need 2 rounds to decide
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Part IV: Extensions and
New Directions
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Less Than Well-Behaved Runs
We now discuss stable runs with failures
– only initial failures, and after that run is “wellbehaved”
– the run is synchronous but there are f failures
Food for thought:
What if the run is unstable and then becomes
stable?
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Stable Runs
with Initial Failures Only
• Algorithms we showed here:
– [MR99] can take t+2 rounds
– Paxos takes 4 rounds after leader is known
• Other results:
– If n > 3t, exists <>W-based algorithm that
terminates in 2 rounds [Mostefaoui, Raynal 00]
– If n < 3t+1, tight lower bound of 3 rounds
[Dutta, Guerraoui, Keidar WIP; Lamport WIP]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Stable Runs
with f Failures
• When f<t-1, we showed lower bound of f+2
• Algorithms we showed here can take 2f+2 in
runs with f failures
• [Dutta, Guerraoui
]
– extend lower bound to cover also f=t-1 and f=t
– show algorithm that takes t+2
Food for thought:
Algorithm that takes f+2 rounds for every f <=t ?
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Byzantine Failures
• Byzantine variants of Paxos [Castro Liskov 99;
Chockler, Malkhi, Reiter 01]
• Need n>3t, even with authentication [DLS88]
– as opposed to synchronous case
• New result: same upper and lower bounds as
in crash failure model (but with more
processes) [Lamport WIP]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Weakness of the “Rounds”
Performance Metric
• Over the Internet, time to complete a round
can depend heavily on number of messages
sent and other factors [Bakr, Keidar
]
– Reason: depends on delay distribution
Food for thought:
Better performance metrics?
© Idit Keidar and Sergio Rajsbaum; PODC 2002
More Food for Thought
• Shared memory models
– Good definitions of “well-behaved executions”?
Including factors such as contention E.g., [Alur,
Attiya, Taubenfeld 97]
– Generic reductions from message passing
bounds to bounds in different shared memory
models
• Infinitely many processes
– Paxos in shared memory with infinitely many
processes [Chockler, Malkhi
]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
More Food for Thought Yet
• Other problems
– Atomic Commit – 3 rounds?
– mutual exclusion, etc.
• Proof techniques
– Layered analysis for partial synchrony models
– Improve timing bounds for such models
• E.g., consensus [Attiya, Dwork, Lynch, Stockmeyer 94]
• Wait-free k-set agreement [Herlihy, Rajsbaum, Tuttle 98]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
On an Optimistic Note
• Consensus requires 2 rounds in partial
synchrony model because of false suspicions
• Optimistic approach: use 1-round algorithm
– correct while there are no false suspicions
– upon false suspicions, reconcile or rollback
– E.g., Group Communication Horus, Amoeba, ...
Atomic Commit [Jimenez-Peris, PatinoMartinez, Alonso, Arevalo 01]
© Idit Keidar and Sergio Rajsbaum; PODC 2002
Download