Lecture 8: Paxos Principles of Reliable Distributed Systems Spring 2008

advertisement
Principles of Reliable
Distributed Systems
Lecture 8: Paxos
Spring 2008
Prof. Idit Keidar
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
1
Material
• Paxos Made Simple
Leslie Lamport
ACM SIGACT News (Distributed
Computing Column) 32, 4 (Whole Number
121, December 2001) 18-25.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
2
Issues in the Real World I/III
• Problem: Sometimes messages take longer than
expected
• Solution 1: Use longer timeouts
 Slow
• Solution 2: Assume asynchrony
 Impossible - FLP
• Solution 3: Assume eventual synchrony or
unreliable failure detectors
– See last week – MR Algorithm
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
3
Reminder: MR in “Normal”
(Failure-Free Suspicion-Free) Runs
1
1
1
(1, v1)
2
2
.
.
.
.
.
.
n
n
all have
est = v1
(1, v1)
(decide, v1)
all decide v1
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
4
On MR’s Performance
• The algorithm can take unbounded time
– What if no failures occur?
• Is this inevitable?
• Can we say more than “decision is reached
eventually” ?
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
5
Performance Metric
Number of communication steps in
well-behaved runs
• Well-behaved:
– No failures
– Stable (synchronous) from the beginning
– With failure detector: no false suspicions
• Motivation: common case
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
6
MR’s Running Time in
Well-Behaved Runs
• In round 1:
– Coord is correct, not suspected by any process
– All processes decide at the end of phase two
• Decision in two communication steps
– Halting (stopping) takes three steps
• How much in synchronous model?
– 2 Rounds for decision in Uniform Consensus
– No performance penalty for indulgence!
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
7
Back to Last Week’s Example
• Example network:
– 99% of packets arrive within 10 µsec
– Upper bound of 1000 µsec on message latency
• Now we can choose a timeout of 10 µsec, without
violating safety!
• Most of the time, the algorithm will be just as fast
as a synchronous uniform consensus algorithm
– We did pay a price in resilience, though
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
8
Issues in the Real World II/III
• Problem: Sometimes messages are lost
• Solution 1: Use retransmissions
 In case of transient partitions, a huge backlog can build
up – catching up may take forever
 More congestion, long message delays for extensive
periods
• Solution 2: Allow message loss
 Impossible - 2 Generals
• Solution 3: Assume eventually reliable links
– That’s what we’ll do today
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
9
Issues in the Real World III/III
• Problem: Processes may crash and later
recover (aka crash-recovery model)
• Solution 1: Store information on stable
storage (disk) and retrieve it upon recovery
– What happens to messages arriving when
they’re down?
– See previous slide
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
10
MR and Unreliable Links
• From MR Algorithm Phase II:
wait for (r,est) from n-t processes
• Transient message loss violates liveness
• What if we move to the next round in case
we can’t get n-t responses for too long?
– Notice the next line in MR:
if any non- value e received then val  e
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
11
What If MR Didn’t Wait …
1
1
(1, v1)
2
est = 
(1, v1)
(1, )
.
.
.
n
1
2
.
.
.
(1, v1)
decide v1
(2, v2)
will
decide v2
n
no waiting
no change of val2
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
12
What Do We Want?
• Do not get stuck in a round (like MR does)
– Move on upon timeout
– Move on upon hearing that others moved on
• But, a new leader
before proposing a decision value
must learn any possibly decided value
(must check with a majority)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
13
Paxos: Main Principles
• Use “leader election” module
– If you think you’re leader, you can start a new “ballot”
• Paxos name for a round
• Always join the newest ballot you hear about
– Leave old ballots in the middle if you need to
• Two phases:
– First learn outcomes of previous ballots from a majority
– Then propose a new value, and get a majority to
endorse it
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
14
Leader Election Failure Detector
 W – Leader
– Outputs one trusted process
– From some point, all correct processes trust the
same correct process
• Can easily implement ◊S
• Is the weakest for consensus
[Chandra, Hadzilacos, Toueg 96]
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
15
W Implementations
• Easiest: use ◊P implementation
– In eventual synchrony model
– Output lowest id non-suspected process
 W is implementable also in some situations
where ◊P isn’t
• Optimizations possible
– Choose “best connected”, strongest, etc.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
16
Paxos: The Practicality
• Overcomes message loss without
retransmitting entire message history
• Tolerates crash and recovery
• Does not rotate through dead coordinators
• Used in replicated file systems
– Frangipani – DEC, early 90s
– Nowadays Microsoft
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
17
The Part-Time Parliament
[Lamport 88,98,01]
Recent archaeological discoveries on the island of
Paxos reveal that the parliament functioned
despite the peripatetic propensity of its part-time
legislators.
The legislators maintained consistent copies of the
parliamentary record, despite their frequent forays
from the chamber and the forgetfulness of their
messengers.
The Paxon parliament’s protocol provides a new way
of implementing the state-machine approach to the
design of distributed systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
18
Annotation of TOCS 98 Paper
• This submission was recently discovered behind a filing
cabinet in the TOCS editorial office.
• …the author is currently doing field work in the Greek
isles and cannot be reached …
• The author appears to be an archeologist with only a
passing interest in computer science.
• This is unfortunate; even though the obscure ancient Paxon
civilization he describes is of little interest to most
computer scientists, its legislative system is an excellent
model for how to implement a distributed computer system
in an asynchronous environment.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
19
The Setting
• The data (ledger) is replicated at n processes
(legislators)
• Operations (decrees) should be invoked (recorded)
at each replica (ledger) in the same order
• Processes (legislators) can fail (leave the
parliament)
• At least a majority of processes (legislators) must
be up (present in the parliament) in order to make
progress (pass decrees)
– Why majority?
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
20
Eventually Reliable Links
• There is a time after which every message
sent by a correct process to a correct
process eventually arrives
– Old messages are not retransmitted
• Usual failure-detector-based algorithms
(like MR) do not work
– Homework question
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
21
The Paxos (Paxos)
Atomic Broadcast Algorithm
• Leader based: each process has an estimate
of who is the current leader
• To order an operation, a process sends it to
its current leader
• The leader sequences the operation and
launches a Consensus algorithm (Synod) to
fix the agreement
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
22
The (Synod) Consensus Algorithm
• Solves non-terminating consensus in
asynchronous system
– Or consensus in a partial synchrony system
– Or consensus using an W failure detector
• Overcomes transient crashes & recoveries
and message loss
– Can be modeled as just message loss
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
23
The Consensus Algorithm Structure
•
•
•
•
Two phases
Leader contacts a majority in each phase
There may be multiple concurrent leaders
Ballots distinguish among values
proposed by different leaders
– Unique, locally monotonically increasing
– Correspond to rounds of ◊S-based algorithms [MR]
– Processes respond only to leader with highest ballot
seen so far
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
24
Ballot Numbers
• Pairs num, process id
• n1, p1 > n2, p2
– If n1 > n2
– Or n1=n2 and p1 > p2
• Leader p chooses unique, locally
monotonically increasing ballot number
– If latest known ballot is n, q
then p chooses n+1, p
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
25
The Two Phases of Paxos
• Phase 1: prepare
– If trust yourself by W (believe you are the leader)
• Choose new unique ballot number
• Learn outcome of all smaller ballots from majority
• Phase 2: accept
– Leader proposes a value with its ballot number
– Leader gets majority to accept its proposal
– A value accepted by a majority can be decided
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
26
Paxos - Variables
BallotNumi, initially 0,0
Latest ballot pi took part in (phase 1)
AcceptNumi, initially 0,0
Latest ballot pi accepted a value in (phase 2)
AcceptVali,
initially 
Latest accepted value (phase 2)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
27
Phase I: Prepare - Leader
• Periodically, until decision is reached do:
if leader (by W) then
BallotNum  BallotNum.num+1, myId
send (“prepare”, BallotNum) to all
• Goal: contact other processes, ask them to
join this ballot, and get information about
possible past decisions
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
28
Phase I: Prepare - Cohort
• Upon receive (“prepare”, bal) from i
This is a higher
ballot than my
current, I better
join it
if bal  BallotNum then
BallotNum  bal
send (“ack”, bal, AcceptNum, AcceptVal) to i
This is a promise not to
accept ballots smaller
than bal in the future
Tell the leader about my latest accepted value
and what ballot it was accepted in
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
29
Phase II: Accept - Leader
Upon receive (“ack”, BallotNum, b, val) from n-t
if all vals =  then myVal = initial value
else myVal = received val with highest b
send (“accept”, BallotNum, myVal) to all /* proposal */
The value accepted in the highest ballot might
have been decided,
I better propose this value
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
30
Phase II: Accept - Cohort
Upon receive (“accept”, b, v)
if b  BallotNum then
AcceptNum  b; AcceptVal  v
This is not from an
old ballot
/* accept proposal */
send (“accept”, b, v) to all (first time only)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
31
Paxos – Deciding
Upon receive (“accept”, b, v) from n-t
decide v
periodically send (“decide”, v) to all
Upon receive (“decide”, v)
decide v
Why don’t we ever “return”?
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
32
In Failure-Free Synchronous Runs
(“prepare”, 1,1)
1
1
1
(“accept”, 1,1 ,v1)
1
1
2
2
2
.
. (“ack”, 1,1, 0,0,)
.
.
.
.
.
.
.
n
n
n
(“accept”, 1,1 ,v1)
Simple W implementation
always trusts process 1
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
decide v1
33
Correctness: Agreement
• Follows from Lemma 1:
If a proposal (“accept”, b, v) is sent by a
majority, then for every sent proposal
(“accept”, b’, v’) with b’>b, it holds that
v’=v.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
34
Proving Agreement Using Lemma 1
• Let v be a decided value. The first process that
decides v receives n-t accept messages for v with
some ballot b, i.e., (“accept”, b, v) is sent by a
majority.
• No other value is sent with an “accept” message
with the same b. Why?
• Let (“accept”, b1, v1) be the proposal with the
lowest ballot number (b1) sent by n-t
• By Lemma 1, v1 is the only possible decision value
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
35
To Prove Lemma 1
• Use Lemma 2: (invariant):
If a proposal (“accept”, b, v) is sent, then there is a
set S consisting of a majority such that either
– no pS accepts a proposal ranked less than b
(all vals = )
or
– v is the value of the highest-ranked proposal among
proposals ranked less than b accepted by processes in S
(myVal = received val with highest b).
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
36
What Makes Lemma 2 Hold
• A process accepts a proposal numbered b
only if it has not responded to a prepare
request having a number greater than b
• The “ack” response to “prepare” is a
promise not to accept lower-ballot proposals
in the future
• The leader uses “ack” messages from a
majority in choosing the proposed value
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
37
Termination
• Assume no loss for a moment
• Once there is one correct leader –
– It eventually chooses the highest ballot number
– No other process becomes a leader with a
higher ballot
– All correct processes “ack” its prepare message
and “accept” its accept message and decide
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
38
What About Message Loss?
• Does not block in case of a lost message
– Phase 1 can start with new rank even if
previous attempts never ended
• Conditional liveness:
If n-t correct processes including the leader can
communicate with each other
then they eventually decide
• Holds with eventually reliable links
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
39
Performance?
(“prepare”, 1,1)
1
1
1
(“accept”, 1,1 ,v1)
1
1
2
2
.
.
.
.
.
.
.
n
n
n
2
.
Why is this phase
needed?
. (“ack”, 1,1, 0,0,)
(“accept”, 1,1 ,v1)
4 Communication steps in well-behaved runs
Compared to 2 for MR
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
40
Optimization
• Allow process 1 (only!) to skip Phase 1
– Initiate BallotNum to 1,1
– Propose its own initial value
• 2 steps in failure-free synchronous runs
– Like MR
• 2 steps for repeated invocations with the
same leader
– Common case
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
41
Atomic Broadcast
by Running A Sequence of
Consensus Instances
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
42
The Setting
• Data is replicated at n servers
• Operations are initiated by clients
• Operations need to be performed at all
correct servers in the same order
– State-machine replication
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
43
Client-Server Interaction
• Leader-based: each process (client/server)
has an estimate of who is the current leader
• A client sends a request to its current leader
• The leader launches the Paxos consensus
algorithm to agree upon the order of the
request
• The leader sends the response to the client
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
44
Failure-Free Message Flow
C
C
request
S1
response
S1
S1
S2
(“prepare”) .
.
.
Sn
Phase 1
(“ack”)
S1
S1
S2
S2
. (“accept”)
.
.
Sn
.
.
.
Sn
Phase 2
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
45
Observation
• In Phase 1, no consensus values are sent:
– Leader chooses largest unique ballot number
– Gets a majority to “vote” for this ballot number
– Learns the outcome of all smaller ballots from
this majority
• In Phase 2, leader proposes either its own
initial value or latest value it learned in
Phase 1
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
46
Message Flow: Take 2
C
C
request
S1
S1
S1
S2
(“prepare”) .
.
.
Sn
Phase 1
(“ack”)
S1
response
S1
S1
S2
S2
. (“accept”)
.
.
Sn
.
.
.
Sn
Phase 2
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
47
Optimization
• Run Phase 1 only when the leader changes
– Phase 1 is called “view change” or “recovery mode”
– Phase 2 is the “normal mode”
• Each message includes BallotNum (from the last
Phase 1) and ReqNum
– e.g., ReqNum = 7 when we’re trying to agree what the
7th operation to invoke on the state machine should be
• Respond only to messages with the “right”
BallotNum
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
48
Paxos Atomic Broadcast:
Normal Mode
Upon receive (“request”, v) from client
if (I am not the leader) then forward to leader
else
/* propose v as request number n */
ReqNum  ReqNum +1;
send (“accept”, BallotNum , ReqNum, v) to all
Upon receive (“accept”, b, n, v) with b = BallotNum
/* accept proposal for request number n */
AcceptNum[n]  b; AcceptVal[n]  v
send (“accept”, b, n, v) to all
(first time only)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
49
Recovery Mode
• The new leader must learn the outcome of all the
pending requests that have smaller BallotNums
– The “ack” messages include AcceptNums and
AcceptVals of all pending requests
• For all pending requests, the leader sends “accept”
messages
• What if there are holes?
– e.g., leader learns of request number 13 and not of 12
– fill in the gaps with dummy “do nothing” requests
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
50
Leslie Lamport’s Reflections
• Inspired by my success at popularizing the
consensus problem by describing it with
Byzantine generals, I decided to cast the algorithm
in terms of a parliament on an ancient Greek
island.
• To carry the image further, I gave a few lectures in
the persona of an Indiana-Jones-style
archaeologist.
• My attempt at inserting some humor into the
subject was a dismal failure.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
51
The History of the Paper
by Lamport
• I submitted the paper to TOCS in 1990. All three referees
said that the paper was mildly interesting, though not very
important, but that all the Paxos stuff had to be removed. I
was quite annoyed at how humorless everyone working in
the field seemed to be, so I did nothing with the paper.
• A number of years later, a couple of people at SRC needed
algorithms for distributed systems they were building, and
Paxos provided just what they needed. I gave them the
paper to read and they had no problem with it. So, I
thought that maybe the time had come to try publishing it
again.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
52
Download