krzys@cs.cornell.edu
"Coordinated Attack"
What are they trying to achieve?
Consensus on whether to attack:
– Both make the same decision.
– Common knowledge:
• "A" knows that "B" knows that "A" knows... etc.
• Refer eg. to Joe Halpern’s work.
• It’s because the system is asynchronous .
– Messages may take arbitrarily long to get delivered.
– Impossible to tell if a process failed (or is just slow).
Why are we here?
We’re here because...
– Real systems are not asynchronous?
• We wouldn’t wait 1000 years for a message.
• Equipment gets repaired or replaced.
– We don’t need any absolute guarantees?
• An asteroid may hit anybody at any time...
• ...so what guarantees are absolute anyway?!
– For the stubborn... problem ill-posed?
What are they trying to achieve?
Generals agree on whether to attack:
– Both do make (at least one) decision.
– Both make only one (irreversible) decision.
– Both make the same decision.
– If all initially intended on doing the same, then that’s what they decide.
– Both make their decision at the same time.
What are they trying to achieve?
Liveness: decision is made
Correctness: exactly one decision
Non-triviality: decision isn’t arbitrary
Simultaneity: at the same time
Now we have a simple solution!
Send a message with proposal
Upon receiving, take a decision based on certain agreed-upon function of all inputs
(eg. give "A" priority over "B")
F(v
A
,v
B
)
Until proposal is received, keep asking for it
Ultimately the other party will get your proposal as well and he’ll do the same
Our solution again...
A
B
F(v
A
,v
B
)
B decides A decides
But does it really work?
What if we have failures?
– If the other process dead: we’ll never make progress.
– And if we try making progress, it may turn out the other process wasn’t dead at all...
• Unsafe
, decision could have been wrong!
Conclusions
Liveness seems very hard to achieve
– Can’t just "make up" for a missing value...
– Can we achieve it via a smarter scheme?
– But isn’t it the very thing we may give up?
• We could rely on probabilistic guarantees!
• Can’t do the same for correctness/nontriviality.
What is „distributed consensus”?
Reaching an agreement among a set of distributed processes.
– What is " agreement " ?
– What does it mean to " reach " it ?
– What " set of processes " ?
What is „distributed consensus”?
Agreement:
– All processes think „X”.
X = ...
• Let’s do something: commit, rollback transaction.
• The value of parameter „A” is now „50”.
• Process „200” is now the new group coordinator.
– Handle failures:
• We don’t want one dead guy to hang the system
• A majority of processes needs to „think” so?
– (think of overlapping)
What is „distributed consensus”?
Agreement:
– Seems to imply that no individual opinion should be „critical” to the final outcome
• Something is „critical” progress in danger
– Seems to imply that we need to rely on a form of majority voting...
– Does it imply any „common knowledge” ?
• Can processes „change their minds” ?
• Can processes give up on agrement ?
What is „distributed consensus”?
Reaching agreement:
– So do all need to „think” at the same time ?
• What does it mean „at the same time” ?
• Consider consistent cuts and atomicity:
– All would have to think „X” before some other „Y”
– Leads us to the virtual synchrony model
• Maybe: all processes will eventually think „X” ?
What is „distributed consensus”?
What is the set of processes involved:
– A static, fixed set:
• All processes know each other’s names.
– A form of common knowledge .
– A fixed point of reference.
– Built into the system or updated „consistently”
(everywhere „at the same time” = „atomically”).
What is „distributed consensus”?
Set of processes involved:
– A dynamic set (of out some superset):
• For example: all alive nodes, a set of nearby nodes etc.
• Agreeing processes need to first agree on who’s there to agree with. With whom to agree on that, though?
• Consensus within a consensus: Group
Membership.
– Fixing membership upfront is a way to solve this recursive dependency.
What is „distributed consensus”?
A static set of processes:
– Processes may fail.
Do we include failed processes in the set?
• Yes : Everybody becomes a single point of failure.
• No : What about the actions of faulty processes?
– Could affect environment: eg. a teller machine.
– May require everybody to do the same!
•
No : Need a consistent way of reporting failures.
– Failure Detectors , Oracles
.
What is „distributed consensus”?
Agreeing on failures.
– Network partitioning.
B
A
What is „distributed consensus”?
A method to collectively:
– Make a decision according to majority will
– Ensure that actions can be based on it, that conflicting decisions cannot be taken
Almost by „definition” it’s a 2PC
– Need to declare and learn intentions
– Need to „secure” the decision made!
Do we really need „consensus”?
Approach a bit „religious”
– Why bother about liveness:
• Probabilistic guarantees are perfectly enough
• Could be quite good even without much effort
– Why simultaneously:
• A „promise” to agree would often be enough...
• Ordering may be all that we care about
– Common knowledge may not be important
– Why solve all problems at once:
• Rely on oracles, failure detectors
– What does „consensus” really „need” to be ”solved”?
What is „impossibility”?
As defined, the problem is unsolvable...
...but has it been defined in the „right” way?
– Are all the assumptions reasonable?
• Does the model sound right?
• Are the „required” properties really required?
– Aren’t they too strong?
– Are they intuitive, do they have interpretation?
– Is the conclusion something I care about?
• What is „impossibility”, after all?
• Does this apply to any reallistic scenarios?
What did they really prove?
Every protocol must necessarily have a
„window of vulnerability”
– A failure during this period may be fatal:
– ...may cause the protocol to get stuck
– ...may keep the protocol running forever
Conclusion:
– Accept non-liveness that as given
– Change the approach: terminate any old protocols if no progress observed, then initiate new, clean rounds
System model
Assumptions (weak):
– Processes are modeled as automata:
• Can have infinitely many states
• Can have unbounded internal storage
– Processes operate in steps
• Receive, work, atomically send multiple msgs
– Communication via messages
• Asynchronous, nondeterministic...
inevitable
... but „ fair ” (messages eventually delivered ) necessary this is the weakening assumption
System model
Participants
– N processes, N≥2
– Cooperative (a non-byzantine setting)
– State:
• Distinguished input/output registers
• Unbounded internal storage, program counter
– Behavior determined by input + transition f.
– One-bit input x p
(fixed at the beginning)
– Write-once output y p
, values
{b,0,1} undecided decision states writing = making decision
System model name of the destination process some fixed universe of all known messages
Communication model:
– A single message is a pair (p,m), m
M
– Message buffer : a multi-set
• Contains all messages sent & not yet delivered
– Operations supported:
• Send(p,m) – place message in buffer
• Receive(p,m)
– delete some message (p,m) from buffer (message gets delivered ), or...
– ...just return
(buffer stays unchanged)
System model
Communication guarantees:
– Communication reliable: msgs corrupted
– Communication is nondeterministic:
• Don’t know when message gets delivered
• Can be delayed for a finite number of rounds
– Other messages may be delivered first
– Nothing may be delivered
• Messages can be reordered
– Communication is „fair”:
• If receive is performed infinitely many times...
... every message eventually gets delivered.
What really is deterministic here?
Process ARE deterministic
Environment is NOT
– Environment can choose event sequence
• Like moving needles at different speeds!
– Environment can feed a process either with events or with a „non-event” (call it
)
• Deterministic automata with
-transitions
System as a whole is nondeterministic
Our general strategy
A typical proof by contradiction
Show that we can’t have all properties
(C
N
L)
(C
N)
L
Assume correctness and nontriviality...
...show that liveness isn’t guaranteed!
– We therefore want to show that:
Any protocol can be made forever indecisive
Our general strategy
A little confusing... the proof is indirect
Use Games with the Devil approach:
– Exploit the inherent uncertainty
– Construct sneaky (yet possible) scenarios:
• Communication is maliciously delayed
• The "red button" – we can "blow up" a process
– An irresistible analogy to pumping lemma
Our general strategy
ALERT!!!
The danger of consensus!
The danger eliminated
(can deliver) the red button applied
A quick refresher on notation
Configuration:
– Internal states of each process
– Contents of the message buffer
C
1
C
2
Initial configuration:
– Each process starts at an initial state
– Message buffer is empty undecided
Initial state:
– All values but those of input registers are fixed
– In particular, output registers have value „b”
Some configurations „have decision value”
– A certain process is in a decision state
C
3
C
4
0
1
A quick refresher on notation
C
1
Step
– Configuration
Configuration
C
2
– A primitive step by a single process „p”:
• Perform receive(p), obtain m
M
• Depending on p’s internal state and m:
– „p” enters a new state
– „p” sends a finite number of messages
– Determinism:
• For a given configuration C, step is uniquely determined by the message delivered
A quick refresher on notation
Event:
– A pair e=(p,m)
– Can be „applicable” to a configuration
• ( p ,
) always applicable
• ( p , m ) applicable if message m is in the buffer
– A function e(
): <config>
<config>
• Uniquely determines a step in every C:
• e(C) = C’
A quick refresher on notation
Schedule:
– A sequence of events „ ”
• Can be finite or infinite
• Can be applied to C, producing C’ =
(C)
– We say that such C’ is reachable from C
– Config. reachable from initial config. is accessible
Run:
– A sequence of configurations
• Determined by C,
= (e
1
, e
2
, ...) as (C, e
1
(C), e
2
(e
1
(C)), ...)
Configurations and events
C
1 e
1 e
3 e
2
C
2 e
4
0
C
3
0
1 range of choice
(applicable events)
A quick refresher on notation
Consensus protocol is:
– „Partially correct” if:
• [correctness]
No accesible configuration has more than one decision value.
• [nontriviality]
Accessible configurations with both „0” and „1” decision values exist
A quick refresher on notation
Process nonfaulty : takes
many steps
– Eventually receives every message sent!
Run is admissible if:
– At most process is faulty
– All messages sent to nonfaulty ones are eventually received (the „fairness”)
Run is deciding if:
– Some process reaches a decision state
A quick refresher on notation
Consensus protocol is:
– „Totally correct in spite of one fault” if:
• Partially correct
• [liveness]
Every admissible run is deciding
(every path in the „configurations tree” has a finite prefix that ends with some process in a deciding state)
Our general strategy
Partially correct = correct + nontrivial
Take a partially correct protocol
Construct an infinite path that never enters configurations w. decision values
– Via choosing the right sequence of events
This will mean that the given protocol is not
„totally correct in spite of one fault”
– Such path represents admissible, nondeciding run
Not totally correct = not live
Bivalent configurations
Configuration in which, in a given protocol, the outcome is not determined
– The protocol might lead to accepting „A”...
– ...but it might as well lead to accepting „B”
Our proof by induction:
– A) Show that initial configuration is bivalent
– B) Show that we can force the protocol to produce bivalent configurations indefinitely
Bivalent configurations single step taken
(state transition)
C
1 bivalent configuration
C
2
C
5
C
4
C
6
C
3 univalent configuration
C
8
C
7
0 1 1 0 1 1
Analogy to the Pumping Lemma bivalent configurations
C
1 e
1 could not deliver this message here...
C
2 e
1 but now it’s okay, we are still bivalent
Proof decomposed
1.
Showing an existence of some initial bivalent configuration
2.
Showing that we can get from one bivalent configuration into another...
3.
...in a way that every message gets delivered after a finite time .
Initial bivalent configuration
Proof by contradiction:
Assume init. biv. config. doesn’t exist
What would it mean, though?
– Every set of inputs determines the outcome of the consensus algorithm
– There exists a function that given the inputs, produces the decision
– Our algorithm essentially „computes” this function
– But one process may fail...
– ...so we might miss one of the input arguments!
– Our algorithm sort of „tolerates” a loss of one bit
– Note the analogy to error correcting codes!
Initial bivalent configuration
Assume it doesn’t exists, then...
...there must exist 0-valent and 1-valent configurations (by partial correctness)
Recall: this corresponds to „nontriviality”
Initial bivalent configuration
Adjacent configurations:
– Differ by value of a single input register
C
1 p
1 p
1 1
2 p
3 p
0 0
4 p
5
1 p
6
0 p
7
1 p
8
1
C
2
1 1 0 0 0 0 1 1
Initial bivalent configuration
Every two initial configurations are connected by a chain of adjacent ones
C
1
1 1 0 0 1 0 1 1
1 1 0 0 0 0 1 1
C
2
1 1 1 0 0 0 1 1
0 1 1 0 0 0 1 1
Initial bivalent configuration
There must exist adjacent such a pair!!!
What does it mean?
– A single process „determines” the output!
– In a sense, what this guy does is „critical”
Initial bivalent configuration
Let C
0
,C
1 be the univalent adjacent pair
Let the „critical” process be P
Take an admissible run from C
0 where P takes no steps (must exist... why?)
Take a corresponding schedule
Apply
to C
1
– must lead to almost identical configurations (differences only in P’s state) e
2 e n
C
0 e
1 e
1
C
1 e
2 e n
Must reach the same decisions!!!
The Intuition
How could the initial value of a process that didn’t communicate at all affect the outcome of the protocol?
Commutativity of schedules
Assumption:
–
1
,
2 involve disjoint sets of processes
Conclusion:
–
1 applicable to
2
(C)
–
2 applicable to
1
(C)
–
1
(
2
(C)) =
2
(
1
(C))
Argument:
–
1
,
2 don’t „interact”
Commutativity of schedules processes taking steps in
1 processes taking steps in
2
C
C
1
=
1
(C)
2
(C
1
)
X
X
1
X
2
X
2
Y
Y
1
Y
1
Y
2
The inductive step
Intuition:
– We want to apply some event „e”...
...but we need to avoid univalent configs
– How far can we get via delaying „e”...
... so that we can safely apply „e” later?
– We want to show we can apply „e” as every event eventually must be applied
The inductive step bivalent trouble univalent e
C
Where else can we get without applying e?
(adding a delay) e applicable to each of the yellow guys we want to show that some of the pink guys are bivalent!
The inductive step
Intuition:
– If C was bivalent, there must be a way to delay e so that to get into another univalent state, different from e ( C )
– If things weren’t pre-determined in C , then there must exist an alternative scenario
The inductive step
C
E i
F i
E i case1 case 2
F i there must be some 0-valent E
0 and some 1-valent E
1 reachable from C (since C is bivalent) define F i accordingly among the pink guys
E case 3 i the F i guys are i-valent:
-- they are not bivalent
-- they have path to E i there exist both 0-valent and 1-valent pink guys
The inductive step
Intuition:
– When at C , event e leads to some univalent configuration
– When delayed to some C’ , event e leads to another univalent configuration
– Well... this change happens at some point in time, during a certain primitive step e’ !
The inductive step
C
(meet) there must exist neighbors C
0
, C
1 such that D i
= e(C i
) are i-valent
C
E
0
F
0
F
1
E
1 say it looks like this:
0-valent
D
0
1-valent
D
1 e e e’=(p’,m’) e’
C
0
C
1
The inductive step
Intuition:
– Can it be that some process different from p is making this critical step e’ ?
– No, since then we could delay his step and apply e first... but once we apply e , we are in a univalent configuration and applying e’ would make no difference (commutativity).
The inductive step
Case #1: p
p’
D
1
= e’(D
0
) by commutativity...
...but this is wrong,
1-valent cannot follow 0-valent
D
0 e’
C e e e’
C
1
C
0
D
1 e’=(p’,m’)
The inductive step
Intuition:
– So both e , e’ are delivered to p
– Aparently p is now the „critical” guy
– Let’s kill the critical guy then!
– The protocol must do some progress
– Actions of other processes must now cause decision to be made
– But then, what if we revive p ???
– Still, he was the critical guy, so delivering messages to him now should matter!
– We will again refer to commutativity
The inductive step
C
Case #2: p = p’
Consider any finite deciding run from C
0 in which p takes no steps, let
be the corresponding schedule
By commutativity,
is applicable to D i thus giving i-valent E i
Again by
E
0 commutativity, we get
But that means that A is bivalent... and A is deciding!
We reached a contradiction
D
0 e
e’
A=
(C
0
) e
E
1 e
D
1 e
C
0 e’
C
1
Final construction
A queue of processes maintained
Buffer organized as FIFO queues
In each step (roughly what happens):
– Take a process from the process queue
– Give him his earliest undelivered message
– Put him at the end of the process queue
Guarantees admissibility:
– Every process takes infinitely many steps
– Every message is eventually delivered
Final construction
Okay, now what exactly happens: bivalent configuration a sequence guaranteeing that applying m
1 later will put us again in a bivalent configuration
C
C’ m
1 process queue p
1 p
2 p
3 p
4 p
5 p
6 message queue of p
1 m
1 m
2 m
3 m
4 m
5
Consensus protocol
Assumptions:
– A majority of processes aren’t faulty
(before the protocol starts)
– No process dies during the protocol
Consensus Protocol: Phase #1
N=9
L=5
Consensus Protocol: Phase #2a
N=9
L=5
Consensus Protocol: Phase #2b
N=9
L=5
Consensus Protocol: Phase #2c
N=9
L=5
What is Paxos?
A practical algorithm (one of many)
– Arguably most prominent
– An underlying mechanism in real systems
– Dynamic membership
• Processes may fail or restart at any time
– Achieves simultaneous agreement
– Does not event try guaranteeing liveness
• Simply start a new protocol if not sure
Conclusions
What’s possible or impossible... first we need to ask the right question
Consensus manifests in many ways and has many flavors to choose from
We can only make probabilistic progress... and that’s fine, we accept is as given
As a consequence actual protocols like Paxos used in practice keep aborting and restarting
Overhead is always high, consensus is costly
Consistency is not sacred, either...