Consensus Krzysztof Ostrowski

advertisement

Consensus

Krzysztof Ostrowski

krzys@cs.cornell.edu

Part #1

A motivating example

"Coordinated Attack"

What are they trying to achieve?

 Consensus on whether to attack:

– Both make the same decision.

– Common knowledge:

• "A" knows that "B" knows that "A" knows... etc.

THIS IS IMPOSSIBLE!

• Refer eg. to Joe Halpern’s work.

• It’s because the system is asynchronous .

– Messages may take arbitrarily long to get delivered.

– Impossible to tell if a process failed (or is just slow).

Why are we here?

 We’re here because...

– Real systems are not asynchronous?

• We wouldn’t wait 1000 years for a message.

• Equipment gets repaired or replaced.

– We don’t need any absolute guarantees?

• An asteroid may hit anybody at any time...

• ...so what guarantees are absolute anyway?!

– For the stubborn... problem ill-posed?

What are they trying to achieve?

 Generals agree on whether to attack:

– Both do make (at least one) decision.

– Both make only one (irreversible) decision.

– Both make the same decision.

– If all initially intended on doing the same, then that’s what they decide.

– Both make their decision at the same time.

What are they trying to achieve?

 Liveness: decision is made

 Correctness: exactly one decision

 Non-triviality: decision isn’t arbitrary

 Simultaneity: at the same time

A new hope:

We could drop some assumptions!

Now we have a simple solution!

 Send a message with proposal

 Upon receiving, take a decision based on certain agreed-upon function of all inputs

(eg. give "A" priority over "B")

F(v

A

,v

B

)

 Until proposal is received, keep asking for it

 Ultimately the other party will get your proposal as well and he’ll do the same

Our solution again...

A

B

F(v

A

,v

B

)

B decides A decides

But does it really work?

 What if we have failures?

– If the other process dead: we’ll never make progress.

– And if we try making progress, it may turn out the other process wasn’t dead at all...

• Unsafe

, decision could have been wrong!

Conclusions

 Liveness seems very hard to achieve

– Can’t just "make up" for a missing value...

– Can we achieve it via a smarter scheme?

– But isn’t it the very thing we may give up?

• We could rely on probabilistic guarantees!

• Can’t do the same for correctness/nontriviality.

Part #2

Definitions

What is „distributed consensus”?

 Reaching an agreement among a set of distributed processes.

– What is " agreement " ?

– What does it mean to " reach " it ?

– What " set of processes " ?

What is „distributed consensus”?

 Agreement:

– All processes think „X”.

X = ...

• Let’s do something: commit, rollback transaction.

• The value of parameter „A” is now „50”.

• Process „200” is now the new group coordinator.

– Handle failures:

• We don’t want one dead guy to hang the system

• A majority of processes needs to „think” so?

– (think of overlapping)

What is „distributed consensus”?

 Agreement:

– Seems to imply that no individual opinion should be „critical” to the final outcome

• Something is „critical”  progress in danger

– Seems to imply that we need to rely on a form of majority voting...

– Does it imply any „common knowledge” ?

• Can processes „change their minds” ?

• Can processes give up on agrement ?

What is „distributed consensus”?

 Reaching agreement:

– So do all need to „think” at the same time ?

• What does it mean „at the same time” ?

• Consider consistent cuts and atomicity:

– All would have to think „X” before some other „Y”

– Leads us to the virtual synchrony model

• Maybe: all processes will eventually think „X” ?

What is „distributed consensus”?

 What is the set of processes involved:

– A static, fixed set:

• All processes know each other’s names.

– A form of common knowledge .

– A fixed point of reference.

– Built into the system or updated „consistently”

(everywhere „at the same time” = „atomically”).

What is „distributed consensus”?

 Set of processes involved:

– A dynamic set (of out some superset):

• For example: all alive nodes, a set of nearby nodes etc.

• Agreeing processes need to first agree on who’s there to agree with. With whom to agree on that, though?

• Consensus within a consensus: Group

Membership.

– Fixing membership upfront is a way to solve this recursive dependency.

What is „distributed consensus”?

 A static set of processes:

– Processes may fail.

Do we include failed processes in the set?

• Yes : Everybody becomes a single point of failure.

• No : What about the actions of faulty processes?

– Could affect environment: eg. a teller machine.

– May require everybody to do the same!

No : Need a consistent way of reporting failures.

– Failure Detectors , Oracles

.

What is „distributed consensus”?

 Agreeing on failures.

– Network partitioning.

B

A

What is „distributed consensus”?

 A method to collectively:

– Make a decision according to majority will

– Ensure that actions can be based on it, that conflicting decisions cannot be taken

 Almost by „definition” it’s a 2PC

– Need to declare and learn intentions

– Need to „secure” the decision made!

Do we really need „consensus”?

 Approach a bit „religious”

– Why bother about liveness:

• Probabilistic guarantees are perfectly enough

• Could be quite good even without much effort

– Why simultaneously:

• A „promise” to agree would often be enough...

• Ordering may be all that we care about

– Common knowledge may not be important

– Why solve all problems at once:

• Rely on oracles, failure detectors

– What does „consensus” really „need” to be ”solved”?

Part #3

The Impossibility Result

What is „impossibility”?

 As defined, the problem is unsolvable...

...but has it been defined in the „right” way?

– Are all the assumptions reasonable?

• Does the model sound right?

• Are the „required” properties really required?

– Aren’t they too strong?

– Are they intuitive, do they have interpretation?

– Is the conclusion something I care about?

• What is „impossibility”, after all?

• Does this apply to any reallistic scenarios?

What did they really prove?

 Every protocol must necessarily have a

„window of vulnerability”

– A failure during this period may be fatal:

– ...may cause the protocol to get stuck

– ...may keep the protocol running forever

 Conclusion:

– Accept non-liveness that as given

– Change the approach: terminate any old protocols if no progress observed, then initiate new, clean rounds

System model

 Assumptions (weak):

– Processes are modeled as automata:

• Can have infinitely many states

• Can have unbounded internal storage

– Processes operate in steps

• Receive, work, atomically send multiple msgs

– Communication via messages

• Asynchronous, nondeterministic...

inevitable

... but „ fair ” (messages eventually delivered ) necessary this is the weakening assumption

System model

 Participants

– N processes, N≥2

– Cooperative (a non-byzantine setting)

– State:

• Distinguished input/output registers

• Unbounded internal storage, program counter

– Behavior determined by input + transition f.

– One-bit input x p

(fixed at the beginning)

– Write-once output y p

, values

{b,0,1} undecided decision states writing = making decision

System model name of the destination process some fixed universe of all known messages

 Communication model:

– A single message is a pair (p,m), m 

M

– Message buffer : a multi-set

• Contains all messages sent & not yet delivered

– Operations supported:

• Send(p,m) – place message in buffer

• Receive(p,m)

– delete some message (p,m) from buffer (message gets delivered ), or...

– ...just return 

(buffer stays unchanged)

System model

 Communication guarantees:

– Communication reliable: msgs  corrupted

– Communication is nondeterministic:

• Don’t know when message gets delivered

• Can be delayed for a finite number of rounds

– Other messages may be delivered first

– Nothing may be delivered

• Messages can be reordered

– Communication is „fair”:

• If receive is performed infinitely many times...

... every message eventually gets delivered.

What really is deterministic here?

 Process ARE deterministic

 Environment is NOT

– Environment can choose event sequence

• Like moving needles at different speeds!

– Environment can feed a process either with events or with a „non-event” (call it 

)

• Deterministic automata with 

-transitions

 System as a whole is nondeterministic

Part #4

The Proof

Our general strategy

 A typical proof by contradiction

 Show that we can’t have all properties

(C

N

L)

(C

N)

 

L

 Assume correctness and nontriviality...

...show that liveness isn’t guaranteed!

– We therefore want to show that:

Any protocol can be made forever indecisive

Our general strategy

 A little confusing... the proof is indirect

 Use Games with the Devil approach:

– Exploit the inherent uncertainty

– Construct sneaky (yet possible) scenarios:

• Communication is maliciously delayed

• The "red button" – we can "blow up" a process

– An irresistible analogy to pumping lemma

Our general strategy

ALERT!!!

The danger of consensus!

The danger eliminated

(can deliver) the red button applied

A quick refresher on notation

 Configuration:

– Internal states of each process

– Contents of the message buffer

C

1

C

2

Initial configuration:

– Each process starts at an initial state

– Message buffer is empty undecided

Initial state:

– All values but those of input registers are fixed

– In particular, output registers have value „b”

Some configurations „have decision value”

– A certain process is in a decision state

C

3

C

4

0

1

A quick refresher on notation

C

1

 Step

– Configuration 

Configuration

C

2

– A primitive step by a single process „p”:

• Perform receive(p), obtain m 

M

 

• Depending on p’s internal state and m:

– „p” enters a new state

– „p” sends a finite number of messages

– Determinism:

• For a given configuration C, step is uniquely determined by the message delivered

A quick refresher on notation

 Event:

– A pair e=(p,m)

– Can be „applicable” to a configuration

• ( p ,

) always applicable

• ( p , m ) applicable if message m is in the buffer

– A function e( 

): <config>

<config>

• Uniquely determines a step in every C:

• e(C) = C’

A quick refresher on notation

 Schedule:

– A sequence of events „  ”

• Can be finite or infinite

• Can be applied to C, producing C’ = 

(C)

– We say that such C’ is reachable from C

– Config. reachable from initial config. is accessible

 Run:

– A sequence of configurations

• Determined by C, 

= (e

1

, e

2

, ...) as (C, e

1

(C), e

2

(e

1

(C)), ...)

Configurations and events

C

1 e

1 e

3 e

2

C

2 e

4

0

C

3

0

1 range of choice

(applicable events)

A quick refresher on notation

 Consensus protocol is:

– „Partially correct” if:

• [correctness]

No accesible configuration has more than one decision value.

• [nontriviality]

Accessible configurations with both „0” and „1” decision values exist

A quick refresher on notation

 Process nonfaulty : takes

 many steps

– Eventually receives every message sent!

 Run is admissible if:

– At most process is faulty

– All messages sent to nonfaulty ones are eventually received (the „fairness”)

 Run is deciding if:

– Some process reaches a decision state

A quick refresher on notation

 Consensus protocol is:

– „Totally correct in spite of one fault” if:

• Partially correct

• [liveness]

Every admissible run is deciding

(every path in the „configurations tree” has a finite prefix that ends with some process in a deciding state)

Our general strategy

 Partially correct = correct + nontrivial

 Take a partially correct protocol

 Construct an infinite path that never enters configurations w. decision values

– Via choosing the right sequence of events

 This will mean that the given protocol is not

„totally correct in spite of one fault”

– Such path represents admissible, nondeciding run

 Not totally correct = not live

Bivalent configurations

 Configuration in which, in a given protocol, the outcome is not determined

– The protocol might lead to accepting „A”...

– ...but it might as well lead to accepting „B”

 Our proof by induction:

– A) Show that initial configuration is bivalent

– B) Show that we can force the protocol to produce bivalent configurations indefinitely

Bivalent configurations single step taken

(state transition)

C

1 bivalent configuration

C

2

C

5

C

4

C

6

C

3 univalent configuration

C

8

C

7

0 1 1 0 1 1

Analogy to the Pumping Lemma bivalent configurations

C

1 e

1 could not deliver this message here...

C

2 e

1 but now it’s okay, we are still bivalent

Proof decomposed

1.

Showing an existence of some initial bivalent configuration

2.

Showing that we can get from one bivalent configuration into another...

3.

...in a way that every message gets delivered after a finite time .

Initial bivalent configuration

 Proof by contradiction:

 Assume init. biv. config. doesn’t exist

 What would it mean, though?

– Every set of inputs determines the outcome of the consensus algorithm

– There exists a function that given the inputs, produces the decision

– Our algorithm essentially „computes” this function

– But one process may fail...

– ...so we might miss one of the input arguments!

– Our algorithm sort of „tolerates” a loss of one bit

– Note the analogy to error correcting codes!

Initial bivalent configuration

 Assume it doesn’t exists, then...

...there must exist 0-valent and 1-valent configurations (by partial correctness)

 Recall: this corresponds to „nontriviality”

Initial bivalent configuration

 Adjacent configurations:

– Differ by value of a single input register

C

1 p

1 p

1 1

2 p

3 p

0 0

4 p

5

1 p

6

0 p

7

1 p

8

1

C

2

1 1 0 0 0 0 1 1

Initial bivalent configuration

 Every two initial configurations are connected by a chain of adjacent ones

C

1

1 1 0 0 1 0 1 1

1 1 0 0 0 0 1 1

C

2

1 1 1 0 0 0 1 1

0 1 1 0 0 0 1 1

Initial bivalent configuration

 There must exist adjacent such a pair!!!

 What does it mean?

– A single process „determines” the output!

– In a sense, what this guy does is „critical”

Initial bivalent configuration

Let C

0

,C

1 be the univalent adjacent pair

Let the „critical” process be P

Take an admissible run from C

0 where P takes no steps (must exist... why?)

Take a corresponding schedule

Apply

 to C

1

– must lead to almost identical configurations (differences only in P’s state) e

2 e n

C

0 e

1 e

1

C

1 e

2 e n

Must reach the same decisions!!!

The Intuition

 How could the initial value of a process that didn’t communicate at all affect the outcome of the protocol?

Commutativity of schedules

 Assumption:

– 

1

,

2 involve disjoint sets of processes

 Conclusion:

– 

1 applicable to

2

(C)

– 

2 applicable to

1

(C)

– 

1

(

2

(C)) =

2

(

1

(C))

 Argument:

– 

1

,

2 don’t „interact”

Commutativity of schedules processes taking steps in

1 processes taking steps in

2

C

C

1

=

1

(C)

2

(C

1

)

X

X

1

X

2

X

2

Y

Y

1

Y

1

Y

2

The inductive step

 Intuition:

– We want to apply some event „e”...

...but we need to avoid univalent configs

– How far can we get via delaying „e”...

... so that we can safely apply „e” later?

– We want to show we can apply „e” as every event eventually must be applied

The inductive step bivalent trouble univalent e

C

Where else can we get without applying e?

(adding a delay) e applicable to each of the yellow guys we want to show that some of the pink guys are bivalent!

The inductive step

 Intuition:

– If C was bivalent, there must be a way to delay e so that to get into another univalent state, different from e ( C )

– If things weren’t pre-determined in C , then there must exist an alternative scenario

The inductive step

C

E i

F i

E i case1 case 2

F i there must be some 0-valent E

0 and some 1-valent E

1 reachable from C (since C is bivalent) define F i accordingly among the pink guys

E case 3 i the F i guys are i-valent:

-- they are not bivalent

-- they have path to E i there exist both 0-valent and 1-valent pink guys

The inductive step

 Intuition:

– When at C , event e leads to some univalent configuration

– When delayed to some C’ , event e leads to another univalent configuration

– Well... this change happens at some point in time, during a certain primitive step e’ !

The inductive step

C

(meet) there must exist neighbors C

0

, C

1 such that D i

= e(C i

) are i-valent

C

E

0

F

0

F

1

E

1 say it looks like this:

0-valent

D

0

1-valent

D

1 e e e’=(p’,m’) e’

C

0

C

1

The inductive step

 Intuition:

– Can it be that some process different from p is making this critical step e’ ?

– No, since then we could delay his step and apply e first... but once we apply e , we are in a univalent configuration and applying e’ would make no difference (commutativity).

The inductive step

Case #1: p

 p’

D

1

= e’(D

0

) by commutativity...

...but this is wrong,

1-valent cannot follow 0-valent

D

0 e’

C e e e’

C

1

C

0

D

1 e’=(p’,m’)

The inductive step

 Intuition:

– So both e , e’ are delivered to p

– Aparently p is now the „critical” guy

– Let’s kill the critical guy then!

– The protocol must do some progress

– Actions of other processes must now cause decision to be made

– But then, what if we revive p ???

– Still, he was the critical guy, so delivering messages to him now should matter!

– We will again refer to commutativity

The inductive step

C

Case #2: p = p’

Consider any finite deciding run from C

0 in which p takes no steps, let

 be the corresponding schedule

 By commutativity,

 is applicable to D i thus giving i-valent E i

Again by

E

0 commutativity, we get

 But that means that A is bivalent... and A is deciding!

We reached a contradiction

D

0 e

 e’

A=

(C

0

) e

E

1 e

D

1 e

C

0 e’

C

1

Final construction

 A queue of processes maintained

 Buffer organized as FIFO queues

 In each step (roughly what happens):

– Take a process from the process queue

– Give him his earliest undelivered message

– Put him at the end of the process queue

 Guarantees admissibility:

– Every process takes infinitely many steps

– Every message is eventually delivered

Final construction

 Okay, now what exactly happens: bivalent configuration a sequence guaranteeing that applying m

1 later will put us again in a bivalent configuration

C

C’ m

1 process queue p

1 p

2 p

3 p

4 p

5 p

6 message queue of p

1 m

1 m

2 m

3 m

4 m

5

Part #5

The Consensus Protocol

Consensus protocol

 Assumptions:

– A majority of processes aren’t faulty

(before the protocol starts)

– No process dies during the protocol

Consensus Protocol: Phase #1

N=9

L=5

Consensus Protocol: Phase #2a

N=9

L=5

Consensus Protocol: Phase #2b

N=9

L=5

Consensus Protocol: Phase #2c

N=9

L=5

Part #6

Paxos

What is Paxos?

 A practical algorithm (one of many)

– Arguably most prominent

– An underlying mechanism in real systems

– Dynamic membership

• Processes may fail or restart at any time

– Achieves simultaneous agreement

– Does not event try guaranteeing liveness

• Simply start a new protocol if not sure

Part #7

Conclusions

Conclusions

 What’s possible or impossible... first we need to ask the right question

 Consensus manifests in many ways and has many flavors to choose from

 We can only make probabilistic progress... and that’s fine, we accept is as given

 As a consequence actual protocols like Paxos used in practice keep aborting and restarting

 Overhead is always high, consensus is costly

 Consistency is not sacred, either...

Download