Distributed Systems: Atomicity, Decision Making, Faults, Snapshots Slides adapted from Ken's CS514 lectures

advertisement
Distributed Systems: Atomicity,
Decision Making, Faults, Snapshots
Slides adapted from Ken's CS514
lectures
Announcements
• Prelim II coming up next week:
–
–
–
–
In class, Thursday, November 20th, 10:10—11:25pm
203 Thurston
Closed book, no calculators/PDAs/…
Bring ID
• Topics:
– Everything after first prelim
– Lectures 14-22, chapters 10-15 (8th ed)
• Review Session Tonight, November 18th, 6:30pm–7:30pm
– Location: 315 Upson Hall
2
Review: What time is it?
• In distributed system we need practical ways to deal with
time
– E.g. we may need to agree that update A occurred before update B
– Or offer a “lease” on a resource that expires at time 10:10.0150
– Or guarantee that a time critical event will reach all interested
parties within 100ms
3
Review: Event Ordering
• Problem: distributed systems do not share a clock
– Many coordination problems would be simplified if they did (“first
one wins”)
• Distributed systems do have some sense of time
– Events in a single process happen in order
– Messages between processes must be sent before they can be
received
– How helpful is this?
4
Review: Happens-before
• Define a Happens-before relation (denoted by ).
– 1) If A and B are events in the same process, and A was executed
before B, then A  B.
– 2) If A is the event of sending a message by one process and B is
the event of receiving that message by another process, then A 
B.
– 3) If A  B and B  C then A  C.
5
Review: Total ordering?
• Happens-before gives a partial ordering of events
• We still do not have a total ordering of events
– We are not able to order events that happen concurrently
• Concurrent if (not AB) and (not BA)
6
Review: Partial Ordering
Pi ->Pi+1; Qi -> Qi+1; Ri -> Ri+1
R0->Q4; Q3->R4; Q1->P4; P1->Q2
7
Review: Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4
8
Review: Timestamps
• Assume each process has a local logical clock that ticks
once per event and that the processes are numbered
– Clocks tick once per event (including message send)
– When send a message, send your clock value
– When receive a message, set your clock to MAX( your clock,
timestamp of message + 1)
• Thus sending comes before receiving
• Only visibility into actions at other nodes happens during
communication, communicate synchronizes the clocks
– If the timestamps of two events A and B are the same, then use the
process identity numbers to break ties.
• This gives a total ordering!
9
Review: Distributed Mutual Exclusion
• Want mutual exclusion in distributed setting
– The system consists of n processes; each process Pi resides at a
different processor
– Each process has a critical section that requires mutual exclusion
– Problem: Cannot use atomic testAndSet primitive since memory not
shared and processes may be on physically separated nodes
• Requirement
– If Pi is executing in its critical section, then no other process Pj is
executing in its critical section
• Compare three solutions
• Centralized Distributed Mutual Exclusion (CDME)
• Fully Distributed Mutual Exclusion (DDME)
• Token passing
10
Today
• Atomicity and Distributed Decision Making
• Faults in distributed systems
• What time is it now?
– Synchronized clocks
• What does the entire system look like at this
moment?
11
Atomicity
• Recall:
– Atomicity = either all the operations associated with a
program unit are executed to completion, or none are
performed.
• In a distributed system may have multiple copies of
the data
– (e.g. replicas are good for reliability/availability)
• PROBLEM: How do we atomically update all of the
copies?
– That is, either all replicas reflect a change or none
12
E.g. Two-Phase Commit
• Goal: Update all replicas atomically
– Either everyone commits or everyone aborts
– No inconsistencies even in face of failures
– Caveat: Assume only crash or fail-stop failures
• Crash: servers stop when they fail – do not continue and generate bad
data
• Fail-stop: in addition to crash, fail-stop failure is detectable.
• Definitions
– Coordinator: Software entity that shepherds the process (in our
example could be one of the servers)
– Ready to commit: side effects of update safely stored on non-volatile
storage
• Even if crash, once I say I am ready to commit then a recover
procedure will find evidence and continue with commit protocol
13
Two Phase Commit: Phase 1
• Coordinator send a PREPARE message to each replica
• Coordinator waits for all replicas to reply with a vote
• Each participant replies with a vote
– Votes PREPARED if ready to commit and locks data items being
updated
– Votes NO if unable to get a lock or unable to ensure ready to
commit
14
Two Phase Commit: Phase 2
• If coordinator receives PREPARED vote from all replicas
then it may decide to commit or abort
• Coordinator send its decision to all participants
• If participant receives COMMIT decision then commit
changes resulting from update
• If participant received ABORT decision then discard
changes resulting from update
• Participant replies DONE
• When Coordinator received DONE from all participants
then can delete record of outcome
15
Performance
• In absence of failure, 2PC (two-phase-commit) makes a
total of 2 (1.5?) round trips of messages before decision is
made
–
–
–
–
Prepare
Vote NO or PREPARE
Commit/abort
Done (but done just for bookkeeping, does not affect response time)
16
Failure Handling in 2PC –
Replica Failure
• The log contains a <commit T> record.
– In this case, the site executes redo(T).
• The log contains an <abort T> record.
– In this case, the site executes undo(T).
• The log contains a <ready T> record
– In this case consult coordinator Ci.
• If Ci is down, site sends query-status T message to the other sites.
• The log contains no control records concerning T.
– In this case, the site executes undo(T).
17
Failure Handling in 2PC –
Coordinator Ci Failure
• If an active site contains a <commit T> record in its log, then
T must be committed.
• If an active site contains an <abort T> record in its log, then
T must be aborted.
• If some active site does not contain the record <ready T> in
its log then the failed coordinator Ci cannot have decided to
commit T. Rather than wait for Ci to recover, it is preferable
to abort T.
• All active sites have a <ready T> record in their logs, but no
additional control records. In this case we must wait for the
coordinator to recover.
– Blocking problem – T is blocked pending the recovery of site Si.
18
Failure Handling
•
•
•
•
Failure detected with timeouts
If participant times out before getting a PREPARE can abort
If coordinator times out waiting for a vote can abort
If a participant times out waiting for a decision it is blocked!
– Wait for Coordinator to recover?
– Punt to some other resolution protocol
• If a coordinator times out waiting for done, keep record of
outcome
• other sites may have a replica.
19
Failures in distributed systems
• We may want to avoid relying on a single
server/coordinator/boss to make progress
• Thus want the decision making to be distributed among the
participants (“all nodes created equal”) => the “consensus
problem” in distributed systems.
• However depending on what we can assume about the
network, it may be impossible to reach a decision in some
cases!
20
Impossibility of Consensus
• Network characteristics:
– Synchronous - some upper bound on network/processing delay.
– Asynchronous - no upper bound on network/processing delay.
• Fischer Lynch and Paterson showed:
– With even just one failure possible, you cannot guarantee
consensus.
• Cannot guarantee consensus process will terminate
• Assumes asynchronous network
– Essence of proof: Just before a decision is reached, we can delay a
node slightly too long to reach a decision.
• But we still want to do it.. Right?
21
Distributed Decision Making Discussion
• Why is distributed decision making desirable?
– Fault Tolerance! Also, atomicity in distributed system.
– A group of machines can come to a decision even if one or more of
them fail during the process
– After decision made, result recorded in multiple places
• Undesirable if algorithm is blocking (e.g. two-phase commit)
– One machine can be stalled until another site recovers:
– A blocked site holds resources (locks on updated items, pages pinned
in memory, etc) until learns fate of update
• To reduce blocking
– add more rounds (e.g. three-phase commit)
– Add more replicas than needed (e.g. quorums)
• What happens if one or more of the nodes is malicious?
– Malicious: attempting to compromise the decision making
• Known as Byzantine fault tolerance. More on this next time
22
Faults
23
Categories of failures
• Crash faults, message loss
– These are common in real systems
– Crash failures: process simply stops, and does nothing wrong that
would be externally visible before it stops
• These faults can’t be directly detected
24
Categories of failures
• Fail-stop failures
– These require system support
– Idea is that the process fails by crashing, and the system notifies
anyone who was talking to it
– With fail-stop failures we can overcome message loss by just
resending packets, which must be uniquely numbered
– Easy to work with… but rarely supported
25
Categories of failures
• Non-malicious Byzantine failures
– This is the best way to understand many kinds of corruption and
buggy behaviors
– Program can do pretty much anything, including sending corrupted
messages
– But it doesn’t do so with the intention of screwing up our protocols
• Unfortunately, a pretty common mode of failure
26
Categories of failure
• Malicious, true Byzantine, failures
– Model is of an attacker who has studied the system and wants to
break it
– She can corrupt or replay messages, intercept them at will,
compromise programs and substitute hacked versions
• This is a worst-case scenario mindset
– In practice, doesn’t actually happen
– Very costly to defend against; typically used in very limited ways
(e.g. key mgt. server)
27
Models of failure
• Question here concerns how failures appear in formal
models used when proving things about protocols
• Think back to Lamport’s happens-before relationship, 
– Model already has processes, messages, temporal ordering
– Assumes messages are reliably delivered
28
Two kinds of models
• We tend to work within two models
– Asynchronous model makes no assumptions about time
• Lamport’s model is a good fit
• Processes have no clocks, will wait indefinitely for messages, could run
arbitrarily fast/slow
• Distributed computing at an “eons” timescale
– Synchronous model assumes a lock-step execution in which
processes share a clock
29
Adding failures in Lamport’s model
• Also called the asynchronous model
• Normally we just assume that a failed process “crashes:” it
stops doing anything
– Notice that in this model, a failed process is indistinguishable from a
delayed process
– In fact, the decision that something has failed takes on an arbitrary
flavor
• Suppose that at point e in its execution, process p decides to treat q as
faulty….”
30
What about the synchronous model?
• Here, we also have processes and messages
– But communication is usually assumed to be reliable: any message
sent at time t is delivered by time t+
– Algorithms are often structured into rounds, each lasting some fixed
amount of time , giving time for each process to communicate with
every other process
– In this model, a crash failure is easily detected
• When people have considered malicious failures, they often
used this model
31
Neither model is realistic
• Value of the asynchronous model is that it is so stripped
down and simple
– If we can do something “well” in this model we can do at least as
well in the real world
– So we’ll want “best” solutions
• Value of the synchronous model is that it adds a lot of
“unrealistic” mechanism
– If we can’t solve a problem with all this help, we probably can’t solve
it in a more realistic setting!
– So seek impossibility results
32
Fischer, Lynch and Patterson
• Impossibility of Consensus
• A surprising result
– Impossibility of Asynchronous Distributed Consensus with a Single
Faulty Process
• They prove that no asynchronous algorithm for agreeing on
a one-bit value can guarantee that it will terminate in the
presence of crash faults
– And this is true even if no crash actually occurs!
– Proof constructs infinite non-terminating runs
– Essence of proof: Just before a decision is reached, we can delay a
node slightly too long to reach a decision.
33
Tougher failure models
• We’ve focused on crash failures
– In the synchronous model these look like a “farewell cruel world”
message
– Some call it the “failstop model”. A faulty process is viewed as first
saying goodbye, then crashing
• What about tougher kinds of failures?
– Corrupted messages
– Processes that don’t follow the algorithm
– Malicious processes out to cause havoc?
34
Here the situation is much harder
• Generally we need at least 3f+1 processes in a system to
tolerate f Byzantine failures
– For example, to tolerate 1 failure we need 4 or more processes
• We also need f+1 “rounds”
• Let’s see why this happens
35
Byzantine Generals scenario
• Generals (N of them) surround a city
– They communicate by courier
• Each has an opinion: “attack” or “wait”
– In fact, an attack would succeed: the city will fall.
– Waiting will succeed too: the city will surrender.
– But if some attack and some wait, disaster ensues
• Some Generals (f of them) are traitors… it doesn’t matter if
they attack or wait, but we must prevent them from
disrupting the battle
– Traitor can’t forge messages from other Generals
36
Byzantine Generals scenario
Attack!
No, wait!
Surrender!
Wait…
Attack!
Attack!
Wait…
37
A timeline perspective
p
q
r
s
t
• Suppose that p and q favor attack, r is a traitor and s and t
favor waiting… assume that in a tie vote, we attack
38
A timeline perspective
p
q
r
s
t
• After first round collected votes are:
– {attack, attack, wait, wait, traitor’s-vote}
39
What can the traitor do?
• Add a legitimate vote of “attack”
– Anyone with 3 votes to attack knows the outcome
• Add a legitimate vote of “wait”
– Vote now favors “wait”
• Or send different votes to different folks
• Or don’t send a vote, at all, to some
40
Outcomes?
• Traitor simply votes:
– Either all see {a,a,a,w,w}
– Or all see {a,a,w,w,w}
• Traitor double-votes
– Some see {a,a,a,w,w} and some {a,a,w,w,w}
• Traitor withholds some vote(s)
– Some see {a,a,w,w}, perhaps others see {a,a,a,w,w,} and still others
see {a,a,w,w,w}
• Notice that traitor can’t manipulate votes of loyal Generals!
41
What can we do?
• Clearly we can’t decide yet; some loyal Generals might
have contradictory data
– Anyone with 4 votes can “decide”
– But with 3 votes to “wait” or “attack,” a General isn’t sure (one could
be a traitor…)
• So: in round 2, each sends out “witness” messages: here’s
what I saw in round 1
– General Smith send me: “attack(signed) Smith”
42
Digital signatures
• These require a cryptographic system
– For example, RSA
– Each player has a secret (private) key K-1 and a public key K.
• She can publish her public key
– RSA gives us a single “encrypt” function:
• Encrypt(Encrypt(M,K),K-1) = Encrypt(Encrypt(M,K-1),K) = M
• Encrypt a hash of the message to “sign” it
43
With such a system
• A can send a message to B that only A could have sent
– A just encrypts the body with her private key
• … or one that only B can read
– A encrypts it with B’s public key
• Or can sign it as proof she sent it
– B can recompute the signature and decrypt A’s hashed signature to
see if they match
• These capabilities limit what our traitor can do: he can’t
forge or modify a message
44
A timeline perspective
p
q
r
s
t
• In second round if the traitor didn’t behave identically for all
Generals, we can weed out his faulty votes
45
A timeline perspective
Attack!!
p
q
r
s
Attack!!
Damn! They’re on to me
Attack!!
Attack!!
t
• We attack!
46
Traitor is stymied
• Our loyal generals can deduce that the decision was to
attack
• Traitor can’t disrupt this…
– Either forced to vote legitimately, or is caught
– But costs were steep!
• (f+1)*n2 ,messages!
• Rounds can also be slow….
– “Early stopping” protocols: min(t+2, f+1) rounds; t is true number of
faults
47
Distributed Snapshots
48
Introducing “wall clock time”
• Back to the notion of time…
• Distributed systems sometimes needs more precise notion
of time other than happens-before
• There are several options
– Instead of network/process identitity to break ties…
– “Extend” a logical clock with the clock time and use it to break ties
• Makes meaningful statements like “B and D were concurrent, although
B occurred first”
• But unless clocks are closely synchronized such statements could be
erroneous!
– We use a clock synchronization algorithm to reconcile differences
between clocks on various computers in the network
49
Synchronizing clocks
• Without help, clocks will often differ by many milliseconds
– Problem is that when a machine downloads time from a network
clock it can’t be sure what the delay was
– This is because the “uplink” and “downlink” delays are often very
different in a network
• Outright failures of clocks are rare…
50
Synchronizing clocks
Delay: 123ms
p
What time is it?
09:23.02921
time.windows.com
• Suppose p synchronizes with time.windows.com and notes
that 123 ms elapsed while the protocol was running… what
time is it now?
51
Synchronizing clocks
• Options?
– p could guess that the delay was evenly split, but this is rarely the
case in WAN settings (downlink speeds are higher)
– p could ignore the delay
– p could factor in only “certain” delay, e.g. if we know that the link
takes at least 5ms in each direction. Works best with GPS time
sources!
• In general can’t do better than uncertainty in the link delay
from the time source down to p
52
Consequences?
• In a network of processes, we must assume that clocks are
– Not perfectly synchronized.
• We say that clocks are “inaccurate”
– Even GPS has uncertainty, although small
– And clocks can drift during periods between synchronizations
• Relative drift between clocks is their “precision”
53
Temporal distortions
• Things can be complicated because we can’t predict
– Message delays (they vary constantly)
– Execution speeds (often a process shares a machine with many
other tasks)
– Timing of external events
• Lamport looked at this question too
54
Temporal distortions
• What does “now” mean?
p0
p1
p2
a
d
b
c
e
f
p3
55
Temporal distortions
What does “now” mean?
p0
p1
p2
a
d
b
c
e
f
p3
56
Temporal distortions
Timelines can “stretch”…
p0
p1
p2
a
d
b
c
e
… caused by scheduling effects, message delays, message loss…
f
p3
57
Temporal distortions
Timelines can “shrink”
p0
p1
a
d
b
c
e
E.g. something lets a machine speed up
p2
f
p3
58
Temporal distortions
Cuts represent instants of time.
p0
p1
a
d
b
c
e
But not every “cut” makes sense
Black cuts could occur but not gray ones.
p2
f
p3
59
Consistent cuts and snapshots
• Idea is to identify system states that “might” have occurred
in real-life
– Need to avoid capturing states in which a message is received but
nobody is shown as having sent it
– This the problem with the gray cuts
60
Temporal distortions
Red messages cross gray cuts “backwards”
p0
p1
p2
a
d
b
c
e
f
p3
61
Temporal distortions
Red messages cross gray cuts “backwards”
p0
p1
a
b
c
e
p2
In a nutshell: the cut includes a message that “was never sent”
p3
62
Who cares?
• Suppose, for example, that we want to do distributed
deadlock detection
– System lets processes “wait” for actions by other processes
– A process can only do one thing at a time
– A deadlock occurs if there is a circular wait
63
Deadlock detection “algorithm”
• p worries: perhaps we have a deadlock
• p is waiting for q, so sends “what’s your state?”
• q, on receipt, is waiting for r, so sends the same question…
and r for s…. And s is waiting on p.
64
Suppose we detect this state
• We see a cycle…
p
Waiting for
Waiting for
Waiting for
s
q
Waiting for
r
• … but is it a deadlock?
65
Phantom deadlocks!
• Suppose system has a very high rate of locking.
• Then perhaps a lock release message “passed” a query
message
– i.e. we see “q waiting for r” and “r waiting for s” but in fact, by the
time we checked r, q was no longer waiting!
• In effect: we checked for deadlock on a gray cut – an
inconsistent cut.
66
Consistent cuts and snapshots
• Goal is to draw a line across the system state such that
– Every message “received” by a process is shown as having been
sent by some other process
– Some pending messages might still be in communication channels
• A “cut” is the frontier of a “snapshot”
67
Chandy/Lamport Algorithm
• Assume that if pi can talk to pj they do so using a lossless,
FIFO connection
• Now think about logical clocks
– Suppose someone sets his clock way ahead and triggers a “flood”
of messages
– As these reach each process, it advances its own time… eventually
all do so.
• The point where time jumps forward is a consistent cut
across the system
68
Using logical clocks to make cuts
Message sets the time
forward by a “lot”
p0
p1
p2
a
d
b
c
e
f
p3
Algorithm requires FIFO channels: must
delay e until b has been delivered!
69
Using logical clocks to make cuts
“Cut” occurs at point
where time advanced
p0
p1
p2
a
d
b
c
e
f
p3
70
Turn idea into an algorithm
• To start a new snapshot, pi …
– Builds a message: “Pi is initiating snapshot k”.
• The tuple (pi, k) uniquely identifies the snapshot
• In general, on first learning about snapshot (pi, k), px
–
–
–
–
Writes down its state: px’s contribution to the snapshot
Starts “tape recorders” for all communication channels
Forwards the message on all outgoing channels
Stops “tape recorder” for a channel when a snapshot message for (pi, k) is
received on it
• Snapshot consists of all the local state contributions and all the taperecordings for the channels
71
Chandy/Lamport
• This algorithm, but implemented with an outgoing flood,
followed by an incoming wave of snapshot contributions
• Snapshot ends up accumulating at the initiator, pi
• Algorithm doesn’t tolerate process failures or message
failures.
72
Chandy/Lamport
w
q
t
r
p
s
u
y
v
x
z
A network
73
Chandy/Lamport
w
I want to start
a snapshot
q
t
r
p
s
u
y
v
x
z
A network
74
Chandy/Lamport
w
q
p records local state
t
r
p
s
u
y
v
x
z
A network
75
Chandy/Lamport
w
p starts monitoring
incoming channels
q
t
r
p
s
u
y
v
x
z
A network
76
Chandy/Lamport
w
“contents of channel py”
q
t
r
p
s
u
y
v
x
z
A network
77
Chandy/Lamport
w
p floods message on
outgoing channels…
q
t
r
p
s
u
y
v
x
z
A network
78
Chandy/Lamport
w
q
t
r
p
s
u
y
v
x
z
A network
79
Chandy/Lamport
w
q is done
q
t
r
p
s
u
y
v
x
z
A network
80
Chandy/Lamport
w
q
q
t
r
p
s
u
y
v
x
z
A network
81
Chandy/Lamport
w
q
q
t
r
p
s
u
y
v
x
z
A network
82
Chandy/Lamport
w
q
q
t
r
p
s
u
y
v
s
x
z
z
A network
83
Chandy/Lamport
w
q
q
p
s
t
x
r
u
u
y
s
v
x
z
z
v
A network
84
Chandy/Lamport
w
w
x
q
q
s
z
y
v
u
r
t
r
p
s
u
y
v
x
z
A network
85
Chandy/Lamport
w
p
q
r
s
t
u
v
w
x
y
Done!
q
t
r
p
s
u
y
v
x
z
z
A snapshot of a network
86
What’s in the “state”?
• In practice we only record things important to the
application running the algorithm, not the “whole” state
– E.g. “locks currently held”, “lock release messages”
• Idea is that the snapshot will be
– Easy to analyze, letting us build a picture of the system state
– And will have everything that matters for our real purpose, like
deadlock detection
87
Summary
• Types of faults
– Crash, fail-stop, non-malicious Byzantine, Byzantine
• Two-phase commit: distributed decision making
– First, make sure everyone guarantees that they will commit if asked
(prepare)
– Next, ask everyone to commit
– Assumes crash or fail-stop faults
• Byzantine General’s Problem: distributed decision making with
malicious failures
– n general: some number of them may be malicious (upto “f” of them)
– All non-malicious generals must come to same decision
– Only solvable if n  3f+1, but costs (f+1)*n2 ,messages
88
Download