Distributed Systems: Atomicity, decision making, snapshots Slides adapted from Ken's CS514 lectures

advertisement
Distributed Systems: Atomicity,
decision making, snapshots
Slides adapted from Ken's CS514
lectures
•
Announcements
Please complete course evaluations
– http://www.engineering.cornell.edu/CourseEval/
• Prelim II coming up this week:
–
–
–
–
Thursday, April 26th, 7:30—9:00pm, 1½ hour exam
101 Phillips
Closed book, no calculators/PDAs/…
Bring ID
• Topics:
– Since last Prelim, up to (and including) Monday, April 23rd
– Lectures 19-34, chapters 10-18 (7th ed)
• Review Session Tuesday, April 24th
– during second half of 415 Section
• Homework 6 (and solutions) available via CMS
– Do it without looking at solutions. However, it will not be graded
• Project 5 due after Prelim II, Monday, April 30th
– Make sure to look at the lecture schedule to keep up with due dates
2
Review: What time is it?
• In distributed system we need practical ways to deal with
time
– E.g. we may need to agree that update A occurred before update B
– Or offer a “lease” on a resource that expires at time 10:10.0150
– Or guarantee that a time critical event will reach all interested
parties within 100ms
3
Review: Event Ordering
• Problem: distributed systems do not share a clock
– Many coordination problems would be simplified if they did (“first
one wins”)
• Distributed systems do have some sense of time
– Events in a single process happen in order
– Messages between processes must be sent before they can be
received
– How helpful is this?
4
Review: Happens-before
• Define a Happens-before relation (denoted by ).
– 1) If A and B are events in the same process, and A was executed
before B, then A  B.
– 2) If A is the event of sending a message by one process and B is
the event of receiving that message by another process, then A 
B.
– 3) If A  B and B  C then A  C.
5
Review: Total ordering?
• Happens-before gives a partial ordering of events
• We still do not have a total ordering of events
– We are not able to order events that happen concurrently
• Concurrent if (not AB) and (not BA)
6
Review: Partial Ordering
Pi ->Pi+1; Qi -> Qi+1; Ri -> Ri+1
R0->Q4; Q3->R4; Q1->P4; P1->Q2
7
Review: Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4
8
Review: Timestamps
• Assume each process has a local logical clock that ticks
once per event and that the processes are numbered
– Clocks tick once per event (including message send)
– When send a message, send your clock value
– When receive a message, set your clock to MAX( your clock,
timestamp of message + 1)
• Thus sending comes before receiving
• Only visibility into actions at other nodes happens during
communication, communicate synchronizes the clocks
– If the timestamps of two events A and B are the same, then use the
process identity numbers to break ties.
• This gives a total ordering!
9
Review: Distributed Mutual Exclusion
• Want mutual exclusion in distributed setting
– The system consists of n processes; each process Pi resides at a
different processor
– Each process has a critical section that requires mutual exclusion
– Problem: Cannot use atomic testAndSet primitive since memory not
shared and processes may be on physically separated nodes
• Requirement
– If Pi is executing in its critical section, then no other process Pj is
executing in its critical section
• Compare three solutions
• Centralized Distributed Mutual Exclusion (CDME)
• Fully Distributed Mutual Exclusion (DDME)
• Token passing
10
Today
• Atomicity and Distributed Decision Making
• What time is it now?
– Synchronized clocks
• What does the entire system look like at this moment?
11
Atomicity
• Recall:
– Atomicity = either all the operations associated with a
program unit are executed to completion, or none are
performed.
• In a distributed system may have multiple copies of
the data
– (e.g. replicas are good for reliability/availability)
• PROBLEM: How do we atomically update all of the
copies?
– That is, either all replicas reflect a change or none
12
General’s Paradox
• General’s paradox:
– Constraints of problem:
• Two generals, on separate mountains
• Can only communicate via messengers
• Messengers can be captured
– Problem: need to coordinate attack
• If they attack at different times, they all die
• If they attack at same time, they win
– Named after Custer, who died at Little Bighorn because he arrived a
couple of days too early!
• Can messages over an unreliable network be used to
guarantee two entities do something simultaneously?
– Remarkably, “no”, even if all messages get through
– No way to be sure last message gets through!
13
Replica Consistency Problem Concurrent and conflicting updates
• Imagine we have multiple bank servers and a client desiring
to update their back account
– How can we do this?
• Allow a client to update any server then have server
propagate update to other servers?
– Simple and wrong!
– Simultaneous and conflicting updates can occur at different servers?
• Have client send update to all servers?
– Same problem - race condition – which of the conflicting update will
reach each server first
14
Two-phase commit
• Since we can’t solve the General’s Paradox (i.e.
simultaneous action), concurrent and conflicting updates
may be sent by clients, let’s solve a related problem
– Distributed transaction: Two machines agree to do something, or not
do it, atomically
• Algorithm for providing atomic updates in a distributed
system
• Give the servers (or replicas) a chance to say no and if any
server says no, client aborts the operation
15
Framework
• Goal: Update all replicas atomically
– Either everyone commits or everyone aborts
– No inconsistencies even in face of failures
– Caveat: Assume only crash or fail-stop failures
• Crash: servers stop when they fail – do not continue and generate bad
data
• Fail-stop: in addition to crash, fail-stop failure is detectable.
• Definitions
– Coordinator: Software entity that shepherds the process (in our
example could be one of the servers)
– Ready to commit: side effects of update safely stored on non-volatile
storage
• Even if crash, once I say I am ready to commit then a recover
procedure will find evidence and continue with commit protocol
16
Two Phase Commit: Phase 1
• Coordinator send a PREPARE message to each replica
• Coordinator waits for all replicas to reply with a vote
• Each participant replies with a vote
– Votes PREPARED if ready to commit and locks data items being
updated
– Votes NO if unable to get a lock or unable to ensure ready to commit
17
Two Phase Commit: Phase 2
• If coordinator receives PREPARED vote from all replicas
then it may decide to commit or abort
• Coordinator send its decision to all participants
• If participant receives COMMIT decision then commit
changes resulting from update
• If participant received ABORT decision then discard
changes resulting from update
• Participant replies DONE
• When Coordinator received DONE from all participants
then can delete record of outcome
18
Performance
• In absence of failure, 2PC (two-phase-commit) makes a
total of 2 (1.5?) round trips of messages before decision is
made
–
–
–
–
Prepare
Vote NO or PREPARE
Commit/abort
Done (but done just for bookkeeping, does not affect response time)
19
Failure Handling in 2PC –
Replica Failure
• The log contains a <commit T> record.
– In this case, the site executes redo(T).
• The log contains an <abort T> record.
– In this case, the site executes undo(T).
• The log contains a <ready T> record
– In this case consult coordinator Ci.
• If Ci is down, site sends query-status T message to the other sites.
• The log contains no control records concerning T.
– In this case, the site executes undo(T).
20
Failure Handling in 2PC –
Coordinator Ci Failure
• If an active site contains a <commit T> record in its log, then
T must be committed.
• If an active site contains an <abort T> record in its log, then
T must be aborted.
• If some active site does not contain the record <ready T> in
its log then the failed coordinator Ci cannot have decided to
commit T. Rather than wait for Ci to recover, it is preferable
to abort T.
• All active sites have a <ready T> record in their logs, but no
additional control records. In this case we must wait for the
coordinator to recover.
– Blocking problem – T is blocked pending the recovery of site Si.
21
Failure Handling
•
•
•
•
Failure detected with timeouts
If participant times out before getting a PREPARE can abort
If coordinator times out waiting for a vote can abort
If a participant times out waiting for a decision it is blocked!
– Wait for Coordinator to recover?
– Punt to some other resolution protocol
• If a coordinator times out waiting for done, keep record of
outcome
• other sites may have a replica.
22
Failures in distributed systems
• We may want to avoid relying on a single
server/coordinator/boss to make progress
• Thus want the decision making to be distributed among the
participants (“all nodes created equal”) => the “consensus
problem” in distributed systems.
• However depending on what we can assume about the
network, it may be impossible to reach a decision in some
cases!
23
Impossibility of Consensus
• Network characteristics:
– Synchronous - some upper bound on network/processing delay.
– Asynchronous - no upper bound on network/processing delay.
• Fischer Lynch and Paterson showed:
– With even just one failure possible, you cannot guarantee
consensus.
• Cannot guarantee consensus process will terminate
• Assumes asynchronous network
– Essence of proof: Just before a decision is reached, we can delay a
node slightly too long to reach a decision.
• But we still want to do it.. Right?
24
Distributed Decision Making Discussion
• Why is distributed decision making desirable?
– Fault Tolerance!
– A group of machines can come to a decision even if one or more of
them fail during the process
• Simple failure mode called “failstop” (different modes later)
– After decision made, result recorded in multiple places
• Undesirable feature of Two-Phase Commit: Blocking
– One machine can be stalled until another site recovers:
• Site B writes “prepared to commit” record to its log, sends a “yes”
vote to the coordinator (site A) and crashes
• Site A crashes
• Site B wakes up, check its log, and realizes that it has voted “yes” on the
update. It sends a message to site A asking what happened. At this point,
B cannot decide to abort, because update may have committed
• B is blocked until A comes back
– A blocked site holds resources (locks on updated items, pages pinned
in memory, etc) until learns fate of update
• Alternative: There are alternatives such as “Three Phase
Commit” which don’t have this blocking problem
• What happens if one or more of the nodes is malicious?
– Malicious: attempting to compromise the decision making
• Known as Byzantine fault tolerance. More on this next time
25
Introducing “wall clock time”
• Back to the notion of time…
• Distributed systems sometimes needs more precise notion
of time other than happens-before
• There are several options
– Instead of network/process identitity to break ties…
– “Extend” a logical clock with the clock time and use it to break ties
• Makes meaningful statements like “B and D were concurrent, although
B occurred first”
• But unless clocks are closely synchronized such statements could be
erroneous!
– We use a clock synchronization algorithm to reconcile differences
between clocks on various computers in the network
26
Synchronizing clocks
• Without help, clocks will often differ by many milliseconds
– Problem is that when a machine downloads time from a network
clock it can’t be sure what the delay was
– This is because the “uplink” and “downlink” delays are often very
different in a network
• Outright failures of clocks are rare…
27
Synchronizing clocks
Delay: 123ms
p
What time is it?
09:23.02921
time.windows.com
• Suppose p synchronizes with time.windows.com and notes
that 123 ms elapsed while the protocol was running… what
time is it now?
28
Synchronizing clocks
• Options?
– p could guess that the delay was evenly split, but this is rarely the
case in WAN settings (downlink speeds are higher)
– p could ignore the delay
– p could factor in only “certain” delay, e.g. if we know that the link
takes at least 5ms in each direction. Works best with GPS time
sources!
• In general can’t do better than uncertainty in the link delay
from the time source down to p
29
Consequences?
• In a network of processes, we must assume that clocks are
– Not perfectly synchronized.
• We say that clocks are “inaccurate”
– Even GPS has uncertainty, although small
– And clocks can drift during periods between synchronizations
• Relative drift between clocks is their “precision”
30
Temporal distortions
• Things can be complicated because we can’t predict
– Message delays (they vary constantly)
– Execution speeds (often a process shares a machine with many
other tasks)
– Timing of external events
• Lamport looked at this question too
31
Temporal distortions
• What does “now” mean?
p0
p1
p2
a
d
b
c
e
f
p3
32
Temporal distortions
What does “now” mean?
p0
p1
p2
a
d
b
c
e
f
p3
33
Temporal distortions
Timelines can “stretch”…
p0
p1
p2
a
d
b
c
e
… caused by scheduling effects, message delays, message loss…
f
p3
34
Temporal distortions
Timelines can “shrink”
p0
p1
a
d
b
c
e
E.g. something lets a machine speed up
p2
f
p3
35
Temporal distortions
Cuts represent instants of time.
p0
p1
a
d
b
c
e
But not every “cut” makes sense
Black cuts could occur but not gray ones.
p2
f
p3
36
Consistent cuts and snapshots
• Idea is to identify system states that “might” have occurred
in real-life
– Need to avoid capturing states in which a message is received but
nobody is shown as having sent it
– This the problem with the gray cuts
37
Temporal distortions
Red messages cross gray cuts “backwards”
p0
p1
p2
a
d
b
c
e
f
p3
38
Temporal distortions
Red messages cross gray cuts “backwards”
p0
p1
a
b
c
e
p2
In a nutshell: the cut includes a message that “was never sent”
p3
39
Who cares?
• Suppose, for example, that we want to do distributed
deadlock detection
– System lets processes “wait” for actions by other processes
– A process can only do one thing at a time
– A deadlock occurs if there is a circular wait
40
Deadlock detection “algorithm”
• p worries: perhaps we have a deadlock
• p is waiting for q, so sends “what’s your state?”
• q, on receipt, is waiting for r, so sends the same question…
and r for s…. And s is waiting on p.
41
Suppose we detect this state
• We see a cycle…
p
Waiting for
Waiting for
Waiting for
s
q
Waiting for
r
• … but is it a deadlock?
42
Phantom deadlocks!
• Suppose system has a very high rate of locking.
• Then perhaps a lock release message “passed” a query
message
– i.e. we see “q waiting for r” and “r waiting for s” but in fact, by the
time we checked r, q was no longer waiting!
• In effect: we checked for deadlock on a gray cut – an
inconsistent cut.
43
Consistent cuts and snapshots
• Goal is to draw a line across the system state such that
– Every message “received” by a process is shown as having been
sent by some other process
– Some pending messages might still be in communication channels
• A “cut” is the frontier of a “snapshot”
44
Chandy/Lamport Algorithm
• Assume that if pi can talk to pj they do so using a lossless,
FIFO connection
• Now think about logical clocks
– Suppose someone sets his clock way ahead and triggers a “flood”
of messages
– As these reach each process, it advances its own time… eventually
all do so.
• The point where time jumps forward is a consistent cut
across the system
45
Using logical clocks to make cuts
Message sets the time
forward by a “lot”
p0
p1
p2
a
d
b
c
e
f
p3
Algorithm requires FIFO channels: must
delay e until b has been delivered!
46
Using logical clocks to make cuts
“Cut” occurs at point
where time advanced
p0
p1
p2
a
d
b
c
e
f
p3
47
Turn idea into an algorithm
• To start a new snapshot, pi …
– Builds a message: “Pi is initiating snapshot k”.
• The tuple (pi, k) uniquely identifies the snapshot
• In general, on first learning about snapshot (pi, k), px
–
–
–
–
Writes down its state: px’s contribution to the snapshot
Starts “tape recorders” for all communication channels
Forwards the message on all outgoing channels
Stops “tape recorder” for a channel when a snapshot message for (pi, k) is
received on it
• Snapshot consists of all the local state contributions and all the taperecordings for the channels
48
Chandy/Lamport
• This algorithm, but implemented with an outgoing flood,
followed by an incoming wave of snapshot contributions
• Snapshot ends up accumulating at the initiator, pi
• Algorithm doesn’t tolerate process failures or message
failures.
49
Chandy/Lamport
w
q
t
r
p
s
u
y
v
x
z
A network
50
Chandy/Lamport
w
I want to start
a snapshot
q
t
r
p
s
u
y
v
x
z
A network
51
Chandy/Lamport
w
q
p records local state
t
r
p
s
u
y
v
x
z
A network
52
Chandy/Lamport
w
p starts monitoring
incoming channels
q
t
r
p
s
u
y
v
x
z
A network
53
Chandy/Lamport
w
“contents of channel py”
q
t
r
p
s
u
y
v
x
z
A network
54
Chandy/Lamport
w
p floods message on
outgoing channels…
q
t
r
p
s
u
y
v
x
z
A network
55
Chandy/Lamport
w
q
t
r
p
s
u
y
v
x
z
A network
56
Chandy/Lamport
w
q is done
q
t
r
p
s
u
y
v
x
z
A network
57
Chandy/Lamport
w
q
q
t
r
p
s
u
y
v
x
z
A network
58
Chandy/Lamport
w
q
q
t
r
p
s
u
y
v
x
z
A network
59
Chandy/Lamport
w
q
q
t
r
p
s
u
y
v
s
x
z
z
A network
60
Chandy/Lamport
w
q
q
p
s
t
x
r
u
u
y
s
v
x
z
z
v
A network
61
Chandy/Lamport
w
w
x
q
q
s
z
y
v
u
r
t
r
p
s
u
y
v
x
z
A network
62
Chandy/Lamport
w
p
q
r
s
t
u
v
w
x
y
Done!
q
t
r
p
s
u
y
v
x
z
z
A snapshot of a network
63
What’s in the “state”?
• In practice we only record things important to the
application running the algorithm, not the “whole” state
– E.g. “locks currently held”, “lock release messages”
• Idea is that the snapshot will be
– Easy to analyze, letting us build a picture of the system state
– And will have everything that matters for our real purpose, like
deadlock detection
64
Download