Distributed Systems: Atomicity, decision making, snapshots Slides adapted from Ken's CS514 lectures • Announcements Please complete course evaluations – http://www.engineering.cornell.edu/CourseEval/ • Prelim II coming up this week: – – – – Thursday, April 26th, 7:30—9:00pm, 1½ hour exam 101 Phillips Closed book, no calculators/PDAs/… Bring ID • Topics: – Since last Prelim, up to (and including) Monday, April 23rd – Lectures 19-34, chapters 10-18 (7th ed) • Review Session Tuesday, April 24th – during second half of 415 Section • Homework 6 (and solutions) available via CMS – Do it without looking at solutions. However, it will not be graded • Project 5 due after Prelim II, Monday, April 30th – Make sure to look at the lecture schedule to keep up with due dates 2 Review: What time is it? • In distributed system we need practical ways to deal with time – E.g. we may need to agree that update A occurred before update B – Or offer a “lease” on a resource that expires at time 10:10.0150 – Or guarantee that a time critical event will reach all interested parties within 100ms 3 Review: Event Ordering • Problem: distributed systems do not share a clock – Many coordination problems would be simplified if they did (“first one wins”) • Distributed systems do have some sense of time – Events in a single process happen in order – Messages between processes must be sent before they can be received – How helpful is this? 4 Review: Happens-before • Define a Happens-before relation (denoted by ). – 1) If A and B are events in the same process, and A was executed before B, then A B. – 2) If A is the event of sending a message by one process and B is the event of receiving that message by another process, then A B. – 3) If A B and B C then A C. 5 Review: Total ordering? • Happens-before gives a partial ordering of events • We still do not have a total ordering of events – We are not able to order events that happen concurrently • Concurrent if (not AB) and (not BA) 6 Review: Partial Ordering Pi ->Pi+1; Qi -> Qi+1; Ri -> Ri+1 R0->Q4; Q3->R4; Q1->P4; P1->Q2 7 Review: Total Ordering? P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4 P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4 P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1, R2, R3, R4 8 Review: Timestamps • Assume each process has a local logical clock that ticks once per event and that the processes are numbered – Clocks tick once per event (including message send) – When send a message, send your clock value – When receive a message, set your clock to MAX( your clock, timestamp of message + 1) • Thus sending comes before receiving • Only visibility into actions at other nodes happens during communication, communicate synchronizes the clocks – If the timestamps of two events A and B are the same, then use the process identity numbers to break ties. • This gives a total ordering! 9 Review: Distributed Mutual Exclusion • Want mutual exclusion in distributed setting – The system consists of n processes; each process Pi resides at a different processor – Each process has a critical section that requires mutual exclusion – Problem: Cannot use atomic testAndSet primitive since memory not shared and processes may be on physically separated nodes • Requirement – If Pi is executing in its critical section, then no other process Pj is executing in its critical section • Compare three solutions • Centralized Distributed Mutual Exclusion (CDME) • Fully Distributed Mutual Exclusion (DDME) • Token passing 10 Today • Atomicity and Distributed Decision Making • What time is it now? – Synchronized clocks • What does the entire system look like at this moment? 11 Atomicity • Recall: – Atomicity = either all the operations associated with a program unit are executed to completion, or none are performed. • In a distributed system may have multiple copies of the data – (e.g. replicas are good for reliability/availability) • PROBLEM: How do we atomically update all of the copies? – That is, either all replicas reflect a change or none 12 General’s Paradox • General’s paradox: – Constraints of problem: • Two generals, on separate mountains • Can only communicate via messengers • Messengers can be captured – Problem: need to coordinate attack • If they attack at different times, they all die • If they attack at same time, they win – Named after Custer, who died at Little Bighorn because he arrived a couple of days too early! • Can messages over an unreliable network be used to guarantee two entities do something simultaneously? – Remarkably, “no”, even if all messages get through – No way to be sure last message gets through! 13 Replica Consistency Problem Concurrent and conflicting updates • Imagine we have multiple bank servers and a client desiring to update their back account – How can we do this? • Allow a client to update any server then have server propagate update to other servers? – Simple and wrong! – Simultaneous and conflicting updates can occur at different servers? • Have client send update to all servers? – Same problem - race condition – which of the conflicting update will reach each server first 14 Two-phase commit • Since we can’t solve the General’s Paradox (i.e. simultaneous action), concurrent and conflicting updates may be sent by clients, let’s solve a related problem – Distributed transaction: Two machines agree to do something, or not do it, atomically • Algorithm for providing atomic updates in a distributed system • Give the servers (or replicas) a chance to say no and if any server says no, client aborts the operation 15 Framework • Goal: Update all replicas atomically – Either everyone commits or everyone aborts – No inconsistencies even in face of failures – Caveat: Assume only crash or fail-stop failures • Crash: servers stop when they fail – do not continue and generate bad data • Fail-stop: in addition to crash, fail-stop failure is detectable. • Definitions – Coordinator: Software entity that shepherds the process (in our example could be one of the servers) – Ready to commit: side effects of update safely stored on non-volatile storage • Even if crash, once I say I am ready to commit then a recover procedure will find evidence and continue with commit protocol 16 Two Phase Commit: Phase 1 • Coordinator send a PREPARE message to each replica • Coordinator waits for all replicas to reply with a vote • Each participant replies with a vote – Votes PREPARED if ready to commit and locks data items being updated – Votes NO if unable to get a lock or unable to ensure ready to commit 17 Two Phase Commit: Phase 2 • If coordinator receives PREPARED vote from all replicas then it may decide to commit or abort • Coordinator send its decision to all participants • If participant receives COMMIT decision then commit changes resulting from update • If participant received ABORT decision then discard changes resulting from update • Participant replies DONE • When Coordinator received DONE from all participants then can delete record of outcome 18 Performance • In absence of failure, 2PC (two-phase-commit) makes a total of 2 (1.5?) round trips of messages before decision is made – – – – Prepare Vote NO or PREPARE Commit/abort Done (but done just for bookkeeping, does not affect response time) 19 Failure Handling in 2PC – Replica Failure • The log contains a <commit T> record. – In this case, the site executes redo(T). • The log contains an <abort T> record. – In this case, the site executes undo(T). • The log contains a <ready T> record – In this case consult coordinator Ci. • If Ci is down, site sends query-status T message to the other sites. • The log contains no control records concerning T. – In this case, the site executes undo(T). 20 Failure Handling in 2PC – Coordinator Ci Failure • If an active site contains a <commit T> record in its log, then T must be committed. • If an active site contains an <abort T> record in its log, then T must be aborted. • If some active site does not contain the record <ready T> in its log then the failed coordinator Ci cannot have decided to commit T. Rather than wait for Ci to recover, it is preferable to abort T. • All active sites have a <ready T> record in their logs, but no additional control records. In this case we must wait for the coordinator to recover. – Blocking problem – T is blocked pending the recovery of site Si. 21 Failure Handling • • • • Failure detected with timeouts If participant times out before getting a PREPARE can abort If coordinator times out waiting for a vote can abort If a participant times out waiting for a decision it is blocked! – Wait for Coordinator to recover? – Punt to some other resolution protocol • If a coordinator times out waiting for done, keep record of outcome • other sites may have a replica. 22 Failures in distributed systems • We may want to avoid relying on a single server/coordinator/boss to make progress • Thus want the decision making to be distributed among the participants (“all nodes created equal”) => the “consensus problem” in distributed systems. • However depending on what we can assume about the network, it may be impossible to reach a decision in some cases! 23 Impossibility of Consensus • Network characteristics: – Synchronous - some upper bound on network/processing delay. – Asynchronous - no upper bound on network/processing delay. • Fischer Lynch and Paterson showed: – With even just one failure possible, you cannot guarantee consensus. • Cannot guarantee consensus process will terminate • Assumes asynchronous network – Essence of proof: Just before a decision is reached, we can delay a node slightly too long to reach a decision. • But we still want to do it.. Right? 24 Distributed Decision Making Discussion • Why is distributed decision making desirable? – Fault Tolerance! – A group of machines can come to a decision even if one or more of them fail during the process • Simple failure mode called “failstop” (different modes later) – After decision made, result recorded in multiple places • Undesirable feature of Two-Phase Commit: Blocking – One machine can be stalled until another site recovers: • Site B writes “prepared to commit” record to its log, sends a “yes” vote to the coordinator (site A) and crashes • Site A crashes • Site B wakes up, check its log, and realizes that it has voted “yes” on the update. It sends a message to site A asking what happened. At this point, B cannot decide to abort, because update may have committed • B is blocked until A comes back – A blocked site holds resources (locks on updated items, pages pinned in memory, etc) until learns fate of update • Alternative: There are alternatives such as “Three Phase Commit” which don’t have this blocking problem • What happens if one or more of the nodes is malicious? – Malicious: attempting to compromise the decision making • Known as Byzantine fault tolerance. More on this next time 25 Introducing “wall clock time” • Back to the notion of time… • Distributed systems sometimes needs more precise notion of time other than happens-before • There are several options – Instead of network/process identitity to break ties… – “Extend” a logical clock with the clock time and use it to break ties • Makes meaningful statements like “B and D were concurrent, although B occurred first” • But unless clocks are closely synchronized such statements could be erroneous! – We use a clock synchronization algorithm to reconcile differences between clocks on various computers in the network 26 Synchronizing clocks • Without help, clocks will often differ by many milliseconds – Problem is that when a machine downloads time from a network clock it can’t be sure what the delay was – This is because the “uplink” and “downlink” delays are often very different in a network • Outright failures of clocks are rare… 27 Synchronizing clocks Delay: 123ms p What time is it? 09:23.02921 time.windows.com • Suppose p synchronizes with time.windows.com and notes that 123 ms elapsed while the protocol was running… what time is it now? 28 Synchronizing clocks • Options? – p could guess that the delay was evenly split, but this is rarely the case in WAN settings (downlink speeds are higher) – p could ignore the delay – p could factor in only “certain” delay, e.g. if we know that the link takes at least 5ms in each direction. Works best with GPS time sources! • In general can’t do better than uncertainty in the link delay from the time source down to p 29 Consequences? • In a network of processes, we must assume that clocks are – Not perfectly synchronized. • We say that clocks are “inaccurate” – Even GPS has uncertainty, although small – And clocks can drift during periods between synchronizations • Relative drift between clocks is their “precision” 30 Temporal distortions • Things can be complicated because we can’t predict – Message delays (they vary constantly) – Execution speeds (often a process shares a machine with many other tasks) – Timing of external events • Lamport looked at this question too 31 Temporal distortions • What does “now” mean? p0 p1 p2 a d b c e f p3 32 Temporal distortions What does “now” mean? p0 p1 p2 a d b c e f p3 33 Temporal distortions Timelines can “stretch”… p0 p1 p2 a d b c e … caused by scheduling effects, message delays, message loss… f p3 34 Temporal distortions Timelines can “shrink” p0 p1 a d b c e E.g. something lets a machine speed up p2 f p3 35 Temporal distortions Cuts represent instants of time. p0 p1 a d b c e But not every “cut” makes sense Black cuts could occur but not gray ones. p2 f p3 36 Consistent cuts and snapshots • Idea is to identify system states that “might” have occurred in real-life – Need to avoid capturing states in which a message is received but nobody is shown as having sent it – This the problem with the gray cuts 37 Temporal distortions Red messages cross gray cuts “backwards” p0 p1 p2 a d b c e f p3 38 Temporal distortions Red messages cross gray cuts “backwards” p0 p1 a b c e p2 In a nutshell: the cut includes a message that “was never sent” p3 39 Who cares? • Suppose, for example, that we want to do distributed deadlock detection – System lets processes “wait” for actions by other processes – A process can only do one thing at a time – A deadlock occurs if there is a circular wait 40 Deadlock detection “algorithm” • p worries: perhaps we have a deadlock • p is waiting for q, so sends “what’s your state?” • q, on receipt, is waiting for r, so sends the same question… and r for s…. And s is waiting on p. 41 Suppose we detect this state • We see a cycle… p Waiting for Waiting for Waiting for s q Waiting for r • … but is it a deadlock? 42 Phantom deadlocks! • Suppose system has a very high rate of locking. • Then perhaps a lock release message “passed” a query message – i.e. we see “q waiting for r” and “r waiting for s” but in fact, by the time we checked r, q was no longer waiting! • In effect: we checked for deadlock on a gray cut – an inconsistent cut. 43 Consistent cuts and snapshots • Goal is to draw a line across the system state such that – Every message “received” by a process is shown as having been sent by some other process – Some pending messages might still be in communication channels • A “cut” is the frontier of a “snapshot” 44 Chandy/Lamport Algorithm • Assume that if pi can talk to pj they do so using a lossless, FIFO connection • Now think about logical clocks – Suppose someone sets his clock way ahead and triggers a “flood” of messages – As these reach each process, it advances its own time… eventually all do so. • The point where time jumps forward is a consistent cut across the system 45 Using logical clocks to make cuts Message sets the time forward by a “lot” p0 p1 p2 a d b c e f p3 Algorithm requires FIFO channels: must delay e until b has been delivered! 46 Using logical clocks to make cuts “Cut” occurs at point where time advanced p0 p1 p2 a d b c e f p3 47 Turn idea into an algorithm • To start a new snapshot, pi … – Builds a message: “Pi is initiating snapshot k”. • The tuple (pi, k) uniquely identifies the snapshot • In general, on first learning about snapshot (pi, k), px – – – – Writes down its state: px’s contribution to the snapshot Starts “tape recorders” for all communication channels Forwards the message on all outgoing channels Stops “tape recorder” for a channel when a snapshot message for (pi, k) is received on it • Snapshot consists of all the local state contributions and all the taperecordings for the channels 48 Chandy/Lamport • This algorithm, but implemented with an outgoing flood, followed by an incoming wave of snapshot contributions • Snapshot ends up accumulating at the initiator, pi • Algorithm doesn’t tolerate process failures or message failures. 49 Chandy/Lamport w q t r p s u y v x z A network 50 Chandy/Lamport w I want to start a snapshot q t r p s u y v x z A network 51 Chandy/Lamport w q p records local state t r p s u y v x z A network 52 Chandy/Lamport w p starts monitoring incoming channels q t r p s u y v x z A network 53 Chandy/Lamport w “contents of channel py” q t r p s u y v x z A network 54 Chandy/Lamport w p floods message on outgoing channels… q t r p s u y v x z A network 55 Chandy/Lamport w q t r p s u y v x z A network 56 Chandy/Lamport w q is done q t r p s u y v x z A network 57 Chandy/Lamport w q q t r p s u y v x z A network 58 Chandy/Lamport w q q t r p s u y v x z A network 59 Chandy/Lamport w q q t r p s u y v s x z z A network 60 Chandy/Lamport w q q p s t x r u u y s v x z z v A network 61 Chandy/Lamport w w x q q s z y v u r t r p s u y v x z A network 62 Chandy/Lamport w p q r s t u v w x y Done! q t r p s u y v x z z A snapshot of a network 63 What’s in the “state”? • In practice we only record things important to the application running the algorithm, not the “whole” state – E.g. “locks currently held”, “lock release messages” • Idea is that the snapshot will be – Easy to analyze, letting us build a picture of the system state – And will have everything that matters for our real purpose, like deadlock detection 64