1 A SHORT DETOUR: THE THEORY OF CONSENSUS Cornell University Ken Birman 2 Consensus Problem Systems like Isis2 are solving a hard problem We will briefly look at the associated theory State machine replication consensus 3 The state machine idea is Very elegant and simple – obvious in many ways Hopefully everyone understands it! But notice that “a group of N processes” assumes agreement on who is in the group! Without agreement on membership, different processes might have different views of group membership In this case some members might miss some updates! Consensus Problem 4 When Isis2 is running, it automatically adjusts group membership to handle failures It also places concurrent multicasts into a fixed order In doing this it creates a powerful tool for solving the so-called “consensus” or “agreement” problem Consensus Problem 5 We imagine a set of processes, already running Each process has some input from a sensor device For simplicity, assume input values {0,1} Our goal: agree on a single value such that All group members use the same value The value was one of the actual inputs (e.g. if all propose 1, we must use 1, if all propose 0, we use 0) Consensus problem 6 Without failures consensus is an easy task! Isis2: The process with rank 0 simply sends its value to every other process: “The answer is 1” With Isis2: Every process broadcasts its value to every other process. Without All collect N values, take the majority vote If N is even, break ties in some standard way Consensus with failures 7 Must ask: “What sort of failures”? Simple case: “Crash” failures (process stops) Hardest case: “Byzantine” failures (process may lie, cheat, run the protocol incorrectly) Both have been studied extensively but we will focus on the crash failure case in our lectures Assume a very simple version: there are either NO failures, or exactly ONE failure In words “Consensus even if one member crashes” Consensus with [at most] one crash 8 With Isis2 this problem is easy Have the rank-0 process multicast its input If a new view becomes defined “instead” of the decision being reported, have the new rank-0 member multicast its input Clearly this solves consensus with zero or one crashes With Isis2 it is easy to solve consensus Consensus without 2 Isis 9 In a famous theory paper, Fischer, Lynch and Patterson formalized the fault-tolerant consensus problem and studied it showed that the problem cannot be solved! Any correct solution must be at risk of endless delays caused by mistaken detections of failures! They But if this is true, how can Isis2 solve consensus? How 2 Isis solves consensus 10 First we build a subsystem (the Oracle) to track membership in the system, and in process groups To join, or handle failure, send a message to the Oracle It will report the event to the whole system The system has an “easier task” because every component sees the identical membership events But the Oracle itself solves consensus! FLP says it is impossible to solve consensus... 11 ... but by impossible means “not always feasible” In effect: There are situations in which progress cannot be guaranteed Indeed, Isis2 is not always able to make progress It could hang, unable to advance But this is impractical to trigger in real life because the situations are so complex to create Is 2 Isis correct? 12 Isis2 has an unavoidable problem Certain rare failure sequences can confuse the system It will be unable to form new group membership views and for this reason, won’t deliver extra multicasts Thus in fact consensus will not be “solved” But these sequences are very rare They require that specific messages be delayed, at specific steps in the protocols, and then released after specific amounts of time It is “not possible” to force such an exception to occur in practice – instead, you are more likely to cause Isis2 to crash This limitation is shared 13 You might read about other systems a protocol by Lamport (Isis2 SafeSendPaxos) Zookeeper, used at Yahoo! Chubby, used at Google... Paxos, ... all of them share this same issue that we see with Isis2! None is able to “always solve consensus” FLP always applies But because the problem is so unlikely, we accept it as a kind of inevitable limitation 14 The CAP Theorem: Brewer You can have just 2 of {Consistency, Availability and Partition or Fault Tolerance} CAP theorem 15 A second important cloud computing question, posed by Berkeley Professor Eric Brewer He considered the many attempts to make cloud computing systems behave like big databases Databases use the ACID model: Atomicity, Consistency, Isolation, Durability (also called Serializability) Brewer’s company, Inktomi, found that ACID was slow! CAP definitions 16 C stands for Consistency A stands for Availability: the system always responds to every request without long delays P stands for partitionability: even if some nodes crash or are inaccessible due to network failure, it still remains available CAP theorem 17 In any wide area cloud computing service that has two or more “places” where services are offered, there will be a tradeoff You can have at most 2 of C, A and P For example, to guarantee A+P, we lose Consistency! How does CAP apply to systems like Isis2? Answer is complicated CAP and 2 Isis 18 Isis2 is normally used within a cluster or a single data center, but in fact it CAN be configured in a WAN setting if there are no firewalls So the scenario for CAP arises If it arises, Isis2 limits availability during a partitioning fault: it offers C+A but limits P. Isis2 has a strong consistency model: a new form of virtual synchrony. 19 Virtual synchrony is a “consistency” model: Membership epochs: begin when a new configuration is installed and reported by delivery of a new “view” and associated state Protocols run “during” a single epoch: rather than overcome failure, we reconfigure when a failureTHIS occurs IS WHERE CAP APPLIES! A=3 B=7 B = B-A Non-replicated reference execution p p q q r r s s t t Time: A=A+1 0 10 20 30 40 50 60 Synchronous execution 70 Time: 0 10 20 30 40 50 60 70 Virtually synchronous execution The “optimistic early delivery” issue 20 In the previous lecture we mentioned that in Isis2 SafeSend is always safe, but that protocols such as OrderedSend, Send, etc are “optimistic” This relates to CAP SafeSend will never deliver a message that could later be lost, as in our example in the picture But it is slower The picture illustrates an optimistic early delivery Optimistic early delivery 21 Allows messages to be delivered before it is certain that all group members will do the delivery Failures can disrupt such a delivery In that case, perhaps the message is “lost” Yet optimistic early delivery is much faster! What about other group systems 22 We mentioned that Isis2 isn’t the only cloud computing solution based on process groups Others we mentioned: Spread, JGroups, Paxos All of them encounter the same issues with FLP+CAP These systems all provide strong guarantees They can solve consensus, so FLP applies The guarantees involve costs. CAP is ultimately about those costs 2 Isis : Send v.s. in-memory SafeSend 23 Send scales best, but SafeSend with in-memory (rather than disk) logging and small numbers of acceptors isn’t terrible. g.Flush 24 When using optimistic early delivery, the application should call g.Flush() before talking to external users or external databases This protocol delays until any optimistic early delivery multicasts have been completed g.OrderedSend+g.Flush g.SafeSend! Version 2: Send+Flush (Optimistic delivery virtually synchronous FIFO multicast with a delay for stability) 25 Update the monitoring and alarms criteria for Mrs. Marsh as follows… Execution timeline for an individual first-tier replica A B C D Soft-state first-tier service Response delay seen by end-user would also include Internet latencies Send Send Local response delay Confirmed Send flush In this example we use g.Send + g.Flush instead of g.SafeSend SafeSend versus Send: looks “similar” 26 A B C D A Soft-state first-tier service B Send Send SafeSend Send SafeSend Bottom line: Which is fastest? D Soft-state first-tier service SafeSend Confirmed C Confirmed flush Flush delay as function of shard size 27 Flush is fairly fast if we only wait for acks from 3-5 members, but slow if we wait for all members. Isis2 lets developer set the threshold. Cornell (Birman): No distribution restrictions. What does Flush do? 28 When an optimistic multicast is running, we track how many deliveries have occured g.Flush(k) delays until any pending multicasts have been delivered to at least k members For example g.Flush(2) means at least 2 members have copies of the multicast Unless both fail, simultaneously, right now, the multicast will be fully delivered. Summary 29 To build a simple replication solution, use g.SafeSend But to get the best performance, it is better to use g.Send+g.Flush, with a small Flush reliability level As a smart designer, you won’t make things complex: start by using g.SafeSend But when tuning performance, consider shifting to g.Send+g.Flush, if your application can be correct with the weaker optimistic guarantees of g.Send! 30 Multicast ordering options How “much” order do we need? Other “flavors” of multicast 31 g.SafeSend() – Paxos g.OrderedSend() – Like SafeSend but for inmemory data. “Optimistic delivery” g.CausalSend() – Rarely used, implements causal ordering. If x y then delivers x before y g.Send() – Fifo: if some process sends x, then y, delivers x before y g.RawSend() – like UDP multicast. Very weak guarantees. How we think about ordering 32 Always start by visualizing a step by step total ordering As if every program only used g.SafeSend Even if multicasts X and Y are sent concurrently, either X will be delivered before Y everywhere, or Y before X Every member sees the identical ordering But SafeSend is slow. So... let’s try and use a weaker form of ordering Using g.OrderedSend+g.Flush 33 This is just like SafeSend but it employs optimistic early delivery Then the g.Flush pauses until the optimistism is finished A relatively simple and safe change for most cases Speeds things up a lot anyhow! Using g.Send+g.Flush 34 This is a similar idea but gives even faster code g.Send only promises total ordering for messages sent by the same sender Just like TCP: a “FIFO” property If member P sends X, then Y, all receive X before Y But if X and Y are sent concurrently, order may vary! When is g.Send “good enough”? 35 We can use g.Send if all the updates to a given variable are sent by just a single member in the view For example, suppose our group maintains object O and only member P ever updates O Then all the updates will be sent in some order by P ... and applied in the order they were sent ... and so O will remain consistent! What if our group has many objects? 36 As long as each object is updated by just one member – the “owner” for that object We have many concurrent updates to distinct objects But each object receives updates in the exact order the owner sent them And if a failure occurs, the multicast will still reach every member So the objects remain in consistent states! g.Send is all we need g.CausalSend and g.RawSend 37 g.CausalSend has the property that if process P receives message X (from anyone) and then sends message Y, message X will be delivered before Y This is a very interesting concept and can be useful, but because we have limited time, we will not look closely Book discusses it in more detail g.RawSend doesn’t guarantee reliability If X is sent before Y, by process P, then X is delivered before Y or X is lost and not delivered at all RawSend is like the Internet IP multicast protocol Point to point protocols 38 We have only discussed multicast protocols In Isis2 you can also send point-to-point messages g.P2PSend (dest, REQUEST, args...); g.P2PQuery(dest, REQUEST, args... EOL, results...); The “dest” argument would be an Isis.Address for some member. Access as v.member[i] in View v. g.RawP2PSend 39 With RawSend, you can send an unreliable multicast With RawP2PSend, you can send an unreliable datagram Useful for sending “unimportant” information or data that needs to be very timely If lost, no effort is made to recover it Subset multicast 40 In Isis2 we usually multicast to the entire group all at once, but there are cases in which data is sharded This means we have a large group but it holds data at subsets of its members For example, a group with 100 members could hold 50 shards of size 2 Typically arises for data that has some form of “key” (key,value) pairs are common in distributed systems Multicasting to a list of members 41 The basic idea is simple g.Send and g.OrderedSend permit you to provide a list of destinations, as do g.Query and g.OrderedQuery But ONLY these primitives support lists You call g.Send(List<Address> dests, REQUEST, ...); How can the members learn dests? 42 It often is important to know who else received the same multicast (e.g. when doing a task in parallel) It would be awkward to do this: g.Send(dests, .... REQUEST, dests, args...) so g.Send(REQUEST, dests, args...) is allowed Now dests is passed to your handler as an argument! 43 The question of Durability SafeSend with and without DiskLogger 2 Isis has two notions of durability 44 Earlier we saw that a group can be checkpointed You need to enable the logging feature g.Persistent(gname) gname will be the name of a file used to log the group state for this particular member The system will periodically put a checkpoint into the file. You can control the frequency. On restart, a chechpointed group will reload the saved state. Durability with SafeSend 45 SafeSend (Paxos protocol) has a second concept of durability This protocol “remembers” the multicasts it delivers In this way we can safely update an external database or file such that after a crash, every update is applied exactly once But using this feature requires understanding durability concepts Durability... inside... and external 46 A SafeSend is durable if the message will not be lost until it is safe to garbage collect it An update to an external database is durable if every replica of the database has the update in it Our challenge: use SafeSend to do externally durable database updates A replicated database with SafeSend 47 Update the monitoring and alarms criteria for Mrs. Marsh as follows… A Execution timeline for an individual first-tier replica B C D Soft-state first-tier service Response delay seen by end-user would also include Internet latencies SafeSend SafeSend Local response delay Confirmed SafeSend db db db We will see that SafeSend is the safest Isis2 option But it is also the most expensive db Failure concern 48 Update the monitoring and alarms criteria for Mrs. Marsh as follows… A Execution timeline for an individual first-tier replica B C D Soft-state first-tier service Response delay seen by end-user would also include Internet latencies SafeSend SafeSend Local response delay Confirmed SafeSend db db db db What if copy C has failed when updates are applied? SafeSend is durable... 49 SafeSend will remember updates, but even so, copy C has missed them because it was not running when they were done! What should we do when copy C recovers? There are two main cases Case 1: In-memory data Case 2: Data stored externally in a disk Case 1: In-memory data 50 For this case we can use g.SafeSend or even g.OrderedSend+g.Flush When the copy restarts we’ll do a state transfer to it C restarts with no memory of previous state so this is necessary and sufficient It works well even for large amounts of state (we may need to use the new Isis2 out-of-band data transfer mechanism if the data is very large) Case 2: C has external state 51 Suppose that when C recovers after a crash, it restarts the local replica of a database or file This assumes that the file or database is on the disk Thus it was not lost in the failure and can still be used But it may be missing many updates SafeSend holds those updates. Our task: replay the missing ones. But how can we learn which are missing? Identifying missing updates 52 In fact this is quite hard Suppose that C crashes just as update X is being executed by the database or file system Perhaps X was completed and is still in the database Perhaps X was lost and is no longer in the database Worst case: perhaps X is partially applied and the file or database is corrupted. Then some form of cleanup will be needed. Isis2 doesn’t do this for you. Replaying updates 53 You must enable the “DiskLogger” for SafeSend g.SetSafeSendThreshold(3); g.SetDurabilityMethod(new Group.DiskLogger(myGroup, gname + "-durabilitylog")); Two steps: first, we tell SafeSend how many logged copies of each message are needed for safety If skipped, every group member must log all messages More than 2 is very conservative Next, we give a file name for logging the updates Isis2 will automatically create K copies, one per group member, appending the member rank to the file name On restart, SafeSend redelivers! 54 Until you call DiskLogger.Done() Isis2 SafeSend assumes that your database replica might not include the update yet So: You apply the update Then tell Isis2 the update is Done() Note: for asynchronous updates, obtain a “completion tag” from Isis2 for the action, supply it in Done() The updates are delivered in order, but may repeat after a failure Your job? 55 After restart any “perhaps not done” updates are redelivered to your handler You must check to see if this update was already applied to the file or database It is your task to find a way to check Apply, in order, if not yet done, then call Done() This way member C in our example can catch up Failure concern 56 Update the monitoring and alarms criteria for Mrs. Marsh as follows… A B C D Soft-state first-tier service Response delay seen by end-user would also include Internet latencies SafeSend SafeSend Local response delay Confirmed SafeSend db Restarts, applies missing updates db db db db SafeSend+DiskLogger allows C to catch up, but YOU need to detect and ignore duplicated updates Would it be easier to use Paxos? 57 Many people who read about SafeSend think that perhaps using Paxos would be easier But in fact SafeSend is a version of Paxos! The issue is that Paxos, like SafeSend, maintains its own set of logs, but then we need to apply the operations in the logs to the external database Paxos would call the external replica a “learner” After recovery from a crash a learner must learn any operations it missed. This is the exact problem SafeSend solves using the DiskLogger 58 Putting Objects into Messages Telling Isis2 about user-defined classes [Automarshall] 2 Isis moves data within messages 59 Isis2 has two data-moving subsystems The primary one moves data within messages. When you send a multicast, the message is created from the arguments you supply. In 2013 a new version is being added. It moves data out of band using memory-mapped files and is intended for big objects. This kind of data is moved as big byte arrays. What kinds of data can be put into a message? Messages 60 Internally, a message consists of a table listing the contents, followed by the data in a compacted form, followed by a checksum Isis2 already knows about built-in .NET data types (int, double/float, byte, bool, etc), arrays, lists The system also understands Isis2 Address objects, Views, Messages You can register additional data types of your own Registering a new data type 61 Define a new class of your own Two options can request that the Isis2 “automarshaller” create messages and recreate the objects. Fields must be public, and there must be a null constructor for your class (one that takes no arguments) You Or you can do the marshalling on your own, by providing a method ToBArray() that returns a byte[] and a constructor that takes a byte[] argument A trivial example 62 You call Msg.RegisterType(typeof(myData), TID); [AutoMarshalled] public class myData { public int d; public myData() { } } TID should be a small integer (1..127) used throughout your collection of programs to identify this kind of object. Private fields are not marshalled 63 Many classes have some data that should be moved from machine to machine, but other fields that do not need to be moved declare the ones Isis2 should include as public and the others as internal or private Simply Note that Isis2 can only handle nested objects if they are of registered types that it knows about Useful helper methods 64 It can be very helpful to know about: byte[] ba = Msg.ToBArray(object, object, .....); object[] objs = Msg.BArrayToObjects(ba); object[] objs = Msg.BArrayToObjects(ba, Type[] types); Msg.InvokeFromBArray(buffer, delegate(args...) { .... } ) There are a number of additional such methods 2 Isis messages... 65 ... are strongly typed ... are encrypted when transmitted if you use g.SetSecure() to put the group into secure mode ... automatically handle “endian” issues ... can be saved into files and read out of later, even on a machine of some other type ... are language independent ... have a fairly “dense” (efficient) data representation What about large objects? 66 For very large objects (many megabytes or more) Isis2 messages may not be efficient Transmission can be slow and encoding involves copying and perhaps fragmentation For such cases we are implementing a new “out of band” memory-mapped file transfer utility Expected by end of summer 2013 Big objects can be referenced as “attachments” like in email, and moved using an ultra-efficient direct method Summary 67 We have learned about the FLP and CAP theorems Isis2 and other systems must live with these, but they do not prevent us from guaranteeing correctness, consistency, faulttolerance and high performance We looked at various message safety and ordering properties Tradeoffs between SafeSend and optimistic early delivery with Send+Flush Durability for SafeSend, replicating an external database Finally we looked at some special cases and at the Isis2 message subsystem