>> Amar Phanishayee: All right. So let’s start. It’s my pleasure to introduce Iulian Moraru. Iulian is a five year grad student at CMU where he works with the recently tenured David Andersen. Iulian’s interests are in distributed systems and early on in his grad career he also look into acute data structures for solving problems that were in the realm of machine learning and applied them to a really scalable [indiscernible] at CMU. His PHD thesis was focused on improving this glorious protocol that you all know, Paxos, and Iulian is going to talk to us about that today. So please help me welcome Iulian. [clapping] >> Iulian Moraru: Thank you, Thank you Amar. Thank you for having me here. So this talk is going to be about revisiting Paxos. And I am going to try to convince you that it’s indeed as exciting as that sounds, just not in that way. So the general topic of my thesis work is full tolerance in distributed systems. And by full tolerance in most other fields this is all about redundancy. The way we achieve redundancy in distributed systems is through state replication. That means replicating the state of a process onto multiple machines so that if some of these machines fail the remaining ones will be able to continue handling client queries and commands just as the failed ones would have. And to keep the state of these processes in sync we made them behave like state machines that change their internal state only as a result of executing the commands that were posted by the clients of the system. And if we make sure they execute the same commands in the same order then they will transition through the exact same sequence of states. And this approach is called State Machine Replication or SMR for short and it is important. For example in local area clusters state machine replication is used for operations as diverse as replicating data, resource discovery, distributed synchronization and because we build larger and larger clusters there is increasing pressure on these implementations of state machine replication to have both high throughput and high availability. And these are some real world systems that implement state machine replication for these operations. We also use state machine replication in the wide-area, because we have clients on different continents accessing the same databases. So we want to bring that data closer to clients and we also want to be able to tolerate full datacenter outages. And because in this setting distances are so large any inefficiency in unnecessary message delays or round trip times to commit are going to have a high impact on latency. So it is critical in this setting for our implementations of state machine replication to have low latency. Spanner and Megastore are well known systems that implement state machine replication for general replication. And implementing state machine replication there are multiple ways to do it and they range from simple primary backup protocols to more complex protocols such as Paxos and visiting full tolerance. Now all the systems that presented on the previous slides they implement Paxos or protocols that are very similar to Paxos and the reason is Paxos has a nice balance between safety, which is to say what kind of failures it can tolerate, and performance. And what I mean by performance is Paxos and Paxos life protocols are fast. They are faster than, at least as fast as a simple primary backup, and they do not depend on external failure detectors, which make them have very high availability. And here is an example of what I mean by that. Let’s say we have a primary backup system, if there is an error partition and we let the clients continue to talk to both the primary and the backup their states will diverge, which is exactly the opposite of what we want to achieve. So, what we do in the setting is we have an external entity, this external failure detector which decides, for example that the primary has official failed, it’s officially failed and the backup is the new authoritative copy of the data. And then the clients will learn from this external failure detector that they have to talk only to the backup. But, of course it takes time between the partition and error partition setting in and the client switching talking to the backup and that’s a window of time where the system is essentially un-available. In Paxos we don’t have that problem because we use more resources. So here we have three replicas instead of two to tolerate the failure, but any of these replicas appearing to have failed, or any network partition setting in will not cause the majority to stop working. And in fact we don’t have to declare failure synchronously. If a replica appears to have failed we just continue going the majority. Because of this reason Paxos has high availability, almost essentially constant availability or in other words instant fail over. And this brings me to the overarching goal of my thesis work. So in my thesis work I want to improve state machine replication, specifically Paxos-style state machine replication and by that I mean systems that tolerate benign failures, so non-[indiscernible] failures. And by Paxos-style I mean systems that use [indiscernible] consensus. I want to improve it in multiple practical important dimensions, but at the same time in a way that is well anchored in theory and here’s what I mean by that: so a Paxos system has essentially two components. There is this nice elegant core, which is this general algorithm, the Paxos algorithm. And then there are a bunch of implementation considerations, right. How do we choose to implement a certain feature? How do we optimize a certain performance characteristic that’s important in our system? And I want to improve state machine replication in a way that also expands this nice algorithm core to include more of the practical implementation considerations because I want the result to be applicable to a wide range of applications. So in this talk I will present the two main components of my work so far. They are Egalitarian Paxos which is a new [indiscernible] protocol based on Paxos. It has lower latency, high throughput and higher performance stability than previous state machine replication protocols and I will also talk about Quorum read leases that address an orthogonal performance characteristic of state machine replication. And that is, how do we read it really quickly from replicated state machines? So I start with Egalitarian Paxos. Before I get to Egalitarian Paxos it’s useful to go through a quick overview of Paxos. So Paxos is at its core an agreement protocol used by a set of distributed processes to agree on one thing. It tolerates F failures with a total of 2F plus 1 replicas and that’s optimal because remember we don’t depend on external failure detectors. And replicas tolerate, so Paxos tolerates only benign failures, non-Byzantine failures. Machines can fail to reply for an indefinite amount of time, but they will not reply in ways that do not conform to their protocol. Also, communication is asynchronous. So for Paxos to be safe we don’t make any synchrony assumptions about communication, however, for it to be live there has to exist this window of times where there is asynchronousity. So how do we use Paxos, this protocol which at its core is a consensus protocol? It let’s us agree on one thing. How do we use it to agree on a sequence of things? Right, that’s what we want in state machine replication. And I am going to show you this as an example. Let’s say we have three replicas and this is their replicated states, everyone has a copy of this, initially all these command slots are empty, right, they are order command slots. Clients talk to the replicas as they propose commands and these replicas will contend for these preordered slots and they will do so by running Paxos. As a result of running Paxos only one of these replicas will win that slot. Everyone will know which replica that is; everyone will know what command went into that slot. The replicas that lost that slot will contend for a different slot and so on. They will [indiscernible] in Paxos. Now, when a continuous sequence of slots has been filled every replica can independently execute the same commands in the same order, thus ensuring that their states will remain in synch. >>: So are you implying executing the Paxos [indiscernible] on every command or are you not really doing that? >> Iulian Moraru: I am going to get to that in the next slide. like the canonical execution of Paxos. So this is So the take away here is that in Paxos, in canonical Paxos we use separate [indiscernible] of Paxos to decide every slot and that it takes two round trips to commit a command because the first round trip is use necessary to take ownership of a slot. Then only after we have taken ownership, only after replica has taken ownership can it propose the command. But, of course, as you remarked, this is inefficient because it takes two round trips. So, practical implementations of Paxos that actually implement something are usually referred to as Multi-Paxos. And in Multi-Paxos one of these replicas is the pre-established owners of all the slots. So for example in this case the green replica is the pre-established owner of all the slots. And then clients talk to only this replica, this one replica, which will decide which commands will go into which slot. And it will be able to do so after just one round trip because again the first round trip was to take ownership of the slot, that’s not necessarily here anymore, it is already the owner. But, unfortunately the single [indiscernible] replica, as I am going to refer to it in this talk can be a bottleneck for performance and availability, right. It has to handle more messages for each command than all the other non-leader replicas and if it fails there is a window of time where there is no leader before another replica is elected as leader. So the question that motivated this research is: can we have it all? Can we have this high throughput and low latency given to us by Multi-Paxos? But, at the same time we want to preserve the constant availability of canonical Paxos. And we want to furthermore be able to distribute load evenly across all our replicas so that we use our resources efficiently and get high throughput. We also want to use the fastest replicas. Perhaps in our system there are concurrent jobs running on some of these replicas and we might want to avoid those replicas that experience high load when we commit commands. >>: So let me ask what you mean by constant availability in standard Paxos. In particular it seems to me that, if I remember the way you described it, clients just send their commands to some replica and then the replica tries to contend for the next slot, right. >> Iulian Moraru: Correct. >>: If the client chooses a replica that has failed then obviously that replica is not going to do anything. >> Iulian Moraru: That’s right, so for that it might appear that the whole thing is unavailable. >>: Right. >> Iulian Moraru: But for other clients --. >>: It’s the retries that makes it different. >> Iulian Moraru: Exactly. >>: So I guess the question is: is that really better than using Multi-Paxos? >> Iulian Moraru: Well it matters in that other clients that have not been unlucky to choose that replica that has failed will chose replicas that have not failed and for them the system will be available. >>: Yeah, but doesn’t it come out in a wash? Because what will happen is in the canonical Paxos scheme when you take a replica failure until you have a timeout because the client can change it, let’s say it’s a 3 a third of the clients will see unavailability. >> Iulian Moraru: That’s right. >>: And in Multi-Paxos if you take a replica failure two-thirds of the time you are not going to lose the leader so there will be no effect at all and one-third of the time everybody will slow down. >> Iulian Moraru: That’s right. >>: So when you multiply this out they come out the same. >> Iulian Moraru: That’s right, but I am talking not about what happens if you take the whole thing over a day. I am talking about what happens in that moment and in that moment one-third of the replicas for canonical Paxos will experience a higher latency for those commands, whereas the other third of the replicas will not experience that. For Multi-Paxos when the leader fails every client will experience higher latency for all their commands. >>: This is true, but it only happens a third as much. >> Iulian Moraru: That’s right. >>: Because it’s going to have a third of as many leaders in its total replica. So you windup with --. I mean it’s not clear that it’s that much better, right. [Inaudible] >> Iulian Moraru: I think we agree on what I define on what I define by availability. I mean all the clients cannot do anything for a particular time. >>: Right, if you define availability as some client is able to do something then canonical Paxos is better than Multi-Paxos. >> Iulian Moraru: That’s right. >>: Exactly. >>: All right. >> Iulian Moraru: And you may argue that’s not enough, but that’s what I mean here. >>: I mean it’s not clear that it’s actually in practice any better, but I mean I understand what you are saying. >> Iulian Moraru: Okay. >>: Good. >> Iulian Moraru: So the last property that we want is to be able to use the closest replicas because in wide area replication we want to commit commands after talking to our closest neighbors instead of having to go perhaps to a neighbor that’s furthers away. So, as I said, Paxos canonical and Paxos have these properties, but because of the two round trips it’s not exactly efficient, right. It doesn’t have high throughput and low latency. Multi-Paxos solves that, but it loses these other properties in the process. Now by contrast EPaxos has all the properties that I have mentioned and it implements them efficiently. So much so that it has higher performance than Multi-Paxos. In Egalitarian Paxos it’s all about ordering. So we have seen that previous strategies included contending for slots as well as canonical Paxos. One replica decides and that’s the case for Multi-Paxos, but also for other versions of Paxos like Fast Paxos and Generalized Paxos. And finally a newer version, a newer protocol based on Paxos called Mencius has this property that replicas take turns in committing commands. So it is pre-established that the first replica is the leader of every third slot starting at one. The second replica is the command leader for every third slot starting at two and so on. And Mencius is effective in balancing load because at the same time there will be many commands being proposed concurrently and all the replicas will be leaders for some commands and acceptors for other commands. Unfortunately, in Mencius the whole system runs at the speed of the slowest replica because we cannot decide a slot, we cannot commit it, until we have learned what happened in all the previous slots. And the previous slots belong to all the replicas. So a third of the previous slots belong to every replica. Furthermore any replica being unavailable causes the whole system to become unavailable. >>: Which makes you wonder what the point is. >> Iulian Moraru: Better throughput, it does have better throughput. >>: Okay, as opposed to just doing a single machine. >>: Anyway, okay, right, that’s not your --. >> Iulian Moraru: It is effective in balancing mode. >>: Right. >> Iulian Moraru: So, in Paxos we take a different approach. Instead of having a linear sequence of pre-ordered slots we split that space into as many subspaces that there are replicas. And we give pre-established ownership to each replica over one of these rows. So this is the replicated state, everyone has a copy of this bi-dimensional rate. And clients will propose commands to any replica of their choosing and that replica will get that command committed in one of it’s own slots, so a slot that belongs to it’s own row. So for the green replica that’s the first row and so on. And this is good because there is no longer contention for slots, but it’s no longer clear how do we order these commands, like 3.1 comes before or after 2.2? So the way we do it is in the process of deciding what command is to be committed in a slot we also decide ordering constraints for that command. This means that B has an ordering constraint on A, so B should come after A, it should be executed after A. And we do this for every command and when these commands have been committed every replica sees the same command in the same slot and with the same ordering constraints. So they can independently analyze these constraints and come up with the same ordering. So the take away here is that we achieve load balancing, because every replica is a command leader for some commands and we have the flexibility to choose any quorum of replicas, any replicas to be part of a quorum to commit a command. We no longer have to revolve one particular replica on every decision. Like, this was the case for the stable leader in Multi-Paxos. Question? >>: Is there an assumption that a client is to list only one command at a time? I mean if I am a client and I have a set of commands that attended upon one another and I send those to different replicas that aren’t communicating I could have those operations commit out of order in terms that the clients expected [indiscernible]. >> Iulian Moraru: Sorry to interrupt, but what is your question? sending commands one at a time? And the answer is yes. Are clients >>: So they wait, they wait until [indiscernible] so you can’t pipeline commands. >> Iulian Moraru: If it is important for them that some commands are executed in a particular order then what they do is they have to wait for the first command to be acknowledged by the system before they propose next command. >>: Is that true for all versions, because Multi-Paxos it seems like that wouldn’t exist because I can, since I am system serialize through a leader, I can sequence them and I can be sure that one of them will be executed. >>: You can’t really. So if it’s Smarter it doesn’t work that way. >> Iulian Moraru: So even in Multi-Paxos you don’t get that guarantee. >>: And the reason is you might change leaders. [inaudible] >> Iulian Moraru: That’s right. >>: So Smarter let’s clients pipeline, but the ordering guarantee that you get is that anything that you send down is ordered after things that you have already seen completions for. So you can send out as many as you want, but they will execute in whatever order. >>: Okay. >>: It’s just like doing Disk IO asynchronous system calls. >>: Well unless you have [indiscernible]. >>: Unless you have [inaudible]. >>: Okay, I understand. >> Iulian Moraru: Okay, I was saying that --. >>: Can you do that by the way? Can you do the same thing where a client can kind of pipeline if he doesn’t care about the order of commands? >> Iulian Moraru: Sure. >>: And they still windup serialized. >> Iulian Moraru: They windup serialized, but not necessarily linearized. >>: Right, yeah. >>: So in Paxos you have to couple serialization guarantees with cubit guarantees, similar to ordering. >> Iulian Moraru: So as I was saying we have the flexibility in Paxos in choose any replicas to be part of our quorums. There is no longer a special replica that has to be part of every decision, right, like a stable leader. And we don’t have to get some information from every other replica in the system like was the case for Mencius. And this has important implications for performance stability because we will be able to avoid replicas that are slow or unresponsive and for wide-area commit latency because we will be able to just talk to our closest neighbors. Question, yes? >>: [inaudible]. >> Iulian Moraru: Excuse me? >>: Well, the important question here is: How are the dependencies determined? >> Iulian Moraru: Right, I will get to that in the next slide. >>: All right. >> Iulian Moraru: Is there a question? >>: And maybe this is something you can answer at the end of the talk because you are saying you want to combine low latency to [inaudible]. I am just wondering if at the end you could take a few minutes to talk about what scenarios [inaudible] where high throughput Paxos systems [inaudible]. >> Iulian Moraru: I think the simple answer is in a local cluster probably throughput is more important than latency, whereas in the wider latency becomes very important. >>: Okay. >> Iulian Moraru: Just because any sort of mistake or inefficiency that will be made is going to cost us a lot, to tens of many seconds. >>: When we did Smarter we headed back a disk. important as SQL server [inaudible]. High throughput was really >> Iulian Moraru: Okay, so to go into more detail of: How do we set these ordering dependencies, I will go through an example. This is a time sequence diagram and time flows from left to right. We have five replicas and let’s assume that there is a command A proposed at replica 1. Replica 1 at this point has no other commands so it says, “PreAccept A”. That’s what we call these first messages, very strong messages, we call them PreAccept’s. So take A with the dependency that it depends on nothing. So essentially there is no ordering dependency at this point. It sends this PreAccept to a majority of replicas, that is itself and to other replicas. The replicas R2 and R3 they agree because they haven’t seen other commands. And because these two acceptors agreed with each other R1 can commit locally and notify everyone else asynchronously. Now let’s assume that at about the same time there is another command proposed replica 5 saying command B. R5 has not seen anything for A. So it says B depends on nothing. R4 agrees, but of course R3 has seen a PreAccept 4A. So it says B has to be dependent A, because the two acceptors R3 and R4 disagreed with each other R5 has to take the union of these constraints and use a second round of communications saying that these are the final rooting constraints. And the replicas that received these accept messages only have to knowledge accepted messages. So they don’t have to update these ordering constraints anymore, even if there are some other commands being proposed concurrently. So they only have to acknowledge. When they have acknowledged R5 can commit locally and notify everyone else asynchronously. And one last example let’s say C is proposed to replica 1, R1 has not seen any message for B so it says C depends only on A. And then R2 and R3 have both seen commits for B before getting the PreAccept for C. So they say, “C has to depend on both A and B”. Because the two acceptors agree with each other R1 can commit locally and notify every other replica asynchronously, including the client. >>: Okay. >> Iulian Moraru: So a simple analysis of this protocol shows you that when commands are not concurrent it will take one round trip to commit, but when commands are concurrent some of them might have to undergo two rounds of communication until commit. And that seems like a bad thing, right. That’s what we wanted to avoid with [inaudible]. [Coughing] >>: Just to be clear this does resolve [inaudible] order, right? You would never windup end up with any –-. For any pair of reads, or sorry, for operations, I can always tell what order they are in. There always was [inaudible]. >> Iulian Moraru: That’s right, that’s right. Now to our rescue comes this observation made by Generalized Paxos before us and by [indiscernible] broadcast protocols before Generalized Paxos, which is that we don’t actually have to order every command with respect to every other command. We just have to order those commands that interfere with each other. And an intuitive way to think about this is that we only have to order those commands that refer to the same state. If they don’t refer the same state, for example, if we have two puts in a replicated key value store to different keys it doesn’t matter which way we execute them on different replicas. As long as we execute them both we will get the same state. So then what this means for our system is that we will be able to commit commands after just one round trip if they are either known concurrent or concurrent but non-interfering and we will take two round trips only for those commands that are both concurrent and interfering. And it turns out that in practice it is rarely the case that concurrent commands interfere with each other. The next logical question is: how do we determine whether commands interfere with each other before executing them? Because, remember we first have to first order these commands and in the ordering process we have to determine whether they interfere or not and only then can we execute them. Now the answer is application specific for known, scaled systems that are variations of key value store. It is very simple because we just look at the operation key, if two operations have the same key then they will interfere, otherwise they don’t. Another approach that we can take is the one taken by Google App Engine which requires developers to specifically specify which sets of transactions interfere with each other. And finally it is possible to infer interference automatically even for more complex databases like for example relational databases because it turns out that most OLTP workloads have these simple transactions that are executed most frequently. And they are so simple that we can analyze them before executing them and tell with certainly which tables and which rows in which tables they will touch. So for these simple transactions we will be able to determine whether they interfere or not and for the remaining transactions which are more complex and perhaps we cannot analyze beforehand we can safely say that they interfere with everything else. Okay, so we have got on these dependencies and how do we actually order the commands? And this again is an example with five commands: A, B, C, D, E and the edges here are [indiscernible] five dependencies. So A has a dependency on B, B also has a dependency on A and so on. So the first thing that we do is we find the strongly connected components. And then because this graph where SCC’s, strongly-connected components, are these super nodes. This graphs is a DAG, directed acyclic graph, we can sort it [indiscernible] and execute every strongly-connected component in inverse [indiscernible] order. When there are multiple commands in a strongconnected component we use another ordering constraint that I haven’t described. We call it an approximate sequence number, but a simple to way to think about it is just a Lamport clock and that is what it is essentially. And we are executing increasing order of this approximate sequence number. So what we get is --. >>: Can I ask you a question. You just move a little too fast for me. So what’s a dependency? Is that interference? You started using the word dependencies in the previous graph. >> Iulian Moraru: So dependency is an ordering dependency. >>: It’s an ordering dependency. >> Iulian Moraru: It means that a command has a dependency on another command if when committing this command, the second one, we saw something about the first command, right, so we had a dependency. Now we only have dependencies if those two commands interfere. So commands interfere if they refer to the same state. If they interfere then we add some ordering dependencies between them. >>: More particularly if they have [inaudible], right. >> Iulian Moraru: Correct, if they are both leads it’s not necessary, right. So the properties --. Yes? >>: So would it be the case that analyzing interference or not takes longer time than long-term [inaudible]? >> Iulian Moraru: For some applications it may take longer time, right. My argument is that for many applications it is actually very easy, because we have things like operation keys. >>: Yeah, so --. >> Iulian Moraru: And even for relational databases. If you take the TPCC benchmark which is very popular in the database community it has one transaction, I believe it’s called a new order transaction, which is very easy to analyze. And we can tell with certainty whether two of these new order transactions can interfere or not. >>: Right, but that depends, at that point what you are considering to be the second thing, right. You mean the schematics are the same or the state stays the same? >> Iulian Moraru: I wanted to say that they commute. >>: They commute. >> Iulian Moraru: That’s the simplest way. >>: So the system won’t resolve in [inaudible]? >> Iulian Moraru: But that’s the case for Multi-Paxos too, because every replica has to execute all the commands at exactly the same time. >>: Um. >> Iulian Moraru: As long as they will execute all the commands that have been committed so far then they will have the same state. >>: If they execute all the commands in the same order they will have the same state, right. You can imagine let’s say a page applicator that just grabs the next page and [inaudible]. >> Iulian Moraru: Sure, yeah. >>: And that only matters when [inaudible]. >> Iulian Moraru: I am sorry, so you are suggesting that --. >>: Well one of the things that we addressed in Smarter is that if you get into a situation where, let’s say a replica goes down for awhile, right, it’s not permanently lost, but it’s off for --. >> Iulian Moraru: Right and it doesn’t get [inaudible]. >>: So you have two options: one is you can just save all the commands --. >> Iulian Moraru: Oh, I see so you are saying so you [indiscernible] the state and you send the state in bulk to the [inaudible]. >>: Or you checkpoint the state and [inaudible]. So you can send the entirety of the state, that will always work, even regardless of this or you can send only the pieces of the state that have changed and you can’t do that. >> Iulian Moraru: Absolutely, but you can checkpoint here. You can use this special checkpoint command that interferes with everything. >>: Right, but the point is --. >> Iulian Moraru: And after you have executed that you will say --. >>: Your states will be different, so you will --. >> Iulian Moraru: In between checkpoints, sure. >>: In between checkpoints. So to bring back a replica you either have to continue remembering all the commands, which results in unknown storage, or you have to move all of the state. >> Iulian Moraru: No, you can remember all the commands from the previous checkpoint. >>: Well, that’s what I meant, but obviously it stops checkpoint when it stops running. I mean, it’s not a terrible thing, but. >> Iulian Moraru: Yeah, since the last checkpoint that it has observed, that’s right. >>: Right, you just lose [inaudible]. >> Iulian Moraru: Yeah, that’s right. >>: It’s actually kind of handy sometimes. All right. >> Iulian Moraru: Yes. >>: Can I just ask one more question about that previous slide before you move on. I am trying to understand the semantics of the circular dependency there. We are trying to determine the order of transactions. >> Iulian Moraru: So you are saying that, am I getting this correctly? are saying that you don’t know why two commands might have a circular dependency or are you saying why does that imply to ordering? You >>: I guess both, it just --. >> Iulian Moraru: So two commands may have a circular dependency if they, for example, if they are concurrent, but it can happen even if they are not concurrent. And they are concurrent with the third one for example. Let’s take a simple: if two commands are concurrent one of them sees the other on some replicas and the other one sees the second one first on some other replicas. Just because they don’t receive messages for the same command, replicas only receive messages for its command in the same order. Okay. >>: I thought that was just a transient state. In other words if there is disagreement I thought we had to keep iterating until finally everyone agrees on [inaudible]. >> Iulian Moraru: No, that’s not what we do. We don’t continue iterating. So we only at most execute two rounds and then if after the second round, if after the first round sorry, there has been a disagreement we take the union of those dependencies. And when we take the union that’s when we get the circular dependencies. >>: So hang on, why don’t you understand what these are because I thought I understood. I didn’t think carefully enough. >> Iulian Moraru: So this means that B got after E on some replica. >>: After? >> Iulian Moraru: Yeah, so some replica received some message for B after it has received some message for E. >>: It depends on E, so B depends on E? >>: So you are saying in this case circular is some replica saw A before B and others saw the opposite. >> Iulian Moraru: Exactly. >>: Okay. >>: I think the confusion was earlier Bill said, “This always establishes a total order” and you said, “Yes”. >> Iulian Moraru: It does establish total order because after you have executed this algorithm --. >> After you have executed this algorithm, but earlier before you described this slide --. >>: Right, the first thing you showed does establish it. algorithm establishes a total order. The two round >> Iulian Moraru: Only after we have executed this algorithm on the dependencies that this commit protocol establishes. >>: So after the last algorithm that you showed in the other slides you could still run the circular dependency. >> Iulian Moraru: Exactly. >>: So does every replica see the same graph? >> Iulian Moraru: Every replica will see exactly the same graph. >>: [inaudible]. >> Iulian Moraru: No, in this graph you can assume that they all interfere with each other. So in the simplest way to imagine this you have some commands that refer to object A and some commands that refer to object B and they will have different graphs. Of course, you can have commands that refer to both A and B and that’s when the graphs will be connected. >>: Is it possible that they agree because [inaudible]? >> Iulian Moraru: Right, so those commands will be in different connected components and would have no bearing on this. >>: But they will eventually [inaudible] the same order? >> Iulian Moraru: Okay, exactly. >>: Okay, I think I get it now. >> Iulian Moraru: Okay, so the properties that this whole commit protocol and execution protocol guarantee are linearizability, which is a strong property. And linearizability implies serializability and furthermore implies that if two commands interfere, and one of them is committed before the other one is even proposed, A will be executed before B everywhere. And the second property, which is an important property, is the fact that fast path quorums for state machine replication that tolerates F concurrent failures. So, fast path quorums will have to contain F plus [indiscernible] F over 2 replicas, including the command leader. What I mean by fast path quorums: how many replicas including the command leader have to agree on a certain set of dependencies for that command to be committed in the fast path? And that’s the number of replicas. What this means is that for the most common deployments of Paxos, of 3 and 5 replicas respectively, this comes down to the smallest majority, so this is optimal. And it also means that it is better than Fast and Generalized Paxos by exactly 1 replica. In practical terms this means that if we could do georeplication and we have a replica in Japan to one on the West Coast --. Oh, excuse me, sorry. So if we do geo-replication and we have a replica in Japan, to one on the West Coast, to one on the East Coast and one in Europe a replica in Japan will be able to commit a command after just talking to its closest neighbors, the US West. It doesn’t have to go all the way to Europe or all the way to the East Coast. So we have implemented EPaxos and we have also implemented Multi-Paxos, Mencius and Generalized Paxos and we compared them for replicated key value store. So first I am going to show you a wide area commit latency. This is the setup I have shown before. We have a replica in an Amazon institute datacenter in North Virginia, one in California, Oregon, and Ireland and in Tokyo. And this gives you a sense of the round trip times between those locations. Now in this diagram I am going to show you the median commit latency as observed at every location for every protocol. We have clients co-located with every replica in each datacenter. And these clients commit commands and then they measure the time it takes to get the reply back. So for a client in California it has to wait for its local replica to talk to Oregon and Virginia. And that’s also the case for our client of Multi-Paxos, but only because the Multi-Paxos leader is located in California. For a client that’s in Virginia that client has to first go to the leader in California, wait for it to commit to the command and then get back to Virginia, so it experience higher latency. Mencius has higher latency because you remember before committing a command we have to wait for some information from every other replica in the system, including the ones that are furthers away. And Generalized Paxos has higher latency than EPaxos because its fast path quorums are more numerous or are larger. And in fact at every location EPaxos has the smallest latency of all the protocols because it has optimal commit latency in the wider gap for 3 and 5 replicas. Now of course, this is median latency. When there is interference, when there is command interference, 99 percent latency is going to be double this. However, there are ways to mitigate that, ways that we just haven’t implemented yet. And we believe that command interference will be rare enough that it will affect at most the 99 percent latency. Now in the local cluster we compare Multi-Paxos, Mencius and EPaxos to see how they fair with respect to throughput. This is a 5 replica scenario; EPaxos has different rates of command interference. 0 percent means that no commands interfere; 100 percent means that all commands interfere. And what you can see here is that EPaxos has higher throughput than the other protocols when command interference is low. Furthermore when replicas are slow and for the Multi-Paxos here the leader has to be slow for this to have an otherwise throughput remain the same. So when replicas are slow and that replica is the Multi-Paxos leader in Multi-Paxos the performance of EPaxos degrades more gracefully than for the other protocols. The reason is we can simply avoid those replicas that are slower whereas in the other protocols we cannot. >>: So a quick question. >> Iulian Moraru: Yes. >>: So looking at that top one, you are comparing Multi-Paxos verses the 100 percent EPaxos [inaudible]? >> Iulian Moraru: Correct, yeah. >>: So the difference is like 10 or 20 percent? >> Iulian Moraru: So the difference here is probably smaller than 10 percent. >>: Yeah, so can you comment on that? >> Iulian Moraru: So what happens here? So the reason why Multi-Paxos is slow here is because all the decisions go through the leader and the leader is bottlenecked. >>: Right. >> Iulian Moraru: The reason why EPaxos is slow here is because it has to do two round trips of communication before committing every command. >>: Right, but it uses all the five servers. >> Iulian Moraru: But, it uses all the five servers. >>: So why is the difference so small? >> Iulian Moraru: You are saying it should be larger? >>: Yeah, right. >> Iulian Moraru: I don’t have the intuition of why it should be larger. So EPaxos does more work, but it distributes that work across all the replicas. Multi-Paxos does less work, but it concentrates that work to the leader. >>: So my practical experience with doing this is that the work of being the leader is not very substantial. >> Iulian Moraru: So the assumption here is that the amount of work that you do on the leader is essentially --. It essentially has to do with communication, because executing the command for example or sending it to the disk, that’s not the bottleneck in the system. If that is the bottleneck in the system then these results would look differently. >>: Okay, what’s the bottleneck? >> Iulian Moraru: Here it’s the CPU. interrupts on the leader. The fact that you have to handle more >>: So you are [inaudible]? >> Iulian Moraru: No, so the operations are small enough that this workload is dominated by messaging. >>: So these are [inaudible], right, all the leads? >> Iulian Moraru: Right, sorry. >>: Okay and then if we look at the 0 confliction and then it’s only twice difference? >> Iulian Moraru: Okay, almost twice, yeah. >>: So not five times? >> Iulian Moraru: No, it’s not five times because there is still some work that has to be done on every machine. And that work sort of like --. So the amount of work that you do in total is more than the amount of work that you are doing total in Multi-Paxos. >>: So are you logging these? >> Iulian Moraru: In this case they weren’t logged. In our paper we have comparison between them. When we do log them synchronously to SSD’s and then the results are marked closer to each other, that’s true, right because there logging is dominant factor. >>: Right and I guess depending on your application you may not care about persistence. >> Iulian Moraru: That’s the reason why I showed you this graph because I wanted to compare the protocols themselves, not necessarily the underlying means of the application which may defer. >>: You said the dominant of overhead in messaging is the time the CPU takes to service the message? >> Iulian Moraru: No, no, no, it’s the fact that it has to handle more interrupts to reply to more messages. >>: Is that because there are more total messages than protocol? >> Iulian Moraru: Because there are more total messages on one replica. >>: It’s because the [inaudible]. >> Iulian Moraru: So a leader --. >>: [inaudible]. >> Iulian Moraru: So the leader handles order and messages for each command. Whereas a non-leader replica handles order of 1 messages for each command, that’s where the imbalance comes from. >>: Okay. >>: [inaudible]. I have one question: in EPaxos are the replicas actually measuring and sending all these messages to other replicas? >> Iulian Moraru: Yeah. >>: So how do you determine the slow replica? >> Iulian Moraru: Oh, we send pings and we measure how long it takes to reply to pings, echo messages essentially. >>: And this is done once? >> Iulian Moraru: It’s done periodically, something like every 250 seconds or a half of second. >>: So for the second [inaudible] would you slow down one of the replicas? >> Iulian Moraru: Yes. >>: How did you slow it down? By reducing throughput or --? >> Iulian Moraru: No by having a CPU intensive program, essentially an infinite loop running on the same machine. >>: [inaudible]. >> Iulian Moraru: What is the metric? >>: What is the protocol --? >> Iulian Moraru: Oh, it’s an Amazon institute datacenter. I think it’s, yeah, I think [inaudible]. But that doesn’t matter because the bandwidth was not the bottleneck. >>: No, but you are saying [inaudible]. >> Iulian Moraru: It’s TCP, yeah. Okay, so now the intuition of --. So batching is an approach that we currently use to increase throughput and you might have the intuition that Multi-Paxos might do better with batching because it has more of a chance to use larger batches, right. So a Multi-Paxos leader receives all the commands. It can just commit all these commands in one batch, in one round of Paxos, whereas in EPaxos we get the same number of commands distributed across multiple replicas. So there are going to be more batches for the same number of commands. This is a latency verses throughput graph. The wide scale is, the YX is log scale, this is the curve for Multi-Paxos, obviously it’s better to be lower and to the right because that means higher throughput, lower latency. EPaxos is significantly better than Multi-Paxos and the reason is the communication with the clients is now shared by all the replicas instead of having to fall squarely on the leader. And perhaps counter intuitively EPaxos 100 percent is almost as good as EPaxos 0 percent. The reason is communication, the second round of communication is necessary to commit commands when there is interference, when there are conflicts, is spread across the large number. So the cost of that is spread across a large number of commands in a batch, right. Second round messages are small; they don’t have to contain the commands themselves anymore, just the revised ordering constraints. And finally a nice side benefit, and this comes back to a discussing about availability, is that in EPaxos we have constant availability. And this will probably clarify what I mean by that. So let’s say that we have, this is Multi-Paxos and what I am showing here is throughput over time when there is leader failure. A replica failure, a non-leader replica failure, would not be seen here. So when there is a leader failure the throughput for the whole replicated state machine goes down until the new leader has been elected. These are commands from the clients that have not been able to commit them in the meantime. In Mencius any replica failing will have this behavior whereas in Paxos a replica failing will cause the throughput to go down by a third. In this case we had 3 replicas. So the throughput goes down by a third because those clients that were talking to that replica that has failed have to timeout and then talk to some other replica. >>: And again I think it’s worth observing that in the Multi-Paxos case twothirds of the time nothing happens. >> Iulian Moraru: That’s right. >>: Right. >> Iulian Moraru: But, when something happens that’s greyer then when something happens in Epaxos. >>: But, it happens most often. >> Iulian Moraru: That’s right. So instead of conclusions for this part of the talk I am going to try to disentangle some of the most important insights into Epaxos. So the first one was we deal with ordering explicitly instead of dealing with ordering implicitly by using Paxos on some pre-ordered slots for commands. And this gives us high throughput because we load balance better. It gives us performance ability because we have the flexibility to avoid slow replicas and low latency because we can chose the replicas that closest to us. Furthermore, it let’s us optimize only those delays that matter. Previous Paxos variances that optimize for low latency try to do away with the first message delay between the client and the first replica. These are Fast Paxos and Generalized Paxos. Now when we are doing wide area replication where latency is really very important that first message delay is going to be insignificant because usually the client will be co-located with is closest replica in the same datacenter. So instead of focusing on doing away with that message delay we focus on having smaller quorums which in turn gives us lower latency because we can talk to fewer of our closest neighbors. And we have [indiscernible] with a [indiscernible] for correctness for Epaxos. It also contains a TLA plus specification that can be model checked. And we have released our implementation of EPaxos and all the other protocols in open source. Yes? >>: Why is it called Egalitarian? >> Iulian Moraru: Because replicas perform the same functions at the same time, sort of. >>: [inaudible]. >> Iulian Moraru: Instead of having a leader that dictates. Okay, so to go back to the goal of my thesis work was to improve SMR and I think that Egalitarian Paxos goes a long way to doing so because it has higher throughput than previous state machine replication protocols, it has optimally low wide area latency for 3 and 5 replicas, it offers better performance robustness, constant availability and furthermore there is no need for leader election, there is no leader. Now can we improve other aspects of state machine replication? And a very important one that doesn’t fall into this category is what happens when we read from the replicated state machine replication? So how do we improve read performance? Now why is that different from command throughput, right? Because let’s say that, so let’s look at the different ways in which we can do read from a replicated state machine. A simple way is to treat reads just like any other command. So if we have a client that client may try to read from a replica, this by the way is MultiPaxos, so the discussion that I am going to describe now for the second part of my talk applies to any Paxos type protocol. So in this particular case the example is Multi-Paxos. So let’s say that a client tries to read from a replica, that replica will forward the read to the leader, the leader will just commit it just like it would do with any other command and then will get back to the client with the result. Now obviously this involves a lot of communication. This is not ideal for a wide area. A second better way is to have the replica that got the read just talk to any quorum of replicas. For instance, its closes neighbors and ask them what the latest command is that you have seen, wait for that command to be executed, get back with the result and the replica closest to the client would perhaps reply to the client. This is less communication, fewer rounds of communication, fewer message delays, its better, but there is still some communication. So, practical systems use leases. The most common variant of leases is leader leases where the leader has a lease on all the objects in the system. As long as that lease is active the leader can read from its local cache, from its local store. Now even if the leader appears to have failed the new leader will not be able to commit updates until the lease of the older leader has expired. In this case the client can either redirect it from the leader or have its closest replica forward that command to the leader and get it back. Obviously the clients that are co-located with the leader will see very low latency and other clients will see high latency. And the other previous way in which we had leases was described in Megastore. And these were leases where every replica in the system had a lease on every object in the system. This means that a client can just read locally from its closest replica and that’s great for reads, however, we pay the price when we do rides because every ride has to go synchronously to every other replica in the system. We cannot commit that ride until every replica in the system has acknowledged that ride. So rides will have higher latency and furthermore when the replica us unavailable the whole system is unavailable for rides. So if we look at the --. >>: Until the lease expires. >> Iulian Moraru: Until the lease expires, exactly. So if we look at the design space for leases so far we have these two points. We have the leader lease which has almost the same ride performance as a system that has no leases and has better read performance than a system without leases, but the read performance is not great, right. We can only read locally from the leader. And we have the Megastore lease where the lead performance is great, we can read locally from every replica, but we pay that in ride performance and availability. And in our work we set out to explore the space in between and I am going to argue that we can do so with something that we call Quorum leases. And Quorum leases allow us to get the right trade off for our application between read performance and ride performance and furthermore I am going to argue that they actually get us most of the performance, most of the benefits of both the Megastore and the leader lease. And the idea here is pretty simple. Instead of having one leader, one replica holds the lease or all the replicas hold the lease we can let our arbitrary subset of the replicas hold the lease. Now in particular we make the observation that the fact that in Paxos we have to communicate synchronously with a quorum before we can commit an update. It induces a natural leasing strategy and that is we simply take the smallest majorities, the simplest quorums and we give them leases for objects. And in fact we can have different quorums hold different leases for disjoined subsets of objects. And the assumption here is that different objects are popular, or are hot, at different replicas. Imagine if we have, let’s say a replicated social network, it is more likely that the clients in Europe will have a different set of requests to the datacenter in Europe then the clients in the US will have of where datacenters in the US, right. So the assumption is that there is some geographic locality in our workload. >>: And that [inaudible] multiple objects. Like imagine that you have a B tree, right, everybody is going to touch the root. >> Iulian Moraru: Right, so when this happens too often we may not want to use that. When this happens infrequently we can do two things: we can either pay the price with the write or just forward it to all the quorums that are pertinent to that composite update or we can make sure that we configure the quorum such that objects that are usually accessed together are part of the same quorum lease. >>: [inaudible]. >> Iulian Moraru: Right, right. >>: [inaudible]. >> Iulian Moraru: Right, have a smaller quorum, exactly, right, right, right. So I guess what you are saying is that if we have a read heavy object then maybe we will be able to lease for that object to more replicas. If it’s a write heavy object it will give that read lease for fewer objects, or I mean for fewer replicas, right, that’s a possibility. >>: So does this also mean that once you start doing this with read lease before you are able to commit a write [indiscernible] don’t you need to know the set of objects [inaudible]? >> Iulian Moraru: That’s right, yeah. >>: Which isn’t necessarily an obvious thing in general state machine replication. >> Iulian Moraru: Uh, not in general, but I thin in many cases it’s fairly easy. >>: And certainly [inaudible]. >> Iulian Moraru: That’s right. Okay, so instead of describing in detail our design I am just going to lay out the challenges of implementing quorum read leases. So the obvious question is which quorums should hold the lease to which objects? And it becomes quickly apparent that quorum releases should be dynamic. Before quorum releases we had single leader lease and every replica has a lease, all the replicas have the lease, which means that it’s very easy to decide who holds the lease, right. It’s either the leader or everyone, but now we have to adapt to access patterns. We have to give the lease to those objects that need it more, right. Sorry, to those replicas that read those particular objects more. So now the challenges become how do we establish, maintain and update these leases, right? And by update I mean about establishing the timing information necessary to be able to expire the leases in a timely way and also migrating different objects between different quorums. So we want to do this cheaply for millions or many millions of objects concurrently with minimal bandwidth overhead. So that’s one challenge. And the other challenge of course is we want to do this while at the same time maintaining the safety of the Paxos protocol underneath the quorum leases and the safety of our leases, right. So we want every read to be a strong lead consistent read. And I am going to quickly show you some preliminary results that we have had. We have used the YCSB benchmark for a skewed workload. This is the distributed workload, YCSB and again, this is the same setup for wide area. And we observed that every replica, not just one replica as the single leader lease, can do 60 to 90 percent of the reads locally and furthermore 60 to 90 percent of all writes have the minimal latency that we can get in a MultiPaxos system on this configuration. Now we believe that with a bit more engineering we can get this number to over 90 percent for the skewed workload for reads. I think that should be pretty easy to do. In the local area quorum release also have the benefit of increasing throughput, read throughput because we can now do local reads at every replica. Every one of our 5 replicas read throughput increases by 4.6X. >>: Wow, wow, okay that depends, right. Again, if you look at what we did with read Smarter we read from all of them. You use the leader to guarantee seralization, but you read from all the replicas. >> Iulian Moraru: But we don’t do any communication for reads. >>: Right, but in the local area case the communication is relatively cheap. >> Iulian Moraru: Not for throughput. >>: Not for throughput? >> Iulian Moraru: No, I understand your point, which I think it is. The reads you compare between the read strategy in Smarter and that’s a valid point, that’s right. We should do that, yeah. >>: Right, which doesn’t work very well in one area that it wasn’t designed to do. >> Iulian Moraru: Right, no I understand. Now this number is about 20 percent lower then Megastore leases, right, because Megastore leases have the best read performance, but we also don’t have the problems for writes with Megastore leases, in particular the write availability problem for Megastore leases. So in summary I presented my work so far in improving state machine replication. I have shown you that we can get optimally low latency in the wide area, higher throughput and higher performance stability with Egalitarian Paxos, we can get high read performance without sacrificing writes with quorum releases and I believe there is extensive room for future work here. I want to, in particular, study faster configuration algorithms that have minimal impact on availability. I also want to study more uses of wall clock time beyond leases. What I mean by that is say let’s say we had synchronized clocks, what can we do better? We know for sure that there are ways to implement EPaxos with synchronized clocks that are easier then just a normal implementation of Egalitarian Paxos. But, conversely if we don’t have synchronized clocks and we want to get things like spanners and time snapshot reads. Can we do that with just quorum leases? And I think there might be a way to do that. And finally I like to be able to put this all together and provide general transactions on top of multiple EPaxos groups that also use quorum leases. And the last slide I am going to show is to point out that I have also worked on other things including hardware-aware data structures, in particular this feed-forward Bloom filter. That’s a variant of Bloom filters that creates benefits for multiple patterns matching, like for example Malware scanning. And also on re-designing new systems around these new memory technologies like, free exchange and memory stir. And if your interests are aligned with these please ask me about it. you. Thank [clapping] >> Amar Phanishayee: Are there any more questions, clarifications or do people want to go back to slides that you might have said I will ask later? >> Iulian Moraru: Yes. >>: So this is something I didn’t understand and that is [inaudible]. >> Iulian Moraru: Right, those are a not regular clock, that’s just a Lamport clock. >>: [inaudible]. >> Iulian Moraru: Yeah. >>: So is that something that only works in this particular case or is that something where you could use Lamport clocks to do order and --. >> Iulian Moraru: So let me see if I understand your question. You are saying what if we use just the Lamport clocks and not use the other dependency thing with the graph and all the edges, right? >>: Yeah. >> Iulian Moraru: Yeah, so that’s not possible. We thought of that and that’s not possible. The reason why it’s not possible is because the Lamport clock for that approximate sequence, for a particular command doesn’t have to cover the entire space for that Lamport clock. So for example if a command get’s sequence it doesn’t necessarily mean that there have to be commands for 1, 2, 3 and 4. So we may wait indefinitely for something to be committed for 1, 2, 3 and 4. Yes? >>: Sorry, I don’t have a question; I just have a clarification [inaudible]. If I understand this correctly basically this strays off the impact of failures by reducing that impact, but spreading it and making it more likely over time. >> Iulian Moraru: Uh huh. >>: So why would it be better to actually increase the impact of failure, but reduce [inaudible]? Like I am sort of struggling with it, seems like the [indiscernible] is a better trade off. >>: Well, no, it’s not clear, right. It depends on --. If you measure it by --. If you assume that replicas are equally likely to fail and you measure it by sort of by total amount of extra latency added by failures then it’s a wash. >>: I agree; I understand. >>: It’s better for predictability though. You would rather have a larger number and smaller failures than a smaller number of large impact failures. >>: Possibly, but it turns out from the point of view that the individual requests it’s exactly the same. >> Iulian Moraru: Let me point out --. >>: So the question is: Is it useful to have the system partially functioning some of the time? And it may or may not be; it depends on your application. >> Iulian Moraru: Let me pout out something else; some other way in which Paxos might be better in this scenario. So if we really do want no client to observe any latency penalty when a replica fails we can simply propose that command to a quorum of replicas, right. So in this case the problem is of course that we would have lower throughput on the all, right, because we propose each command twice in a 3D replicas system. We never have that latency penalty when something fails, but our system has to be able to distinguish between the same command proposed twice, right and only execute the first one. >>: Yeah, but you have to be really careful with that, right. >> Iulian Moraru: No, no, no, because you will have to do that for MultiPaxos as well, because imagine a client that has sent a command to a leader that fails and it re-sends that command, the same command, to another leader. So any --. >>: Yeah, but in that case only one of the --. If you do it correctly only one of those is ever accepted so you don’t ever --. >> Iulian Moraru: I don’t think so, the leader --. >>: Trust me, it is in Smarter; it absolutely has that property in Smarter. >> Iulian Moraru: And you don’t use any client per client ID’s for commands? >>: No there is per client ID. >> Iulian Moraru: Well then that’s probably why it works in Smarter. >>: Yeah. >> Iulian Moraru: And that’s how you would solve it in the system. >>: Well, I mean the way that it, so --. >> Iulian Moraru: And that’s how you would solve it in any Paxos. >>: Okay, all right, yeah, so you can do the same thing. >>: You two need to have a white board later to hash this out. [laughter] >> Amar Phanishayee: Any other questions? [clapping] If not it’s time to go.