Document 17864648

>> Amar Phanishayee: So it's my pleasure to introduce Wyatt Lloyd. Wyatt is a grad student at Princeton. He's actually defending sometime this week, later this week>> Wyatt Lloyd: No, next week. >> Amar Phanishayee: Next week. Okay, there we go. And he works with Mike Freedman at Princeton. Wyatt has also closely worked with Dave Anderson and Michael Kaminski at CMU. And Wyatt’s work sort of tackles building systems that are geo-replicated and are, offer low latency to users of system, and in doing so, they provide the standard things that are required of large-scale systems today: high availability, high scalability, but also partition tolerance and stronger notions of consistency. And not just that, but he also offers limited transactions. So at this point you probably think I'm making things up, right? So I'm going to give it to Wyatt. Take it away, and blow us away. >> Wyatt Lloyd: All right. Thank you very much. I'll give these guys one second to trickle in. So I like to begin the talk by talking about the types of problems that I like to look at which are what I consider to be classic distributed systems problems, exceptionably lens of modern cloud or web services, and so today I'm going to talk about how we re-examined replicated storage for massive scale websites with the new requirements of low latency, massive scale, and georeplication. So what do I mean when I say geo-replicated storage? Well, I mean the backend of massive scale websites. Things like Facebook, Reddit or Amazon and I mean the storage that holds the data that represents that service. So on Facebook, this would be your profile, who you’re friends with, what you like, your status updates and so on. And why is this geo-replicated storage? Well, if we look at the architecture of these services, we'll see that they are located in geographically different places, East Coast, West Coast, Europe, for instance. So what are some reasons for this? Well, one is fault tolerance. If one of our data centers goes down, we still want our service to be up and available so people can use it; but another one, and the one we are more concerned with in this talk, is so that we can serve client requests quickly. We want to take clients, direct them to a nearby data center, serve their request entirely inside that data center, and then send a response back to them very quickly. And this is because many Internet companies have done many studies, and they've seen that there’s this strong correlation between page load time, user engagement, and revenue. So we want things to be really fast. So we'll zoom in on one of these data centers. So what happens when I actually connect to one of these Web services? Well, my web browser is creating connection with the one of the machines in the front end web tier. And so this is the machine that's building my page for me and interacting with the service on my behalf. And these machines are, they don't have to durable state on them. It’s not stored there, and they’re independent, meaning they can each do the same thing. So if my service becomes more popular and I want to be able to handle more clients, this is pretty simple, I just add more machines to this front end web tier and I can direct clients to them and they can handle front end requests. So how does this machine in the front end web tier actually build my pages and interact with the service? Well, it does this by reading and writing data from a separate storage tier. So in contrast, this storage tier is where that data is actually durably stored, and it's a cooperative system. So you have a large number of nodes cooperating together to provide this single storage abstraction. So this storage tier will be the focus of the talk today. And there's two interesting dimensions of the storage tier. So our first interesting dimension is the many node dimension. So the scale of this state is massive. It's very large. It’s far too large to fit on a single machine. So instead, we have to spread it out across a large number of machines. So the typical way to do this, and the way that we do this is by sharding the data, or putting different subsets of the data on different nodes in the system. So one way we can do this is by last name. So we put Ada Lovelace's data on the server for L and Alan Turing’s on the server for T. Now our other interesting dimension is the multiple data center dimension. So we have our data in multiple places, East Coast, West Coast, Europe, so if we write data into one location, we have to replicated out to those other locations as well. So at a high level, we have three goals for this storage. One, we want to serve client requests quickly. We want things to be very fast because of that correlation between page load time and revenue. Two, we want to be able to scale out the number of nodes in the data center. As our service becomes more popular, we want to be able to continue to grow that storage tier to keep up with capacity and throughput demands. And then third, we want to build and interact with the data coherently in a way that makes sense. Now you notice that I've been a little wishy-washy with this term coherent. And that's because, as we'll see later, stronger things that we could say, like the strongest forms of consistency, are actually theoretically incompatible with our first two properties. So what we do in this work and in this talk is we’re going to take these first two goals and we’re going to say these are our requirements. We are going to ensure that we’re always going to be fast and scalable, then we’re going to push on this third goal. How can I interact with this data in a way that makes sense? So to do that we're going to provide stronger consistency so that operations up here in an order that makes sense for users and programmers, and stronger semantics or more powerful tools for programmers to interact with this data spread across many machines. >>: Stronger than what? >> Wyatt Lloyd: So stronger than eventual. Yeah. Okay. So I want to say our first two goals a little bit more precisely, that we want things to be fast and scalable. So we can do this with the ALPS properties. So the ALPS properties say that we want a system that has availability, which means that operations always complete successfully; low latency, which says that operations always complete quickly, something that's on the order of a local round-trip time, and something that's much faster than a wide area around the trip time; partition tolerance, which says that our system continues to operate even if we have a network partition, for instance, one separating the East Coast and West Coast data centers. And if you take these three properties together, you get what’s called an always on data store. Operations always complete, they're always successful, and they're always very fast. And in addition to these three properties, we also want scalability. And this simply says that we can continue to grow our cluster, increase our capacity and throughput, and by adding more machines. So in addition to these ALPS properties, let's say we are fast and scalable, we also want to be able to interact with this data in a way that makes sense. So the way that we do that is through consistency. And consistency is a restriction on the order and sometimes the timing of operations in our system. We prefer stronger consistency because it makes programming easier, there's fewer things for me to reason about as a programmer, and because it makes the user experience better. Websites behave in a way that I intuitively expect them to. Yeah. >>: When you say partition tolerance, do you also consider partitions within the data center or is that outside your>> Wyatt Lloyd: Yeah. So we do not. So we assume the partitions happen only in the white area, not inside data centers. Okay. So I want to explain stronger consistency a little bit more clearly through an example. So I want to explain what strong consistency is, and this is more formally called linearizability. And this gives us two properties. The first, that there exists a total order over overall operations in the system so that everyone sees all operations in the same order, and that this order agrees with real time, so that if operation A completes before operation B, it will be ordered in this way. So intuitively, the way to think about this is that if I write something on the East Coast and then call up my friend on the West Coast and say, hey, check out this thing that I just wrote, my friend on the West Coast will be guaranteed to see it in the linearizable system. So linearizability is great for programmers. It's easy to reason about, and it’s great for users, behaves in a way that they expect. So there's a spectrum of consistency models, and what consistency can actually achieve with ALPS systems? So let's start at the top of the spectrum with linearizability, which I just described, can I achieve this with ALPS systems? So unfortunately, the answer is no. So the famous CAP theorem, which I expect pretty much everyone in here knows, says that you can't have linearizability availability in partition tolerance. You can't have all three of these things. So linearizability is out. So let's move down the spectrum a little bit and we get to sequential consistency. So sequential consistency is the original correctness condition for multiprocessors. It still has that total order that exists over all operations, but it no longer has that real-time requirement. Sequential consistency is still very, very useful, but unfortunately this is also provably incompatible with the ALPS properties. Notably, the total order in sequential consistency and our low latency requirement are incompatible. And this proof about the total order also means that other consistency models, things like serializability which has a total order over transactions, not operations, is also provably incompatible with low latency. So we can't have linearizability or serializability or sequential consistency, what can we have? Well, if we look at systems that people built, they were fast and scalable, things like Amazon's Dynamo and Facebook and now Apache's Cassandra system, these, when they are in the low latency configurations at least, only provide you with what’s called eventual consistency. So eventual consistency is this weak catch all term that just means that data written in one location will eventually show up somewhere else. And it doesn't give you any guarantees about ordering, and it especially doesn't give you any guarantees about ordering for data that's spread on different shards of the system. So the question is: what sits in this whitespace here in between the impossible to achieve sequential consistency and that doesn't give you very much eventual consistency? So this is where our work falls in, this is where our talk sits, and we’re going to provide something called causal consistency. So causal consistency is going to give us a partial order, not a total order, but a partial order over operations in the system that agrees with the notion of potential causality. So if A happens before B and you see A and B, you'll see A before you see B. So I'll explain this much more clearly with an example in two slides. First, I want to revisit these theoretical results though that say that there's this fundamental choice between the ALPS properties and the strongest forms of consistency. And what I'm not arguing is that you always want the ALPS properties. Sometimes you need the strongest forms of consistency. And specifically, you need them when you want to enforce a global invariant. So if I have something like a bank account that you want to guarantee never goes below zero, I need the strongest forms of consistency for that. But in all other situations, we can provide you stronger consistency while still guaranteeing low latency. So before our work, this was our feeling in the community for scalable systems, so this is a quote from Amazon's Dynamo paper, very influential from SOSP in 2007, and then it said: “for Dynamo to provide this always on experience, this high-level availability, Dynamo sacrifices consistency.” And what our work says, it says, don't do that. Don't settle for eventual consistency. Instead we can provide you causal consistency. So what is causal consistency? So I'll explain it so hopefully it makes sense now. So I’ll do this through an example. So on our social network we’re going to have Alan Turing remove his boss from his friends group, and he's going to post to his friends, I'm looking for a new job, and then his friend Ada will read that post. So you have three operations here and three rules that define causality. So let's go through all three of them. Our first rule is the thread-of-execution rule. This says that operations done by the same thread-of-execution are ordered this way by causality. So Alan’s first operation is before his second. So everyone will see them in that order. Second rule is the reads-from rule that says operations that read a value are causality after the operations that wrote that value. So Ada’s read of his post is after his write of that post. And then our final rule is the transitive closure over our first to rules. So this means that Ada’s read of his post is still after his first operation. This also means that any of Ada’s later operations will appear to people after they see Alan's earlier operations. So causal consistency is great for users because now websites work in the way that they expect. Yeah. >>: What would strong consistency [inaudible]? >> Wyatt Lloyd: So what would strong consistency look like here? So it doesn't sort of map directly to this. Everyone would see all of these things in this order, but strong consistency will also avoid the drawbacks or like the sort of tricky thing you still have to reason about, which I'll get to in a few slides, and it will allow you to enforce global invariants. >>: Are you assuming that Alan's order goes to the same data center? >> Wyatt Lloyd: Yeah. So one of our assumptions is that users are tied to a particular data center. So the clients of our system are the front end web servers, but sort of the people that we really care about causality for like meta-clients, the clients of the clients. So we assume that they're connected to a single data center. So we assume that the web servers that using us detects us in some way like has in your cookie, this is the data center you've been using recently, and then will redirect you to that data center if for some reason there some routing flap and you’re sent somewhere else. But, yes. We are assuming users are sticking to a data center. Okay. So causality is great for users because things work in the way that they expect. So here are some examples. So when Alan removes his boss from his friends list and then posts that he wants a new job, he does this because he expects those operations to show up in that order. In an eventually consistent system, this new job post could show up before the boss removal. And this is exactly what he didn't want to happen. Or on Amazon, if I add something to my cart and then I click on my cart, in an eventually consistent system, what I added might not be there. In a causality consistent system, my boss will never see my post, and in a causality consistent system, I will always see what I just added to my cart. Or on Wikipedia, if I see some spammy article and I delete it and then I refreshed the page, I expect it not to be there, but it might be in an eventually consistent system, and won't be in a causality consistent system. So causal consistency is great for users in that websites work in the way they expect now, but our bigger contribution is making programmers lives better. And programmers really like causality because of simplifies what they have to reason about when they're programming. So here are some examples of this. So let's say I take a photo and I upload it to a website, then I take that photo and I add it to an album. In an eventually consistent system I can get this add to album operation before I get the photo that it’s referring to. So I have to reason about well, what do I do in this situation? Do I queue up a callback, do I even think about this, what's going on? I have to reason about getting these operations out of order. I don't have to do that in a causal consistency system. Here's another example. Let's say I like something on Facebook and then I hide it for my timeline. And in eventually consistent system, you have to reason about, what happens if I get this hide from time line operation before the thing I'm trying to hide exists? Or on Amazon. You have to create an account before you can check out. In a causality consistent system, when you go to check out, your account will always exist. In an eventually consistent system it might not. So causal consistency is great for users and programmers, but it’s not as good as strong consistency. There's still some tricky things that we have to reason about, and specifically, we have to reason about our concurrent writes. So these are called conflicts in causal consistency. And these are writes that don't have any ordering between them. And they are writes to the same data item. So this is what it looks like. We have two people at two different data centers writing to the same data at the same time. So what happens here? Well, in traditional causal consistency, in the formalization of traditional causal consistency at least, the operations are propagated out to the other replicas, and whatever arrives later at a location overwrites what existed earlier at that location. So this means that on the East Coast and the West Coast and our other data centers, we might end up having divergent values. We might have different values for the same data. And since this is not what we want, we want everyone to see the same data everywhere, so how do we handle this? We do one of two things. One thing that we do, or one option, is to just arbitrarily pick one of the two operations. And say this one happened later than the other. And so this is called the last writer wins rule or the Thomas's write rule. And this is the default thing that we do in our system. So just choose one of the two and have it overwrite the other when they're both present in a data center. We could also do something fancier. So we could take our two update operations, one and two, and merge them together so we get three in each one of our data centers. We can do this through a function that's commutative and associative that we call an application specific conflict handler function. So one of the things that we did in this work was that we named and formally defined this combination of causal consistency and convergent conflict handling as causal plus consistency. So if you look at previous causal systems from the distributed systems community, these systems provided causal consistency but they also provide a causal plus consistency. This is intuitively, this is what we want our system to actually provide. So our contribution there was naming and defining it. So if we take a look at these previous causal systems, they're based on a similar idea. And that's this idea. Let's exchange logs of operations. So this is what it looks like. So in our local data center we are going to take operations and we’re going to log them in the order they occurred. Now we’re going to take this log of operations and ship it off to a remote data center. Now in this remote data center we’re going to play back those operations in that exact same order. So this log is functioning as a single serialization point. So this is good and it’s going to implicitly capture and enforce our causal order. If A happens before B, it will be logged before B, then when we ship this across the remote data center, it will be played back before B. So that's great because that gives us causal consistency. But it has one of two problems. So one, if we do this on a per shard basis, so remember our data is spread out across a huge number of machines, if we just to do this on sort of each shard we don't have any causality cross the shards and we are actually not providing causal consistency. Or, if we do this data center wide, this is not going to be scalable. Let's take all the operations in this data center and log them to one place. That's a big bottleneck. But an even bigger bottleneck is when we ship this log off to another data center and try to play this back in that exact same order across this big distributed system. So that doesn't scale even worse. So let's review the challenges that we've seen with consistency. We saw that the strongest forms of consistency are theoretically incompatible but the ALPS properties. We saw that eventual consistency, which is what people were settling for with these ALPS systems, doesn't really provide us with any consistency. And then we shot this technique of log exchange, which provides causal consistency, but is not scalable. So the contribution of our work is that we are the first scalable system that's able to provide this causal consistency. >>: So is Amazon’s entire e-commerce infrastructure built on eventual consistency? >> Wyatt Lloyd: Their entire e-commerce, so I don't have sort of a full vision into what Amazon does, so they have Dynamo there which is eventually consistent. I’ve sort of only heard rumors of how much Dynamo is used internally versus not used, and my understanding it is not used exclusively and it’s not used sort of ubiquitously. I don't know if anyone else here has more insight than that. >>: That's what we heard too. >> Wyatt Lloyd: Yeah. Not used that much. >>: Because the consistency[inaudible]. >> Wyatt Lloyd: Yeah, cool. Confirmation. Amar. >>Amar Phanishayee: Just a quick question. So is it not scalable? >> Wyatt Lloyd: Yes. >> Amar Phanishayee: One of the assumptions that you make in your work is the user data [inaudible]? >> Wyatt Lloyd: No. User data is not owned by any data centers. >>: [inaudible] if you wanted to write to [inaudible]? I mean the one post [inaudible] one data center [inaudible]. >> Wyatt Lloyd: So the data is not owned by any data center, it’s that users are tied to a data center for sort of a temporal period of time. So we assume that a user continues to use the same data center, but I can be a user in data center East Coast and you can be on the West Coast and we can be reading and writing the exact same data, and you can be updating data that's stored under my key and things like this. But it’s for the sort of causality tracking it and enforcing it, which I'll get into soon, that we assume users are tied to a particular data center. We do have ways to move users between data centers, but this happens, it’s sort of, you know, when you're going to log out, when you're on the plane moving across the country, that kind of thing. >>: Actually, my question is within the data center. You have all these machines where data is stored, systems like Isis or Bayou or whatever, right? >> Wyatt Lloyd: Yeah. >>: [inaudible] some sort of causal consistency [inaudible]. [inaudible] systems today data icons across servers. How much tracking needs to be done [inaudible]? >> Wyatt Lloyd: So that’s like super workload dependent. So definitely some, but yeah, sort of how much data is related across all of the different servers is highly dependent on workloads and sort of slightly difficult to quantify. >>: So something, the application has to decide how much data- >> Wyatt Lloyd: Absolutely. >>: [inaudible] previous techniques would scale or not. Is that right? >> Wyatt Lloyd: So previous techniques are designed with the idea, which was correct and sort of accurate in the 1990s, that your data fits on a single machine. And sort of the ways that you would take this and generalize it out to a system where the data spread across different machines, you know, we can guess and there’s sort of different ways to do this that are better or worse, but I suspect that if you take this sort of all the way to its logical conclusion, if you transform these to a fully distributed scalable system you'll end up with something that's very similar to what I'm about to describe. Okay. So, yeah. So our big contribution is this scalable causality consistency system. So how do we provide this? What's our key idea for enabling this? Well, it's two parts. The first part is we are going to capture causality with explicit dependency metadata. So we’re going to say than this purple operation is after that blue operation. And then the second part is that we’re going to enforce this ordering with distributed verifications. So we’re not going to expose a replicated put operation, a replicated write operation, until we know that the causally previous operations have been applied in this data center. So here's what this looks like: in the local data center we have the data spread across these many shards, they're replicating this data out to the remote data center in this scalable way that's distributed, and then the remote data center, these nodes are doing a small number of checks with one another to make sure we are applying these operations in that correct causal order. So let's dive into the architecture of our system. Here we have a local data center with the data spread across many nodes, sharded across many nodes there, and then you have two remote data centers out in the distance. Inside the data center, this is our client. As I mentioned earlier, the front end web server is our direct client, the one that we can easily reason about. And this is what's building pages on behalf of meta-clients, actual users out there in the world. So what we're going to do with this client is we’re going to keep all of its operations local. And this is how I provide that always on experience, that available low latency partition tolerant experience. And this is always on, so we’re never going out in the white area where we can hit the partitions, and we’re never going out in the white area where things take a long time, where things are slow. We’re going to keep everything local. So if we take a look at our client a little bit more in depth, we have this client library that sits on the client. And the client library does two things for us. The first is it mediates access to nodes in the local cluster. So it knows where data sits on these different nodes, so we can send request directly to them, and second and far more important, is it's going to track and attach causal dependencies to our operations. So let's see how it actually does this. So here's a read operation. So the web server sends a read operation to the client library, and the client library’s going to send this out to the node in the local data center, so it's responsible for the data we are interested in. That node will respond immediately. Excuse me, sorry. Okay. So it will respond immediately and will update some metadata in that client library. The client library will then return to the web server. Okay. Great. So what about a write operation? This is a little bit trickier. So write operation comes into the client library. We’re going to transform this into a write after operation. So this is a write operation that also includes that ordering metadata. This operation is after these other operations. We're going to apply this at the local data center immediately and queue it up for replication in the background. Now we can apply it immediately because this client has been interacting with this data center. So you know that the client’s, everything that the client previously read from this data center, or everything the client previously read was from this data center, and everything the client previously wrote was written into this data center. So we know that all causally previous operations have been applied here, so we can apply it right away. So we apply it right away, we return to the client library immediately, again, update some dependency media there and return to the web server. Now in the background, we’re going to replicate these write after operations out to the nodes in the other data centers that are responsible for this particular data. So let's zoom in on one of our remote data centers. Okay, question. >>: So locally writes are synchronized? >> Wyatt Lloyd: Local writes are synchronous. Yeah. So we assume linearizability or sort of we build on top of systems that provide linearizability inside the local data center. Yeah. Or we assume partitions don't occur. Okay. So we replicate these out to the nodes in the other data centers responsible for this data. So let's see what happens in these remote data centers there. So we zoom in. We get this replicated right after it comes in and it has these attached dependencies. So if we take a look at these dependencies, we’ll see that there's two parts. The first part is this locator key. This lets us know, who do I ask to find out whether this causally previously operation has been applied yet or not? So here it's L. So we’ll ask the node that's responsible for storing key L. And then the second part is this unique timestamp. And this uniquely identifies the operation in the system so when we ask node L, it knows exactly what operation we are talking about. So this timestamp comes from the accepting data center. The node that accepts that write assigns each write operation, this unique timestamp; it's based on the logical Lamport clocks that we have in the system, and we append the unique node ID to ensure that these are globally unique. Okay. So we can send out these dependency check operations that ask the other node, have you applied this particular operation? And that node will check. If it has, it will return right away, if it hasn't, it will block until it does. Okay, question? >>: Is the timestamp globally unique or is the timestamp plus the [inaudible]? >> Wyatt Lloyd: Yeah. So the timestamp is those things concatenated together. So you need the global unique node ID at the end to ensure uniqueness. Yeah. Okay. So we send out these dependency checks. We send them out in parallel for each one of our dependencies. And then those nodes respond once those operations have been applied. So once we get all these responses back, we know we can safely applied this write operation because all the causally previously write operations have been applied in this data center. Yeah. >>: [inaudible] it seems like [inaudible] are implicit rather than explicit. If I remove my boss from my friends list and then make a post, the act of making a post doesn’t explicitly depend on my friends list, it’s just that I have to know in my brain that I already removed my boss from my friends list? >> Wyatt Lloyd: Right. So this is a very interesting point. So what we are capturing in our system is a pessimistic notion of potential causality. We're not capturing sort of actual causality, because as you notice, that's in your mind. We don't actually know what's going on. So anything that could potentially cause it is what we are capturing. So just because you read something earlier and then you write something after you've read that, we're going to ensure that it shows up later even though we don't know if it really is causally related or not, we’re going to capture every possible scenario just like that. Yeah. Okay. So to summarize the basic architecture of our system, we’re going to keep all of our operations local inside this local data center and then do our replication in the background. And that's what allows us to provide that always on experience. Then we’re going to shard our data across many nodes inside each one of these data centers, that's what allows us to provide scalability, and then we’re going to control how we do replication using dependencies and dependency checks so we can provide causal consistency in this scalable way across that sharded base. Yeah. Question. >>: [inaudible] application and maybe you’ll get to it, but [inaudible]. If you don't have latency or low latency [inaudible]? >> Wyatt Lloyd: Yes. >>: My understanding in this, basically you are basically in causality, the write is also dependent on read, right? Basically, let's say, basically I’m doing a Facebook post. Before I post, I see two other posts. >> Wyatt Lloyd: Yes. >>: And you assume when this gets replicated and when the other person, for example, see these posts, the order they appears is that you know, basically, your post, see the early, let’s say post from [inaudible]. So my question is this: imagine A is posting to basically one addition request after that. So basically you have your own post of A, you write a new, basically your post, and then you write another post after you do this. Is this causality preserved that you can always guarantee the other person's seeing your post before A’s new post? >> Wyatt Lloyd: So causality exists only in the forward direction. So we would ensure that your post appeared after the first post that you read. We would not ensure that some second post that that person did appeared after yours. We would only do that if that person read your post and then posted in response to you. >>: Your basically causality does not guarantee this basically causality on the basically write order? So say after your post, say, oh I don't agree with you, but this post potentially can move up in the order [inaudible]. >> Wyatt Lloyd: So if the causality exists, we will enforce the order. So if there is actual causal dependence, so the person saw my thing and said I don’t agree with you, then we would enforce that order. If two people are commenting on the same post concurrently, we won’t enforce any ordering there. So sort of any causality that exists, we will enforce it, but any causality that doesn’t exist, we won't enforce it because then we gain greater parallel time and we can be more efficient. >>: The reason that I asked is basically, they are [inaudible] consistency basically. I think this may cause some problems because you are reading some values, but this value may be made basically rewritten putting the time you basically arrived. >> Wyatt Lloyd: Yeah. >>: [inaudible] basically. >> Wyatt Lloyd: Yeah. So the way to think about things in our system is everything is operation based. And when I move on in the future and I start talking about limited transactions, these will only be read-only or write only transactions. So that's sort of in a single atomic block, you're only either reading data or you’re writing data. So there's no sort of insurance that you read this data and then you wrote something based on it, and then when it's replicated across, the data that you read still exists there. It could have been updated to something new concurrently while you were doing your writes as well. Yeah. >>: So I'm trying to understand your sense of causal consistency. >> Wyatt Lloyd: Yes. >>: If I post on someone's, if I respond to a post with a comment saying: you're an expletive one and that same person replies back, well, yeah, you're an expletive two. Under your definition, that you're an expletive two might occur before you’re an expletive one. >> Wyatt Lloyd: So no. Because that person was responding, they must've read your post. And because they read your post, they’re causally after your post. So when they were one and two, so because two saw one and was responding to it, it will show up after it. >>: Okay. >> Wyatt Lloyd: Yeah. So anything that sort of read or written in the system that actually happens that is seen by the system, we enforce that order. Yeah. >>: [inaudible] consistence exactly [inaudible] causal ordering>> Wyatt Lloyd: Absolutely. >> Wyatt Lloyd: [inaudible] paper in 1970 [inaudible]. >> Wyatt Lloyd: 78. Exactly. Yeah. So causal consistency is exactly equivalent to enforcing the happens before relationship. Yeah, exactly the same. Okay. Cool. So in the current system I've described, there's this big challenge. And that's we’re going to have lots of dependencies. So why are dependencies bad? Well, we have metadata associated with them. So we are wasting space in the system that we have do these dependency checks. So it’s taking away throughput from user facing operations. So we don't want to have lots of dependencies. And we have lots of them because dependencies grow with client lifetime. So here's an example. Let's say a user does a write operation and they do another write operation and another write operation. Then they read some value and they read some other value, well, their next write operation has a ton of dependencies. Depends on all of its previous write operations because of that thread-of-execution rule. Depends on all of the operations that wrote values that it read because of the reads-from rule, and then it depends on all the operations that those operations depend on because of the transitive closure rule. So as you can see, this can very quickly get out of hand, but luckily we can use the concept of nearest dependencies to dramatically reduce what we have to track in the system. So nearest dependencies are what I've marked in green here on this graph, and they're going to be the nodes that transitively capture all of our ordering constraints. In graph theoretic terms, the green nearest dependencies are the nodes that have a longest path of length one to our current operation. Because they transitively capture all of our ordering constraints, we know that if we are after the green nearest dependencies, we’re going to be after all the blue dependencies as well. So this is great because this is all we really need to track in our system. One really nice thing about these nearest dependencies is even if I have this huge graph of causality, I still have a small number of nearest dependencies. And as I'll show you near the very end of the talk, this allows us to have an implementation that's quite efficient. So there is one drawback of nearest dependencies and it's this: to actually figure out what the nearest dependencies are we need a cut of this causal graph. We need a large sub graph where sort of each one of these operations that we’re looking at to be able to calculate what is actually a nearest dependency and what is not. So instead of tracking this optimal set of nearest dependencies, instead we track what we call one hop dependencies. So these are what I've marked in orange here. So you notice that the middle node in this graph is orange, but it was not green. So it's a one hop dependency, but it's not a nearest. These are the nodes that have a shortest path of length one to the current operation in the system. They are superset of nearest dependencies, so they're going to be sufficient for providing causal consistency, but they're a lot easier to track. In particular, to track them all we need to do is track one, the last write operation that we did, and two, all of the reads that happened after that. So now from our thread-of-execution rule, we only have one dependency on that last write that we did. From our reads-from rule, we now only have a limited number of dependencies on all the operations that wrote values that we read after that last write. And then finally, we don't have any dependencies from the transitive closure. None of them are there. Question. >>: [inaudible]? >> Wyatt Lloyd: So is this like tracking>>: [inaudible] reach parts of the global [inaudible], so there's a global vector [inaudible] parts of it? >> Wyatt Lloyd: Yeah. So vector clocks are sort of a very difficult issue to address in that the sort of depending on what you do with them, we have this like big graph going on in the system and vector clocks sort of compress things based on a single node and have a certain number of checks going in between them. What we do is we have a much larger amount of parallelism that's going on in the system, and we ensure that all operations can be accepted locally right away. So it's sort of the thing that I would do if I was just using regular vector clocks to build one of these systems is I would end up sort of, if I'm waiting for an operation to show up from over here, to show up in this data center, I wouldn't be able to accept new operation coming into this node til that operation is applied. So I’d have to block this write operation. And I don't want to do that, I want everything to be low latency. Slightly unsatisfying because exactly how you would build vector clocks in the system is unclear. But it's related at a high level and we’re sort of doing all the specific things that you need to make it efficient in this scalable implementation. >>: An interesting point [inaudible] value [inaudible] vector clocks; in theory it’s a huge vector clock. [inaudible] so that would be [inaudible] dependencies. But the problem is that [inaudible] vector [inaudible]. Maybe I’m wrong. >> Wyatt Lloyd: If you had a vector clock with an entry per data location, I would say it would be equivalent. But that would be, yes. That would be ungodly large. But yes. I would say that would be equivalent. >>: [inaudible]. >> Wyatt Lloyd: So this is, so you could think of what we're doing as sort of a clever way to take this ungodly large vector clock and reduce it down to the minimum set; it’s easy for us to track that will allows us to enforce these things in a causally consistent way. Yeah. Okay, cool. So what are these one hop dependencies actually buy us, or what do they give us? Well, checking them is sufficient for causal consistency because they are superset of the nearest dependencies. And there’s going to be, still be few enough of them that we’re going to be competitive to eventually consistent systems, which I'll show you near the end of the talk. We never have to store any dependencies on the server, and this is because we are not storing, we don't need any of those transitive closure dependencies, and because we're not calculating the nearest dependencies. So we never need any dependencies on servers. And it really simplifies our client-side dependency tracking. All we need to do is every time we have a write operation, attach all of our current dependencies to that write operation, clear it out entirely, and add a new dependency on that write. So it's very simple. So before I move onto the second half of the talk, I want to summarize how to provide causal consistency in this scalable way. So we have this large number of clients that are concurrently accessing the system sending read and write operations to their local replica of the data store. These operations are serviced right away, so reads return right away and writes are applied right away. And then responses are sent back immediately to the clients. We’re going to update metadata on each one of these web servers so we can continue to track causality. And then in the background, we’re going to replicate data out to the remote data centers, to the nodes in the remote data centers that own the data we've just read. And these remote data centers will do these explicit dependency checks and will only apply the values once those dependency checks have been returned. And they're going to exploit the transitivity in that graph of causality, so we can do this in an efficient way. Yeah. >>: [inaudible] tiers, so I as a user always contact the same [inaudible] not just the same data center? >> Wyatt Lloyd: So we, in the current design of the system, we have users sticking to a front end web server. And one of the questions that I'll often get is sort of if you're going to design the system over from scratch, what one thing would you change? This is the thing that I would change. I would not have the client library actually resident on one of these web servers, so that I would have the client library resident on one of these machines; and we would connect users to a particular server machine based on their cookie so that we could sort of remove all state from these web servers. But right now on the current design they are sticking to web servers. >>: And what happens [inaudible] front end [inaudible]? >> Wyatt Lloyd: So that's why I would move them. >>: Moving them affects it because the failures can also then happen at the storage tier. >> Wyatt Lloyd: Right. So one of the things that we're doing in the storage tier that we're not doing in the web tier is that each one of these servers that I'm representing here is actually a logical server. So there's an abstraction of something that does not have failures. And what's going on underneath that is we have replicated state machines. So that's like a chain replication group or a Paxis[phonetic] group going on underneath that's providing us fault tolerance. And so because we already have that abstraction there, that's why put the client server and there. These front end web servers, we don't want to do that. We don't want to do any replication to provide fault tolerance. >>: And it’s easy to provide that abstraction? Because I mean in the end>> Wyatt Lloyd: Is it easy? I mean, so there's lots of ongoing research into this which is sort of one of the reasons why I wanted to have that abstraction because there's lots of exciting things going on from, some from here, some from CMU and at different places. They're going on it sort of this lower level for providing efficient replicated state machines. Lots of questions. Yes. >>: So when you're applying [inaudible] data center, why [inaudible] another [inaudible] comes from the data center, you just apply the [inaudible]? How do you order that? >> Wyatt Lloyd: So how do I order it? >>: Yeah. >> Wyatt Lloyd: So when I get something over here, I have these dependencies attached to it and I check the dependencies. Only once they return do I apply this operation. >>: But then that means they [inaudible]? >> Wyatt Lloyd: So, okay. Yeah. So the writes that return to clients, so in sort of the lifecycle of write there's the client facing part of the write which happens only local data center. So the write comes in, it’s applied right away, and returned to the client. And the client is now done. My write’s been applied. There's also a part that's in the background is not user facing. That's when we replicate things out to the other data centers and then do this check. So things for users, all user facing operations are very fast. They have low latency. But what sort of can have longer latency, which does have longer latency, is this sort of latency divisibility. How long from when I write something over here til it shows up over there? They will delay it until they can show it in the correct consistent order. >>: When do the conflicts [inaudible]? >> Wyatt Lloyd: So the conflicts would surface when sort of once both conflicting operations are both resident on a single server. So once this server gets an operation from the remote data center, if it also had a concurrent rate, that we could recognize it there. If we’re doing a last order wins rule we can actually do this obliviously. We just apply things in the order given by the way of the Lamport timestamps that automatically enforces the last order wins rule. But if you're trying to detect those, if you’ve told me that you want to detect those, then we’ll detect those every time a write comes in, and we check against the current write to see if these are causally ordered and there’s not a conflict, whether they're concurrent and there is a conflict. >>: So I'm curious what should the [inaudible] operator [inaudible], this write and later on the [inaudible] potentially? >> Wyatt Lloyd: Is not that the write is dumped in favor of another write. So like once you get a write, we’re going to apply that elsewhere unless this data center blows up. But once you get a write, we're going to apply that elsewhere. But you're not guaranteed that no one else, no one is not also allowed to write that data. Yeah. >>: Quick question. Here when the [inaudible] front end, reading[inaudible] basically beckoned>> Wyatt Lloyd: Yeah. >>: For each read there didn’t need to be a basic reading to [inaudible] meaning for a [inaudible]? >> Wyatt Lloyd:. So depending on how sort of a logical server abstraction is set up, and I would set it up in a way that sort of cheap for reads, expensive for writes, but sort of if you just set it up with the normal Paxis[phonetic] group, you're going to do a normal Paxis[phonetic] read. >>: [inaudible]? >>: [inaudible] down there [inaudible]. >> Wyatt Lloyd: There's lots of awesome stuff going on so you don’t have to do expensive things. Yeah. >>: [inaudible] it’s a browser>> Wyatt Lloyd: So our client is a front end web server. So this is like a machine learning your PHP server, your Apache server, whatever. And then the meta-client is the browser. And for when we get really deep, the meta-meta-client is the person using the browser, using the web server, using our system. >>: [inaudible] abstractions? What’s this global abstraction? Is it a single partition or>> Wyatt Lloyd: Oh, yeah. So each server is a replicated state machine. Each one individually is a replicated state machine. >>: [inaudible] by [inaudible] data center [inaudible] exploit that? >> Wyatt Lloyd: Yes. It’s the one that's accepting that write. >>: So what's the [inaudible]? [inaudible] these objects, bytes, shards, what’s that>> Wyatt Lloyd: So in the first, so on each server it’s a shard. Sort of the level of abstraction the data model we’re providing is going to be different in the two different systems which is on my next slide. So that's an excellent, this is a good transit. >>: Just one question. [inaudible] meticulous way if there's a write in one data center? >> Wyatt Lloyd: Yes. >>: An update is propagated from this data center [inaudible] and so if there's a partition now, all the writes that are going to go there are going to succeed but they're going to be concurrent, [inaudible] right? >> Wyatt Lloyd: Yes. >>: So partition tolerant [inaudible] write to actually show this concurrent write it would be causally correct but if you look at the timeline actually [inaudible]. >> Wyatt Lloyd: So depending on how things happen on different sides of partition with the current logical clocks, strange things could happen. If you look at like the 1978 original paper, there’s sort of ways to incorporate real-time clocks into this as well, so you have a notion of real-time, so you don’t have some strange thing happening where you have one really active data center that is going to overwrite data that was written 10 minutes later in a non-active data center elsewhere. But, yeah. Okay. Great. So let's take it up to a little bit higher level. We’re talking about these geo-replicated storage systems and these properties we want them to have. We want them to be fast and scalable. There's ALPS properties. We want to be able to interact with this data in a coherent way. In way that makes sense. So I just showed you how I can provide causal consistency for these scalable systems and that was the big contribution of our COPS paper that was at SOSP, 2011. And after that work we continued to push on how can I interact with this data in a way that makes sense? And so our Eiger paper, which was at NSTI just a few weeks ago, makes a number of new contributions. So one, it's going to move us from the key value data model that we had in COPS, this very simple data model which is appropriate as an opaque cache for my web service but not as my primary data store. We’re going to move to a rich data model, so I could actually build a rich web service on top of this, something like Facebook. We're going to also have read only and write only transactions that allow us to consistently interact with data that's spread across many servers in the system. Okay. So let's dive into this rich data model, which is a column family data model. It’s this widely used hierarchical structure. Here's what it looks like abstractly, and don't worry I'll fill that in so it hurts your eyes less. So this data model was introduced by Google with their big table system. So you can see why it's called a big table. And then it was open sourced by Facebook with the Cassandra system which is now widely used outside of Facebook. Okay. So this column family data model, it's very useful for all of these services, it’s used for all this stuff, how can I actually build something like Facebook on top of this, a social network? So we have keys on the left side. These are going to correspond to users in our system. So we have one for Ada Lovelace, one for Alan Turing, and then one for his advisor from when he was at Princeton, or his boss, Alonzo Church. Along the top I have super columns which you could also think of as categories. And then underneath each one of these I have columns. So columns are individual data locations and we can store things like your age in one, your town in another and so on. Underneath, okay that's profile, so what about underneath the friends and status? Well, here I only have an entry per data that exists. So I'll have an entry under Ada, that she is friends with Alan and so we have some metadata associated with it like the date, but I won't have an entry that says she's not friends with her, herself, and she's not friends with Alonso and she's not friends with our other 1 billion social network users. So this is a sparse table. It's a [inaudible] data model. The same thing for status updates. You only have an entry for something that exists. And then under counters we have a special type of column. So we have friends here, and this is a number, and it's special in that it's going to be commutatively updated. So when I move to a new town, I'm overriding this opaque value that existed there. In a counter, I'm not overriding the value that existed before; instead I’m adding something to it. Every time I add a friend I'm adding one thing to this value. So while we have one operation that created that town value that says that I live in London, I have 631 operations that commuted together to provide this one single value that I have 631 friends. So one of our challenges is how do we represent this in an efficient way in our new system? So this column family data model, again, it's storing tons of data, so it's spread across lots of servers in the system, and this all existed before, it was widely used, so what's our contribution? Well, it's providing causal consistency for this data model. So now when Alan removes his boss from his friends list and then posts that he’s looking for a new job, these operations will show up in that order across this rich data model, across all of these servers. In addition to this, we provide read-only transactions. So I can now consistently read data spread across this table and across all of these servers. And then write only transactions that allow me to atomically update data that's spread across many servers in a single data center. So the way you could think about this is now I can consistently update symmetric relationships. I can make Ada and Alonso friends atomically at a single time. So they both become friends at the same time, and we never have an inconsistency in this graph of actual friendships. So our Eiger system, it provides these ALPS properties that are going to be fast and scalable, provides this rich data model so we can build rich web services on top of this, it provides causal consistency for that rich data model though I actually didn't go over how we did that, and then we also provide these read only and write only transactions which I want to go over each of which briefly. So first, why do I even need read-only transactions? Why aren’t reads enough? At a high-level, it's because we're sending asynchronous requests to distributed data. Here's what this looks like. So we have Alan Turing’s operations where he removes his boss from his friends list and then he posts that he’s looking for a new job. And here we have a data center where neither of these operations has shown up yet. So this could be his local data center where he’s thought these operations but hasn't applied them yet, or it could be a remote data center where they haven't been replicated out to yet. We're going to track his operations with this purple bar here. And here's the client of his boss, of Alonzo Church, this web server that's going to be building his page for him. So this web server can go out before either of these operations have been applied and requested the friends list that includes his boss. Now Alan Turing’s operations can appear and be applied in the correct causal order, and now our data store has always been consistent; but now the web server goes out and gets this status update, and now it's going to return inconsistent data back to the user even though it was reading an always consistent data store. So instead what we need are read only transactions. Read-only transactions give us a consistent and up-to-date view of data that's spread across many servers in the system. The way to think about it is this: we have these data items, we have logical time moving in that direction, each one of these data items moves through progressingly newer versions in a logical time, and what are read only transactions are going to do is they're going to return a view of the data store from one particular logical time. So we are either going to return when their friends and there's the old status, or when they’re not friends and the old status for when they're not friends and there's the new status. We're never going to return that inconsistent result where they are friends and there's the new status. So what are some of the challenges that we overcame for these read only transactions? Well, one was scalability. Traditional solutions for transactions have something like a transaction manager which is centralized, but we want to be able to continue to scale the number of nodes that we have in our system so we achieved scalability through a fully decentralized algorithm. We want to build a guaranteed low latency. And so we do this with an algorithm that's going to take at most two parallel rounds of local reads inside the data center and then avoids all locks and blocking. So all responses are sent back immediately. And then finally, we want to provide high-performance. So our tagline for this work is: don't settle for eventual consistency because we want performance that’s going to be competitive with eventual. So we achieve that with an algorithm that's going to take one round of reads in the normal case. So how do the read-only transactions actually work? Well, at a high-level it's this: we're going to start out with a first round of optimistic parallel reads. So we’re going to ask the servers that have the data, what’s your current value? And they're going to return that value along with some validity metadata. So when in logical time are these values valid? Once we get all of our responses back we’re going to calculate an effective time for the read-only transaction. This is a single logical time that the read-only transaction takes place. And we’re going to calculate this in a way that we are always going to ensure progress in the system. We’re always moving forward, we are never stuck on an old consistent cut. Once we have this effective time, we'll check our first round results. Are they all valid at that one particular executive time? If they are, we'll return and most of the time that will be the case. In fact, the only time it won't be the case is if there is concurrent writes to the data that we are reading. If that happens, we might have to reread some data in a second round, and we’ll do that with a second round of parallel read at time operations. So it's either a special read operations that include that effective time that asked the servers, return this data at this particular logical time that we chose. The servers can return that data and then we can return this consistent cut of the data from this effective time that we chose right away to the client. Now to support these read at time operations, the servers have to store a number of old values, but we can limit that significantly because we are always making progress in the system. We are always moving forward, so we never have to store things that are variable. So our Eiger provides ALPS properties which data model all this stuff and then read-only transactions. So I want to quickly get into read only transactions. So read only transactions allow us to atomically update data spread across many servers in a single data center. We're going to replicate these out in the background in that causally consistent order, and the data will appear in the other data centers atomically as well. So what are some of the challenges that we overcame here? Well, again the scalability. How do we do this in a fully decentralized, or how do we do this in a way that we're going to ensure scalability? Is our data spread across many nodes? And again, we do this with a fully decentralize algorithm. How do we ensure low latency? So for write only transactions, we do it with a limited number of local rounds of communication avoiding all locks and blocking again, but really the tricky thing was how do we insure low latency for concurrent read only transactions that are reading the same data? And we do this by not blocking these read-only transactions that are concurrent, but instead in directing them through one of the nodes participating in the write only transaction to find out whether the write only transaction happens before or after the effective time that we’ve chosen. So our Eiger system, it provides these ALPS properties: we’re going to be fast and scalable, the rich data model, causal consistency so that operations appear in an order that makes sense to users and programmers, and these limited forms of transactions, so I can consistently interact with data spread across many servers, but what does all of this cost? Well to find out, I built some prototypes. I built a prototype for the COP system on top of FAWN-KV. This is about four and half thousand lines of C, plus, plus. And then for the more recent system, I took the opensource Cassandra system, forked it, and added about 5000 of lines of Java to it. So I'll show you results from the more recent Eiger system. So when we’re evaluating these systems, we wanted to answer two key questions. The first is what is the cost of these stronger consistency, causal consistency and our stronger semantics, these limited transactions compared to eventually consistent non-transactional Cassandra system? So I’ll show you the overhead for real workload shared with us by our friend from Facebook, and I’ll also show you the overhead for a large state space of different possible workloads, not just the one that we really care about. And then second, the big contribution of our work is being able to provide these properties in a scalable way for data that's spread across a large number of nodes, so I'll show you empirically that our system scales to large clusters. So here's our experimental setup. We’re going to have a local data center that has n client, n servers in it. We’ll spread the data across these n servers. N will be h in the first two experiments, and then it will grow to be much larger than this in the later experiments. We also have n client machines with many threads of execution. They're going to be fully loading these servers in the local data center. We also have a remote data center that we are replicating data out to. That will also have n machines in it. So here are the results for that Facebook workload. So these are the results from Facebook TAO system. It's an eventually consistent non-transactional geo-replicated production system that they use at Facebook. So TAO is like that baseline I'm saying you don't have to settle for. Along the y-axis, this is throughput for that eight server cluster; this is going to be in hundreds of thousands of columns per second; so those are the individual data locations that I showed you in that big table earlier. The Cassandra system has a throughput of about half a million of these per second. Our Eiger system is very close. In fact, we see only about a 6 and a half percent overhead for providing these much stronger properties. So I find this very encouraging. So that's just one particular workload. What about for a large state space of possible workloads? Well to find out, we built a dynamic workload generator, and so this generator is going to have seven parameters. These include things like: what are the size of values in the system? What are reads and writes look like in terms of how many keys are they spread across? How many columns within each key are they spread across? How often are we reading of verses writing? How often are those writes transactional? So in our system all reads are transactional but only some writes are transactional. So what we did for each one of these parameters was we chose a sensible default for each and then perturbed each one individually so we have seven graphs. Here I'll show you three of the graphs. Excuse me. This should give you a good idea of what's going on without being too overwhelming. So each one of these graphs is normalized. So we’ll have the Cassandra system, that baseline system, as a flat line on top. We want to be as close to that as possible. And so to show you the results for value size, this is logarithmic. The results for write fraction, how often are we reading verses writing? We are reading more often at zero. The number of columns per read. So sort of how many data locations are underneath a single key being updated or being read from each read. Okay. So here are the results. The gray lines indicate our default values. And as you can see, there's parts of the state space where our overhead is medium, about 30 percent here. This is where value sizes are tiny. That's because we have metadata in our system, we have these dependencies. In the Eiger system, dependencies are 16 bytes. And so when you compare this with a one byte or an eight byte value, most of our processing is not going to the actual value itself. When we increase value size, however, to something that is still not very large like 32 bytes, our overhead goes to what I would call low. This is true also for a large part of the write fraction state space and sort of independent of the number of columns within a read. So we have this very large of this area of the state space where our overhead is pretty low; we even have parts of the state space where our overhead is minimal. When we are only doing reads, or when values are large, like 512 bytes or larger than that, our overhead is to be very, very low. So this is very encouraging. So not just for that Facebook workload that we really care about, but for this large state space of workloads our overhead is pretty competitive as well. So, okay. Yeah. >>: So you're varying one parameter here? >> Wyatt Lloyd: Yes. >>: So what were all the other parameters set at for>> Wyatt Lloyd: Yeah. So they were set at their gray values. The default value. Yeah. Okay. So here's the graph that I really like. It's my favorite graph. This shows how well our system scales. So this is the money graph, right? Okay. So along the x-axis we’re going to double the number of servers in each cluster, right? From one to two, four, eight, more server in each cluster. And along, at the very extreme end of this graph, we have the 128 servers in a cluster. This is actually a pretty big experiment, for me at least as an academic, because we have 384 machines here. Two clusters of 128 servers and one cluster of 128 clients. Along the Y axis I have the normalized throughput. So I'm going to take the throughput of the cluster and divided by the throughput of a single server cluster. So ideally what's going to happen is as we double the number of servers in each cluster, we’re going to double our throughput as well. So what happens is I double from one to two and so on, so I double from 1 to 2, 2 to 4, 4 to 8, my throughput almost doubles each time. Here we see about a 72 percent increase for each doubling of the number of servers. So this is good, but it's not ideal. Ideally I would have 100 percent increase for each doubling of servers. So what's going on here? Well, most of our operations are spread across a large number of keys which are spread across a large number of machines. In a single server cluster or in small clusters, many of these operations are batched together into a single operation. So we get great benefits from batching in this cluster and then a little bit less and a little bit less and so on. Okay? So what happens is I continue to double the number of servers from 8 to 16 and all the way up to 128 servers in each cluster. Well, the effects of batching are decreased. I see about a doubling of throughput for each one of these, and in fact, now I see 96 percent increase every time I double the number of servers. So this is super encouraging. What it says is we can scale out our cluster. We can keep increasing our capacity and our throughput by adding more machines. So to summarize this main part of the talk, I talked about geo-replicated storage that's conserved as the backend of large-scale websites, things like Facebook, Reddit, or Amazon. I talked about the ALPS properties that precisely say we want to be fast and scalable. I talked about causal consistency which we enforce in a scalable way through explicit dependencies and distributed dependency checks. I showed you how we exploit transitivity to reduce the overhead of our system. I talked about are stronger semantics including this rich data model so you can build services like Facebook on top of this. Read-only transactions, so I can consistently read data spread across many machines. And write only transactions. So I can atomically update data spread across many machines in a single data center. I showed you our system was competitive with this eventually consistent baseline. And we can scale to many nodes. I want to revisit the theme of my research which is that I like to look at these classic distributed systems problems except through the lens of modern cloud or web services, and with that in mind, I want to review some previous work and talk about a few future directions. So some of my previous work was looking at fault tolerance for cloud services. So this included the TROD system, which was at DSN in 2011, which looked at how can I recover connections to these front end web servers with minimal overhead? So what's new in this setting? Well, one is we can't modify these client machines. These are client browsers; I don't have any control over them so I can’t change them. The other new thing is these front end web servers are actually very similar to another. They can each serve the same content. So what we did was we restricted our domain a little bit. We said let’s just worry about what we call object delivery services. So these are going to be services that are serving static content, data that's not changing, so photos or videos or static webpages. And with this restriction we were able to provide fault tolerance for these connections with dramatically lower overhead than previous approaches. And our key technique is we are actually going to embed application level state in the mechanics of the TCB transport layer so that we’re going to get these unmodified clients to actually help us out even though they are unmodified. I also had another project at NSDI in 2010 on a system called Prophecy. This was about scaling byzantine fault tolerance. So byzantine fault tolerance provides protection against the worst kinds of faults, malicious vaults where failing nodes can even collude with one another. And in traditional byzantine fault tolerance systems, you need four machines or even more to provide the same throughput or similar throughput to a single non-fault tolerant machine. So in this project we produced the first the system that was able to provide byzantine fault tolerance with slightly weaker consistency that had throughput competitive with non-fault tolerant clusters for common web workloads. I want to touch on a few interesting future directions that I'm excited to work on. One is: how do we reason about partial replicas in geo-replicated storage? So right now and in this talk I talked about, you know, we have a couple of data centers and we have our data in each one of the data centers. So what happens when I have 20 data centers? Well, I'll only want to have a subset of the data in each one of the data centers; I don't want 20 copies of it. So how do I reason about that? How do I ensure low latency and consistency for this data where I only have subsets of it in each one of these data centers? This is also really cool because it allows us to extend our reasoning to edge devices. So your phone or your laptop could be viewed as a tiny partial replica of the data that we actually have in the service. Another exciting future direction is making general transactions more scalable. When I say general transactions, I mean things that are more general than what I talked about in this talk, but not fully general in the database sense. And so general transactions will allow you to atomically read data and then write data in a single atomic block, for instance. I'm interested in looking at how we can make these fully scalable as well so we can use them on data spread across a huge number of machines. Also interested in revisiting the network abstractions that we used to build these storage systems. So the current abstractions that we have are from the 1970s, and we don't have to be stuck with them. And this big distributed system, I control all the nodes; so if I want to put a new networking stack on there, I can do that. I can upgrade all of them at the same time. And I think this is exciting, this is an exciting place to do networking research because of that, and when I was building these systems I noticed that the abstractions underneath didn't actually match what I wanted to do. What I want to do is a massive number of parallel RPCs. My final direction of future work that I'm excited about is improving the programmability of these distributed cloud storage services. So I talked about this fundamental impossibility result. Either we can have strong consistency or we can have low latency. We can't have both. Sometimes you need strong consistency and sometimes you need low latency. So what I'm interested in doing is first, unifying these two abstractions for the programmers so they can explicitly choose what do I need in each situation. And then iterating on that and improving it using software engineering and programming language techniques so eventually we can get to a place where programmers are telling us their invariants and we’re enforcing it for them and they’re telling us declaratively what they want to do instead of imperatively how they want to do it. So to conclude, in general I like to look at classic distributed systems problems except through lens of modern cloud or web services; and today I showed you how we re-examined replicated storage for these large web services with the new requirements of geo-replication, massive scale and low latency; and I showed you how we can provide stronger causal consistency and stronger semantics for them. Thank you. >>: We have a few minutes for questions. >>: What’s the scaling of the eventually consistent version of Cassandra look like? >> Wyatt Lloyd: So it looks very similar. The sort of, the big bottleneck that we see there where there’s sort of the non-ideal scaling characteristic is that we have lots of batching on a single server cluster and not so much batching on a much larger cluster. This happens in an eventually consistent system as well. >>: But how does the slope compare to the slope that you showed? >> Wyatt Lloyd: So I did not, I didn't publish the slope for the Cassandra system. But in my not fully rigorous experimentation with it, it looked exactly the same. >>: So for the Cassandra system and the Eiger systems, [inaudible] experiments, each of the write [inaudible] replicated? >> Wyatt Lloyd: Oh, yeah. No, no, no. Sorry. So in these experiments Cassandra does not implement replicated state machines. So each one of these, so you take these results with sort of a grain of salt in that each one is being written onto a single server locally, and we would see different performance characteristics that would perhaps be a little bit worse if we were doing something on top of sort of that actual logical server abstraction. >>: For each of the systems, let's say basically it starts [inaudible] do you think the percentage basically difference still would be the same or basically Eiger potentially may have more overhead? >> Wyatt Lloyd: So I think we did things very naively. Eiger would have more overhead. An interesting direction that I didn’t talk about is looking at these two levels of abstraction on top of one another. And I think sort of depending on what we put underneath will have good performance for some workloads, worst performance for other workloads, so if you have a super write heavy workload versus a super read heavy workload, you know you want different abstractions underneath. And sort of exploring this I think is very interesting. And also seeing, you know, can we do something that's going across these levels of abstraction that would be more efficient? I think it’s an interesting direction of research as well. Yeah. >>: So my understanding is that you communicate to a user that they’ve been migrated from one web server to another or one data center to another by locking them out and locking them back and forth so they have to lock back in? >> Wyatt Lloyd: So no. So one, this is not implemented, this is sort of the level in front. But the way that I would tell someone writing a web service in front of this is that in the cookie of the user, [inaudible] something that is the last time or the last logical time this user interacted with the service and what data center they were using. And then there's stuff that I didn't talk about, like we have garbage collection the goes around, keeps track of sort of this data in the different data centers that says that this is how far behind different data centers are, so that we can do things like there's no dependencies on stuff from yesterday, for instance. So if you read data from that there's no dependencies. If you include that, you can do a check that says this user is using this data center, once all data centers are passed what that user is seen, the users no longer have, you can sort of delete the state that says this user has been using data center x. So when they're on a plane this will happen, and when they got the plane in the other location they can start using whatever data center your DNS sends them to or whatever. >>: So you delay. So if the user, if the web server dies or the data center dies, you just delay them by, you just to say the user can’t use the system until you are sure that all of this data has copied, has dependencies [inaudible]? >> Wyatt Lloyd: So if the user physically migrates we will sort of have that will be fine because they can only physically migrate in a slow time. If a user is directed to a different data center, we’ll do this sort of redirection to the data center they had been using or we’ll do triangle routing in the worst case. If that data center blows up, if that data center has failure, that's a sort of not ideal. We'll try to redirect things for a while, right? Yeah. So I mean we’re going to try to redirect things for a while. At some point you're going to give up and you're going to say the user either has to wait for that data center to come back up, or the user is going to see inconsistent data. That's sort of an administrative policy that you have to decide. And in the worst case, let's say you write some data into this data center and we accept it locally right away and send a response back to you and then this data center gets destroyed in like the earthquake that they tell us is coming, right? So in that case we could have data loss because we've only written in one place and we are optimistically sending it out elsewhere. I think that that's not going to happen in a high fraction of cases there because we have, here's the web server, here's the data storage, the web server writes something to the data storage, we write it right away and send something back concurrently with that, I sent it to the background but it’s concurrent, we’re sending it out to the other data centers. So if doesn't make it out to the other data centers, it's likely that before this server responds like [inaudible] responds something that says your post went through, this data center, this web server will also be blown up. >>: I was just wondering how you communicate to the user, and it sounds like a combination of delays and non-communication. >> Wyatt Lloyd: So you can think of primarily triangle routing and in the worst case we would either delay or show something inconsistent. Yeah. >>: So why would you show something inconsistent? [inaudible], right? >> Wyatt Lloyd: So if you have the user accessing data center x and data center x blows up and some of their updates and some of the stuff that they read is no longer available anywhere else, you have to choose, do I have this person wait forever because that data centers no longer there, or do I show them the data that doesn't include sort of everything that was there? This is like sort of this extreme failure case, right? >>: So it could be old data that that's still causality consistent. >> Wyatt Lloyd: It would be old data that was causality consistent with itself, but the user might not see things like, I see your status update and we’re using the same data center; the data center gets immediately blown up. We never have that again. If I access another data center I cannot see your status update. I might even lose it. >>: [inaudible]. >> Wyatt Lloyd: Yes. Exactly. >>: [inaudible] replicate and you don't know [inaudible], I mean there’s nothing you can do about that. >> Wyatt Lloyd: Yes, yes. >>: So if you're comparing system, [inaudible] system, what would you say [inaudible]? >> Wyatt Lloyd: So Spanner is not a low latency system. So there's been a bunch of sort of cool work pushing on both sides of this impossibility result. On the one side it’s sort of mainly my work, on the other side it’s the Walter system from NYU and Marcos Abulera[phonetic]. There's also this Gemini paper and OSDI in Spain or OSDI that’s looking at how can we make strongly consistent systems more efficient, and because of these impossibility results, none of these systems can guarantee low latency like we can do. So in Spanner, they're doing at least one white area round-trip. The initial motivation for Spanner was storing ad data. So it's reasonable in ad data that they can say we can have a reasonable amount of low latency because we are doing this white area agreement, but it’s all going to be on the East Coast, for instance. So we’re going to have sort of a small amount of latency here. In a social network setting you want everyone to be able to access this data, so if you only write things on the East Coast, if anyone wants to read that data they have to access the East Coast for that. So there’s sort of a trade-off for there, they're hiding a little bit by not distributing things globally. And they always have to do something in the white area anyway. But they provide stronger consistency. So that's the trade-off. Yeah. >>: [inaudible] linearizability which is not achievable and then causal [inaudible] is right underneath there? >> Wyatt Lloyd: Yeah. >>: Do you have a sense for something useful in between those two or is causal plus the best way>> Wyatt Lloyd: So there's an awesome, so there's an awesome theoretical result out of UT Austin where they take, so we have causal consistency and we have causal plus consistency and they have this thing called real-time causal consistency which is very similar. And they have this very intricate cool proof that says that you cannot have a consistency model that is strictly stronger than real-time consistency in a low latency white area system. So there's nothing strictly stronger than that. There could be something that was incomparable that would be more useful, but no one’s thought of it yet. You can’t prove something like that. But, yeah. It's sort of, this exists and everything that we know that is more useful, that is better, has proofs that say you can't have that. >>: What about [inaudible] transactions? >> Wyatt Lloyd: Sorry, I can't hear. >>: [inaudible] read only and write only transactions>> Wyatt Lloyd: Yes. >>: [inaudible] general [inaudible]? >> Wyatt Lloyd: So interactive transactions are very tricky. General transactions that are greater in scope than what we have but less than like interactive where I can start a transaction at the beginning of the day and then at the end of the day like try to commit it, things that are more interactive than that are really interesting, and I need them to transfer money between our bank accounts and things like this. They're very tricky to handle in a low latency way. Because as soon as you give me read write transactions, the first thing I want to do is implement mutual exclusion, right? This field is empty, I grab the lock. But you’re somewhere else, and if you're doing this in the latency, the fields are also empty for you and you grab the lock. So if we are doing read write transactions and we’re trying to guarantee low latency, we have to have these like really funky conflict resolution policies, which some systems like Bayou have, but I think are very difficult for programmers to sort of use in a meaningful way, generally. So when I talk about general transactions there, I have this like bullet point where I say like we can’t guarantee low latency. I think we can’t guarantee low latency because one thing about read write transactions, the consistency that I need underneath that is stronger than causal consistency. >>: [inaudible] are interactive phase for read transactions, like same you do right now for write transactions or right [inaudible] with just reading interactively. I mean, we don't know who is my friends or who is my transaction with. They want to read the posts of your friends [inaudible]. >> Wyatt Lloyd: Two phases of reads, like you read some stuff and you read some stuff after that. So I think that's a super useful, and it’s very interesting, and it's something that I started to design and then I had to graduate. But it’s very useful, and I think we probably could do that in a low latency way. Yeah. >>: Let's thank the speaker again.

Document 17864648

Related documents

Products

Support

Document 17864648

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib