1 >> John Douceur: Well, good morning, everyone. It's my pleasure to welcome Chris Stewart, who is an assistant professor at the Ohio State University, who is visiting us for the day as part of a larger west coast tour, we understand. Chris's interests are in network and distributed systems, and particularly two sort of areas. One in sustainability, and one in performance. And based on the title of this talk, I assume we're going to be hearing more about the latter. So take it away. >> Christopher Stewart: All right. Excellent. So if I were Troy McClure, I would start this talk with, you may remember me from such research as. For those many or good friends, happy to see you guys again. For those just meeting me, I graduated from the university of Rochester late 2008. My work there was mostly on modeling and system management for internet services. So there, the idea, we started with some early work, published in SDI 05, my second year of grad school and we had some works going all the way up through 2008. And the basic idea there was to come up with some first principle models that gave us good ideas about what system, what response time would look like as we changed the number of nodes in a system so we could do things like capacity planning. Anomaly detection came a little bit later. One of the great things about doing a set of slides like this is you get time to sort of think about looking back at the high level contributions that these things had, right? So hear I think with this particular work especially, the contribution was on the simplicity of the model. So we started looking at queueing theory, which at the time people weren't looking really too deeply into for internet services. I think there was a paper from duke and [indiscernible] group was working on similar stuff. And we showed how to use system metrics that you can routinely collect at some middle layer by -- we instrumented OS, but today it's done in a lot of middleware systems. We showed how to collect these metrics and how to apply them in a fashion that system managers could actually understand and implement. So when I was at HP for basically the latter part of my Ph.D. thesis, I spent working with folks there, we ended up shipping this out-scope to a product group, and this paper, I'm told, influenced some other people at other companies that ended up being able to apply these in practice. 2 So around the time this was winding down, conveniently, it was around the time I was graduating. So figured it was time to switch areas and I did a one-year post-doc with my advisor. So I started looking at performance anomaly detection. And here the idea is to use these models that we were developing before and use them now as the base system, instead of trying to predict what, trying to use models to predict what the real system is like. We want to say hey, the real system should look like these models. And here, there was a really nice interplay between some machine learning techniques, especially, here. So we looked at trying to characterize whole space for this anomaly depiction stuff. Most recently, or second most recently, I left the systems community altogether and started looking at datacenter architectures in particular. In 2009, early 2009, I was caught up post-election cycle, I was all about renewable energy at the time. So I was looking at how can we power datacenters directly with solar panels. And we published one of the first papers on that at Hotpower. It's been pretty well received. So now there's an emerging community in the sustainable computing area, and I'm the editor of the sustainable computing reg City, which is an IEEE publication from the IEEE STC on sustainable computing. I'll say that again, the IEEE STC on sustainable computing. I encourage everyone to join. It's actually free. This is a new model for IEEE, where we're just trying to bring together experts, researchers in the field with a common interest area. So after this, I decided, okay, let's get back to the heart of what I work on, and how can we do things that are very relevant to business, to industry in terms of performance and system management without losing the sustainability focus. And this is what my talk will be about today. And so I'm going to give you guys some discussion about actually some things that are very recent. So these are papers that are in the pipeline, stuff that hasn't been published yet. So I appreciate your feedback, and you can ask questions at any time. I think I've slated this talk for about, I think I have been an hour, but the talk is about 40 minutes without interruption. So feel free to ask questions at any time, and I can manage. If I can't answer, I'll just pretend like I'm running out of time and we'll go from there. 3 So at a high level, my current research goal comes from the intersection of three trends. So from the hardware side, SSD is faster than disk. PCM is faster than disk, and the total capacity of on-chip caches is still growing. So we can still fit more transistors on chip at least for a few more years. So we have the ability to access more data quickly than we ever did before. On the software side, that has been a timely occurrence, because there's been this huge data explosion. So we want to perform analytics on top of any and everything. By we, I should say there are certain companies that want to perform analytics on top of any and everything for whom analytics is their profit margin. So being able to predict what people are doing is very important. And here, intuitively at a high level, the closer to realtime you can get with data analytics, presumably the better your profit margins can become. It doesn't hurt you to be able to do things faster. And then finally, the human aspect, the human interaction aspect. Some would say this is surprising. Still isn't getting much better. So the time that you have to do all of this data analytics is still about the same. So there's a really neat website called humanbenchmark.com. They just show you a red box and they ask you to click as soon as this red box turns green. So just measuring your response time. This is the range of the scores, 215 to about 500 milliseconds. Still same place it was in the early 2000s when we first started looking at ->>: How much of that is the [indiscernible] from the brain to the fingers? >> Christopher Stewart: >>: I don't know. So I think that -- Did you ever stop to notice what time takes. >> Christopher Stewart: Yeah, I don't know why we're so slow. In particular, I don't know why I, myself, am slower than others, actually. When I did this benchmark, I lose upward of 500 milliseconds plus. So there is a biological answer here, and at a high level, you could say maybe we could find ways to reinvent computers to circumvent some of these biological constraints, right. Maybe allowing response time to be a function of how fast we can blink, rather than click. Maybe that's a little tangential to the main point here, but it's an interesting thought. 4 So the way I view these all sort of combining together, we're going to want to be able to run data analytics on moderate scale, moderate amounts of data, very quickly in as close to realtime as possible. So we're looking at emerging applications that are defined not by the interaction time between service to a human, but between a machine to a machine. And we need to do a lot of those interactions before we go back out to the human. So at a high level, we're trying to get at emerging applications that have a demand for very low latency but also very strict SLAs. And I guess I'll define that on the next slide. So the general outline here is first, I'm going to talk about the types of applications that I've been targeting in my research group. Then we'll talk very specifically about our key value store that supports these types of applications, show some results and conclude. So when I say very strict SLA with low latency, by low latency, we're talking about the bound in a service level agreement. So the K-percent of requests must complete within X amount of time, where X amount of time is defined in terms of milliseconds. K percent of request is defined in terms of number of nines. So this long-held goal of trying to get to multiple nines, like the telephone industry for access times. So here on the bottom, I'm showing the difference between what I see as emerging services and traditional services. Your traditional service had a couple of different layers. But in the end, when it got to the data store layer, you only had a few machines issuing accesses before you responded back. For instance, you did one look-up in a user query table to see what their most recent purchases were, or to retrieve their cookies or something like that. With these emerging services, what we're looking at is middleware that sort of encapsulates a lot of what is previously done at the database layer. So transaction processing. Also performs some complicated data analytics business logic and is issuing a lot of requests to your back-end data store. Here the thickness of the lines indicates the number of accesses issued by these components. So why do they issue more accesses than previously? Two reasons. One, you're doing more complicated tasks, like data analytics of some sort. And these back ends have gotten simpler. So where database gave you the ability, for instance, to do transactions, your key value stores may just give you the 5 ability to do gets and puts. And -- >>: Is this because when you say data analytics, I'm trying figure out what you scoping target is, because there's one kind of data analytics, which is like [indiscernible] over web pages and stuff like that. And there's another type of data analytics, when you think about Facebook, things like that, so somebody's profile page loads. You're trying to figure out which posts should be shown, stuff like that. So there's much different scale requirements, stuff like that, for both of those. And so can you give some example applications, like ->> Christopher Stewart: Right, right. So as a high level, you can follow the -- those will highlight and those are all example applications that I'm going to talk about. And on the next slide, James, it's coming, it's coming. But to answer your question directly, it would be closer towards this Facebook side. So a moderate amount of data that needs to be analyzed and searched, and very quickly. >>: The phone company, after all, the lifeline service, 99.9% reliability makes sense. If you're trying to make money, wouldn't 90% do it? You make money on nine customers and you lose on the tenth. If you're ahead of the game in other ways ->> Christopher Stewart: Yeah, so the reason here is the user isn't involved on all of the requests here. So you're right. You know, if one out of ten users has a bad experience, maybe we're willing to deal with it. The problem here is most of your accesses are happening at this layer. That is, machine to machine interactions. So a few slow accesses, every 1,000 requests, means that everybody is going to see a very slow service. That's what we're getting at. >>: But the user level, you lose 90% of the [indiscernible]. >> Christopher Stewart: Exactly, exactly. So take a look at some concrete applications, right. So one domain that we've been targeting is the scientific cloud. The goal here is not to replace the super computer. That works very well for the types of applications that it targets. But there are many scientific applications that don't utilize super computers well. So these, if you run them, they run on a small portion of the super computer. They don't need to run for months at a time. Maybe they're running for weeks now, or several days. 6 And what we would like to do is make those simulations move from an order of magnitude of days to hours or hours to minutes. And what I'm showing you here is a concrete example. So we're looking at a smart grid simulator that we've instrumented in my lab. And one of the collaborators that I've been working with out here, Bonneville power, I think they're based in Oregon. I don't know if they reach all the way out here or not. Maybe not. So one question you might ask is, okay, there's been an increase in demand, so should I increase the amount of hydropower that I'm producing, or should I ship electricity in from elsewhere? And if you have a smart grid, you have also devices that may adjust their behavior if you tell them to. Well, how does all of this play out? You'd really want to have an idea of this so you'd like to be able to run simulations to get an idea of what's going to happen before you do it. So the way this particular grid simulator works is you have a master node that's going to set up a lot of, for instance, global variables and just, you know, do some maintenance work. And then it's going to start all of these different simulation agents. For instance, if you're modeling houses that all have smart air conditioners that are going to react to messages from the power company to indicate when electricity is quite costly and then turn up the heat, right, you want to have those modeled. You want to have the free market price changes modeled. For instance if you start producing more hydro, that's going to affect prices. And you need to model people's personal behaviors. All of these simulated agents have input and output files that they need to access in order to do a correct simulation as well as global variables that need to be accessed consistently over time and shared across these different agents. So one more excellent. So one of these simulations. One of these relatively short simulations, so just simulating the effects across three days, you can issue over 10,000 storage accesses. And right now, if you just put this on one node and ran it, you're talking about a little over an hour -- no, a little over two hours to do a single computation, which is too slow, because your power operator really would like to be able to do these simulations on the order of minutes to get an idea of demand response. 7 >>: So are these jobs IO bound? Are they [indiscernible] bound? >> Christopher Stewart: They're a mix. So at some points, any go through different phases. So at some points, they may be IO bound. At some points, they may be CPU bound, because, for instance, maybe an agent is trying to update or model a person or something like that. So once again, going back to the earlier question, so if you have 10,000 requests in here, if you're only meeting an SLA of 95 percent, that means you're going to have a lot of slow accesses, right. Five percent of 10K is a relatively large number. So you really need to have a high SLA to guarantee you'll have low latency on all of the accesses issued by the smart grid simulator. >>: [indiscernible] try to see how that follows, because it seems like if you model one of these agents, has to get through 10,000 accesses, it seems like what would drive the total simulation time is the average access time, not the worst case access time. >> Christopher Stewart: Right, right. So the intuition here is that this is a time step type simulation. So in order to simulate the next minute, you need the results from the prior minute. So if you have a couple of slow accesses, what they're going to do is slow down -- they'll prevent you from moving forward. Whereas if you -- for instance, if this were more like a web server, right, where just everybody throws all their requests against and you get what comes back, then you care about the average, right. But here you can be -- we'll show some data on exactly how that has an effect a little later. >>: So it's like the [indiscernible] problem, you get a barrier. >> Christopher Stewart: Exactly, exactly. analytics features -- uh-huh? So looking at some of the data >>: Sorry. I guess intuitively, it seems like can't they just have a rule of thumb, and a formula they plug in how many smart air conditioners they have, and the cost, couldn't you get a good approximation that you could run in a second? Why do you need to do this elaborate simulation? [indiscernible]. Is it such a chaotic system where small change and -- I mean, I can't imagine that 8 ->> Christopher Stewart: Well being the smart grid in -- if its ultimate glory were ever realized, the smart grid would be an incredibly complex distributed system. You can imagine if everyone had their own meter that's fluctuating and changing over time and people have the ability, for instance, to turn that meter on and off, right, it can be -- now, I think your question is very valid. What is the actual impact of those things, right. Is it that it just sort of converges to some global average. That I don't know. But presumably, we would like to be able to run the test to see if that happens. This is just something that has been of interest to the folks that are building the smart grid. You see how that work out in practice. So from the data analytics perspective, I guess what we're looking at, the example I'm providing here is in terms of email. What if you wanted to do some sort of targeted ad placement, and actually some previous work here from Microsoft sort of motivated this. So they were looking at searches at bing and they found that if you could look at the recent log of searches that had been performed, you could improve the recall accuracy by something like 25 to 50 percent. It was an interesting study. More generally, I think social networks and whatnot provide even deeper data here. Like so can you detect when a crowd is -- when there's some sort of crowd phenomena happening or when there's a trend that is changing the meaning or context of words for better targeting of ads, right. So here, the challenge is the data's changing frequently. So you could use some old data. You could use -- which what is people pretty much do nowadays. But more up to date is better. And it would be even better if you could do some sort of dynamic feature extraction from the query. And to give a very concrete example, we sort of view this as being very akin to deep Q&A, like the IBM Watson project. What happens here is you give input of some text that, for instance in Watson may have been a question or, I'm sorry, a sentence. He said this, and then you want to extract from that text not just keywords, but some context also. And the way this is done, actually, was very interesting to me as we looked into the NLP approaches to this. I call it a big set of if statements. So we have an idea of what context looks 9 like here, here, here, here. We need to check against all of these. And based on that, you issue a lot of storage accesses and try to retrieve relevant content. So this idea sort of follows into a map reduce model. But the key idea is you're time bound. So Watson, they were trying to do two to three seconds to win at jeopardy. By the way, this quote, does anybody know this one? Simplicity is a great virtue, but it takes hard work to achieve and education to appreciate. This is Ed Dykstra. One of my favorites. So ideally, you guys will believe that my ideas are relatively simple. But also impactful. That's where we're trying to go with this. So once again, this is a real system that we have up and running so we have grid lab D up and running in my lab. We have open Ephyra, which is actually what the earliest versions of Watson was built on running on our system with data sets. One fun data set is from USENIX. So we've culled all the USENIX web pages and we can tell you about your history of service to that community, in particular, what PCs and whatnot. And then finally, because I do still care about the sustainability side -uh-huh? >>: We were talking about stragglers before. So to first approximation, what fraction of the problems that you're trying to solve are, quote unquote, just fragment problems? >> Christopher Stewart: We hear a lot about stragglers. Actually -- >>: Would you say that's the main thing that keep people from satisfying the SLAs, the fact that if there were no stragglers then satisfying SLAs would actually be fairly straightforward. But the fact is there are stragglers that straggle different ways. >> Christopher Stewart: There's still some issues with just queueing and managing workload fluctuations, but this talk is specifically about the stragglers. So when we get into some of the more technical aspects, that's what we will be talking about. So yeah. So the idea here is you have these, with the ability to beat human perception now, pretty easily, computers are fast enough that we can respond 10 pretty quickly, one way to think about that is it can give you some leeway with the SLA. Some opportunities to invest in other things when you would have originally invested in, for instance, more nodes. So our idea here is we want to meet an SLA using as few resources as possible, and then exploit other competitive advantages. For instance, going carbon neutral Microsoft, right? And how can we turn those into profit? So another application that we've set up in my lab, you'll notice the URL is just sort of like a random IP address. Greenmail or mantismail or I don't know. We have to come up with some internal name for it. This is a carbon neutral email IMAP cache, and what we'd like to be able to do is provide differentiated SLAs to different users based on their desired level of greenness. That is, the amount of carbon offsets that they would like to receive as they access the site and the frequency with which they access the site. Here, once again, we want to use as few nodes to meet the SLA as possible so that we can have as many -- so that we can buy, within a budget, as many carbon offsets as possible for our sets of users. So this is another one that we have set up. All of this comes together for the technical focus of this talk, which is Zoolander. This is our key value store on the back end. So without further delay, the basic idea behind Zoolander is this. For years, we've been looking at using replication and partitioning as our ways to manage SLAs, right. And so here I'm depicting an awful JPEG figure. The basic concept's here. So partitioning means you have some set of keys or groups of keys, often called shards so you put keys together. And any rights that are going to one shard go to one node. Rights going to this other shard go to another node, right. Replication, traditional replication, or what we call replication for through-puts is, hey, I want to evenly divide the accesses so I'm just going to, whenever I get a write, I'll send it to one node. Next write, I'll send to another node, so forth and so forth. Now, the basic idea behind both of these approaches is that you queueing. You want to reduce the amount of work that each node And, you know, it's worked very well. It clearly can get us up percentile SLAs. The challenges here, traditionally, have been want to reduce has to do. to this 95 in terms of 11 managing hot spots and consistency. So if you do partitioning, if you have more accesses to one key than another, you have a hot spot, which means you've added a node, but you haven't divided the workload in half. If you do replication for through-put, sure you divide the accesses evenly, but in order to maintain a consistent view of data, the nodes have to interact with each other. You still haven't really divided the workload in half. Now, that the traditional problem. But as James was pointing out, we're actually concerned about a different problem. We're actually concerned about stragglers. And so the key problem here is even if you have perfect division, even distribution with workload, and very light arrival rates, your ability to reach these very strict SLAs that we need to ensure that something like 10,000 accesses all happen very quickly is actually still limited. So here you're looking at an experiment that we did on zookeeper, where we had just a very small key range that we accessed with no concurrency. And on the X axis, you're looking at the latency of each axis. The Y axis is the cumulative distribution. So your SLA that says something like 98% of accesses is just the question of reading horizontally along here to the intersection of the line on the X axis and that tells you you're latency bound at that point. This figure should be a little bit depressing, because what we're saying is you can see all of these very heavy tails here. And what it means, 98%, 96%, 91% for read-only accesses, writes, and writes to an actually dependable store can actually be processed within -- that is, that bar is three times the mean. So that means most of the time, that means a high -- I shouldn't say with a high probability. But getting to 99%, that is relatively close to the mean, isn't possible under just this experiment. >>: This is a function of the [indiscernible], right. [indiscernible] -- So if you >> Christopher Stewart: No, no, no arrival rate here. This is just take an access, issue it, wait for it to come back. Issue another access, wait for it to come back. This is ->>: [indiscernible]. >> Christopher Stewart: No concurrency. right? So you can divide this. Like I said, sort of depressing, 12 >>: [indiscernible]. >> Christopher Stewart: So I'll get there. So the -- I just want to point out before that that this actually happens across other key value stores, right. So we've looked at Cassandra, Zookeeper, Memcache. These are all popular value stores that people are using. So what are the reasons for this? You have background jobs, right? So, for instance, Cassandra has this right buffer issue that, you know, you issue a right, it tries to respond as quickly as possible by keeping a buffer of your rights. This is great, high through-put, right? I can batch all of these buffer rights together. The problem is, every so often, a request can get stuck behind one of these buffer rights. So if a buffer right occurs once every second and it's taking you a typical access is taking you just one millisecond, then you have this good chance that -- not good chance, but you have 5 percent of your accesses are going to be affected by possibly this right buffer flush. . Now, if the right buffer flush stalls the request, if it requires stalling the entire system, maybe a right buffer flush, you may say, doesn't have to do that. But definitely garbage collection or DNS timeouts could. Then you're going to talk about having almost a 10X increase for that particular request for these background jobs. >>: It seems like there is an arrival rate. In other words, if there's a background job and there's some arrival rate, it's not that you're issuing the request in isolation. You're issuing the requests and other people are also issuing requests concurrently. Is that what's happening? >> Christopher Stewart: You can call it queueing. You can call it queueing, but these aren't particularly requests that are coming from nodes in your system for storage access, right. This is a part of the system itself. Cassandra just keeps this right buffer access. >>: [inaudible]. Are you talking about background jobs, ear things accessing. >> Christopher Stewart: So just to maybe explain this example a little better. So Cassandra keeps a right buffer. Eventually, you have to flush this right buffer just to keep your data dependable, to get it on a place that is reliable. That is a background job. The process of flushing the right buffer 13 is a background job. background job here. >>: Garbage collection would be what we're calling a Or updating or -- [indiscernible]. >> Christopher Stewart: So if we look at this, this is an interesting issue. So what we found is that these things that are causing the heavy tail tend not to be correlated with a specific access. So it's nona function of the input I'm giving to the key value store. It's just something that has to happen over time. And, as we saw if I go back, these are some pretty big delays. So every now and then, you just really get whammed with slow delays. And this has a big impact. So earlier, somebody was asking me about the barriers and the impact of stragglers, right. So if we look at that grid lab D workload again, and now we're going to actually inject stragglers into the system. So on 2% of our accesses, we're going to inject a delay. And we're going to compare that to injecting the same delay, the same total amount of delay, across all 100 -- all accesses. And the idea, the take-away from this is so this dotted line represents linear slowdown. So if we inject across all accesses, we can actually do better than the linear slowdown. Why? Because we can mask things with computation. Because you have to compute so you can compute while you're doing IO in the background. And it's a slow delay. It's not a long delay. So this is okay. But if you focus these delays on just a couple of stragglers, you start to run into trouble. Because now, the delay is long enough that you're delaying these barriers as you go through. >>: So you're saying that delaying all requests by 20% has less effect than going 2% of the ->> Christopher Stewart: you're injecting. >>: 20%. It's the same aggregate amount of delay that So you're saying the total delay and concentrating it into a small number? >> Christopher Stewart: Exactly. So here, on this line, you may be adding, say, for instance, just 200 microseconds to each request, rather than divide by 50, two milliseconds to each request. 14 >>: The average delay, you're adding a series is the standard deviation, if you will, of the delay you're adding. In other words, at 20 ->> Christopher Stewart: >>: So which line is the linear? >> Christopher Stewart: >>: Exactly, exactly. The dotted line. Oh, the one that's -- >> Christopher Stewart: Yeah, sorry. So this is reversed. That is reversed. Whoops. Okay. So here's the key insight behind Zoolander. I really like this work. So key insight is we want to revisit this very old technique called replication for predictability, which I think falls in line with this old paper by Jeff Mogul called an old dumb idea whose time has come. The basic idea behind replication for predictability is you make N copies of your data. But instead of sending each request to one of N, you all requests to all of N. And you take the fastest response back. Now, and so it's a dumb idea. Why is it a dumb idea, just off bat? You don't improve through-put this way. So you're issuing the same amount of work to all of your nodes. Why are you doing this? It turns out to be a useful idea now, because it reduces the variance. So that heavy tail, you're less likely to get stuck behind the heavy tail if you do something like this. And we're here showing this, right. >>: You're reducing the capacity of the system by a third. So you're comparing now the 20% of the 2% to some [indiscernible]. Your hypothesis [indiscernible]. >> Christopher Stewart: Yep, yep, yep, yep, yep, yep, yep. We'll get to that. We'll get to how we can answer that. So the, here, we're just showing a high level depiction of what's happening here. So here, along this strip and along this strip we have requests that are arriving. And here we're showing one of these anomalies happening. That is, replica 2, or when we do replication for predictability to use different terms, we use a duplicate. So replica versus 15 duplicate, partition versus duplicate. So the key point here is the second request on traditional approaches would get stuck behind this right buffer flush and wouldn't finish until here. And if you have barrier, meaning this third get can't happen until this responds and it does some computation, well, you've delayed the entire execution as a function of the right buffer. That's what we just saw on the previous graph. >>: So I'm probably getting ahead of you, but you probably are going to say this is on slide 22, but what about the sources that will happen? If you're doing the same thing on everybody, then maybe you'll trigger the right buffer flushes at the same time on everybody. The flush takes place at 3:00 p.m. how do you make them get skewed? >> Christopher Stewart: with that. We'll get to some of the things that we do to deal So the key idea here is instead, just using two nodes, Jeremy, right, did duplication -- replication for predictability here, you would actually get this response back sooner and then you'd be able to start the third request earlier and you'd get a speedup here. So this is the situation in which it helps. >> Christopher Stewart: Well, if you use the same number of nodes, it keeps power consumption the same. So here, you're just looking at two copies, two copies of two nodes running the same copy of data. Presumably, maybe you could argue if you're energy efficient and you're, you know, most nodes today, though, you know, you have two nodes running. It's 2X the power. >>: Let's be clear. So that means that your claim is that power is essentially just a straight linear function of how many nodes you have in the system. >> Christopher Stewart: Rather than the amount of data hosted on each node. >>: Or the number of requests it's processing. getting at. >> Christopher Stewart: >>: I guess that's what he's Yeah. Right now, seems like what you're doing is actually giving more work to 16 the nodes. >> Christopher Stewart: That's fair. You're right. In that sense. So we haven't considered energy yet. If you are energy proportional as a function of the request accesses or actually as a function of the amount of data stored on the node, this approach, I could see some issues. >>: [indiscernible] that's the key point. >> Christopher Stewart: Yeah. >>: I can't buy the claim that it's just faster in an absolute sense sending to both. Because in a system that has a lot of competing request, one of them, all of them are going to be delayed by the fact that, for example, duplicate two, which had more response, could have been processing somebody else's response instead. So it would have finished faster, I mean, you can make an argument that this particular request you're drawing finished faster than it otherwise would have, but that was at the expense of some other request that would have finished sooner and is now finishing much later because it got pushed back in the queue. >> Christopher Stewart: So there is -- yeah, so right now, I think we're trying to motivate the high level intuition, but we do not contend that this should replace all of the prior work that we've done on partitioning and reputation for through-put and these other techniques. But maybe in some cases, this is something that we may want to insert and use when it is appropriate for certain appropriate workloads. So the answer is if you agree with me that this is a situation where it can benefit, then I should only have to convince you that I can figure out when this situation occurs and that there's situation occurs in nontrivial cases that are not made up. >>: I think you've convinced me that the first request that comes in is faster, but it seems like all the subsequent requests would be slower. >>: In a system where you continue, for example, seeks on a disk, on a system where you have multiple [indiscernible], seems like this is going to increase queueing time. Because if this system, you're actually sending more requests to each disk. So it's having to move around, possibly wasting [indiscernible], it seems like, because in this example, the replication of predictability, so 17 ->> Christopher Stewart: So not to get too far ahead, but yes. If you have shared resources on the back end of this, if you have a shared disk, for instance, on the back end of this that would create problems, that would definitely be something that we would want to avoid. So there is a need to avoid having -- I'm sorry. I put my hand at the wrong place. So if these two duplicates had shared state here, right, you have to account for these dependencies. And this is something that we've tried to account for. >>: It's a shared state, right. You look at the workload, a duplicate of one versus replica of one, replica of one would have to process two requests. Duplicate of one would have to process three requests. There's more work duplicate one had to do than replica one. So if you had a situation where there was queueing, I think [indiscernible] then you're going to increase the queueing divide in the second case even though the integer request may finish sooner, you've increased the queueing because you've increased the load. >> Christopher Stewart: >>: I agree. A violation where you say look, it really doesn't work. >> Christopher Stewart: Yeah, I totally agree with you guys. In fact, I think your intuition is exactly right. So a key plot that I will show hopefully in the next ten minutes will be a function of utilization over time and how this approach compares with partitioning and replication for through-put. We are getting there. >>: My intuition is you showed us where the tail is ten, a hundred times. If it's more than two times, then you can afford to halve the number of machines that you have available so you can -- you double the queueing [indiscernible], it will move the mean over by a factor of two, but you're still better off because the tail ->> Christopher Stewart: >>: And your intuition is exactly right as well. That's a problem. >> Christopher Stewart: So go ahead. 18 >>: So data consistency, so if you're [indiscernible] if you really want to be [indiscernible]. >> Christopher Stewart: Yeah, we'll get to this. Actually, I won't focus on this too much in the talk. At a high level, what I can -- at a high level what I can say to you now is so you saw this grid lot D is hosting shared variables, right. So it requires consistent access. So what I'm going to tell you, at a high level right now, and sort of skip over a little bit in this talk just because short on time is our key value store supports that. So we can do strong consistency. We can do [indiscernible] and write consistency, or we can do write order consistency. And I'll describe some of the mechanisms for that support, but we could talk about it offline to talk more about exactly how to do that. >>: We'll talk a little [indiscernible]. >> Christopher Stewart: To now, we're not the first people to do that. Actually, Microsoft either in its hires or in its own papers have been doing this before us so Mantri at OSDI, these guys were looking at doing this map reduced task scheduling and they were running things twice and taking -running two copies of the same task, taking the one that responded fastest, under certain nested conditions. >>: The first Google paper does that. Regional map [indiscernible] they just do it in a smarter way. The general idea of the assigning two replicas [indiscernible]. Paper by Google. >> Christopher Stewart: Okay. I believe you. So and then Peter's work, SCADS at FAST, is actually I think the most related. So they were looking at specifically high SLA key value store and they were issuing their reads to two copies and waiting for the fastest one to respond. So the difference between our work is we're providing full support for this approach. So we actually think this approach is really neat. Rather than just trying this out on a couple of duplicates, sort of conservatively, saying hey, we've got two copies and we only want to do it for reads, we want to support reads and writes at scale. This is our goal. I'll show you some tests on up to a 32-node cluster for a single partition. So in order to provide that full support, we have to do a couple of things. We 19 have to manage these read and write accesses for consistency. Support scaling out. And, importantly, for this talk, what I'll focus on is the modeling, because this is the part that I really love. And, of course, there's also adapting the workload change, failure recovery, some of these other traditional aspects to managing a data store, which we deal with also, but I won't focus on too much here. So I want to give you a high level idea of the type of support that we have to have and how we go about addressing that. So if we add this into the mix, our whole point here is we want to add this into the mix. So we want to be able to scale the number of replicas, but we don't want to say that this is all we want to do, right. We want to also do partitioning when it's appropriate. We want to do the others. The problem is we've just had an SLA violation. How do we decide which of these approaches is the appropriate approach to take to resolve this SLA violation. And so for that, what we're going to do is come up with a performance model that is a little bit different from past models. Rather than predicting something like average latency, right, what we want to predict here is the service level, where the service level, if you recall from some previous slides, is that percentage of requests that will complete within a latency bound. So this is an important point here. latency. We're predicting percentages here, not >>: You say it's an SLA violation. It isn't actually an SLA violation yet, right, that you're expecting. But it's 4% of requests that only take 15 milliseconds. You're allowed up to 5%. Isn't that the case? >> Christopher Stewart: No, we're trying to hit SLAs that are multiple nines, right. So no less than 99%, this is what we want to ->>: You're not trying to predict the fact that we want to take action now so we don't really mess up. >> Christopher Stewart: Yeah, you really missed it. So here's sort of the bird's eye view of our model. So we're taking as input some system metrics, like arrival rate, service time distributions, and network latency. As well as 20 target SLA in terms of service you a whole bunch of equations of some first principles about work based on the example that level and latency bound. And I'm going to show inside of here that really are just a function the way replication for predictability should I've shown you. At a high level, just the model by itself is going to output the expected service level. So we're going to get a percentage out. You give us a latency bound, we tell you what kind of an SLA you could actually support at that latency bound. The way you really want to use this is you want to it rate over this, right, so you want to take as an output that service level, compare it to your target, see, well, maybe I'm not meeting the target yet so I should change my node layout some kind of way, and iterate over here. So there's a replication policy as well. And that replication policy is going to tell us should I use partitioning, how much partitioning should I use, how much replication for predictability. And our -- the jargon that we'll use to do that is say, for instance, we'll have four one-duplicate partitions. So here, each shard is on an individual node. So this is just traditional partitioning, and two duplicate partitions means we have two nodes, each serving these shards. So I'll sort of fall back to this in explaining a little later. But this is the jargon. >>: When you say service time distributions, you're talking about here being observed [indiscernible] of ->> Christopher Stewart: >>: Yes. [indiscernible] that's I guess Jay's question. >> Christopher Stewart: Yeah, they can be, yes. They can and they cannot be. Our model considers both, but we focus on the case where they're the same. >>: Anybody tried to implicate the shard by [indiscernible]. three copies, any -- So you ask for >> Christopher Stewart: Yes, so there's this relationship to quorums, right. Quorums are great if you want to do this on top of reads. But when you start dealing with consistency in writes, you need some more support. 21 So the key idea here is this. So there would actually be two problems if we did this fully. So if we gave you a model that accurately predicted what happened when you added partitions or when you did replication for through-put and you had this consistency, that's a problem in its own right. And people have been struggling with how to model and understand the effects of hot spots. Our approach sort of simplifies that problem by saying, look, existing systems give you great support for partitioning and traditional replication, and we're just going to assume that they're really good at it. So we're going to bias in favor of these approaches and bias against this new approach that we're arguing that you should revisit, replication for predictability. How do we do that? We can just assume that accesses are going to be evenly distributed across partitions. So that means we're increasing the performance we would expect to see from partitions or reputation for through-put. And then we're going to be accurate on replication for predictability. So that's what this model does. And the basic intuition here is we look at a CDF distribution like this one, and from the CDF distribution, we take our latency bound. We're looking at a bound of about 15 milliseconds. We look at the intersection point with the line and that tells us the probability of an SLA violation with no concurrency, no queueing, not considering any of the dependencies that happen between network latency and, also, possibly workload changes. So here's a whole big chunk of formulas. I'll motivate really just the intuition here. So the probability that an SLA violation is going to occur is going to be one minus the result from the CDF. We can come up with a very general formula if we don't want to assume that the CDFs are the same across the duplicates. But if we do make that assumption, this actually just boils down to good old calculus of geometric series, where the expected SLA is one minus -- oh, typo. Raised to the N here. Gosh. It's bad when you have things highlighted that it's a typo. So one minus the probability of SLA violation raised to the N. So that means in order to have a violation, all of our duplicates have to incur this violation independently, together. So we do consider two key dependencies. So you've got network latency and you've got queueing delay. Both of these things are going to make it harder to meet a latency bound for all of your duplicates, because they all suffer from 22 these measures, these effects. So we use an NG 1 queueing delay model and we just use the CDF, the cumulative distribution function of your network latency. And the way, the novel idea is how do you incorporate these. So going back here, tau is our latency bound. So our point here is that adding in queueing theory or adding in queueing delay and adding in network delay affects your tag for each node. So now we have a model that is a function, that has this modified tau for each node, modified latency bound. And workload changes require some additional support like one-time monitoring, which we're also doing to deal with changes. >>: So you're just using the average value of the average queueing delay, then, or are you actually modeling the variability. >> Christopher Stewart: So we are using the average queueing delay, the average network latency. We could go more complex, but in a couple of slides, I'm going to show you that we're pretty accurate so we decided to keep it as simple as possible for now. There is, I think, another metrics paper here that looks more into that variability, but we're not at that point yet. So here's the key plot on the model's accuracy. So we issued nothing but writes. Still not dealing with queueing delay at all. So bear with me on that. One million accesses, and we add a duplicate after each one of these rounds of tests. And recall, we're predicting percentages here. So absolute error, as you've typically seen it, isn't really relevant here. So we look at the absolute percentage point error between our predicted percentage and what we actually observe. And you basically see only one line here. Which is great. It shows that our model is accurately predicting the service level as we scale up. And to make that more clear, each point we're plotting the prediction error, which in terms of absolute percentage points, is pretty darn low. >>: So what is your [indiscernible], is that simulation or is this actual hardware? >> Christopher Stewart: This is on actual hardware. We just ran Zookeeper. Zookeeper with one node apiece. So this is a very fast version of Zookeeper. Yeah, this is on our -- 23 >>: When you say service system, you're now in the replication version, or there's partitions? >> Christopher Stewart: These are just duplicates. These are -- so we're only looking at our ability to predict the behavior of replication for predictability right now. Because that's where we want to be accurate. >>: So when you say [indiscernible]. >> Christopher Stewart: Exactly, exactly. And so the other take-away from this is not just our accuracy, but how high we can get. So you can start to see an idea of the level of 9s that you can achieve through this approach that we've historically dismissed because it doesn't improve through-put, but it does improve predictability a lot. >>: These are duplicates, meaning they don't know about each other? Seems like if you're doing writes then the read is going to be the worst case [indiscernible] you have to wait for all eight to come back and see which one's the most recent? >>: [inaudible]. >> Christopher Stewart: >>: So -- Is there a story about that, or should I wait? >> Christopher Stewart: Yeah, so the story about that is there's not a story about that here. Maybe we'll talk about it offline. We deal with that. You're right, reads can be slower. We evaluate that. It's not bad. It's a function of your through-put slowing down, not a function of your latency slowing down, which is what we really care about. But we'll talk about that. >>: How many trials is this to get you to this point? >> Christopher Stewart: >>: One million. One million accesses apiece. [indiscernible]. >> Christopher Stewart: So once again, this is percentage point error. So we're looking at -- and this is very different from latency. So you're looking 24 at a tail where you can have a small percentage point error can correspond actually to a lot of latency. So even a one percent error, for instance, at the 99 percentile, actually would be awful in terms of latency. A one percent error would mean that you were mispredicting by almost 3X, right? The difference between the 99 percentile and the 99.9 percentile. So take that with a little bit of a grain of salt here. This is a metric on percentage points, not on latency. So it's a little different. >>: To restate Jay's astonishment, it seems like with real hardware, it's unlikely that you could even run benchmark twice. Forget prediction for a second. It seems like repeatability wouldn't even reach that level, let alone predictability. Especially if you have these asynchronous background tasks like buffer flushes that are happening out of your control or DNS run. The list of explanations that you gave for ->> Christopher Stewart: >>: Yeah, so -- Over a million. >> Christopher Stewart: You're looking over a large sample space is the whole point here. So if I, for instance, tried to convince you -- actually, yeah, my students have trouble with this all the time. They often want to run smaller experiments. Hey, look, we want to understand 99.9%. Just run 3,000 experiments, right. But that only gives you three sample points. So now your variability is very high if you're trying to understand this 0.009 level. But, you know, you run a million, that variability is going to go toward a mean of some sort, right. Just using the law of big numbers. >>: You're still only getting a thousand points, right? If you're trying to figure out your 99.99 percentile, then it's dictated by a thousand out of a million requests. >> Christopher Stewart: Five is enough for a T-test. bit higher than that, right? A thousand is quite a >>: I would expected prediction errors of one in a million, given an N of a thousand. Maybe I'm misunderstanding what the prediction error is. It's the prediction error of what? Of the service level? >> Christopher Stewart: Yeah, so it's, for instance, we predict 99.10%, and we 25 observe 99.102%. Something like that. So, well, in any case, we'll sort of move on past this. I would like to talk a little bit about queueing and maybe breeze through the actual results before taking questions. So what we're looking at here is probably the important plot for those that want to understand how do you combine these two techniques together. So on the X axis, week looking at utilization. So we're increasing the utilization on each duplicate up to one. And on the Y axis, we're looking at the expected service level, according to our model. So we're able to understand the performance of using just replication for predictability only using the results that I just showed you. So that's just if we have four duplicates. Their utilization level will all be the same. And you just feed that into our model. We can also understand the effect of using these traditional approaches only, right, where we biased toward no consistency overhead, evenly distributed partitions. And then we compare what we're claiming, what we really want to sort of move people toward doing, what we're doing in my group, and it's working pretty well, is the mixed approach, where we're doing a little bit of both, depending on the workload. So first thing to note is that we really should reconsider replication for predictability in our system design. Why? Because if you want to have a very tight latency bound, here we're looking at 3.5 milliseconds, which is just a little bit longer than our mean latency, I think about 25% longer, you're just, you can achieve service levels that you just can't achieve otherwise. So if we use four duplicates, we can get a service level that's pretty darn high. But if we use just one duplicate, we're talking about orders of magnitude lower. I'm sorry. If we used only replication for predictability. So that is if we just distributed each axis to just one node and didn't get any of the variability gains that we get from replication for predictability. If So as to we -- the second take-away here is the comparison between these two plots. the problem with replication for predictability is when you have queueing, queueing increases, you get this huge drop-off, because once queueing gets be larger than the benefit that you're getting from cutting into this tail, 26 you know, the performance benefit goes straight down, because your node is basically fully utilized and you're not getting any performance, any gains. So if we did this at the three and a half millisecond level, even though we can achieve very high SLAs under very light workload, as soon as the utilization gets larger, we hit a cliff. Now, that said if you increase tau up to 15 milliseconds, you can go a lot longer. So take away two is we visit this, because for realistic latency bounds that we care about now on the order of maybe tens of milliseconds, we can still even just using this approach get up to a decent level of utilization. >>: So in practice, what was the utilization you see? >> Christopher Stewart: So we're looking at 20 to 30 percent utilization for a key value store that needs to give you very he response times. >>: So like that's what you'd see if you looked at Facebook, you worked at those workloads. >> Christopher Stewart: >>: I don't know Facebook's, but -- For reasonable workload generation. >> Christopher Stewart: Right, right. The issue here is we've long known that queueing delay means low response times. And while there was a lot of work maybe mid 2000s that was looking at consolidation and how can we consolidate things to increase the utilization, really, once you get around 50% utilization, people get skittish about the effects of queueing delay. So I think in practice, you're really looking somewhere around this range of, well, I mean, some people run as low as 10%. For instance, our CIO status center, they just don't use -- they have no utilization at all. Some more cost efficient approaches may push you up to 55%. And then the final, most important take-away from this is hey look. You don't have to deal with this drop-off at all if you take a more modest approach. So if you look at the utilization that you're facing and decide to use a mixture of replication for predictability and partitioning, you can get the same type of curve that you would see here with just using partitioning only, but you can 27 just tick up the service level that you can meet by using the two in combo. All right. Big picture slide. I won't take any more questions so I can finish in a couple more minutes. Big picture slide. This is an early depiction of our implementation. The key idea to support writes and write order consistency is that we issue all writes through a message repeater, a multicast message repeater. Ideally, that would be in hardware. We did it in software and we still got pretty good performance results. We took a lot of pain to implement this in a way that is agnostic to the key value store that is on the back end. So we've implemented this with Zookeeper back here and with Cassandra and different configurations of both. We do some system monitoring data, and we have a couple systems implementation techniques. Like if we put this loop to the data path, we've created exactly what I was worried about, which is these things that are common to that could cause problems for all duplicates. of tried and true message repeater in the telling you we were all duplicates, right, So to minimize its effect, our implementation uses a low overhead call back so we send writes directly back -- or we send responses directly back to the client, rather than going RPC style back through the repeater. General overview. Last but not least, I wanted to show you guys this stuff about Gridlab-D. So this is sort of looking at the layout of agents in Gridlab-D, issuing reads, doing some computation and finally, at the end of a time step, issuing a write back to the key value store. So you can see this sort of, I don't want to call it a barrier, but you can see the issue here. So until these writes complete, you can't move forward to the next time step. And James, along your question, you know, this multi-phase scientific computation, it's been well studied, the fact that these scientific workloads that run a long time go in and out of different types of phases. And ours is no different. So on an open workload, I'll just skip to the end-to-end tests. So here is the take-away plot. So we're showing performance improvement as you would typically expect it, right. So the relative performance improvement between two approaches. So using just the partitioning approach versus using the combo approach of partitioning and replication for through-put where the combo approach is chosen by our model. 28 At the far end here, we're looking at the total running time of a grid lab D simulation over a course of three days. And we're getting about a 20% improvement when we compare same size storage systems. So using 8, 16 or 32 nodes where we use only replication for predictability, or mixed, right, so same number of nodes, we're getting a consistent performance gain there. And we get that by reducing the 99 percentile significantly. So we've significantly reduced the straggler effect here. And to be sure that this isn't just something that happens on my cluster and, you know, nowhere else, we took this, we ran it on EC2 in order to scale up to larger sizes, and as you can see, results are still consistent there. Actually, EC2 works even better, because there you have a lot of variability. I mean, we were able to, at one point, literally watch our CDF change without touching anything. So this is end-to-end test. Conclusion, basically if you walk away with anything here, the four points. So we were looking at these emerging workloads. They need low latency on thousands of accesses. This is our target. This is what we sort of see going forward into the future. What we're suggesting is that the systems community may want to revisit replication for predictability, this old approach where you send many requests to -- you send the same request to many nodes and take the fastest response. It actually makes sense now for these types of workloads. And you know, Zoolander provides all this support that I had previously mentioned, and you just saw our results on end-to-end scientific computing workload. And then these are all of my students that have contributed to some projects that I've talked about today in some way or another, and I to, even though they are not here, in case they see these slides, I make sure they knew I was thinking about them. That's it. So time Ran over a little built. of these just wanted wanted to for Q&A. >>: So the simulation, I just want to revisit the consistency thing a little bit. How many barriers were there in the run. There was one global barrier 29 separates two phases, or were there different barriers throughout this? >> Christopher Stewart: So it was more like one global barrier. So you would issue, you would operate all of your agents would do what they were going to do for one time step. Then ->>: I guess my question is how many times -- in other words, how many times in the execution was there a barrier? There must be a lot of them. >> Christopher Stewart: We did it at three-second granularity over three days, but that's not clear because you have intermediate. So if you want to do that math, how many three-second. >>: 6,400. So it still seems like if there's this barrier where everybody has to write before anybody starts to read in the next time step, it just seems like there would be this problem that if you're releasing the write barrier after the fastest write, then the readers would have to wait for the slowest read to make sure they got the fastest writer. Or conversely, you could wait for all writers to complete, and then you could wait for the fastest reader, but you can't do both. >> Christopher Stewart: Right. So here's the way we handle that. So if you want, if you need strong consistency, what you do is you send writes and reads through this multicast repeater that I issued you. So that what does is serializes everything. And that can get a through-put of ->>: [inaudible]. >> Christopher Stewart: repeater. >>: Yeah, you send everything through this multicast [inaudible]. >> Christopher Stewart: Yeah, so then you'd have to use multiple. But this is standard practice for achieving this type of strong consistency. Zookeeper does the same thing. Zookeeper, lazy base was just a [indiscernible] paper this year. So if you want strong consistency, you have to pay a performance penalty for it. >>: Zookeeper is not [indiscernible]. 30 >> Christopher Stewart: >>: Right. So -- [indiscernible]. >> Christopher Stewart: Yeah, you can go with quorums. And what quorums can do is allow you to sort of mask the effect on reads while lowering your through-put on writes. Or you can take this approach where you just pay the penalty on both. And so that's sort of our trade-off. If you want read my own write, we can do that a little bit more efficiently. So when we respond back with the fastest completing node in the write, we can also respond back with an address, which allows you to directly issue your read back to that node again. And if that node is, for instance, a Zookeeper group that is 3X replicated, then you can scale that read performance by adding nodes. Does that make sense? >>: Are there background tasks on the centralized -- I mean, seems like now -- >> Christopher Stewart: On the software repeater? Yeah, yeah. That's exactly why earlier, I mentioned this, you know, we have to take approaches to implement that as well as we possibly can. You're right. Some things are just going to be caused by this message repeater in between. >>: [indiscernible], right? >> Christopher Stewart: Yeah, yeah. So it's not that hard to implement it very well, I think. But, you know, I implement systems. And then finally, if you want write order consistency, then the issue is, then if you can deal with some inconsistency in your reads, which, you know, a lot of these other systems support, then you just get the write back from the fastest and then you issue your reads to whomever and eventually you're guaranteed that everybody will eventually give you the same write performance. So our raw through-put numbers are comparable to existing systems, right. we're not making any fancy tradeoffs there. >>: So In general [indiscernible] you have a scaled down [indiscernible]. >> Christopher Stewart: Well, that's only -- but you're talking about a very 31 specific type of workload, right. So you're really talking about, hey, if I only want sequential strong consistency, right, then you have, yeah, the single bottle neck that is a problem, right. But if you don't need that, we can still scale your reads, right, by doing -- if you're willing to relax that. I think, you know, I don't want to get into trouble by saying anything too bold, but when people scale out nodes to support sequential consistency, it's often either to improve read through-put or for availability. You very rarely -- by very rarely, I mean I can't think of anybody off the top of my head that it's good high write performance while supporting strong consistency. That's not a fair comparison there. >> John Douceur: speaker again. We're running late so I don't want to go too long, thank our >> Christopher Stewart: Thanks for hosting me. It was a pleasure.