>> Karin Strauss: Welcome. My name is Karin Strauss. I'm here to introduce Malte Schwarzkopf. I hope I didn't butcher your name. >> Malte Schwarzkopf: Yes, very good. >>: Malte is a last-year Ph.D. student at University of Cambridge and his work is on distribution systems and operating systems or the combination of both. So Malte will talk about work he's done at University of Cambridge and at Google today. Thank you very much, Malte. >> Malte Schwarzkopf: Hello, everyone, and thanks to Karin for hosting me here. So this talk is sort of coming in two parts, and, you know, we'll go as far through it as we feel like it. But the first part is work that I did while interning at Google. And then if there's time, I'll move on to talk about some of the follow-on work that I've been doing back at Cambridge. But the overall umbrella topic here is scheduling on clusters in data centers. So I'll first talk about this system called Omega, which was published in EuroSys 2013. Omega is Google's next-generation cluster management system. And Andy, who was one of the other interns at the time, and myself were quite privileged to make a contribution to it. It's actually a much bigger system that doesn't just do scheduling, but this talk and this paper are on the scheduling aspects. So scheduling in this case means the problem of mapping work, such as tasks that are parts of jobs, to resources which in the data center are typically machines. Now, of course, there's thousands of machines and there's thousands of tasks in any particular job potentially, so the scale of this problem can get quite big. Now, at Google one -- a few of the things that people observed over time is that the scheduling problem has really changed a little bit in the sort of last ten years, and specifically, while they were running the old cluster management system, which precedes Omega and sort of dates back about ten years. First of all, workloads are becoming quite diverse in shapes, sizes, and also in their sort of requirements of the scheduler. This makes the scheduler's job potentially quite a lot harder as we'll see. And at the same time, the size of the clusters involved is increasing quite dramatically, so we're looking at, you know, tens of thousands of machines now, and hundreds of thousands is not completely out of the expectation. And the number of jobs surviving at any particular time unit is also growing, so all of this combined makes the problem harder. Now, how does a typical cluster scheduler work? Well, as I said, we've got some work that we need to map to resources, so in some way the scheduler has to track the state of machines, which is sort of down here. These gray squares are representing machines, and some of them are running tasks; some of them aren't. Obviously in practice a machine would run more than one task, but I visualize it as having just one here. And up there we've got tasks arriving from all sorts of different workloads, and they will proceed through some scheduling logic in order to be mapped to machines. Now, this scheduling logic could well be the same for everyone, for every arriving job, but actually in practice what seems to be happening is that it's actually becoming more and more heterogeneous over time. So some jobs are getting sort of scheduled in a much simpler fashion by just dropping them wherever, and some jobs have more stringent requirements, and the scheduling logic is becoming quite elaborate. And in fact it's not completely out of the realm of possibility for the scheduling logic at Google in practice to take around 60 seconds to place a large job on the cluster just because there's a lot of work to do. Now, why might this be? There's constraints solvers in some of the things that people want to use, Monte Carlo simulations and so on, and these things can get quite time consuming for large jobs. So this sort of feature creep in the scheduling logic actually leads to an increasingly complex scheduler implementation. If you assume that there's just sort of one implementation. In fact, the previous Google cluster scheduler went for exactly this process of sort of becoming more and more complex. There were more and more control paths through it and shortcuts introduced, extensions requested. Different teams wanted different things, so everyone sort of added their little piece to the pie. But of course, this significantly increases the complexity of engineering and maintaining the scheduler, so we would like to streamline this and break the monolithic scheduler up into modular ones. But while these are probably easier to maintain, they still, at the end, have to share the physical cluster resources in some way, and we have to arbitrate resources between the different schedulers that are working in the same cluster. Now, there are various ways this can be done, and people have looked at this in the past. The straightforward one is sort of you just break the cluster up into multiple logical clusters. You say, well we've got the red cluster, the blue cluster, and the green cluster, and we just split our resources three ways. Now, of course, that has a bunch of problems, for example, if the red cluster, say, is a MapReduce cluster, it's completely full. It's running at high utilization, but there's space in the others, but because of the static partitioning, the jobs can't grab resources from the other logical clusters. Now, clearly, one possible way of addressing this is by actually making that partition dynamic, and in fact, this is what some existing symptoms do. For example, Mesos, the cluster management system from Berkeley, schedules resources on two levels. So they have a higher-level resource manager and a set of schedulers that I show in red, blue, and green here, which each work with their resource allocation. So this sort of yellowish resource manager will carve out a little partition of cluster and offer that to the different schedulers. We'll later see that that actually has some problems too. So in Omega what we do is take a slightly differently view. We say, okay, instead of having this resource manager that does arbitration, we just sort of say, take a very laissez-faire approach and say every scheduler can claim any resource at any point in time if it wants to. So the entire cluster is, in effect, a big shared-data structure. Now, there's no need for any reservations, for any a priori coordination between schedulers. They just try, hope for the best, and then sort out any problems afterwards. So let's see how that works in detail. Consider an example. Here we've got our cluster stated in sort of being, and this is the shared-data structure and that sort of represents the ground truth. That sort of represents the state of the cluster at this point in time. So if a machine fails, it will disappear from this. If a job gets scheduled, it appears in this and so on. Now, we have two schedulers, red and blue. Each scheduler, internal to itself, has a replica of the cluster state. So you can see little gray squares there that are just a replica of the data structure down there, and this replica is frequently updated. We're just sort of pushing diffs effectively of changes that have happened in reality to sort of bring the replica up to date. Now, let's see what happens when jobs arrive. So in the red scheduler a bunch of tasks in a job arrive. They're considered for scheduling. And using the red scheduler's specific scheduling logic, it decides on two machines where these tasks should go. Now, that happens and the replica is updated, but of course, that still needs to make its way to the shared-data structure. So a delta, as we call it is, created and sent to the shared cluster state asking to be committed. Now meanwhile, the blue scheduler has been busy too, and it has two tasks. It considers them. It finds machines to place them on, and using its specialized logic makes that decision and then sends a delta to the shared-cluster state. So, in the shared-cluster state, these deltas are applied. And in this case unfortunately, both the schedulers decided to place a task on the same machine. Now, that could be fine, but it could also be problematic depending on how big these tasks are and how much room there is on this machine. So there can be a conflict, which in this case means that the two tasks that we're trying to place cannot both go on to that machine, as it would lead to overcommit. So what happens in Omega is one of the schedulers will succeed according to some arbitrative policy. It can just be first come, first serve and the other one will fail. So the blue scheduler has succeeded in this case and it gets told your job is now scheduled, whereas the red scheduler is told your job has failed and it has to try again. Now, clearly these sorts of conflicted scheduling attempts can lead to wasted work. And in the worst case, the red scheduler would have to try over and over and over again until it eventually gets its job scheduled. So if it got very unlucky it might never make it, so clearly the viability of this kind of optimistically concurrent model depends on how often these conflicts roccur and how well we can avoid them in practice for these sort of real, large-scale workloads. So, after this brief explanation of how Omega works, I'll now use a set of practical sort of case studies to investigate how viable the model is and answer that question I've just posed. Yeah? Sure. >>: Who makes the decision where it is allocated? >> Malte Schwarzkopf: You mean how does blue make the decision of using these two machines? >>: Blue made the decision of one of those two machines and requested those two machines. Then someone presumably said yes, you have these two machines. >> Malte Schwarzkopf: Yeah. That would be the code maintaining this data structure, but that code is very, very simple. It just, you know, the first one that arrives and fits gets the resources. It's not -- there is -- there is a FRED running a sort of admission to the data structure, but it's very simple code. There's no complicated policy here. >>: [indiscernible] >> Malte Schwarzkopf: No. That's in one place. >>: What if the red job just has tons of tasks arriving but they're very low priority? Wouldn't red wind up filling everything? >> Malte Schwarzkopf: Exactly. And that's going to be one of the cases that I'll sort of look at. So the concrete example is if red is a batch scheduler, and, it has large MapReduce jobs, then if he tries to touch five thousand machines every time he schedules a job, then he might be causing other people to get basically get starved. And we have to somehow deal with that without incurring coordination overhead, and you know, restricting red's ability to actually get his jobs started. So you know, you don't want him to starve other people, but you also don't want him to get starved as a result of other people having touched the machines that he wants to use. So priorities are with one way after doing this, and we'll look at some other ways in a minute. So, in fact, let me take that directly from here. In this work we consider a simple sort of two-way workload split where we have batch jobs and service jobs. And this is a real differentiation that Google are actually doing in Omega. They also have some other categories, but these are sort of the most important ones and also the most challenging ones to deal with. Now, batch jobs run for, you know, some amount of time and finish. So a MapReduce job is sort of the canonical example of this, whereas service jobs are those production jobs that require careful scheduling schedule for optimal resource allocation and which run for a very long time. So conceptually they might run forever. For example, if you run the gmail backend, you start it up and it remains running until either the machine fails or the scheduler ends up moving it somewhere else. But assumption is it will basically go ad infinitum. And it is very important that it meets its SLAs and doesn't get placed in a bad -- in a bad spot. So let's look how this actually -- these two categories actually break up the real-world cluster workload that we're seeing on Google clusters. We're having a look at free representative Google clusters here. We'll call them A, B, and C. They differ a little bit in their size and utilization, but important to point out is that C cluster is the one for this Google released a trace in 2011. So the workload that's covered here is actually the same as in the publicly released trace, whereas the other two, they don't releases traces for. So what the bar chart shows here for each cluster is the relative share of batch jobs and service jobs. Batch jobs are shown in solid color, whereas the service jobs are shown as the dotted portion. And the four categories that we're looking at here, are the number of jobs, the number of tasks, and the aggregate CPU and the aggregate RAM resources used by these tasks. So if we look at this, the takeaway is pretty clear. The large, large, large majority of jobs fall into the batch category. It's way above 90 percent. It's really hard to actually see the service jobs on the left-hand two bars, but a lot more resources, in turn, are consumed by service jobs. In fact, more than half of the resources in all of these clusters actually are devoted to running service jobs. So clearly it makes a lot of sense to invest careful scheduling into the service jobs because they run for a long time and use a lot of resources, whereas the batch jobs are much more sort of throw-away, but much more numerous. So looking at some other statists, we find that the individual jobs have different priorities too. I've already touched on some of those. Batch jobs are typically much shorter. Their sort of median is about twelve to twenty minutes for a job, and some of them are a lot shorter than that. Some of them are a bit longer, and they arrive a lot more often. So in the traces that we considered, the interarrival times of batch jobs were sort of every four to seven seconds, and in peak times it can be, you know, several ones per second as well in big clusters. Whereas service jobs -- well, the traces we looked at were for 29 days, and the median job runs for 29 days, so conceptually it runs ad infinitum. If we had looked at a longer trace, we might have seen something different. But they run for a very long time typically, and they don't arrive at often. Their sort of scheduling requests arrive on the granularity of tens of minutes rather than every second. So let's see with that sort of workload characterization in mind, how the Omega approach actually compares to the competition of the other approaches that I showed you earlier. So the way we did this is we wrote a simple simulator that can simulate all of the architect scheduler architectures that we're interested in. So it can simulate a monolithic scheduler. It can simulate a Mesos-style two-level scheduler architecture. It can simulate Omega. And this simulator, by virtue of being a simulator, does simplify things a little bit. For example, it uses empirical distributions for the sizes and shapes of jobs and tasks by sort of just sampling traces that we derived, and it uses a simple placement algorithm, but it does allow us to implement all of these architectures and do a like-with-like comparison. And the simulator is open source, so if you're interested, you can look at it. It's written in Scholar, I think, most of it. Now, how do we model scheduler decision time? Because that's obviously going to be very important for this sort of work, because the time when the scheduler is trying to make its decision is the time when it's vulnerable to competing updates from other schedulers. And it's also the time that the logic takes to make a decision, and we are sort of bounded by the fact that we can optimize the logic only to some extent if we're doing a large-scale sort of constraint solution, something that will just take a relatively long amount of time and make the scheduler vulnerable. So the way we model this in our simulation is we say, well, there's a per-task amount of work that you have to do, and we typically set that to about five milliseconds as a sort of conservative, generous overestimate. And this is the work that, if you have twice as many tasks you have to do twice as much of this work. But of course, there's also a constant per-job overhead that is just the overhead of dealing with a job, and we typically set this to a hundred milliseconds, which again, is a conservative over approximation. In practice, it would be much faster in the real world, but we're making things deliberately difficult for the Omega model here in order to see its limits. So let's see how we do. In this first experiment, we are going to see a couple of 3D graphs, and I'll take a moment to explain them because 3D graphs are a bit complicated sometimes. So we start off looking at a monolithic scheduler. So this is the case of the monolithic scheduler where there's one big scheduler that handles all the jobs and has all of scheduling logic in it. We sort of dismissed this in the beginning because it's hard to engineer, but we're taking it as a baseline for our evaluation. So we call this a single-path scheduler. It applies the same logic to the same scheduling software to everyone, and in inside that logic, we might be taking different control-flow paths, but it will take roughly as long. Now on this diagram we vary -on this axis, we vary the per-job overhead for all of the jobs. So here, we've got a hundred milliseconds, and here we vary it up to hundred seconds and this would be one millisecond. So it's a log scale, and as we go along, sort of the per-job scheduling time grows. And on the other axis we vary the per-task decision time. So again, we see sort of one millisecond here, and it goes all the way up to one second per task. So at the far end of the diagram is sort of you take a hundred seconds for the job and then another one second for each task, so that's a pretty ridiculous case. And the Z axis shows a metric that we call the scheduler busyness. Now, what you can imagine that as sort of the value that top would show for the CPU in your machine. It's basically what fraction of time does the scheduler spend making decisions and what fraction of the time is it idle and not doing anything? Now obviously, this can't be greater than 1.0 because then it's making decisions all the time. The color is adding another dimension, which actually is quite intuitive if you think about it. When the utilization of the scheduler hits a hundred percent and it's making decisions all the time, then of course, it might not be able to keep up with the workload. So wherever the diagram is colored red, we actually failed to schedule all of the jobs, and we started accumulating a larger and larger backlog until eventually things fell over. Whereas where things are blue, all jobs got scheduled. Everything is great. There's also an intermediate bit where sort of some, temporarily jobs don't get scheduled, but then the scheduler catches up again. Which is sort, you know, of somewhere in the sort of purple range in this diagram. So as we can see, the monolithic scheduler doesn't do very well when we have it spend a long time making its decision, and that's what we would have expected. It's not very realistic to do this. You would never use a hundred seconds in a monolithic scheduler for every job, but it serves as a useful baseline. >>: I know this is a detail, but I see a red dot or portion that's below 1. >> Malte Schwarzkopf: Yeah. >>: What's going on there? >> Malte Schwarzkopf: There's a couple of rounding noise in the values we get from this, because when we reason the simulator, it did generate the same sequence of jobs, but at the end of the simulation, a job might be halfway scheduled. It will never schedule. I will not be accounted as scheduled, so there's a bit of random noise here. But we actually looked at the trace, and it did schedule. It did spend most of the time -- it's never going to be exactly a hundred percent, but there were tiny gaps. It's effectively a hundred percent of the time. You couldn't have fit in any more jobs, but that doesn't mean that necessarily you didn't have short idle periods, especially at the beginning when there aren't that many jobs around. So it's noise. >>: Thank you. >> Malte Schwarzkopf: Okay. So let's consider something a bit more realistic than this sort of slightly dumb, monolithic single path, everyone has the same experience going through the scheduler situation. This diagram shows the same metrics for what we call a multipath monolithic scheduler. So this is effectively a scheduler that starts with a big switch statement that goes if this is a service job, then apply this scheduling logic. If this is a batch job, apply that scheduling logic. If this is something else, then apply this scheduling logic. So it's sort of multiple schedulers in one, which is really not nice for engineering purposes, but it is a sort of plausible thing you could do and you wouldn't have to deal with all the shared-data structure stuff that we have to deal. Now again, we vary the decision times on the xand y-axis, but here we vary it only for service jobs, not for all jobs because we now have these different control-flow paths, these different categories of jobs we're dealing with, and we only increase the decision time for service jobs because these are the ones that typically take a long time to place because Google is really concerned about placing them well. Whereas batch jobs it doesn't really matter because they only live for a couple of minutes, so we can just drop them anywhere. We see that the busyness has actually now dropped a lot, and we can easily take a hundred seconds or close to hundred seconds and one second per task and still not quite hit the it ceiling. So that's good news. But they're sort of offset at the base, where even if the scheduling time for each job and every task is very short, the scheduler is still busy for a significant fraction of time, and it's actually more than it needs to be. And we investigated this, and we found it's because there's no parallelism in the system. While there's different control-flow paths, you end up scheduling one job at a time, so therefore, you deal with a series of incoming jobs one after the other, which means you can get head-of-line blocking. So as a batch job, you can get stuck behind the a service job that takes a long time to schedule even though if the scheduler had scheduled them the other way around or in parallel, you would long have been done, long have been done. >>: Why would reordering it reduce the scheduler's busyness? I see that it would release the job sooner, so the short jobs wouldn't need to wait as long for the long jobs, but you're measuring how busy the scheduler is, and I don't see why that would raise the total amount of work the scheduler is doing. >> Malte Schwarzkopf: That's right, yes. Let me think about this. So in the next graph, we do have parallelism. Yes. So it's not the head-of-line blocking that makes the different here. It's the fact that you have these very numerous batch jobs, and they come in all the time, every couple of seconds, take a couple of seconds to schedule, so even if they don't take a long time to schedule, you'll still be busy some of the time. In the next graph when this is happening in parallel and we're looking at the busyness of the service scheduler only, then the batch scheduler sort of exists but it's not shown on this diagram, and it will be a flat surface that's offset by about this much from the bottom. So it is lack of parallelism, but it's not the head-of-line blocking that's causing it. I should have been clearer there. The head-ofline blocking does also happen in this though, so jobs might experience a longer wait time. And there's, in fact, an analogous version of this 3D graph where we show -- we look at the job wait time and we see the same phenomenon. But here, you're right. The sort of step at the bottom is actually the batch workload, which now in the next graph will disappear because we're only looking at the service scheduler and the batch stuff is sort of handled in parallel. Okay, so we're moving on to Mesos, which is this this two-level model where you have -- where you have the resource manager that allocates dynamic shares to different schedulers, and then they work with their -- with their share. And again, we vary the decision time for service jobs, and we show the busyness of the service scheduler only in this case. The batch scheduler, as we just discussed, is sort of flat because we're not varying the decision time for that. But the surprising thing here is that in every case we end up with unscheduled jobs, which is a bit odd because when the decisions are fast, that shouldn't be happening. And in order to understand why this is happening we have to look at how Mesos actually works in practice. So we have a Mesos resource manager, and let's say we've got two schedulers which receive their resource offers in turn. Mesos does not make offers in parallel. It just offers every idle resource in the cluster to each scheduler in sort of a round robin fashion instead of offering parts. That's a bit sort of counter to the model, and we were a bit surprised when we found that, but the reasoning that the Mesos guys apply is that they assume that there's a high churn in the cluster and therefore, if you offer all of the resources to everyone in turn very frequently, effectively everyone gets offered the optimal share. But let's see what happens here. So in this case the green scheduler receives an offer for all of the available resources first, and then it makes its decision, and it might take a while. In the meantime, a blue task of the blue scheduler has finished, so now a little, little piece of resource has become available. And what Mesos could do is it could just wait and offer that in the next round, or it could make another offer. And while they sort of didn't implement parallel offers by splitting up the available resources, they did have this sort of -- they do have this sort of event-driven model where when a resource becomes available, they will immediately offer it. And in this case, what that means is they do end up effectively with a parallel offer but with a really dumb parallel offer because it's offering a tiny amount of resource to the blue scheduler because that resource has just become available. When it would have been a much better idea to actually carve up the initial allocation into sort of equal shares. And in this case, blue receives its tiny offer and says well, but I've got too much work for that; I can't really make use of that, so I'll just leave it unused and wait for a few offer. So it can't use it. Meanwhile, while the green guy is still making its decision. Let's say he's the service scheduler and he's taking a long time to make his decision. So we repeat this many times and, in fact, Mesos will offer this again and go, well, have you thought about it? Maybe you do want it after all, and then it goes away again. They do it again and so on. Repeat many times. At some point the batch scheduler is done and releases all of the resources. And in fact, if we compare here it has only placed two tasks and released all of the rest of the resources that it had acquired is a lock on. But by this time, now, blue finally gets its large offer that it can use to schedule all the work it needs to schedule, but by now, it's actually given up. And that is what happened in the simulation. So we actually debugged this, you know, the fact that we saw unscheduled jobs in every case, and effectively what we saw was that they were being dismissed because the schedule had given up after a timeout a fairly long timeout of something like a minute. It had just stopped retrying because it had received so many offers that it couldn't use that it decided to get rid of the job instead of leaving it in there and blocking the scheduler from doing other work. So, let's see how Omega compares to this. After this short interlude. And remember again, our main goal here was to decompose the scheduler and make it flexible, not to necessarily beat every other model. But actually if we compare this to Mesos, it's not too bad. The Mesos graph, with the caveat of everything being red, was a little bit lower. And that makes sense because in Omega we're experiencing these conflicts, and we're going to be doing extra work. So clearly, we would expect to be at least at good -- at most as good as Mesos probably was. But this is not quite as good as we would like. So we ended up making a couple of optimizations that I'll discuss next and found that actually when we're a bit smarter about deciding when a conflict occurs and how we handle these resource conflicts than we initially were, we can actually get quite close to Mesos. So in this graph we see that for the entire plot the busyness of the scheduler is low, so we can always keep up with the workload, and that's good news. Now, here we compare all three of them again, and we can see the optimized Omega is a little bit worse still than Mesos, but it does actually handle all of the workload. And if we compare it to the monolithic scheduler, there's this little step which has gone because we've got parallelism in Omega, but actually if you compared the shape modulo, the step, we're still a little bit worse, than the monolithic scheduler. Again, that's what you would expect because we're going some extra work. So that was great, but it was sort of a low-fidelity simulation because we tried to compare all these different schedule models and we -- yep? >>: The evaluation may have been sort of getting us an intuitive feel for it is a metric of can the schedule keep up, but the other question that you started out is, does it make good decisions? That's not reflected here, right? >> Malte Schwarzkopf: That's not measured here. What we assume is that if it schedules job, then it's satisfied with the decision that it has made. >>: Is it the case that the utilization of the cluster is equivalent between all three of these? >> Malte Schwarzkopf: Well, it's lower in this case because you're not scheduling everything. >>: It's failing, hence the red lines. >> Malte Schwarzkopf: Yeah. But if you just consider the blue, then the utilization is equivalent. Yeah. >>: So it's not the case that making faster decisions doesn't get -- doesn't reduce some of the slack time in the cluster. I imagine that a scheduler that did a better job might -- then there are other metrics, but might actually produce a better overall result, I guess. I think what you're saying is as long as the scheduler completes its job, everybody's happy. >> Malte Schwarzkopf: Well, yeah, certainly everybody's happy. I think, if I understand your point correctly, you're saying, you know, because we've got some parallelism, you know, we're now dealing with more scheduling decisions per time unit, and therefore we can probably higher utilization because we can do more work. >>: One question is whether -- is whether a slower schedule, even if it gets the job done, gets it done while leaving pieces of resources unutilized for some time while it got its decision made. >> Malte Schwarzkopf: Right. >>: The other question is whether when you have schedulers arguing over placement decisions, whether a smarter scheduler, like -- the point of going beyond the monolithic scheduler is the monolithic scheduler is very simple. And my intuition is that point somehow with more complex scheduling decisions, you can get a better overall outcome because you presumably ->> Malte Schwarzkopf: Sure. The metric that people at Google use to assess this would not be utilization. It would be something like conformance to fault tolerance SLAs, for example. Conformance with requests, you know, handling SLAs and those sorts of things. We did not particularly focus at that. We did look at it, and because it's a simulation, we can't really say what would have happened to the request SLAs, but we said the -- we looked at the fault tolerance, and it was better because the more complex scheduling logic is making better decisions and placing things. >>: Right. It seems like that was your primary goal, but I guess there's also a scalability goal. >> Malte Schwarzkopf: Well, so the premise that we started with was, you know, people want -- for service jobs they want this fault tolerance, so they will badger the scheduler team until they implement it. And they could either implement it into the one scheduler that they have, in which case you end up with a monolithic case, or they could disaggregate the scheduler, and you know, that's -- that ended up being a good idea. >>: So fundamentally then, you take as a premise that you're going to have to have complexity. >> Malte Schwarzkopf: Yeah. >>: And then the metric of success is how well you implement, how quickly you ->> Malte Schwarzkopf: Yeah, yeah. Okay. So we did another simulation which actually is much more high fidelity, and I'll explain that with this table. So in the first simulator I explained to you that we used to compare the different models. We made a couple of simplifications in order to get it to run reasonably fast and in order to make the implementation not too complicated, so we assumed that all the machines are homogenous. The job parameters were sampled from distributions. We didn't particularly support placement constraints. So Google has all sorts of hard and soft placement constraints that are described in other papers. We did not support those. And the scheduling algorithm was a sort of fairly stupid, random first fit. All of the graphs you've seen so for were with these sort of caveats. With the high-fidelity simulator that I'm now going to talk about, we actually literally used the Google scheduling logic. So when I say Google algorithm here, I mean I literally wrote hash include scheduler dot h and was really using the real logic that's used in production. And as a result we are also supporting constraints. We're supporting all of the sort of complications. We also used a real-world machine description for a real cluster and replayed a workload trace, a complete event trace of a month-long sort of cluster work load. So this is a much more realistic simulation than the sort of simpleminded one to compare the different models that we used initially. So let's see how this works out for a month-long trace of cluster C that we've seen earlier. So again, on the y-axis here we've got scheduler busyness. On the x-axis we're varying the service scheduler's one-off decision time, so that's the sort of per-job overhead of the service scheduler. And again we vary it up to a hundred seconds. The blue line is the batch scheduler, and because we're not varying his decision time, it remains -- the busyness remains constant. In fact, it wouldn't necessarily have to remain constant. It could actually go up if the batch scheduler was having to retry a lot as a result of the service scheduler taking longer to decide. And we can see that that's not happening, so that's actually good news for us to some extent. But what we also show here is an approximation of what the busyness would look like of the service scheduler shown in red if we did not have any conflicts. So if we only had one attempt per job, and everything happily scheduled the first time round and, that's the dotted line here. Now, we can see the area between these lines is the overhead that we're adding due to having conflicts, due to having to redo work, and the bad news is this is pretty huge. Yep. >>: If you're hash-including the actual code, how do you vary the one-off position time? >> Malte Schwarzkopf: Right, so we don't actually, you know, measure the time that the code takes to run. We sort of -- this is an event-driven simulation, so we account for decision time, and it takes some time that is either comparable or less or more to actually make the decision. So when the simulator runs it uses the real logic, but the time it accounts is a parameter that we vary. And the reasoning behind this is, you know, you could to some extent, you know, you could optimize it to make it faster. You could simplify it to make it faster, or you might be adding more and more of these fault tolerance features or complexity that would make it slower. So we were sort of exploring the space for scheduler decision times, and the reason that we're interested in scheduler decision times is because with the Omega model, if you think about it, the crucial parameter that makes schedulers vulnerable to interference is how long they take to make decisions. So we were trying to test, you know, how far can you push this before it folds over, and you can see it falls over at around twenty to thirty seconds. >>: I was wondering if maybe their existing algorithm said, here, have this much time to make your algorithm. So to then you could actually model. >> Malte Schwarzkopf: I actually talked to the guys writing the logic and sort of tried to get an estimate from them how long it takes. And they actually couldn't really tell me, and the reason is because it's really hard to tease this out of the existing logic as a per-job quantity. They're sort of -- the logic is doing all sorts of smart things in terms of when it looks at a job and it's touching something that is relevant to other jobs, it will sort of amortize the work and do a bunch of work for the other job as well. So it's really hard to say it took five seconds to decide for this job because it affected a bunch of other jobs as well. This is in the existing monolithic scheduler, so in the disaggregated schedulers, presumably this would change, but they didn't exist yet at the time, so we modeled it like this. Okay, so bad news: It kind of doesn't scale particularly well, and the overhead gets quite big. So compared to the previous graphs, this is worse, and that's what we would expect. We sort of looked at why we saw such a big difference, and it's mainly placement constraints. If you have production jobs that are very picky in where they want to schedule, they will, of course, hone in on the same resources, so these resources become hot and more contended, and therefore you're seeing more overhead. So all for nothing, or maybe not quite. So remember the optimizations that I mentioned earlier. We looked at this and sort of thought a little bit more about what you can do in order to reduce the frequency at which conflicts occur and deal with conflicts more gracefully. So one optimization that we made was we said, well, we can detect conflicts in a slightly more fine-grained way. So initially what we did was we gave every machine a sequence number, and we just increased the sequence number every time a scheduler touched that machine. So conflict detection was then very simple because we could just check if the sequence number had changed and if it had, we said, oh, bad luck. You have to try again. So, of course, that's a bit pessimistic because actually you might be able to fit that task and also the other task that the other scheduler is trying to fit on the same machine if you make life a little harder for yourself in terms of checking whether you actually experienced a conflict. And I say a little bit harder. It sounds quite simple. It sounds like, you know, you just need to do of a couple of integer comparisons here. The resource model that Google use in their data centers is reasonably complicated. There are sort of guaranteed resources and reserved resources and nice-to-have resources, and then there's a bunch of other dimensions as well. So it's a sort of reasonably complicated piece of multi-dimensional sort of analysis whether another task fits on the machine. So we actually take a little longer to detect conflicts here, but at the -- at the price of maybe having fewer of them, and that turned out to be a good idea. The second optimization that we made is a relatively obvious one, which is that we said, well, if we have a big job and on the first attempt one of its tasks conflicts and can't get placed then, of course, one of option is to fail the entire scheduling attempt and retry the entire job. Which then second time round might work, but actually wouldn't it be much smarter if we only retried the task that failed? Now, of course, this only works for jobs where it's acceptable to incrementally schedule the tasks, not for ones that want gang scheduling. But for those it should give us much better chance of scheduling second time round because the job is now a lot smaller and is going to touch fewer machines. So let's see how these optimizations actually impact the performance. Again, you know the axes. Y-axis scheduler busyness, x-axis, service scheduler decision time. Is red line is the red line that we had before. So we had the sort of yellow underneath here. And then the cerulean line and the pink line are with the optimizations applied. So for the cerulean line we've applied the incremental scheduling instead of gang scheduling, but we're still making the course-grained conflict detection decisions. And with the pink line we apply both of them. And we find in practice we got sort of a 2x or fifty percent, depending on which way round you look at it, difference in scheduler busyness as a result of applying each of these optimizations. When we look at the previous graph now with these optimizations applied, the yellow area that shows the overhead of wasted work due to conflicts is a lot smaller. It's still there. We still have to redo things, and in fact, if you look at the numbers without the optimizations, the worst case that I saw was something like 200 retries until the job finally scheduled, and then it did schedule. But in this case, the worst I saw was seven retries, and that was sort of very rare, a one-off in month of simulated cluster runtime. So we can make these optimizations and therefore make our model viable even at pretty long scheduler decision times. And that is good news because that's what Google likes. We also did a case study on what you can do if you have these sorts of disaggregated schedulers, so so far we've looked and batch and servers and I sort of in a hand way said service jobs take a long time to schedule and therefore we want, you know, it'd be nice if we could disaggregated things. An opportunity that is granted to us by having these disaggregated schedulers is that maybe we'll have more schedulers. We can have job-specific schedulers. So one case study that we looked at here for qualitative benefit, which is sort of a little bit what John was asking about earlier, but it's a bit different in this case. We looked at a MapReduce scheduler, a job-specific MapReduce scheduler, and what this scheduler does is it introspects the cluster state to see if there's any idle resources around and sort of opportunistically leverages these idle resources to complete MapReduce jobs faster. Now, to understand why this works we have to look at the number of workers in a MapReduce job, and at Google that number is a manually human-specified configuration option. So we end up -- basically a human ends up writing a job profile and job scheduling request, and in that, that human says, I'd like a hundred workers, please. This is not the same as the number of map shards and reduce shards. It's just the number of parallel workers, so it's like the parallel number of FREDs, I suppose, in an open MP program or something. And it turns out if you look at a histogram of the sort of distribution of numbers of workers that people use in practice, the numbers are all sort of nice numbers that humans will pick when you say, well, how many workers would you like? And a lot of them say 200 because that seems like a nice number. Now of course, there's a serious point behind this, which is that people actually implicitly make a trade-off between getting the job done faster by having more workers, assuming it is embarrassingly parallel, and getting it to schedule sooner, because if you ask for 8,000 workers, it will probably take a fair while until the job gets scheduled; whereas if you ask for five, it will be pretty fast but the job itself might take the longer. So we said well, maybe it'd be better if this wasn't decided by a human, but it was decided by the scheduler as a function of the cluster's busyness at any point in time. If there's lots of idle resources then, you know, maybe we can give the job a bit more resources and it can compete faster. If there's not, then it will just have how many workers it gets. So we built a scheduler that did that, and here we are having a look at the benefit. So on the x-axis, we now have the relative speed-up of job completion compared to the case of running it with the number of workers configured on a log scale. And on the y-axis we have a cumulative distribution. Now I have to say a caveat here is the speed-up model that we used was pretty naive. We basically assumed that MapReduce jobs are infinitely scalable by adding more workers, which of course, is not true. If you have -- well, we didn't quite go to infinite. We never added more workers than map shards, for example, because that doesn't make sense. Of course, not every job is sort of linearly scalable liking this. But if we assume that sort of very optimistic scaling model, then we get this sort of happy zone up there where jobs experience a speed-up. And we can look at this, and we see that out of the jobs that we looked at in this cluster, 60 percent of the MapReduce jobs were actually affected by this optimization, and that means 60 percent of MapReduce jobs could have benefited from extra resources, from being given extra resources in addition to the ones that were configured. >>: Yeah, just to understand the experiment. The speed-up includes both scheduling time and execution. >> Malte Schwarzkopf: This is job runtime. Scheduled time is not included. >>: It's not. >> Malte Schwarzkopf: It's just the job runtime as a result of using, you know, more workers. And what we did was we looked at the number of MapReduce shards, and we said up to the number of MapReduce shards, we'll give you more workers so that you can run more of them in parallel. And you can see actually that for a number of jobs it didn't matter, this other 40 percent, and those are the ones that in some cases that already had the number of workers set to the number of MapReduce shards or they were very small, in which case it didn't matter. So if we look at sort of the median of the jobs that actually do experience a speed-up and look at how about that speed-up is, we're sort of looking at between 3 and 4x, which actually is quite considerable. Now the CDF does, of course, say in the 90th percentile up there, you can get a hundred x speed-up. That might or might not be the case, and we actually found some jobs that were in that sort of category where the configured number of workers was tiny compared to the amount of work that was supposed to be done. Those jobs had been explicitly throttled by their designer, and when we talked to them and said, hey, if we could make your job a lot faster by giving you much more workers, they said, no, no, no we don't actually want that. It's deliberate that they are being throttled. So it doesn't always help. But it's a nice tastiness result that by having a custom scheduling logic that takes into account the sort of opportunistically available resources in the cluster, which it can actually very easily inspect by just looking at the replica of the cluster state that it has available to it, we can achieve this kind of benefit, and flexibility. So let's finish up this Omega stuff. We found that for Google's workloads flexibility and scale in scheduling require parallelism. They require disaggregation into different schedulers. And it can work if we do it right and their use of shared state in the optimistically concurrent transaction model is the right way of doing it, or at least it is for Google. So I can stop here. If people are sort of a bit talked out. It's been about an hour. I can talk about some of the work that I have done since, or you know, whatever you prefer. We can have questions. >>: Do you have a short version of the work you're doing now? >> Malte Schwarzkopf: Yeah it's not going to take another hour [laughter]. It's not dimensioned for that, but there was a question back there. Shall we talk about that first? >>: Regarding this busyness, so parallel scheduling itself really does not reduce busyness. >> Malte Schwarzkopf: No. >>: Unless you considered using additional resources to make scheduling decision. So is this implicitly shown in your graph? >> Malte Schwarzkopf: So what you're saying is if you do things in parallel it doesn't reduce the amount of work you're doing. That's correct. But if you run things in parallel, then you're sort of by assumption you're giving it extra resources to do the work, right? If I have two FREDs and they work perfectly in parallel, then I've used fifty percent more CPU time. Yeah in any given time tick, but I'm getting twice as much work done ideally. >>: This is probably fine in this context. I'm just trying to see how you measure the busyness. >> Malte Schwarzkopf: Yeah, so in the 3D graphs we had this discussion about the step. In the 3D graphs were in the initial one measuring the busyness as a result of the overall busyness of the scheduler, but there was only one scheduler. And then when we had multiple schedulers, we only looked at the service scheduler because that was the one where the shape was actually varying, whereas for the batch scheduler because we didn't modify -- we didn't vary the decision time, it was just flat. And this offset. The work is still there. It's just done in a different FRED, if you will. So it's being done, and the overall busyness, the cumulative busyness remains the same, but it's being done sort of in parallel. But you have, you know, to use more resources to make the decisions. In practice Google is not looking at having 500 schedulers, right. Google is looking at having a handful, and you can easily run this on a single multicore machine. And it would work pretty well. >>: Another last question is, at the very beginning you measured about this tool that was scheduling. You don't really is have a comparison example as a later parallel slide compared with two-level scheduling how the Omega works. So do you have some results other like ->> Malte Schwarzkopf: So I can -- there's a bit more stuff in the paper. There might be something in my backup slides, but they're right at the end, so I'm not going skip ahead to them now. But I can tell you offer informally what our experience with two-level scheduling was. We used Mesos, the Mesos model, because Mesos is a real thing that is out there and that people are using. So it's an implementation of two-level scheduling. It's an implementation that's based on a bunch of assumptions. So one of the assumptions they make is high churn. One of the assumptions they make is schedulers don't take extremely long to make their decisions. Now unfortunately, with our premises we violated both of these assumptions, and therefore, it didn't work particularly well. So you know, every credit -- one of my coauthors is one of the Mesos authors so, you know, we weren't deliberately trying to screw them over. They optimized for a different setting. You could, of course, build a two-level scheduler system that did not necessarily have the limitations that Mesos has. So you could have one that does smarter dynamic partitioning, for example. But we think it would (a) be more complicated and (b) some of the nice things you can do with Omega you couldn't do in that setting. So if you think about the MapReduce scheduler that inspects the spare resources available in the cluster, if you had a dynamic partitioned scheme and the MapReduce scheduler was only seeing its dynamic partition of the overall resources, then it might not actually get a very good idea of how much spare resources there are, because some partition -- some resource manager has decided to only give it a slice. Now of course, you could then communicate this out of band in some way or another. Omega is not the only way of doing things, but we found it is a nice simple model that actually captures lots of policies. Does that make sense? >>: It does. I'm just, well, trying to figure out. Some of the constraints you were talking about where you do dynamic partitioning, I think also exist in the Omega scheduler, and some of those can also be avoided by doing dynamic scheduling at the same time be aware of the global state. >> Malte Schwarzkopf: I think the answer is yes. So there are some policies that don't work well in Omega. So one good example is if you had a scheduling logic that always picked the least loaded machine. That would be a really bad idea in Omega because then the old schedulers using this logic would hone in on that one machine, and they would basically live-lock each other. So we do assume that your scheduling logic has some element of randomization. We also do assume that your scheduling logic is designed in such a way that the majority of -- that you can actually in most cases schedule on the majority of resources, that you're not sort of picking your favorite machine every time. This is not this is not enforced in Omega. If you build a stupid scheduler, then it will either break the system or not work very well. In practice in the practical implementation that Google have done since, which I don't know very much about because I had left by the time they had done most of it, but in the practical implementation they are making sure that by writing a stupid or buggy scheduler you're not ruining the entire system; you're just ruining yourself. So in -- Jay in the beginning asked about the code that sort of that manages the shared state. And there you can put in a policy that you can just put in a counter and say, you know, if this guy keeps, you know, causing other people to fail, then we make him fail a little bit so that it gets a bit fairer again. And in practice what would end up happening is a badly written scheduler will just keep failing, and then, you know, someone has an incentive to make it better or make it more -- make it fit the assumptions better. >>: Thank you. >> Malte Schwarzkopf: Okay, great. >>: I'm just trying to understand this a little better. In a closely managed environment like this, where you know what jobs are in the cluster. >> Malte Schwarzkopf: Sure. >>: I see the benefit of Omega especially when you want to enable say, innovation at the scheduler level. >> Malte Schwarzkopf: Yep. >>: You could let people write different schedulers for different kinds of jobs. The load level scheduler is really dumb. It just says resource available or not, but in an environment when you control almost everything, right, you might have one set of machines dedicated to certain services. Why is the monolithic approach -- monolithic ->> Malte Schwarzkopf: Right. I see where you are. So one thing that I think is sort of a policy decision that Google make is that they don't dedicate machines particularly to things. So, you know, I was an intern, so I don't know everything, but what I do know is that whenever possible they try and use shared infrastructure. So the sort of model where you say these machines over there are gmail machines, and these machine are the web search machines -- that doesn't exist. They assume sort of a utility cloud model where, you know, there's infrastructure that's used by everyone. So with that premise, monolithic servers that sort of just run in their own little playpen and never have to talk to each other become unviable because everyone is sharing the resources. Now you could, of course, come up with schemes where some resources are sort of preferentially sticky for some workload, and some resources can only be leveraged opportunistically by otherwise, but you could do that in Omega as well. It would complicate the code that manages the shared state, however, and one of the sort of design premises in Omega is that code should be as simple as possible. It should effectively just be a conflict check. Because, and this is sort of the wider picture of Omega as suppose, which John [inaudible] can explain much better than I can, because they also want to hook in things like machine management with this same shared-state data structure. So if a machine fails or is taken offline for maintenance, it will just disappear from the shared-data structure by another component that is not a scheduler, but instead is a machine manager that says, I'm removing this machine. And that's actually quite a nice abstraction for, you know, sort of unifying the cluster management service. >>: -- one of the things failure to detect [indiscernible] >> Malte Schwarzkopf: Sure. Yeah. This is not the only way of solving this problem, absolutely not. But Google liked this sort of giant, shared infrastructure vision where every cluster management component is using one sort of central, shared notion of the state of the world. And, you know, that's what they're thinking is the right thing to do. It's clearly not the only way you could do it. And it might change; they have very fast turnover of systems there so, you know. The average lifetime of a system, of a scheduler there is probably, you know, a few years, so, we don't know what they're going to be doing in five years' time. Okay. Any other questions? Stop? Look at some more stuff? I don't know how keen people are on going to lunch [laughter]. >>: I don't mind listening. >> Malte Schwarzkopf: Well, we can go a little further. So one of the things that people asked a lot when I presented this paper at EuroSys was, you know, why might scheduling take a hundred seconds? That seems outrageous. And I sort of in the beginning I said oh, you know, Monte Carlo or stuff, constraint solving blah, blah, blah, it takes long. And that is sort of the fact at Google that, you know, these things do take long, but it's not a very concrete example of why something would take that long. So I'll show you on example of something that's probably a bit closer to home for Microsoft and that actually does cause very long scheduling decisions. There's a T on the second line there. And this is relevant to a new system that I'm working on which is called Firmament, and that actually that system extends Quincy, which is why I said it's a little closer to home for Microsoft. So the premise of Firmament is that machines don't really look like this anymore. That's what they looked like in 2009. We now look at a complex sort of architecture like this, or indeed if you have an Intel machine, it might even look like this. You've got deep cache hierarchies, multiple memory controllers hyper FREDing, all sorts of architectural complications that weren't really there a few years ago. And this matters for data centers because these machines are being shared, again, in the sort of Google infrastructure model where you might have tasks from completely different jobs running on different parts of the system, the sharing of resources really does matter to the way you're tasks will proceed. So here's a little experiment that's I have done using some spec CPU benchmarks. So this is basically all of the spec CPU benchmarks arranged roughly in order of how memory-intensive they are as in how much of their time they spent accessing memory as opposed to doing CPU-bound work on a fairly recent AMD Opteron machine. And the heat map shows how much a spec CPU benchmark degrades as a result of sharing a level two cache with another instance of a spec CPU 2006 benchmark. So there's only on a twelve core machine. There's only two things running, but they do run atop the same l2 cache. And you can see that in a bunch of cases it gets very red here, which is 1.5, so that's a fifty percent degradation. By degradation, I mean increased runtime. But actually the scale is clamped to 1.5 here to bring out the differences. If you leave it free, it's about seven and 7x in the worst cases. And as you would expect when you're sharing a cache and other levels of the memory hierarchy, if the workload is more memory bound, the interference is worse. I should have said which way round this is, and it's hard to get right, so I'll be very careful. The diagram shows that the left-hand workload, in this case lbm, which is highly memory intensive, causes the degradation indicated by the color of the workload on the x-axis. So what it means in practice is lbm running next to gromacs will cause gromacs to take fifty percent longer than gromacs would do if it was running in isolation on an otherwise idle machine. So the baseline is the workload running in isolation on an idle machine. Now, some of you might be thinking, well, sometimes it looks like it's getting faster when you're sharing things; that seems a bit odd. And I turns out the baseline of this experiment was a little bit buggy because I've measured the isolated runtime without first warming up the buffer cache. So because the isolation case was the first in a series of experiments to be run, the baseline time is actually higher than it would be with a warmed-up buffer cache, which was the case for all of the other experiments. So what I should have done is either dropped the buffer cache in between experiments or warmed it up before, before running the experiment but it takes -because this 28 x 28 matrix takes about a month to generate, so I think I've, by now, rerun the baseline so I can replot this, but I haven't yet got round to actually doing the plotting. So I assume that this these blue artifacts will go away, and in fact, I've verified it for a small subset that they do go away when you're running on a warm buffer cache. But the degradation is only going to get worse as a result of this. It's not going to get better if this becomes 1.0. So that was only two tasks on an otherwise idle machine, but of course, if you're running in a data center, you will want to pack the entire machine with work ideally. And so I did that with the spec CPU benchmarks and again ordered them by memory intensity. I don't expect you to follow all of these bars, but if we only look at the worst case, this gray bar here is on a 12 core machine we run 12 instances of the lbm benchmark on all of the cores. And as a result, the runtime is almost 5x of the isolated runtime because they are just evicting each other's cache lines all the time and contend for the memory bus and all of these sorts of things. Towards the left-hand edge that's off the screen, but with the CPU-bound benchmarks you finds that the degradation is almost zero because they just run in their own l1 cache, and you know, do CPU-bound work. So it can get quite bad, but of course, that's spec CPU, and spec CPU is maybe not a super representative benchmark for data center workloads. So we also looked at a realistic sort of data center application, which in this case is Spark, the sort of Berkeley parallel data processing framework. And this is actually an artifact from a real experiment we did for a different paper. We ran Spark on a simple string processing workload. So this is taking an input string and splitting a column out of it, so it's a column projection on a multicolumn space-separated input string for reasonably big but not enormous inputs because we only had a couple of machines. And on this cluster of, I think it is, six or seven machines, one of our machines is a 48-core machine. The others are smaller machines because we're a poor university, and we can't afford lots of homogenous machines. What we found was that if you look at the runtime here, the blue line, the 8C version, is Spark limited to using only eight cores on each machine. Whereas in the red line we let it use as many as it wants, which means on the 48-core machine, in this case it's using 48 cores. In this case it's only using eight cores. And it turns out that Spark actually got faster when you gave it fewer cores, and the result for this was that it was contending for resources on the machine, which is not a huge surprise if you think about it, but for us when we ran the experiment, it was a bit surprising that removing cores made the benchmark go faster, so this interference does happen in the real world. And, in fact, there's 76 cores that we've removed in total by restricting all cluster machines to only eight cores running Spark. And as a result, it got twice as fast. So that's just one thing. You obviously can contend for network access as well. You can have some sort of hard and soft constraints. You can have accelerators. You can have deadlines, SLAs, special resources, SSDs, GPGPUs that you don't have in every machine. All of this comes into the scheduling algorithm. So what's a good scheduling mechanism to capture all of these dimensions? Well, actually if you think about it, Quincy is quite well-placed for this, and I really like the Quincy work. I think, you know, it's sad that nobody has ever done, you know, any more work on this, which is why I'm now doing it I suppose. But the Quincy scheduler is based on an optimization, a global sort of cluster-wide optimization of a flow problem. So it actually models scheduling as a min-cost, max-flow optimization. I'll just explain this very quickly because I assume that people here are reasonably familiar with this. But basically Quincy has this graph where it has machines, racks, and what they call a cluster aggregator, which is effectively a “don't care” node that means can schedule anywhere. And then there's a sink that sucks the flow out of the system. All of the machines where tasks can schedule are connected to the sink, and the task comes out of different jobs on the lefthand side here are the sources of flow. So overall, the optimization is to get flow going from the sources to the sink in such a way that the cost, the overall cost of routing the entire flow to the sink is minimized. And in order to be able to do this routing, obviously we have to connect things up a little bit more. So because there might be more tasks than there are machines to run them, room to run them in the cluster, there must be an unscheduled node through which we can send flow if a tasks is not currently scheduled. Every task is connected to that unscheduled node and also to the cluster aggregate, which means it can schedule anywhere. And if a task like T0,0 here ends up routing flow like this, routing it through machine M1, then effectively what that means in the Quincy world is that it's scheduled on machine one. Or if it's routing its flow through the unscheduled node, it's not scheduled. You can also have preferences that point directly into other parts of this hierarchy. So thinking about this sort of degradation problem as a result of interference, what we would actually like to do in that case is we'd like to combine cache-sensitive and cache-insensitive workloads to put memory-bound tasks next to CPU-bound tasks, for example, because that will maximize the overall utility. Use CPU tasks to fill in gaps and so on. So how can we express that in this Quincy paradigm? Well, one thing we can do, and this is how this work sort of got started, is we can actually extend the leaves of the graph from machines to individual cores. So we say, well, okay, instead of having machines connected to the sink, we actually model the entire architectural hierarchy of the machine with all the caches and the cores and even the hyperFREDs if you want, as part of this graph. So we can do a global optimization over a placement of tasks on individual cores in the machine. And if we do that, we obviously need to classify workloads in some way, so we need to have some sort of equivalence class of a CPU-bound and a memory-bound workload. Ideally again, we would do this automatically. We wouldn't need the user to specify he have it. The heuristic that we're using in Firmament is sort of an animal zoo. This is actually not my invention. It was invented by people who did work on cache sharing in architecture in the early 2000s. And they came up with those categories, and I think they've added a bunch more since, but they he came up with these categories of devils, turtles, rabbits, and sheep. And I don't need to explain those in detail, but effectively, devils are really bad. They cause degradation if anything runs close to them. Turtles are super gentle. They're the CPU-bound tasks that will just sit there and do their work, never minding what anyone else is doing in the system. And rabbits and sheep are various different increments in between. So what we've done in Firmament is we've leveraged the performance counters that you find in modern CPUs to measure a bunch of things at low overhead. So we measure things like the memory accesses per instruction. We measure the cycles per instruction. We measure how many cache references there are, how many cache misses. Performance counters aren't perfect, but they're a very low overhead way of measuring these microarchitectural things. They also lie sometimes, but you know, it's as good as we can get without much overhead. So we're looking at two examples here. There's one task that is approximating pi, which is a CPU-bound workload. It's just literally computing in the L1 cache. We can see it does about 12,000 instructions for each memory access, so it very rarely accesses memory, and only L2 percent of its memory accesses end up causing cache misses. That's what you'd expect for an application like this. Matrix multiplication on a big matrix. This was set up for a matrix that, you know, definitely doesn't fit in an L1 cache and probably didn't fit into L3. We find that it only does 40 instructions per memory access because memory access is quite common in that sort of workload as a matter of it being a matrix multiplication. And 65 percent of the accesses, memory accesses end up causing last-level cache misses in this implementation. >>: You talk about 65 percent causing a miss. That implies a given size for the cache, and that size is a thing that's going to change when you put a contending workload next to it. >> Malte Schwarzkopf: Yes. >>: The metric you really want there is what cache size produces a 90 percent hit ratio or an 80 percent. >> Malte Schwarzkopf: Yeah, and in fact, that's what the cache-sharing architecture people look at. It's a bit hard to figure this out dynamically if you're not running an architecture simulator, but a real machine. >>: There's a not a good way to estimate that using ->> Malte Schwarzkopf: No. You can use something like cache grind, but yeah, it's a simulator. A bunch of people have done work in this area where they used offline simulation with those sort of tools to predict how much cache or what the cache footprint of a workload would be. So let's assume we've done this, so we've classified some of our tasks as devils and we've classified some of our tasks as turtles. >>: Can you go back? >> Malte Schwarzkopf: Sure. >>: Can you use that model to predict the red and blue picture? >> Malte Schwarzkopf: To some extent. So if you look at the cache-sharing papers, they come up -- as a result of having these animals they come up with a bunch of heuristics where they say putting devils next to turtles is good. Putting devils next to rabbits is really bad. Putting sheep next to rabbits has no effect. So you can sort of come up with a taxonomy of what goes well with what. You can't exactly predict what the heat map will look like, but you can say sort of the bottom of the heat map would be the devils, and the top would be the turtles. So you can predict the general place in the sort of hierarchy of how interfering is something that the task folds into, if you can classify it as a devil. Also, of course, tasks go in three phases. You have some workloads have memory-intensive phases and computerintensive phases, and we're not capturing that perfectly. You could do a better job there, but we're trying to do this, you know, in production with almost no overhead as opposed to offline. >>: Based on your previous experience in Omega with the schedulers there, did they do something very, very similar? >> Malte Schwarzkopf: No. Well for this, they don't do anything. They just suck up the interference. In fact, that's not quite true. Google published a paper called CPI squared in the same EuroSys where the Omega was. What they do do is they sample these sorts of performance counters to detect when really bad interference is happening, and then they move the task somewhere else. But it's reactive rather than proactive, and they only sample about one percent of tasks; they don't do it for everything. And certainly, the scheduler has nothing to do with it. It's a completely separate part of system that rebalances these sorts of things. So if we wanted to do this in the scheduler using the Quincy model, what we can do is we could classify our tasks and we can then add another aggregator node into this flow graph, which is the devil aggregator up here. And we have turtle aggregate for the equivalence class of turtles here, and then we connect our tasks to those. So let's say we have a devil running here on C0, and we are now having a turtle that we would like to schedule. So based on our equivalence classes and based on the reasoning of how things go well together, what we would like to do is for the turtle to end up here, because then we use this machine, this part of the machine in the most efficient way. If it ends up here, it's not really helping us if we put another devil here or rabbit or a sheep or something like that. So we'd like to make it create an incentive for the scheduler to put the turtle into that place, but we don't want a hard constraint, because you know, we're probably better off it scheduling in a suboptimal place than it not scheduling at all. So what we can do is we can put a preference edge from the turtle aggregator to the node representing this L2 cache and say, well, we have a strong preference for putting turtles underneath that L2 cache because we already have a devil here. And then when the flow optimizer runs it will, with great likelihood, put a turtle there, but it might not end up doing so if there's some other globally optimal solution. If it did put a turtle there, then the flow gets routed like this, and the turtles gets scheduled, and we're all happy. And the same happens for the devil, which might end up going there. So what is the impact of this? We've built the system. We've tested it. This is an indicative result. It's not, you know, a fully-fledged SOSP evaluation or anything like that. But remember the pi approximation and the matrix multiplication. They would fall into the equivalence classes of turtle and devil. If we run them in isolation, the pi approximation takes about 6.3 seconds and the matrix multiplication takes about 20 seconds on an otherwise idle machine, so in a very unrealistic environment for a cluster. If we use a simple Q-based greedy scheduler, nothing to do with Quincy, no flow optimization, just you know, the sort of state-of-the-art, Q-based simple scheduler that doesn't take a long time to make a decision but maybe makes somewhat suboptimal decisions. And it uses a random placement strategy, then we end up with a 4x degradation in runtime of the matrix multiplication while even the turtle, the pi approximation, degrades, and this is with multiple metrics, multiplication tasks, so it's not a single task. This is an average value out of, I think, 12 tasks and 12 pi approximation tasks. So sometimes the greedy scheduler ends up putting things to next to each other in a bad way, and therefore, we end up with this degradation. >>: How much resources are you using for this experiment? >> Malte Schwarzkopf: This was using one machine that we either allow the greedy scheduler on this one machine place things in any way it likes, or we use the flow optimization. Obviously, we can also do this across multiple machines, but that's not happening in this experiment. So if we use the flow -- the Quincy flow modeling for the scheduling problem and we put in some of the stuff that I discussed, preference for turtles next to devils, you know, interleaving of these categories in this way, we end up improving by 2x over the greedy scheduler. It's still 2x worse than running in isolation, but you running in isolation is not really an option if you have a data center full of 48-core machines and you want to use them reasonably efficiently as opposed to just running one task per machine. >>: If you run these two jobs in sequence, they'd still take less time than running them together. Am I misunderstanding your numbers here? If you run all the turtle jobs and it takes seven seconds, and then you run all the devil jobs, and it takes twenty seconds, then you're done 27 seconds later. >> Malte Schwarzkopf: That's true, but again, in a real data center environment, you can't just run your entire workload in sequence, right. For these particular jobs, yes. >>: But the lower chunks in -- other words, you can't do that at the data center granularity, but you can probably do it at the L3 cache granularity and maybe at the machine granularity ->> Malte Schwarzkopf: Yes, sure. >>: It seems like you'd want to exploit those opportunities if the answer is that ->> Malte Schwarzkopf: Yes. >>: Maybe you should try even harder to keep these things apart because maybe running in isolation is actually a better overall utilization. >> Malte Schwarzkopf: So one thing we found with this experiment was actually if you don't pin things to cores and you let the OS scheduler make decisions, it actually performs somewhat better because it does what you say: It schedules things in sequence in a reasonable way. But the problem with that is the OS scheduler can only see one machine, right. If the cluster scheduler decides to give 12 memorybound devil tasks to that machine and decides to put all the turtles on the machine next door, the OS scheduler can't do anything. My argument that this cluster scheduler should be taking care of this, and I think we're in agreement on that, that you should try hard. Of course, there's limits to this because the cluster scheduler will be far away from the machine, so it can't make decisions on a microsecond granularity just because of the communication delay, but maybe it can amortize. It can sort of load a schedule into the FRED scheduler and say, I think this is a good order of FREDs, you know, this is a good order of running the different FREDs on the different cores in your machine. We haven't investigated that yet, but that would be a nice. >>: I guess the point I'm trying to make is that if your scheduler a producing runtimes that are twice as bad as just running the jobs sequentially, then you're probably trying to too hard or something, right? Something has gone off the rails, right. >> Malte Schwarzkopf: Sure, but well do bear in mind that this case is underutilizing the infrastructure, and you might not have the luxury of running things sequentially. >>: It's underutilizing it only on the temporal axis, but down here, that bottom one took 48 seconds to get the same work done, so it was underutilizing at a very fine grain. >> Malte Schwarzkopf: Sure. >>: It's sort of silly to say that that core is going to waste if adding it makes the whole system function slower. >> Malte Schwarzkopf: But what I'm saying is this 20 second number is unrealistic in the sense that in a real-world setting you will have other things running on the same machine at the same time, and these things will make that number worse. I don't know how much they're going to make it worse. In the best case they will make it not much worse. In the worst case they will make it a lot worse, but you're not going to get this number in reality unless you're underutilizing your machine. >>: So let me see if I understand just to -- maybe there's something that could clarify this question to understand the setting. When you're saying you're running a task in isolation. >> Malte Schwarzkopf: Yes. >>: You're utilizing one thread and everything else in that machine is empty. >> Malte Schwarzkopf: Correct. >>: When you're doing simple, greedy, and flow interleaves you're effectively using all the threads in that machine. >> Malte Schwarzkopf: Yes. >>: So if you were trying to use all the threads in the machine, you wouldn't get those numbers. >> Malte Schwarzkopf: That's what I'm saying. >>: No matter which scheduler you are using. >> Malte Schwarzkopf: I think John is making a valid had point in that he is saying, you know, if you were in an ideal case you were an omniscient scheduler and you did everything right and you could perfectly disaggregate all the offer interference effects in the machine, then you might get close to that. >>: I'm saying something simpler, which is if I have those two jobs I'd like to run and then a whole bunch of other jobs, and I've got a machine with 48 cores, one way I could achieve those jobs is I could do all the turtle jobs, and that'll be done seven seconds later. Then I can do all the devil jobs, and I'll be done 20 seconds later. >> Malte Schwarzkopf: No, you won't because the devil job -- if you do all devil jobs on that 48-core machine they will take longer than 20 seconds. >>: There's only one unit worth of work in the first row, whereas there are 12 units of work in a ->>: Thank you ->>: Okay. That explains the mystery. That was the answer that I was looking for. Now I understand. >> Malte Schwarzkopf: So you're right. If you were -- sorry, I should have made this clearer. This is doing 12 times as much work as this. If you were doing the same amounts of work, then it would take actually times longer to do this. >>: Okay. >> Malte Schwarzkopf: Because as you said you do it in sequence. Okay, sorry. So this is a nice indicative result, but obviously, this approach is not trivial. We have to worry about what costs we put into the flow graph. They must have a total -- they must achieve a total ordering, for example. So if you wanted to use a multidimensional vector to, for example, express the costs of contending for caches, network access, disk access, then that's great, but you have to be able to totally order your cost vectors in some way of another. For integers it's simple, but if you have multidimensional costs, which in reality you do have, if you put a turtle here, that might be good for cache interference, but it might be bad for disk access, for example. These things are not quite as clear-cut as I made them so far. So there's some complexity there. One thing this approach cannot deal with, and Quincy I can't deal with either, is combinatorial constraints. If you have a combinatorial constraint of the form "I never want more than one type -- more than one task of my job to run on any one machine because I want to achieve fault tolerance," for example, we can't do that because the flow optimization might end up putting two tasks on the same machine because everyone is always connected to the "don't care; can't schedule anywhere" node. We could tweak things around to approximate it, but we can't have any combinatorial constraints. That just won't work. It does work if you do multiple rounds of scheduling and only schedule one task from the job per round, but then again, it gets complicated. And of course, and this is how I started talking about this work, the runtime of the flow optimizer can get quite big if the graph is quite big. So here we've got a simple comparison of, again, the greedy queue-based scheduler and the flow scheduler, and this is a CDF of the per-task decision time for a very simple-minded job. You can see the greedy queue-based scheduler will take, you know, at most 20 milliseconds to make a decision. This is a very unoptimized implementation. Whereas the flow optimizer for a single machine, again, this is only considering a single machine, not a whole data center, will take up to fifty milliseconds. And then as you scale this up, if you scale it up from one machine to 35,000 machines in this case, with the same amount of work being scheduled but the amount of resources available increased, we end up basically with linear scalability in the number of machines. But of course, if you increase the number of machines, you will also have more work to do, so we also increase the number of jobs, and again, in this experiment, the number of machines is constant but the number of jobs is increasing. And again, we end up with linear scalability. The reason for this is effectively the flow optimizer is linear in the number of edges that exist in the graph, and both increasing the number of machines and increasing the amount of work increases the number of edges. Now, what I'd like you to take away from this sort of high level description of Firmament is that we can use an extension of the Quincy approach to reduce interference, and you know, there's some more work to be done, but we can have an effect, but at the cost of a real scheduler possibly running for multiple seconds. So if you look at this, if you look at the case of 8,000 or 9,000 jobs which have a hundred tasks each, so that's looking at just under a million tasks running on a large 35,000-machine data center we're looking at 90 seconds of scheduler runtime at the top of the y-axis there, which is the same ballpark as the sort of worst case we considered for Omega. So this is a practical example of something that might take so long. Now, of course, if you can build an incremental min-cost, max-flow solver -- in fact, we're working on that right now and potentially we can make it much faster if we can incrementally solve the problem for each newly arrived task as opposed to optimizing the entire cluster every time, but as it stands, it takes quite a long time. I think that's all I have. Here are some other projects that are going on at in Cambridge, but I won't talk about those. >> Karin Strauss: Thank you. [applause]