>> Karin Strauss: Welcome. My name is Karin...

advertisement
>> Karin Strauss: Welcome. My name is Karin Strauss. I'm here to introduce Malte Schwarzkopf. I hope
I didn't butcher your name.
>> Malte Schwarzkopf: Yes, very good.
>>: Malte is a last-year Ph.D. student at University of Cambridge and his work is on distribution systems
and operating systems or the combination of both. So Malte will talk about work he's done at University
of Cambridge and at Google today. Thank you very much, Malte.
>> Malte Schwarzkopf: Hello, everyone, and thanks to Karin for hosting me here. So this talk is sort of
coming in two parts, and, you know, we'll go as far through it as we feel like it. But the first part is work
that I did while interning at Google. And then if there's time, I'll move on to talk about some of the
follow-on work that I've been doing back at Cambridge. But the overall umbrella topic here is
scheduling on clusters in data centers. So I'll first talk about this system called Omega, which was
published in EuroSys 2013. Omega is Google's next-generation cluster management system. And Andy,
who was one of the other interns at the time, and myself were quite privileged to make a contribution
to it. It's actually a much bigger system that doesn't just do scheduling, but this talk and this paper are
on the scheduling aspects. So scheduling in this case means the problem of mapping work, such as tasks
that are parts of jobs, to resources which in the data center are typically machines. Now, of course,
there's thousands of machines and there's thousands of tasks in any particular job potentially, so the
scale of this problem can get quite big. Now, at Google one -- a few of the things that people observed
over time is that the scheduling problem has really changed a little bit in the sort of last ten years, and
specifically, while they were running the old cluster management system, which precedes Omega and
sort of dates back about ten years. First of all, workloads are becoming quite diverse in shapes, sizes,
and also in their sort of requirements of the scheduler. This makes the scheduler's job potentially quite
a lot harder as we'll see. And at the same time, the size of the clusters involved is increasing quite
dramatically, so we're looking at, you know, tens of thousands of machines now, and hundreds of
thousands is not completely out of the expectation. And the number of jobs surviving at any particular
time unit is also growing, so all of this combined makes the problem harder. Now, how does a typical
cluster scheduler work? Well, as I said, we've got some work that we need to map to resources, so in
some way the scheduler has to track the state of machines, which is sort of down here. These gray
squares are representing machines, and some of them are running tasks; some of them aren't.
Obviously in practice a machine would run more than one task, but I visualize it as having just one here.
And up there we've got tasks arriving from all sorts of different workloads, and they will proceed
through some scheduling logic in order to be mapped to machines. Now, this scheduling logic could well
be the same for everyone, for every arriving job, but actually in practice what seems to be happening is
that it's actually becoming more and more heterogeneous over time. So some jobs are getting sort of
scheduled in a much simpler fashion by just dropping them wherever, and some jobs have more
stringent requirements, and the scheduling logic is becoming quite elaborate. And in fact it's not
completely out of the realm of possibility for the scheduling logic at Google in practice to take around 60
seconds to place a large job on the cluster just because there's a lot of work to do. Now, why might this
be? There's constraints solvers in some of the things that people want to use, Monte Carlo simulations
and so on, and these things can get quite time consuming for large jobs. So this sort of feature creep in
the scheduling logic actually leads to an increasingly complex scheduler implementation. If you assume
that there's just sort of one implementation. In fact, the previous Google cluster scheduler went for
exactly this process of sort of becoming more and more complex. There were more and more control
paths through it and shortcuts introduced, extensions requested. Different teams wanted different
things, so everyone sort of added their little piece to the pie. But of course, this significantly increases
the complexity of engineering and maintaining the scheduler, so we would like to streamline this and
break the monolithic scheduler up into modular ones. But while these are probably easier to maintain,
they still, at the end, have to share the physical cluster resources in some way, and we have to arbitrate
resources between the different schedulers that are working in the same cluster. Now, there are
various ways this can be done, and people have looked at this in the past. The straightforward one is
sort of you just break the cluster up into multiple logical clusters. You say, well we've got the red
cluster, the blue cluster, and the green cluster, and we just split our resources three ways. Now, of
course, that has a bunch of problems, for example, if the red cluster, say, is a MapReduce cluster, it's
completely full. It's running at high utilization, but there's space in the others, but because of the static
partitioning, the jobs can't grab resources from the other logical clusters. Now, clearly, one possible way
of addressing this is by actually making that partition dynamic, and in fact, this is what some existing
symptoms do. For example, Mesos, the cluster management system from Berkeley, schedules resources
on two levels. So they have a higher-level resource manager and a set of schedulers that I show in red,
blue, and green here, which each work with their resource allocation. So this sort of yellowish resource
manager will carve out a little partition of cluster and offer that to the different schedulers. We'll later
see that that actually has some problems too. So in Omega what we do is take a slightly differently
view. We say, okay, instead of having this resource manager that does arbitration, we just sort of say,
take a very laissez-faire approach and say every scheduler can claim any resource at any point in time if
it wants to. So the entire cluster is, in effect, a big shared-data structure. Now, there's no need for any
reservations, for any a priori coordination between schedulers. They just try, hope for the best, and
then sort out any problems afterwards. So let's see how that works in detail. Consider an example.
Here we've got our cluster stated in sort of being, and this is the shared-data structure and that sort of
represents the ground truth. That sort of represents the state of the cluster at this point in time. So if a
machine fails, it will disappear from this. If a job gets scheduled, it appears in this and so on. Now, we
have two schedulers, red and blue. Each scheduler, internal to itself, has a replica of the cluster state.
So you can see little gray squares there that are just a replica of the data structure down there, and this
replica is frequently updated. We're just sort of pushing diffs effectively of changes that have happened
in reality to sort of bring the replica up to date. Now, let's see what happens when jobs arrive. So in the
red scheduler a bunch of tasks in a job arrive. They're considered for scheduling. And using the red
scheduler's specific scheduling logic, it decides on two machines where these tasks should go. Now, that
happens and the replica is updated, but of course, that still needs to make its way to the shared-data
structure. So a delta, as we call it is, created and sent to the shared cluster state asking to be
committed. Now meanwhile, the blue scheduler has been busy too, and it has two tasks. It considers
them. It finds machines to place them on, and using its specialized logic makes that decision and then
sends a delta to the shared-cluster state. So, in the shared-cluster state, these deltas are applied. And
in this case unfortunately, both the schedulers decided to place a task on the same machine. Now, that
could be fine, but it could also be problematic depending on how big these tasks are and how much
room there is on this machine. So there can be a conflict, which in this case means that the two tasks
that we're trying to place cannot both go on to that machine, as it would lead to overcommit. So what
happens in Omega is one of the schedulers will succeed according to some arbitrative policy. It can just
be first come, first serve and the other one will fail. So the blue scheduler has succeeded in this case
and it gets told your job is now scheduled, whereas the red scheduler is told your job has failed and it
has to try again. Now, clearly these sorts of conflicted scheduling attempts can lead to wasted work.
And in the worst case, the red scheduler would have to try over and over and over again until it
eventually gets its job scheduled. So if it got very unlucky it might never make it, so clearly the viability
of this kind of optimistically concurrent model depends on how often these conflicts roccur and how
well we can avoid them in practice for these sort of real, large-scale workloads. So, after this brief
explanation of how Omega works, I'll now use a set of practical sort of case studies to investigate how
viable the model is and answer that question I've just posed. Yeah? Sure.
>>: Who makes the decision where it is allocated?
>> Malte Schwarzkopf: You mean how does blue make the decision of using these two machines?
>>: Blue made the decision of one of those two machines and requested those two machines. Then
someone presumably said yes, you have these two machines.
>> Malte Schwarzkopf: Yeah. That would be the code maintaining this data structure, but that code is
very, very simple. It just, you know, the first one that arrives and fits gets the resources. It's not -- there
is -- there is a FRED running a sort of admission to the data structure, but it's very simple code. There's
no complicated policy here.
>>: [indiscernible]
>> Malte Schwarzkopf: No. That's in one place.
>>: What if the red job just has tons of tasks arriving but they're very low priority? Wouldn't red wind
up filling everything?
>> Malte Schwarzkopf: Exactly. And that's going to be one of the cases that I'll sort of look at. So the
concrete example is if red is a batch scheduler, and, it has large MapReduce jobs, then if he tries to
touch five thousand machines every time he schedules a job, then he might be causing other people to
get basically get starved. And we have to somehow deal with that without incurring coordination
overhead, and you know, restricting red's ability to actually get his jobs started. So you know, you don't
want him to starve other people, but you also don't want him to get starved as a result of other people
having touched the machines that he wants to use. So priorities are with one way after doing this, and
we'll look at some other ways in a minute. So, in fact, let me take that directly from here. In this work
we consider a simple sort of two-way workload split where we have batch jobs and service jobs. And
this is a real differentiation that Google are actually doing in Omega. They also have some other
categories, but these are sort of the most important ones and also the most challenging ones to deal
with. Now, batch jobs run for, you know, some amount of time and finish. So a MapReduce job is sort
of the canonical example of this, whereas service jobs are those production jobs that require careful
scheduling schedule for optimal resource allocation and which run for a very long time. So conceptually
they might run forever. For example, if you run the gmail backend, you start it up and it remains
running until either the machine fails or the scheduler ends up moving it somewhere else. But
assumption is it will basically go ad infinitum. And it is very important that it meets its SLAs and doesn't
get placed in a bad -- in a bad spot. So let's look how this actually -- these two categories actually break
up the real-world cluster workload that we're seeing on Google clusters. We're having a look at free
representative Google clusters here. We'll call them A, B, and C. They differ a little bit in their size and
utilization, but important to point out is that C cluster is the one for this Google released a trace in 2011.
So the workload that's covered here is actually the same as in the publicly released trace, whereas the
other two, they don't releases traces for. So what the bar chart shows here for each cluster is the
relative share of batch jobs and service jobs. Batch jobs are shown in solid color, whereas the service
jobs are shown as the dotted portion. And the four categories that we're looking at here, are the
number of jobs, the number of tasks, and the aggregate CPU and the aggregate RAM resources used by
these tasks. So if we look at this, the takeaway is pretty clear. The large, large, large majority of jobs fall
into the batch category. It's way above 90 percent. It's really hard to actually see the service jobs on
the left-hand two bars, but a lot more resources, in turn, are consumed by service jobs. In fact, more
than half of the resources in all of these clusters actually are devoted to running service jobs. So clearly
it makes a lot of sense to invest careful scheduling into the service jobs because they run for a long time
and use a lot of resources, whereas the batch jobs are much more sort of throw-away, but much more
numerous. So looking at some other statists, we find that the individual jobs have different priorities
too. I've already touched on some of those. Batch jobs are typically much shorter. Their sort of median
is about twelve to twenty minutes for a job, and some of them are a lot shorter than that. Some of
them are a bit longer, and they arrive a lot more often. So in the traces that we considered, the interarrival times of batch jobs were sort of every four to seven seconds, and in peak times it can be, you
know, several ones per second as well in big clusters. Whereas service jobs -- well, the traces we looked
at were for 29 days, and the median job runs for 29 days, so conceptually it runs ad infinitum. If we had
looked at a longer trace, we might have seen something different. But they run for a very long time
typically, and they don't arrive at often. Their sort of scheduling requests arrive on the granularity of
tens of minutes rather than every second. So let's see with that sort of workload characterization in
mind, how the Omega approach actually compares to the competition of the other approaches that I
showed you earlier. So the way we did this is we wrote a simple simulator that can simulate all of the
architect scheduler architectures that we're interested in. So it can simulate a monolithic scheduler. It
can simulate a Mesos-style two-level scheduler architecture. It can simulate Omega. And this simulator,
by virtue of being a simulator, does simplify things a little bit. For example, it uses empirical
distributions for the sizes and shapes of jobs and tasks by sort of just sampling traces that we derived,
and it uses a simple placement algorithm, but it does allow us to implement all of these architectures
and do a like-with-like comparison. And the simulator is open source, so if you're interested, you can
look at it. It's written in Scholar, I think, most of it. Now, how do we model scheduler decision time?
Because that's obviously going to be very important for this sort of work, because the time when the
scheduler is trying to make its decision is the time when it's vulnerable to competing updates from other
schedulers. And it's also the time that the logic takes to make a decision, and we are sort of bounded by
the fact that we can optimize the logic only to some extent if we're doing a large-scale sort of constraint
solution, something that will just take a relatively long amount of time and make the scheduler
vulnerable. So the way we model this in our simulation is we say, well, there's a per-task amount of
work that you have to do, and we typically set that to about five milliseconds as a sort of conservative,
generous overestimate. And this is the work that, if you have twice as many tasks you have to do twice
as much of this work. But of course, there's also a constant per-job overhead that is just the overhead
of dealing with a job, and we typically set this to a hundred milliseconds, which again, is a conservative
over approximation. In practice, it would be much faster in the real world, but we're making things
deliberately difficult for the Omega model here in order to see its limits. So let's see how we do. In this
first experiment, we are going to see a couple of 3D graphs, and I'll take a moment to explain them
because 3D graphs are a bit complicated sometimes. So we start off looking at a monolithic scheduler.
So this is the case of the monolithic scheduler where there's one big scheduler that handles all the jobs
and has all of scheduling logic in it. We sort of dismissed this in the beginning because it's hard to
engineer, but we're taking it as a baseline for our evaluation. So we call this a single-path scheduler. It
applies the same logic to the same scheduling software to everyone, and in inside that logic, we might
be taking different control-flow paths, but it will take roughly as long. Now on this diagram we vary -on this axis, we vary the per-job overhead for all of the jobs. So here, we've got a hundred milliseconds,
and here we vary it up to hundred seconds and this would be one millisecond. So it's a log scale, and as
we go along, sort of the per-job scheduling time grows. And on the other axis we vary the per-task
decision time. So again, we see sort of one millisecond here, and it goes all the way up to one second
per task. So at the far end of the diagram is sort of you take a hundred seconds for the job and then
another one second for each task, so that's a pretty ridiculous case. And the Z axis shows a metric that
we call the scheduler busyness. Now, what you can imagine that as sort of the value that top would
show for the CPU in your machine. It's basically what fraction of time does the scheduler spend making
decisions and what fraction of the time is it idle and not doing anything? Now obviously, this can't be
greater than 1.0 because then it's making decisions all the time. The color is adding another dimension,
which actually is quite intuitive if you think about it. When the utilization of the scheduler hits a
hundred percent and it's making decisions all the time, then of course, it might not be able to keep up
with the workload. So wherever the diagram is colored red, we actually failed to schedule all of the
jobs, and we started accumulating a larger and larger backlog until eventually things fell over. Whereas
where things are blue, all jobs got scheduled. Everything is great. There's also an intermediate bit
where sort of some, temporarily jobs don't get scheduled, but then the scheduler catches up again.
Which is sort, you know, of somewhere in the sort of purple range in this diagram. So as we can see, the
monolithic scheduler doesn't do very well when we have it spend a long time making its decision, and
that's what we would have expected. It's not very realistic to do this. You would never use a hundred
seconds in a monolithic scheduler for every job, but it serves as a useful baseline.
>>: I know this is a detail, but I see a red dot or portion that's below 1.
>> Malte Schwarzkopf: Yeah.
>>: What's going on there?
>> Malte Schwarzkopf: There's a couple of rounding noise in the values we get from this, because when
we reason the simulator, it did generate the same sequence of jobs, but at the end of the simulation, a
job might be halfway scheduled. It will never schedule. I will not be accounted as scheduled, so there's
a bit of random noise here. But we actually looked at the trace, and it did schedule. It did spend most
of the time -- it's never going to be exactly a hundred percent, but there were tiny gaps. It's effectively a
hundred percent of the time. You couldn't have fit in any more jobs, but that doesn't mean that
necessarily you didn't have short idle periods, especially at the beginning when there aren't that many
jobs around. So it's noise.
>>: Thank you.
>> Malte Schwarzkopf: Okay. So let's consider something a bit more realistic than this sort of slightly
dumb, monolithic single path, everyone has the same experience going through the scheduler situation.
This diagram shows the same metrics for what we call a multipath monolithic scheduler. So this is
effectively a scheduler that starts with a big switch statement that goes if this is a service job, then apply
this scheduling logic. If this is a batch job, apply that scheduling logic. If this is something else, then
apply this scheduling logic. So it's sort of multiple schedulers in one, which is really not nice for
engineering purposes, but it is a sort of plausible thing you could do and you wouldn't have to deal with
all the shared-data structure stuff that we have to deal. Now again, we vary the decision times on the xand y-axis, but here we vary it only for service jobs, not for all jobs because we now have these different
control-flow paths, these different categories of jobs we're dealing with, and we only increase the
decision time for service jobs because these are the ones that typically take a long time to place because
Google is really concerned about placing them well. Whereas batch jobs it doesn't really matter
because they only live for a couple of minutes, so we can just drop them anywhere. We see that the
busyness has actually now dropped a lot, and we can easily take a hundred seconds or close to hundred
seconds and one second per task and still not quite hit the it ceiling. So that's good news. But they're
sort of offset at the base, where even if the scheduling time for each job and every task is very short, the
scheduler is still busy for a significant fraction of time, and it's actually more than it needs to be. And we
investigated this, and we found it's because there's no parallelism in the system. While there's different
control-flow paths, you end up scheduling one job at a time, so therefore, you deal with a series of
incoming jobs one after the other, which means you can get head-of-line blocking. So as a batch job,
you can get stuck behind the a service job that takes a long time to schedule even though if the
scheduler had scheduled them the other way around or in parallel, you would long have been done,
long have been done.
>>: Why would reordering it reduce the scheduler's busyness? I see that it would release the job
sooner, so the short jobs wouldn't need to wait as long for the long jobs, but you're measuring how busy
the scheduler is, and I don't see why that would raise the total amount of work the scheduler is doing.
>> Malte Schwarzkopf: That's right, yes. Let me think about this. So in the next graph, we do have
parallelism. Yes. So it's not the head-of-line blocking that makes the different here. It's the fact that
you have these very numerous batch jobs, and they come in all the time, every couple of seconds, take a
couple of seconds to schedule, so even if they don't take a long time to schedule, you'll still be busy
some of the time. In the next graph when this is happening in parallel and we're looking at the busyness
of the service scheduler only, then the batch scheduler sort of exists but it's not shown on this diagram,
and it will be a flat surface that's offset by about this much from the bottom. So it is lack of parallelism,
but it's not the head-of-line blocking that's causing it. I should have been clearer there. The head-ofline blocking does also happen in this though, so jobs might experience a longer wait time. And there's,
in fact, an analogous version of this 3D graph where we show -- we look at the job wait time and we see
the same phenomenon. But here, you're right. The sort of step at the bottom is actually the batch
workload, which now in the next graph will disappear because we're only looking at the service
scheduler and the batch stuff is sort of handled in parallel. Okay, so we're moving on to Mesos, which is
this this two-level model where you have -- where you have the resource manager that allocates
dynamic shares to different schedulers, and then they work with their -- with their share. And again, we
vary the decision time for service jobs, and we show the busyness of the service scheduler only in this
case. The batch scheduler, as we just discussed, is sort of flat because we're not varying the decision
time for that. But the surprising thing here is that in every case we end up with unscheduled jobs, which
is a bit odd because when the decisions are fast, that shouldn't be happening. And in order to
understand why this is happening we have to look at how Mesos actually works in practice. So we have
a Mesos resource manager, and let's say we've got two schedulers which receive their resource offers in
turn. Mesos does not make offers in parallel. It just offers every idle resource in the cluster to each
scheduler in sort of a round robin fashion instead of offering parts. That's a bit sort of counter to the
model, and we were a bit surprised when we found that, but the reasoning that the Mesos guys apply is
that they assume that there's a high churn in the cluster and therefore, if you offer all of the resources
to everyone in turn very frequently, effectively everyone gets offered the optimal share. But let's see
what happens here. So in this case the green scheduler receives an offer for all of the available
resources first, and then it makes its decision, and it might take a while. In the meantime, a blue task of
the blue scheduler has finished, so now a little, little piece of resource has become available. And what
Mesos could do is it could just wait and offer that in the next round, or it could make another offer. And
while they sort of didn't implement parallel offers by splitting up the available resources, they did have
this sort of -- they do have this sort of event-driven model where when a resource becomes available,
they will immediately offer it. And in this case, what that means is they do end up effectively with a
parallel offer but with a really dumb parallel offer because it's offering a tiny amount of resource to the
blue scheduler because that resource has just become available. When it would have been a much
better idea to actually carve up the initial allocation into sort of equal shares. And in this case, blue
receives its tiny offer and says well, but I've got too much work for that; I can't really make use of that,
so I'll just leave it unused and wait for a few offer. So it can't use it. Meanwhile, while the green guy is
still making its decision. Let's say he's the service scheduler and he's taking a long time to make his
decision. So we repeat this many times and, in fact, Mesos will offer this again and go, well, have you
thought about it? Maybe you do want it after all, and then it goes away again. They do it again and so
on. Repeat many times. At some point the batch scheduler is done and releases all of the resources.
And in fact, if we compare here it has only placed two tasks and released all of the rest of the resources
that it had acquired is a lock on. But by this time, now, blue finally gets its large offer that it can use to
schedule all the work it needs to schedule, but by now, it's actually given up. And that is what happened
in the simulation. So we actually debugged this, you know, the fact that we saw unscheduled jobs in
every case, and effectively what we saw was that they were being dismissed because the schedule had
given up after a timeout a fairly long timeout of something like a minute. It had just stopped retrying
because it had received so many offers that it couldn't use that it decided to get rid of the job instead of
leaving it in there and blocking the scheduler from doing other work. So, let's see how Omega compares
to this. After this short interlude. And remember again, our main goal here was to decompose the
scheduler and make it flexible, not to necessarily beat every other model. But actually if we compare
this to Mesos, it's not too bad. The Mesos graph, with the caveat of everything being red, was a little bit
lower. And that makes sense because in Omega we're experiencing these conflicts, and we're going to
be doing extra work. So clearly, we would expect to be at least at good -- at most as good as Mesos
probably was. But this is not quite as good as we would like. So we ended up making a couple of
optimizations that I'll discuss next and found that actually when we're a bit smarter about deciding
when a conflict occurs and how we handle these resource conflicts than we initially were, we can
actually get quite close to Mesos. So in this graph we see that for the entire plot the busyness of the
scheduler is low, so we can always keep up with the workload, and that's good news. Now, here we
compare all three of them again, and we can see the optimized Omega is a little bit worse still than
Mesos, but it does actually handle all of the workload. And if we compare it to the monolithic scheduler,
there's this little step which has gone because we've got parallelism in Omega, but actually if you
compared the shape modulo, the step, we're still a little bit worse, than the monolithic scheduler.
Again, that's what you would expect because we're going some extra work. So that was great, but it was
sort of a low-fidelity simulation because we tried to compare all these different schedule models and we
-- yep?
>>: The evaluation may have been sort of getting us an intuitive feel for it is a metric of can the
schedule keep up, but the other question that you started out is, does it make good decisions? That's
not reflected here, right?
>> Malte Schwarzkopf: That's not measured here. What we assume is that if it schedules job, then it's
satisfied with the decision that it has made.
>>: Is it the case that the utilization of the cluster is equivalent between all three of these?
>> Malte Schwarzkopf: Well, it's lower in this case because you're not scheduling everything.
>>: It's failing, hence the red lines.
>> Malte Schwarzkopf: Yeah. But if you just consider the blue, then the utilization is equivalent. Yeah.
>>: So it's not the case that making faster decisions doesn't get -- doesn't reduce some of the slack time
in the cluster. I imagine that a scheduler that did a better job might -- then there are other metrics, but
might actually produce a better overall result, I guess. I think what you're saying is as long as the
scheduler completes its job, everybody's happy.
>> Malte Schwarzkopf: Well, yeah, certainly everybody's happy. I think, if I understand your point
correctly, you're saying, you know, because we've got some parallelism, you know, we're now dealing
with more scheduling decisions per time unit, and therefore we can probably higher utilization because
we can do more work.
>>: One question is whether -- is whether a slower schedule, even if it gets the job done, gets it done
while leaving pieces of resources unutilized for some time while it got its decision made.
>> Malte Schwarzkopf: Right.
>>: The other question is whether when you have schedulers arguing over placement decisions,
whether a smarter scheduler, like -- the point of going beyond the monolithic scheduler is the
monolithic scheduler is very simple. And my intuition is that point somehow with more complex
scheduling decisions, you can get a better overall outcome because you presumably ->> Malte Schwarzkopf: Sure. The metric that people at Google use to assess this would not be
utilization. It would be something like conformance to fault tolerance SLAs, for example. Conformance
with requests, you know, handling SLAs and those sorts of things. We did not particularly focus at that.
We did look at it, and because it's a simulation, we can't really say what would have happened to the
request SLAs, but we said the -- we looked at the fault tolerance, and it was better because the more
complex scheduling logic is making better decisions and placing things.
>>: Right. It seems like that was your primary goal, but I guess there's also a scalability goal.
>> Malte Schwarzkopf: Well, so the premise that we started with was, you know, people want -- for
service jobs they want this fault tolerance, so they will badger the scheduler team until they implement
it. And they could either implement it into the one scheduler that they have, in which case you end up
with a monolithic case, or they could disaggregate the scheduler, and you know, that's -- that ended up
being a good idea.
>>: So fundamentally then, you take as a premise that you're going to have to have complexity.
>> Malte Schwarzkopf: Yeah.
>>: And then the metric of success is how well you implement, how quickly you ->> Malte Schwarzkopf: Yeah, yeah. Okay. So we did another simulation which actually is much more
high fidelity, and I'll explain that with this table. So in the first simulator I explained to you that we used
to compare the different models. We made a couple of simplifications in order to get it to run
reasonably fast and in order to make the implementation not too complicated, so we assumed that all
the machines are homogenous. The job parameters were sampled from distributions. We didn't
particularly support placement constraints. So Google has all sorts of hard and soft placement
constraints that are described in other papers. We did not support those. And the scheduling algorithm
was a sort of fairly stupid, random first fit. All of the graphs you've seen so for were with these sort of
caveats. With the high-fidelity simulator that I'm now going to talk about, we actually literally used the
Google scheduling logic. So when I say Google algorithm here, I mean I literally wrote hash include
scheduler dot h and was really using the real logic that's used in production. And as a result we are also
supporting constraints. We're supporting all of the sort of complications. We also used a real-world
machine description for a real cluster and replayed a workload trace, a complete event trace of a
month-long sort of cluster work load. So this is a much more realistic simulation than the sort of simpleminded one to compare the different models that we used initially. So let's see how this works out for a
month-long trace of cluster C that we've seen earlier. So again, on the y-axis here we've got scheduler
busyness. On the x-axis we're varying the service scheduler's one-off decision time, so that's the sort of
per-job overhead of the service scheduler. And again we vary it up to a hundred seconds. The blue line
is the batch scheduler, and because we're not varying his decision time, it remains -- the busyness
remains constant. In fact, it wouldn't necessarily have to remain constant. It could actually go up if the
batch scheduler was having to retry a lot as a result of the service scheduler taking longer to decide.
And we can see that that's not happening, so that's actually good news for us to some extent. But what
we also show here is an approximation of what the busyness would look like of the service scheduler
shown in red if we did not have any conflicts. So if we only had one attempt per job, and everything
happily scheduled the first time round and, that's the dotted line here. Now, we can see the area
between these lines is the overhead that we're adding due to having conflicts, due to having to redo
work, and the bad news is this is pretty huge. Yep.
>>: If you're hash-including the actual code, how do you vary the one-off position time?
>> Malte Schwarzkopf: Right, so we don't actually, you know, measure the time that the code takes to
run. We sort of -- this is an event-driven simulation, so we account for decision time, and it takes some
time that is either comparable or less or more to actually make the decision. So when the simulator
runs it uses the real logic, but the time it accounts is a parameter that we vary. And the reasoning
behind this is, you know, you could to some extent, you know, you could optimize it to make it faster.
You could simplify it to make it faster, or you might be adding more and more of these fault tolerance
features or complexity that would make it slower. So we were sort of exploring the space for scheduler
decision times, and the reason that we're interested in scheduler decision times is because with the
Omega model, if you think about it, the crucial parameter that makes schedulers vulnerable to
interference is how long they take to make decisions. So we were trying to test, you know, how far can
you push this before it folds over, and you can see it falls over at around twenty to thirty seconds.
>>: I was wondering if maybe their existing algorithm said, here, have this much time to make your
algorithm. So to then you could actually model.
>> Malte Schwarzkopf: I actually talked to the guys writing the logic and sort of tried to get an estimate
from them how long it takes. And they actually couldn't really tell me, and the reason is because it's
really hard to tease this out of the existing logic as a per-job quantity. They're sort of -- the logic is doing
all sorts of smart things in terms of when it looks at a job and it's touching something that is relevant to
other jobs, it will sort of amortize the work and do a bunch of work for the other job as well. So it's
really hard to say it took five seconds to decide for this job because it affected a bunch of other jobs as
well. This is in the existing monolithic scheduler, so in the disaggregated schedulers, presumably this
would change, but they didn't exist yet at the time, so we modeled it like this. Okay, so bad news: It
kind of doesn't scale particularly well, and the overhead gets quite big. So compared to the previous
graphs, this is worse, and that's what we would expect. We sort of looked at why we saw such a big
difference, and it's mainly placement constraints. If you have production jobs that are very picky in
where they want to schedule, they will, of course, hone in on the same resources, so these resources
become hot and more contended, and therefore you're seeing more overhead. So all for nothing, or
maybe not quite. So remember the optimizations that I mentioned earlier. We looked at this and sort
of thought a little bit more about what you can do in order to reduce the frequency at which conflicts
occur and deal with conflicts more gracefully. So one optimization that we made was we said, well, we
can detect conflicts in a slightly more fine-grained way. So initially what we did was we gave every
machine a sequence number, and we just increased the sequence number every time a scheduler
touched that machine. So conflict detection was then very simple because we could just check if the
sequence number had changed and if it had, we said, oh, bad luck. You have to try again. So, of course,
that's a bit pessimistic because actually you might be able to fit that task and also the other task that the
other scheduler is trying to fit on the same machine if you make life a little harder for yourself in terms
of checking whether you actually experienced a conflict. And I say a little bit harder. It sounds quite
simple. It sounds like, you know, you just need to do of a couple of integer comparisons here. The
resource model that Google use in their data centers is reasonably complicated. There are sort of
guaranteed resources and reserved resources and nice-to-have resources, and then there's a bunch of
other dimensions as well. So it's a sort of reasonably complicated piece of multi-dimensional sort of
analysis whether another task fits on the machine. So we actually take a little longer to detect conflicts
here, but at the -- at the price of maybe having fewer of them, and that turned out to be a good idea.
The second optimization that we made is a relatively obvious one, which is that we said, well, if we have
a big job and on the first attempt one of its tasks conflicts and can't get placed then, of course, one of
option is to fail the entire scheduling attempt and retry the entire job. Which then second time round
might work, but actually wouldn't it be much smarter if we only retried the task that failed? Now, of
course, this only works for jobs where it's acceptable to incrementally schedule the tasks, not for ones
that want gang scheduling. But for those it should give us much better chance of scheduling second
time round because the job is now a lot smaller and is going to touch fewer machines. So let's see how
these optimizations actually impact the performance. Again, you know the axes. Y-axis scheduler
busyness, x-axis, service scheduler decision time. Is red line is the red line that we had before. So we
had the sort of yellow underneath here. And then the cerulean line and the pink line are with the
optimizations applied. So for the cerulean line we've applied the incremental scheduling instead of gang
scheduling, but we're still making the course-grained conflict detection decisions. And with the pink line
we apply both of them. And we find in practice we got sort of a 2x or fifty percent, depending on which
way round you look at it, difference in scheduler busyness as a result of applying each of these
optimizations. When we look at the previous graph now with these optimizations applied, the yellow
area that shows the overhead of wasted work due to conflicts is a lot smaller. It's still there. We still
have to redo things, and in fact, if you look at the numbers without the optimizations, the worst case
that I saw was something like 200 retries until the job finally scheduled, and then it did schedule. But in
this case, the worst I saw was seven retries, and that was sort of very rare, a one-off in month of
simulated cluster runtime. So we can make these optimizations and therefore make our model viable
even at pretty long scheduler decision times. And that is good news because that's what Google likes.
We also did a case study on what you can do if you have these sorts of disaggregated schedulers, so so
far we've looked and batch and servers and I sort of in a hand way said service jobs take a long time to
schedule and therefore we want, you know, it'd be nice if we could disaggregated things. An
opportunity that is granted to us by having these disaggregated schedulers is that maybe we'll have
more schedulers. We can have job-specific schedulers. So one case study that we looked at here for
qualitative benefit, which is sort of a little bit what John was asking about earlier, but it's a bit different
in this case. We looked at a MapReduce scheduler, a job-specific MapReduce scheduler, and what this
scheduler does is it introspects the cluster state to see if there's any idle resources around and sort of
opportunistically leverages these idle resources to complete MapReduce jobs faster. Now, to
understand why this works we have to look at the number of workers in a MapReduce job, and at
Google that number is a manually human-specified configuration option. So we end up -- basically a
human ends up writing a job profile and job scheduling request, and in that, that human says, I'd like a
hundred workers, please. This is not the same as the number of map shards and reduce shards. It's just
the number of parallel workers, so it's like the parallel number of FREDs, I suppose, in an open MP
program or something. And it turns out if you look at a histogram of the sort of distribution of numbers
of workers that people use in practice, the numbers are all sort of nice numbers that humans will pick
when you say, well, how many workers would you like? And a lot of them say 200 because that seems
like a nice number. Now of course, there's a serious point behind this, which is that people actually
implicitly make a trade-off between getting the job done faster by having more workers, assuming it is
embarrassingly parallel, and getting it to schedule sooner, because if you ask for 8,000 workers, it will
probably take a fair while until the job gets scheduled; whereas if you ask for five, it will be pretty fast
but the job itself might take the longer. So we said well, maybe it'd be better if this wasn't decided by a
human, but it was decided by the scheduler as a function of the cluster's busyness at any point in time.
If there's lots of idle resources then, you know, maybe we can give the job a bit more resources and it
can compete faster. If there's not, then it will just have how many workers it gets. So we built a
scheduler that did that, and here we are having a look at the benefit. So on the x-axis, we now have the
relative speed-up of job completion compared to the case of running it with the number of workers
configured on a log scale. And on the y-axis we have a cumulative distribution. Now I have to say a
caveat here is the speed-up model that we used was pretty naive. We basically assumed that
MapReduce jobs are infinitely scalable by adding more workers, which of course, is not true. If you have
-- well, we didn't quite go to infinite. We never added more workers than map shards, for example,
because that doesn't make sense. Of course, not every job is sort of linearly scalable liking this. But if
we assume that sort of very optimistic scaling model, then we get this sort of happy zone up there
where jobs experience a speed-up. And we can look at this, and we see that out of the jobs that we
looked at in this cluster, 60 percent of the MapReduce jobs were actually affected by this optimization,
and that means 60 percent of MapReduce jobs could have benefited from extra resources, from being
given extra resources in addition to the ones that were configured.
>>: Yeah, just to understand the experiment. The speed-up includes both scheduling time and
execution.
>> Malte Schwarzkopf: This is job runtime. Scheduled time is not included.
>>: It's not.
>> Malte Schwarzkopf: It's just the job runtime as a result of using, you know, more workers. And what
we did was we looked at the number of MapReduce shards, and we said up to the number of
MapReduce shards, we'll give you more workers so that you can run more of them in parallel. And you
can see actually that for a number of jobs it didn't matter, this other 40 percent, and those are the ones
that in some cases that already had the number of workers set to the number of MapReduce shards or
they were very small, in which case it didn't matter. So if we look at sort of the median of the jobs that
actually do experience a speed-up and look at how about that speed-up is, we're sort of looking at
between 3 and 4x, which actually is quite considerable. Now the CDF does, of course, say in the 90th
percentile up there, you can get a hundred x speed-up. That might or might not be the case, and we
actually found some jobs that were in that sort of category where the configured number of workers
was tiny compared to the amount of work that was supposed to be done. Those jobs had been explicitly
throttled by their designer, and when we talked to them and said, hey, if we could make your job a lot
faster by giving you much more workers, they said, no, no, no we don't actually want that. It's
deliberate that they are being throttled. So it doesn't always help. But it's a nice tastiness result that by
having a custom scheduling logic that takes into account the sort of opportunistically available resources
in the cluster, which it can actually very easily inspect by just looking at the replica of the cluster state
that it has available to it, we can achieve this kind of benefit, and flexibility. So let's finish up this Omega
stuff. We found that for Google's workloads flexibility and scale in scheduling require parallelism. They
require disaggregation into different schedulers. And it can work if we do it right and their use of shared
state in the optimistically concurrent transaction model is the right way of doing it, or at least it is for
Google. So I can stop here. If people are sort of a bit talked out. It's been about an hour. I can talk
about some of the work that I have done since, or you know, whatever you prefer. We can have
questions.
>>: Do you have a short version of the work you're doing now?
>> Malte Schwarzkopf: Yeah it's not going to take another hour [laughter]. It's not dimensioned for
that, but there was a question back there. Shall we talk about that first?
>>: Regarding this busyness, so parallel scheduling itself really does not reduce busyness.
>> Malte Schwarzkopf: No.
>>: Unless you considered using additional resources to make scheduling decision. So is this implicitly
shown in your graph?
>> Malte Schwarzkopf: So what you're saying is if you do things in parallel it doesn't reduce the amount
of work you're doing. That's correct. But if you run things in parallel, then you're sort of by assumption
you're giving it extra resources to do the work, right? If I have two FREDs and they work perfectly in
parallel, then I've used fifty percent more CPU time. Yeah in any given time tick, but I'm getting twice as
much work done ideally.
>>: This is probably fine in this context. I'm just trying to see how you measure the busyness.
>> Malte Schwarzkopf: Yeah, so in the 3D graphs we had this discussion about the step. In the 3D
graphs were in the initial one measuring the busyness as a result of the overall busyness of the
scheduler, but there was only one scheduler. And then when we had multiple schedulers, we only
looked at the service scheduler because that was the one where the shape was actually varying,
whereas for the batch scheduler because we didn't modify -- we didn't vary the decision time, it was just
flat. And this offset. The work is still there. It's just done in a different FRED, if you will. So it's being
done, and the overall busyness, the cumulative busyness remains the same, but it's being done sort of in
parallel. But you have, you know, to use more resources to make the decisions. In practice Google is
not looking at having 500 schedulers, right. Google is looking at having a handful, and you can easily run
this on a single multicore machine. And it would work pretty well.
>>: Another last question is, at the very beginning you measured about this tool that was scheduling.
You don't really is have a comparison example as a later parallel slide compared with two-level
scheduling how the Omega works. So do you have some results other like ->> Malte Schwarzkopf: So I can -- there's a bit more stuff in the paper. There might be something in my
backup slides, but they're right at the end, so I'm not going skip ahead to them now. But I can tell you
offer informally what our experience with two-level scheduling was. We used Mesos, the Mesos model,
because Mesos is a real thing that is out there and that people are using. So it's an implementation of
two-level scheduling. It's an implementation that's based on a bunch of assumptions. So one of the
assumptions they make is high churn. One of the assumptions they make is schedulers don't take
extremely long to make their decisions. Now unfortunately, with our premises we violated both of
these assumptions, and therefore, it didn't work particularly well. So you know, every credit -- one of
my coauthors is one of the Mesos authors so, you know, we weren't deliberately trying to screw them
over. They optimized for a different setting. You could, of course, build a two-level scheduler system
that did not necessarily have the limitations that Mesos has. So you could have one that does smarter
dynamic partitioning, for example. But we think it would (a) be more complicated and (b) some of the
nice things you can do with Omega you couldn't do in that setting. So if you think about the MapReduce
scheduler that inspects the spare resources available in the cluster, if you had a dynamic partitioned
scheme and the MapReduce scheduler was only seeing its dynamic partition of the overall resources,
then it might not actually get a very good idea of how much spare resources there are, because some
partition -- some resource manager has decided to only give it a slice. Now of course, you could then
communicate this out of band in some way or another. Omega is not the only way of doing things, but
we found it is a nice simple model that actually captures lots of policies. Does that make sense?
>>: It does. I'm just, well, trying to figure out. Some of the constraints you were talking about where
you do dynamic partitioning, I think also exist in the Omega scheduler, and some of those can also be
avoided by doing dynamic scheduling at the same time be aware of the global state.
>> Malte Schwarzkopf: I think the answer is yes. So there are some policies that don't work well in
Omega. So one good example is if you had a scheduling logic that always picked the least loaded
machine. That would be a really bad idea in Omega because then the old schedulers using this logic
would hone in on that one machine, and they would basically live-lock each other. So we do assume
that your scheduling logic has some element of randomization. We also do assume that your scheduling
logic is designed in such a way that the majority of -- that you can actually in most cases schedule on the
majority of resources, that you're not sort of picking your favorite machine every time. This is not this is
not enforced in Omega. If you build a stupid scheduler, then it will either break the system or not work
very well. In practice in the practical implementation that Google have done since, which I don't know
very much about because I had left by the time they had done most of it, but in the practical
implementation they are making sure that by writing a stupid or buggy scheduler you're not ruining the
entire system; you're just ruining yourself. So in -- Jay in the beginning asked about the code that sort of
that manages the shared state. And there you can put in a policy that you can just put in a counter and
say, you know, if this guy keeps, you know, causing other people to fail, then we make him fail a little bit
so that it gets a bit fairer again. And in practice what would end up happening is a badly written
scheduler will just keep failing, and then, you know, someone has an incentive to make it better or make
it more -- make it fit the assumptions better.
>>: Thank you.
>> Malte Schwarzkopf: Okay, great.
>>: I'm just trying to understand this a little better. In a closely managed environment like this, where
you know what jobs are in the cluster.
>> Malte Schwarzkopf: Sure.
>>: I see the benefit of Omega especially when you want to enable say, innovation at the scheduler
level.
>> Malte Schwarzkopf: Yep.
>>: You could let people write different schedulers for different kinds of jobs. The load level scheduler
is really dumb. It just says resource available or not, but in an environment when you control almost
everything, right, you might have one set of machines dedicated to certain services. Why is the
monolithic approach -- monolithic ->> Malte Schwarzkopf: Right. I see where you are. So one thing that I think is sort of a policy decision
that Google make is that they don't dedicate machines particularly to things. So, you know, I was an
intern, so I don't know everything, but what I do know is that whenever possible they try and use shared
infrastructure. So the sort of model where you say these machines over there are gmail machines, and
these machine are the web search machines -- that doesn't exist. They assume sort of a utility cloud
model where, you know, there's infrastructure that's used by everyone. So with that premise,
monolithic servers that sort of just run in their own little playpen and never have to talk to each other
become unviable because everyone is sharing the resources. Now you could, of course, come up with
schemes where some resources are sort of preferentially sticky for some workload, and some resources
can only be leveraged opportunistically by otherwise, but you could do that in Omega as well. It would
complicate the code that manages the shared state, however, and one of the sort of design premises in
Omega is that code should be as simple as possible. It should effectively just be a conflict check.
Because, and this is sort of the wider picture of Omega as suppose, which John [inaudible] can explain
much better than I can, because they also want to hook in things like machine management with this
same shared-state data structure. So if a machine fails or is taken offline for maintenance, it will just
disappear from the shared-data structure by another component that is not a scheduler, but instead is a
machine manager that says, I'm removing this machine. And that's actually quite a nice abstraction for,
you know, sort of unifying the cluster management service.
>>: -- one of the things failure to detect [indiscernible]
>> Malte Schwarzkopf: Sure. Yeah. This is not the only way of solving this problem, absolutely not. But
Google liked this sort of giant, shared infrastructure vision where every cluster management component
is using one sort of central, shared notion of the state of the world. And, you know, that's what they're
thinking is the right thing to do. It's clearly not the only way you could do it. And it might change; they
have very fast turnover of systems there so, you know. The average lifetime of a system, of a scheduler
there is probably, you know, a few years, so, we don't know what they're going to be doing in five years'
time. Okay. Any other questions? Stop? Look at some more stuff? I don't know how keen people are
on going to lunch [laughter].
>>: I don't mind listening.
>> Malte Schwarzkopf: Well, we can go a little further. So one of the things that people asked a lot
when I presented this paper at EuroSys was, you know, why might scheduling take a hundred seconds?
That seems outrageous. And I sort of in the beginning I said oh, you know, Monte Carlo or stuff,
constraint solving blah, blah, blah, it takes long. And that is sort of the fact at Google that, you know,
these things do take long, but it's not a very concrete example of why something would take that long.
So I'll show you on example of something that's probably a bit closer to home for Microsoft and that
actually does cause very long scheduling decisions. There's a T on the second line there. And this is
relevant to a new system that I'm working on which is called Firmament, and that actually that system
extends Quincy, which is why I said it's a little closer to home for Microsoft. So the premise of
Firmament is that machines don't really look like this anymore. That's what they looked like in 2009.
We now look at a complex sort of architecture like this, or indeed if you have an Intel machine, it might
even look like this. You've got deep cache hierarchies, multiple memory controllers hyper FREDing, all
sorts of architectural complications that weren't really there a few years ago. And this matters for data
centers because these machines are being shared, again, in the sort of Google infrastructure model
where you might have tasks from completely different jobs running on different parts of the system, the
sharing of resources really does matter to the way you're tasks will proceed. So here's a little
experiment that's I have done using some spec CPU benchmarks. So this is basically all of the spec CPU
benchmarks arranged roughly in order of how memory-intensive they are as in how much of their time
they spent accessing memory as opposed to doing CPU-bound work on a fairly recent AMD Opteron
machine. And the heat map shows how much a spec CPU benchmark degrades as a result of sharing a
level two cache with another instance of a spec CPU 2006 benchmark. So there's only on a twelve core
machine. There's only two things running, but they do run atop the same l2 cache. And you can see
that in a bunch of cases it gets very red here, which is 1.5, so that's a fifty percent degradation. By
degradation, I mean increased runtime. But actually the scale is clamped to 1.5 here to bring out the
differences. If you leave it free, it's about seven and 7x in the worst cases. And as you would expect
when you're sharing a cache and other levels of the memory hierarchy, if the workload is more memory
bound, the interference is worse. I should have said which way round this is, and it's hard to get right,
so I'll be very careful. The diagram shows that the left-hand workload, in this case lbm, which is highly
memory intensive, causes the degradation indicated by the color of the workload on the x-axis. So what
it means in practice is lbm running next to gromacs will cause gromacs to take fifty percent longer than
gromacs would do if it was running in isolation on an otherwise idle machine. So the baseline is the
workload running in isolation on an idle machine. Now, some of you might be thinking, well, sometimes
it looks like it's getting faster when you're sharing things; that seems a bit odd. And I turns out the
baseline of this experiment was a little bit buggy because I've measured the isolated runtime without
first warming up the buffer cache. So because the isolation case was the first in a series of experiments
to be run, the baseline time is actually higher than it would be with a warmed-up buffer cache, which
was the case for all of the other experiments. So what I should have done is either dropped the buffer
cache in between experiments or warmed it up before, before running the experiment but it takes -because this 28 x 28 matrix takes about a month to generate, so I think I've, by now, rerun the baseline
so I can replot this, but I haven't yet got round to actually doing the plotting. So I assume that this these
blue artifacts will go away, and in fact, I've verified it for a small subset that they do go away when
you're running on a warm buffer cache. But the degradation is only going to get worse as a result of
this. It's not going to get better if this becomes 1.0. So that was only two tasks on an otherwise idle
machine, but of course, if you're running in a data center, you will want to pack the entire machine with
work ideally. And so I did that with the spec CPU benchmarks and again ordered them by memory
intensity. I don't expect you to follow all of these bars, but if we only look at the worst case, this gray
bar here is on a 12 core machine we run 12 instances of the lbm benchmark on all of the cores. And as a
result, the runtime is almost 5x of the isolated runtime because they are just evicting each other's cache
lines all the time and contend for the memory bus and all of these sorts of things. Towards the left-hand
edge that's off the screen, but with the CPU-bound benchmarks you finds that the degradation is almost
zero because they just run in their own l1 cache, and you know, do CPU-bound work. So it can get quite
bad, but of course, that's spec CPU, and spec CPU is maybe not a super representative benchmark for
data center workloads. So we also looked at a realistic sort of data center application, which in this case
is Spark, the sort of Berkeley parallel data processing framework. And this is actually an artifact from a
real experiment we did for a different paper. We ran Spark on a simple string processing workload. So
this is taking an input string and splitting a column out of it, so it's a column projection on a multicolumn
space-separated input string for reasonably big but not enormous inputs because we only had a couple
of machines. And on this cluster of, I think it is, six or seven machines, one of our machines is a 48-core
machine. The others are smaller machines because we're a poor university, and we can't afford lots of
homogenous machines. What we found was that if you look at the runtime here, the blue line, the 8C
version, is Spark limited to using only eight cores on each machine. Whereas in the red line we let it use
as many as it wants, which means on the 48-core machine, in this case it's using 48 cores. In this case
it's only using eight cores. And it turns out that Spark actually got faster when you gave it fewer cores,
and the result for this was that it was contending for resources on the machine, which is not a huge
surprise if you think about it, but for us when we ran the experiment, it was a bit surprising that
removing cores made the benchmark go faster, so this interference does happen in the real world. And,
in fact, there's 76 cores that we've removed in total by restricting all cluster machines to only eight cores
running Spark. And as a result, it got twice as fast. So that's just one thing. You obviously can contend
for network access as well. You can have some sort of hard and soft constraints. You can have
accelerators. You can have deadlines, SLAs, special resources, SSDs, GPGPUs that you don't have in
every machine. All of this comes into the scheduling algorithm. So what's a good scheduling mechanism
to capture all of these dimensions? Well, actually if you think about it, Quincy is quite well-placed for
this, and I really like the Quincy work. I think, you know, it's sad that nobody has ever done, you know,
any more work on this, which is why I'm now doing it I suppose. But the Quincy scheduler is based on
an optimization, a global sort of cluster-wide optimization of a flow problem. So it actually models
scheduling as a min-cost, max-flow optimization. I'll just explain this very quickly because I assume that
people here are reasonably familiar with this. But basically Quincy has this graph where it has machines,
racks, and what they call a cluster aggregator, which is effectively a “don't care” node that means can
schedule anywhere. And then there's a sink that sucks the flow out of the system. All of the machines
where tasks can schedule are connected to the sink, and the task comes out of different jobs on the lefthand side here are the sources of flow. So overall, the optimization is to get flow going from the sources
to the sink in such a way that the cost, the overall cost of routing the entire flow to the sink is
minimized. And in order to be able to do this routing, obviously we have to connect things up a little bit
more. So because there might be more tasks than there are machines to run them, room to run them in
the cluster, there must be an unscheduled node through which we can send flow if a tasks is not
currently scheduled. Every task is connected to that unscheduled node and also to the cluster
aggregate, which means it can schedule anywhere. And if a task like T0,0 here ends up routing flow like
this, routing it through machine M1, then effectively what that means in the Quincy world is that it's
scheduled on machine one. Or if it's routing its flow through the unscheduled node, it's not scheduled.
You can also have preferences that point directly into other parts of this hierarchy. So thinking about
this sort of degradation problem as a result of interference, what we would actually like to do in that
case is we'd like to combine cache-sensitive and cache-insensitive workloads to put memory-bound
tasks next to CPU-bound tasks, for example, because that will maximize the overall utility. Use CPU
tasks to fill in gaps and so on. So how can we express that in this Quincy paradigm? Well, one thing we
can do, and this is how this work sort of got started, is we can actually extend the leaves of the graph
from machines to individual cores. So we say, well, okay, instead of having machines connected to the
sink, we actually model the entire architectural hierarchy of the machine with all the caches and the
cores and even the hyperFREDs if you want, as part of this graph. So we can do a global optimization
over a placement of tasks on individual cores in the machine. And if we do that, we obviously need to
classify workloads in some way, so we need to have some sort of equivalence class of a CPU-bound and
a memory-bound workload. Ideally again, we would do this automatically. We wouldn't need the user
to specify he have it. The heuristic that we're using in Firmament is sort of an animal zoo. This is
actually not my invention. It was invented by people who did work on cache sharing in architecture in
the early 2000s. And they came up with those categories, and I think they've added a bunch more since,
but they he came up with these categories of devils, turtles, rabbits, and sheep. And I don't need to
explain those in detail, but effectively, devils are really bad. They cause degradation if anything runs
close to them. Turtles are super gentle. They're the CPU-bound tasks that will just sit there and do their
work, never minding what anyone else is doing in the system. And rabbits and sheep are various
different increments in between. So what we've done in Firmament is we've leveraged the performance
counters that you find in modern CPUs to measure a bunch of things at low overhead. So we measure
things like the memory accesses per instruction. We measure the cycles per instruction. We measure
how many cache references there are, how many cache misses. Performance counters aren't perfect,
but they're a very low overhead way of measuring these microarchitectural things. They also lie
sometimes, but you know, it's as good as we can get without much overhead. So we're looking at two
examples here. There's one task that is approximating pi, which is a CPU-bound workload. It's just
literally computing in the L1 cache. We can see it does about 12,000 instructions for each memory
access, so it very rarely accesses memory, and only L2 percent of its memory accesses end up causing
cache misses. That's what you'd expect for an application like this. Matrix multiplication on a big
matrix. This was set up for a matrix that, you know, definitely doesn't fit in an L1 cache and probably
didn't fit into L3. We find that it only does 40 instructions per memory access because memory access is
quite common in that sort of workload as a matter of it being a matrix multiplication. And 65 percent of
the accesses, memory accesses end up causing last-level cache misses in this implementation.
>>: You talk about 65 percent causing a miss. That implies a given size for the cache, and that size is a
thing that's going to change when you put a contending workload next to it.
>> Malte Schwarzkopf: Yes.
>>: The metric you really want there is what cache size produces a 90 percent hit ratio or an 80 percent.
>> Malte Schwarzkopf: Yeah, and in fact, that's what the cache-sharing architecture people look at. It's
a bit hard to figure this out dynamically if you're not running an architecture simulator, but a real
machine.
>>: There's a not a good way to estimate that using ->> Malte Schwarzkopf: No. You can use something like cache grind, but yeah, it's a simulator. A bunch
of people have done work in this area where they used offline simulation with those sort of tools to
predict how much cache or what the cache footprint of a workload would be. So let's assume we've
done this, so we've classified some of our tasks as devils and we've classified some of our tasks as
turtles.
>>: Can you go back?
>> Malte Schwarzkopf: Sure.
>>: Can you use that model to predict the red and blue picture?
>> Malte Schwarzkopf: To some extent. So if you look at the cache-sharing papers, they come up -- as a
result of having these animals they come up with a bunch of heuristics where they say putting devils
next to turtles is good. Putting devils next to rabbits is really bad. Putting sheep next to rabbits has no
effect. So you can sort of come up with a taxonomy of what goes well with what. You can't exactly
predict what the heat map will look like, but you can say sort of the bottom of the heat map would be
the devils, and the top would be the turtles. So you can predict the general place in the sort of hierarchy
of how interfering is something that the task folds into, if you can classify it as a devil. Also, of course,
tasks go in three phases. You have some workloads have memory-intensive phases and computerintensive phases, and we're not capturing that perfectly. You could do a better job there, but we're
trying to do this, you know, in production with almost no overhead as opposed to offline.
>>: Based on your previous experience in Omega with the schedulers there, did they do something very,
very similar?
>> Malte Schwarzkopf: No. Well for this, they don't do anything. They just suck up the interference. In
fact, that's not quite true. Google published a paper called CPI squared in the same EuroSys where the
Omega was. What they do do is they sample these sorts of performance counters to detect when really
bad interference is happening, and then they move the task somewhere else. But it's reactive rather
than proactive, and they only sample about one percent of tasks; they don't do it for everything. And
certainly, the scheduler has nothing to do with it. It's a completely separate part of system that
rebalances these sorts of things. So if we wanted to do this in the scheduler using the Quincy model,
what we can do is we could classify our tasks and we can then add another aggregator node into this
flow graph, which is the devil aggregator up here. And we have turtle aggregate for the equivalence
class of turtles here, and then we connect our tasks to those. So let's say we have a devil running here
on C0, and we are now having a turtle that we would like to schedule. So based on our equivalence
classes and based on the reasoning of how things go well together, what we would like to do is for the
turtle to end up here, because then we use this machine, this part of the machine in the most efficient
way. If it ends up here, it's not really helping us if we put another devil here or rabbit or a sheep or
something like that. So we'd like to make it create an incentive for the scheduler to put the turtle into
that place, but we don't want a hard constraint, because you know, we're probably better off it
scheduling in a suboptimal place than it not scheduling at all. So what we can do is we can put a
preference edge from the turtle aggregator to the node representing this L2 cache and say, well, we
have a strong preference for putting turtles underneath that L2 cache because we already have a devil
here. And then when the flow optimizer runs it will, with great likelihood, put a turtle there, but it might
not end up doing so if there's some other globally optimal solution. If it did put a turtle there, then the
flow gets routed like this, and the turtles gets scheduled, and we're all happy. And the same happens
for the devil, which might end up going there. So what is the impact of this? We've built the system.
We've tested it. This is an indicative result. It's not, you know, a fully-fledged SOSP evaluation or
anything like that. But remember the pi approximation and the matrix multiplication. They would fall
into the equivalence classes of turtle and devil. If we run them in isolation, the pi approximation takes
about 6.3 seconds and the matrix multiplication takes about 20 seconds on an otherwise idle machine,
so in a very unrealistic environment for a cluster. If we use a simple Q-based greedy scheduler, nothing
to do with Quincy, no flow optimization, just you know, the sort of state-of-the-art, Q-based simple
scheduler that doesn't take a long time to make a decision but maybe makes somewhat suboptimal
decisions. And it uses a random placement strategy, then we end up with a 4x degradation in runtime
of the matrix multiplication while even the turtle, the pi approximation, degrades, and this is with
multiple metrics, multiplication tasks, so it's not a single task. This is an average value out of, I think, 12
tasks and 12 pi approximation tasks. So sometimes the greedy scheduler ends up putting things to next
to each other in a bad way, and therefore, we end up with this degradation.
>>: How much resources are you using for this experiment?
>> Malte Schwarzkopf: This was using one machine that we either allow the greedy scheduler on this
one machine place things in any way it likes, or we use the flow optimization. Obviously, we can also do
this across multiple machines, but that's not happening in this experiment. So if we use the flow -- the
Quincy flow modeling for the scheduling problem and we put in some of the stuff that I discussed,
preference for turtles next to devils, you know, interleaving of these categories in this way, we end up
improving by 2x over the greedy scheduler. It's still 2x worse than running in isolation, but you running
in isolation is not really an option if you have a data center full of 48-core machines and you want to use
them reasonably efficiently as opposed to just running one task per machine.
>>: If you run these two jobs in sequence, they'd still take less time than running them together. Am I
misunderstanding your numbers here? If you run all the turtle jobs and it takes seven seconds, and then
you run all the devil jobs, and it takes twenty seconds, then you're done 27 seconds later.
>> Malte Schwarzkopf: That's true, but again, in a real data center environment, you can't just run your
entire workload in sequence, right. For these particular jobs, yes.
>>: But the lower chunks in -- other words, you can't do that at the data center granularity, but you can
probably do it at the L3 cache granularity and maybe at the machine granularity ->> Malte Schwarzkopf: Yes, sure.
>>: It seems like you'd want to exploit those opportunities if the answer is that ->> Malte Schwarzkopf: Yes.
>>: Maybe you should try even harder to keep these things apart because maybe running in isolation is
actually a better overall utilization.
>> Malte Schwarzkopf: So one thing we found with this experiment was actually if you don't pin things
to cores and you let the OS scheduler make decisions, it actually performs somewhat better because it
does what you say: It schedules things in sequence in a reasonable way. But the problem with that is
the OS scheduler can only see one machine, right. If the cluster scheduler decides to give 12 memorybound devil tasks to that machine and decides to put all the turtles on the machine next door, the OS
scheduler can't do anything. My argument that this cluster scheduler should be taking care of this, and I
think we're in agreement on that, that you should try hard. Of course, there's limits to this because the
cluster scheduler will be far away from the machine, so it can't make decisions on a microsecond
granularity just because of the communication delay, but maybe it can amortize. It can sort of load a
schedule into the FRED scheduler and say, I think this is a good order of FREDs, you know, this is a good
order of running the different FREDs on the different cores in your machine. We haven't investigated
that yet, but that would be a nice.
>>: I guess the point I'm trying to make is that if your scheduler a producing runtimes that are twice as
bad as just running the jobs sequentially, then you're probably trying to too hard or something, right?
Something has gone off the rails, right.
>> Malte Schwarzkopf: Sure, but well do bear in mind that this case is underutilizing the infrastructure,
and you might not have the luxury of running things sequentially.
>>: It's underutilizing it only on the temporal axis, but down here, that bottom one took 48 seconds to
get the same work done, so it was underutilizing at a very fine grain.
>> Malte Schwarzkopf: Sure.
>>: It's sort of silly to say that that core is going to waste if adding it makes the whole system function
slower.
>> Malte Schwarzkopf: But what I'm saying is this 20 second number is unrealistic in the sense that in a
real-world setting you will have other things running on the same machine at the same time, and these
things will make that number worse. I don't know how much they're going to make it worse. In the best
case they will make it not much worse. In the worst case they will make it a lot worse, but you're not
going to get this number in reality unless you're underutilizing your machine.
>>: So let me see if I understand just to -- maybe there's something that could clarify this question to
understand the setting. When you're saying you're running a task in isolation.
>> Malte Schwarzkopf: Yes.
>>: You're utilizing one thread and everything else in that machine is empty.
>> Malte Schwarzkopf: Correct.
>>: When you're doing simple, greedy, and flow interleaves you're effectively using all the threads in
that machine.
>> Malte Schwarzkopf: Yes.
>>: So if you were trying to use all the threads in the machine, you wouldn't get those numbers.
>> Malte Schwarzkopf: That's what I'm saying.
>>: No matter which scheduler you are using.
>> Malte Schwarzkopf: I think John is making a valid had point in that he is saying, you know, if you
were in an ideal case you were an omniscient scheduler and you did everything right and you could
perfectly disaggregate all the offer interference effects in the machine, then you might get close to that.
>>: I'm saying something simpler, which is if I have those two jobs I'd like to run and then a whole bunch
of other jobs, and I've got a machine with 48 cores, one way I could achieve those jobs is I could do all
the turtle jobs, and that'll be done seven seconds later. Then I can do all the devil jobs, and I'll be done
20 seconds later.
>> Malte Schwarzkopf: No, you won't because the devil job -- if you do all devil jobs on that 48-core
machine they will take longer than 20 seconds.
>>: There's only one unit worth of work in the first row, whereas there are 12 units of work in a ->>: Thank you ->>: Okay. That explains the mystery. That was the answer that I was looking for. Now I understand.
>> Malte Schwarzkopf: So you're right. If you were -- sorry, I should have made this clearer. This is
doing 12 times as much work as this. If you were doing the same amounts of work, then it would take
actually times longer to do this.
>>: Okay.
>> Malte Schwarzkopf: Because as you said you do it in sequence. Okay, sorry. So this is a nice
indicative result, but obviously, this approach is not trivial. We have to worry about what costs we put
into the flow graph. They must have a total -- they must achieve a total ordering, for example. So if you
wanted to use a multidimensional vector to, for example, express the costs of contending for caches,
network access, disk access, then that's great, but you have to be able to totally order your cost vectors
in some way of another. For integers it's simple, but if you have multidimensional costs, which in reality
you do have, if you put a turtle here, that might be good for cache interference, but it might be bad for
disk access, for example. These things are not quite as clear-cut as I made them so far. So there's some
complexity there. One thing this approach cannot deal with, and Quincy I can't deal with either, is
combinatorial constraints. If you have a combinatorial constraint of the form "I never want more than
one type -- more than one task of my job to run on any one machine because I want to achieve fault
tolerance," for example, we can't do that because the flow optimization might end up putting two tasks
on the same machine because everyone is always connected to the "don't care; can't schedule
anywhere" node. We could tweak things around to approximate it, but we can't have any combinatorial
constraints. That just won't work. It does work if you do multiple rounds of scheduling and only
schedule one task from the job per round, but then again, it gets complicated. And of course, and this is
how I started talking about this work, the runtime of the flow optimizer can get quite big if the graph is
quite big. So here we've got a simple comparison of, again, the greedy queue-based scheduler and the
flow scheduler, and this is a CDF of the per-task decision time for a very simple-minded job. You can see
the greedy queue-based scheduler will take, you know, at most 20 milliseconds to make a decision. This
is a very unoptimized implementation. Whereas the flow optimizer for a single machine, again, this is
only considering a single machine, not a whole data center, will take up to fifty milliseconds. And then
as you scale this up, if you scale it up from one machine to 35,000 machines in this case, with the same
amount of work being scheduled but the amount of resources available increased, we end up basically
with linear scalability in the number of machines. But of course, if you increase the number of
machines, you will also have more work to do, so we also increase the number of jobs, and again, in this
experiment, the number of machines is constant but the number of jobs is increasing. And again, we
end up with linear scalability. The reason for this is effectively the flow optimizer is linear in the number
of edges that exist in the graph, and both increasing the number of machines and increasing the amount
of work increases the number of edges. Now, what I'd like you to take away from this sort of high level
description of Firmament is that we can use an extension of the Quincy approach to reduce
interference, and you know, there's some more work to be done, but we can have an effect, but at the
cost of a real scheduler possibly running for multiple seconds. So if you look at this, if you look at the
case of 8,000 or 9,000 jobs which have a hundred tasks each, so that's looking at just under a million
tasks running on a large 35,000-machine data center we're looking at 90 seconds of scheduler runtime
at the top of the y-axis there, which is the same ballpark as the sort of worst case we considered for
Omega. So this is a practical example of something that might take so long. Now, of course, if you can
build an incremental min-cost, max-flow solver -- in fact, we're working on that right now and
potentially we can make it much faster if we can incrementally solve the problem for each newly arrived
task as opposed to optimizing the entire cluster every time, but as it stands, it takes quite a long time. I
think that's all I have. Here are some other projects that are going on at in Cambridge, but I won't talk
about those.
>> Karin Strauss: Thank you. [applause]
Download