23907 >> Andrew Baumann: Thank you all for coming. ... Simon is just finishing up his Ph.D. in the Systems...

advertisement
23907
>> Andrew Baumann: Thank you all for coming. It's my pleasure to introduce Simon Peter.
Simon is just finishing up his Ph.D. in the Systems Group in ETH Zurich and he's going to talk
about operating systems scheduler design for multi-core architectures.
>> Simon Peter: Hi everybody. Before I start with the main topic of the talk I'd like to say a little
bit more about the context of my work, which is the Barrelfish multi-core research operating
system. I assume that most of you already know about this. So we'll keep this fairly brief.
Barrelfish is a cooperation between MSR and ETH Zurich, and the overarching problem that
we're trying to address with this work is commodity computing is and has been for the past couple
of years in a period of fundamental change.
Multicore architectures mean speedup through more processor cores and processor dyes instead
of just simply faster processors.
Now, unfortunately, this development is not transparent to software and so we actually have to
write our software specifically or modify our software to make use of the additional cores in the
architecture.
However, at the same time we see fast pace of quite nontrivial developments within the
architecture into an EMD are turning out a new multi-core architecture roughly every one and a
half years or so.
So that means we have to keep modifying our software in order to make it work on new multicore
hardware architectures.
Now, operating systems struggle very much to keep up with these developments. Especially
commodity operating systems is what I'm talking about.
There's an extreme engineering effort being put into a commodity operating systems like Linux
and Windows to have them work right now on multicore architectures and it's projected that that
effort has to increase even further as more fundamental changes are being put into these
architectures.
This is a structure problem. With these operating systems Linux and Windows are quite complex
monolithic programs that have been developed back in the early '90s with uniprocessors and
quite simply multi processors in mind and these quite disruptive changes in the architecture that
impact software very much were just not around at the time.
Now, Barrelfish is a new operating system that's been built from scratch for multi-core
architectures. One of the -- in order to address these problems one of the main problems that
Barrelfish is -- sorry, one of the main ideas that we're using within Barrelfish is to use distributed
systems techniques in order to make the system both more scaleable as we add more cores to
the architecture, as well as something that we call agility with the changing architecture. Make
the operating system easily changeable or adapt to different hardware architectures.
Barrelfish is similar to other multi-core operating system approaches like Corey, Akaros, fos and
also tesselation that have been proposed within academia over the past year or two.
And so my hopes are that the ideas that I will present in this talk are also applicable to the rest of
the multicore OS space. Now, I have been involved with many things in the Barrelfish operating
system and I encourage you if you get to talk to me later to ask me questions about other things
that I've been working on as well.
And I'd also like to say that we can make this talk fairly interactive. So if you have any questions
just raise your hand and I can keep track of time.
Okay. So back to the topic of this talk, which is multicore scheduling. Why is this becoming a
challenge now? What's different in the way we need to schedule applications now and in the
future as opposed to how we used to schedule applications in the past.
Well, it's that applications also increasingly use parallelism in order to gain computational
speedups with the additional cores that are now in the architecture.
If I look over the past couple of years, I see a vast number of parallel runtimes emerging. There
are things like interest thread building blocks. Microsoft has peeling. There's also a lesser well
known runtime called Conquer T. Open PS has been around for a while but we're seeing
increased usage.
And other things like applet's grand dispatch, for example, and all of these runtimes are
essentially made to encourage programmers to use more parallelism and programs are starting
to use these runtimes in order to do that.
There's a new class of workloads that is anticipated by many in the community to become more
important for the end user commodity computing space. These workloads have an umbrella
term. They're called recognition mining synthesis workloads. An example would be recognizing
faces on photos that an end user might want to run on their desktop box in their photo collection.
The common property of all of the applications in this class are that they are quite computationally
intensive. While also being easily parallelizable is benefit quite nicely from the additional cores in
the architecture.
Now parallel applications like these have quite different scheduling requirements as opposed to
the traditional single threaded or even concurrent applications, and that's something that's quite
well known, for example, in the super computing community that has been dealing with
parallelism essentially for decades.
And quite adequate scheduling algorithms for these parallel applications have been developed in
this community. However, the focus there is mainly to run a big batch of usually scientific
applications that run for a very long time to a super computer.
So you submit that. And then you have to wait usually for hours, days, sometimes even weeks.
You don't really change the system while you're running these applications.
And then you collect your results back later. What's new now is that we want to run these parallel
applications in a mix together with traditional commodity applications.
So maybe alongside a word processor that a user might be running or as part of a Web service.
So the main two differences in this scenario are that the setting is much more interactive. We
now have to deal with ad hoc changes to the workload as the user provides input to the system,
and it's also providing us with a new metric, responsiveness to the user is now important. We
have to be able to make scheduling decisions quickly enough so the user can actually feel a
difference in the system as the scheduler is making decisions.
>>: [inaudible] there's the word processor and there's a massive scientific calculation alongside?
>> Simon Peter: Compute intensive applications. Face recognition, for example, is something
that's really compute intensive. It takes a while to recognize faces on photos. But it's still
something that you might want to run either in the background or you might want to run this as
part of a Web service, for example, you might just want to do this when people submit photos.
>>: Why is the response more challenging than this way? Seems like I've always wanted ->> Simon Peter: This type of workload, that's true. But I was comparing it to what people do in
the super computing community. So there they have developed scheduling algorithms for parallel
applications, but their use cases are typically throughput oriented batch processing where you
have a big batch of applications and you just want to run -- there's no real latency requirement for
a particular job, for example.
Okay. So to put that into a nutshell again the research question that I was trying to answer is how
do we schedule multiple parallel applications in a dynamic interactive mix on commodity multicore
PCs?
I'd like to motivate this a bit further and actually make it more concrete by giving you an example
of what happens when we run multiple parallel applications today on a commodity operating
system.
In my case I ran two parallel Open MP applications concurrently at the same time on a 16 core
AMD Opteron system, fairly state of the art. On the Linux operating system I used the Intel open
library which is also state of the art, Open MP library. Both of my applications essentially execute
parallel follows which is one of the main ways in Open MP to gain parallelism.
One application, however, is completely CPU bound. It's just chugging away doing computation,
and it's using eight threads. And the other application is quite synchronization-intensive. It
executes a lot of Open MP barriers.
If you don't know what a barrier is, I'd like to introduce it real quick. It's essentially a primitive
where you have to wait until all the threads that enter the barrier have to come out of the barrier.
Synchronization across all the threads. If one thread enters the barrier it will wait for all the
threads to enter the barrier as well, and they can all leave the barrier again.
This primitive is something that's used quite often within parallel computing, especially at the end
of computations and end of sub computations because you want all threads to finish computing
their chunk of a parallel dataset and you do an aggregation operation over that computation.
So yes?
>>: Are these real applications that have these properties or for this purpose are you synthesizing
applications that exaggerate these properties?
>> Simon Peter: In fact, these are applications that I have synthesized, though they're actually
doing some computation. It's not that they're just going for barriers all the time. They're doing
some computation.
>>: Like a relationship ->> Simon Peter: Yes but I synthesized them so I could show the effect. So what I do then is as I
run these two applications concurrently, I vary the number of threads that the barrier application
uses as both applications run.
In order to have some notion of progress that these applications are making, I count the number
of parallel follow-up iteration s that they're going through over some time -- in my case it was a
second.
And then I look at the speedup or rather the slow-down as we will see on the next slide of these
two applications as they run concurrently.
So here's what I got. On this graph you see in the top, the kind of reddish looking line is the CPU
bound application and the purple line is the barrier application. On the X axis we have the
number of threads that are allocated at any point in time to the barrier application. Remember
that the CPU bound application is running on eight threads all the time and that we have 16 cores
in total in this machine.
And then on the Y axis, I have the relative rate of progress that these two applications are going
through. Normalized to the two threads case here on the left-hand side.
So let's start and focus actually on the left-hand side of this graph up until eight barrier ->>: Is there a lot of variability in the amount of computation between those threads and the
second case, the case where they're synchronizing is there a lot of variability in what each thread
is doing.
>> Simon Peter: In this case there's not a lot of variability.
>>: Same amount of work, same barrier, same amount of work, same barrier?
>> Simon Peter: Yes, this is pretty much -- I would argue if you use a lot of synchronization, most
of these algorithms are actually trying to optimize for that case, that actually most of the threads
are doing similar amounts of work.
If you end up having one thread doing a lot more work, then if you're going through the barrier, it
would delay the execution of all the other threads essentially.
>>: Are you surprised the degradation given --
>> Simon Peter: I can explain that. Yes. I'll explain it. So up until eight barrier threads, things
looked pretty much benign. We have the CPU bound application essentially stays there at a
progress of 1, which is what we would expect because it has the same amount of threads
allocated to it. And there's nothing else running inside the system.
So the operating system here is doing something sane, and it's scheduling the CPU bound
application somewhere where the barrier application is not running.
We can see that the barrier application is actually degrading in its performance, and that's due to
the increase in cost of going through a barrier with an increasing number of threads. You have to
synchronize with an increasing number of threads.
If you're going through the barrier often enough you can actually see this degradation. Here's
why the application is degrading.
>>: You've got more cores also working on the problem, too?
>> Simon Peter: Sure. But the progress that I'm showing you here is actually over all the
threads. So I could have graphed it differently and then it would have stayed up there. But the
difference comes from the barrier.
>>: So this is relatively -- you're dividing by the number of threads that you have.
>> Simon Peter: Yeah. Yeah. Sorry, I should have probably said that earlier. So, yeah, it's
divided by the number of threads.
>>: The CPU intensive, is that the same curve as you would get if you didn't have the CPU
intensive workload working on the other eight cores.
>> Simon Peter: Yes, you would see this degradation. Exactly, the degradation is due to the
synchronization that the application is doing over an increased number of cores. But there's no
impact from the other application and doing memory contention, exactly.
And actually I verified that the operating system is doing core partitioning here essentially. It's
assigning the eight cores to the CPU-bound application and then it's scheduling the barrier-bound
application on the other eight cores that are still available.
It's essentially doing a load balancing algorithm. Now on the right-hand side, from nine threads
onwards, this is the first case where we have more threads in total in the system than we have
cores in the system.
The operating system starts to go to time multi plexing, the different threads over the cores. And
with the CPU-bound application we still see something that we would expect and that is that
CPU-bound application degrades linearly as it's being overlapped with more and more threads of
the barrier application.
So that's quite straightforward. However, finally we see that the barrier applications progress
drop sharply almost immediately down to zero. And that is due to the fact that it can only make
progress when all of its threads are running concurrently. If one thread is being preempted by
another thread of the CPU-bound application, then all of the other threads in the barrier
application now have to wait for that one single thread to enter the barrier as well before they can
all continue together.
So this is something that we do not want, and that can be alleviated.
>>: One CPU-bound thread can still see the inter[inaudible].
>> Simon Peter: Sure, exactly if you had 16 barrier threads and just had one CPU bound thread
running you would still see a barrier like this. So you could set up this experiment in different
ways and you would still ->>: For example, running Word and the symbols -- versus the ->> Simon Peter: Yeah. Yeah. So there are scheduling techniques to alleviate this problem.
Parallel scheduling techniques, essentially. One is called gang scheduling and that's a technique
to synchronize the execution of all the threads of one application. So you make sure you're
always dispatching all of the threads that are within this one application at the same time. And
you never have one thread execute out of order essentially with the other threads.
And then there's also core partitioning, which is dedicating a number of cores to a particular
application. So there is no overlap with any other application.
And these work. However, we can't easily apply them in present day commodity operating
systems because there is no knowledge about the applications requirements. The applications
today are essentially scheduled as black boxes using heuristics by the operating system,
completely unaware of what the requirements are, while the applications and the runtime
systems, especially within these applications, could have known that there are threads that are
synchronizing within the application.
So in this case, for example, in the Open MP case interacting threads are always within an Open
MP team. So either the runtime or maybe with help from the compiler the runtime could have
known that -- I don't know what that was. That these threads are synchronizing but there's no
way to tell the operating system of this fact.
Also, we can't just go and realize better scheduling solely from within the runtimes themselves,
because it's much more of a problem of coordination between different runtimes. We might not
just want to run Open MP programs. Maybe we want to run programs that use other parallel
programming runtimes. How do we do the coordination between these different runtimes? If we
did it just within the runtimes, then several runtimes might contend for the same cores. Even if
they are very smart, to figure out where to run within the system and how to schedule if two
different runtimes make the same decisions they might still end up battling over the same
resources and all the benefits that they have calculated for themselves might be void.
So this is a case where policies and mechanisms are essentially in the wrong place. Now, my
solution to this problem is to have the operating system exploit the knowledge of the different
applications to avoid misscheduling them. For example, the knowledge about which threads are
synchronizing within an application. On the other hand, also applications should be able to react
better to different processor allocations that the operating system is making.
For example, the CPU-bound application could modify the number of threads that it's using as
processor allocations change with the barrier application or to give the barrier application more
CPU resource.
So my idea is to integrate the operating system scheduler and language runtimes more closely,
and the way I do this is essentially firstly by removing the threads abstraction at the operating
system interface level. That doesn't mean I'm taking threads away from the application
programmers they can still be provided within the runtime.
This is something that has been done in operating systems before. There's the K-42 operating
system at the site key operating system and scheduling technique called inheritance scheduling
that essentially does not require threads at the OS interface level.
In my case, I provide what I call processor dispatchers at the OS interface instead, and those give
you much more control over which processors you're actually running on and give you the
information on where you're running on. So the operating system won't just migrate a thread from
one core to a different core. Instead, if you have a dispatcher you know you're running on a
particular core. So it's kind of closer to running on hardware threads essentially.
And then I extend the interface between the operating system scheduler and the programming
language runtime. And before I get to that, I should also say since extending the interface usually
raises questions with are people actually going to use this, do we now have to modify all of our
applications or to use this interface, a claim that now is a quite good time to do this move
because application programs are increasingly programming against parallel runtimes instead of
right to the operating system interface directly. So we would only need to modify the parallel
runtimes to integrate more closely with the OS.
Instead of every single application. Okay. So I call my idea end-to-end scheduling not to be
confused with the end-to-end paradigm in networking. Just call it that way because I kind of cut
through the threads abstraction to integrate the OS scheduler and the language runtimes more
closely.
And end-to-end scheduling is a technique for dynamic scheduling of parallel applications that
interact with time scales. Three concepts to it.
The first one is that applications specify processor requirements to the operating system
scheduler. So OS scheduler can learn what is important to the application. The second one is
that the operating system scheduler can notify the application about allocation changes that it's
doing as the system runs.
So the application can react to system state changes. And finally, in order to make this work at
interactive time scales, I break out the scheduling program and schedule it at multiple time scales
instead so we can react quickly to ad hoc work changes. I'll explain on a later slide how that
works and why that is so important.
Okay. First to the first concept, which is the processor requirements. In my system applications
can say how many processors they want. They can also say which processors they want. They
can say when they want them and at what rate.
For example, an application can say something like I need three cores all at the same time every
20 milliseconds. That is somewhat similar to Solaris scheduler hints although it's much more
explicit. Schedule hints in Solaris are essentially a feature where you can say I'm currently in a
critical section, please don't preempt me now.
The scheduler can still decide what it wants to do but it takes the hints from the applications in
order to make a more informed decision.
So this is a bit more explicit than just that and a bit more expressive.
The idea is that the applications submit these processor requirements quite infrequently or at the
application start. In my system they actually submitted in plain text so there's a little -- it's not
really a language. It's just more of a specification IDL that allows you to specify those, like a
manifest.
I do realize that applications are going through different phases, especially in this more interactive
scenario. Some of them might be parallel phases. Some might be sequential -- single threaded
phases and these phases might have different processor requirements.
So I allow the specification of these different phases in these processor requirement
specifications directly. And an example would be if you have a parallel, if you have a music
library application it might do some parallel audio reencoding essentially have two phases, one
phase is the phase you're pretty much running the graphical user interface. That might be a
single threaded phase. And then you could have a phase where you do the parallel audio
re-encoding and you're actually using multiple threads.
So you can say in my processor requirement specifications you can say I need one core in phase
A and I need three cores in phase B. And then I added a new system call that you can call in
order to switch between these phases.
So you can say I'm switching to phase A now. Switching into phase B now.
The second concept is that the operating system schedule notifies the applications of the current
allocation that it has allotted to it. This is very similar to work that has been done before. There's
scheduler activations and also work that's been done in the psyche and K 42 application systems.
So applying these ideas in my system.
So the operating system can say something like you now have these two cores every ten
milliseconds, as other applications are going through different phases or new applications start
up.
Finally, since we're in an interactive setting we want to be responsive -- yes?
>>: The OS says these two cores, ten seconds of delay. How much -- you'll probably have
[inaudible].
>> Simon Peter: So when the operating system tells you this, then you have them. The
operating system has a right to change this at any time. In my system, it will do that whenever an
application goes into a different phase or another application comes up or an application exits.
So we have these quite defined phase boundaries essentially where the operating system can
change its mind and give you a new allocation.
>>: But can't [inaudible].
>> Simon Peter: Yes. So if it tells you, you have the cores, then you do have the cores. Yeah.
>>: Even if a priori operating system shows up, have the cores, you have two minutes you have
two minutes.
>> Simon Peter: It will tell you things like you have it every ten milliseconds. But it won't tell you
you will have it guaranteed for the next two minutes or something like that. So it can still change
its mind.
So we're in this interactive scenario we need quite small overhead when we're scheduling. We
certainly can't take a very long to change our mind. And this is a big problem because computing
these processor allocations, especially taking into account the whole architecture, the topology of
the architecture is a quite complex problem. It's essentially heterogeneous -- scheduling on a
heterogeneous platform, essentially. You have different latencies of communicating between
different cores; you have different cache affinities, different memory affinities. This problem is
very hard to solve. And actually if you have multiple applications, it takes quite a while.
I will have in the evaluation section of the talk I will show you for one example problem how long it
took to solve it.
Also if we're doing things like gang scheduling we have to synchronize dispatch over several
cores. If we do that every time slice and we're talking about interactive time slices typically on the
order of one millisecond to tens of milliseconds this might not scale with a lot of cores. So we'll
slow down the system.
So in order to tackle that problem I break up the scheduling problem into multiple time scales.
There is the long-term computation of processor allocations, and I expect this to be on the order
of minutes. So the allocations that I'm providing are expected to stay relatively stable for at least
tens of seconds or minutes.
Then there's the medium term -- the medium term allocation phase changes that the applications
can go through with their system call. And that's on the order of say desi seconds. I would
expect that it doesn't make much sense to go through different resource requirements much
faster than that.
But you want to have something where you want to be fairly interactive with the users so you
might still change very quickly between different phases.
Finally you have the short term pure core scheduling that's the regular scheduler that runs in the
kernel and does the time slicing of the applications, and that's on the order of milliseconds.
Okay. So how is this all implemented within the Barrelfish operating system? I realize all of this
by a combination of techniques at different time granularities.
First of all, I have a centralized placement controller. That's the thing on the right-hand side. I
should also say how this schematic is built up. We have the cores essentially on the X axis here
and there's as many cores as there are cores in the system. And we have some components
that run in kernel space and some components that run in user space.
So the placement controller is the entity that's responsible for doing the long-term scheduling. So
this will receive the processor requirement specifications from the different applications, and then
go and produce schedules for different combinations of phases that these applications can go
through.
And it makes sense to have this be centralized because it's doing something that we can take
some time for computing and it also needs to have global knowledge of what's going on within the
system in order to do this, the placement of different applications on to the cores.
And then the placement controller will download the produced schedules to a network of planners
that there's one planner running on each core. These planners will, first of all, just store that
information locally so that we don't have to do communication when we're going through different
phase changes with the placement controller.
Then we have the system core that allows an application to change a phase. And when that
happens, the application will send one thread of the application will send the system call and
communicate with the planner to change with the local core planner to change the phase. That
planner then has an ID for the phase that it communicates with those other planners that are
involved in scheduling this application.
So there's just one message being sent, one cache line essentially with the ID of the new phase
being sent to the other planners. And the other planners just have to look up that ID in their
database of different schedules that they have for their local core. And can then activate that
schedule. And the way they do that is essentially by downloading a number of parameters to the
kernel of the scheduler for that particular phase.
I have evaluated how long it takes in the worst case to do such a phase change. If we change
every single scheduling parameter on in my case a 48-core machine which was the biggest
machine I had available at the time when I did this experiment, I did a phase change over all of
the cores. And it took 37 microseconds. So since we're speaking on the order of desiseconds
that a phase is lasting, this is quite low overhead for doing this.
Finally, just as another number I have the context which duration of the system is four
microseconds which is the other number that I have there. And so doing this scheduling at
multiple time scales allows me to do something that's pretty cool. Something that I call phase log
scheduling.
The way I do this, and this is a technique that allows me to do synchronized dispatch like gang
scheduling over a large number of cores without requiring excessive communication between
these cores which will not scale very well if I do it at very fine time granularities, the way I do this
is I use a deterministic scheduler that's running in kernel space per core. In my case I use the
RBED scheduler which was presented by Scott Brandt in 2003, which is essentially a hard real
time scheduler. Though I'm not using it for the real time properties, I'm using it for the property of
being deterministic.
And then I do a clock synchronization over the core local timers that drive each scheduler on
each core. And then I can essentially download a fixed schedule to all of these deterministic
schedulers and know that these schedulers will dispatch certain threads at certain times without
any further need for communication.
So I do not have to do this communication over all the cores that will be involved in say gang
scheduling and application every time I want to dispatch or remove the threads from the particular
cores.
So I have a slide on this to show you how this works. And this is an actual trace that I took of the
CPU bound and the barrier application running on the 16 core system.
In the first case, we're just doing regular best effort scheduling, and this is also to show you the
benefits of gang scheduling the applications.
So the red should probably explain the graph. The red bars are the barrier application and the
blue bars are the CPU bound application, and X axis we have time and Y we have the cores.
In the best effort case, the barrier application can only make progress in these small time
windows when all of the threads are executing together.
Then you can gang schedule. In the classical case you'd have to synchronize every time with
every core that's involved in the schedule.
>>: What's the application that makes just one core?
>> Simon Peter: Well, so this is where gang scheduling becomes harder and this is why it takes
a long time to compute optimal gang scheduling, essentially a bin packing problem. You have
this other application that would run in the system and it might either be also gang schedule, runs
on one core there's no point in gang scheduling it.
You now have to figure out how do I fill the two dimensional matrix essentially of time and cores
optimally with the gang and then this one application. Either you leave a lot of holes in which
case your scheduler is not very good or you're actually trying to do bin pack and try and fit
everything in.
So it's quite hard to compute an optimal gang schedule. And this is why scheduling in the
long-term this placement controller is quite a, takes quite a long time. So this is why it's important
to break up the scheduling problem and not do it ad hoc essentially when the workload changes.
Okay. So in the phase log case, what I'm doing is I synchronize the core local clocks once. For
that I need some communication in order to synchronize the clocks. You can either use a readily
available protocol for that like NTP to synchronize the clocks or use a reference clock. In my
system I used the reference clock. There's the pit in the system essentially that I can use to
synchronize the core local AP clocks.
And then you agree on a future start time for the gangs. For that you also need a little bit of
communication. And then you also agree on the gang period. And finally this will just schedule
itself now each core has agreed and has its deterministic schedule and now we can go ahead
and gang schedule. And we only might need to communicate again at some point in the future
when we need to resync clocks.
I've actually evaluated whether that is the case on x86 machines and it turns out that it's not the
case. It seems that the APIC clocks are all driven by the same court. I ran a long-term
experiment on several machines where I started the machine synchronized the two clocks and
saw that there was drift over a day or two and there was no measurable drift in these machines.
So it seems they just have an offset and it's based on when the core will actually start up. But
they're all driven by probably driven by the same clock. Okay. So let's go back to the original
example and see if end-to-end scheduling actually benefits us here. So this is the same two
applications, just not running on Linux but running on Barrelfish.
Otherwise same conditions. The barrier implementations -- sorry, the barrier application slows
down a little bit faster and it's due to the barrier implementation that I had available in the Open
MP library that I used within Barrelfish. Obviously I couldn't use the interloping PL because of the
closed source I wasn't able to just port it to Barrelfish.
We also see a little bit more noise. And that's mainly due to the fact that Barrelfish has more
background activity. There are these monitors and planners running in the background whereas
Linux is typically almost completely idle when you don't do anything.
The pitfall that is still there, that's the important bit, now if I activate end-to-end scheduling, and I
tell the operating system that I would benefit from gang scheduling from the barrier-based
application, then it will actually do so and we can see that the pitfall is now gone. So the barrier
application keeps its progress due to gang scheduling. We still see a little bit of degradation due
to the barrier. There's more time taken away from the CPU bound application, and so it slows
down faster. That's what you would expect, because we have to take the time from somewhere
where now alotting to the barrier application.
We're now much fairer in this workload mix to the parallel barrier application. Okay. So so far
I've only talked about cores. Now, multicore, it's not just about cores. There are more things like
memory hierarchies, different cache interconnects, cache sizes and things like that, that's another
main outcome of the Barrelfish work, is the system topology is very important.
So how far away are different cores on the memory interconnect? Barriers implemented by
probing shared memory and I did this measurement on the 16 core AMD Opteron, you see the
schematic here. I measured it takes about 75 cycles to do a memory probe into a cache that's on
a core that's on the same dye within the same system.
And it takes about twice as much to go to a core that's on a separate dye within the system. And
so it depends very much on the application's performance executing these barriers depends very
much on where the operating system is actually placing those threads. If it happens to place the
threads close to each other the application will make more progress if we place threads on cores
that are further away from each other than we will see less progress.
And other multicore platforms look different again. So we have to be able to accommodate all of
them in order to make good progress with the application. We can actually see this in the original
example. So this is back in the Linux system.
By the big arrow bars that we're seeing in the barrier bound application. As you can even see
this most pronounced in the two threads case, we only have two threads and the operating
system is making arbitrary decisions over many runs which I did in this benchmark, where in
many cases it happened to place it on the two cores that were close together and we had very
good progress.
If it happened to place these threads on two cores that were further away from each other we
essentially have half the progress in the barrier application.
Now, how do we incorporate that? There is a component inside Barrelfish that's called the
system knowledge base that's mainly work done by another Ph.D. student in the group, Adrian
Shipper. The system knowledge base is a component that contains abstract knowledge about
the machine. It knows, for example, what the cache hierarchy is and what the topology is
between different cores. And it gains this information either via runtime measurements or
whether through other information sources like ACPI, for example, or CPUID. So it's one abstract
store of all this information.
And what I did is I interfaced my scheduler with this component. So the system knowledge base
is now informed by my scheduler about the resource allocations that the scheduler is doing. So it
knows about utilizations in the machine.
Also I should say that the system knowledge base is a constraint solver. So you can query it
using constrained cost functions, and this allows us to now ask questions like or ask queries like I
need at least three cores that share a cache. More are better. Instead of just saying I want these
cores and things like that.
So I worked with Adrian and Jana Givech another Ph.D. student in the group to port a parallel
column store, essentially a key value store, to Barrelfish. As a workload, we ran a real workload
on this application. It's all the European flight bookings coming from a company called Amadeus,
which is responsible for that. Essentially they're interested in high query throughput to book
flights.
And we ran this column store together with a background stressor application inside the system.
Again, on the 16 core AMD system. And we specified six constraints for the application. There
were things like cache and memory size as well as sharing of different caches and numeral
locality in there.
We asked the system to maximize CPU time, and so by submitting this constraint query,
essentially it took the resource allocator several seconds to produce a final schedule. This is just
to give you an idea how long it can take to do this.
And it ended up deciding that it would partition the machine between the column store and the
background stressor application. And on this graph down here we can see the result in terms of
query throughput.
Let's start with the middle one, which is the way to naively run the two applications by just starting
the background stressor and starting the column store. In that case the column store just goes
ahead and tries to acquire all the cores in the system.
So it overlaps with the background stressor application and we can see the query throughput is
down there at a little more than 3,000 queries per second. There's also a way of just manually
placing these two applications, trying to partition them. Then we get a throughput of 5,000 -- 500
queries per second. And finally on Barrelfish, we are really close to the optimal way of placing
these two applications. So Barrelfish did the right decision of deciding to partition the machine
between these two applications.
Finally, I'd also like to say a little bit more about my evaluation of phase log scheduling. Turned
out after evaluating it it's maybe not yet important on the x86 platform.
When I just used synchronized scheduling, synchronizing every time slice essentially over all the
cores, there was no real measurable slow down up until 48 cores which was the biggest machine
I had available to measure this. The reason for this is that both timers, the core local APIC timers
and interprocessor interrupt, which is what I used in order to keep the schedule synchronized
over all the cores, generate a trap on the receiving core.
And taking a trap takes around an order of magnitude longer on x86 than the act of sending an
IPI to a different core. So in this case it doesn't make any sense to distribute receiving interrupts
which is essentially what I do by phase log scheduling now every core uses its core local timer
that generates the trap. However, it does use less resources. I don't actually send the
interprocessor up to the other cores. Now x86 architecture that doesn't make much of a
difference because x86 has a dedicated interrupt bus. There's no contention on that bus
essentially for multiple applications trying to send interrupts.
But it does produce less contentions on systems that don't have a special interrupt bus. For
example, the Intel signature cloud computer, which is another architecture that I investigated
Barrelfish on, used the same mesh interconnect for both the messages that are being sent
between cores and cache lines as well as the interrupt. So if you have a lot of traffic on that
interconnect, using these interrupts, interrupts could actually be delayed on the interconnect, and
then you don't have a schedule that's as nicely synchronized.
So the insight here is that one should always investigate the cost for intercore sends and receives
individually. That was something that we hadn't thought of within Barrelfish thoroughly before we
always thought there was one cost associated with a message that's being transmitted between
two different cores.
But we didn't look so much at receive cost and send cost. Obviously this is only -- there's more
future work with this. One thing that I haven't evaluated so far is how expressive the operating
system and runtime API should be. So far I've only looked at these exact processor requirements
essentially.
If one were to deploy this in a real system, you would probably want to do this expressiveness
analysis first in order to not give the user this very expressive interface to essentially do
everything like the constraint solving and all that. It's nice because it allows you to experiment a
lot with different things, but you probably want to figure out what adequate interface for
applications is, and one thing that I looked at a little bit and I found interesting was number of core
speedup functions that might be provided by applications in order to give an operating system an
idea how well the applications are scaling with the workload they're running, and they could do
that based on estimates.
And there's actually related work here. There's a system called Parkour presented two years ago
in HotPar by the University of California and San Diego that allows you to do exactly these
estimates. So you will give it an Open MP workload in this case and it will figure out how well that
workload will scale.
Essentially you can feed that information into the scheduler and the scheduler could automatically
figure out how many cores to allocate for a particular allocation over some other applications that
might scale either better or less well.
Another thing that I haven't looked at so far is IO scheduling. So for everything that was mainly
done within memory, and they just looked at cores and caches. So there's still some work to be
done on devices and interrupts, for example. What if we have a device driver that wants to be
scheduled?
Finally, I also find heterogeneous multi-core architectures to be important in the future. Things
like programmable accelerators are becoming quite pervasive. Like GPUs but also network
processors and the question is what are the required abstractions here, are the abstractions I've
presented so far adequate for this or do I need to integrate more abstractions?
Okay. Also like to say a little bit about my Barrelfish-related achievements to date. Throughout
my work I was able to author and co-author several papers and top ranking conferences
workshops and also journals. With Barrelfish I participated in building a system from scratch.
I'm the proud owner to be the first check in for the coat base which dates back to October 2007.
Barrelfish is a very large system. It contains around 150,000 lines of new code. If you count all
the ported libraries and applications we amount to around half a million lines of code.
It had more than 20,000 downloads of the source code over the past year and generated quite
significant media interest. Being a Ph.D. student, I was also involved in supervising several
bachelors and masters students, the most important ones were one who implemented a virtual
machine monitor together with me for Barrelfish and then another one who implemented the
Open MP runtime system that I then used for my experiments on Barrelfish.
I also had two bigger collaborations going on. One with Intel in order to evaluate Barrelfish on the
single chip cloud computer. There is how I gained most of my knowledge there. They also
provided some other test systems for the group to do experiments on.
And then there was a collaboration with the Barcelona Super Computing Center they were
interested in running parallel runtime that was based on C++, and we didn't have a C++ compiler
and runtime system for Barrelfish at the time. I collaborated with them in order to essentially port
one to Barrelfish.
So to conclude, multicore poses fundamental problems to the design of the operating systems.
Things like parallel applications as well as the fast pace of architectural changes.
Commodity operating systems need to be redesigned in order to cope with these problems. I've
focused on processor scheduling where I found that combining hardware knowledge with
resource negotiation and parallel scheduling techniques can enable parallel applications with an
interactive scenario.
Also like to say that some problems are only evident when dealing with a real system. For
example, the problem or the insight that one should investigate messages sent and and receive
separately and the cost is quite different on a multicore machine.
The very end I'd like to say a little bit about myself and projects I've worked on besides Barrelfish.
Main two areas are operating systems as well as distributed and network systems. And other
projects I have worked on are a replication framework for the network file system version 4 that I
did my masters thesis. This is essentially a framework that allows you to specify different
replication strategies and also writable replication for the network file system which were not
around at the time. I accomplished two internships throughout my Ph.D. both at Microsoft
Research. The first one was in Cambridge, working with Rebecca Isaacs and Paul Barnum
where we applied constraint solving techniques in order to aid solving the problem of scheduling
applications on heterogeneous clusters, in our case it was dried link applications that we wanted
to run on heterogeneous clusters.
I've ended another internship with [inaudible] working on the Fay cluster tracing system that got
published in SSP last year in 2011.
Fay is a system that allows you to trace through all the layers of the software stack on a number
of machines all the way down to the operating system kernel and doing so safely as well as in a
dynamic fashion.
Finally, just before starting my work on Barrelfish, while I was already on the Ph.D. program, I did
a study of timer usage within operating systems that got published in Euro Sys in 2008 where I
found that many programmers actually set their time-outs to quite arbitrary values without looking
at what the underlying reason for setting the timer was in the first place and what the timer is
being used for.
And I provided several methodologies out of that problem in the paper. Finally my preferred
approach to especially doing systems research is to find a sufficiently challenging problem,
understand the whole space, and build a real system. Thank you. [applause] questions?
>>: Can you talk more about the constraint system used to the scheduler.
>> Simon Peter: Yes. So it's based on the Eclipse prologue interpreter, not to be confused with
the Eclipse integrated development environment.
It is prologue with added constraints essentially. We've ported this to barrel finish with currently
single threaded centralized component.
>>: What kind of constraints does it solve in this case?
>> Simon Peter: Well, there are different solvers inside the system that you can use to solve the
constraints. You can specify constraints like this value should be smaller than another value in
the system or smaller than a constant or bigger than, constraints like that.
I don't know if you want to know ->>: I'll follow up later.
>> Simon Peter: Okay. Yes?
>>: Your experiments we're using all the applications provided you with scheduling information?
>> Simon Peter: So in the experiments I do that. The system itself does not rely on getting
information from every single application. It's more of a thing where applications can support the
scheduling by providing information. So if you don't provide information, then the operating
system scheduler will essentially resort to best effort scheduling for you.
>>: How much does it work out to? If I have a bunch of legacy applications [inaudible].
>> Simon Peter: So I have not evaluated that thoroughly. But I definitely -- definitely the
background service s that are running that are not providing any information did not impact the
system that very much. You can see that there was more noise in the system. So there is some
impact. But for the sake of running the experiments, it was still okay.
>> Andrew Baumann: Other questions?
>>: You mentioned earlier, you mentioned trying to decrease the latency or increase the
responsiveness, I guess, the experiment you showed us with the throughput.
>> Simon Peter: Yes, that is true. I'm sorry I don't have one for me about the responsiveness.
The kind of the idea that I wanted to get across that we needed to be able to make scheduling
decisions quite quickly, and if we do this, the constraint solving, for example, every time the
workload changes then that is clearly not responsive enough.
So for the purposes of this talk it's more about that.
>>: The application give you a hint that makes your constraint solving run longer than you
expect?
>> Simon Peter: Yes. So that's one -- actually that's a good and bad point about the constraint
solving. Yeah, so you don't know up front how long it might take to do the solving. However, the
good thing about constraint solving is that it kind of searches the problem space.
So it allows you or it will give you a quite good solution very quickly, and then it can go on and
solve further. And that's actually something that I'm planning to use in the system as well that the
placement controller essentially will go off due to constraint solving, it will start downloading
scheduling information as it gets to the schedules. So we can start using these. And then go on
and solve some more.
The interesting thing about that is also that it seems to be trading off quite nicely with different
lengths of phases that you have. If you have an application that has quite short phases, then
we'll go through these phases quite quickly. In that case we might not have the optimal schedule
available at that point. However, since the phases are quite short, it might not matter too much to
have the optimal phase allocation available. And if you have an application that has quite long
phases, then it also allows you to -- it allows you more solving time essentially to come up with a
better schedule for these longer phases and then the decisions will also have more impact.
Any more questions?
>>: So did I -- this is a general purpose question for the schedule. Are you offering that we
should integrate gang scheduling in this time of framework into existing systems? Should I put a
gang scheduler in one of my phases, is that going to solve ->> Simon Peter: You shouldn't just put the gang scheduler into the problem. I'm just saying that
the problem will definitely become much more, much harder because of the parallel applications
that you potentially want to run in your system. They do have scheduling requirements like gang
scheduling. Synchronize dispatch.
>>: Doesn't having [inaudible] isn't that how you get scheduler [inaudible] that's different
scheduling [inaudible].
>> Simon Peter: That depends a little bit. That depends on how much you really care about the
best effort applications. I mean, if you don't care so much about the progress of these
applications, and since they're just best effort anyway, it probably doesn't matter so much for
them. You can just try and ->>: [inaudible].
>> Simon Peter: That depends on -- right now, I mean -- that depends a little bit what the user
wants to do with it. If he's actually, like actually using Word at this point in time it's important that
you give him the service at that point in time. But it's still a best effort application. I'm not sure
what you're trying to get at right now.
Okay.
>> Andrew Baumann: Any other questions? Thanks [applause]
Download