>> Galen Hunt: So good morning, welcome you all... hosting Andrew Baumann again. I hosted him about two...

advertisement
>> Galen Hunt: So good morning, welcome you all here. It's my pleasure to be
hosting Andrew Baumann again. I hosted him about two years ago, almost
exactly two years that Andrew came through when he finished his PhD, which he
did at the University of New South Wales in Australia. Granaut [phonetic] was
his advisory. And after he came through here then he went to ETH, where he's
been working with Timothy Roscoe, Mothy, as we all know and love him on
multikernels on Barrelfish. You know Barrelfish was not in the title nor in the
abstract anywhere.
>> Andrew Baumann: That was because we forgot to put it in after the
anonymization. [laughter].
>> Galen Hunt: Okay. So anyway, so we're glad to have him here to talk to us
about Barrelfish. So take it away, Andrew.
>> Andrew Baumann: Thanks. So just so I know, how many people were at
SOSP last week and saw this talk already? Okay.
>> Galen Hunt: Not very much.
>> Andrew Baumann: Good. Well, you guys will probably be bored for the first
part of the talk, but I've got some new stuff at the end.
And also, feel free to ask questions as I go.
So, yeah, the title of this is as it was in the paper, the Multikernel a New OS
Architecture For Scalable Multicore Systems. And this is joint work with a whole
bunch of folks at Cambridge and a bunch of students at ETH and Mothy Roscoe
Galen already mentioned.
And so what we're trying to answer in this work is the question of how should an
operating system for future multicore systems be structured? And there are two
key problems that we see here. The first is scalability, sort of the obvious one.
How do you scale up with an ever increasing core count.
But the second, which is a little more subtle is managing heterogeneity and
hardware diversity. So to give some examples of what we mean by that, these
are three current or near current multicore processors. On the Niagara, you have
this bank two cache and a crossbar switch and what that means is that all
regions of the L2 are equidistant from all the cores on the system. And so
shared memory algorithms running on those cores that are manipulating data in
the same cache line work quite efficiently on the Niagara whereas on the
Opteron, you have a large slower shared L3 that's almost twice as slow to access
as the local L2s. And on the Nehalem you have this banked cache with a ring
network inside the chip shutting data around inside the cache.
And so the message is that the way that you optimize shared memory algorithms
to scale well on these three processors is different because they all have
different access latencies and different tradeoffs. And so we need a way for
system software to be able to adapt to these kinds of different architectures.
Another example of diversity is in the interconnect. This is an eight-socket, four
cores per socket, 32 core AMD Opteron machine that we have in our lab. And
you can see that these hypertransport links between the sockets make for pretty
unusual network topology. And that topology plays into the scalability and the
performance of shared memory algorithms executing on that machine.
You see the layout of the interconnect. Particularly you see when you go beyond
here these two links get contended. And so these kinds of interconnect
topologies play in the way that you want to optimize things on the machine. But
they're all different. This is what the fourth coming eight-socket Nehalems will
look like. And there are also plenty of other different hypertransport topologies
that you can have inside a box.
And even as things move on to the chip and densities on a chip become greater,
you'll start to see these kind of networking effects inside a chip. So that's a
Larrabee with a ring network. This is a [inaudible] trip with a mesh network inside
the processor. And again, the communication latencies inside the chip depend
on where you are on the interconnect and the utilization of the interconnect.
Finally we're starting to see diverse cores. Today inside a commodity PC, it's
quite likely that you'll have programmable network interface of some sort or a
GPU. And you can put FPGAs in CPU sockets. For example, in that Opteron
system there are FPGAs that will sit on the hypertransport bus.
And in general operating systems, today's operating systems don't handle these
things particularly well. This is also one of the motivations behind Helios.
On the single die, it seems likely in the near future that we'll have heterogenous
cores either because they'll be asymmetric in performance, a lot of people are
talking about situations where you may have a small number of large, fast, out of
order cores that are power hungry, and a much larger number of smaller, simpler
in order cores. Or you may find that some cores on the chip don't support the full
instruction set. They can leave out things like streaming instructions,
virtualization support, and you know, cell is just one stream example of corps
heterogeneity in current systems.
So in general, there's increasing core counts, but also kind of related to that is
increasing diversity between cores and within the machine and between different
systems.
And so unlike in a lot of prior cases of scalability to large number of cores it's
going to be much harder to make these kinds of optimizations that you need to
make to scale at design time. Because the machines on which you run will be
different and the resources inside the machine will have different tradeoffs and
different topologies. And so we need a way for system software to be able to
adapt dynamically to the system on which it finds itself running in order to scale.
And so we think that because of all of these things now is a good time to rethink
the adult structure of and operating system. Today's multiprocessor operating
systems are more or less by default. The shared-memory kernel that executes
on every core in the machine communicates and synchronizes using shared -memory data structures protected by locks or other synchronization mechanisms.
And anything that doesn't fit into that model tens to be abstracted away as a
device driver. So programmable network interface, regardless of the fact that
you can run application code over there, is hidden behind the device driver for
API.
And so we propose structuring the operating system for something like this, more
as a distributed system than as a shared-memory multiprocessor program. And
we call the model for the OS as a distributed system the multikernel. And in the
rest of this talk, I'm going to introduce and motivate firstly three design principles
that make up the multikernel model. And they are making all communication
between cores explicit, making the structure of the OS neutral to the underlying
hardware, and viewing state inside the machine as replicated.
So after I introduce those design principles, I'm going to talk about Barrelfish,
which is our implementation of a multikernel. I'll present some evaluation from
our SOSP paper. And I'll also present something a little more concrete to give
you sort of feeling for how things work inside Barrelfish and how we can analyze
and optimize the performance of something in a system structure this way.
But first the design principles. So I said that we want to make inter-core
communication explicit. And in a multikernel what that means is that all
communication between cores in the system is done with the explicit message
passing operations. And so there is no assumption of shared state in the model.
That's quite a radical change, radical departure from the way people have built
multiprocessor operating systems in the past. But if we can make it work, we
think it has some good advantages. First, it allows you to decouple the structure
of the system from the concrete details of the communication mechanism
between cores. And so you explicitly express your communication patents inside
the machine. So to give an example of that, typically the way you optimize
shared-memory data structures is you think about which cache lines are likely to
be local, which cache lines are likely to be remote, how do I optimize for the
locality of some piece of data and think explicitly about when do I -- when I need
to communicate, which lines am I going to invalidate, which data structures are
going to be moving across the machine?
If you're doing this with message based communication, you think for some
particular operation with which other cores in the system do I need to
communicate? And in order to do that, I need to send them a message. So it's a
different way of thinking about scalability within the machine.
Message based communication also supports heterogenous cores on
non-coherent interconnects. So again, think about the example of the ARM
processor on the other side of a PCI, the express interface. Very similarly to
Helios, you cannot coherently share memory with that thing but you can send it a
message. And even if there's no coherent shared memory and the software
running on the offload processor is running in a different instruction set and
supports different ridges sizes and pointer sizes, it can interpret your message
and do something in response.
Message based communication is potentially a better match for future hardware,
either because that hardware might support cheap explicit message passing. An
example here is the tile processor where you can send us a message in software
or across the interconnect in 10s of cycles or because that hardware may not
support cache-coherence or may have limited support for cache-coherence. So
Intel's 80-core Polaris prototype is one example of a machine where they didn't
implement cache-coherence in hardware because it was too expensive.
Message based communication allows an important optimization of split-phase
operations. So what this -- what split-phase means in this context is that you
decouple the message that requests some remote operation from the message
that indicates successful failure of the response.
And so rather than everything being a sort of synchronous model where you
perform some operation and you block waiting for that operation to complete, you
can initiate multiple operations and then asynchronously handle responses. And
that's important for getting greater concurrency and parallelism inside the
machine. I'll give you an example of that later on.
And finally we can reason about it, both scalability, correctness, performance of
systems built upon explicit communication with message passing. So you may
well be thinking that's all very nice, but the machines that we have today are
fundamentally shared-memory systems. And the performance of any message
passing [inaudible] on these kinds of systems is going to be slow compared to
using the hardware cache coherence to migrate data.
And this is what I'm about to present you here is a very simple microbenchmark.
This is just exploring that tradeoff. So there are two cases. In the
shared-memory case, we have a large shared array in shared-memory, and we
have a number of cores manipulating varying sized regions of the array. And so
-- and there's no locking here. The cores are just directly executing right updates
on the shared array.
And so what happens when a core issues right is that the cache coherence
protocol migrates cache lines around to the different cores that are performing
the updates. And while that migration happens, the processor is stalled, doesn't
get to retire any other instructions while it's waiting for the interconnect to fetch
and invalidate cache lines. And so the performance of the whole operation is
going to be limited by the latency of these round trips across the interconnect.
And in general, it's going to depend on two things, the size of data in terms of the
number of cache lines that have to be modified as part of each update and the
contention in terms of the number of cores all hammering on the same cache
lines.
And this is how it performs. This is a four socket, 16 core Opteron machine.
This is what happens when every minutes one catch line, two, four, eight cache
lines. You can see that it doesn't scale particularly well. And remember that I
said there's no locking here. So in the 16 core case and in the 14 core case all of
those cores are executing the same number of instructions that's simply
modifying the same region of shared memory. And all of the extra cycles
between there and down here are stalls waiting for the cache coherence
protocols to move cache lines around.
>>: [inaudible].
>> Andrew Baumann: Yes. Well, in practice given the x86 consistency model it
doesn't actually make a difference because when you do one right to wrong word
you have to move the whole cache line over.
So in the message passing case what we've done is we've localized the array on
a single core. So there's one core that is responsible for updating the array. And
all of the other cores, when they have updates, instead of directly modifying the
array, they express their update as a message. We assume that the update with
-- and its results can be expressed in a single cache line. So we essentially send
a request in a cache line to say please manipulate this entry in the array and tell
me when you're done.
And we send it as an up PC to the server core. And the way that we send it is
with this implementation of message channel. Now, on current hardware the
only communication mechanism we have is coherent shaped memory. And so
what we have here is a ring buffer based in shared memory that we use for
shipping the messages between cores. And we have microoptimized the
implementation of the ring buffer to the details of AMD's hypertransport cache
coherence mechanism so that it moves messages as efficiently as possible on
this hardware.
And in this experiment, the clients send the request and then block waiting for a
reply, so their synchronous of PCs.
And that's how it scales. This is what happens when the server core manipulates
one cache line in this shared array for every message. And that's what happens
when it manipulates eight cache lines, which is not terribly surprising. Because
the data that it's manipulating is now staying entirely local in it's cache. The only
things that move across the interconnect are the messages in the ring buffers.
>>: [inaudible] so you do an IPI?
>> Andrew Baumann: No. So in this experiment the server is polling and the
clients are polling while they're waiting for a reply. And so on this machine for
this benchmark, which admittedly is not particularly fair on the shared-memory
case, there are many ways you could optimize this kind of thing that don't have
everybody hammering on the same shared memory we better have four cores
and four cache lines.
But what's more interesting I think is if you look at the cost of an update as
experienced at the server, so this is the time it takes the server to perform each
update, and as you'd expect it stays flat because it's just manipulating local state,
you can infer that this time difference here is essentially the same period that the
client is blocked waiting for a reply. And so the message is sitting in the queue.
And the reason that this scales up is because the server is saturated and there's
a queuing delay at the server to process the updates. But.
But in this case, the client is retiring instructions, there's polling on the message
channel waiting to see the reply, whereas in the shared-memory case it doesn't
get to do anything, the process is stalled waiting for the interconnect.
So if we had an asynchronous RPC primitive where the client could send an
update to the server core, do something else, do some other useful work and
then asynchronously handle the reply, those cycles would actually be available to
perform useful work. And that's why we say that this split phase optimization is
going to be important, because as the latencies increase you want to be able to
do work while you're waiting for something to happen remotely.
Our second design principle is that we separate the structure of the operating
system if the underlying hardware, at least as much as possible. And so in the
model the only hardware specific parts of the system are the message transports
which are responsible for moving data between cores and so I've just showed
you one example on hypertransport. They're highly optimized and specialized to
the underlying hardware. And of course the parts of the system that interact with
hardware as far as fast drive are the bits that manipulate the CPU state and so
on.
This is important because this is how we can adapt to all the hardware diversity
that I've just motivated. Either changing performance characteristics, changing
topologies and so on. In particular we can late-bind the implementations of the
message transports and also the messaging protocols used above those
transports to the particular machine on which we run. And I'll show you an
example later in the unmapped case of how the way that you send messages in
a particular machine can be changed depending upon the topology of that
machine.
The final [inaudible] principle is that all potentially-shared state is accessed as if it
was a local replica. This includes anything that is traditionally maintained by an
operating system, including run queues, process control blocks, file system
metadata, so on and so forth.
This is kind of required by the message-passing model. You don't have any
shared state in the model then what you have to have is local state and
messages to maintain the replicated copies of local state.
It also -- but it's good because it naturally supports domains that don't share
memory. So again think about the case where you have some application, part
of the application is running on a CPU, part of the application is running on an
offload processor like an ARM on the other side of express bus.
It also naturally supports changes to the set of running cores. You can think
about hotplugging devices that have cores on them, but you also just have a
problem already in today's system of how do you imagine bringing up and
shutting down cores for power management?
And in general there's a lot of literature in the distributed system space that is
related to how do you maintain the consistency of replicas when nodes in the
system are coming and going and they have partial inconsistent replicas when
they reappear.
So we think that there's a way to take that stuff and apply it to the problem of
power management and hot plug and operating systems.
So as I've sort of presented the model so far, you can see this spectrum of
sharing and locking where traditional operating systems are progressively trying
to get more scalable by introducing finer-grained locking and then partially
replicating state in order to scale up. Whereas the multikernel is this extreme
end point over here on the right where we have no shared state at all and replica
maintenance protocols September around by message passing.
In reality there are going to be situations where it's much cheaper to share
memory. You can think about two hardware threads on the same processor.
You can think about cores with a large shared cache between them. There will
be situations in the hardware level where it's cheaper to share locally than to
exchange messages between cores.
So in reality, in a multikernel system, we see sharing as a local optimization of
replication. So you may find that one replica of some piece of state is shared
between some threads or some type of coupled cores and rather than send
messages to those other cores, you simply take out a lock and manipulate the
local shared copy and then release the lock.
But the important difference here is that sharing is the local optimization of
replication rather than replication being the scalability optimization of a
shared-memory model. And so this allows us firstly simply to support hardware
that doesn't have sharing or support hardware where it's cheaper to send
messages because of the scalable system, but also allows us to make this local
sharing decision at runtime based upon the hardware on which we find
ourselves. And the basic model remains this split-phase replica update model.
So if you put all that together, this is sort of our logical picture of what a
multikernel looks like. You have some instance of the operating system running
on every core in the machine. And it's maintaining a partial replica of some part
of the logical global estate of the OS. And it's maintaining those replicas by
exchanging messages between the different OS nodes.
Note that you can specialize the implementation of the OS node and the replica
that it maintains to the architecture of the particular core on which it's running and
also note that where hardware supports that we can happily support applications
that use shared memory across multiple cores but we don't want to rely upon
shared memory for the correctness and for the implementation of the OS itself.
And so that's more or less the multikernel model.
Barrelfish is our prototype implementation of a multikernel. It's from scratch,
we've reused some libraries, but essentially everything else is written by us. And
it currently runs on 64 bit Intel. Ron Hudson at MSR Cambridge has an initial
port to ARM, so the next job for us is to integrate that into the same tree, once I
get back to Zurich. And it's open source.
If you look at what Barrelfish looks like, we've partitioned the OS node that was
the thing running on every core into a kernel mode portion that we call the CPU
driver that is completely specialized to the hardware on which it runs, handles
traps and exceptions and reflects them up to user software. So you can think of
this as sort of a microkernel on each core.
And a privileged user domain called the monitor. The monitor also runs on every
core. But the monitor actually exchanges messages with all our monitors on
other cores. And so the monitors together implement this logical distributed
system and the immediate local operations on global state by exchanging
messages between themselves to maintain replicated state.
Our current message transport -- and this is only on cache coherent x86
hardware, is our implementation of user-level RPC. So again, this is this
shared-memory ring buffer that is optimized for user to user, message transport
based upon shared memory. But we fully expect that to change as hardware
changes and the messaging mechanisms change much like a microkernel or an
exokernel, what most other system facilities are implemented that use a little.
There's a whole other set of ideas in Barrelfish that are basically the design
choices that we make when building an operating system but are not directly
related to the multikernel model but are just a whole set of choices that you have
to make when you build our OS and things that we thought were the right way to
build and OS.
I'm not going to talk through all of these points, but feel free to ask questions. So
some of the important ones are minimize shared state obviously. We decouple
the messaging, the thing that moves the message between cores from the
notification mechanism. Such as an IPI on current hardware. We use
capabilities for all resource management using the same model as the seL4
system that was presented at SOSP this week.
We have upcall based dispatch. I said we run drivers in their own domains. We
use a lot of DSLs for specifying different parts of the operating system, including
in this case device registers and device access code is generated from a DSL.
And so on.
There's all sort of applications running on Barrelfish because we're trying to build
it as much as -- you know, as much as you realistically can in research build it as
a real OS. You're looking at the slide show viewer that runs on Barrelfish. Our
web server runs Barrelfish and successfully we stored a flash starting a couple of
weeks ago.
We now of a virtual machine monitor, and that's our plan for getting device
support for a lot of the devices where we don't worry about the performance.
>>: So OS components like the [inaudible] that you [inaudible].
>> Andrew Baumann: Yeah, our networking stack is all [inaudible] at the
moment. And that's a place holder. We ->>: [inaudible].
>> Andrew Baumann: Don't have one yet.
>>: Okay. So is the web server just keeping everything in memory and then
serving ->> Andrew Baumann: The web server is serving -- there are two cases of the
web server. Everything is in memory but it can either be static RAM FS -- there's
a RAM FS essentially the web server serves from, and there's a dynamic case
where we have and in memory database and we serve thing out of the database.
But we don't have the storage drive yet.
So the virtual -- we're implementing the device drivers where we care about the
performance where we want to perform benchmarks and for our hardware, but
the reason for the VM is that it gives us a way to support a much larger set of
devices where the performance double matter so much.
We republic shared-memory benchmarks, run a database engine, we run a
constraint engine. I'll tell you why we might want to run a constraint engine in a
moment. And more every day. So that's Barrelfish.
Now, how do you evaluate a completely different operating system structure? In
particular Barrelfish as, you know, an initially implementation of something is
necessarily much less complete than anything like Windows or Linux. So it's
very much difficult to do comparison here that aren't apples and oranges.
Our goals for Barrelfish initially are that it should have good baseline
performance, which means that at least some current hardware it's comparable
to existing systems. What we're more interested in seeing is scalability with core
counts and the ability to adapt to the underlying hardware on which we find
ourselves dynamically, and also to exploit the message-passing perimeters for
good performance.
So I'll show you some bits of the evaluation. This is a microbenchmark of
message-passing cost simply because every research operating system has to
have microbenchmarks. There's a lot of numbers on these, but the high level
point is that on current cache carrying hardware it takes between four and 700
cycles to move a message where the message is the cache client from one core
to another. And it's because we have microoptimized it down to two
hypertransport requests on AMD systems. And we think that with the way that is
on current hardware, that's probably about the best you can do.
The other thing to note is that you get batching and pipelining for free. If you
send multiple messages down the same channel, they actually get pipelined
across the interconnect and you get much better than the single message latency
for sending multiple messages at once.
There's an interesting side note here which is sort of comparing the multikernel
system which is using message passing between cores as a way to scale the OS
so a microkernel which is using message passing within a core between different
components as a way to manage protection and isolate different parts of the
system from each other. And this is not really our main goal, but it's an
interesting comparison.
The latency of -- at least on this machine, the latency of a intercore message is
comparable to the latency of an intracore message on a multikernel. With the
advantage that you don't have to do a context switch, you can get better
throughput because you pipeline these things across the interconnect and less
impact on the cache and so on.
So it's interesting to thing -- and you know, one way of thinking about this is
thinking partly about a microkernel but rather than decomposing these things into
different servers that are run on the same core and having to context switch
between them decomposing these things into different servers that run on
different cores with message passing between the cores. Obviously depending
upon the amount of data that you need to move whether you have shared
memory these tradeoffs change. But it's interesting comparison.
What I'm going to present now is a case study of how we implement in Barrelfish
one piece of OS functionality that is often a scalability problem. And in this case,
it's unmap. When a user unmaps some region of memory what the operating
system needs to do is send a message to every other core that may have that
mapping in its TLB, and wait for them to acknowledge that they performed the
unmap operation locally. In most systems the way this works is that the kernel
on the initiating core sends in a process or interrupts to all the other cores that
might hold the mapping and then spins on some region of shared memory
waiting for them to acknowledge that they performed the unmap locally.
In Barrelfish the way this works is that user domain sends the local request to
their monitor domain. So that's a local message. And then the monitor
performance a single-phase commit across the other cores in the machine to
hold the mapping. And this single-phase commit is because that's essentially
what this is, send a request to every core, wait for them to acknowledge that they
have perform the operation. And the question is how do we complement the
single-phase commit, how do we implement this communication inside the
machine?
We looked at a couple of different messaging protocols to do this. In what we
call the unicast case, the initiating monitor has a point to point channel with all
the other monitors in the system, and so when the initiating monitor gets an
unmap request, it writes that request into every cache line and then waits for the
acknowledgements on every cache line -- sorry, message channel, which
consists of multiple cache lines.
In the broadcast case, we have one shared channel, so every receiver core is
polling on the shared channel and the request is written once into that shared
channel. Neither of these perform particularly well. Broadcast doesn't scale any
better because cache coherence is not broadcast, even if you write the message
once into one region of shared memory. If everybody's reading it, it crosses the
interconnect once for every reader.
So the question is how can we make this scale better? If you look at the
machine, this is the machine on which we're doing this benchmark, when you
send the message once to every other core, your message is crossing the
interconnect many, many times. This interconnect link carries the message at
least four times because it's going to want to each of these four cores.
What you would like to do is something more like a multicast tree where you
send the message once say to every socket and then send it out local
optimization of replication to the other cores on the same socket. And we do this
in Barrelfish. So in the multicast case we have message channels to an ago
interrogation core on every socket in the box and then that core, that monitor
locally sends it on to its three neighbors. And that's much faster because they're
sharing an L3 cache. And then we aggregate the replies, the acknowledgement
in the same way. There's one additional optimization to this, which is if you look
at the topology of the machine, some cores are further away in terms of more
hypertransport hops and therefore greater message latency than other cores in
the system.
And so if you send the messages to the sockets that are furthest away from you
first and then to progressively local sockets you get greater parallelism than the
messages traveling through the machine. And we call that NUMA-aware
multicast. And that scales much, much better.
Interestingly in the NUMA case you can see where you can over additional
numbers of hypertransport hops.
But brings up a general point. In order to perform this optimization, you need to
know a fair bit about the underlying hardware. In particular for this one we need
to know the mapping of cores to sockets. We need to know which cores are on
the same sockets. We can get that information out of CPUID.
We need to know the messaging latency so that we can sensor thing that are
furthest away first and then to things that are local to us.
More generally, Barrelfish needs a way to reason about and optimize for
relatively complex combinations of hardware resources and interconnects and so
on. The way we tackle this is with constraint logic programming.
We have a user level servers call the system knowledge base that is a port of the
eclipse constrained engine. And that stores as rich and detailed representation
of hardware topology, performance measurements, device enumeration, all these
kind of data that we can gather from the underlying hardware and allows us to
exercise constraint optimization queries against it.
And so in this case, there's actually a prolog query that we use to construct the
optimal multicast routing tree for one or two phase commit inside the machine.
When we put all that together, this is how unmap scales in Barrelfish. The more
thing here for me is not that, you know, we can scale better than Windows or
Linux partly because all of these things could be optimized. Barrelfish is paying
the very heavy cost of a local message passing operation that is quite inefficient.
Linux and Windows could equally well be optimized to do sort of multicast IPI
techniques or at least use broadcast IPIs the important point for me is that
separating this problem from a shared-memory synchronization problem to some
more abstract message passing operating like a single-phase commit, and then
being able to optimize behind that high-level IPI to the machine on which you're
running based upon performance measurements and so on, seems like a good
way to cast these kinds of optimizations being able to adapt to the hardware that
you run on.
>>: [inaudible].
>> Andrew Baumann: Yeah?
>>: So which systems are using IPIs and which systems are using polling.
>> Andrew Baumann: Barrelfish is using polling.
>>: Okay.
>> Andrew Baumann: I can talk about IPIs in a moment. These two systems
use IPIs.
>>: Is that -- do you think that the main thing you're seeing on this is that the
efficient multicast or is there a difference between polling and IPI?
>> Andrew Baumann: Well, so, yes, so clearly you are seeing a difference here
in polling. If there's a user application running on Barrelfish on the cores that you
want to unmap from, then there's a tradeoff between time slicing overhead and
how long it takes the core to acknowledge the message and whether you want to
send an IPI.
So let me come back to that at the end. I have a slide about it.
The other things that we ran on Barrelfish is computer-bound shared-memory
workloads. So we had some benchmarks from the OpenMP on the NAS
OpenMP benchmark suite in SPLASH-2. These are not very interesting,
because they're essentially just exercising the hardware cache coherence
mechanisms to migrate cache lines around. They're essentially just here as, you
know, sort of proof that even though the operating system double use
shared-memory it can happily support user applications that do so.
So they're mainly exercising the hardware coherence mechanisms. And they're
not very interesting, as you can see. Actually newspaper of these things scales
particularly well on that hardware which maybe says something about the
granularity of sharing that these benchmarks assume.
The other set of benchmarks that we ran just briefing are looking at sort of
slightly higher level OS servers, I/O and stuff like that, we get the same network
throughput as Linux, we can run this kind of pipelined web servers setup where
we run the driver on one core application such web server on another core and a
database engine on another core and pipeline these requests across the cores in
the system, and we get respectable performance.
This is very much apples and oranges. So I'm not going to, you know, try to read
too much into this because anybody small efficient static in memory web server
can probably beat something on Linux, even if it's a fast web server on Linux.
The high level point for us here is just that you can build similar servers on top of
this, and there is no inherent performance penalty from structuring the system in
this way, but we really have to get more experience with complex applications
and a high level OS services before we can claim success on this.
Enough evaluation. I have more if you want to ask questions about it.
But what I wanted to show you finally was an example of first of all the Barrelfish
sort of programming interface that we have at the moment, and secondly how
you can analyze and tune and scale performance of things in a system that's built
this way, because it's quite a different model to a sort of a traditional OS.
So what I'm going to show you now is a series of traces from the user code that
is doing this. Essentially you have a domain running on a single core with a
single dispatcher. Dispatcher is like our version of a thread accept that it's an out
core model instead of an explicit -- it's an explicit out core model instead of
thread our assumption.
And what you want to do is get another dispatcher running on ever other core in
the system in the same shared address space. So this is typically what happens
if you run something like SPLASH or OpenMP in the setup phase where it's
forking a thread on every other core in the same address space.
So on Barrelfish the way that we do this is we iteratively request creating a new
dispatch for on every other core. And this is the split facing. There's one
operation to say please start creating a dispatcher on another core and there's a
separate acknowledgement which because we're using C as you know often a
very invent driven style, there's a separate acknowledgement that the dispatcher
has been created. So we keep an account of the number of dispatches created
and then we handle messages until that count is reached, the number of
dispatches that we require. So we create N dispatches concurrently.
And here's how the first version of this which is also unfortunately what's in the
current [inaudible] that you can download of Barrelfish scales. And the answer is
very, very badly. What you see here is this is a 16 way machine. So you're
seeing one line on the Y axis for every core in the system, you're seeing the
horizontal axis is time in CPU cycles. And the color codings of each bar
represent what that core is doing at that point in time. And these orange lines
which you can hardly see because there's so many of them are message sends
and receives between this core and mostly between this core and core zero.
And what you're actually seeing in this case is that almost all the time all of these
cores are with the monitor blocked waiting for the response to a memory
allocation RPC. And that's because there's one memory server running on this
core which is responsible for allocating all the memory inside the machine. So
every time another core needs to allocate memory, it sends an RPC, a blocking
RPC to the memory server on core zero, and so the whole operation is
bottlenecked on core zero, and it takes 50 million cycles.
So how do you improve this kind of thing? Well, the first obvious thing to do is
partition the memory server. Have a local memory server on each call that
manages some region of memory. Probably makes sense for those regions of
memory to be local to the NUMA domain of the core on which it runs. And that's
what we do here. And so you can see that we're now down to 9 million cycles,
which is pretty good improvement. And we are now spending a lot of time -we're still spending quite a lot -- there's still quite a lot of communication with core
zero. And we're spending a lot of time just be zeroing all the pages because
what's happening here is we're mapping in, reallocating things like stacks, thread
context, stuff like that. All that memory needs to be zeroed on allocation. And
partly this is just a horribly inefficient be zero is doing byte by byte. And partly we
should be doing that somewhere else, not on the critical path.
So we fix that. We also have more allocations. We changed the implementation
of monitor program to do more allocations locally to the local memory server
rather than going back to the memory server on core zero.
>>: What does it mean by [inaudible] before you start the monitor?
>> Andrew Baumann: Yes. So in particular -- look, in this case --
>>: [inaudible] eventually, right, so you're just zeroing ->> Andrew Baumann: I'm not saying that this is -- this is, assuming that we could
move that off the critical path and pre-zero everything then this is what would
happen. In practice the way we ran this was zero memory at birth and don't
reuse it. So I'm not -- this is not in Barrelfish yet for that reason.
And we do a lot better -- and actually I can see a couple of bugs here. One of
them is I don't understand why these three cores are creating things and then it's
going up there. But what you can see is that there's still not as much parallelism
as you would like because this core sends a request to this core, then it goes
back here, switches to the user program, runs the user program for a while, the
user program requests the next dispatcher be created. Remember that the user
program here is iteratively doing this. And each one of these involves the
message to its local monitor in a context switch locally. So this is, you know,
standard optimization there. Aggregate all those things up into a single operation
that says create me a dispatcher on this set of cores.
And so if we do that, then we do much better. And that's down to two and a half
million cycles and 76 messages. So there are sort of two-things that I think are
interesting here. First of all is the way that you analyze and optimize the
performance of something based upon message passing. You sort of see
dependencies in terms of message arrival and departure.
And what you can see here with these long horizontal lines is that there's a large
queuing delay between a message being sent, sitting in a queue and actually
being processed over here and so there's still, you know, plenty of improvements
that we could make.
The other one is that there is a whole different set of performance optimizations
like in any system moving memset off the critical path is something that, you
know, people have done for ages. Aggregating, combining low level operations
into high level operations like create me a dispatch on every core is a sort of
typical optimization of IPIs. But some of the optimizations are at least easier to
reason about. I think given this explicit view of message passing dependencies
inside the system.
>>: [inaudible].
>> Andrew Baumann: Yeah. To be honest, I don't know. There's plenty of -there's plenty of bugs in this. So to wrap up. I've argued that modern computers
are inherently distributed systems inside the box, now more than ever. And so
this is a good time to rethink the default structure of an operating system.
I've presented the multikernel, which is our model of the OS built in this way. It
uses explicit communication and replicated state on each core. It tries to be
neutral to the underlying hardware.
I've also shown you some of Barrelfish which is our implementation of the
multikernel. Barrelfish is as much as possible a really system, and we're
definitely continuing to work on it and build it in that way. I think I can argue that
it performs reasonably on current hardware, and more importantly that it shows
trends of being able to adapt and scale better for future hardware. So at least
from our perspective, the approach so far is promising.
And again, another plug for our website. You can find papers, source code, and
other information there. To answer Chris's question about polling, all of the stuff
-- well, no. In the message bank case here, there are situations where you need
to do context switches and so on before you handle message, and there's
different domains running on the other cores so not everything there is polling but
the other benchmarks are presented in most cases there was only one or two
things running on every other core. So you can argue how does our messaging
latency given that we're polling to receive messages compare to something
where you send an IPI and thus keep the other core? And this is the most
frequently asked question. That's why I have a slide about it.
First of all, polling for messages is cheap. This might be obvious, but the cache
line that contains a message, the next message in the ring buffer is local to the
receiver's core until the point in time that the message is sent by the sender and
the sender invalidates the line. So polling a whole set of message channels is
relatively cheap because all the state is local in your cache.
Obviously if the receiver is not running -- and the other thing I should say is that
we aggregate incoming message channels into monitor domain and so there is
one central place on the core where if you use a domain doesn't happen to be
running at the moment, it doesn't have to wait for a time slice if it's blocked. A
user domain can block on a channel. Somebody else will poll on that core will
poll its channel forward and unblock it when a message arrives. So it's not that
all user domains have to remain runable in order to poll the channels, but ->>: But somebody is in the monitor.
>> Andrew Baumann: In this case the monitor and we intend to push that into
the kernel so that we can do it between every context switch.
So there's a tradeoff between time slicing overhead, time slicing frequency, and
message delay. But there would be some situations where you need to notify
another core. And in order to do that, you need to use [inaudible] processor
interrupt.
In Barrelfish we took an explicit decision to decouple the notification from the
message transport because it's very expensive. On our hardware on these AMD
systems, it's something like 800 cycles to send the IPI on the sending side and
more than 1,000 cycles to handle the IPI. Because first thing is when the IPI
arrives you go off and execute microcode which can take some time. Then you
get into the kernel, then you need to get into user space. The high level bit is it's
not cheap. You clearly don't want to do this for every message passing
operation.
So there's a tradeoff. We decouple it from fast-path messaging. There are a
number of good reasons to do that. First of all, you often have -- and in
particular, given the split phase interface, somebody on one core can express a
number of operations. Let's say I'm a garbage collector. I want to unmap a
whole region of memory that's may be several on map cores. I want to get all of
those done for one IPI at best. I don't want to have that happen on every
message. And that's where the split phasing comes in. You issue a number of
requests and then you block and send the IPI.
The other argument is that there are some operations that don't require the low
latency. Just because somebody's executing an unmap or requesting some
other service on that core, if there's a user program running on that core getting
his work done, why should it be interrupted for this remote operation? So there
are some situations where you don't require low latency and you can also avoid
the IPI.
Then there are the operations that actually need the low latency. And at the
moment the way we handle that with Barrelfish is with the explicit send IPI to
[inaudible] which is not what you want. What you probably want is something
that is not of the lowest level of the messaging transport but in a slightly higher
level like a time out that says if I send a number of operations and this domain is
blocked waiting for reply and it doesn't get a reply within some period of time then
send the IPI automatically. Do some sort of an optimization like that. We haven't
done that yet.
The other reason to send an IPI is if a call's gone to sleep. You obviously don't
want -- if a call has nothing to do, you don't want it to sit there burning power
polling channels all the time. So we have a way for a core to go to sleep, set
some global state that says I've gone to sleep. If you send it a message, you
check the global state and you send it the IPI if it's already asleep. Or we can
use MONITOR/MWAIT for that.
>>: [inaudible] user model [inaudible].
>> Andrew Baumann: If you're just going into kernel mode? You mean on the
receiver side?
>>: Let's say both sides were already in the kernel mode.
>> Andrew Baumann: Then it's cheaper. I mean, so a big part of this is context
switching. So you're looking at hundreds of cycles if -- because you still have to
-- this is not the latency. This is just the number of cycles that you lose by taking
an IPI. It's still exception handling and it's microcode before that even runs. I
don't -- but I haven't ->>: [inaudible] one of the questions I -- one of the questions I have is so this
monitor really is a trusted component of the system, right?
>> Andrew Baumann: Why not put it in the kernel?
>>: Yeah, why -- I mean, why not put it in kernel mode, right? You could still do
separate ->> Andrew Baumann: We ->>: Processes.
>> Andrew Baumann: So there's a reason that I presented the multikernel as
this abstract thing and Barrelfish as monitor and CPU drive because we realized
they would be good performance arguments for putting that stuff in kernel mode
and ->>: You know, and you could ->> Andrew Baumann: It's -- it's largely a design simplicity argument. It was the
simplest way to build the system. It's nice to have a kernel that is non
preemptable single threaded serial trapping and exception handling.
>>: Yeah, I agree with that. It certainly makes sense. I guess one of the
questions -- you know, so not necessarily arguing that you should use manage
code and go the whole singularity later route, but you could still, you know, we
have multiple hypothesis in ring zero ->> Andrew Baumann: I think ->>: Or you know shall or protected ->> Andrew Baumann: I think probably the approach -- I think probably the
approach we will take is putting the most critical parts of the monitor's
opportunity, and polling message channels is an obvious example. Putting those
up into the kernel mode.
I think just from an implementation. So especially if we're dealing with different
distributed algorithms and message handling code and stuff, it's easier if you can
write that in something and use particularly because you can start to use high
level languages and stuff.
>>: So how do hypervisors fit in there as the ultimate physical resource manager
and isolation platform.
>> Andrew Baumann: They're not really in the way that we're thinking about the
OS, not because we think that they're a bad idea, but because we think that
these issues about how you scale across large numbers of diverse cores are
orthogonal to what your implementing looks like a hypervisor looks like an
operating system. So I think that most of this model would apply equally well to
the implementation of a hypervisor which has [inaudible] multiple cores and
implement communication mechanisms between diverse and different cores and
sort of some level of system services on those cores as they would to the
implementation of an OS.
>>: Because one concern is that, you know, any one of your communication
resource management paths can make the entire machine vulnerable, and if I
wanted to, you know, petition the ->> Andrew Baumann: I'm sorry, vulnerable to what?
>>: Bugs and bad implementation. For example, the BM kit which wants to bring
in other operating systems as acting like the hypervisor monitor, the bugs there
and, you know, so legacy application and then bring it through your machine.
>> Andrew Baumann: Yeah. I guess -- I guess to some extent I mean we have
not really been thinking about the sort of security and reliability properties of this.
I think that if you structure the system with explicit messaging and then you
define the messaging protocol much like channel contracts, you have a handle
on isolating and containing things. But we haven't tried to do that in Barrelfish.
In particular if you were trying to do isolation, really strict isolation, you wouldn't
want to build a message transport on top of a shared-memory thing that both
sides could fiddle with.
>>: I'm thinking what if the hypervisor, I don't want it to lie to you, I want it to just
enforce the implements. So my view of hypervisors in a world of plenty cores is
not like multiplying processors and lying about availability of resources, it actually
tells you what resource is there put provides like -- like these high end data
center machines where you can actually program hardware registers to separate
views of memory and CPU really hardware partition machine.
>> Andrew Baumann: Agreed.
>>: Yeah.
>> Andrew Baumann: The bottom level thing should be providing you some
guaranteed and isolated set of resources rather than [inaudible]. And that the
sort of -- that's the way we're trying to structure the user environment on
Barrelfish as well. Not that we're not trying to implement a hypervisor, but the
reason for the capability model that we have and the reason for things like the
system knowledge base is so that you can say I want to have this chunk of
memory and I know what cores I'm running on, and then the user level
optimization process has more knowledge to work with.
>>: And then -- no lies from the lower level software.
>> Andrew Baumann: Yeah.
>>: And then you develop according to a dynamic changing resource
environment.
>> Andrew Baumann: It's kind of the -- to some extent it's the [inaudible] with the
abstractions.
>>: Can you go back to the earlier -- the numbers -- the polling numbers and the
argument that most of the [inaudible] of monitoring to [inaudible] those numbers
to be small. But in the end all those, you know, most of those message passing
is happening because of the activity in the user space anyway. Is that a -- the
application is running up there if -- one thing that they would be doing would
cause it to go into monitor mode and send the message?
>> Andrew Baumann: Yeah. Not necessarily, because there are things -- as
soon as you're running -- as soon as you're not running one application across
the whole machine, you have different sets of applications. There are system
services that may run on different cores and may not run on the same core all the
time with which you need to communicate. So application uses the file system,
file system is trying to provide some [inaudible] like consistency guarantees when
I execute this right nobody else has executed this right, some other application
on some other core had that file open at some point in the past and so now we
need to go and communicate with some replica of the file system running on that
core.
There are things that -- there would be cases where you need to communicate
with things that aren't just the application on the other cores if that makes sense.
>>: So some -- so the application [inaudible] change if the application is more
[inaudible] can use multiple cores you can still run that on these multicores but
then that can mean that the application has to have its own shared memory -sorry, message passing communication mechanism in there, or more that the
application itself is inherently using shared memory.
>> Andrew Baumann: If the application is inherently a shared-memory
application the API to the application on Barrelfish looks very similar to the API to
the application on something like Linux or Windows. We have a thread model at
the moment that's implemented in user space, but it looks like a P threads kind of
API. We have shared-memory mapping. You can map memory and share and
address base across multiple cores. The application doesn't really see this stuff.
Some of the application services are implemented in libraries in the applications
address space instead of up in the kernel, but, you know, the API is similar. I
think what we're targeting is more future application programming environments.
API of applications implemented at some high level of abstraction than P threads.
We can adapt the runtime environment to understand the underlying hardware
and to communicate perhaps with messages for scalability or at least use the
split phase primitives that we offer in the operating system.
>>: So some of this could be abstracted to the [inaudible] application?
>> Andrew Baumann: Yes.
>>: So just timelinewise how long -- how long of [inaudible] would you see
before the learnings from this would be ready to integrate into ->> Andrew Baumann: I don't know.
>>: [inaudible] and go.
>>: I mean, you're holding a separate OS and that's good as an [inaudible] but at
some point we would like to see this and this going some form of scalability come
back into Windows. Is that something that you think is three years away, 10
years away?
>> Andrew Baumann: It's hard for me to say. Three years away does sound a
bit tough.
>>: Okay.
>> Andrew Baumann: At least initially our focus has just been on this is where
we think hardware is going, how would you want to build a operating system -starting completely from scratch, independently from everything else, how would
you build an OS from that and see where that takes us, and then the hope is that,
you know, if this works there may be pieces of this that you can somehow isolate
and put into an existing system.
You know, one example of that is just things like machine specific multicast run
map. Maybe that in its current form doesn't make sense. But you can imagine
taking these bits and pieces and putting them into an existing system. What
we're more interested in is how does the underlying fundamental structure of the
OS play into how you can build things on a multicore.
So that's a bit of a non answer, but it's also not our current focus. At least on our
side. You might have to ask the Cambridge folks about what they're thinking in
that direction.
>>: Do you [inaudible] come to be used to optimize the queue size for message
passing?
>> Andrew Baumann: No, but that's a really good idea.
>>: So that would -- I think that would be the biggest -- I don't [inaudible]. Do
you think it's possible? Do you think ->> Andrew Baumann: Specific support for message passing?
>>: Well, you're doing it right now.
>> Andrew Baumann: That's a -- yeah, I mean that's a fundamental assumption.
>>: What I meant was do you think it's possible to automatically rather than
rewriting by hand for every -- so A and B switches their CPU topology, you run
your constraints however you figure out what the topology is, you at the end
generate code for the message passing under the covers and so you don't have
to support --
>> Andrew Baumann: Hang on. Let me say this. There's two different parts
here. There's the transport implementation, which is written by hand and
[inaudible] specific, and that's given one message channel between one core and
another core and, you know, either hypertransport or whatever, quick pop
underneath, microoptimized to get a message there efficiently. That's with
expect to write by hand and not generate.
Then there's the next level up, which is how do I do something like single phase
commit over a set of message channels between a set of different cores
multicast. And that we would probably like to be able to generate. Right now the
way we do it is that we have this C implementation that takes -- is, you know, is
given a -- we call it a reading table but it's a little bit more than a reading table, it's
a messaging send order sort of table that says what order do I spend to what
other cores. And that thing is generated from this knowledge base.
There's, you know, you could compile that code at dynamically ->>: I'm wondering ->> Andrew Baumann: Oh, you're interesting in the [inaudible].
>>: Yes, I'm wondering if that might be possible. If it was possible, that would be
a huge peace making easier for a [inaudible] like to take something like this on.
>> Andrew Baumann: So you can see this thing -- you can see this message
transformer implementation, you can see it as an interconnect driver.
>>: Yes.
>> Andrew Baumann: The interconnect -- there are not that many interconnects.
I don't think they change that fast. The problem -- the problem -- the reason the
diversity is a problem is because the combination of interconnect cross product
topology, cross product different machines, that is changing all the time and it's
different on every machine on which you boot. But if you -- we can distill it down
to the interconnect. We have quick path, PCI express, you know. Maybe there's
that order of ->>: [inaudible] already mapped these out.
>>: Yeah, but I think the high order is that product guys get really nervous about
user mode code and your user code [inaudible].
>>: [inaudible]. [laughter].
>> Galen Hunt: Why don't we -- shall we terminate. Anyway, shall we thank the
speaker. We have hit and hour.
[applause].
>> Andrew Baumann: Thank you.
Download