>> Galen Hunt: Good morning. It's my pleasure... interviewing with our group, with the operating systems group for...

advertisement
>> Galen Hunt: Good morning. It's my pleasure to introduce Andrew Baumann, who is
interviewing with our group, with the operating systems group for a position, a research
position. Andrew did his PhD at the University of New South Wales with Gernote Heiser
and for his PhD he did some work on live upgrading of operating system components,
right?
>> Andrew Baumann: Yeah. That was a crazy idea.
>> Galen Hunt: That was a crazy idea, which is why one should do it for one's PhD and
not afterwards. [laughter]. And then for the last two years he's been at ETH Zurich with
Mothy, Timothy Roscoe doing the Barrelfish system, which is what he's actually here to
talk about. And I will say this about Andrew, Paul Barham I have a -- we joke that the
qualifications for an OS researcher is that they should know what a TLB is and how to
use it. And Andrew has met many TLBs and loves them all. [laughter].
>> Andrew Baumann: I wouldn't say loves them all.
>>: [inaudible].
>> Andrew Baumann: Yeah. So thanks all, for coming. If you have questions as I go,
feel free to ask. I'll try to manage the time. A couple of brief little words of introduction
just following on from what Galen said before I start talking about Barrelfish. I'm sort of
interested in lots of core OS research issues and increasingly large ly because of what
we've been doing in Barrelfish, increasingly in distributed network systems.
As well as the dynamic update stuff, I've also worked on single address space trading
system that we were building called Mungi. Done some work on tracing and performance
monitoring in the K42 multiprocessor OS at IBM Research which was also relate to the
dynamic update work. And since moving to ETH, I've worked on a couple of things,
including study of OS timer usage, a runtime for self-hosing and self-organizing
distributed systems. But Maine I've been working on Barrelfish which is an OS for
heterogenous multicore, so that's what I'm going to be talking about today.
For those of you that saw the talk I gave here after SOSP, some of this is probably going
to be fairly familiar. Though I apologize. But feel free to ask questions and, you know,
ask about things you think I'm not covering.
The goals of Barrelfish are to figure out how an operating system for future multicore
systems should be structured. And so that includes running a dynamic set of
general-purpose applications. We think the real challenges of large scalable and
heterogenous multicore occur when you have these things in your laptops and your
desktops. On the computers that you're using every day. And in particular because of
this to reduce the code complexity that is required for building a system like this. And so
the challenges are obviously scaling to a large number of cores. The number of cores in
the system is increasing at rates approaching Moore's law. How do you deal with that?
But, moreover tackling and managing the heterogeneity and hardware diversity that's
starting to arrive in this space. And I'm going to talk more about why these things are
heterogenous and why that's a problem.
So let me slide this in the examples of current multicore chips on the market. On the left
you have the Sunday Niagara. And it has a banked L2 cache that's connected to all the
cores on the chip by this crossbar switch. And what that means is that any region of the
L2 is equidistant in terms of access latency for many of the calls on the chip. And so
shared memory algorithms that rely on relatively fine grained sharing and rely on directly
manipulating state that's shared by a large number of threads on a large number of cores
or hardware threads work quite well on this chip as long as they stay within the L2 and
they say within a single chip.
The same is not true of these other processors over here. On the top is the current AMD
Opteron where you have a large shared slow L3 cache that's almost double the access
latency of the L2. And so cache lines that are in the L3 will be much slower to access.
And then here you have the Beckton where you have this [inaudible] it should have a ring
network inside the chip to access regions to the cache that are not local to the call it's
accessing.
The important point here is that the performance tradeoffs of these three chips are all
quite different. The way you would go about optimizing scalable multithreaded
multiprocessor shared data structures for these three things are very different. And two
of them are even an x86 and run the same operating system. So there's a sort of
challenge here for system software to be able to cope with this diversity.
Another place in which you see diversity is the interconnect architecture that connects the
sockets inside the system to make up a multicore system. And this is an example here of
actually relatively sold now, two years old Opteron system that we have in our lab that
says eight sockets, each October has four cores. And you can see that these blue links
between the sockets, which are hypertransport links, make for a fairly unusual, you might
just say insane and ridiculous interconnect unusual.
And the topology of that network plays into the performance of the machine. As you run
shared memory scale possibility benchmarks and you go beyond about 20 cores, you
start to see all sorts of interesting affects because of this cross. And you also see affects
like contention occurring on these two links here. And depending upon the patent of
memory access that you do inside the machine and weigh your running and whom you're
sharing with, is loud of the interconnect actually matters. And the hard point -- the
important point is that that's just that box, and there are many other systems on the
market with different layouts, different topologies.
And right now no system software is really trying to take account for this. And it's not
even clear how it can.
>>: I like the floppy disk drive.
>> Andrew Baumann: Yeah, well, this is -- [laughter] this machine still does have a
floppy disk drive and a floppy deterministic controller. It's completely insane. [laughter].
>>: [inaudible] why you're overloading your links.
>> Andrew Baumann: That's not why the links are being overloaded, no, that's true.
>>: [inaudible].
>> Andrew Baumann: Yeah. It's because this is one bod and this is a [inaudible] and
there are two risers here that sandwich the [inaudible] on top of the other board.
So, yeah, reasonable sort of engineering decisions, but the end result is bizarre. And
now that's that system. But the challenge is that as sizes shrink and as more of this stuff
moves on to a chip you start to see the same kind of a fix within a chip -- I'm sorry. I
should say quickly. Again, as an example of different interconnects this is the eight
socket Nehalmem interconnect topology which is arguably much more sane.
As die sizes shrink and all of this stuff starts to move on to a chip, you see the same kind
of networking effect occurring between the cores and a chip. And again, there's no clear
model for what the interconnect here should look like. On the top I'm showing you the
ring network on the Larabee, and on the bottom is the mesh network on the Tilera TILE64
processor.
And again, depending upon what this looks like, it plays a big impact in how you scale
software on these things and how things perform.
So finally we're starting to see diversity between the cores that you have in a system.
Today that's relatively specialized. You can buy a machine today that might have a
programmable network interface with a general purpose processor running close to the
network where you can offload application processing. Or you might have GPUs inside
the system and there are all sorts of applications that can make use of GPUs for
application processing. And there are even applications for FPGAs that you can plug into
CPU sockets such as in that Opteron system I showed you before you can buy FPGAs
that sit on the system interconnect. And that's today.
In the near future it seems quite likely that we'll start to see this kind of diversity between
cores on a single die, either because of performance asymmetry. There's a lot of
arguments that say that for reasons of power efficiency you will have a small number of
large fast power hungry out of order cores and a much larger number of slower simpler in
order cores or you might have some kind of special purpose logic because special
purpose logic is more power efficient, it can be switched on and off depending on what
you need.
There's a lot of different visions of the future that all involve special purpose logic, special
purpose core times. Also you might simply kind that some cores laid out various features
of the architecture like streaming instructions and virtualization. And so there's a
challenge here for low-level system software that can do sensible resource allocation and
management of this divergent set of hardware with complex computation parts between
it.
And this is the kind of problem that we're trying to tackle with Barrelfish. So to summarize
all of this is increasing core counts and increasing diversity. And unlike many previous
systems that have scaled to large number of cores typically in the super computing space
and the high performance space, once this stuff enters general purpose computing you
can't make nearly as many optimization decisions and you can't bake into your system at
design time the way that you're going to use this hardware and the way that you're going
to scale to this hardware, because it's all going to be different.
And so there's a challenge here for system software to be able to adapt to the hardware
in which it find itself running, either the number of cores, the topology, the interconnect,
the available hardware resources.
To find you -- and finally, to show you two sort of current examples of where research
processors are going in this space, I'm going to show you the Rock Creek and the
Beehive. This is Intel's single-chip cloud computer. And if you get past the buzz word,
what this thing is a research ship that Intel announced I think in December. So it's quite
recent. And they've actually built this. So it's on silicon. You have 24 tiles, two
essentially pentium cores per tile. What's interesting about this chip from a research
perspective is first of all, simply the number of cores. Second of all, the interconnect,
which is quite configureable and high performance relative to the performance of the
cores. But most importantly the fact that none of these caches are coherent with
anything else on this chip. Instead there's some hardware widgets to make it easier to do
core to core message-passing, and software is expected to deal with this without having
cache coherent memory. And the reason for building the chip in this way is because
there's a algorithm belief that cache coherence won't scale to the large number of cores
that you all see on these future multicore chips. And so this is an experiment to say can
software make useful use of a chip that's built in this way.
Another one which I hope you've heard of is Chuck Thacker's Beehive processor that's
being developed by the group at Silicon Valley. And this thing is another research chip,
and they have a bunch of ideas about doing combined hardware-software research in the
FPGAs and so on.
But this is also interesting from a Barrelfish perspective. Again, because it's very
different. In this case what you have is a variable number of risk-ish cores on the ring
and you have explicit message passing operations, more so than on the Rock Creek, to
pass messages between software on the cores around the ring. It also is not cache
coherent and also has a very interesting memory access model which I won't go into this
talk. It's another example of a current research architecture that is interesting to think
about from a system software perspective of how you build something like an operating
system for this kind of processor.
And so because of all these trends in architecture and in computer design, we think that
now is an excellent time to rethink the default structure of an operating system. Today's
operating systems are by and large a shared memory multithreaded kernel that executes
on every core in the machine or every hardware thread in the machine and synchronizes
and communicates using state in shared memory predicted by synchronization
mechanisms like locks or other shared memory primitives. And anything that double fit
into this model, either because it's not on coherent memory or because it's a
heterogenous core time or because it can't be -- it can't be expressed in this model is
abstracted away between a relatively low level device interface. And so we propose
structuring the OS as a distributed system of explicitly communicating independent cores.
And we call the model for an OS structured in this way the multikernel. And now design
principles for multikernel are that we make intercore communication explicit, we make the
structure of the operating system mutual to that of the underlying hardware, and we view
all state as replicated inside the machine. And so on the rest of this talk I'm going to go
into a lot more detail about those three design principles and what the concrete
implications of them are.
I'm going to show you Barrelfish, which is our implementation of multikernel. I'm going to
so you some parts of an evaluation. And I'm going to wrap up about talking about where
we're taking this and some interesting idea for future work in this.
So the first design principle was making all the communication between cores explicit.
And what I mean by that is that all intercore communication uses the explicit message
passing. There's no assumption of shared state in the model. That's a fairly sort of bold
statement I think and certainly very different from the way people structure operating
systems and system software today. If we can make it work, it has some big advantages.
First, most importantly it let's you have a much looser coupling between the structure of
your system and the concrete details of the underlying hardware in terms of how the
hardware implements communication.
So you explicitly express communication patents between cores in your system rather
than implementing a scalable shared memory primitive where you have to think about to
implement this system primitive, what data structures do I need to mutate, how do I
protect the correctness of those operations with respect to all the other processes in the
system, and then in order to make that scale well, you think about well how do I ensure
that my operations are consistent, how do I ensure that I'm mostly making local memory
accesses, how do I operate things for placement in cache lines, which am I likely to take
a cache fault? Instead you think about it at the level of based on my local, purely local
state, what I previously have heard if the other processors in the system, what -- with
whom do I need to communication to perform some operation and do some explicit
message based communication with those processes and then you can map that on to
an underlying hardware primitive, be that either a shared memory primitive or an explicit
non-coherent message based primitive.
Message based communication actually supports heterogenous cores or cores which
may be behind non coherent interconnects. Again, think about the case of the offload
core on the other side of the PCI express interface. You might like to run some part of
your OS over there and communicate with it, even if PCI express doesn't offer you
coherent sharing.
It's a better match for future hardware like the processors I showed you before year
because that hardware might have explicit message passing, so as well as the Beehive
and the SCC, the Tilera processors have explicit hardware support for message passing.
Or because that hardware may not have cache coherence or cache coherence may be
very painfully slow.
Message based communication allows the important optimization of split phase
operations by which I mean that you can decouple the operation that requests some
remote service, invoke some remote communication and the handling of the response
and you can do something useful in the meantime. That's much harder to do on top of a
coherent shared memory primitive. And I'll show you an example of that in a minute.
And finally we can reason about it. There's much of formal frameworks about the
correctness and the behavior of systems that use explicit message based
communication.
So the standard sort of OS response to this is that's nice but the performance is going to
be terrible. The machines that we have today are designed to provide a shared memory
model. And they're optimized for that model. And the performance of any message
based perimeter on top of this is going to be painfully slow.
And there's some truth in that. What I'm going to show you now, however, is a simple
microbenchmark that is simply exploring the cost of that tradeoff. In particular it's
exploring how much the shared model really costs you if you want to push it and you
want to use it for sharing.
In this experiment, I'm going to show you two case. I'm going to show you the shared
memory case and I'm going to show you the message case. In the shared memory case,
we have a shared array that is in coherent shared memory and we have a variable
number of client cores that are wanting to mutate the state in this array. So they're
performing read-write updates on that shared array.
The performance of this operation in terms of the time it takes to perform each update or
throughput and upper dates that you can achieve is going to depend upon two things. It's
going to depend upon the size of the data in the array and the size of the data that's
mutated by each update in terms of the number of cache lines that I need to touch to
perform an update. And it's going to depend upon the contention in terms of the number
of other cores also trying to mutate the array. And that's because every time one of
these processors tries to change something in the array it executes a read instruction or
a write instruction on the shared array. And what happens is the process or pipeline is
stalled barring the ability of the processor to get any more parallelism out of the pipeline it
stalls while the hardware cache coherent protocol migrates the modified cache lines
around between the processors that are all the attempting to modify the cache lines.
And that migration either the fetch or the invalidate that the cache coherence protocol has
to do is limited by the latency of the round trips across the interconnect. And so here's
how it actually performs. This is on a 16 core AMD basis. This is what happens when
every core in the system for variable number of cores is trying to mutate one cache line in
the shared array. And this is what happens when they need to manipulate two, four, and
eight cache lines. You can see that this doesn't scale particularly well either in the
number of cores or in the size of the modified state. And in particular what's happening
over here, all of these extra cycles as you increase the contention in terms of the number
of cores, all of these extra cycles are stall cycles. What's happening is the processors
are executing the same number of instructions. The cycles are just going away stalled
waiting for the cache coherence protocols to do the fetches and the invalidates.
Now, this is obviously a worst case for sharing. But it shows you how much you can lose
to a coherent shared memory model when you are explicitly manipulating the same piece
of shared state. So that's not particularly good.
If we look at the message passing case, what we do on the same machine, there is no
hardware message primitive. All we have from the hardware in terms of communication
is very slow in the process for interops and coherent shared memory. So the question is
how can you build an efficient message based primitive on top of what the hardware
provides you? What we do is we implement a message channel that uses coherent
shared memory as the underlying transport. So then it comes down to the question of
knowing the details of how the hardware implements cache coherence how can you most
efficiently get a fixed size message from one core to another? And on current systems
that [inaudible] down to essentially a ring buffer in shared memory where you're very
careful to make sure that every message send involves as few coherence operations as
possible. In our case, this boils down to one fetch and one invalidate for every message
sent.
And so in the message based case, we have these message channels that allow efficient
communication. And we encapsulate this array behind a single server core that is
responsible for performing all the updates on the array. And so now when a client wants
to modify the array, rather than directly manipulating the shared state, it describes its
operation as a message. We assume the request that it wants to go forward can always
be described in a single 64 byte cache line. It writes that request into the message
channel. The server core processes the operation on its behalf, manipulates the array
and sends back the reply.
And so the client here essentially this is a blocking out PC. And while the operation is
being performed the client [inaudible] waiting for a reply. So this is it performs. This is
what happens when a surfer modifies one cache line worth of state in response to every
client request. And this is what happens when the server modifies eight lines in response
to every client request. You can see that they're almost identical because all of that
shared state is now local in the client's cache, all of the updates are local to the client.
You can also see that on this machine at this point here for this benchmark we're already
crossing over here at four cores and four cache lines where the overhead of sending and
receiving messages wins back over the overhead of migrating these lines around for the
shared memory case.
But what's more interesting is if you look at the cost of each update at the server, this is
the time that the server experiences to perform each modification, as you'd expect it's flat
and it's low because it's all local to the surface cache. You can then infer that the that
this line here is increasing is that there's queuing delay in the message channels. The
clients are offering more load to the server than the server can satisfy. And so the clients
are experiencing a queuing delay while their request sits in the channel while the server
is busy processing all the requests.
>>: [inaudible] speed up in the system, right.
>> Andrew Baumann: As you have cores, yes. It's completely centralized. You can
satisfy all the requests on a single server.
But we're still doing better already, even though it's all centralized.
>>: [inaudible] slows down. You more or less slowed up.
>> Andrew Baumann: Well, everything ends up being serialized at some point. That's
because we're mutating one array. Either the coherence protocol serializes it or the
server serializes it.
So if you -- and if you compare the cost of these two lines something like this here, all of
this time here is where the clients are blocked waiting for a reply. But unlike the shared
memory case, they're actually executing instructions. So you could use those to do some
of this work if you had a nice synchronous OPC primitive where you send out the request,
you do something else and then you handle the replies.
>>: [inaudible].
>>: [inaudible].
>>: Well, I'm asking.
>> Andrew Baumann: We don't use interprocessor interop except for the case when a
whole core wants to go to sleep. Otherwise what we do is we have a bunch of incoming
channels and we [inaudible] serve them from polling and when there's nothing else -when there's nothing else to do, then you can go and spend the time polling.
>>: I've had some conversations with the Barrelfish team about ways to reduce the
overhead which here seems to be the order of a few hundred cycles each way. That
could be taken down by at least the order of magnitude [inaudible]. That would help.
>> Andrew Baumann: I mean this hardware is not built for this at all.
>>: I'm curious to -- this is a comparison between, you know, a smart message passing
solution sort of naive shared memory solution.
>> Andrew Baumann: I wouldn't call this message passing solution particularly smart.
It's centralize everything on a single core and send the request to that core.
>>: But I'm more concerned about the shared memory.
>> Andrew Baumann: Sure.
>>: So like [inaudible] are using MCS logs there, people who try to request ->> Andrew Baumann: There's no locking here at all. I'm assuming the perfect lock ->>: That's the problem, right?
>> Andrew Baumann: Consists [inaudible].
>>: [inaudible] then you might be queuing up [inaudible].
>> Andrew Baumann: Yes. But the oracle throughput could only be worse.
>>: Why?
>> Andrew Baumann: Because the lock is [inaudible] the [inaudible] protocol is
serializing everything. And I'm doing the min number of applications in terms of write to
perform each update. If I added a lock, which I would have to do if I was going to build
real data structure actually, the lock would serialize the clients instead of the coherence
protocol explicitly, but I'm adding extra operations to acquire and release the lock. I'm
doing more coherence traffic than I would be in this case.
>>: I was just curious.
>>: [inaudible] cost go down [inaudible].
>> Andrew Baumann: There's a little bit of a fit here where the server sometimes
[inaudible] goes to sleep whereas here it's completely busy.
>>: So [inaudible] but on shared memory my understanding is that if only one guy writes
many others just trying to read you don't have to wait ->> Andrew Baumann: The readers have to wait. If there's one writer and many readers,
the readers have ->>: You don't have to -- unless it's the writers, then you have to -- the cost of the -- like
the unit cost on read is much simpler -- much lower than you have to send a message to
server and say tell me what's in a memory. Is that the case?
>> Andrew Baumann: So in general, I mean, there are different ways of implementing
cache coherence. But in general what will happen is when the writer writes its local and
modified in that cache and any reader will have to fetch the line from the guy who wrote it
last, typically that's what happens unless it's being flushed out to memory.
So in a kind of protocol way you have one write and many readers. The readers
experience the delay of the coherence protocol.
>>: I know --
>> Andrew Baumann: Oh, no, sorry, a lie. The writer, depending on what phases the
[inaudible] protocol offers, the writer will have to then do an invalidate for every write
because the readers pulled it down to the shared state.
>>: I'm trying to remember [laughter].
>>: I'm curious though about the argument of doing more coherence [inaudible] I'm not
sure that the hardware's actually doing a very good thing if everybody tries to -- I'm not
sure that the serialization can come back is actually -- it could be worse than what we get
if we [inaudible].
>> Andrew Baumann: The one thing the lock would let you do is it would let you do a
kind of split phase thing, I think. You could -- in queue -- you could have some kind of
backoff log where you enqueue yourself in a queue and somebody will do an explicit
notification. MCS -- MCF by default doesn't do that, the reader enqueues himself and
then spins on a variable in the queue node.
>>: [inaudible] its own location, there's nobody else touches until it's available.
>> Andrew Baumann: Right. So he can -- right. He can then go on to some other
operations, that's true.
>>: There's an interesting discussion. I think it's rather moot because the [inaudible].
We will have to [inaudible].
>> Andrew Baumann: I think in general these things -- I mean, message passing and
shared memory are joules, right? And for every optimization and every way you think
about restructuring the system there's usually a joule. Sometimes its obvious; sometimes
it's not. And a lot of what it comes down to is just a different way of thinking about the
problem more than necessarily better or worse.
The one motivating reason for the message passing case is that if you have hardware
that doesn't do coherent shared memory, it's much easier to map a message passing
primitive on to that than if you have hard -- then if you have a coherent shared memory
abstraction it is to map that on to a message passing primitive, CF distributed shared
virtual memory. And that's the argument for structuring the system around that.
>>: Phil and I will agree with you.
>>: Well, the other thing is if you're building pipelines, if you're building pipelines out of
heterogenous processors or homogenous even shared memory is not the best way to
coordinate that.
>> Andrew Baumann: So that was hopefully the most contentious design principle.
The second one is to make the structure of the operating system mutual to the underlying
hardware. And so we mean in practice that the only hardware specific parts of a
multikernel are the message transports which like the one I've just showed you have to
be specialized so the details of the underlying hardware. And the traditional parts of the
operating system that are hardware specific like device drivers and what we call the CPU
driver which is the part of the operating system that deals with the MMU and all the
privilege kind of state in the CPU.
In particular, we don't have to implement efficient scalable shared memory primitives that
are different depending on things like cache size and topology and available
synchronization primitives. And so that allows us to adapt to changing performance
characteristics of different machines. You can late bind the concrete implementation of a
message transport and even some of the protocol layers that you layer on top of that for
perimeters like how do I do and efficient broadcast or how do I do an efficient group
communication based on knowledge of the underlying hardware.
The third design principle is viewing state as replicated. And this kind of falls out
naturally from the message passing model. But it means that any potentially shared state
that is required by multiple cores in the system is accessed always as if it were a
potentially inconsistent microreplica. That includes things like schedule run queues
process control blocks file system metadata, system capability metadata. All of the state
that the operating system would need to keep consistent across multiple cores. Because
you can't keep it in shared memory.
Again that naturally supports domains that don't share memory, naturally supports things
where you don't have hardware coherence and it arguably more easily supports changes
to the set of running cores. If you're bringing cores up and down for power management
reasons or if you're hot plugging devices that have cores on them and you view this as a
replica maintenance problem instead of if I turn off the power while this lock is held how
do I recover this particular state, arguably it gives you a cleaner way to think about that
problem.
So as I've presented it, you can see this sort of logical spectrum of approaches to making
operating systems scale. The system software in general scale. Typically you start from
this point on the left where everything is shared in this one big lock, you progressively
use finer grain locking. Maybe you notice that locality is a problem, so you either partition
your data or you introduce ways of doing replica maintenance. Clustered objects in K 42
are one way of doing this on top of the shared memory abstraction. To the multikernel
where we've dropped -- jumped to this extreme end point of purely distributed state and
nothing but replica maintenance protocols.
And a lot of current sort of trends in system software, particularly operating systems are
gradually moving in this
direction to scale up. You can sort of logically see this as having jumped to the end point
of that scale. In reality, that's not a very good place to be, because there are going to be
some situations on hardware where it's much cheaper to share memory than it is to send
messages.
Cases in point being things like hardware threads or tightly coupled cores that share a
cache or maybe even depending upon the [inaudible] that you're maintaining, maybe
even a small coherence domain.
So in reality, that's the model. But behind the model we can locally optimize back using
shared memory between local core pairs where necessary. And so we see sharing as a
local optimization of reply -- of messages, rather than messaging a replica maintenance
it's the scalability optimization, sharing.
So you might have a shared replica for threads or closely coupled set of cores where
when you invoke some operation, rather than actually sending a message to that core, all
you'll do is take out lock and manipulate the shared copy of the state that's shared by you
and the other core. The important point is that that's purely local. And you can even
make that had optimization at runtime on the basis of performance information or
topology information that you find out about in the machine. And the basic model that's
visible from the IPI side remains this kind of split phase replica maintenance style. It just
sort of happens that sometimes the call back will come back immediately because it was
local.
So if you put all that together, this is the logical view of an operating system constructed
as a multikernel. Inside this dotted line, where you would typically have your shared
memory kernel running across all the cores, you instead have a separate instance of
what we call an OS node running on every core. Maintaining some possibly partial
replica of logically global state inside the system. And these nodes exchange messages
too, to maintain the consistency of those replicas.
Note two things. Note that you can customize the implementation at OS node to the
hardware on which it runs. There's no particular reason that they need to be the same
architecture as long as they can interpret the same messages. And note that where the
hardware supports it, you know, applications can happily continue to use coherent shared
memory across some subset have the cores inside the machine. But the operating
system won't -- at least the lowest level of the operating system shouldn't be relying upon
this in order to function.
So that's the model. Barrelfish is our concrete implementation of the multikernel. It's
written from scratch. We've used a few bits and pieces of library code but otherwise it's
all written from scratch. It's open source.
It current supports 32 and 64 bit Intel microprocessors but we're in the process of porting
it to a couple of other architectures, including Beehive is running in user mode on a single
core, and they're starting to work on the messaging. What creek we've been more
involved with at the ETH, we had one student who got to spend one week being baby sat
in an Intel lab with a rock creek board. And so we can actually boot on up to 12 cores
one string of cores before some strange res condition in the messaging kills us. And
we're hoping to go back next month and figure that out and get the rest of the system up.
>>: So why do you do it from scratch? Is there something -- why didn't you instead start
with an existing one and say, okay, here are the top hundred shared data structures
where it would be useful to [inaudible].
>> Andrew Baumann: Because at the end of the day you couldn't boot on something that
didn't have -- that didn't support coherence still.
>>: Could you do static analysis and turn all the references to global data -- to shared
memory and to accrued -- accrued -- and then just optimize ones that were ->> Andrew Baumann: I guess you could do that. But I think that ->>: I'm asking suppose you worked with a company with ->> Andrew Baumann: [inaudible] existing operating system how would you go about
applying this? I think -- I think the approach that we're taking is very much a sort of, you
know, from construction. If you want to build a system like this first of all does it make
any sense in and even if I were in a company like this that had a large monolithic system,
I wouldn't start with an idea like this by trying to modify that to be like this. I'd first want to
know if this just makes sense, just as a starting point on its own. And you can see this
approach is being like that.
If this works, then it would make sense to try to apply it. But first let's just see if it works
in a small scale. One other interesting note about portability here is that unlike porting a
traditional OS where you have the architecture and that's the compile time constant that
you know which architecture you're compiling for. Porting something like Barrelfish is
very different because you have potentially different parts of your source tree for different
parts of your system being compiled for all these different architectures at the same time
and then maybe even linked together into one image. That's not hugely interesting as a
research challenge. But it is kind administering as a software engineering challenge
where we've had to rethink a lot of how you actually structure system software. There's a
lot more dynamic nature to this than your traditional operating system.
Whose involved. I keep saying we. We is systems group at ETH Zurich and also
Microsoft Cambridge. On the ETH side I should acknowledge a large bunch of folks who
have been involved in this had. On the ETH side, Pierre Dagand was a former intern.
Simon Peter, Adrian Schupbach and Akhilesh Singhania are three PhD students working
on the Barrelfish project. Mothy Roscoe Galen mentioned he's my boss. And then we've
had the support of and we've worked very closely with a large group of folks at
Cambridge as well, Paul Barham, Richard Black, Tim Harris, Orion Hodson, who used to
be here, Rebecca Isaacs, Ross Mcllroy who also used to be here. And we've actually
been surprisingly successful at work closely between the two different groups. So all of
those folks have contributed to everything this talk.
When you put it all together, this is what Barrelfish looks like. And if you compare it sort
of from a high level to the previous picture of the multi-kernel architecture, you'll learn as
the -- where we had the OS node in Barrelfish we factored that into a kernel mode thing
called the CPU driver. And a privileged user mode thing called the monitor. And that's
largely because it made it easier to construct the system.
This thing is essentially a very simple single-core microkernel or an exokernel kind of
thing. It serially handles traps and exceptions, reflects them to user code, implements
local protection. Does not communicate with any other core in the system at all. And is
mostly hardware specific.
The monitor is mostly hardware independent. And it actually implements -- it
communicates with all the other monitors on the other cores, and it is responsible for
implementing many of these replica maintenance algorithms and for mediating
cooperations on potentially global state. And so this split between potentially long
running things that communicate and sort running things that don't in a purely local made
it easier to factor the system.
Our current message transport between x86 cores is something called UMP, which is our
implementation of a previous research idea called URPC, which is essentially this shared
memory ring buffer that allows you to move cache line size messages very efficiently
between cores.
That's completely different on different platforms. So obviously the messaging transport
on Beehive uses the hardware primitive for sending messages and the messaging
transport on rock creek is also using the hardware features to accelerate message
transport. And that can even be different between different pairs of cores inside the
same system.
Much like a marker kernel or an XO kernel, most other system facilities are implemented
as user level which then themselves may need to be replicated across multiple cores for
scalability reasons.
When you build an operating system from scratch, you get to make a whole lot of design
decisions and there's a whole bunch in Barrelfish that have nothing directly to do with our
research agenda. But there were, you know, fun decisions to make and things that we
had to implement. So this is a slide briefly listing many of those ideas. Feel free to ask
me questions. I won't go through all of them. Some of the important ones are probably
minimizing shared state like a lot of the multiprocessor operating research OSs have
done in the past.
We use capabilities for all resource management, including management of physical
memory inside the machine. Using capability model from a system called SEL 4. We
have upcall processor dispatch much like scheduler activations. We're running drivers
that use a level -- we're actually building -- this slide is a little bit out of date. It talks about
specifying device reduces but we're actually a whole series of little domain specific
languages to make it easier to construct it from parts of the system. So one of these
languages generates the code for accessing device registers, another one generates our
build system, another one generates our messaging transport stubs for different
underlying hardware interconnects, another one generates part of a logic that implements
the capability system in the kernel. It's actually a very interesting way about -- of building
an operating system.
Some of the applications running on Barrelfish because in order to build this as a real
system and try to keep ourselves real, we have to have applications. Some of the
applications that are running today are a slide show viewer which you're actually looking
at. So why -- the reason I keep sort of glancing around at that screen is because I can't
see anything here because you would be surprised how heinously complicated it is to
enable two display pipes at once. [laughter].
Our web server runs Barrelfish.
>>: [inaudible].
>> Andrew Baumann: Thanks for that. I'll send you the code and you can tell me how to
implement that. [laughter].
>>: [inaudible] we believe anything you said up to this point some [laughter].
>> Andrew Baumann: The problem is that the T is in hardware and it's behind the GPU
programming API.
>>: [inaudible].
>> Andrew Baumann: Yeah. Two video cards is easy. So our web server runs
Barrelfish. We have a virtual machine monitor that we're using to be able to run device
drivers but we don't worry about the performance or applications where we don't
necessarily need the scale.
In terms of evaluation, we currently have a relatively limited set of shared memory
multiprocessor benchmarks like these guys. We have database engine and constrained
engine which I'll talk more about in a minute. But we are rapidly trying to acquire more.
And in particular, some of the things you can look at this list and say well what's not
there. What's not there today is something like a file system which is a big missing part.
And what's not there is a real network stack that multiple applications can use.
And so that's obviously some of the things that we're working on.
>>: [inaudible].
>> Andrew Baumann: No, you won't. But here's a [inaudible] going to use all the cores
in my multicore machine. Probably at some point I might want to run more than one of
them.
So, evaluation, which had I'll kind of preface a little bit by that discussion. This raises a
generally tricky question for OS research, which is how do you go about evaluating a
completely different operating system structure and in particular how do you evaluate the
implementation of something that is by necessity much less complete than a real OS?
Our goals for Barrelfish are that it should have good baseline performance, which means
that it's at least comparable to the same performance of existing systems on current
hardware. But more importantly that it show promise of being able to scale with core
counts and be able to adapt to different details of the hardware on which it finds itself
running and be able to actually exploit the message passing primitive to achieve better
performance based upon hardware knowledge.
The first kind of evaluation I'm going to show you just because this is a research OS talk
and all research OSs for a long time have held a tradition of message passing
microbenchmarks is the performance of our user to user message passing primitive,
which is this shared memory ring buffer thing. There's a lot of numbers on this slide, but
the important point is probably that the performance of a cross-core, intercore message
on current hardware, overshared memory is in the order of 4 to 700 cycles, and I think
that's quite respectable given that the hardware wasn't built to support this.
It's also interesting because it means that in contrast to many other distributed -- most, if
not all other distributed systems, the latency of message propagation can easily be
outweighed by the cost of handling -- sending and receiving messages and actually
processing messages. And so the tradeoffs for building things on top of this end up
being quite different to a typical distributed system. It's often much cheaper just to send
another message than to try to optimize the number of messages that you have send if
it's more expensive to process them.
Also note that if you look at the throughput numbers and you work it out, we actually get
pipelining for free across the interconnect because of the way the message transport
works. You can have multiple messages implied.
But then given a optimized message transport, how do you usefully implement part of an
operating system on top of that? What I'm going to show you now is how we implement
unmap or TLB shootdown. And this is what happens when you have some piece of
replicated state like the TLB which by design is replicated between all the cores in the
system. And you need to do some operation that would make some of those replicas
inconsistent like reduces the rights or remove a mapping that might have been cached in
all those TLBs.
In a typical -- and so logically what you have to do is send a message for every core that
has this mapping cached and wait for them all to acknowledge that they've removed it. In
most systems the way this works is within a processor interops. The kernel on the
initiating call sends an IPI to every other core that might have the mapping and then spins
on some shared acknowledgement count that it uses to know when all the other cores
have performed data flash.
In Barrelfish the way this works is that first a user domain sends a local request to its
monitor and then the monitor performs a single-phase commit across all the cores in the
system. And it's a single-phase commit because you need to know when it's done. It
can't fail, it's just the flash of the TLB for a particular -- for a particular memory packet.
So the question is how do we implement this single-phase commit, and how to we do it
efficiently? We looked at a couple of different protocols for doing this in our machines.
The first is what we call unicast. And in this case we have one message channel point to
point between every pair of cores in the system. And the initiating core sends a single
message down every channel to every other core to say unmap this region.
And because every message send is essentially one cache line invalidate, what this boils
down to is writing in cache lines where the N is number of cores. An optimization that we
tried is what you might call broadcast where you have the single cache line, you write the
message once and every receiver is pulling it out of that single channel.
Neither of these things actually perform very well. This is on a 32 core Opteron for the
box I showed you at the beginning. Neither of these things actually scale very well. In
particular broadcast doesn't scale well because cache coherence isn't broadcast. Even if
you only write it once, every receiver has to go and fetch it from the score that wrote the
message. So it doesn't do any better.
So the question is how can we do better in if we look at the topology of the machine,
again, this is the crazy 8 socket system, and we think about what's happening in both the
unicast and the broadcast cores. Let's say this guy is initiating the unmap, he's sending a
message here, here, and here. Okay. Then he's sending a message here, here, and
here, and here. The same message is crossing the same interact links many, many
times. We're not using our network resources very efficiently.
What you'd like to do is something where you can aggregate messages. For example,
send a message once to every other socket in the system rather than once to every other
core. And we do that, and we call it multicast. So in the multicast case, we have an
aggregated core, which is the first core in every socket in the box. And then that core
locally sends the message on to its three neighbors and we aggregate the responses in
the same way. This performance -- this send here locally is much more efficient because
all of these cores are sharing in L3 cache so the message sent through the local cache is
much faster.
There's one more optimization that you can make to this which is if you go back and look
at this picture, you observe that depending upon where the initiator is, some cores are
going to be further away on the interconnect than other cores. And there's a higher
message latency to reach this guy over here than there is to reach this guy over here.
>>: So [inaudible] back to your second design principle because this -- waiting for this to
come up.
>> Andrew Baumann: How do I make the optimization based on hardware knowledge in.
>>: Yeah. Or it seems like -- isn't this an example of like -- isn't it going to be really hard
to try to make the [inaudible] independent of the [inaudible].
>> Andrew Baumann: Okay. Hold the thought. Hold the thought. So you want to send
to this guy first and thus get a greater parallelism in the way the messages travel across
the intergame. We do that as well. And that's called NUMA-aware multicast. And this is
how the microrange messaging operations looks like. And you can see that obviously
enough it's scaling much better. But it brings up an important question for Barrelfish and
for multikernel in general, which is you need a lot of hardware knowledge to make this
kind of optimization decision. And the decision is going to be different based on different
systems.
In this case the way we made the decision is based upon the mapping of cores to
sockets which is a somewhat suboptimal way of knowing where the shared caches are.
And you can get that information from CPUID. And we based it on messaging latency
which is which sockets are further away, which cores are further away to send to? And
we can do a bunch of online measurements for example at [inaudible] time to collect this
information about the system. But more generally an operating system on a diverse
heterogenous machine is going to need a way of reasoning about those resources in the
system, making these kind of optimization decisions.
The way that we currently tackle this Barrelfish in Barrelfish is with constraint logic
programming. And this is where the constraint engine comes in. There's a user level
service that just runs on a single core. It doesn't have to scale because it's off the fast
path called the system knowledge base. And this thing is actually a part of the eclipse
constrain engine.
The system knowledge base stores as rich and detailed a representation of all the
information about the hardware that we can either gather or measure and allows users to
perform online reasoning and optimization queries against it. So in particular, there's a
prolog optimization query that we use to construct the optimal multicast tree for the
machine on which we're booting and that then configures the implementation of the
multicast group communication primitive for that particular machine.
Now, constraints are sort of a place holder here. Clearly this is throwing a sledge
hammer at an ant. And it's not clear what the right way of doing these optimizations is,
but I think there's an argument for some part of the system that is doing these high level
global optimizations using as much information about hardware as it can gather and is
explicitly off the fast path for messaging.
>>: [inaudible]. Sledge hammer at a mountain because some other bust design might
have a whole bunch of other scheduling constraints and so oh, you only get high
performance if you schedule the messages to these cores in exactly this order.
>> Andrew Baumann: So there's a hard question about how you express all those
constraints and how you optimize for them. I would argue that it's if not easier in the
message parsing abstraction, certainly no harder than in the shared memory abstraction,
I think the only way to tackle that problem is to do it explicitly, as in collect all the
information we can and allow you to do queries against that high level information.
Operating systems today don't do this at all, they just say what is the best thing that we
can implement that will work reasonably everywhere rather than how can we implement it
in such a way that we can change what we're doing based on the underlying hardware.
But it's definitely an open problem.
>>: [inaudible] I mean how much of this do you think can be done sort of in this dynamic
online fashion versus [inaudible] inherently done at design time. I mean it seems like -[inaudible] rephrasing it, but it seems like this enclosed thing can only handle so much
that there's some stuff that you couldn't even figure out how you could abstract out the
reasoning in the first place.
>> Andrew Baumann: Sure. Yeah. You always have to pick a point. And then the
further you want to go to being dynamic, the more difficult it is to express all the possible
optimizations.
>>: It's important in this context in particular because you're saying -- you're trying to
avoid sort of the specificity creek that you see in a lot of commodity OSs ->> Andrew Baumann: Right.
>>: Then you look at the [inaudible] [laughter]. One of the things that you see is you
know you'll see like all these you know architecture specific stuff and people will
[inaudible] the comments about how this is an ugly feature, this particular chipset or
universal or whatever. What you're trying to say is, well, look, if we have this sort of
different architecture or we have this nice message passing set, we can kind of punt on a
lot of that. At least that's sort of the high to mid levels.
But it seems that when we start talking about this multicache tree stuff, you're starting to
go down that ->> Andrew Baumann: What you do this all those kind of situations you is build primitives
or you build abstractions that allow you to build the high levels of your software
independently of the implementation in the abstraction.
In the shared memory case, you build scalable implementations or data structures like
lists and hash tables that are different depending upon the underlying hardware.
In the message passing case, you build implementations of things like broadcast or group
communication. Single face -- pack source even if you want. Different kinds of protocols
that you can optimize the implementation of for without the layers above needing to know
how that works. I think that you can -- I think that that's tractable. Yes, it's hard, and
figuring out the optimization for different pieces of hardware is still there and it's still hard
and somebody has to do it. But the goal is not to throw that stuff away, the goal is to
figure out how to decouple those details from the way you built the software.
>>: So how much of the -- this kind of stuff do you expose to the application? So if I
want to write an application in one of these strange systems, it seems to me that there's a
couple classes [inaudible] there's things that don't care about performance, right, it's just
-- I can't type fast enough to do anything. And it doesn't matter what you do, all right?
There's another class of things where I really care about performance. I probably want to
know about the topology, the HPC stuff. I want to know about topology of the underlying
hardware and [inaudible].
So if I understand coming home instead of [inaudible] what you tell me how much you
export to the application writer of the ->> Andrew Baumann: So we're not directly trying to solve the problem of how to make
applications scale from multicore. The problem that Barrelfish is trying to solve is how do
I make the lowest level of the system software work well on scalable [inaudible] but
there's a joule of, you know, the argument that this is the way to build an operating
system in an argument that says maybe this is a good way to build an application. And
applications on Barrelfish have access to all of the same information that the OS itself
has. So it's nothing that's stopping an application going and talk to this thing.
Let me come back to that when I get to future work a bit at the end.
>>: [inaudible].
>> Andrew Baumann: What's the right set of primitives? Yes, that's a good question.
>>: It is small but [inaudible] spend a lot of time optimizing it.
>>: There's a lot of people that talk about taking advantage of topology or knowing
exactly where all the interconnects go. Let me tell you that in HPC that was given up a
long time ago as a bad job just too much diversity, so what we've come to instead is
essentially asking the operative system to allocate neighbors to applications and then
having some sort of model that lets you -- lets you cope with that. The furthest we've
gone is sort of some of the -- some of the sort of this hierarchal NUMA systems in which
you specified the abstract domain on which the computation is placed. The domain gets
placed by the operating system on to collections of nodes or collections of cores. But the
actual details of wow, this look is latency this and bandwidth that are just way too
complicated.
>> Andrew Baumann: Yeah. The application wants some notion of these things. These
things are loosely coupled.
>>: These need to be closed, these don't, whatever.
>>: [inaudible] this is too much work.
>>: Yeah. They get -- there was thought of this, and then Larry Snider sort of pioneered
this view that took us away from that the long key model and its variants are an extension
of that. There's just a whole bunch of examples of situations where HPC people backed
away from knowing the details of the topology systems.
>>: But I think the argument here is exactly the same which is you don't want to
[inaudible] to know the details.
>>: Right, right, right.
>> Andrew Baumann: Right. There's no reason that you can't build abstractions of
[inaudible] data. Somebody somewhere at the bottom needs to ->>: Maybe there's a small number of communication patterns that the operating system
in the application used and they can spend a lot of time optimizing.
>>: [inaudible] creates a lot of support problems as well. [inaudible] database systems
won't adapt to the hardware and make the best use of the current hardware resources
that you can. But that means that there you query your optimizer [inaudible]. Nobody
has done there because they don't know how to support it and the customers call back
and say okay, this one does not solve. Well, you are unique. You have 10,000 unique
systems out there and you can't use them. Same is going to be a problem if you start
modifying the decisions this the operating system does at runtime. How are you going to
tell where that originated from?
>>: So people that design that architectures need to be punished. [laughter].
>> Andrew Baumann: The problem is not that they're bad, the problem is just they're
different. But that's the point you're getting at.
>>: Perhaps that's [inaudible].
>>: It is bad.
>> Andrew Baumann: There's many good arguments against doing runtime adaptation.
Is point is ->>: I'm just saying we have to be very, very careful. Can't go in there whole hog.
>> Andrew Baumann: Sometimes you need to adapt just not to do something blatantly
stupid is the challenge. And a lot of stuff in systems comes down to avoiding the
blatantly terrible case.
So let me wrap up quickly because we're running a bit late, I think. This is how -- what
happens had when you put it all together into an actual unmap primitive that also involves
communicating with the local monitor. And you can see this is actually improved since
then. But we have a pretty inefficient local message passing primitive in that graph.
Let me skip quickly through these benchmarks which are showing you the overhead of
Barrelfish on running basic multithreaded shared memory primitives. This is SPLASH-2,
which is a multithreaded shared memory compute bound thing. This is a similar NAS
OpenMP benchmark. None of these workloads actually scale very well on our machine,
which says something about the workloads in the cache protocol itself. The only point
that these applications are trying to make is that there's no inherent overhead for
coherent shared memory applications or no inherent penalty for running coherent shared
memory applications on an operating system built in this way, nor would you really expect
there to be. And also that, yes, you know, we can still run shared memory applications.
Perhaps more interestingly, we've built an I/O stack on top of Barrelfish. In this case we
have a case where we have network driver on one core, web server and application
server on another core and then a backend database on the another core. And we're
using message channels we sort of pipeline, we have a pipeline stall model pipeline
request through this thing. And that gets quite healthy speedups over the standard sort
of request parallelism model on traditional systems.
It's also again apples and oranges kind of comparison going on here, our implementation
of the whole stack versus something that's much more complete. But it's merely trying to
make the point that you can build systems services on top of a model like this and they
can perform reasonably.
To wrap up, I wanted to talk briefly about sort of IDs for future work and where some of
this stuff is headed. What we have today is a relatively first cut implementation of an
operating system based on this model. And a high level argument that says it's not an
obvious disaster to build a system in this way and that the idea shows some promise.
But that leaves open a whole lot of interesting questions which obviously -- some of
which we're obviously working on.
The first is what are the actual protocols that you need to do replication and agreement in
a system like this? The kind of state that we replicate today in Barrelfish is mostly the
capability system metadata and some aspects of memory allocation. That works, but it's
also not very interesting. You end up doing things either like one phase commit and
some phases two phase commit. One of the big questions is what are the right protocols
for building higher level system services on top of these and how can you structure them
in a way that you can use transparent local sharing to make them go fast locally?
And note as I said before, the tradeoffs with respect to the performance of message
passing are going to be very different here than in many of the classical distributed
systems. Although we can reuse ideas from distributed system space, I don't think many
of the same protocols would necessarily make sense given these kinds of assumptions.
The next question which leads into a little bit I think what some of the questions were
getting at is how do you build applications for this? Obviously your applications aren't all
going to be microprocessor shared memory P threads to all applications. What does a
native Barrelfish application look like, if there is such a thing?
I don't know the answer to this. I'm not sure anybody knows the answer to this. But
there's a general hard question here about how you program applications for different
kinds of heterogenous scalable multicore hardware. And I think that first of all the answer
is going to have to lie in raising the abstraction bar. Obviously people aren't going to be
building applications in C low level languages and using the typical operating system
APIs that we have today. And so that's one of the reasons that we're not directly trying to
specify the Barrelfish API that you write application to.
Instead, we think that applications are going to be written in higher level language run
times, programming environments and that there will be different environments for
different styles of application. So one case -- one example of this that maps very well on
to heterogenous multicore is data intensive applications are implemented in something
like a Dryad or Map/Reduce kind of model where the runtime for that kind of thing can
map the computation that it wants to do downloads at the available hardware resources.
Obviously not all application fit this.
There will be applications that need tightly coupled synchronization that have some
variance of a message passing model. Probably not NPI because that's way too low
level but something slightly higher level than that for those kinds of applications. Other
applications might be doing streaming data processing or whatever. There will be some
number of classes of applications and the answer here has got to lie in tightly coupling
the implementation of the runtime for those things with the underlying OS so that the
runtime can use knowledge of the hardware to optimize itself and to present its workload
appropriately to the OS without the application developer having to know about it.
And the optimizations that you have to use therein without being domain specific.
How do you instruct useful system services? Things that we're looking at at the moment
include networking architecture and also file system. The obvious thing to try to do here
is take some ideas from cluster file systems and HBC file systems and see how they map
no this space. One thing that's very different is when it comes down to dealing with
things like where does the data actually live? Having hardware supported shared
memory even if it's not coherent makes implementation of these kind of things very
different.
And finally there's a sort of big open question about how you do resource allocation and
scheduling in a system that's structured in this way given some diverse set of competing
application demands and a very varying set of processes and varying requirements like
performance and power management, how do you actually do that resource allocation
problem? And I think that's a big open question.
Part of the answer is going to have to lie in high level things like constraint solving. But
we're starting to look at how we can do heterogeneity or scheduling or resourcing
allocation, particularly how the application can express its resource requirements of the
operating system. Another thing to note is because you have these message queues
between applications and between cores, you can actually infer something about
communication patterns and even load based on the length of the message queues like
some of the event driven architectures that people have built in the past. And so there
might be something interesting that you could do there.
Finally just to head off one question that people often tend to ask, now that you've built
an operating system based on explicit message based communication, can you actually
extend that model outside the box? Does it make sense to put a network in between and
run one core on the other side of the network?
The first answer to that question is probably nor or at least not in the trivial sense.
There's big differences to being inside a machine and across machines, in particular to
the guarantees that your messing system can offer, things like message delivery, reliable
message delivery, all the kinds of things that our message transports guarantee within a
machine are completely different as soon as you go across the machine. The
performance tradeoffs are different. And even if you don't rely on that, having hardware
support for shared memory makes things very different. But building a system this way
definitely lowers the gap between the programming model that you use inside a machine
and the programming model that you might use across multiple machines. You have the
explicit message passing. In particular you have these explicit split-phase APIs that allow
you to tolerate much more latency from communication.
So you can think about again taking programming models for programming multiple
aggregates and machines and mapping them on to programming inside of the machine
and using the same kinds of primitives there. You can also think about extending your
PC with your personal collection of machines with resources that you dynamically acquire
from the cloud or other resources that you might have in your pocket and having the
same part of your operating system that is dealing with high level questions like resource
allocation and scheduling decisions be just as able to say well, I'm going to go and run
this over there on the other side of the network, and it will have some kind of integrated
view of the same part of my machine in the cloud as it would if it was running on my local
machine or on my phone or whatever.
So to wrap up. I've argued that modern computers are inherently distributed systems.
I've argued that it's therefore time to rethink the structure of an OS to match. And I've
presented the multikernel which is the model for an OS as a distributed. It uses explicit
communication, replicated state. It's mutual to the structure of the underlying hardware.
And I've showed you Barrelfish.
I hope I've convinced you that Barrelfish is a promising approach for structuring operating
systems. We're trying to build it very much as a real system. I think I've shown it has at
least reasonable performance on current hardware, more importantly that structuring a
system in this way should allow it to scale and adapt better to future hardware. And with
that, I'll put in another plug for our website. There's papers and source code and more
information there.
And I'm happy to take any other questions.
[applause].
>>: I have one question about suppose I have a [inaudible] so what does your C
compiler do on Barrelfish?
>> Andrew Baumann: We have GCC, and it does what GCC does. You can ->>: [inaudible] then you would use the shared ->> Andrew Baumann: That application, that application is not going to be able to run on
cores that don't have coherent shared memory and aren't the same architecture. All
right? So in the way you've built that application, you've limited the set of cores on which
you can run it. But there's no reason that the -- our operating system can't run that
application. Obviously we're not building the operating system that way. We've built the
operating system in C, but we don't use threads, and we don't assume shared memory
between multiple instances of the same program.
>>: [inaudible] if you want to do that kind of sharing you should use the APIs and
[inaudible].
>> Andrew Baumann: Yeah. No, the shared memory multithreaded model is probably
the right model for some set of applications. They're just not going to run across all these
cores. But if they only operate on a small number of cores they have loosely coupled
synchronization it's a perfectly valid programming model.
>>: So I didn't hear much about security. I'm a security person so I'm always curious.
>> Andrew Baumann: Okay.
>>: And I was wondering, one, if there's a failure rate in any one of the subkernels on
any one of the chips or cores does that mean your host is everything fully [inaudible] by
everything else and is the message passing -- how do you fend against an attempted to
do a service denial by sending [inaudible].
>> Andrew Baumann: The answer is right now we don't defend against any of those
things and if one of them goes down, the whole system goes down. However, as a
model for constructing a system, we know to some approximation how to secure
message based systems. You can build things that look at the integrate of messages on
the channel. You can build things that isolate failed components. You can build
messaging protocols that tolerate fairly well. You can do the same thing in a shared
memory model, but I would argue that it's more complex because you don't know what
the boundaries of sharing are. You don't know where the communication is. If some
other core has gone and trampled on all my data structures I'm much more screwed than
if some other core sent me a bad message and I can detect that it's a bad message. So
we're not trying to solve this problem, but I think that this model gives you a better handle
on that.
>>: And in terms of availability just how do you protect against one application just
sending lots of -- sending lots of messages [inaudible].
>> Andrew Baumann: That's just a general sort of resource allocation problem. How do
I protect against one application spinning on a core or creating lots of threads and
throttling the system? I can -- you know, I can -- message channels to set up --
>>: Some kind of accounting for [inaudible].
>> Andrew Baumann: Well, I can also account for the initialled message operation. I
mean, in general most of these issues are different in the messaging scenario, in the
shared memory scenario. But I don't think they're any harder necessarily.
>>: Well, one thing that might be harder is that the message passing scenario unless
depending on what constraints you've solved for, you may not know the big cost of
sending a message. There may be some certain types of message that cost order of
magnitude more than other types of messages. So it may be very hard to set a threshold
where if you said you only send X messages that save you may need to have some other
kind of a ->> Andrew Baumann: Yeah, yeah.
>>: I don't believe you at all, but I'll let him answer the question. [laughter].
>> Andrew Baumann: I think it's probably [inaudible] [laughter].
[applause]
Download