>> Galen Hunt: So good morning, welcome you all here. It's my pleasure to be hosting Andrew Baumann again. I hosted him about two years ago, almost exactly two years that Andrew came through when he finished his PhD, which he did at the University of New South Wales in Australia. Granaut [phonetic] was his advisory. And after he came through here then he went to ETH, where he's been working with Timothy Roscoe, Mothy, as we all know and love him on multikernels on Barrelfish. You know Barrelfish was not in the title nor in the abstract anywhere. >> Andrew Baumann: That was because we forgot to put it in after the anonymization. [laughter]. >> Galen Hunt: Okay. So anyway, so we're glad to have him here to talk to us about Barrelfish. So take it away, Andrew. >> Andrew Baumann: Thanks. So just so I know, how many people were at SOSP last week and saw this talk already? Okay. >> Galen Hunt: Not very much. >> Andrew Baumann: Good. Well, you guys will probably be bored for the first part of the talk, but I've got some new stuff at the end. And also, feel free to ask questions as I go. So, yeah, the title of this is as it was in the paper, the Multikernel a New OS Architecture For Scalable Multicore Systems. And this is joint work with a whole bunch of folks at Cambridge and a bunch of students at ETH and Mothy Roscoe Galen already mentioned. And so what we're trying to answer in this work is the question of how should an operating system for future multicore systems be structured? And there are two key problems that we see here. The first is scalability, sort of the obvious one. How do you scale up with an ever increasing core count. But the second, which is a little more subtle is managing heterogeneity and hardware diversity. So to give some examples of what we mean by that, these are three current or near current multicore processors. On the Niagara, you have this bank two cache and a crossbar switch and what that means is that all regions of the L2 are equidistant from all the cores on the system. And so shared memory algorithms running on those cores that are manipulating data in the same cache line work quite efficiently on the Niagara whereas on the Opteron, you have a large slower shared L3 that's almost twice as slow to access as the local L2s. And on the Nehalem you have this banked cache with a ring network inside the chip shutting data around inside the cache. And so the message is that the way that you optimize shared memory algorithms to scale well on these three processors is different because they all have different access latencies and different tradeoffs. And so we need a way for system software to be able to adapt to these kinds of different architectures. Another example of diversity is in the interconnect. This is an eight-socket, four cores per socket, 32 core AMD Opteron machine that we have in our lab. And you can see that these hypertransport links between the sockets make for pretty unusual network topology. And that topology plays into the scalability and the performance of shared memory algorithms executing on that machine. You see the layout of the interconnect. Particularly you see when you go beyond here these two links get contended. And so these kinds of interconnect topologies play in the way that you want to optimize things on the machine. But they're all different. This is what the fourth coming eight-socket Nehalems will look like. And there are also plenty of other different hypertransport topologies that you can have inside a box. And even as things move on to the chip and densities on a chip become greater, you'll start to see these kind of networking effects inside a chip. So that's a Larrabee with a ring network. This is a [inaudible] trip with a mesh network inside the processor. And again, the communication latencies inside the chip depend on where you are on the interconnect and the utilization of the interconnect. Finally we're starting to see diverse cores. Today inside a commodity PC, it's quite likely that you'll have programmable network interface of some sort or a GPU. And you can put FPGAs in CPU sockets. For example, in that Opteron system there are FPGAs that will sit on the hypertransport bus. And in general operating systems, today's operating systems don't handle these things particularly well. This is also one of the motivations behind Helios. On the single die, it seems likely in the near future that we'll have heterogenous cores either because they'll be asymmetric in performance, a lot of people are talking about situations where you may have a small number of large, fast, out of order cores that are power hungry, and a much larger number of smaller, simpler in order cores. Or you may find that some cores on the chip don't support the full instruction set. They can leave out things like streaming instructions, virtualization support, and you know, cell is just one stream example of corps heterogeneity in current systems. So in general, there's increasing core counts, but also kind of related to that is increasing diversity between cores and within the machine and between different systems. And so unlike in a lot of prior cases of scalability to large number of cores it's going to be much harder to make these kinds of optimizations that you need to make to scale at design time. Because the machines on which you run will be different and the resources inside the machine will have different tradeoffs and different topologies. And so we need a way for system software to be able to adapt dynamically to the system on which it finds itself running in order to scale. And so we think that because of all of these things now is a good time to rethink the adult structure of and operating system. Today's multiprocessor operating systems are more or less by default. The shared-memory kernel that executes on every core in the machine communicates and synchronizes using shared -memory data structures protected by locks or other synchronization mechanisms. And anything that doesn't fit into that model tens to be abstracted away as a device driver. So programmable network interface, regardless of the fact that you can run application code over there, is hidden behind the device driver for API. And so we propose structuring the operating system for something like this, more as a distributed system than as a shared-memory multiprocessor program. And we call the model for the OS as a distributed system the multikernel. And in the rest of this talk, I'm going to introduce and motivate firstly three design principles that make up the multikernel model. And they are making all communication between cores explicit, making the structure of the OS neutral to the underlying hardware, and viewing state inside the machine as replicated. So after I introduce those design principles, I'm going to talk about Barrelfish, which is our implementation of a multikernel. I'll present some evaluation from our SOSP paper. And I'll also present something a little more concrete to give you sort of feeling for how things work inside Barrelfish and how we can analyze and optimize the performance of something in a system structure this way. But first the design principles. So I said that we want to make inter-core communication explicit. And in a multikernel what that means is that all communication between cores in the system is done with the explicit message passing operations. And so there is no assumption of shared state in the model. That's quite a radical change, radical departure from the way people have built multiprocessor operating systems in the past. But if we can make it work, we think it has some good advantages. First, it allows you to decouple the structure of the system from the concrete details of the communication mechanism between cores. And so you explicitly express your communication patents inside the machine. So to give an example of that, typically the way you optimize shared-memory data structures is you think about which cache lines are likely to be local, which cache lines are likely to be remote, how do I optimize for the locality of some piece of data and think explicitly about when do I -- when I need to communicate, which lines am I going to invalidate, which data structures are going to be moving across the machine? If you're doing this with message based communication, you think for some particular operation with which other cores in the system do I need to communicate? And in order to do that, I need to send them a message. So it's a different way of thinking about scalability within the machine. Message based communication also supports heterogenous cores on non-coherent interconnects. So again, think about the example of the ARM processor on the other side of a PCI, the express interface. Very similarly to Helios, you cannot coherently share memory with that thing but you can send it a message. And even if there's no coherent shared memory and the software running on the offload processor is running in a different instruction set and supports different ridges sizes and pointer sizes, it can interpret your message and do something in response. Message based communication is potentially a better match for future hardware, either because that hardware might support cheap explicit message passing. An example here is the tile processor where you can send us a message in software or across the interconnect in 10s of cycles or because that hardware may not support cache-coherence or may have limited support for cache-coherence. So Intel's 80-core Polaris prototype is one example of a machine where they didn't implement cache-coherence in hardware because it was too expensive. Message based communication allows an important optimization of split-phase operations. So what this -- what split-phase means in this context is that you decouple the message that requests some remote operation from the message that indicates successful failure of the response. And so rather than everything being a sort of synchronous model where you perform some operation and you block waiting for that operation to complete, you can initiate multiple operations and then asynchronously handle responses. And that's important for getting greater concurrency and parallelism inside the machine. I'll give you an example of that later on. And finally we can reason about it, both scalability, correctness, performance of systems built upon explicit communication with message passing. So you may well be thinking that's all very nice, but the machines that we have today are fundamentally shared-memory systems. And the performance of any message passing [inaudible] on these kinds of systems is going to be slow compared to using the hardware cache coherence to migrate data. And this is what I'm about to present you here is a very simple microbenchmark. This is just exploring that tradeoff. So there are two cases. In the shared-memory case, we have a large shared array in shared-memory, and we have a number of cores manipulating varying sized regions of the array. And so -- and there's no locking here. The cores are just directly executing right updates on the shared array. And so what happens when a core issues right is that the cache coherence protocol migrates cache lines around to the different cores that are performing the updates. And while that migration happens, the processor is stalled, doesn't get to retire any other instructions while it's waiting for the interconnect to fetch and invalidate cache lines. And so the performance of the whole operation is going to be limited by the latency of these round trips across the interconnect. And in general, it's going to depend on two things, the size of data in terms of the number of cache lines that have to be modified as part of each update and the contention in terms of the number of cores all hammering on the same cache lines. And this is how it performs. This is a four socket, 16 core Opteron machine. This is what happens when every minutes one catch line, two, four, eight cache lines. You can see that it doesn't scale particularly well. And remember that I said there's no locking here. So in the 16 core case and in the 14 core case all of those cores are executing the same number of instructions that's simply modifying the same region of shared memory. And all of the extra cycles between there and down here are stalls waiting for the cache coherence protocols to move cache lines around. >>: [inaudible]. >> Andrew Baumann: Yes. Well, in practice given the x86 consistency model it doesn't actually make a difference because when you do one right to wrong word you have to move the whole cache line over. So in the message passing case what we've done is we've localized the array on a single core. So there's one core that is responsible for updating the array. And all of the other cores, when they have updates, instead of directly modifying the array, they express their update as a message. We assume that the update with -- and its results can be expressed in a single cache line. So we essentially send a request in a cache line to say please manipulate this entry in the array and tell me when you're done. And we send it as an up PC to the server core. And the way that we send it is with this implementation of message channel. Now, on current hardware the only communication mechanism we have is coherent shaped memory. And so what we have here is a ring buffer based in shared memory that we use for shipping the messages between cores. And we have microoptimized the implementation of the ring buffer to the details of AMD's hypertransport cache coherence mechanism so that it moves messages as efficiently as possible on this hardware. And in this experiment, the clients send the request and then block waiting for a reply, so their synchronous of PCs. And that's how it scales. This is what happens when the server core manipulates one cache line in this shared array for every message. And that's what happens when it manipulates eight cache lines, which is not terribly surprising. Because the data that it's manipulating is now staying entirely local in it's cache. The only things that move across the interconnect are the messages in the ring buffers. >>: [inaudible] so you do an IPI? >> Andrew Baumann: No. So in this experiment the server is polling and the clients are polling while they're waiting for a reply. And so on this machine for this benchmark, which admittedly is not particularly fair on the shared-memory case, there are many ways you could optimize this kind of thing that don't have everybody hammering on the same shared memory we better have four cores and four cache lines. But what's more interesting I think is if you look at the cost of an update as experienced at the server, so this is the time it takes the server to perform each update, and as you'd expect it stays flat because it's just manipulating local state, you can infer that this time difference here is essentially the same period that the client is blocked waiting for a reply. And so the message is sitting in the queue. And the reason that this scales up is because the server is saturated and there's a queuing delay at the server to process the updates. But. But in this case, the client is retiring instructions, there's polling on the message channel waiting to see the reply, whereas in the shared-memory case it doesn't get to do anything, the process is stalled waiting for the interconnect. So if we had an asynchronous RPC primitive where the client could send an update to the server core, do something else, do some other useful work and then asynchronously handle the reply, those cycles would actually be available to perform useful work. And that's why we say that this split phase optimization is going to be important, because as the latencies increase you want to be able to do work while you're waiting for something to happen remotely. Our second design principle is that we separate the structure of the operating system if the underlying hardware, at least as much as possible. And so in the model the only hardware specific parts of the system are the message transports which are responsible for moving data between cores and so I've just showed you one example on hypertransport. They're highly optimized and specialized to the underlying hardware. And of course the parts of the system that interact with hardware as far as fast drive are the bits that manipulate the CPU state and so on. This is important because this is how we can adapt to all the hardware diversity that I've just motivated. Either changing performance characteristics, changing topologies and so on. In particular we can late-bind the implementations of the message transports and also the messaging protocols used above those transports to the particular machine on which we run. And I'll show you an example later in the unmapped case of how the way that you send messages in a particular machine can be changed depending upon the topology of that machine. The final [inaudible] principle is that all potentially-shared state is accessed as if it was a local replica. This includes anything that is traditionally maintained by an operating system, including run queues, process control blocks, file system metadata, so on and so forth. This is kind of required by the message-passing model. You don't have any shared state in the model then what you have to have is local state and messages to maintain the replicated copies of local state. It also -- but it's good because it naturally supports domains that don't share memory. So again think about the case where you have some application, part of the application is running on a CPU, part of the application is running on an offload processor like an ARM on the other side of express bus. It also naturally supports changes to the set of running cores. You can think about hotplugging devices that have cores on them, but you also just have a problem already in today's system of how do you imagine bringing up and shutting down cores for power management? And in general there's a lot of literature in the distributed system space that is related to how do you maintain the consistency of replicas when nodes in the system are coming and going and they have partial inconsistent replicas when they reappear. So we think that there's a way to take that stuff and apply it to the problem of power management and hot plug and operating systems. So as I've sort of presented the model so far, you can see this spectrum of sharing and locking where traditional operating systems are progressively trying to get more scalable by introducing finer-grained locking and then partially replicating state in order to scale up. Whereas the multikernel is this extreme end point over here on the right where we have no shared state at all and replica maintenance protocols September around by message passing. In reality there are going to be situations where it's much cheaper to share memory. You can think about two hardware threads on the same processor. You can think about cores with a large shared cache between them. There will be situations in the hardware level where it's cheaper to share locally than to exchange messages between cores. So in reality, in a multikernel system, we see sharing as a local optimization of replication. So you may find that one replica of some piece of state is shared between some threads or some type of coupled cores and rather than send messages to those other cores, you simply take out a lock and manipulate the local shared copy and then release the lock. But the important difference here is that sharing is the local optimization of replication rather than replication being the scalability optimization of a shared-memory model. And so this allows us firstly simply to support hardware that doesn't have sharing or support hardware where it's cheaper to send messages because of the scalable system, but also allows us to make this local sharing decision at runtime based upon the hardware on which we find ourselves. And the basic model remains this split-phase replica update model. So if you put all that together, this is sort of our logical picture of what a multikernel looks like. You have some instance of the operating system running on every core in the machine. And it's maintaining a partial replica of some part of the logical global estate of the OS. And it's maintaining those replicas by exchanging messages between the different OS nodes. Note that you can specialize the implementation of the OS node and the replica that it maintains to the architecture of the particular core on which it's running and also note that where hardware supports that we can happily support applications that use shared memory across multiple cores but we don't want to rely upon shared memory for the correctness and for the implementation of the OS itself. And so that's more or less the multikernel model. Barrelfish is our prototype implementation of a multikernel. It's from scratch, we've reused some libraries, but essentially everything else is written by us. And it currently runs on 64 bit Intel. Ron Hudson at MSR Cambridge has an initial port to ARM, so the next job for us is to integrate that into the same tree, once I get back to Zurich. And it's open source. If you look at what Barrelfish looks like, we've partitioned the OS node that was the thing running on every core into a kernel mode portion that we call the CPU driver that is completely specialized to the hardware on which it runs, handles traps and exceptions and reflects them up to user software. So you can think of this as sort of a microkernel on each core. And a privileged user domain called the monitor. The monitor also runs on every core. But the monitor actually exchanges messages with all our monitors on other cores. And so the monitors together implement this logical distributed system and the immediate local operations on global state by exchanging messages between themselves to maintain replicated state. Our current message transport -- and this is only on cache coherent x86 hardware, is our implementation of user-level RPC. So again, this is this shared-memory ring buffer that is optimized for user to user, message transport based upon shared memory. But we fully expect that to change as hardware changes and the messaging mechanisms change much like a microkernel or an exokernel, what most other system facilities are implemented that use a little. There's a whole other set of ideas in Barrelfish that are basically the design choices that we make when building an operating system but are not directly related to the multikernel model but are just a whole set of choices that you have to make when you build our OS and things that we thought were the right way to build and OS. I'm not going to talk through all of these points, but feel free to ask questions. So some of the important ones are minimize shared state obviously. We decouple the messaging, the thing that moves the message between cores from the notification mechanism. Such as an IPI on current hardware. We use capabilities for all resource management using the same model as the seL4 system that was presented at SOSP this week. We have upcall based dispatch. I said we run drivers in their own domains. We use a lot of DSLs for specifying different parts of the operating system, including in this case device registers and device access code is generated from a DSL. And so on. There's all sort of applications running on Barrelfish because we're trying to build it as much as -- you know, as much as you realistically can in research build it as a real OS. You're looking at the slide show viewer that runs on Barrelfish. Our web server runs Barrelfish and successfully we stored a flash starting a couple of weeks ago. We now of a virtual machine monitor, and that's our plan for getting device support for a lot of the devices where we don't worry about the performance. >>: So OS components like the [inaudible] that you [inaudible]. >> Andrew Baumann: Yeah, our networking stack is all [inaudible] at the moment. And that's a place holder. We ->>: [inaudible]. >> Andrew Baumann: Don't have one yet. >>: Okay. So is the web server just keeping everything in memory and then serving ->> Andrew Baumann: The web server is serving -- there are two cases of the web server. Everything is in memory but it can either be static RAM FS -- there's a RAM FS essentially the web server serves from, and there's a dynamic case where we have and in memory database and we serve thing out of the database. But we don't have the storage drive yet. So the virtual -- we're implementing the device drivers where we care about the performance where we want to perform benchmarks and for our hardware, but the reason for the VM is that it gives us a way to support a much larger set of devices where the performance double matter so much. We republic shared-memory benchmarks, run a database engine, we run a constraint engine. I'll tell you why we might want to run a constraint engine in a moment. And more every day. So that's Barrelfish. Now, how do you evaluate a completely different operating system structure? In particular Barrelfish as, you know, an initially implementation of something is necessarily much less complete than anything like Windows or Linux. So it's very much difficult to do comparison here that aren't apples and oranges. Our goals for Barrelfish initially are that it should have good baseline performance, which means that at least some current hardware it's comparable to existing systems. What we're more interested in seeing is scalability with core counts and the ability to adapt to the underlying hardware on which we find ourselves dynamically, and also to exploit the message-passing perimeters for good performance. So I'll show you some bits of the evaluation. This is a microbenchmark of message-passing cost simply because every research operating system has to have microbenchmarks. There's a lot of numbers on these, but the high level point is that on current cache carrying hardware it takes between four and 700 cycles to move a message where the message is the cache client from one core to another. And it's because we have microoptimized it down to two hypertransport requests on AMD systems. And we think that with the way that is on current hardware, that's probably about the best you can do. The other thing to note is that you get batching and pipelining for free. If you send multiple messages down the same channel, they actually get pipelined across the interconnect and you get much better than the single message latency for sending multiple messages at once. There's an interesting side note here which is sort of comparing the multikernel system which is using message passing between cores as a way to scale the OS so a microkernel which is using message passing within a core between different components as a way to manage protection and isolate different parts of the system from each other. And this is not really our main goal, but it's an interesting comparison. The latency of -- at least on this machine, the latency of a intercore message is comparable to the latency of an intracore message on a multikernel. With the advantage that you don't have to do a context switch, you can get better throughput because you pipeline these things across the interconnect and less impact on the cache and so on. So it's interesting to thing -- and you know, one way of thinking about this is thinking partly about a microkernel but rather than decomposing these things into different servers that are run on the same core and having to context switch between them decomposing these things into different servers that run on different cores with message passing between the cores. Obviously depending upon the amount of data that you need to move whether you have shared memory these tradeoffs change. But it's interesting comparison. What I'm going to present now is a case study of how we implement in Barrelfish one piece of OS functionality that is often a scalability problem. And in this case, it's unmap. When a user unmaps some region of memory what the operating system needs to do is send a message to every other core that may have that mapping in its TLB, and wait for them to acknowledge that they performed the unmap operation locally. In most systems the way this works is that the kernel on the initiating core sends in a process or interrupts to all the other cores that might hold the mapping and then spins on some region of shared memory waiting for them to acknowledge that they performed the unmap locally. In Barrelfish the way this works is that user domain sends the local request to their monitor domain. So that's a local message. And then the monitor performance a single-phase commit across the other cores in the machine to hold the mapping. And this single-phase commit is because that's essentially what this is, send a request to every core, wait for them to acknowledge that they have perform the operation. And the question is how do we complement the single-phase commit, how do we implement this communication inside the machine? We looked at a couple of different messaging protocols to do this. In what we call the unicast case, the initiating monitor has a point to point channel with all the other monitors in the system, and so when the initiating monitor gets an unmap request, it writes that request into every cache line and then waits for the acknowledgements on every cache line -- sorry, message channel, which consists of multiple cache lines. In the broadcast case, we have one shared channel, so every receiver core is polling on the shared channel and the request is written once into that shared channel. Neither of these perform particularly well. Broadcast doesn't scale any better because cache coherence is not broadcast, even if you write the message once into one region of shared memory. If everybody's reading it, it crosses the interconnect once for every reader. So the question is how can we make this scale better? If you look at the machine, this is the machine on which we're doing this benchmark, when you send the message once to every other core, your message is crossing the interconnect many, many times. This interconnect link carries the message at least four times because it's going to want to each of these four cores. What you would like to do is something more like a multicast tree where you send the message once say to every socket and then send it out local optimization of replication to the other cores on the same socket. And we do this in Barrelfish. So in the multicast case we have message channels to an ago interrogation core on every socket in the box and then that core, that monitor locally sends it on to its three neighbors. And that's much faster because they're sharing an L3 cache. And then we aggregate the replies, the acknowledgement in the same way. There's one additional optimization to this, which is if you look at the topology of the machine, some cores are further away in terms of more hypertransport hops and therefore greater message latency than other cores in the system. And so if you send the messages to the sockets that are furthest away from you first and then to progressively local sockets you get greater parallelism than the messages traveling through the machine. And we call that NUMA-aware multicast. And that scales much, much better. Interestingly in the NUMA case you can see where you can over additional numbers of hypertransport hops. But brings up a general point. In order to perform this optimization, you need to know a fair bit about the underlying hardware. In particular for this one we need to know the mapping of cores to sockets. We need to know which cores are on the same sockets. We can get that information out of CPUID. We need to know the messaging latency so that we can sensor thing that are furthest away first and then to things that are local to us. More generally, Barrelfish needs a way to reason about and optimize for relatively complex combinations of hardware resources and interconnects and so on. The way we tackle this is with constraint logic programming. We have a user level servers call the system knowledge base that is a port of the eclipse constrained engine. And that stores as rich and detailed representation of hardware topology, performance measurements, device enumeration, all these kind of data that we can gather from the underlying hardware and allows us to exercise constraint optimization queries against it. And so in this case, there's actually a prolog query that we use to construct the optimal multicast routing tree for one or two phase commit inside the machine. When we put all that together, this is how unmap scales in Barrelfish. The more thing here for me is not that, you know, we can scale better than Windows or Linux partly because all of these things could be optimized. Barrelfish is paying the very heavy cost of a local message passing operation that is quite inefficient. Linux and Windows could equally well be optimized to do sort of multicast IPI techniques or at least use broadcast IPIs the important point for me is that separating this problem from a shared-memory synchronization problem to some more abstract message passing operating like a single-phase commit, and then being able to optimize behind that high-level IPI to the machine on which you're running based upon performance measurements and so on, seems like a good way to cast these kinds of optimizations being able to adapt to the hardware that you run on. >>: [inaudible]. >> Andrew Baumann: Yeah? >>: So which systems are using IPIs and which systems are using polling. >> Andrew Baumann: Barrelfish is using polling. >>: Okay. >> Andrew Baumann: I can talk about IPIs in a moment. These two systems use IPIs. >>: Is that -- do you think that the main thing you're seeing on this is that the efficient multicast or is there a difference between polling and IPI? >> Andrew Baumann: Well, so, yes, so clearly you are seeing a difference here in polling. If there's a user application running on Barrelfish on the cores that you want to unmap from, then there's a tradeoff between time slicing overhead and how long it takes the core to acknowledge the message and whether you want to send an IPI. So let me come back to that at the end. I have a slide about it. The other things that we ran on Barrelfish is computer-bound shared-memory workloads. So we had some benchmarks from the OpenMP on the NAS OpenMP benchmark suite in SPLASH-2. These are not very interesting, because they're essentially just exercising the hardware cache coherence mechanisms to migrate cache lines around. They're essentially just here as, you know, sort of proof that even though the operating system double use shared-memory it can happily support user applications that do so. So they're mainly exercising the hardware coherence mechanisms. And they're not very interesting, as you can see. Actually newspaper of these things scales particularly well on that hardware which maybe says something about the granularity of sharing that these benchmarks assume. The other set of benchmarks that we ran just briefing are looking at sort of slightly higher level OS servers, I/O and stuff like that, we get the same network throughput as Linux, we can run this kind of pipelined web servers setup where we run the driver on one core application such web server on another core and a database engine on another core and pipeline these requests across the cores in the system, and we get respectable performance. This is very much apples and oranges. So I'm not going to, you know, try to read too much into this because anybody small efficient static in memory web server can probably beat something on Linux, even if it's a fast web server on Linux. The high level point for us here is just that you can build similar servers on top of this, and there is no inherent performance penalty from structuring the system in this way, but we really have to get more experience with complex applications and a high level OS services before we can claim success on this. Enough evaluation. I have more if you want to ask questions about it. But what I wanted to show you finally was an example of first of all the Barrelfish sort of programming interface that we have at the moment, and secondly how you can analyze and tune and scale performance of things in a system that's built this way, because it's quite a different model to a sort of a traditional OS. So what I'm going to show you now is a series of traces from the user code that is doing this. Essentially you have a domain running on a single core with a single dispatcher. Dispatcher is like our version of a thread accept that it's an out core model instead of an explicit -- it's an explicit out core model instead of thread our assumption. And what you want to do is get another dispatcher running on ever other core in the system in the same shared address space. So this is typically what happens if you run something like SPLASH or OpenMP in the setup phase where it's forking a thread on every other core in the same address space. So on Barrelfish the way that we do this is we iteratively request creating a new dispatch for on every other core. And this is the split facing. There's one operation to say please start creating a dispatcher on another core and there's a separate acknowledgement which because we're using C as you know often a very invent driven style, there's a separate acknowledgement that the dispatcher has been created. So we keep an account of the number of dispatches created and then we handle messages until that count is reached, the number of dispatches that we require. So we create N dispatches concurrently. And here's how the first version of this which is also unfortunately what's in the current [inaudible] that you can download of Barrelfish scales. And the answer is very, very badly. What you see here is this is a 16 way machine. So you're seeing one line on the Y axis for every core in the system, you're seeing the horizontal axis is time in CPU cycles. And the color codings of each bar represent what that core is doing at that point in time. And these orange lines which you can hardly see because there's so many of them are message sends and receives between this core and mostly between this core and core zero. And what you're actually seeing in this case is that almost all the time all of these cores are with the monitor blocked waiting for the response to a memory allocation RPC. And that's because there's one memory server running on this core which is responsible for allocating all the memory inside the machine. So every time another core needs to allocate memory, it sends an RPC, a blocking RPC to the memory server on core zero, and so the whole operation is bottlenecked on core zero, and it takes 50 million cycles. So how do you improve this kind of thing? Well, the first obvious thing to do is partition the memory server. Have a local memory server on each call that manages some region of memory. Probably makes sense for those regions of memory to be local to the NUMA domain of the core on which it runs. And that's what we do here. And so you can see that we're now down to 9 million cycles, which is pretty good improvement. And we are now spending a lot of time -we're still spending quite a lot -- there's still quite a lot of communication with core zero. And we're spending a lot of time just be zeroing all the pages because what's happening here is we're mapping in, reallocating things like stacks, thread context, stuff like that. All that memory needs to be zeroed on allocation. And partly this is just a horribly inefficient be zero is doing byte by byte. And partly we should be doing that somewhere else, not on the critical path. So we fix that. We also have more allocations. We changed the implementation of monitor program to do more allocations locally to the local memory server rather than going back to the memory server on core zero. >>: What does it mean by [inaudible] before you start the monitor? >> Andrew Baumann: Yes. So in particular -- look, in this case -- >>: [inaudible] eventually, right, so you're just zeroing ->> Andrew Baumann: I'm not saying that this is -- this is, assuming that we could move that off the critical path and pre-zero everything then this is what would happen. In practice the way we ran this was zero memory at birth and don't reuse it. So I'm not -- this is not in Barrelfish yet for that reason. And we do a lot better -- and actually I can see a couple of bugs here. One of them is I don't understand why these three cores are creating things and then it's going up there. But what you can see is that there's still not as much parallelism as you would like because this core sends a request to this core, then it goes back here, switches to the user program, runs the user program for a while, the user program requests the next dispatcher be created. Remember that the user program here is iteratively doing this. And each one of these involves the message to its local monitor in a context switch locally. So this is, you know, standard optimization there. Aggregate all those things up into a single operation that says create me a dispatcher on this set of cores. And so if we do that, then we do much better. And that's down to two and a half million cycles and 76 messages. So there are sort of two-things that I think are interesting here. First of all is the way that you analyze and optimize the performance of something based upon message passing. You sort of see dependencies in terms of message arrival and departure. And what you can see here with these long horizontal lines is that there's a large queuing delay between a message being sent, sitting in a queue and actually being processed over here and so there's still, you know, plenty of improvements that we could make. The other one is that there is a whole different set of performance optimizations like in any system moving memset off the critical path is something that, you know, people have done for ages. Aggregating, combining low level operations into high level operations like create me a dispatch on every core is a sort of typical optimization of IPIs. But some of the optimizations are at least easier to reason about. I think given this explicit view of message passing dependencies inside the system. >>: [inaudible]. >> Andrew Baumann: Yeah. To be honest, I don't know. There's plenty of -there's plenty of bugs in this. So to wrap up. I've argued that modern computers are inherently distributed systems inside the box, now more than ever. And so this is a good time to rethink the default structure of an operating system. I've presented the multikernel, which is our model of the OS built in this way. It uses explicit communication and replicated state on each core. It tries to be neutral to the underlying hardware. I've also shown you some of Barrelfish which is our implementation of the multikernel. Barrelfish is as much as possible a really system, and we're definitely continuing to work on it and build it in that way. I think I can argue that it performs reasonably on current hardware, and more importantly that it shows trends of being able to adapt and scale better for future hardware. So at least from our perspective, the approach so far is promising. And again, another plug for our website. You can find papers, source code, and other information there. To answer Chris's question about polling, all of the stuff -- well, no. In the message bank case here, there are situations where you need to do context switches and so on before you handle message, and there's different domains running on the other cores so not everything there is polling but the other benchmarks are presented in most cases there was only one or two things running on every other core. So you can argue how does our messaging latency given that we're polling to receive messages compare to something where you send an IPI and thus keep the other core? And this is the most frequently asked question. That's why I have a slide about it. First of all, polling for messages is cheap. This might be obvious, but the cache line that contains a message, the next message in the ring buffer is local to the receiver's core until the point in time that the message is sent by the sender and the sender invalidates the line. So polling a whole set of message channels is relatively cheap because all the state is local in your cache. Obviously if the receiver is not running -- and the other thing I should say is that we aggregate incoming message channels into monitor domain and so there is one central place on the core where if you use a domain doesn't happen to be running at the moment, it doesn't have to wait for a time slice if it's blocked. A user domain can block on a channel. Somebody else will poll on that core will poll its channel forward and unblock it when a message arrives. So it's not that all user domains have to remain runable in order to poll the channels, but ->>: But somebody is in the monitor. >> Andrew Baumann: In this case the monitor and we intend to push that into the kernel so that we can do it between every context switch. So there's a tradeoff between time slicing overhead, time slicing frequency, and message delay. But there would be some situations where you need to notify another core. And in order to do that, you need to use [inaudible] processor interrupt. In Barrelfish we took an explicit decision to decouple the notification from the message transport because it's very expensive. On our hardware on these AMD systems, it's something like 800 cycles to send the IPI on the sending side and more than 1,000 cycles to handle the IPI. Because first thing is when the IPI arrives you go off and execute microcode which can take some time. Then you get into the kernel, then you need to get into user space. The high level bit is it's not cheap. You clearly don't want to do this for every message passing operation. So there's a tradeoff. We decouple it from fast-path messaging. There are a number of good reasons to do that. First of all, you often have -- and in particular, given the split phase interface, somebody on one core can express a number of operations. Let's say I'm a garbage collector. I want to unmap a whole region of memory that's may be several on map cores. I want to get all of those done for one IPI at best. I don't want to have that happen on every message. And that's where the split phasing comes in. You issue a number of requests and then you block and send the IPI. The other argument is that there are some operations that don't require the low latency. Just because somebody's executing an unmap or requesting some other service on that core, if there's a user program running on that core getting his work done, why should it be interrupted for this remote operation? So there are some situations where you don't require low latency and you can also avoid the IPI. Then there are the operations that actually need the low latency. And at the moment the way we handle that with Barrelfish is with the explicit send IPI to [inaudible] which is not what you want. What you probably want is something that is not of the lowest level of the messaging transport but in a slightly higher level like a time out that says if I send a number of operations and this domain is blocked waiting for reply and it doesn't get a reply within some period of time then send the IPI automatically. Do some sort of an optimization like that. We haven't done that yet. The other reason to send an IPI is if a call's gone to sleep. You obviously don't want -- if a call has nothing to do, you don't want it to sit there burning power polling channels all the time. So we have a way for a core to go to sleep, set some global state that says I've gone to sleep. If you send it a message, you check the global state and you send it the IPI if it's already asleep. Or we can use MONITOR/MWAIT for that. >>: [inaudible] user model [inaudible]. >> Andrew Baumann: If you're just going into kernel mode? You mean on the receiver side? >>: Let's say both sides were already in the kernel mode. >> Andrew Baumann: Then it's cheaper. I mean, so a big part of this is context switching. So you're looking at hundreds of cycles if -- because you still have to -- this is not the latency. This is just the number of cycles that you lose by taking an IPI. It's still exception handling and it's microcode before that even runs. I don't -- but I haven't ->>: [inaudible] one of the questions I -- one of the questions I have is so this monitor really is a trusted component of the system, right? >> Andrew Baumann: Why not put it in the kernel? >>: Yeah, why -- I mean, why not put it in kernel mode, right? You could still do separate ->> Andrew Baumann: We ->>: Processes. >> Andrew Baumann: So there's a reason that I presented the multikernel as this abstract thing and Barrelfish as monitor and CPU drive because we realized they would be good performance arguments for putting that stuff in kernel mode and ->>: You know, and you could ->> Andrew Baumann: It's -- it's largely a design simplicity argument. It was the simplest way to build the system. It's nice to have a kernel that is non preemptable single threaded serial trapping and exception handling. >>: Yeah, I agree with that. It certainly makes sense. I guess one of the questions -- you know, so not necessarily arguing that you should use manage code and go the whole singularity later route, but you could still, you know, we have multiple hypothesis in ring zero ->> Andrew Baumann: I think ->>: Or you know shall or protected ->> Andrew Baumann: I think probably the approach -- I think probably the approach we will take is putting the most critical parts of the monitor's opportunity, and polling message channels is an obvious example. Putting those up into the kernel mode. I think just from an implementation. So especially if we're dealing with different distributed algorithms and message handling code and stuff, it's easier if you can write that in something and use particularly because you can start to use high level languages and stuff. >>: So how do hypervisors fit in there as the ultimate physical resource manager and isolation platform. >> Andrew Baumann: They're not really in the way that we're thinking about the OS, not because we think that they're a bad idea, but because we think that these issues about how you scale across large numbers of diverse cores are orthogonal to what your implementing looks like a hypervisor looks like an operating system. So I think that most of this model would apply equally well to the implementation of a hypervisor which has [inaudible] multiple cores and implement communication mechanisms between diverse and different cores and sort of some level of system services on those cores as they would to the implementation of an OS. >>: Because one concern is that, you know, any one of your communication resource management paths can make the entire machine vulnerable, and if I wanted to, you know, petition the ->> Andrew Baumann: I'm sorry, vulnerable to what? >>: Bugs and bad implementation. For example, the BM kit which wants to bring in other operating systems as acting like the hypervisor monitor, the bugs there and, you know, so legacy application and then bring it through your machine. >> Andrew Baumann: Yeah. I guess -- I guess to some extent I mean we have not really been thinking about the sort of security and reliability properties of this. I think that if you structure the system with explicit messaging and then you define the messaging protocol much like channel contracts, you have a handle on isolating and containing things. But we haven't tried to do that in Barrelfish. In particular if you were trying to do isolation, really strict isolation, you wouldn't want to build a message transport on top of a shared-memory thing that both sides could fiddle with. >>: I'm thinking what if the hypervisor, I don't want it to lie to you, I want it to just enforce the implements. So my view of hypervisors in a world of plenty cores is not like multiplying processors and lying about availability of resources, it actually tells you what resource is there put provides like -- like these high end data center machines where you can actually program hardware registers to separate views of memory and CPU really hardware partition machine. >> Andrew Baumann: Agreed. >>: Yeah. >> Andrew Baumann: The bottom level thing should be providing you some guaranteed and isolated set of resources rather than [inaudible]. And that the sort of -- that's the way we're trying to structure the user environment on Barrelfish as well. Not that we're not trying to implement a hypervisor, but the reason for the capability model that we have and the reason for things like the system knowledge base is so that you can say I want to have this chunk of memory and I know what cores I'm running on, and then the user level optimization process has more knowledge to work with. >>: And then -- no lies from the lower level software. >> Andrew Baumann: Yeah. >>: And then you develop according to a dynamic changing resource environment. >> Andrew Baumann: It's kind of the -- to some extent it's the [inaudible] with the abstractions. >>: Can you go back to the earlier -- the numbers -- the polling numbers and the argument that most of the [inaudible] of monitoring to [inaudible] those numbers to be small. But in the end all those, you know, most of those message passing is happening because of the activity in the user space anyway. Is that a -- the application is running up there if -- one thing that they would be doing would cause it to go into monitor mode and send the message? >> Andrew Baumann: Yeah. Not necessarily, because there are things -- as soon as you're running -- as soon as you're not running one application across the whole machine, you have different sets of applications. There are system services that may run on different cores and may not run on the same core all the time with which you need to communicate. So application uses the file system, file system is trying to provide some [inaudible] like consistency guarantees when I execute this right nobody else has executed this right, some other application on some other core had that file open at some point in the past and so now we need to go and communicate with some replica of the file system running on that core. There are things that -- there would be cases where you need to communicate with things that aren't just the application on the other cores if that makes sense. >>: So some -- so the application [inaudible] change if the application is more [inaudible] can use multiple cores you can still run that on these multicores but then that can mean that the application has to have its own shared memory -sorry, message passing communication mechanism in there, or more that the application itself is inherently using shared memory. >> Andrew Baumann: If the application is inherently a shared-memory application the API to the application on Barrelfish looks very similar to the API to the application on something like Linux or Windows. We have a thread model at the moment that's implemented in user space, but it looks like a P threads kind of API. We have shared-memory mapping. You can map memory and share and address base across multiple cores. The application doesn't really see this stuff. Some of the application services are implemented in libraries in the applications address space instead of up in the kernel, but, you know, the API is similar. I think what we're targeting is more future application programming environments. API of applications implemented at some high level of abstraction than P threads. We can adapt the runtime environment to understand the underlying hardware and to communicate perhaps with messages for scalability or at least use the split phase primitives that we offer in the operating system. >>: So some of this could be abstracted to the [inaudible] application? >> Andrew Baumann: Yes. >>: So just timelinewise how long -- how long of [inaudible] would you see before the learnings from this would be ready to integrate into ->> Andrew Baumann: I don't know. >>: [inaudible] and go. >>: I mean, you're holding a separate OS and that's good as an [inaudible] but at some point we would like to see this and this going some form of scalability come back into Windows. Is that something that you think is three years away, 10 years away? >> Andrew Baumann: It's hard for me to say. Three years away does sound a bit tough. >>: Okay. >> Andrew Baumann: At least initially our focus has just been on this is where we think hardware is going, how would you want to build a operating system -starting completely from scratch, independently from everything else, how would you build an OS from that and see where that takes us, and then the hope is that, you know, if this works there may be pieces of this that you can somehow isolate and put into an existing system. You know, one example of that is just things like machine specific multicast run map. Maybe that in its current form doesn't make sense. But you can imagine taking these bits and pieces and putting them into an existing system. What we're more interested in is how does the underlying fundamental structure of the OS play into how you can build things on a multicore. So that's a bit of a non answer, but it's also not our current focus. At least on our side. You might have to ask the Cambridge folks about what they're thinking in that direction. >>: Do you [inaudible] come to be used to optimize the queue size for message passing? >> Andrew Baumann: No, but that's a really good idea. >>: So that would -- I think that would be the biggest -- I don't [inaudible]. Do you think it's possible? Do you think ->> Andrew Baumann: Specific support for message passing? >>: Well, you're doing it right now. >> Andrew Baumann: That's a -- yeah, I mean that's a fundamental assumption. >>: What I meant was do you think it's possible to automatically rather than rewriting by hand for every -- so A and B switches their CPU topology, you run your constraints however you figure out what the topology is, you at the end generate code for the message passing under the covers and so you don't have to support -- >> Andrew Baumann: Hang on. Let me say this. There's two different parts here. There's the transport implementation, which is written by hand and [inaudible] specific, and that's given one message channel between one core and another core and, you know, either hypertransport or whatever, quick pop underneath, microoptimized to get a message there efficiently. That's with expect to write by hand and not generate. Then there's the next level up, which is how do I do something like single phase commit over a set of message channels between a set of different cores multicast. And that we would probably like to be able to generate. Right now the way we do it is that we have this C implementation that takes -- is, you know, is given a -- we call it a reading table but it's a little bit more than a reading table, it's a messaging send order sort of table that says what order do I spend to what other cores. And that thing is generated from this knowledge base. There's, you know, you could compile that code at dynamically ->>: I'm wondering ->> Andrew Baumann: Oh, you're interesting in the [inaudible]. >>: Yes, I'm wondering if that might be possible. If it was possible, that would be a huge peace making easier for a [inaudible] like to take something like this on. >> Andrew Baumann: So you can see this thing -- you can see this message transformer implementation, you can see it as an interconnect driver. >>: Yes. >> Andrew Baumann: The interconnect -- there are not that many interconnects. I don't think they change that fast. The problem -- the problem -- the reason the diversity is a problem is because the combination of interconnect cross product topology, cross product different machines, that is changing all the time and it's different on every machine on which you boot. But if you -- we can distill it down to the interconnect. We have quick path, PCI express, you know. Maybe there's that order of ->>: [inaudible] already mapped these out. >>: Yeah, but I think the high order is that product guys get really nervous about user mode code and your user code [inaudible]. >>: [inaudible]. [laughter]. >> Galen Hunt: Why don't we -- shall we terminate. Anyway, shall we thank the speaker. We have hit and hour. [applause]. >> Andrew Baumann: Thank you.