>> Galen Hunt: Good morning. It's my pleasure to introduce Andrew Baumann, who is interviewing with our group, with the operating systems group for a position, a research position. Andrew did his PhD at the University of New South Wales with Gernote Heiser and for his PhD he did some work on live upgrading of operating system components, right? >> Andrew Baumann: Yeah. That was a crazy idea. >> Galen Hunt: That was a crazy idea, which is why one should do it for one's PhD and not afterwards. [laughter]. And then for the last two years he's been at ETH Zurich with Mothy, Timothy Roscoe doing the Barrelfish system, which is what he's actually here to talk about. And I will say this about Andrew, Paul Barham I have a -- we joke that the qualifications for an OS researcher is that they should know what a TLB is and how to use it. And Andrew has met many TLBs and loves them all. [laughter]. >> Andrew Baumann: I wouldn't say loves them all. >>: [inaudible]. >> Andrew Baumann: Yeah. So thanks all, for coming. If you have questions as I go, feel free to ask. I'll try to manage the time. A couple of brief little words of introduction just following on from what Galen said before I start talking about Barrelfish. I'm sort of interested in lots of core OS research issues and increasingly large ly because of what we've been doing in Barrelfish, increasingly in distributed network systems. As well as the dynamic update stuff, I've also worked on single address space trading system that we were building called Mungi. Done some work on tracing and performance monitoring in the K42 multiprocessor OS at IBM Research which was also relate to the dynamic update work. And since moving to ETH, I've worked on a couple of things, including study of OS timer usage, a runtime for self-hosing and self-organizing distributed systems. But Maine I've been working on Barrelfish which is an OS for heterogenous multicore, so that's what I'm going to be talking about today. For those of you that saw the talk I gave here after SOSP, some of this is probably going to be fairly familiar. Though I apologize. But feel free to ask questions and, you know, ask about things you think I'm not covering. The goals of Barrelfish are to figure out how an operating system for future multicore systems should be structured. And so that includes running a dynamic set of general-purpose applications. We think the real challenges of large scalable and heterogenous multicore occur when you have these things in your laptops and your desktops. On the computers that you're using every day. And in particular because of this to reduce the code complexity that is required for building a system like this. And so the challenges are obviously scaling to a large number of cores. The number of cores in the system is increasing at rates approaching Moore's law. How do you deal with that? But, moreover tackling and managing the heterogeneity and hardware diversity that's starting to arrive in this space. And I'm going to talk more about why these things are heterogenous and why that's a problem. So let me slide this in the examples of current multicore chips on the market. On the left you have the Sunday Niagara. And it has a banked L2 cache that's connected to all the cores on the chip by this crossbar switch. And what that means is that any region of the L2 is equidistant in terms of access latency for many of the calls on the chip. And so shared memory algorithms that rely on relatively fine grained sharing and rely on directly manipulating state that's shared by a large number of threads on a large number of cores or hardware threads work quite well on this chip as long as they stay within the L2 and they say within a single chip. The same is not true of these other processors over here. On the top is the current AMD Opteron where you have a large shared slow L3 cache that's almost double the access latency of the L2. And so cache lines that are in the L3 will be much slower to access. And then here you have the Beckton where you have this [inaudible] it should have a ring network inside the chip to access regions to the cache that are not local to the call it's accessing. The important point here is that the performance tradeoffs of these three chips are all quite different. The way you would go about optimizing scalable multithreaded multiprocessor shared data structures for these three things are very different. And two of them are even an x86 and run the same operating system. So there's a sort of challenge here for system software to be able to cope with this diversity. Another place in which you see diversity is the interconnect architecture that connects the sockets inside the system to make up a multicore system. And this is an example here of actually relatively sold now, two years old Opteron system that we have in our lab that says eight sockets, each October has four cores. And you can see that these blue links between the sockets, which are hypertransport links, make for a fairly unusual, you might just say insane and ridiculous interconnect unusual. And the topology of that network plays into the performance of the machine. As you run shared memory scale possibility benchmarks and you go beyond about 20 cores, you start to see all sorts of interesting affects because of this cross. And you also see affects like contention occurring on these two links here. And depending upon the patent of memory access that you do inside the machine and weigh your running and whom you're sharing with, is loud of the interconnect actually matters. And the hard point -- the important point is that that's just that box, and there are many other systems on the market with different layouts, different topologies. And right now no system software is really trying to take account for this. And it's not even clear how it can. >>: I like the floppy disk drive. >> Andrew Baumann: Yeah, well, this is -- [laughter] this machine still does have a floppy disk drive and a floppy deterministic controller. It's completely insane. [laughter]. >>: [inaudible] why you're overloading your links. >> Andrew Baumann: That's not why the links are being overloaded, no, that's true. >>: [inaudible]. >> Andrew Baumann: Yeah. It's because this is one bod and this is a [inaudible] and there are two risers here that sandwich the [inaudible] on top of the other board. So, yeah, reasonable sort of engineering decisions, but the end result is bizarre. And now that's that system. But the challenge is that as sizes shrink and as more of this stuff moves on to a chip you start to see the same kind of a fix within a chip -- I'm sorry. I should say quickly. Again, as an example of different interconnects this is the eight socket Nehalmem interconnect topology which is arguably much more sane. As die sizes shrink and all of this stuff starts to move on to a chip, you see the same kind of networking effect occurring between the cores and a chip. And again, there's no clear model for what the interconnect here should look like. On the top I'm showing you the ring network on the Larabee, and on the bottom is the mesh network on the Tilera TILE64 processor. And again, depending upon what this looks like, it plays a big impact in how you scale software on these things and how things perform. So finally we're starting to see diversity between the cores that you have in a system. Today that's relatively specialized. You can buy a machine today that might have a programmable network interface with a general purpose processor running close to the network where you can offload application processing. Or you might have GPUs inside the system and there are all sorts of applications that can make use of GPUs for application processing. And there are even applications for FPGAs that you can plug into CPU sockets such as in that Opteron system I showed you before you can buy FPGAs that sit on the system interconnect. And that's today. In the near future it seems quite likely that we'll start to see this kind of diversity between cores on a single die, either because of performance asymmetry. There's a lot of arguments that say that for reasons of power efficiency you will have a small number of large fast power hungry out of order cores and a much larger number of slower simpler in order cores or you might have some kind of special purpose logic because special purpose logic is more power efficient, it can be switched on and off depending on what you need. There's a lot of different visions of the future that all involve special purpose logic, special purpose core times. Also you might simply kind that some cores laid out various features of the architecture like streaming instructions and virtualization. And so there's a challenge here for low-level system software that can do sensible resource allocation and management of this divergent set of hardware with complex computation parts between it. And this is the kind of problem that we're trying to tackle with Barrelfish. So to summarize all of this is increasing core counts and increasing diversity. And unlike many previous systems that have scaled to large number of cores typically in the super computing space and the high performance space, once this stuff enters general purpose computing you can't make nearly as many optimization decisions and you can't bake into your system at design time the way that you're going to use this hardware and the way that you're going to scale to this hardware, because it's all going to be different. And so there's a challenge here for system software to be able to adapt to the hardware in which it find itself running, either the number of cores, the topology, the interconnect, the available hardware resources. To find you -- and finally, to show you two sort of current examples of where research processors are going in this space, I'm going to show you the Rock Creek and the Beehive. This is Intel's single-chip cloud computer. And if you get past the buzz word, what this thing is a research ship that Intel announced I think in December. So it's quite recent. And they've actually built this. So it's on silicon. You have 24 tiles, two essentially pentium cores per tile. What's interesting about this chip from a research perspective is first of all, simply the number of cores. Second of all, the interconnect, which is quite configureable and high performance relative to the performance of the cores. But most importantly the fact that none of these caches are coherent with anything else on this chip. Instead there's some hardware widgets to make it easier to do core to core message-passing, and software is expected to deal with this without having cache coherent memory. And the reason for building the chip in this way is because there's a algorithm belief that cache coherence won't scale to the large number of cores that you all see on these future multicore chips. And so this is an experiment to say can software make useful use of a chip that's built in this way. Another one which I hope you've heard of is Chuck Thacker's Beehive processor that's being developed by the group at Silicon Valley. And this thing is another research chip, and they have a bunch of ideas about doing combined hardware-software research in the FPGAs and so on. But this is also interesting from a Barrelfish perspective. Again, because it's very different. In this case what you have is a variable number of risk-ish cores on the ring and you have explicit message passing operations, more so than on the Rock Creek, to pass messages between software on the cores around the ring. It also is not cache coherent and also has a very interesting memory access model which I won't go into this talk. It's another example of a current research architecture that is interesting to think about from a system software perspective of how you build something like an operating system for this kind of processor. And so because of all these trends in architecture and in computer design, we think that now is an excellent time to rethink the default structure of an operating system. Today's operating systems are by and large a shared memory multithreaded kernel that executes on every core in the machine or every hardware thread in the machine and synchronizes and communicates using state in shared memory predicted by synchronization mechanisms like locks or other shared memory primitives. And anything that double fit into this model, either because it's not on coherent memory or because it's a heterogenous core time or because it can't be -- it can't be expressed in this model is abstracted away between a relatively low level device interface. And so we propose structuring the OS as a distributed system of explicitly communicating independent cores. And we call the model for an OS structured in this way the multikernel. And now design principles for multikernel are that we make intercore communication explicit, we make the structure of the operating system mutual to that of the underlying hardware, and we view all state as replicated inside the machine. And so on the rest of this talk I'm going to go into a lot more detail about those three design principles and what the concrete implications of them are. I'm going to show you Barrelfish, which is our implementation of multikernel. I'm going to so you some parts of an evaluation. And I'm going to wrap up about talking about where we're taking this and some interesting idea for future work in this. So the first design principle was making all the communication between cores explicit. And what I mean by that is that all intercore communication uses the explicit message passing. There's no assumption of shared state in the model. That's a fairly sort of bold statement I think and certainly very different from the way people structure operating systems and system software today. If we can make it work, it has some big advantages. First, most importantly it let's you have a much looser coupling between the structure of your system and the concrete details of the underlying hardware in terms of how the hardware implements communication. So you explicitly express communication patents between cores in your system rather than implementing a scalable shared memory primitive where you have to think about to implement this system primitive, what data structures do I need to mutate, how do I protect the correctness of those operations with respect to all the other processes in the system, and then in order to make that scale well, you think about well how do I ensure that my operations are consistent, how do I ensure that I'm mostly making local memory accesses, how do I operate things for placement in cache lines, which am I likely to take a cache fault? Instead you think about it at the level of based on my local, purely local state, what I previously have heard if the other processors in the system, what -- with whom do I need to communication to perform some operation and do some explicit message based communication with those processes and then you can map that on to an underlying hardware primitive, be that either a shared memory primitive or an explicit non-coherent message based primitive. Message based communication actually supports heterogenous cores or cores which may be behind non coherent interconnects. Again, think about the case of the offload core on the other side of the PCI express interface. You might like to run some part of your OS over there and communicate with it, even if PCI express doesn't offer you coherent sharing. It's a better match for future hardware like the processors I showed you before year because that hardware might have explicit message passing, so as well as the Beehive and the SCC, the Tilera processors have explicit hardware support for message passing. Or because that hardware may not have cache coherence or cache coherence may be very painfully slow. Message based communication allows the important optimization of split phase operations by which I mean that you can decouple the operation that requests some remote service, invoke some remote communication and the handling of the response and you can do something useful in the meantime. That's much harder to do on top of a coherent shared memory primitive. And I'll show you an example of that in a minute. And finally we can reason about it. There's much of formal frameworks about the correctness and the behavior of systems that use explicit message based communication. So the standard sort of OS response to this is that's nice but the performance is going to be terrible. The machines that we have today are designed to provide a shared memory model. And they're optimized for that model. And the performance of any message based perimeter on top of this is going to be painfully slow. And there's some truth in that. What I'm going to show you now, however, is a simple microbenchmark that is simply exploring the cost of that tradeoff. In particular it's exploring how much the shared model really costs you if you want to push it and you want to use it for sharing. In this experiment, I'm going to show you two case. I'm going to show you the shared memory case and I'm going to show you the message case. In the shared memory case, we have a shared array that is in coherent shared memory and we have a variable number of client cores that are wanting to mutate the state in this array. So they're performing read-write updates on that shared array. The performance of this operation in terms of the time it takes to perform each update or throughput and upper dates that you can achieve is going to depend upon two things. It's going to depend upon the size of the data in the array and the size of the data that's mutated by each update in terms of the number of cache lines that I need to touch to perform an update. And it's going to depend upon the contention in terms of the number of other cores also trying to mutate the array. And that's because every time one of these processors tries to change something in the array it executes a read instruction or a write instruction on the shared array. And what happens is the process or pipeline is stalled barring the ability of the processor to get any more parallelism out of the pipeline it stalls while the hardware cache coherent protocol migrates the modified cache lines around between the processors that are all the attempting to modify the cache lines. And that migration either the fetch or the invalidate that the cache coherence protocol has to do is limited by the latency of the round trips across the interconnect. And so here's how it actually performs. This is on a 16 core AMD basis. This is what happens when every core in the system for variable number of cores is trying to mutate one cache line in the shared array. And this is what happens when they need to manipulate two, four, and eight cache lines. You can see that this doesn't scale particularly well either in the number of cores or in the size of the modified state. And in particular what's happening over here, all of these extra cycles as you increase the contention in terms of the number of cores, all of these extra cycles are stall cycles. What's happening is the processors are executing the same number of instructions. The cycles are just going away stalled waiting for the cache coherence protocols to do the fetches and the invalidates. Now, this is obviously a worst case for sharing. But it shows you how much you can lose to a coherent shared memory model when you are explicitly manipulating the same piece of shared state. So that's not particularly good. If we look at the message passing case, what we do on the same machine, there is no hardware message primitive. All we have from the hardware in terms of communication is very slow in the process for interops and coherent shared memory. So the question is how can you build an efficient message based primitive on top of what the hardware provides you? What we do is we implement a message channel that uses coherent shared memory as the underlying transport. So then it comes down to the question of knowing the details of how the hardware implements cache coherence how can you most efficiently get a fixed size message from one core to another? And on current systems that [inaudible] down to essentially a ring buffer in shared memory where you're very careful to make sure that every message send involves as few coherence operations as possible. In our case, this boils down to one fetch and one invalidate for every message sent. And so in the message based case, we have these message channels that allow efficient communication. And we encapsulate this array behind a single server core that is responsible for performing all the updates on the array. And so now when a client wants to modify the array, rather than directly manipulating the shared state, it describes its operation as a message. We assume the request that it wants to go forward can always be described in a single 64 byte cache line. It writes that request into the message channel. The server core processes the operation on its behalf, manipulates the array and sends back the reply. And so the client here essentially this is a blocking out PC. And while the operation is being performed the client [inaudible] waiting for a reply. So this is it performs. This is what happens when a surfer modifies one cache line worth of state in response to every client request. And this is what happens when the server modifies eight lines in response to every client request. You can see that they're almost identical because all of that shared state is now local in the client's cache, all of the updates are local to the client. You can also see that on this machine at this point here for this benchmark we're already crossing over here at four cores and four cache lines where the overhead of sending and receiving messages wins back over the overhead of migrating these lines around for the shared memory case. But what's more interesting is if you look at the cost of each update at the server, this is the time that the server experiences to perform each modification, as you'd expect it's flat and it's low because it's all local to the surface cache. You can then infer that the that this line here is increasing is that there's queuing delay in the message channels. The clients are offering more load to the server than the server can satisfy. And so the clients are experiencing a queuing delay while their request sits in the channel while the server is busy processing all the requests. >>: [inaudible] speed up in the system, right. >> Andrew Baumann: As you have cores, yes. It's completely centralized. You can satisfy all the requests on a single server. But we're still doing better already, even though it's all centralized. >>: [inaudible] slows down. You more or less slowed up. >> Andrew Baumann: Well, everything ends up being serialized at some point. That's because we're mutating one array. Either the coherence protocol serializes it or the server serializes it. So if you -- and if you compare the cost of these two lines something like this here, all of this time here is where the clients are blocked waiting for a reply. But unlike the shared memory case, they're actually executing instructions. So you could use those to do some of this work if you had a nice synchronous OPC primitive where you send out the request, you do something else and then you handle the replies. >>: [inaudible]. >>: [inaudible]. >>: Well, I'm asking. >> Andrew Baumann: We don't use interprocessor interop except for the case when a whole core wants to go to sleep. Otherwise what we do is we have a bunch of incoming channels and we [inaudible] serve them from polling and when there's nothing else -when there's nothing else to do, then you can go and spend the time polling. >>: I've had some conversations with the Barrelfish team about ways to reduce the overhead which here seems to be the order of a few hundred cycles each way. That could be taken down by at least the order of magnitude [inaudible]. That would help. >> Andrew Baumann: I mean this hardware is not built for this at all. >>: I'm curious to -- this is a comparison between, you know, a smart message passing solution sort of naive shared memory solution. >> Andrew Baumann: I wouldn't call this message passing solution particularly smart. It's centralize everything on a single core and send the request to that core. >>: But I'm more concerned about the shared memory. >> Andrew Baumann: Sure. >>: So like [inaudible] are using MCS logs there, people who try to request ->> Andrew Baumann: There's no locking here at all. I'm assuming the perfect lock ->>: That's the problem, right? >> Andrew Baumann: Consists [inaudible]. >>: [inaudible] then you might be queuing up [inaudible]. >> Andrew Baumann: Yes. But the oracle throughput could only be worse. >>: Why? >> Andrew Baumann: Because the lock is [inaudible] the [inaudible] protocol is serializing everything. And I'm doing the min number of applications in terms of write to perform each update. If I added a lock, which I would have to do if I was going to build real data structure actually, the lock would serialize the clients instead of the coherence protocol explicitly, but I'm adding extra operations to acquire and release the lock. I'm doing more coherence traffic than I would be in this case. >>: I was just curious. >>: [inaudible] cost go down [inaudible]. >> Andrew Baumann: There's a little bit of a fit here where the server sometimes [inaudible] goes to sleep whereas here it's completely busy. >>: So [inaudible] but on shared memory my understanding is that if only one guy writes many others just trying to read you don't have to wait ->> Andrew Baumann: The readers have to wait. If there's one writer and many readers, the readers have ->>: You don't have to -- unless it's the writers, then you have to -- the cost of the -- like the unit cost on read is much simpler -- much lower than you have to send a message to server and say tell me what's in a memory. Is that the case? >> Andrew Baumann: So in general, I mean, there are different ways of implementing cache coherence. But in general what will happen is when the writer writes its local and modified in that cache and any reader will have to fetch the line from the guy who wrote it last, typically that's what happens unless it's being flushed out to memory. So in a kind of protocol way you have one write and many readers. The readers experience the delay of the coherence protocol. >>: I know -- >> Andrew Baumann: Oh, no, sorry, a lie. The writer, depending on what phases the [inaudible] protocol offers, the writer will have to then do an invalidate for every write because the readers pulled it down to the shared state. >>: I'm trying to remember [laughter]. >>: I'm curious though about the argument of doing more coherence [inaudible] I'm not sure that the hardware's actually doing a very good thing if everybody tries to -- I'm not sure that the serialization can come back is actually -- it could be worse than what we get if we [inaudible]. >> Andrew Baumann: The one thing the lock would let you do is it would let you do a kind of split phase thing, I think. You could -- in queue -- you could have some kind of backoff log where you enqueue yourself in a queue and somebody will do an explicit notification. MCS -- MCF by default doesn't do that, the reader enqueues himself and then spins on a variable in the queue node. >>: [inaudible] its own location, there's nobody else touches until it's available. >> Andrew Baumann: Right. So he can -- right. He can then go on to some other operations, that's true. >>: There's an interesting discussion. I think it's rather moot because the [inaudible]. We will have to [inaudible]. >> Andrew Baumann: I think in general these things -- I mean, message passing and shared memory are joules, right? And for every optimization and every way you think about restructuring the system there's usually a joule. Sometimes its obvious; sometimes it's not. And a lot of what it comes down to is just a different way of thinking about the problem more than necessarily better or worse. The one motivating reason for the message passing case is that if you have hardware that doesn't do coherent shared memory, it's much easier to map a message passing primitive on to that than if you have hard -- then if you have a coherent shared memory abstraction it is to map that on to a message passing primitive, CF distributed shared virtual memory. And that's the argument for structuring the system around that. >>: Phil and I will agree with you. >>: Well, the other thing is if you're building pipelines, if you're building pipelines out of heterogenous processors or homogenous even shared memory is not the best way to coordinate that. >> Andrew Baumann: So that was hopefully the most contentious design principle. The second one is to make the structure of the operating system mutual to the underlying hardware. And so we mean in practice that the only hardware specific parts of a multikernel are the message transports which like the one I've just showed you have to be specialized so the details of the underlying hardware. And the traditional parts of the operating system that are hardware specific like device drivers and what we call the CPU driver which is the part of the operating system that deals with the MMU and all the privilege kind of state in the CPU. In particular, we don't have to implement efficient scalable shared memory primitives that are different depending on things like cache size and topology and available synchronization primitives. And so that allows us to adapt to changing performance characteristics of different machines. You can late bind the concrete implementation of a message transport and even some of the protocol layers that you layer on top of that for perimeters like how do I do and efficient broadcast or how do I do an efficient group communication based on knowledge of the underlying hardware. The third design principle is viewing state as replicated. And this kind of falls out naturally from the message passing model. But it means that any potentially shared state that is required by multiple cores in the system is accessed always as if it were a potentially inconsistent microreplica. That includes things like schedule run queues process control blocks file system metadata, system capability metadata. All of the state that the operating system would need to keep consistent across multiple cores. Because you can't keep it in shared memory. Again that naturally supports domains that don't share memory, naturally supports things where you don't have hardware coherence and it arguably more easily supports changes to the set of running cores. If you're bringing cores up and down for power management reasons or if you're hot plugging devices that have cores on them and you view this as a replica maintenance problem instead of if I turn off the power while this lock is held how do I recover this particular state, arguably it gives you a cleaner way to think about that problem. So as I've presented it, you can see this sort of logical spectrum of approaches to making operating systems scale. The system software in general scale. Typically you start from this point on the left where everything is shared in this one big lock, you progressively use finer grain locking. Maybe you notice that locality is a problem, so you either partition your data or you introduce ways of doing replica maintenance. Clustered objects in K 42 are one way of doing this on top of the shared memory abstraction. To the multikernel where we've dropped -- jumped to this extreme end point of purely distributed state and nothing but replica maintenance protocols. And a lot of current sort of trends in system software, particularly operating systems are gradually moving in this direction to scale up. You can sort of logically see this as having jumped to the end point of that scale. In reality, that's not a very good place to be, because there are going to be some situations on hardware where it's much cheaper to share memory than it is to send messages. Cases in point being things like hardware threads or tightly coupled cores that share a cache or maybe even depending upon the [inaudible] that you're maintaining, maybe even a small coherence domain. So in reality, that's the model. But behind the model we can locally optimize back using shared memory between local core pairs where necessary. And so we see sharing as a local optimization of reply -- of messages, rather than messaging a replica maintenance it's the scalability optimization, sharing. So you might have a shared replica for threads or closely coupled set of cores where when you invoke some operation, rather than actually sending a message to that core, all you'll do is take out lock and manipulate the shared copy of the state that's shared by you and the other core. The important point is that that's purely local. And you can even make that had optimization at runtime on the basis of performance information or topology information that you find out about in the machine. And the basic model that's visible from the IPI side remains this kind of split phase replica maintenance style. It just sort of happens that sometimes the call back will come back immediately because it was local. So if you put all that together, this is the logical view of an operating system constructed as a multikernel. Inside this dotted line, where you would typically have your shared memory kernel running across all the cores, you instead have a separate instance of what we call an OS node running on every core. Maintaining some possibly partial replica of logically global state inside the system. And these nodes exchange messages too, to maintain the consistency of those replicas. Note two things. Note that you can customize the implementation at OS node to the hardware on which it runs. There's no particular reason that they need to be the same architecture as long as they can interpret the same messages. And note that where the hardware supports it, you know, applications can happily continue to use coherent shared memory across some subset have the cores inside the machine. But the operating system won't -- at least the lowest level of the operating system shouldn't be relying upon this in order to function. So that's the model. Barrelfish is our concrete implementation of the multikernel. It's written from scratch. We've used a few bits and pieces of library code but otherwise it's all written from scratch. It's open source. It current supports 32 and 64 bit Intel microprocessors but we're in the process of porting it to a couple of other architectures, including Beehive is running in user mode on a single core, and they're starting to work on the messaging. What creek we've been more involved with at the ETH, we had one student who got to spend one week being baby sat in an Intel lab with a rock creek board. And so we can actually boot on up to 12 cores one string of cores before some strange res condition in the messaging kills us. And we're hoping to go back next month and figure that out and get the rest of the system up. >>: So why do you do it from scratch? Is there something -- why didn't you instead start with an existing one and say, okay, here are the top hundred shared data structures where it would be useful to [inaudible]. >> Andrew Baumann: Because at the end of the day you couldn't boot on something that didn't have -- that didn't support coherence still. >>: Could you do static analysis and turn all the references to global data -- to shared memory and to accrued -- accrued -- and then just optimize ones that were ->> Andrew Baumann: I guess you could do that. But I think that ->>: I'm asking suppose you worked with a company with ->> Andrew Baumann: [inaudible] existing operating system how would you go about applying this? I think -- I think the approach that we're taking is very much a sort of, you know, from construction. If you want to build a system like this first of all does it make any sense in and even if I were in a company like this that had a large monolithic system, I wouldn't start with an idea like this by trying to modify that to be like this. I'd first want to know if this just makes sense, just as a starting point on its own. And you can see this approach is being like that. If this works, then it would make sense to try to apply it. But first let's just see if it works in a small scale. One other interesting note about portability here is that unlike porting a traditional OS where you have the architecture and that's the compile time constant that you know which architecture you're compiling for. Porting something like Barrelfish is very different because you have potentially different parts of your source tree for different parts of your system being compiled for all these different architectures at the same time and then maybe even linked together into one image. That's not hugely interesting as a research challenge. But it is kind administering as a software engineering challenge where we've had to rethink a lot of how you actually structure system software. There's a lot more dynamic nature to this than your traditional operating system. Whose involved. I keep saying we. We is systems group at ETH Zurich and also Microsoft Cambridge. On the ETH side I should acknowledge a large bunch of folks who have been involved in this had. On the ETH side, Pierre Dagand was a former intern. Simon Peter, Adrian Schupbach and Akhilesh Singhania are three PhD students working on the Barrelfish project. Mothy Roscoe Galen mentioned he's my boss. And then we've had the support of and we've worked very closely with a large group of folks at Cambridge as well, Paul Barham, Richard Black, Tim Harris, Orion Hodson, who used to be here, Rebecca Isaacs, Ross Mcllroy who also used to be here. And we've actually been surprisingly successful at work closely between the two different groups. So all of those folks have contributed to everything this talk. When you put it all together, this is what Barrelfish looks like. And if you compare it sort of from a high level to the previous picture of the multi-kernel architecture, you'll learn as the -- where we had the OS node in Barrelfish we factored that into a kernel mode thing called the CPU driver. And a privileged user mode thing called the monitor. And that's largely because it made it easier to construct the system. This thing is essentially a very simple single-core microkernel or an exokernel kind of thing. It serially handles traps and exceptions, reflects them to user code, implements local protection. Does not communicate with any other core in the system at all. And is mostly hardware specific. The monitor is mostly hardware independent. And it actually implements -- it communicates with all the other monitors on the other cores, and it is responsible for implementing many of these replica maintenance algorithms and for mediating cooperations on potentially global state. And so this split between potentially long running things that communicate and sort running things that don't in a purely local made it easier to factor the system. Our current message transport between x86 cores is something called UMP, which is our implementation of a previous research idea called URPC, which is essentially this shared memory ring buffer that allows you to move cache line size messages very efficiently between cores. That's completely different on different platforms. So obviously the messaging transport on Beehive uses the hardware primitive for sending messages and the messaging transport on rock creek is also using the hardware features to accelerate message transport. And that can even be different between different pairs of cores inside the same system. Much like a marker kernel or an XO kernel, most other system facilities are implemented as user level which then themselves may need to be replicated across multiple cores for scalability reasons. When you build an operating system from scratch, you get to make a whole lot of design decisions and there's a whole bunch in Barrelfish that have nothing directly to do with our research agenda. But there were, you know, fun decisions to make and things that we had to implement. So this is a slide briefly listing many of those ideas. Feel free to ask me questions. I won't go through all of them. Some of the important ones are probably minimizing shared state like a lot of the multiprocessor operating research OSs have done in the past. We use capabilities for all resource management, including management of physical memory inside the machine. Using capability model from a system called SEL 4. We have upcall processor dispatch much like scheduler activations. We're running drivers that use a level -- we're actually building -- this slide is a little bit out of date. It talks about specifying device reduces but we're actually a whole series of little domain specific languages to make it easier to construct it from parts of the system. So one of these languages generates the code for accessing device registers, another one generates our build system, another one generates our messaging transport stubs for different underlying hardware interconnects, another one generates part of a logic that implements the capability system in the kernel. It's actually a very interesting way about -- of building an operating system. Some of the applications running on Barrelfish because in order to build this as a real system and try to keep ourselves real, we have to have applications. Some of the applications that are running today are a slide show viewer which you're actually looking at. So why -- the reason I keep sort of glancing around at that screen is because I can't see anything here because you would be surprised how heinously complicated it is to enable two display pipes at once. [laughter]. Our web server runs Barrelfish. >>: [inaudible]. >> Andrew Baumann: Thanks for that. I'll send you the code and you can tell me how to implement that. [laughter]. >>: [inaudible] we believe anything you said up to this point some [laughter]. >> Andrew Baumann: The problem is that the T is in hardware and it's behind the GPU programming API. >>: [inaudible]. >> Andrew Baumann: Yeah. Two video cards is easy. So our web server runs Barrelfish. We have a virtual machine monitor that we're using to be able to run device drivers but we don't worry about the performance or applications where we don't necessarily need the scale. In terms of evaluation, we currently have a relatively limited set of shared memory multiprocessor benchmarks like these guys. We have database engine and constrained engine which I'll talk more about in a minute. But we are rapidly trying to acquire more. And in particular, some of the things you can look at this list and say well what's not there. What's not there today is something like a file system which is a big missing part. And what's not there is a real network stack that multiple applications can use. And so that's obviously some of the things that we're working on. >>: [inaudible]. >> Andrew Baumann: No, you won't. But here's a [inaudible] going to use all the cores in my multicore machine. Probably at some point I might want to run more than one of them. So, evaluation, which had I'll kind of preface a little bit by that discussion. This raises a generally tricky question for OS research, which is how do you go about evaluating a completely different operating system structure and in particular how do you evaluate the implementation of something that is by necessity much less complete than a real OS? Our goals for Barrelfish are that it should have good baseline performance, which means that it's at least comparable to the same performance of existing systems on current hardware. But more importantly that it show promise of being able to scale with core counts and be able to adapt to different details of the hardware on which it finds itself running and be able to actually exploit the message passing primitive to achieve better performance based upon hardware knowledge. The first kind of evaluation I'm going to show you just because this is a research OS talk and all research OSs for a long time have held a tradition of message passing microbenchmarks is the performance of our user to user message passing primitive, which is this shared memory ring buffer thing. There's a lot of numbers on this slide, but the important point is probably that the performance of a cross-core, intercore message on current hardware, overshared memory is in the order of 4 to 700 cycles, and I think that's quite respectable given that the hardware wasn't built to support this. It's also interesting because it means that in contrast to many other distributed -- most, if not all other distributed systems, the latency of message propagation can easily be outweighed by the cost of handling -- sending and receiving messages and actually processing messages. And so the tradeoffs for building things on top of this end up being quite different to a typical distributed system. It's often much cheaper just to send another message than to try to optimize the number of messages that you have send if it's more expensive to process them. Also note that if you look at the throughput numbers and you work it out, we actually get pipelining for free across the interconnect because of the way the message transport works. You can have multiple messages implied. But then given a optimized message transport, how do you usefully implement part of an operating system on top of that? What I'm going to show you now is how we implement unmap or TLB shootdown. And this is what happens when you have some piece of replicated state like the TLB which by design is replicated between all the cores in the system. And you need to do some operation that would make some of those replicas inconsistent like reduces the rights or remove a mapping that might have been cached in all those TLBs. In a typical -- and so logically what you have to do is send a message for every core that has this mapping cached and wait for them all to acknowledge that they've removed it. In most systems the way this works is within a processor interops. The kernel on the initiating call sends an IPI to every other core that might have the mapping and then spins on some shared acknowledgement count that it uses to know when all the other cores have performed data flash. In Barrelfish the way this works is that first a user domain sends a local request to its monitor and then the monitor performs a single-phase commit across all the cores in the system. And it's a single-phase commit because you need to know when it's done. It can't fail, it's just the flash of the TLB for a particular -- for a particular memory packet. So the question is how do we implement this single-phase commit, and how to we do it efficiently? We looked at a couple of different protocols for doing this in our machines. The first is what we call unicast. And in this case we have one message channel point to point between every pair of cores in the system. And the initiating core sends a single message down every channel to every other core to say unmap this region. And because every message send is essentially one cache line invalidate, what this boils down to is writing in cache lines where the N is number of cores. An optimization that we tried is what you might call broadcast where you have the single cache line, you write the message once and every receiver is pulling it out of that single channel. Neither of these things actually perform very well. This is on a 32 core Opteron for the box I showed you at the beginning. Neither of these things actually scale very well. In particular broadcast doesn't scale well because cache coherence isn't broadcast. Even if you only write it once, every receiver has to go and fetch it from the score that wrote the message. So it doesn't do any better. So the question is how can we do better in if we look at the topology of the machine, again, this is the crazy 8 socket system, and we think about what's happening in both the unicast and the broadcast cores. Let's say this guy is initiating the unmap, he's sending a message here, here, and here. Okay. Then he's sending a message here, here, and here, and here. The same message is crossing the same interact links many, many times. We're not using our network resources very efficiently. What you'd like to do is something where you can aggregate messages. For example, send a message once to every other socket in the system rather than once to every other core. And we do that, and we call it multicast. So in the multicast case, we have an aggregated core, which is the first core in every socket in the box. And then that core locally sends the message on to its three neighbors and we aggregate the responses in the same way. This performance -- this send here locally is much more efficient because all of these cores are sharing in L3 cache so the message sent through the local cache is much faster. There's one more optimization that you can make to this which is if you go back and look at this picture, you observe that depending upon where the initiator is, some cores are going to be further away on the interconnect than other cores. And there's a higher message latency to reach this guy over here than there is to reach this guy over here. >>: So [inaudible] back to your second design principle because this -- waiting for this to come up. >> Andrew Baumann: How do I make the optimization based on hardware knowledge in. >>: Yeah. Or it seems like -- isn't this an example of like -- isn't it going to be really hard to try to make the [inaudible] independent of the [inaudible]. >> Andrew Baumann: Okay. Hold the thought. Hold the thought. So you want to send to this guy first and thus get a greater parallelism in the way the messages travel across the intergame. We do that as well. And that's called NUMA-aware multicast. And this is how the microrange messaging operations looks like. And you can see that obviously enough it's scaling much better. But it brings up an important question for Barrelfish and for multikernel in general, which is you need a lot of hardware knowledge to make this kind of optimization decision. And the decision is going to be different based on different systems. In this case the way we made the decision is based upon the mapping of cores to sockets which is a somewhat suboptimal way of knowing where the shared caches are. And you can get that information from CPUID. And we based it on messaging latency which is which sockets are further away, which cores are further away to send to? And we can do a bunch of online measurements for example at [inaudible] time to collect this information about the system. But more generally an operating system on a diverse heterogenous machine is going to need a way of reasoning about those resources in the system, making these kind of optimization decisions. The way that we currently tackle this Barrelfish in Barrelfish is with constraint logic programming. And this is where the constraint engine comes in. There's a user level service that just runs on a single core. It doesn't have to scale because it's off the fast path called the system knowledge base. And this thing is actually a part of the eclipse constrain engine. The system knowledge base stores as rich and detailed a representation of all the information about the hardware that we can either gather or measure and allows users to perform online reasoning and optimization queries against it. So in particular, there's a prolog optimization query that we use to construct the optimal multicast tree for the machine on which we're booting and that then configures the implementation of the multicast group communication primitive for that particular machine. Now, constraints are sort of a place holder here. Clearly this is throwing a sledge hammer at an ant. And it's not clear what the right way of doing these optimizations is, but I think there's an argument for some part of the system that is doing these high level global optimizations using as much information about hardware as it can gather and is explicitly off the fast path for messaging. >>: [inaudible]. Sledge hammer at a mountain because some other bust design might have a whole bunch of other scheduling constraints and so oh, you only get high performance if you schedule the messages to these cores in exactly this order. >> Andrew Baumann: So there's a hard question about how you express all those constraints and how you optimize for them. I would argue that it's if not easier in the message parsing abstraction, certainly no harder than in the shared memory abstraction, I think the only way to tackle that problem is to do it explicitly, as in collect all the information we can and allow you to do queries against that high level information. Operating systems today don't do this at all, they just say what is the best thing that we can implement that will work reasonably everywhere rather than how can we implement it in such a way that we can change what we're doing based on the underlying hardware. But it's definitely an open problem. >>: [inaudible] I mean how much of this do you think can be done sort of in this dynamic online fashion versus [inaudible] inherently done at design time. I mean it seems like -[inaudible] rephrasing it, but it seems like this enclosed thing can only handle so much that there's some stuff that you couldn't even figure out how you could abstract out the reasoning in the first place. >> Andrew Baumann: Sure. Yeah. You always have to pick a point. And then the further you want to go to being dynamic, the more difficult it is to express all the possible optimizations. >>: It's important in this context in particular because you're saying -- you're trying to avoid sort of the specificity creek that you see in a lot of commodity OSs ->> Andrew Baumann: Right. >>: Then you look at the [inaudible] [laughter]. One of the things that you see is you know you'll see like all these you know architecture specific stuff and people will [inaudible] the comments about how this is an ugly feature, this particular chipset or universal or whatever. What you're trying to say is, well, look, if we have this sort of different architecture or we have this nice message passing set, we can kind of punt on a lot of that. At least that's sort of the high to mid levels. But it seems that when we start talking about this multicache tree stuff, you're starting to go down that ->> Andrew Baumann: What you do this all those kind of situations you is build primitives or you build abstractions that allow you to build the high levels of your software independently of the implementation in the abstraction. In the shared memory case, you build scalable implementations or data structures like lists and hash tables that are different depending upon the underlying hardware. In the message passing case, you build implementations of things like broadcast or group communication. Single face -- pack source even if you want. Different kinds of protocols that you can optimize the implementation of for without the layers above needing to know how that works. I think that you can -- I think that that's tractable. Yes, it's hard, and figuring out the optimization for different pieces of hardware is still there and it's still hard and somebody has to do it. But the goal is not to throw that stuff away, the goal is to figure out how to decouple those details from the way you built the software. >>: So how much of the -- this kind of stuff do you expose to the application? So if I want to write an application in one of these strange systems, it seems to me that there's a couple classes [inaudible] there's things that don't care about performance, right, it's just -- I can't type fast enough to do anything. And it doesn't matter what you do, all right? There's another class of things where I really care about performance. I probably want to know about the topology, the HPC stuff. I want to know about topology of the underlying hardware and [inaudible]. So if I understand coming home instead of [inaudible] what you tell me how much you export to the application writer of the ->> Andrew Baumann: So we're not directly trying to solve the problem of how to make applications scale from multicore. The problem that Barrelfish is trying to solve is how do I make the lowest level of the system software work well on scalable [inaudible] but there's a joule of, you know, the argument that this is the way to build an operating system in an argument that says maybe this is a good way to build an application. And applications on Barrelfish have access to all of the same information that the OS itself has. So it's nothing that's stopping an application going and talk to this thing. Let me come back to that when I get to future work a bit at the end. >>: [inaudible]. >> Andrew Baumann: What's the right set of primitives? Yes, that's a good question. >>: It is small but [inaudible] spend a lot of time optimizing it. >>: There's a lot of people that talk about taking advantage of topology or knowing exactly where all the interconnects go. Let me tell you that in HPC that was given up a long time ago as a bad job just too much diversity, so what we've come to instead is essentially asking the operative system to allocate neighbors to applications and then having some sort of model that lets you -- lets you cope with that. The furthest we've gone is sort of some of the -- some of the sort of this hierarchal NUMA systems in which you specified the abstract domain on which the computation is placed. The domain gets placed by the operating system on to collections of nodes or collections of cores. But the actual details of wow, this look is latency this and bandwidth that are just way too complicated. >> Andrew Baumann: Yeah. The application wants some notion of these things. These things are loosely coupled. >>: These need to be closed, these don't, whatever. >>: [inaudible] this is too much work. >>: Yeah. They get -- there was thought of this, and then Larry Snider sort of pioneered this view that took us away from that the long key model and its variants are an extension of that. There's just a whole bunch of examples of situations where HPC people backed away from knowing the details of the topology systems. >>: But I think the argument here is exactly the same which is you don't want to [inaudible] to know the details. >>: Right, right, right. >> Andrew Baumann: Right. There's no reason that you can't build abstractions of [inaudible] data. Somebody somewhere at the bottom needs to ->>: Maybe there's a small number of communication patterns that the operating system in the application used and they can spend a lot of time optimizing. >>: [inaudible] creates a lot of support problems as well. [inaudible] database systems won't adapt to the hardware and make the best use of the current hardware resources that you can. But that means that there you query your optimizer [inaudible]. Nobody has done there because they don't know how to support it and the customers call back and say okay, this one does not solve. Well, you are unique. You have 10,000 unique systems out there and you can't use them. Same is going to be a problem if you start modifying the decisions this the operating system does at runtime. How are you going to tell where that originated from? >>: So people that design that architectures need to be punished. [laughter]. >> Andrew Baumann: The problem is not that they're bad, the problem is just they're different. But that's the point you're getting at. >>: Perhaps that's [inaudible]. >>: It is bad. >> Andrew Baumann: There's many good arguments against doing runtime adaptation. Is point is ->>: I'm just saying we have to be very, very careful. Can't go in there whole hog. >> Andrew Baumann: Sometimes you need to adapt just not to do something blatantly stupid is the challenge. And a lot of stuff in systems comes down to avoiding the blatantly terrible case. So let me wrap up quickly because we're running a bit late, I think. This is how -- what happens had when you put it all together into an actual unmap primitive that also involves communicating with the local monitor. And you can see this is actually improved since then. But we have a pretty inefficient local message passing primitive in that graph. Let me skip quickly through these benchmarks which are showing you the overhead of Barrelfish on running basic multithreaded shared memory primitives. This is SPLASH-2, which is a multithreaded shared memory compute bound thing. This is a similar NAS OpenMP benchmark. None of these workloads actually scale very well on our machine, which says something about the workloads in the cache protocol itself. The only point that these applications are trying to make is that there's no inherent overhead for coherent shared memory applications or no inherent penalty for running coherent shared memory applications on an operating system built in this way, nor would you really expect there to be. And also that, yes, you know, we can still run shared memory applications. Perhaps more interestingly, we've built an I/O stack on top of Barrelfish. In this case we have a case where we have network driver on one core, web server and application server on another core and then a backend database on the another core. And we're using message channels we sort of pipeline, we have a pipeline stall model pipeline request through this thing. And that gets quite healthy speedups over the standard sort of request parallelism model on traditional systems. It's also again apples and oranges kind of comparison going on here, our implementation of the whole stack versus something that's much more complete. But it's merely trying to make the point that you can build systems services on top of a model like this and they can perform reasonably. To wrap up, I wanted to talk briefly about sort of IDs for future work and where some of this stuff is headed. What we have today is a relatively first cut implementation of an operating system based on this model. And a high level argument that says it's not an obvious disaster to build a system in this way and that the idea shows some promise. But that leaves open a whole lot of interesting questions which obviously -- some of which we're obviously working on. The first is what are the actual protocols that you need to do replication and agreement in a system like this? The kind of state that we replicate today in Barrelfish is mostly the capability system metadata and some aspects of memory allocation. That works, but it's also not very interesting. You end up doing things either like one phase commit and some phases two phase commit. One of the big questions is what are the right protocols for building higher level system services on top of these and how can you structure them in a way that you can use transparent local sharing to make them go fast locally? And note as I said before, the tradeoffs with respect to the performance of message passing are going to be very different here than in many of the classical distributed systems. Although we can reuse ideas from distributed system space, I don't think many of the same protocols would necessarily make sense given these kinds of assumptions. The next question which leads into a little bit I think what some of the questions were getting at is how do you build applications for this? Obviously your applications aren't all going to be microprocessor shared memory P threads to all applications. What does a native Barrelfish application look like, if there is such a thing? I don't know the answer to this. I'm not sure anybody knows the answer to this. But there's a general hard question here about how you program applications for different kinds of heterogenous scalable multicore hardware. And I think that first of all the answer is going to have to lie in raising the abstraction bar. Obviously people aren't going to be building applications in C low level languages and using the typical operating system APIs that we have today. And so that's one of the reasons that we're not directly trying to specify the Barrelfish API that you write application to. Instead, we think that applications are going to be written in higher level language run times, programming environments and that there will be different environments for different styles of application. So one case -- one example of this that maps very well on to heterogenous multicore is data intensive applications are implemented in something like a Dryad or Map/Reduce kind of model where the runtime for that kind of thing can map the computation that it wants to do downloads at the available hardware resources. Obviously not all application fit this. There will be applications that need tightly coupled synchronization that have some variance of a message passing model. Probably not NPI because that's way too low level but something slightly higher level than that for those kinds of applications. Other applications might be doing streaming data processing or whatever. There will be some number of classes of applications and the answer here has got to lie in tightly coupling the implementation of the runtime for those things with the underlying OS so that the runtime can use knowledge of the hardware to optimize itself and to present its workload appropriately to the OS without the application developer having to know about it. And the optimizations that you have to use therein without being domain specific. How do you instruct useful system services? Things that we're looking at at the moment include networking architecture and also file system. The obvious thing to try to do here is take some ideas from cluster file systems and HBC file systems and see how they map no this space. One thing that's very different is when it comes down to dealing with things like where does the data actually live? Having hardware supported shared memory even if it's not coherent makes implementation of these kind of things very different. And finally there's a sort of big open question about how you do resource allocation and scheduling in a system that's structured in this way given some diverse set of competing application demands and a very varying set of processes and varying requirements like performance and power management, how do you actually do that resource allocation problem? And I think that's a big open question. Part of the answer is going to have to lie in high level things like constraint solving. But we're starting to look at how we can do heterogeneity or scheduling or resourcing allocation, particularly how the application can express its resource requirements of the operating system. Another thing to note is because you have these message queues between applications and between cores, you can actually infer something about communication patterns and even load based on the length of the message queues like some of the event driven architectures that people have built in the past. And so there might be something interesting that you could do there. Finally just to head off one question that people often tend to ask, now that you've built an operating system based on explicit message based communication, can you actually extend that model outside the box? Does it make sense to put a network in between and run one core on the other side of the network? The first answer to that question is probably nor or at least not in the trivial sense. There's big differences to being inside a machine and across machines, in particular to the guarantees that your messing system can offer, things like message delivery, reliable message delivery, all the kinds of things that our message transports guarantee within a machine are completely different as soon as you go across the machine. The performance tradeoffs are different. And even if you don't rely on that, having hardware support for shared memory makes things very different. But building a system this way definitely lowers the gap between the programming model that you use inside a machine and the programming model that you might use across multiple machines. You have the explicit message passing. In particular you have these explicit split-phase APIs that allow you to tolerate much more latency from communication. So you can think about again taking programming models for programming multiple aggregates and machines and mapping them on to programming inside of the machine and using the same kinds of primitives there. You can also think about extending your PC with your personal collection of machines with resources that you dynamically acquire from the cloud or other resources that you might have in your pocket and having the same part of your operating system that is dealing with high level questions like resource allocation and scheduling decisions be just as able to say well, I'm going to go and run this over there on the other side of the network, and it will have some kind of integrated view of the same part of my machine in the cloud as it would if it was running on my local machine or on my phone or whatever. So to wrap up. I've argued that modern computers are inherently distributed systems. I've argued that it's therefore time to rethink the structure of an OS to match. And I've presented the multikernel which is the model for an OS as a distributed. It uses explicit communication, replicated state. It's mutual to the structure of the underlying hardware. And I've showed you Barrelfish. I hope I've convinced you that Barrelfish is a promising approach for structuring operating systems. We're trying to build it very much as a real system. I think I've shown it has at least reasonable performance on current hardware, more importantly that structuring a system in this way should allow it to scale and adapt better to future hardware. And with that, I'll put in another plug for our website. There's papers and source code and more information there. And I'm happy to take any other questions. [applause]. >>: I have one question about suppose I have a [inaudible] so what does your C compiler do on Barrelfish? >> Andrew Baumann: We have GCC, and it does what GCC does. You can ->>: [inaudible] then you would use the shared ->> Andrew Baumann: That application, that application is not going to be able to run on cores that don't have coherent shared memory and aren't the same architecture. All right? So in the way you've built that application, you've limited the set of cores on which you can run it. But there's no reason that the -- our operating system can't run that application. Obviously we're not building the operating system that way. We've built the operating system in C, but we don't use threads, and we don't assume shared memory between multiple instances of the same program. >>: [inaudible] if you want to do that kind of sharing you should use the APIs and [inaudible]. >> Andrew Baumann: Yeah. No, the shared memory multithreaded model is probably the right model for some set of applications. They're just not going to run across all these cores. But if they only operate on a small number of cores they have loosely coupled synchronization it's a perfectly valid programming model. >>: So I didn't hear much about security. I'm a security person so I'm always curious. >> Andrew Baumann: Okay. >>: And I was wondering, one, if there's a failure rate in any one of the subkernels on any one of the chips or cores does that mean your host is everything fully [inaudible] by everything else and is the message passing -- how do you fend against an attempted to do a service denial by sending [inaudible]. >> Andrew Baumann: The answer is right now we don't defend against any of those things and if one of them goes down, the whole system goes down. However, as a model for constructing a system, we know to some approximation how to secure message based systems. You can build things that look at the integrate of messages on the channel. You can build things that isolate failed components. You can build messaging protocols that tolerate fairly well. You can do the same thing in a shared memory model, but I would argue that it's more complex because you don't know what the boundaries of sharing are. You don't know where the communication is. If some other core has gone and trampled on all my data structures I'm much more screwed than if some other core sent me a bad message and I can detect that it's a bad message. So we're not trying to solve this problem, but I think that this model gives you a better handle on that. >>: And in terms of availability just how do you protect against one application just sending lots of -- sending lots of messages [inaudible]. >> Andrew Baumann: That's just a general sort of resource allocation problem. How do I protect against one application spinning on a core or creating lots of threads and throttling the system? I can -- you know, I can -- message channels to set up -- >>: Some kind of accounting for [inaudible]. >> Andrew Baumann: Well, I can also account for the initialled message operation. I mean, in general most of these issues are different in the messaging scenario, in the shared memory scenario. But I don't think they're any harder necessarily. >>: Well, one thing that might be harder is that the message passing scenario unless depending on what constraints you've solved for, you may not know the big cost of sending a message. There may be some certain types of message that cost order of magnitude more than other types of messages. So it may be very hard to set a threshold where if you said you only send X messages that save you may need to have some other kind of a ->> Andrew Baumann: Yeah, yeah. >>: I don't believe you at all, but I'll let him answer the question. [laughter]. >> Andrew Baumann: I think it's probably [inaudible] [laughter]. [applause]