>> Chris Hawblitzel: So hello, everybody. Please welcome me in welcoming -[laughter]. So let me say a few words with Livio. Livio has been doing a lot of interesting work in using operating system technology and improvements to operating systems in order to squeeze more performance out of the commodity hardware. And so today he's going to be telling us about exception-less system calls and how you can get better performance that way. So please take over. >> Livio Soares: Hi. Thanks for having me here. I'm very excited to come visit MSR. So as Chris mentioned, I'm going to talk about a new operating system primitive that I call exception-less system call, and it's a new way for applications to request services from the operating system kernel. And we believe that it has different performance characteristics that are more compatible with modern computer architecture than the traditional way of using a softer trap or an exception for these two to communicate. But before I get into the details of exception-less system call, I want to talk a little bit about the motivations for this type of work. And it seems that it's no news to you that energy has been constraining the way we've built computers in the last few years, and it seems like it will continue to constrain or restrict in some way. So we have, beginning at the microprocessor level, an inability increase the clock rate frequency at which we had increased in the past. So in 2004, I guess, when I started my graduate school, this was already the case, and it's still the case today. And in the near future it looks like we're going to see a dark silicone because we can't even power all of the transistors, let alone increase the clock rate. And this also has effects at a larger scale which are data centers. So we have energy starting to dictate operational costs for data centers indirectly through powering the actual machines -- sorry, directly through powering the machines and indirectly cooling all of this heat that gets dissipated. And it seems like it is also start limiting the server density because you can't feed enough power to a single data center. So just as an example, this is a pretty old report from 2007 from the EPA drawing potential curves for the U.S. energy consumption. So from 2000 and 2007 it's pretty accurate. From then on they draw some different scenarios, and in 2007 in particular 1.5 of all U.S. energy was consumed by data centers. I was trying to get new numbers. It seems like estimates around now is between 2 to 3 percent. So we have seen that the curve has increased. It hasn't really gone to the best practice scenario. >>: Like all energy or all electricity production? >> Livio Soares: That's a good question. I think it's ->>: Well, no, like thermal energy, like gasoline or natural gas or ->>: [inaudible]. >>: Sorry >> Livio Soares: And I guess more interesting is the phenomenon behind the growth of the energy, which is the demand for computation. So here I have the installed server base. So this is not only servers in data centers, it's servers in either your private corporation or data centers, but I guess be in recent time it's mostly been in data centers for the U.S. and the world, and you can see it follows a similar shape, the curve. And so we see a four times increase in the number of servers for the last decade. And we've considered that the last decade you have a 30 times increase in server performance, you could have a back of the envelope calculation, so we have at least 100 times increase in demand for server computation. This is probably not true. There's virtualization that became popular starting in the early 2000s which made it possible to consolidate a lot of computation into a single server, so it may be even worse than 100 times fold. And there's no reason why this phenomenon won't continue. Especially now that we have mobile computation becoming so popular, offloading parts of the application to the server side is going to increasingly continue to occur, and I think we'll see this trend going forward. So given these trends, we would like applications to run very efficiently so we can reduce the number of machines we have and reduce the power that we require from those machines. So I have graphed here is several random workloads, mostly from SPEC 2006, so divided up into SPEC Int, SPEC FP and a couple of server workloads, and I'm showing the instructions per cycle, IPC, on an Intel core i7 processor. So in theory the Intel core i7 processor can deliver up to four instructions every cycle. In practice, you see that even the highest IPC workload barely scratches two, and what's worse for the servers, you have the lowest among these applications. So this is also not new. But it seems like given these trends, we have a computer systems challenge which is to enable efficient computation. Now, my focus of my graduate work has been a little more narrow than that, which is I've just been looking at run time and operating system techniques to help run applications more efficiently. So this is the outline of my talk for today. So mostly I'm going to talk about exception-less system call, so I'm going to try to outline the problem with synchronous system calls and then describe our implementation of exception-less system call, talk a little bit of how that changes the interface for the application to user space. So we have two ways to interface with the kernel. A threading mode that supports threaded applications, mostly transparently, and a way to do venture in execution which is not transparent to the application you must modify the code. And if I get a chance, I'll talk about some previous work that I did on helping the performance of last level processor caches which are known to have a lot of pollution, so they have very low reuse rates, and I have a technique called software pollute-buffer which is trying to dynamically monitor applications and use page coloring to improve their performance. And maybe I'll talk a little bit about future work. So as I was mentioning, the performance of servers in terms of IPC is quite low compared to other type of workloads, and this is mainly because of locality. And there's I guess two types of locality that we can reason about. Temporal locality, which is reusing items in a short period of time, and spatial locality, which is reusing items, for example, on the same core or that are physically close together when we talk about memory, for example. And all processor structures seem to exploit one or the other type of locality. So we have caches, instruction caches, data caches, that use this branch prediction TLBs, and the studies show that servers show low levels of locality when compared to more traditional scientific workloads. And some of the reasons are inherent to the way what servers do. So servers are request driven, so it's very hard to predict ahead time what you're going to access or what's the pattern that you're going to access. It's harder than it is in a more traditional workload. You also have to support multiple concurrent requests, which potentially requires a server to touch data from different parts of your data set, and there are things that are non-inherent to the server, but just the way we engineer it. They happen to have a large instruction set and data footprints, and they are a heavy user of operating system services. And the heavy use has non-trivial performance implications. So a quick reminder on how applications access kernel services. So usually you have a system call which is based on a software trap or an exception. So you have some assembly here where you move the system call number to a register and maybe some arguments, and you issue an syscall instruction which immediately synchronously jumps into kernel mode into a previously declared handler. You resolve that system call handling it and return to the application. Now, both the syscall and the sysret, they are synchronous, and they perform a few things during the mode switch which costs a few cycles. And traditionally people have tried to measure the overhead of system calls, usually in a sort of controlled microbenchmark where you have a tight loop where you read the time stamp, issue a very simple system call that basically doesn't do anything in the operating system, returns, and you note the time difference. And what I have here is this type of benchmark over a series of processors. On the left you have the oldest and on the right you have the newest, and I have both Intel processors and IBM processors. There seems to be a trend that in recent history that they've tried to improve the performance of this exception going back and forth. Now, unfortunately when I did this I thought, okay, system calls are not problematic, and it seems that that data is a little misleading when it comes to understanding the full impact that system calls have on performance. So let me show you a different experiment. So we took one of those benchmarks from SPEC 2006, Xalan, which has been engineered to virtually spend no time in the operating system. So it does very few system calls just to do the setup in the beginning. And I ran this on the Intel core i7, and I artificially injected exceptions in the execution of Xalan, and I did two types of exception handling, a null exception handling which would try to show us what the direct cost of issuing system calls is, and we did a p write system call inside the exception handling which is trying to show a little bit the indirect costs of actually doing some work while you're in the operating system. And in the graph that I'm going to show I only measured user mode time, so I completely ignored what was happening inside the kernel. And in an ideal world we would see the performance of user mode completely unaltered. >>: Is your ability to select a measure just user mode time depend -- when you say that, you really mean you measured time up until the kernel switched the [inaudible] over, right? That switch between user mode and kernel mode time happens in the operating system somewhere behind the contact switch. Is that right? >>: Or, basically, how did you measure that? >> Livio Soares: Yeah, so I used the hardware performance commenter facility, and they allow you to specify what type of cycles you want to monitor. >>: So the hardware does this transition for you right at the moment of the exception. >> Livio Soares: Right. So I don't know exactly how close it is ->>: [inaudible] >> Livio Soares: Yes, it is. >>: [inaudible] >> Livio Soares: So here we have the results of the degradation of Xalan due to the synchronous system call. So on the orange or the lower part of the graph we have that direct null exception, so we have anything between 45 percent of the application slowing down when we have every thousand instructions an exception. Then I grow the x axis in the log scale. And then the blue part at the top part of the graph is showing the indirect when had you actually do work inside of the operating system. So remember, I'm just measuring user mode time. And just to give you some reference, a very simple benchmark running on Apache and another one on MySQL, this is the range that we're looking at for these types of server applications between 2000 and 10k. So in that range we see that the blue part is more predominant than the orange part. So the fact that we're doing work inside the operating system is slowing down the user application. >>: [inaudible] how did you get every x thousands instructions? >> Livio Soares: So I also used the hardware performance counter. So I didn't want to modify the application because it's hard to know exactly x number of instructions. So I programmed the hardware counters to trigger an exception every x number of instructions, user mode instructions. >>: And did you then run that with the null system call to measure the overhead of your exception projection? Because that's going to pollute the cache and ->> Livio Soares: Right. Yes. So the baseline is you just -- well ->>: [inaudible]. >>: Well, the direct overhead here is going to be very different from the syscall instruction I think is what Bill is getting at. Taking a [inaudible] from the [inaudible] makes one exception call [inaudible] >> Livio Soares: Right. So here they were using the same path. So the direct is actual handling the ->>: So direct is just handling the -- the orange part of this is the overhead of your exception injection technique? >> Livio Soares: That's right. >>: Okay. So what we want to do is look at the difference between the orange and the blue. Okay. >> Livio Soares: Well, no, the orange you do see that just by if you really were to issue a system call instruction. >>: Well, in a real program I'm not going to have timers that go off and generate system calls, I'm going to execute my program logic and decide it's time for system calls. What I'm trying to do is isolate the performance of location of the timer that you stuck in there to generate the system calls from the system call itself because ->> Livio Soares: So the actual handling of the exception for the performance call is not timed, right? Because I'm showing user mode time here only. Now, what you may be implying is that handling a system call looks a little bit different than handling a system call exception. Yes, but the code was written in a way that really we're just issuing some hundred or 200 instructions. This is a very lightweight exception handling for the performance counter case. And it's already causing this much problems in user mode, so I'm not timing that part in the kernel part of it. >>: [inaudible] >> Livio Soares: So probably it's not the exception trashing the caches because for that piece of code you're really modifying a half dozen instructions lines and data lines. It's really the pipeline being hammered. >>: It's thrashing the pipeline? >> Livio Soares: Yeah, it's thrashing the pipeline, right. So the direct effect is you're hammering the pipeline. It takes about a hundred or so or more cycles just to start warming up your pipeline. And actually -- I don't have this graph here, but I have another graph which was showing how long it took to recover for -- user mode IPC to recover from a single exception. And it can take up to thousands of cycles. >>: So if you looked at this on a CPU with a really short pipeline or with something that doesn't do hardware execution, like an atom or something? >> Livio Soares: No. >>: Okay. If you have a short pipeline on a CPU does that mean you issue lots of system calls [inaudible]? >> Livio Soares: Yeah, I think that would be the case. So I tried this on a power PC which is pretty aggressive pipeline as well, and Intel, but my intuition is yes. So if you have a very short pipeline, you don't have much state to warm up to begin with, then the performance perturbation you get -- the direct performance perturbation that you get is going to be less. But the indirect is still there. And the indirect is really the thrashing of computer architecture state. So when I measured a single Linux write call it seems like two-thirds of the L1 cache, both the data and the instruction -- well, no the instruction was less, but the data and the TLB were evicted during that write call. So at least for Linux, it does cause a lot of pressure on these low-level performance sensitive structures of the OS. And as you can imagine, the performance of the operating system is going to be equally affected by the application. So if you were to measure the IPC of the kernel, and because the kernel doesn't get to execute a lot, it's just a request-driven piece of software, you see that it gets affected tremendously. Yes? >>: Is this dependent on the parameters of the write, like how many bytes you're writing? >> Livio Soares: I haven't explored that too much. I don't think -- the experiments I've done, it doesn't really matter. This even occurs close to this on a read, which you would expect to have slightly different code path. >>: [inaudible] >> Livio Soares: It's not necessarily a lot of data. It's scattered data. So I guess the operating system may not be optimized to be TLB sensitive, so it may be touching a few variables along its path. If these variables don't sit together on a few pages, then ->>: [inaudible] >> Livio Soares: Sorry? >>: [inaudible] >> Livio Soares: So it could be more sensitive to what you're going to write, but there's some fundamental things that you do that are independent of the actual data. So checking the processor, the current processor, the permissions, the file descriptor table, all these things that the operating system does along the way to get to the actual write is causing perturbation at the architecture level. And so we found that synchronous system calls are expensive and not necessarily because of the exception per se, although they can have costs at the pipeline. They're also expensive because they don't play fair with the way computer architectures are designed. So what we're proposing is to try to sidestep this boundary so that we can have longer periods of time executing in user mode and longer periods of time executing in kernel mode so that we don't wipe the pipeline clean as frequently and we also try to reuse as much of the instructions and data as possible. So exception-less system call is our way to try to decouple execution of the application from the execution of the operating system. And in it there's two main parts, which is the interface. So we use a shared memory structure so that application can write to and read from the requests and the results, and also there is a mechanism to execute those requests asynchronously in Linux which we call syscall threads. So let's try to give a high-level overview -- by the way, our system we call Flex, for flexible system calls. So I guess there are three main components. The first is the actual implementation in the operating system that includes what I mentioned, the interface and the execution threads. And there's some support for applications. So we have a Flex threads library which is binary compatible with current Linux p threads and an libflex library which is used for event-driven applications. And note also we can have regular traditional applications that don't use exception-less system call and it shouldn't interfere with our system at all. So they are supported. So the benefits of exception-less system call, as you mentioned, is you try to reduce the number of mode switches, so saving on the direct costs, and then you can do a few optimizations. So one of them is you can batch a lot of system calls to be executed in chunk so that you can execute a lot of things in the application space, and then when you enter the kernel you have a lot of work piled up to do, so that's batching. And that can occur even on a single core. Then if do you have multicore, you can do a little bit parallel execution where you put messages on this shared page to execute remotely, and this allows you to somehow specialize the content of the caches or other structures on the processor to execute either in user mode or in kernel mode. So let me begin by talking about the interface, which we call the syscall page. So on the right we have the diagram of a syscall page. It's structured as a series of entries. Each one corresponds to a system call, so it has things like system call number, the arguments, a status code and the return code from the system call. And on the left you'd have the code necessary to issue, let's say, a write system call. So the application wants to do a write. Instead of issuing a syscall instruction, it runs this piece of code where you find a system call entry that's free, you fill in the argument, so we have an example there at the bottom, and you tell the operating system, okay, this is ready to execute. Submit. And then you try to find something else to do in the meanwhile. And once the operating system notices that, it goes, executes the system call, updates this memory with done and the return code, and finally the application can move on from there. But now you need something to execute this request ->>: [inaudible] >> Livio Soares: Sure. >>: [inaudible] >> Livio Soares: Right. So I guess I can answer ->>: If you're going to get to it, that's fine ->> Livio Soares: No, I can answer right now. So we try to map one of these pages or several of these pages per core, and at that core the library guarantees that there's only one thread running that has access to that page at a time. So you may not need locking necessarily, but depending on your execution model, yes, you may -- it may be necessary to lock some of this thing. But my implementation right now doesn't require a lock. >>: [inaudible] then that means that to do a system call, you have to communicate with that thread, which is going to require a lot, right? >> Livio Soares: So I guess I'm going to explain my threads library in a few slides, and I think that will make it clear. So inside the operating system you need some context to actually execute these system call requests in the background. So the way Linux or most Unix does it is when you trap into the kernel, you use that stack to go ahead and start executing the system call. Now, you don't have that luxury right now because your application thread is off doing user space work. So you need some context -- some stack and an execute environment to go and do the work. So this is what we call syscall threads. So they're kernel-only threads, and their job is to go pick up a submitted slot from the syscall page and execute them. Just for the sake of our implementation, we pinned syscall threads per core to also help with some of the locking issues. So as I mentioned, one of the optimizations you can do is system call batching. So the goal of this request is as many system calls as possible, and once you're done doing application work, you have nothing more to do, you have to get a reply, you can switch down to the kernel and execute the system call and return. The other slightly more interesting thing is you can specialize cores. So an example of this is you have a 4-core setup where the top two cores here execute user mode thread only. So it's just producing system calls, never enters the kernel. And then the bottom left one, you can have a kernel-only core that's only executing syscall threads. And because sometimes a split is not perfect between user mode and kernel mode, you may still have a core that mixes execution between them. Yes? >>: So in p threads base code all these systems calls are synchronous, all the threads that make these system calls are blocking, waiting for a result, right? So how are you extracting parallelism if you can't make another call -- you can't execute any more instructions until you know the result of the system call [inaudible]? >> Livio Soares: Right. So that's why I did the Flex thread library, which I'm going to get to in a couple of slides, which is to try to give the illusion to that p thread thread that it is still synchronous. Right. >>: But that instruction counter, the instruction pointer to that thread can't make progress until [inaudible] >> Livio Soares: That's correct. >>: So where do you get the parallelism? Where are you doing more work? >> Livio Soares: So I'm going to explain that. That is an important partner. It's coming soon. So what programs can benefit from Flex? So the first thing we've done is tried to extract parallelism from threading applications, so we built Flex threads which I'll explain in more detail in the next slide. And the other mechanism is if you could program directly to this, so change the application so that it knows that these things are asynchronous, what would you do? And so we've imported memcache D and engine X, which is a popular web server, which is event-driven already to use the Flex thread facility, and we built a little library that helps these servers to execute asynchronous system calls. So the Flex thread library, it's a hybrid m on n threading library. So you have one kernel thread per core visible -- sorry, one core per thread visible to the kernel, but on top of that you have a lot of user-level threads. So I guess Microsoft has had this notion of fibers for a while now, and it's a little bit similar to that. And the goal of Flex thread library is to redirect system calls to the system call page, and then because we have a user mode threading library we can switch to a different thread to continue doing some other work. And so -- go ahead. >>: So it seems like the most obvious danger here is that now you're going to cover more territory and exercise your cache -- you know, pull out your caches >> Livio Soares: Right. >>: So how are you going to keep that from happening? >> Livio Soares: So I don't keep it from happening. >>: [inaudible] a better tradeoff than letting the kernel blow out your caches. >> Livio Soares: That's right. So I guess it may be workload specific. Some of the workloads that we have -- the work that is being done in each thread is similar enough that it's much more beneficial to keep doing user space work than it is to jump into the kernel. But you're right, it could happen that your application -- each thread is doing very different work ->>: [inaudible] >> Livio Soares: Yeah, and touching different data, and then doing this may result in you're using the caches in a worse way than you would have otherwise. Yeah. So let me try to illustrate a little bit better how the Flex thread library works. So we have one single thread visible to the operating system per core. So that's the white Flex thread there. And on top of it we can schedule multiple user level threads, and these are what the p thread programmer thinks about. And below we have the syscall threads that we had before. So you start executing a user level thread, and it wants to do a system call, so the library simply dispatches the request to the system call page and blocks. And so we get the parallelism by choosing another user level thread and continuing execution. So we keep that thread block until the results come back, and then we can unblock it. And from the thread's perspective, it was a synchronous call. So then eventually we may run out of work to do, and there's no core in the background doing operating system work so there's a fail-back mechanism which is you just synchronously jump into the kernel, say wait, and then the kernel tries to execute -- at least finish executing at least one of the system calls before it wakes up the Flex thread again. >>: What happens if you have kernel threads that one of them's busy, it will get time sliced out. >> Livio Soares: Right. >>: How do you deal with that with the user thread? So say you have a user level thread that doesn't make syscalls and that needs the processor a hundred percent? Will it starve the rest of the threads in that core? >> Livio Soares: In my implementation, yes. I guess there has been some work in hybrid threading, this m-on-n threading. You can send a signal up to the threading library and say oh, this much time has been spent already. So I haven't actually needed it because none of the workloads -- we're talking about a few thousand cycles where each thread does some work. But, yeah, you would have that problem if you have a CPU-hungry thread in there. So let me quickly try to show you the evaluation. So we implemented this on Linux 2.6.33. We had an Nehalem processor which is not very aggressive. It was a cheaper model, 2.3 gigahertz, four cores on a chip. We have clients connected with a 1 gigabit network, and I'm going to show you two workloads, an sysbench, which is a simple OLTP-style workload running on MySQL, and execution of that results in 80 percent in user space and 20 percent in kernel space. And then there's Apache bench which tries to hammer Apache web server asking for a static web page. And I'm going to show you the graphs. Whatever is denoted as sync is the default Linux threading library and Flex is our Flex thread library. So the sysbench on MySQL with the single core. So on the x axis we have increasing concurrency, so more concurrent requests from the client, and on the i axis the throughput that the client sees in requests per second. So we have a modest gain here of 15 percent with Flex when you have enough concurrency, so between 64 and 128 we really see some gains. But more interesting are the four-core numbers. So in the one-core case it was only doing the system call batching. In the four-core we actually have mostly one of the cores, because we had an 80/20 percent split, is doing the kernel work for three other cores that are issuing system calls. So we are able to specialize the caches a little bit more, and we see a better improvement than before. Yes? >>: Can you go back a slide? Okay. Please continue. I just wanted to get [inaudible] >> Livio Soares: Sure. >>: Does this require any modification of the application or was this just ->> Livio Soares: So this is the same MySQL binary running on both the base and mine. And so, well, there's the latency which is also important for certain server workloads, and we see that on average we have about a 30 percent decrease in latency. If you were to consider the 95th percentile, it can be a little bit more, especially there's some anomalous cases like the single core running on default Linux. But, yes, this is the graph that I like the most, which is somehow validating the intuition of the work. It's showing different processor metrics that I collected using hardware performance counters running the four-core scenario. So on the left we have user mode execution and on the right kernel mode execution, and there's two types of bars, the IPC bars, which are in orange or brown, and those better is higher. So the higher it is, the more efficient you're executing. And then the blue bars which are misses on these instructions. So you want less misses, and they will result in better execution. So you see that some of the metrics in user mode don't get affected too much, and this is somewhat expected since 80 percent of the time you are executing in user mode. So user mode is probably dominating those structures. Yes? >>: Is MySQL benchmark touching discs? >> Livio Soares: No. So this setup that I have was targeted -- yeah, it was in memory. So it was trying to get the full use of the CPU. >>: Okay. >>: And improvement here, this is with the four-core, so you're doing the specialization? >> Livio Soares: Yes, that's correct. >>: So maybe you've already discussed this. I apologize. >> Livio Soares: That's okay. >>: So should I think of this as just a confirmation of the computation spreading paper [inaudible] >> Livio Soares: Yeah, I guess the principles that worked there ->>: It's the exact same thing, right? >> Livio Soares: Yeah ->>: Running DOS on one set of cores and running the apps on the other set of cores? >> Livio Soares: That's right. So that principal sell is exactly the same principle that let's you have performance [inaudible] here. So, yeah, I guess it is confirmation of that. >>: Is this particular slide, is this showing me anything other than that? Should I draw any conclusion? Is there anything -- so I guess my question is if their system ran this exact same benchmark, I would expect to get the exact same results, or is there something different or better that you're doing? >> Livio Soares: So I guess -- I find it hard to compare their system with my system. So I understand the principle of specializing cores is the same in both works. The way they obtained the specialization is through -- I guess they have two things. So they have a hardware support for detecting when they should migrate execution over and they have a VM layer, and that's as much as I could understand from the paper. I didn't talk ->>: They didn't have hardware support. They were just using traps. They were just saying, you know, when you go into kernel mode, the kernel [inaudible] and then yeah, they used the VM to produce, whereas you've modified the kernel scheduler [inaudible] >> Livio Soares: So what I tried to do in the beginning was emulate some of their work, but unfortunately the only way currently to communicate between cores synchronously is through IPIs. And in my machine it takes several thousand cycles. So if I were to implement a version of what they described, I had very bad results. So that's why I assumed they must have had some hardware support to carry over execution in a very inexpensive way. Otherwise, I don't see how they achieved the performance when they did. So my proposal is, okay, I don't have that hardware support. Let me try to change the way applications and OS interact so that I can sort of emulate what would have been if I had that hardware support without really requiring the hardware support. >>: Just one more thing. So to be sure I'm reading this right, the only thing that got worse was TLB in user? >> Livio Soares: Yes, that's the only thing that got worse in user mode, yeah, was the TLB. >>: I mean, the only thing that got worse, period, was TLB in user. >> Livio Soares: In this graph, yes. >>: Have you looked into why TLB gets worse in user? >> Livio Soares: So I tried to do some monitoring, and it seems that I guess what that gentleman pointed out was the case which -- you have to do more work in user mode before you jump into kernel mode. And when you're doing that work, not necessarily all of the transactions in MySQL are accessing the same data. So you're potentially spanning more data. And when you go back to kernel mode, you have to recollect the arguments that were sent to begin with. So there's a slight increase in TLB. That's the best I could infer. >>: [inaudible] so you're putting more pressure on the TLB >> Livio Soares: That's right. >>: So another question. Is the 03 a shared ->> Livio Soares: Yeah. So the Nehalem is a private L2s and one big 8-megabyte shared L3. >>: So any insight on why it went [inaudible] >> Livio Soares: Yeah. So unfortunately this is the problem with relative performance. So L3 actually wasn't -- so if you look at the MPKI, I think it was relatively small. So variations on that, if you have a small variation, it looks like a big thing, but it's actually not. >>: So it's hitting more pages, but the work [inaudible] is very large. >> Livio Soares: Right. Or the prefetcher is doing a good enough job. >>: It used to have five L3 misses and now it has one [laughter]. >> Livio Soares: And so for ApacheBench, which is, I guess, one of the best suited works loads for this technique because you have 50-50 use of kernel and user, so with system call batching we see an 80 to 90 percent improvement as long as we have more than 16 or 32 concurrent requests. And with four cores we only get a little bit better because you already see a lot of the locality benefits. Sorry. I lost my train of thought. So similar to MySQL ->>: [inaudible] where you're not decreasing the reduction but you're actually showing an improvement in the positive direction. It seems like you always get improvement once you've overloaded the server for the MySQL results. >>: Said differently, when you showed slides 33 and 34, you could have gotten a lot of that improvement by simply doing a mission control and stopping the green line from falling off the cliff. You didn't actually need to do sort of improvement because right here the green line kind of falls off the cliff. If you just sort of stop the system at 40 concurrent requests and use a mission control, you'd have most of the benefit. Whereas ->>: So it may have been that the green line, the baseline, went up and I would have obtained similar performance improvement over the baseline. >>: That seems implausible. Why are they going down now? It's because the system is now thrashing, right? Once you've gone off the cliff that would be -- on this graph the performance improvement comes from sort of keeping the system from falling off the cliff as quickly, and if mission control would have given you better throughput by just limiting the number of concurrent requests you handle live, just don't [inaudible] whereas on slide 37 you really are ->> Livio Soares: So I think some of the benefits you would get from kernel mode execution, even if you had a mission control, are likely to -- so you may get a slightly smaller chunk of that, but IPC is improving. So this is not only -- I don't know what type of thrashing you were referring to, but you see at least the kernel improving its execution. I don't see why you would all of a sudden lose that. >>: I think what Ed is wanting to say is for percent improvement, just look at the peak of the green line and the peak of the blue line and that's your true improvement. I was actually hoping he'd ask the general question, which was is the best benefit for this [inaudible] server or are there cases where ->>: There's still benefits. See, the peak of the blue is much higher. >>: We'll talk later >> Livio Soares: Sure. So for the Apache we also see a similar reduction in latency. And here we have a more drastic change in the IPC. So you see both user and kernel almost doubling their efficiency and the misses on several of the structures reduced. Now, we still have a problem with the TLB in user mode. >>: I have a question. You have these kernel threads that are sitting there waiting for something to show up in your syscall page. When there's no work for them to do there, what do they do? >> Livio Soares: So they call the Flex wait system call, which is a synchronous system call which ->>: [inaudible] >> Livio Soares: Oh, they wait on a queue. So we keep one syscall thread awake per core at most. >>: And it does what? Does it spin? >> Livio Soares: No, it doesn't spin. So it checks once to see if there's work to do ->>: And then what does that core do next? >> Livio Soares: And then it goes to sleep and ->>: [inaudible] no system calls to one system call, you have to poke it to wake it up >> Livio Soares: Right. So if you spend a long time ->>: That's fine. If the answer was it was spinning, then your IPC would be uninteresting because what you'd tell me ->>: What you actually probably do is wait until the queue fills up the system [inaudible]. >>: [inaudible]. >>: Like you don't want to wake it up for one, you want to ->>: Well, otherwise your system calls are going to languish. You need to wake it up. >>: It's fine if [inaudible] the whole idea here is to collect as many system calls as you can. >>: [inaudible]. >>: This isn't just counting spinning. That's all I wanted to know. >> Livio Soares: Well, I also measure the number of instructions, and my system does have a few more instructions, but it's under 2 to 3 percent. So it's not all instructions from [inaudible]. Did you have a question? >>: I was just wondering how -- are you making forward progress? Maybe I'll shut up for a little bit. >> Livio Soares: Oh, I can end it here [laughter]. I don't need to talk about the -this was the main message I wanted to give today. So synchronous system call degrade performance, and I guess what I'm targeting at is processor pollution. So even if we were to build slightly better processor support for jumping modes, you would shave a little bit of that off, but there's something more inherent, which is execution that don't access the same data, they tend to perform poorly on modern processors. And so I proposed this exception-less system call which is a way to more flexibly schedule execution. Either you wait a little bit longer or you can execute on a separate core. And I showed you the Flex thread library which uses exception-less system call and tries to give the illusion to the application that it's still issuing synchronous calls. And I've shown you a little bit of the results. Yes? >>: [inaudible] >> Livio Soares: Right. So I think the latency throughput improvement comes from the fact that once you reach a certain number of threads, say more than 200 threads executing work, it's very likely that that thread can't execute a whole transaction or can't do all of its work at one continuous amount of time. So it may issue some network request that gets switched out to a different thread, right? So even in the default Linux case, you're likely to be interrupted several times with all the threads doing work, so you have to wait a long time between your scheduled in again ->>: [inaudible] >> Livio Soares: It may not be hardware interrupts. It's just you want to do a read from the socket, and it's not ready, so you have to schedule something else in. >>: [inaudible] >> Livio Soares: Right. Or -- yeah. So ->>: [inaudible] >> Livio Soares: It was growing? >>: No, no. I'm asking it [inaudible] >> Livio Soares: The 99 percentile we did have higher latencies. That's expected. >>: [inaudible]. >>: In order to get good batching, you need to have a high rate of requests going through the system >> Livio Soares: That's right. >>: And so with this setup you've got this terrible tradeoff between high rate of requests and increased latency but also increased [inaudible] >> Livio Soares: Right. >>: So I was hoping you were going to tell us about the [inaudible] version because there are with the [inaudible], if it works correctly, you don't need to oversaturate your number of threads and number of thread handles to get the same amount of concurrency. >> Livio Soares: I'm not sure I'm going to have time to talk about that. But I do also see ->>: Does it have -- does it avoid this [inaudible] with the events? >>: Do you have a graph like this with the event? >> Livio Soares: So not here, but it's not as good at reducing the latency in the event-driven case. So in the event-driven case, because -- so what causes high latency here is the number of threads that needs to be scheduled in and out. For event-driven execution you have a lot less of having to wait such a long time before you get notified oh, your thing was ready a long time ago, you could have already executed part of this call. And this is with current EPOL [phonetic] systems, for example. So you have a smaller latency to begin with. So event-driven has much smaller latency, I've observed, as the baseline. So my system, it mildly increases the latency, and that's only because you're executing things a little bit faster. >>: We can talk about it later. I was hoping you wouldn't -- with the event-driven system, you wouldn't have to oversaturate as much to get the -- see a throughput. >> Livio Soares: I think you need concurrency in requests. You need about 32 or 64 concurrent requests coming in. You're right that it's only a single thread now. You don't need 64 threads. But you still need a number bigger than 16 or 32. Below that you're still not winning much. >>: This sensitivity is directed to the sensitivity, the cost of flushing your pipeline. CPU to CPU it might be different, I mean, quicker. You're observing the difference between cost of batching your call versus the cost of a pipeline flush. And the more expensive the pipeline flush is, the less concurrency needed to recover that cost through batching. Is that accurate? >> Livio Soares: Yeah, the flushing and the amount of pollution you do as well. Yeah, and I think I add some overhead from my system as well that needs to be recovered. So if you don't have enough volume of system calls to execute, the overhead I introduce ends up costing you. So in that case you're better off just doing a synchronous call. >>: Did you do any experiments where instead of waiting to batch you simply execute the system calls immediately, avoiding the exception? So now there's no penalty to batching. You're just purely covering the pipeline costs. >> Livio Soares: Right. On a separate core? >>: Yeah. >> Livio Soares: Yeah. So as long as I had something to execute, I didn't wait until ->>: You didn't wait? >> Livio Soares: No, I didn't wait. So as long as I had something to execute I would try to execute. I don't think I did experiments where I explicitly waited for a certain number of -- well, first of all, it's hard to tell, right? Because you're sleeping right now, so it's hard for you to wait until -- you don't even know how many system calls there unless user space is willing to send you a wake-up, and that's going to cost you. So I'm going to skip this part where we were trying to improve the performance of last-level caches by dynamically profiling use hardware performance counters, applications, looking at the miss and hit rates such as the example here, seeing that some parts of your address space behave better than others and using some sort of page coloring to restrict the use of the cache. And I call that pollute buffer. We get some improvement. But I think it's more interesting to talk a little bit about future work. So as part of my future work I guess I'd like to continue the theme of efficient computation, so as I started talking about in the beginning, in the intro, before 2004 every decade or so we saw 30x clock rate improvement and probably more in application improvement because of growing caches and whatnot. And according to somewhat recent estimates, we're going to get something between two and three times clock rate improvement every decade after 2004, and as I said, most of it is thermal power issues. But because of these last decades, we have had a hard boundary between the hardware and the software layer where we haven't changed the [inaudible] too much, and it's been really good. Hardware has grown very fast and it allowed software to grow independently of the hardware. But the most promising solutions to the thermal power problem that I'm aware of is like multicore and specialized execution and I guess GPUs are an example of that, and you can get even more aggressive and say let's do analog circuitry, those greatly impact the hardware-software interface. So as part of my future work I guess I'd like to continue my current line of trying to rethink ways to build run time systems to focus on efficiency of execution and potentially even ask the hardware architects to make slight changes to their basic architecture to support a more efficient run time system. So there are a few things that I would, I guess, want to suggest is, first of all, I've been doing a lot of work in performance monitoring with PMUs, and unfortunately I'd like to do more dynamic adaptation, and for that they need to monitor more types of things and make it cheaper to monitor. I think it would be great to spend some of the resources -- the transistors, the abundance of transistors that we have to support better monitoring of execution. And then hopefully with that information, you can try to adapt the run time system, and I think specialization is a big part of adapting the run time system to try to match whatever the application is doing with whatever the hardware sees. But I think there's a point at which you can't do things transparently anymore, so it's likely that we may have to ask application writers to change the way they've been writing applications and maybe support a more higher level interface to the run time so that there's more room to make this transparent adaptation to match hardware. So that's it for today. Thank you for coming to my talk. [applause]