>> Chris Hawblitzel: So hello, everybody. Please welcome... [laughter]. So let me say a few words with...

advertisement
>> Chris Hawblitzel: So hello, everybody. Please welcome me in welcoming -[laughter]. So let me say a few words with Livio. Livio has been doing a lot of
interesting work in using operating system technology and improvements to
operating systems in order to squeeze more performance out of the commodity
hardware. And so today he's going to be telling us about exception-less system
calls and how you can get better performance that way.
So please take over.
>> Livio Soares: Hi. Thanks for having me here. I'm very excited to come visit
MSR.
So as Chris mentioned, I'm going to talk about a new operating system primitive
that I call exception-less system call, and it's a new way for applications to
request services from the operating system kernel. And we believe that it has
different performance characteristics that are more compatible with modern
computer architecture than the traditional way of using a softer trap or an
exception for these two to communicate.
But before I get into the details of exception-less system call, I want to talk a little
bit about the motivations for this type of work. And it seems that it's no news to
you that energy has been constraining the way we've built computers in the last
few years, and it seems like it will continue to constrain or restrict in some way.
So we have, beginning at the microprocessor level, an inability increase the clock
rate frequency at which we had increased in the past.
So in 2004, I guess, when I started my graduate school, this was already the
case, and it's still the case today. And in the near future it looks like we're going
to see a dark silicone because we can't even power all of the transistors, let
alone increase the clock rate. And this also has effects at a larger scale which
are data centers.
So we have energy starting to dictate operational costs for data centers indirectly
through powering the actual machines -- sorry, directly through powering the
machines and indirectly cooling all of this heat that gets dissipated. And it seems
like it is also start limiting the server density because you can't feed enough
power to a single data center.
So just as an example, this is a pretty old report from 2007 from the EPA drawing
potential curves for the U.S. energy consumption. So from 2000 and 2007 it's
pretty accurate. From then on they draw some different scenarios, and in 2007
in particular 1.5 of all U.S. energy was consumed by data centers.
I was trying to get new numbers. It seems like estimates around now is between
2 to 3 percent. So we have seen that the curve has increased. It hasn't really
gone to the best practice scenario.
>>: Like all energy or all electricity production?
>> Livio Soares: That's a good question. I think it's ->>: Well, no, like thermal energy, like gasoline or natural gas or ->>: [inaudible].
>>: Sorry
>> Livio Soares: And I guess more interesting is the phenomenon behind the
growth of the energy, which is the demand for computation. So here I have the
installed server base. So this is not only servers in data centers, it's servers in
either your private corporation or data centers, but I guess be in recent time it's
mostly been in data centers for the U.S. and the world, and you can see it follows
a similar shape, the curve.
And so we see a four times increase in the number of servers for the last decade.
And we've considered that the last decade you have a 30 times increase in
server performance, you could have a back of the envelope calculation, so we
have at least 100 times increase in demand for server computation.
This is probably not true. There's virtualization that became popular starting in
the early 2000s which made it possible to consolidate a lot of computation into a
single server, so it may be even worse than 100 times fold.
And there's no reason why this phenomenon won't continue. Especially now that
we have mobile computation becoming so popular, offloading parts of the
application to the server side is going to increasingly continue to occur, and I
think we'll see this trend going forward.
So given these trends, we would like applications to run very efficiently so we can
reduce the number of machines we have and reduce the power that we require
from those machines.
So I have graphed here is several random workloads, mostly from SPEC 2006,
so divided up into SPEC Int, SPEC FP and a couple of server workloads, and I'm
showing the instructions per cycle, IPC, on an Intel core i7 processor.
So in theory the Intel core i7 processor can deliver up to four instructions every
cycle. In practice, you see that even the highest IPC workload barely scratches
two, and what's worse for the servers, you have the lowest among these
applications. So this is also not new. But it seems like given these trends, we
have a computer systems challenge which is to enable efficient computation.
Now, my focus of my graduate work has been a little more narrow than that,
which is I've just been looking at run time and operating system techniques to
help run applications more efficiently. So this is the outline of my talk for today.
So mostly I'm going to talk about exception-less system call, so I'm going to try to
outline the problem with synchronous system calls and then describe our
implementation of exception-less system call, talk a little bit of how that changes
the interface for the application to user space. So we have two ways to interface
with the kernel. A threading mode that supports threaded applications, mostly
transparently, and a way to do venture in execution which is not transparent to
the application you must modify the code.
And if I get a chance, I'll talk about some previous work that I did on helping the
performance of last level processor caches which are known to have a lot of
pollution, so they have very low reuse rates, and I have a technique called
software pollute-buffer which is trying to dynamically monitor applications and
use page coloring to improve their performance. And maybe I'll talk a little bit
about future work.
So as I was mentioning, the performance of servers in terms of IPC is quite low
compared to other type of workloads, and this is mainly because of locality. And
there's I guess two types of locality that we can reason about. Temporal locality,
which is reusing items in a short period of time, and spatial locality, which is
reusing items, for example, on the same core or that are physically close
together when we talk about memory, for example.
And all processor structures seem to exploit one or the other type of locality. So
we have caches, instruction caches, data caches, that use this branch prediction
TLBs, and the studies show that servers show low levels of locality when
compared to more traditional scientific workloads.
And some of the reasons are inherent to the way what servers do. So servers
are request driven, so it's very hard to predict ahead time what you're going to
access or what's the pattern that you're going to access. It's harder than it is in a
more traditional workload.
You also have to support multiple concurrent requests, which potentially requires
a server to touch data from different parts of your data set, and there are things
that are non-inherent to the server, but just the way we engineer it. They happen
to have a large instruction set and data footprints, and they are a heavy user of
operating system services.
And the heavy use has non-trivial performance implications. So a quick reminder
on how applications access kernel services. So usually you have a system call
which is based on a software trap or an exception. So you have some assembly
here where you move the system call number to a register and maybe some
arguments, and you issue an syscall instruction which immediately
synchronously jumps into kernel mode into a previously declared handler. You
resolve that system call handling it and return to the application.
Now, both the syscall and the sysret, they are synchronous, and they perform a
few things during the mode switch which costs a few cycles. And traditionally
people have tried to measure the overhead of system calls, usually in a sort of
controlled microbenchmark where you have a tight loop where you read the time
stamp, issue a very simple system call that basically doesn't do anything in the
operating system, returns, and you note the time difference.
And what I have here is this type of benchmark over a series of processors. On
the left you have the oldest and on the right you have the newest, and I have
both Intel processors and IBM processors.
There seems to be a trend that in recent history that they've tried to improve the
performance of this exception going back and forth.
Now, unfortunately when I did this I thought, okay, system calls are not
problematic, and it seems that that data is a little misleading when it comes to
understanding the full impact that system calls have on performance.
So let me show you a different experiment. So we took one of those benchmarks
from SPEC 2006, Xalan, which has been engineered to virtually spend no time in
the operating system. So it does very few system calls just to do the setup in the
beginning. And I ran this on the Intel core i7, and I artificially injected exceptions
in the execution of Xalan, and I did two types of exception handling, a null
exception handling which would try to show us what the direct cost of issuing
system calls is, and we did a p write system call inside the exception handling
which is trying to show a little bit the indirect costs of actually doing some work
while you're in the operating system.
And in the graph that I'm going to show I only measured user mode time, so I
completely ignored what was happening inside the kernel. And in an ideal world
we would see the performance of user mode completely unaltered.
>>: Is your ability to select a measure just user mode time depend -- when you
say that, you really mean you measured time up until the kernel switched the
[inaudible] over, right? That switch between user mode and kernel mode time
happens in the operating system somewhere behind the contact switch. Is that
right?
>>: Or, basically, how did you measure that?
>> Livio Soares: Yeah, so I used the hardware performance commenter facility,
and they allow you to specify what type of cycles you want to monitor.
>>: So the hardware does this transition for you right at the moment of the
exception.
>> Livio Soares: Right. So I don't know exactly how close it is ->>: [inaudible]
>> Livio Soares: Yes, it is.
>>: [inaudible]
>> Livio Soares: So here we have the results of the degradation of Xalan due to
the synchronous system call. So on the orange or the lower part of the graph we
have that direct null exception, so we have anything between 45 percent of the
application slowing down when we have every thousand instructions an
exception. Then I grow the x axis in the log scale.
And then the blue part at the top part of the graph is showing the indirect when
had you actually do work inside of the operating system. So remember, I'm just
measuring user mode time.
And just to give you some reference, a very simple benchmark running on
Apache and another one on MySQL, this is the range that we're looking at for
these types of server applications between 2000 and 10k. So in that range we
see that the blue part is more predominant than the orange part.
So the fact that we're doing work inside the operating system is slowing down the
user application.
>>: [inaudible] how did you get every x thousands instructions?
>> Livio Soares: So I also used the hardware performance counter. So I didn't
want to modify the application because it's hard to know exactly x number of
instructions. So I programmed the hardware counters to trigger an exception
every x number of instructions, user mode instructions.
>>: And did you then run that with the null system call to measure the overhead
of your exception projection? Because that's going to pollute the cache and ->> Livio Soares: Right. Yes. So the baseline is you just -- well ->>: [inaudible].
>>: Well, the direct overhead here is going to be very different from the syscall
instruction I think is what Bill is getting at. Taking a [inaudible] from the
[inaudible] makes one exception call [inaudible]
>> Livio Soares: Right. So here they were using the same path. So the direct is
actual handling the ->>: So direct is just handling the -- the orange part of this is the overhead of your
exception injection technique?
>> Livio Soares: That's right.
>>: Okay. So what we want to do is look at the difference between the orange
and the blue. Okay.
>> Livio Soares: Well, no, the orange you do see that just by if you really were to
issue a system call instruction.
>>: Well, in a real program I'm not going to have timers that go off and generate
system calls, I'm going to execute my program logic and decide it's time for
system calls. What I'm trying to do is isolate the performance of location of the
timer that you stuck in there to generate the system calls from the system call
itself because ->> Livio Soares: So the actual handling of the exception for the performance call
is not timed, right? Because I'm showing user mode time here only.
Now, what you may be implying is that handling a system call looks a little bit
different than handling a system call exception. Yes, but the code was written in
a way that really we're just issuing some hundred or 200 instructions. This is a
very lightweight exception handling for the performance counter case. And it's
already causing this much problems in user mode, so I'm not timing that part in
the kernel part of it.
>>: [inaudible]
>> Livio Soares: So probably it's not the exception trashing the caches because
for that piece of code you're really modifying a half dozen instructions lines and
data lines. It's really the pipeline being hammered.
>>: It's thrashing the pipeline?
>> Livio Soares: Yeah, it's thrashing the pipeline, right. So the direct effect is
you're hammering the pipeline. It takes about a hundred or so or more cycles
just to start warming up your pipeline. And actually -- I don't have this graph
here, but I have another graph which was showing how long it took to recover
for -- user mode IPC to recover from a single exception. And it can take up to
thousands of cycles.
>>: So if you looked at this on a CPU with a really short pipeline or with
something that doesn't do hardware execution, like an atom or something?
>> Livio Soares: No.
>>: Okay. If you have a short pipeline on a CPU does that mean you issue lots
of system calls [inaudible]?
>> Livio Soares: Yeah, I think that would be the case. So I tried this on a power
PC which is pretty aggressive pipeline as well, and Intel, but my intuition is yes.
So if you have a very short pipeline, you don't have much state to warm up to
begin with, then the performance perturbation you get -- the direct performance
perturbation that you get is going to be less. But the indirect is still there.
And the indirect is really the thrashing of computer architecture state. So when I
measured a single Linux write call it seems like two-thirds of the L1 cache, both
the data and the instruction -- well, no the instruction was less, but the data and
the TLB were evicted during that write call.
So at least for Linux, it does cause a lot of pressure on these low-level
performance sensitive structures of the OS. And as you can imagine, the
performance of the operating system is going to be equally affected by the
application. So if you were to measure the IPC of the kernel, and because the
kernel doesn't get to execute a lot, it's just a request-driven piece of software,
you see that it gets affected tremendously.
Yes?
>>: Is this dependent on the parameters of the write, like how many bytes you're
writing?
>> Livio Soares: I haven't explored that too much. I don't think -- the
experiments I've done, it doesn't really matter. This even occurs close to this on
a read, which you would expect to have slightly different code path.
>>: [inaudible]
>> Livio Soares: It's not necessarily a lot of data. It's scattered data. So I guess
the operating system may not be optimized to be TLB sensitive, so it may be
touching a few variables along its path. If these variables don't sit together on a
few pages, then ->>: [inaudible]
>> Livio Soares: Sorry?
>>: [inaudible]
>> Livio Soares: So it could be more sensitive to what you're going to write, but
there's some fundamental things that you do that are independent of the actual
data. So checking the processor, the current processor, the permissions, the file
descriptor table, all these things that the operating system does along the way to
get to the actual write is causing perturbation at the architecture level.
And so we found that synchronous system calls are expensive and not
necessarily because of the exception per se, although they can have costs at the
pipeline. They're also expensive because they don't play fair with the way
computer architectures are designed.
So what we're proposing is to try to sidestep this boundary so that we can have
longer periods of time executing in user mode and longer periods of time
executing in kernel mode so that we don't wipe the pipeline clean as frequently
and we also try to reuse as much of the instructions and data as possible.
So exception-less system call is our way to try to decouple execution of the
application from the execution of the operating system. And in it there's two main
parts, which is the interface. So we use a shared memory structure so that
application can write to and read from the requests and the results, and also
there is a mechanism to execute those requests asynchronously in Linux which
we call syscall threads.
So let's try to give a high-level overview -- by the way, our system we call Flex,
for flexible system calls. So I guess there are three main components. The first
is the actual implementation in the operating system that includes what I
mentioned, the interface and the execution threads. And there's some support
for applications.
So we have a Flex threads library which is binary compatible with current Linux p
threads and an libflex library which is used for event-driven applications.
And note also we can have regular traditional applications that don't use
exception-less system call and it shouldn't interfere with our system at all. So
they are supported.
So the benefits of exception-less system call, as you mentioned, is you try to
reduce the number of mode switches, so saving on the direct costs, and then you
can do a few optimizations. So one of them is you can batch a lot of system calls
to be executed in chunk so that you can execute a lot of things in the application
space, and then when you enter the kernel you have a lot of work piled up to do,
so that's batching. And that can occur even on a single core.
Then if do you have multicore, you can do a little bit parallel execution where you
put messages on this shared page to execute remotely, and this allows you to
somehow specialize the content of the caches or other structures on the
processor to execute either in user mode or in kernel mode.
So let me begin by talking about the interface, which we call the syscall page. So
on the right we have the diagram of a syscall page. It's structured as a series of
entries. Each one corresponds to a system call, so it has things like system call
number, the arguments, a status code and the return code from the system call.
And on the left you'd have the code necessary to issue, let's say, a write system
call.
So the application wants to do a write. Instead of issuing a syscall instruction, it
runs this piece of code where you find a system call entry that's free, you fill in
the argument, so we have an example there at the bottom, and you tell the
operating system, okay, this is ready to execute. Submit.
And then you try to find something else to do in the meanwhile. And once the
operating system notices that, it goes, executes the system call, updates this
memory with done and the return code, and finally the application can move on
from there.
But now you need something to execute this request ->>: [inaudible]
>> Livio Soares: Sure.
>>: [inaudible]
>> Livio Soares: Right. So I guess I can answer ->>: If you're going to get to it, that's fine ->> Livio Soares: No, I can answer right now.
So we try to map one of these pages or several of these pages per core, and at
that core the library guarantees that there's only one thread running that has
access to that page at a time. So you may not need locking necessarily, but
depending on your execution model, yes, you may -- it may be necessary to lock
some of this thing. But my implementation right now doesn't require a lock.
>>: [inaudible] then that means that to do a system call, you have to
communicate with that thread, which is going to require a lot, right?
>> Livio Soares: So I guess I'm going to explain my threads library in a few
slides, and I think that will make it clear.
So inside the operating system you need some context to actually execute these
system call requests in the background. So the way Linux or most Unix does it is
when you trap into the kernel, you use that stack to go ahead and start executing
the system call.
Now, you don't have that luxury right now because your application thread is off
doing user space work. So you need some context -- some stack and an
execute environment to go and do the work.
So this is what we call syscall threads. So they're kernel-only threads, and their
job is to go pick up a submitted slot from the syscall page and execute them.
Just for the sake of our implementation, we pinned syscall threads per core to
also help with some of the locking issues.
So as I mentioned, one of the optimizations you can do is system call batching.
So the goal of this request is as many system calls as possible, and once you're
done doing application work, you have nothing more to do, you have to get a
reply, you can switch down to the kernel and execute the system call and return.
The other slightly more interesting thing is you can specialize cores. So an
example of this is you have a 4-core setup where the top two cores here execute
user mode thread only. So it's just producing system calls, never enters the
kernel.
And then the bottom left one, you can have a kernel-only core that's only
executing syscall threads.
And because sometimes a split is not perfect between user mode and kernel
mode, you may still have a core that mixes execution between them.
Yes?
>>: So in p threads base code all these systems calls are synchronous, all the
threads that make these system calls are blocking, waiting for a result, right? So
how are you extracting parallelism if you can't make another call -- you can't
execute any more instructions until you know the result of the system call
[inaudible]?
>> Livio Soares: Right. So that's why I did the Flex thread library, which I'm
going to get to in a couple of slides, which is to try to give the illusion to that p
thread thread that it is still synchronous. Right.
>>: But that instruction counter, the instruction pointer to that thread can't make
progress until [inaudible]
>> Livio Soares: That's correct.
>>: So where do you get the parallelism? Where are you doing more work?
>> Livio Soares: So I'm going to explain that. That is an important partner. It's
coming soon.
So what programs can benefit from Flex? So the first thing we've done is tried to
extract parallelism from threading applications, so we built Flex threads which I'll
explain in more detail in the next slide.
And the other mechanism is if you could program directly to this, so change the
application so that it knows that these things are asynchronous, what would you
do?
And so we've imported memcache D and engine X, which is a popular web
server, which is event-driven already to use the Flex thread facility, and we built a
little library that helps these servers to execute asynchronous system calls.
So the Flex thread library, it's a hybrid m on n threading library. So you have one
kernel thread per core visible -- sorry, one core per thread visible to the kernel,
but on top of that you have a lot of user-level threads. So I guess Microsoft has
had this notion of fibers for a while now, and it's a little bit similar to that.
And the goal of Flex thread library is to redirect system calls to the system call
page, and then because we have a user mode threading library we can switch to
a different thread to continue doing some other work.
And so -- go ahead.
>>: So it seems like the most obvious danger here is that now you're going to
cover more territory and exercise your cache -- you know, pull out your caches
>> Livio Soares: Right.
>>: So how are you going to keep that from happening?
>> Livio Soares: So I don't keep it from happening.
>>: [inaudible] a better tradeoff than letting the kernel blow out your caches.
>> Livio Soares: That's right. So I guess it may be workload specific. Some of
the workloads that we have -- the work that is being done in each thread is
similar enough that it's much more beneficial to keep doing user space work than
it is to jump into the kernel. But you're right, it could happen that your
application -- each thread is doing very different work ->>: [inaudible]
>> Livio Soares: Yeah, and touching different data, and then doing this may
result in you're using the caches in a worse way than you would have otherwise.
Yeah.
So let me try to illustrate a little bit better how the Flex thread library works. So
we have one single thread visible to the operating system per core. So that's the
white Flex thread there. And on top of it we can schedule multiple user level
threads, and these are what the p thread programmer thinks about. And below
we have the syscall threads that we had before.
So you start executing a user level thread, and it wants to do a system call, so
the library simply dispatches the request to the system call page and blocks.
And so we get the parallelism by choosing another user level thread and
continuing execution.
So we keep that thread block until the results come back, and then we can
unblock it. And from the thread's perspective, it was a synchronous call.
So then eventually we may run out of work to do, and there's no core in the
background doing operating system work so there's a fail-back mechanism which
is you just synchronously jump into the kernel, say wait, and then the kernel tries
to execute -- at least finish executing at least one of the system calls before it
wakes up the Flex thread again.
>>: What happens if you have kernel threads that one of them's busy, it will get
time sliced out.
>> Livio Soares: Right.
>>: How do you deal with that with the user thread? So say you have a user
level thread that doesn't make syscalls and that needs the processor a hundred
percent? Will it starve the rest of the threads in that core?
>> Livio Soares: In my implementation, yes. I guess there has been some work
in hybrid threading, this m-on-n threading. You can send a signal up to the
threading library and say oh, this much time has been spent already. So I
haven't actually needed it because none of the workloads -- we're talking about a
few thousand cycles where each thread does some work. But, yeah, you would
have that problem if you have a CPU-hungry thread in there.
So let me quickly try to show you the evaluation. So we implemented this on
Linux 2.6.33. We had an Nehalem processor which is not very aggressive. It
was a cheaper model, 2.3 gigahertz, four cores on a chip. We have clients
connected with a 1 gigabit network, and I'm going to show you two workloads, an
sysbench, which is a simple OLTP-style workload running on MySQL, and
execution of that results in 80 percent in user space and 20 percent in kernel
space.
And then there's Apache bench which tries to hammer Apache web server asking
for a static web page. And I'm going to show you the graphs. Whatever is
denoted as sync is the default Linux threading library and Flex is our Flex thread
library.
So the sysbench on MySQL with the single core. So on the x axis we have
increasing concurrency, so more concurrent requests from the client, and on the i
axis the throughput that the client sees in requests per second.
So we have a modest gain here of 15 percent with Flex when you have enough
concurrency, so between 64 and 128 we really see some gains. But more
interesting are the four-core numbers.
So in the one-core case it was only doing the system call batching. In the
four-core we actually have mostly one of the cores, because we had an 80/20
percent split, is doing the kernel work for three other cores that are issuing
system calls.
So we are able to specialize the caches a little bit more, and we see a better
improvement than before.
Yes?
>>: Can you go back a slide? Okay. Please continue. I just wanted to get
[inaudible]
>> Livio Soares: Sure.
>>: Does this require any modification of the application or was this just ->> Livio Soares: So this is the same MySQL binary running on both the base
and mine.
And so, well, there's the latency which is also important for certain server
workloads, and we see that on average we have about a 30 percent decrease in
latency. If you were to consider the 95th percentile, it can be a little bit more,
especially there's some anomalous cases like the single core running on default
Linux.
But, yes, this is the graph that I like the most, which is somehow validating the
intuition of the work. It's showing different processor metrics that I collected
using hardware performance counters running the four-core scenario.
So on the left we have user mode execution and on the right kernel mode
execution, and there's two types of bars, the IPC bars, which are in orange or
brown, and those better is higher. So the higher it is, the more efficient you're
executing.
And then the blue bars which are misses on these instructions. So you want less
misses, and they will result in better execution.
So you see that some of the metrics in user mode don't get affected too much,
and this is somewhat expected since 80 percent of the time you are executing in
user mode. So user mode is probably dominating those structures.
Yes?
>>: Is MySQL benchmark touching discs?
>> Livio Soares: No. So this setup that I have was targeted -- yeah, it was in
memory. So it was trying to get the full use of the CPU.
>>: Okay.
>>: And improvement here, this is with the four-core, so you're doing the
specialization?
>> Livio Soares: Yes, that's correct.
>>: So maybe you've already discussed this. I apologize.
>> Livio Soares: That's okay.
>>: So should I think of this as just a confirmation of the computation spreading
paper [inaudible]
>> Livio Soares: Yeah, I guess the principles that worked there ->>: It's the exact same thing, right?
>> Livio Soares: Yeah ->>: Running DOS on one set of cores and running the apps on the other set of
cores?
>> Livio Soares: That's right. So that principal sell is exactly the same principle
that let's you have performance [inaudible] here. So, yeah, I guess it is
confirmation of that.
>>: Is this particular slide, is this showing me anything other than that? Should I
draw any conclusion? Is there anything -- so I guess my question is if their
system ran this exact same benchmark, I would expect to get the exact same
results, or is there something different or better that you're doing?
>> Livio Soares: So I guess -- I find it hard to compare their system with my
system. So I understand the principle of specializing cores is the same in both
works. The way they obtained the specialization is through -- I guess they have
two things. So they have a hardware support for detecting when they should
migrate execution over and they have a VM layer, and that's as much as I could
understand from the paper. I didn't talk ->>: They didn't have hardware support. They were just using traps. They were
just saying, you know, when you go into kernel mode, the kernel [inaudible] and
then yeah, they used the VM to produce, whereas you've modified the kernel
scheduler [inaudible]
>> Livio Soares: So what I tried to do in the beginning was emulate some of their
work, but unfortunately the only way currently to communicate between cores
synchronously is through IPIs. And in my machine it takes several thousand
cycles.
So if I were to implement a version of what they described, I had very bad
results. So that's why I assumed they must have had some hardware support to
carry over execution in a very inexpensive way. Otherwise, I don't see how they
achieved the performance when they did.
So my proposal is, okay, I don't have that hardware support. Let me try to
change the way applications and OS interact so that I can sort of emulate what
would have been if I had that hardware support without really requiring the
hardware support.
>>: Just one more thing. So to be sure I'm reading this right, the only thing that
got worse was TLB in user?
>> Livio Soares: Yes, that's the only thing that got worse in user mode, yeah,
was the TLB.
>>: I mean, the only thing that got worse, period, was TLB in user.
>> Livio Soares: In this graph, yes.
>>: Have you looked into why TLB gets worse in user?
>> Livio Soares: So I tried to do some monitoring, and it seems that I guess
what that gentleman pointed out was the case which -- you have to do more work
in user mode before you jump into kernel mode. And when you're doing that
work, not necessarily all of the transactions in MySQL are accessing the same
data. So you're potentially spanning more data.
And when you go back to kernel mode, you have to recollect the arguments that
were sent to begin with. So there's a slight increase in TLB. That's the best I
could infer.
>>: [inaudible] so you're putting more pressure on the TLB
>> Livio Soares: That's right.
>>: So another question. Is the 03 a shared ->> Livio Soares: Yeah. So the Nehalem is a private L2s and one big
8-megabyte shared L3.
>>: So any insight on why it went [inaudible]
>> Livio Soares: Yeah. So unfortunately this is the problem with relative
performance. So L3 actually wasn't -- so if you look at the MPKI, I think it was
relatively small. So variations on that, if you have a small variation, it looks like a
big thing, but it's actually not.
>>: So it's hitting more pages, but the work [inaudible] is very large.
>> Livio Soares: Right. Or the prefetcher is doing a good enough job.
>>: It used to have five L3 misses and now it has one [laughter].
>> Livio Soares: And so for ApacheBench, which is, I guess, one of the best
suited works loads for this technique because you have 50-50 use of kernel and
user, so with system call batching we see an 80 to 90 percent improvement as
long as we have more than 16 or 32 concurrent requests.
And with four cores we only get a little bit better because you already see a lot of
the locality benefits.
Sorry. I lost my train of thought.
So similar to MySQL ->>: [inaudible] where you're not decreasing the reduction but you're actually
showing an improvement in the positive direction. It seems like you always get
improvement once you've overloaded the server for the MySQL results.
>>: Said differently, when you showed slides 33 and 34, you could have gotten a
lot of that improvement by simply doing a mission control and stopping the green
line from falling off the cliff. You didn't actually need to do sort of improvement
because right here the green line kind of falls off the cliff. If you just sort of stop
the system at 40 concurrent requests and use a mission control, you'd have most
of the benefit. Whereas ->>: So it may have been that the green line, the baseline, went up and I would
have obtained similar performance improvement over the baseline.
>>: That seems implausible. Why are they going down now? It's because the
system is now thrashing, right? Once you've gone off the cliff that would be -- on
this graph the performance improvement comes from sort of keeping the system
from falling off the cliff as quickly, and if mission control would have given you
better throughput by just limiting the number of concurrent requests you handle
live, just don't [inaudible] whereas on slide 37 you really are ->> Livio Soares: So I think some of the benefits you would get from kernel mode
execution, even if you had a mission control, are likely to -- so you may get a
slightly smaller chunk of that, but IPC is improving. So this is not only -- I don't
know what type of thrashing you were referring to, but you see at least the kernel
improving its execution. I don't see why you would all of a sudden lose that.
>>: I think what Ed is wanting to say is for percent improvement, just look at the
peak of the green line and the peak of the blue line and that's your true
improvement. I was actually hoping he'd ask the general question, which was is
the best benefit for this [inaudible] server or are there cases where ->>: There's still benefits. See, the peak of the blue is much higher.
>>: We'll talk later
>> Livio Soares: Sure.
So for the Apache we also see a similar reduction in latency. And here we have
a more drastic change in the IPC. So you see both user and kernel almost
doubling their efficiency and the misses on several of the structures reduced.
Now, we still have a problem with the TLB in user mode.
>>: I have a question. You have these kernel threads that are sitting there
waiting for something to show up in your syscall page. When there's no work for
them to do there, what do they do?
>> Livio Soares: So they call the Flex wait system call, which is a synchronous
system call which ->>: [inaudible]
>> Livio Soares: Oh, they wait on a queue. So we keep one syscall thread
awake per core at most.
>>: And it does what? Does it spin?
>> Livio Soares: No, it doesn't spin. So it checks once to see if there's work to
do ->>: And then what does that core do next?
>> Livio Soares: And then it goes to sleep and ->>: [inaudible] no system calls to one system call, you have to poke it to wake it
up
>> Livio Soares: Right. So if you spend a long time ->>: That's fine. If the answer was it was spinning, then your IPC would be
uninteresting because what you'd tell me ->>: What you actually probably do is wait until the queue fills up the system
[inaudible].
>>: [inaudible].
>>: Like you don't want to wake it up for one, you want to ->>: Well, otherwise your system calls are going to languish. You need to wake it
up.
>>: It's fine if [inaudible] the whole idea here is to collect as many system calls
as you can.
>>: [inaudible].
>>: This isn't just counting spinning. That's all I wanted to know.
>> Livio Soares: Well, I also measure the number of instructions, and my system
does have a few more instructions, but it's under 2 to 3 percent. So it's not all
instructions from [inaudible].
Did you have a question?
>>: I was just wondering how -- are you making forward progress? Maybe I'll
shut up for a little bit.
>> Livio Soares: Oh, I can end it here [laughter]. I don't need to talk about the -this was the main message I wanted to give today. So synchronous system call
degrade performance, and I guess what I'm targeting at is processor pollution.
So even if we were to build slightly better processor support for jumping modes,
you would shave a little bit of that off, but there's something more inherent, which
is execution that don't access the same data, they tend to perform poorly on
modern processors.
And so I proposed this exception-less system call which is a way to more flexibly
schedule execution. Either you wait a little bit longer or you can execute on a
separate core. And I showed you the Flex thread library which uses
exception-less system call and tries to give the illusion to the application that it's
still issuing synchronous calls. And I've shown you a little bit of the results.
Yes?
>>: [inaudible]
>> Livio Soares: Right. So I think the latency throughput improvement comes
from the fact that once you reach a certain number of threads, say more than 200
threads executing work, it's very likely that that thread can't execute a whole
transaction or can't do all of its work at one continuous amount of time.
So it may issue some network request that gets switched out to a different
thread, right? So even in the default Linux case, you're likely to be interrupted
several times with all the threads doing work, so you have to wait a long time
between your scheduled in again ->>: [inaudible]
>> Livio Soares: It may not be hardware interrupts. It's just you want to do a
read from the socket, and it's not ready, so you have to schedule something else
in.
>>: [inaudible]
>> Livio Soares: Right. Or -- yeah. So ->>: [inaudible]
>> Livio Soares: It was growing?
>>: No, no. I'm asking it [inaudible]
>> Livio Soares: The 99 percentile we did have higher latencies. That's
expected.
>>: [inaudible].
>>: In order to get good batching, you need to have a high rate of requests going
through the system
>> Livio Soares: That's right.
>>: And so with this setup you've got this terrible tradeoff between high rate of
requests and increased latency but also increased [inaudible]
>> Livio Soares: Right.
>>: So I was hoping you were going to tell us about the [inaudible] version
because there are with the [inaudible], if it works correctly, you don't need to
oversaturate your number of threads and number of thread handles to get the
same amount of concurrency.
>> Livio Soares: I'm not sure I'm going to have time to talk about that. But I do
also see ->>: Does it have -- does it avoid this [inaudible] with the events?
>>: Do you have a graph like this with the event?
>> Livio Soares: So not here, but it's not as good at reducing the latency in the
event-driven case. So in the event-driven case, because -- so what causes high
latency here is the number of threads that needs to be scheduled in and out.
For event-driven execution you have a lot less of having to wait such a long time
before you get notified oh, your thing was ready a long time ago, you could have
already executed part of this call.
And this is with current EPOL [phonetic] systems, for example. So you have a
smaller latency to begin with. So event-driven has much smaller latency, I've
observed, as the baseline. So my system, it mildly increases the latency, and
that's only because you're executing things a little bit faster.
>>: We can talk about it later. I was hoping you wouldn't -- with the event-driven
system, you wouldn't have to oversaturate as much to get the -- see a
throughput.
>> Livio Soares: I think you need concurrency in requests. You need about 32
or 64 concurrent requests coming in. You're right that it's only a single thread
now. You don't need 64 threads. But you still need a number bigger than 16 or
32. Below that you're still not winning much.
>>: This sensitivity is directed to the sensitivity, the cost of flushing your pipeline.
CPU to CPU it might be different, I mean, quicker. You're observing the
difference between cost of batching your call versus the cost of a pipeline flush.
And the more expensive the pipeline flush is, the less concurrency needed to
recover that cost through batching. Is that accurate?
>> Livio Soares: Yeah, the flushing and the amount of pollution you do as well.
Yeah, and I think I add some overhead from my system as well that needs to be
recovered. So if you don't have enough volume of system calls to execute, the
overhead I introduce ends up costing you. So in that case you're better off just
doing a synchronous call.
>>: Did you do any experiments where instead of waiting to batch you simply
execute the system calls immediately, avoiding the exception? So now there's
no penalty to batching. You're just purely covering the pipeline costs.
>> Livio Soares: Right. On a separate core?
>>: Yeah.
>> Livio Soares: Yeah. So as long as I had something to execute, I didn't wait
until ->>: You didn't wait?
>> Livio Soares: No, I didn't wait. So as long as I had something to execute I
would try to execute. I don't think I did experiments where I explicitly waited for a
certain number of -- well, first of all, it's hard to tell, right? Because you're
sleeping right now, so it's hard for you to wait until -- you don't even know how
many system calls there unless user space is willing to send you a wake-up, and
that's going to cost you.
So I'm going to skip this part where we were trying to improve the performance of
last-level caches by dynamically profiling use hardware performance counters,
applications, looking at the miss and hit rates such as the example here, seeing
that some parts of your address space behave better than others and using
some sort of page coloring to restrict the use of the cache. And I call that pollute
buffer. We get some improvement. But I think it's more interesting to talk a little
bit about future work.
So as part of my future work I guess I'd like to continue the theme of efficient
computation, so as I started talking about in the beginning, in the intro, before
2004 every decade or so we saw 30x clock rate improvement and probably more
in application improvement because of growing caches and whatnot.
And according to somewhat recent estimates, we're going to get something
between two and three times clock rate improvement every decade after 2004,
and as I said, most of it is thermal power issues. But because of these last
decades, we have had a hard boundary between the hardware and the software
layer where we haven't changed the [inaudible] too much, and it's been really
good. Hardware has grown very fast and it allowed software to grow
independently of the hardware.
But the most promising solutions to the thermal power problem that I'm aware of
is like multicore and specialized execution and I guess GPUs are an example of
that, and you can get even more aggressive and say let's do analog circuitry,
those greatly impact the hardware-software interface.
So as part of my future work I guess I'd like to continue my current line of trying
to rethink ways to build run time systems to focus on efficiency of execution and
potentially even ask the hardware architects to make slight changes to their basic
architecture to support a more efficient run time system.
So there are a few things that I would, I guess, want to suggest is, first of all, I've
been doing a lot of work in performance monitoring with PMUs, and unfortunately
I'd like to do more dynamic adaptation, and for that they need to monitor more
types of things and make it cheaper to monitor. I think it would be great to spend
some of the resources -- the transistors, the abundance of transistors that we
have to support better monitoring of execution.
And then hopefully with that information, you can try to adapt the run time
system, and I think specialization is a big part of adapting the run time system to
try to match whatever the application is doing with whatever the hardware sees.
But I think there's a point at which you can't do things transparently anymore, so
it's likely that we may have to ask application writers to change the way they've
been writing applications and maybe support a more higher level interface to the
run time so that there's more room to make this transparent adaptation to match
hardware.
So that's it for today. Thank you for coming to my talk.
[applause]
Download