18907

advertisement
18907
>>: Okay. It's my great pleasure to welcome Todd -- I'm not going to try and pronounce
this name. Mytkowicz?
>> Todd Mytkowicz: Mytkowicz.
>>: Mytkowicz. From the University of Colorado, who is coming here to talk to us about
the possibility of joining our team. And talk to us about performance analysis, I
suppose. Take it away.
>> Todd Mytkowicz: Thanks. Thanks a lot. Thanks for having me. It's great to be
here.
So the title of my talk is: We have it easy, but do we have it right? This is actually a
similar talk, and hopefully you guys haven't already seen it, to something that my
advisor has given here about a year ago. So there's a lot of new content, but there's
going to be a little bit of overlap.
So when I started out in my grad program, I realized pretty quick that systems.
Is where I wanted to be. I was taking a class in operating systems, and the professor
came in and gave a lecture on context switching, and you know, all the nitty gritty
details, and then said, okay, your homework assignment is go home, and you know,
here are these micro controllers, these little boards that we got. Go home and
implement context switching.
And, you know, this was just awesome. I mean, we learned about a concept in a class
from a textbook and reading, but then we were able to go home and program this on our
own and kind of hack and get something working that was in pretty much almost every
major operating system, almost every major operating system out there in use today.
So you know, symptoms research is really great because there's a big component of
engineering. And so you get tinker and see your results.
But you know, there's also a notion of evaluation, right? And particularly when we
research, we have to build, you know, we build smarter compiler optimizations. We
might come along and come up with some idea for better compiler optimization.
But to know how smarter it really is, we have to do an evaluation, right? We have to
compare ourselves to prior work that has come along.
So systems research as this kind of interesting interplay between the engineering
aspect of our work, which you know, we build stuff, and then the evaluations aspect
where we kind of say to other people, hey, look, you should be using my evaluation -my compiler. It's going to be, you know, ten times faster than the current one that
you're using. It will make your life better.
So let's generalize a little bit about how we actually conduct experiments in systems
research.
The idea that we -- you know, this is a generalization, but generally what we start with is
a very complicated system. Right? We have some system that has multiple layers of a
hardware/software stack, application frameworks. And we go in and we take a
measurement of that system, and we may measure certain layers of that application
stack from the hardware up to the application layer, and we get something interesting
from those measurements, and we say, okay, look, there's a bottleneck in our system at
this point. We're seeing a huge number of writes to disk when we shouldn't be seeing
those. So that's a potential bottleneck. We can come up with a way to fix this.
So we come up with some innovation. And so we augment our system now with this
innovation, and we come along and again we do another measurement. And so we're
measuring before and after our innovation, and we'd like to ask, you know, what's it's
efficacy of our innovation?
So but what if our data are wrong? Right? What if we had come along and done our
measurements and come to the conclusion using, you know, best methodologies, that
our innovation was beneficial?
Well, if our data are wrong, then obviously our conclusion is wrong, right? We've
claimed to the public or to our other research colleagues that, look, you know, my
compiler is going to make your program ten percent faster. But in reality, when they
actually use it, it doesn't actually make their program any faster.
So when we have a poor evaluation, we're going to come to wrong conclusions.
So how can our data be wrong? Well, I kind of didn't tell you the whole story about how
we actually carry out experiments in systems work. We traditionally execute our
experiments in some sort of a experimental setup. This is like the context that carries -that holds all of the aspects of our experiments.
And so if you read a systems paper, people are usually pretty good at saying, like, okay,
we looked at, you know, the GCC compiler version 4.2, the Linux operating system
2.6.24, so the things that we think can bias our results are -- or impact our conclusions,
we're usually pretty good about specifying those things in our experimental work.
But what we're going to talk about today is in reality, there are certain things in our
experimental setup that we really would never consider them to potentially cause huge
changes in our conclusions, but they're going to impact our conclusions nonetheless.
So that's going to be what kind of the point of this talk is, is that aspects of our
experimental setup are going to bias our data, and thus cause us to come to wrong
conclusions about our innovations.
Okay. So let's take a look at an example. I've been talking relatively abstractly
[indiscernible]. So take a look at an example of evaluating a compiler optimization.
So let's imagine that we have, using our notation here, some system -- I'm just going to
use perlbench compiled at -- using gcc compiled at 02. That's kind of our baseline.
We've got some innovation that we would like to evaluate. That's just the 03
optimizations above and beyond 02. Again, we're using perlbench.
And what we'd like to do is do an evaluation that's reproducible. So the idea is I'd like to
off -- I gave my advisor some homework, and I asked him, hey, Amer -- Amer is my
advisor -- could you go off and do an evaluation and tell me just what's the speedup of
03 over 02?
And so Amer came back and used speedup. He used proper experimental
methodology and calculated that, you know, there's about an 18 percent plus or minus,
some very small amount, speedup, by using the 03 optimization over 02. All right.
So his conclusion is that optim -- 03 is good. So speedup is just the run time of 02
divided by the run time of 03, so the fact that it's above one means, you know, you just
measure seconds. That means that there's about an 18 percent speedup.
So Amer says, you know what? The conclusion, 03 is good. It's really helped out.
So I came along and I did the exact same experiment. These are real results from real
runs. This is -- I'm not pulling any tricks. And I came back and I came -- used the same
machine, same binaries. We had no one else on the machine. It wasn't like frequency
scaling was going on. It was, you know, we made sure that the temperature was the
same in the room.
So I came back and I said, look, Amer, you're wrong. I see about a 16 percent slow
down, plus or minus some very small amount. I did problem methodology as well. I did
my -- I ran my experiments hundreds of times, and these are the numbers that I got.
So I could claim that 03 is bad. He must have done something to make a mistake here.
So to figure out what was going on here, we went back around we kind of sat down
together around we started looking at what was different between the two of our
experimental setups. And we covered the main things that you would expect, right? It
was the same machine, the same binaries, the frequency scaling wasn't kicking on, no
one else was on the machine. The disks, there wasn't like an NSF issue. And what we
came to was a really, really subtle aspect of the differences between our experimental
setups.
My name is longer than Amer's. And so these machines were in UNIX. And so in our
home variable, I used a little bit more space than Amer did. And you know, if you take a
look at what that actually does to an alph [phonetic] binary, this is a very osciative view
of what alph really looks like, right? You've got your text and code or your code and
your data, you've got your environment variables, which are placed on this -- right
before the start of the run time stack. And then the stack which grows in this direction,
right?
And because my environment variables took up a little bit more space, because my
name was slightly longer, the start of my stack address was slightly higher. And so, you
know, we -- this is the only thing we could see that was different between our
experimental setups, so we were wondering, could this really, really cause a change in
program performance as large as the results that we saw on prior slides.
So we did a controlled experiment where we just took an empty environment, so we
started with no bytes in our environment, and that's given here in the X axis. And we
just progressively added environment variables, dummy environment variables, to this
empty environment.
The Y axis, I'm showing you speedup, so we have slow down, speedup, and if you're
above 1.0, and each point here is just the mean invariants of multiple runs, the
conclusion of whether or not 03 gives a speedup of over 02 for perlbench.
Now, clearly, what we see, despite the fact that the environment variables are actually
never used by the program, right, they never actually get touched by the program, we
see a fluctuation in our conclusions by a large amount. About ten percent. And even if
we disregard these guys, these two kind of outliers, we still have a variation that's
sizeable relative to what we kind of see in today's compiler evaluations.
And moreover, these points here, these are the confidence intervals, these runs are
incredibly deterministic. You know, you run a program once, and it takes ten seconds,
run a program again, it takes 10.1 seconds. So the variation here is very minimal
relative to the fluctuations from the performance benefits that we're seeing by doing
optimizations.
So clearly, environment variables are causing a change in our conclusions here. Even
though they're completely irrelevant to the actual program.
So the way I'm going to generalize these results so we can see this isn't just a one-off
example for spec, is we're going to take the use of violin plot. So when you think about
a violin plot, it's just projecting all these data down to this -- to the violin plot. The white
dot in the middle is going to be the median, and then there's this -- if you tilt your head
sideways, you can see this distribution is just smoothed out. So it's effectively a smooth
histogram, and it's symmetric about the middle point.
So the nice thing about it is the height tells you what's the range in variation, and the
width or the fatness tells you kind of where the points are actually distributed. So it's a
good way to summarize these data.
Okay. So we looked at a bunch of the spec then -- or all the spec C benchmarks. And
we see that, you know, here's perlbench from the prior slide. That there are other
benchmarks as well that suffer from the similar phenomenon, in particular, lbm has his
swath height of the violins is relatively large. So we may claim a speedup or a slow
down, depending on the size of our UNIX environments, our UNIX environment
variables.
And even if we disregard these guys that are below 1.0, we still see large fluctuations in
the height on the order of a couple percentages, which again is relatively large for
today's compiler evaluations.
So the setting of totally irrelevant environment variables can lead to bias conclusions in
our evaluations.
I should mention, feel free to interrupt me for questions if you guys have anything.
Yeah?
>>: First of all, if you run it one time, it may take ten seconds; run it again on the same
machine, same environment, variables, takes 10.1.
>> Todd Mytkowicz: Yup.
>>: But that's not much different from what you're getting here. Here you would get
variation of being, say, 10, 10 and a half seconds. So ->> Todd Mytkowicz: Okay. I didn't do the math right. The variation is very small.
That's the distance [indiscernible].
>>: And the second question is, I never see [indiscernible] of [indiscernible].
>> Todd Mytkowicz: That's right. That's a good question. So the issue is these results
were -- we started at 0 and went an offset of 64 bytes. And as a consequence, if you
actually expand this out in this region of the space, there's a lot more fluctuation going
on. But as a function of just the fact that -- we'll get to this in a little bit, but just to do
these results took a lot of computer time, so we just picked 64 bytes and went with it.
Okay. So once Amer and I used the same environment variables, we started to agree
in the overall, you know, our conclusion for perlbench. But for one of the other
benchmarks, we disagreed. And so again, we went back around started to figure out
what was different between our experimental setups.
And turned out that Amer, you know, we were using the same compiled binaries, but
Amer was using LD *.0 to actually do his linking. And so *.0 just expands to
alphabetical. And so you get, you know, something like this laid out in the text section.
The text section of the code has these dot O files just swapped location wise.
I, on the other hand, was using the fault that chips and spec, and so I had a slightly
different ordering than what Amer had. And so we were curious, could this be the
source of bias?
It turns out, yes, indeed it is. I'm just going to show you the environment variables here
rather than kind of step through the plots like we did with perlbench.
We just took every -- so to do this experiment, we took a particular benchmark like quid
quantum. We just compiled the -- all the files into dot 0 files, and then just randomly
changed the linking order when we actually built the binary. And we did that for 32
randomly generated linking orders. And then we asked, what's the speedup of 02 over
03 for any particular linking order. And you can see we get very large fluctuations in our
conclusions right where speed -- speedup or slow down, but also still pretty wide
swaths, even when we always claim that there's a speedup.
So we're having large fluctuations in our conclusions based on the order of linking -- that
we link our programs together. Yeah?
>>: So it seems like some programs are more susceptible than others, obviously, to
this variation. And also, the attendant biases in the same way, like this s gen [phonetic]
in this one, you know, is also in the same kind of cluster ->> Todd Mytkowicz: Yes.
>>: -- as it was with the -- do you have any explanation?
>> Todd Mytkowicz: Not a concise one that you're probably going to like. What turns
out to be interesting is that I'm showing you data for the Core 2. If we take, for
instance -- for instance, we'll take gcc. If you run gcc on a Pentium 4 and do the same
set of experiments, you're going to see a very different shape to this particular -- to the
violin graph. In particular, there's pretty much not one benchmark across all the
architectures we've tried where you don't see some sort after fluctuation like this.
So for a particular piece of -- a particular architecture, perlbench is very susceptible to
both environment variables, and link order. Part of that is -- we're going to get to the
reason as to why that is in a second.
>>: Is the underlying cause the same?
>> Todd Mytkowicz: No.
>>: Okay. Because it's surprising that the effect is so similar ->> Todd Mytkowicz: Yeah.
>>: -- [indiscernible].
>> Todd Mytkowicz: I haven't actually brought up [indiscernible] and I haven't actually
looked into that one. So in depth, I don't know if [indiscernible], you know, the reason is
the similar for both of these scenarios?
>>: So are these all with a particular environment [indiscernible]?
>> Todd Mytkowicz: Yes.
>>: Which ones did you choose and why?
>> Todd Mytkowicz: I just chose 0, and that's actually a really -- so no bytes in the
environment. So the question was are these all for a particular environment size. And
yeah. Because we knew that the environment variables was a problem, so we just said,
okay, we're going to just hold that constant for now as a parameter of our study and just
look at link order.
You raised an interesting question about generality of these results in the sense that if
we had used an environment variable of, say, 15, would we get different results. In the
sense that we would get maybe different fluctuations, so the distribution here might look
slightly different. But the overall result would be the same.
>>: And so remind me then, which link order is the 1.0 baseline?
>> Todd Mytkowicz: I'm not sure I understand your question.
>>: You're measuring stuff relative to ->> Todd Mytkowicz: So there's 02 -- I'm sorry 02 divided by -- [indiscernible] of 02
divided by [indiscernible] of 03. Yeah. And I'm just showing you the fault in alphabetical
to show you that you just can't choose the default link order that's spec shift and expect
that that for instance gets you the median. Right. I mean, that doesn't really make any
sense. Spec doesn't necessarily know anything about our hardware. They just give
you a link order.
Okay. So the order of dot O files can lead to contradictory conclusions here.
Okay, so where does bias come from? This gets a little bit to the questions that were
just being asked.
Well, you may wonder if maybe it was a methodological issue. We're pretty sure it's
not. The results that you've seen so far took five months of machine time. This
machine was literally turning results for five straight months. So every graph that we've
showed you has actually, you know, been statistically rigorous in the sense that we've
applied statistics to all of our results. We've looked at a bunch of different platforms for
hardware. Two microprocessors, both the Pentium 4 and the Core 2 in Intel's hardware.
And the Power. And the M5 is a simulator, which we're not going to get to in these data,
but we have it in the paper if anybody is interested. So the simulator also suffers from
bias above and beyond the fact that it's simulating stuff, right?
The benchmarks we looked at are C and C++. But other people have replicated these
results with other programs and other benchmarks. And again, I'm showing you the
results from gcc 4.1, which is one of the more recent compilers when this -- when we
did these experiments, but we also saw this similar phenomenon on Intel hardware with
Intel's compiler.
You may think that Intel knows the details of their hardware and so their compiler can
kind of mitigate some of these specs. It turns out that that they're almost exactly the
same.
Okay. So generally speaking, bias comes about from interaction with hardware buffers.
And what's interesting is these buffers, you know, as soon as you see these results, a
lot of people start to say okay, this must have been a cache, cache issue. You're
getting conflict misses in your O1, for instance.
And it turns out that really the buffers that we're talking about are really nonintuitive
buffers. There's not the things that you necessarily think about as generally describing
as buffers.
So for instance, in the Core 2 -- I'm sorry, perlbench, on the Core 2, there's something
in the tip that's called the store-to-load forwarding queue. And the idea here is you can
imagine if you write to some variable of memory, then you want to load that value right
away. Rather than waiting for that write to go all the way back to the cache and actually
update the cache with that value, the store-to-load forwarding queue is this kind internal
buffer within the actual pipeline of the chip that says, okay, I know that that value is just
written. I'm just going to use that value directly for my load if it comes right afterwards,
after the store.
So it's a performance optimization the chip does. You can imagine if you're -- if this
loop, if you're in a loop and you're doing loads and stores, loads and stores, and you hit
this right, you fit within this cache, or this buffer, you're going to get very good
performance if that particular loop is indicative of your program's performance.
But if you miss and you have to go back to the L1 for every read, you're going to have,
you know, really poor performance. And so that's exactly what's happening with
perlbench and the Core 2.
There's some static variables, that are allocated in the BSS segment, and there's some
stack allocated variables that are conflicting with that that happen to be in the same hot
loop. And as a consequence, you get good load-to-store forwarding if your environment
variables align that hot stack allocated variables such that you don't conflict with that
static member. Yeah?
>>: So what's the [indiscernible] versus this queue?
>> Todd Mytkowicz: Oh, I don't know off the -- I'm not sure off the top of my head.
>>: Multiple cycles to the outline [indiscernible].
>> Todd Mytkowicz: I think it's -- it must be something reasonable, yeah. At least one,
right? [Laughter].
Okay. So for bzip2 the simulator, this is the M503CPU model. This is an alpha
simulator. There's a hot loop in bzip2. And you know, bzip2 really just does compress,
decompress, compress, decompress. And the compression loop pretty much has a -fits in about eight alpha instructions. And as a consequence, depending on how that
loop aligns in the front end, the front end can completely turn off and stream these
decoded instructions directly from this loop stream detector. Then you know, so you
don't have to do any of the decoding. So you save yourself a cycle.
Turns out that has a huge impact on bzip2's performance in the simulator. And as a
consequence, you can see the performance of bzip2 fluctuate by 20, 30 percent,
depending on how you actually align this loop within that particular structure.
So lastly, on the Pentium 4, this is a bit of an updated machine nowadays. But the
performance of the quantum will fluctuate by 30 to 40 percent, depending on how that
particular hot loop in the quantum fits within the trace cache. And the trace cache, you
can think of it as just a special purpose kind of fancy instruction cache.
So bias generally speaking comes about when 02 interacts differently with these buffers
than 03. It's -- yeah. So and a lot of times, right, the bias -- the buffers that we're
talking about are not the kind of buffers that you normally consider to be an issue in
micro architecture.
Okay. So what can we do about bias? Well, other areas have suffered from bias for a
long, long time. And they have kind of their own ways to deal with it. And so we're just
going to appropriate one of their ideas, which is randomized trials. Yeah?
>>: It seems like where you're going with this is you're going to talk about how to make
sure your studies [indiscernible].
>> Todd Mytkowicz: Right.
>>: Main take away I got from last couple of slides is that it seems almost hopeless to
keep thinking about doing these optimizations pause these random things are going to
completely defeat these optimizations you're doing.
>> Todd Mytkowicz: Well, I think that it's an interesting question to think about kind of at
a philosophical level. But at the end of the day, if someone came to you ->>: Optimizing the wrong thing. The thing we're trying to optimize in 03 is like a minor
effect compared to these other structured that you should be paying attention to.
>> Todd Mytkowicz: Possibly, possibly. Right. I mean, if someone came to you and
said, hey, I've got this idea called register allocation. You're going to -- here's how it
works. And you know, I haven't actually been able to evaluate it yet. You're probably
going to say, okay, be that sounds like a pretty good idea. Right? There's registers
which are really fast access to memory. Being smart about allocating your data to those
things seems reasonable.
So there's kind of -- and at the end of the day, those register allocations provides a
benefit that's relatively large and maybe will overwhelm some of the effects that we're
starting to see here.
So it's not that optimizations themselves are kind of, you know, hopeless. Right?
There's theories, there's absolute hope here, since each one of these variations you're
seeing could in turn be turned into an optimization, right?
>>: [Indiscernible] things have changed. Introduce registers that compile writers put
stuff in, take advantage of it. And the new kind of thing to sort of forward [indiscernible]
load things are not as dismal compiler writers and so they don't write the [indiscernible]
and take advantage [indiscernible] which means that the whole discipline has kind of
changed by not tracking the hardware anymore.
>> Todd Mytkowicz: Yeah.
>>: So it's an interesting question about whether it should be or whether, you know --
>> Todd Mytkowicz: It's also an interesting question which gets into kind of marketing
aspects of this, which is do we have the ability to do that even if we would like to?
Right? I mean, we're not the ones that make these chips. We'd love to be able to sit
down with Intel and say, hey, what's the, you know, what's exactly going on in this
particular buffer? How do we exploit it? And we don't have that information. You know,
rightly so, maybe. They're not going to give that out. Yeah.
>>: You said you also tested on ICC and it showed a lot of the same issues. So I
mean, did they all have the same information? Seems like it's just too much
complicated stuff going on for this targeting to be [indiscernible].
>> Todd Mytkowicz: Yeah. I mean, in some sense, for instance, take a look at
perlbench with this stack -- the stack allocated variables being the problem. That's a
data dependent thing, right? So you're never going to be able to prove statically that
you're stack is going to be a problem for this particular, you know, structure. So
sometimes that part is almost, you know, you can't really do much about it.
But the question here that we come back to is, okay, what can we do about it. And the
idea is that we know that these are sources of bias in our experiments. And you know,
if we have a compiler optimization that gets, say, maybe 2 to 5 percent, you know, God
says, hey, this is going to give you two percent, we'd like to be able to measure that.
And at least what we -- if we do randomize trials effectively, we can actually get to those
numbers, which is -- we're going to make a claim over average behavior over those -over environment variables and loop order because we know that those things can bias
our results.
And so the idea is that rather than looking at just a performance of a single program,
look at the performance of a large number of programs where we randomly affected
these parameters that we know to cause bias, as such, mitigated some of the effects of
those parameters.
So let's take a look at how this would look in practice. Here's our perlbench example.
I'm showing you cycles, which is just run time of how long the X axis is the histogram.
We looked at both 02 and 03 for each of these compiled binaries. We looked at a large
number of link orders and a large number of environment variables. So, you know,
there's a huge number of runs here. And we're just plotting the distribution of run times.
And the idea is if we have done a good job of randomizing over environment variables
and link orders, we can start to make claims about the distributions here.
So clearly, you know, we look at -- it's actually not that clear because it's not that big of
speedup because of these guys over here. 03 tends to give a slight, on average,
increase of performance over 02. And so that would be our claim for what the efficacy
of the 03 optimization and perlbenches. Yeah?
>>: It also looks pretty clear that the long tail is more typically associated with 03.
>> Todd Mytkowicz: That's right. Yeah.
>>: So reasons?
>> Todd Mytkowicz: I don't know the reasons. I haven't looked into it.
>>: Because I mean, arguably, yeah, so [indiscernible].
>>: [Indiscernible].
>>: Yeah, exactly. I mean ->>: [Indiscernible] and all that.
>> Todd Mytkowicz: It depends if you're the oracles of the world and you want to sell a
single binary and you would like it to get good performance on average, I mean, on
average or in worst case are two different questions depending on the organization's
needs. Yeah?
>>: Again, [indiscernible] back a slide. So you took a -- took a lot of effort for you to
find out those two [indiscernible].
>> Todd Mytkowicz: Absolutely.
>>: So how do I -- I mean, where [indiscernible] ->>: Yeah. How do we know there aren't other variables we should also be
randomizing?
>> Todd Mytkowicz: You're absolutely right. There are. There are other variables, but
we don't know them yet. So until we know them, we can't, you know ->>: Seems like the argument for your strategy, I mean, you're doing randomization, but
it seems like another argument, another strategy would be to measure things in a
deployed setting. In other words, you know, look at a thousand customers, you look
and you may have [indiscernible] quote, randomized environment variables.
>> Todd Mytkowicz: Yup.
>>: And you measure things in that setting. You're going to get a better picture of
[indiscernible] ->> Todd Mytkowicz: That's exactly right.
>>: -- happens.
>> Todd Mytkowicz: And in fact, you guys have the group that's starting to do that. So I
mean, that's kind of cool in the sense that you have the ability to talk about sending this
out to -- 02 out to a thousand people and seeing what kind of performance games they
get.
But I think that's kind of against [indiscernible] your point a little bit of title of the talk,
which is, you know, if you would like to make a claim about the benefit of your
optimization, it's -- we need to be able to -- we need to have codified methods that allow
us to make those claims.
Okay. So we were curious if other kind of common tasks that we do for evaluation can
produce biased data in the systems [indiscernible], and so in particular, we were curious
about java because java is a little bit more of a complicated environment.
And so I spent the summer at IBM, and while at IBM, my mentor there, Peter Sweeny,
one day came to me and said, hey, Todd, I've got this program, it's taking a lot of time in
this one particular method. I would like you to speed up that method.
And so he gave me this task and I went off and he told me what method it is. Doesn't
really matter what method that was, right? So my baseline here in our notation is just
the program that Peter got and he said, you know, this method M is taking up almost
20 percent of total execution. Make a fix and see if you can change it. Speed it up.
My optimization, I went in, I did some algorithmic changes. And so that's my innovation
was the algorithmic change. And so before, you know, before I sent Peter my results, I
wanted to run it by my advisor. I knew that bias had bitten me before. I wanted to make
sure that I did a good job of evaluating this. So I said Amer, can you take my optimized
program, tell me how much time we're spending in this method M.
So Amer did that and he came back the next day and says, hey, Todd, your
optimization is good. He said, you did a really good job. We're spending no time in that
particular method.
Okay. That's better than I thought but great.
So Peter, I sent it off to Peter and Peter came back and said, Todd, you know, I thought
we talked about this, there's a problem in this method M. You didn't change it at all.
And you know what? I did proper methodology. You know, I ran this multiple times. So
you can definitely make this things faster. Okay.
So take a look at why this happens, and we did the exact same thing as before. We
went back and looked at what was the differences between Amer's and Peter's
experimental setup. And it turned out that they were, you know, they were both -- they
knew about environment variables. They had taken care of that.
But what they were dealing with now is that they were using two different profilers.
Now, these profilers were both statistically based profilers. They're sampling profilers.
And they're supposed to produce the same result, right? The profiler shouldn't define
where our [indiscernible]s are.
So could this be the source of bias? Turns out, yes, indeed, it was. We did another
controlled experiment where we looked at four different profilers. Hprof is a profiler -- an
open source profiler that shifts the sum JDK. These are all from the sum JDK's data.
Jprofiler is a commercial profiler. It's pretty expensive profiler. Won a bunch of
developer awards. As is yourkit, and xprof is the profiler that shifts internally with sun
hot spot. So this is open source. This is part of sun.
So there's -- each profiler here has three bars. And that's the hottest method according
to that particular profiler. So hprof said that we spent about six percent of [indiscernible]
execution in this particular method. And that was the hottest method.
Jprofiler, on the other hand, said we spent no time in that particular method. Actually,
we spent most of our time -- 12 percent of our time in this gray method here. And these
points, I'm showing you the mean and competence intervals that these guys are
reproducible results, you know, there's very little agreement as to where -- what is the
hot parts of our program across these four different profilers.
And I'm just not going to take the time here, but this generalizes to other benchmarks.
Yeah?
>>: [Indiscernible].
>> Todd Mytkowicz: They're all sampling profilers?
>>: [Indiscernible] instructions kind of basic loss and just ->> Todd Mytkowicz: I didn't.
>>: -- use that as a [indiscernible].
>> Todd Mytkowicz: I didn't, no. That gets into other issues because the point we
wanted to make was that these are all -- they're all supposed to be producing the same
result. So the question you're kind of getting to is which is right and we'll get to that in a
second.
So this is for the sun JDK, but it's also happens for IBM's JDK as well. Very similar
results. But I'll point you to our [indiscernible] paper if you're interested.
Okay. Where does bias come from? Well, in this situation, it turns out to be a single
source. Before we get into that, we need to take a quick diversion in a our profilers
work.
So a profiler works by periodically stopping your program and why you asking what
method is executing right now? So you can imagine if your program is -- has two
methods foo and bar, and 90 -- you know, takes a hundred samples, 90 of which are
attributed to foo, well, you know, 90 percent of your time you're spending in foo. So
they're nice because you can, you know, control the sampling rate which allows you to
control the overhead. So they're usually pretty low overhead profilers as far as profiling
goes.
And the other nice thing about them is there's kind of a built-in error measurement upon
how accurate your results are. That is if you do your sampling correctly, the estimate -your error in your estimate goes down with the square root of the number of samples.
So if you increase the sampling for that particular method, run your program longer, you
get more samples, you get a more accurate statistically -- statistically accurate
measurement of how much time you're in that particular method.
So all of these -- kind of been hinting at this a little bit as I've been talking about it. All
these profilers, when you're statistically based, require that your samples be
independent. So what that means is that when I think -- if I take a sample every ten
milliseconds, right? And my program happens to be in foo every ten million seconds,
well, I'm going to get a biased -- I'm going to say that foo is taking up a large amount of
overall.
Execution time. Even if foo is incredibly short and took up very little of our overall
execution time.
So the point is that the sample of time T can tell you nothing about the -- should tell you
nothing about when the sample of time T plus one is going to occur, T plus 15, right?
So just to prove to you that that actually happens with the current profilers that we've
looked at in the prior slide, I'm using the auto correlation graph. It may look a little scary
but it's actually pretty simple. Auto correlation is a really simple way to think about just
correlations between two time series.
So you take a time series. In this case we're taking the time at which a particular
sample was serviced by the JBM when a profiler said, you know, I'd like to know what
method is running. So we're going to see if those times -- if the times at which those
samples occur are correlated. So take that -- it's time series, while we're running, for
instance, this is with hprof. Let me just finish this one point. And we duplicate it. And
so then we do a -- what's the correlation between those two time series? Well, clearly, if
you correlate a time series with itself, you're going to get one.
Then we shift that second series by one and we do -- that's our lag of one here on the X
axis, and we ask what's the correlation. And we just progressively shift this guy down
and do the correlation.
The point is that if we see -- once we get to a lag to be relatively large, we should see
that this goes to 0. The autocorrelation score should go to 0, which is the Y axis.
And what we do instead, we see as a periodic pattern that shows that the time which a
sample occurs at time T can tell you something about when a sample is going to occur
at time T plus, plus, say, a thousand.
So as a consequence, this is our way of showing you that samples are not independent,
which was the kind of the main point of why statistically base profilers -- or what
statistically base profilers need to adhere to in order to be accurate. Yeah?
>>: [Indiscernible] amount of correlation?
>> Todd Mytkowicz: For this it is. Yeah. If you don't do this with -- if you use a profiler
and a second that doesn't use yieldpoints, it's lower. I'm sorry. It doesn't use -- where
the samples are not correlated, it's lower.
>>: And so thee profilers [indiscernible] the actual time [indiscernible] the cycle count or
whatever?
>> Todd Mytkowicz: No, I changed this. I had to go --
>>: You had to go and do it yourself.
>> Todd Mytkowicz: Yeah. The nice thing is that they all actually use the same kind of
hook into the VM to ask for when a profile occurs -- I'm sorry when a sample occurs. So
they all use the same call, method call?
>>: [Indiscernible] the time ->> Todd Mytkowicz: No. Okay. So this is randomization stuff has been around. We've
known about this in the systems community since the 80s, and not earlier. That was the
first paper I could find on it. But this has been -- but all of our profilers, you know, I
never thought this was going to be that big of an impact, to be perfectly honest. And I
think probably most people have a similar kind of idea.
So to take a look at why this actually occurs, why we're getting information about the
time at which a sample occurs at T tells you something about the time the sample
occurs at T plus one. The reason that this comes about is because of these things
called yieldpoints that exist in the VM.
So the VM itself, if you look at java, the code that we output is not -- produces is not
preemptive. It's kind of quasi preemptive. And the point is that every once in a while,
the jit [phonetic] will put out an instruction that says you know, for an application, so that
an application tries to execute that method, it will say hey, look, VM, should I stop
running? Do you need to do anything like garbage collection or profiling tasks?
And so as a consequence you can't sample all parts or parts of your program. You can
only sample at these particular what are called yieldpoints.
So if you take a look at this particular code, if you run this, you know, or a variant of this
on your own java platform, you'll see that a profiler will say very interesting results. That
is, it will pick one of these two methods as to where you're spending all of your time.
So imagine that you have some code that loops, and that loop code calls straight,
straight has some mathematics in it, maybe some like modulus and some other
expensive operations.
When the jit sees that this code gets hot, it's going to place a yieldpoints right here. At
least the sun JBM will do this.
Now, you can imagine if you're profiling, in a profiler and all of a sudden your timer goes
off and says, hey, look, I'd like to know what method it's executing, the VM is going to
say, look, I can't at the time you until we get to a yieldpoint, and the next yieldpoint you
get to is right here. So there's a delay here, and that delay turns out to cause this kind
of interaction between -- this subtle interaction between our profiler asking for it to take
a sample and the actual dynamics of the underlying application.
So yieldpoints turn out to cause whatever the autocorrelation plot that we saw in the
prior example.
Okay. What can we do about bias? Well, unlike last time, we have a ->>: Bias on yieldpoints.
>> Todd Mytkowicz: Sure.
>>: So it seems to me like yieldpoints might be particularly problematic for this in java,
but if these functions are just having fairly predictable timing behavior, like in this
expensive code, both of those take ten milliseconds, then you're doing a sample every
20 milliseconds, you could still have this alignment without having a yieldpoint.
>> Todd Mytkowicz: You're absolutely right, yeah. Yeah.
Okay. So we know the source of bias here. At least we're claiming it's yieldpoints. I
haven't actually proved that yet.
And so we introduced a profiler called tprof. And the idea here is tprof doesn't use
yieldpoints. It just sits out the VM. It's a bit of an engineering effort, and we can go into
the details of how this works maybe off line.
But the idea is kind of at a very high level, it sits outside the VM, uses -- randomizes its
sampling interval, just using the UNIX signal to stop the entire VM process, gets an x86
address, and then maps that x86 address to the jitted code, and then, you know,
bytecode address.
So the idea is that we can sample at all program points and we can control our sampling
period and thus make it random.
So how do we know that tprof is accurate? Well, we use the technique we're calling -or based on other experimental sciences called value analysis. The idea here is that we
have some method M -- we'd like to know that our profile is accurate. And so the way
that we go about figuring this out is we cause a change to a particular method, and we'd
like to -- an accurate profiler to reflect that particular change.
Now, speeding up a program by a fixed amount is actually really hard. You have side
effects from memory allocation and all these types of thing. But we have the insight
that's slowing you down by a fixed amount is actually relatively easy.
So the way that we do that is that we take a particular method that we're interested in
studying, just add the Fibonacci code or a wall loop that calculates the first N Fibonacci
numbers every time M executes. As we progressively increase N, we're going to
systematically slow down the program. We profile M during every time that we actually
change this N value. And an un biassed profiler, one that's producing inaccurate profile,
should show that the slow down only in M, right, and no other methods.
So the idea kind of intuitively is that if we slow down a program by 20 seconds, we
should see byte by changing M for method M, should see an accurate profiler. Say that
this method M is now taking 20 seconds longer. So 20 seconds slow down of
application, 20 seconds increase in M, the slope as we change this N here should be
one.
So if we take a look at these four different profilers here, we've got tprof, hprof, xprof,
and jprof. The slope is given right afterwards. You guys can see that.
Each cluster of points here is the -- as we're changing this variable N, X axis gives us
overall run times so we're slowing down the program as we change N by a large
amount, almost a factor of three. The Y axis is just the amount of time that a profiler
says we're in that particular method M that we're slowing down.
And you'll see that for tprof, the slope is at about one, so that's the profiler that doesn't
use yieldpoints. The other profilers have systematic bias. So you can see for instance
in jprof, the slope is .65, so we're slowing down the program by a huge amount, by
inserting Fibonacci into this method. But some other method that's totally unrelated is
now getting hotter according to that profiler, and systematically getting hotter
nonetheless.
It turns out the results in the top graph are actually really good. Most of the time when
you pick this method M, the other profilers, the yieldpoint base profilers say even though
you slowed down the program by a factor of two, this method that we inserted Fibonacci
into didn't change at all.
And tprof, the profiler that doesn't use yieldpoints doesn't have that issue.
>>: [Indiscernible].
>> Todd Mytkowicz: Just different -- it's different methods or different benchmarks. We
picked a large number of benchmarks and a large number of methods to do
experiments on because it was automated.
So yieldpoint-based profilers rarely produce unbiased profiles.
Okay. So we've seen a bunch of examples where really subtle innocuous aspects of
our experimental design or experimental setup can bias our data and our conclusions.
And kind of the overarching kind of point that we've been trying to push here is that
small changes can have really large effects. It turns out that that actually the classical
definition the chaos if you think about it in a mathematical standpoint.
And so we were interested in understanding, just taking standard tools from nonlinear
dynamics, which is where the idea of chaos comes from, and trying to apply them to
understanding computer performance and what pops out.
So this is work that actually was first started by Hugh Berry at [indiscernible]. But we
carried it to its next logical conclusion. And the idea here is that, imagine that you have
some really large state space X. And X is every single register in your machine. You
have some really complicated function F. And every cycle, F updates the state of X to
produce a new state, right? F is deterministic and then you feedback X of T plus one
back into F and you get X of T plus two.
So the idea is that you can get some notion of the performance of a program overtime
by thinking about the program going through this giant state space and we just use
standard techniques. This is kind of -- once you actually put the framework of the
problem into the idea of creating a physics problem, you can just use standard tools
from nonlinear dynamics to start to understand some of the properties of F. And one of
those properties is a sensitive dependence on initial conditions, which means that if you
take X and you make a slight perturbation to X, you can see an exponential divergence
in kind of the distance between X and this epsilon that you did as you increase -- as you
apply this function F to it.
So the kind of high-level point here is that there's chaos in computers and that's kind of
a nice theoretical underpinning for some of the results that we've started to see.
So our conclusion here is that, well, this means that we need to be really careful with
our instrumentation. And that kind of leads me into some of the other work that my
coworkers and I have done. All with the idea of trying to reduce kind of the overhead of
our instrumentation techniques in order to capture relatively interesting information. So
I'm going to go through rather quickly these next two slides. Yeah?
>>: On the other hand, a heavyweight instrumentation that actually stopped after early
instruction, it would slow your program down ->> Todd Mytkowicz: Yes?
>>: -- but it would be hard to measure [indiscernible] absolutely accurate about where
the hot spots are.
>> Todd Mytkowicz: Well, I mean, if you take a look -- think about bytecode
instructions. This can get into a bit of a philosophical argument. But think about
measuring the hot spot -- the program -- where the hot spots in your program are by
number of instructions executed per function.
If a function gets in lined, then you know, the jit all of a sudden removes that function it
becomes -- you know, the timing information isn't necessarily 1 to 1 with the number of
instructions. So you -- that may be some surrogate for a definition of performance. But
at the end of the day, we really do care about time and not necessarily number of
instructions or some other kind of heavyweight kind of instrumentation. Yeah.
>>: So I'm just curious. So you are saying that the profilers you've looked at, they all
accurately reflect when the jit does [indiscernible] things like that, accurately reflect
which function is responsible?
>> Todd Mytkowicz: No, no. So ->>: [Indiscernible] that line variety of sort of which was the function that ->> Todd Mytkowicz: Was the hottest.
>>: Right. And a number of different possible explanations. One could be that they're
inaccurately determining the time ->> Todd Mytkowicz: Right?
>>: -- because of things like jit. So did you eliminate that as a possibility?
>> Todd Mytkowicz: Yeah. You can turn off inlining in this effect still stay. In fact, I
think those -- that graph may have actually been within one he turned off.
I mean, inlining is a big problem for any profiler. There's been a bunch of work with
dealing with that in and of itself. But the effect that we've seen are not because of
inlining.
Okay. So the idea with this work that we did in micro was we have a -- when you start
trying to diagnose a performance anomaly, if you look at kind of the lowest level
decoding you can start to capture, that's hardware metrics like cache misses and
branch mispredicts and these types of things. And modern micro processors have a
huge number of registers that allow you to capture a large number of -- I'm sorry.
They have a huge number of metrics that you'd like to capture, but only few registers to
actually do it. So you can run your program and maybe capture two performance
metrics. But at the end of the day, you don't know what you're looking for, so you need
to do a large number of runs before you can start to kind of understand what the
performance of -- or where the performance anomaly is coming from. So you may need
to do one run where you capture data cache misses and another where you capture
instruction cache misses. And what you would like is some way to reason across those
particular runs.
So here's an example of the problem that you face when you start to try and do that.
I'm showing you just instructions per cycle here on the X axis.
Let's imagine that in this run you capture instructions per cycle and data cache misses,
and this guy, this second run you capture instructions per cycle and instruction cache
misses.
Now, for whatever reason, maybe the OS jumped in here and caused the delay in this
particular phase of the program's behavior, so it's replicated three times while down
here it's replicated only once.
You would like to know, you know, you would like to come up with some alignments that
match the event three in this trace with the event three in that trace.
And so we use an idea from genomic sequencing called dynamic time warping. To do
this alignment, you come up with a off wall alignment such that the distance between
any of these two traces is minimized.
And we found that it worked relatively well for capturing a large amount of metrics
without, you know, over many possible runs.
Okay. So the second idea to reduce our instrumentation is a notion we'll calling inferred
call path profiling. And the idea hear is that call paths of really useful for program
performance or debugging, but unfortunately they're kind of expensive to collect in run
time -- in production systems. And you know, what you -- the way you normally do that
is you add instrumentation that kind of keeps track. If you look at this as a call graph, it
keeps track of as the program executes what method you're actually executing and
where you have actually come from on the call chain.
And there are various optimizations you can do to kind of make sure that you do that in
an efficient way, but you still have to execute all this extra code in order to do that.
So let's imagine we're in foo and we'd like to know do we come through A or from B.
Now, traditionally, as I said, you add instrumentation. What we would like to do is kind
of limit the amount of instrumentation we add, so we're just sampling, and sampling of
the PC at foo and using information to infer how we actually got here.
So I'm going to add a set of numbers here, and hopefully maybe this makes it a little bit
easier to see how we do that. And these numbers are just the size of number of bytes
in the activation record for these particular functions.
So main has, you know, one variable, so it's got eight bytes. B has one variable, so it's
got eight bytes. A on the other hand, has 16 bytes, has two variables. So this
activation record is a little bit larger.
So the point is if we came through A, the stack height, the difference between main and
foo should be 32 bytes. We came through B, it's going to be 24 bytes.
So the idea is that stack height provides context or a calling context without
computation. We can sample the PC, which gives us the function we're in, and the
stack point there, the SP, a lot of modern hardware will actually allow you to sample the
registers at any particular time when you sample the actual program so you get the
program counter and the stacked counter so you can effectively do this difference and
then infer what the path that you actually came from.
So there's a bunch more details here that we have to -- I'm totally glossing over. You
can imagine if there's a bunch of issues with ambiguity. That's the biggest problem with
this approach. And we have a bunch of techniques in the paper to actually deal with
that, and I can talk about that offline if people are interested.
Okay. So there's been a lot of work that we've covered here. I'm just going to really
quickly go over some related work.
If people haven't read Andy Georges' paper from OOPSLA in 2007, I suggest -- highly
suggest it. It's an awesome paper on statistically rigorous performance valuations for
java. And they go to show how if you don't necessarily use the same methodologies
across a large number of experiments, that is proper statistical methodology, you can
kind of come to a different conclusions in your results.
And we kind of see our work as an expense on that to show that even when using your
proper methodologies, you have to worry about bias in your results.
There's been -- there has been a lot of work in bias. This is experiments, but it's usually
kind of domain-specific stuff. So Steve Blackburn had a paper where he described -where he showed that if you change the size of your heat, you're going to come to very
different conclusions about memory allocator performance.
And Alameldeen and Wood showed that, you know, when you're doing simulations in
microprocessors, you want to change the latency of your L2 cache in order to get
variability in your runs from run to run. So they're kind of domain-specific applications of
bias.
There's been a lot of work on evaluating profilers. The difference between the work that
we've done and the work that's come prior is most of this evaluation work has a gold
standard which you can compare the results to. We lack that gold standard. If
someone gives you a profile and says that you're spending 80 percent of your time at
this foo, there's nothing that we can use that says -- that validates that a hundred
percent. So we had to come up with some in order to figure that out.
And then there's been a lot of work on just call path profiling that I'm not going to touch
on.
Okay. So let's take a look at some key lessons here.
Would you be a presidential pole from one small town? I'll take Boulder as an example.
It's where I'm from. It's a pretty liberal place. If someone came around and tried to pole
the next presidential elections in Boulder and then put those out and say this is going to
be a person that wins, you probably wouldn't believe it, and you shouldn't, right? It's
biased.
So then why would you be a systems experiment that's done in one experimental
setup? We've seen that the experiments that we -- the experimental setup can have a
huge impact on the conclusions of your experiment and evaluation of your ideas.
And other areas go to great effort to kind of work around these problems, and I think we
should too. So we've had it easy, but we haven't had it right.
So this work was definitely done with a lot of help with a large number of people here.
So I'd like to make sure I thank them. And I'll take some questions, whatever you guys
have.
>>: I have two questions. First one is given this and given the time that you put in
obviously to doing this kind of measurement, demonstrating that there is, you know,
quite a lot of bias, what do you recommend for people to do? Especially economics.
What is the conclusion you draw around that?
>> Todd Mytkowicz: Mm-hmm. So I think there's a couple things. First off is that this is
just -- we're hopefully starting -- the behind this work is to first off show, you know, raise
awareness that these things can actually cause a problem in our evaluations. And you
know, I think a lot of people probably realize this when they kind of get into the nitty
gritty of doing their own evaluations. They see some -- you know, one day they get one
result and the next day they get another result.
And you know, maybe it doesn't make sense to actually go in and figure that out on an
individual basis for that particular person. But the idea is that when you kind of -- when
you would like to disseminate your results to other people, we think that you need to
start thinking about these things in your own evaluations. So we gave you an example
of randomized trials for compiler evaluations.
So the first part is just raising awareness and showing that there are solutions to these
problems, even if, you know, they're not necessarily the ones we like. That is that it
takes a lot of computer time to actually execute our evaluations now. But it doesn't
necessarily require any more input from you. I mean, if environment variable -- if doing
these randomized trials mitigates the impact on your evaluations, then you know, that's
good enough. You don't have to actually need to go in and figure out the problem for
yourself.
But I think generally speaking, more generally speaking, the idea would be that, you
know, there are other sources of bias out there and a large number of about -- of other
parts of systems. So using different heat size, for instance, is an example of this. It's
the Blackburn foundation.
And there are a host of these things out there. And as a community, it would behoove
us to start thinking about these types of things. And maybe, you know, having like
Microsoft open up something like the example that you have, that you gave earlier in
your questions, which is you know, having the ability to farm these out to a large -- to
one particular set of clusters that are dedicated to -- for compiler evaluation work. You
have this machine, this set of machines that are all different and you start being able to
talk about average behavior.
So I think there are ways that we can start generalizing this and making impacts and
helping other people do their own evaluations. Because you don't necessarily as an
individual want to have do this for every experiment that you come across.
>>: [Indiscernible]. So how do you see the work going forward? In other words, are
you kind of done? Do you think that that -- you sort of made your mark and new
frontiers, or do you see a way -- things to do?
>> Todd Mytkowicz: Sure. There's definitely things to do. So for instance, there's just
really almost dumb statistical techniques that you can apply to start understanding
whether or not you need to do these huge number of runs. And you can cut them off
really quickly if you notice that the variation isn't that big. So that's one aspect that is
just kind of a no-brainer in the sense that I took a perlbench program that runs in
15 seconds and it took me three days to do an evaluation of it. That's going to scale for
a large number of people.
So being able to be smart about how you do those valuations is certainly an area of
future work that I think would be great.
I think the way I would describe it is right now we're at the stage of botany as is to kind
of biology. We're still categorizing some of the impacts of the problems rather than
having like a general science that we can talk about. And so going and pushing on
what is that general science, can we [indiscernible] general methodologies that people
can use to help them understand this stuff, that's also an area I think that's ripe for
future.
>>: That particular point reminds me of this work by [indiscernible] where he took
people around him [indiscernible].
>> Todd Mytkowicz: Yeah. Thanks.
[Applause.]
Download