18907

18907 >>: Okay. It's my great pleasure to welcome Todd -- I'm not going to try and pronounce this name. Mytkowicz? >> Todd Mytkowicz: Mytkowicz. >>: Mytkowicz. From the University of Colorado, who is coming here to talk to us about the possibility of joining our team. And talk to us about performance analysis, I suppose. Take it away. >> Todd Mytkowicz: Thanks. Thanks a lot. Thanks for having me. It's great to be here. So the title of my talk is: We have it easy, but do we have it right? This is actually a similar talk, and hopefully you guys haven't already seen it, to something that my advisor has given here about a year ago. So there's a lot of new content, but there's going to be a little bit of overlap. So when I started out in my grad program, I realized pretty quick that systems. Is where I wanted to be. I was taking a class in operating systems, and the professor came in and gave a lecture on context switching, and you know, all the nitty gritty details, and then said, okay, your homework assignment is go home, and you know, here are these micro controllers, these little boards that we got. Go home and implement context switching. And, you know, this was just awesome. I mean, we learned about a concept in a class from a textbook and reading, but then we were able to go home and program this on our own and kind of hack and get something working that was in pretty much almost every major operating system, almost every major operating system out there in use today. So you know, symptoms research is really great because there's a big component of engineering. And so you get tinker and see your results. But you know, there's also a notion of evaluation, right? And particularly when we research, we have to build, you know, we build smarter compiler optimizations. We might come along and come up with some idea for better compiler optimization. But to know how smarter it really is, we have to do an evaluation, right? We have to compare ourselves to prior work that has come along. So systems research as this kind of interesting interplay between the engineering aspect of our work, which you know, we build stuff, and then the evaluations aspect where we kind of say to other people, hey, look, you should be using my evaluation -my compiler. It's going to be, you know, ten times faster than the current one that you're using. It will make your life better. So let's generalize a little bit about how we actually conduct experiments in systems research. The idea that we -- you know, this is a generalization, but generally what we start with is a very complicated system. Right? We have some system that has multiple layers of a hardware/software stack, application frameworks. And we go in and we take a measurement of that system, and we may measure certain layers of that application stack from the hardware up to the application layer, and we get something interesting from those measurements, and we say, okay, look, there's a bottleneck in our system at this point. We're seeing a huge number of writes to disk when we shouldn't be seeing those. So that's a potential bottleneck. We can come up with a way to fix this. So we come up with some innovation. And so we augment our system now with this innovation, and we come along and again we do another measurement. And so we're measuring before and after our innovation, and we'd like to ask, you know, what's it's efficacy of our innovation? So but what if our data are wrong? Right? What if we had come along and done our measurements and come to the conclusion using, you know, best methodologies, that our innovation was beneficial? Well, if our data are wrong, then obviously our conclusion is wrong, right? We've claimed to the public or to our other research colleagues that, look, you know, my compiler is going to make your program ten percent faster. But in reality, when they actually use it, it doesn't actually make their program any faster. So when we have a poor evaluation, we're going to come to wrong conclusions. So how can our data be wrong? Well, I kind of didn't tell you the whole story about how we actually carry out experiments in systems work. We traditionally execute our experiments in some sort of a experimental setup. This is like the context that carries -that holds all of the aspects of our experiments. And so if you read a systems paper, people are usually pretty good at saying, like, okay, we looked at, you know, the GCC compiler version 4.2, the Linux operating system 2.6.24, so the things that we think can bias our results are -- or impact our conclusions, we're usually pretty good about specifying those things in our experimental work. But what we're going to talk about today is in reality, there are certain things in our experimental setup that we really would never consider them to potentially cause huge changes in our conclusions, but they're going to impact our conclusions nonetheless. So that's going to be what kind of the point of this talk is, is that aspects of our experimental setup are going to bias our data, and thus cause us to come to wrong conclusions about our innovations. Okay. So let's take a look at an example. I've been talking relatively abstractly [indiscernible]. So take a look at an example of evaluating a compiler optimization. So let's imagine that we have, using our notation here, some system -- I'm just going to use perlbench compiled at -- using gcc compiled at 02. That's kind of our baseline. We've got some innovation that we would like to evaluate. That's just the 03 optimizations above and beyond 02. Again, we're using perlbench. And what we'd like to do is do an evaluation that's reproducible. So the idea is I'd like to off -- I gave my advisor some homework, and I asked him, hey, Amer -- Amer is my advisor -- could you go off and do an evaluation and tell me just what's the speedup of 03 over 02? And so Amer came back and used speedup. He used proper experimental methodology and calculated that, you know, there's about an 18 percent plus or minus, some very small amount, speedup, by using the 03 optimization over 02. All right. So his conclusion is that optim -- 03 is good. So speedup is just the run time of 02 divided by the run time of 03, so the fact that it's above one means, you know, you just measure seconds. That means that there's about an 18 percent speedup. So Amer says, you know what? The conclusion, 03 is good. It's really helped out. So I came along and I did the exact same experiment. These are real results from real runs. This is -- I'm not pulling any tricks. And I came back and I came -- used the same machine, same binaries. We had no one else on the machine. It wasn't like frequency scaling was going on. It was, you know, we made sure that the temperature was the same in the room. So I came back and I said, look, Amer, you're wrong. I see about a 16 percent slow down, plus or minus some very small amount. I did problem methodology as well. I did my -- I ran my experiments hundreds of times, and these are the numbers that I got. So I could claim that 03 is bad. He must have done something to make a mistake here. So to figure out what was going on here, we went back around we kind of sat down together around we started looking at what was different between the two of our experimental setups. And we covered the main things that you would expect, right? It was the same machine, the same binaries, the frequency scaling wasn't kicking on, no one else was on the machine. The disks, there wasn't like an NSF issue. And what we came to was a really, really subtle aspect of the differences between our experimental setups. My name is longer than Amer's. And so these machines were in UNIX. And so in our home variable, I used a little bit more space than Amer did. And you know, if you take a look at what that actually does to an alph [phonetic] binary, this is a very osciative view of what alph really looks like, right? You've got your text and code or your code and your data, you've got your environment variables, which are placed on this -- right before the start of the run time stack. And then the stack which grows in this direction, right? And because my environment variables took up a little bit more space, because my name was slightly longer, the start of my stack address was slightly higher. And so, you know, we -- this is the only thing we could see that was different between our experimental setups, so we were wondering, could this really, really cause a change in program performance as large as the results that we saw on prior slides. So we did a controlled experiment where we just took an empty environment, so we started with no bytes in our environment, and that's given here in the X axis. And we just progressively added environment variables, dummy environment variables, to this empty environment. The Y axis, I'm showing you speedup, so we have slow down, speedup, and if you're above 1.0, and each point here is just the mean invariants of multiple runs, the conclusion of whether or not 03 gives a speedup of over 02 for perlbench. Now, clearly, what we see, despite the fact that the environment variables are actually never used by the program, right, they never actually get touched by the program, we see a fluctuation in our conclusions by a large amount. About ten percent. And even if we disregard these guys, these two kind of outliers, we still have a variation that's sizeable relative to what we kind of see in today's compiler evaluations. And moreover, these points here, these are the confidence intervals, these runs are incredibly deterministic. You know, you run a program once, and it takes ten seconds, run a program again, it takes 10.1 seconds. So the variation here is very minimal relative to the fluctuations from the performance benefits that we're seeing by doing optimizations. So clearly, environment variables are causing a change in our conclusions here. Even though they're completely irrelevant to the actual program. So the way I'm going to generalize these results so we can see this isn't just a one-off example for spec, is we're going to take the use of violin plot. So when you think about a violin plot, it's just projecting all these data down to this -- to the violin plot. The white dot in the middle is going to be the median, and then there's this -- if you tilt your head sideways, you can see this distribution is just smoothed out. So it's effectively a smooth histogram, and it's symmetric about the middle point. So the nice thing about it is the height tells you what's the range in variation, and the width or the fatness tells you kind of where the points are actually distributed. So it's a good way to summarize these data. Okay. So we looked at a bunch of the spec then -- or all the spec C benchmarks. And we see that, you know, here's perlbench from the prior slide. That there are other benchmarks as well that suffer from the similar phenomenon, in particular, lbm has his swath height of the violins is relatively large. So we may claim a speedup or a slow down, depending on the size of our UNIX environments, our UNIX environment variables. And even if we disregard these guys that are below 1.0, we still see large fluctuations in the height on the order of a couple percentages, which again is relatively large for today's compiler evaluations. So the setting of totally irrelevant environment variables can lead to bias conclusions in our evaluations. I should mention, feel free to interrupt me for questions if you guys have anything. Yeah? >>: First of all, if you run it one time, it may take ten seconds; run it again on the same machine, same environment, variables, takes 10.1. >> Todd Mytkowicz: Yup. >>: But that's not much different from what you're getting here. Here you would get variation of being, say, 10, 10 and a half seconds. So ->> Todd Mytkowicz: Okay. I didn't do the math right. The variation is very small. That's the distance [indiscernible]. >>: And the second question is, I never see [indiscernible] of [indiscernible]. >> Todd Mytkowicz: That's right. That's a good question. So the issue is these results were -- we started at 0 and went an offset of 64 bytes. And as a consequence, if you actually expand this out in this region of the space, there's a lot more fluctuation going on. But as a function of just the fact that -- we'll get to this in a little bit, but just to do these results took a lot of computer time, so we just picked 64 bytes and went with it. Okay. So once Amer and I used the same environment variables, we started to agree in the overall, you know, our conclusion for perlbench. But for one of the other benchmarks, we disagreed. And so again, we went back around started to figure out what was different between our experimental setups. And turned out that Amer, you know, we were using the same compiled binaries, but Amer was using LD *.0 to actually do his linking. And so *.0 just expands to alphabetical. And so you get, you know, something like this laid out in the text section. The text section of the code has these dot O files just swapped location wise. I, on the other hand, was using the fault that chips and spec, and so I had a slightly different ordering than what Amer had. And so we were curious, could this be the source of bias? It turns out, yes, indeed it is. I'm just going to show you the environment variables here rather than kind of step through the plots like we did with perlbench. We just took every -- so to do this experiment, we took a particular benchmark like quid quantum. We just compiled the -- all the files into dot 0 files, and then just randomly changed the linking order when we actually built the binary. And we did that for 32 randomly generated linking orders. And then we asked, what's the speedup of 02 over 03 for any particular linking order. And you can see we get very large fluctuations in our conclusions right where speed -- speedup or slow down, but also still pretty wide swaths, even when we always claim that there's a speedup. So we're having large fluctuations in our conclusions based on the order of linking -- that we link our programs together. Yeah? >>: So it seems like some programs are more susceptible than others, obviously, to this variation. And also, the attendant biases in the same way, like this s gen [phonetic] in this one, you know, is also in the same kind of cluster ->> Todd Mytkowicz: Yes. >>: -- as it was with the -- do you have any explanation? >> Todd Mytkowicz: Not a concise one that you're probably going to like. What turns out to be interesting is that I'm showing you data for the Core 2. If we take, for instance -- for instance, we'll take gcc. If you run gcc on a Pentium 4 and do the same set of experiments, you're going to see a very different shape to this particular -- to the violin graph. In particular, there's pretty much not one benchmark across all the architectures we've tried where you don't see some sort after fluctuation like this. So for a particular piece of -- a particular architecture, perlbench is very susceptible to both environment variables, and link order. Part of that is -- we're going to get to the reason as to why that is in a second. >>: Is the underlying cause the same? >> Todd Mytkowicz: No. >>: Okay. Because it's surprising that the effect is so similar ->> Todd Mytkowicz: Yeah. >>: -- [indiscernible]. >> Todd Mytkowicz: I haven't actually brought up [indiscernible] and I haven't actually looked into that one. So in depth, I don't know if [indiscernible], you know, the reason is the similar for both of these scenarios? >>: So are these all with a particular environment [indiscernible]? >> Todd Mytkowicz: Yes. >>: Which ones did you choose and why? >> Todd Mytkowicz: I just chose 0, and that's actually a really -- so no bytes in the environment. So the question was are these all for a particular environment size. And yeah. Because we knew that the environment variables was a problem, so we just said, okay, we're going to just hold that constant for now as a parameter of our study and just look at link order. You raised an interesting question about generality of these results in the sense that if we had used an environment variable of, say, 15, would we get different results. In the sense that we would get maybe different fluctuations, so the distribution here might look slightly different. But the overall result would be the same. >>: And so remind me then, which link order is the 1.0 baseline? >> Todd Mytkowicz: I'm not sure I understand your question. >>: You're measuring stuff relative to ->> Todd Mytkowicz: So there's 02 -- I'm sorry 02 divided by -- [indiscernible] of 02 divided by [indiscernible] of 03. Yeah. And I'm just showing you the fault in alphabetical to show you that you just can't choose the default link order that's spec shift and expect that that for instance gets you the median. Right. I mean, that doesn't really make any sense. Spec doesn't necessarily know anything about our hardware. They just give you a link order. Okay. So the order of dot O files can lead to contradictory conclusions here. Okay, so where does bias come from? This gets a little bit to the questions that were just being asked. Well, you may wonder if maybe it was a methodological issue. We're pretty sure it's not. The results that you've seen so far took five months of machine time. This machine was literally turning results for five straight months. So every graph that we've showed you has actually, you know, been statistically rigorous in the sense that we've applied statistics to all of our results. We've looked at a bunch of different platforms for hardware. Two microprocessors, both the Pentium 4 and the Core 2 in Intel's hardware. And the Power. And the M5 is a simulator, which we're not going to get to in these data, but we have it in the paper if anybody is interested. So the simulator also suffers from bias above and beyond the fact that it's simulating stuff, right? The benchmarks we looked at are C and C++. But other people have replicated these results with other programs and other benchmarks. And again, I'm showing you the results from gcc 4.1, which is one of the more recent compilers when this -- when we did these experiments, but we also saw this similar phenomenon on Intel hardware with Intel's compiler. You may think that Intel knows the details of their hardware and so their compiler can kind of mitigate some of these specs. It turns out that that they're almost exactly the same. Okay. So generally speaking, bias comes about from interaction with hardware buffers. And what's interesting is these buffers, you know, as soon as you see these results, a lot of people start to say okay, this must have been a cache, cache issue. You're getting conflict misses in your O1, for instance. And it turns out that really the buffers that we're talking about are really nonintuitive buffers. There's not the things that you necessarily think about as generally describing as buffers. So for instance, in the Core 2 -- I'm sorry, perlbench, on the Core 2, there's something in the tip that's called the store-to-load forwarding queue. And the idea here is you can imagine if you write to some variable of memory, then you want to load that value right away. Rather than waiting for that write to go all the way back to the cache and actually update the cache with that value, the store-to-load forwarding queue is this kind internal buffer within the actual pipeline of the chip that says, okay, I know that that value is just written. I'm just going to use that value directly for my load if it comes right afterwards, after the store. So it's a performance optimization the chip does. You can imagine if you're -- if this loop, if you're in a loop and you're doing loads and stores, loads and stores, and you hit this right, you fit within this cache, or this buffer, you're going to get very good performance if that particular loop is indicative of your program's performance. But if you miss and you have to go back to the L1 for every read, you're going to have, you know, really poor performance. And so that's exactly what's happening with perlbench and the Core 2. There's some static variables, that are allocated in the BSS segment, and there's some stack allocated variables that are conflicting with that that happen to be in the same hot loop. And as a consequence, you get good load-to-store forwarding if your environment variables align that hot stack allocated variables such that you don't conflict with that static member. Yeah? >>: So what's the [indiscernible] versus this queue? >> Todd Mytkowicz: Oh, I don't know off the -- I'm not sure off the top of my head. >>: Multiple cycles to the outline [indiscernible]. >> Todd Mytkowicz: I think it's -- it must be something reasonable, yeah. At least one, right? [Laughter]. Okay. So for bzip2 the simulator, this is the M503CPU model. This is an alpha simulator. There's a hot loop in bzip2. And you know, bzip2 really just does compress, decompress, compress, decompress. And the compression loop pretty much has a -fits in about eight alpha instructions. And as a consequence, depending on how that loop aligns in the front end, the front end can completely turn off and stream these decoded instructions directly from this loop stream detector. Then you know, so you don't have to do any of the decoding. So you save yourself a cycle. Turns out that has a huge impact on bzip2's performance in the simulator. And as a consequence, you can see the performance of bzip2 fluctuate by 20, 30 percent, depending on how you actually align this loop within that particular structure. So lastly, on the Pentium 4, this is a bit of an updated machine nowadays. But the performance of the quantum will fluctuate by 30 to 40 percent, depending on how that particular hot loop in the quantum fits within the trace cache. And the trace cache, you can think of it as just a special purpose kind of fancy instruction cache. So bias generally speaking comes about when 02 interacts differently with these buffers than 03. It's -- yeah. So and a lot of times, right, the bias -- the buffers that we're talking about are not the kind of buffers that you normally consider to be an issue in micro architecture. Okay. So what can we do about bias? Well, other areas have suffered from bias for a long, long time. And they have kind of their own ways to deal with it. And so we're just going to appropriate one of their ideas, which is randomized trials. Yeah? >>: It seems like where you're going with this is you're going to talk about how to make sure your studies [indiscernible]. >> Todd Mytkowicz: Right. >>: Main take away I got from last couple of slides is that it seems almost hopeless to keep thinking about doing these optimizations pause these random things are going to completely defeat these optimizations you're doing. >> Todd Mytkowicz: Well, I think that it's an interesting question to think about kind of at a philosophical level. But at the end of the day, if someone came to you ->>: Optimizing the wrong thing. The thing we're trying to optimize in 03 is like a minor effect compared to these other structured that you should be paying attention to. >> Todd Mytkowicz: Possibly, possibly. Right. I mean, if someone came to you and said, hey, I've got this idea called register allocation. You're going to -- here's how it works. And you know, I haven't actually been able to evaluate it yet. You're probably going to say, okay, be that sounds like a pretty good idea. Right? There's registers which are really fast access to memory. Being smart about allocating your data to those things seems reasonable. So there's kind of -- and at the end of the day, those register allocations provides a benefit that's relatively large and maybe will overwhelm some of the effects that we're starting to see here. So it's not that optimizations themselves are kind of, you know, hopeless. Right? There's theories, there's absolute hope here, since each one of these variations you're seeing could in turn be turned into an optimization, right? >>: [Indiscernible] things have changed. Introduce registers that compile writers put stuff in, take advantage of it. And the new kind of thing to sort of forward [indiscernible] load things are not as dismal compiler writers and so they don't write the [indiscernible] and take advantage [indiscernible] which means that the whole discipline has kind of changed by not tracking the hardware anymore. >> Todd Mytkowicz: Yeah. >>: So it's an interesting question about whether it should be or whether, you know -- >> Todd Mytkowicz: It's also an interesting question which gets into kind of marketing aspects of this, which is do we have the ability to do that even if we would like to? Right? I mean, we're not the ones that make these chips. We'd love to be able to sit down with Intel and say, hey, what's the, you know, what's exactly going on in this particular buffer? How do we exploit it? And we don't have that information. You know, rightly so, maybe. They're not going to give that out. Yeah. >>: You said you also tested on ICC and it showed a lot of the same issues. So I mean, did they all have the same information? Seems like it's just too much complicated stuff going on for this targeting to be [indiscernible]. >> Todd Mytkowicz: Yeah. I mean, in some sense, for instance, take a look at perlbench with this stack -- the stack allocated variables being the problem. That's a data dependent thing, right? So you're never going to be able to prove statically that you're stack is going to be a problem for this particular, you know, structure. So sometimes that part is almost, you know, you can't really do much about it. But the question here that we come back to is, okay, what can we do about it. And the idea is that we know that these are sources of bias in our experiments. And you know, if we have a compiler optimization that gets, say, maybe 2 to 5 percent, you know, God says, hey, this is going to give you two percent, we'd like to be able to measure that. And at least what we -- if we do randomize trials effectively, we can actually get to those numbers, which is -- we're going to make a claim over average behavior over those -over environment variables and loop order because we know that those things can bias our results. And so the idea is that rather than looking at just a performance of a single program, look at the performance of a large number of programs where we randomly affected these parameters that we know to cause bias, as such, mitigated some of the effects of those parameters. So let's take a look at how this would look in practice. Here's our perlbench example. I'm showing you cycles, which is just run time of how long the X axis is the histogram. We looked at both 02 and 03 for each of these compiled binaries. We looked at a large number of link orders and a large number of environment variables. So, you know, there's a huge number of runs here. And we're just plotting the distribution of run times. And the idea is if we have done a good job of randomizing over environment variables and link orders, we can start to make claims about the distributions here. So clearly, you know, we look at -- it's actually not that clear because it's not that big of speedup because of these guys over here. 03 tends to give a slight, on average, increase of performance over 02. And so that would be our claim for what the efficacy of the 03 optimization and perlbenches. Yeah? >>: It also looks pretty clear that the long tail is more typically associated with 03. >> Todd Mytkowicz: That's right. Yeah. >>: So reasons? >> Todd Mytkowicz: I don't know the reasons. I haven't looked into it. >>: Because I mean, arguably, yeah, so [indiscernible]. >>: [Indiscernible]. >>: Yeah, exactly. I mean ->>: [Indiscernible] and all that. >> Todd Mytkowicz: It depends if you're the oracles of the world and you want to sell a single binary and you would like it to get good performance on average, I mean, on average or in worst case are two different questions depending on the organization's needs. Yeah? >>: Again, [indiscernible] back a slide. So you took a -- took a lot of effort for you to find out those two [indiscernible]. >> Todd Mytkowicz: Absolutely. >>: So how do I -- I mean, where [indiscernible] ->>: Yeah. How do we know there aren't other variables we should also be randomizing? >> Todd Mytkowicz: You're absolutely right. There are. There are other variables, but we don't know them yet. So until we know them, we can't, you know ->>: Seems like the argument for your strategy, I mean, you're doing randomization, but it seems like another argument, another strategy would be to measure things in a deployed setting. In other words, you know, look at a thousand customers, you look and you may have [indiscernible] quote, randomized environment variables. >> Todd Mytkowicz: Yup. >>: And you measure things in that setting. You're going to get a better picture of [indiscernible] ->> Todd Mytkowicz: That's exactly right. >>: -- happens. >> Todd Mytkowicz: And in fact, you guys have the group that's starting to do that. So I mean, that's kind of cool in the sense that you have the ability to talk about sending this out to -- 02 out to a thousand people and seeing what kind of performance games they get. But I think that's kind of against [indiscernible] your point a little bit of title of the talk, which is, you know, if you would like to make a claim about the benefit of your optimization, it's -- we need to be able to -- we need to have codified methods that allow us to make those claims. Okay. So we were curious if other kind of common tasks that we do for evaluation can produce biased data in the systems [indiscernible], and so in particular, we were curious about java because java is a little bit more of a complicated environment. And so I spent the summer at IBM, and while at IBM, my mentor there, Peter Sweeny, one day came to me and said, hey, Todd, I've got this program, it's taking a lot of time in this one particular method. I would like you to speed up that method. And so he gave me this task and I went off and he told me what method it is. Doesn't really matter what method that was, right? So my baseline here in our notation is just the program that Peter got and he said, you know, this method M is taking up almost 20 percent of total execution. Make a fix and see if you can change it. Speed it up. My optimization, I went in, I did some algorithmic changes. And so that's my innovation was the algorithmic change. And so before, you know, before I sent Peter my results, I wanted to run it by my advisor. I knew that bias had bitten me before. I wanted to make sure that I did a good job of evaluating this. So I said Amer, can you take my optimized program, tell me how much time we're spending in this method M. So Amer did that and he came back the next day and says, hey, Todd, your optimization is good. He said, you did a really good job. We're spending no time in that particular method. Okay. That's better than I thought but great. So Peter, I sent it off to Peter and Peter came back and said, Todd, you know, I thought we talked about this, there's a problem in this method M. You didn't change it at all. And you know what? I did proper methodology. You know, I ran this multiple times. So you can definitely make this things faster. Okay. So take a look at why this happens, and we did the exact same thing as before. We went back and looked at what was the differences between Amer's and Peter's experimental setup. And it turned out that they were, you know, they were both -- they knew about environment variables. They had taken care of that. But what they were dealing with now is that they were using two different profilers. Now, these profilers were both statistically based profilers. They're sampling profilers. And they're supposed to produce the same result, right? The profiler shouldn't define where our [indiscernible]s are. So could this be the source of bias? Turns out, yes, indeed, it was. We did another controlled experiment where we looked at four different profilers. Hprof is a profiler -- an open source profiler that shifts the sum JDK. These are all from the sum JDK's data. Jprofiler is a commercial profiler. It's pretty expensive profiler. Won a bunch of developer awards. As is yourkit, and xprof is the profiler that shifts internally with sun hot spot. So this is open source. This is part of sun. So there's -- each profiler here has three bars. And that's the hottest method according to that particular profiler. So hprof said that we spent about six percent of [indiscernible] execution in this particular method. And that was the hottest method. Jprofiler, on the other hand, said we spent no time in that particular method. Actually, we spent most of our time -- 12 percent of our time in this gray method here. And these points, I'm showing you the mean and competence intervals that these guys are reproducible results, you know, there's very little agreement as to where -- what is the hot parts of our program across these four different profilers. And I'm just not going to take the time here, but this generalizes to other benchmarks. Yeah? >>: [Indiscernible]. >> Todd Mytkowicz: They're all sampling profilers? >>: [Indiscernible] instructions kind of basic loss and just ->> Todd Mytkowicz: I didn't. >>: -- use that as a [indiscernible]. >> Todd Mytkowicz: I didn't, no. That gets into other issues because the point we wanted to make was that these are all -- they're all supposed to be producing the same result. So the question you're kind of getting to is which is right and we'll get to that in a second. So this is for the sun JDK, but it's also happens for IBM's JDK as well. Very similar results. But I'll point you to our [indiscernible] paper if you're interested. Okay. Where does bias come from? Well, in this situation, it turns out to be a single source. Before we get into that, we need to take a quick diversion in a our profilers work. So a profiler works by periodically stopping your program and why you asking what method is executing right now? So you can imagine if your program is -- has two methods foo and bar, and 90 -- you know, takes a hundred samples, 90 of which are attributed to foo, well, you know, 90 percent of your time you're spending in foo. So they're nice because you can, you know, control the sampling rate which allows you to control the overhead. So they're usually pretty low overhead profilers as far as profiling goes. And the other nice thing about them is there's kind of a built-in error measurement upon how accurate your results are. That is if you do your sampling correctly, the estimate -your error in your estimate goes down with the square root of the number of samples. So if you increase the sampling for that particular method, run your program longer, you get more samples, you get a more accurate statistically -- statistically accurate measurement of how much time you're in that particular method. So all of these -- kind of been hinting at this a little bit as I've been talking about it. All these profilers, when you're statistically based, require that your samples be independent. So what that means is that when I think -- if I take a sample every ten milliseconds, right? And my program happens to be in foo every ten million seconds, well, I'm going to get a biased -- I'm going to say that foo is taking up a large amount of overall. Execution time. Even if foo is incredibly short and took up very little of our overall execution time. So the point is that the sample of time T can tell you nothing about the -- should tell you nothing about when the sample of time T plus one is going to occur, T plus 15, right? So just to prove to you that that actually happens with the current profilers that we've looked at in the prior slide, I'm using the auto correlation graph. It may look a little scary but it's actually pretty simple. Auto correlation is a really simple way to think about just correlations between two time series. So you take a time series. In this case we're taking the time at which a particular sample was serviced by the JBM when a profiler said, you know, I'd like to know what method is running. So we're going to see if those times -- if the times at which those samples occur are correlated. So take that -- it's time series, while we're running, for instance, this is with hprof. Let me just finish this one point. And we duplicate it. And so then we do a -- what's the correlation between those two time series? Well, clearly, if you correlate a time series with itself, you're going to get one. Then we shift that second series by one and we do -- that's our lag of one here on the X axis, and we ask what's the correlation. And we just progressively shift this guy down and do the correlation. The point is that if we see -- once we get to a lag to be relatively large, we should see that this goes to 0. The autocorrelation score should go to 0, which is the Y axis. And what we do instead, we see as a periodic pattern that shows that the time which a sample occurs at time T can tell you something about when a sample is going to occur at time T plus, plus, say, a thousand. So as a consequence, this is our way of showing you that samples are not independent, which was the kind of the main point of why statistically base profilers -- or what statistically base profilers need to adhere to in order to be accurate. Yeah? >>: [Indiscernible] amount of correlation? >> Todd Mytkowicz: For this it is. Yeah. If you don't do this with -- if you use a profiler and a second that doesn't use yieldpoints, it's lower. I'm sorry. It doesn't use -- where the samples are not correlated, it's lower. >>: And so thee profilers [indiscernible] the actual time [indiscernible] the cycle count or whatever? >> Todd Mytkowicz: No, I changed this. I had to go -- >>: You had to go and do it yourself. >> Todd Mytkowicz: Yeah. The nice thing is that they all actually use the same kind of hook into the VM to ask for when a profile occurs -- I'm sorry when a sample occurs. So they all use the same call, method call? >>: [Indiscernible] the time ->> Todd Mytkowicz: No. Okay. So this is randomization stuff has been around. We've known about this in the systems community since the 80s, and not earlier. That was the first paper I could find on it. But this has been -- but all of our profilers, you know, I never thought this was going to be that big of an impact, to be perfectly honest. And I think probably most people have a similar kind of idea. So to take a look at why this actually occurs, why we're getting information about the time at which a sample occurs at T tells you something about the time the sample occurs at T plus one. The reason that this comes about is because of these things called yieldpoints that exist in the VM. So the VM itself, if you look at java, the code that we output is not -- produces is not preemptive. It's kind of quasi preemptive. And the point is that every once in a while, the jit [phonetic] will put out an instruction that says you know, for an application, so that an application tries to execute that method, it will say hey, look, VM, should I stop running? Do you need to do anything like garbage collection or profiling tasks? And so as a consequence you can't sample all parts or parts of your program. You can only sample at these particular what are called yieldpoints. So if you take a look at this particular code, if you run this, you know, or a variant of this on your own java platform, you'll see that a profiler will say very interesting results. That is, it will pick one of these two methods as to where you're spending all of your time. So imagine that you have some code that loops, and that loop code calls straight, straight has some mathematics in it, maybe some like modulus and some other expensive operations. When the jit sees that this code gets hot, it's going to place a yieldpoints right here. At least the sun JBM will do this. Now, you can imagine if you're profiling, in a profiler and all of a sudden your timer goes off and says, hey, look, I'd like to know what method it's executing, the VM is going to say, look, I can't at the time you until we get to a yieldpoint, and the next yieldpoint you get to is right here. So there's a delay here, and that delay turns out to cause this kind of interaction between -- this subtle interaction between our profiler asking for it to take a sample and the actual dynamics of the underlying application. So yieldpoints turn out to cause whatever the autocorrelation plot that we saw in the prior example. Okay. What can we do about bias? Well, unlike last time, we have a ->>: Bias on yieldpoints. >> Todd Mytkowicz: Sure. >>: So it seems to me like yieldpoints might be particularly problematic for this in java, but if these functions are just having fairly predictable timing behavior, like in this expensive code, both of those take ten milliseconds, then you're doing a sample every 20 milliseconds, you could still have this alignment without having a yieldpoint. >> Todd Mytkowicz: You're absolutely right, yeah. Yeah. Okay. So we know the source of bias here. At least we're claiming it's yieldpoints. I haven't actually proved that yet. And so we introduced a profiler called tprof. And the idea here is tprof doesn't use yieldpoints. It just sits out the VM. It's a bit of an engineering effort, and we can go into the details of how this works maybe off line. But the idea is kind of at a very high level, it sits outside the VM, uses -- randomizes its sampling interval, just using the UNIX signal to stop the entire VM process, gets an x86 address, and then maps that x86 address to the jitted code, and then, you know, bytecode address. So the idea is that we can sample at all program points and we can control our sampling period and thus make it random. So how do we know that tprof is accurate? Well, we use the technique we're calling -or based on other experimental sciences called value analysis. The idea here is that we have some method M -- we'd like to know that our profile is accurate. And so the way that we go about figuring this out is we cause a change to a particular method, and we'd like to -- an accurate profiler to reflect that particular change. Now, speeding up a program by a fixed amount is actually really hard. You have side effects from memory allocation and all these types of thing. But we have the insight that's slowing you down by a fixed amount is actually relatively easy. So the way that we do that is that we take a particular method that we're interested in studying, just add the Fibonacci code or a wall loop that calculates the first N Fibonacci numbers every time M executes. As we progressively increase N, we're going to systematically slow down the program. We profile M during every time that we actually change this N value. And an un biassed profiler, one that's producing inaccurate profile, should show that the slow down only in M, right, and no other methods. So the idea kind of intuitively is that if we slow down a program by 20 seconds, we should see byte by changing M for method M, should see an accurate profiler. Say that this method M is now taking 20 seconds longer. So 20 seconds slow down of application, 20 seconds increase in M, the slope as we change this N here should be one. So if we take a look at these four different profilers here, we've got tprof, hprof, xprof, and jprof. The slope is given right afterwards. You guys can see that. Each cluster of points here is the -- as we're changing this variable N, X axis gives us overall run times so we're slowing down the program as we change N by a large amount, almost a factor of three. The Y axis is just the amount of time that a profiler says we're in that particular method M that we're slowing down. And you'll see that for tprof, the slope is at about one, so that's the profiler that doesn't use yieldpoints. The other profilers have systematic bias. So you can see for instance in jprof, the slope is .65, so we're slowing down the program by a huge amount, by inserting Fibonacci into this method. But some other method that's totally unrelated is now getting hotter according to that profiler, and systematically getting hotter nonetheless. It turns out the results in the top graph are actually really good. Most of the time when you pick this method M, the other profilers, the yieldpoint base profilers say even though you slowed down the program by a factor of two, this method that we inserted Fibonacci into didn't change at all. And tprof, the profiler that doesn't use yieldpoints doesn't have that issue. >>: [Indiscernible]. >> Todd Mytkowicz: Just different -- it's different methods or different benchmarks. We picked a large number of benchmarks and a large number of methods to do experiments on because it was automated. So yieldpoint-based profilers rarely produce unbiased profiles. Okay. So we've seen a bunch of examples where really subtle innocuous aspects of our experimental design or experimental setup can bias our data and our conclusions. And kind of the overarching kind of point that we've been trying to push here is that small changes can have really large effects. It turns out that that actually the classical definition the chaos if you think about it in a mathematical standpoint. And so we were interested in understanding, just taking standard tools from nonlinear dynamics, which is where the idea of chaos comes from, and trying to apply them to understanding computer performance and what pops out. So this is work that actually was first started by Hugh Berry at [indiscernible]. But we carried it to its next logical conclusion. And the idea here is that, imagine that you have some really large state space X. And X is every single register in your machine. You have some really complicated function F. And every cycle, F updates the state of X to produce a new state, right? F is deterministic and then you feedback X of T plus one back into F and you get X of T plus two. So the idea is that you can get some notion of the performance of a program overtime by thinking about the program going through this giant state space and we just use standard techniques. This is kind of -- once you actually put the framework of the problem into the idea of creating a physics problem, you can just use standard tools from nonlinear dynamics to start to understand some of the properties of F. And one of those properties is a sensitive dependence on initial conditions, which means that if you take X and you make a slight perturbation to X, you can see an exponential divergence in kind of the distance between X and this epsilon that you did as you increase -- as you apply this function F to it. So the kind of high-level point here is that there's chaos in computers and that's kind of a nice theoretical underpinning for some of the results that we've started to see. So our conclusion here is that, well, this means that we need to be really careful with our instrumentation. And that kind of leads me into some of the other work that my coworkers and I have done. All with the idea of trying to reduce kind of the overhead of our instrumentation techniques in order to capture relatively interesting information. So I'm going to go through rather quickly these next two slides. Yeah? >>: On the other hand, a heavyweight instrumentation that actually stopped after early instruction, it would slow your program down ->> Todd Mytkowicz: Yes? >>: -- but it would be hard to measure [indiscernible] absolutely accurate about where the hot spots are. >> Todd Mytkowicz: Well, I mean, if you take a look -- think about bytecode instructions. This can get into a bit of a philosophical argument. But think about measuring the hot spot -- the program -- where the hot spots in your program are by number of instructions executed per function. If a function gets in lined, then you know, the jit all of a sudden removes that function it becomes -- you know, the timing information isn't necessarily 1 to 1 with the number of instructions. So you -- that may be some surrogate for a definition of performance. But at the end of the day, we really do care about time and not necessarily number of instructions or some other kind of heavyweight kind of instrumentation. Yeah. >>: So I'm just curious. So you are saying that the profilers you've looked at, they all accurately reflect when the jit does [indiscernible] things like that, accurately reflect which function is responsible? >> Todd Mytkowicz: No, no. So ->>: [Indiscernible] that line variety of sort of which was the function that ->> Todd Mytkowicz: Was the hottest. >>: Right. And a number of different possible explanations. One could be that they're inaccurately determining the time ->> Todd Mytkowicz: Right? >>: -- because of things like jit. So did you eliminate that as a possibility? >> Todd Mytkowicz: Yeah. You can turn off inlining in this effect still stay. In fact, I think those -- that graph may have actually been within one he turned off. I mean, inlining is a big problem for any profiler. There's been a bunch of work with dealing with that in and of itself. But the effect that we've seen are not because of inlining. Okay. So the idea with this work that we did in micro was we have a -- when you start trying to diagnose a performance anomaly, if you look at kind of the lowest level decoding you can start to capture, that's hardware metrics like cache misses and branch mispredicts and these types of things. And modern micro processors have a huge number of registers that allow you to capture a large number of -- I'm sorry. They have a huge number of metrics that you'd like to capture, but only few registers to actually do it. So you can run your program and maybe capture two performance metrics. But at the end of the day, you don't know what you're looking for, so you need to do a large number of runs before you can start to kind of understand what the performance of -- or where the performance anomaly is coming from. So you may need to do one run where you capture data cache misses and another where you capture instruction cache misses. And what you would like is some way to reason across those particular runs. So here's an example of the problem that you face when you start to try and do that. I'm showing you just instructions per cycle here on the X axis. Let's imagine that in this run you capture instructions per cycle and data cache misses, and this guy, this second run you capture instructions per cycle and instruction cache misses. Now, for whatever reason, maybe the OS jumped in here and caused the delay in this particular phase of the program's behavior, so it's replicated three times while down here it's replicated only once. You would like to know, you know, you would like to come up with some alignments that match the event three in this trace with the event three in that trace. And so we use an idea from genomic sequencing called dynamic time warping. To do this alignment, you come up with a off wall alignment such that the distance between any of these two traces is minimized. And we found that it worked relatively well for capturing a large amount of metrics without, you know, over many possible runs. Okay. So the second idea to reduce our instrumentation is a notion we'll calling inferred call path profiling. And the idea hear is that call paths of really useful for program performance or debugging, but unfortunately they're kind of expensive to collect in run time -- in production systems. And you know, what you -- the way you normally do that is you add instrumentation that kind of keeps track. If you look at this as a call graph, it keeps track of as the program executes what method you're actually executing and where you have actually come from on the call chain. And there are various optimizations you can do to kind of make sure that you do that in an efficient way, but you still have to execute all this extra code in order to do that. So let's imagine we're in foo and we'd like to know do we come through A or from B. Now, traditionally, as I said, you add instrumentation. What we would like to do is kind of limit the amount of instrumentation we add, so we're just sampling, and sampling of the PC at foo and using information to infer how we actually got here. So I'm going to add a set of numbers here, and hopefully maybe this makes it a little bit easier to see how we do that. And these numbers are just the size of number of bytes in the activation record for these particular functions. So main has, you know, one variable, so it's got eight bytes. B has one variable, so it's got eight bytes. A on the other hand, has 16 bytes, has two variables. So this activation record is a little bit larger. So the point is if we came through A, the stack height, the difference between main and foo should be 32 bytes. We came through B, it's going to be 24 bytes. So the idea is that stack height provides context or a calling context without computation. We can sample the PC, which gives us the function we're in, and the stack point there, the SP, a lot of modern hardware will actually allow you to sample the registers at any particular time when you sample the actual program so you get the program counter and the stacked counter so you can effectively do this difference and then infer what the path that you actually came from. So there's a bunch more details here that we have to -- I'm totally glossing over. You can imagine if there's a bunch of issues with ambiguity. That's the biggest problem with this approach. And we have a bunch of techniques in the paper to actually deal with that, and I can talk about that offline if people are interested. Okay. So there's been a lot of work that we've covered here. I'm just going to really quickly go over some related work. If people haven't read Andy Georges' paper from OOPSLA in 2007, I suggest -- highly suggest it. It's an awesome paper on statistically rigorous performance valuations for java. And they go to show how if you don't necessarily use the same methodologies across a large number of experiments, that is proper statistical methodology, you can kind of come to a different conclusions in your results. And we kind of see our work as an expense on that to show that even when using your proper methodologies, you have to worry about bias in your results. There's been -- there has been a lot of work in bias. This is experiments, but it's usually kind of domain-specific stuff. So Steve Blackburn had a paper where he described -where he showed that if you change the size of your heat, you're going to come to very different conclusions about memory allocator performance. And Alameldeen and Wood showed that, you know, when you're doing simulations in microprocessors, you want to change the latency of your L2 cache in order to get variability in your runs from run to run. So they're kind of domain-specific applications of bias. There's been a lot of work on evaluating profilers. The difference between the work that we've done and the work that's come prior is most of this evaluation work has a gold standard which you can compare the results to. We lack that gold standard. If someone gives you a profile and says that you're spending 80 percent of your time at this foo, there's nothing that we can use that says -- that validates that a hundred percent. So we had to come up with some in order to figure that out. And then there's been a lot of work on just call path profiling that I'm not going to touch on. Okay. So let's take a look at some key lessons here. Would you be a presidential pole from one small town? I'll take Boulder as an example. It's where I'm from. It's a pretty liberal place. If someone came around and tried to pole the next presidential elections in Boulder and then put those out and say this is going to be a person that wins, you probably wouldn't believe it, and you shouldn't, right? It's biased. So then why would you be a systems experiment that's done in one experimental setup? We've seen that the experiments that we -- the experimental setup can have a huge impact on the conclusions of your experiment and evaluation of your ideas. And other areas go to great effort to kind of work around these problems, and I think we should too. So we've had it easy, but we haven't had it right. So this work was definitely done with a lot of help with a large number of people here. So I'd like to make sure I thank them. And I'll take some questions, whatever you guys have. >>: I have two questions. First one is given this and given the time that you put in obviously to doing this kind of measurement, demonstrating that there is, you know, quite a lot of bias, what do you recommend for people to do? Especially economics. What is the conclusion you draw around that? >> Todd Mytkowicz: Mm-hmm. So I think there's a couple things. First off is that this is just -- we're hopefully starting -- the behind this work is to first off show, you know, raise awareness that these things can actually cause a problem in our evaluations. And you know, I think a lot of people probably realize this when they kind of get into the nitty gritty of doing their own evaluations. They see some -- you know, one day they get one result and the next day they get another result. And you know, maybe it doesn't make sense to actually go in and figure that out on an individual basis for that particular person. But the idea is that when you kind of -- when you would like to disseminate your results to other people, we think that you need to start thinking about these things in your own evaluations. So we gave you an example of randomized trials for compiler evaluations. So the first part is just raising awareness and showing that there are solutions to these problems, even if, you know, they're not necessarily the ones we like. That is that it takes a lot of computer time to actually execute our evaluations now. But it doesn't necessarily require any more input from you. I mean, if environment variable -- if doing these randomized trials mitigates the impact on your evaluations, then you know, that's good enough. You don't have to actually need to go in and figure out the problem for yourself. But I think generally speaking, more generally speaking, the idea would be that, you know, there are other sources of bias out there and a large number of about -- of other parts of systems. So using different heat size, for instance, is an example of this. It's the Blackburn foundation. And there are a host of these things out there. And as a community, it would behoove us to start thinking about these types of things. And maybe, you know, having like Microsoft open up something like the example that you have, that you gave earlier in your questions, which is you know, having the ability to farm these out to a large -- to one particular set of clusters that are dedicated to -- for compiler evaluation work. You have this machine, this set of machines that are all different and you start being able to talk about average behavior. So I think there are ways that we can start generalizing this and making impacts and helping other people do their own evaluations. Because you don't necessarily as an individual want to have do this for every experiment that you come across. >>: [Indiscernible]. So how do you see the work going forward? In other words, are you kind of done? Do you think that that -- you sort of made your mark and new frontiers, or do you see a way -- things to do? >> Todd Mytkowicz: Sure. There's definitely things to do. So for instance, there's just really almost dumb statistical techniques that you can apply to start understanding whether or not you need to do these huge number of runs. And you can cut them off really quickly if you notice that the variation isn't that big. So that's one aspect that is just kind of a no-brainer in the sense that I took a perlbench program that runs in 15 seconds and it took me three days to do an evaluation of it. That's going to scale for a large number of people. So being able to be smart about how you do those valuations is certainly an area of future work that I think would be great. I think the way I would describe it is right now we're at the stage of botany as is to kind of biology. We're still categorizing some of the impacts of the problems rather than having like a general science that we can talk about. And so going and pushing on what is that general science, can we [indiscernible] general methodologies that people can use to help them understand this stuff, that's also an area I think that's ripe for future. >>: That particular point reminds me of this work by [indiscernible] where he took people around him [indiscernible]. >> Todd Mytkowicz: Yeah. Thanks. [Applause.]

18907

Related documents

Products

Support

18907

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib