>>Aaron Smith: So it's my great pleasure to introduce Vijay Reddi. Vijay recently graduated from Harvard University where he was co-advised by Mike Smith and David Brooks, and he will be joining A&D Research Labs soon. I'm not sure when. So today he's going to talk about software-assisted hardware reliability. >>Vijay Janapa Reddi: Thanks, Aaron. So some of this work that I'm actually going to be talking about is work that I've actually finished up as part of my thesis, and, I mean, this is just one venue of work that I've looked at, but I'm not really going to go into too much depth because I'm going to talk about processor reliability issues and specifically one particular thing. But what I'm going to try and do is not just get nailed down into one specific paper. I think I'm just going to try and stem things from the way it stands in terms of the design variability issues or saying -- sort of building up the architectural piece and how software actually becomes an integral piece in the long run. So it's really trying to put a vision together. I mean, a lot of the papers are out there, but it's sort of trying to pitch whatever it is that I've been pushing forward. So just to give you a brief introduction of the kind of work that I've been doing, it's really -- I would consider myself a software guy for the most part. What I've done is work in [inaudible] translators. For those of you in the architecture community, you would know about PIN, which is one of the things I've worked on quite a bit all through my master's. And then I was basically looking for how do you actually use software intelligently to mitigate some of the hardware problems. So when I went over to Harvard, I started looking at architecture problems and, you know, from architecture there are a couple of circuit-level issues that were coming up basically in terms of reliability, and I started fusing the three things together in my head and I kind of felt like there was a new venue to up for research that's sort of emerging, which is what we'll talk about today. So the crux of the work is going to be about variations. So typically what we're seeing is that transistor sizes are shrinking. Now, that's a great thing because now for the same area you're able to get a lot more compute density. But the problem is you make these transistor sizes so much smaller, they're becoming very susceptible to variations. Now, variations can be either static or dynamic. On the static side it means that one transistor has inherently different properties from its neighboring transistor. On the dynamic side -- sorry. And that basically comes out of process issues. When you're fabricating, there's design time challenges, so fabrication and manufacturing-time challenges. On the dynamic side, dynamic meaning it is sort of when the processor is actually running, on the static side it's, you know, this is what you have. You have this kind of chip that's got some kind of variability within it. On the dynamic side the activity of the program pretty much leads to a different hot spots on the chip, as we were saying, on this heat map. Or you can also end up with voltage fluctuations within the chip. And that is actually going to become a major problem in the future. Most of the work that's are out there right now sort of looks at the static side, the process variation level side of things, and there's been a fair amount of work in the temperature or the dynamic temperature side of things, but voltage variation is still not a variable understood in the space yet, and that's sort of really what I'm going to peel off into. So that's going to be the focus. I'm going to start off real basic because it's not really well known out there, so I'm going to talk about why voltage within the microprocessor actually fluctuates in a way that we don't want it to fluctuate. And why is that going to become a problem? This is because of the prior work in the space and why some of those existing challenges are not going to scale into the future and what we need to do about it and how, at the end of the day, software actually becomes a very critical piece. In fact, I'm going to pitch this entire architectural reliability problem as a software optimization problem to the compiler space and why the software will actually be able to mitigate it the much more effectively than the hardware is going to be able to do it. So voltage noise. That's typically what we call this problem. Voltage noise or voltage fluctuation. And one of the points I want to emphasize is that those variations end up translating to some kind of delay in the circuit, and that delay is going to impact the performance of your chip. And voltage especially is a big problem because it impacts frequency and the power of efficiency of the cores, and those are things we will peel into in the coming couple of slides. But why does voltage within the microprocessor actually fluctuate? Well, to understand that, what I'm showing you here is the current trace for a GCC running in our simulator, and we're seeing that the current is fluctuating inside the processor because, well, the workload is doing different things over the course of several hundreds of cycles that we're looking at here. Now, the voltage, in response, is actually fluctuating as well. Now, typically you want the voltage to be sitting around at the nominal voltage. You want it to be sitting idle at that point, but it's not, and that's because there's parasitics involved in the power delivery network going from your power supply, like your battery source, all the way down to the actual execution units. And the parasitics are what are causing the voltage to fluctuate. Now, in order to make sure the processor is always going to work correctly, we have to understand how much of a swing we're going to dynamically see in here. Now, while parasitics are one reason that the voltage will fluctuate, today's microprocessors use clock gating, for instance, as a way of saving pay, dynamically shutting off parts of the processor based on the usage. Now, from a power perspective, that's great, but, again, that's going to cause the voltage to swing because whenever you see a sudden shift in current, the parasitics effectively cause the voltage to swing much larger. So understand that you want the nominal voltage to be just -- you want the voltage to be sitting around in the nominal, but it's not -- and that's because of the parasitics. And you need to understand how much of a swing you see in order to make sure that the circuit works correctly because if the voltage falls below a certain operating margin or a certain threshold value, well, then end up with a slow circuit, slower than what you've actually designed or you're expecting it to operate at. At that point you can end up with timing variations in the circuit, and then you end up with incorrect results. If it goes up too high, then you can potentially impact the behavior of the -- you can impact the characteristics of the transistor and you end up damaging the transistor itself and so you have, you know, a lifetime issue. And throughout the rest of the talk I'm just going to focus on voltage emergencies, or I'm going to call these things these points where you can potentially run into these incorrect execution spots as voltage emergencies, and I'm only talking about the correctness issue. I'm not going to talk about the lifetime issue because correctness is fundamental to operation, and that has grave implications on the frequency and power efficiency of the processor. So the way we typically handle this today industry-wide, the practice is -- what you do is you try to determine what is the absolute worst case voltage swing you would see inside the processor while it's actually running. And the only way to sort of determine that is to actually write viruses that cause -- you really understand how much -- these are very microcode kind of viruses where you try to understand how to activate the functional blocks in a certain way so they get these rapid swings in current, and that translates to voltage swings. And that's what they do. And here I'm actually showing you a snapshot of what I've written for the code to do a processor, and you can see the voltage swings by a large amount. These are actually measured results, and that's effectively what they do. Based on that large swing, you can determine how much of a guard ban or how much voltage swing you can actually tolerate. And then you call this upper and lower margin, and then you determine where you're going to set your nominal. And then when you run your real program, what you see is that the real program is well within those extremes. So you're guaranteed correctness, right? So margins get you that correctness. But the problem is that the way you allocate these bands so there's room for guaranteeing correctness has severe penalties. It's like a static -- it's a static penalty you end up paying up front. For instance, if your nominal voltage is 1 volt, the IBM Power6 design is, for instance, found that at any given point their voltage might be able to drop all the way down to .8 volts just based on the activity inside the chip. What that effectively means is that you have to make sure that although you're powering your circuits with one volt, the effective operating speed is set by .8 volts because that determines the frequency. And so naturally you're losing some amount of peak operational speed in the circuit, and that's what I'm trying to study here for the 45 nanometer design space where your nominal is set at 1 volt, and what we're seeing is that that translates to a 24 percent loss in clock frequency immediately. Now, as you're going into future nodes, and if I believe I get RS [phonetic] rejections, right, which says that you're going to get about a usual you're going to get some amount of voltage drop slightly but very minimal threshold voltage scaling, then that's going to end up with slower circuits as you're going in the future. That means that your peak operational frequency for a given margin, for instance, the way the Power6 designers were using, for the exact same margin what you end up seeing is that the peak frequency starts dropping pretty quickly. Now, we might not necessarily be focused completely on the peak frequency just in terms of clock frequency improvements, but that's important because at the end of the day, it sort of determines how much work you're going to get done over a period of time, right? So the frequency is an important component here. So the only way to actually maintain frequency from one generation to another, for instance, is you have to enable tighter margins. But I just said that the whole reason you have these guard bands or margins is because you're trying to make sure your circuit is always operating correctly. And if you enable tighter margins to sustain your frequency in the future processor generations, then you end up with a very high chance of getting incorrect execution. And, really, you don't have a choice. I mean, this is the direction we will have to go, because one of the big ones is the margins also impact energy efficiency or the amount of power your circuits -- your processor is burning. For instance, when we look at the 20 percent margin that I was talking about, imagine if you could lower your nominal down to a much closer, tighter guard band, right? You can operate your circuit at .83 volts and you still maintain that same efficiency, then that translates to a 31 percent reduction in your power. And it's huge, because this is, you know, a chip-wide effect. It's not sort of being optimized for a specific workload or something. This is just your basic chip operation. So it has both frequency and energy efficiency issues. And another thing that it actually impacts is the cost of the chip itself, which is something that we don't typically think about. But when I was talking to Intel guys, they're very concerned about the amount of capacitance they would have to put on. So basically in order to tolerate voltage swings -- so it's kind of hard to see here, but what they do is they put a lot of capacitance in order to prevent these large swings from happening. It helps them dampen the effect. And that's what I'm showing you here. This is a land side -- you know, the back of a processor, you know, the package, and there's a lot of capacitors here, and as I'm saying, the swings are going to be getting larger, basically voltage noise becomes a bigger problem in the future, then you're going to have to pack a lot more capacitance, which means you're going to have to invest more in the chip. And if you don't have the packaged capacitance, which is basically what I'm showing you here, what I've done is I've peeled the package caps off. You can literally pop them off carefully. You've got to be careful you don't shock the chip. But you can see that if the package cap is reduced, then you can see that the swing ends up increasing a decent amount. And I will actually be using that as a way of projecting into future nodes in a bit. But there's a cost issue here as well. So far I was talking about industry sort of designing for that extreme worst case point, because you're running these power virus that occurs -- that almost never occur. In the real program, that's what you'd see. And then you operate the average case time. Your average case behavior is based on this extreme design point. Instead, why not sort of really figure out what the average behavior of the chip is and then operate it accordingly. Don't set those worst case margins, really. Just sort of set more aggressive conditions. And what I'm showing you here are chip measurement results of voltage samples. So zero percent means it's the nominal voltage, and each sample basically is how much deviation you see in the voltage going up or, like, below the nominal voltage. And what I'm saying here is that -- in the real production chip what you see is that the swing is only about 3 percent. So it's a very, very small amount of voltage swing you see. But instead we go out and we put in these large guard bands where the processors very rarely stress to. In fact, it's a good thing that they're doing it, because the measurement results do show that you do see these very large swings inside the chip, and it does go as far as 12 percent. And on the Core 2 Duo we actually determined that the margin that the Intel guys actually put in was 14 percent. So it does make practical sense that they have these large bands, because that's what's getting to that correctness at all times. But for the most part, the swing is only 2 to 3 percent. So instead you'd want to typically operate under these situations. Yeah? >>: So under all these situations, at least I guess from what you say on [inaudible], is that there's sort of a constant computation that's going on. >>Vijay Janapa Reddi: Yeah. >>: If you have a more realistic operating system scenario, then you'd have bursts of activity, which are then called bursts of inactivity, which sounds an awful lot more like how I would think a power virus would be written. So do you have a sense that in a real sort of operating system environment, the swing is actually a lot more than this? >>Vijay Janapa Reddi: So these are actually measured results from the chip where you're actually running it with the OS running and everything. >>: But you have sort of a constant computation as opposed to something -- an intermediate computation, like decoding a video frame that comes every n times a second or something like that. >>Vijay Janapa Reddi: The time scales we're talking about is extremely small. I mean, it seems that the kind of computation you're talking about is in the millisecond range. But here the swings -- if the current happens to drop dramatically within a few cycles, that's the kind of granularity that we're talking about. >>: So is it a few styles? Is that the -- what's the time constant of your sort of worst case power virus? What do you oscillate at to make it have the worst possible ->>Vijay Janapa Reddi: Typically you'd want to oscillate it at around 100 to 200 megahertz, the shifts in the current, because that's when basically -- you would see this in almost every IBM paper, Intel paper. It's where the characteristics sit. That is the point where you get the worst case. And so you can -- the virus I had written, you can actually figure out whether your code is actually oscillating at that point by constructing [phonetic] FFPs and then you'd be able to see it. >>: [inaudible] >>Vijay Janapa Reddi: Yeah. Okay. Yeah, sorry. Yeah, power virus is quite different from the DIDT [phonetic] virus, but yeah. So what I'm trying to argue for is that if you look in a lot of the production runs that we're seeing, like, over several thousands of runs, most of the samples are pretty much sitting within 3 to 4 percent range. So you don't really want to optimize based on the 12 percent very infrequent case that we're seeing. You want to design it for the typical case, typical case point, because then you can get a lot of frequency or power improvements. And there's been a little bit of work in the research space sort of looking at what can we do, I mean, do we really have to design for it? And people have been saying maybe we don't need those extreme cases. And I'm going to argue a little bit against some of the challenges we're seeing that if you're going to try to implement it. Basically the notion is we want to design for the typical case, which means that if we have some amount of voltage range, you enable a tighter margin. You run it closer. Then you can end up with the risk of getting incorrect execution and the voltage dips below. So the typical solution has been, oh, we're going to build sensors and spread them across the chip and you're going to have some kind of detection mechanism that's basically sampling and then trying to figure out does the voltage get too close at a certain point, and if it does that means that there's a high chance that the voltage is going to go beyond the correctness margin. And at that point you want to throttle or basically slow the processor down a little bit. And you can do this at very fine-grain levels by controlling the issue rate of the machines or the way you dispatch instructions out. And by doing that, you'll be able to prevent the voltage from actually dipping below, and then you prevent the emergency. Because all it takes is one glitch and that's it. If you get incorrect execution, you've got incorrect execution. It doesn't matter whether this happens once or a million times. Correctness, right? That's what we're going after. And the way that it's typically implemented is by monitoring the current voltage in the processor. And if you cross that particular voltage setting, then you call the throttling system, however you choose to implement it. There are multiple flavors of that. And the biggest problem here is the delay, the time it takes to be able to detect something and then actually respond to it. And I'll show you in the next slide that the delay has to be extremely small. The challenge is, with the delay, is that you have to have all these sensors scattered across the chip because you could get a dip anywhere across the entire chip, and that's the problem, really, because you have to detect all of them and then figure out whether you're really getting the problem. So where you set the threshold or the way you figure out if you're going to be crossing the margin becomes very critical. Typically if your margin, let's say, is your average -your average case in your design margin is 4 percent, very tight, you're trying to get the efficiency, then you want to set that threshold point where you're going to try and protect the processor as close as possible to that extreme point. Now, when you do it, one of the problems is there might not just be enough time to detect and prevent the voltage from dropping below. And so that's what we're seeing here. If you assume that the red line is a typical case, you know, the way voltages actually would happen if you weren't trying to protect the processor, and the blue line is the one where you actually do take action to try and prevent the problem, if there's not enough time, well, the voltage is going to dip below and you're going to end up with an incorrect execution. And it turns out that if you really want to be able to pull this off in this particular configuration, for instance, the delay to be able to detect and respond has to be no more than one cycle, because if you go beyond one cycle of detecting and response time, then you end up with a large number of failures. And like I was saying, all it takes is one mistake. That's the problem. And so, therefore, designing it is extremely difficult because a typical delay time for sensors tends to be around, like, 20 cycles or so. Even in the best case, if you assume, it's got to be about 10 cycles, and it's impossible from our data. The natural thing to do is to push the thresholds up, the trigger thresholds. That way you have some room to actually prevent the problem. When you do that, you can very safely prevent the problem. But you can end up with false/positives when you try to do this in the hardware because there might be a case where the voltage does go beyond that triggering point and then you try to prevent it, obviously, and then you realize that you don't really have to do that. Turns out at that this kind of configuration you end up with 80 percent false/positives, and the penalties are huge because you're throttling unnecessarily or you're slowing the machine gracefully. So those two have been pretty much the prominent way of looking at these things. Industry does do the worst case design right now, but as we scale into the future, that's just not going to work because the swings are getting bigger. As you make the transistors smaller, you're pulling a lot more current, and then naturally as the voltage decreases and the current increases, your swing ends up becoming much larger, and so you must have a much bigger guard band. So that's not scaling. And the threshold techniques that people have been proposing, well, the problem there is that the detection circuits are just not there. You're not capable of doing it. And what I'm going to pitch towards is that you really need to be able to understand what the causing the voltage in the processor to actually fluctuate, and that's very much dependent upon how aggressive your processor is and how aggressive the microarchitecture is, and then what are the characteristics of the power delivery network itself and how the program is actually behaving. Because if the program tends to have a loop that maybe does not actually resonated at that frequent, then you're not going to get the problem, the emergency problem. So it very much depends on how the program behaves as well. And in this talk I'm actually going to try and take that combined approach, given that we know that that's what -- those interactions are the ones that actually caused the problem, I'm going to go the route of actually understanding by looking at the voltage traces, understand how we're going to actually design a better solution. And what we've found is that when you actually look these voltage fluctuations inside the processor, you realize that it's actually recurring, basically meaning that there's a very strong recurring pattern. Once you end up with the problem, we've found that the problem pattern tends to repeat over and over. And then it's pretty much based on actually -- the reason that these swings happen is because of sudden stalls inside the processor, because when the processor suddenly stops, then there's a huge shift in the current. And that causes the voltage to swing by a huge amount. And the stalls tend to be a big -- the events tend to -- the microarchitectural events are a pretty good indicator of when you're getting a stall. And it's also the program, the code itself, how the code is actually running. And so I'll walk you through an example of that in the next few slides. But just to put a context around -Yeah? >>: So this makes sense. I can understand this from a single [inaudible]. >>Vijay Janapa Reddi: Yeah. >>: [inaudible] >>Vijay Janapa Reddi: Yeah. >>: [inaudible] >>Vijay Janapa Reddi: Yeah. So I actually want to start getting into the software piece, and that's where the hardware piece starts crumbling a bit, because I start off saying that we can build -- for a single core, I show how you can actually build an intelligent hardware structure that gets rid of the way people have been proposing and you can actually implement it, and that will solve the problem in one core. And I haven't extended that into, like, the two-core systems, you know, the basic multicore system. But that's where the software becomes much more critical, because trying to gather and simulate data from multiple cores in very short time scales is just next to impossible, right? And trying to figure out the interaction is even more difficult. But the software then -- by combining -- because we have understood how events across the processor sort of cause voltage to swing across multiple cores cause the voltage the swing, you can actually schedule the code accordingly. And by code, meaning multiple threads. >>: [inaudible] >>Vijay Janapa Reddi: So the way we do it is actually we assume some feedback mechanism. >>: [inaudible] >>Vijay Janapa Reddi: I yeah. When I get to that point I'll actually explain what kind feedback mechanism we need and what we actually end up using. So for most of the measurements, the way we do it, just because I've talked about some of the data, the way we do it is actually we tap into the VCC and VSS pins that you can actually tap into at the back of the motherboard, and we use a differential probe, because these are very high frequency measurements, and capture that on the scope where we're processing things as the programs are running, doing execution, and then analyze it offline. >>: Does the chip have just one VCC and VSS or does it have one VCC on one core and a different VCC on the other core? >>Vijay Janapa Reddi: So the ones -- so the multicore system that I'm actually going to be talking about is all connected to the same power, but the ones I haven't looked at ->>: It's not [inaudible]. >>Vijay Janapa Reddi: No. >>: How bad are the differences? If one end of the chip is 1 volt, how bad would the other end of the chip be? >>Vijay Janapa Reddi: It's not necessarily that, you know, it's the same equivalent. That's the thing. I think that's the point you are making. And so what we're tapping into is, you know, a basic keyhole view of what is really going on inside the processor at that point where we're actually measuring the voltage. So I'm not going to walk through the validation process. We have validated this thing and made sure that our setup is actually representative of what we expect to see in terms of are our measurements, given some kind of test cases running on the processor, do they actually make sense. We have done the validation, but the Core 2 Duo that is also at Intel typically the motherboard regulated designers. So that's all just from the measurement and characterization kind of basic setup, but this is the way we actually sort of look into solutions in the future. What we have is in order to really understand what is going on inside the processor, we use a simulator, and it's an X86 simulator. What we do is we rely and watch to get the power model or the current consumption, which basic tracks microarchitectural activity within the processor, and that's watch. That generates current, and then we feed that into our power delivery model that we have, and that power delivery model basically, every cycle by cycle, spits out what the voltage is. And we use that as our feedback mechanism into whatever solution to the higher level we're interested in looking at. And this has been -- this is primarily on the single core setup. For the multicore I've just done it all on the real system. And so when I get to that point, I will be talking about that. So my solution to the whole thing is, you know, it's called tolerance, avoidance, and elimination. It's the way I think we should be going about this problem. And this is just on the voltage noise, but I believe we can expand it. The idea is instead of completely preventing the voltage from ever really dropping below the point, the idea is we allow the voltage to actually dip below the margin, right? We're allowing incorrect execution, so basically it's an error-prone circuit or an error-prone chip. And when we detect that the voltage has gone below, then we know that the computation is incorrect, and then we rely on a recovery unit to roll the execution back. And Intel actually -- actually, we were talking to them on Friday, and they do have a prototype chip that actually does error recovery right now. So it's sort of coming together, and they're actually very much interested in how you do avoidance mechanisms as well because they're thinking about putting these things in because you cannot always rely on allowing errors to happen. It's just too costly, basically, if you allow them to happen. >>: So if your chip voltage drops below and is no longer reliable, how can you rely if the roll-back is done correctly? >>Vijay Janapa Reddi: Because the roll-back pretty much assumes some amount of delay by the time you detect it. And you're keeping snapshots so that the time when you take the snapshot versus when you actually detect the delay is sort of -- you know a priori when you're configuring it that way. >>: But all that's done in the same sort of logic, I assume. So how do you know that logic is functioning correctly. >>Vijay Janapa Reddi: So that part is, like, timing protected. I think that's what you're trying to get to. >>: Okay. >>Vijay Janapa Reddi: So you basically allow errors to happen, and when you get an error, you correct the execution back. But by allowing errors to happen, that's just the first step. In the future what we ideally want to do is we want to be able to predict these things and say, okay, we want to get rid of the problem all together. We want to be able to anticipate the fact that the voltage will dip below. And so I'll show you that you can actually build an intelligent predictor mechanism by understanding -- by relying on tolerance, you understand what causes the problem. Then you can actually build an intelligent predictor for that problem. And then you can then take that and from what we learn you can actually push it up even further to the software level where the software can completely eliminate the problem altogether by restructuring the code or scheduling the threads a bit different on the microprocessor. So I'm going to talk about tolerance first, because it allows us to understand what is causing the problem. So for this I'm going to walk you through a pretty complicated series of animations which will sort of explain what within the microprocessor is actually causing the voltage to really swing hard and when you actually get a problem, because by understanding that, then you can actually take preventive actions. So I'm showing you the voltage trace for GCC over 800 cycles or so. And we what see is that if we set up 4 percent margin, you end up with potentially incorrect -- you know, you can end up with voltage emergencies. And at first it looks like it's pretty random, it's just random fluctuations, but if I were to splice this up you would see that there are recurring patterns in the noise, A, B, C. Basically there are recurring phases of voltage activity. And what we've found is that because we're in the simulator, we're able to easily stitch it back up to the code level. And what we've ground is that GCC is actually initializing its register set, and this is doing the first part the code, and it's all in a loop. And so it makes sense that that's why you're getting these recurring problems or these recurring swings, these repeating patterns. But what we're specifically interested in is why do we suddenly see these large swings at this point, why not here versus there. And when we observed the activity inside the processor, what we find is that pipeline flushers are going on or basically the microarchitecture mispredicted a branch. And at that point it that is to drain the entire pipeline out. So it's a sudden stall in activity, and that causes the voltage to swing. But what's interesting is also that only a few of them, a few of those sudden pipeline stalls, actually cause an emergency. Like this one doesn't, but this one does. And so that's what we want to figure out, sort of what are the subtle differences, what is so different about this branch misprediction versus that branch misprediction. And when we relate it back to the code, we find that the hot spots are a bit different. So 2 here corresponds to the basic block in the control flow graph there. We're looking at the branch, the branch that controls the basic block, the path following basic block 2, and 5 means the branch here, whether you're going this route or whether you're actually going to take this loop back edge path. It's a back edge going back, you know, so it's a loop. So, for instance, we see that 5, whenever it mispredicts, whenever the machine mispredicts on the branch 5, we always get a problem. We always end up with an emergency. But in the case of 2, that doesn't happen. And to sort of figure out what is so special about 2, you know, why 2 only causes a problem in one situation versus, you know, in A it doesn't cause a problem, we have to really understand what code the machine is actually executing. And the best heuristic for that is the issue rate of the machine. So we see that in A, in phase A, where you don't have a problem, the issue rate is pretty jagged, so that means there's a lot of -- the current is not being consumed very aggressively. It's sort of an average -- a smaller amount of current. But in B the processor is consuming a larger amount of current. And I say that the processor is consuming a large amount of current directly from the issue graph because the issue logic typically consumes -- is the brain of the processor, right? So it consumes a decent amount of power. And so here you're seeing -- I'm going to zoom in a bit -- here you're seeing that when, in B, the processor is issue a lot of instructions and really being aggressive, it's doing exactly what you would like it to do, it's running a lot of instructions within a short amount of time, but when you get that branch misprediction here, at that point the issue completely stops. The machine comes to a sudden still, and then that causes the voltage actually to swing up very quickly and then slightly past a little -- maybe about a couple of cycles down the issue ramps up suddenly. So there's a complete stall in current draw and then there's a sudden burst because there's a lot of IOP sitting on the correct path, and that's what's actually causing -- the sudden current draw is causing the voltage to actually drop below. Whereas here it's a bit different, right? Following this point here, you're able to see that the issue is not that aggressive. And those subtle changes are effectively what -- you know, determine whether you have a problem or not. And relating it back to the code, what you see is in A where you don't have a problem versus B where you have a problem, the path that the program is actually taking are slightly different. >>: From a processor perspective, if sudden drops in activity are the problem, then does the processor just compensate by doing a lot of busy work, like issuing [inaudible] or something? >>Vijay Janapa Reddi: So you can actually -- yeah, you can end up dilating it, but the problem is you can't do that with every single pipeline flush. In fact, if you do that -- if we think about it as that being a heuristic to predicting, you know, whether you will actually get an emergency or not, then it's really bad. It's about 8 percent or so. You need to be able to track the history that is actually causing the problem, not just, oh, did I get a branch misprediction, because the branches are just one point where you get this problem. There's actually cache misses, TLB misses. And, in fact, you don't even need any -- you don't even need to have specific microarchitectural events to cause problems. The code itself can be a problem. >>: So even if the processor could compensate, it wouldn't know when to do it? >>Vijay Janapa Reddi: It wouldn't really know when to do it because -- yeah, sometimes there's actually no indication because it's just the way the code the resonating inside. Right. And, actually, I am get to the point where I will show you an example where a divide instruction basically stalls the processor in a way that it actually causes this effect. >>: I'm sorry, but in all those cases the TLB cache misses and the problem you were mentioning, won't those all be visible through the issue rate? Isn't that where the processor can realize how much -- I mean, does the issue rate correspond very tightly with the current drop? >>Vijay Janapa Reddi: It tends to correspond pretty decently, yes. And you're saying just more to the issue rate? >>: Yeah, more to the issue rate [inaudible] if it drops too quickly, then do you fake work, and if it rises too quickly then ->>Vijay Janapa Reddi: That's exactly what -- I mean, that's a good idea, right? Fundamentally, it's a very simple and a good idea to do it, but the problem is being able to detect that delta in a very short amount of time because these increases and drops are so sharp that by the time you detect them and you actually respond to it, that's where the problem comes, and that's the soft threshold mechanism I was talking about. You're effectively trying to detect whether that's happening. And that's the big problem, being able to detect and respond just requires very quick response time that we cannot actually build. >>: But that's, like you were saying, more of a proactive mechanism where I can simply -- if I issued an instruction as a cycle, I have my scheduler, you know, not -- just simply not permit the issuing of more than n plus k the next cycle where I can choose k so that I don't actually ever have, you know, ramp up ->>Vijay Janapa Reddi: I see what you're saying. So completely just do it based on the issuing. >>: Be more proactive so that I'm just going to make sure that I never ramp up or ramp down more than [inaudible]. >>Vijay Janapa Reddi: Yeah, I think -- I'm not so sure, but I think there was some -well, okay, it wasn't for the IDT, but that's something I have not actually looked at and I haven't really seen. But maybe, yeah, if you don't rely on any of the detection mechanism and you just look at the issue rate ->>: That still might be pretty tricky to implement because you still have to choose the k, you have to still do it in a way that you can guarantee that, you know, if the current fluctuations don't change by [inaudible]. >>Vijay Janapa Reddi: I think that would also depend on sort of what functional blocks are you activating during that set of instruction that you issued. That I think would make it hard. But the point I was trying to build from this slide is that it's a mix of program and microarchitectural activity in the past that actually lends to whether, you know, you're going to get an emergency or not. And that actually builds the case as to, okay, it's not just random swings or something, there's something intelligent we can really do, because we can understand it at this level. And so I was saying, you know, if we tolerate the emergencies, you allow them to happen and you build these recovery mechanisms, then you can actually just use that. I mean, one idea is to just rely on that. And the problem with that is that you have to implement very fine-grained recovery mechanisms that are pretty intrusive to the microarchitecture. I mean, you have to -- and when I was at Intel and I was talking to these guys some time back, they were saying that making any changes to traditional microarchitecture structures in order to enable very fine-grained recovery is a big pain in the butt because it takes a lot of validation and testing, and that's a real pain for them. So just to show you the kind of -- the recovery costs that you typically would need to implement, I'm showing you a heap map here of the amount of performance improvement you would get from using a very aggressive margin, right? So the core to do it has a 14 percent margin, and we're going to set that to a typical case, so whatever value we choose. And here ideally you'd want to be around 4 percent. And so as you make that margin tighter, what you're seeing is the color becomes more vibrant, and the blue color here means that you're getting a tremendous amount of clock frequency improvement because you're able to run it faster. And that, to some extent, would be able to translate into program performance. And if you want to get this piece here, well, you're sort of realizing that your recovery mechanisms have to be around 100 cycles or so, and that's pretty intrusive. I mean, there has been some work even from within our own group looking at this stuff, basically just purely designing just the recovery mechanisms only. And Razor, for those of you who are familiar with the architecture side, I mean, Razor is where this kind of notion all started off. And the problem is it is very intrusive, and you have a change -- you know you have to identify your critical paths and all of that. From a performance standpoint, it's a great way of doing things. It's designing it and building it that's the problem. And so what I'm going to be talking about is how can we use some kind of coarser-grained recovery mechanism which is sort of coming out because the ability to allow program execution to continue and then take some kind of checkpoint state is becoming very interesting for many uses. For instance, there's a lot of work out there saying we could do debugging this way in multi-threaded programs, for instance, right? You could track some history in. So the notion is to basically leverage some of the existing recovery logic that people have been pushing out as this general purpose piece of hardware that you can use for multiple reasons and then apply that to mitigate -- to get the performance improvement that we want. And that's ideally what I'd push for, because if we want to use the really -if we want to get this very good amount of improvement, then we'd have to be around 10 cycle recovery costs, which is really -- I mean, that's almost even more fine-grained than branch misprediction costs. And that's what I'm going to show. So here what I'm trying to show you here is that's the kind of recovery costs that you'd want right now, right? And that is for today as processor. But as you go into the future -- and I'll tell you how I project them, but as you go into the future, what you start seeing is that the potential here you have in terms of the recovery costs quickly starts diminishing. So you use Proc 25. This is typically projecting into, like, about the 22 nanometer space. And that starts dropping. And then as you go even further into the 1611 you start seeing that the recovery cost really has to be extremely fine-grained. And this is based on the assumption that, you know, the voltage will scale a little bit, which is what ITR [phonetic] is expecting, and the current density is going to increase. And the way I've been doing it -- the way I come to this conclusion is we take an existing chip and we measure the date and we look at the recovery cost versus the margin tradeoff. In order to project into the future where we expect a peak-to-peak voltage swing within the processor to increase, we artificially fake that out. Unfortunately, you can't really see it, but what we do is we reduce the number of capacitors, package capacitors, and that artificially causes the swing to increase, and that in effect allows us to study what's going to happen at the end of the day. This is just a vehicle. At the end of the day, this is what we're really interested in. And this is showing that as you educational background up breaking caps off and the swing increases because of the larger number -- because the average swing increases, then you end up with more recoveries that are going on inside the processor, and because of that increase in recovery cost, increase in frequency of emergencies, the recovery cost has to become smaller because the penalty per emergency needs to decrease dramatically. And that's effectively what's happening. And so we really need to understand how do we design and mitigate the problem. And, of course, I wanted to show you the swing here, the magnitude of the swing as it is increasing when you start breaking capacitor off, but sorry about that. So by tolerating, what I'm saying is that we can understand -- we can understand what is causing activity, and because we're not going to be able to build as extremely fine-grained recovery mechanisms, we want to be able to build lighter-weight schemes on top of it. And that's what I'm going to talk very briefly about, the predictor, which I call the principal for avoidance. The idea is you allow the emergency to happen inside the processor and you detect that that's happened, and you can detect it using the typical sensor I was talking about, which might take long -- a bit of a time to actually detect it, but that's completely fine because we're not trying to proactively prevent it. And when you detect it, you initiate your checkpoint recovery, which rolls the execution back to some previously known state, right? And you know that state is correct. And you notify what call as the emergency predictor, and the predictor then basically takes a snapshot of some activity that was happening, because like I was saying, when you were doing checkpoint recovery, you were actually able to track what is going on inside the processor. You maintain some history. You track that and then you can actually predict using that activity pattern. You don't really have to wait the next time around. So that allows us to basically prevent the emergencies. And so here I'm showing you what the predictor would actually track in hardware. So it's just some control flow I'm showing you here, but, of course, like I said, there might be other microarchitectural activity, and you track those patterns and you store them in your table. And these patterns actually become very important for the software piece because unless you know what caused the problem, you wouldn't really know where to actually fix the problem. So that's why these signatures are very critical for the software piece. And so what I'm trying to show here is how far ahead do these patterns allow you to predict and anticipate the problem. And that's what I call as lead time. If you were to get an emergency here, you know, you ideally want some amount of lead time so you can take some preventive action. And I'm showing you that even as you go 16 cycles ahead of time, it's able to predict those patterns that we were able to record. Even with 16 cycles amount of time, you can actually predict them very accurately, up to be 90 percent accuracy, which is huge. And if you were to just rely on that simple detection mechanism, then it's about -- sorry, yeah? >>: So how do you define accuracy? >>Vijay Janapa Reddi: So accuracy is -- the way we look at it is what fraction of -okay, we record how many times the emergency happens, and each time we track -- we compare whether the predictor, with whatever history it looked at, did it also expect an emergency at that point. So basically predictor -- you might get some kind of activity history pattern, and you look at that history pattern and you say did I get this emergency when I saw that pattern? >>: So is this saying that there's 10 percent false-negatives, 10 percent that you missed? >>Vijay Janapa Reddi: 10 percent that I missed. >>: But it doesn't say anything about how many extra ones you thought you were going to get but you didn't necessarily ->>Vijay Janapa Reddi: How many extra ones that ->>: So if I were to make a predictor that said predict every cycle that's a voltage emergency, under this metric would I have 100 percent accuracy? >>Vijay Janapa Reddi: No, you still wouldn't. >>: Okay. >>Vijay Janapa Reddi: Because there's still some amount of error in the history that you record, because you record in history and -- well, if the page block is huge, for instance, and then you're only tracking control flow. That means -- sorry. >>: I think my question was more to your metric of accuracy as to -- I thought the thing that you said was that you're saying the percentage of emergencies which you predicted would be an emergency. >>Vijay Janapa Reddi: Yeah. >>: The question is do you consider false-positives and false-negatives? >>Vijay Janapa Reddi: Ah, false-positives and false-negatives. So actually that ->>: [inaudible] >>Vijay Janapa Reddi: Yeah. So I did do -- so this is not based on the actual runware. Predictor is actually active as well. And when we do that, actually there tends to be this behavior where if you -- as you're preventing some of the emergencies, some of the future nearby emergencies go away. So in fact it turns out that only one percent of the emergencies actually we miss, the predictor would emergency. This is, of course, assuming that you've already learned the pattern. That's what I'm saying. So first you have to learn the pattern, right? That's a compulsory miss just like in a cache miss. And once you get that, once you've figured out the pattern that causes the problem, then you can use that to anticipate the future occurrences. And when we do that, we realize that 99 percent of the time we're able to prevent the problem. That is in the actual run because we actually throttle. So one of the things -- obviously it's very much like a branch breaker. >>: [inaudible] >>Vijay Janapa Reddi: It's actually ->>: [inaudible] >>Vijay Janapa Reddi: It's instructional, right, because I'll look at branches or -- it's not completely instructional. It's a bit of instructional and microarchitectural events because I basically encode, oh, this is a branch. It's a taken or not taken branch. And then I will also encode in whether you got an L1 miss or an L2 miss. >>: [inaudible] >>Vijay Janapa Reddi: So, actually, yeah, I was going to get to that in the next slide. Because it turns out that we just need to track a small fraction, so it's about seven events or so. Because those are the ones that really end up causing the major stalls. So because it's very much similar to a branch predictor-like structure, then what you see is as the amount of information increases, then obviously you're tracking more history, and therefore your ability to actually predict the problem increases. And where you track the information, for instance -- I think this is what you're sort of asking, where you're tracking and what kind events you're tracking, right? What I'm showing you here is if -- the first three basically assume the code path. You're only tracking the program path. And it shows that if you capture the commit side, which means you're only looking at the true program path, you're not looking at speculation, effectively you're ignoring branch mispredictions, then you only get 40 percent. But when you go to the decode, you end up capturing some of the speculation that the processor is doing, so you're effectively capturing the misprediction implicitly, then your accuracy improves. So branches, obviously, is a major thing. And as you throw in more events, basically, TLB misses, L1 and L2 effects, then your accuracy improves dramatically. >>: [inaudible] >>Vijay Janapa Reddi: So the signature sizes, this is what I'm talking about here. So this is the number of entries, and I used 3-bit encodings for each of the entries. So like I was saying right at the beginning, you can either go -- if you use aggressive margins, you can either get power reduction or you can boost up the clock frequency. Here I'm just going to talk about it from a frequency perspective. So in our system, we had a 14 percent worst case group typical, just the way the Core 2 Duo processor was experiencing, and we chose a 4 percent based on the aggressive setting, you know, whatever the average case design point seemed like, and that translates to a 17.5 percent increase in clock frequency you would be able to get by using an aggressive margin. And what we see is that if you compare against an oracle predictor scheme, the oracle, because it still has to throttle once per emergency, so it knows exactly when the problem is going to come and it's going to pay a little bit of penalty, you would see that compared to the 17.5 percent, it comes to about 15 percent, which is very close, but you're still paying some performance penalty. And if you take the signature predictor, which is the one we have come up with, it comes very close to the oracle. And, in fact, that does even after it's relying on these very expensive training mechanisms, right? Because it has to actually incur a problem, take that problem, and then apply it at run time, so basically learn, pay the cost for learning, and after that it has to be able to predict very effectively to do this. But there's actually more room to push it even further. And that's where the software piece kind of comes in. And I have to talk a lot about the basic -- I really have to lay the foundation for how the software would actually be able to do these things, because you have to understand what is causing the voltage to swing, and the predictor is a very key piece of this entire scheme because it captures a record or the history patterns that are need the inside the compiler to actually change the code around. >>: Sorry. One more question about that. What was the [inaudible] again? Is that just the frequent increase you can achieve or is it actual ->>Vijay Janapa Reddi: This is actually once the speed up -- once you're actually running the program. >>: Okay. >>Vijay Janapa Reddi: So from tolerating, we have now figured out what is causing the problem, and now we can, and then you're able to predict it accurately. And the prediction, although it seems like a hardware thing, it's actually very important for the software, because it is a representation of what is the quality of the information you're giving the software, because the software is going to basically operate off that. And so the software, once it gets the profile information from these hardware structures, would actually be able to change the code, and that's what I'm going to talk about in a single core system. And then when you go to the multicore system, you can actually schedule threads intelligently. You can actually get rid of the problem altogether, so basically you're never going to a voltage emergency in effect. And so when you actually look at -- what we did was we took the programs and we observed the number -- the locations where this problem actually happens in terms of program structure. What we find is that there are very few places in the code where you get these emergencies. They tend to be hot spots. And that's what I'm showing you here. Across the three different benchmark suites, what I'm showing you is the dynamic emergencies is the number of emergencies you actually incur during program execution, and the static locations, if you were to relate it back to the control flow graph of the program, you know, how many places do you actually end up with an emergency. And it's a very small fraction. It's basically just a few hundred instructions or instruction hot spots that cause the entire problem. So it actually motivates everything for the software because if this problem is sort of scattered all across the code and then it's very hard for the software to really be able to do anything. The software schemes work really well when there's a repeating pattern because changing something at one location specifically for a given problem is much more effective. So here, just to really explain, we looked at some of the control flow graph stuff just to sort of say how the interaction between the program control flow and the events that the processor experiencing can actually make a problem. And I'm showing you benchmark C. So what we see is the code -- you get an emergency here in this section of the control flow graph, and you get this because what happens is the processor ends up stalling briefly at a certain point, and it stalls when we get a divide instruction because this divide instruction is a long latency operation, and effectively until its computation is processed, the rest of the instructions cannot actually be dispatched. And this is in a loop. And so what we see is that you get a stall here. That's what we observe. Actually when we look at the set of processor events here we're observing, we see that the long latency thing, the divide, happened. When you get that divide, the issue rate of the machine suddenly comes to a complete standstill, and that's because there's a dependency on that instruction because the machine cannot go forward until that computation is finished, and so the current drops dramatically. But when the computation finishes, you see a sudden increase in the issue rate as expected, because now suddenly all the instructions are being executed, the dependent instructions, and then you see a -- that's why the current increases. Because the current increases steep and it's sudden, you see that the voltage actually drops suddenly. And what we effectively end up doing is that we can actually change the way the code is structured and sort of dilate that sudden burst in activity and then prevent the problem. So it's basically because of that sudden issue rate you're getting that voltage drop, right? And that's happening because there's a lot of IOP, and the way I'm showing that is like let's say your machine can typically issue two instructions a cycle. It's doing great. And each cycle it's able to dispatch that number of instructions. Then, naturally, there's a huge issue rate, and then the current is going to be consistently high. And the colors here represent instructions that are effectively dependent. That means they cannot proceed until the previous instruction completes. What we want to do is we basically want to stall the issue rate out a little bit, and by doing that, you know, you're getting that averaged effect, and that prevents the problem. And so we effectively want to cripple what the machine does at a given point, and you want to do this intelligently because you just cannot do it all over the place. Otherwise you'd lose performance dramatically. And by chaining up dependent instructions together you can actually end up slowing the machine just enough at the right hot spot. And we actually tried doing this in the static compiler, and the problem is when you chain something at one place, the problem ends up creeping at another place. So you effectively have to kind of keep fixing it on the fly. That's what actually end up doing. It's a dynamic feedback system. So what I'm trying to show here is if you have one instruction, for instance, a move instruction, if you could move that instructions down and increase the read after write dependence chain at the right place, then -- well, because the compiler has to move instructions with making sure you're still preserving all the semantics correctly. If you can do that, then what you see is that -- so this is a before, the one we had a problem in, and this is the after. Then what we see is that we flat out the issue rate a little bit, so we crippled the IOP of the machine, and then that way you decrease the amount of current draw just a little bit effectively, and that be prevents the voltage drop decrease. And it's a bit tricky to do this in an autophoto [phonetic] machine because you have to worry about the instruction scheduling window because it's pretty big. So a lot of nifty tricks that we had to pull on it. >>: [inaudible] compiler is specialized for every speed greater -- every stepping of every processor? >>Vijay Janapa Reddi: I'm sorry? >>: It seems that the compiler has to be done differently for every speed grade of every stepping of every processor. >>Vijay Janapa Reddi: I think it sort of -- when do you mean by a speed grade? >>: You buy chips at the gigahertz, either 2 gigahertz or 5 gigahertz, you pay mor money for them and [inaudible]. >>Vijay Janapa Reddi: Right. So I think -- we haven't actually seen it -- like, okay, so we just took one chip on which we were trying this out, and I think the point at which we were able to show was that, look, actually we would be able to actually come up with an algorithm that makes sense. And so I guess I cannot strongly make the claim that, oh, this would just work out of the box because we're still looking at how the optimization has to be built so that it is -- it's going to work completely independently off the platform itself. What we've found basically was that if you can -- if you can construct the chain so that it matches the issue rate, the issue width of the machine, then we've found that it typically works very well. So it's a very simple heuristic that we've found, but it seems to have a very tight core relation with how effective the scheme is. The only problem is you have to worry about how big the instruction window is of the machine. If it's really big, trying to create read after right dependence chains is extremely hard because you have to find those chains all over the place. Another simple way is to just throw in no op [phonetic] instructions or something to dilate the time. And so that's what I'm trying to show here is if you reschedule the code intelligently, then you can reduce a large number of emergencies. And I'm showing you the fraction of emergencies that remain after just one optimization, after just one pass that we had made. And if you actually try to dilate the time, try to lengthen things out by injecting instructions into the stream just arbitrarily just because you have a hot spot at some point and the compiler says, oh, based on the emergency signature pattern, it looks like the hot spot is here and I need to fix that, and you just start throwing in instructions to basically cripple the IOP, then that becomes a problem because that can actually introduce unnecessary lead stalls. And that's where we find that we lose performance, actually. So naturally, if you introduce no op instructions, you're going to lose time because you're doing a lot of unnecessary work versus if you actually reschedule the code intelligently, then you can actually get better performance at the end of the day. But the point I'm trying to make here is when we reschedule, we're not getting performance improvements because we changed the code a lot, but it's just that the -- you can actually come up with optimizations that are not completely performance driven. You're just specifically optimizing the code for this problem to get rid of the problem. >>: What kind of no ops did you guys inject? Because for [inaudible]. >>Vijay Janapa Reddi: Yeah, so you actually construct pseudo no ops. So you look at the code and then basically force it to actually do some computation that's actually redundant. Yeah. So we do pseudo no ops. It's not really no ops because, you're right, the thing just gets killed out at the dispatch of the thing, at the decode after. >>: So when you specialize the code, you're running more software, doesn't that cause additional voltage ->>Vijay Janapa Reddi: Yes. So we actually looked at the compiler itself. The compiler itself does end up getting emergencies when it's actually doing the thing. So the way we handle it is because you have a tolerance mechanism at the end of day, so we make this huge assumption that the baseline -- there is always a corrective mechanism at the hardware. And so whenever we get that fault, we just take the fault and we observe the penalty loss inside the compiler. And the thing is that if your compiler is good, if you can really optimize it correctly, then the number of times you would have to call the compiler would be very few. That's the assumption. If the optimizations don't work and then you keep calling the compiler, then it becomes redundant. And so what we actually end up doing -- the way the entire system actually works is you propagate stuff up to the software level, but if the software cannot fix it, then that emergency signature that the software relies on is actually fed back into the predictor mechanism because the hardware -- because this is much more cheaper to maintain the performance than relying on completely tolerating the emergency. Yes, of course. Then, now, as we go into, like, the multicore systems -- so all of that was just on the single core. So if you can optimize and localize, you're basically performing very localized optimizations to get rid of the problem. But what happens in the multicore, then you have interactions across another chip, another core that can cause problems on your core because typically in a multicore system, a certain number of cores -- for instance, two or four or six cores -- can actually be tethered to the same power supply. So what that means is that if one core is misbehaving, then you end up penalizing every other core because the droop will propagate down. And the IBM guys have actually done -- wait they do it is pretty crazy because they will actually send -- they'll create a droop at one core and they actually observe it traveling down and they'll say that, okay, after x number of nanoseconds, then you will see -- and if the second core happens to fire up right at that time, then you get a much bigger droop. So they do these very detailed studies trying to deal with it in a multicore system. And so we're starting to expand into, like, many, many, many cores, and this droop issue becomes more significant because if any time all the cores happen to fire up and draw a lot of current, then the droop is going to be much bigger. So that means your guard bands have to increase substantially larger. You end up comprising the overall efficiency of the system itself. And so in order to mitigate that, what I was looking at was how does the noise within the program actually change over time. And I'm showing you three benchmarks from CPU 2006. And what we've found is that if you were to [inaudible] the emergencies that you would typically have per clock cycle, like, you know, per thousand clock cycles, using a very aggressive margin, just purely for position purposes, then you see that in some benchmarks, the noise is very flat. Whereas, in some cases you send up with some kind of phase, a minor voltage noise swing. In some extreme cases you get a large number of -- very large amount of variance, and that variance basically happens because of the microarchitectural stalls. They've actually been able to correlate all of this noise down to the microarchitectural activity. And the question is how can they actually use that intelligently. You can actually pair multiple threads with very different behavior together and you'll get a large amount of variance. So in order to prove that, what I'm showing you here is I take one benchmark that's here and combine it with every other benchmark on the x axis and then I show the distribution that we see in terms of how much voltage -- how many voltage emergencies we get. And that's what the box plot is effectively showing you. And the blue point is a pretty good comparison point because it's basically saying what happens if I just run a program with itself. It gives us some kind of comparison point. And what we see is that there's a huge amount of variance. So in most cases -- it's almost a hundred emergencies difference, right? And this is per thousand clock cycles. That's just because we're using a very aggressive setting to try and determine how to -what the noise characteristics are. But there's a huge amount of variance. And so what you can actually do is you can actually schedule the code -- schedule threads using the OS or hypervisor across the cores intelligently. >>: [inaudible] >>Vijay Janapa Reddi: I'm sorry? >>: If you've got a message passing based on where each core is operating independently, how do you do a coordination between them. >>Vijay Janapa Reddi: So I haven't actually looked at it when you have that kind of mechanism, but maybe at that point it would be -- it would open up the door, well, can you actually look at the interactions between the threads. Because that's very different from the case we're looking at here where the programs are completely independent. But in that case they might be interactions. And one of the things I was doing on the chip was to actually see if you can -- what happens if you a producer-consume other relationship and you release something and then there's a sudden burst on the second core. And we find that it's pretty -- I've had a tough time trying to really be able to take a picture, for instance, on a scope because it's really a whole bunch of factories that ends up starting up. It's not just a few instructions. But I think that that might be something we would be able to do just looking at the traffic that's coming in and out of threads. Especially if each one is just bound to a single core. So there's been a lot of work in the space basically in terms of scheduling threads, right? Typically we've been looking at scheduling threads in the operating system community for voltage noise -- sorry, for performance reasons. And typically what you're trying to do is you're trying to minimize the bottleneck you have at the shared resource, the cache. And so that gets rid of one percent -- you know, by optimizing for cache sharing, you're in effect removing the cache stalls in a way. And if you remove the bottleneck cache stalls, that means you would expect that, okay, some amount of the noise has to go down because you just got rid of some of the stalls. But it turns out that that alone is not enough because you do have to account for other kinds events that are happening inside the processor. Like I said, there's branches, there's TLB and there's a lot of interaction which I haven't shown here in the slides, but all those interactions create different levels of noise swings. And to prove that point what I'm showing you here is I'm looking at -- you know, so x axis is IPC. So let's say if I schedule intelligently for IPC only based on hardware metrics, which is a pretty well-established literature space out there, I'm showing you that if you schedule for IPC, you're going to get better performance relative to some baseline here. We'll just speculate in my case. But the problem is you end up with more droops because the system is completely agnostic of what is going on underneath the schedule itself. And once I actually tag on the recovery costs, the performance actually decreases dramatically. You can even go into the negative space. But if you were to intelligently schedule for droops alone, you're trying to say, like, I just want to pair threads based on some activity history, I want to pair threads that are actually causing low number of recoveries, then you can see that the -- you end up with a little bit of performance win, but that's really not the point. That's just artifact. But the biggest point is that the number of droops you see relative to your baseline decreases, and that's a good thing, because that means now you have fewer recoveries effectively. And so it makes a very strong case that scheduling for droops is very different than scheduling for performance. And I'm showing you what happens when you have random -- just random schedules, if you're not scheduling for performance or anything, you're just pairing threads together to get the job done, which is pretty much what the Linux scheduler, if you look at it, really turns out to be doing. It's not truly random, but the effect is effectively the same thing as random. And so that's the point I'm trying to make here is when you're looking at multiple threads or multiple cores, then there's a large amount of room for optimization by looking at this at the software level because trying to build mechanisms at the hardware level in CNP [phonetic] systems will be extremely hard for this. And I think the way to think about the software pieces the last thing I think -- being a software guy, the last thing I ever want to hear is that, oh, the hardware is going to be faulty so you need to make sure your code is always going to be correct, right? And you have to do all this optimization. I think that would just never, ever work. But the way I think about it is, look, we're going to design the hardware intelligently as well as we can to try to mitigate some of these problems because this is really going to start eating us in terms of energy efficiency and stuff. Then the software guys could really think about it as, look, oh, here's another opportunity for performance optimization, because cache misses and branch mispredictions, for instance, are just one specific thing that we optimize for. And the case that I'm really trying to construct here is as this becomes a big problem and you get the right substrate at the hardware level, then you push it as an optimization problem at the software level. So by just getting rid of the recovery cost, then you get better performance naturally. And so just to summarize the things that I covered in the talk, I mentioned that margins increasingly causing frequency loss, power efficiency and increase the cost price of the chip, and I mentioned that by tolerating, you can actually intelligently avoid the thing. And for all of that, you have to look at how the voltage is fluctuating within the processor. And once you learn to avoid, avoidance is important because tracking the right information there. Then you can actually use a software to completely get rid of the problem itself. And so we have come up with this three-tiered architecture basically, you know, tolerance, avoidance and elimination that I really think is applicable to this particular problem that people haven't yet completely understood yet. And the variation -- if you remember the first slide that I started off, there's this direction on the process variations, the thermal variations and there's voltage variation. And we looked at the voltage variation as happening at a very fine-grained time scale, but we might be able to extract this into the general process, you know, how do you generally handle variations as a scheme. And I think basically the architecture we've built up actually works pretty well for this thing. And I think this notion of tolerating errors is something that we are going to be seeing in the future. The architecture will be tolerating the errors, and that's just how we're going to get just the way we -- in effect, you're going to be speculating all the time on correctness and sort of just -- I mean, we already do that with branch mispredictions or, like, memory speculation. And this is just going to be another notion that's already going to be coming out. Because like I said, on Friday I'm actually going to be talking to the guys who put the chip together, and they do have a prototype chip that can tolerate now that does this and is actually running the programs to completion basically. So that would be my talk. Any questions? [applause] >>: [inaudible] >>: Do you have any sense of -- sort of following up on this earlier question -- of how much the compiler [inaudible] need to be aware of the variation across chips that you mentioned very early in the talk, especially with regards to temperature and just [inaudible] process variations? >>Vijay Janapa Reddi: I think it sort of depends on -- I think it depends on what your workload characteristics are. So let's say we talk about process variations, right? And then one core will obviously be running at a much lower frequency, and you don't necessarily want to put -- in a multithread workload, you don't want to put the critical section on that core, for instance. There would be a lot of benefit from moving that core into the one with lesser variations because you're going to be able to operate at higher clock rates and get through that. And the ones where -- if you're just purely looking at basically crunching through non-critical sections, then you might want to be able to schedule the threads intelligently that way. >>: It's kind of crazier, I mean, because you have core-to-core process variations such that if you scheduled A here, B here, you get emergencies. If you scheduled them the other way around you might not actually get the emergencies because of the [inaudible] variations of the activities match up with that [inaudible]. >>Aaron Smith: All right. Any more questions? >>: Do these signatures, like, for one particular architecture, do they depend a lot on the workload or ->>Vijay Janapa Reddi: They actually depend very much on the workload and the architecture. So I didn't show it here, but -- so this was based on several -- I think it was about 55 different programs. The experimental set I'm talking about is 55 unique, very unique programs, and then we looked at three very different architectures in terms of the aggressiveness of the configuration, and the emergency -- the patterns you record -the information we're recording is correct, but the orderings are obviously completely different. So it has to be a very dynamic thing. It's not something you would say, like, okay I'm just going to capture this record of activity and then just do some static compiler optimization. We tried that. That was ideally what we wanted to be able to solve. >>Aaron Smith: All right. Should we thank our speaker one more time? [applause]