>>Aaron Smith: So it's my great pleasure to introduce... graduated from Harvard University where he was co-advised by Mike...

advertisement
>>Aaron Smith: So it's my great pleasure to introduce Vijay Reddi. Vijay recently
graduated from Harvard University where he was co-advised by Mike Smith and David
Brooks, and he will be joining A&D Research Labs soon. I'm not sure when.
So today he's going to talk about software-assisted hardware reliability.
>>Vijay Janapa Reddi: Thanks, Aaron.
So some of this work that I'm actually going to be talking about is work that I've actually
finished up as part of my thesis, and, I mean, this is just one venue of work that I've
looked at, but I'm not really going to go into too much depth because I'm going to talk
about processor reliability issues and specifically one particular thing.
But what I'm going to try and do is not just get nailed down into one specific paper. I
think I'm just going to try and stem things from the way it stands in terms of the design
variability issues or saying -- sort of building up the architectural piece and how software
actually becomes an integral piece in the long run.
So it's really trying to put a vision together. I mean, a lot of the papers are out there, but
it's sort of trying to pitch whatever it is that I've been pushing forward.
So just to give you a brief introduction of the kind of work that I've been doing, it's
really -- I would consider myself a software guy for the most part. What I've done is
work in [inaudible] translators. For those of you in the architecture community, you
would know about PIN, which is one of the things I've worked on quite a bit all through
my master's. And then I was basically looking for how do you actually use software
intelligently to mitigate some of the hardware problems.
So when I went over to Harvard, I started looking at architecture problems and, you
know, from architecture there are a couple of circuit-level issues that were coming up
basically in terms of reliability, and I started fusing the three things together in my head
and I kind of felt like there was a new venue to up for research that's sort of emerging,
which is what we'll talk about today.
So the crux of the work is going to be about variations. So typically what we're seeing is
that transistor sizes are shrinking. Now, that's a great thing because now for the same
area you're able to get a lot more compute density. But the problem is you make these
transistor sizes so much smaller, they're becoming very susceptible to variations.
Now, variations can be either static or dynamic. On the static side it means that one
transistor has inherently different properties from its neighboring transistor. On the
dynamic side -- sorry. And that basically comes out of process issues. When you're
fabricating, there's design time challenges, so fabrication and manufacturing-time
challenges.
On the dynamic side, dynamic meaning it is sort of when the processor is actually
running, on the static side it's, you know, this is what you have. You have this kind of
chip that's got some kind of variability within it. On the dynamic side the activity of the
program pretty much leads to a different hot spots on the chip, as we were saying, on
this heat map. Or you can also end up with voltage fluctuations within the chip. And
that is actually going to become a major problem in the future. Most of the work that's
are out there right now sort of looks at the static side, the process variation level side of
things, and there's been a fair amount of work in the temperature or the dynamic
temperature side of things, but voltage variation is still not a variable understood in the
space yet, and that's sort of really what I'm going to peel off into. So that's going to be
the focus.
I'm going to start off real basic because it's not really well known out there, so I'm going
to talk about why voltage within the microprocessor actually fluctuates in a way that we
don't want it to fluctuate. And why is that going to become a problem? This is because
of the prior work in the space and why some of those existing challenges are not going
to scale into the future and what we need to do about it and how, at the end of the day,
software actually becomes a very critical piece. In fact, I'm going to pitch this entire
architectural reliability problem as a software optimization problem to the compiler
space and why the software will actually be able to mitigate it the much more effectively
than the hardware is going to be able to do it.
So voltage noise. That's typically what we call this problem. Voltage noise or voltage
fluctuation. And one of the points I want to emphasize is that those variations end up
translating to some kind of delay in the circuit, and that delay is going to impact the
performance of your chip.
And voltage especially is a big problem because it impacts frequency and the power of
efficiency of the cores, and those are things we will peel into in the coming couple of
slides.
But why does voltage within the microprocessor actually fluctuate? Well, to understand
that, what I'm showing you here is the current trace for a GCC running in our simulator,
and we're seeing that the current is fluctuating inside the processor because, well, the
workload is doing different things over the course of several hundreds of cycles that
we're looking at here.
Now, the voltage, in response, is actually fluctuating as well. Now, typically you want
the voltage to be sitting around at the nominal voltage. You want it to be sitting idle at
that point, but it's not, and that's because there's parasitics involved in the power
delivery network going from your power supply, like your battery source, all the way
down to the actual execution units. And the parasitics are what are causing the voltage
to fluctuate.
Now, in order to make sure the processor is always going to work correctly, we have to
understand how much of a swing we're going to dynamically see in here. Now, while
parasitics are one reason that the voltage will fluctuate, today's microprocessors use
clock gating, for instance, as a way of saving pay, dynamically shutting off parts of the
processor based on the usage.
Now, from a power perspective, that's great, but, again, that's going to cause the
voltage to swing because whenever you see a sudden shift in current, the parasitics
effectively cause the voltage to swing much larger.
So understand that you want the nominal voltage to be just -- you want the voltage to be
sitting around in the nominal, but it's not -- and that's because of the parasitics. And you
need to understand how much of a swing you see in order to make sure that the circuit
works correctly because if the voltage falls below a certain operating margin or a certain
threshold value, well, then end up with a slow circuit, slower than what you've actually
designed or you're expecting it to operate at. At that point you can end up with timing
variations in the circuit, and then you end up with incorrect results.
If it goes up too high, then you can potentially impact the behavior of the -- you can
impact the characteristics of the transistor and you end up damaging the transistor itself
and so you have, you know, a lifetime issue.
And throughout the rest of the talk I'm just going to focus on voltage emergencies, or I'm
going to call these things these points where you can potentially run into these incorrect
execution spots as voltage emergencies, and I'm only talking about the correctness
issue. I'm not going to talk about the lifetime issue because correctness is fundamental
to operation, and that has grave implications on the frequency and power efficiency of
the processor.
So the way we typically handle this today industry-wide, the practice is -- what you do is
you try to determine what is the absolute worst case voltage swing you would see inside
the processor while it's actually running. And the only way to sort of determine that is to
actually write viruses that cause -- you really understand how much -- these are very
microcode kind of viruses where you try to understand how to activate the functional
blocks in a certain way so they get these rapid swings in current, and that translates to
voltage swings. And that's what they do.
And here I'm actually showing you a snapshot of what I've written for the code to do a
processor, and you can see the voltage swings by a large amount. These are actually
measured results, and that's effectively what they do.
Based on that large swing, you can determine how much of a guard ban or how much
voltage swing you can actually tolerate. And then you call this upper and lower margin,
and then you determine where you're going to set your nominal.
And then when you run your real program, what you see is that the real program is well
within those extremes. So you're guaranteed correctness, right? So margins get you
that correctness.
But the problem is that the way you allocate these bands so there's room for
guaranteeing correctness has severe penalties. It's like a static -- it's a static penalty
you end up paying up front. For instance, if your nominal voltage is 1 volt, the IBM
Power6 design is, for instance, found that at any given point their voltage might be able
to drop all the way down to .8 volts just based on the activity inside the chip.
What that effectively means is that you have to make sure that although you're powering
your circuits with one volt, the effective operating speed is set by .8 volts because that
determines the frequency. And so naturally you're losing some amount of peak
operational speed in the circuit, and that's what I'm trying to study here for the 45
nanometer design space where your nominal is set at 1 volt, and what we're seeing is
that that translates to a 24 percent loss in clock frequency immediately.
Now, as you're going into future nodes, and if I believe I get RS [phonetic] rejections,
right, which says that you're going to get about a usual you're going to get some amount
of voltage drop slightly but very minimal threshold voltage scaling, then that's going to
end up with slower circuits as you're going in the future. That means that your peak
operational frequency for a given margin, for instance, the way the Power6 designers
were using, for the exact same margin what you end up seeing is that the peak
frequency starts dropping pretty quickly.
Now, we might not necessarily be focused completely on the peak frequency just in
terms of clock frequency improvements, but that's important because at the end of the
day, it sort of determines how much work you're going to get done over a period of time,
right? So the frequency is an important component here.
So the only way to actually maintain frequency from one generation to another, for
instance, is you have to enable tighter margins. But I just said that the whole reason
you have these guard bands or margins is because you're trying to make sure your
circuit is always operating correctly. And if you enable tighter margins to sustain your
frequency in the future processor generations, then you end up with a very high chance
of getting incorrect execution. And, really, you don't have a choice. I mean, this is the
direction we will have to go, because one of the big ones is the margins also impact
energy efficiency or the amount of power your circuits -- your processor is burning.
For instance, when we look at the 20 percent margin that I was talking about, imagine if
you could lower your nominal down to a much closer, tighter guard band, right? You
can operate your circuit at .83 volts and you still maintain that same efficiency, then that
translates to a 31 percent reduction in your power. And it's huge, because this is, you
know, a chip-wide effect. It's not sort of being optimized for a specific workload or
something. This is just your basic chip operation. So it has both frequency and energy
efficiency issues.
And another thing that it actually impacts is the cost of the chip itself, which is
something that we don't typically think about. But when I was talking to Intel guys,
they're very concerned about the amount of capacitance they would have to put on. So
basically in order to tolerate voltage swings -- so it's kind of hard to see here, but what
they do is they put a lot of capacitance in order to prevent these large swings from
happening. It helps them dampen the effect. And that's what I'm showing you here.
This is a land side -- you know, the back of a processor, you know, the package, and
there's a lot of capacitors here, and as I'm saying, the swings are going to be getting
larger, basically voltage noise becomes a bigger problem in the future, then you're going
to have to pack a lot more capacitance, which means you're going to have to invest
more in the chip.
And if you don't have the packaged capacitance, which is basically what I'm showing
you here, what I've done is I've peeled the package caps off. You can literally pop them
off carefully. You've got to be careful you don't shock the chip. But you can see that if
the package cap is reduced, then you can see that the swing ends up increasing a
decent amount. And I will actually be using that as a way of projecting into future nodes
in a bit. But there's a cost issue here as well.
So far I was talking about industry sort of designing for that extreme worst case point,
because you're running these power virus that occurs -- that almost never occur. In the
real program, that's what you'd see. And then you operate the average case time. Your
average case behavior is based on this extreme design point. Instead, why not sort of
really figure out what the average behavior of the chip is and then operate it
accordingly. Don't set those worst case margins, really. Just sort of set more
aggressive conditions.
And what I'm showing you here are chip measurement results of voltage samples. So
zero percent means it's the nominal voltage, and each sample basically is how much
deviation you see in the voltage going up or, like, below the nominal voltage. And what
I'm saying here is that -- in the real production chip what you see is that the swing is
only about 3 percent. So it's a very, very small amount of voltage swing you see.
But instead we go out and we put in these large guard bands where the processors very
rarely stress to. In fact, it's a good thing that they're doing it, because the measurement
results do show that you do see these very large swings inside the chip, and it does go
as far as 12 percent. And on the Core 2 Duo we actually determined that the margin
that the Intel guys actually put in was 14 percent. So it does make practical sense that
they have these large bands, because that's what's getting to that correctness at all
times.
But for the most part, the swing is only 2 to 3 percent. So instead you'd want to typically
operate under these situations.
Yeah?
>>: So under all these situations, at least I guess from what you say on [inaudible], is
that there's sort of a constant computation that's going on.
>>Vijay Janapa Reddi: Yeah.
>>: If you have a more realistic operating system scenario, then you'd have bursts of
activity, which are then called bursts of inactivity, which sounds an awful lot more like
how I would think a power virus would be written. So do you have a sense that in a real
sort of operating system environment, the swing is actually a lot more than this?
>>Vijay Janapa Reddi: So these are actually measured results from the chip where
you're actually running it with the OS running and everything.
>>: But you have sort of a constant computation as opposed to something -- an
intermediate computation, like decoding a video frame that comes every n times a
second or something like that.
>>Vijay Janapa Reddi: The time scales we're talking about is extremely small. I mean,
it seems that the kind of computation you're talking about is in the millisecond range.
But here the swings -- if the current happens to drop dramatically within a few cycles,
that's the kind of granularity that we're talking about.
>>: So is it a few styles? Is that the -- what's the time constant of your sort of worst
case power virus? What do you oscillate at to make it have the worst possible ->>Vijay Janapa Reddi: Typically you'd want to oscillate it at around 100 to 200
megahertz, the shifts in the current, because that's when basically -- you would see this
in almost every IBM paper, Intel paper. It's where the characteristics sit. That is the
point where you get the worst case. And so you can -- the virus I had written, you can
actually figure out whether your code is actually oscillating at that point by constructing
[phonetic] FFPs and then you'd be able to see it.
>>: [inaudible]
>>Vijay Janapa Reddi: Yeah. Okay. Yeah, sorry. Yeah, power virus is quite different
from the DIDT [phonetic] virus, but yeah.
So what I'm trying to argue for is that if you look in a lot of the production runs that we're
seeing, like, over several thousands of runs, most of the samples are pretty much sitting
within 3 to 4 percent range. So you don't really want to optimize based on the 12
percent very infrequent case that we're seeing. You want to design it for the typical
case, typical case point, because then you can get a lot of frequency or power
improvements.
And there's been a little bit of work in the research space sort of looking at what can we
do, I mean, do we really have to design for it? And people have been saying maybe we
don't need those extreme cases. And I'm going to argue a little bit against some of the
challenges we're seeing that if you're going to try to implement it.
Basically the notion is we want to design for the typical case, which means that if we
have some amount of voltage range, you enable a tighter margin. You run it closer.
Then you can end up with the risk of getting incorrect execution and the voltage dips
below.
So the typical solution has been, oh, we're going to build sensors and spread them
across the chip and you're going to have some kind of detection mechanism that's
basically sampling and then trying to figure out does the voltage get too close at a
certain point, and if it does that means that there's a high chance that the voltage is
going to go beyond the correctness margin. And at that point you want to throttle or
basically slow the processor down a little bit. And you can do this at very fine-grain
levels by controlling the issue rate of the machines or the way you dispatch instructions
out. And by doing that, you'll be able to prevent the voltage from actually dipping below,
and then you prevent the emergency. Because all it takes is one glitch and that's it. If
you get incorrect execution, you've got incorrect execution. It doesn't matter whether
this happens once or a million times. Correctness, right? That's what we're going after.
And the way that it's typically implemented is by monitoring the current voltage in the
processor. And if you cross that particular voltage setting, then you call the throttling
system, however you choose to implement it. There are multiple flavors of that.
And the biggest problem here is the delay, the time it takes to be able to detect
something and then actually respond to it. And I'll show you in the next slide that the
delay has to be extremely small.
The challenge is, with the delay, is that you have to have all these sensors scattered
across the chip because you could get a dip anywhere across the entire chip, and that's
the problem, really, because you have to detect all of them and then figure out whether
you're really getting the problem.
So where you set the threshold or the way you figure out if you're going to be crossing
the margin becomes very critical. Typically if your margin, let's say, is your average -your average case in your design margin is 4 percent, very tight, you're trying to get the
efficiency, then you want to set that threshold point where you're going to try and protect
the processor as close as possible to that extreme point.
Now, when you do it, one of the problems is there might not just be enough time to
detect and prevent the voltage from dropping below. And so that's what we're seeing
here. If you assume that the red line is a typical case, you know, the way voltages
actually would happen if you weren't trying to protect the processor, and the blue line is
the one where you actually do take action to try and prevent the problem, if there's not
enough time, well, the voltage is going to dip below and you're going to end up with an
incorrect execution.
And it turns out that if you really want to be able to pull this off in this particular
configuration, for instance, the delay to be able to detect and respond has to be no
more than one cycle, because if you go beyond one cycle of detecting and response
time, then you end up with a large number of failures. And like I was saying, all it takes
is one mistake. That's the problem.
And so, therefore, designing it is extremely difficult because a typical delay time for
sensors tends to be around, like, 20 cycles or so. Even in the best case, if you assume,
it's got to be about 10 cycles, and it's impossible from our data.
The natural thing to do is to push the thresholds up, the trigger thresholds. That way
you have some room to actually prevent the problem.
When you do that, you can very safely prevent the problem. But you can end up with
false/positives when you try to do this in the hardware because there might be a case
where the voltage does go beyond that triggering point and then you try to prevent it,
obviously, and then you realize that you don't really have to do that.
Turns out at that this kind of configuration you end up with 80 percent false/positives,
and the penalties are huge because you're throttling unnecessarily or you're slowing the
machine gracefully.
So those two have been pretty much the prominent way of looking at these things.
Industry does do the worst case design right now, but as we scale into the future, that's
just not going to work because the swings are getting bigger. As you make the
transistors smaller, you're pulling a lot more current, and then naturally as the voltage
decreases and the current increases, your swing ends up becoming much larger, and
so you must have a much bigger guard band.
So that's not scaling. And the threshold techniques that people have been proposing,
well, the problem there is that the detection circuits are just not there. You're not
capable of doing it.
And what I'm going to pitch towards is that you really need to be able to understand
what the causing the voltage in the processor to actually fluctuate, and that's very much
dependent upon how aggressive your processor is and how aggressive the
microarchitecture is, and then what are the characteristics of the power delivery network
itself and how the program is actually behaving. Because if the program tends to have
a loop that maybe does not actually resonated at that frequent, then you're not going to
get the problem, the emergency problem. So it very much depends on how the
program behaves as well.
And in this talk I'm actually going to try and take that combined approach, given that we
know that that's what -- those interactions are the ones that actually caused the
problem, I'm going to go the route of actually understanding by looking at the voltage
traces, understand how we're going to actually design a better solution.
And what we've found is that when you actually look these voltage fluctuations inside
the processor, you realize that it's actually recurring, basically meaning that there's a
very strong recurring pattern. Once you end up with the problem, we've found that the
problem pattern tends to repeat over and over.
And then it's pretty much based on actually -- the reason that these swings happen is
because of sudden stalls inside the processor, because when the processor suddenly
stops, then there's a huge shift in the current. And that causes the voltage to swing by a
huge amount.
And the stalls tend to be a big -- the events tend to -- the microarchitectural events are a
pretty good indicator of when you're getting a stall.
And it's also the program, the code itself, how the code is actually running. And so I'll
walk you through an example of that in the next few slides.
But just to put a context around -Yeah?
>>: So this makes sense. I can understand this from a single [inaudible].
>>Vijay Janapa Reddi: Yeah.
>>: [inaudible]
>>Vijay Janapa Reddi: Yeah.
>>: [inaudible]
>>Vijay Janapa Reddi: Yeah. So I actually want to start getting into the software piece,
and that's where the hardware piece starts crumbling a bit, because I start off saying
that we can build -- for a single core, I show how you can actually build an intelligent
hardware structure that gets rid of the way people have been proposing and you can
actually implement it, and that will solve the problem in one core. And I haven't
extended that into, like, the two-core systems, you know, the basic multicore system.
But that's where the software becomes much more critical, because trying to gather and
simulate data from multiple cores in very short time scales is just next to impossible,
right? And trying to figure out the interaction is even more difficult.
But the software then -- by combining -- because we have understood how events
across the processor sort of cause voltage to swing across multiple cores cause the
voltage the swing, you can actually schedule the code accordingly. And by code,
meaning multiple threads.
>>: [inaudible]
>>Vijay Janapa Reddi: So the way we do it is actually we assume some feedback
mechanism.
>>: [inaudible]
>>Vijay Janapa Reddi: I yeah. When I get to that point I'll actually explain what kind
feedback mechanism we need and what we actually end up using.
So for most of the measurements, the way we do it, just because I've talked about some
of the data, the way we do it is actually we tap into the VCC and VSS pins that you can
actually tap into at the back of the motherboard, and we use a differential probe,
because these are very high frequency measurements, and capture that on the scope
where we're processing things as the programs are running, doing execution, and then
analyze it offline.
>>: Does the chip have just one VCC and VSS or does it have one VCC on one core
and a different VCC on the other core?
>>Vijay Janapa Reddi: So the ones -- so the multicore system that I'm actually going to
be talking about is all connected to the same power, but the ones I haven't looked at ->>: It's not [inaudible].
>>Vijay Janapa Reddi: No.
>>: How bad are the differences? If one end of the chip is 1 volt, how bad would the
other end of the chip be?
>>Vijay Janapa Reddi: It's not necessarily that, you know, it's the same equivalent.
That's the thing. I think that's the point you are making.
And so what we're tapping into is, you know, a basic keyhole view of what is really going
on inside the processor at that point where we're actually measuring the voltage.
So I'm not going to walk through the validation process. We have validated this thing
and made sure that our setup is actually representative of what we expect to see in
terms of are our measurements, given some kind of test cases running on the
processor, do they actually make sense. We have done the validation, but the Core 2
Duo that is also at Intel typically the motherboard regulated designers.
So that's all just from the measurement and characterization kind of basic setup, but this
is the way we actually sort of look into solutions in the future.
What we have is in order to really understand what is going on inside the processor, we
use a simulator, and it's an X86 simulator. What we do is we rely and watch to get the
power model or the current consumption, which basic tracks microarchitectural activity
within the processor, and that's watch. That generates current, and then we feed that
into our power delivery model that we have, and that power delivery model basically,
every cycle by cycle, spits out what the voltage is. And we use that as our feedback
mechanism into whatever solution to the higher level we're interested in looking at.
And this has been -- this is primarily on the single core setup. For the multicore I've just
done it all on the real system. And so when I get to that point, I will be talking about
that.
So my solution to the whole thing is, you know, it's called tolerance, avoidance, and
elimination. It's the way I think we should be going about this problem. And this is just
on the voltage noise, but I believe we can expand it.
The idea is instead of completely preventing the voltage from ever really dropping below
the point, the idea is we allow the voltage to actually dip below the margin, right? We're
allowing incorrect execution, so basically it's an error-prone circuit or an error-prone
chip.
And when we detect that the voltage has gone below, then we know that the
computation is incorrect, and then we rely on a recovery unit to roll the execution back.
And Intel actually -- actually, we were talking to them on Friday, and they do have a
prototype chip that actually does error recovery right now. So it's sort of coming
together, and they're actually very much interested in how you do avoidance
mechanisms as well because they're thinking about putting these things in because you
cannot always rely on allowing errors to happen. It's just too costly, basically, if you
allow them to happen.
>>: So if your chip voltage drops below and is no longer reliable, how can you rely if the
roll-back is done correctly?
>>Vijay Janapa Reddi: Because the roll-back pretty much assumes some amount of
delay by the time you detect it. And you're keeping snapshots so that the time when
you take the snapshot versus when you actually detect the delay is sort of -- you know
a priori when you're configuring it that way.
>>: But all that's done in the same sort of logic, I assume. So how do you know that
logic is functioning correctly.
>>Vijay Janapa Reddi: So that part is, like, timing protected. I think that's what you're
trying to get to.
>>: Okay.
>>Vijay Janapa Reddi: So you basically allow errors to happen, and when you get an
error, you correct the execution back. But by allowing errors to happen, that's just the
first step. In the future what we ideally want to do is we want to be able to predict these
things and say, okay, we want to get rid of the problem all together. We want to be able
to anticipate the fact that the voltage will dip below. And so I'll show you that you can
actually build an intelligent predictor mechanism by understanding -- by relying on
tolerance, you understand what causes the problem. Then you can actually build an
intelligent predictor for that problem.
And then you can then take that and from what we learn you can actually push it up
even further to the software level where the software can completely eliminate the
problem altogether by restructuring the code or scheduling the threads a bit different on
the microprocessor.
So I'm going to talk about tolerance first, because it allows us to understand what is
causing the problem. So for this I'm going to walk you through a pretty complicated
series of animations which will sort of explain what within the microprocessor is actually
causing the voltage to really swing hard and when you actually get a problem, because
by understanding that, then you can actually take preventive actions.
So I'm showing you the voltage trace for GCC over 800 cycles or so. And we what see
is that if we set up 4 percent margin, you end up with potentially incorrect -- you know,
you can end up with voltage emergencies. And at first it looks like it's pretty random, it's
just random fluctuations, but if I were to splice this up you would see that there are
recurring patterns in the noise, A, B, C. Basically there are recurring phases of voltage
activity.
And what we've found is that because we're in the simulator, we're able to easily stitch it
back up to the code level. And what we've ground is that GCC is actually initializing its
register set, and this is doing the first part the code, and it's all in a loop. And so it
makes sense that that's why you're getting these recurring problems or these recurring
swings, these repeating patterns.
But what we're specifically interested in is why do we suddenly see these large swings
at this point, why not here versus there. And when we observed the activity inside the
processor, what we find is that pipeline flushers are going on or basically the
microarchitecture mispredicted a branch. And at that point it that is to drain the entire
pipeline out. So it's a sudden stall in activity, and that causes the voltage to swing.
But what's interesting is also that only a few of them, a few of those sudden pipeline
stalls, actually cause an emergency. Like this one doesn't, but this one does.
And so that's what we want to figure out, sort of what are the subtle differences, what is
so different about this branch misprediction versus that branch misprediction. And
when we relate it back to the code, we find that the hot spots are a bit different.
So 2 here corresponds to the basic block in the control flow graph there. We're looking
at the branch, the branch that controls the basic block, the path following basic block 2,
and 5 means the branch here, whether you're going this route or whether you're actually
going to take this loop back edge path. It's a back edge going back, you know, so it's a
loop.
So, for instance, we see that 5, whenever it mispredicts, whenever the machine
mispredicts on the branch 5, we always get a problem. We always end up with an
emergency. But in the case of 2, that doesn't happen. And to sort of figure out what is
so special about 2, you know, why 2 only causes a problem in one situation versus, you
know, in A it doesn't cause a problem, we have to really understand what code the
machine is actually executing.
And the best heuristic for that is the issue rate of the machine. So we see that in A, in
phase A, where you don't have a problem, the issue rate is pretty jagged, so that means
there's a lot of -- the current is not being consumed very aggressively. It's sort of an
average -- a smaller amount of current. But in B the processor is consuming a larger
amount of current.
And I say that the processor is consuming a large amount of current directly from the
issue graph because the issue logic typically consumes -- is the brain of the processor,
right? So it consumes a decent amount of power. And so here you're seeing -- I'm
going to zoom in a bit -- here you're seeing that when, in B, the processor is issue a lot
of instructions and really being aggressive, it's doing exactly what you would like it to do,
it's running a lot of instructions within a short amount of time, but when you get that
branch misprediction here, at that point the issue completely stops.
The machine comes to a sudden still, and then that causes the voltage actually to swing
up very quickly and then slightly past a little -- maybe about a couple of cycles down the
issue ramps up suddenly. So there's a complete stall in current draw and then there's a
sudden burst because there's a lot of IOP sitting on the correct path, and that's what's
actually causing -- the sudden current draw is causing the voltage to actually drop
below.
Whereas here it's a bit different, right? Following this point here, you're able to see that
the issue is not that aggressive. And those subtle changes are effectively what -- you
know, determine whether you have a problem or not.
And relating it back to the code, what you see is in A where you don't have a problem
versus B where you have a problem, the path that the program is actually taking are
slightly different.
>>: From a processor perspective, if sudden drops in activity are the problem, then
does the processor just compensate by doing a lot of busy work, like issuing [inaudible]
or something?
>>Vijay Janapa Reddi: So you can actually -- yeah, you can end up dilating it, but the
problem is you can't do that with every single pipeline flush. In fact, if you do that -- if
we think about it as that being a heuristic to predicting, you know, whether you will
actually get an emergency or not, then it's really bad. It's about 8 percent or so. You
need to be able to track the history that is actually causing the problem, not just, oh, did
I get a branch misprediction, because the branches are just one point where you get
this problem. There's actually cache misses, TLB misses. And, in fact, you don't even
need any -- you don't even need to have specific microarchitectural events to cause
problems. The code itself can be a problem.
>>: So even if the processor could compensate, it wouldn't know when to do it?
>>Vijay Janapa Reddi: It wouldn't really know when to do it because -- yeah,
sometimes there's actually no indication because it's just the way the code the
resonating inside. Right.
And, actually, I am get to the point where I will show you an example where a divide
instruction basically stalls the processor in a way that it actually causes this effect.
>>: I'm sorry, but in all those cases the TLB cache misses and the problem you were
mentioning, won't those all be visible through the issue rate? Isn't that where the
processor can realize how much -- I mean, does the issue rate correspond very tightly
with the current drop?
>>Vijay Janapa Reddi: It tends to correspond pretty decently, yes. And you're saying
just more to the issue rate?
>>: Yeah, more to the issue rate [inaudible] if it drops too quickly, then do you fake
work, and if it rises too quickly then ->>Vijay Janapa Reddi: That's exactly what -- I mean, that's a good idea, right?
Fundamentally, it's a very simple and a good idea to do it, but the problem is being able
to detect that delta in a very short amount of time because these increases and drops
are so sharp that by the time you detect them and you actually respond to it, that's
where the problem comes, and that's the soft threshold mechanism I was talking about.
You're effectively trying to detect whether that's happening. And that's the big problem,
being able to detect and respond just requires very quick response time that we cannot
actually build.
>>: But that's, like you were saying, more of a proactive mechanism where I can
simply -- if I issued an instruction as a cycle, I have my scheduler, you know, not -- just
simply not permit the issuing of more than n plus k the next cycle where I can choose k
so that I don't actually ever have, you know, ramp up ->>Vijay Janapa Reddi: I see what you're saying. So completely just do it based on the
issuing.
>>: Be more proactive so that I'm just going to make sure that I never ramp up or ramp
down more than [inaudible].
>>Vijay Janapa Reddi: Yeah, I think -- I'm not so sure, but I think there was some -well, okay, it wasn't for the IDT, but that's something I have not actually looked at and I
haven't really seen. But maybe, yeah, if you don't rely on any of the detection
mechanism and you just look at the issue rate ->>: That still might be pretty tricky to implement because you still have to choose the k,
you have to still do it in a way that you can guarantee that, you know, if the current
fluctuations don't change by [inaudible].
>>Vijay Janapa Reddi: I think that would also depend on sort of what functional blocks
are you activating during that set of instruction that you issued. That I think would make
it hard.
But the point I was trying to build from this slide is that it's a mix of program and
microarchitectural activity in the past that actually lends to whether, you know, you're
going to get an emergency or not. And that actually builds the case as to, okay, it's not
just random swings or something, there's something intelligent we can really do,
because we can understand it at this level.
And so I was saying, you know, if we tolerate the emergencies, you allow them to
happen and you build these recovery mechanisms, then you can actually just use that.
I mean, one idea is to just rely on that. And the problem with that is that you have to
implement very fine-grained recovery mechanisms that are pretty intrusive to the
microarchitecture. I mean, you have to -- and when I was at Intel and I was talking to
these guys some time back, they were saying that making any changes to traditional
microarchitecture structures in order to enable very fine-grained recovery is a big pain in
the butt because it takes a lot of validation and testing, and that's a real pain for them.
So just to show you the kind of -- the recovery costs that you typically would need to
implement, I'm showing you a heap map here of the amount of performance
improvement you would get from using a very aggressive margin, right? So the core to
do it has a 14 percent margin, and we're going to set that to a typical case, so whatever
value we choose. And here ideally you'd want to be around 4 percent.
And so as you make that margin tighter, what you're seeing is the color becomes more
vibrant, and the blue color here means that you're getting a tremendous amount of clock
frequency improvement because you're able to run it faster. And that, to some extent,
would be able to translate into program performance.
And if you want to get this piece here, well, you're sort of realizing that your recovery
mechanisms have to be around 100 cycles or so, and that's pretty intrusive. I mean,
there has been some work even from within our own group looking at this stuff, basically
just purely designing just the recovery mechanisms only.
And Razor, for those of you who are familiar with the architecture side, I mean, Razor is
where this kind of notion all started off. And the problem is it is very intrusive, and you
have a change -- you know you have to identify your critical paths and all of that. From
a performance standpoint, it's a great way of doing things. It's designing it and building
it that's the problem.
And so what I'm going to be talking about is how can we use some kind of
coarser-grained recovery mechanism which is sort of coming out because the ability to
allow program execution to continue and then take some kind of checkpoint state is
becoming very interesting for many uses. For instance, there's a lot of work out there
saying we could do debugging this way in multi-threaded programs, for instance, right?
You could track some history in.
So the notion is to basically leverage some of the existing recovery logic that people
have been pushing out as this general purpose piece of hardware that you can use for
multiple reasons and then apply that to mitigate -- to get the performance improvement
that we want. And that's ideally what I'd push for, because if we want to use the really -if we want to get this very good amount of improvement, then we'd have to be around
10 cycle recovery costs, which is really -- I mean, that's almost even more fine-grained
than branch misprediction costs. And that's what I'm going to show.
So here what I'm trying to show you here is that's the kind of recovery costs that you'd
want right now, right? And that is for today as processor. But as you go into the
future -- and I'll tell you how I project them, but as you go into the future, what you start
seeing is that the potential here you have in terms of the recovery costs quickly starts
diminishing.
So you use Proc 25. This is typically projecting into, like, about the 22 nanometer
space. And that starts dropping. And then as you go even further into the 1611 you
start seeing that the recovery cost really has to be extremely fine-grained. And this is
based on the assumption that, you know, the voltage will scale a little bit, which is what
ITR [phonetic] is expecting, and the current density is going to increase.
And the way I've been doing it -- the way I come to this conclusion is we take an existing
chip and we measure the date and we look at the recovery cost versus the margin
tradeoff. In order to project into the future where we expect a peak-to-peak voltage
swing within the processor to increase, we artificially fake that out. Unfortunately, you
can't really see it, but what we do is we reduce the number of capacitors, package
capacitors, and that artificially causes the swing to increase, and that in effect allows us
to study what's going to happen at the end of the day.
This is just a vehicle. At the end of the day, this is what we're really interested in. And
this is showing that as you educational background up breaking caps off and the swing
increases because of the larger number -- because the average swing increases, then
you end up with more recoveries that are going on inside the processor, and because of
that increase in recovery cost, increase in frequency of emergencies, the recovery cost
has to become smaller because the penalty per emergency needs to decrease
dramatically. And that's effectively what's happening. And so we really need to
understand how do we design and mitigate the problem.
And, of course, I wanted to show you the swing here, the magnitude of the swing as it is
increasing when you start breaking capacitor off, but sorry about that.
So by tolerating, what I'm saying is that we can understand -- we can understand what
is causing activity, and because we're not going to be able to build as extremely
fine-grained recovery mechanisms, we want to be able to build lighter-weight schemes
on top of it. And that's what I'm going to talk very briefly about, the predictor, which I
call the principal for avoidance.
The idea is you allow the emergency to happen inside the processor and you detect that
that's happened, and you can detect it using the typical sensor I was talking about,
which might take long -- a bit of a time to actually detect it, but that's completely fine
because we're not trying to proactively prevent it.
And when you detect it, you initiate your checkpoint recovery, which rolls the execution
back to some previously known state, right? And you know that state is correct. And
you notify what call as the emergency predictor, and the predictor then basically takes a
snapshot of some activity that was happening, because like I was saying, when you
were doing checkpoint recovery, you were actually able to track what is going on inside
the processor. You maintain some history. You track that and then you can actually
predict using that activity pattern. You don't really have to wait the next time around.
So that allows us to basically prevent the emergencies.
And so here I'm showing you what the predictor would actually track in hardware. So
it's just some control flow I'm showing you here, but, of course, like I said, there might
be other microarchitectural activity, and you track those patterns and you store them in
your table. And these patterns actually become very important for the software piece
because unless you know what caused the problem, you wouldn't really know where to
actually fix the problem. So that's why these signatures are very critical for the software
piece.
And so what I'm trying to show here is how far ahead do these patterns allow you to
predict and anticipate the problem. And that's what I call as lead time. If you were to
get an emergency here, you know, you ideally want some amount of lead time so you
can take some preventive action.
And I'm showing you that even as you go 16 cycles ahead of time, it's able to predict
those patterns that we were able to record. Even with 16 cycles amount of time, you
can actually predict them very accurately, up to be 90 percent accuracy, which is huge.
And if you were to just rely on that simple detection mechanism, then it's about -- sorry,
yeah?
>>: So how do you define accuracy?
>>Vijay Janapa Reddi: So accuracy is -- the way we look at it is what fraction of -okay, we record how many times the emergency happens, and each time we track -- we
compare whether the predictor, with whatever history it looked at, did it also expect an
emergency at that point. So basically predictor -- you might get some kind of activity
history pattern, and you look at that history pattern and you say did I get this emergency
when I saw that pattern?
>>: So is this saying that there's 10 percent false-negatives, 10 percent that you
missed?
>>Vijay Janapa Reddi: 10 percent that I missed.
>>: But it doesn't say anything about how many extra ones you thought you were going
to get but you didn't necessarily ->>Vijay Janapa Reddi: How many extra ones that ->>: So if I were to make a predictor that said predict every cycle that's a voltage
emergency, under this metric would I have 100 percent accuracy?
>>Vijay Janapa Reddi: No, you still wouldn't.
>>: Okay.
>>Vijay Janapa Reddi: Because there's still some amount of error in the history that
you record, because you record in history and -- well, if the page block is huge, for
instance, and then you're only tracking control flow. That means -- sorry.
>>: I think my question was more to your metric of accuracy as to -- I thought the thing
that you said was that you're saying the percentage of emergencies which you predicted
would be an emergency.
>>Vijay Janapa Reddi: Yeah.
>>: The question is do you consider false-positives and false-negatives?
>>Vijay Janapa Reddi: Ah, false-positives and false-negatives. So actually that ->>: [inaudible]
>>Vijay Janapa Reddi: Yeah. So I did do -- so this is not based on the actual runware.
Predictor is actually active as well. And when we do that, actually there tends to be this
behavior where if you -- as you're preventing some of the emergencies, some of the
future nearby emergencies go away. So in fact it turns out that only one percent of the
emergencies actually we miss, the predictor would emergency.
This is, of course, assuming that you've already learned the pattern. That's what I'm
saying. So first you have to learn the pattern, right? That's a compulsory miss just like
in a cache miss.
And once you get that, once you've figured out the pattern that causes the problem,
then you can use that to anticipate the future occurrences. And when we do that, we
realize that 99 percent of the time we're able to prevent the problem. That is in the
actual run because we actually throttle.
So one of the things -- obviously it's very much like a branch breaker.
>>: [inaudible]
>>Vijay Janapa Reddi: It's actually ->>: [inaudible]
>>Vijay Janapa Reddi: It's instructional, right, because I'll look at branches or -- it's not
completely instructional. It's a bit of instructional and microarchitectural events because
I basically encode, oh, this is a branch. It's a taken or not taken branch. And then I will
also encode in whether you got an L1 miss or an L2 miss.
>>: [inaudible]
>>Vijay Janapa Reddi: So, actually, yeah, I was going to get to that in the next slide.
Because it turns out that we just need to track a small fraction, so it's about seven
events or so. Because those are the ones that really end up causing the major stalls.
So because it's very much similar to a branch predictor-like structure, then what you see
is as the amount of information increases, then obviously you're tracking more history,
and therefore your ability to actually predict the problem increases.
And where you track the information, for instance -- I think this is what you're sort of
asking, where you're tracking and what kind events you're tracking, right? What I'm
showing you here is if -- the first three basically assume the code path. You're only
tracking the program path. And it shows that if you capture the commit side, which
means you're only looking at the true program path, you're not looking at speculation,
effectively you're ignoring branch mispredictions, then you only get 40 percent. But
when you go to the decode, you end up capturing some of the speculation that the
processor is doing, so you're effectively capturing the misprediction implicitly, then your
accuracy improves. So branches, obviously, is a major thing. And as you throw in
more events, basically, TLB misses, L1 and L2 effects, then your accuracy improves
dramatically.
>>: [inaudible]
>>Vijay Janapa Reddi: So the signature sizes, this is what I'm talking about here. So
this is the number of entries, and I used 3-bit encodings for each of the entries.
So like I was saying right at the beginning, you can either go -- if you use aggressive
margins, you can either get power reduction or you can boost up the clock frequency.
Here I'm just going to talk about it from a frequency perspective.
So in our system, we had a 14 percent worst case group typical, just the way the Core 2
Duo processor was experiencing, and we chose a 4 percent based on the aggressive
setting, you know, whatever the average case design point seemed like, and that
translates to a 17.5 percent increase in clock frequency you would be able to get by
using an aggressive margin.
And what we see is that if you compare against an oracle predictor scheme, the oracle,
because it still has to throttle once per emergency, so it knows exactly when the
problem is going to come and it's going to pay a little bit of penalty, you would see that
compared to the 17.5 percent, it comes to about 15 percent, which is very close, but
you're still paying some performance penalty.
And if you take the signature predictor, which is the one we have come up with, it
comes very close to the oracle. And, in fact, that does even after it's relying on these
very expensive training mechanisms, right? Because it has to actually incur a problem,
take that problem, and then apply it at run time, so basically learn, pay the cost for
learning, and after that it has to be able to predict very effectively to do this.
But there's actually more room to push it even further. And that's where the software
piece kind of comes in. And I have to talk a lot about the basic -- I really have to lay the
foundation for how the software would actually be able to do these things, because you
have to understand what is causing the voltage to swing, and the predictor is a very key
piece of this entire scheme because it captures a record or the history patterns that are
need the inside the compiler to actually change the code around.
>>: Sorry. One more question about that. What was the [inaudible] again? Is that just
the frequent increase you can achieve or is it actual ->>Vijay Janapa Reddi: This is actually once the speed up -- once you're actually
running the program.
>>: Okay.
>>Vijay Janapa Reddi: So from tolerating, we have now figured out what is causing the
problem, and now we can, and then you're able to predict it accurately. And the
prediction, although it seems like a hardware thing, it's actually very important for the
software, because it is a representation of what is the quality of the information you're
giving the software, because the software is going to basically operate off that.
And so the software, once it gets the profile information from these hardware structures,
would actually be able to change the code, and that's what I'm going to talk about in a
single core system. And then when you go to the multicore system, you can actually
schedule threads intelligently. You can actually get rid of the problem altogether, so
basically you're never going to a voltage emergency in effect.
And so when you actually look at -- what we did was we took the programs and we
observed the number -- the locations where this problem actually happens in terms of
program structure. What we find is that there are very few places in the code where you
get these emergencies. They tend to be hot spots.
And that's what I'm showing you here. Across the three different benchmark suites,
what I'm showing you is the dynamic emergencies is the number of emergencies you
actually incur during program execution, and the static locations, if you were to relate it
back to the control flow graph of the program, you know, how many places do you
actually end up with an emergency. And it's a very small fraction. It's basically just a
few hundred instructions or instruction hot spots that cause the entire problem. So it
actually motivates everything for the software because if this problem is sort of
scattered all across the code and then it's very hard for the software to really be able to
do anything. The software schemes work really well when there's a repeating pattern
because changing something at one location specifically for a given problem is much
more effective.
So here, just to really explain, we looked at some of the control flow graph stuff just to
sort of say how the interaction between the program control flow and the events that the
processor experiencing can actually make a problem. And I'm showing you benchmark
C.
So what we see is the code -- you get an emergency here in this section of the control
flow graph, and you get this because what happens is the processor ends up stalling
briefly at a certain point, and it stalls when we get a divide instruction because this
divide instruction is a long latency operation, and effectively until its computation is
processed, the rest of the instructions cannot actually be dispatched.
And this is in a loop. And so what we see is that you get a stall here. That's what we
observe. Actually when we look at the set of processor events here we're observing, we
see that the long latency thing, the divide, happened. When you get that divide, the
issue rate of the machine suddenly comes to a complete standstill, and that's because
there's a dependency on that instruction because the machine cannot go forward until
that computation is finished, and so the current drops dramatically.
But when the computation finishes, you see a sudden increase in the issue rate as
expected, because now suddenly all the instructions are being executed, the dependent
instructions, and then you see a -- that's why the current increases. Because the
current increases steep and it's sudden, you see that the voltage actually drops
suddenly.
And what we effectively end up doing is that we can actually change the way the code is
structured and sort of dilate that sudden burst in activity and then prevent the problem.
So it's basically because of that sudden issue rate you're getting that voltage drop,
right? And that's happening because there's a lot of IOP, and the way I'm showing that
is like let's say your machine can typically issue two instructions a cycle. It's doing
great. And each cycle it's able to dispatch that number of instructions. Then, naturally,
there's a huge issue rate, and then the current is going to be consistently high.
And the colors here represent instructions that are effectively dependent. That means
they cannot proceed until the previous instruction completes. What we want to do is we
basically want to stall the issue rate out a little bit, and by doing that, you know, you're
getting that averaged effect, and that prevents the problem.
And so we effectively want to cripple what the machine does at a given point, and you
want to do this intelligently because you just cannot do it all over the place. Otherwise
you'd lose performance dramatically.
And by chaining up dependent instructions together you can actually end up slowing the
machine just enough at the right hot spot. And we actually tried doing this in the static
compiler, and the problem is when you chain something at one place, the problem ends
up creeping at another place. So you effectively have to kind of keep fixing it on the fly.
That's what actually end up doing. It's a dynamic feedback system.
So what I'm trying to show here is if you have one instruction, for instance, a move
instruction, if you could move that instructions down and increase the read after write
dependence chain at the right place, then -- well, because the compiler has to move
instructions with making sure you're still preserving all the semantics correctly. If you
can do that, then what you see is that -- so this is a before, the one we had a problem
in, and this is the after. Then what we see is that we flat out the issue rate a little bit, so
we crippled the IOP of the machine, and then that way you decrease the amount of
current draw just a little bit effectively, and that be prevents the voltage drop decrease.
And it's a bit tricky to do this in an autophoto [phonetic] machine because you have to
worry about the instruction scheduling window because it's pretty big. So a lot of nifty
tricks that we had to pull on it.
>>: [inaudible] compiler is specialized for every speed greater -- every stepping of
every processor?
>>Vijay Janapa Reddi: I'm sorry?
>>: It seems that the compiler has to be done differently for every speed grade of every
stepping of every processor.
>>Vijay Janapa Reddi: I think it sort of -- when do you mean by a speed grade?
>>: You buy chips at the gigahertz, either 2 gigahertz or 5 gigahertz, you pay mor
money for them and [inaudible].
>>Vijay Janapa Reddi: Right. So I think -- we haven't actually seen it -- like, okay, so
we just took one chip on which we were trying this out, and I think the point at which we
were able to show was that, look, actually we would be able to actually come up with an
algorithm that makes sense. And so I guess I cannot strongly make the claim that, oh,
this would just work out of the box because we're still looking at how the optimization
has to be built so that it is -- it's going to work completely independently off the platform
itself.
What we've found basically was that if you can -- if you can construct the chain so that it
matches the issue rate, the issue width of the machine, then we've found that it typically
works very well. So it's a very simple heuristic that we've found, but it seems to have a
very tight core relation with how effective the scheme is.
The only problem is you have to worry about how big the instruction window is of the
machine. If it's really big, trying to create read after right dependence chains is
extremely hard because you have to find those chains all over the place.
Another simple way is to just throw in no op [phonetic] instructions or something to
dilate the time.
And so that's what I'm trying to show here is if you reschedule the code intelligently,
then you can reduce a large number of emergencies. And I'm showing you the fraction
of emergencies that remain after just one optimization, after just one pass that we had
made. And if you actually try to dilate the time, try to lengthen things out by injecting
instructions into the stream just arbitrarily just because you have a hot spot at some
point and the compiler says, oh, based on the emergency signature pattern, it looks like
the hot spot is here and I need to fix that, and you just start throwing in instructions to
basically cripple the IOP, then that becomes a problem because that can actually
introduce unnecessary lead stalls. And that's where we find that we lose performance,
actually.
So naturally, if you introduce no op instructions, you're going to lose time because
you're doing a lot of unnecessary work versus if you actually reschedule the code
intelligently, then you can actually get better performance at the end of the day. But the
point I'm trying to make here is when we reschedule, we're not getting performance
improvements because we changed the code a lot, but it's just that the -- you can
actually come up with optimizations that are not completely performance driven. You're
just specifically optimizing the code for this problem to get rid of the problem.
>>: What kind of no ops did you guys inject? Because for [inaudible].
>>Vijay Janapa Reddi: Yeah, so you actually construct pseudo no ops. So you look at
the code and then basically force it to actually do some computation that's actually
redundant. Yeah. So we do pseudo no ops. It's not really no ops because, you're
right, the thing just gets killed out at the dispatch of the thing, at the decode after.
>>: So when you specialize the code, you're running more software, doesn't that cause
additional voltage ->>Vijay Janapa Reddi: Yes. So we actually looked at the compiler itself. The compiler
itself does end up getting emergencies when it's actually doing the thing. So the way
we handle it is because you have a tolerance mechanism at the end of day, so we make
this huge assumption that the baseline -- there is always a corrective mechanism at the
hardware. And so whenever we get that fault, we just take the fault and we observe the
penalty loss inside the compiler.
And the thing is that if your compiler is good, if you can really optimize it correctly, then
the number of times you would have to call the compiler would be very few. That's the
assumption. If the optimizations don't work and then you keep calling the compiler, then
it becomes redundant.
And so what we actually end up doing -- the way the entire system actually works is you
propagate stuff up to the software level, but if the software cannot fix it, then that
emergency signature that the software relies on is actually fed back into the predictor
mechanism because the hardware -- because this is much more cheaper to maintain
the performance than relying on completely tolerating the emergency.
Yes, of course. Then, now, as we go into, like, the multicore systems -- so all of that
was just on the single core. So if you can optimize and localize, you're basically
performing very localized optimizations to get rid of the problem.
But what happens in the multicore, then you have interactions across another chip,
another core that can cause problems on your core because typically in a multicore
system, a certain number of cores -- for instance, two or four or six cores -- can actually
be tethered to the same power supply. So what that means is that if one core is
misbehaving, then you end up penalizing every other core because the droop will
propagate down. And the IBM guys have actually done -- wait they do it is pretty crazy
because they will actually send -- they'll create a droop at one core and they actually
observe it traveling down and they'll say that, okay, after x number of nanoseconds,
then you will see -- and if the second core happens to fire up right at that time, then you
get a much bigger droop.
So they do these very detailed studies trying to deal with it in a multicore system. And
so we're starting to expand into, like, many, many, many cores, and this droop issue
becomes more significant because if any time all the cores happen to fire up and draw a
lot of current, then the droop is going to be much bigger. So that means your guard
bands have to increase substantially larger. You end up comprising the overall
efficiency of the system itself.
And so in order to mitigate that, what I was looking at was how does the noise within the
program actually change over time. And I'm showing you three benchmarks from CPU
2006. And what we've found is that if you were to [inaudible] the emergencies that you
would typically have per clock cycle, like, you know, per thousand clock cycles, using a
very aggressive margin, just purely for position purposes, then you see that in some
benchmarks, the noise is very flat. Whereas, in some cases you send up with some
kind of phase, a minor voltage noise swing. In some extreme cases you get a large
number of -- very large amount of variance, and that variance basically happens
because of the microarchitectural stalls.
They've actually been able to correlate all of this noise down to the microarchitectural
activity. And the question is how can they actually use that intelligently. You can
actually pair multiple threads with very different behavior together and you'll get a large
amount of variance.
So in order to prove that, what I'm showing you here is I take one benchmark that's here
and combine it with every other benchmark on the x axis and then I show the
distribution that we see in terms of how much voltage -- how many voltage emergencies
we get. And that's what the box plot is effectively showing you.
And the blue point is a pretty good comparison point because it's basically saying what
happens if I just run a program with itself. It gives us some kind of comparison point.
And what we see is that there's a huge amount of variance. So in most cases -- it's
almost a hundred emergencies difference, right? And this is per thousand clock cycles.
That's just because we're using a very aggressive setting to try and determine how to -what the noise characteristics are. But there's a huge amount of variance.
And so what you can actually do is you can actually schedule the code -- schedule
threads using the OS or hypervisor across the cores intelligently.
>>: [inaudible]
>>Vijay Janapa Reddi: I'm sorry?
>>: If you've got a message passing based on where each core is operating
independently, how do you do a coordination between them.
>>Vijay Janapa Reddi: So I haven't actually looked at it when you have that kind of
mechanism, but maybe at that point it would be -- it would open up the door, well, can
you actually look at the interactions between the threads. Because that's very different
from the case we're looking at here where the programs are completely independent.
But in that case they might be interactions. And one of the things I was doing on the
chip was to actually see if you can -- what happens if you a producer-consume other
relationship and you release something and then there's a sudden burst on the second
core.
And we find that it's pretty -- I've had a tough time trying to really be able to take a
picture, for instance, on a scope because it's really a whole bunch of factories that ends
up starting up. It's not just a few instructions.
But I think that that might be something we would be able to do just looking at the traffic
that's coming in and out of threads. Especially if each one is just bound to a single core.
So there's been a lot of work in the space basically in terms of scheduling threads,
right? Typically we've been looking at scheduling threads in the operating system
community for voltage noise -- sorry, for performance reasons. And typically what
you're trying to do is you're trying to minimize the bottleneck you have at the shared
resource, the cache.
And so that gets rid of one percent -- you know, by optimizing for cache sharing, you're
in effect removing the cache stalls in a way. And if you remove the bottleneck cache
stalls, that means you would expect that, okay, some amount of the noise has to go
down because you just got rid of some of the stalls.
But it turns out that that alone is not enough because you do have to account for other
kinds events that are happening inside the processor. Like I said, there's branches,
there's TLB and there's a lot of interaction which I haven't shown here in the slides, but
all those interactions create different levels of noise swings.
And to prove that point what I'm showing you here is I'm looking at -- you know, so x
axis is IPC. So let's say if I schedule intelligently for IPC only based on hardware
metrics, which is a pretty well-established literature space out there, I'm showing you
that if you schedule for IPC, you're going to get better performance relative to some
baseline here. We'll just speculate in my case.
But the problem is you end up with more droops because the system is completely
agnostic of what is going on underneath the schedule itself. And once I actually tag on
the recovery costs, the performance actually decreases dramatically. You can even go
into the negative space.
But if you were to intelligently schedule for droops alone, you're trying to say, like, I just
want to pair threads based on some activity history, I want to pair threads that are
actually causing low number of recoveries, then you can see that the -- you end up with
a little bit of performance win, but that's really not the point. That's just artifact.
But the biggest point is that the number of droops you see relative to your baseline
decreases, and that's a good thing, because that means now you have fewer recoveries
effectively. And so it makes a very strong case that scheduling for droops is very
different than scheduling for performance.
And I'm showing you what happens when you have random -- just random schedules, if
you're not scheduling for performance or anything, you're just pairing threads together
to get the job done, which is pretty much what the Linux scheduler, if you look at it,
really turns out to be doing. It's not truly random, but the effect is effectively the same
thing as random.
And so that's the point I'm trying to make here is when you're looking at multiple threads
or multiple cores, then there's a large amount of room for optimization by looking at this
at the software level because trying to build mechanisms at the hardware level in CNP
[phonetic] systems will be extremely hard for this.
And I think the way to think about the software pieces the last thing I think -- being a
software guy, the last thing I ever want to hear is that, oh, the hardware is going to be
faulty so you need to make sure your code is always going to be correct, right? And
you have to do all this optimization. I think that would just never, ever work.
But the way I think about it is, look, we're going to design the hardware intelligently as
well as we can to try to mitigate some of these problems because this is really going to
start eating us in terms of energy efficiency and stuff.
Then the software guys could really think about it as, look, oh, here's another
opportunity for performance optimization, because cache misses and branch
mispredictions, for instance, are just one specific thing that we optimize for. And the
case that I'm really trying to construct here is as this becomes a big problem and you
get the right substrate at the hardware level, then you push it as an optimization
problem at the software level. So by just getting rid of the recovery cost, then you get
better performance naturally.
And so just to summarize the things that I covered in the talk, I mentioned that margins
increasingly causing frequency loss, power efficiency and increase the cost price of the
chip, and I mentioned that by tolerating, you can actually intelligently avoid the thing.
And for all of that, you have to look at how the voltage is fluctuating within the
processor.
And once you learn to avoid, avoidance is important because tracking the right
information there. Then you can actually use a software to completely get rid of the
problem itself. And so we have come up with this three-tiered architecture basically,
you know, tolerance, avoidance and elimination that I really think is applicable to this
particular problem that people haven't yet completely understood yet.
And the variation -- if you remember the first slide that I started off, there's this direction
on the process variations, the thermal variations and there's voltage variation. And we
looked at the voltage variation as happening at a very fine-grained time scale, but we
might be able to extract this into the general process, you know, how do you generally
handle variations as a scheme. And I think basically the architecture we've built up
actually works pretty well for this thing.
And I think this notion of tolerating errors is something that we are going to be seeing in
the future. The architecture will be tolerating the errors, and that's just how we're going
to get just the way we -- in effect, you're going to be speculating all the time on
correctness and sort of just -- I mean, we already do that with branch mispredictions or,
like, memory speculation. And this is just going to be another notion that's already
going to be coming out. Because like I said, on Friday I'm actually going to be talking to
the guys who put the chip together, and they do have a prototype chip that can tolerate
now that does this and is actually running the programs to completion basically.
So that would be my talk. Any questions?
[applause]
>>: [inaudible]
>>: Do you have any sense of -- sort of following up on this earlier question -- of how
much the compiler [inaudible] need to be aware of the variation across chips that you
mentioned very early in the talk, especially with regards to temperature and just
[inaudible] process variations?
>>Vijay Janapa Reddi: I think it sort of depends on -- I think it depends on what your
workload characteristics are. So let's say we talk about process variations, right? And
then one core will obviously be running at a much lower frequency, and you don't
necessarily want to put -- in a multithread workload, you don't want to put the critical
section on that core, for instance. There would be a lot of benefit from moving that core
into the one with lesser variations because you're going to be able to operate at higher
clock rates and get through that. And the ones where -- if you're just purely looking at
basically crunching through non-critical sections, then you might want to be able to
schedule the threads intelligently that way.
>>: It's kind of crazier, I mean, because you have core-to-core process variations such
that if you scheduled A here, B here, you get emergencies. If you scheduled them the
other way around you might not actually get the emergencies because of the [inaudible]
variations of the activities match up with that [inaudible].
>>Aaron Smith: All right. Any more questions?
>>: Do these signatures, like, for one particular architecture, do they depend a lot on
the workload or ->>Vijay Janapa Reddi: They actually depend very much on the workload and the
architecture. So I didn't show it here, but -- so this was based on several -- I think it was
about 55 different programs. The experimental set I'm talking about is 55 unique, very
unique programs, and then we looked at three very different architectures in terms of
the aggressiveness of the configuration, and the emergency -- the patterns you record -the information we're recording is correct, but the orderings are obviously completely
different. So it has to be a very dynamic thing. It's not something you would say, like,
okay I'm just going to capture this record of activity and then just do some static
compiler optimization. We tried that. That was ideally what we wanted to be able to
solve.
>>Aaron Smith: All right. Should we thank our speaker one more time?
[applause]
Download