>> Engin Ipek: Hi, everybody. It is my...

advertisement
>> Engin Ipek: Hi, everybody. It is my distinct pleasure to have Radu Teodorescu with us today. Radu is
a PhD candidate in the Department of Computer Science at the University of Illinois at Urbana Champaign.
And today he will talk to us about his work on addressing parameter variation with architectural techniques.
>> Radu Teodorescu: Thank you very much, Engin. Good morning, everyone, and thank you for coming
to my talk. Today I'm going to talk about the work that I did as a graduate student at the University of
Illinois on helping Moore's Law through architectural techniques to address parameter variation.
As we all know technology scaling continues to follow Moore's Law delivering more transistors with every
new generation of microprocessors. This has translated into increased computing power at lower costs
and has been a driver for the computing industry for many decades now. However, as transistors reach
deep nanometer sizes this continued scaling faces major challenges. Now some of these challenges
originate in the chip manufacturing process while others originate in the environment in which the chips run.
On the manufacturing side I'm -- I'll mention just a couple. Some sub wavelength leafography, which is the
process of actually printing transistors on to the silicon dye uses light source that has a wavelength of 192
nanometers to project much smaller feature sizes on the order of 35 nanometers in the current generation.
So this can actually lead to imprecision, difficulty in controlling this features sizes which leads to variation in
the effective length of the transistor, one of the important parameters.
Other problems include variation in dopant density in the transistor channel as the transistor channel gets
smaller and we get fewer dopen thumb parameters such as threshold voltage of the transistor. And there
are many more manufacturing challenges, but we also have environmental factors that impact the
functioning of these transistors. Some include temperature variation or supply voltage fluctuation. So there
are many problems. These and other environmental factors, manufacturing -- any environmental factors
will cause variation, significant variation in transistor parameters, that we care about a lot such as switching
speed which determines the operating frequency of a transistor and off the entire chip and leakage power
consumption which is an important component of the total power consumption of our systems today.
So if we were to plot the probability density function of any of these parameters. Let's say this is switching
speed for all the transistors that we have in one of our chips today we would expect to see a function that
looks something like this with most transistors falling inside this nominal region. Instead what we see again
if the switching speed is a distribution that looks more like this with some of the transistor actually faster
most of them slower and fewer of them falling inside this nominal region here.
Now this behavior has a direct impact on current microprocessors such as AMD's quad-core opteron for
instance. This is just an example. I'm not implying anything about the AMD's technology here. And also
on Intel's -- for instance Intel's 80 core super computer on a chip prototype. So it has an impact on current
and future systems resulting in reduced chip frequency that would otherwise be possible without variation,
significantly higher power consumption and decreased reliability.
Now to get a sense of the very real effect of variation let's look at some actual data as measured by
researchers at Intel. So this plot will show you the normalized frequency versus leakage power
consumption for a set of chips that are manufactured in what is now an old technology, 130 nanometers.
And their distribution on this frequency versus leakage plot looks something like this. What each dot here
represents the frequency versus leakage power of a single dye. So as we see we have 30% variation in
frequency across these dyes and up to 20 times variation in leakage power consumption.
So why we would expect a chip to run somewhere here at high frequency and with low leakage power -yes, question?
>> Question: (Inaudible) all of those dots like the overall mass in the left corner versus the outliers?
>> Radu Teodorescu: Well, I don't actually have that because I didn't have access to the numbers. I had
essentially just a plot. It's over an illustration that I did. Yeah?
>> Question: (Inaudible) -- independent of the actual probability mass?
>> Radu Teodorescu: So the point that I'm trying to make here is essentially that there is significant
variation both in frequency and leakage power. Now how many of these chips actually fall within a smaller
variation region here I don't know precisely. I will show you data for our own models. And that may
make ->> Question: So we're plotting here the overall chip frequency leakage or the individual transistors?
>> Radu Teodorescu: That's a good question. So this is chip-to-chip variation. So it's the overall
frequency verses leakage power for the entire chip. Yeah. Later on we'll see this broken down into
chip-to-chip and within chip variation, but that's coming up in a second.
So most of the chips will have lower frequency and higher leakage power consumption than we would
expect. And by some estimates actually one generation of process scaling is actually lost to process
variation. So this is a very significant problem. For the rest of this presentation I'm going to talk about my
work and getting some of these losses back.
First, a little bit more background on variation. Question?
>> Question: Like the processors you showed before.
>> Radu Teodorescu: Uh-huh.
>> Question: Like are these the ones that were manufactured or the ones that were actually sold?
>> Radu Teodorescu: They're the ones that were manufactured.
>> Question: Okay.
>> Radu Teodorescu: In fact some of these you can't sell because of the high power consumption and
have you to throw them away.
>> Question: Okay.
>> Radu Teodorescu: Yes?
>> Question: So are the operating conditions the same for the different chips?
>> Radu Teodorescu: I would assume yes. Intel gave these measurements and the way they do these
measurements is post manufacturing. They measure each dye on the wafer under the same conditions.
Yes.
So variation occurs at different levels. If we examine a wafer of microprocessors such as this one
containing the Core 2 Duos we will that some of the shapes are fast, which are shown here in red, while
others are slow shown, here in yellow. So variation will make individual chips different from one another
and this is generally referred to as dye-to-dye variation.
Further if we look inside a single dye, such as this 4 core chip, we will notice that there is significant
variation inside the dye and this is referred to as within dye variation.
Now within dye variation also has an interesting property in that it is spatially correlated, meaning that
transistors that are close to each other tend to have similar properties. For instance these shown here in
red are fast, but consume a lot of leakage or study power, while the ones shown here in white and yellow
are slower, but more power efficient. So my work is focused mainly on addressing the problems caused by
-- within dye variation. That's what I'm going to talk about next.
So here's an overview of the solution that I propose and I'm going to talk about today. Now even though
variation occurs at the level of circuits and devices, right, it's effect on performance, power and reliability
are affecting all of us. Are affecting the micro-architectural layer, which I care about, but also the run-time
system layer, DOS and software. So that is the reason why I developed my architectural solutions that
span different levels of the computing stack. I will discuss today two such techniques. The first one is
aimed at variation reduction. We're trying to reduce the effect of variation inside the dye. The second
technique is aimed at variation tolerance. We're trying to design our system around variation somewhat.
So in the first solution we look at the chip multiprocessor with four cores with variation and the goal of this
solution, like I said, is to reduce within dye variation. To do that we partition the chip into multiple cells and
using a technique called body bias apply to each of these cells we can either reduce the leakage power of
cells that consume a lot of power, the ones that are shown here in red, or speed up cells that are slow,
shown here, resulting in reduced within dye variation. We call this dynamic fine-grain body biasing. This is
what I'm going to talk about first, in the first part of the talk.
Now the second part of the talk in the second work I took a different approach. So the second technique is
aimed at tolerating variation in a large chip multi processor, such as these 24 CMP. And the idea here is
that because of variation these CMPs will have cores that even though they are designed to be identical
they'll behave differently meaning they'll have different power consumptions and different frequency
potentials.
So to manage a system like this we develop variation aware application scheduling and power
management algorithm for DOS and the power management subsystem. So for instance these algorithms
can map publication to high frequency cores as shown here in red. The flowery things are application. The
high frequency cores, when the goal is to improve the performance of the system or through low power
cores when the goal is to save power. In addition by combining knowledge about the characteristics of the
cores and of the characteristics and requirements of the application, we develop a poor management
subsystem that is able to improve throughput at the same power consumption compared to a variation on a
Y axis. But this is what I am going to talk about in the second part of the talk.
So here is an outline of my talk. I will describe the two solutions. First dynamic fine-grain body biasing,
which is work that appeared in last year's International Symposium on Micro-architecture. And then
describe the second solution variation where scheduling and power management, which will appear in this
year's ISCA. I will evaluate both techniques together at the end and finally say a few words about my plans
for future work.
So let's start with dynamic fine-grain body biasing. Now body biasing is a well known technique and has
been used for a long time for controlling threshold voltage. In body bias a voltage is applied between the
source or drain and the substrate of a group of transistors. Depending on the pylotic of this voltage that
gets applied we can have two types of effects. One is forward body bias, which essentially increases the
switching speed of these transistors at the cost of increased leakage power consumption.
The opposite is true in reverse body bias, which essentially decreases the speed of these transistors with
the benefit of actually saving a significant amount of leakage or static power consumption. So this property
makes body bias a key knob that can help us trade off leakage power for frequency. And you can think
about this much like another well known technique, dynamic voltage and frequency scaling trades off
frequency for dynamic power.
Previous work by researchers at Intel has proposed a technique called static fine-grain body biasing for
reducing within dye variation. So in this technique a chip with variation is divided into so-called body bias
cells, where each cell is the unit at which this technique is applied and each of the cells receive its own
body bias voltage as follows.
Reverse body bias is applied to cells that consume a lot of study power and forward body bias is applied to
cells that are slow to speed them up. And the overall effect is a reduction in within dye variation in both
frequency and power consumption. And this in turn results in improved overall processor frequency
because remember the frequency of a chip is determined by its lowest predictable path so by improving all
of them we can improve the probability that that slowest critical path will be faster and also overall reduction
in power consumption. So that's one benefit, right, will reduce within dye variation.
But we also gain an additional control over the chip's frequency and power consumption. And this control
allows a manufacturer to essentially improve the chip's frequency and at the cost of increased power
beyond what was possible without body bias. So let's see how this is done. Take an example. Here -- so
this is a distribution of chips on a frequency versus leakage power plot and here I'm isolating just the effects
of within dye variation. So within dye variation causes the dyes to distribute something like this. Again I'm
plotting frequency of each of these dyes versus leakage power. This is from our own variation model and
its simulation.
So after manufacturing -- after manufacturing the vendor will actually take all these chips, test them and
based on their maximum frequency would place them on -- in frequency bins, so called frequency bins. I'm
showing four here. So the chips that full for instance into bin four will be sold at 2 gigahertz. Bin three
would be 2.2, 2.4, 2.6 and so.
We've also seen that leakage power can be a significant problem. Right. It varies a lot and it can be very
high. So in addition to splitting the chips into frequency bins the manufacturer also imposes total power
limit on these chips and limits the amount of power that these chip can consume. This total power limit
translates into a leakage power limit for this chips that look something like this.
As a result some of these chips will have to be rejected because of high power consumption and only those
falling in the green region here can be accepted and used and sold. Now normally this would be the end of
the story and the vendor would have to leave with this situation. But what I'm going to show is that with
fine-grain body bias we can improve the operating conditions of these chips by trading off some of the
frequency for leakage power consumption and I'm going to take two examples here.
So let's take a chip that has a fairly low frequency running here in bin three. We want to run it at a higher
frequency, what we can do by applying carefully this fine-grain body bias we can improve its frequency at
the cost of increased leakage power consumption. Until -- and we do that as long as we stay below this
power budget. We cannot push it beyond because then we would need to throw that chip away.
On the other hand -- yes? Question?
>> Question: The bottom of the bin?
>> Radu Teodorescu: At the bottom of the bin. Yeah. Yes. Yes. I mean ideally you may want to stop
here because there is no need to push it beyond. If you are gonna -- it depends where the frequency of the
bin is. It could be here somewhere in the middle. This could be -- well actually no. The frequency is here.
Say this is 2.4, so in that case you may not need to push the chip all the way up. Yes. Here I'm assuming
a continuous distribution of frequency.
On the other hand, if you have a chip that's consuming a lot of leakage power and normally would have to
be thrown away, you can decrease the leakage power consumption of this chip by applying mostly a
reverse body bias at the cost of some loss of frequency and bring this chip within this green acceptable
region here.
So this is done in a post manufacturing calibration process that has the goal of finding the maximum
frequency that each of these chips can run it. The problem is that -- okay. So after we find this maximum
frequency footage of these chips the body bias value -- so the body bias settings for each of the cells that
made the frequency possible, made this F Max reachable are fixed for the lifetime of the chip. Because we
have to guarantee that these chips will run at the maximum frequency at F Max under any conditions.
Worse case conditions such as temperature and power consumption are assumed. For that reason we
say -- we make the observation that static fine-grain body biasing has to be conservative.
Which brings us to the motivation for my work, which we call dynamic fine-grain body biasing. Now we
start by making two observations here. In chips today we have significant temperature variation. And this
is observed both in space across different cores, maybe even inside a core from one functional unit to the
other, and those are in time as the activity factor and the workload changes, for instance the computation
migrate from one of the cores to the other. So there is significant temperature difference on chip in space
which can be 30, 40 degrees centigrade and also in time as the activity of this chip changes.
In addition to that we know that there is a very important impact from this temperature. Circuit delay
increases significantly with temperature. So while at low temperature at least some parts of these chips will
be faster as the chip heats up it will essentially slow down. And at the highest temperature it will be
significantly slower. So we know there is significant temperature variation and that impact has a significant
impact on circuit delay.
So for that reason static fine-grain body bias has to assume worst case temperature when calibrating the
body bias voltages. At that temperature most of the circuits on this dye will be slower allowing less of the
power saving reverse body bias to be applied and requiring more of the forward body bias to reach the
maximum frequency F-max. And this results in overall higher power consumption. The main idea behind
dynamic fine-grain body biasing is essentially to allow the body bias voltages to change dynamically to
adapt to temperature changes in time, space and with workload.
And the rationale here is that at average temperature, at least some parts of the chip from here in blue, will
be significantly cooler. And also which parts of the chips are cold if any time will change dynamically. And
so as a result at least some of the chips will be fast allowing more reverse body bias to be applied and
requiring less of the forward body bias which consumes a lot of power to reach the exact same frequency
F-Max, with significantly lower power consumption. So if we were to define the optimal body bias to be
applied. As the body bias that requires the minimum amount of power to reach a certain target frequency,
we could say that the goal of dynamic fine-grain body biasing is essentially to keep the body bias optimal at
any time as temperature changes.
Now the big question of course is how do we find what this optimal body bias is? To do that we use a
circuit. We actually dynamically measure the delay of each body bias cell. So we use a circuit for
measuring delay that is placed in each of these body bias cells and the idea is that we want to make sure
that at any time each cell -- for each cell we spend the minimum amount of power to reach a certain delay.
So this delay sampling circuit consists of critical path replica, which essentially is a circuit that is determined
by the timing analysis tool to be representative for the worst case delay in that cell. We pair that critical
path replica with a face detector, which essentially will indicate whether our critical path and therefore the
entire cell is either speeding up or slowing down as a result of temperature changes when compared to the
reference clock. So what we care about is to make sure that each of these cells is fast enough to meet our
timing requirements to reach the target frequency of the chip, but is not too fast that it's actually wasting
power. We're not applying too much forward body bias and wasting power. So that is what we're trying to
do with this circuitry here.
So the body bias voltages for each of these cells are adjusted dynamically as the temperature changes
until the optimal body bias of voltage is reached. Yes?
>> Question: (Inaudible) -- to know what the critical path is like (inaudible)?
>> Radu Teodorescu: This requires you to have a good idea what the critical path is at design time. Yes.
And your timing analysis tool will give you a good sense of that. But there's a caveat here. Because of
variation your critical path may not actually be representative. So let me -- this is important. Let me
answer that in more detail here.
So actually the circuitry that we use for each cell is quite a bit more complicated so we actually use several
of these critical part samples that are distributed across the cell, which is five of them. And we also add a
margin of error or small margin of delay which means the actual sample is going to be slower than the
actual critical path by a little bit. So that's one possibility to do this. We use several of the samples in each
of the body-bias cell.
There's also in the literature out there there are some proposal for actually measuring the delays. So
sampling the delay of the actual critical path rather than replicating the critical path across the cell. So you
can also do that.
This circuit is also subject of variation. Yes -- and that is the idea essentially. That is why -- so we take the
critical part that the CAD tool tells us this is your critical part. I'm going to take that and I'm going to copy it
in various locations on this cell hoping that it will affect -- it will actually be affected by variation the same
way as the actual critical part is affected. Yes?
>> Question: So if the voltage is that this replica would be effective -- between a single magnitude from a
variation of the original critical path then because of spatial formation you're going to place it in (inaudible).
But in certain parts of your micro-architecture is disrupting the layout may have an impact on the wide array
that you would have. For example, the issue to schedule in makeup logic or actual logic is on your critical
path and you want to put this in the middle of you (inaudible) for example. That could slow you down quite
a bit.
>> Radu Teodorescu: Yes. Yeah. That's a very good question so that is actually one of the reasons why
we want to keep this isolated. That's why actually we replicate the critical path because you can sort of you
have a different layout.
This is not dependent on the layout of your circuit. Just try to squeeze this critical path wherever you have
space in those cell.
Yeah. I acknowledge that in some very dense areas this may be an issue. But by doing this -- by actually
using samples rather than measuring the actual critical path you're less disruptive in the design because
this is something you sort of lay on top of your original design. But it may in some -- I acknowledge I don't
know for certain very crowded areas where maybe you have a lot of wiring going on it may be a bit of an
issue to prove all this. But the reason why we're actually having a separate circuit rather than actual critical
path measurement is to try to minimize the impact as much as possible.
Also to point out here is that for instance all these parts have to kind of be in agreement before a decision
is made. For instance if at least one of them says this cell is too slow then forward body bias is applied and
all the cell is sped up. And the cell will not be slowed down with a reverse body bias until all the samples
agree that there's some slack there. Okay. Let me go back to -- the slide here.
So what we find is that by doing this, by applying this dynamic body bias online as the chip runs, this gives
us a lot of flexibility to control both frequency and power. And there's a question here where do we choose
our optimal point on this frequency versus power card? So depending on the demands of the system we
define kind of three different operating environment, so we stand out in the environment, use dynamic body
bias to improve both frequency and power consumption. We're kind of trying to find the middle ground
here. Improve the chip frequency but don't spend too much power. This we call the standard environment.
We also looked at the high performance environment in which we spent all the power that we have in our
power budget to maximize frequency dynamically, like I said, at any cost and we see how much frequency
we can get and that's an add. In the low power environment we don't try to change the original frequency
of the chip with variation, we just try to minimize leakage power as much as possible. So in the interest of
time I'll actually only talk about the standard environment and at the end I'll show results for the other two,
as well.
So to see how dynamic body bias works in this standard environment let me bring back this original
distribution of chips. So here we have the original chip that's shown in blue before -- with variation before
any of this stuff is applied to it. And what we do is just like in the case of static body bias we find the
maximum frequency the chip can reach under calibration conditions as in the case of study body bias and
this gives us the F-Max at maximum temperature.
Now normally at average temperature the chip will actually run with lower leakage because the temperature
is significantly lower even after even with just static body bias. However, by adapting to temperature
changes dynamic fine-grain body biasing can actually decrease leakage power consumption even more
than with the study case while keeping the frequency the same. So dynamic fine-grain body biasing saves
leakage power compared to the static case, while taking the chip and making sure the chip runs at the
same exact frequency F-Max. And note that this F-Max frequency is already much higher than the match.
It is higher than the original frequency. So like I said in the standard environment we're trying to create the
balance between how much we improve frequency and how much we save power.
So this is all I have about body bias in this talk. I will present, like I said, the results at the end. For now let
me just give you a quick summary of what dynamic body bias achieves. So we find dynamic fine-grain
body biasing to be quite effective at reducing within dye variation. So if we look at the original distribution
of chips here on leakage -- again leakage power versus frequency, before body biasing is applied we find
that static body biasing reduces the distribution somewhat, but dynamic body biasing is significantly more
effective and you can note this here, a much narrower distribution.
And this results in 40% lower leakage power in the static environment at the same frequency as -compared to static body bias. In the standard environment and after 10% improvement in frequency in the
high performance environment again compared to the static body bias case.
So if there are no more questions here I would like to move on to the second part of the talk. Yes? Please.
>> Question: (Inaudible)?
>> Radu Teodorescu: Uh, excellent question. Excellent question. So it turns out that this is -- this
mechanism not only interacts with DVFS, it's actually essential to make DVFS possible. And the reason is
this. At lower voltages if you don't adapt the body bias values that you apply for each of the cells your chip
will actually not work at the same frequency. So if you just do static body bias for the chip at one volt, the
maximum volt, and then you try to change the supply voltage your chip will actually fail. So what you have
to do is essentially adapt the body bias dynamically as you change supply voltage and what we show here
is that the savings in leakage power continue to be significant so continues to scale as supply voltage goes
down.
So I'm sorry about this. This is a fairly crowded plot. So I'm showing here dynamic fine-grain body biasing
at different granularities from one cell to 144 body bias cells. And this is results that I am going to show,
just for the nominal supply voltage of 1 volt. So we see a decrease in leakage power. But we see here
even for lower supply voltages this difference remains significant. But the main point -- so this is one good
result I think but the other thing is you need dynamic fine-grain body biasing in order to be able to scale the
supply voltage. And there are a lot of interesting issues here in determining. So essentially once you have
both dynamic fine-grain body biasing and dynamic voltage and frequency scaling you added two knobs that
you need to control at the same time. So this poses an interesting control problem here that I have not
really addressed. But it's an interesting thing to look at definitely.
>> Question: (Inaudible) -- so what was -- how did you model the variability?
>> Radu Teodorescu: So we developed our own variation model and I'm going to talk a little bit about that
in the results section, but not a lot. So essentially we took data from industry for older technologies. We
scaled that to our -- using projections for variation in the 45, 30 nanometer technologies. And we also
abstracted away the existing models, which are at transistor levels to micro-architectural units because we
needed a model -- there is no model for variation at micro-architectural level. So we went through a lot of
pain to actually abstract away the details of the circuit model to model, you know, functional units, variation
at the level of functional units caches and so forth. So I can talk a lot more about that off line, but in the
interest of time let me move.
>> Question: I don't understand why you really need fine-grain body biasing to enable DVFS.
>> Radu Teodorescu: You don't need fine-grain body biasing necessarily. You need dynamic body
biasing. Fine-grain body biasing -- so as you saw in the plot this DFBAB1 is actually cheap wide body bias.
So you need to change the body bias voltages at the same time as you scale the supply voltage because if
you don't your chip will actually not run at the third frequency because at lower supply voltage you need a
lower body bias voltage. You need -- I shouldn't say lower. You need a different value for the body bias.
You don't necessarily need fine-grain body bias, but as I'm showing here fine-grain body bias makes a big
difference in power. This is at all the same frequency. I'm sorry, not at the -- of course not. So this is at
the maximum frequency the chip can achieve at 0.8, 0.6.
So I'm not sure that answers your question. Sure. So this is -- I mean we are scaling the frequency here
and we -- oh, yeah. What we want to make sure here is for all this data is that achieved with and without
body bias work at the same frequency. So we take the chip. We measure its frequency after variation.
We know the frequency of the chip. We want to make sure that after applying body bias all we do is save
power. We don't effect the original frequency. If you just do that you determine the body bias at the supply
voltage one and then you scale the supply voltage while leaving the body bias at the same level your chip
will fail, will not actually run at that original frequency.
>> Question: (Inaudible).
>> Radu Teodorescu: You reduced frequency, as well, will not even run at that -- so will not run as the
reduced frequency of the original chip. So it will actually be worse with the original chip with reduced
frequency.
I can talk about this a lot more offline. Let me move on now to the second part of the talk.
So we have seen a technique that reduces within dye variation. I will not -- I will now talk about the solution
for tolerating valuation rather than trying to reduce them. Let's say there is nothing we can do about
variation. We take the chip with variation and let's see how we can design our system around this
variation. So the motivation for this work stands from the observation that large cheap multiprocessors will
have significant core-to-core variation. So we measure core to core, we model a 24 CMP 32 nanometer
technology such as this one here with variation and we want to measure core-to-core variation in both
frequency and power consumption.
So as illustration of what we find let's state the fastest core on the dye, which is C2, and the slowest core
on the dye, which is C20, and compare them. Now if we look at frequency what we find is that C2 is
actually 30% faster than C20. It also consumes about twice the amount of leakage power of C20, which is
quite significant and translates into roughly 40% of higher total power consumption. So this I think is quite
significant because we have designed identical scores, scores that were meant to be identical that have
dramatically different properties.
And these are average numbers. For some of the dyes that we model, we see up to 50% variation in
frequency where the variation is so severe. And all the other cores are somewhere in between these two
extreme cases. So the question we ask now is how can we exploit this variation? We know that currency
MPs generally run at the frequency of the slowest core. Right? So that would be C20. What we can do is
with relatively simple changes to the hardware is to run each score at the maximum frequency that each
individual core can achieve. And with this we can buy back some of that frequency loss and we can get the
50% average increase in -- average frequency now -- because now we don't have a uniform frequency
across the entire CMP. And the support for actually doing this exists in production today the AMD
quad-core opteron allows all its cores to run at different frequencies. Mainly for power reasons because
they want to scale the frequency dynamically for this core to save power, but one could presume that also
because of variation -- I don't know, I'm just -- yes. Question?
>> Question: Are you assuming that the (inaudible) are largely (inaudible) by variation?
>> Radu Teodorescu: No. I'm not. I'm not assuming that. I'm assuming that the cache will run at -- well,
first of all I'm assuming that the cache is not really very critical. You can always add a cycle or two to -with access time. It will, yeah, yeah. So the L2 cache will of course be affected by variation, but we don't
consider banking the cache and running the cache at different frequencies.
>> Question: So ->> Radu Teodorescu: So the cache -- the L2 cache will remain same frequency. It's only The l1 and -- the
core and the L1 that gets -- that runs at the higher frequency.
>> Question: So I'm not worried about a couple of additional cycles that you may have for the pit time of
the L2 cache, but given on its own it starts running at only a very small fraction of the core frequency then
that could become a (inaudible) and maybe even (inaudible).
>> Radu Teodorescu: A small -- I'm not sure. Okay. So consider the differences in frequency here. So
we're talking about 30%. Before it was running at two gigahertz. This would be 2 point something. So the
difference is there. But the frequency -- later on I'll talk about what happens if you scale the supply voltage
so then the frequency may be -- the frequency change will be more significant.
>> Question: (Inaudible).
>> Radu Teodorescu: I am going to assume that -- not necessarily that the L2 cache is not affected by
variation, but it runs at its slowest at the fastest frequency that it can run at. So let's say if there is a critical
part in this particular L2 cache here that's going to determine the frequency of the cache for both. Yes?
So in such a system essentially it will become -- so after we split this caching into -- so each core will have
different power consumptions, different operating frequencies so essentially what we have done, we have
turned a homogenous system into heterogenous system. So in such a system it becomes suboptimal to
treat all the cores as equal. For instance, in the OS Scheduler or in the power management subsystem.
So what we propose doing is essentially exposing this variation to the OS and the software. And to take
advantage of this knowledge we develop variational work scheduling algorithms for the OS and also a
variation where power management subsystem. And what we show is that this variation aware solutions
are better at saving power or improving throughput than the variation unaware case.
So I'll take these two in turn and first talk about variation aware scheduling a little bit. This is actually a very
simple idea. So we know that one of the jobs of the OS scheduler is essentially in a CMP is to assign
applications that need to run in a system to available cores. So these cartoons here represent applications
and normally all the cores are treated as equal. So meaning that essentially all other things being equal
applications are randomly assigned to cores.
In variation of our scheduling what we propose doing is using additional information to guide scheduling
decisions, such as knowledge about the variation in core frequency and power, but also details about the
application behavior such as dynamic power consumption. And for instance computing density as
measured in instructions, executive cycle of IPC. In such a system we could also have different goals. For
instance, the goal of the system would be to reduce power or improve performance in the scheduling
algorithms could be adapted to each of these goals.
We examine a number of variation of our scheduling policies. I will not only discuss two of them here and
you will see they're quite, quite simple, actually. So for instance when the goal of the system is to reduce
power consumptions, the OS Scheduler what can do is simply give priority to low power cores in the
schedule. So essentially if I have many cores to assign, I will select the cores that consume less power.
In addition to that applications that have high dynamic power can be assigned to low power cores and in
that way try to balance the power distribution across the CMP and minimize hot spots to further reduce
power. So this particular policy we call of our power. Now when the goal is to improve performance -- I'm
running behind the animation. So these applications are assigned to the lower power here.
Now when the goal is to improve performance the OS actually favors the high frequency cores in the
schedule. In addition, applications that have the highest IPC are sent to the highest frequency cores
because these applications are doing the most useful work and they benefit most from the frequency boost.
Low IPC application may for instance just be waiting on memory or IO so they're not doing much, just
pushing them to high frequency core is not going to buy you much. So both these algorithms can achieve
power savings or can improve throughput compared to a variation unaware naive schedule.
Now let's now examine what happens if we add an additional layer of control to this system. And we do
that in the form of adding dynamic voltage and frequency control -- frequency scaling for each of these
cores. So then voltage and frequency scaling as we well know is a known technique for power
management in which the voltage and frequency of a processor is reduced dynamically to track changes in
the application and save power. Now future systems will most likely have core level control over both
voltage and frequency, which means that we'll have the ability to say what voltage and frequency each of
these cores should have. We already have in the AMD opteron the ability to set different frequencies for
each of these cores. We don't have yet the ability to set different voltages because that is -- there are
some challenges that need to be worked out there, but we'll soon see that.
So now we have a lot of control and flexibility, right? We can have an independent voltage and frequency
for each of these core. That's great. The problem is how do we decide on a voltage and frequency level
for each of these cores? That is not a trivial issue. It is quite a challenge. Because it's pretty obvious all
the cores would have the same voltage and frequency because some applications benefit more, you know,
your high computing density, high priority application you want to run at the highest frequency while others,
you know, your mostly idol processes, you don't want to run them at the high frequency. So it is pretty
obvious not all of them should have the same voltage and frequency because not all of them benefit
equally from high frequency. But deciding on voltage and frequency settings for all these cores is not a
trivial issue and in addition variation makes the problem significantly more difficult. Because it introduces
uncertainty into this decision problem. So let's see why.
Let's take an example. This -- I'm going to show here total power consumption versus frequency for a
single CMP that has 20 cores. So we take one core and we scale its supply voltage from one volt all the
way down to 0.6 volts. So as we see here it behaves quite nicely. The power goes down as does the
frequency as we scale the supply voltage. So this is fine. The problem is that when you would take a
different core this is what we get. It behaves completely different. So as we see the blue core here and the
red core reach the same -- have the same power consumption and the same frequency as two different
supply voltage. 0.7 and 0.85. Moreover, the red core is more power efficient above 0.7 volts and less
power efficient, below 0.7 volts. In addition to that as you can see the red core can reach a higher
frequency than the blue core even at the maximum supply voltage.
And the problem is all the other cores behave differently, as well, which means that all of a sudden we don't
know anything about how our CMP is going to behave. You can say, oh, I'm going to set core three to 0.7
volts and it's not clear what I'm going to get, what frequency and what power consumption. So managing
this system and finding an optimal voltage and frequency for each of these cores become significantly more
difficult.
So here's how we attack this problem. So to solve this what we try to do is treat it like an optimization
problem, which we'll define as follows. So given a mapping of threads to cores for instance given by one of
our variation in our scheduling policies like VAR performance. So we already have a mapping between
applications to cores. What we want to find is the best voltage and frequency for each of these cores with
a certain goal. The goal that we choose is to maximize overall system throughput as measured in the
number of instructions that are exited in the system each second. And with a set of constraints, the main
constraint that we set on this system was to keep the total power consumption for the entire chip below a
certain budget.
And it -- for us it makes sense to keep this budget flexible, meaning that we can set it in software and if this
is running on a laptop you could have the power budget at 50 watts. If you have it running in a desktop it
could be 75 watts or 100 watts or you can even change dynamically depending on whether your laptop is
plugged into the wall or unplugged.
So given this power budget that you give to your system you want to get the best performance out of it.
The question is what we put into this box, how do we find this best or as close to the best voltage and
frequency voltage core that we can. So to solve this we look at a range of solutions to fill in this box. The
first, which has been done in the past, is essentially exhaustive search, which means you have to look at all
the possible combinations of voltages and frequencies for each of these scores and decide one is the best
and this has been done in previous work for four course CMP that had maybe three different voltage levels.
But our problem is much bigger. We have 20 cores and each core can have 10 different voltage levels. So
the search space for exhaustive search is so large that simply infeasible. So we discard that immediately.
The next thing we look at is simulated annealing, which is a problemistic algorithm for solving optimization
problems. Question? Yes?
>> Question: So did you change the system's voltage and frequency, how do we estimate the performance
that you will get from each application (inaudible)?
>> Radu Teodorescu: That's a good question. So we use profiling to measure the IPC of the application
as it is running. Yes?
>> Question: Sharing a cache so it is going to be a very hard system to model?
>> Radu Teodorescu: A very hard system to what?
>> Question: To model?
>> Radu Teodorescu: To model. Yes, that is -- that is a concern and we have seen some of that, but
yeah, that's something that is not much we can do about it. There's something we can, you know, there's
not much we can do about that. Yeah. There is an impact of the shell cache. What we look at is
essentially how -- yeah, we have to make -- to make the assumption that essentially these applications are
more or less independent when we do -- when we run our decision algorithms. Of course they are not
totally independent because of the cache interference.
>> Question: (Inaudible) performance results for profiling?
>> Radu Teodorescu: Yeah, I'm going to -- let me go through a couple of slides. If that's not clear by then
stop me.
So we look at simulated annealing for solving this optimization problem and we get the best results with
simulated annealing. However, what we find is that it's computationally very expensive. We want to run
this algorithm online as programs run to allow the system to adapt the changes in work loads, power
consumption targets or even optimization goals.
So the solution that we come up with is we develop our own algorithm which we call Lean Op. This is
based on linear programming and it has the advantage of being orders of magnitude faster than simulated
annealing, but requires us to make some approximation. We compare it to simulated annealing, which is
close to, show it is close to optimum.
So let's see -- let me tell you a little bit about how we define the problem as an optimization problem -- as a
linear optimization problem. If that doesn't solve your questions then by all means stop me. So we know
that linear programming can solve optimization problems of the following form. We want to maximize an
objective function F of an independent variables X1 to XN.
Subject to constraints of the following four. A function G of the same independent variables X1 to X10 has
to be less than a constant C. Now we can have several of these constraints and one objective function.
What we can also have in this -- in linear programming is constraints on the value of each of these
variables X. Now the big problem here is that by definition these function, F and G, have to be linear. So
that is a big constraint in our system.
The problem that we're trying to solve as a linear optimization as follows. So the variables that we're trying
to find, the X is for us our voltage levels for all the core, that is what I'm trying to find. What is the voltage
for each of the core. The objective function, which we're trying to maximize is the overall system
throughput. So by definition throughput as measured in NPS or instructions per second can be computed
as the frequency of a core times the IPC of an application running on that core. So that the throughput for
one core. We add all the cores and we get total system throughput.
So frequency is mostly near function in supply voltage and/or the assumption that we make here is that the
IPC does not change as we change the supply voltage and frequency. This is an approximation. Of
course, because of memory axis your IPC will change somewhat. However, the changes are small
compared to the differences in IPC between applications and between phases of the same application.
So while there is a change in IPC, we find that that's not the first -- that's not the biggest impact here. The
biggest impact is the IPC difference between applications and also within an application.
So as a result because frequency is most linear in voltage and IPC is mostly constant in voltage -- now
another caveat here, we don't need to assume IPC constant. We can if we are able to measure the IPC of
the application at two different voltages, extrapolate a linear function. So that is just one simplification that
we make here. You don't necessarily need to make that. You can because the IPC is fairly linear in
voltage again, so we assume it is constant. But you could potentially assume a linear event.
Now because of these linear functions we can compute a function F that is a linear function on the supply
voltage and is also a function of other constants that are characteristic to each core and also to each
application, of course like the IPC. And the constraint is for us the main constraint is to keep the power
under a certain target. So we have to somewhat -- somehow cut the power as a function of supply voltage
to write this constraint here.
So power is clearly dependent on supply voltage. The problem here is that the dependence is nonlinear.
So what we have to do is essentially linearize the dependence of power on supply voltage, which is the
biggest approximation that we have to make. But what we find are the results are pretty good compared to
simulating annealing, which does not need to make that approximation and control that in the results. So
this is how we cast the problem that we're trying to solve as a linear optimization problem. If you have any
questions here please ask now. Yes?
>> Question: How do you get the IPC number?
>> Radu Teodorescu: The IPC numbers are gathered through profiling.
>> Question: But I mean they were ->> Radu Teodorescu: So you are running the application. Yes, so this is an online algorithm. This runs all
the time.
It is not something static. So you profile the application as it is running and you say, okay, I've seen this
IPC for this application for this long I'm going to predict that the IPC is going to be the same. So based on
my prediction that this IPC is going to be this I'm going to make a decision.
>> Question: So you're ->> Radu Teodorescu: So based on profiling.
>> Question: So you're also assuming that these applications (inaudible)?
>> Radu Teodorescu: Yes. Yes. That is one big -- that is one assumption that we make because we can't
model that. Potentially if we could model that somehow we could add that into the optimization problem
here, but I'm not exactly sure how to model interference in the caches. Yes?
>> Question: I'm curious about the loss line, like how can you linearize the like the power as a function of
voltage?
>> Radu Teodorescu: Well, you can linearize anything you want provided that you are willing to take the
hit. Well, it depends.
Yeah, it seems outrageous too me at first. I didn't consider this at first. I said, there is no way this is going
to match the linear function. The question is over what interval are you linearizing the function? And it
turned out the approximation here of the nonlinear function is fairly close, is descent enough to give good
result. What happens because we cannot precisely predict, because we linearize this nonlinear function
what happens is we miss our target -- our power target by a few percentage points. Even though we
expect the power as a result of our optimization to be 50 watts we may be 51, 52. So that does happen.
Okay. Let me move on here.
So the linear optimization algorithm works together with the OS scheduler. So the OS scheduler will
essentially map applications to cores according to one of the variation aware of algorithms, for instance
VAR performance and then Lean Op would find the voltage and frequency settings for each of these cores.
Lean Op runs periodically as a system process, either on a spare core if you have one or otherwise inside
the power management unit that is similar to what -- it can be a non-microcontroller similar to what the Intel
Titanium 2 Fox team uses to manage power. So they have something like this. They have essentially a
microcontroller that runs code inside the Titanium 2 for their power management algorithm. So we could
use something like that.
All the data that I'm showing assumes Lean Op runs on a spare core. Lean Op uses profile information as
input, as I mentioned before, quite a bit prove filing information. Some of this profile information can be
collected post manufacturing, something like frequency and static power consumption for each core at all
the voltage levels. Other information is dynamic and needs to be profiled at run time, such as dynamic
power and IPC for each core and application pair.
Now we feed all this information together with the power target and the optimization goal to the Lean Op
algorithm, which outputs the best, well it's not the best, it's as close as we can get to the optimal voltage
and frequency for each of these cores.
So we now at implementation we try running Lean Op at different granularities. We start with 500
milliseconds, all the way down to one millisecond in the system. We find that the good interval for good
performance penalty tradeoff is to run it roughly every 10 millisecond, which means if we run several times
inside an OS scheduling interval. So of course depending on how -- yes, question?
>> Question: (Inaudible) this optimization on the division of four also requires energy, right?
>> Radu Teodorescu: Yeah.
>> Question: So --
>> Radu Teodorescu: And yeah, yeah, so ->> Question: Does it like running this additional core kind of outweigh the gains you can get?
>> Radu Teodorescu: Yeah. That's a good question. So it depends. If you run it in the core the impact is
higher. If you run it in the microcontroller, which is what we tend to do, is less because a microcontroller is
very simple, small area consumes a lot less power. The Titanium folks give roughly 0.5% area for their
microcontroller -- microcontroller, therefore, that's about power consumption that we assume for this. If you
run it on the core (inaudible).
Okay. If there are no more questions on this I will move on to the evaluation. I think we have about 12
minutes here. Yes?
>> Question: This work (inaudible)?
>> Radu Teodorescu: Peril(phonetic) applications, um -- so that's a very good question. We did not try this
on peril applications what we tried is essentially multi program workloads that have little dependencies
between applications. My gut feeling would be for peril applications the optimization to -- would actually try
to keep all the cores at the same voltage if the applications are well balanced, the threads are well
balanced. If there is imbalance you try to favor the threads that are making good progress. If the threads
are well balanced then you would try to keep them at the same frequency because if you have barriers and
synchronization there is no point in running one of the threads very fast if it has to wait on a barrier. So but
that is something we haven't looked at.
So let me move on to the evaluation. A few words about our evaluation infrastructure. We used our own
process variation model, which we call Varius(phonetic). This was developed in (Inaudible) with several of
my colleagues and was published in (inaudible) Transaction on Semiconductor Manufacturing. So this is
variation model that can be used at micro-architectural level to model a variation and we can actually do
multicolor simulations for 200 dyes with variation, which would then use to analyze our solution. We use
(inaudible) architectural simulator, SAS, which was developed in (inaudible) and a range of other tools to
model leakage power, which was started -- so we started with these models at the level of transistor. So
we model leakage and the effects of variation in spice model. So very detailed transistor level simulation.
But we have to abstract at the way the cache level we assume okay all the transistors in the cache are
going to behave roughly the same. All the transistors in the LEU are going to be mostly the same. We
have to have some abstraction. We have tools for estimating temperature.
Well, for the first piece of work, dynamic fine-grain body biasing, we model A four core cMP 45 nanometer
technology at 4 gigahertz. We evaluate fine-grain body biasing at different granularities ranging from 1 to
144 cells. So there is cell here is essentially the unit at which body bias is applied. We want to see okay
how many of these cells do we need before the overhead gets too high. So we look at the range of 1, 16,
64 and 144 cells.
So let's look at dynamic body biasing in the standard case. This plot shows again frequency versus
leakage power. The blue dot represents the average frequency of leakage for the 200 dyes that we model
before body bias is applied. All the other results as you see are normalized to this value. So let's look first
at how starting body bias can improve the chip's frequency. So as we can see this is for SFG it would be
144, which mean its uses 144 body bias cells. The frequency is improved by about 6%. As the number of
cells goes down the improvement is less for one cell essentially static body bias can do nothing here.
On the other hand, dynamic body, chip with dynamic body bias would run at the exact same frequency as
the static case. However, the reduction in leakage is quite significant. These will run with significantly less
leakage power, which ranges between almost 30% to 40% reduction in leakage power for the 144 body
bias cells case. So of course the more body bias cells we have the higher the frequency and lower the
leakage because we have more control over variation in -- on this chip.
For the other environments, let me just briefly mention the results. In the high performance environment.
DFG high performance achieves between 7 and 10% frequency increase compared to the static case and
up to 16% compared to the no body bias case, but this comes at high cost in power. The more interesting
case for us where the results were actually much better than we expected were the dynamic fine-grain
body biasing low power where we got between 10 and 50% reduction in leakage power compared to the
static case, which we thought was quite good actually.
So for the second piece of work that I talked about, variation in the scheduling and power management we
model a 20 core CMP 32 nanometer of technology at 4 gigahertz. We use multi-program workload
consisting between one and 20 applications that are selected randomly from a pool of speccing and spec
MP benchmarks and we construct these multi program workloads for this CMP.
We evaluate the following power management schemes. They all have the same goal to maximize
throughput and work with the constraint to keep the power budget -- power below certain budget. Here I'm
going to show results for 75 watts in the paper we have the full range of budgets.
We call the first scheme Foxton Plus because it is an enhanced version of what the Intel Titanium 2 does to
manage power. Essentially in the schemes we map applications randomly to cores -- we map applications
randomly to cores and then we decrease the voltage for each of the cores in small increments, a round
Robin, until we reach the power target. Yes? Question?
>> Question: The illustration shows application for using speck FP and speck MP, so are there any of the
speck tests that actually have to wait on network or user input or spend any time waiting on anything other
than memory?
>> Radu Teodorescu: Not that I know of. They're mostly -- yeah. At most they would (inaudible) when
they develop spec they went to right length to remove IO and all that. So what we take advantage of is yes
waiting on memory not so much. But there is, I mean, a huge difference between the behavior of the
applications in terms of relation to (inaudible). So some applications are very (inaudible).
>> Question: (Inaudible).
>> Radu Teodorescu: Yeah.
>> Question: (Inaudible). (Inaudible) why so I would based on what you said this morning I would have
anticipated that what your goal would have been would been to minimize power and your constraint would
have been for a fixed throughput. What's the (inaudible) ->> Radu Teodorescu: That's -- yeah, that's one way to do it. The problem is what is your throughput goal?
For most general purpose applications for the more general purpose systems you don't have a strict timing
requirement. If it is your MP3 decoder, yes, fine you have a frame for (inaudible). But what is your the
throughput requirement for (inaudible). It is very hard for most applications it is very hard to define that in
terms of (inaudible).
>> Question: Well, my throughput range in PowerPoint is can I display my set of slides in 90 minutes.
What is (inaudible) ->> Radu Teodorescu: (Inaudible).
>> Question: My point is it is a fixed amount of work. The question is what is the smallest power it can
use, right? I mean, you ->> Radu Teodorescu: Yeah, I took a different approach because I could not see a way to define a timing
requirement for an application and that is what -- so the way people have attacked this problem for instance
in the embedded market what you have, your whatever your radio decoding something and your MP3
player and what-not. Then you have very strict requirements. So the way they do it there is essentially say
give me the minimum power for a certain timing. I took the different approach because I think it makes
sense for, you know, a general purpose system to say give me the best performance, I give you this
budget. Give me the best performance.
So our proposed scheme is VAR performance plus linear optimization. Which essentially means that we
use the VAR performance scheduling algorithm to map applications to cores to map the high IPC
applications to high frequency cores and then we use Lean Op to find the voltage for each of these cores.
We compare it to VAR performance plus simulated annealing which is our upper bound. We want to see
how close we get to that now.
So these are our results. This plot will show you throughput for the three schemes measured in millions of
instructions per second and we look at the three -- the three power management schemes for different
experiments. We look at likely loaded system with four threads running on the 24 CMP, all the way up to
20 threads fully utilized system running on this 20 core CMP.
So these are the results. What we find is that our proposed scheme, everything is normalized to the
(inaudible) here. What we find is that our proposed scheme achieves throughput improvement between 12
and 17% and what we find is also that it comes within roughly 2% of the simulated annealing solution which
doesn't use almost any of the approximations that we made in the Lean Op algorithm.
So to sum up we want to answer the following question; right. How much of the performance and power
that was lost to variation have we recovered with each of these scheme? Right? Why are we bothering to
do all of this? So in the case of dynamic fine-grain body biasing we first look at frequency and here the first
bar shows the frequency of all the shapes that we model without variation. So we use exact same model,
but for nominal value. We simply extract all the variation away. And what we find is because of -- within
dye variation we lose roughly 15% of frequency. Now DFG would be standard and is able to recover about
a third of that frequency loss, while DFG high performance recovers almost all of the frequency lost, but
this comes at the high price in power consumption.
On the leakage side because of variation because within dye variation we see a 25% increase in leakage
power, however, for leakage reduction as we have seen dynamic body bias is even more effective. Not
only at recovering the leakage loss to variation, but actually improving over the no variation case because
able to adapt to temperature. In fact DFGAB low power is roughly 35% lower than the no variation case
that we thought was quite good.
Now for the second technique we look at the effect of variation on overall system throughput at the same
power. The power constraint here is 75 words, 20 threads running on the 20 core CMP. What we is find
because of variation we lose about 15% of the system throughput. Again, 20 threads, 20 cores. And
because with our best scheme VAR performance plus Lean Op we are able to recover most of these
losses back.
So we can conclude that both techniques recover most of the losses that were used by process variation.
In some cases with a high price power like in DOG high performance, but in other cases, like in the case of
leakage, we actually do better than the no variation case.
So this is all I have. Let me if I have time talk a little bit about my future work or okay it's a couple of slides.
Couple of slides.
So in the longer term I will be looking further down into the future and today semiconductor road maps are
predicting that in a few years for the 11 nanometer technology we'll probably have 100 or more billion
transistors on a single chip. This, if current trends continue, will translate into hundreds of cores on a dye,
hundreds if not thousands of cores on a dye. We'll have a lot of resources. The problem is in the 11
nanometer node reliability problems will get exponentially worse and the problems are many. One reason
is variation. The other reason is accelerated aging that we're already seeing in these systems. Increase
accessibility to soft arrows(phonetic) and plenty of other problems.
So in this environment some cores will fail immediately. Right up to manufacturing you cannot just throw
the system away because two out of your 2000 cores have failed. So we have to design around that.
Other cores will fail over time because of aging, maybe two weeks or two months after we deploy the
system in production. Other cores still will fail intermittently as the environment in which the systems
change.
So we have to learn how to successfully manage such an unreliable system. So what I'm looking to build is
an integrated approach to system reliability that spans multiple levels of the computing stack, starting with
devices centered around micro-architecture, which is my level of expertise, but going up to OS and
software, as well.
And so I believe that essentially to increase collaboration across these layers and working with researchers
that have expertise in different domains we can achieve far more effective solutions than would otherwise
be possible. So as an example of how such a system would work, let's take the case of timing errors,
which for instance are variation-induced timing errors. These are errors caused by the inability of signal to
propagate within a single clock cycle causing the wrong value to be latched.
Timing errors are more likely to occur at high temperature or unstable supply. So what we could have at
the circuit level is environment sensors, devices that can detect when either the temperature is too high,
the voltage is unstable and so on. At that point this layer can inform the micro- architectural layer, which
can enable the detection and correction techniques such as ECC or hardware redundancy that would
normally be disabled to save power.
Now let's say the error rate grows too high to be managed at the micro-architectural layer, we could bring in
an OS intervention and the OS would essentially migrate computation away from the failing core either
permanently because the core is no longer usable or temporarily until the conditions that cause the spike in
error rate disappear.
At the software and compiling levels we could have for instance different versions of the same code that
run depending on the increasing reliability of the core. For instance we could have a version of the code
that is optimized for performance and this runs on reliable cores or optimized for reliability with significant
hardware software redundancy built in or checking that could run on the (inaudible). So bottom line here is
that I believe integrated solutions, such as this one, will become key to tackling daunting reliability
challenges that we're likely to face in future systems.
So before I finish let me tell you about some other work that I did part of my PhD (inaudible). On a slightly
different topic. So this work focused on developing hardware support to help software, to help divide
software essentially. So the main idea was that the process of actually searching for bugs should continue
at the user side in your laptop, desktop or server. To catch all those bugs that escape during the
development and we know there are plenty. So the problem is that most debugging tools come with a very
heavy performance impact. So throughout this work we show that with the help of hardware support we
can debug software with very little overhead. In this context I worked on developing a prototype of a
processor with fast software control checkpointing and rollback support. This was fully working prototype
working on a new FPGI shape and it could run an operating system and this is what we use as a testing
batch. The idea was to use this low overhead rollback ability to quickly undo faulty executions and look for
bugs.
Another project I was involved on was the development of hardware implementation of data race detection
algorithm lockset for those of you who are familiar with that. And we managed to reduce the overhead of
actually running lockset from 10X to a few percentage points. Together with (inaudible) at Intel I also
worked on log-based architectures for (inaudible) monitoring of production code, which had kind of the
same idea of online monitoring and I can talk a little about that -- a little bit about that offline.
Thank you very much for your attention. Sorry for running over time here. But thank you very much for
coming. (Applause)
Download