>> Engin Ipek: Hi, everybody. It is my distinct pleasure to have Radu Teodorescu with us today. Radu is a PhD candidate in the Department of Computer Science at the University of Illinois at Urbana Champaign. And today he will talk to us about his work on addressing parameter variation with architectural techniques. >> Radu Teodorescu: Thank you very much, Engin. Good morning, everyone, and thank you for coming to my talk. Today I'm going to talk about the work that I did as a graduate student at the University of Illinois on helping Moore's Law through architectural techniques to address parameter variation. As we all know technology scaling continues to follow Moore's Law delivering more transistors with every new generation of microprocessors. This has translated into increased computing power at lower costs and has been a driver for the computing industry for many decades now. However, as transistors reach deep nanometer sizes this continued scaling faces major challenges. Now some of these challenges originate in the chip manufacturing process while others originate in the environment in which the chips run. On the manufacturing side I'm -- I'll mention just a couple. Some sub wavelength leafography, which is the process of actually printing transistors on to the silicon dye uses light source that has a wavelength of 192 nanometers to project much smaller feature sizes on the order of 35 nanometers in the current generation. So this can actually lead to imprecision, difficulty in controlling this features sizes which leads to variation in the effective length of the transistor, one of the important parameters. Other problems include variation in dopant density in the transistor channel as the transistor channel gets smaller and we get fewer dopen thumb parameters such as threshold voltage of the transistor. And there are many more manufacturing challenges, but we also have environmental factors that impact the functioning of these transistors. Some include temperature variation or supply voltage fluctuation. So there are many problems. These and other environmental factors, manufacturing -- any environmental factors will cause variation, significant variation in transistor parameters, that we care about a lot such as switching speed which determines the operating frequency of a transistor and off the entire chip and leakage power consumption which is an important component of the total power consumption of our systems today. So if we were to plot the probability density function of any of these parameters. Let's say this is switching speed for all the transistors that we have in one of our chips today we would expect to see a function that looks something like this with most transistors falling inside this nominal region. Instead what we see again if the switching speed is a distribution that looks more like this with some of the transistor actually faster most of them slower and fewer of them falling inside this nominal region here. Now this behavior has a direct impact on current microprocessors such as AMD's quad-core opteron for instance. This is just an example. I'm not implying anything about the AMD's technology here. And also on Intel's -- for instance Intel's 80 core super computer on a chip prototype. So it has an impact on current and future systems resulting in reduced chip frequency that would otherwise be possible without variation, significantly higher power consumption and decreased reliability. Now to get a sense of the very real effect of variation let's look at some actual data as measured by researchers at Intel. So this plot will show you the normalized frequency versus leakage power consumption for a set of chips that are manufactured in what is now an old technology, 130 nanometers. And their distribution on this frequency versus leakage plot looks something like this. What each dot here represents the frequency versus leakage power of a single dye. So as we see we have 30% variation in frequency across these dyes and up to 20 times variation in leakage power consumption. So why we would expect a chip to run somewhere here at high frequency and with low leakage power -yes, question? >> Question: (Inaudible) all of those dots like the overall mass in the left corner versus the outliers? >> Radu Teodorescu: Well, I don't actually have that because I didn't have access to the numbers. I had essentially just a plot. It's over an illustration that I did. Yeah? >> Question: (Inaudible) -- independent of the actual probability mass? >> Radu Teodorescu: So the point that I'm trying to make here is essentially that there is significant variation both in frequency and leakage power. Now how many of these chips actually fall within a smaller variation region here I don't know precisely. I will show you data for our own models. And that may make ->> Question: So we're plotting here the overall chip frequency leakage or the individual transistors? >> Radu Teodorescu: That's a good question. So this is chip-to-chip variation. So it's the overall frequency verses leakage power for the entire chip. Yeah. Later on we'll see this broken down into chip-to-chip and within chip variation, but that's coming up in a second. So most of the chips will have lower frequency and higher leakage power consumption than we would expect. And by some estimates actually one generation of process scaling is actually lost to process variation. So this is a very significant problem. For the rest of this presentation I'm going to talk about my work and getting some of these losses back. First, a little bit more background on variation. Question? >> Question: Like the processors you showed before. >> Radu Teodorescu: Uh-huh. >> Question: Like are these the ones that were manufactured or the ones that were actually sold? >> Radu Teodorescu: They're the ones that were manufactured. >> Question: Okay. >> Radu Teodorescu: In fact some of these you can't sell because of the high power consumption and have you to throw them away. >> Question: Okay. >> Radu Teodorescu: Yes? >> Question: So are the operating conditions the same for the different chips? >> Radu Teodorescu: I would assume yes. Intel gave these measurements and the way they do these measurements is post manufacturing. They measure each dye on the wafer under the same conditions. Yes. So variation occurs at different levels. If we examine a wafer of microprocessors such as this one containing the Core 2 Duos we will that some of the shapes are fast, which are shown here in red, while others are slow shown, here in yellow. So variation will make individual chips different from one another and this is generally referred to as dye-to-dye variation. Further if we look inside a single dye, such as this 4 core chip, we will notice that there is significant variation inside the dye and this is referred to as within dye variation. Now within dye variation also has an interesting property in that it is spatially correlated, meaning that transistors that are close to each other tend to have similar properties. For instance these shown here in red are fast, but consume a lot of leakage or study power, while the ones shown here in white and yellow are slower, but more power efficient. So my work is focused mainly on addressing the problems caused by -- within dye variation. That's what I'm going to talk about next. So here's an overview of the solution that I propose and I'm going to talk about today. Now even though variation occurs at the level of circuits and devices, right, it's effect on performance, power and reliability are affecting all of us. Are affecting the micro-architectural layer, which I care about, but also the run-time system layer, DOS and software. So that is the reason why I developed my architectural solutions that span different levels of the computing stack. I will discuss today two such techniques. The first one is aimed at variation reduction. We're trying to reduce the effect of variation inside the dye. The second technique is aimed at variation tolerance. We're trying to design our system around variation somewhat. So in the first solution we look at the chip multiprocessor with four cores with variation and the goal of this solution, like I said, is to reduce within dye variation. To do that we partition the chip into multiple cells and using a technique called body bias apply to each of these cells we can either reduce the leakage power of cells that consume a lot of power, the ones that are shown here in red, or speed up cells that are slow, shown here, resulting in reduced within dye variation. We call this dynamic fine-grain body biasing. This is what I'm going to talk about first, in the first part of the talk. Now the second part of the talk in the second work I took a different approach. So the second technique is aimed at tolerating variation in a large chip multi processor, such as these 24 CMP. And the idea here is that because of variation these CMPs will have cores that even though they are designed to be identical they'll behave differently meaning they'll have different power consumptions and different frequency potentials. So to manage a system like this we develop variation aware application scheduling and power management algorithm for DOS and the power management subsystem. So for instance these algorithms can map publication to high frequency cores as shown here in red. The flowery things are application. The high frequency cores, when the goal is to improve the performance of the system or through low power cores when the goal is to save power. In addition by combining knowledge about the characteristics of the cores and of the characteristics and requirements of the application, we develop a poor management subsystem that is able to improve throughput at the same power consumption compared to a variation on a Y axis. But this is what I am going to talk about in the second part of the talk. So here is an outline of my talk. I will describe the two solutions. First dynamic fine-grain body biasing, which is work that appeared in last year's International Symposium on Micro-architecture. And then describe the second solution variation where scheduling and power management, which will appear in this year's ISCA. I will evaluate both techniques together at the end and finally say a few words about my plans for future work. So let's start with dynamic fine-grain body biasing. Now body biasing is a well known technique and has been used for a long time for controlling threshold voltage. In body bias a voltage is applied between the source or drain and the substrate of a group of transistors. Depending on the pylotic of this voltage that gets applied we can have two types of effects. One is forward body bias, which essentially increases the switching speed of these transistors at the cost of increased leakage power consumption. The opposite is true in reverse body bias, which essentially decreases the speed of these transistors with the benefit of actually saving a significant amount of leakage or static power consumption. So this property makes body bias a key knob that can help us trade off leakage power for frequency. And you can think about this much like another well known technique, dynamic voltage and frequency scaling trades off frequency for dynamic power. Previous work by researchers at Intel has proposed a technique called static fine-grain body biasing for reducing within dye variation. So in this technique a chip with variation is divided into so-called body bias cells, where each cell is the unit at which this technique is applied and each of the cells receive its own body bias voltage as follows. Reverse body bias is applied to cells that consume a lot of study power and forward body bias is applied to cells that are slow to speed them up. And the overall effect is a reduction in within dye variation in both frequency and power consumption. And this in turn results in improved overall processor frequency because remember the frequency of a chip is determined by its lowest predictable path so by improving all of them we can improve the probability that that slowest critical path will be faster and also overall reduction in power consumption. So that's one benefit, right, will reduce within dye variation. But we also gain an additional control over the chip's frequency and power consumption. And this control allows a manufacturer to essentially improve the chip's frequency and at the cost of increased power beyond what was possible without body bias. So let's see how this is done. Take an example. Here -- so this is a distribution of chips on a frequency versus leakage power plot and here I'm isolating just the effects of within dye variation. So within dye variation causes the dyes to distribute something like this. Again I'm plotting frequency of each of these dyes versus leakage power. This is from our own variation model and its simulation. So after manufacturing -- after manufacturing the vendor will actually take all these chips, test them and based on their maximum frequency would place them on -- in frequency bins, so called frequency bins. I'm showing four here. So the chips that full for instance into bin four will be sold at 2 gigahertz. Bin three would be 2.2, 2.4, 2.6 and so. We've also seen that leakage power can be a significant problem. Right. It varies a lot and it can be very high. So in addition to splitting the chips into frequency bins the manufacturer also imposes total power limit on these chips and limits the amount of power that these chip can consume. This total power limit translates into a leakage power limit for this chips that look something like this. As a result some of these chips will have to be rejected because of high power consumption and only those falling in the green region here can be accepted and used and sold. Now normally this would be the end of the story and the vendor would have to leave with this situation. But what I'm going to show is that with fine-grain body bias we can improve the operating conditions of these chips by trading off some of the frequency for leakage power consumption and I'm going to take two examples here. So let's take a chip that has a fairly low frequency running here in bin three. We want to run it at a higher frequency, what we can do by applying carefully this fine-grain body bias we can improve its frequency at the cost of increased leakage power consumption. Until -- and we do that as long as we stay below this power budget. We cannot push it beyond because then we would need to throw that chip away. On the other hand -- yes? Question? >> Question: The bottom of the bin? >> Radu Teodorescu: At the bottom of the bin. Yeah. Yes. Yes. I mean ideally you may want to stop here because there is no need to push it beyond. If you are gonna -- it depends where the frequency of the bin is. It could be here somewhere in the middle. This could be -- well actually no. The frequency is here. Say this is 2.4, so in that case you may not need to push the chip all the way up. Yes. Here I'm assuming a continuous distribution of frequency. On the other hand, if you have a chip that's consuming a lot of leakage power and normally would have to be thrown away, you can decrease the leakage power consumption of this chip by applying mostly a reverse body bias at the cost of some loss of frequency and bring this chip within this green acceptable region here. So this is done in a post manufacturing calibration process that has the goal of finding the maximum frequency that each of these chips can run it. The problem is that -- okay. So after we find this maximum frequency footage of these chips the body bias value -- so the body bias settings for each of the cells that made the frequency possible, made this F Max reachable are fixed for the lifetime of the chip. Because we have to guarantee that these chips will run at the maximum frequency at F Max under any conditions. Worse case conditions such as temperature and power consumption are assumed. For that reason we say -- we make the observation that static fine-grain body biasing has to be conservative. Which brings us to the motivation for my work, which we call dynamic fine-grain body biasing. Now we start by making two observations here. In chips today we have significant temperature variation. And this is observed both in space across different cores, maybe even inside a core from one functional unit to the other, and those are in time as the activity factor and the workload changes, for instance the computation migrate from one of the cores to the other. So there is significant temperature difference on chip in space which can be 30, 40 degrees centigrade and also in time as the activity of this chip changes. In addition to that we know that there is a very important impact from this temperature. Circuit delay increases significantly with temperature. So while at low temperature at least some parts of these chips will be faster as the chip heats up it will essentially slow down. And at the highest temperature it will be significantly slower. So we know there is significant temperature variation and that impact has a significant impact on circuit delay. So for that reason static fine-grain body bias has to assume worst case temperature when calibrating the body bias voltages. At that temperature most of the circuits on this dye will be slower allowing less of the power saving reverse body bias to be applied and requiring more of the forward body bias to reach the maximum frequency F-max. And this results in overall higher power consumption. The main idea behind dynamic fine-grain body biasing is essentially to allow the body bias voltages to change dynamically to adapt to temperature changes in time, space and with workload. And the rationale here is that at average temperature, at least some parts of the chip from here in blue, will be significantly cooler. And also which parts of the chips are cold if any time will change dynamically. And so as a result at least some of the chips will be fast allowing more reverse body bias to be applied and requiring less of the forward body bias which consumes a lot of power to reach the exact same frequency F-Max, with significantly lower power consumption. So if we were to define the optimal body bias to be applied. As the body bias that requires the minimum amount of power to reach a certain target frequency, we could say that the goal of dynamic fine-grain body biasing is essentially to keep the body bias optimal at any time as temperature changes. Now the big question of course is how do we find what this optimal body bias is? To do that we use a circuit. We actually dynamically measure the delay of each body bias cell. So we use a circuit for measuring delay that is placed in each of these body bias cells and the idea is that we want to make sure that at any time each cell -- for each cell we spend the minimum amount of power to reach a certain delay. So this delay sampling circuit consists of critical path replica, which essentially is a circuit that is determined by the timing analysis tool to be representative for the worst case delay in that cell. We pair that critical path replica with a face detector, which essentially will indicate whether our critical path and therefore the entire cell is either speeding up or slowing down as a result of temperature changes when compared to the reference clock. So what we care about is to make sure that each of these cells is fast enough to meet our timing requirements to reach the target frequency of the chip, but is not too fast that it's actually wasting power. We're not applying too much forward body bias and wasting power. So that is what we're trying to do with this circuitry here. So the body bias voltages for each of these cells are adjusted dynamically as the temperature changes until the optimal body bias of voltage is reached. Yes? >> Question: (Inaudible) -- to know what the critical path is like (inaudible)? >> Radu Teodorescu: This requires you to have a good idea what the critical path is at design time. Yes. And your timing analysis tool will give you a good sense of that. But there's a caveat here. Because of variation your critical path may not actually be representative. So let me -- this is important. Let me answer that in more detail here. So actually the circuitry that we use for each cell is quite a bit more complicated so we actually use several of these critical part samples that are distributed across the cell, which is five of them. And we also add a margin of error or small margin of delay which means the actual sample is going to be slower than the actual critical path by a little bit. So that's one possibility to do this. We use several of the samples in each of the body-bias cell. There's also in the literature out there there are some proposal for actually measuring the delays. So sampling the delay of the actual critical path rather than replicating the critical path across the cell. So you can also do that. This circuit is also subject of variation. Yes -- and that is the idea essentially. That is why -- so we take the critical part that the CAD tool tells us this is your critical part. I'm going to take that and I'm going to copy it in various locations on this cell hoping that it will affect -- it will actually be affected by variation the same way as the actual critical part is affected. Yes? >> Question: So if the voltage is that this replica would be effective -- between a single magnitude from a variation of the original critical path then because of spatial formation you're going to place it in (inaudible). But in certain parts of your micro-architecture is disrupting the layout may have an impact on the wide array that you would have. For example, the issue to schedule in makeup logic or actual logic is on your critical path and you want to put this in the middle of you (inaudible) for example. That could slow you down quite a bit. >> Radu Teodorescu: Yes. Yeah. That's a very good question so that is actually one of the reasons why we want to keep this isolated. That's why actually we replicate the critical path because you can sort of you have a different layout. This is not dependent on the layout of your circuit. Just try to squeeze this critical path wherever you have space in those cell. Yeah. I acknowledge that in some very dense areas this may be an issue. But by doing this -- by actually using samples rather than measuring the actual critical path you're less disruptive in the design because this is something you sort of lay on top of your original design. But it may in some -- I acknowledge I don't know for certain very crowded areas where maybe you have a lot of wiring going on it may be a bit of an issue to prove all this. But the reason why we're actually having a separate circuit rather than actual critical path measurement is to try to minimize the impact as much as possible. Also to point out here is that for instance all these parts have to kind of be in agreement before a decision is made. For instance if at least one of them says this cell is too slow then forward body bias is applied and all the cell is sped up. And the cell will not be slowed down with a reverse body bias until all the samples agree that there's some slack there. Okay. Let me go back to -- the slide here. So what we find is that by doing this, by applying this dynamic body bias online as the chip runs, this gives us a lot of flexibility to control both frequency and power. And there's a question here where do we choose our optimal point on this frequency versus power card? So depending on the demands of the system we define kind of three different operating environment, so we stand out in the environment, use dynamic body bias to improve both frequency and power consumption. We're kind of trying to find the middle ground here. Improve the chip frequency but don't spend too much power. This we call the standard environment. We also looked at the high performance environment in which we spent all the power that we have in our power budget to maximize frequency dynamically, like I said, at any cost and we see how much frequency we can get and that's an add. In the low power environment we don't try to change the original frequency of the chip with variation, we just try to minimize leakage power as much as possible. So in the interest of time I'll actually only talk about the standard environment and at the end I'll show results for the other two, as well. So to see how dynamic body bias works in this standard environment let me bring back this original distribution of chips. So here we have the original chip that's shown in blue before -- with variation before any of this stuff is applied to it. And what we do is just like in the case of static body bias we find the maximum frequency the chip can reach under calibration conditions as in the case of study body bias and this gives us the F-Max at maximum temperature. Now normally at average temperature the chip will actually run with lower leakage because the temperature is significantly lower even after even with just static body bias. However, by adapting to temperature changes dynamic fine-grain body biasing can actually decrease leakage power consumption even more than with the study case while keeping the frequency the same. So dynamic fine-grain body biasing saves leakage power compared to the static case, while taking the chip and making sure the chip runs at the same exact frequency F-Max. And note that this F-Max frequency is already much higher than the match. It is higher than the original frequency. So like I said in the standard environment we're trying to create the balance between how much we improve frequency and how much we save power. So this is all I have about body bias in this talk. I will present, like I said, the results at the end. For now let me just give you a quick summary of what dynamic body bias achieves. So we find dynamic fine-grain body biasing to be quite effective at reducing within dye variation. So if we look at the original distribution of chips here on leakage -- again leakage power versus frequency, before body biasing is applied we find that static body biasing reduces the distribution somewhat, but dynamic body biasing is significantly more effective and you can note this here, a much narrower distribution. And this results in 40% lower leakage power in the static environment at the same frequency as -compared to static body bias. In the standard environment and after 10% improvement in frequency in the high performance environment again compared to the static body bias case. So if there are no more questions here I would like to move on to the second part of the talk. Yes? Please. >> Question: (Inaudible)? >> Radu Teodorescu: Uh, excellent question. Excellent question. So it turns out that this is -- this mechanism not only interacts with DVFS, it's actually essential to make DVFS possible. And the reason is this. At lower voltages if you don't adapt the body bias values that you apply for each of the cells your chip will actually not work at the same frequency. So if you just do static body bias for the chip at one volt, the maximum volt, and then you try to change the supply voltage your chip will actually fail. So what you have to do is essentially adapt the body bias dynamically as you change supply voltage and what we show here is that the savings in leakage power continue to be significant so continues to scale as supply voltage goes down. So I'm sorry about this. This is a fairly crowded plot. So I'm showing here dynamic fine-grain body biasing at different granularities from one cell to 144 body bias cells. And this is results that I am going to show, just for the nominal supply voltage of 1 volt. So we see a decrease in leakage power. But we see here even for lower supply voltages this difference remains significant. But the main point -- so this is one good result I think but the other thing is you need dynamic fine-grain body biasing in order to be able to scale the supply voltage. And there are a lot of interesting issues here in determining. So essentially once you have both dynamic fine-grain body biasing and dynamic voltage and frequency scaling you added two knobs that you need to control at the same time. So this poses an interesting control problem here that I have not really addressed. But it's an interesting thing to look at definitely. >> Question: (Inaudible) -- so what was -- how did you model the variability? >> Radu Teodorescu: So we developed our own variation model and I'm going to talk a little bit about that in the results section, but not a lot. So essentially we took data from industry for older technologies. We scaled that to our -- using projections for variation in the 45, 30 nanometer technologies. And we also abstracted away the existing models, which are at transistor levels to micro-architectural units because we needed a model -- there is no model for variation at micro-architectural level. So we went through a lot of pain to actually abstract away the details of the circuit model to model, you know, functional units, variation at the level of functional units caches and so forth. So I can talk a lot more about that off line, but in the interest of time let me move. >> Question: I don't understand why you really need fine-grain body biasing to enable DVFS. >> Radu Teodorescu: You don't need fine-grain body biasing necessarily. You need dynamic body biasing. Fine-grain body biasing -- so as you saw in the plot this DFBAB1 is actually cheap wide body bias. So you need to change the body bias voltages at the same time as you scale the supply voltage because if you don't your chip will actually not run at the third frequency because at lower supply voltage you need a lower body bias voltage. You need -- I shouldn't say lower. You need a different value for the body bias. You don't necessarily need fine-grain body bias, but as I'm showing here fine-grain body bias makes a big difference in power. This is at all the same frequency. I'm sorry, not at the -- of course not. So this is at the maximum frequency the chip can achieve at 0.8, 0.6. So I'm not sure that answers your question. Sure. So this is -- I mean we are scaling the frequency here and we -- oh, yeah. What we want to make sure here is for all this data is that achieved with and without body bias work at the same frequency. So we take the chip. We measure its frequency after variation. We know the frequency of the chip. We want to make sure that after applying body bias all we do is save power. We don't effect the original frequency. If you just do that you determine the body bias at the supply voltage one and then you scale the supply voltage while leaving the body bias at the same level your chip will fail, will not actually run at that original frequency. >> Question: (Inaudible). >> Radu Teodorescu: You reduced frequency, as well, will not even run at that -- so will not run as the reduced frequency of the original chip. So it will actually be worse with the original chip with reduced frequency. I can talk about this a lot more offline. Let me move on now to the second part of the talk. So we have seen a technique that reduces within dye variation. I will not -- I will now talk about the solution for tolerating valuation rather than trying to reduce them. Let's say there is nothing we can do about variation. We take the chip with variation and let's see how we can design our system around this variation. So the motivation for this work stands from the observation that large cheap multiprocessors will have significant core-to-core variation. So we measure core to core, we model a 24 CMP 32 nanometer technology such as this one here with variation and we want to measure core-to-core variation in both frequency and power consumption. So as illustration of what we find let's state the fastest core on the dye, which is C2, and the slowest core on the dye, which is C20, and compare them. Now if we look at frequency what we find is that C2 is actually 30% faster than C20. It also consumes about twice the amount of leakage power of C20, which is quite significant and translates into roughly 40% of higher total power consumption. So this I think is quite significant because we have designed identical scores, scores that were meant to be identical that have dramatically different properties. And these are average numbers. For some of the dyes that we model, we see up to 50% variation in frequency where the variation is so severe. And all the other cores are somewhere in between these two extreme cases. So the question we ask now is how can we exploit this variation? We know that currency MPs generally run at the frequency of the slowest core. Right? So that would be C20. What we can do is with relatively simple changes to the hardware is to run each score at the maximum frequency that each individual core can achieve. And with this we can buy back some of that frequency loss and we can get the 50% average increase in -- average frequency now -- because now we don't have a uniform frequency across the entire CMP. And the support for actually doing this exists in production today the AMD quad-core opteron allows all its cores to run at different frequencies. Mainly for power reasons because they want to scale the frequency dynamically for this core to save power, but one could presume that also because of variation -- I don't know, I'm just -- yes. Question? >> Question: Are you assuming that the (inaudible) are largely (inaudible) by variation? >> Radu Teodorescu: No. I'm not. I'm not assuming that. I'm assuming that the cache will run at -- well, first of all I'm assuming that the cache is not really very critical. You can always add a cycle or two to -with access time. It will, yeah, yeah. So the L2 cache will of course be affected by variation, but we don't consider banking the cache and running the cache at different frequencies. >> Question: So ->> Radu Teodorescu: So the cache -- the L2 cache will remain same frequency. It's only The l1 and -- the core and the L1 that gets -- that runs at the higher frequency. >> Question: So I'm not worried about a couple of additional cycles that you may have for the pit time of the L2 cache, but given on its own it starts running at only a very small fraction of the core frequency then that could become a (inaudible) and maybe even (inaudible). >> Radu Teodorescu: A small -- I'm not sure. Okay. So consider the differences in frequency here. So we're talking about 30%. Before it was running at two gigahertz. This would be 2 point something. So the difference is there. But the frequency -- later on I'll talk about what happens if you scale the supply voltage so then the frequency may be -- the frequency change will be more significant. >> Question: (Inaudible). >> Radu Teodorescu: I am going to assume that -- not necessarily that the L2 cache is not affected by variation, but it runs at its slowest at the fastest frequency that it can run at. So let's say if there is a critical part in this particular L2 cache here that's going to determine the frequency of the cache for both. Yes? So in such a system essentially it will become -- so after we split this caching into -- so each core will have different power consumptions, different operating frequencies so essentially what we have done, we have turned a homogenous system into heterogenous system. So in such a system it becomes suboptimal to treat all the cores as equal. For instance, in the OS Scheduler or in the power management subsystem. So what we propose doing is essentially exposing this variation to the OS and the software. And to take advantage of this knowledge we develop variational work scheduling algorithms for the OS and also a variation where power management subsystem. And what we show is that this variation aware solutions are better at saving power or improving throughput than the variation unaware case. So I'll take these two in turn and first talk about variation aware scheduling a little bit. This is actually a very simple idea. So we know that one of the jobs of the OS scheduler is essentially in a CMP is to assign applications that need to run in a system to available cores. So these cartoons here represent applications and normally all the cores are treated as equal. So meaning that essentially all other things being equal applications are randomly assigned to cores. In variation of our scheduling what we propose doing is using additional information to guide scheduling decisions, such as knowledge about the variation in core frequency and power, but also details about the application behavior such as dynamic power consumption. And for instance computing density as measured in instructions, executive cycle of IPC. In such a system we could also have different goals. For instance, the goal of the system would be to reduce power or improve performance in the scheduling algorithms could be adapted to each of these goals. We examine a number of variation of our scheduling policies. I will not only discuss two of them here and you will see they're quite, quite simple, actually. So for instance when the goal of the system is to reduce power consumptions, the OS Scheduler what can do is simply give priority to low power cores in the schedule. So essentially if I have many cores to assign, I will select the cores that consume less power. In addition to that applications that have high dynamic power can be assigned to low power cores and in that way try to balance the power distribution across the CMP and minimize hot spots to further reduce power. So this particular policy we call of our power. Now when the goal is to improve performance -- I'm running behind the animation. So these applications are assigned to the lower power here. Now when the goal is to improve performance the OS actually favors the high frequency cores in the schedule. In addition, applications that have the highest IPC are sent to the highest frequency cores because these applications are doing the most useful work and they benefit most from the frequency boost. Low IPC application may for instance just be waiting on memory or IO so they're not doing much, just pushing them to high frequency core is not going to buy you much. So both these algorithms can achieve power savings or can improve throughput compared to a variation unaware naive schedule. Now let's now examine what happens if we add an additional layer of control to this system. And we do that in the form of adding dynamic voltage and frequency control -- frequency scaling for each of these cores. So then voltage and frequency scaling as we well know is a known technique for power management in which the voltage and frequency of a processor is reduced dynamically to track changes in the application and save power. Now future systems will most likely have core level control over both voltage and frequency, which means that we'll have the ability to say what voltage and frequency each of these cores should have. We already have in the AMD opteron the ability to set different frequencies for each of these cores. We don't have yet the ability to set different voltages because that is -- there are some challenges that need to be worked out there, but we'll soon see that. So now we have a lot of control and flexibility, right? We can have an independent voltage and frequency for each of these core. That's great. The problem is how do we decide on a voltage and frequency level for each of these cores? That is not a trivial issue. It is quite a challenge. Because it's pretty obvious all the cores would have the same voltage and frequency because some applications benefit more, you know, your high computing density, high priority application you want to run at the highest frequency while others, you know, your mostly idol processes, you don't want to run them at the high frequency. So it is pretty obvious not all of them should have the same voltage and frequency because not all of them benefit equally from high frequency. But deciding on voltage and frequency settings for all these cores is not a trivial issue and in addition variation makes the problem significantly more difficult. Because it introduces uncertainty into this decision problem. So let's see why. Let's take an example. This -- I'm going to show here total power consumption versus frequency for a single CMP that has 20 cores. So we take one core and we scale its supply voltage from one volt all the way down to 0.6 volts. So as we see here it behaves quite nicely. The power goes down as does the frequency as we scale the supply voltage. So this is fine. The problem is that when you would take a different core this is what we get. It behaves completely different. So as we see the blue core here and the red core reach the same -- have the same power consumption and the same frequency as two different supply voltage. 0.7 and 0.85. Moreover, the red core is more power efficient above 0.7 volts and less power efficient, below 0.7 volts. In addition to that as you can see the red core can reach a higher frequency than the blue core even at the maximum supply voltage. And the problem is all the other cores behave differently, as well, which means that all of a sudden we don't know anything about how our CMP is going to behave. You can say, oh, I'm going to set core three to 0.7 volts and it's not clear what I'm going to get, what frequency and what power consumption. So managing this system and finding an optimal voltage and frequency for each of these cores become significantly more difficult. So here's how we attack this problem. So to solve this what we try to do is treat it like an optimization problem, which we'll define as follows. So given a mapping of threads to cores for instance given by one of our variation in our scheduling policies like VAR performance. So we already have a mapping between applications to cores. What we want to find is the best voltage and frequency for each of these cores with a certain goal. The goal that we choose is to maximize overall system throughput as measured in the number of instructions that are exited in the system each second. And with a set of constraints, the main constraint that we set on this system was to keep the total power consumption for the entire chip below a certain budget. And it -- for us it makes sense to keep this budget flexible, meaning that we can set it in software and if this is running on a laptop you could have the power budget at 50 watts. If you have it running in a desktop it could be 75 watts or 100 watts or you can even change dynamically depending on whether your laptop is plugged into the wall or unplugged. So given this power budget that you give to your system you want to get the best performance out of it. The question is what we put into this box, how do we find this best or as close to the best voltage and frequency voltage core that we can. So to solve this we look at a range of solutions to fill in this box. The first, which has been done in the past, is essentially exhaustive search, which means you have to look at all the possible combinations of voltages and frequencies for each of these scores and decide one is the best and this has been done in previous work for four course CMP that had maybe three different voltage levels. But our problem is much bigger. We have 20 cores and each core can have 10 different voltage levels. So the search space for exhaustive search is so large that simply infeasible. So we discard that immediately. The next thing we look at is simulated annealing, which is a problemistic algorithm for solving optimization problems. Question? Yes? >> Question: So did you change the system's voltage and frequency, how do we estimate the performance that you will get from each application (inaudible)? >> Radu Teodorescu: That's a good question. So we use profiling to measure the IPC of the application as it is running. Yes? >> Question: Sharing a cache so it is going to be a very hard system to model? >> Radu Teodorescu: A very hard system to what? >> Question: To model? >> Radu Teodorescu: To model. Yes, that is -- that is a concern and we have seen some of that, but yeah, that's something that is not much we can do about it. There's something we can, you know, there's not much we can do about that. Yeah. There is an impact of the shell cache. What we look at is essentially how -- yeah, we have to make -- to make the assumption that essentially these applications are more or less independent when we do -- when we run our decision algorithms. Of course they are not totally independent because of the cache interference. >> Question: (Inaudible) performance results for profiling? >> Radu Teodorescu: Yeah, I'm going to -- let me go through a couple of slides. If that's not clear by then stop me. So we look at simulated annealing for solving this optimization problem and we get the best results with simulated annealing. However, what we find is that it's computationally very expensive. We want to run this algorithm online as programs run to allow the system to adapt the changes in work loads, power consumption targets or even optimization goals. So the solution that we come up with is we develop our own algorithm which we call Lean Op. This is based on linear programming and it has the advantage of being orders of magnitude faster than simulated annealing, but requires us to make some approximation. We compare it to simulated annealing, which is close to, show it is close to optimum. So let's see -- let me tell you a little bit about how we define the problem as an optimization problem -- as a linear optimization problem. If that doesn't solve your questions then by all means stop me. So we know that linear programming can solve optimization problems of the following form. We want to maximize an objective function F of an independent variables X1 to XN. Subject to constraints of the following four. A function G of the same independent variables X1 to X10 has to be less than a constant C. Now we can have several of these constraints and one objective function. What we can also have in this -- in linear programming is constraints on the value of each of these variables X. Now the big problem here is that by definition these function, F and G, have to be linear. So that is a big constraint in our system. The problem that we're trying to solve as a linear optimization as follows. So the variables that we're trying to find, the X is for us our voltage levels for all the core, that is what I'm trying to find. What is the voltage for each of the core. The objective function, which we're trying to maximize is the overall system throughput. So by definition throughput as measured in NPS or instructions per second can be computed as the frequency of a core times the IPC of an application running on that core. So that the throughput for one core. We add all the cores and we get total system throughput. So frequency is mostly near function in supply voltage and/or the assumption that we make here is that the IPC does not change as we change the supply voltage and frequency. This is an approximation. Of course, because of memory axis your IPC will change somewhat. However, the changes are small compared to the differences in IPC between applications and between phases of the same application. So while there is a change in IPC, we find that that's not the first -- that's not the biggest impact here. The biggest impact is the IPC difference between applications and also within an application. So as a result because frequency is most linear in voltage and IPC is mostly constant in voltage -- now another caveat here, we don't need to assume IPC constant. We can if we are able to measure the IPC of the application at two different voltages, extrapolate a linear function. So that is just one simplification that we make here. You don't necessarily need to make that. You can because the IPC is fairly linear in voltage again, so we assume it is constant. But you could potentially assume a linear event. Now because of these linear functions we can compute a function F that is a linear function on the supply voltage and is also a function of other constants that are characteristic to each core and also to each application, of course like the IPC. And the constraint is for us the main constraint is to keep the power under a certain target. So we have to somewhat -- somehow cut the power as a function of supply voltage to write this constraint here. So power is clearly dependent on supply voltage. The problem here is that the dependence is nonlinear. So what we have to do is essentially linearize the dependence of power on supply voltage, which is the biggest approximation that we have to make. But what we find are the results are pretty good compared to simulating annealing, which does not need to make that approximation and control that in the results. So this is how we cast the problem that we're trying to solve as a linear optimization problem. If you have any questions here please ask now. Yes? >> Question: How do you get the IPC number? >> Radu Teodorescu: The IPC numbers are gathered through profiling. >> Question: But I mean they were ->> Radu Teodorescu: So you are running the application. Yes, so this is an online algorithm. This runs all the time. It is not something static. So you profile the application as it is running and you say, okay, I've seen this IPC for this application for this long I'm going to predict that the IPC is going to be the same. So based on my prediction that this IPC is going to be this I'm going to make a decision. >> Question: So you're ->> Radu Teodorescu: So based on profiling. >> Question: So you're also assuming that these applications (inaudible)? >> Radu Teodorescu: Yes. Yes. That is one big -- that is one assumption that we make because we can't model that. Potentially if we could model that somehow we could add that into the optimization problem here, but I'm not exactly sure how to model interference in the caches. Yes? >> Question: I'm curious about the loss line, like how can you linearize the like the power as a function of voltage? >> Radu Teodorescu: Well, you can linearize anything you want provided that you are willing to take the hit. Well, it depends. Yeah, it seems outrageous too me at first. I didn't consider this at first. I said, there is no way this is going to match the linear function. The question is over what interval are you linearizing the function? And it turned out the approximation here of the nonlinear function is fairly close, is descent enough to give good result. What happens because we cannot precisely predict, because we linearize this nonlinear function what happens is we miss our target -- our power target by a few percentage points. Even though we expect the power as a result of our optimization to be 50 watts we may be 51, 52. So that does happen. Okay. Let me move on here. So the linear optimization algorithm works together with the OS scheduler. So the OS scheduler will essentially map applications to cores according to one of the variation aware of algorithms, for instance VAR performance and then Lean Op would find the voltage and frequency settings for each of these cores. Lean Op runs periodically as a system process, either on a spare core if you have one or otherwise inside the power management unit that is similar to what -- it can be a non-microcontroller similar to what the Intel Titanium 2 Fox team uses to manage power. So they have something like this. They have essentially a microcontroller that runs code inside the Titanium 2 for their power management algorithm. So we could use something like that. All the data that I'm showing assumes Lean Op runs on a spare core. Lean Op uses profile information as input, as I mentioned before, quite a bit prove filing information. Some of this profile information can be collected post manufacturing, something like frequency and static power consumption for each core at all the voltage levels. Other information is dynamic and needs to be profiled at run time, such as dynamic power and IPC for each core and application pair. Now we feed all this information together with the power target and the optimization goal to the Lean Op algorithm, which outputs the best, well it's not the best, it's as close as we can get to the optimal voltage and frequency for each of these cores. So we now at implementation we try running Lean Op at different granularities. We start with 500 milliseconds, all the way down to one millisecond in the system. We find that the good interval for good performance penalty tradeoff is to run it roughly every 10 millisecond, which means if we run several times inside an OS scheduling interval. So of course depending on how -- yes, question? >> Question: (Inaudible) this optimization on the division of four also requires energy, right? >> Radu Teodorescu: Yeah. >> Question: So -- >> Radu Teodorescu: And yeah, yeah, so ->> Question: Does it like running this additional core kind of outweigh the gains you can get? >> Radu Teodorescu: Yeah. That's a good question. So it depends. If you run it in the core the impact is higher. If you run it in the microcontroller, which is what we tend to do, is less because a microcontroller is very simple, small area consumes a lot less power. The Titanium folks give roughly 0.5% area for their microcontroller -- microcontroller, therefore, that's about power consumption that we assume for this. If you run it on the core (inaudible). Okay. If there are no more questions on this I will move on to the evaluation. I think we have about 12 minutes here. Yes? >> Question: This work (inaudible)? >> Radu Teodorescu: Peril(phonetic) applications, um -- so that's a very good question. We did not try this on peril applications what we tried is essentially multi program workloads that have little dependencies between applications. My gut feeling would be for peril applications the optimization to -- would actually try to keep all the cores at the same voltage if the applications are well balanced, the threads are well balanced. If there is imbalance you try to favor the threads that are making good progress. If the threads are well balanced then you would try to keep them at the same frequency because if you have barriers and synchronization there is no point in running one of the threads very fast if it has to wait on a barrier. So but that is something we haven't looked at. So let me move on to the evaluation. A few words about our evaluation infrastructure. We used our own process variation model, which we call Varius(phonetic). This was developed in (Inaudible) with several of my colleagues and was published in (inaudible) Transaction on Semiconductor Manufacturing. So this is variation model that can be used at micro-architectural level to model a variation and we can actually do multicolor simulations for 200 dyes with variation, which would then use to analyze our solution. We use (inaudible) architectural simulator, SAS, which was developed in (inaudible) and a range of other tools to model leakage power, which was started -- so we started with these models at the level of transistor. So we model leakage and the effects of variation in spice model. So very detailed transistor level simulation. But we have to abstract at the way the cache level we assume okay all the transistors in the cache are going to behave roughly the same. All the transistors in the LEU are going to be mostly the same. We have to have some abstraction. We have tools for estimating temperature. Well, for the first piece of work, dynamic fine-grain body biasing, we model A four core cMP 45 nanometer technology at 4 gigahertz. We evaluate fine-grain body biasing at different granularities ranging from 1 to 144 cells. So there is cell here is essentially the unit at which body bias is applied. We want to see okay how many of these cells do we need before the overhead gets too high. So we look at the range of 1, 16, 64 and 144 cells. So let's look at dynamic body biasing in the standard case. This plot shows again frequency versus leakage power. The blue dot represents the average frequency of leakage for the 200 dyes that we model before body bias is applied. All the other results as you see are normalized to this value. So let's look first at how starting body bias can improve the chip's frequency. So as we can see this is for SFG it would be 144, which mean its uses 144 body bias cells. The frequency is improved by about 6%. As the number of cells goes down the improvement is less for one cell essentially static body bias can do nothing here. On the other hand, dynamic body, chip with dynamic body bias would run at the exact same frequency as the static case. However, the reduction in leakage is quite significant. These will run with significantly less leakage power, which ranges between almost 30% to 40% reduction in leakage power for the 144 body bias cells case. So of course the more body bias cells we have the higher the frequency and lower the leakage because we have more control over variation in -- on this chip. For the other environments, let me just briefly mention the results. In the high performance environment. DFG high performance achieves between 7 and 10% frequency increase compared to the static case and up to 16% compared to the no body bias case, but this comes at high cost in power. The more interesting case for us where the results were actually much better than we expected were the dynamic fine-grain body biasing low power where we got between 10 and 50% reduction in leakage power compared to the static case, which we thought was quite good actually. So for the second piece of work that I talked about, variation in the scheduling and power management we model a 20 core CMP 32 nanometer of technology at 4 gigahertz. We use multi-program workload consisting between one and 20 applications that are selected randomly from a pool of speccing and spec MP benchmarks and we construct these multi program workloads for this CMP. We evaluate the following power management schemes. They all have the same goal to maximize throughput and work with the constraint to keep the power budget -- power below certain budget. Here I'm going to show results for 75 watts in the paper we have the full range of budgets. We call the first scheme Foxton Plus because it is an enhanced version of what the Intel Titanium 2 does to manage power. Essentially in the schemes we map applications randomly to cores -- we map applications randomly to cores and then we decrease the voltage for each of the cores in small increments, a round Robin, until we reach the power target. Yes? Question? >> Question: The illustration shows application for using speck FP and speck MP, so are there any of the speck tests that actually have to wait on network or user input or spend any time waiting on anything other than memory? >> Radu Teodorescu: Not that I know of. They're mostly -- yeah. At most they would (inaudible) when they develop spec they went to right length to remove IO and all that. So what we take advantage of is yes waiting on memory not so much. But there is, I mean, a huge difference between the behavior of the applications in terms of relation to (inaudible). So some applications are very (inaudible). >> Question: (Inaudible). >> Radu Teodorescu: Yeah. >> Question: (Inaudible). (Inaudible) why so I would based on what you said this morning I would have anticipated that what your goal would have been would been to minimize power and your constraint would have been for a fixed throughput. What's the (inaudible) ->> Radu Teodorescu: That's -- yeah, that's one way to do it. The problem is what is your throughput goal? For most general purpose applications for the more general purpose systems you don't have a strict timing requirement. If it is your MP3 decoder, yes, fine you have a frame for (inaudible). But what is your the throughput requirement for (inaudible). It is very hard for most applications it is very hard to define that in terms of (inaudible). >> Question: Well, my throughput range in PowerPoint is can I display my set of slides in 90 minutes. What is (inaudible) ->> Radu Teodorescu: (Inaudible). >> Question: My point is it is a fixed amount of work. The question is what is the smallest power it can use, right? I mean, you ->> Radu Teodorescu: Yeah, I took a different approach because I could not see a way to define a timing requirement for an application and that is what -- so the way people have attacked this problem for instance in the embedded market what you have, your whatever your radio decoding something and your MP3 player and what-not. Then you have very strict requirements. So the way they do it there is essentially say give me the minimum power for a certain timing. I took the different approach because I think it makes sense for, you know, a general purpose system to say give me the best performance, I give you this budget. Give me the best performance. So our proposed scheme is VAR performance plus linear optimization. Which essentially means that we use the VAR performance scheduling algorithm to map applications to cores to map the high IPC applications to high frequency cores and then we use Lean Op to find the voltage for each of these cores. We compare it to VAR performance plus simulated annealing which is our upper bound. We want to see how close we get to that now. So these are our results. This plot will show you throughput for the three schemes measured in millions of instructions per second and we look at the three -- the three power management schemes for different experiments. We look at likely loaded system with four threads running on the 24 CMP, all the way up to 20 threads fully utilized system running on this 20 core CMP. So these are the results. What we find is that our proposed scheme, everything is normalized to the (inaudible) here. What we find is that our proposed scheme achieves throughput improvement between 12 and 17% and what we find is also that it comes within roughly 2% of the simulated annealing solution which doesn't use almost any of the approximations that we made in the Lean Op algorithm. So to sum up we want to answer the following question; right. How much of the performance and power that was lost to variation have we recovered with each of these scheme? Right? Why are we bothering to do all of this? So in the case of dynamic fine-grain body biasing we first look at frequency and here the first bar shows the frequency of all the shapes that we model without variation. So we use exact same model, but for nominal value. We simply extract all the variation away. And what we find is because of -- within dye variation we lose roughly 15% of frequency. Now DFG would be standard and is able to recover about a third of that frequency loss, while DFG high performance recovers almost all of the frequency lost, but this comes at the high price in power consumption. On the leakage side because of variation because within dye variation we see a 25% increase in leakage power, however, for leakage reduction as we have seen dynamic body bias is even more effective. Not only at recovering the leakage loss to variation, but actually improving over the no variation case because able to adapt to temperature. In fact DFGAB low power is roughly 35% lower than the no variation case that we thought was quite good. Now for the second technique we look at the effect of variation on overall system throughput at the same power. The power constraint here is 75 words, 20 threads running on the 20 core CMP. What we is find because of variation we lose about 15% of the system throughput. Again, 20 threads, 20 cores. And because with our best scheme VAR performance plus Lean Op we are able to recover most of these losses back. So we can conclude that both techniques recover most of the losses that were used by process variation. In some cases with a high price power like in DOG high performance, but in other cases, like in the case of leakage, we actually do better than the no variation case. So this is all I have. Let me if I have time talk a little bit about my future work or okay it's a couple of slides. Couple of slides. So in the longer term I will be looking further down into the future and today semiconductor road maps are predicting that in a few years for the 11 nanometer technology we'll probably have 100 or more billion transistors on a single chip. This, if current trends continue, will translate into hundreds of cores on a dye, hundreds if not thousands of cores on a dye. We'll have a lot of resources. The problem is in the 11 nanometer node reliability problems will get exponentially worse and the problems are many. One reason is variation. The other reason is accelerated aging that we're already seeing in these systems. Increase accessibility to soft arrows(phonetic) and plenty of other problems. So in this environment some cores will fail immediately. Right up to manufacturing you cannot just throw the system away because two out of your 2000 cores have failed. So we have to design around that. Other cores will fail over time because of aging, maybe two weeks or two months after we deploy the system in production. Other cores still will fail intermittently as the environment in which the systems change. So we have to learn how to successfully manage such an unreliable system. So what I'm looking to build is an integrated approach to system reliability that spans multiple levels of the computing stack, starting with devices centered around micro-architecture, which is my level of expertise, but going up to OS and software, as well. And so I believe that essentially to increase collaboration across these layers and working with researchers that have expertise in different domains we can achieve far more effective solutions than would otherwise be possible. So as an example of how such a system would work, let's take the case of timing errors, which for instance are variation-induced timing errors. These are errors caused by the inability of signal to propagate within a single clock cycle causing the wrong value to be latched. Timing errors are more likely to occur at high temperature or unstable supply. So what we could have at the circuit level is environment sensors, devices that can detect when either the temperature is too high, the voltage is unstable and so on. At that point this layer can inform the micro- architectural layer, which can enable the detection and correction techniques such as ECC or hardware redundancy that would normally be disabled to save power. Now let's say the error rate grows too high to be managed at the micro-architectural layer, we could bring in an OS intervention and the OS would essentially migrate computation away from the failing core either permanently because the core is no longer usable or temporarily until the conditions that cause the spike in error rate disappear. At the software and compiling levels we could have for instance different versions of the same code that run depending on the increasing reliability of the core. For instance we could have a version of the code that is optimized for performance and this runs on reliable cores or optimized for reliability with significant hardware software redundancy built in or checking that could run on the (inaudible). So bottom line here is that I believe integrated solutions, such as this one, will become key to tackling daunting reliability challenges that we're likely to face in future systems. So before I finish let me tell you about some other work that I did part of my PhD (inaudible). On a slightly different topic. So this work focused on developing hardware support to help software, to help divide software essentially. So the main idea was that the process of actually searching for bugs should continue at the user side in your laptop, desktop or server. To catch all those bugs that escape during the development and we know there are plenty. So the problem is that most debugging tools come with a very heavy performance impact. So throughout this work we show that with the help of hardware support we can debug software with very little overhead. In this context I worked on developing a prototype of a processor with fast software control checkpointing and rollback support. This was fully working prototype working on a new FPGI shape and it could run an operating system and this is what we use as a testing batch. The idea was to use this low overhead rollback ability to quickly undo faulty executions and look for bugs. Another project I was involved on was the development of hardware implementation of data race detection algorithm lockset for those of you who are familiar with that. And we managed to reduce the overhead of actually running lockset from 10X to a few percentage points. Together with (inaudible) at Intel I also worked on log-based architectures for (inaudible) monitoring of production code, which had kind of the same idea of online monitoring and I can talk a little about that -- a little bit about that offline. Thank you very much for your attention. Sorry for running over time here. But thank you very much for coming. (Applause)