Jie Liu: It's a great pleasure to welcome Sherief...

Jie Liu: It's a great pleasure to welcome Sherief Reda from Brown University. He's visiting us today and he's been doing research on power and thermal management on both the chip level and the data center level. And today he's primarily going to talk about the data center level. But be very interesting. On the chip level management things we can chat off-line as well. Shief Reda: Thanks a lot for the introduction and thanks for organizing this visit and initiating it. I'm happy to be here today. Is this mic working or it's not working? Jie Liu: It should be working. Shief Reda: All right. So as he mentioned, I have a number of projects going on; mostly the deal with power, energy, thermal, different places in the stacks of the computing. That mean -- the first research theme is an adaptive energy-efficient computing systems. And by adaptive in here, I mean things that can be applied during runtime. When we started this project, we were looking basically at techniques at the chip level, thermal and power management at the chip level. And eventually we moved on to server level, and then eventually to datacenter level. So that's going to be mainly the focus of this talk. And in fact I'll just be talking about half of it, the part we worked on, the datacenter, which is relatively recent. The second theme that I work on is thermal and power sensing and modeling for processers system on a chip, other computing systems and computing forms, and that's actually where a big bulk of my lab infrastructure is in with grants from UED's and other places. This is for instance, a high-end infrared camera that we used to look at the thermal distribution inside the chip. And for instance, I will show here a realtime video typically. All right. Yep, so this is a realtime video actually from a dual core AMD processor. And this is the thermal map. It usually goes up to 55 degrees, the red and the blue is usually around 35 degrees. I just can't save it with the video. You can see the dual core modules from the infrared camera. What's happening in here, this is the spec Java, spec power SHA workload. So what's happening, you're seeing the Java threads being launched and then we're responding to a query from online and then terminating. And with the infrared sensing, you can see -- I mean you can see the caches, you can see the memory controller, you see the L1 caches in here. So you can actually see the thermal gradients. This is in realtime. So this is a starting point for many of our research including quantifying hotspots. If you're going to put embedded thermal sensors on the chip, before you should put them, because you have to limit the budget for them. We also have a big research project on inverting these temperature maps into power maps so you can actually see what's underlying, what's the power consumption of the different units of the processor. >>: Realtime? Shief Reda: This is Realtime. >>: So basically if the temperature changes [inaudible] -Shief Reda: Yeah, because the thread launches and goes down quickly. So if you have a spec application, if you have a regular spec CPU, it's not going to change like this. It's just going to launch a hotspot and stay as this. It's the nature of this workload. That's why we created that video, because the threads launch in a couple of seconds and then they die off. But other applications want to have this. >>: What is the variance in temperature? Shief Reda: So usually when it peaks like this -- as I said, I couldn't save it to the [indiscernible] but it's about like 55 in here, maybe 35 in here. >>: Wow. Shief Reda: Yeah, so this is a big project and you can actually also -- imagine applications in security where people are trying to insert malicious entries in your circuit, you can detect them with the thermal infrared imaging. So this is a lot of big project in here I work on. I won't be talking about today, but it's a big project. Another big project is low power design techniques and CAD tools. The main project we work on is approximate computing. So approximate computing is a way to trade off accuracy in some applications that allow you like machine learning and computer vision, et cetera. With some reductions in power consumption. And there's a lot of interesting approximate computing and there is different scales you can work on in the computing stack. So you can design circuits that are approximate. That's one thing you can do. Or you can do application. So you can rechange your software, restructure your software. And I think there's a lot of applications in datacenters. I think potentially eventually I would like to merge these two, because we can't imagine large-scale approximate computations happening in datacenters, and you really get a big savings on power consumption. So these are the main three projects. I'll mostly talk about the first one. And I'd like to first acknowledge the people that I work with. These are only the students who worked on this first project. There are other students working on the other one. So some of them graduated, some of them are working on it right now. I also have a league at BU. I show that our club rate with -- especially in the first part of the stock, the pack and cap technique. The rest of the stuff was mainly at Brown. So we usually write proposals together. So what's the big goal in here? The big goal is that -- big goal shared by many researchers is that we want to design energy efficiency, and this goal means that we would like to maximize the total performance of the datacenter while meeting a dynamic total power budget. And a total power budget in here is basically computing and cooling simultaneously. We maybe want to meet a certain cap. There is a cap for the datacenter and we want to meet it. And there are a number of reasons why this problem has to be considered from a dynamic sense. One of them, obviously there are a few things, but one of them is the variance in demand over time. The datacenter -- demand for the datacenter is changing over time. This is data collected from a computing cluster at Brown over 2010, and I'm sure everybody has similar traces. Also, applications themselves change their power consumption over time. They go through different phases and change their power consumption, which means that you might have some settings that are [Indiscernible] you the budget, but over time you see some drift and you're not able to meet this power consumption. >>: What [Indiscernible]. Shief Reda: It's relatively highly utilized. >>: Do you cluster or something? Shief Reda: No, it is not a cluster. This is actually a big computing facility at Brown and researchers submit their job. So instead of everybody getting their own clusters, it's more like batch. >>: [Indiscernible]. >>: Could you go back one slide? I'm curious here. Because you're saying energy efficient datacenters; right? And then you talk about a power budget. Shief Reda: Right. >>: If -- because you're forced to stay within the budget, you have to run things slower. So you end up consuming more energy, not less. Shief Reda: Right, but sometimes -- I will explain actually there's a third reason. If you don't have any concern for power consumption, then as you're saying you could be consuming the least amount of energy. But sometimes you actually want to adhere to a certain cap. And the reason for this is particularly the third one. So many datacenters are connected to the grid and the grid has energy markets you can participate in. And if you want to participate in that energy market, the smart grid or the grid operators will send you a regulation signal that tells you "please regulate your power consumption to a certain level." And if you do, then you get back an energy incentive, you get reduced rates, you get back money. So there's a lot of demand for this dynamic power caps. But you're absolutely right. There's a chance that if you run at the lowest level, you could spend more energy. >>: The main reason, I think right, the fact that you can over subscribe on a datacenter by putting in more servers than it was actually designed for because you have the ability to limit the power is probably more of a factor than the low response stuff; right? Shief Reda: That's another possibility too. So that will tie in with point number one. You're doing this over-provisioning to be able to handle those variations, but sometimes you don't need it. But at the end of the day, if the power budget is the maximum and you want to maximize your performance within this power budget. So that's the energy efficiency, performance over power; so if I -- say you have to consume, say, one megawatt, then you still want to maximize the performance within this megawatt. There are so many ways you can get to the one megawatt, but what is the best way to go through with it. So that's why we always consider this power budget within our work. We don't consider -- we don't want to just maximize performance within the maximum power budget, but potentially within lower power budgets. So what are the degrees of freedom we have during runtime? There's obviously things you can do during planning, like purchasing how many servers you would like to have. But from our runtime point perspective, what do we have? So there are a few degrees of freedom. One of them is that the servers could have power cap controllers. Each server you can tell it consume no more than 150 Watt, and it will adhere to this power consumption. If you have I heterogenous datacenter, things become more interesting, because which workloads goes to which server is going to determine performance, energy, and power, and maybe different servers offer different tradeoffs for different applications. And also, different servers could be -- maybe one is best for performance, maybe one is the least power, maybe one is the highest energy efficiency and so forth. So this is another runtime degree of freedom. Workload co-runners, you could choose to co-run applications on the same server, or you could make sure there's no degradation by just running a single workload. Virtual machine migration. Maybe you do this mapping initially, but then eventually this favorite server for the blue circle is now free. So you can actually start migrating things to it to leverage that usage. The CRAC supply temperature. The CRAC supply temperature is going to determine your pulling power. And that's another degree of freedom you have in your data systems. So what we want to do is to adaptively determine these degrees of freedom in a data system to actually improve energy efficiency. Of course it's challenging, because all of these degrees of freedom are highly interacting and there's a lot of dependencies in between them. So when we try to do sort of a road map for these degrees of freedom, that's how things would look like. And let's see in their the degrees of freedom and their applicability. And there's two domains, homogeneous clusters where all the servers in your clusters have the same configuration, and heterogenous clusters where you actually have different racks, which one with its own server configuration. So the cap controller is applicable to both of them, co-runners is applicable to both of them, workload mapping is really mostly irrelevant for this one, unless you consider cooling into the process. Virtual migration is also applicable only for heterogeneous. Cooling is applicable for both of them. And of course workloads is another degree of freedom really. You can control it, but the way you look at these degrees of freedom will change depending on the workloads that you're running. And you actually look at most of the literature and most of the papers in the literature, you can easily map them into some of these combinations in here. These combinations that are highlighted are actually the ones that I'm working on. This is all work in progress. So we're trying to still do a better integration for all of these degrees of freedom. >>: So the mechanisms for managing power are under the cap controller? Shief Reda: So I will actually talk about cap controller techniques, I'll talk about techniques with corunners and mapping. So all these ones that have this highlight are ones that I will talk about. >>: Have you looked at utilization scenarios? Because I suspect that we in migration will be useful when you try to consolidate [indiscernible]. Shief Reda: So we will see some results on that. The short answer is there will be a difference. Because if you have low utilization, sort of every workload comes and gets its favorite server and you don't need to migrate afterwards. But if you have a high utilization, the jobs comes to the datacenter and in that case you don't have many free servers. So you allocate it to something that is free. But then that opens the rooms for migration afterwards. So it is indeed true that the migration will have different quantitative advantage as a function of your utilization. And I will show this later on in a slide. So this is the overview of the technique that I explained. I will start with a technique that we worked on first for energy efficiency of server, just if you have a server, what would you do. Then I'll move on to a cluster, which is just servers. And eventually then I'll talk about a whole datacenter, and I just mean in here, really it's the cluster plus the cooling equipment. So these are the three parts of the talk and this is how it's going to be structured. So let's look first at the server level management technique, if you want to look at the energy efficiency of maximize performance within a budget. So traditionally the trade-off between performance on power consumption by a server is achieved by using dynamic voltage scaling, also known as the Pstate. By controlling the P-state, you're controlling your performance and power. Many modern servers actually have power cap controllers that can sense the power consumption, say, once every one or two seconds. And you can give the server hardware a certain power cap, and then the firmware in the server on the controller will adjust the P-state or the dynamic voltage of frequency scaling accordingly. It's a very simple mechanism. If you are above the cap, then decrease DVFS until you have a cap. If you're below the cap, then increase your DVFS or P-state, because you're not delivering the best performance, so increase it until you meet the cap. So that's the classical design that's being used right now. And it's mainly done in hardware. >>: That assumes the workload is CPU bound; right? [indiscernible] Shief Reda: And actually that would be one of the things we're going to leverage in the rest of the talk. So problem, it ignores the software. And it ignores the fact that we have multi cores and multi threading. So when we started this project, we thought can we devise better power management techniques by looking at both the software and the hardware in conjunction. >>: So what are the symptoms of that problem? Shief Reda: So the symptoms will appear when you look at the advantage of the technique [Indiscernible]. It's going to come up right now. You're going see like "okay." So in here, we don't just have -- DVFS is a great freedom. We actually have the number of threads. So for instance, if I launch my application with just one thread, I'll get a different power consumption than if I launch it on four threads. So this mean really the DVFS and the number of threads should be both of them considered in basically determining my power consumption. And that's even true if that threads belongs to workloads that are not even like doing something in common with each other. So what we advocate is actually controlling the number of threads in DVFS, but controlling the number of threads adaptively during runtime is not an easy task. So what we do instead is launch the application as usual states of four core, launch with the four threads, and then we apply what we call thread packing. So thread packing meaning that I have four threads in here, and then in that case if I want to switch to a lower power mode I can pack the four threads into three cores, basically by setting the affinities of the three threads to these cores. And by doing that, four core become idle, basically, and the clog gating mechanisms in the processer will automatically shut this core down saving some power consumption. And you can apply DVFS still on top of it. And if I would like to be more aggressive, I can pack four threads into two cores. And eventually if I really would like to be more aggressive, I can pack four threads into one core. So by applying this packing, plus DVFS, we can get better power capping and power management technique. So that's what we call this technique pack and cap, and it appeared in micro a few years back, two, three years ago. >>: Do you consider the increase in temperature on the core that you're concentrating stuff on? Shief Reda: The temperature does increase a little bit. But actually the temperature of this core will slightly increase. Could be a degree. It's a good point. But keep in mind that when you have all the threads running and all the cores, the whole ship overall is consuming a lot of power. So really the temperature of this core would be its own -- coming from its own power plus the heat diffusion. >>: It's more like a hotspot as opposed to having a temperature that's running across [indiscernible]. Shief Reda: It's an important question. But let's say imagine the thread is running in here. The question is this thread running this core to full utilization? If it's running at full utilization and you pack those things, then you're switching from sort of space multiplexing into time multiplexing, and that time multiplexing is not going to necessarily increase the power of this core's temperature. However, if this thread was IO dominated and you get some time where the core is not running, then moving something in there will increase its utilization and increase its temperature. But if it's a regular thread, then effectively by doing this, we actually found that it's no different than launching just one thread on that one core in terms of performance. >>: It will take longer. Shief Reda: It will take longer to finish. >>: [indiscernible]. Shief Reda: So this will go next. So we'll show this. So let's first look at the first benefit from the one disk compared to just DVFS. So let's look at one metric, which is the dynamic power range. Which is if I launch application, what's the highest power it can consume and what's the lowest power it can consume? So with the traditional technique, DVFS, the difference would be basically the highest -- the difference between the highest DVFS setting, this one, minus the lower power setting, and this gives me about 5 watts in the servers. But by applying this packing techniques from top of DVFS, we see those purple lines are actually achieving larger power range. We're able to go into lower deep states because we're packing things, so we're increasing -- we're extending the power range. So this is on a 4 core processor. And if you have an 8 core processor with two different sockets, it becomes more interesting, because as you start packing things into one socket, you're even make the first socket totally idle in that you can shut it down entirely. So you can actually get a larger 40 percent improvement in dynamic power range. So larger power range basically meaning better energy proportionality. But everybody saying the server's basically don't have good energy proportionality. By doing this technique, you're actually able to extend the range even further. Let me show ->>: Is this real results? Shief Reda: This is a real server. This is the real results. >>: So how -- I'm curious, how hard is it to move the thread that's running from one core to another, or is it potentially [indiscernible]? Shief Reda: It's very easy. >>: How sticky are these cores? Shief Reda: It's easy. There's a functional operating [indiscernible] it's called [indiscernible] affinity and you give it the process ID and you give it the set up cores that you prefer it to run on. >>: Right. Shief Reda: So you can even set it at -- it's affinity not just to one core, you can say run on two cores, or run or three cores. So it's relatively easy. >>: And it's pretty instantaneous, is it? Shief Reda: I mean it's certainly within the time limits of the kernels [indiscernible]. >>: Every five milliseconds. Shief Reda: Yeah. So DVFS is in the micro second. You change it to DVFS this one, change it to the millisecond. >>: And you got to dig the cache misses to [indiscernible] -Shief Reda: Yeah, that's a good point. Yeah. And in fact it becomes even trickier if you have multiple sockets. Because if you have [indiscernible], you have no uniform memory access, then switching it to a socket that doesn't have its data would be harder, yeah. So let's look at this data in more detail. So let's take one applications in here from this PARSEC benchmark and launch it with eight threads, it's called black [Indiscernible], and then we configure it to pack it in one core, two cores, four cores, et cetera. And then for each configuration, we look at the power consumption in here. This is just a small slice of it. And this is a hundred billion instructions. And look at the runtime. So every point in here correspond to, you know, we have two degrees of freedom, the number of cores and the DVFS. So every one of those represents the change in the DVFS while one core is active. In here two cores are active and then we change DVFS, over in here is eight cores, and you change DVFS. So the one thing becomes clear is my previous slide. You see the range. If I'm having it at eight threads and I change DVFS, this is power range that I get, the smaller arrow. But if I'm able to pack things then I extend my power range, I'm able to go down with the server all the way down to 120. So I can achieve lower power caps if required for my datacenter or from servers. The main question is, if we say we want to cap the power consumption, say, at 140 watts, it's different than DVFS. Now, there are so many combinations or pair of settings that you can use to achieve that cap. So when you have multiple choices, now it becomes interesting, because one choice will actually give you the best performance. So this is what we're after for right now is things are becoming more interesting. You have a bigger search space. So what we did is that we plotted the Pareto frontier, which basically the settings along the [indiscernible] line would give you the best data between one time and power consumption. So the optimal settings basically would be on this Pareto frontier. And the question is how can we figure out this Pareto frontier. So let's take a look in here. Say you don't have any limits in your power consumption, you're running with eight active threads on eight cores, and then you want to cut down power consumption. Let's see what's happening to the Pareto frontier? Is it moving across the cores or is it moving at first at DVFS? It seems it's -- it seems it's better that if you want to cut power, it's better to cut it with DVFS than to move across the cores to cut it. That give you the better runtime, the better performance. So as you go down in here, you try to cut DVFS first; at some point DVFS will run out of steam, then you switch, reduce the cores as least as possible. So you just move a little bit in here, and then play a game, the DVFS game. So it seems that it's better for performance to keep a larger number of cores than to actually have a higher DVFS. >>: What kind of [indiscernible]. Shief Reda: What is what? >>: Thread synchronization. Shief Reda: Threads synchronization? So we're not doing anything. We're just leaving -- oh, this one, this is from the PARSEC application. So there -- I think the MP thread, the -- sorry. >>: The P threads? Shief Reda: Yeah, the P threads. >>: But is this spinning or blocking? Shief Reda: Well, each application will have its own programs. >>: If this spinning synchronization, right, it would make sense that you would prefer to stay on multiple cores. Shief Reda: Okay. >>: Right? Because as you concentrate more, these guys are going to be spinning, waiting for things that are not going to happen because the other threads are not running. Shief Reda: Right, so we've seen -- so I'll tell you something in here. Actually I'll come to this point later. But I haven't seen anything of the PARSEC benchmarks. Their behavior changed a little bit. But I've never seen a single instance where you actually prefer to use DVFS over the number of cores. So it's a little bit -- so there could be a chance that there's a system out there with 16 thread and the programmer writes it in such a way, or maybe it's two cores or four cores, but a limited number of cores, and you do aggressive packing, and maybe the programmer did not try the core in such a right way. >>: I think the issue is the implementation of the P threads; right. If it's spinning implementation, the result will probably be different than a blocking implementation, or synchronization. Shief Reda: Probably there's a good chance -- I mean in all cases the program will work. We're not talking about ->>: Yeah I know, it's just a matter of how long it's going to -Shief Reda: Yeah, yeah, exactly. Yeah. All right. So let's look at the other slides. Suppose I have the least power cap and I actually try to increase it. Well, let's look at in here, whenever I had some rule to increase the number of cores, the Pareto frontier just suddenly jumped to the other one. So this is a critical point in here. I have some increase in my power budget, it didn't go up there, it actually preferred to increase the number of cores. So that gives us our big idea in here for the Pareto frontier is that the settings we need to maximize the performance for a given cap is that you actually want to activate the largest number of cores that you can allow within the cap before you actually try to increase DVFS. And this observation is repeated across all the PARSEC benchmarks, all of them have the same kind of behavior. So let's give us the idea for our approach, the first approach. The first approach is that this is our server we have right now, two degrees of freedom, DVFS, a number of cores. We going to have some control of it, which I call the multi-gain controller. We can measure power consumption. Say that's the one approach. And then you have a power cap. And the difference will be given to this and this controller will tweak the number of DVFS on cores. So the trick is that if you have a negative power slack, meaning that you're consuming more than the cap, then you actually want to cut down DVFS before you start to degrade the number of cores. Because we said if you want to maximize performance, you want the highest number of cores. And if you have a positive power slack, meaning that you are consuming below the cap, then to improve performance, you actually want to engage, increase the number of active cores before you actually increase DVFS. So that's the -- so you see how it's working. But it's the same -- it's a different strategy, but it boils down to the same concept, that you want to maximize the cores before the frequency. >>: But the reaction could be quite different; right? Increasing the DVFS is easy, right? You're not changing anything. The setting of a different number of cores, this allow that system [indiscernible]. Shief Reda: Like Ricardo was saying, the cache is for instance ->>: Right. So you can't afford to make that kind of mistake. Because you can't predict how much the power consumption is going to be. You can give it more cores, oh, it's too high and drop down and give it more -Shief Reda: So actually you want to -- so actually it's different. When you want to give it more power, you increase the number of cores; but you want to go down, you will say I keep my cores, I try to decrease DVFS. So if you have a positive power slack, increase the number of active cores. And then suddenly power caps go down and you say I have to reduce my power, then go in DVFS first. Try and reduce the DVFS. Because it's the number of cores that you want to maximize. It's -- think about that number of cores is the first priority you want to be using. >>: So I'm looking at, for example, the job featuring one core or two core, the job end set, the lowest DVFS to two cores. And if that's high, too high for whatever the cap is, you have to drop back. Shief Reda: So you have to reduce first DVFS. And if you reach the lowest DVFS, you say okay, I give up, there's nothing I can do except cutting down the cores. Then you switch the number of cores. >>: Okay. Shief Reda: But there is -- I think what you're trying to say is if I jump here immediately and I know at the bottom there's nothing I can do except suggest the cores ->>: Yeah. Shief Reda: Yeah. So if you're at the bottom already of the DVFS, then you will have to switch back to one core. >>: You wanted to be pretty conservative and give it more cores. Shief Reda: Yes. The other big question is how often things are going to change. If things are going to change by ten seconds, maybe the millisecond like you were mentioning in the operating system is not going to be an issue. But if your, actually, power cap is changing at a higher rate, then that's a problem. >>: These benchmarks that you test, I assume, stress the CPUs. Shief Reda: These are classic benchmarks. They have a good range. Some of them do stress, some of them are more memory bound. Some of them are scalable, meaning you increase the number of threads and performance increase linearly. Some of them are not scalable, they have synchronization issues, so you try to scale them and the runtime doesn't increase as much. So they have a good variety, those PARSEC benchmarks. >>: So if you -- say you're running eight cores in ten second utilization, and then you want to just reduce the amount of power by -- power capping such [indiscernible]. Do you see any difference between what you're doing here and what the operating system should do ordinarily with power management enabled? Because the OS will be cold parking. Shief Reda: I think the operating system will actually -- I think most right now will just use DVFS. They won't play any of those tricks with the number of threads and packing them or anything. Right now the assignment from threads to cores is based on fairness basically. >>: You said, I believe, those well-packed cores, it will consolidate threads. Shief Reda: Okay. >>: If the utilization of the OS -Shief Reda: The utilization, not power. >>: Yeah. [indiscernible] >>: So it would be interesting if you have a load utilization test to see sort of the comparison between OS on its own versus [indiscernible]opinion. Shief Reda: What I've seen is the operating system at least for the Linux kernel is always one of -between fairness, between the cores, that's what it's trying to do. >>: So you tried to sort Windows 7 -- [indiscernible]? Shief Reda: No, not Windows 7. >>: And these applications tend to run full tilt on the CPU; right? It might be that their memory bound, but as far as the operation system is concerned -Shief Reda: You don't get utilization 10 percent. >>: Yeah. Shief Reda: So we're talking about ->>: So it's sort of like what we do with ->>: For you to do that, you'd have to use like fewer threads than cores. Shief Reda: Yeah, exactly. Yeah. Yeah. So in that case, it's already packed in some way. So, okay, that's a good point. If you're saying what you're saying, if I manage between two active threads and I have eight cores, then in that case there's no reason for the operating system to try and maintain fairness by switching the threads around. Because that's what the operating system do. I see it with a camera, it switches the threads around from the cores, not for thermal, but to maintain fairness between the cores. So you don't want to do that. You want to pack them ->>: So I'm just looking. I think XP used to do that, and then Windows 7 changed that. Shief Reda: So another thing is this multi-gain controller, is this idea that the impact of changing a setting, say DVFS platform number, of course is actually the function of the value of the other setting. So for instance, if I have eight active cores and I change my DVFS, that's how the power is going to change. So the average impact of the change is 5.61. And it's going to determine the gain of my controller. That's impact. I want to see should I increase DVFS by one setting, two settings. And that's the average gain. However, if I have one active core, and you're not getting the change in power consumption, the average consumption is 0.9 Watt. It makes sense. But the idea in here is that I can't just have a controller with one gain -- actually my gain would be a function -- my gain for DVFS is a function of how many active cores I have. Same things apply on the other side. If I fix my DVFS and try to change the number of cores, say from one to eight, the average impact for power increase on the server is about 9.9 Watt. But if I'm running at a low frequency and I try to increase the number of active cores, then the average impact is 6.1. So really you have a lot of combinations in here and for each one you have to characterize and see what the other value to figure out the average gain. >>: Why is it the most gaining -Shief Reda: What is that? >>: Why is there any scaling? Why is everything not scaled? Shief Reda: Because you have just one active core and you're changing it. So think about -- say here you have active cores eight; right? Let's multiply 8 by .9. You're actually getting close to that number. It's just that you have so many of them so they add up a lot. Because you have more active cores. So this is just inherently what the change of one core of power. But if you have a whole bunch of them, then their power will aggregate and it will become big. >>: And [indiscernible] compared to power. Shief Reda: How is what? >>: How would the [indiscernible] as opposed to power. Shief Reda: So the energy is the runtime, and the power tradeoff that I -- I think it was a few slides back. It was in here. So that's how different power is going to look like. But the goal is to maximize performance within a certain power cap. So that give us the idea for the multi-gain controller. And some servers actually don't have a power meter. So you have an instrument on the server, some old one or some very low sensing. But we done techniques where you can basically project the impact of the settings on the power consumption. And many people here from this group also did techniques where you can model power consumption without a power. And the basic idea is that you have some performance counter saying the CPU that will give you the toolkit and operating system metric like utilization, memory, IO, disk transfer. And then you can combine those to learn off-line some metrics. And those metrics would link basically the DVFS and the number of active cores with power, and you can use this as a proxy for the power meter. And basically during runtime you would query this model with the different possible settings that you're thinking of. So if you have nine DVFS settings on four cores, I have 36 combinations. And you can use this model to evaluate the power consumption for the 36 combination, and then from this combinations I would see the one that consume less than the power cap and which is the winner. The winner would be the one that has the highest number of cores, because that's what we want to do, we want to traverse this Pareto frontier. So we will always select that combination in here from the model that gives the highest number of course, before DVFS. So let's look at some quick results. This is just -- I mean we have so many detailed results, but this is just a quick overview. This is just a baseline controller. We want to compare our technique with something else, so we said let's implement a baseline controller that uses just a single gain like people do. And it almost prioritizes DVFS. So instead of -- it's going to change both of them, but always plays with DVFS. So as a result it's not going to be working on the Pareto frontier. >>: So what do you mean by a single gain controller? This is a BID controller? Shief Reda: Yes, standard controller where you just have, you want to change DVFS, then you just have one gain irrespective of how many active cores you have. >>: I see. So but for a BID control, there's -Shief Reda: Yeah. We actually don't use an I. It's only a P controller in here. P controllers are good enough. Because there's no integration for the area. A P/I controllers are good when you're getting integration, like data accumulates. In here ->>: I'm surprised that the D is not involved either. Shief Reda: Well, there's no -- it just works. I'll show you some results. >>: Okay. Shief Reda: So this is the standard technique. It doesn't work -- there's a big error tracking error, which is the difference between the cap. So let me explain this in detail. In here in this experiment we basically changed the cap randomly every five, I think, or six seconds. So basically the blue line is the power cap of the server, the red line is the actually power consumption from the server, using this technique. And we're actually showing, during this runtime, the frequency and the active number of cores. So this is from what people would have thought should be implemented like. This is our pack and cap controller. So you see very accurate tracking in here. And because it has this multi-gain controller, it basically reduces the power tracking error to 37.5 percent. And the actual runtime, if we finish these applications all to the end, all the application, you will see that we, on the average, we have less -- 10 percent less runtime. And the reason for the lower runtime is basically we try to follow this Pareto frontier, this idea of trying to maximize the number of cores before DVFS. >>: So here, it's still a proportional controller, you just change the game that -Shief Reda: That's a function of the other setting. Yeah. And implementing this idea of trying to maximize the cores before DVFS. >>: Almost like -Shief Reda: It's a closed loop. So this one is the closed loop, this one. >>: I know this was -Shief Reda: Yeah. >>: So the result has the power meter in it. Shief Reda: Yeah, the this one has the power meter. These first two ones have the power meter. This is just like a reference for us to show that we are indeed doing better by throwing this Pareto frontier. This one is without the power meter. So there's no power meter connected to the system, we changed the power cap. But we can model power. So actually this works really well. This one kind of fell off, but for the rest, it's actually going up and down. And there's no power meter associated with it. >>: Well, that's a big problem. If you go 10 seconds [laughter]. Shief Reda: But still, for no meter, it's still a good deal. >>: Well, circuit's still going to break and [indiscernible] [laughter]. Shief Reda: And if you assume that the errors from different servers are like one is high, one is low, then maybe they will add up to zero. Over all, you obviously have higher error than if you have the meter. You also get less reduction over time. But overall, it's actually working pretty well. >>: You should probably have some emergency thing, right, that if you realize that you have not been able to lower the environment exception, you just go down and -Shief Reda: Yeah, you just default it to some low value. >>: Right, because that over there is [indiscernible] [laughter]. Shief Reda: I think -- I was impressed that it was still without a meter. So that's techniques we've seen to increase energy efficiency of a single server by using the software and hardware techniques. So let's do things, upgrade things a little bit and then move to a cluster level power management. So for cluster level power management, I will talk about two parts. One of them is homogenous and one of them, when it becomes more interesting, and you have heterogeneous clusters. So for a single server, quantifying performance is easy, it's runtime. How are we going to do it for a datacenter? So this is from the literature and this is what we think is one of the best way to characterize performance is you look at the actual application normalize performance. Meaning that you look at the ratio between the runtime in the ideal case of execution, meaning by its own self at the highest power cap, versus the runtime at some other condition which is not idle. So that's the application normalize performance. The whole performance of the datacenter, the system normalize performance, will be the arithmetic average for the ANP. Some people also use the geometric average. We didn't really see any difference with both of them. So you can use the SNP. If you use reciprocals of the ANP, that's the slow down mode. That gives you sort of how much applications are slowing down on the average from your policy. Another good metric is unfairness which is the standard deviation on the ANP's. If you have a standard -- big standard deviation, then you're unfair in some way, because you're having more things than other ones. So these are the metrics we're considering for a cluster. So let's look at the easiest scenario for cluster management. All your servers are basically identical. So there's no inherent advantage from sending different workload to different servers. You can pick any server and send it to, if you ignore cooling. The main degree of freedom in here that you can play with is how much power you're going to give to each server. So assuming in here these servers have this power cap controllers, and the question is how much you give to each one of them. So maybe people will say, "Well, take the whole power budget, one megawatt, you have 10,000 servers, you know, give each one 100 watts uniformly." That's not really the best way to go about allocating or budgeting power on a cluster. Why is that? Because you have ->>: It's not feasible. [Indiscernible]. Shief Reda: Huh? >>: The best way is not feasible. There's a big packing problem. Shief Reda: Yeah, so this is -- I'll actually show in a little bit. So the idea in here, it's actually more of a knapsack problem, that's what we found. The idea you don't want to do this uniformly is that because basically different applications will have different characteristics and they will translate power that you give to them differently. So these are power servers, they have the same identical configuration. But they're running different applications from second PARSEC benchmarks. And you see that for each one of them when you give it more power cap, basically the ANP or the performance changes in a different way. So this means really your formula is not the best strategy. I should give it in a different way. So for instance, like you were saying, memory-bound applications usually do not leverage increases in power to increase in performance, so maybe I shouldn't give them too much power to begin with. So need to find a better way to budget the power. So let's look at scenario in here and what people did. Let's say I have four servers, the same four servers, and I only have a budget of 520 watts. So that's really the floor of my power consumption, because it means that each server will get 130 watts to leave the least power cap that I can ever get. Suppose now that there's some good news and the company wants to spend more; so in that case we have 600 watts. So 80 more watts that we have available to budget among these servers. What can we do. One of the techniques in the literature, the greedy technique basically ranges the server by their energy efficiency, meaning it looks at -- it divides the performance by the power and start allocating these extra power consumption based on the energy efficiency of the server. Where servers that are more energy efficient will actually get more power. Because we think that they're energy efficient, so in that case they can leverage it. So in that case, the most energy efficient is this one, we'll give it 35 watts, minus the 80, I still have [indiscernible]. Let's go to the next one, it's .68. We'll give it also another 35, so that's 70 out of my 80 watts. There are 10 left. I give it to the next energy efficient, this one, and then finally this one will not get anything. And if we do this, this will be the ANP that you will get. Can we do better? Can we do better than that? >>: That policy seems to be [indiscernible]. Shief Reda: Yeah. Well, we'll see that we have good improvements over it, so this is indeed counterintuitive. So let's imagine we actually have all these values. If we can actually have that ANP or the performance for the different power caps, can we do better? And like you guys were already heading, we actually can think of the knapsack problem. So if I'm a thief and I have a knapsack with a certain weight and I have some items each with a value and a weight, I want to basically pack those items into the bag to maximize my value so we can think that this problem is connected to ours. In that case really each item is basically a server and a workload combo. And in that case it has multiple choices for power. So for each choice there's a power, that's the weight, and there's a value, which is the performance. And I want to pack -- I want to make my choices for each item to maximize the total performance, the total value subject to a given power budget. So that's the knapsack problem. So if we add these values and solve the knapsack problem, these are the actual optimal choices that will come up, which indeed shows that it was counter-intuitive for the greedy to do this choice. And we could get in this simple case 13 percent improvement in performance. >>: So how much profiling do you have to do to get this table? Shief Reda: So that will be my next couple of slides. So how can we get those values? Those values are unknown to begin with. So how can we fill those entries? So that's a big challenge. So we have to do some modeling in here. So the idea is basically we can capture the characteristic, the ANP versus power, say with a function like this, a quadratic function. So this quadratic function is giving us the performance of a certain workload on a server, as a function of that power cap that you give. Now these model co-efficients, people in the past have thought of them as fixed value. If you have just one kind of workload, then they are fixed in value, but if you have different workloads, then these co-efficients are really going to be capturing those characteristics of the application. So what we are doing is we're actually having -- those co-efficients are actually being functions themselves of the application characteristics. So what are those characteristics we look at? At any time you're running the server, you are actually have three characteristics that you can capture, and we actually use this to successfully have those as dynamically changing. One of them is the number of instruction executed per second by the server, which is basically corresponding to utilization. The other one is its current power consumption. That's another important value. Another important metric is the last level, cache misses. So last level cache miss is an indicator of how memory-bound your application is. And by incorporating those three metrics, we can actually determine values for those coefficients that are actually a function of the application characteristics. So if you go back to our example, let's say you have some power caps right now that are running and this is the ANP of the servers. Well, in that case, here is our ANP predictor. And at those running points you collect the through-put, you collect the power and the LOC. These are measurable and available. You send them to the predictor. Oops, where is it? Why is this? Why is the transmission not working? What happened in here? Sorry about that. I don't know why it's not -Okay, you send those characteristics to the ANP predictor. The ANP predictor is going to use them. Here it is, okay. You send those values in there, and then the ANP predictor is going to calculate those predictions, and once you have those predictions, you can run the knapsack -- multi-choice knapsack algorithm to get the ultimate results. So this is ->>: How long does it take to do that for a fairly large cross search? Shief Reda: So the runtime will be proportional. I mean the runtime of the DP is basically the number of power caps you have times the number of servers. So it's all of ->>: So number of servers or number of server types? Shief Reda: Number of servers. >>: So [indiscernible]. Shief Reda: Yeah, you could. We can say -- you could lose a little bit of efficiency. I mean you want to trade up both of them. You can divide your datacenter into sub-clusters and solve the problem for each sub-cluster. So you lose a little bit of the global picture, but you manage the runtime. It's also a function of how often you want to change the power caps. It would be a function of how often, say, the smart -- the grid is telling you to regulate your power consumption. So that's another factor. This is a result in here, unlike the Part 1 where we built everything in a real server, I don't have a cluster, so we actually have a queuing simulator for 320 servers, and that queuing simulator is using execution traces from the actual server. So we collect execution traces that are through-put, the LLC, the power consumption over time, and various sampling rates for the application, and then we use it in part of our queuing simulator. So when we compare this -- first of all, we apply different power budget and we adjust the power caps, and we look at the SNP, and the bottom line in here is the uniform and greedy results, and then above and atop of it is our approach, the predictor plus the knapsack. And the Oracle plus knapsack, here it is, the upper bound. If you really truly know the actual values in the table, how much performance would you get? So that gives you an upper bound for what the supports is doing. Eventually if you have a really, really, really low power cap, or a really, really, really high power cap, all these algorithms are going to behave in the same way, because you will give them as much power as they need or you will just starve them from power. It's really the range in between in here where these algorithms start to diverse from each other. So that gives us about 4.4 improvement in the SNP over the greedy method that was proposed. And if obviously greedy is not really doing good with [indiscernible] with respect to applications, so we improve the fairness dramatically. And that's actually important because in some cases datacenter assign us a service level agreement where it starts - tries to basically makes things more consistent. So that's another important metric. >>: How is that fairness measured again? Shief Reda: It's the standard deviation of the ANPs. >>: The SNP sounds like a good metric, but could you give a little more explanation about what is the impact of 4 percent improvement or how useful it is or -Shief Reda: You're getting 4.4 percent improvement in performance free. Because you're respecting the same power budget. So you can twist it the other way, meaning I can reduce my power while maintaining my performance. Right? So you can think of it two way depending on what you care about. Do you want to maintain the same performance and cut power? Or do you want to maintain the certain power while maximizing performance with it. >>: So this is an SNP for the fully clustered -Shief Reda: For all the clusters. >>: So it's actually multiplied by 4 percent of whatever kilowatts you have. Shief Reda: Exactly, exactly. So it's not just -- that's the nice thing about datacenter, it gets multiplied. So but we can get much better improvements that are 4 point -- I think still it's pretty good for a datacenter, because that's a large amount of savings. But we can do better if you have a heterogeneous cluster. So I'll explain some techniques right now that's the second part for the cluster. And we're proposing something in here more dramatic for heterogeneous clusters. So for heterogeneous clusters, I'm assuming in here of course that you're -- there's a number of reason why you would have heterogeneous clusters. One of them is some applications could be good for a certain kind of servers, but not good with the other, and some applications are good for this one, but not good for this one. You also have multiple updates and multiple deployment cycles in the datacenter, so there actually is a chance there will be clusters of different machines. And assume of course that when you go purchasing, you purchase them by the racks so the servers within the same rack need to be all kind of the same configuration. So what's happening in here is that workloads come and we can assume we have their characteristics, if they've been running in the datacenter for a while and we don't, or you can predict them. And in that case our HetMan jobs catalog basically will map those applications periodically to the free servers in here, and it tries to map it to, say, maximize the performance within a power budget. It can also map them as single workloads in free server or it can map them to busy servers, say, with 50 percent utilization. If you have a co-runner, if you enable co-runners, and you can have multiple cores on leverage, then it can also do core running. Periodically, every once in a while on a much longer timescale, HetMan would actually reassess the whole mapping of workloads in the center and essentially shuffle things around if it can give you better results. And the reason it would do this, it's maybe that the first choice in here, when you map an application to a free server, maybe it did not pick the best free server for it. >>: So when it's reassigning the workload, is it changing the power cap for each server? Shief Reda: Okay ->>: Or a power cap for each server and then -Shief Reda: So I actually did not include the power cap in this one. So for a homogeneous server, the power cap was my only degree of freedom, so I used it over there. In here I thought we actually have the possibility to map the jobs around to maintain the power without actually touching the power cap. So I'm assuming in here that servers will be running at their highest power cap. But this is an area that I think potentially could be included. The degrees of freedom in here I'm choosing are mapping and core running and migration. So these are the three degrees of freedom. So I'm not playing with the power caps in here any more. Because I think that we can meet a certain power budget by just assigning things around, without touching the power cap. But I think it's definitely relevant and we want to do it. I'm telling my students we need to include the cap controllers in here too. >>: So in here what is the goal here? Is it to -Shief Reda: The same goal we have always, maximize the performance per power budget. Except ->>: So power budget is whatever the power budget that has been given to that server. Shief Reda: Exactly, to the cluster. Just like -- so in the knapsack that was the goal in here; right? We had a power budget for the whole cluster and we were distributing it among the servers to maximize >>: That's exactly my question. In this next setup -Shief Reda: In this setup, instead of using the power caps, we will leverage the fact that we have multiple servers with multiple powers performance tradeoffs and we can leverage that to achieve the same effect. >>: Okay. Shief Reda: Good questions. All right, so let's look at the first one in here is that you're engaging, you get some jobs in this queue, and very frequently, you're mapping this jobs in the queue into the free servers. So this looks like the classical [indiscernible] matching problem. It's very similar to it. You want to match the things in here and one part of the graph to the other one to maximize your performance. But there's a little bit trick in here, because we also want to cap the power consumption, which makes it different. So in that case if we have -- if you, say, executed this benchmark and or we have some predictor that can tell us what will be the ANP of executing workload I on server J, and what will be the additional power over idle from executing P over J -- I over J, sorry, then we can actually write an IOP, like this one. And when you solve this IOP optimally, it seeks to maximize the SNP such that each workload gets assigned to a server, and each server has at most one workload, and the total power budget is met. So we're meeting the power budget this time by shuffling things around. Now, because of this constraint, this actually has to be solved as an IOP. Which is a problem because of runtime. So we relax it and we round the solution. We found that actually when you relax it and you round the solution, in 99.99 percent of the cases you don't get, in the rounded solution, any violation to the constraints. But sometimes it can happen that two workloads get assigned to one server and one server's free. So in that case we kind of have a little clean-up procedure that if the founding solution is not correct then we just have heuristic to fix it. But it works nicely. So this is if you have jobs in the queue, you have free servers and you want to map it to them. >>: So this approximation, are you solving an IOP as an LP? Shief Reda: We're solving, yeah, exactly ->>: And then the RAM? Shief Reda: And then the RAM. Now, so this is if you're mapping the jobs to pre-servers, but in many cases you actually want to -you're having the high utilization, and then you want to apply co-runners. You want to co-run applications together. Problem with co-running application is interference. When you run them together they're going to degrade the performance. So this is also another extended experiment many people have done, including this group; you run those pairs of applications together and you see the average degradation that is happening to the ANP. If you don't run them together then they get an ANP1, but if you run them together, they degrade. And what you will see in here is basically applications that are memory-bound tend to actually degrade more if you locate them with each other. Makes sense, because they might among a shared resource, which is the memory controller and the main bus and the main memory. So when you co-locate them on the same server, they tend to degrade with each other. So can we somehow map the workloads, the server in such a way that we can like put nice people together with each other on the same server and minimize degradation? And that's what we want to do. But the main -- before we can do that, we have to capture this interference in the linear program. And the way we do this is with the last level cache misses. If we know the last level cache misses in the application, this is something we can characterize in advance, then we can actually have a model that predict the degradation that will happen to work [indiscernible] if it's assigned to server J that has already some existing workload. So how does this work? Let's look at this plot. This plot, actually each point in here represents a core location or pairs of application. And I'm plotting in here the sum of the LLC misses of those individual applications. So before I put them on the same server, I profile them beforehand, I know the LLC of each one of them, I sum them up and run them on the server, and that's the ANP that I will get. And what you're seeing in here is a clear trend is that the sum of the LLC increases. That of course means they're more memory-bound than degradation will increase. And then we can capture that degradation basically in this formula. So the only thing you need to know to figure out the degradation from assigning a workload I to some server J is just having the LLC for the servers that exist plus the workloads and you plug them in here and you get number one. >>: So there are many simplifications here; right? Shief Reda: There are many simplifications. >>: No IO, the LLC misses that are additive. Shief Reda: So we're not necessarily assuming that they're additive. Because we added them beforehand. So there are two ways that you can do it. You can have a model that says what will be the LLC if I add these two and then use this to query another model, that gave me the degradation. Or then what do we do, it's just one direct model. So you're assuming that LLC beforehand, but this is the ANP actually co-locating them together. So it's like a shortcut directly to it. >>: Right, but you're looking at only a pair of applications; right? Sherief Reda: Yeah, so there's a chance -- so that's another thing open for future work. You could have three or four -- yeah, could be more. Or however many. >>: And we need to do this for every [inaudible]. Sherief Reda: So you actually need to -- it's a good point. This function will be one for each kind of server. So this degradation has to be learned for every class of servers. So there's four of them in my experiments. But if you have this degradation function, so here it is, the jobs in the queue, so this time I'm having the servers with just one workload. Then in that case you want to assign those jobs to the busy servers to still maximize the SNP. But this time there's a little thing in here, which is the degradation that's going to happen. So that's the old ANP. But I have to subtract twice the degradation. Why? Because the "I" was the average of the two. So when I'm assigning this to this one, this one will degrade and the original one will degrade, because of this co-runner, and that's why there's a factor of two in here. >>: So you think that the number of workloads that you're trying to co-locate, do you think it's going to have an impact on how good the approximation is? Sherief Reda: So here's the trick. This LP approximation is trying to model what's in reality, but it's not a perfect model. But it's solutions is definitely better than many of the heuristics done in the literature. So this is what I will show. So the goal -- the goal is not necessarily to say my LP is a perfect mimic of what's in reality. But what you want to do is that it has a good fidelity with the good solutions, meaning that the solution gives [indiscernible]. >>: So have you tried to use a different heuristic, like use a simulator -Sherief Reda: Yes, we use the paragon paper from the guys at Stanford? So we use their model for corunners, and we basically have better results. We just smitted this paper. So this part in here is still 234 submission. But we have better results than paragon. >>: So doing two workloads per server? Sherief Reda: There's a potential for more, but we do two right now. >>: Yeah, we found that two is sort of a greedy algorithm or LP can solve, it's more [indiscernible] -Sherief Reda: I mean I don't know how this model will look like if we have any more than -- to look more complicated. >>: It's not a greedy thing any more. You can do the two co-runners, is already -- it's an LP, right, so it's going to be hard. [indiscernible]. Sherief Reda: So the LP is no longer NP hard. We relax the LP to LP and it works. But it gives better results than greedy, because greedy is basically what Paragon is doing in many of its decisions. So another thing is ->>: But they're really fast though. Sherief Reda: Paragon is really fast, yeah. LP is slow. >>: And a question. So memory is just one of the rates, so I'm [indiscernible]. Sherief Reda: Yeah, so that's a good question. And I think it echos what you were saying about the IO's was what was Ricardo was saying. >>: Yeah. Sherief Reda: So in here, we're just modeling memory bound. But if you have a cluster and there is, say, some communicating with NPI, network interfaces, hard disks or SSD's -- >>: So the disk is really -Sherief Reda: Yeah, so it's a function of your workload. So we're using PARSEC and spec, so this works well with it. But for other kind of application, it's the thing. You know, I mention in the beginning how the degree of freedom applies and for the different servers, and then I mention kind of like cheaply that, you know, that the workloads -- which workloads you choose is actually a big degree of freedom that will change all your conclusion about the whole -- all these techniques, including this next one. So for instance, when you map -- I'm wondering how much time do I have left? Because I think -- right now it's one hour exactly from when we started. Are people okay for another -Jie Liu: Yeah. Sherief Reda: Okay. So the next technique is that in here if you recall, we're talking the workloads that come in the queue, assign them to the pre-servers. If you have low utilization, then hopefully you will have enough free servers from the different kind of server classes. So you still get a good solution. But if you have high utilization in your servers, there's -- in your datacenter, there's a chance at that time jobs actually won't get their best servers. However, after time maybe one server will finish the workload that was assigned to it. And maybe in that case it's good to start migrating application from one server to the other. So we did some experiments with live migration, with the Xen Hyperviser, and it's a little bit tricky; and again, it's a function of workloads. If your workloads really spends two, three seconds, it's not worth it to migrate. But if your workload takes enough long time, say, a couple of minutes like happening in this spec and PARSEC, then in that case it might be a good idea to serve from Server A to switch from Server B. In that case, the runtime will be almost a linear combination. Because when you do live migration the downtime is really low at the end; right. If you have cold migration and you have live migration and they have different tradeoffs. But with live migration, the downtime at the end is really tiny. So we looked at actually this idea of the linear combination with the actual runtime between two servers. So these are two servers in here, Server A and Server B, and there's something called initiation time, which is how far into the application you engage it to do it to migrate on the other virtual machine. So the transition time it zero. It means that I move this application right away to Server B. And in that case you're incurring the runtime mostly of what Server B is, which is the good server. But if your [Indiscernible] time is really low, if you were really late in migrating this application, then most of your runtime will be incurred on Server A. Anything in the pin is basically a linear combination, and that's the actual estimated time with the linear combination versus the actual time. Live migration, it incurs little downtime, overhead, but it still has two problems. One of them is network traffic. Because right now if you migrate the virtual machines they will be going from one rack to another rack, because we said each rack has the same set of servers. So if you go outside the rack, you're going on that switch on top of the rack, and that basically is going to create network traffic. So that's going to impose a limit on the number of virtual machine migrations you can do. In addition, during the live transfer phases, actually additional power consumption somewhere between 5 to 10 percent. It's a function of how big, say, the memory allocated for your virtual machine. But that's another thing that can happen, and for PARSEC applications and spec, you can have this additional 10 power consumption for about twenty seconds, during the live transfer phase. >>: You're making an assumption that it's no point in migrating between machines of a blank characteristics. But what if your VM is a different size and you want to pack different size VMs to a small -Sherief Reda: Good point, consolidating virtual machines, yeah. There's a lot of paper in that. So in here we actually have all the virtual machines at the same configuration. So there was one inherit difference from moving it to another one. It's really the server that's running the virtual machine that's giving it either higher performance ->>: Is that something you observe in real life in your own cluster? Sherief Reda: You guys have bigger clusters than mine, so you tell me about it. [laughter] I had two computers just migrating things between them. So you know better what is the deal in here. So if you want to do migration -- and also with co-runners. So there's no point -- if we say all the virtual machines have the same configurations, it could be useful to migrate if you have co-runners. So maybe the initial co-runner assignment was no good, but now something is free, you can move it and now this co-runner is a better scenario. So that's another thing. So how are we going to do it? If you really want to rip up all the workloads and remap them, that's a quadratic program. Because now we need to model two free ones. So how do we do this? So this is our algorithm. So if we imagine these are workloads on the servers, what we do first is flood them into two mutually exclusive sets, say the set left and the set R. So no two [indiscernible] residing on the same server exist in the same set. So you see how they're mutually exclusive and each one has one workloads. What we do is, say, fix set R. Let's imagine those workloads are fixed on the servers and then we rip off the workloads that are on L. Once we rip them off we can calculate the new ANP as a linear combination of the initiation time of the migration. And then basically use our good -- use our old good LP. Because one of the workloads is fixed, it's like you're ripping half of it and you're allocating it with co-runner. So it's a same problem. Then you remap them, you shuffled L, you fix it, and you do the same thing for R. Now you calculated, updated the performance, and then you rip them and reshuffle them. So you basically solved the LP twice, and this way we can break the complexity that will come from assigning two application at the same time. Any questions about how this algorithm can actually work? Otherwise, if you think about it, the problem will be quadratic, you will have XIG times XIG if you want to map two things at the same time. So that's the way we break it. There's a little slide in here ->>: [Indiscernible] VMs on the same server, which one do you choose to make [Indiscernible]? Sherief Reda: So in here we just assume two co-runners. So really there's only two. So that's why I'm dividing into set L and R. If there are more than we have to change the degradation, we have to change this algorithm. So now for simplicity, it's only two. There is a slide in here. Maybe I won't go into much details, but all these LPs require you to know the ANP, the power, and the LLC of the application. In many cases you've seen these applications before in the datacenters, you know their characteristics. But in some case it's not. And we have a machine learning technique in here that can actually run the machine for very briefly on a reference server and actually estimate the characteristics on different clusters of servers that the workload did not run on. And it's better than the one that in Paragon which uses the NetFlix heuristic, if you would, that paper. So this is, again, our experiment repeats. We have eight racks in here, two racks per cluster server, so these are four different servers -- or not really servers, but four different machines in my lab with different configuration. And again we execute the applications on all of them to get the application traces for performance, LLC, and power, and then we feed it to the simulation to the queuing simulator, and we run it until study state. In here we compare our technique, HetMan, with one technique in the literature, Paragon. We actually modified Paragon a little bit so that can work at multiple power budgets. And you see in here that it's given you -- first of all, it's meeting the different power budget you give it for the cluster by just moving things around and reshuffling the applications to the servers. And it's always consistently given 12.5. So Paragon is mostly based on that greedy heuristics. This is just a runtime trace. Initially the power consumption of that simulator, under medium load conditions for the datacenter, there is no constraining power. The utilization increase a little bit, Java rate increases, so the power goes up and down. At some point with the HetMan, we have a 38-kilowatt budget, make sure application get assigned to server so that you meet it, so it meets it. And if you increase the power budgets a little bit it will allow this extra. >>: But if you just show this, this is not a fair comparison to Paragon; right? Because the goal there was not to profile all of these different applications and the combinations and so on, right, we're doing that. Sherief Reda: So that's the slide that I skipped. So I actually have Paragon, I have -- I can show you the comparisons afterwards. So this is actually -- Paragon, as I mentioned, has desk NetFlix heuristics, is that it does not profile every -- I don't profile every combination of workloads, and that's why this function [Indiscernible], I use the sum of the LLC of the individual applications rather than the sum when I run them together. Because then in that case I will need to profile every pair. So what we do in here basically, it's similar in style to Paragon, but the idea is different. You can profile a application on one server, so this is a different server than instructions per second, and I get a certain number. And then from that number I use it to figure out the instruction per second if I run it to on platform A or Platform C or Platform D. And these models in here basically capture the relationship between the LLC of a reference server versus another server from a different class. So this is different from Paragon which is uses, since you're familiar with it, uses the NetFlix heuristic. And this one actual actually is based on a more simpler observation, is that there is actually links between the different servers in many ways. If I have a processor is four cores I'm expecting it to have twice the number of instructions it had per second. So there is actually a physical reason why this server is behaving like the other one. The NetFlix is not assuming this. The NetFlix is people random choices and moods, whatever. This one actually, there is a physical reason why a server would have high LLC than the other one, is they have to have the cache. If you have the cache memory, then it's expected that you have a higher LLC. Same thing, if I have a System C that has a higher DVFS, I will always get higher power for that system. So we can have upper [Indiscernible] models built that link this information that we need to characterize things directly and we don't need to profile every pair of applications. And actually I have results, I can show it afterwards, for our technique using the heuristics from Paragon and the ones we are using, and you can see the difference right away. So we actually even mixed and matched their components just to isolate how much we're better here and there and so forth. So let's get some insights in here on what's happening. This I think actually one of my favorite results. This shows you, as a function of the job arrival rate into the queue of the system, what will the SNP look like. So when you have low job arrival rate, your datacenter is locally utilized, then it's simply mapping the free application and the applications to the free server works really well. There's no reason for co-runners. Co-runners are actually going to degrade things. Because you have low utilization, things come in the queue, they get their favorite servers right away. If you degrade things, you know, they don't sit in -- they don't wait in the queue for a long time. But so the waiting time in the queue is much less than the penalty you get from co-runners. So that's why there is no point in doing co-runners when you have low utilization. When you do the migration, it doesn't give you a big improvement. Why? Because the server was low utilized to begin with, which means there were enough diversity within that system to map the jobs and get the best performance. There is no point in migration. But if we increase the job rate, things start to switch. So in that case, for instance in here, it's better to do co-runner than a single application. Why? Because if you do single application, the workload will stay in the queue, the job will stay in the queue for a long time; that it's degradation, because waiting in the queue will be higher than its degradation if you run it as a co-runner. So co-runner in here becomes better. And this trend appears over time. As the server -- as the datacenter became more and more utilized, it's better to do co-runner than to make things run as a single -- on a single server on its own. Because the waiting time in the queue will keep increasing. And the same trend. As the datacenter becomes more and more utilized, the advantage of migration keeps increasing; the relative advantage from using migration. Because it's utilized, it means that applications when they get assigned in the first time, they're not assigned to their best machines. So by periodically revisiting the assignment, you can get better results. And important metric in here is the number of virtual machine migrations. So we've found in our experiment it's about 20.8 percent to 6.7 percent of the virtual machines being migrate. So that's good news. If we were talking about migrating 50 percent of the virtual migration in this technique, it doesn't work. So this is a reasonable number and it's a function of whether we have co-runners or not. >>: And [indiscernible] is 5 percent migration; right? Sherief Reda: The virtual machine migration, right, that will be the overhead. >>: So it's 5 percent of this one. Sherief Reda: Yeah, it's 5 -- no, no, this is the number of virtual machines that will need to be migrated. >>: So each machine is 5 percent overhead? Sherief Reda: No, let's say you have 400 virtual machines running in your -- virtual machines running in your center. Then you run your algorithm. How many of them will migrate? So 5 percent of the 400. So 20 virtual machines will be migrated around. And that's important, because the more virtual machine you're migrating, the more network traffic you're creating and the whole thing will start to degrade. I think there's -- this needs to be dug deeper with a real datacenter with network topology, and know, you get it right. In that case you can assess how things will interfere with each other. But it needs a more detailed simulator. Should I go on? There's a final part in here for cooling, for datacenter cooling where you actually have a -- the problem in here is that we're assume that we're given a power budget for computing, but in reality you have a total power budget for the datacenter. And that total power budget needs to be partitioned between the computing and the cooling in a self-consisting way, meaning that this cooling power is sufficient to extract the computing power, the heat from the computing power. So how can this be done? That's a big question. Do we have time to continue? Or are you people starting to -Jie Liu: Five minutes. Sherief Reda: Five minutes? Okay, I'll try explain it in five minutes. So how can we do this goal? So quickly, the cooling power of a CRAC unit is equal to the heat that is coming to the computing power divided by co-efficient of performance. And the co-efficient of performance is usually an empirical function of T, which is the supply temperature. Easily enough if I set temperature high, the cooling power will go down, something we all aware of. So as the CRAC supply temperature increases, I get higher values for the COP and my cooling power goes down. So how can we compute the cooling power for a data center? Usually the goal is to find the maximum temperature for the CRAC, because that's going to minimize my cooling. But there is a constraint. The inlet temperature of the server has to be within the red line temperature of the manufacture. And to figure this out you actually need to have execute a computation fluid dynamics, and we actually have a tool to do that. So basically that's the computation fluid dynamics from the tool, and from that CFD simulation you can seen the inlet temperature of the servers. And if they are below the red line it means that you're spending more than you need on the cooling, so in that case you increase the supply temperature. And if you see them above the red line, that means you're not doing enough cooling and you have to cut down the supply temperature. So that's the right way to figure the supply temperature in simulation. But now we have this procedure to compute the cooling power. How can we actually split the power? So that comes to our proposed self-consistent method. Let's say I have a datacenter with this total power budget, 70-kilowatt, and I want to split it between computing and cooling. So that really presents this straight line. This straight line is the set of all possible solutions that split the computing and the cooling such that it's equal to 70-kilowatt. So my solution that I'm looking for is somewhere along this line -- let's say the start, and this is where I'm getting, but I don't know what it is. So we can start in the simulation with a particular split, like say this point in here. So let's say I start with this split, 4.45 for computing and the rest is for cooling. Now, I take this 45 watts -- sorry, 45-kilowatts and run my CFD simulation to give me some cooling. Remember, this computing power is less than the start. So it's going to actually use less cooling power. So the cooling power that it needs is 21-kilowatts. But from this point actually I'm consuming way less than my budget, so I can actually increase my computing power budget to become 48. But in that case if it's 48, this cooling is not going to be correct and I need to compute my cooling and it's going to be higher than the start, so I need -- so that's the new cooling. And then this process could be repeated. So right now I'm consuming above the budget, so I will cut it by cutting down my computing. And then I can keep on iterating this procedure until it converges, and we actually have a proof that it does converge. So by doing this self-consistent partitioning, we can zoom in to this magical solution where I know if I give 47 kilowatt to my computing and set the supply temperatures to what this procedure will give me, I will consume 22.8 kilowatt and both of them will sum to be exactly equal to 70 kilowatts. So that's this iterative procedure. We tried it for different power budgets and it always work correctly. It always give you the cooling power and the supply temperatures that would exactly take out what you give to the computing such that the sum in here is exactly equal to the total power budgets. So it works nicely. So just to rehash the whole talk, I hope I did not give you headaches. The first key point is that if we integrate the software, the number of threads, and active number of cores to DVFS, we have a larger exploration space that we can actually leverage to achieve better power capping, better power ranges, better power capping accuracy and improvements in runtime. There are a few future directions to work. The workload is a big one. We can think of the same idea for virtual machines; cannot source the number of resources given for a virtual machine and that becomes your degrees of freedom. Thread pack and choices. We packed threads to the cores, but maybe if we actually carefully look at which thread gets packed with which ones we can get better results. There are also situations where you have asymmetric cores. Maybe your cores are not all having the same preference, like say in new architectures or in some of the new ARM processers will people have these four-plus-one strategies with different cores. Key Point 2 is that we don't need to allocate the power caps here formally to the clusters, server clusters. We can actually leverage the workload characteristics, build the predictor and use it with a knapsack algorithms to get better power locations for a given budget. And the Key Point 3 is that if you have [indiscernible] in your datacenter, you have a large improvement in runtime optimization space, because then you can do co-runners, you can map different applications to different servers, you can do migration, and that gives you a much more better improvement for energy efficiency. There are a few directions, one of them I definitely would love to do is work on a real cluster to answer some of the questions you guys mentioned that are not clearly visible with the simulator we're having, different kind of workloads, especially ones that require more coordination between the servers, because it's important. Right now we're considering each workloads are independent. But maybe actually there's connections between them. Multiple co-runners is something -- was one of the questions in here raised, more than two. There's also the idea of software energy profiling for heterogenous servers. I think it's building models for energy profiles that work across different servers with different server configuration and different makeup as an interesting research direction. Key Point No. 4 is that we don't need to suffer from violation because cooling power exceeds our computing for total power budget. We can actually plan both of them simultaneously in a selfconsistent way so that the cooling can extract the heating power from the cluster and meet the total power budget. One of the things we actually just submitted a work on is that -- this idea is that you can actually reduce the cooling power, not during runtime, but better planning for heterogeneous datacenters. You can actually take the racks from the different server, look at their specs and look at their characteristics, and you can strategically place them in the data center during the planning phase so that you improve your thermal characteristic and reduce the cooling power [Indiscernible]. So it's different -- it's not a runtime degree of freedom, but during the planning you can do some magic layout tricks to kind of cut down the cooling. So this -- some more information in these papers. Some of them appeared and some of them are being submitted. But that's about it. Any questions? [Applause] Sherief Reda: Thank you. Everybody's probably eager to go to lunch or go to do whatever you want. I hold you in here for some time. Sorry about it if you had to -- all right. See you. All right. Thank you guys.

Jie Liu: It's a great pleasure to welcome Sherief...

Related documents

Products

Support

Jie Liu: It's a great pleasure to welcome Sherief...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib