>> Navendu Jain: Good morning, everyone. Thank you for coming. It's my pleasure to introduce Tajana Rosing from University of California San Diego. She did her PhD from Stanford and then went on to HP Labs and then finally came on to San Diego. So she really likes California. So she's going to be telling us about how to build energy efficient computing systems, and she and Roger Shulz will be visiting us on Monday their leading research projects in this apparently hot area. >> Tajana Simunic Rosing: Thanks. >> Navendu Jain: So Tajana. >> Tajana Simunic Rosing: Thank you all for coming. And yes, I do like California. Sunshine is great. But I also love hiking. And I think you guys have some of the best hiking around Seattle so I come up here very often to satisfy my craving for woods. As you can see, I work on energy efficient computing. I'll be spending quite a bit of time today talking to you about power and thermal management as it relates to both individual servers and multiple sets of servers. Some fraction of the time I'll be talking about what happens inside of a processor and how that may change the decisions we make that go beyond. And then towards the end, I'll be talking about how some of the same energy management decisions can be done to change energy efficiency and sensor networks. It turns out that the ideas are actually quite similar. Numbers are very different. You know, in data centers you talk about megawatts and millions of dollars and sensor networks you're talking about milliwatts or maybe even microwatts and probably not millions of dollars in terms of energy cost. But the bottom line is still the same. You need to deliver some results within a reasonable amount of time and as low energy consumption as possible. So the future as we see it basically is a world that has whole bunch of different types of sensors around us and on us as well. And towards the end of my talk, I'll illustrate to you a couple of projects we run at UCSD to look at how our daily decisions will affect our long-term health care. So this is work that we do with UCSD School of Medicine that's focused primarily on preventative medicine. So not necessarily trying to deal with somebody who is sick. The interesting thing you final when you're looking at preventative medicine is you need to be able to track effects on the health in 24-7, 365 days a year. When you try to do this, you have huge amounts of data that get generated, and you have a gigantic problem with battery lifetime. You know, if you're trying to detect only a small event using sensor nodes and on the large scale environmental installation, you can do duty cycling, so you can sample periodically with relatively low sampling rate. However, if you're trying to figure out whether person is running versus sitting versus sleeping versus, you know, doing something else, you'll need to be continually monitoring. So energy is actually a very big challenge here. Secondly, there is a middle layer here that involves mobile access. So sensors are fine and all except, you know, that since data needs to somehow get to us in a form that makes sense to us and today most of us will use our cell phones to interact with the data. And in fact, in the project that again with the healthcare center really showed the power of providing feedback to people through a cell phone. The cell phones also have battery limitations. A lot of us have more sophisticated smartphones where we, you know, take pictures, we get e-mail, we may even watch some media. I got little kids who love playing games on them. You know, so battery lifetime is really keen. Also the connectivity between sensors and cell phones and cell phones and the back end is something that is variable. Sometimes we have good 3G coverage and other times when, you know, I was hiking on Olympic peninsula, there was no 3G coverage. However, if I'm a heart patient, I need to be able to get connectivity to the sensors on my bottom, perhaps so the sensors in the environment through that same cell phone. So there needs to be a way for us to adapt to what is going on in the environment and what is going on with a particular application and the user's needs. And in the center of this picture is data center infrastructure core. So if we have all these data from the sensors, from the cell phones, all the data eventually will need to be stored somewhere, will need to be analyzed, and will need to be leveraged over time. So going back again to the healthcare example, we actually don't know what really causes asthma in people. We do know that there's a factor that's genetic. We do know that there is another factor that's environmental. For example, there was a study that showed if you live within couple miles of a freeway the study was done in LA county as you might imagine, you have 50 percent chance higher of developing asthma. That's a huge number. The issue is that we don't know whether this comes because it's a freeway because we happen to be testing a certain fraction of population that simply is more susceptible to asthma because maybe there's some other environmental causes that would also affect asthma. It's not obvious. The only way we can get the answers to these questions is by having large sets of data over long period of time that can then be analyzed retrospectively, right? And you get that by having this large infrastructure core. So at UCSD, we're really lucky in that we have full fledged deployments of both large scale environmental sensor nodes, work again with UCSD School of Medicine looking at body area networks. We have tight collaboration with people in industry, you know QUALCOMM, for example, that's focussed on developing cell phones and you know, cell phone infrastructure. And we have a supercomputing center that deals with humongous data sets. And so we can really find out what is involved from the outside looking in of this picture and how does one achieve energy efficient computing across this big scale? So with that in mind, I'll zoom in on my small slice of this big huge occupy, and specifically I'll talk about power and thermal management starting from the data center infrastructure at the center and work my way out to the sensor networks and mobile devices towards the end of my talk. So data center infrastructure there are two variables that I've really been concerned about that go beyond just simple scheduling for performance. And that is power and temperature. Power in terms of how much we are spending to actually run systems, processors, memory servers, racks, and so on. Temperature in terms of what fraction of this power is actually going to calling and how will reliability be affected by the scheduling and by power management decisions are being made. So with that in mind, I'll be talking first a little bit about some issues that make our servers not so energy proportional, and specifically I'll be focussing on interaction between the processor and memory infrastructure. How could we make memory a little bit better architected so it doesn't consume as much energy as it does today? CPUs it turns out are actually fairly good already. You can scale energy fairly well. Then I'll talk about, you know, if we have a particular system, what is the right way to monitor the workloads and based on the characteristics of workloads to do power management? So this would be both controlling sleep states and voltage scaling. And how those decisions will make a system operate more effectively. We'll be talking about machine learning algorithms that we've developed that are really good at doing online adaptation. And then how power management decisions affect thermal management? It turns out that the two aren't always compatible with each other. So if you think about large scale servers, the absolutely best way for you to save power is to just not run a server, right? As soon as you turn it on, you are already consuming huge fraction of the overall power budget. However -- and this same idea applies to on the chip, right, on the processor core you may have, you know, eight processors, four processors, whatever it is. You know, the best way you're going to save power is by shutting down the processor. Your next best way is to use voltage scaling potentially or to be careful what workloads you place. However, shutting down this processor and having something else run very fast right next to it creates big temperature differentials, which can lead to reliability problems. So these two decisions aren't always compatible with each other, and I'll be talking about how do we actually balance the two to achieve both low energy computation and also to get good reliability at the large scale. And lastly, I'll talk about how we scaled some of these results virtualization system that runs at container data center level that we have at UCSD and some of the preliminary results that we got from that. And then as I said, the second topic will be looking at wireless sensing and energy management or better said, what do we do at the outer parts of my picture on the previous slide? So what happens with sensors and that mobile devices. So the high level view of cost efficient energy management in data centers is that you have some sets of jobs are coming. They need to be scheduled. They need to be scheduled in a way that's going to meet performance requirements so you got SLAs, you got response time requirements, maybe you have some throughput requirements that are necessary to meet the SLAs. That scheduling will make decisions on what racks and what servers jobs will run at. At the individual server level you have either an operating system or a virtual machine that makes a decision on which exact processor something will be running at. Those decisions will strongly affect energy efficiency of the overall system. Right now those decisions have been largely disconnected from a decision on what level of cooling are you using? So level of cooling is controlled by independent set of thermal sensors. Granted, the temperature that's sensed by those thermal sensors is affected by scheduling decisions. However, we found if we actually consider both at the same time, we can get much more effective results. So at the high level view what we're really interested in is monitoring temperature, power, and performance at all of these scales, right, using those variables to then control cooling, power states that we can control and task scheduling. Because we believe it's only if we start scheduling multi-scale properties of this system well that we'll get fairly big energy savings. The goal here is to get energy efficient computation. We have found that if we utilize some relatively simple prediction models we can actually do even better than the current reactive policies. So what I'll show you is some of the results of predicting what temperature's likely to do in near future where near future is defined as certain number of scheduling decisions going forward, and, what incoming workload is likely to look like. Again in near future where near future is basically a function of the decision I have to make right now. Okay? >>: Are you only looking at controlling the racks, or are you talking about controlling the [inaudible] chillers and all that? >> Tajana Simunic Rosing: So the project that actually we have funded add UCSD involves a whole data center container, and they have control of the water that's coming in, the rate and the temperature, and then inside of the container we have tapped into all of the controls for the fans that are in there and we voluntarily have ability to schedule jobs. Now, we're not quite -- so you won't see results of doing it at the whole container level, because we're not quite there yet. We got this container just this last year. So we're ->>: [inaudible]. >> Tajana Simunic Rosing: This is a Sun container. It's actually a big NSF project, NSF MRI grant that got us the container. So it was just within this last year that the container has been fully populated. We're at a point where we're running right now at five servers, and we're going to be deploying what we have tested already. A small number of servers to the racks in the container. So it's going to be running at seven racks within about a month. But to actually get to that takes a lot of work. So you won't see the results. But that is actually the end goal. And the project is -- involves multiple people. It's not just me that's doing it. I'm leading the energy efficient effort here. So if we zoom in on to a single server, it really depends whose server you buy and what that server is designed to do. So the first server that we got was a really small two socket quad core Xeon machine with only eight gig of memory. So if you have a machine that has relatively small amount of memory, doesn't have the most powerful processor in the world, I mean, it's still reasonably powerful, the type of division that you can see when it's running active is about, you know, maybe a bit less than half will go to the CPU. You have some chunk that goes to memory that's relatively small. And then you have almost, you know, 16 percent that goes through fans and very little that goes to hard disk. Now, if I start looking at the opposite side of extreme and I don't have this picture here, unfortunately, but you can see the Googles server that now CPU here has dropped down to about 30 percent. Memory's also 30 percent. Because in goal servers it's about 64 gig that you have. Right? And then the other stuff is basically the rest of the pie. And if you talk -- I've talked to people at supercomputing center and they actually have some of the servers that run at 512 gig because of the types of jobs that they need to run. 512 gig of memory gets very, very hot. And there, you know, huge fraction of the pie goes to memory and to fans in order to cool this down. So the picture -- my point here is that things will change drastically in terms of what slice of the pie it is. But the top components that are consumers are usually CPU, memory, fans and power supply and efficiency. Power supply and efficiency I will not be talking about at all. I think there are people who have done a great job of designing better power supplies and it's just a question of are you going to spend the money to put it in? What I will be -- and I won't spend much time talking about CPUs. You know, Intel seems to be doing an excellent job designing energy scalable CPUs. I will talk a little bit about how I use the properties of those CPUs to achieve energy efficient computation. What I will discuss next is what we have started looking at in terms of novel memory organizations to try to the same or near the same level of performance at a fraction of the power cost. And also I'll be talking a little bit later about what happens with temperature and therefore with cool system control. Yeah? >>: [inaudible] distribution of this power, did you look how to [inaudible] in full capacity at sometime or [inaudible]. >> Tajana Simunic Rosing: That's a very good question. I will show you a slide. And it changes. >>: Database, data ->> Tajana Simunic Rosing: Yes. So this server is not ->>: This guy always is ->> Tajana Simunic Rosing: I totally agree. And it did not -- I do not show that here. >>: Okay. >> Tajana Simunic Rosing: This is not the database storage server at all. >>: [inaudible]. >> Tajana Simunic Rosing: So this is -- this here is a computing server. So the picture -- really my point with these pictures is that the slices of the pie change drastically. Right? So for the hard disk you write. If I was focusing on storage servers I would actually need to have the last bullet the says how do we deal with the hard disk. >>: [inaudible] is the storage server. >> Tajana Simunic Rosing: Yes. >>: Typically storage is most expensive. >> Tajana Simunic Rosing: Uh-huh. Uh-huh. So one of the things that you might be interested in that we are doing at -- for another big grant through SDSC that I'm a part of is looking at large scale supercomputing storage that includes flash as a front end. Because that turns out to be one of the good ways to get the efficiency that you need and the response times that actually those types of applications really need. Yeah. Okay. So what we've looked at in terms of memory design is the idea of using both non-volatile and volatile memory combination. The problem with DRAM is yes, it responds very quickly, it has pretty high density so you can pack a lot of stuff in it, but it also costs a lot of power. So if you look at a typical server, you turn on the server, that memory is going to be on for the rest of the time that the server is on, and that really doesn't make server very scalable in terms of energy. The problem with using non-volatile memory like PCM is that, yes, you know, you can turn power off to it so the data is retained well so you can achieve much more energy scalable computation as a result. The problem is you can write to it only so many times before it fails. And you know, the other side to note is that it's not exactly mainstream yet. It's getting there. So the challenges are to be able to hide drawbacks of PCM, main drawback being the write endurance which is a very similar issue that you have with flash. So the design goals are for us to uniformly distribute write access across PCM so do wear leveling much in the same way as you do it with flash, except now it's at the main memory level. It's not as the cache to the disk. And to direct any write intensive applications to DRAM instead. So we achieve this by doing a combined hardware/software design. So this requires some changes to the operating system and it requires changes to the hardware. On the bright side, the overhead actually of the changes to both is very, very small. And the benefits as you see may justify the approach. So in -- at a high level what we have is a new memory controller that then monitors the writes and monitors the aware leveling properties of PCM and based on that will direct workloads appropriately. So here is how it works. It looks at -- so I will have two activity types of memory. I'll have DRAM and PRAM. The data and commands come through the memory controller. The memory controller actually has a small access map cache. So by different pages, and this is done at page level because page is a unit that you allocate from the operating system, you actually count how many writes have been done. Based on that write count, we can figure out whether that page should be retired and a new page brought in or not. Because that then allows us to do write wear leveling. And obviously when a page gets to some threshold of writes, we have to be able to swap the page. So that is where the interrupt for painful swap occurs. And as you reach the limit of PCM endurance, we may have a bad page interrupt so at that point that interrupt is issued and a new page again is allocated. So the objective is to uniformly distribute this write across PCM. We do this by holding -- I have a 10 memory allocator three different lists. One is a free list on the left and then threshold free list on the right. With the tree list, it's basically a said of pages that have not yet been allocated by any application. As the applications come in, they allocate the pages. Once those pages have been allocated some threshold number of times, they get put into a threshold free list. Eventually you get to the point where free list is empty and threshold free list is full. At that point, you swap the two. So the threshold free list becomes empty, the free list becomes full, and you start allocation again. You do this enough times, you'll get to the point where some of the pages will go beyond the write endurance per page. And at that point, the page will go into the bad list. So that is where this bad page list comes. Page swapper will handle page swap and bad page interrupts. Bad page interrupt obviously will mark pages invalid and hopefully avoid failure. >>: [inaudible] what's bad page list ->> Tajana Simunic Rosing: So this is the swap threshold actually is a variable that we've played with some to see for a set of applications within benchmarks that we had. And it turns out that it's actually not that sensitive. So we had swap threshold in the paper that we published this year -- this last year. It's a new year already. It said 2,000, but it turned out if you put it down to 10 you get almost the same result as you put it up to about 5,000 writes. So interestingly enough, it's really not that sensitive. There are two policies that we looked at. So one policy is what we call uniform that basically says when you reallocate the page, you'll always want to put the page into PCM. The idea behind that is PCM is lower power, so therefore you'll be more energy efficient. However, clearly this could be a bad policy because PCM also has very slow response time to a write, much slower than DRAM, and you'll get to using your -- getting your page to become a bad page sooner. The next policy that we have is a hybrid policy that actually after using -- so after passing the threshold will allocate the page instead in DRAM, assuming that because this page has been used so much it's likely being used by an application that has lots of writes and therefore should be going into DRAM. So we use baseline system that had four gigs of DDR3 SDRAM, and this actually was measured on the server that we had. All the results that you'll see with PCM are simulated because we don't have that much PCM in the house. So the hybrid system had one gig of SDRAM plus three gigs of PRAM. So it was about one to three ratio. And the uniform system was all PRAM. The workloads that we simulated had a set of different types of applications. The idea was that we wanted to see both the percentage of reads, percentage of writes and the total number of pages vary significantly. So we selected applications that would represent different classes of workloads by these particular values. So you can see that the applu, for example, has lots and lots of pages and fairly high write percentage and then on the other hand sixtrack, because it's so CPU intensive really doesn't do much. And face recognition is somewhere right in the middle of the two. So average overhead in terms of performance for the hybrid system was about less than six percent. This overhead is largely due to the fact that writes take longer in PCM. So if we actually had a better caching methodology we could probably hide this overhead even better. The overhead of the uniform system was much bigger. It was about 30 percent. So you really wouldn't want to build a server with just PCM. And that should be kind of obvious anyway. But there is real benefit to -- and I'll skip through the details here, to actually combining PCM and DRAM. So if we combine the two, you get energy savings of 30 percent without considering power management. So we did not actually turn off power to PCM and use the fact that it retains data. So I think the overall energy savings can be much bigger than this 30 percent. This is in the worst case effectively. You have a question? >>: [inaudible] just the total time [inaudible]. >> Tajana Simunic Rosing: So it includes all of the overheads. >>: What do you mean by the overheads? >> Tajana Simunic Rosing: Of access map analysis, trying to figure out whether what is the number of writes that you have to that particular page and for the page swaps. So overall performance overhead basically is included. What your application would experience if it used this new system. >>: What's the estimated [inaudible] including DRAM in that scenario? >> Tajana Simunic Rosing: The number we had used was 10 to the 9th, right? It was a number we got out of literature. As number goes up, our results get better. >>: [inaudible]. >> Tajana Simunic Rosing: Oh, so with a particular -- and there is actually detailed discussion on this in our paper. This was analysis that we had to do because others our server dies within five minutes of using it, it's useless. So it turns out that if you have a dumb policy with 10 to the 9th writes, you can end up with a failure within about an hour. But if you have a smart policy, like what we've designed, you end up with a failure of about 10 years. So you have to be very careful what you do with your software and hardware. >>: That's quite a difference. Do you ->> Tajana Simunic Rosing: Yeah. Well, the difference comes from the dumb policy basically will allocate write intensive page, always the same page at the same location, and therefore it dies quickly. >>: [inaudible]. >> Tajana Simunic Rosing: Yeah. >>: [inaudible]. >> Tajana Simunic Rosing: Yeah. >>: What's -- did you look at the concept of potentially assisting the focus in the PRAM between RS restarts? >> Tajana Simunic Rosing: See that's what I mentioned here at the end. We have not accounted for that in this energy savings. What my student is working on right now is estimating and evaluating how well would we do if we actually leveraged this idea? Because I think the savings will be much bigger and it will enable you to wake up extremely quickly. You have instant on effectively. And you know the fact that we used relatively little DRAM and PRAM, you know, is another issue. You know, normally you would want about one to 10 ratio between the two to get even bigger benefits. But you know to actually simulate one big to 10 big of PCM is very difficult to do with simulators. We had hard enough time allocating enough memory to fill up four gig in a simulator. So that's one of the limitations of doing a simulationwise. But jointly with Steve Swanson were building evaluation system that will have memory controller and an FPGA and we'll actually be able to see in somewhat real life what happens. Any other questions on this? So the nice thing about using PCM in the way that we've talked about is not only that you can now retain the state but we can also think about having a completely non uniform memory architecture, so we can have chunks of PCM associated with chunks of processors and therefore you can start actually creating a truly energy proportional computer, right? Because I can assign workloads in such a way that only a fraction of the memory is being used, instead of all of the memories on all the time. Even though maybe to the application it looks as if all of the memory is there, simply because the data is still retained, it just didn't on because you're not currently running it. Okay? So that's actually the motivation behind going with this work. Okay. So going forward, as I said, we have this great NSF funded Project Greenlight. The idea is to first evaluate the energy efficiency of a larger scale installation, and we're interested in seeing literally what happens at a whole box level. So how efficient is it if I do X versus Y in terms of what is the cost of power to box and what is the box to cool the box. And those two variables on this box are very easy to measure. Because I got one plug for water and one plug for power. So it's pretty much instantaneous feedback. And in order to achieve more energy efficient computation, we'll look at two classes of algorithms, power management and thermal management. Power management because obviously our goal is to lower the energy consumption. Thermal management, because lowering the energy consumption of just the powering aspect of it would not necessarily be the best thing for the cooling system. So we need to be able to balance the two. And I'll show you some instances of why this would be the case on the chip level. And this actually applies also to the server model just as well. So first thing that we did is we started looking at workloads. And this was actually done ages ago on the single machine. What we looked at is different devices within the machine, so you see on the top a hard disk trace, on the bottom you see wireless network interface trace. The reason why included these two is because they're about as different from each other as it gets, right? On wireless network interface you have relatively fast interarrival times and you're expecting actually to have very low power consumption. On the hard disk trace you expect much slower interarrival times and much slower processing times and you also have much higher overhead for changing power states. The interesting thing that we found on these traces, what you see on the X axis is the interarrival time measured in seconds. What you see on the Y actions is the tail distributions. So effectively it's one minus cumulative probability distribution of getting that amount of interarrival time. The reason why I plot a tail of the distribution is because it highlight what happens with longer interarrival times. Why do I want longer interarrival times? Any ideas? Why do I care? >>: [inaudible]. >> Tajana Simunic Rosing: Sleep, right? Or slow down if it's a CPU. When do I care about short interarrival times? >>: [inaudible]. >> Tajana Simunic Rosing: Performance. Right? So if you think about performance, people actually use -- usually will use queuing models to evaluate performance of a distributed system. That queuing model does an excellent job at short interarrival times, IE, when you're looking for high performance, you know, and you can see that even here that actually models for exponential distribution, which is the backbone of queuing theory, and heavy tail distribution such as Pareto really matches reasonably well experimental data. The problem is when you go to the right of this picture and you start looking at long interarrival time, the match is very poor. You end up in a situation in which exponential distribution makes decisions that are way too optimistic and that actually in the case of machine that we implemented the policy that was based purely on exponential data we ended up consuming more pure than just leaving it on all the time. So this really illustrates that you really need to be able to model the workload accurately if you are going to do power management well. And in case of these particular devices you see that the same conclusion applies regardless of whether you're looking at a hard disk, you're looking at wireless network interface card, you can look at large scale networking data traces. You'll see exactly the same conclusion. It's all heavy tailed. Why do people use queuing theory? Well, it's easy. And it works reasonably well at those short interarrival times. Why we can't use simple queuing theory for power management is because it leads to bad decisions at those longer interarrival times. So that is what we did. We actually took a simple queuing model and we expanded it to ensure that we can model the points of time when we need to make a power management decision correctly. So when we were in the idle state, we can either wait and get jobs that we then process. Notice in these states we're at the fast end of the curve where Markov model will hold. Once we get back into the idle state, we have a decision to make. That decision is do we do something, do we slow down, do we shut down or not? If we decide to shut down or go into sleep state it takes us some amount of time to get there. So this cannot be modelled again with the Markov chain, it has to actually be modelled with a uniform or some similar distribution to uniform distribution. Essentially non exponential because it takes you finite amount of time to get there. Right? Similarly, as you saw in the previous slide, the request that comes after longer idle period you really want to model with non exponential distribution. You need heavy tailed model. So you effectively have two distributions that actually govern the decision. And this is exactly where we had to make a little more sophisticated model. So what we did is we went with time indexed, that's what TI stands for, semi Markov decision process. It's basically a glorified Markov chain. A little more complicated. The assumptions are that general distribution governs the first request arrival. So this is the arrival at the -- that breaks the long idle period. Pretty much exponential distribution can be modelled for everything else. And user device and queue models are stationary. That last assumption is actually the lousiest one of them all. What does stationary mean? It means that on average the statistical properties do not change of the distributions that are designed for. And that we know to absolutely not be true at a typical set of servers. However, if we can pretend that it's true, you actually get globally optimal policy with simple linear programming, and you get results that we implemented. They're within 11 percent of an ideal policy. This ideal policy knows the future. So you get really good result. So the question is what do we do about this stationary assumption? How do we handle that? Yeah? >>: [inaudible] exponential or [inaudible]. >> Tajana Simunic Rosing: I didn't hear you. >>: The service time, how did you model the service time? >> Tajana Simunic Rosing: That's a very good question. So the service time that happens down here was modelled exponentially. So we assumed once the request comes that that's a beginning of a set of requests that you have to process that happen relatively quickly after each other. And at that point, we're back into this red part of the picture. Yeah. And that turned out to really not affect things much. If you have different results it would be interesting to see. But it really didn't affect things much. It was really this decision of do I go to sleep or not that was a critical decision. Yeah. So what we did to deal with the stationarity is we said instead of designing one single policy that does a great job for all of the workloads except, you know, it's not optimal for anything, what we want to do is have a library of maybe small number of policies, each of which is really good at delivering savings for class of workloads that we can select them on. So that is how we used online learning here. And the bright side, we're able to measure pretty good energy savings. We had experts -- this is effectively policies, that could control power states and can control the speed of execution. Each of these experts was specifically optimized for a class of workload. So for example, you would have Web Search policy or you would have a database policy or you would have, you know, whatever, video decoding policy. Because these workloads do have fairly different properties from each other. And then we have a controller that monitors what's going on in the system. It monitors it by looking at hardware performance counters. So it doesn't necessarily know anything from the application. If it knows something from the application, it will do better obviously. It selects the best performing expert. This would be whatever it thinks is the best policy. That particular policy makes decision for a device and then we evaluate how well were these decisions made? How do we evaluate? Well, it's fairly straightforward. If it has to go with going into sleep state you know exactly how long the idle period was. So you can figure out what was the best decision to make, and relative to that best decision you can figure out how well did your policy do? So if it was a simple timeout policy, it would wait some amount of time before getting to sleep. That amount of time that it waited is counted against it. Right? The really cool thing about the controller is that its performance converges to that of the best performing expert at a very fast rate. So it's very good at selecting what is the best policy to run right now. That rate is a function of the number of experts and -- now, remember, we have relatively few of these experts. We're not going to have hundreds of them. A few to well represent classes of workloads that you have running and the number of evaluation periods. And this, in case of the operating system, would be scheduler tech's okay. So here are a couple of examples of results we got on the hard disk drive and on processor. For the hard disk drive we are looking at different traces that I got from HP, so my days of working at HP paid off. I picked three different traces because I wanted to see traces with fairly different average interarrival times and different variants around the average. We wanted to show how well we can adapt. We looked at hard disk because decisions that are not correct for the hard disk showed the worst possible case, right, because this takes so long to respond. And on the CPU we're focussing more on voltage and frequency scaling. So on the hard disk we had four policies that were implemented. There was a simple timeout policy that's a default. There was an adaptive time out. Our policy that was optimized for specific performance overhead. So for in this case it was optimized for three and a half percent performance overhead. So the nice thing about actually that I forgot to mention about these policies is this is the only class of power management policies that can be designed to meet performance guarantees. So you can say, I have this performance guarantee that has to be met, it will output a set of decisions that will minimize energy within that performance guarantee. And the last one is a predictive policy. Now, this was just a sample set that represents what we've seen in literature. You can add whatever other policy you like. Within this sample set you can see that our policy was best for performance because we obviously optimized it to be that way. And that in this particular case predictive did really well with energy savings. Now when we actually ran across all of these different workloads with our online learning controller, you find that we can fairly easily trade off low performance overhead. In that case our controller will pick our policy here most frequently, because that is what gives you low performance overhead, with higher energy savings and it will pick the appropriate policy for that. Okay? Similar thing happens when you look at voltage scaling. So in this particular case, this was actually done for a cell phone type application. And you can see that we have four different voltage and frequency settings. And as we trade off for low performance overhead for energy savings, you can see that you get better performance if you run faster, obviously, and then as we want more energy savings, we're going to spend more time running at low speed. However, what you notice for this particular application quick sort, there's a good fraction of the time that we also run at 400 megahertz, even though we're maximizing energy savings. The reason why this happens is because quick sort has regions in which it's more memory versus CPU intensive. And our controller is able to detect this appropriately and select a correct setting to ensure that you get performance and maximize energy efficiency. Because if you are CPU bound, you are not saving energy by running slow. Right? So that is effectively what you see here. 25 percent of the time we're CPU intensive, 75 percent of the time memory intensive, hence about 75 percent of the time you run low, about, you know, 20 percent of the time you run faster. So that brings me to the last question, you know, that I almost always get asked when I give this talk and that is, you know, you get this great savings and voltage scaling when you run on an old CPU effectively, older technology. What happens when you do this in servers? And the answer actually is a little depressing so I wanted to share that with you. This was a set of results that we got by running AMD Opteron machine. We also did the same thing on the Xeon. The results are compatible with each other. So the conclusion that we draw out of this is a correct conclusion. Presented this to Intel. They agree that that's where things are going. You can see for Opteron the range will voltages at the top speed is from 1.25 volts. At the low speed it's .9 volts. So there is very small range actually that we're playing with here, which by itself already tells you that the amount of energy savings for slowing down is going to be much smaller than it is for processors in the older technology. In the contrast for the XScale processor I think the range is from 1.2 to 1.8 volts. Which is much bigger. You know, .6 versus not even .3. Versus the range of frequencies in this case from .8 gigahertz at the low end to 2.6 gigahertz at the high end is huge. Right? That's more than a factor of 4. Compare that to my previous slide, the range of frequencies was from 200 to 500 megahertz. So that's about factor of 2.5. So basically you would expect XScale to get better energy savings with voltage scaling just by looking at these numbers than AMD Opteron for sure. The other depressing thing about this processor as compared to older XScale is that it leaks like crazy. So leakage power in these things is about 45 percent. So what this tells you is that as you slow down, about half of your power is simply going to be wasted four times as long if you run at the lowest frequency. That is very bad news. In contrast to that, XScale I think we estimated at about 20 percent. There is a big difference between 20 percent and 45 percent. So we wanted to illustrate the point what we're doing here is we're comparing slowing down so if we run at top speed versus at one of these slower speeds with three simple power management policies, the top policy, all of these three will have results where we run as fast as we can and then we switch CPU to state C1, so we just remove the clock supply. This is the lightest sleep state that you can choose. Then we switch CPU to state C6, so it's completely off. This is the deepest sleep you can do with the CPU alone. And the last policy actually says well, you know, this is great but what about the rest of the system? You know. CPU can sleep but there is whole bunch of other stuff that's still on. So in this case, we put memory to self refresh to show how does running as fast as you can and shutting off actually help the overall system power? And to contrast that to slowing down. Okay? So here are the results. These results actually show you only a small sample of workloads. We specifically picked MCF because it's very memory intensive, sixtrack because it's very CPU intensive, so this will show you approximate range of savings that you can get. You would expect MCF to give you the best savings because of its memory intensiveness, so slowing down CPU actually should not affect it's performance as much, versus for sixtrack you would expect to get really bad results because slowing it down, you know, will affect performance and energy. You can see first of all if you look at the delay column that at very best at the top level the percent slowdown is about 30 percent. And this is running at 1.9 gigahertz versus 2.6 gigahertz. So you're paying 30 percent of performance overhead for almost no power savings. Okay? So the power savings here what we are showing here is percent energy save for voltage scaling relative to running as fast as possible and going into C1 state, running as fast as possible going into C6 state or putting memory in self refresh. Okay? So your benefit for voltage scaling is relatively minuscule. On the worst end, you have overhead of 230 percent in performance for sixtrack for about seven percent benefit for voltage scaling. So how many of you are going to do voltage scaling? Yeah? >>: I am. Because they're totally incomparable. When you do voltage scaling, the processor's still running. You're able to process requests. When you put it to sleep, it's not. >> Tajana Simunic Rosing: Yes. And that is true. Now, why are you doing voltage scaling? >>: [inaudible] power. >> Tajana Simunic Rosing: So what this shows you is you're actually not saving energy. What the reasons why you my want to do it ->>: Is that a characteristic of the processor you're using though? >> Tajana Simunic Rosing: It's a characteristic of all of the high end processors you buy today. There is no processor that you can buy on the market that I know of today that would show significantly different results here. If you ran on your cell phone, you would see different results. Because it's older technology and obviously much lower performance. But for energy savings perspective, it actually does not make a lot of sense to do. It does make sense to do for power perspective, power budgeting, right? Voltage scaling will lower your power budget instantaneously while allowing you to continue running. It also makes sense to do if you're desperate, right? You need to save the last bit of -- so in case of thermal management, if things are getting hot, you still need to run, but you need to run at a way that let's you not be too hot, you're going to use voltage scaling. Right? So if you actually talk to Intel, their newest processors that they are looking at putting out on the market going forward are going to have much faster in and out, out of sleep states than they ever had before, and they're going to have a lot less flexibility with voltage scaling because they don't see voltage scaling as all that useful. So things are -- paradigm is really changing. The other thing to think about is in the old times when you had a single processor, the idea of using voltage scaling made a lot of sense because as you said, you can still continue running, you know. If you shut it off, nothing is happening. So you can't continue getting results. With lots of processors you have the option of continuing running perhaps running even faster if you use the top speed, while at the same time perhaps shutting off one or two processors on the die to save energy. Yeah? >>: That's actually just where I was going to go, shutting down the processors and ->> Tajana Simunic Rosing: You don't have to shut off everything basically. You can shut off one out of eight cores. >>: And if you do so, are you -- is the control for that being done in a way where you eliminate statically the [inaudible]? >> Tajana Simunic Rosing: Yes. Yeah. And that's why these numbers look the way they do. Because basically you have card transistors that turn off connection to the power supply. Otherwise it wouldn't work. You would have the same problem. Yeah. Okay. So that brings me to the key points. You saw that having power management policies that understand workload characteristics and can adapt to them can make a big difference. You saw that lowering voltage and frequency settings doesn't necessarily help in the high-end processors for pure energy savings. However, it does help with peak power reduction. It does help with thermal management. And obviously for systems that don't have high leakage, it's system a very good option. So if you have a system that has low leakage, by all means I would use it. Which is why you see voltage scaling still very prevalent in all the mobile systems. So this brings me to thermal management. It's pretty clear by now that the picture is not complete if you don't think about temperature. So for thermal management, what we took is inputs as workloads that we collected the data center. This was work done jointly with Sun Microsystems. They, it turns out, have big reliability lab right next door to UCSD and there they have access to all the workloads from their customers. So that's what we used. I have a student who worked for Sun. We took floor plan and, you know, based on that floor plan and temperature sensor data were able to figure out how to schedule workloads. What you see on the picture here is Sun's Niagara chip. That's what we started working with. And one interesting thing that you should immediately notice is that on this floor plan where you put the workload will affect temperature. If I put something that's very hot right here, it's not going to get nearly as hot as if I put it in the middle. Because level to cache acts as a very huge big cooling mechanism, right? The same is the case at the server and at the data center level. Where you put the workload in a rack or in a box will really depend on how well it can be cooled. So it's exactly the same concept that you see on chip that can be expanded to the whole box effectively. For power management, we used sleep states and we used voltage scaling. We also used the ability to migrate threads as another knob to deal with the temperature issues. And the results you're going to see are results that come out of thermal simulator. This is primarily because Sun wasn't too keen on releasing reliability data. But the conclusions actually hold. There are two classes of policies I'll show you. One is optimal and static. The goal behind this was to see what is actually the ultimate benefit for temperature if first of all I optimize job allocation for minimizing energy only. So if I say I want minimum energy, does that solve the temperature problem? Right. Secondly, what if I also add thermal constraint? How does that change things? So that is the top two policies. The reason why I looked at the optimal is because that tells me what I should be striving for with my heuristics. Obviously optimal results are simply not possible in reality where I'm doing online scheduling of large number of jobs that I don't know ahead of time. And this is why all of the operating system schedulers are heuristic in nature. So dynamic ones which used load balancing, which was basically defaulted that we in Linux and in Solaris, and this balances the threads for performance only. Some of the balancing actually helps temperatures, some of it doesn't. Then we enhanced it by the information about the floor plan so if now the scheduler knows that cores are inherently likely to be cooler than others, simply because of where they are, it's going to do a little better job of allocating based on that knowledge. Okay. The next policy this is simple improvement over the coolest floor plan is a policy that uses thermal sensors. It says how it keeps information on recent history per core of what temperature has been and depending on how hot it's been, it will give either higher or lower probability to send workload there. So if it's been reasonably cool, then it's going to have higher chance of getting new workload. If it's been reasonably hot, then it will be lower chance. The idea behind this is to actually try to smooth out thermal gradients. So there are a number of ways that things can fail. One way that things fail is because temperature gets hot in a particular location, in a hot spot. There you have electromigration, you have thermal electric -- dielectric breakdown. There are a number failure mechanisms that depend only on what is the absolute value of the temperature. The second way that things fail is because temperature changes in time a lot, and this is essentially a stress mechanism. So if it goes between cold and hot a lot, you end up with failures. The last way things fail which turns out is the biggest pain of them all is what we call spatial gradients. So if across the die we have large differences in temperature, signals who travel a different speed in one area of the chip versus in the other, and you can end up with these nasty bugs to debug, these are not permanent failures, they're basically temporary failures that you can recover from, but they're horrible to deal with. So what we're -- yeah? >>: [inaudible] redundancy do you have in the [inaudible]. >> Tajana Simunic Rosing: That is a perfect question. So sensors to date have been really lousy, actually. And in fact on the original chip we had one sensor that was very bad. So what we ended up doing is putting the chip in an oil bath to get an IR camera so we can actually see what temperatures were. Newer chips have more sensors. >>: [inaudible] temperatures as well. >> Tajana Simunic Rosing: Yeah. I know. But what are you going to do, right? What's the choice? The good news are that some of the most recent chips are coming out with a minimum one sensor per core, and that actually turns out to be good enough. If I have one sensor per core, plus we have information about the floor plan, the thermal properties of the package we can in fact filter out bad sensor data. And if you're interested in that, I can show you, you know, afterwards, some of the work we've published in this area. And this work in fact was done because we had all this pain with the single sensor on the die. >>: Is that [inaudible] Intel and [inaudible] are doing? >> Tajana Simunic Rosing: Yes. Yeah. Yeah. Yeah. So it's actually fairly well known by both Intel and AMD and, you know, Sun, who also has thermal sensors on the die that those thermal sensors actually tend to be fairly low quality. They have very low A to D conversion, so they tend to be very inaccurate. They get very noisy data. And you have to calibrate them. And calibration turns out to be very non-trivial. So you can be really far off if you don't do this right. Yeah, so -- yeah. And that actually is even more true for servers and racks. So as we talk about these scheduling policies, they're closed-loop scheduling policies effectively because we monitor sensors, we monitor the workload, we make a decision based on that, and then we try to adapt. You really have to be careful, you know, how to identify sensors inaccuracies to smooth them out or to simply detect that something has failed and work around it. Otherwise you end up with crazy decisions. You know, you can -- for example, one of the things that had happened was we ended up having a fan that kept spinning up and down a lot because it was triggering off of a sensor that was actually bad. So okay. So moving on. The last set of policies that I'll show you are going to sleep when it gets too hot, slowing down when it gets too hot and moving threads when it gets too hot. And then at the bottom I have exactly the same policy we did for thermal -- for power management, online learning that selects among these policies online as necessary. Okay? So first the optimal. So this is -- this plot shows you the percentage of hot spots across the particular die that I was showing. If we use load balancing versus if we do optimal energy aware optimization. Now, mind you, this is a very small number of jobs here because doing an ILP for lots of jobs is impossible to get done. And then if we do thermally aware optimization. And what you find that's interesting is energy aware optimization actually will create quite a few hot spots. It does give minimum energy consumption. So among these three, this is by far the minimum energy consumption. But it's at the cost of clustering jobs. And the reason why it clusters jobs again is because from a processor perspective, it is much more efficient to shut it off than to run it slower. Okay? So this I think fairly well illustrate to you that doing energy management without consideration the cooling, thermal and reliability aspects is a bad idea. Because you end up with corner cases that cause a lot of problems potentially. Okay? So here is what happens again if we now consider all different possible options and if we use online learning to select among them. So default is the default load balancing and this was actually results done on Solaris. We ran all of the workloads on their server. So the performance is all actual performance running on the server. The temperature numbers are all simulation. Migration. So if you move a thread, if it gets too hot, power management, voltage scaling and power management, the adaptive random technique that I showed you that keeps track of temperature history per core. And you can see that this little adaptive random technique actually is a pretty good benefit. If you combine it with voltage scaling, you get to the point where you are very close to voltage scaling results but not at the performance overhead of voltage scaling. And then online learning, which beats the best possible policy by 20 percent. The reason why it beats is simply because it's adaptive. So you will select whoever makes most sense at any given point in time. >>: Are you going to talk about the impact, the reliability impact? >> Tajana Simunic Rosing: Yeah. So I'm not going to show reliability numbers because again, I was not ->>: Whether you reduce the hot spot to 20 percent, why would I care? >> Tajana Simunic Rosing: You care because temperature is exponentially related to meantime to failure. So if I reduce temperature by 10 degrees, the hot spot temperature, there is an exponential relationship between that and your meantime to failure due to electromigration and the electric breakdown. >>: I understand that. But is this extending -- I mean, I need servers that last me four years. This is making it last 20 years? >> Tajana Simunic Rosing: So ->>: I mean ->> Tajana Simunic Rosing: Yeah. So that's -- that's an excellent question. So this particular set of results, and I could show you estimates that we have for reliability, we can actually extend the -- in this case per core lifetime by about, you know, fact of 10 approximately by doing things right. At a cost the not saving as much energy, right? So you get the performance, you get a reliability, you get some energy savings but it's not as good as you could by, you know, doing it this way. How it affects the whole server, that's a whole other question. And that answer I actually don't have yet. The problem with all of this data is it's very hard to get reliability numbers out of people. So temperature numbers I can get pretty easily, but, you know, what is exact relationship between the temperature, temperature differentials and reliability that's much more tricky. Yeah? >>: [inaudible] flip this around the other way. Now that you've reduced the hot spots, normalized the temperature across the die. >> Tajana Simunic Rosing: Yes. >>: Can you now push the entire die harder ->> Tajana Simunic Rosing: Yes. And see, that was what was behind this whole thing. Because effectively what we've done is we've created the more balanced thermal environment which means that now we can actually raise the envelope, so we have more head room to add more jobs effectively. Before your cooling system has to kick to a higher gear. >>: Or looked at another way, let me run in higher ambient temperatures ->>: Sure. >> Tajana Simunic Rosing: Yeah. >>: So that exactly ->> Tajana Simunic Rosing: Yeah. And you know, the two are obviously directly related to each other. >>: Don't have to worry about peak. >> Tajana Simunic Rosing: Yeah. >>: Fundamental ->> Tajana Simunic Rosing: And actually can get better than this. Because temperature is a slow moving variable they can do proactive thermal management, so instead of waiting for things to get hot and then doing something about it, they can in fact forecast that it will get hot and based on that allocate jobs so it doesn't get to that point. So it lets you work with whatever thermal ambient temperature you happen to have in the best possible way effectively. So it says if I have X number of servers with Y number of cores here is how I'm going to pack jobs in those cores so that the temperature remains within the envelope that I'm targeting. So for this, we take again data from thermal sensors. We used autoregressive moving average predictor. The reason why we used this predictor is because the simple exponential average that's very poorly for corner cases where temperature cycled a lot. So if somebody despite to be really smart and save power and shut off the server and then turn it back on a lot, exponential average died on that. And the nice thing about ARMA is you can also calculate it online very easily. So you can monitor, fit the parameters and you can monitor how well it fits the results. And then based on this temperature, we actually do predictions. For some number of times in the future, number of scheduled instances in the future what will happen, and based on that we'll then schedule. Online we monitor ARMA model. We validate it, recalculate, if necessary, and move on, make the decisions. So when we add this capability to the system, here is where we end up in terms of reducing hot spots relative to default load balancing. Okay? So you can see that reducing by over 80 percent -- over 60 percent as compared to reactive migration at practically no performance impact. So simply because again temperature does not change that quickly. So we leverage that. And similarly, gradients are reduced to less than three percent in terms of space and to less than five percent in terms of time. So you end up literally in a situation in which you have fairly flat and low thermal profile that is nirvana for reliability engineer. Okay? Yeah? >>: [inaudible] data that compares [inaudible] the temperature coming out of a server which [inaudible] versus one that isn't? I mean, is it actually detectable? You are using it ->> Tajana Simunic Rosing: I have data that compares multi-socket temperatures and that definitely is -- there's a big benefit that we've measured. >>: By a degree or two or, you know, what's the scale? >> Tajana Simunic Rosing: It depends on what you are running. >>: No, so I'm just like in any given -- I mean, picking a configuration, I don't know what it is, but what sort of differences were you seeing [inaudible] scale [inaudible]. >> Tajana Simunic Rosing: So in the best case we've even about 10 degrees lower on the die which is actually a big deal for reliability. But this is the best case. You know, if you really run CPU intensive jobs and all of the cores and there ain't nothing we can do about it, you know, we're stuck. So in that case, you know, what he mentioned, you know, slowing down is about the best that it's going to get. You know, maybe your performance isn't stellar but at least your server hasn't died yet. Yeah. So the other thing that we have done that I haven't clustered here is we looked at combining scheduling, proactive scheduling of jobs with doing fan control. So if I have two or X number of sockets and fans that are associated with those sockets, they are trying to cool them, those fans feeds turns out are cubic function -- so power off the fan is a cubic function of its speed. So basically as you go from low to high speed, you consume a lot more power. And so you have a lot of motivation if you are already running at a higher power level at one socket to pack in as many jobs as that level of blowing air will handle. So when you do this jointly, we actually got savings of about 80 percent on joint socket and fan power. Which is fairly big. And it's relatively simple to do. It requires from operating system to tap into thermal sensors and to have some level of fan speed control. Thermal sensors you can already do. We've done it, so you can do it. The fan speed control is trickier because right now that's done with an online controller processor that's on the board. So that you have to make some deals with manufacturers. But it's certainly doable. You know, they can expose the variables that are needed to you. And the benefit, as I said, is big enough to where it may be worthwhile thinking about it. And you know, if it's big -- if it's that big in the single server I would imagine that you could get even bigger benefits at the whole box. So that's kind of what we're after right now. >>: Hopefully just have a simple equation which said okay this fan is already running -- >> Tajana Simunic Rosing: Yes. >>: -- at high speed, so just pack all the jobs on there anyway [inaudible]. >> Tajana Simunic Rosing: Well, but it also goes the other way, right? You can also have a situation in which you pack the jobs, you have more jobs coming in, so the question is do I put it on that particular socket and possibly trip it over the control point or do I put it on some other socket and which other socket should I choose? And if I'm starting to get overloaded on a set of sockets, how do I offload, to whom? Right. >>: [inaudible] components in the server as well to check the monitor. >> Tajana Simunic Rosing: Yeah. So my student has just submitted a paper that looks at also memory subsystem. Because, you know, depending on who build the server they may have put fans in such a way that they blow first across all of the memory and then nice hot air comes to your CPU. You know, in some cases they actually build big air ducts that will take subset of fans to CPU and the other subset goes to memory. So it really depends, you know, who did it and how they did it and the end result of your job scheduling will change drastically depending on what configuration is there. So what you really want is to be able to learn this online and adapt to it, which is why we want to have the controls and the monitoring capability. So for that, you know, as we move from signal server or a few servers to multi-box level, we started using virtualization because it's a lot easier. So we developed a system that we call vGreen that's on top of Xen. The system is able to monitor all different virtually machines and the physical machine properties, communicates that to general scheduler so there is a server that right now does centralized scheduling across multiple machines. The idea is to do allocation of what VM should go where and to also do power and thermal management in tandem with that. For workload characteristics right now the results you'll see are very, very basic and elementary. They basically say is it CPU intensive or is it memory I/O intensive, and depending on that, we'll locate it on an appropriate place. The interesting thing is that even the simple variable actually gives you good results already, so, you know, after I saw that I told my students, boy, we got to push the envelope here because you're guaranteed to get low hanging fruit in no time at all. So that's what we're doing. So here is a motivating factor behind this, you know. So we all know that because machines when you turn them on, they consume a huge amount of power just being on, and really the major dynamic component is the CPU, right, you know, so CPU will really affect the top range of power. Most people therefore assume that CPU utilization is good enough for estimating the power consumption of the whole server. And you know, to a rough degree it may be good enough. However, what we found is depending on what applications you run, you can have actually pretty big range of the overall server power that you measure. We saw -- here I'm showing 65 watts, you know we've seen up to about 80 watts depending on the server of delta without any power management, just depending on what applications you run. Okay. That tells you that smart location of jobs will actually make up to 70 volts of difference. Right? So that is exactly what we did. We used combination of SPEC and PARSEC benchmarks, two servers and one centralized scheduler server that's separate and all we did is dynamic workload characterization and migration. And what we got is 15 percent AC power savings. So this is the whole box server savings, with 20 percent speedup. So both performance and energy benefit. Yeah? >>: [inaudible] do you have the [inaudible] together or ->> Tajana Simunic Rosing: So this actually shows the same job basically increasing utilization. So if I have MCF, I'm going to run it on one core, on two cores, on three cores up to, you know, whatever cores and threads that I have available. And it shows what happens there. On the next slide it's actually mixes of jobs. >>: Are you doing [inaudible] capping on those ->> Tajana Simunic Rosing: Do I do what? >>: The virtually passing utilization capping to achieve the 25 percent? How do you do controlled ->> Tajana Simunic Rosing: Oh, yeah, we basically capped it artificially because that way, you know, we can control. Because, you know, how are you going to create an experiment, right? Yeah. So really what these numbers should tell you is that there is actually a lot of potential benefit that hasn't been exploited here yet. You know, when you combine the ability to characterize jobs, to place the [inaudible] appropriately and then to do power and thermal management jointly. So going forward, I'm actually heading a fairly big center that's called music, and a particular aspect of that that I'm focussed on is leading an effort on energy balanced data centers that spans from softwares to system and down to platform. And basically we're looking at how we can do multi-scale cross-layer distributed and hierarchical management. So the question is where should the control sit? There are lots of different layers. And how should the communication happen between these layers? We want to ensure that energy is spent only when and where it's needed instead of wasted in lots of interfaces. Okay? And this brings me to a quick summary which I think I can skip. And I have a few more minutes to maybe such on some of the sensor stuff. I think what's interesting about the sensor stuff is that all the ideas that I just talked about minus thermal management pretty much apply. >>: We're actually starting to talk about thermal management in these as well. >> Tajana Simunic Rosing: So, yeah, I guess if you have really bad environment, your CPUs have problems. >>: But we're not there yet. But it's come up. >> Tajana Simunic Rosing: So the work that I have been doing, that I haven't talked about at all is looking at heterogenous multicore processors that go into bay stations. And thermal management there is actually a big deal. So I have a couple of papers that recently came out on that that you might be interested in. Yeah? >>: One comment. This is a concern I've had as we do more and more management. >> Tajana Simunic Rosing: Yes. >>: On different systems. What work, if any -- we've actually seen failures of servers as a result of incompatible controls where you reach oscillation ->> Tajana Simunic Rosing: Yeah. >>: So what work, if any, have you done in that area? >> Tajana Simunic Rosing: I've done quite a bit of work for the multi-socket, multi-chip. I have not done anything on the server level. The work that we did on multi-sockets actually looked at fan failures and oscillation, so designing policies that actually are stable across multiple variables. We are going in that direction. The problem that I'm running into is reliability has been difficult to model. So if you have data, you know, that can help me do a good job of modeling, that would be good. >>: [inaudible] control systems ->> Tajana Simunic Rosing: I know, it's ->>: [inaudible]. >> Tajana Simunic Rosing: No, I totally hear you. I know. >>: [inaudible] and end up fighting each other ->> Tajana Simunic Rosing: Yeah. Yeah. And that is why, you know, when I said multi-scale, that's exactly what I'm after. You know, you have all these different things, and they each individually could be designed to do a great job. But you put them together, you could have catastrophic failures. You know, how do we avoid these catastrophic failures. I don't have a good answer yet. But we have started -- in my case we started kind of small, you know, and we're starting to build up from there. And I also have a part of my team that started sort of from top down, you know, so getting all the thermal data of realtime trying to understand how it relates to the current control loop for the data center container. Because there is already a controller in there. And if we make changes to the controller what happens. Which is a scary thought. So on sensor networks is very much a distributed problem, also, except in this case it's also physically very much a distributed problem. So what you see on this map is a large scale sensor network deployment. UCSD is this little dot right here. That gives you an idea of the size. The range inland is about hundred miles. The network spans up to Salton Sea, just beyond Salton Sea. It goes 70 miles off the coast. The connection of the coast was done jointly with navy because they are interested in using our wireless backbone. It reaches close to Mexico and goes all the way up to Riverside County. What you see on this map is only the top layer of the network, so the wireless mesh backbone. What's underneath every one of these dots are literally hundreds of sensor node cluster heads and thousands of sensor nodes. And the sensor nodes do all kinds of different things, most of which have nothing to do with computer science at all. In fact, you know, probably the largest density of nodes is for seismic monitoring, Scripps Institute of Oceanography deploys this. They have a number of ecological research stations placed that do things like temperature, water quality, soil. And we have also people who are monitoring wildlife, so California Wolf Center. Lots of different applications. What I think I find really exciting about this deployment is exactly this, that it represents a really live test case scenario of what is life going to look like as we start using more and more sensors and as people who are not engineers start deploying and using the networks? This network has actually also been used for helping California fire department fight forest fires. And you know, in San Diego area. So the last big fire a few years back leveraged our network to pretty big degree. A lot of the images that you saw came actually from our sensors out in the field. So here is an example. You know, in the very low end I mentioned we have these earthquake sensors. They don't produce that much data. They are about five kilobits per second worth of bandwidth per sensor. However, they are somewhat latency sensitive when something exciting happens where something exciting is defined as a big earthquake, right? So at that point you really want this five kill about it per second trickle to make the hundred mile trip over a gazillion hops very quickly. At the middle layer we have things like motion-detect cameras and acoustic sensors that basically in case of motion-detect cameras you can see the walls here. They want to turn on when some interesting animal happens to walk by and then continuing reeling the video while the interesting animal is still there and then stop. At the same time, we want to collect acoustic data. So people that are studying wolf behavior want to know, you know, if this particular howl, if you really come close you can see the howl on the acoustic sensor, how does it correlate with particular body posturing that the wolf is making. So you would need to have time correlation between these two, because these two actually aren't physically at the same location even. They're in a slightly different location because of noise. Okay. High-resolution still cameras. I give that example primarily because it's actually a lot of data that comes out of those. Wild fire tracking cameras, day and night. So these are still images of actually video cameras that are strategically located at places that help us track how the fire is progressing. We have also done experiments with helicopters with cameras on them that them stream video to mountain tops and to the back end control of California fire department. And then at the very high end, there are two observatories, Palomar and not Livermore and not Lowell, it's an L observatory, sorry. I'm spacing on it. But basically two observatories. Palomar observatory produces [inaudible] to put 150 megabits per second at night. So you can see a small problem. If I have 150 megabits per second in the middle of the night streaming through my network and at the same time there happens to be a fire out there somewhere close to. Palomar there will be a problem in how quickly the fire data is going to get through. So we need to be able to adjust our quality of service settings along the network very quickly and easily. At the ultimate low end we have devices that don't even use batteries. So this is an example of a structural health monitoring device that my group developed that actually uses solar cells and super caps. The reason why we went this direction is because people like Los Alamos Laboratory that's funded this project want to deploy the sensors in areas that are very difficult to access by humans, and shelf life of a solar cell and a super cap is longer than a battery. The problem when you have solar cell and a super cap is that your power source is very unreliable. Sometimes it's really good, sometimes it's really lousy, and yet you need to be able to deliver results regardless of how bad it is. So in this -and for this particular application, it's actually doing what we call active sensing. So each one of these boards, which actually has been shrunk significantly has access to 16 PZTs, Piezo electric devices. Each one of these Piezo electric devices can generate a wave signature that's sent through the structure and sensed by another one. So you can create pairs of paths. The signature looks something like this. The red line those what's actually sensed. The blue line is how it actually should be if everything was okay. So you can see clearly that there is a problem here. Per single path we get 10,000 samples. 16 PZTs, each path 10,000 samples is a lot of data that you can get, right? That data, as you can see, has to be processed. The type of processing that's done on this is very similar to what you do for video decoding or any other kind of large scale signal processing. As a result, we had to put the DSP on board to really do this effectively. Now, clearly this processing is significant enough to where during the nighttime I won't be able to do whole lot of it. During the daytime when it's nice and sunny, I love San Diego, you can do a lot, right? So this picture actually shows a variability and amount of sunlight that's available and a prediction that we developed to give us an estimate of how much we're likely to get. >> Navendu Jain: Five minutes. >> Tajana Simunic Rosing: Yeah, I see it. Yeah? >>: [inaudible]. >> Tajana Simunic Rosing: I'm sorry? >>: How often do you need to do this? >> Tajana Simunic Rosing: Well, that's the good news. Couple of times a day. Yeah. Unless something bad happens. So bomb drops, airplane runs into a building, earthquake happens, then you need to do it more. So on the bright side it's not often on average. On not so bright side there are events that we had triggered for which you need to be ready to respond. So you have to have some reserve planned always. And in case of solar, and you need to be what we want is to be able to trade off the amount of accuracy for computation with the amount of energy that's currently stored and that's likely to become available. So that is exactly what we do with the predictor and what you see down here that at some point in time we execute very few tasks and other times we execute lots of tasks. So you can see accuracy grow significantly depending on how much juice we got. Okay? >>: [inaudible]. >> Tajana Simunic Rosing: It doesn't, but it can estimate it by what it's currently getting out of the solar cell. So basically what it does is it says, well, you know, currently I'm getting X amount of energy given the recent history, which it keeps, I predict that I'm going to get Y amount. And it turns out that for about 30 minute time period we can be within 10 percent accuracy. Which is really good enough. Because a typical if I did all 160 paths it will take me about three and a half minutes to complete. So 30 minutes is good enough. Okay. So going back to my original sensor network, you saw that at the top level I have this high speed wireless mesh. At the middle I have these sensor node cluster heads that collect the data from whole bunch of sensors at the low end. The sensor node cluster heads is where a lot of significant computation happens, and where also decision on who gets what data will occur. So there is actually quite a high demand for delivering reasonable quality of service with good battery lifetime. And for that we developed routing and scheduling algorithms that helped us save power while improving the throughput. Because without that, this was actually the layer in which energy will become very critical. And that has not been addressed yet. And that enabled us to start thinking about combining body area networks with the environment. So for this new CitiSense project that I talked about a little bit earlier what we're looking at is how do we monitor people 24-7 in order to understand how diseases such as asthma develop in the first place. And for asthma, you need to monitor air quality, you need to monitor physical movement of a person so you understand how much physical exertion affects the breathing patterns. You need to monitor whole bunch of parameters. And for that we've actually already had a small deployment. What you see on this picture is results that we got by just monitoring physical activity. And we actually had 63 patients use our system with the goal of becoming more physically active and therefore losing weight. And it turned out that the simple job of monitoring it and providing quality feedback in realtime through the cell phone motivating people hugely. People really loved the system, and they actually lost significantly more weight than the people who didn't use. So in this case, it was over six pounds more than the control. Right? Which is bill deal for four months. The most important thing I think from this came because over 95 percent of the people wanted to continue to use the system after the study was done and wanted to buy it for their friends and family. So people actually really liked what they got, which is not a small thing to do when you give realtime feedback on somebody's physical activity. It can be very annoying, you know, if you don't do it right. So, yeah, I mean think about it. So this is why actually we get thinking. And if we combine this with environment, we can get a truly powerful way to provide doctors, medical professionals, had you been health officials with tools that they can use to understand what to do you know and how to encourage us. Unfortunately as you do this, you end up with increasingly more complex sensor networks so you get stuff on your body, you get stuff in the environment, you got these what we call local servers, basically cell phones, you got the back end. And to date most people basically said just collect the data and then send it to the back end, have the back end figure it out and give you the answer. If you think about healthcare, that absolutely does not work. Right? You don't always have ROS connectivity in the first place. And secondly, you got all this great processing right over here. Why are you not using it, you know, wireless actually costs you a lot of power in the first place. So that is what my group has been looking at is if we have sets of tasks that have dependencies between them and that have some performance requirements so I need to provide feedback within some amount of time, what is the right way to dynamically schedule them across this whole network in a way that maximizes perceived battery lifetime per user, because that's what you actually want, you want each individual to be happy. You don't really care about the whole system. And that delivers information that people need. So here is an example of a very simple task graph and that we may be working with and you know, with two different types of binding, again the ILP, back to my slides some 20 slides back. The goal was to minimize the maximum energy consumption rate among all of the sources. So in that effect we would balance how quickly you're draining batteries per individual user's perspective. And we got to assignment that was an average 20 percent better than moving everything to the back end in ideal conditions. So assuming you have perfect wireless connectivity, nothing changes, same tasks, you're going to do better if you actually assign some of the computation tasks to local nodes than if you send everything to the back. Now, if you look at the real life scenario where you actually have to start from some initial assignment, detect that things are changing and then adapt on runtime here, we can actually get about 80 percent longer battery lifetime than the best static implementation. So we take the ILP result, we put that on, we are 80 percent better by doing simple online heuristics. Okay? So that's kind of where we're going, going forward. You can see that in my group we're looking at everything from ultra low power sentencing systems all the way to large scale servers. The idea is the same across this whole spectrum. We want to get energy efficient behavior with reasonable performance. And we want to do it in a way that's relatively simple to verify and that's reliable. So that's all I had for you. And I think I just ran out of time. So this is good. [applause]