21864 >> Victor Bahl: Thank you very much for those of you who are here and thank you very much for those of you who are watching us remotely. This is -- it's a pleasure to have Abhinav Pathak. Abhinav has been my intern. This is his second year, second internship with Microsoft Research. His interests are quite diverse. Worked in the database quite a bit and network data centers, measurement, modeling and other good stuff. But I asked him his true passion is in Smartphones and mobility, use of mobile phones, et cetera, and he's been working on some really interesting ideas related to energy management in phones and energy management as you know is a really important area. And I thought this work just got recently accepted in [inaudible] with lots of great reviews. So I thought it would be good for him to share some of these ideas with the rest of us. So with that, Abhinav. >> Abhinav Pathak: Thanks, Victor, thanks for the introduction. Thank you everyone. I'm Abhinav Pathak. I'm here to talk about our work, fine-grained power modeling for Smartphones using a system called raising. This work has been done along with my advisor Charlie Hu from Purdue, Ming, Victor and Yi-Min at Microsoft Research. This is the year 2011. And I don't need a motivation slide to say that Smartphones are bought, a lot of people use them. It's an accepted fact. But what I'll talk about is one of the most crucial aspects of Smartphone that people face today is energy. Because Smartphones have limited battery time, limited battery life. If you are a moderate to heavier user of Smartphone you have to carry around a charger with you all the time. You need a car charger, need a charger in your office and charger in your home. You need to make sure it doesn't run out. We'll ask a very innocent question. Why do we want to save energy in phones? Obviously, first thing is moderate to heavier user can charge once a day go home and charge and you're happy. But even for light users or people who don't heavily depend on Smartphones a lot, you can provide a lot more functionalities on the phone. You can enable continuous sensing. You can enable -- you can provide them a lot of applications, services, and even the dreaded background applications, anti viruses, search indexing and so on. Now the first question that we need to ask if we want to save energy is: Where is the energy being spent? Which component of the phone is consuming a lot of energy? Which application? Which process, which thread? Can you go ahead and ask which function in your code is consuming a lot of energy? And why do you need a fine-grained measurement is actually because we are targeting at developers. We can give them information that this function, this particular sets of function consumes the most energy. This function you can probably decrease the energy. This function you cannot decrease the energy. They are already optimum. What do we do with the measurements is one thing is you can provide app developers with a heat map. And while you're doing measurements you realize how do components interact. How is energy dissipated in the phone? You can develop low level strategies like scheduling, how do you schedule devices so that energy, you can save energy. So with this in mind, we go ahead and let's show the simplest way how to measure power. The simplest way here to measure power is by inexpensive segment, monsoon power monitor, and you perform in 20-step surgery on your phone. Bypass the battery, connect the terminals. Connect the monitor to an external power supply. And then you run your application. It gives you a nice graph that this is the power consumed, this is the energy consumed here. Now, in this approach, even though it is very expensive, is your mobile -- mobile anymore. You're hooked to your desktop. You're sitting right there. You cannot move. Can you do the measurements in your car? It's very difficult. I know people who have done that. I know what the problem is there. And the second question is, does this answer which component, which process, which thread is consuming energy. This will just give you an overall how much is the device, an overall estimate. This device is consuming this much energy, your phone is consuming this much energy. The silver lining in this methodology is it is absolutely fine-grained. It gives you a measurement every 200 microseconds. It gives you at this particular instance, this much [inaudible] we'll keep this in mind and we'll proceed to what people do, how do people model energy consumption of Smartphones. Feel free to stop me at any time for questions. Power modeling on Smartphones. This is the -- we'll talk about the state of art which is a very intuitive way to model Smartphone energy consumption. So there's several devices. There's several components on the phone. You have the screen you have the SD card you have the CPU, wireless, GPS so on and so forth. So in this approach what do people do? There's two phases, one is the training phase, one prediction phase. Your ultimate goal is to predict what is the energy consumption. So what do you do is in the training phase you calculate on an average what is the energy spent per byte of transfer, per byte of send or receive. On an average, if I use CPU for 100 percent how much is the energy does the CPU consume? Similarly if I write one byte on the disk or some number, how much is the energy being consumed? And then during the prediction phase, I'll every once again I'll read slash Proc like counters which gives me aggregate statistics. In the past one second this much CPU was used. This much byte of data, this much byte was written to the disk and so on. And then I'll multiply it with these numbers usually linear estimation and this is my predicted energy. >>: [inaudible]. >> Abhinav Pathak: Yes. It fails miserably. >>: How did you consider the energy for send and receive because it varies completely with functions such as [inaudible] and characteristics of the network and varies hugely with the size of the data that you're sending and receiving. >> Abhinav Pathak: Right, absolutely. And that is what I'll show in the next slide, even before I show you the numbers I'll convince you it's not going to work. We even tried it with numbers. I'll have it towards the end. >>: I'm curious to see ->> Abhinav Pathak: Sure. Let's see why this strategy doesn't work. I'll read the equation, predicted and consumption is an estimation of how much is the utilization and so on. Now the first fundamental yet intuitive assumption this model makes is only active utilization implies energy consumption. That is only if I'm sending and receiving on the network I'll consume energy. Nothing else. Only if I'm reading, writing to the disk to the HD card I'll consume energy. Nothing else. What we found out that on today's Smartphone, this assumption is wrong. Smartphones continue to withdraw a lot of energy, consume a lot of energy even though you're not actively using the device. >>: [inaudible]. >> Abhinav Pathak: Sorry? >>: [inaudible]. >> Abhinav Pathak: So I have examples, right, on the next slide, what do I mean. A simple example could be you open a file. You consume a lot of energy. You close a file. You consume a lot of energy. You close a socket, you change the power state. >>: But the first example [inaudible] seeking and consume a lot of ->>: It makes syncing on the device ->>: I think your graph ->>: Even when you close a file you still use a component. >> Abhinav Pathak: Right. So I'll show you some of the examples for each of the assumptions which are wrong here in the next slides. The second assumption, devices don't interact in terms of energy. And that is why I can linearly add them. Basically I'm saying no matter whether I'm using disk or not, network will consume the same amount of energy if I send N number of packets. This is also wrong. What we observed is when you run multiple devices together, they need not always add. Third assumption, energy scaled linearly with the amount of work. If I send ten packets it consumes X amount of energy I send 20 packets it will consume 2x amount of energy it's wrong. It may not always hold true. In fact, sending 20 packets could consume 4X amount of energy, and I can write an application that consumes more than 20 packets. Everybody's confused here. I'll show you examples for all of these three assumptions that failed really badly. But what I'll talk about before going to those examples is it is using such models, it is hard to correlate going to the process level, thread level, function level because these counters are not readily available. How much is network sent to see for this particular function? Like these counters are not readily available and there are new exotic devices that do not have quantitative utilization. Like GPS camera. I don't have I sent ten packets per second or I just don't own GPS. Don't own GPS. Camera I turn on I take a picture. >>: The point you're trying to make is the linear model is not the appropriate way to agree with you? >> Abhinav Pathak: Right. Additionally it is these problems. >>: That's not a problem with the model ->> Abhinav Pathak: Right, this is not a stated problem, but this is ->>: This is a side ->> Abhinav Pathak: Sub. Let's jump to examples. Only interactive examples implies assumption. What I do I took a touch Smartphone, Windows mobile 6.5 running, and I did a simple experiment. I open a file. And I plug the energy consumption. X axis I have the time length. Y axis I have the current consume. Throughout these sets of slides I'll talk power in terms of the current assumption. Actual power will be 3.7 volts into whatever the number is here. So what we see here is the moment you do a file open, there's a spike. You don't do anything else. You just sleep after this, the component continues to withdraw a lot of power. There is no active utilization in this zone. And also file open is not considered an activation. It's not in reality a read or write. Similarly, file delete, file close, file create, all of these show similar characteristics. Next example, several components, stale state, we just saw a stale state when we do a file open. We open a socket, we send some data on the socket. We close a socket right here, and then we see for two seconds roughly it still stays in that high powered state. It continues to draw a lot of power. This is seen in disk and Wi-Fi and GPS and Android and Windows mobile, all these. And but more interesting thing we observed is when we change the handset on a touch on a windows mobile 6.5, system calls have ability to change the power states. I start sending data on the socket. I stopped sending data, and until and unless I don't do a socket close, my power is not going to drop. >>: So is this a socket on Wi-Fi or ->> Abhinav Pathak: This is Wi-Fi. This is Wi-Fi. >>: Wi-Fi is much more [inaudible]. >> Abhinav Pathak: Right. It's actually much worse where the tail lasts really long. You had a question? >>: I didn't but I have now. So I can see the last one happening because you're trying to maintain the connection because you're not closing the socket. >> Abhinav Pathak: Right. >>: If you have to maintain the connection you have to stay on. >> Abhinav Pathak: Right. >>: Or you lose the connection. >> Abhinav Pathak: Sure. >>: But I'm puzzled by the first one. Why would the first one happen? Is it because of the [inaudible] remains in a high energy state? Why is that do you have any intuition? >> Abhinav Pathak: So the driver continues, this is what we think it is. The drivers are so close we don't know the answer for sure. The driver continues, the device continues to draw the power in a hope that you'll have more operations. >>: What do you mean by power fail, right? Is it it remains on and doesn't go to sleep mode. >> Abhinav Pathak: The processor, all we see is that the processor is not utilized. >>: Not utilized, will go down a few seconds. It's more likely an efficient driver [inaudible]. >>: The 5-four doesn't have Wi-Fi. >>: Talking about that one. But that one is the SD card driver who knows what happens. Maybe it's a compaction of the flash when you do something like that. It's very ->> Abhinav Pathak: It could be. Right. >>: Makes sense. >> Abhinav Pathak: We don't have tools to measure to know exactly what's going on here. Let's look at the second assumptions. Devices don't interact in terms of energy. I did three simple experiments. In the first experiment I sent the update data. I did a sweep for some time do a socket close to show there exists a detail. On the second one I run a far loop for two million times, sorry, ten million times doing some floating point computation. In the third experiment I do the same two things but I send two megabytes of data, and spin CPU for two million and repeat this five times. Now, in the first scenario, we have X axis we have the time. Y axis again we have the current. What we see is the moment we start sync, we consume an additional 180 million amperes of current. After this send is done, we consume an additional 110 amperes until we record the socket close. Second example, second experiment the moment we start spinning the CPU we consume an additional 200 million amperes, and this is a little -- we complete, do the complete operation. And third experiment, when we do things side by side, what we observe is apart profile like this. Let's look in detail what happens. When we start sync, the first iteration, we send to a bit of data we jump to 180 million amperes, consume an additional 180 amperes, which is perfect according to this here. But what we expect after this is your network driver, network device to consume an additional 1ten million amperes because it will enter into the tail and stop sending at that point. And since I run CPU on top of it, I expect we'll consume 200 plus 110 milliamps. But what happens is I just consumed 110 milliamperes, the first loop. What happened to the tail? Where did the tail go. I go to the last iteration ->>: Your CPU spin cycle seems to be driving ->> Abhinav Pathak: Sorry? >>: The CPU ten million times of the CPU cycle seems to be driving over time. >> Abhinav Pathak: Right. So at this particular time, this is one-fifth of what we're doing here. What we assume here at this time, there should be a network tail, because network device we just used it. We have sent. >>: But the power is used by the CPU, because you have to concur it. >> Abhinav Pathak: Exactly. >>: But the time below is longer. So your total time below is like overstretched to like a 16. >> Abhinav Pathak: Right. >>: The other one is only ten. >> Abhinav Pathak: Upper one is only ten. And we have a six here. >>: I guess the point is you kind of -- that's the basis of this point. You're right. He's right. But the thing, the big point is meta point you can't just look at numbers and add them up and say ->> Abhinav Pathak: What we see is during the last send, after the last send is completed we have the last CPU spin. After the last CPU spin is over there's a tail, 1ten million amperes. After we do a socket close, it's vanished. Clear case of devices are interacting with each other in terms of power. The last ->>: The device, I guess you're talking about components inside ->> Abhinav Pathak: Components. Components. >>: Devices. >> Abhinav Pathak: The third example, energy scaled linearly with the amount of work that occurred. The phone, 6.5, send packets at the rate of less than 50 packets per second, choose any number, choose any bytes per packet. Second experiment I send packets at the rate of 50, more than 50 packets per second, again choose any number. You can choose 55, 45, and what we observe in the first case, this is the power consumption draw. What we observe is in the first case we consume 100 to 125 million am peer spikes whenever we do a send. And in the second case, 300 -- finally amperes. So if you're sending 45 packets per second versus 55 packets per second you're consuming three times the amount. The integration of the energy, is that three times or about the same? Or smaller? >> Abhinav Pathak: That depends on a lot of factors, because the moment you're sending above 50 packets a second you're incurring tails. So you have to take that into account. >>: My point is that we have a experiment showing that you send it faster actually the overall energy, multiple energy consumed is usually less than ->>: There's something missing here. In general it's much better from an energy perspective to send big chunks than sending small chunks, more big chunks once than sending small chunks -- so I don't know what the bug might have been here. But some bad things with the data in this case. >> Abhinav Pathak: So what we ->>: What you're trying to get across is that there's no linear correlation. >> Abhinav Pathak: There's no linear correlation. So in terms of energy, when we do an integration even then this ratio nearly holds true because it's already a three times ratio, one is 2.5 and one is three. When you add a tail along with it you're really talking about the ratio. All the Smartphones we tested we tested one Android, two Windows Mobile, exhibited this. It looks like a particular number of packets configured somewhere that dictates when do you jump to a power state. Dictates how much power you are going to consume. And I'll call these as intelligent optimizations written in the device driver. That dictates how much power. Any questions here? What we have learned so far state of model powering, assumptions are wrong. I'll show you numbers they don't work. There exists a notion of power states here in the device drivers. What we have hinted so far is device drivers they have intelligent power controls. System calls, they act as nice triggers in power consumption and definitely there looks like a scope for energy optimizations. We can probably do something intelligent based on top of all of this whatever we have seen so far. What are the challenges in measuring fine-grained energy? It is that the device drivers are closed source. They're just binaries. We don't have the code and we assume that we cannot get any information out of them. We cannot get when the packet was sent out we cannot get when was the bytes returned to the disk and so on. So our modeling, the aim is very simple. We want to reverse and continue the power logic that was present in the device drivers. We'll do a black box reverse engineering here, because that's closed source. We use something called finite state machine instead of linear regression, adding up everything linearly. Finite state machine is basically simple. It consists of states and transition, states is a notion that says that after -- it represents an abstract power state. It shows how much power is consumed by the device if you are in that state. >>: How do you define power state? >> Abhinav Pathak: It will come. Transitions could be basically anything. It could be a timeout. It could be a system call. It could be any other condition. For example, any more than 60 packets per second. Nodes in our finite state machine represent the power states. We have two here. One is unproductive one is stale. Productive is where you actually do nodes and tails are the ones you aren't doing any work and the energy is being consumed. Edges are the transition rules. And we make an assumption here that the device drivers implement very simple logic, power logic. Like based on simple history numbers, however was utilization last time do I have to open a connection, do I have to do this, do I have to do that? Based on this assumption, we'll try to reverse engineer the power model. The methodology is very simple. We model energy consumption using system calls. System calls give you a very nice interface where you can sit down and watch what all the applications are doing. System calls is the only way application can access the hardware. And once if you see the system calls, then you can figure out how much is the device being used, which process is using it, which thread is using it. You can also say what are the nonutilization calls there. And whatever the kernel does below we have no idea. You just treat it as a black box. You use a very simple I call it systematic brute force approach. It's reverse engineering where first step we model each individual system call. Once we have models this finite state machines for every system call, I'll go ahead and club finite state machines of different system calls that acts as the same component, for example, the wireless card. And step three is I'll try to club finite state machines of different devices different components, club CPU, disk, network, and so on, to get a giant finite state machine for the entire device. Let's look at step one. Modeling single system call. Let's say it's a read give a file descriptive, give a file size. This is how the power profile looks like X axis the time, Y axis is the current. We do a discrete. It jumps to a highest power state. It continues to consume a lot of energy until it falls down. We'll toss on this. Into a figure like this. Where we say you are at base state. State number one. You are at base state. You do a system call here and you do a transition to another state called D 1. At state D 1 you consume some amount of power. After the work is completed at state D 1, you come down to state D 2 and you consume some power for duration of time. >>: How many labels with the states? >> Abhinav Pathak: Excuse me. >>: How do you label D 1 and D 2? >> Abhinav Pathak: How do we label? Because they consume different power. >>: But in the real trace it might give over 100 different spikes, right? >> Abhinav Pathak: Sorry. >>: In a reasonable [inaudible] might get over 100 different bar sites. Doesn't make any sense to label ->> Abhinav Pathak: We're doing one system call at a time. We know exactly in the profile exactly when was the system call started, when was the system call ended, and we are trying to analyze everything in between that. Definitely there are a lot of spikes caused by other components unknowingly we're not running, for example, the phone keeps on ->>: What I'm trying to say if you use a system calls a black box there's no need to define states within this black box just integrate the whole thing as an energy and it should be done. >> Abhinav Pathak: It's not that simple, because we'll induce error there. Because this is just a simple system call. And we are seeing different states here. So what do we do is we come back here and we make a power model, a finite state machine. We represent that variable in the base, you consume a zero million pair, that's a reference. You do a file brief. You go into high disk state, consume an additional 109 million amperes. Stay there. Two, your file state is completed. And then you go back to the disk tail and inactivity of the by state you go back to the base state. This is a single system call model. Any questions? Another example, how do we monitor, how do we model network? This is what [inaudible] was saying that we have multiple spikes, they correspond to each, correspond to one single system call. What we see is since system call is simple, you're in the base state, you see less than 100 package again, go to low network state, spike of 125 million amperes, you go back when that is done. We found out when you increase the amount of data you're sending, the amount of packets you're sending, you go to the low network state. When you see across the 50 packets per second threshold or [inaudible] to jump to a high network tail of 280 million amperes, and then you see spikes there. >>: So don't these numbers that 285 depend on how far I am from the sector? >> Abhinav Pathak: It does. For simplicity here let's say we have a good signal strength here. And in the paper we have details how these numbers would change depending on what is your solution. There is a variation. So this 125 can become 350 depending on what your signal strength is. Similarly, in CPU modeling at what frequency you are running, that will define the actual power. I'm skipping all those details here. But we do model those. So this is system call. One finite state machine for the entire send. Next we model multiple system calls to the same component. Observation is a device, component can only have a small number of finite number of power states. And the methodology is simple you go to the programming order can't do a read or write before you do an open. And then you use something called a C test application. It's a parameterized with code models which you can easily shuffle around. The idea is you want to access multiple system calls together and see if they generate a new state and show you how do you do it. And what you do is you identify and club similar power states. So basically your same system call has a power finite state machine like this. Your closed system call, socket close, has a finite state machine like this, and you see that these two states are the same, the base state is the same. You just compile it. You go on combining system, multiple system calls, going to the same component, and at the end you'll reach what is the finite state machine for an entire component. This is very simple brute force approach. You're trying to access different system calls in different states. That's it. Modeling multiple components. We continue this approach. We have the observation of the different components interact with each other. Methodology again is you try to reach a different combination of states from different places. And you see whether there are new states or transitions. I'll show you one simple example of a system capture for a device. This is a Titon 2 phone on Windows Mobile 6.5. This is how does the networking finite state machine look like. Go to a base state, go to low in the case if you're sending packets and go to a high network and have the spike here. If you have to annotate CPU on top of this, you'll try to run a program when you are in network tail state. You run CPU and you measure the energy consumption. You model it as another state. Similarly, you run CPU in a high network state and realize it consumes the same amount of energy and realize it doesn't go anywhere else. Then suppose this is the disk part. Base state. Access disk, you have the disk tail. If you're in the disk tail if you access CPU you get a new state here. It consumes 130 million amperes. Similarly if you access disk, then you have a device in network tail state there's a new state. It all looks combinatorially you'll blow out, you'll have a lot of number of states, a lot number of transitions. But what we found out is most of these states are send and for three devices, say disk, CPU and network, we can contain under 12 states. So the combinatorial concern how many times you've tried approved for first approach most of the components ->>: That was an interesting diagram, but you seem to model [inaudible] only by one phase. Can you model your state machine with more than one tail? >> Abhinav Pathak: We haven't seen a device that have two tails. >>: Actually 3G device just like that. It have -- I mean there's something called hide-and-seek [phonetic] actually has two tails. >> Abhinav Pathak: Right. Yes. We can model multiple tail states. But for all the devices that we tested we just saw at most one tail state. At most two corrective state. It doesn't stop us to have that kind of modeling where you have more than two productive states or two tail states. >>: Okay. The final state the productive state. I'm saying the tail state. >> Abhinav Pathak: There could be two. Right. We can do that. Now, one of the aspects here, one of the problems we see is the combinatorial approach we are trying to access different states together and you can quickly run into a large number of testing. Now, what we found that's most components they have one or two productive states and most of the components we tested have more than one tail state even you have two tail states. Talking about 15 to 20 combinations for three or four components. And since power modeling is a one-time effort, we can say that this is once you have to do it and forget about it. So a long time spending some time on this is acceptable. What we are working on is how -- we are working on to automate this procedure. This procedure is currently manual. Trying to automate it, run applications which can self -- which can decide whether it's a new state, old state, get the model outside. Regarding the completeness of the power modeling, theoretically a device driver can implement any number of complicated optimizations. If it rains outside then use this much power otherwise use that much power. Nobody's stopping them to do that. See if you can detect. But idly device drivers they're not that complicated. They're pretty simple. They're based on simple history information and practically we saw that. Simple things we were able to use and do reverse engineering for most of the devices we saw. As far as the implementation, we implemented system call tracing on Windows mobile 6.5 and Android. We used C log, extended C log to log additional system calls. It's just earlier above the kernel where we set down all the C applications. We logged down all the system calls, which application, which thread, what are the parameters, when was the return, everything. And Android we used Android sits on top of Linux kernel. We used system tap. A framework to log any calls in the kernel. And since Android has Dalvic [phonetic] running on top of it. They do, already do some kind of optimization for applications running on top of Dalvic. So we have to log additionally what is going on in Dalvic. Our paper contains details of why do we need to do these steps and so on. Valuation. We used three answers, Titon 2, a Touch, S2G magic running 6.5 one running Android. And I've shown most of the finite state machine results on top before here. What I'll show here is the accuracy numbers. So we did end-to-end energy estimation. What we did is we chose a lot of applications, grab your maps, Facebook, YouTube, chase, wire scan, photo compiler for both Android and Windows Mobile 6. We ran the applications. Some of the applications were ran 10 seconds, some 20 seconds and we measured what's a predicted and actual energy and what's an error estimation. So this graph plots the error and estimation using our model, the finite state machine model, and the state of linear regression model that we talked earlier in the talk. The Y axis plus the error. What we see is our model, we are under 4% of end-to-end error estimation. In all the cases. Linear regression, it performs fairly well in this case. Some of the errors are less than a percent. Some of the errors go as high as 20 -- 20 percent. >>: You're talking [inaudible] it's more or less that the integration [inaudible]. >> Abhinav Pathak: Exactly. >>: The devices. >> Abhinav Pathak: Exactly. And the point here I'm trying to make is even in common applications, it hurts when you miss these particular things. Now, let's see what happens when we look the energy consumption under a microscope. Right now we're doing only end-to-end energy measurement. So what do we do in the next set of slides is let's say an application runs for 10 seconds. All right. I want to find out at smaller time intervals did we predict correctly. So I spread the interval into 50 milliseconds and for every bin I compute what is the error in estimation. And I draw a CD of all of this. We have the CDF for our model, we have the CDF for linear regression and we have different applications running on different oasis at both places. How do we read the graph? Let's say this is a point here. We say that 60 percent of the 50 millisecond bins had less than 10 percent error. 40 percent showed more than 10 percent error. What will be an ideal curve on this graph? An ideal curve will be any curve who is sticking on to the Y axis because that way you'll have as minimum an iteration estimation as possible. What we see here is using FSM, using our model, 80th percentile of errors they are less than 10 percent for all applications. That means in 80 percent of these 50 millisecond time period for all applications we're under 10 percent of error. There is this number for linear regression, that is only 10 percent. And we see error increases. Some of them are not at all tolerable, like you have photo up loader showing 50 percent error. And a lot of bins. So what does this indicate is our model you can use it, even when you're looking at small granularity of time period. You need not look at only end-to-end. It is correct at every intermediate position. And since 50 milliseconds is of the order of time where functions at the time functions consumed, we can use this to predict per function energy in your source code. Our paper contains iteratives, results, measurements, overhead analysis. I'll throw out a number, we're under nine percent overhead which is similar to any linear regression scheme. We show why can't you improve linear regression, why can't you improve what are the problems there. I'm skipping everything over here. What are the limitations of our approach? This is noticeable that we are not solving all the problems. But one of the limitations is we're based on the assumption that power optimizations in the device driver are very simple. And this is fairly true, because when you're writing low level code, you are advised not to write really complicated optimizations based on 100 different parameters. But there's nothing stopping anybody to do that. If this assumption is not true, then everything we built will -- it will make our modeling even more hard. The second is devices with high access rates. Network is a very good device. You do send once every 100 milliseconds in the worst case you don't do a million sends a second. The higher the access rate, the higher the system calls, it requires any component requires the higher your overhead would be. For us, right now the most significant overhead is caused by CPU logging CPU system call, basically which application is running on the CPU right now. If there is a device where you require millions of accesses every second we'll have a very high overhead. And last problem we have is we don't have lower level details like exactly when did the packet get out, because that's when the device driver would consume, the device would consume the power. We don't have that information. And definitely we have that information we can incorporate it in -- we can easily write down transitions for that. >>: This is why you say it's true that you can basically have a power function because many times you just micro filing deciding if it's a function that might happen or might not happen this percent of the time. >> Abhinav Pathak: Right. I'll talk about that in three slides from now. So current and future work. So current work, so we did the fine-grained and [inaudible] modeling and measurement. And on top of it we have very -- this is half work done right now -- energy profiler. We call it E prof. It's like the standard G prof tool, which gives how much time the function runs and so on. We're trying to build a an prof tool, how much energy was consumed in every function. Then we'll try to build on top of this an energy debugger. We want to answer a question like: Can you do better? Or is it already the best enough? Then we're also planning to multi-device energy [inaudible] shielding. I'll talk about all this in the next few slides. Let's talk about current work. We're doing energy profiler. So what the problem is we want to measure per function the rate process, energy consumption of an application. The challenge is you want to correctly account. You need to correctly perform the energy accounting. And as alluded earlier we need to account for tail energy. We need to account for device interaction, because if you don't do that, you can terribly go wrong. I have an example in the next slide. The approach we have is we build a finite state machine power models. System calls. They provide a very good interface for you to tell which application, which process, which thread caused it. And to answer which function caused it, we currently just map the function boundary ourselves. Later we'll use compiler techniques. And we use methodology called pinning to handle the tail side effects. What does it mean? Let's see. We have a simple program main caused function F1 and F2. F1 does something. It does a network send. And then F2 does some CPU on the mobile. The power profiler looks like this. X axis is the time and Y axis how much power has been consumed. F1 starts here. Data network send, network send completes there. F1 ends there. Function F2 starts there. F2 ends here. And the network tail ends here. Now, if you use strict timelines, that is, F1 started at that point. F 1 ended at that point and the energy consumed by F1 is only in the area of that graph, you'll see F2 consumes the energy only the area in this graph and then you'll have no answer to the question who consumed this energy. F1 and F2 will go over your application is done. How do you account for this energy. Now ideally what you need to do we'll ask a question how much F1 would have consumed the energy if there was no F2? And that would be all the area in this part. If there were no F2, it doesn't consume an additional CPU, the power profile would have looked like this. And what we say is this should be the entire F1 energy. And only this should be the energy of F2. So the pinning methodology is simple. We pin functions based on what devices they use. And then we attribute the tail to those functions that are paying for that device. So if a function is pinned for network, it will be accounted for network tail. If a function is pinned for disk, it will be accounted for the disk tail. >>: What happens when you have two functions, both axis are ->> Abhinav Pathak: That's a tricky question. What happens if you have two functions. >>: That's what I'm asking. >> Abhinav Pathak: Right. The simple approach could be you can split the tail energy, that's it. But that may not be correct all the time. Because if those functions were executed differently, you may not cross that threshold. You can only send 40 packets, 40 packets but they run together. So this you can give as a hint to the programmer. Say here is a way to -- and this leads policy on top of it how do you decide. >>: I think one thing -- >>: Go ahead. >>: Sort of a new part, I guess. But the sort of thing like you don't take -- you may but you haven't talked about it, you haven't taken too much advantage of any history here, right? I mean, there are certain things that regardless of the applications you're using functions and different combinations we're using them, over a period of time I would suspect you could should be able [inaudible] and that should be able to feed you back some really useful information in fine tuning these models, right? >> Abhinav Pathak: That's a great idea. We can deploy this on user's handsets and do some learning, based on what is the patent and how is the energy consumed and make some analysis. >>: Because you should be able to disambiguate from the question that he asked. It's always the case you always have noisy data and the way you get at noisy data aggregate over multiple sources and you kind of know ->> Abhinav Pathak: Right. That is a way out. Right. So what we did is took an application, up loaded application. It's simple. It has a GUI. You specify on the phone up load all the photos in this particular directory, and then up load all files called reads file one at a time, does hash calculation and then sends the file over the network. It does that for all the files. And what I've plotted here is how much percentage of time did each function consume. And how much percentage of energy was consumed by each function taking the tail side effects? And what we see here is send file, consumes 50 percent of time. It consumes roughly 70 percent of the energy. If you don't consider ->>: [inaudible] across the time? >> Abhinav Pathak: The entire, if the function, if the function runs 10 seconds how much time did the send file function took, how much time. And we make sure whenever the send file is completed, the file is transferred. It's not ->>: It's not exact. >> Abhinav Pathak: It's not. It's end-to-end. So what do we see here is that there's a lot of disparity here among ->>: I'm confused now. >> Abhinav Pathak: Okay. So the blue bars they show what is the percentage of time spent in this particular function when we use a DL clock. >>: Compared to what? >> Abhinav Pathak: So the whole application, let's say it consumes 100 seconds. >>: 50 percent of the whole application. >> Abhinav Pathak: 50 percent. >>: 50 percent of the CPU. But it's the way it's written ->> Abhinav Pathak: [inaudible]. >>: Calls multiple different places. >> Abhinav Pathak: It's called how many of the files were there in the directory, that are in the same place. >>: Sorry. Just for a second there ->> Abhinav Pathak: The thing I want to point out is if you don't consider the tail side effects the 68 percent number, it boils down to 50, 55. And that is not correct, because you're not taking tail input. And another thing is though your hash calculation runs for 20 percent it only consumes 10 percent. You can tell your programmer don't bother about this. This is already too little energy. You cannot save any more. Probably you can do something about that. >>: But isn't this like -- I'm not sure, what are the [inaudible] the learning you're providing. In a sense there's a lot of studies about they take so much energy, the display takes so much energy, such and such component takes so much energy. So we could map that learning into something for the user in some sense without actually having to do ->> Abhinav Pathak: So the first step is you want to give developer an insight, where is your energy being spent. And the next question you want to tell him can you optimize here? So maybe send file consumes 70 percent of energy. But if you can't optimize it there's no point looking there. >>: I guess you're saying that you're doing it at the system call level. That's the difference? >> Abhinav Pathak: Right. >>: This particular system calls takes this much? >> Abhinav Pathak: Exactly. And then using that we pin it on to the function state. >>: I'm trying to understand this, because you are looping through a file to do refile hash calculation and send file. Send file and refile are blocking calls. So I am not sure if I can just separate them as refile how much time hash can, how much time send file, how much time. That is from the user level. But you go down to the lower level before the device, some of the send file might be doing some concurrency, because they can be blocked and then go back and do some hash calculation. So there are probably not easy to have a very clear-cut of how much time refile actually use. Besides, new processor started to have a multicore so that will make the concurrency even more overlapping. >> Abhinav Pathak: Right. So the moment you go to multicore, things changes. And in this particular example, we made sure once the read file is done and entire data is -- entire data is there in the memory. Similarly, once a send file is done the entire data is gone. It's received on the server. And because the buffers on mobile are really small. They get filled up quickly. We made sure of that. There is a hardbound for that. But, yes, in general case [inaudible] so that presents an additional challenge of how do you mark the boundaries of the functions. >>: [inaudible]. >> Abhinav Pathak: Yep, just a couple of more slides. So what we can do is we're planning to integrate -- we're planning to develop a plug in with Visual Studio that you can block a heat map saying send file consumes a lot of time. Read file consumes the next time, once they're marked they're really hard to try to do something there. Future work, energy debugging. Once we have a pro function energy consumption we want to answer the question does the application consume reasonable energy? Or can we say what is the optimum energy? Still future work, we're not yet looking at here. How do you decrease the energy consumption with or without compromising performance? And what are the best programming practices? Should we use a hash table? Should you use an [inaudible] want to answer questions like this so you can give programmers tools on how you save energy. This side, multi-device aware [inaudible] what we saw is during the course of our measurements, a lot of interactions, a lot of cut-offs we can use. And what we are saying is we'll try to use three schemes. One is inertia one is camouflaging and one is anticipation. Inertia is basically if a device is doing some work, let it do the work. Because if you stop it doing work you'll curtail energy. Camouflaging is when multiple devices run together, then the energies don't add up. So you can save energy. Think about a background application. Some application has run into a network tail state you can run the application right there because you can cut down the application by 50, 80 percent there. And then you can do something like anticipation. Stop the processes don't let it send any packet unless you have a lot of packets. So you can just pull this out. You can do that programmers these are being done in Android, coalescing, caching, request, GPS, Android framework. There's a lot of these optimizations. Then you can do tail avoid and still aggregations. Very simple strategies looking at this, the final -- the main contribution here is we developed fine-grained energy modeling. We predict energy consumption at really small intervals of times. Implemented this on two OSs, and we demonstrated the accuracy over the current state of art. And that will be the last slide. Thank you. >> Victor Bahl: Great. [applause] >> Abhinav Pathak: Thank you. Take any questions. >> Victor Bahl: Any questions? >>: I think this is a major improvement over the modeling. It's very good that you are predicting 5 percent. Just wondering if you can expand a little bit in regard of the, for example, the Wi-Fi, you actually, there are many different bandwidths you actually use to talk to the AT. >> Abhinav Pathak: Exactly. >>: So the same device calls for send packets doesn't always end up using the same amount that the energy, depending on the bandwidth that is used at that time. >> Abhinav Pathak: That is correct. And in fact some of the details I skipped here, when you are sending packets from the [inaudible] different power level they're transmitted actually at different rates. One is sending at 6.5. One is sending at 11. That in fact goes on in the packet. >>: Right. Cool. >>: One of the problems is the hardware dependence piece where you have to build the model by having the actual hardware available so you can monitor the behavior the state machines. Have you thought at all about how big a set you need to have in order to come up with some generic or is it even possible to come up with a generic algorithm that will work across multiple platforms at least of the same type, Smartphones, tablets, at least within each category, or the second piece is assuming that we can, which is my assumption, what is the minimal amount of information that hardware manufacturers need to provide in order to feed into a model successfully? >> Abhinav Pathak: So that's a good question. And right now we do modeling for a set of hardware components. You have them upon doing modeling. You can ask the component makers for some questions like give us the power transition codes. Give us ->>: So the IP is going to prevent them from giving you lots of information. But I'm asking what's the minimum we can get from them are the key features, can they be described in a way that would not, A, expose IP, and B, be onerous where the OEMs have to provide, especially since they may not want to go through all these steps themselves? So you need to give them a simple cookbook that says I need these five things from you so I can at least be in the ballpark. >> Abhinav Pathak: Right. So that is actually a hard question, because we are coming up with IP restrictions, because I would have asked forgive me the power states, give me the transitions. That is not useful. So we have to go back. >>: Things like more like on the order of A is twice costly as B. A costs twice as much as B. And [inaudible] is break even X packets per second things like that where they can keep it at a fair fuzzy level but possibly give you enough so you can say at least application A appears to be worse than application B. >> Abhinav Pathak: That could be. >>: Sorry. You had a question. I had a second one. >>: No, I was going to ->>: That's fine. >>: I was probably going to go off. What I was trying to see if you had a product, how long would it take to deploy this and actually get some reasonable numbers, and I was going to allude to what type of information you need out of the hardware. >> Abhinav Pathak: So once we have this, the model available, you have a generic framework system calls which are running on all the phones. You just have to plug in the model and do the computation. There you can do it across all the devices. I have all the devices giving the system call information, and you can update the model whenever you get it or whenever the hardware changes. So that way you don't have to do a push all the time. >>: I guess the first question is are you thinking so one aspect of this is just to somehow random generate an app and puts it into the marketplace and the other one is such that we do ourselves here at Microsoft, right? Which one do you have in mind? >>: I was looking at the Microsoft one. >>: For us. Where we had the source code and things like that? >>: Yeah. >>: Yeah. That should be pretty straightforward. >> Abhinav Pathak: If you have the source code of the device drivers. >>: Source code would be there. You've already profiled a bunch of these. >> Abhinav Pathak: Right. >>: Whatever not, you can do it. Then it's just a matter of ->> Abhinav Pathak: Running the application. >>: Running the app. It should be ->> Abhinav Pathak: If you have source code of the application, then you can do it. >>: That's part of the paper that you're linked over there in terms of how to actually get it started? >> Abhinav Pathak: Right. So the paper has very tiny information, because towards the end how do you get this started. >>: It will be interesting -- we can talk about this, we can get this going. >>: So the next question is if you had the ability to instrument the operating system, again, what is the minimum set instrumentation point that you would want to have put in so that hopefully you can run this 24 by 7 and not have to do it in sampling mode but have the cost where you would always be monitoring resource utilization, what's the minimum number of instrumentation points that you would want to have then? I don't think it's necessarily every API that you'd want to do. I think you'd want to attribute disk IO, network packets, couple other, things GIS, what's the minimum instrumentation point, where we could think how we would potentially put those into an operating system running on a mobile device? >> Abhinav Pathak: So it depends on the device that was devised. For example, for the disk you only need five system calls, log in. >>: Well, file cache is one problem. So file cache it's not a disk operation so you gotta go deeper. >> Abhinav Pathak: Little deeper. Buffer cache what is going on there, is it further going on in the disk. Right. When these things come -- now we're talking about the whole kernel complexity here. >>: I understand. I want to know what's the minimum number of things you need. I have my own set. I was curious if you have a set that's different than that. We can take it off line. >> Abhinav Pathak: For devices I've seen, for the ones that we have tested removing the kernel complexity outside, five to six for most of them would be sufficient. Five to six system calls. And GPS there's only two system calls I need to track. Disk I need to track six or seven, I think. Network I need to track ->>: But there's a lot more than -- there's only sensors now that you're tracking, you have good models for it yet. >> Abhinav Pathak: So sensors, for example, let's say GPS. >>: Accelerometer. >> Abhinav Pathak: Accelerometer. You have one start, one end. Start accelerometer and stop, consumes nearly power. That would be a really good. Camera, for example, start the camera, hardware, close the camera, hardware. Really good. Maybe when the picture was taken that could be another trigger. But if you're talking about something like memory, we have a lot of problem there. Because capturing memory accesses fine granularity will blow up the overhead. >>: Fortunately memory can be roughly you can call CPU and GPU, state memory is going to run linearly with that. Attributed on that side. >> Abhinav Pathak: We did something. >>: Cache manager. It's actually opposite -- the device drivers are a lot of times these mobile things written by other parties. And so we don't even necessarily have access. Whereas, the kernel calls we do. So we actually take the opposite approach of instrumenting the kernel, knowing it's going to pass the device driver and also some frameworks that are being pulled, et cetera. So turn it on its head a bit, see what's inside the device kernel, device drivers. >> Abhinav Pathak: We did it exactly that way. I'd say per device I haven't got anything showing about six or seven system calls which are effective. For example, a dot seek, file system call doesn't do anything. It just goes into a memory. It need not go all the way to the test. So those things we can [inaudible] we have a very, very small set needed to do this work. >>: Actually, some can be very expensive. A structure. So you can have one open caused five IOs. >> Abhinav Pathak: True. >>: Then it does one write and closes. Turns out the open source is part of that. So I would disagree with the statement that you can deal with opens and closes that are part of the meta operations. Meta data takes up one-third of the disk IO that we in the Windows operation. Dollar log file, dollar bit mat, dollar this, dollar MPI. It's one-third of the IO. That's just pure metadata. There's a lot of those cases where attribute other things and just the mainstream. Oh, he wrote and ten bytes and he wrote ten bytes. But also have to update the last access time, reverse address structures, and change things, update some things and all of a sudden it blows up. >> Abhinav Pathak: Right, right now we're above a little. >>: That's the problem with the API level, I think you have to do it done deeper in the guts and bowels of the kernel where it actually knows, oh, 16 IOs were generated as a result of that call. I know the threat ID it might be attributed back a lot of cases. Some cases we don't. But work on that. >> Abhinav Pathak: That is very interesting. >>: I would -- this is a great concept. I would say that universe is restricted in a way, couldn't get in the OS. But the methodology is pretty applicable. >> Abhinav Pathak: Pretty applicable. So, still, we cannot get into the Windows or Android. But even in Android we cannot get into device drivers but you can still use these ->>: Windows source code, we give out source code licenses at the schools. Wouldn't want you to be scrumming through the cache manager on your own. >> Victor Bahl: All right. Thank you. Good job. I gotta go run back here. But thank you very much. >> Abhinav Pathak: Thanks.