>> Navendu Jain: Good morning, everyone. Thank you... pleasure to introduce Tajana Rosing from University of California San...

advertisement
>> Navendu Jain: Good morning, everyone. Thank you for coming. It's my
pleasure to introduce Tajana Rosing from University of California San Diego.
She did her PhD from Stanford and then went on to HP Labs and then finally
came on to San Diego. So she really likes California.
So she's going to be telling us about how to build energy efficient computing
systems, and she and Roger Shulz will be visiting us on Monday their leading
research projects in this apparently hot area.
>> Tajana Simunic Rosing: Thanks.
>> Navendu Jain: So Tajana.
>> Tajana Simunic Rosing: Thank you all for coming. And yes, I do like
California. Sunshine is great. But I also love hiking. And I think you guys have
some of the best hiking around Seattle so I come up here very often to satisfy my
craving for woods.
As you can see, I work on energy efficient computing. I'll be spending quite a bit
of time today talking to you about power and thermal management as it relates to
both individual servers and multiple sets of servers. Some fraction of the time I'll
be talking about what happens inside of a processor and how that may change
the decisions we make that go beyond.
And then towards the end, I'll be talking about how some of the same energy
management decisions can be done to change energy efficiency and sensor
networks. It turns out that the ideas are actually quite similar. Numbers are very
different. You know, in data centers you talk about megawatts and millions of
dollars and sensor networks you're talking about milliwatts or maybe even
microwatts and probably not millions of dollars in terms of energy cost.
But the bottom line is still the same. You need to deliver some results within a
reasonable amount of time and as low energy consumption as possible.
So the future as we see it basically is a world that has whole bunch of different
types of sensors around us and on us as well. And towards the end of my talk,
I'll illustrate to you a couple of projects we run at UCSD to look at how our daily
decisions will affect our long-term health care. So this is work that we do with
UCSD School of Medicine that's focused primarily on preventative medicine. So
not necessarily trying to deal with somebody who is sick.
The interesting thing you final when you're looking at preventative medicine is
you need to be able to track effects on the health in 24-7, 365 days a year.
When you try to do this, you have huge amounts of data that get generated, and
you have a gigantic problem with battery lifetime. You know, if you're trying to
detect only a small event using sensor nodes and on the large scale
environmental installation, you can do duty cycling, so you can sample
periodically with relatively low sampling rate.
However, if you're trying to figure out whether person is running versus sitting
versus sleeping versus, you know, doing something else, you'll need to be
continually monitoring. So energy is actually a very big challenge here.
Secondly, there is a middle layer here that involves mobile access. So sensors
are fine and all except, you know, that since data needs to somehow get to us in
a form that makes sense to us and today most of us will use our cell phones to
interact with the data.
And in fact, in the project that again with the healthcare center really showed the
power of providing feedback to people through a cell phone. The cell phones
also have battery limitations. A lot of us have more sophisticated smartphones
where we, you know, take pictures, we get e-mail, we may even watch some
media. I got little kids who love playing games on them. You know, so battery
lifetime is really keen.
Also the connectivity between sensors and cell phones and cell phones and the
back end is something that is variable. Sometimes we have good 3G coverage
and other times when, you know, I was hiking on Olympic peninsula, there was
no 3G coverage. However, if I'm a heart patient, I need to be able to get
connectivity to the sensors on my bottom, perhaps so the sensors in the
environment through that same cell phone. So there needs to be a way for us to
adapt to what is going on in the environment and what is going on with a
particular application and the user's needs.
And in the center of this picture is data center infrastructure core. So if we have
all these data from the sensors, from the cell phones, all the data eventually will
need to be stored somewhere, will need to be analyzed, and will need to be
leveraged over time. So going back again to the healthcare example, we actually
don't know what really causes asthma in people. We do know that there's a
factor that's genetic. We do know that there is another factor that's
environmental. For example, there was a study that showed if you live within
couple miles of a freeway the study was done in LA county as you might imagine,
you have 50 percent chance higher of developing asthma. That's a huge
number. The issue is that we don't know whether this comes because it's a
freeway because we happen to be testing a certain fraction of population that
simply is more susceptible to asthma because maybe there's some other
environmental causes that would also affect asthma. It's not obvious. The only
way we can get the answers to these questions is by having large sets of data
over long period of time that can then be analyzed retrospectively, right?
And you get that by having this large infrastructure core. So at UCSD, we're
really lucky in that we have full fledged deployments of both large scale
environmental sensor nodes, work again with UCSD School of Medicine looking
at body area networks. We have tight collaboration with people in industry, you
know QUALCOMM, for example, that's focussed on developing cell phones and
you know, cell phone infrastructure. And we have a supercomputing center that
deals with humongous data sets.
And so we can really find out what is involved from the outside looking in of this
picture and how does one achieve energy efficient computing across this big
scale?
So with that in mind, I'll zoom in on my small slice of this big huge occupy, and
specifically I'll talk about power and thermal management starting from the data
center infrastructure at the center and work my way out to the sensor networks
and mobile devices towards the end of my talk.
So data center infrastructure there are two variables that I've really been
concerned about that go beyond just simple scheduling for performance. And
that is power and temperature. Power in terms of how much we are spending to
actually run systems, processors, memory servers, racks, and so on.
Temperature in terms of what fraction of this power is actually going to calling
and how will reliability be affected by the scheduling and by power management
decisions are being made.
So with that in mind, I'll be talking first a little bit about some issues that make our
servers not so energy proportional, and specifically I'll be focussing on interaction
between the processor and memory infrastructure. How could we make memory
a little bit better architected so it doesn't consume as much energy as it does
today?
CPUs it turns out are actually fairly good already. You can scale energy fairly
well. Then I'll talk about, you know, if we have a particular system, what is the
right way to monitor the workloads and based on the characteristics of workloads
to do power management? So this would be both controlling sleep states and
voltage scaling. And how those decisions will make a system operate more
effectively.
We'll be talking about machine learning algorithms that we've developed that are
really good at doing online adaptation. And then how power management
decisions affect thermal management? It turns out that the two aren't always
compatible with each other.
So if you think about large scale servers, the absolutely best way for you to save
power is to just not run a server, right? As soon as you turn it on, you are
already consuming huge fraction of the overall power budget. However -- and
this same idea applies to on the chip, right, on the processor core you may have,
you know, eight processors, four processors, whatever it is. You know, the best
way you're going to save power is by shutting down the processor.
Your next best way is to use voltage scaling potentially or to be careful what
workloads you place. However, shutting down this processor and having
something else run very fast right next to it creates big temperature differentials,
which can lead to reliability problems. So these two decisions aren't always
compatible with each other, and I'll be talking about how do we actually balance
the two to achieve both low energy computation and also to get good reliability at
the large scale.
And lastly, I'll talk about how we scaled some of these results virtualization
system that runs at container data center level that we have at UCSD and some
of the preliminary results that we got from that.
And then as I said, the second topic will be looking at wireless sensing and
energy management or better said, what do we do at the outer parts of my
picture on the previous slide? So what happens with sensors and that mobile
devices.
So the high level view of cost efficient energy management in data centers is that
you have some sets of jobs are coming. They need to be scheduled. They need
to be scheduled in a way that's going to meet performance requirements so you
got SLAs, you got response time requirements, maybe you have some
throughput requirements that are necessary to meet the SLAs. That scheduling
will make decisions on what racks and what servers jobs will run at. At the
individual server level you have either an operating system or a virtual machine
that makes a decision on which exact processor something will be running at.
Those decisions will strongly affect energy efficiency of the overall system. Right
now those decisions have been largely disconnected from a decision on what
level of cooling are you using? So level of cooling is controlled by independent
set of thermal sensors.
Granted, the temperature that's sensed by those thermal sensors is affected by
scheduling decisions. However, we found if we actually consider both at the
same time, we can get much more effective results. So at the high level view
what we're really interested in is monitoring temperature, power, and
performance at all of these scales, right, using those variables to then control
cooling, power states that we can control and task scheduling. Because we
believe it's only if we start scheduling multi-scale properties of this system well
that we'll get fairly big energy savings.
The goal here is to get energy efficient computation. We have found that if we
utilize some relatively simple prediction models we can actually do even better
than the current reactive policies. So what I'll show you is some of the results of
predicting what temperature's likely to do in near future where near future is
defined as certain number of scheduling decisions going forward, and, what
incoming workload is likely to look like. Again in near future where near future is
basically a function of the decision I have to make right now. Okay?
>>: Are you only looking at controlling the racks, or are you talking about
controlling the [inaudible] chillers and all that?
>> Tajana Simunic Rosing: So the project that actually we have funded add
UCSD involves a whole data center container, and they have control of the water
that's coming in, the rate and the temperature, and then inside of the container
we have tapped into all of the controls for the fans that are in there and we
voluntarily have ability to schedule jobs.
Now, we're not quite -- so you won't see results of doing it at the whole container
level, because we're not quite there yet. We got this container just this last year.
So we're ->>: [inaudible].
>> Tajana Simunic Rosing: This is a Sun container. It's actually a big NSF
project, NSF MRI grant that got us the container. So it was just within this last
year that the container has been fully populated. We're at a point where we're
running right now at five servers, and we're going to be deploying what we have
tested already. A small number of servers to the racks in the container. So it's
going to be running at seven racks within about a month. But to actually get to
that takes a lot of work. So you won't see the results. But that is actually the end
goal.
And the project is -- involves multiple people. It's not just me that's doing it. I'm
leading the energy efficient effort here. So if we zoom in on to a single server, it
really depends whose server you buy and what that server is designed to do. So
the first server that we got was a really small two socket quad core Xeon
machine with only eight gig of memory.
So if you have a machine that has relatively small amount of memory, doesn't
have the most powerful processor in the world, I mean, it's still reasonably
powerful, the type of division that you can see when it's running active is about,
you know, maybe a bit less than half will go to the CPU. You have some chunk
that goes to memory that's relatively small. And then you have almost, you
know, 16 percent that goes through fans and very little that goes to hard disk.
Now, if I start looking at the opposite side of extreme and I don't have this picture
here, unfortunately, but you can see the Googles server that now CPU here has
dropped down to about 30 percent. Memory's also 30 percent. Because in goal
servers it's about 64 gig that you have. Right? And then the other stuff is
basically the rest of the pie.
And if you talk -- I've talked to people at supercomputing center and they actually
have some of the servers that run at 512 gig because of the types of jobs that
they need to run. 512 gig of memory gets very, very hot. And there, you know,
huge fraction of the pie goes to memory and to fans in order to cool this down.
So the picture -- my point here is that things will change drastically in terms of
what slice of the pie it is. But the top components that are consumers are usually
CPU, memory, fans and power supply and efficiency.
Power supply and efficiency I will not be talking about at all. I think there are
people who have done a great job of designing better power supplies and it's just
a question of are you going to spend the money to put it in? What I will be -- and
I won't spend much time talking about CPUs. You know, Intel seems to be doing
an excellent job designing energy scalable CPUs.
I will talk a little bit about how I use the properties of those CPUs to achieve
energy efficient computation. What I will discuss next is what we have started
looking at in terms of novel memory organizations to try to the same or near the
same level of performance at a fraction of the power cost. And also I'll be talking
a little bit later about what happens with temperature and therefore with cool
system control. Yeah?
>>: [inaudible] distribution of this power, did you look how to [inaudible] in full
capacity at sometime or [inaudible].
>> Tajana Simunic Rosing: That's a very good question. I will show you a slide.
And it changes.
>>: Database, data ->> Tajana Simunic Rosing: Yes. So this server is not ->>: This guy always is ->> Tajana Simunic Rosing: I totally agree. And it did not -- I do not show that
here.
>>: Okay.
>> Tajana Simunic Rosing: This is not the database storage server at all.
>>: [inaudible].
>> Tajana Simunic Rosing: So this is -- this here is a computing server. So the
picture -- really my point with these pictures is that the slices of the pie change
drastically. Right? So for the hard disk you write. If I was focusing on storage
servers I would actually need to have the last bullet the says how do we deal with
the hard disk.
>>: [inaudible] is the storage server.
>> Tajana Simunic Rosing: Yes.
>>: Typically storage is most expensive.
>> Tajana Simunic Rosing: Uh-huh. Uh-huh. So one of the things that you
might be interested in that we are doing at -- for another big grant through SDSC
that I'm a part of is looking at large scale supercomputing storage that includes
flash as a front end. Because that turns out to be one of the good ways to get
the efficiency that you need and the response times that actually those types of
applications really need. Yeah. Okay.
So what we've looked at in terms of memory design is the idea of using both
non-volatile and volatile memory combination. The problem with DRAM is yes, it
responds very quickly, it has pretty high density so you can pack a lot of stuff in
it, but it also costs a lot of power. So if you look at a typical server, you turn on
the server, that memory is going to be on for the rest of the time that the server is
on, and that really doesn't make server very scalable in terms of energy.
The problem with using non-volatile memory like PCM is that, yes, you know, you
can turn power off to it so the data is retained well so you can achieve much
more energy scalable computation as a result. The problem is you can write to it
only so many times before it fails.
And you know, the other side to note is that it's not exactly mainstream yet. It's
getting there. So the challenges are to be able to hide drawbacks of PCM, main
drawback being the write endurance which is a very similar issue that you have
with flash. So the design goals are for us to uniformly distribute write access
across PCM so do wear leveling much in the same way as you do it with flash,
except now it's at the main memory level. It's not as the cache to the disk. And
to direct any write intensive applications to DRAM instead. So we achieve this by
doing a combined hardware/software design.
So this requires some changes to the operating system and it requires changes
to the hardware. On the bright side, the overhead actually of the changes to both
is very, very small. And the benefits as you see may justify the approach.
So in -- at a high level what we have is a new memory controller that then
monitors the writes and monitors the aware leveling properties of PCM and
based on that will direct workloads appropriately.
So here is how it works. It looks at -- so I will have two activity types of memory.
I'll have DRAM and PRAM. The data and commands come through the memory
controller. The memory controller actually has a small access map cache. So by
different pages, and this is done at page level because page is a unit that you
allocate from the operating system, you actually count how many writes have
been done. Based on that write count, we can figure out whether that page
should be retired and a new page brought in or not. Because that then allows us
to do write wear leveling.
And obviously when a page gets to some threshold of writes, we have to be able
to swap the page. So that is where the interrupt for painful swap occurs. And as
you reach the limit of PCM endurance, we may have a bad page interrupt so at
that point that interrupt is issued and a new page again is allocated.
So the objective is to uniformly distribute this write across PCM. We do this by
holding -- I have a 10 memory allocator three different lists. One is a free list on
the left and then threshold free list on the right. With the tree list, it's basically a
said of pages that have not yet been allocated by any application.
As the applications come in, they allocate the pages. Once those pages have
been allocated some threshold number of times, they get put into a threshold free
list. Eventually you get to the point where free list is empty and threshold free list
is full. At that point, you swap the two.
So the threshold free list becomes empty, the free list becomes full, and you start
allocation again. You do this enough times, you'll get to the point where some of
the pages will go beyond the write endurance per page. And at that point, the
page will go into the bad list. So that is where this bad page list comes.
Page swapper will handle page swap and bad page interrupts. Bad page
interrupt obviously will mark pages invalid and hopefully avoid failure.
>>: [inaudible] what's bad page list ->> Tajana Simunic Rosing: So this is the swap threshold actually is a variable
that we've played with some to see for a set of applications within benchmarks
that we had. And it turns out that it's actually not that sensitive. So we had swap
threshold in the paper that we published this year -- this last year. It's a new year
already. It said 2,000, but it turned out if you put it down to 10 you get almost the
same result as you put it up to about 5,000 writes. So interestingly enough, it's
really not that sensitive.
There are two policies that we looked at. So one policy is what we call uniform
that basically says when you reallocate the page, you'll always want to put the
page into PCM. The idea behind that is PCM is lower power, so therefore you'll
be more energy efficient. However, clearly this could be a bad policy because
PCM also has very slow response time to a write, much slower than DRAM, and
you'll get to using your -- getting your page to become a bad page sooner.
The next policy that we have is a hybrid policy that actually after using -- so after
passing the threshold will allocate the page instead in DRAM, assuming that
because this page has been used so much it's likely being used by an application
that has lots of writes and therefore should be going into DRAM.
So we use baseline system that had four gigs of DDR3 SDRAM, and this actually
was measured on the server that we had. All the results that you'll see with PCM
are simulated because we don't have that much PCM in the house. So the
hybrid system had one gig of SDRAM plus three gigs of PRAM. So it was about
one to three ratio. And the uniform system was all PRAM.
The workloads that we simulated had a set of different types of applications. The
idea was that we wanted to see both the percentage of reads, percentage of
writes and the total number of pages vary significantly. So we selected
applications that would represent different classes of workloads by these
particular values.
So you can see that the applu, for example, has lots and lots of pages and fairly
high write percentage and then on the other hand sixtrack, because it's so CPU
intensive really doesn't do much. And face recognition is somewhere right in the
middle of the two.
So average overhead in terms of performance for the hybrid system was about
less than six percent. This overhead is largely due to the fact that writes take
longer in PCM. So if we actually had a better caching methodology we could
probably hide this overhead even better.
The overhead of the uniform system was much bigger. It was about 30 percent.
So you really wouldn't want to build a server with just PCM. And that should be
kind of obvious anyway. But there is real benefit to -- and I'll skip through the
details here, to actually combining PCM and DRAM. So if we combine the two,
you get energy savings of 30 percent without considering power management.
So we did not actually turn off power to PCM and use the fact that it retains data.
So I think the overall energy savings can be much bigger than this 30 percent.
This is in the worst case effectively.
You have a question?
>>: [inaudible] just the total time [inaudible].
>> Tajana Simunic Rosing: So it includes all of the overheads.
>>: What do you mean by the overheads?
>> Tajana Simunic Rosing: Of access map analysis, trying to figure out whether
what is the number of writes that you have to that particular page and for the
page swaps. So overall performance overhead basically is included.
What your application would experience if it used this new system.
>>: What's the estimated [inaudible] including DRAM in that scenario?
>> Tajana Simunic Rosing: The number we had used was 10 to the 9th, right? It
was a number we got out of literature. As number goes up, our results get better.
>>: [inaudible].
>> Tajana Simunic Rosing: Oh, so with a particular -- and there is actually
detailed discussion on this in our paper. This was analysis that we had to do
because others our server dies within five minutes of using it, it's useless.
So it turns out that if you have a dumb policy with 10 to the 9th writes, you can
end up with a failure within about an hour. But if you have a smart policy, like
what we've designed, you end up with a failure of about 10 years. So you have
to be very careful what you do with your software and hardware.
>>: That's quite a difference. Do you ->> Tajana Simunic Rosing: Yeah. Well, the difference comes from the dumb
policy basically will allocate write intensive page, always the same page at the
same location, and therefore it dies quickly.
>>: [inaudible].
>> Tajana Simunic Rosing: Yeah.
>>: [inaudible].
>> Tajana Simunic Rosing: Yeah.
>>: What's -- did you look at the concept of potentially assisting the focus in the
PRAM between RS restarts?
>> Tajana Simunic Rosing: See that's what I mentioned here at the end. We
have not accounted for that in this energy savings. What my student is working
on right now is estimating and evaluating how well would we do if we actually
leveraged this idea? Because I think the savings will be much bigger and it will
enable you to wake up extremely quickly. You have instant on effectively.
And you know the fact that we used relatively little DRAM and PRAM, you know,
is another issue. You know, normally you would want about one to 10 ratio
between the two to get even bigger benefits.
But you know to actually simulate one big to 10 big of PCM is very difficult to do
with simulators. We had hard enough time allocating enough memory to fill up
four gig in a simulator. So that's one of the limitations of doing a simulationwise.
But jointly with Steve Swanson were building evaluation system that will have
memory controller and an FPGA and we'll actually be able to see in somewhat
real life what happens.
Any other questions on this?
So the nice thing about using PCM in the way that we've talked about is not only
that you can now retain the state but we can also think about having a completely
non uniform memory architecture, so we can have chunks of PCM associated
with chunks of processors and therefore you can start actually creating a truly
energy proportional computer, right? Because I can assign workloads in such a
way that only a fraction of the memory is being used, instead of all of the
memories on all the time. Even though maybe to the application it looks as if all
of the memory is there, simply because the data is still retained, it just didn't on
because you're not currently running it. Okay?
So that's actually the motivation behind going with this work. Okay.
So going forward, as I said, we have this great NSF funded Project Greenlight.
The idea is to first evaluate the energy efficiency of a larger scale installation,
and we're interested in seeing literally what happens at a whole box level. So
how efficient is it if I do X versus Y in terms of what is the cost of power to box
and what is the box to cool the box. And those two variables on this box are very
easy to measure. Because I got one plug for water and one plug for power. So
it's pretty much instantaneous feedback.
And in order to achieve more energy efficient computation, we'll look at two
classes of algorithms, power management and thermal management. Power
management because obviously our goal is to lower the energy consumption.
Thermal management, because lowering the energy consumption of just the
powering aspect of it would not necessarily be the best thing for the cooling
system. So we need to be able to balance the two. And I'll show you some
instances of why this would be the case on the chip level. And this actually
applies also to the server model just as well.
So first thing that we did is we started looking at workloads. And this was
actually done ages ago on the single machine. What we looked at is different
devices within the machine, so you see on the top a hard disk trace, on the
bottom you see wireless network interface trace. The reason why included these
two is because they're about as different from each other as it gets, right? On
wireless network interface you have relatively fast interarrival times and you're
expecting actually to have very low power consumption.
On the hard disk trace you expect much slower interarrival times and much
slower processing times and you also have much higher overhead for changing
power states. The interesting thing that we found on these traces, what you see
on the X axis is the interarrival time measured in seconds. What you see on the
Y actions is the tail distributions. So effectively it's one minus cumulative
probability distribution of getting that amount of interarrival time.
The reason why I plot a tail of the distribution is because it highlight what
happens with longer interarrival times. Why do I want longer interarrival times?
Any ideas? Why do I care?
>>: [inaudible].
>> Tajana Simunic Rosing: Sleep, right? Or slow down if it's a CPU. When do I
care about short interarrival times?
>>: [inaudible].
>> Tajana Simunic Rosing: Performance. Right? So if you think about
performance, people actually use -- usually will use queuing models to evaluate
performance of a distributed system. That queuing model does an excellent job
at short interarrival times, IE, when you're looking for high performance, you
know, and you can see that even here that actually models for exponential
distribution, which is the backbone of queuing theory, and heavy tail distribution
such as Pareto really matches reasonably well experimental data.
The problem is when you go to the right of this picture and you start looking at
long interarrival time, the match is very poor. You end up in a situation in which
exponential distribution makes decisions that are way too optimistic and that
actually in the case of machine that we implemented the policy that was based
purely on exponential data we ended up consuming more pure than just leaving it
on all the time. So this really illustrates that you really need to be able to model
the workload accurately if you are going to do power management well.
And in case of these particular devices you see that the same conclusion applies
regardless of whether you're looking at a hard disk, you're looking at wireless
network interface card, you can look at large scale networking data traces. You'll
see exactly the same conclusion. It's all heavy tailed.
Why do people use queuing theory? Well, it's easy. And it works reasonably
well at those short interarrival times. Why we can't use simple queuing theory for
power management is because it leads to bad decisions at those longer
interarrival times. So that is what we did.
We actually took a simple queuing model and we expanded it to ensure that we
can model the points of time when we need to make a power management
decision correctly. So when we were in the idle state, we can either wait and get
jobs that we then process. Notice in these states we're at the fast end of the
curve where Markov model will hold.
Once we get back into the idle state, we have a decision to make. That decision
is do we do something, do we slow down, do we shut down or not? If we decide
to shut down or go into sleep state it takes us some amount of time to get there.
So this cannot be modelled again with the Markov chain, it has to actually be
modelled with a uniform or some similar distribution to uniform distribution.
Essentially non exponential because it takes you finite amount of time to get
there. Right?
Similarly, as you saw in the previous slide, the request that comes after longer
idle period you really want to model with non exponential distribution. You need
heavy tailed model. So you effectively have two distributions that actually govern
the decision. And this is exactly where we had to make a little more
sophisticated model. So what we did is we went with time indexed, that's what TI
stands for, semi Markov decision process.
It's basically a glorified Markov chain. A little more complicated. The
assumptions are that general distribution governs the first request arrival. So this
is the arrival at the -- that breaks the long idle period. Pretty much exponential
distribution can be modelled for everything else.
And user device and queue models are stationary. That last assumption is
actually the lousiest one of them all. What does stationary mean? It means that
on average the statistical properties do not change of the distributions that are
designed for. And that we know to absolutely not be true at a typical set of
servers.
However, if we can pretend that it's true, you actually get globally optimal policy
with simple linear programming, and you get results that we implemented.
They're within 11 percent of an ideal policy. This ideal policy knows the future.
So you get really good result. So the question is what do we do about this
stationary assumption? How do we handle that? Yeah?
>>: [inaudible] exponential or [inaudible].
>> Tajana Simunic Rosing: I didn't hear you.
>>: The service time, how did you model the service time?
>> Tajana Simunic Rosing: That's a very good question. So the service time
that happens down here was modelled exponentially. So we assumed once the
request comes that that's a beginning of a set of requests that you have to
process that happen relatively quickly after each other. And at that point, we're
back into this red part of the picture. Yeah. And that turned out to really not
affect things much. If you have different results it would be interesting to see.
But it really didn't affect things much. It was really this decision of do I go to
sleep or not that was a critical decision. Yeah.
So what we did to deal with the stationarity is we said instead of designing one
single policy that does a great job for all of the workloads except, you know, it's
not optimal for anything, what we want to do is have a library of maybe small
number of policies, each of which is really good at delivering savings for class of
workloads that we can select them on. So that is how we used online learning
here.
And the bright side, we're able to measure pretty good energy savings. We had
experts -- this is effectively policies, that could control power states and can
control the speed of execution. Each of these experts was specifically optimized
for a class of workload. So for example, you would have Web Search policy or
you would have a database policy or you would have, you know, whatever, video
decoding policy. Because these workloads do have fairly different properties
from each other.
And then we have a controller that monitors what's going on in the system. It
monitors it by looking at hardware performance counters. So it doesn't
necessarily know anything from the application. If it knows something from the
application, it will do better obviously. It selects the best performing expert. This
would be whatever it thinks is the best policy. That particular policy makes
decision for a device and then we evaluate how well were these decisions made?
How do we evaluate? Well, it's fairly straightforward. If it has to go with going
into sleep state you know exactly how long the idle period was. So you can
figure out what was the best decision to make, and relative to that best decision
you can figure out how well did your policy do?
So if it was a simple timeout policy, it would wait some amount of time before
getting to sleep. That amount of time that it waited is counted against it. Right?
The really cool thing about the controller is that its performance converges to that
of the best performing expert at a very fast rate. So it's very good at selecting
what is the best policy to run right now. That rate is a function of the number of
experts and -- now, remember, we have relatively few of these experts. We're
not going to have hundreds of them. A few to well represent classes of
workloads that you have running and the number of evaluation periods. And this,
in case of the operating system, would be scheduler tech's okay.
So here are a couple of examples of results we got on the hard disk drive and on
processor. For the hard disk drive we are looking at different traces that I got
from HP, so my days of working at HP paid off. I picked three different traces
because I wanted to see traces with fairly different average interarrival times and
different variants around the average. We wanted to show how well we can
adapt.
We looked at hard disk because decisions that are not correct for the hard disk
showed the worst possible case, right, because this takes so long to respond.
And on the CPU we're focussing more on voltage and frequency scaling. So on
the hard disk we had four policies that were implemented. There was a simple
timeout policy that's a default. There was an adaptive time out. Our policy that
was optimized for specific performance overhead. So for in this case it was
optimized for three and a half percent performance overhead.
So the nice thing about actually that I forgot to mention about these policies is
this is the only class of power management policies that can be designed to meet
performance guarantees. So you can say, I have this performance guarantee
that has to be met, it will output a set of decisions that will minimize energy within
that performance guarantee.
And the last one is a predictive policy. Now, this was just a sample set that
represents what we've seen in literature. You can add whatever other policy you
like. Within this sample set you can see that our policy was best for performance
because we obviously optimized it to be that way. And that in this particular case
predictive did really well with energy savings.
Now when we actually ran across all of these different workloads with our online
learning controller, you find that we can fairly easily trade off low performance
overhead. In that case our controller will pick our policy here most frequently,
because that is what gives you low performance overhead, with higher energy
savings and it will pick the appropriate policy for that. Okay?
Similar thing happens when you look at voltage scaling. So in this particular
case, this was actually done for a cell phone type application. And you can see
that we have four different voltage and frequency settings. And as we trade off
for low performance overhead for energy savings, you can see that you get
better performance if you run faster, obviously, and then as we want more energy
savings, we're going to spend more time running at low speed. However, what
you notice for this particular application quick sort, there's a good fraction of the
time that we also run at 400 megahertz, even though we're maximizing energy
savings.
The reason why this happens is because quick sort has regions in which it's
more memory versus CPU intensive. And our controller is able to detect this
appropriately and select a correct setting to ensure that you get performance and
maximize energy efficiency. Because if you are CPU bound, you are not saving
energy by running slow. Right?
So that is effectively what you see here. 25 percent of the time we're CPU
intensive, 75 percent of the time memory intensive, hence about 75 percent of
the time you run low, about, you know, 20 percent of the time you run faster.
So that brings me to the last question, you know, that I almost always get asked
when I give this talk and that is, you know, you get this great savings and voltage
scaling when you run on an old CPU effectively, older technology. What
happens when you do this in servers?
And the answer actually is a little depressing so I wanted to share that with you.
This was a set of results that we got by running AMD Opteron machine. We also
did the same thing on the Xeon. The results are compatible with each other. So
the conclusion that we draw out of this is a correct conclusion. Presented this to
Intel. They agree that that's where things are going. You can see for Opteron
the range will voltages at the top speed is from 1.25 volts. At the low speed it's .9
volts. So there is very small range actually that we're playing with here, which by
itself already tells you that the amount of energy savings for slowing down is
going to be much smaller than it is for processors in the older technology.
In the contrast for the XScale processor I think the range is from 1.2 to 1.8 volts.
Which is much bigger. You know, .6 versus not even .3. Versus the range of
frequencies in this case from .8 gigahertz at the low end to 2.6 gigahertz at the
high end is huge. Right? That's more than a factor of 4. Compare that to my
previous slide, the range of frequencies was from 200 to 500 megahertz. So
that's about factor of 2.5.
So basically you would expect XScale to get better energy savings with voltage
scaling just by looking at these numbers than AMD Opteron for sure.
The other depressing thing about this processor as compared to older XScale is
that it leaks like crazy. So leakage power in these things is about 45 percent. So
what this tells you is that as you slow down, about half of your power is simply
going to be wasted four times as long if you run at the lowest frequency. That is
very bad news.
In contrast to that, XScale I think we estimated at about 20 percent. There is a
big difference between 20 percent and 45 percent. So we wanted to illustrate the
point what we're doing here is we're comparing slowing down so if we run at top
speed versus at one of these slower speeds with three simple power
management policies, the top policy, all of these three will have results where we
run as fast as we can and then we switch CPU to state C1, so we just remove
the clock supply. This is the lightest sleep state that you can choose. Then we
switch CPU to state C6, so it's completely off.
This is the deepest sleep you can do with the CPU alone. And the last policy
actually says well, you know, this is great but what about the rest of the system?
You know. CPU can sleep but there is whole bunch of other stuff that's still on.
So in this case, we put memory to self refresh to show how does running as fast
as you can and shutting off actually help the overall system power? And to
contrast that to slowing down. Okay?
So here are the results. These results actually show you only a small sample of
workloads. We specifically picked MCF because it's very memory intensive,
sixtrack because it's very CPU intensive, so this will show you approximate range
of savings that you can get. You would expect MCF to give you the best savings
because of its memory intensiveness, so slowing down CPU actually should not
affect it's performance as much, versus for sixtrack you would expect to get really
bad results because slowing it down, you know, will affect performance and
energy.
You can see first of all if you look at the delay column that at very best at the top
level the percent slowdown is about 30 percent. And this is running at 1.9
gigahertz versus 2.6 gigahertz. So you're paying 30 percent of performance
overhead for almost no power savings. Okay? So the power savings here what
we are showing here is percent energy save for voltage scaling relative to
running as fast as possible and going into C1 state, running as fast as possible
going into C6 state or putting memory in self refresh. Okay?
So your benefit for voltage scaling is relatively minuscule. On the worst end, you
have overhead of 230 percent in performance for sixtrack for about seven
percent benefit for voltage scaling. So how many of you are going to do voltage
scaling? Yeah?
>>: I am. Because they're totally incomparable. When you do voltage scaling,
the processor's still running. You're able to process requests. When you put it to
sleep, it's not.
>> Tajana Simunic Rosing: Yes. And that is true. Now, why are you doing
voltage scaling?
>>: [inaudible] power.
>> Tajana Simunic Rosing: So what this shows you is you're actually not saving
energy. What the reasons why you my want to do it ->>: Is that a characteristic of the processor you're using though?
>> Tajana Simunic Rosing: It's a characteristic of all of the high end processors
you buy today. There is no processor that you can buy on the market that I know
of today that would show significantly different results here. If you ran on your
cell phone, you would see different results. Because it's older technology and
obviously much lower performance. But for energy savings perspective, it
actually does not make a lot of sense to do. It does make sense to do for power
perspective, power budgeting, right? Voltage scaling will lower your power
budget instantaneously while allowing you to continue running.
It also makes sense to do if you're desperate, right? You need to save the last
bit of -- so in case of thermal management, if things are getting hot, you still need
to run, but you need to run at a way that let's you not be too hot, you're going to
use voltage scaling. Right?
So if you actually talk to Intel, their newest processors that they are looking at
putting out on the market going forward are going to have much faster in and out,
out of sleep states than they ever had before, and they're going to have a lot less
flexibility with voltage scaling because they don't see voltage scaling as all that
useful.
So things are -- paradigm is really changing. The other thing to think about is in
the old times when you had a single processor, the idea of using voltage scaling
made a lot of sense because as you said, you can still continue running, you
know. If you shut it off, nothing is happening. So you can't continue getting
results.
With lots of processors you have the option of continuing running perhaps
running even faster if you use the top speed, while at the same time perhaps
shutting off one or two processors on the die to save energy. Yeah?
>>: That's actually just where I was going to go, shutting down the processors
and ->> Tajana Simunic Rosing: You don't have to shut off everything basically. You
can shut off one out of eight cores.
>>: And if you do so, are you -- is the control for that being done in a way where
you eliminate statically the [inaudible]?
>> Tajana Simunic Rosing: Yes. Yeah. And that's why these numbers look the
way they do. Because basically you have card transistors that turn off
connection to the power supply. Otherwise it wouldn't work. You would have the
same problem. Yeah. Okay.
So that brings me to the key points. You saw that having power management
policies that understand workload characteristics and can adapt to them can
make a big difference. You saw that lowering voltage and frequency settings
doesn't necessarily help in the high-end processors for pure energy savings.
However, it does help with peak power reduction. It does help with thermal
management. And obviously for systems that don't have high leakage, it's
system a very good option. So if you have a system that has low leakage, by all
means I would use it. Which is why you see voltage scaling still very prevalent in
all the mobile systems.
So this brings me to thermal management. It's pretty clear by now that the
picture is not complete if you don't think about temperature. So for thermal
management, what we took is inputs as workloads that we collected the data
center. This was work done jointly with Sun Microsystems. They, it turns out,
have big reliability lab right next door to UCSD and there they have access to all
the workloads from their customers. So that's what we used. I have a student
who worked for Sun.
We took floor plan and, you know, based on that floor plan and temperature
sensor data were able to figure out how to schedule workloads. What you see
on the picture here is Sun's Niagara chip. That's what we started working with.
And one interesting thing that you should immediately notice is that on this floor
plan where you put the workload will affect temperature. If I put something that's
very hot right here, it's not going to get nearly as hot as if I put it in the middle.
Because level to cache acts as a very huge big cooling mechanism, right?
The same is the case at the server and at the data center level. Where you put
the workload in a rack or in a box will really depend on how well it can be cooled.
So it's exactly the same concept that you see on chip that can be expanded to
the whole box effectively.
For power management, we used sleep states and we used voltage scaling. We
also used the ability to migrate threads as another knob to deal with the
temperature issues. And the results you're going to see are results that come out
of thermal simulator. This is primarily because Sun wasn't too keen on releasing
reliability data. But the conclusions actually hold.
There are two classes of policies I'll show you. One is optimal and static. The
goal behind this was to see what is actually the ultimate benefit for temperature if
first of all I optimize job allocation for minimizing energy only. So if I say I want
minimum energy, does that solve the temperature problem? Right.
Secondly, what if I also add thermal constraint? How does that change things?
So that is the top two policies. The reason why I looked at the optimal is
because that tells me what I should be striving for with my heuristics. Obviously
optimal results are simply not possible in reality where I'm doing online
scheduling of large number of jobs that I don't know ahead of time. And this is
why all of the operating system schedulers are heuristic in nature.
So dynamic ones which used load balancing, which was basically defaulted that
we in Linux and in Solaris, and this balances the threads for performance only.
Some of the balancing actually helps temperatures, some of it doesn't.
Then we enhanced it by the information about the floor plan so if now the
scheduler knows that cores are inherently likely to be cooler than others, simply
because of where they are, it's going to do a little better job of allocating based
on that knowledge.
Okay. The next policy this is simple improvement over the coolest floor plan is a
policy that uses thermal sensors. It says how it keeps information on recent
history per core of what temperature has been and depending on how hot it's
been, it will give either higher or lower probability to send workload there. So if
it's been reasonably cool, then it's going to have higher chance of getting new
workload. If it's been reasonably hot, then it will be lower chance.
The idea behind this is to actually try to smooth out thermal gradients. So there
are a number of ways that things can fail. One way that things fail is because
temperature gets hot in a particular location, in a hot spot. There you have
electromigration, you have thermal electric -- dielectric breakdown. There are a
number failure mechanisms that depend only on what is the absolute value of the
temperature.
The second way that things fail is because temperature changes in time a lot,
and this is essentially a stress mechanism. So if it goes between cold and hot a
lot, you end up with failures.
The last way things fail which turns out is the biggest pain of them all is what we
call spatial gradients. So if across the die we have large differences in
temperature, signals who travel a different speed in one area of the chip versus
in the other, and you can end up with these nasty bugs to debug, these are not
permanent failures, they're basically temporary failures that you can recover
from, but they're horrible to deal with.
So what we're -- yeah?
>>: [inaudible] redundancy do you have in the [inaudible].
>> Tajana Simunic Rosing: That is a perfect question. So sensors to date have
been really lousy, actually. And in fact on the original chip we had one sensor
that was very bad. So what we ended up doing is putting the chip in an oil bath
to get an IR camera so we can actually see what temperatures were.
Newer chips have more sensors.
>>: [inaudible] temperatures as well.
>> Tajana Simunic Rosing: Yeah. I know. But what are you going to do, right?
What's the choice? The good news are that some of the most recent chips are
coming out with a minimum one sensor per core, and that actually turns out to be
good enough. If I have one sensor per core, plus we have information about the
floor plan, the thermal properties of the package we can in fact filter out bad
sensor data. And if you're interested in that, I can show you, you know,
afterwards, some of the work we've published in this area.
And this work in fact was done because we had all this pain with the single
sensor on the die.
>>: Is that [inaudible] Intel and [inaudible] are doing?
>> Tajana Simunic Rosing: Yes. Yeah. Yeah. Yeah. So it's actually fairly well
known by both Intel and AMD and, you know, Sun, who also has thermal sensors
on the die that those thermal sensors actually tend to be fairly low quality. They
have very low A to D conversion, so they tend to be very inaccurate. They get
very noisy data. And you have to calibrate them. And calibration turns out to be
very non-trivial. So you can be really far off if you don't do this right.
Yeah, so -- yeah. And that actually is even more true for servers and racks. So
as we talk about these scheduling policies, they're closed-loop scheduling
policies effectively because we monitor sensors, we monitor the workload, we
make a decision based on that, and then we try to adapt. You really have to be
careful, you know, how to identify sensors inaccuracies to smooth them out or to
simply detect that something has failed and work around it. Otherwise you end
up with crazy decisions. You know, you can -- for example, one of the things that
had happened was we ended up having a fan that kept spinning up and down a
lot because it was triggering off of a sensor that was actually bad. So okay.
So moving on. The last set of policies that I'll show you are going to sleep when
it gets too hot, slowing down when it gets too hot and moving threads when it
gets too hot.
And then at the bottom I have exactly the same policy we did for thermal -- for
power management, online learning that selects among these policies online as
necessary. Okay?
So first the optimal. So this is -- this plot shows you the percentage of hot spots
across the particular die that I was showing. If we use load balancing versus if
we do optimal energy aware optimization. Now, mind you, this is a very small
number of jobs here because doing an ILP for lots of jobs is impossible to get
done. And then if we do thermally aware optimization.
And what you find that's interesting is energy aware optimization actually will
create quite a few hot spots. It does give minimum energy consumption. So
among these three, this is by far the minimum energy consumption. But it's at
the cost of clustering jobs. And the reason why it clusters jobs again is because
from a processor perspective, it is much more efficient to shut it off than to run it
slower. Okay? So this I think fairly well illustrate to you that doing energy
management without consideration the cooling, thermal and reliability aspects is
a bad idea. Because you end up with corner cases that cause a lot of problems
potentially. Okay?
So here is what happens again if we now consider all different possible options
and if we use online learning to select among them. So default is the default load
balancing and this was actually results done on Solaris. We ran all of the
workloads on their server. So the performance is all actual performance running
on the server. The temperature numbers are all simulation.
Migration. So if you move a thread, if it gets too hot, power management,
voltage scaling and power management, the adaptive random technique that I
showed you that keeps track of temperature history per core. And you can see
that this little adaptive random technique actually is a pretty good benefit. If you
combine it with voltage scaling, you get to the point where you are very close to
voltage scaling results but not at the performance overhead of voltage scaling.
And then online learning, which beats the best possible policy by 20 percent.
The reason why it beats is simply because it's adaptive. So you will select
whoever makes most sense at any given point in time.
>>: Are you going to talk about the impact, the reliability impact?
>> Tajana Simunic Rosing: Yeah. So I'm not going to show reliability numbers
because again, I was not ->>: Whether you reduce the hot spot to 20 percent, why would I care?
>> Tajana Simunic Rosing: You care because temperature is exponentially
related to meantime to failure. So if I reduce temperature by 10 degrees, the hot
spot temperature, there is an exponential relationship between that and your
meantime to failure due to electromigration and the electric breakdown.
>>: I understand that. But is this extending -- I mean, I need servers that last me
four years. This is making it last 20 years?
>> Tajana Simunic Rosing: So ->>: I mean ->> Tajana Simunic Rosing: Yeah. So that's -- that's an excellent question. So
this particular set of results, and I could show you estimates that we have for
reliability, we can actually extend the -- in this case per core lifetime by about,
you know, fact of 10 approximately by doing things right. At a cost the not saving
as much energy, right?
So you get the performance, you get a reliability, you get some energy savings
but it's not as good as you could by, you know, doing it this way. How it affects
the whole server, that's a whole other question. And that answer I actually don't
have yet. The problem with all of this data is it's very hard to get reliability
numbers out of people. So temperature numbers I can get pretty easily, but, you
know, what is exact relationship between the temperature, temperature
differentials and reliability that's much more tricky. Yeah?
>>: [inaudible] flip this around the other way. Now that you've reduced the hot
spots, normalized the temperature across the die.
>> Tajana Simunic Rosing: Yes.
>>: Can you now push the entire die harder ->> Tajana Simunic Rosing: Yes. And see, that was what was behind this whole
thing. Because effectively what we've done is we've created the more balanced
thermal environment which means that now we can actually raise the envelope,
so we have more head room to add more jobs effectively. Before your cooling
system has to kick to a higher gear.
>>: Or looked at another way, let me run in higher ambient temperatures ->>: Sure.
>> Tajana Simunic Rosing: Yeah.
>>: So that exactly ->> Tajana Simunic Rosing: Yeah. And you know, the two are obviously directly
related to each other.
>>: Don't have to worry about peak.
>> Tajana Simunic Rosing: Yeah.
>>: Fundamental ->> Tajana Simunic Rosing: And actually can get better than this. Because
temperature is a slow moving variable they can do proactive thermal
management, so instead of waiting for things to get hot and then doing
something about it, they can in fact forecast that it will get hot and based on that
allocate jobs so it doesn't get to that point. So it lets you work with whatever
thermal ambient temperature you happen to have in the best possible way
effectively. So it says if I have X number of servers with Y number of cores here
is how I'm going to pack jobs in those cores so that the temperature remains
within the envelope that I'm targeting.
So for this, we take again data from thermal sensors. We used autoregressive
moving average predictor. The reason why we used this predictor is because the
simple exponential average that's very poorly for corner cases where
temperature cycled a lot. So if somebody despite to be really smart and save
power and shut off the server and then turn it back on a lot, exponential average
died on that.
And the nice thing about ARMA is you can also calculate it online very easily. So
you can monitor, fit the parameters and you can monitor how well it fits the
results.
And then based on this temperature, we actually do predictions. For some
number of times in the future, number of scheduled instances in the future what
will happen, and based on that we'll then schedule. Online we monitor ARMA
model. We validate it, recalculate, if necessary, and move on, make the
decisions.
So when we add this capability to the system, here is where we end up in terms
of reducing hot spots relative to default load balancing. Okay?
So you can see that reducing by over 80 percent -- over 60 percent as compared
to reactive migration at practically no performance impact. So simply because
again temperature does not change that quickly. So we leverage that.
And similarly, gradients are reduced to less than three percent in terms of space
and to less than five percent in terms of time. So you end up literally in a
situation in which you have fairly flat and low thermal profile that is nirvana for
reliability engineer. Okay? Yeah?
>>: [inaudible] data that compares [inaudible] the temperature coming out of a
server which [inaudible] versus one that isn't? I mean, is it actually detectable?
You are using it ->> Tajana Simunic Rosing: I have data that compares multi-socket temperatures
and that definitely is -- there's a big benefit that we've measured.
>>: By a degree or two or, you know, what's the scale?
>> Tajana Simunic Rosing: It depends on what you are running.
>>: No, so I'm just like in any given -- I mean, picking a configuration, I don't
know what it is, but what sort of differences were you seeing [inaudible] scale
[inaudible].
>> Tajana Simunic Rosing: So in the best case we've even about 10 degrees
lower on the die which is actually a big deal for reliability. But this is the best
case. You know, if you really run CPU intensive jobs and all of the cores and
there ain't nothing we can do about it, you know, we're stuck. So in that case,
you know, what he mentioned, you know, slowing down is about the best that it's
going to get. You know, maybe your performance isn't stellar but at least your
server hasn't died yet. Yeah.
So the other thing that we have done that I haven't clustered here is we looked at
combining scheduling, proactive scheduling of jobs with doing fan control. So if I
have two or X number of sockets and fans that are associated with those
sockets, they are trying to cool them, those fans feeds turns out are cubic
function -- so power off the fan is a cubic function of its speed. So basically as
you go from low to high speed, you consume a lot more power. And so you have
a lot of motivation if you are already running at a higher power level at one socket
to pack in as many jobs as that level of blowing air will handle.
So when you do this jointly, we actually got savings of about 80 percent on joint
socket and fan power. Which is fairly big. And it's relatively simple to do. It
requires from operating system to tap into thermal sensors and to have some
level of fan speed control. Thermal sensors you can already do. We've done it,
so you can do it. The fan speed control is trickier because right now that's done
with an online controller processor that's on the board. So that you have to make
some deals with manufacturers. But it's certainly doable. You know, they can
expose the variables that are needed to you.
And the benefit, as I said, is big enough to where it may be worthwhile thinking
about it. And you know, if it's big -- if it's that big in the single server I would
imagine that you could get even bigger benefits at the whole box. So that's kind
of what we're after right now.
>>: Hopefully just have a simple equation which said okay this fan is already
running --
>> Tajana Simunic Rosing: Yes.
>>: -- at high speed, so just pack all the jobs on there anyway [inaudible].
>> Tajana Simunic Rosing: Well, but it also goes the other way, right? You can
also have a situation in which you pack the jobs, you have more jobs coming in,
so the question is do I put it on that particular socket and possibly trip it over the
control point or do I put it on some other socket and which other socket should I
choose? And if I'm starting to get overloaded on a set of sockets, how do I
offload, to whom? Right.
>>: [inaudible] components in the server as well to check the monitor.
>> Tajana Simunic Rosing: Yeah. So my student has just submitted a paper
that looks at also memory subsystem. Because, you know, depending on who
build the server they may have put fans in such a way that they blow first across
all of the memory and then nice hot air comes to your CPU. You know, in some
cases they actually build big air ducts that will take subset of fans to CPU and the
other subset goes to memory. So it really depends, you know, who did it and
how they did it and the end result of your job scheduling will change drastically
depending on what configuration is there. So what you really want is to be able
to learn this online and adapt to it, which is why we want to have the controls and
the monitoring capability.
So for that, you know, as we move from signal server or a few servers to
multi-box level, we started using virtualization because it's a lot easier.
So we developed a system that we call vGreen that's on top of Xen. The system
is able to monitor all different virtually machines and the physical machine
properties, communicates that to general scheduler so there is a server that right
now does centralized scheduling across multiple machines. The idea is to do
allocation of what VM should go where and to also do power and thermal
management in tandem with that.
For workload characteristics right now the results you'll see are very, very basic
and elementary. They basically say is it CPU intensive or is it memory I/O
intensive, and depending on that, we'll locate it on an appropriate place. The
interesting thing is that even the simple variable actually gives you good results
already, so, you know, after I saw that I told my students, boy, we got to push the
envelope here because you're guaranteed to get low hanging fruit in no time at
all. So that's what we're doing.
So here is a motivating factor behind this, you know. So we all know that
because machines when you turn them on, they consume a huge amount of
power just being on, and really the major dynamic component is the CPU, right,
you know, so CPU will really affect the top range of power. Most people
therefore assume that CPU utilization is good enough for estimating the power
consumption of the whole server.
And you know, to a rough degree it may be good enough. However, what we
found is depending on what applications you run, you can have actually pretty big
range of the overall server power that you measure. We saw -- here I'm showing
65 watts, you know we've seen up to about 80 watts depending on the server of
delta without any power management, just depending on what applications you
run. Okay.
That tells you that smart location of jobs will actually make up to 70 volts of
difference. Right? So that is exactly what we did. We used combination of
SPEC and PARSEC benchmarks, two servers and one centralized scheduler
server that's separate and all we did is dynamic workload characterization and
migration. And what we got is 15 percent AC power savings. So this is the
whole box server savings, with 20 percent speedup. So both performance and
energy benefit. Yeah?
>>: [inaudible] do you have the [inaudible] together or ->> Tajana Simunic Rosing: So this actually shows the same job basically
increasing utilization. So if I have MCF, I'm going to run it on one core, on two
cores, on three cores up to, you know, whatever cores and threads that I have
available. And it shows what happens there. On the next slide it's actually mixes
of jobs.
>>: Are you doing [inaudible] capping on those ->> Tajana Simunic Rosing: Do I do what?
>>: The virtually passing utilization capping to achieve the 25 percent? How do
you do controlled ->> Tajana Simunic Rosing: Oh, yeah, we basically capped it artificially because
that way, you know, we can control. Because, you know, how are you going to
create an experiment, right? Yeah.
So really what these numbers should tell you is that there is actually a lot of
potential benefit that hasn't been exploited here yet. You know, when you
combine the ability to characterize jobs, to place the [inaudible] appropriately and
then to do power and thermal management jointly.
So going forward, I'm actually heading a fairly big center that's called music, and
a particular aspect of that that I'm focussed on is leading an effort on energy
balanced data centers that spans from softwares to system and down to
platform. And basically we're looking at how we can do multi-scale cross-layer
distributed and hierarchical management. So the question is where should the
control sit? There are lots of different layers. And how should the
communication happen between these layers? We want to ensure that energy is
spent only when and where it's needed instead of wasted in lots of interfaces.
Okay?
And this brings me to a quick summary which I think I can skip. And I have a few
more minutes to maybe such on some of the sensor stuff. I think what's
interesting about the sensor stuff is that all the ideas that I just talked about
minus thermal management pretty much apply.
>>: We're actually starting to talk about thermal management in these as well.
>> Tajana Simunic Rosing: So, yeah, I guess if you have really bad
environment, your CPUs have problems.
>>: But we're not there yet. But it's come up.
>> Tajana Simunic Rosing: So the work that I have been doing, that I haven't
talked about at all is looking at heterogenous multicore processors that go into
bay stations. And thermal management there is actually a big deal. So I have a
couple of papers that recently came out on that that you might be interested in.
Yeah?
>>: One comment. This is a concern I've had as we do more and more
management.
>> Tajana Simunic Rosing: Yes.
>>: On different systems. What work, if any -- we've actually seen failures of
servers as a result of incompatible controls where you reach oscillation ->> Tajana Simunic Rosing: Yeah.
>>: So what work, if any, have you done in that area?
>> Tajana Simunic Rosing: I've done quite a bit of work for the multi-socket,
multi-chip. I have not done anything on the server level. The work that we did on
multi-sockets actually looked at fan failures and oscillation, so designing policies
that actually are stable across multiple variables. We are going in that direction.
The problem that I'm running into is reliability has been difficult to model. So if
you have data, you know, that can help me do a good job of modeling, that would
be good.
>>: [inaudible] control systems ->> Tajana Simunic Rosing: I know, it's ->>: [inaudible].
>> Tajana Simunic Rosing: No, I totally hear you. I know.
>>: [inaudible] and end up fighting each other ->> Tajana Simunic Rosing: Yeah. Yeah. And that is why, you know, when I
said multi-scale, that's exactly what I'm after. You know, you have all these
different things, and they each individually could be designed to do a great job.
But you put them together, you could have catastrophic failures. You know, how
do we avoid these catastrophic failures. I don't have a good answer yet.
But we have started -- in my case we started kind of small, you know, and we're
starting to build up from there. And I also have a part of my team that started
sort of from top down, you know, so getting all the thermal data of realtime trying
to understand how it relates to the current control loop for the data center
container. Because there is already a controller in there. And if we make
changes to the controller what happens. Which is a scary thought.
So on sensor networks is very much a distributed problem, also, except in this
case it's also physically very much a distributed problem. So what you see on
this map is a large scale sensor network deployment. UCSD is this little dot right
here. That gives you an idea of the size. The range inland is about hundred
miles. The network spans up to Salton Sea, just beyond Salton Sea. It goes 70
miles off the coast. The connection of the coast was done jointly with navy
because they are interested in using our wireless backbone. It reaches close to
Mexico and goes all the way up to Riverside County.
What you see on this map is only the top layer of the network, so the wireless
mesh backbone. What's underneath every one of these dots are literally
hundreds of sensor node cluster heads and thousands of sensor nodes. And the
sensor nodes do all kinds of different things, most of which have nothing to do
with computer science at all.
In fact, you know, probably the largest density of nodes is for seismic monitoring,
Scripps Institute of Oceanography deploys this. They have a number of
ecological research stations placed that do things like temperature, water quality,
soil. And we have also people who are monitoring wildlife, so California Wolf
Center. Lots of different applications.
What I think I find really exciting about this deployment is exactly this, that it
represents a really live test case scenario of what is life going to look like as we
start using more and more sensors and as people who are not engineers start
deploying and using the networks?
This network has actually also been used for helping California fire department
fight forest fires. And you know, in San Diego area. So the last big fire a few
years back leveraged our network to pretty big degree. A lot of the images that
you saw came actually from our sensors out in the field. So here is an example.
You know, in the very low end I mentioned we have these earthquake sensors.
They don't produce that much data. They are about five kilobits per second
worth of bandwidth per sensor. However, they are somewhat latency sensitive
when something exciting happens where something exciting is defined as a big
earthquake, right?
So at that point you really want this five kill about it per second trickle to make the
hundred mile trip over a gazillion hops very quickly.
At the middle layer we have things like motion-detect cameras and acoustic
sensors that basically in case of motion-detect cameras you can see the walls
here. They want to turn on when some interesting animal happens to walk by
and then continuing reeling the video while the interesting animal is still there and
then stop.
At the same time, we want to collect acoustic data. So people that are studying
wolf behavior want to know, you know, if this particular howl, if you really come
close you can see the howl on the acoustic sensor, how does it correlate with
particular body posturing that the wolf is making. So you would need to have
time correlation between these two, because these two actually aren't physically
at the same location even. They're in a slightly different location because of
noise.
Okay. High-resolution still cameras. I give that example primarily because it's
actually a lot of data that comes out of those. Wild fire tracking cameras, day
and night. So these are still images of actually video cameras that are
strategically located at places that help us track how the fire is progressing. We
have also done experiments with helicopters with cameras on them that them
stream video to mountain tops and to the back end control of California fire
department.
And then at the very high end, there are two observatories, Palomar and not
Livermore and not Lowell, it's an L observatory, sorry. I'm spacing on it. But
basically two observatories. Palomar observatory produces [inaudible] to put 150
megabits per second at night.
So you can see a small problem. If I have 150 megabits per second in the
middle of the night streaming through my network and at the same time there
happens to be a fire out there somewhere close to.
Palomar there will be a problem in how quickly the fire data is going to get
through. So we need to be able to adjust our quality of service settings along the
network very quickly and easily.
At the ultimate low end we have devices that don't even use batteries. So this is
an example of a structural health monitoring device that my group developed that
actually uses solar cells and super caps. The reason why we went this direction
is because people like Los Alamos Laboratory that's funded this project want to
deploy the sensors in areas that are very difficult to access by humans, and shelf
life of a solar cell and a super cap is longer than a battery.
The problem when you have solar cell and a super cap is that your power source
is very unreliable. Sometimes it's really good, sometimes it's really lousy, and yet
you need to be able to deliver results regardless of how bad it is. So in this -and for this particular application, it's actually doing what we call active sensing.
So each one of these boards, which actually has been shrunk significantly has
access to 16 PZTs, Piezo electric devices. Each one of these Piezo electric
devices can generate a wave signature that's sent through the structure and
sensed by another one. So you can create pairs of paths.
The signature looks something like this. The red line those what's actually
sensed. The blue line is how it actually should be if everything was okay. So
you can see clearly that there is a problem here.
Per single path we get 10,000 samples. 16 PZTs, each path 10,000 samples is a
lot of data that you can get, right? That data, as you can see, has to be
processed. The type of processing that's done on this is very similar to what you
do for video decoding or any other kind of large scale signal processing.
As a result, we had to put the DSP on board to really do this effectively.
Now, clearly this processing is significant enough to where during the nighttime I
won't be able to do whole lot of it. During the daytime when it's nice and sunny, I
love San Diego, you can do a lot, right? So this picture actually shows a
variability and amount of sunlight that's available and a prediction that we
developed to give us an estimate of how much we're likely to get.
>> Navendu Jain: Five minutes.
>> Tajana Simunic Rosing: Yeah, I see it. Yeah?
>>: [inaudible].
>> Tajana Simunic Rosing: I'm sorry?
>>: How often do you need to do this?
>> Tajana Simunic Rosing: Well, that's the good news. Couple of times a day.
Yeah. Unless something bad happens. So bomb drops, airplane runs into a
building, earthquake happens, then you need to do it more.
So on the bright side it's not often on average. On not so bright side there are
events that we had triggered for which you need to be ready to respond. So you
have to have some reserve planned always. And in case of solar, and you need
to be what we want is to be able to trade off the amount of accuracy for
computation with the amount of energy that's currently stored and that's likely to
become available. So that is exactly what we do with the predictor and what you
see down here that at some point in time we execute very few tasks and other
times we execute lots of tasks.
So you can see accuracy grow significantly depending on how much juice we
got. Okay?
>>: [inaudible].
>> Tajana Simunic Rosing: It doesn't, but it can estimate it by what it's currently
getting out of the solar cell. So basically what it does is it says, well, you know,
currently I'm getting X amount of energy given the recent history, which it keeps, I
predict that I'm going to get Y amount. And it turns out that for about 30 minute
time period we can be within 10 percent accuracy. Which is really good enough.
Because a typical if I did all 160 paths it will take me about three and a half
minutes to complete. So 30 minutes is good enough. Okay.
So going back to my original sensor network, you saw that at the top level I have
this high speed wireless mesh. At the middle I have these sensor node cluster
heads that collect the data from whole bunch of sensors at the low end.
The sensor node cluster heads is where a lot of significant computation happens,
and where also decision on who gets what data will occur. So there is actually
quite a high demand for delivering reasonable quality of service with good battery
lifetime. And for that we developed routing and scheduling algorithms that
helped us save power while improving the throughput. Because without that, this
was actually the layer in which energy will become very critical. And that has not
been addressed yet.
And that enabled us to start thinking about combining body area networks with
the environment. So for this new CitiSense project that I talked about a little bit
earlier what we're looking at is how do we monitor people 24-7 in order to
understand how diseases such as asthma develop in the first place. And for
asthma, you need to monitor air quality, you need to monitor physical movement
of a person so you understand how much physical exertion affects the breathing
patterns. You need to monitor whole bunch of parameters.
And for that we've actually already had a small deployment. What you see on
this picture is results that we got by just monitoring physical activity. And we
actually had 63 patients use our system with the goal of becoming more
physically active and therefore losing weight. And it turned out that the simple
job of monitoring it and providing quality feedback in realtime through the cell
phone motivating people hugely. People really loved the system, and they
actually lost significantly more weight than the people who didn't use.
So in this case, it was over six pounds more than the control. Right? Which is
bill deal for four months. The most important thing I think from this came
because over 95 percent of the people wanted to continue to use the system
after the study was done and wanted to buy it for their friends and family. So
people actually really liked what they got, which is not a small thing to do when
you give realtime feedback on somebody's physical activity. It can be very
annoying, you know, if you don't do it right.
So, yeah, I mean think about it. So this is why actually we get thinking. And if
we combine this with environment, we can get a truly powerful way to provide
doctors, medical professionals, had you been health officials with tools that they
can use to understand what to do you know and how to encourage us.
Unfortunately as you do this, you end up with increasingly more complex sensor
networks so you get stuff on your body, you get stuff in the environment, you got
these what we call local servers, basically cell phones, you got the back end.
And to date most people basically said just collect the data and then send it to
the back end, have the back end figure it out and give you the answer.
If you think about healthcare, that absolutely does not work. Right? You don't
always have ROS connectivity in the first place. And secondly, you got all this
great processing right over here. Why are you not using it, you know, wireless
actually costs you a lot of power in the first place.
So that is what my group has been looking at is if we have sets of tasks that have
dependencies between them and that have some performance requirements so I
need to provide feedback within some amount of time, what is the right way to
dynamically schedule them across this whole network in a way that maximizes
perceived battery lifetime per user, because that's what you actually want, you
want each individual to be happy. You don't really care about the whole system.
And that delivers information that people need. So here is an example of a very
simple task graph and that we may be working with and you know, with two
different types of binding, again the ILP, back to my slides some 20 slides back.
The goal was to minimize the maximum energy consumption rate among all of
the sources. So in that effect we would balance how quickly you're draining
batteries per individual user's perspective. And we got to assignment that was
an average 20 percent better than moving everything to the back end in ideal
conditions. So assuming you have perfect wireless connectivity, nothing
changes, same tasks, you're going to do better if you actually assign some of the
computation tasks to local nodes than if you send everything to the back.
Now, if you look at the real life scenario where you actually have to start from
some initial assignment, detect that things are changing and then adapt on
runtime here, we can actually get about 80 percent longer battery lifetime than
the best static implementation. So we take the ILP result, we put that on, we are
80 percent better by doing simple online heuristics. Okay?
So that's kind of where we're going, going forward. You can see that in my group
we're looking at everything from ultra low power sentencing systems all the way
to large scale servers. The idea is the same across this whole spectrum. We
want to get energy efficient behavior with reasonable performance. And we want
to do it in a way that's relatively simple to verify and that's reliable.
So that's all I had for you. And I think I just ran out of time. So this is good.
[applause]
Download