1

advertisement
1
>> Ben Zorn: Great pleasure today to introduce Ling ja Tang from the
University of Virginia. Lingjia is getting her Ph.D., finishing her Ph.D., and
she's been working the area of improving the performance of datacenter or
warehouse scale computers.
She recently won an award, the CGO Best Paper Award, and her work has been
recognized by Google. She's worked closely with Google in a number of
internships. So it's a great pleasure. Thank you.
>> Lingjia Tang: Hi. Thanks. I don't know if this is working. Thanks. So
my name is Lingjia. Thanks for the introduction. To make it easier for you
guys to pronounce it, it's pronounced as Ninja but with an lin, with an L.
So today, I'll be talking about mitigating memory resource contention on the
commodity multicore machines to improve efficiency in the modern data warehouse
scale computers.
So I also want to share some of the insights and some of the discussion,
discuss with you guys about when you rethink the row of compiler in round time
system, it's kind of emerging computing domain.
So let me just first give a little bit background. I'm sure you guys, lot of
you guys are really familiar with this warehouse scale datacenter, this
computing domain that my work is focused on.
So in the last couple of years, especially in the last couple of years, the
mega scale datacenters have gained tremendous momentum, has emerged as a very
important computing domain. Companies such as Microsoft, including Microsoft
and Google, Facebook, are constructing those mega scale datacenters that host
large scale internet services such as web search and mail, as I've shown here.
And a lot of people really are very familiar with this kind of applications and
some people are really addicted with it.
So The Economist magazine points out that datacenters have become very vital to
the functioning of the society nowadays. Those kind of datacenters are very
expensive to build, construct and to operate. So typically, each of the
datacenter may cost you hundreds of millions of dollars. And the cost and the
size of those datacenters have grown rapidly in the last few years.
2
So back to a few years ago, we were talking about this kind of datacenter, we
would say it's a football field sized building housing tens of thousands of
machines. And if you go back a few years later now, we're talking about those
buildings as tens of football field size. And we're talking about hundred of
millions of dollars in the recent datacenter. This one shown, I found this
picture on the internet, is Apple's new datacenter that's actually costing a
billion dollars. So those things are growing really, really expensive
nowadays.
And, of course, when people spend a billion dollars building datacenters, and
they want it to be very efficient. So efficiency is really critical for the
cost. So my work mostly focuses on the memory resource contention on the
commodity multicore architecture. And which is very significant, limiting
factor for the efficiency in the warehouse scale computers.
And I'm going to explain why. So those are the server racks that populate the
datacenters, and if you look at the machines that we use in those type of
datacenters, which typically is commodity multicore machines. So on those
kinds of machines, we have multiple processing cores share part of the memory
subsystem, right. So each core here has its private L1 a cache, but they share
L2 cache memory controller. And bandwidth to the memory.
So if you have multiple applications running on this kind of architectures, the
multiple architecture threads, they may [indiscernible] for some of the memory
source. For example, when applications are affect the data of another
application thread out of the shared cache. Causing significant performance
degradation.
So in addition to significant performance degradation, this memory resource
contention on these kind of multicore machines also limits server utilization.
And I'm going to explain why in just the next couple of slides, but I want to
point out here that one percent improvement either in terms of application
performance or utilization in a massive scale of datacenter translates to
millions of dollars potentially saved. So it's really important.
All right. So let me further illustrate the impact of memory resource
contention on both performance and utilization. So here's a really simplified
model. We have two applications and we have two options of how we run those
applications. First, we say each application gets its own dedicated server.
As shown here. So we have two machines running, two applications. Each
application may or may not occupy all of the processing cores of this machine.
3
So you have low utilization of those machines.
performance.
But you get, you know, peak
The other solution say let's allow co-location so we improve the utilization of
the server, reduce the number of machine we need, but we actually potentially
have significant performance degradation, because of the memory resource
contention.
So things get a little trickier when you have latency sensitive applications,
such as Websearch or mail. Those are applications that user [indiscernible]
latency, you care about user experience allows you don't want to actually
sacrifice a lot of the quality of service, performance of those applications.
So you care about those.
And they're a potentially low priority application in this datacenter, such as
batch applications so you don't care that much about, but we really do care
about latency sensitive application.
So those, if those application cannot really deliver the accepted QOS when we
actually co-locate those applications, then we don't want to do this kind of
co-location. So co-location is often disallowed for latency sensitive
application. So typical approach is we choose to actually disallow co-location
and sacrifice utilization for performance.
So just to show you that low utilization is pretty common and it's actually one
of the big challenge for datacenters. Here's a graph from HP datacenter,
production datacenter. So one of the key feature of this graph that you, the
observation you cannot see from this graph is we're spending 60 percent of the
time, or even more than 60 percent of the time running at 10 percent of server
utilization. So this is quite low. Google has a slightly better data. So the
official data is actually around 30 percent of utilization.
I was talking to the IBM fellow, and they were saying the IBM datacenter
production datacenter, around 8 to 12 percent average utilization. So those
are really low number.
So why is low utilization expensive if you look at the total cost of ownership
of a datacenter server, purchasing server is a huge chunk of those costs. On
top of that, constructing datacenters is a big chunk too. And then the power,
right. So if you actually have low utilization, low utilization meaning given
4
same amount of work, you want more machines to around the work, you want a
bigger datacenter so those are very costly. So you want to improve utilization
in the datacenter.
So therefore, mitigating contention is really critical for improving both
performance -- go ahead.
>>: Can you go back one slide? I just want to understand it a little bit
better. Would it be fair to say that the server cost is a recurring cost,
because, you know, servers keep failing and you keep buying more and more
servers. Whereas the other stuff, excluding the power at the datacenter,
that's basically a fixed cost?
>> Lingjia Tang: Oh, this is total cost ownership and your breakdown. So
assuming you have the server for four or five years and amortized the cost.
All right. So the big picture of the work that I'm going to present here, my
dissertation work, first we want to understand the interaction between the
datacenter applications and the underlying memory subsystem. And on top of
that, based on understanding [indiscernible], we want to manage resource
sharing between application threads so we can improve the application
performance. We also want to manipulate the application characteristics so we
can actually facilitate more co-location and work consolidation on those kind
of servers to improve utilization.
So some of the results I have the publication in those domains so I'm going to
briefly talk about those too and then focus most of the work on this part of
the work.
I also have other work related to the datacenter, how we improve the efficiency
in datacenter. Tomorrow, Jason is going to talk a lot about those very
interesting work and you can hear a lot from him.
So the outline of the talk, I've discussed the motivation and the goal of the
work. So I'm going to talk a little bit of the characterization of those
applications and how they interact with memory subsystem first.
So let's take a deeper look at the memory resource contention for those kind of
architectures so this is the Intel Xeon Clovertown, this is actually widely
deployed in goggle production datacenter. So on those kind of machine, we have
5
two socket, and each socket will have four processing cores. And two cores are
sharing a last level cache. Four cores on this socket are sharing a bus, and
then two sockets are sharing a memory controller.
So you have three major components of sharing on this kind of architecture.
It's one thing -- one interesting to notice is the sharing could potentially
have a constructive or destructive performance impact. If application, if
threads actually belong to the same application in the shared data and it fits
nicely to the shared cache, then you basically prefetch for each other and you
are reducing the coherence traffic and the bandwidth traffic. So it's a good
thing.
But potentially, they will also contend for those resources. If that happens,
it's a destructive kind of performance impact for those applications.
Another interesting thing is the thread-to-core mapping determines which
resource is shared among multiple threads. In this simple example, those two
threads are sharing those three resources. So if I move the thread to another
core, I'm only sharing memory bus and memory controller. If I move the thread
to another socket, we only share memory controller here. So it's a really
interesting feature of those kind of architecture.
So the research question of the first, this part of the talk, is what's the
performance impact of those threat-to-core mapping and the memory resource
sharing for the application performance for emerging large scale performance
that we care about, datacenter performance.
So some of the prior work, people have been using SPEC, and they conclude
there's significant performance degradation due to memory resource contention.
There's a lot of work in the hardware community based on.
But there's a recent work in PPoPP 2010 that concludes that cache sharing does
not matter for the contemporary multithread applications. So our question is,
you know, what about the large scale emerging datacenter that we care about.
What about the performance impact for those applications.
So let me show you some of the key results that we found. There is more
results and details of those applications in my publication so we can talk
offline.
6
So this is intra-application sharing, meaning we're interested in finding out
the sharing among threads that belong to the same application.
So here are the three large scale application we use, including Websearch and
Bigtable, Content Analyzer and systematic analyzer. So in each case, each
application we run on the previous, we show on the Clovertown machines in three
different configurations, sharing configurations. So from the left to right,
you can actually think as more and more sharing among threads that belong to
the same application. So the Y axis is the normalized performance. So one
thing to notice here is there's a big performance swing, up to 22 percent for
those applications, for example. Semantic analyzer degrade for more sharing,
indicating there's contention among threads. The performance suffers from the
contention.
And Websearch as well. So but there's also interesting thing to see from this
graph, that it's both constructive and destructive impact. Because if you look
at Content Analyzer and Websearch, the degrading from cache contention,
bandwidth contention, but Bigtable actually benefits from the cache sharing.
Go ahead.
>>: So just to make sure I understand, so you care about single machine
performance, ignoring potential contention on the network.
>> Lingjia Tang: Yeah, this is really just single machine, micro-architecture
results contention. So we're not looking at [indiscernible] contention. And
for example, the Websearch, we're looking at leaf note performance.
>>: So this is sort of the request comes in, it's completely executed on the
machine?
>> Lingjia Tang:
>>:
Yeah.
It doesn't make any remote calls?
>> Lingjia Tang: Yeah. So basically, we use queries of, a trace of real world
queries and we actually fed those queries to this machine.
>>: But do you have a sense of how much this impact actually translates to the
end-to-end latency, which is what the user sees?
7
>> Lingjia Tang: I don't have exact number of how it translates to real world,
in the complete latency. I think for Websearch latency, on the real machine,
on the back end machine does not -- it's not going to be the major dominant
part for the latency for Websearch, because the network latency is so big that
the correlator in a single server might not mean that much. But the key for
the back end is we really want to improve performance structure, reduce cost.
So it's not so much about performance, it's a lot about cost. But we actually,
we do care about the latency of those queries on those machines. You
definitely want the best machines for those queries, best performance for those
queries on those machines.
All right. So there's constructive and destructive impact for both
applications, for the three applications. And then we look at
inter-application sharing. So basically, what's the impact of sharing among
applications. So in this case, we actually co-locate another application with
multiple threads on to the same architecture, what's the performance impact.
So here are the results. Each graph is a single application. Content
analyzer, three different kind of a running scenarios. Running alone, we have
[indiscernible] running with another, image processing for street vow. And
another benchmark. So you can see that there's performance degradation because
the performance basically dropped from running alone to running with other
applications.
Another really interesting about this graph that I want to point out is the
optimal thread-to-core mapping changes when the co-runner changes. So when you
think about Content Analyzer, when it's running alone, its best performance is
running actually spread its threads across the two sockets. But running with
[indiscernible] actually makes it benefit from cluster all its threads
together.
And similar for Websearch. It changes in different settings, and in this case
it's preferred to actually share cache with its own threads but not share
bandwidth with its own threads.
Bigtable always benefit from the sharing. And it's also interesting to notice
that this difference between thread-to-core mapping performance can be pretty
significant. It's up to 40 percent for Bigtable in this case.
8
So a quick summary. I have a lot more data about this kind of study in the
publication that I don't time to actually show here, but I want to highlight
some of the things we found. So contrary to the prior work, we found that the
memory resource sharing actually matters a lot for large scale datacenter
applications and the performance swings up to 40 percent simply based on
different thread-to-core mapping.
And the interesting thing is optimal mapping changes, when application changes,
when a co-runner changes, when the architecture changes. And I highlight those
two because basically what they are saying is because there's up to 40 percent
performance opportunity that we left on the table without thinking about
thread-to-core mapping, without thinking about this interaction between the
application and the micro-architecture, and so it really, the next question we
ask is how do we achieve the optimal thread-to-core mapping?
>>: All these applications that you have mentioned, do they involve any
synchronization?
>> Lingjia Tang: Yes, but not so much. It's there is, of course,
synchronization in those applications, but it's not a huge amount of
synchronization bounded applications.
And that would be really interesting to see, you know, log contention, that
kind of performance impact on those application, but we're only focusing on
memory resource contention in those applications.
>>: You say that she's are large scale datacenter applications but then you
look at a single box performance.
>> Lingjia Tang:
Right.
>>: So what is really different about these applications than some other
applications that would not run at large scale in a datacenter? Is it the
workloads are different, the memory access patterns are different that really
makes this now interesting compared to non-datacenter applications?
>> Lingjia Tang: There's actually a publication coming out of Microsoft,
actually really, multiple publication comparing those workloads with SPEC in
terms of, you know, different, how they access different kind of functional
9
units or the [indiscernible] memory access different patterns difference. And
I think you will find there's quite a difference between those applications,
between SPEC and there's new work coming out thinking about how to use this
different characteristics to actually design new servers, architectures.
That's not really -- for example, instruction cache is really important for
those kind of application, which is kind of different for SPEC. So there's a
lot of difference. But we can talk offline about it.
So how do we achieve this optimal thread-to-core mapping for better
performance? So I'm going to just talk really briefly about this section. So
we basically manage the resource sharing by doing intelligent thread-to-core
mapping of this application on this architecture to improve performance.
So there are two approaches that we present in the paper. One is heuristic
basic approach and the other is adaptive approach. And so the heuristic-based
approach, the key idea is that's actually figured out application's
characteristics and map the threads according to the application's
characteristics and the machine memory topology. So we're kind of caching this
machine actually has, for example.
And we identify three important application characteristics, including the
amount of data sharing among threads in this application, how much cache usage
and bandwidth usage.
>>: I have a question. I thought you just said that this thing depends on the
co-runner. So how can you make this decision in isolation?
>> Lingjia Tang:
>>:
That's a good question.
I'm going to get to that.
Okay.
>> Lingjia Tang: So the heuristic-based approach is basically saying that say
you have application and you have a co-runner. The key idea is to say compare
the potential contention within the application versus the contention between
the application. And then you choose the best thread-to-core mapping to
minimize this contention.
Because if you actually have a lot of data sharing, then you potentially prefer
just shared data with your own threads instead of sharing cache with other
applications. And if you have a lot of contention among thread, let's say you
10
have another application coming with less contention, then you potentially want
to share cache with another application. So this is how we actually decide the
mapping.
So you need to know the application characteristics of all co-runners and you
need to know the memory topology. So again -- okay.
>>:
So this is a fix for co-runner.
So you pick a co-runner already?
>> Lingjia Tang: So this is fixed for architecture. But depends on the
co-runner, you may go different paths of deciding how to map.
>>:
Right.
So do you know the co-runner?
You are not choosing the co-runner.
>> Lingjia Tang: Yeah, that's a good question. We're not choosing co-runner.
That's a different research question. We basically know that we need to
actually manage those two multiple applications and then we decide how do we
actually ->>:
Can I ask a question?
>> Lingjia Tang:
Go ahead.
>>: In these warehouse scale computing scenarios, is a set of co-runners more
or less static and remain unchanged over a long period, or it changes very
fast?
>> Lingjia Tang: They're really long-running applications. So once you decide
the co-running situations, they potentially won't change for days. But ->>:
Okay, cool.
>> Lingjia Tang: But they will actually change. So if your program crashed,
for example, so the job mapper with new tasks coming would decide where the
tasks is -- so they use basically a bing packing algorithm and they decide
where to actually map those application to the machine. In that case, you
actually have the co-runner changes. But once it's run, it's going to run for
a while.
So and this is to address your question is architecture specific and
11
potentially changing with the co-runner and we require profiling so this is
doable, but it's not necessarily optimal. So we propose the adaptive TTC
mapping. And the insight is because of the optimal mapping changes based on a
lot of things changes, co-runner changes, architecture changes, so why don't we
just do it online.
So the key idea is we have the competition heuristic, we have the learning
phase that we are actually basically go through the search space in a smart
way, and then we pick the best thread-to-core mapping and we execute it for a
while. So we basically loop this.
And then to address the question about datacenter application, they're long
running and they have very steady phases in general. But they may change
co-runners and different tasks. The same binary could be potentially mapped to
different architectures. So this kind of approach is more dynamic addressing
those kind of changes in environment.
So some evaluation for the adaptive mapping, these are both on real application
on real hardware. And those hardware, production hardware. The performance is
on the Y axis. X axis is the workload. The blue bar is the average random
mapping. So the green is adaptive, and the yellow is the optimal mapping you
could do for those kind of workloads.
And the baseline is the worst case mapping. So basically, this is definitely
better, significantly better than worst case. Worst case is not -- you don't
want to do worst case, basically. And we're really close to the optimal, and
we're beating random mapping on both architectures.
>>: Are those bars significant?
variability, right?
Do you do multiple runs here?
It's probable
>> Lingjia Tang: Yeah, so the benchmark here. So we benchmark-wised our
application so we basically make sure it's always repetitively run, the same
kind of trace that we're using, and then benchmark round, between rounds is one
percent difference.
>>: So what about changes in the workload. Say the binary's the same, but
depending on what the requests are coming in, it might change ->> Lingjia Tang:
So to address that, basically, use really long traces to
12
queries. So it's basically, it's kind of averaged out.
the average performance of those queries.
So we're looking at
All right. Section summary. We present adaptive mapping to actually improve
the performance of those applications. And I want to highlight the importance
of taking advantage of the interaction between applications, among applications
and between applications in the micro-architecture, and I think this is where a
lot of runtime system will come into actually improve the performance with
those kind of emerging domain.
So I talk a little bit about improving performance. I want to switch gears a
little bit and talk about server utilization. How do you actually improve
server utilization.
So when you think about improving performance, we manage application levels.
Application level manage how they actually share results, share the common
memory resource on the multicore architectures.
For improving server utilization, the goal is to can we actually manipulate
applications characteristics using compilation technique or other techniques to
improve the -- to reduce the memory contention and improve the server
utilization.
So here we present the compilation technique to manipulate application
characteristics so we can facilitate more workload consolidation on those kind
of machines.
So here's, I've shown this graph before. But basically, I want to highlight
the goal of this work. So again, the option one is to actually say disallow
co-location so each application is running on the single dedicated server so
you have idle cores for each server. You have really good performance but low
utilization. The other option is saying we co-locate those applications, but
we potentially have really significant performance degradation because of the
memory resource contention.
And for latency sensitive application, we don't do this, because we want to
actually deliver the acceptable quality of service.
So the goal of this is to say, let's mitigate the contention, the low priority
application generates to the high priority application. So the high priority
13
application, the latency sensitive application can deliver the acceptable
quality of service.
So when we do this, we can facilitate more safe co-locations, meaning the
co-location, when we do this, the high priority application delivers
satisfactory QOS, performance.
And this we could actually improve utilization. Even if we actually sacrifice
a little bit of the low -- oops. Sorry. Even if we sacrifice a little bit low
application priority's performance, we're still gain this kind of utilization,
right, because if you compare this graph to this graph, that's part of the
cores that were previously idle and we actually basically extract the
utilization from that part of the architecture.
So that's the goal of this part of the work for improving server utilization.
So the research questions come with how do we do this. Can we actually
mitigate the contention by changing the application's characteristics. So we
want to make the low priority application less contentious. How do we do this?
How do we actually identify? Can we identify the co-regions of the low
priority application that are really contentious and how do we change the
characteristics?
So compiling for niceness is our approach. It's the first compilation approach
to address this kind of a challenges for multiple co-running applications in
this kind of domain.
So the highlight of this technique, we pinpoint the contentious code regions
and then we actually apply novel compilation techniques to manipulates its
contentious characteristics. So reduce the contention this low priority
application may generate when it's co-running with other applications so we
could actually allow it to co-run with high priority applications.
So some of the key insights before I go into the details of this work.
Traditionally, when people think about compiler, they think about we're
compiling application. This is going to be the single application we're
running on this architecture. So we're focusing on performance a lot.
But in this kind of new domain when you have a lot of workloads, you want to
consolidate those workloads in a server and a number of cores on those servers
is growing with every generation. So you have a big pressure, we can't really
14
overlook this trend of we want to actually consolidate workloads in this
domain.
In this kind of a domain, there are more objectives you may need to think about
when you think about compiler. How aggressive. Application can really
aggressively use memory resource and get the performance, but is it really the
best way to do it. What about the influence this application has to the old
co-running applications.
>>: Traditionally, you can imagine having OS priorities and just scheduling
things, scheduling the one that should contend at a lower priority. Why is
that not a good solution in this case?
>> Lingjia Tang: Well, so OS has -- well, I would say mostly because you want
to change the application's characteristic of the low priority application.
And an OS is not really tasked with doing that. And here, it's not really a
time sharing kind of a situation. Where traditionally, as you would think
about nice being application time sharing resource. You can actually tone down
the priority of one application potentially and actually do the time sharing.
But here, two applications are co-running at the same time. They're always
running simultaneously. And also, a lot of challenge comes with how to do this
for OS, but I'm going to talk about it a little bit more.
>>: A small follow-up question triggered by your response. Is it true that
they don't do time sharing at all on an individual core in this warehouse scale
computer scenarios?
>> Lingjia Tang: There's definitely some kind of amount of time sharing. But
not so much, because if you actually have your threads or your application less
than the actually available cores, then you don't have to do time sharing for a
lot of situations.
>>: It looks like it's a matter of billing for true cost of [indiscernible] if
you did datacenter billing, how would you bill to get niceness into the mix?
If you bill for megabytes of memory and for a [indiscernible].
>> Lingjia Tang: Right, this work mostly looking at the datacenter, that
provides internet services. So you basically have control of those
applications. So you build yourself so it's easy to negotiate and say which
15
application or which application has higher priority.
the datacenter.
And we often do this in
But, I mean, I can imagine if you do this, this may be a wild claim, I haven't
really looked into it. But if you do this, you could actually say select my
priority with different billing price. I could -- I have batch application.
You could charge me really cheaply. Just get it done, whatever. So those kind
of options.
>>: I have one more question on the basic subject. So a lot of the
datacenters seem to assume that you only one job per CPU, so what are they
doing when they're overloaded, when there's more than one job per CPU that
needs to be done? Do they not do time sharing? Do they hold it up at the
queue until somebody's finished?
>> Lingjia Tang: My experience is they often do time sharing at a point and
then your performance is going to degrade a lot. And then the operator is
going to have to come in and say we can't do this.
>>:
So shouldn't the queue that's sending them to the --
>> Lingjia Tang: Yeah, they do bing packing algorithm so basically, the
application will come with this other resource requirement saying how many
megabyte in one, how many CPU in one. And then the mapper, the job mapper, the
scheduler decides how to basically map those applications. If the machine
still have available -- still has available resources, then I'm going to map
this job.
>>: So it seems to me that, you know, a lot of decade-long research and OS,
all the [indiscernible] they used to make a fuss about, they're just irrelevant
in warehouse scale computing. Is that correct?
>> Lingjia Tang: I mean, I think OS people really need to really think about,
in this kind of a domain, what kind of priorities and ->>: The optimization criteria was not quality of service on a single job.
It's fairness. And so the operating systems community optimized for an
important part of the space. And that is a different -- now it's how fast you
get the answer back, or someone's not going to click on your website.
16
>> Lingjia Tang: Thanks. So another insight is, in this kind of a domain when
you sacrifice a lot of the performance of a low priority applications, you can
facilitate safe co-location and win a lot of utilization in the end. So using
other times, when you sacrifice a little bit of your self-interest in
performance, you win big in the big picture. That's a life lesson. Being nice
is sometimes good for the big picture.
And also, in these kind of datacenter, we know the applications we're running,
they're long running and we have the source code available.
So we call our system QOS-compile and here's a review of the system. We have a
profiler that basically identify the contentious region of those low priority
applications and we also have a compile their target those applications, those
contentious region. And when we compile those regions to mitigate the
contention nature. So I'm going to talk a little bit of the compiler, the
profiler first.
So this profiler is based on hardware performance counters. Because they're
fairly low overhead. So the idea is based on how aggressive those application
is using a certain memory resource, we actually predict how contentious it may
be when it's running with other applications.
So the application's running and the profiler, based on the performance
counters, will actually predict a contention score of the execution phase. So
we identified the execution phase that was high contention score and find the
associated code regions that have the high contentious score. And there's a
lot of thinking and work that goes into how to build this kind of prediction
model, and we use regression and here's the model. And I can talk more offline
if you're interested.
But let's take a look at the performance of this model. Here are the two prior
work. Those are the work that publish [indiscernible]. They're using
last-level cache miss rate and reference rate to actually predict how
contentious the application is. And this is our model.
As you can see, we have a much better linear aggression, linear coefficient
correlation. And ->>: Is that taken from your interferences between the different jobs, or what
are the inputs to your model?
17
>> Lingjia Tang: Right, so the input to the model is the performance counters.
For example, last level cache [indiscernible] of the application. There's a
lot of reasoning of why we actually select certain performance counters. The
output is contention score. So basically, imagine you have certain lines in
application runs and we actually calculate the score and we compare the score
with the average degradation in this application when running cores to the
co-runners.
>>: So you're actually predicting something different as well as using
multiple inputs to help you, right?
>> Lingjia Tang:
>>:
Yeah.
Okay.
>>: So is this on a per-region basis?
timeline is.
I'm not sure I understand what the
>> Lingjia Tang: Oh, sorry. This is actually application level. So average
application, average cache miss rate, for example, for the entire ->>:
How do you do it by region?
How do you identify contentious regions?
>> Lingjia Tang: I have a graph, but I'm not sure if I have it in the slides.
But basically when it's running, every milliseconds, sample the performance
counter. So you get every milliseconds, get a contention score. And by
region, we didn't D instrumentation to actually find -- so we basically just
looking for the execution phase that corresponds to that part. The co-region
that corresponds to the part of execution phase, which is not ->>:
So you do a phase analysis as part of this, then?
>> Lingjia Tang: We did ping, we use ping to actually -- so we, the way we did
it is basically, we do two path kind of analysis. One path, we basically get a
contention score of the execution phase. We also record the instruction
executed during the phase and then the next phase, the next path, we say
basically you got the contention score every millisecond and the instructions
executed every one millisecond. The second pass would be the ping
instrumentation so basically counting the instructions executed. And then
18
identify the code regions that are executing the mapping instructions of the
first [indiscernible] and identify the code regions. So I think we have more
details in the paper. But we could talk more offline.
So there's also a phase graph that would show the phase.
pretty good job identifying the phases.
We actually do a
So that's the compiler. That's the profiler. Let's take a look at the
compiler. The intuition behind the compiler is we use rate reduction to
actually reduce the contentious nature of a contentious region. So we
basically reduce the memory request rate of those region. And here's some
illustrative example to show the intuition behind when you reduce the memory
request rate, you reduce the contentious nature of the application.
So you have two application running, application A and B, and assume certain
initial last level cache content and access order, you have execution time of
B, for example. So when you actually slow down A's access, basically what you
changed is the access order here that you prioritize B's execution in the
memory request. Because A is actually slowed down, so B memory requests get
prioritized. And that will actually have a big impact on the performance of B.
So in this illustrative example, B's execution is greatly reduced.
So based on this intuition, we have two techniques to reduce the memory request
rate. One is padding. Basically we inserted NOPS into those code regions so
the idea is that application execute few instructions and then executing NOPS.
So during this phase, no memory request issued.
Another one is nap insertion. So it basically inserting nap. So you basically
can control, let the application run certain minutes, seconds, for example and
then let it sleep for a few milliseconds to use those techniques to control the
rate reduction.
And you may wonder that those two techniques are really similar, because
they're just basically let application to execute for a while and let it sleep
for a while, inserting NOPS for a while, they should have the same kind of
performance. But in experiments, we found out that that might not necessarily
always be the case.
So here's some performance showing padding and nap insertion action.
Y axis
19
here is the quality of service of high private application.
execution rate of the low priority application.
The X axis is
So from 1X to 0.4X, we slowed down this application by inserting NOPS or
padding. Or napping. As we slow down this application, you can see the
performance of the QOS of the high priority application improves. And the
three lines of napping with one millisecond every nap or ten millisecond and
padding. So in this case, we actually have really similar performance for
three techniques. In this case, however, we could see the nap every ten
millisecond actually outperforms all the other two greatly.
So even if slow down the application, one application to the same execution
rate, but different technique, you may have a different effect in terms of
improving the QOS. So same rate reduction may not have the same effect.
So it turns out, this performance difference is because of the granularity when
you do the rate reduction. So let me illustrate this one. So if you think
about the performance of a high priority application, let's say it's starting
in the performance really low, because of the other applications in generating
interference, and the nap and padding starts here. We start to actually slow
down at a higher low priority application.
In this case, high priority applications performance only picks up instead of
jumping up to the really optimal performance, only picks up. That's because
when you just start napping and padding, there's still memory requests that are
coming from the low priority applications still getting served in the system.
So it takes a while to actually cool down those memory requests. It also takes
a while to warm up the cache for the high priority application for it to pick
up the performance.
So what this really says is actually when you slow down the application, the
granularity you pick to do this actually matters.
So smaller, fine granularity for padding, in terms of cycles, may not really
give you the best win. For the slow down, you actually sacrifice. Also,
comparing padding and napping, napping is easier to do, more accurate timing
control, because you could potentially have a timer and it's more power
efficient, because we're not really executing NOPS in this case.
So some of the evaluation for QOS-compile, and this is combining both identify
20
code regions and applied transformation to those applications. Here on the X
axis, we have the high priority application. Y axis is the quality of service
of the high priority applications.
Each bar, the first bar is when those applications running with original, the
original LBM with no padding or the original and basically no QOS-compile. And
the yellow bar is where we actually identify the code regions, inserting naps
into the code regions and every ten milliseconds, napping for ten milliseconds,
the purple is every ten milliseconds, 20 milliseconds. As you can see here, we
actually greatly improve the performance of those high priority applications up
to 85 percent and 95 percent of its original peak performance.
Similar results and there are more results in the paper. And let's look at
utilization we gain. So basically, those are the application, the low priority
application when it's running with those high priority application because we
slow down so we're not really running at 100 percent of the execution rate.
But we still gain significant around 40 percent in this case, 50 percent in
this case of the utilization of the cores that previously would be idled if we
don't really do this kind of co-location.
So we also have really similar results for the Google applications where we
actually improved the Websearch for up to close to 90 percent in this case.
So the summary is we pinpoint this contentious regions and we apply code
transformation to those regions to mitigate the contention. The low priority
application generates to the high priority application. So we could improve
server utilization because we facilitate this kind of workload consolidation.
And I won't actually stress the multi-objective compilers in this kind of era
that year actually looking at a lot of tradeoffs, more than we actually used to
think about before.
All right.
But --
>>: Quick question. So you just said inserting NOPS, sort of stretching out
the execution. Have you thought at all about sort of more sophisticated
techniques or other things that could be done with maybe like more -- you think
about a compiler. Compiler can reason about the code and figure out different
instruction schedule and things like that. Are there other things that could
be done that are more sophisticated?
21
>> Lingjia Tang: I think potentially, there's things that could be done. I
was looking at thinking some of the optimization you could think about, you
know, memory hierarchy, optimization, compiler optimization we do to actually
optimize for the cache structure or loop transformation that potentially we
change the memory characteristics. But when we look at those kind of
optimizations, the performance impact of the impact in terms of mitigating
contention is quite small comparing to you really slow down the applications.
If you think about loop transformation, potentially you're looking at maybe 6
percent, 8 percent of the performance. Well, for 6 percent, 8 percent of the
performance in terms of changing the high priority applications, if you change
the memory characteristic, access pattern of your low priority application by
loop transformations, the impact it has on the high priority application is
really small compared to just really slow down a lot. So I think potentially,
there are things to be done. But I choose to actually go this route because
the basically, the benefit you're getting from those is not as big.
All right. So why be nice, if not necessary? So the last part of the work
that I want to talk about is fairly new work that would dynamically regulate
the pressure we have on the memory subsystem.
So the key idea is compiling for niceness is a static technique. So basically
have to be conservative. You have to throttle down the application without
knowing how much the high priority application may suffer from the performance
degradation or contention.
So why do we do this reactively online? We could dynamically manipulate the
application characteristic based on the performance degradation we actually
observe and based on the contention that we actually detect, right. So we
could avoid unnecessary slowdown. So if the high priority application, for
example, is not really suffering from contention, we don't have to slow down
the low priority application and we can actually achieve more accurate QOS
control. Because you're actually monitoring the QOS online.
All right. So the overview of this technique, we have the compiler that
basically is a profiling based and identify the contentious region, and this
instrument mock-ups to actually invoke the runtime system. We also have the
runtime system that basically tasked with monitoring the QOS, detecting
contention, and feedback-controlled to actually control how much throttling
down and how much nice you want this low priority application to be.
22
So the compiler here is really similar to the compiling for niceness. We use
the same profiler to actually identify the contentious code regions. But
instead of actually inserting NOPS or paddings, we insert a mock-up to actually
invoke the runtime.
And the runtime system is where a lot of the interesting things happen. So we
have the monitor that's attached to the high priority application which can
monitor the performance counter information. So it's actually start in the
machine memory in the system, in the circular buffer, and the nap engine is the
part that's actually attached to the low priority application.
So the low priority application
basically this nap engine would
priority application and detect
into the detention detected and
can invoke runtime of this nap engine,
read the monitored information of the high
the contention and actually respond, called
the QOS of the high priority applications.
So I'm going to talk a little bit more of this part, because this is actually
really interesting.
So to detect the contention and react to this contention, we have could have
potentially implemented different kind of policies.
So in this work, the first policy we tried is let's do a simple way. That's
just conservatively throttle down whenever the high priority application
suffers QOS degradation, right. So it could be a false positive, but let's
just do it really conservatively. Because the QOS degradation may not really
just be simply due to contention.
Another way is let's do a feedback controlled. So we have different states.
We execute for a while, then we actually go to the check state to actually see
whether napping or our technique will potentially have impact on the QOS. If
it does, then we actually continue to nap. If it doesn't, which means maybe
there's QOS degradation, but that's not necessarily caused by contention,
because our technique's not going to be able to address, reduce this contention
or improve this QOS. So it's not necessarily caused by contention. So that's
actually just low priority application execute.
So let's take a look at those two techniques in action. So here, this is
sphinx running, which is a SPEC benchmark, running a low priority application.
23
This is the rough input in the X axis at the time. And the Y axis is the
normalized IPC. So if it's one, meaning there's no contention, it's running on
peak performance.
Here, we say the QOS when the sphinx is co-located was the original low
priority application. And the blue bar is when we actually apply the reacting
niceness, here using simple, to the low priority application. So you can see
the IPC improvement here.
Similarly, here we have the green and the red bar, the red lines are the
similar line, which is the original performance. Here is the green line is
when the targeted react is applied to the low priority. So the big,
take-appoint from this graph is you can see this line is very stable. It's
around 90 percent, which is our target QOS. And comparing, especially
comparing to the simple, they're both effective, but the targeted achieves much
more accurate QOS control because it's actually monitoring the feedback. It's
monitoring the feedback effect of throttling down on those application.
>>: Is QOS defined on just the high priority application, or is it both the
high priority and the low priority? Because if it's just the high priority, I
could just say I'm not going to run the low priority, and I maximize QOS.
>> Lingjia Tang:
But you actually sacrifice utilization.
>>: Sure. But if QOS what is you're trying to optimize, I guess the question
is what's the objective function? Do you take into consideration both
utilization and quality of service?
>> Lingjia Tang: I would think the optimization question, if you want to
formally define it, is the constraint is the QOS. That you want QOS to hit
certain constraint. With that constraint, you want to maximize the
utilization. So that's how I would actually think about this optimization.
So for example, the constraint here is we really wanted to guarantee 90 percent
of the QOS and then we could say, with that give me the maximum performance you
could actually give me.
But, well, I want to say this is -- I think this is a really nice way of
thinking about it, because you're actually giving people a lot of flexible
spectrum of selecting the trade-off between the QOS and utilization, right.
24
You could say I care about 80 percent, give me the maximum utilization here.
So people basically have a knob they can actually work with in a datacenter.
So let's take a look. Another graph showing, comparing simple and the
targeted. So again, the red line is QOS when co-locate was original. The blue
line is simple when we apply it to simple so it was actually fairly stable
here. The green line is using target. So they have similar, very similar
experiments. However, when we look at the nap duration, decided by the nap
engine of those two heuristics, the green line here is targeted, which is much
lower than a simple. Meaning we are actually doing much less slowing down to
achieve the same performance. Because nap duration is basically how long you
actually slow down the application, you let the application to sleep.
And this is because targeted is using a feedback control that basically detects
how much and the necessary throttling down you need for the certain performance
you want.
So again, they're both very effective, but targeted is better for improving
utilization than simple.
>>:
[indiscernible].
>> Lingjia Tang: So here. So this is -- I think this is every two
millisecond, we nap. So every two milliseconds, our napping duration.
a little over 0.5. So it's 0.5 divided by.
>>:
So it's
It's related to the nap time?
>> Lingjia Tang: Yes. So let's say I have a two milliseconds, you nap. Every
two milliseconds, you run full speed. And then you do a nap, for example, 0.5
milliseconds. Then your utilization is 0.5 over 2.5, which is 80 percent.
So let's take a look at the evaluation. That is comparing our reactive
niceness to QOS compile. Again, two high priority applications. Running with
original low priority and running with the reactive niceness and this is
running with the QOS. And you can see those two performances are really
similar. However, when you are comparing the utilization, we achieve much,
much higher than the QOS compile. This is because we are dynamically figuring
out how much utilization we could get for the performance, desirable
performance.
25
And this also is static, because you basically, so you only see two bars
because you basically decide how much you want this application to throttle
down before even knowing the -- it doesn't really change when you actually
change co-runners, necessarily, unless you actually have to specify. But this
one is more dynamic, that you can actually, based on what kind of application
the low priority application is running with, they throttle down those
applications accordingly.
Some more evaluation to go through. This is to demonstrate those are the
utilization number and those are the QOS target that we get about 80 percent,
90 percent and we're doing pretty good job maintaining 90 percent, if that's
our target.
All right. So some of the performance efficiency numbers that the red bar is
where we actually run two applications on separate machines. This allow
co-location, and the green bar is we actually run those applications together
and using reactive niceness to control 90 percent of the QOS. So we achieve 90
percent of the QOS. We have utilization and much more power efficient.
All right. So the section summary here, the reactive niceness, basically.
Take advantage of static compilation technique. But also, dynamically
regulates the memory pressure on those kind of machines to facilitate
utilization, co-location and improve utilization.
So to conclude, we showed the characterize impact of memory resource contention
on the large scale internet service applications. And turns out they are
actually very, very important. And based on the insights, we showed the this
thread-to-core mapping technique that improved performance and also the
reactive -- the compiling for niceness and reactive niceness to manipulate
application characteristics to improve the utilization in the datacenter.
Some of the future work that I'm thinking. So we're also using a lot of
managed runtime in the datacenter. We haven't really looked into how to
construct and manage runtime that's aware of these kind of tradeoffs between
QOS and utilization and potentially there could be more flexible research that
could be done in the managed runtime space.
Again, runtime system infrastructure in the WSC is still quite new and
emerging. There are a lot of interesting things we could actually look at,
26
particularly take advantage of the interaction between application, among
applications and between applications and the micro architecture.
Also, there are a lot of mobile system that are emerging. The hardware that
have been evolving with really fast turnaround generation time for mobile
services. I know there's a lot of heterogeneity going on in this kind of
system that we really need to think about how do we build the software stack
and compile a runtime system to really take advantage of those system that I
think this would be a really interesting to look at.
All right.
>>:
So with that, I'll take questions.
Your last graph had QOS instructions for [indiscernible] number looks low.
>> Lingjia Tang:
>>:
It's --
Instructions per joule or instructions per second per [indiscernible].
>> Lingjia Tang: I'll look into it. I don't, yeah, maybe you're right.
didn't -- basically, what we did is using the power meter. Yeah.
I
>>: The power meter gives you energy on the [indiscernible]. Doesn't give you
power. It gives you energy. Even though it's called power meter. If you're
using the Sandy bridge one.
>> Lingjia Tang: We just basically used the one you can actually
[indiscernible] for measuring, like, household appliance kind of thing.
>>:
Oh, yeah?
>> Lingjia Tang: So we actually measured the system performance because we
really care about the whole entire system.
>>: So actually on the wall so you're just doing draw out of the wall, not
performance counter?
>> Lingjia Tang: Yeah, not performance. I've never tried the performance
counter. I guess that would be really interesting to try.
>>:
Low priority application doesn't map.
Is that [indiscernible].
27
>> Lingjia Tang:
>>:
Um-hmm.
Yeah.
[inaudible].
>>: How robust in general are those results? Are you confident that they
would basically, the general insights that you have drawn from this work apply
to datacenters? Can you go into that?
>> Lingjia Tang: I think it's very general. I've shown some of the Google
applications and those are the data that we do with Google application on their
production machines. And I do know that this, they're really working on these
kind of techniques right now.
>>: So maybe to point out a little bit more on that too, you opened up and
talked about low utilization. That was the motivation. A lot of the
benchmarks were SPEC, which are just integer codes that don't do any IO. So
I'm curious, do you have any ideas about how much IO is going to have an impact
on some of these things? Because the SPEC benchmarks are designed not to have
IO, right?
>> Lingjia Tang: Right, so the reactive niceness, we're not using any Google
applications because I was not really doing internship when I actually did this
work. But the QOS compiler was comparing SPEC and Google application and we
actually achieved really similar results. That's why we are really confident
that we could actually use SPEC for these kind of applications.
In terms of IO, whether reactive niceness will have an effect on IO, I don't
have data. So I can only tell you that the reason why I didn't really worry
about much about using SPEC, because a lot of Websearch, those kind of
applications, their data mostly resides in the main memory. So you want to
reduce as much of the IO operations for that particular, that kind of
application, just because you want higher latency. Not higher latency. You
don't want higher latency. You want lower latency. So you want everything to
be in the memory.
But there are certain application that IO would be interesting thing to look
at. Gmail has a lot of IO that, yeah, I don't want to really say that's -- I
think the main idea may be potentially applicable in those kind of
applications, but I don't really have data for it.
28
Download