1 >> Ben Zorn: Great pleasure today to introduce Ling ja Tang from the University of Virginia. Lingjia is getting her Ph.D., finishing her Ph.D., and she's been working the area of improving the performance of datacenter or warehouse scale computers. She recently won an award, the CGO Best Paper Award, and her work has been recognized by Google. She's worked closely with Google in a number of internships. So it's a great pleasure. Thank you. >> Lingjia Tang: Hi. Thanks. I don't know if this is working. Thanks. So my name is Lingjia. Thanks for the introduction. To make it easier for you guys to pronounce it, it's pronounced as Ninja but with an lin, with an L. So today, I'll be talking about mitigating memory resource contention on the commodity multicore machines to improve efficiency in the modern data warehouse scale computers. So I also want to share some of the insights and some of the discussion, discuss with you guys about when you rethink the row of compiler in round time system, it's kind of emerging computing domain. So let me just first give a little bit background. I'm sure you guys, lot of you guys are really familiar with this warehouse scale datacenter, this computing domain that my work is focused on. So in the last couple of years, especially in the last couple of years, the mega scale datacenters have gained tremendous momentum, has emerged as a very important computing domain. Companies such as Microsoft, including Microsoft and Google, Facebook, are constructing those mega scale datacenters that host large scale internet services such as web search and mail, as I've shown here. And a lot of people really are very familiar with this kind of applications and some people are really addicted with it. So The Economist magazine points out that datacenters have become very vital to the functioning of the society nowadays. Those kind of datacenters are very expensive to build, construct and to operate. So typically, each of the datacenter may cost you hundreds of millions of dollars. And the cost and the size of those datacenters have grown rapidly in the last few years. 2 So back to a few years ago, we were talking about this kind of datacenter, we would say it's a football field sized building housing tens of thousands of machines. And if you go back a few years later now, we're talking about those buildings as tens of football field size. And we're talking about hundred of millions of dollars in the recent datacenter. This one shown, I found this picture on the internet, is Apple's new datacenter that's actually costing a billion dollars. So those things are growing really, really expensive nowadays. And, of course, when people spend a billion dollars building datacenters, and they want it to be very efficient. So efficiency is really critical for the cost. So my work mostly focuses on the memory resource contention on the commodity multicore architecture. And which is very significant, limiting factor for the efficiency in the warehouse scale computers. And I'm going to explain why. So those are the server racks that populate the datacenters, and if you look at the machines that we use in those type of datacenters, which typically is commodity multicore machines. So on those kinds of machines, we have multiple processing cores share part of the memory subsystem, right. So each core here has its private L1 a cache, but they share L2 cache memory controller. And bandwidth to the memory. So if you have multiple applications running on this kind of architectures, the multiple architecture threads, they may [indiscernible] for some of the memory source. For example, when applications are affect the data of another application thread out of the shared cache. Causing significant performance degradation. So in addition to significant performance degradation, this memory resource contention on these kind of multicore machines also limits server utilization. And I'm going to explain why in just the next couple of slides, but I want to point out here that one percent improvement either in terms of application performance or utilization in a massive scale of datacenter translates to millions of dollars potentially saved. So it's really important. All right. So let me further illustrate the impact of memory resource contention on both performance and utilization. So here's a really simplified model. We have two applications and we have two options of how we run those applications. First, we say each application gets its own dedicated server. As shown here. So we have two machines running, two applications. Each application may or may not occupy all of the processing cores of this machine. 3 So you have low utilization of those machines. performance. But you get, you know, peak The other solution say let's allow co-location so we improve the utilization of the server, reduce the number of machine we need, but we actually potentially have significant performance degradation, because of the memory resource contention. So things get a little trickier when you have latency sensitive applications, such as Websearch or mail. Those are applications that user [indiscernible] latency, you care about user experience allows you don't want to actually sacrifice a lot of the quality of service, performance of those applications. So you care about those. And they're a potentially low priority application in this datacenter, such as batch applications so you don't care that much about, but we really do care about latency sensitive application. So those, if those application cannot really deliver the accepted QOS when we actually co-locate those applications, then we don't want to do this kind of co-location. So co-location is often disallowed for latency sensitive application. So typical approach is we choose to actually disallow co-location and sacrifice utilization for performance. So just to show you that low utilization is pretty common and it's actually one of the big challenge for datacenters. Here's a graph from HP datacenter, production datacenter. So one of the key feature of this graph that you, the observation you cannot see from this graph is we're spending 60 percent of the time, or even more than 60 percent of the time running at 10 percent of server utilization. So this is quite low. Google has a slightly better data. So the official data is actually around 30 percent of utilization. I was talking to the IBM fellow, and they were saying the IBM datacenter production datacenter, around 8 to 12 percent average utilization. So those are really low number. So why is low utilization expensive if you look at the total cost of ownership of a datacenter server, purchasing server is a huge chunk of those costs. On top of that, constructing datacenters is a big chunk too. And then the power, right. So if you actually have low utilization, low utilization meaning given 4 same amount of work, you want more machines to around the work, you want a bigger datacenter so those are very costly. So you want to improve utilization in the datacenter. So therefore, mitigating contention is really critical for improving both performance -- go ahead. >>: Can you go back one slide? I just want to understand it a little bit better. Would it be fair to say that the server cost is a recurring cost, because, you know, servers keep failing and you keep buying more and more servers. Whereas the other stuff, excluding the power at the datacenter, that's basically a fixed cost? >> Lingjia Tang: Oh, this is total cost ownership and your breakdown. So assuming you have the server for four or five years and amortized the cost. All right. So the big picture of the work that I'm going to present here, my dissertation work, first we want to understand the interaction between the datacenter applications and the underlying memory subsystem. And on top of that, based on understanding [indiscernible], we want to manage resource sharing between application threads so we can improve the application performance. We also want to manipulate the application characteristics so we can actually facilitate more co-location and work consolidation on those kind of servers to improve utilization. So some of the results I have the publication in those domains so I'm going to briefly talk about those too and then focus most of the work on this part of the work. I also have other work related to the datacenter, how we improve the efficiency in datacenter. Tomorrow, Jason is going to talk a lot about those very interesting work and you can hear a lot from him. So the outline of the talk, I've discussed the motivation and the goal of the work. So I'm going to talk a little bit of the characterization of those applications and how they interact with memory subsystem first. So let's take a deeper look at the memory resource contention for those kind of architectures so this is the Intel Xeon Clovertown, this is actually widely deployed in goggle production datacenter. So on those kind of machine, we have 5 two socket, and each socket will have four processing cores. And two cores are sharing a last level cache. Four cores on this socket are sharing a bus, and then two sockets are sharing a memory controller. So you have three major components of sharing on this kind of architecture. It's one thing -- one interesting to notice is the sharing could potentially have a constructive or destructive performance impact. If application, if threads actually belong to the same application in the shared data and it fits nicely to the shared cache, then you basically prefetch for each other and you are reducing the coherence traffic and the bandwidth traffic. So it's a good thing. But potentially, they will also contend for those resources. If that happens, it's a destructive kind of performance impact for those applications. Another interesting thing is the thread-to-core mapping determines which resource is shared among multiple threads. In this simple example, those two threads are sharing those three resources. So if I move the thread to another core, I'm only sharing memory bus and memory controller. If I move the thread to another socket, we only share memory controller here. So it's a really interesting feature of those kind of architecture. So the research question of the first, this part of the talk, is what's the performance impact of those threat-to-core mapping and the memory resource sharing for the application performance for emerging large scale performance that we care about, datacenter performance. So some of the prior work, people have been using SPEC, and they conclude there's significant performance degradation due to memory resource contention. There's a lot of work in the hardware community based on. But there's a recent work in PPoPP 2010 that concludes that cache sharing does not matter for the contemporary multithread applications. So our question is, you know, what about the large scale emerging datacenter that we care about. What about the performance impact for those applications. So let me show you some of the key results that we found. There is more results and details of those applications in my publication so we can talk offline. 6 So this is intra-application sharing, meaning we're interested in finding out the sharing among threads that belong to the same application. So here are the three large scale application we use, including Websearch and Bigtable, Content Analyzer and systematic analyzer. So in each case, each application we run on the previous, we show on the Clovertown machines in three different configurations, sharing configurations. So from the left to right, you can actually think as more and more sharing among threads that belong to the same application. So the Y axis is the normalized performance. So one thing to notice here is there's a big performance swing, up to 22 percent for those applications, for example. Semantic analyzer degrade for more sharing, indicating there's contention among threads. The performance suffers from the contention. And Websearch as well. So but there's also interesting thing to see from this graph, that it's both constructive and destructive impact. Because if you look at Content Analyzer and Websearch, the degrading from cache contention, bandwidth contention, but Bigtable actually benefits from the cache sharing. Go ahead. >>: So just to make sure I understand, so you care about single machine performance, ignoring potential contention on the network. >> Lingjia Tang: Yeah, this is really just single machine, micro-architecture results contention. So we're not looking at [indiscernible] contention. And for example, the Websearch, we're looking at leaf note performance. >>: So this is sort of the request comes in, it's completely executed on the machine? >> Lingjia Tang: >>: Yeah. It doesn't make any remote calls? >> Lingjia Tang: Yeah. So basically, we use queries of, a trace of real world queries and we actually fed those queries to this machine. >>: But do you have a sense of how much this impact actually translates to the end-to-end latency, which is what the user sees? 7 >> Lingjia Tang: I don't have exact number of how it translates to real world, in the complete latency. I think for Websearch latency, on the real machine, on the back end machine does not -- it's not going to be the major dominant part for the latency for Websearch, because the network latency is so big that the correlator in a single server might not mean that much. But the key for the back end is we really want to improve performance structure, reduce cost. So it's not so much about performance, it's a lot about cost. But we actually, we do care about the latency of those queries on those machines. You definitely want the best machines for those queries, best performance for those queries on those machines. All right. So there's constructive and destructive impact for both applications, for the three applications. And then we look at inter-application sharing. So basically, what's the impact of sharing among applications. So in this case, we actually co-locate another application with multiple threads on to the same architecture, what's the performance impact. So here are the results. Each graph is a single application. Content analyzer, three different kind of a running scenarios. Running alone, we have [indiscernible] running with another, image processing for street vow. And another benchmark. So you can see that there's performance degradation because the performance basically dropped from running alone to running with other applications. Another really interesting about this graph that I want to point out is the optimal thread-to-core mapping changes when the co-runner changes. So when you think about Content Analyzer, when it's running alone, its best performance is running actually spread its threads across the two sockets. But running with [indiscernible] actually makes it benefit from cluster all its threads together. And similar for Websearch. It changes in different settings, and in this case it's preferred to actually share cache with its own threads but not share bandwidth with its own threads. Bigtable always benefit from the sharing. And it's also interesting to notice that this difference between thread-to-core mapping performance can be pretty significant. It's up to 40 percent for Bigtable in this case. 8 So a quick summary. I have a lot more data about this kind of study in the publication that I don't time to actually show here, but I want to highlight some of the things we found. So contrary to the prior work, we found that the memory resource sharing actually matters a lot for large scale datacenter applications and the performance swings up to 40 percent simply based on different thread-to-core mapping. And the interesting thing is optimal mapping changes, when application changes, when a co-runner changes, when the architecture changes. And I highlight those two because basically what they are saying is because there's up to 40 percent performance opportunity that we left on the table without thinking about thread-to-core mapping, without thinking about this interaction between the application and the micro-architecture, and so it really, the next question we ask is how do we achieve the optimal thread-to-core mapping? >>: All these applications that you have mentioned, do they involve any synchronization? >> Lingjia Tang: Yes, but not so much. It's there is, of course, synchronization in those applications, but it's not a huge amount of synchronization bounded applications. And that would be really interesting to see, you know, log contention, that kind of performance impact on those application, but we're only focusing on memory resource contention in those applications. >>: You say that she's are large scale datacenter applications but then you look at a single box performance. >> Lingjia Tang: Right. >>: So what is really different about these applications than some other applications that would not run at large scale in a datacenter? Is it the workloads are different, the memory access patterns are different that really makes this now interesting compared to non-datacenter applications? >> Lingjia Tang: There's actually a publication coming out of Microsoft, actually really, multiple publication comparing those workloads with SPEC in terms of, you know, different, how they access different kind of functional 9 units or the [indiscernible] memory access different patterns difference. And I think you will find there's quite a difference between those applications, between SPEC and there's new work coming out thinking about how to use this different characteristics to actually design new servers, architectures. That's not really -- for example, instruction cache is really important for those kind of application, which is kind of different for SPEC. So there's a lot of difference. But we can talk offline about it. So how do we achieve this optimal thread-to-core mapping for better performance? So I'm going to just talk really briefly about this section. So we basically manage the resource sharing by doing intelligent thread-to-core mapping of this application on this architecture to improve performance. So there are two approaches that we present in the paper. One is heuristic basic approach and the other is adaptive approach. And so the heuristic-based approach, the key idea is that's actually figured out application's characteristics and map the threads according to the application's characteristics and the machine memory topology. So we're kind of caching this machine actually has, for example. And we identify three important application characteristics, including the amount of data sharing among threads in this application, how much cache usage and bandwidth usage. >>: I have a question. I thought you just said that this thing depends on the co-runner. So how can you make this decision in isolation? >> Lingjia Tang: >>: That's a good question. I'm going to get to that. Okay. >> Lingjia Tang: So the heuristic-based approach is basically saying that say you have application and you have a co-runner. The key idea is to say compare the potential contention within the application versus the contention between the application. And then you choose the best thread-to-core mapping to minimize this contention. Because if you actually have a lot of data sharing, then you potentially prefer just shared data with your own threads instead of sharing cache with other applications. And if you have a lot of contention among thread, let's say you 10 have another application coming with less contention, then you potentially want to share cache with another application. So this is how we actually decide the mapping. So you need to know the application characteristics of all co-runners and you need to know the memory topology. So again -- okay. >>: So this is a fix for co-runner. So you pick a co-runner already? >> Lingjia Tang: So this is fixed for architecture. But depends on the co-runner, you may go different paths of deciding how to map. >>: Right. So do you know the co-runner? You are not choosing the co-runner. >> Lingjia Tang: Yeah, that's a good question. We're not choosing co-runner. That's a different research question. We basically know that we need to actually manage those two multiple applications and then we decide how do we actually ->>: Can I ask a question? >> Lingjia Tang: Go ahead. >>: In these warehouse scale computing scenarios, is a set of co-runners more or less static and remain unchanged over a long period, or it changes very fast? >> Lingjia Tang: They're really long-running applications. So once you decide the co-running situations, they potentially won't change for days. But ->>: Okay, cool. >> Lingjia Tang: But they will actually change. So if your program crashed, for example, so the job mapper with new tasks coming would decide where the tasks is -- so they use basically a bing packing algorithm and they decide where to actually map those application to the machine. In that case, you actually have the co-runner changes. But once it's run, it's going to run for a while. So and this is to address your question is architecture specific and 11 potentially changing with the co-runner and we require profiling so this is doable, but it's not necessarily optimal. So we propose the adaptive TTC mapping. And the insight is because of the optimal mapping changes based on a lot of things changes, co-runner changes, architecture changes, so why don't we just do it online. So the key idea is we have the competition heuristic, we have the learning phase that we are actually basically go through the search space in a smart way, and then we pick the best thread-to-core mapping and we execute it for a while. So we basically loop this. And then to address the question about datacenter application, they're long running and they have very steady phases in general. But they may change co-runners and different tasks. The same binary could be potentially mapped to different architectures. So this kind of approach is more dynamic addressing those kind of changes in environment. So some evaluation for the adaptive mapping, these are both on real application on real hardware. And those hardware, production hardware. The performance is on the Y axis. X axis is the workload. The blue bar is the average random mapping. So the green is adaptive, and the yellow is the optimal mapping you could do for those kind of workloads. And the baseline is the worst case mapping. So basically, this is definitely better, significantly better than worst case. Worst case is not -- you don't want to do worst case, basically. And we're really close to the optimal, and we're beating random mapping on both architectures. >>: Are those bars significant? variability, right? Do you do multiple runs here? It's probable >> Lingjia Tang: Yeah, so the benchmark here. So we benchmark-wised our application so we basically make sure it's always repetitively run, the same kind of trace that we're using, and then benchmark round, between rounds is one percent difference. >>: So what about changes in the workload. Say the binary's the same, but depending on what the requests are coming in, it might change ->> Lingjia Tang: So to address that, basically, use really long traces to 12 queries. So it's basically, it's kind of averaged out. the average performance of those queries. So we're looking at All right. Section summary. We present adaptive mapping to actually improve the performance of those applications. And I want to highlight the importance of taking advantage of the interaction between applications, among applications and between applications in the micro-architecture, and I think this is where a lot of runtime system will come into actually improve the performance with those kind of emerging domain. So I talk a little bit about improving performance. I want to switch gears a little bit and talk about server utilization. How do you actually improve server utilization. So when you think about improving performance, we manage application levels. Application level manage how they actually share results, share the common memory resource on the multicore architectures. For improving server utilization, the goal is to can we actually manipulate applications characteristics using compilation technique or other techniques to improve the -- to reduce the memory contention and improve the server utilization. So here we present the compilation technique to manipulate application characteristics so we can facilitate more workload consolidation on those kind of machines. So here's, I've shown this graph before. But basically, I want to highlight the goal of this work. So again, the option one is to actually say disallow co-location so each application is running on the single dedicated server so you have idle cores for each server. You have really good performance but low utilization. The other option is saying we co-locate those applications, but we potentially have really significant performance degradation because of the memory resource contention. And for latency sensitive application, we don't do this, because we want to actually deliver the acceptable quality of service. So the goal of this is to say, let's mitigate the contention, the low priority application generates to the high priority application. So the high priority 13 application, the latency sensitive application can deliver the acceptable quality of service. So when we do this, we can facilitate more safe co-locations, meaning the co-location, when we do this, the high priority application delivers satisfactory QOS, performance. And this we could actually improve utilization. Even if we actually sacrifice a little bit of the low -- oops. Sorry. Even if we sacrifice a little bit low application priority's performance, we're still gain this kind of utilization, right, because if you compare this graph to this graph, that's part of the cores that were previously idle and we actually basically extract the utilization from that part of the architecture. So that's the goal of this part of the work for improving server utilization. So the research questions come with how do we do this. Can we actually mitigate the contention by changing the application's characteristics. So we want to make the low priority application less contentious. How do we do this? How do we actually identify? Can we identify the co-regions of the low priority application that are really contentious and how do we change the characteristics? So compiling for niceness is our approach. It's the first compilation approach to address this kind of a challenges for multiple co-running applications in this kind of domain. So the highlight of this technique, we pinpoint the contentious code regions and then we actually apply novel compilation techniques to manipulates its contentious characteristics. So reduce the contention this low priority application may generate when it's co-running with other applications so we could actually allow it to co-run with high priority applications. So some of the key insights before I go into the details of this work. Traditionally, when people think about compiler, they think about we're compiling application. This is going to be the single application we're running on this architecture. So we're focusing on performance a lot. But in this kind of new domain when you have a lot of workloads, you want to consolidate those workloads in a server and a number of cores on those servers is growing with every generation. So you have a big pressure, we can't really 14 overlook this trend of we want to actually consolidate workloads in this domain. In this kind of a domain, there are more objectives you may need to think about when you think about compiler. How aggressive. Application can really aggressively use memory resource and get the performance, but is it really the best way to do it. What about the influence this application has to the old co-running applications. >>: Traditionally, you can imagine having OS priorities and just scheduling things, scheduling the one that should contend at a lower priority. Why is that not a good solution in this case? >> Lingjia Tang: Well, so OS has -- well, I would say mostly because you want to change the application's characteristic of the low priority application. And an OS is not really tasked with doing that. And here, it's not really a time sharing kind of a situation. Where traditionally, as you would think about nice being application time sharing resource. You can actually tone down the priority of one application potentially and actually do the time sharing. But here, two applications are co-running at the same time. They're always running simultaneously. And also, a lot of challenge comes with how to do this for OS, but I'm going to talk about it a little bit more. >>: A small follow-up question triggered by your response. Is it true that they don't do time sharing at all on an individual core in this warehouse scale computer scenarios? >> Lingjia Tang: There's definitely some kind of amount of time sharing. But not so much, because if you actually have your threads or your application less than the actually available cores, then you don't have to do time sharing for a lot of situations. >>: It looks like it's a matter of billing for true cost of [indiscernible] if you did datacenter billing, how would you bill to get niceness into the mix? If you bill for megabytes of memory and for a [indiscernible]. >> Lingjia Tang: Right, this work mostly looking at the datacenter, that provides internet services. So you basically have control of those applications. So you build yourself so it's easy to negotiate and say which 15 application or which application has higher priority. the datacenter. And we often do this in But, I mean, I can imagine if you do this, this may be a wild claim, I haven't really looked into it. But if you do this, you could actually say select my priority with different billing price. I could -- I have batch application. You could charge me really cheaply. Just get it done, whatever. So those kind of options. >>: I have one more question on the basic subject. So a lot of the datacenters seem to assume that you only one job per CPU, so what are they doing when they're overloaded, when there's more than one job per CPU that needs to be done? Do they not do time sharing? Do they hold it up at the queue until somebody's finished? >> Lingjia Tang: My experience is they often do time sharing at a point and then your performance is going to degrade a lot. And then the operator is going to have to come in and say we can't do this. >>: So shouldn't the queue that's sending them to the -- >> Lingjia Tang: Yeah, they do bing packing algorithm so basically, the application will come with this other resource requirement saying how many megabyte in one, how many CPU in one. And then the mapper, the job mapper, the scheduler decides how to basically map those applications. If the machine still have available -- still has available resources, then I'm going to map this job. >>: So it seems to me that, you know, a lot of decade-long research and OS, all the [indiscernible] they used to make a fuss about, they're just irrelevant in warehouse scale computing. Is that correct? >> Lingjia Tang: I mean, I think OS people really need to really think about, in this kind of a domain, what kind of priorities and ->>: The optimization criteria was not quality of service on a single job. It's fairness. And so the operating systems community optimized for an important part of the space. And that is a different -- now it's how fast you get the answer back, or someone's not going to click on your website. 16 >> Lingjia Tang: Thanks. So another insight is, in this kind of a domain when you sacrifice a lot of the performance of a low priority applications, you can facilitate safe co-location and win a lot of utilization in the end. So using other times, when you sacrifice a little bit of your self-interest in performance, you win big in the big picture. That's a life lesson. Being nice is sometimes good for the big picture. And also, in these kind of datacenter, we know the applications we're running, they're long running and we have the source code available. So we call our system QOS-compile and here's a review of the system. We have a profiler that basically identify the contentious region of those low priority applications and we also have a compile their target those applications, those contentious region. And when we compile those regions to mitigate the contention nature. So I'm going to talk a little bit of the compiler, the profiler first. So this profiler is based on hardware performance counters. Because they're fairly low overhead. So the idea is based on how aggressive those application is using a certain memory resource, we actually predict how contentious it may be when it's running with other applications. So the application's running and the profiler, based on the performance counters, will actually predict a contention score of the execution phase. So we identified the execution phase that was high contention score and find the associated code regions that have the high contentious score. And there's a lot of thinking and work that goes into how to build this kind of prediction model, and we use regression and here's the model. And I can talk more offline if you're interested. But let's take a look at the performance of this model. Here are the two prior work. Those are the work that publish [indiscernible]. They're using last-level cache miss rate and reference rate to actually predict how contentious the application is. And this is our model. As you can see, we have a much better linear aggression, linear coefficient correlation. And ->>: Is that taken from your interferences between the different jobs, or what are the inputs to your model? 17 >> Lingjia Tang: Right, so the input to the model is the performance counters. For example, last level cache [indiscernible] of the application. There's a lot of reasoning of why we actually select certain performance counters. The output is contention score. So basically, imagine you have certain lines in application runs and we actually calculate the score and we compare the score with the average degradation in this application when running cores to the co-runners. >>: So you're actually predicting something different as well as using multiple inputs to help you, right? >> Lingjia Tang: >>: Yeah. Okay. >>: So is this on a per-region basis? timeline is. I'm not sure I understand what the >> Lingjia Tang: Oh, sorry. This is actually application level. So average application, average cache miss rate, for example, for the entire ->>: How do you do it by region? How do you identify contentious regions? >> Lingjia Tang: I have a graph, but I'm not sure if I have it in the slides. But basically when it's running, every milliseconds, sample the performance counter. So you get every milliseconds, get a contention score. And by region, we didn't D instrumentation to actually find -- so we basically just looking for the execution phase that corresponds to that part. The co-region that corresponds to the part of execution phase, which is not ->>: So you do a phase analysis as part of this, then? >> Lingjia Tang: We did ping, we use ping to actually -- so we, the way we did it is basically, we do two path kind of analysis. One path, we basically get a contention score of the execution phase. We also record the instruction executed during the phase and then the next phase, the next path, we say basically you got the contention score every millisecond and the instructions executed every one millisecond. The second pass would be the ping instrumentation so basically counting the instructions executed. And then 18 identify the code regions that are executing the mapping instructions of the first [indiscernible] and identify the code regions. So I think we have more details in the paper. But we could talk more offline. So there's also a phase graph that would show the phase. pretty good job identifying the phases. We actually do a So that's the compiler. That's the profiler. Let's take a look at the compiler. The intuition behind the compiler is we use rate reduction to actually reduce the contentious nature of a contentious region. So we basically reduce the memory request rate of those region. And here's some illustrative example to show the intuition behind when you reduce the memory request rate, you reduce the contentious nature of the application. So you have two application running, application A and B, and assume certain initial last level cache content and access order, you have execution time of B, for example. So when you actually slow down A's access, basically what you changed is the access order here that you prioritize B's execution in the memory request. Because A is actually slowed down, so B memory requests get prioritized. And that will actually have a big impact on the performance of B. So in this illustrative example, B's execution is greatly reduced. So based on this intuition, we have two techniques to reduce the memory request rate. One is padding. Basically we inserted NOPS into those code regions so the idea is that application execute few instructions and then executing NOPS. So during this phase, no memory request issued. Another one is nap insertion. So it basically inserting nap. So you basically can control, let the application run certain minutes, seconds, for example and then let it sleep for a few milliseconds to use those techniques to control the rate reduction. And you may wonder that those two techniques are really similar, because they're just basically let application to execute for a while and let it sleep for a while, inserting NOPS for a while, they should have the same kind of performance. But in experiments, we found out that that might not necessarily always be the case. So here's some performance showing padding and nap insertion action. Y axis 19 here is the quality of service of high private application. execution rate of the low priority application. The X axis is So from 1X to 0.4X, we slowed down this application by inserting NOPS or padding. Or napping. As we slow down this application, you can see the performance of the QOS of the high priority application improves. And the three lines of napping with one millisecond every nap or ten millisecond and padding. So in this case, we actually have really similar performance for three techniques. In this case, however, we could see the nap every ten millisecond actually outperforms all the other two greatly. So even if slow down the application, one application to the same execution rate, but different technique, you may have a different effect in terms of improving the QOS. So same rate reduction may not have the same effect. So it turns out, this performance difference is because of the granularity when you do the rate reduction. So let me illustrate this one. So if you think about the performance of a high priority application, let's say it's starting in the performance really low, because of the other applications in generating interference, and the nap and padding starts here. We start to actually slow down at a higher low priority application. In this case, high priority applications performance only picks up instead of jumping up to the really optimal performance, only picks up. That's because when you just start napping and padding, there's still memory requests that are coming from the low priority applications still getting served in the system. So it takes a while to actually cool down those memory requests. It also takes a while to warm up the cache for the high priority application for it to pick up the performance. So what this really says is actually when you slow down the application, the granularity you pick to do this actually matters. So smaller, fine granularity for padding, in terms of cycles, may not really give you the best win. For the slow down, you actually sacrifice. Also, comparing padding and napping, napping is easier to do, more accurate timing control, because you could potentially have a timer and it's more power efficient, because we're not really executing NOPS in this case. So some of the evaluation for QOS-compile, and this is combining both identify 20 code regions and applied transformation to those applications. Here on the X axis, we have the high priority application. Y axis is the quality of service of the high priority applications. Each bar, the first bar is when those applications running with original, the original LBM with no padding or the original and basically no QOS-compile. And the yellow bar is where we actually identify the code regions, inserting naps into the code regions and every ten milliseconds, napping for ten milliseconds, the purple is every ten milliseconds, 20 milliseconds. As you can see here, we actually greatly improve the performance of those high priority applications up to 85 percent and 95 percent of its original peak performance. Similar results and there are more results in the paper. And let's look at utilization we gain. So basically, those are the application, the low priority application when it's running with those high priority application because we slow down so we're not really running at 100 percent of the execution rate. But we still gain significant around 40 percent in this case, 50 percent in this case of the utilization of the cores that previously would be idled if we don't really do this kind of co-location. So we also have really similar results for the Google applications where we actually improved the Websearch for up to close to 90 percent in this case. So the summary is we pinpoint this contentious regions and we apply code transformation to those regions to mitigate the contention. The low priority application generates to the high priority application. So we could improve server utilization because we facilitate this kind of workload consolidation. And I won't actually stress the multi-objective compilers in this kind of era that year actually looking at a lot of tradeoffs, more than we actually used to think about before. All right. But -- >>: Quick question. So you just said inserting NOPS, sort of stretching out the execution. Have you thought at all about sort of more sophisticated techniques or other things that could be done with maybe like more -- you think about a compiler. Compiler can reason about the code and figure out different instruction schedule and things like that. Are there other things that could be done that are more sophisticated? 21 >> Lingjia Tang: I think potentially, there's things that could be done. I was looking at thinking some of the optimization you could think about, you know, memory hierarchy, optimization, compiler optimization we do to actually optimize for the cache structure or loop transformation that potentially we change the memory characteristics. But when we look at those kind of optimizations, the performance impact of the impact in terms of mitigating contention is quite small comparing to you really slow down the applications. If you think about loop transformation, potentially you're looking at maybe 6 percent, 8 percent of the performance. Well, for 6 percent, 8 percent of the performance in terms of changing the high priority applications, if you change the memory characteristic, access pattern of your low priority application by loop transformations, the impact it has on the high priority application is really small compared to just really slow down a lot. So I think potentially, there are things to be done. But I choose to actually go this route because the basically, the benefit you're getting from those is not as big. All right. So why be nice, if not necessary? So the last part of the work that I want to talk about is fairly new work that would dynamically regulate the pressure we have on the memory subsystem. So the key idea is compiling for niceness is a static technique. So basically have to be conservative. You have to throttle down the application without knowing how much the high priority application may suffer from the performance degradation or contention. So why do we do this reactively online? We could dynamically manipulate the application characteristic based on the performance degradation we actually observe and based on the contention that we actually detect, right. So we could avoid unnecessary slowdown. So if the high priority application, for example, is not really suffering from contention, we don't have to slow down the low priority application and we can actually achieve more accurate QOS control. Because you're actually monitoring the QOS online. All right. So the overview of this technique, we have the compiler that basically is a profiling based and identify the contentious region, and this instrument mock-ups to actually invoke the runtime system. We also have the runtime system that basically tasked with monitoring the QOS, detecting contention, and feedback-controlled to actually control how much throttling down and how much nice you want this low priority application to be. 22 So the compiler here is really similar to the compiling for niceness. We use the same profiler to actually identify the contentious code regions. But instead of actually inserting NOPS or paddings, we insert a mock-up to actually invoke the runtime. And the runtime system is where a lot of the interesting things happen. So we have the monitor that's attached to the high priority application which can monitor the performance counter information. So it's actually start in the machine memory in the system, in the circular buffer, and the nap engine is the part that's actually attached to the low priority application. So the low priority application basically this nap engine would priority application and detect into the detention detected and can invoke runtime of this nap engine, read the monitored information of the high the contention and actually respond, called the QOS of the high priority applications. So I'm going to talk a little bit more of this part, because this is actually really interesting. So to detect the contention and react to this contention, we have could have potentially implemented different kind of policies. So in this work, the first policy we tried is let's do a simple way. That's just conservatively throttle down whenever the high priority application suffers QOS degradation, right. So it could be a false positive, but let's just do it really conservatively. Because the QOS degradation may not really just be simply due to contention. Another way is let's do a feedback controlled. So we have different states. We execute for a while, then we actually go to the check state to actually see whether napping or our technique will potentially have impact on the QOS. If it does, then we actually continue to nap. If it doesn't, which means maybe there's QOS degradation, but that's not necessarily caused by contention, because our technique's not going to be able to address, reduce this contention or improve this QOS. So it's not necessarily caused by contention. So that's actually just low priority application execute. So let's take a look at those two techniques in action. So here, this is sphinx running, which is a SPEC benchmark, running a low priority application. 23 This is the rough input in the X axis at the time. And the Y axis is the normalized IPC. So if it's one, meaning there's no contention, it's running on peak performance. Here, we say the QOS when the sphinx is co-located was the original low priority application. And the blue bar is when we actually apply the reacting niceness, here using simple, to the low priority application. So you can see the IPC improvement here. Similarly, here we have the green and the red bar, the red lines are the similar line, which is the original performance. Here is the green line is when the targeted react is applied to the low priority. So the big, take-appoint from this graph is you can see this line is very stable. It's around 90 percent, which is our target QOS. And comparing, especially comparing to the simple, they're both effective, but the targeted achieves much more accurate QOS control because it's actually monitoring the feedback. It's monitoring the feedback effect of throttling down on those application. >>: Is QOS defined on just the high priority application, or is it both the high priority and the low priority? Because if it's just the high priority, I could just say I'm not going to run the low priority, and I maximize QOS. >> Lingjia Tang: But you actually sacrifice utilization. >>: Sure. But if QOS what is you're trying to optimize, I guess the question is what's the objective function? Do you take into consideration both utilization and quality of service? >> Lingjia Tang: I would think the optimization question, if you want to formally define it, is the constraint is the QOS. That you want QOS to hit certain constraint. With that constraint, you want to maximize the utilization. So that's how I would actually think about this optimization. So for example, the constraint here is we really wanted to guarantee 90 percent of the QOS and then we could say, with that give me the maximum performance you could actually give me. But, well, I want to say this is -- I think this is a really nice way of thinking about it, because you're actually giving people a lot of flexible spectrum of selecting the trade-off between the QOS and utilization, right. 24 You could say I care about 80 percent, give me the maximum utilization here. So people basically have a knob they can actually work with in a datacenter. So let's take a look. Another graph showing, comparing simple and the targeted. So again, the red line is QOS when co-locate was original. The blue line is simple when we apply it to simple so it was actually fairly stable here. The green line is using target. So they have similar, very similar experiments. However, when we look at the nap duration, decided by the nap engine of those two heuristics, the green line here is targeted, which is much lower than a simple. Meaning we are actually doing much less slowing down to achieve the same performance. Because nap duration is basically how long you actually slow down the application, you let the application to sleep. And this is because targeted is using a feedback control that basically detects how much and the necessary throttling down you need for the certain performance you want. So again, they're both very effective, but targeted is better for improving utilization than simple. >>: [indiscernible]. >> Lingjia Tang: So here. So this is -- I think this is every two millisecond, we nap. So every two milliseconds, our napping duration. a little over 0.5. So it's 0.5 divided by. >>: So it's It's related to the nap time? >> Lingjia Tang: Yes. So let's say I have a two milliseconds, you nap. Every two milliseconds, you run full speed. And then you do a nap, for example, 0.5 milliseconds. Then your utilization is 0.5 over 2.5, which is 80 percent. So let's take a look at the evaluation. That is comparing our reactive niceness to QOS compile. Again, two high priority applications. Running with original low priority and running with the reactive niceness and this is running with the QOS. And you can see those two performances are really similar. However, when you are comparing the utilization, we achieve much, much higher than the QOS compile. This is because we are dynamically figuring out how much utilization we could get for the performance, desirable performance. 25 And this also is static, because you basically, so you only see two bars because you basically decide how much you want this application to throttle down before even knowing the -- it doesn't really change when you actually change co-runners, necessarily, unless you actually have to specify. But this one is more dynamic, that you can actually, based on what kind of application the low priority application is running with, they throttle down those applications accordingly. Some more evaluation to go through. This is to demonstrate those are the utilization number and those are the QOS target that we get about 80 percent, 90 percent and we're doing pretty good job maintaining 90 percent, if that's our target. All right. So some of the performance efficiency numbers that the red bar is where we actually run two applications on separate machines. This allow co-location, and the green bar is we actually run those applications together and using reactive niceness to control 90 percent of the QOS. So we achieve 90 percent of the QOS. We have utilization and much more power efficient. All right. So the section summary here, the reactive niceness, basically. Take advantage of static compilation technique. But also, dynamically regulates the memory pressure on those kind of machines to facilitate utilization, co-location and improve utilization. So to conclude, we showed the characterize impact of memory resource contention on the large scale internet service applications. And turns out they are actually very, very important. And based on the insights, we showed the this thread-to-core mapping technique that improved performance and also the reactive -- the compiling for niceness and reactive niceness to manipulate application characteristics to improve the utilization in the datacenter. Some of the future work that I'm thinking. So we're also using a lot of managed runtime in the datacenter. We haven't really looked into how to construct and manage runtime that's aware of these kind of tradeoffs between QOS and utilization and potentially there could be more flexible research that could be done in the managed runtime space. Again, runtime system infrastructure in the WSC is still quite new and emerging. There are a lot of interesting things we could actually look at, 26 particularly take advantage of the interaction between application, among applications and between applications and the micro architecture. Also, there are a lot of mobile system that are emerging. The hardware that have been evolving with really fast turnaround generation time for mobile services. I know there's a lot of heterogeneity going on in this kind of system that we really need to think about how do we build the software stack and compile a runtime system to really take advantage of those system that I think this would be a really interesting to look at. All right. >>: So with that, I'll take questions. Your last graph had QOS instructions for [indiscernible] number looks low. >> Lingjia Tang: >>: It's -- Instructions per joule or instructions per second per [indiscernible]. >> Lingjia Tang: I'll look into it. I don't, yeah, maybe you're right. didn't -- basically, what we did is using the power meter. Yeah. I >>: The power meter gives you energy on the [indiscernible]. Doesn't give you power. It gives you energy. Even though it's called power meter. If you're using the Sandy bridge one. >> Lingjia Tang: We just basically used the one you can actually [indiscernible] for measuring, like, household appliance kind of thing. >>: Oh, yeah? >> Lingjia Tang: So we actually measured the system performance because we really care about the whole entire system. >>: So actually on the wall so you're just doing draw out of the wall, not performance counter? >> Lingjia Tang: Yeah, not performance. I've never tried the performance counter. I guess that would be really interesting to try. >>: Low priority application doesn't map. Is that [indiscernible]. 27 >> Lingjia Tang: >>: Um-hmm. Yeah. [inaudible]. >>: How robust in general are those results? Are you confident that they would basically, the general insights that you have drawn from this work apply to datacenters? Can you go into that? >> Lingjia Tang: I think it's very general. I've shown some of the Google applications and those are the data that we do with Google application on their production machines. And I do know that this, they're really working on these kind of techniques right now. >>: So maybe to point out a little bit more on that too, you opened up and talked about low utilization. That was the motivation. A lot of the benchmarks were SPEC, which are just integer codes that don't do any IO. So I'm curious, do you have any ideas about how much IO is going to have an impact on some of these things? Because the SPEC benchmarks are designed not to have IO, right? >> Lingjia Tang: Right, so the reactive niceness, we're not using any Google applications because I was not really doing internship when I actually did this work. But the QOS compiler was comparing SPEC and Google application and we actually achieved really similar results. That's why we are really confident that we could actually use SPEC for these kind of applications. In terms of IO, whether reactive niceness will have an effect on IO, I don't have data. So I can only tell you that the reason why I didn't really worry about much about using SPEC, because a lot of Websearch, those kind of applications, their data mostly resides in the main memory. So you want to reduce as much of the IO operations for that particular, that kind of application, just because you want higher latency. Not higher latency. You don't want higher latency. You want lower latency. So you want everything to be in the memory. But there are certain application that IO would be interesting thing to look at. Gmail has a lot of IO that, yeah, I don't want to really say that's -- I think the main idea may be potentially applicable in those kind of applications, but I don't really have data for it. 28