1 >> Kathryn McKinley: I'm Kathryn McKinley. I've been here, I'm going to have my one-year anniversary this month, but this is my first candidate to host and I'm really thrilled to have Jason Mars here. He's been working on data center energy efficiency. Some of his work that we're going to hear about today has been selected for IEEE top picks and they've influenced the way that Google is building their data centers, and I'm hoping to have him come here and influence how we build our data centers. >> Jason Mars: Thank you. I appreciate it. Hi. So today, I'm going talking about a piece of work that's captured my interest for the past years. And it deals with the architecture of warehouse scale computer we build a highly efficient design. I've been looking forward to this I'm happy to be here. So let's begin. to be few and how talk and So the landscape of computing has been changing traditionally, when users thought of computing, they think of a desktop they go to do some type of activity, work or play, and then move on with their daily lives. However, now, we're always connected with highly mobile portable devices and much of our computation cycle live in these massive scale warehouse scale computers. And as noted by Forester research, the cloud was a $40 million market in 2011, and this will grow to a $241 billion market by 2020. So this is really the space that I work has been in recently. And a very interesting space for computer science in general. So here, I show two pictures of, two of Google's large scale warehouse scale computers, each of these buildings ar football field in size, and each building houses thousands of servers and machines. On these machines, we run large-scale internet applications, like search and mail, social networking and so forth, maps and so forth. These warehouse scale computers are expensive, costing hundreds of millions of dollars to construct and operate, and this is growing. And my claim is they're inefficient. As a system and software architecture of these warehouse scale computers remains in its infancy. Now, when thinking about improving efficiency and warehouse scale computers, there is a number of optimization operatives and metrics you can consider. My work has focused on performance of software running in these warehouse scale computers and utilization. Was noted by Luiz Barrosa and Urs Hoelzle, software performance and server utilization is critical for efficiency in warehouse 2 scale computers. So we are all familiar with performance. It's how well our software is running on these machines. However, utilization has a particularly interesting metric in warehouse scale computers. This graph to the right shows the utilization of Google warehouse scale computers over fraction of time of one of their 2007 machines. So on the Y axis, we have the fraction of time and on the X axis, we have the amount of utilization for that fraction of time. And as you can notice, the hump of the curve indicates that we're usually around 30 to 40 percent utilization on average, and what we'd like to do is move this hump to the right to have higher utilization for a larger fraction of time. And to put these two metrics in perspective, a 1% improvement in either performance or utilization results in millions of dollars saved at the scale of Microsofts and Yahoos and warehouse scale computer. >>: So is that in that you don't have to buy the service or is that in you turn them off so you're not paying for the energy? Where is that? >> Jason Mars: This millions of dollars? So it's you won't have to use the service. So basically, there is a cost model used. So when a product group wants to use X amount of machines, Google -- well, I've done a lot of my work at Google, but Google will actually put a price tag on how much it costs to use that many machines. And so if you can have a one percent improvement using Google's cost model across the entire infrastructure, you save millions of dollars worth of computing resources. So that's kind of where that millions of dollars really comes from. It's an internal model. But you can consider building a smaller data center for some fixed amount of work. Question? >>: So I'm curious. It seems like if you have a data center and you're running at that utilization, if you had fewer machines that you were running at a higher utilization. Can't you just trade off the number of machines you have turned on to change your utilization curve? >> Jason Mars: Yeah, so that's a good point. This curve doesn't include load, right. So this is the utilization when all of the machines are active in some 3 way. So basically, there's a number of contributing factors to low utilization. And one of those factors has to do with the lack of co-locating things together on the same machine. So does that -- right. So I know what you're talking about. Like so basically, you can have across over like a month, you can have times when basically you're not getting as many queries on the data center, but I believe this curve, I could be wrong, but I believe this curve factors that aspect out. I could be wrong, though. >>: Google is not going to show us if they're turning on or off their machines anyway, right? >> Jason Mars: Pretty much. >>: I think you can assume the server doesn't just do computations and a lot of wasted time is on things like stacks and storage and other external resources. >> Jason Mars: Right. >>: And the more to spare you have, the more time you're going to have sitting around waiting for a network. >> Jason Mars: >>: Absolutely, absolutely. You can't use a CPU. >> Jason Mars: And all of those factors that you just mentioned factored into this utilization challenge. >>: So is this CPU utilization or just any part of the system some. >> Jason Mars: It's compute utilization, yeah. So we'll see shortly how we can start addressing that utilization problem. However, before we look into improving efficiency in warehouse scale computers, it's important to reflect on the design of these systems. So traditionally, companies have used a functionality first, efficiency second approach. Where initially, they use commodity components like, you know, off the shelf processor and open source software. They stitch these components together for functionality, and then 4 they tweak these performance for -- these components for better efficiency as time goes on. The problem is commodity components were not initially conceived and designed for the space of warehouse scale computers, and when you start with these components, you may lose sight of unique characteristics in warehouse scale computers that are critical for a highly efficient design. And so system architects may have missed a key characteristic that steers that design so I argue that we must rethink the system architecture for any characteristics we would want to design our systems to exploit to have a highly efficient system. And so the insight of my work and one of the underlying threads of the work I'm presenting today is one such characteristic has been missed, and that is the diversity in execution environments. So let me first define what I mean by an execution environment. So here, I show, so given the task, its execution environment is the underlying machine configuration for which that task is running on coupled with the co-running tasks on that machine at that given time. So the execution environment of this task is the generation to key on with the co-runner that's running. And so inside of a warehouse scale computer at any given time, we actually have a number of various execution environments. We have different machines because as new machines, as machines are retired or fail, new machines are brought in that are of newer generations and we also have at any particular time in a warehouse scale computer, each machine is loaded with a different number of tasks already running. So as we say take this identical web search job and we run it on three different execution environments, currently in warehouse scale computers, the entire system does not place tasks where they would like to run or where they run best. Tasks can adapt at the machine level to events within the execution environment of the machine. And certain kinds of events we can't address or measure or manage explicitly, such as the interference between tasks, ask. This is critical for utilization. So these are all hard problems and precisely what my work looks at. But before we talk about those problems, you might ask, well, is it important to acknowledge this diversity in execution environment? And my claim is that this diversity in execution environment is key for a highly efficient design. 5 And a simple experiment can lead us to this conclusion. So if we take three different machine configurations, when just looking at machine configuration, if we take three different machine configuration, these are actually production configurations you would find in a Google cluster. When we run these nine large-scale Google web services across these different machine configurations, we observe a significant impact on the performance of these various jobs. Some applications like big table here can observe a 3x difference in performance depending on the machine configuration. Other jobs, like proto buff, it matters little which machine it runs on and we have jobs like search scoring and search on maps detect face where it will prefer one architecture while other machines prefer, tend to prefer another. So we observe that task performance is heavily impacted by diverse machine configurations. >>: Do you have a reason why any intuition from a high level why you see those? Is it [indiscernible]. >> Jason Mars: I do have that intuition and by and large it has to do with the diversity and the memory subsystem of the various architectures. So across the three architectures I've shown, we observe a very big variation in the cache sizes used, in the prefetchers that are used on those various architectures, and the topology of the memory subsystem across these architectures. So we have a generation 1 Xeon, which is a Clovertown core 2 type architecture. We have an Opteron, Istanbul Opteron and we have a Westmere generation 2 [indiscernible] type core I-7, but an earlier core I-7 architecture. So if you notice across these architectures, if you look at it, the cache sizes are very different and the hierarchy is different and the types of prefetchers and the effectiveness of the prefetchers is different. That's our observation as the biggest contributor to the variation. >>: Maybe I'm [indiscernible] to your argument. Would it simplify to say the CPU runs at infinite speed and all the delays are due to the memory subsystem? >> Jason Mars: Infinite speed is very fast, but yeah. I mean, if we can actually keep all of our work in the, like, first level cache, you would have a significantly fast machine. We can't really realize that. I agree with you 6 completely. We don't in practice, it's always the latency that leaves the first level cache that's causing ->>: A local story, it's like complaining about the [indiscernible] stop at 520. >> Jason Mars: Yeah, exactly. The potential is huge is, I think, exactly the point that you're pointing out. >>: You're normalizing to one on the clover here? >> Jason Mars: Right, exactly. So basically, the minimum performing of the three architectures, we normalize the whole cluster to, yeah. So yeah. >>: So are these all, these are all multithread the benchmarks? >> Jason Mars: These are multithreaded benchmarks, yeah. >>: So when you run on an Xeon, which I guess the squares are showing it has two CPUs and the Istan has four and the West has four? So do you fork a number of different threads? >> Jason Mars: So we bin -- in this experiment, I bin the work to one core. So these workloads are essentially a part of a suite of pre-made benchmarks inside of Google which are composed of the commercial binary coupled with a giant log of hundreds of thousands of queries that it just turns on a pre-made package log of queries. It's called perf lab. >>: So these are all running in one thread? >> Jason Mars: >>: So they get the whole memory systems to themselves? >> Jason Mars: >>: Right in this particular -- Exactly. So we don't see -- [indiscernible] memory system itself. >> Jason Mars: configuration. Yeah, exactly. So when we look at -- so that's machine We observe a lot of variation in performance. What about 7 co-runners. We also observe when we take the same type of experiments but only look at the degradations due to a co-runner running alongside this machine, we have a significant degradation in performance in some cases and not so much in other cases. So across the various clusters we can see -- and the key thing to take out of this graph is as you change the co-runner, you can have a different amount of degradation so we observe that the task performance is highly impacted by the co runners on that machine as well. >>: So the co-runner, you have three co-runners there? >> Jason Mars: Yeah, so here we fill up the whole -- because we're studying contention here. So we just fill up the whole chip with threads so it's binned based on half the -- the policy is half the cores are one application. The other half of the cores are the other application. Because we want it to kind of, if we ran the experiment where we only limited it to two threads, we won't really see the true pressure of contention, you know, given the memory system. Because you won't have as many axises, memory axises coming from the various applications if they're limited to one thread. It would be a slower rate. So that was the rationale to fill up the whole chip. >>: Okay. So again, each of these apps that you have, they get one core? >> Jason Mars: >>: They get half the cores, okay. >> Jason Mars: >>: even that over So in this experiment, they get half the cores on the machine. It's a half policy. So I imagine there's a lot going on on these systems at any point in time, in the performance lab, and I imagine that there's a lot of variability you get just from running these things multiple times. So are these means multiple runs? >> Jason Mars: Right. So these are means over three runs. However, this perf lab, it's called perf lab, this benchmark suite that Google's developed internally, they've worked really hard to get the performance variation across runs down within the one percent range. So that's really what's used here. 8 >>: When you start adding interference, then you have timing issues of when you actually start the applications, right? >> Jason Mars: >>: Yes. Is the harness handling that. >> Jason Mars: So the harness here is handling -- it spawns both applications at the same time. The applications run for a while so they run for 20 minutes so we ->>: So you minimize that. >> Jason Mars: We minimize it, but there might be a little, yeah. So the observation is that the execution environment has a significant impact on performance, and for this, this is I claim that exploiting and not just exploiting but I adapting to diversity in execution environments is critical for improving performance and utilization in warehouse scale computers. So I've done a number of work dealing with these types of issues of optimization and the interaction between workloads and the underlying system across disciplines and computer science such as characterization, workload studies, compiler, static and dynamic analyses, runtime systems and software system and computer architecture. But today, I'll be talking about a number of these works that deal with this issue of diversity in execution environments and how we can exploit that. So there's three aspects to the work. There's three design points that we need to acknowledge the diversity in execution environments. At the cluster level, we want to map jobs in execution environments they prefer. At the machine level, we want to able jobs to adapt to the execution environment on the machine and changes in that execution environment. And then there's a very important event that happens within execution environment, important for utilization, the interference between tasks and we want to be able to acknowledge and explicitly manage this interference to improve utilization. And that will be clear in just later in the talk. So to summarize the contribution, I've developed an intelligent mapping approach that will dynamically learn how jobs perform in various execution environments using continuous profiling. And then exploit that knowledge to do 9 this mapping and that results in a 15% performance overall across a cluster and this has been validated using real, the same kind of application I'm showing. Real production machines and real production workloads. And at the machine level, I've developed a new mechanism that allows for applications to adapt and respond to events in the execution environment and solved two pressing problems in warehouse scale computers. The first is selecting your aggressive optimizations based on if they're effective or not dynamically so you can dynamically detect if your aggressive optimizations are giving you a benefit and then apply them only when they're giving you a benefit. And then also, we really need a mechanism in software, and this is actually, you know, at the time was a challenging problem is to think of a mechanism in software to detect that contention is occurring. To be able to positively say that we observe contention. And so I've used this runtime system to detect and respond to contention dynamically, which is important for utilization. And then finally, when dealing with the interference within an execution environment, when certain applications have a precise amount of performance, they must guarantee. And certain co-locations between applications can violate or not violate that performance requirement and I've developed a technique that can precisely predict which applications are okay to co-locate on a given machine, to allow more co-locations to occur over a policy of just disallowing co-locations. It will be clear when I get into it. And that presents a 50% improvement in utilizations over 500 machine cluster in the scenario that I will present. >>: So in each of the scenarios is a five-machine cluster? >> Jason Mars: Oh, no, no. So it differs from experiment to experiment. So at the cluster level and these two works, this work, the cluster level and the interference in the EE, it's 500 machines. And we run an experimental scenario on those 500 machines using, you know, canned executions. It will be clear in just a few. So let's look at the cluster level. I'll be going through each of these cluster level, machine level and interference. I'll be dealing with each of these in the talk. So we start with cluster level. At the cluster level, what we're really trying to attack is the assumption of homogeneity in the way 10 systems are designed currently. So many systems, like systems using Google, they view the entire warehouse scale computer as a collection of thousand of cores and petabytes of memory. And basically, when a job comes in to be scheduled on this giant computer, it finds the first available machine with the prescribed number of cores and amount of memory needed by that job. Each job has a configuration file with a number of cores and memory needed. However, what actually happens is we do have this machine heterogeneity, and as I've shown you before, it matters a lot in the graphs I've shown, where the jobs end up landing. So how do we exploit this? How do we take advantage of this off the cuff? The first thing to do when taking advantage of this diversity? So we want to be able to map jobs where they run best and we want to take advantage of the unique properties of warehouse scale computers to do so. We know the application services we're going to be running continuously on these machines and we also know that we have continuous profiling as a service that lives in production where we continuously know can profile all of the jobs and to get performance information using hardware performance counts. In the case here, I use GWP. In this work, I use GWP, which is the Google wide profiler. And there's a paper in IEEE micro from 2010 that discusses that in more detail. So what I did here was develop this approach, I call SmartyMap. And what SmartyMap does is it exploits this continuous profiling to build a knowledge bank of how jobs perform, scores of how jobs perform in different execution environments. Defined by both the machine unranked machine configuration and the co-running jobs. And based on this knowledge bank, SmartyMap will train the knowledge bank and use it to map jobs in places where it predicts it will have the highest score using a Google wide profiling. Question? >>: Space from production profiling or on the test cluster where you run these benchmarks? >> Jason Mars: The evaluation will show task cluster. 11 >>: The maps and the models. >> Jason Mars: Oh, this is live. SmartyMap runs live and with the continuous profiling information, it refines its knowledge of the different types of jobs running and the types of jobs it likes to run. >>: So how do you incorporate the variance in workloads coming in and other noise that you see. >> Jason Mars: So statistical methods is what we use. We would use the average, basically, that we observe across different execution environments to score. >>: And tell you know how running, one more question. How do you see applications are instrumented to this is the start of a request, this is when it's done so that you long it's taking? Because if you just see arbitrary, you know, binary you don't know the latencies, right? The user would -- >> Jason Mars: Right, right. So that's not factored in. So what happens is it uses hardware performance counters. So the profiling service continuously samples all day long all night long hardware counters across all machines and it also collects what was running on that machine. So we have all of these logs of just hardware performance counter information for all the jobs running as a service. So it does it at the hardware performance counter level and it doesn't give you higher level information such as, you know, quality of service information. >>: So how do you know from the counters if this is slow or if this is fast. >> Jason Mars: So here in this experiment, I use a metric I call instructions per second. So I'll talk about it in terms of improving the performance we're getting from the whole cluster in terms of how many instructions per second am I getting from the cluster. Kind of thinking of the whole cluster as a chip, kind of. But it's kind of -- I like it. >>: So why is IPC the right metric? 12 >> Jason Mars: So IPC particularly wouldn't be the right metric because we have different types of architectures. But IPS, which is essentially IPC but generalized to seconds, because seconds is generalizable across architectures. But, I mean, I think you're hinting toward a really good point, which is you may not be optimizing -- you may not want to optimize strictly for how fast something is running. There might be other things view to optimize for. And for that, you can use a similar methodology. You would just change the feature you're optimizing for. So if you could collect from GWP, if you could get a motion of, let's even make it high level user -- how satisfied a year is. If you can use a continuous profiling service to see how satisfied a user is from these types of services and different execution environments, you could apply the same methodology that I'll show or the same approach that I'll show in the next couple slides. >>: Even though you're using instructions per second, your approach is general enough to handle other ->> Jason Mars: Right. And you'll see on the next slide. It's because we formulate -- it's a great question. It's because we formulate the mapping problem as an optimization problem. So we can leverage standard numerical optimization techniques to arrive at a good solution so we have an objective function and we're just trying to optimize to that objective function. So that's how we formulate the problem. Given that we have this knowledge bank that we can refine these scores using the statistics I just mentioned, given this knowledge bank, what SmartyMap does is whenever it has a set of jobs it needs to map into a cluster, it will internally SmartyMap will simulate the different mappings using the information it has from the knowledge bank and use, in this -- use a numerical optimization to optimize toward the highest performance it predicts that you can get from the cluster. So we use a stochastic hill climbing approach to perform this optimization. works well for this kind of optimization. It >>: Is this sort of a one-shot optimization that when I have a bunch of jobs to place, I just optimize for those, ignoring everybody else and keeping everybody else fixed, or will you get additional benefit by sort of globally optimizing? 13 >> Jason Mars: So how this works is imagine you want to turn a couple of services on inside of a cluster. When you turn those services on, there's a set of many types of jobs that need to run. So when you have the set of jobs you need to run, you run it through an optimizer that will simulate to use the knowledge bank to optimize to get the best aggregate performance it predicts. So it kind of works like that. I think you're hinting towards two issues, which is what if there are things already running. And that would be where you start your optimization. You'd have things running and then you would be mapping, simulating things being added on to those machines and you optimize accordingly. Another thing I think you're hinting to, which is a great point, is what about changing the optimization dynamically, can you move things around. >>: Can you turn off that service that's already running for a while. >> Jason Mars: Yeah, I think that's an absolutely fine idea and probably would produce good results. But I didn't do that particularly specifically. So I'm not going to talk too much about -- go ahead. >>: How long does it -- so it could take a long time to compute that and you have some requirements on getting these jobs scheduled quickly. So what kind of deadline do you give yourself for the scheduling algorithm. >> Jason Mars: So the scheduling algorithm we have here, and with the size of problems that, you know, in the thousands of machines and a particular mapping scenario, so it takes, in a matter of seconds, you'll have a map that is predicted to be good by SmartyMap. So in this situation, we're not doing -- so these mappings won't be done often, because it happens when you turn on the services in the warehouse scale computer. You can kind of do that once. Let's assume you want to have search and maps and mail running on cluster 5. You can do that once and then let that run until you make a different decision. Like all right, I don't want to run maps in that cluster. I want to move it to this other cluster and then you can do another remapping. >>: So it's not as jobs arrive. It's just when the service gets turned on? 14 >> Jason Mars: Right, right, exactly right. So as time passes, the quality of knowledge bank improves, because we keep refining it with new updated information. And I'm not going to get into too much detail about the specific classes of information you can use for mapping, just leave it to say smarty-M, which is just if you were to only take information at the machine level, like jobs, if you were to only consider machines in the heterogeneous environment, what is the complexity of the amount of information you would need to collect and then what if you were to take machine level plus all the permutations of possible co-runners, which the complexity gets much higher. So I would just point to this, because there's an interesting observation as to how SmartyMap across these two things work. And I think I present that in the slide after this. So just to describe the experimental setup, to see how well does the SmartyMap work. So we 500 machines, three production types of machines. These are actually, you know, real machine types configurations used at Google, which might be interesting for you guys to see. I got approved to have that. I think it slipped past the approval, because normally, they wouldn't approve that. Anyway, so we have a thousand jobs that we like to map. And I've used the nine Google applications presented earlier, same -- there's two sets of experimental test beds. There's the Google one that I did at Google, and then there's an experimental test bed that I used spec and off the shelf machines when I was not at Google to do the same experimentation. So what we observed. So this is normalized IPS, and it's normalized as if all jobs had the best machine it could run on and it was running alone. So that's the normal, that's one. And as you observe, across the different metrics, the light blue bar shows what we'd expect currently, right. Where it doesn't take into account any of the heterogeneity across machines or co-runners. And then we show different classes of information that we would collect. But observe that machine -- so only these two, the -- only the blue and the red show where we consider machines and not just co-runners. And notice that when you consider machines, you get most of the benefit. So the diversity across co-runners in the execution environment matters a lot less than the diversity across machines. Which is a nice, interesting observation. But interesting thing in the paper that this is attached to that you guys 15 haven't seen yet, the interesting thing is, right, it depends on how diverse your machines are. So across these three machines that actually exist in a real cluster, the diversity is relatively high. But it's kind of surprising, because we were just talking about going from core 2 to core I-7, basically and having the Opteron competitor to the core I-7 in there. So the generations aren't that huge. >>: You are doubling the size of the memory system from the previous slide. >> Jason Mars: Exactly. And the prefetchers got so much better, especially from core 2 to I-7, they fixed a lot of things in the prefetcher. You're absolutely right. Good point. So that's the result. But there's one more interesting observation. So we have a 15% improvement overall across those two examples. Now, one interesting observation in this, and this might be really useful to some of the data center people here is that it tells you something about how do we build our cluster. If you were to acknowledge this diversity, does it change the way we would want to go buy machines to fill a cluster. And it does, right. So here we show applying the SmartyMap to clusters that are fully composed of one of the three types of machines. And then we show clusters that are half filled with Clovertown and Istanbul and West mere. And Westmere and Clovertown, the three different types of machines. And what we can observe is so -- oh, no. All right. Oh, I see I was clicking. So what we observe is if you have all Istanbul versus having half Istanbul and older architecture to Clovertown, when you're doing the optimization of mapping jobs where they like to run, you can match the performance of the all, the newer machine cluster, which this one would be a cheaper cluster to build, and you actually beat the performance. Why? It's because some of those workloads prefer the Clovertown other Istanbul. If you can do the mapping right, you can actually have a cheaper mixed cluster and get matching or better performance, which we see in this case. Which I think is really interesting and might tell us that we shouldn't just buy all one machine. >>: Although your West number says we should just buy memory machines, right? 16 >> Jason Mars: Yeah, that's because West is just such a strong processor. But if you look at the, at least if you look at the difference between the expected, the random, the expected performance and the difference between the when optimizing performance, you do a little better than if you were to randomly map. So that's why I didn't -- I don't want to -- because this one is a little bit more subtle, but you have to look at maybe there's another graph I would need to built to really show this point, but you get better efficiency. You can get better efficiency than the random when you're doing the mapping is what I mean, going across the two. But you're right. In this case, you would just, for best performance, you just buy all Westmere, done. >>: But if you look to performance for energy, the powering up that memory system is efficient so you might get a better energy efficiency point on one of the other two graphs, depending upon how much it costs to run it. >> Jason Mars: That would be a very interesting data point to show, and I kind of wish we did that now. Because if I can sell you the energy benefit that would be actually interesting. But yeah. So that's that result. So let's move to the machine level. So at the machine level, we want to be able to allow jobs to adapt to their execution environment, because there are a number of problems that require this and I'll show why. So what we're dealing with here is the rigidness. Rigidness at the application level. So traditionally, when a compiler generates your binary, your binary is fixed regardless of where it runs in the execution environment. Now, there's a class of problems that would require this application to adapt to its execution environment. Two I'll highlight and address in this work. The first problem is the problem of aggressive compiler optimizations. These are optimizations known to either improve or degrade performance based on person dynamic effects or certain situations across different applications. So we would want -- oh, so in one execution environment, you may see a performance improvement. While in the other, you might experience a slowdown. So that's the first problem that we need this kind of flexibility. The second problem is this issue of detecting and responding to interference 17 online. So your web search job will have performance requirements and may be co-located to some priority batch applications. Currently, there's no way for us to know programatically or to ask, well, are these jobs contending. And so because if we could ask this question, and if we could get a response from the software system, we might be able to do something to reduce the interference that is generated. So the ability to detect and respond requires a runtime system to be there to ask that question to. But before we can address these two problems, we need a mechanism for cracking this rigidness, for allowing this online adaptation. So when realizing this kind of mechanism, we have binary translators. This is for arbitrary native applications. So for arbitrary binary applications, we have dynamic binary translators. However, it hasn't been adopted in many production environments, including warehouse scale computers, because they're huge in complexity and sacrifice a significant performance. And the argument of using these binary translators for optimization hasn't really been realized by our community in a convincing way for production. So to achieve deploy ability in warehouse scale computers, we need an approach that is lightweight and low in complexity. So we need a new technology for online adaptation, and this new technology is one of the major challenges and contributions of this work, right. So let's take a quick look at conventional approaches to binary adaptation. So your application runs on top of a runtime and this application is responsible for keeping fine grain control at the instruction and basic block levels. So it uses something we call a code cache and the runtime will dynamically translate the application at the instruction basic block level into this code cache and then execution is only allowed to run -- execution is only allowed to occur out of this code cache. If this code cache were to transition back to the original binary, things will break. So that's a big no-no. However, there's expensive transitions between the code cache and the runtime that occurs. So when we have indirect branches that are not easy to predict and on the critical path of your application, you may take many expensive transitions between the code cache and the runtime, and that's very expensive. And then to monitor the execution environment, you need to add instrumentation. And then this instrumentation to employ some kind of adaptation might need to also make transitions to the runtime and back. So we have a lot of complexity, 18 a lot of overhead. So we need a new approach for binary adaptation. So I asked the question, why not let the application run directly on the processor, right? And we can use a very tiny runtime that uses hardware performance counters to keep us informed about what's going on dynamically. So we have no fine grained control and we execute directly. But I may ask, well, how do you adapt the application code then? So I advocate and propose that we use software, what I call scenario-based multi-versioning, where we can specialize instances and regions of code for dynamic situations we anticipate and we can just select which regions and code to execute when we experience that dynamic situation. So this is very lightweight, very low overhead to actually employ when you want to change your code dynamically and then you get the compiler so there's some added benefit over something that only uses binary. So it's an interesting argument, because there are things that you might not be able to do since you're not compiling dynamically, but then there's things you can do because you have a compiler, but you have to statically, the caveat is you have to statically acknowledge what you want to specialize for dynamically. That's the caveat of this approach. And we also need, especially for when we want to coordinate multiple co-running applications, we need to be enable coordination between runtime. So we need to let this runtime system talk to each other so that we can have coordinated adaptation policies across applications. So this is the mechanism. So this is low in complexity and low overhead. So I'm not going to talk to too much detail about the runtime system, but I'll show I call it LOAF, the lightweight online adaptation framework. But I'll show how this diagram can kind of summarize technically what's really going on, right for scenario based multi-versioning. So basically, we have an application that each call to the region of code that you specialize statically for dynamic situations are indirect branches that use a table, a global dispatch table to know which one is active, which version is active, and you can reroute execution dynamically using this dynamic introspection engine. So the beautiful thing about this approach is as your program is running along as fast as it can doing its thing with one particular routing, your dynamic engine can simply write new values into this table and then restructure, reconfigure how your binary is running. 19 So you can think of it as a reconfigurable binary, and this is nice because as opposed to traditional multi-versioning where you have some kind of conditional that's on your critical path, if that conditional had to look something up in your environment, you'd have to execute it every time. But here, the program never has to itself conceptually itself pick what version to use. >>: Are you trading an indirect branch for -- >> Jason Mars: A direct branch. >>: Well, yeah, exactly, for a test that's going to -- a branch [indiscernible]. So isn't this actually more overhead than what you just said? >> Jason Mars: Oh, so okay. There is a tiny bit of overhead. I've actually run the experiments. The overhead is less than one percent. Just if you were to -- so the experiment to test that overhead was just keep these values the same and just let the application run, right. And that overhead is less than one percent. >>: The branch predictor works? >> Jason Mars: >>: Yeah, and you're not changing -- The direct branch or indirect branch? >> Jason Mars: Yeah. And when you're re-routing, you know, a lot of it depends on the granularity on which you're chaining this table. If you're chaining this table at milliseconds and you have tens or hundreds of milliseconds is when the next change comes, then you don't suffer a major penalty from switching the table too much. It's much lower than -- it's much lower than traditional approaches. >>: So the one percent was redirecting every column through this table? >> Jason Mars: Right, every call of the -- it's not all the functions. It's the hottest functions. Every call to the hottest functions, so kind of like every call since they're the hottest functions. 20 >>: Arnold Ryder did an instrumentation framework like this for Matt's Ph.D. thesis that's very similar. They had the same kind of results even when the branch predictors weren't that good. Or as good as they are now. >>: There seems to be a testing problem. You have 20 such options. You have over a million versions of the program at any one time and even a bigger infinity if it's a long acting bug. What would the testing [indiscernible] say about that? >> Jason Mars: point. >>: I haven't heard that comment before. That's an interesting Your compiler --. >> Jason Mars: You're right. You allude to a really good point. Statically, if you want to do too much online, there's a state explosion, right? You can have all kinds of permutations of versions for different dynamic scenarios. So it's up to the optimization designer to design -- to not go too crazy. It or it will start to not be as performant so that's an interesting point. >>: Did you do anything so now you have filter version so you have code explosion. So do you do anything to control that so your cache doesn't blow up? >> Jason Mars: Yeah. So I don't do anything explicitly. Really, everything I do is in the designing what -- so you identify your dynamic scenarios and you specialize accordingly. And it's in controlling the number of specializations you want to do. But I don't do anything clever dynamically. But it's not hard to imagine coming up with some ideas to keep that state explosion. Actually, that's a very, very interesting point. Yeah, we should chat more about that. So time address those problems now that we have this runtime system. So I mentioned the two problems, SBO and CAER and aggressive optimizations and detecting contention. So I'm going to talk about SBO first and then contention aware execution. So how do we use this kind of approach to detect when our aggressive optimizations are effective and then use them only when they're effective. 21 So I use an approach I call a competition heuristic, where initially, throughout execution, you learn to identify whether or not aggressive -- you try aggressive and compare that to the non-aggressive version and identify which version produces the better performance using the dynamic information that gives you performance information online and once you make that decision, you let that run for a while before coming back around to do another test and this continues, this functions continuously throughout execution. And so that's the competition heuristic. It shown to work quite well. So on this graph, these are spec applications. I show the execution time normalized to O2. Basically it's just O2 no aggressive optimizations. And then each of the three bars, the blue, red and -- the green, red and light green bars show the performance when statically applying these different permutations. And then the third bar shows the performance when you dynamically select the optimizations. And we consistently do better than our baseline, which is O2, the conservative approach, which is just to not use aggressive optimization. So we get a clear win over the conservative approach, and we, by and large, we even beat statically applying any one dynamically. So it's pretty effective. And overall, it is about a 4 percent to 10 percent performance gain over the ->>: Which one is doing it dynamically versus the fixed? >> Jason Mars: So just this one. And all the others are some fixed aggressive optimizations. No aggressive optimizations, cache prefetching, loop unrolling and both cache prefetching and loop unrolling using the GCC 4.3. So we get about a 4 to 10 percent performance improvement there. And ask for my dissertation if you find any of this interesting. stuff is in detail in my dissertation and there's papers too. All of this So now let's look at contention aware execution where we need to detect and respond to contention. So let's make it clear what is this contention problem? So if we have multiple applications ->>: Before you get into this, can you give me a better sense, when you're compiling for different versions of these [indiscernible], what's the difference between the [indiscernible] choices. 22 >> Jason Mars: Right. So the choices I made in this work were to use the -so GCC is configured for some canned heuristic knob settings, right. So I use the canned M-arc, the micro [indiscernible], maybe I'm getting too technical, but I used the designers of GCC have selected values for loop unrolling, which is aggressive, and then they've selected values for how to do the heuristics for how to do the cache prefetching. So I've just used those. I didn't have versions for each knob setting and this gets into the state explosion, right. Because if you have different versions, if you want each knob setting, you can start to really blow up stop. >>: Like the unroll is the unroll by four? >> Jason Mars: >>: Which I happen to have in my head. >> Jason Mars: four. >>: Yeah, exactly. Actually, that's precisely right. The unroll is unroll by That's the magic number. >> Jason Mars: Four. Just go four. Four and you can't go wrong. That's a great point. You have to really think carefully about how you specialize. >>: One of the issues in doing trials is that you, when you measure something at a fine grain, you could measure one instance of the loop that only has ten [indiscernible] and you can measure one that has a million. So, of course, the one that takes a million takes longer even though it is the faster version, because it's optimized more correctly for the million version, right? >> Jason Mars: Yeah. >>: So how do you take that into account, or are your workloads so homogenous that you didn't see that very much? >> Jason Mars: Yeah, so I use -- that's a good point. I use a kind of a coarse granularity. I use in the milliseconds. So it's ten milliseconds for comparing the different versioning schemes. So the space of versioning 23 schemes. And I refresh the information continuously. After some fixed amount of time, I think it was a couple seconds. So these are spec applications, right. So after a couple seconds, it would just do the evaluation again, aggressive, nonaggressive, and update. So it's really dynamic, truly dynamic in this sense. Does that answer? >>: So you just are making local choice is based on the current averages? >> Jason Mars: Precisely. Absolutely correct, yeah. >>: And how do you alternate between the versions? Are you running two separate versions and measuring both of them, or are you switching between them? >> Jason Mars: Switching between them using that rerouting capability that I -- just updating that table. >>: You could get unlucky because you -- >> Jason Mars: >>: Switch at the wrong time, make the wrong decision. But you're not seeing that in -- >> Jason Mars: Yeah, not across these applications. So that's good. All right. So what's contention. So when you have multiple applications, if the working set of the application fits neatly into the early levels of cache, we don't have interference. We get to scale with the number of cores. However, as soon as we used shared resources like the shared cache or the bandwidth to memory, then these applications can slow each other down by contending for these shared resources. So that's what interference, that's how we get interference. Now, in warehouse scale computers, this is particularly problematic. If we have application A and B run on its own machines, they'll run full speed. However, as soon as we co-locate them on the same machine, we can have a performance degradation, and a high priority application may have a quality of service requirement that it must meet. And so for this reason, in warehouse scale computers, the co-location between high priority applications and other applications are simply disallowed. It gives the whole machine to the high priority application. So that actually wastes potential utilization. And this is because not all potential co-runners 24 will degrade the performance of our high priority application. So if we can do a check dynamically to see are these programs contending, we can identify those that don't have any contention at all, right? So what we need to do is have a runtime approach that can detect contention. So you know, at the time I did this, a lot of folks told me from software, there's no counter for contention. Like if you have a dip in performance, how do you know that it's actually contention? So is detection possible? Well, I came one this approach. I call it the shutter approach. So on your runtime, you have your -- this is the high priority latency sensitive. And we have a batch application. The runtime can shutter the execution of the low priority application and if it observes a corresponding spike in last level cache misses, corresponding with the shutter of the low priority application, then we can assert they're contending, because we see it with the little test. It's a little online test. And we can see that they're actually contending. So we do this little test, and we can know they're contention. And there are a whole number of ways you can respond. You can respond by killing it. You can respond by slowing it down, which is what's in the results I show. You can respond in a number of ways. But the cool thing is you can rule out those that don't contend at all, you can let those run with your little shutter approach. This also works quite well, uses the same runtime infrastructure that I presented before. This also works quite well. So basically, what this graph shows is it will show the degradation across these spec applications when co-located with LBM. And when we just allow co-location, the degradation can be very high. But when we co-locate with CAER, then the degradation is significantly reduced. So this is across two different heuristics of applying CAER. For brevity, I'll just tell you that we get an average of 60% utilization and we reduce the interference from 17 percent to 4 percent on average and get 60 percent of utilization from the neighboring core. So this percent of that getting experiment was done with two cores on the machine, and so with the 60 of the neighboring core, we get an overall improvement of utilization machine by 30 percent. Where we're allowing the co-location and the reduction here. The problem with this is that it's not precise. It doesn't deal with 25 performance requirements. And the next part of my talk deals with just that. And that's the Bubble-Up work, which I'm going to get to right after this question. And so, okay, there's some insights that come from this that I really believe is how we should think about doing software runtimes as we move on. I think it's good to allow some vertical integration. So what I demonstrated was integration between the compiler, the runtime and the architecture. In developing dynamic solutions, we need to, one option is to allow the compiler to stitch in reconfigurability in our binaries. I think that's an option. That's something we might want to think about doing. Performance monitors are the key enabler of online techniques moving forward and we really need to think about how to build monitors not just for debugging, but for also for online dynamic techniques. The approaches I presented are realizable today and I believe will gain more traction as performance counters are built into the application binary interface, right, where currently, you know, there's a lot of systems where if VM Ware is using your counters, you can't use them. As soon as we have this built in to the application binary interface, we'll solve that. And I believe ultimately, the day will come when all code and systems will be continually restructuring like the warming of a cache. Okay. So finally, I'd like to show how we manage and measure this interaction, this interference so we can allow for precise predictions and precise co-locations. So what does that mean? So some applications have strict performance requirements, as I mentioned below. And some applications that interfere, we can allow, if it doesn't hurt our performance too much to violate that requirement. So here's a simple example. If we take search render, and each one of these bars shows the performance of search renderer when co-located with each of these applications. Now we're back with Google large scale real large applications. And what we observe is when we co-locate search renderer with each one of these applications, search renderer's performance is impacted variably. If we have a 90 percent performance requirement, I'm going to call this the 26 quality of service threshold, then some of our co-runners violate that performance requirement, while others do not violate that performance requirement. So what we want to do is be able to identify all of the co-locations that don't violate that performance requirement for precision, because of this over provisioning of resources we have where latency sensitive applications get run alone and we waste those cores. So we want to eliminate the uncertainty of that interference penalty. And we want to precisely predict the impact of quality of service to allow safe co-locations. So the goal is to have a general methodology that's platform agnostic. It can run on all those machines I presented before, and is deployable at a scale of warehouse scale computers. Now, prior work, there's a lot of work that predicts whether an application is contentious, but we don't have a way to predict how much it will hurt each other when running together. If you file two arbitrary applications, we don't have a way to know how much they'll penalize each other from a performance standpoint when running together on a machine. And here, we're trying to capture an interaction with resources that are not explicitly manageable or visible to software. A quadratic brute force profiling methodology is straightforward. You take all of your possible jobs and co-locate it, you know, with all the other possible jobs, but that's not suitable at the scale of warehouse scale computers as jobs are updated frequently and there's a large number of different types of jobs. So it doesn't work at the scale. The question is, is a linear approach possible? Can you look at each type of job once and then tell arbitrarily how much they will interfere with each other from a performance standpoint? Especially considering that that impact is based on the interaction with the co-runner. So insight that let me to this approach is given a white box approach, where we try to analytically be able to analytically model all of the different aspects of the processor, the prefetchers, the caches, the interconnect, bandwidth, memory controller, replacement, queues, buffers, and even the secret sauce, which is private, like how the prefetchers on real architectures work, it may not be the right approach. It's high in complexity. It's not portable. You'll have to do this for every type of machine. Do work for every type of machine and it's may not be feasible because of this secret sauce stuff. 27 If this really matters a lot with the performance. So the question is, is a black box approach possible, where we can just treat the whole memory subsystem as a black box. So it's lower in complexity. It's portable. You can move it to any black box and run the same approach, and it's deployable on real systems with secret sauce stuff. So when thinking of this approach, I'm going to use an analogy I use the man or woman in a dark room where you can think of the memory subsystem as a dark room where you try to feel -- in a dark room, you can feel out the furniture. You can kind of get an idea of the layout of the room. Can we do that as a methodology. And that will come in the slide after next. It will be clearer how that actually plays in. But first, let me tell you essentially how Bubble-Up works. So we capture a representation of sensitivity and a representation of aggressiveness of each of the applications. When deciding a co lotion, we take that representation of sensitivity and aggressiveness and combine them to produce the performance degradation. And it will be clear from this animation. So we have a representation from our profiling, we have a representation of sensitivity, which is a sensitivity curve, which you'll see, and we have a representation of the co-runner's aggressiveness, which is the aggressiveness score. So we want to predict how much they will hurt each other when running on this real system, we can take the sensitivity curve and we can also take the aggressiveness score and the sensitivity curve shows the Y axis, it shows quality of service as you increase this notion of pressure in the memory subsystem, and you'll see how that works in the next slide. But you can take this aggressiveness score and use the sensitivity curve to predict an actual performance level when co-located. So then the question becomes, well, how do you get this magical, awesome sensitivity curve and these aggressiveness scores. Well, we use a bubble to produce the sensitivity curve and a reporter to produce the aggressiveness score. And this is where the analogy comes in. So the bubble is essentially a stress test that provides a performance dial where you can turn up the pressure on the memory subsystem holistically. Just using loads and stores from the application, you can turn up the pressure dial that holistically turns up pressure on the memory subsystem. As you turn up 28 this pressure dial as shown in the animation, you generate a curve as to how the performance degrades as you increase the pressure on the memory subsystem. And that's how you get the sensitivity curve. Now, to get a score, use a similar kind of intuition. You have a reporter that sits on the memory subsystem that has a presence on the memory subsystem. And you let your application run alongside that reporter, right. And what the reporter, it's trained. It's trained to know -- it's complicated how it's trained. It actually has trained a sensitivity curve in itself that's used in reverse. But I'm not going to get into too much detail. But it's trained and it basically reports how much it thinks you're hurting it. It's basically reports how much it thinks the application, what its aggressiveness score is based on how its own performance is being affected. So that's how the reporter works. So these are the two steps of the profiling approach. You take each application and you run it through this profiling approach. So what is this bubble? Well, how can we conceptualize the bubble. Well if you take the degradation on some application by a co runner, A and C, you can conceptualize the degradation as the summation of all the sensitivities and pressures of sensitivity application A and the pressure on some resource, Ri, and the pressure of the co-runners on that resource Ri. And essentially, the bubble is this approximation of this actual degradation by replacing the co-runner with a bubble of some k, where k is closest to the co-runner. If this isn't landing, it's all in the paper and dissertation. So when view this approximation, we do have a source of error and that source of error comes from the difference when you replace that C in the actual degradation from the double K. So there's a hypothesis. The hypothesis is if you design the right kind of bubble, you can minimize this potential for error. And so good bubble design is key, and in the paper, I actually outline the systematic principles for designing a bubble. If your bubble stress test software application has all of these properties, it should be a very good bubble, very small error. And so it's monotonic curves. The details of this is in the paper and I show how I achieve each of these three things in the paper. But let's look at how well Bubble-Up works. And I have some backup slides that even show even more how well it works. So let's take, this is actually very important to understand the next graphs coming. The experimental scenario. We have 500 machines. And each machine is running -- each machine has six cores, 29 right. Search renderer is configured to use three cores and each machine is running search renderer. So we have 500 machines half loaded with search renderer, right. And we have three cores available for the co-runner. So we have 500 jobs ready to run and it's an even mixture of the 17 Google workloads. Randomly selected mixture of the 17 Google runners, and basically we allow co-location that is steered by Bubble-Up. So as opposed to having each machine only run search renderer, which is kind of the current approach, we'll allow co-location, but only to co-locations that Bubble-Up says are okay. So our baseline is basically those 500 machines half loaded. So we're at 50 percent utilization. Now, if we say, okay, Bubble-Up, please allow all of the co-locations that don't violate our program, that doesn't degrade our program by more than one percent. A lot of the ones that don't cause, essentially don't cause contention. I'll allow a one percent leeway. This is how much utilization improves as you change the QoS policy, that line that tells you how much you have to guarantee. So at 99 percent, we're already experiencing significant gains. And at 90 percent, we're beyond 80 percent utilization. So we have a significant improvement in utilization. So overall, at the 98 percent, we have about a 50 percent improvement in utilization by letting Bubble-Up steer 98 percent of the quality of service. But I mentioned that there was a potential for error. And so Bubble-Up can potentially make mistakes. It can predict that a co-location is safe, but then it does violate the quality of service. That could happen in some degree. But we observe that when that does happen, violations, when that does happen, it's severely small violation, right. So in the worst case, we have a violation of 3 percent. So all of this blue are correct decisions made by Bubble-Up, right. And then the other colors show slight violations. So a one percent violation, the worst we get is a 3 percent violation. It's actually a 2.2 percent violation. But it's between 2 and 3 percent. And this is across these 17 large real world applications. And actually, in the paper, I show how if you change how you define your QoS policies, you can bring this down to -- you can significantly reduce violations and bring it down to zero if you change how you define your QoS policies. So this is a huge step forward. >>: What if I were to just have a policy of just adding in any attempt of the 30 -- or randomly allocating. Do you have anything you compared against? You provided the policy to describe when and where you could go. But I could imagine just a random ->> Jason Mars: Right, just randomly co-locate anything. Well, this top graph, that's an interesting point. This top graph might be helpful for that. It shows the number of violations that you would have, the absolute number of violations you would have if you allowed all co-locations. So if you just allowed everything to co-run, this is how much violation. If you let Bubble-Up co-run, that's how many violations you have. And on top of that, the violations are very, very small. So in the worst case, if you say I want 95%, in the worst case, which is very small, you get 92% across these workloads. And you can change the way. You can incorporate a tolerance in your QoS policy to reduce the number of violations if your contract can find. In fact, companies like Google and [indiscernible] companies are not contract, like, it's not like they lose money. It's just their latency will be 3 percent more than they want. So this is actually a very big result for, you know, when you have web services that you're providing, software as a service. So that's pretty cool. And then it's general. It applies across architectures. Here are two totally different kinds of architectures, right. And we let Bubble-Up allow the co-location across these very two very different memory subsystems and hierarchies. The same bubble, the same reporter, the same design, everything's the same. You just blindly move Bubble-Up on a different architecture and it works just as well. And to be clear, just as well doesn't mean it has as many co-locations. Just as well means it predicts the co-locations because, remember, these processors have smaller caches, smaller subsystem, so you won't be able to have as many co-locations at a given quality of service level. So Bubble-Up works quite well here. >>: How many cores do those machines have? >> Jason Mars: Oh, yeah. And it's the same half-loaded policy where we do half the cores and half the cores. So the core 2 has 4 and then the AMD one has six and the Westmere has six. So it's six, six, and then four for the core 31 2, but these only show the core 2 and the AMD. The other one, the other results showed the Westmere on the Westmere architecture. So basically, all the machines are half loaded or full loaded policy. >>: Does your reporter function need to be tuned to the architecture of the specific machine? >> Jason Mars: No, that's the beautiful thing. It's the same reporter, the same bubble. And you just increase that pressure and the reporter programatically ->>: Measure it on the new machine? >> Jason Mars: Yeah, yeah. You have to run the profiling on the new machine. On every machine, you have to run the profiling. So you profile your applications on the given machine and then you can predict arbitrary co-locations. So you know, that might not be necessarily necessary. It's tough. That's a good, interesting point. But I think there's a lot more interesting kind of cool things you can do with this kind of approach and I'm really exploring future work. So I've presented these, you know, how we integrate an awareness of execution environments, these three levels. And now I'll just talk about what's next. What are the kind of things, what are the areas of research that I'm planning to jump right in to. So the first little vision that has many components associated to it is to have the warehouse scale computer be an autonomous sentient entity, where it's continuously self-aware of all the execution environments within the warehouse scale computer and dynamically responding and optimizing. You can imagine having automated learning and reaction to historical data, where you can derive policies from observing anomalies. So if you learn, for instance, oh, when this job has to make an RPC this far away, every single, like most of the time, that job over there is going to suffer something. So you can find these little patterns and then derive automatically policies from them. And you can imagine using machine learning to assist with that. So here, you think of the data center as a robot that's continuously 32 self-optimizing. And you can also think you might want to express to that robot policies, like you might want to tell the robot how to deal with certain situations and how do you express, how do you program this robot. How do you design a language that you can express things like responses and sensing and execution environment. It's really a language for performance anomalies and warehouse scale computers. So I wrote a grant on that, and it was funded by Google. excitement about this particular idea. There's a lot of Another idea I want to go into is the notion of accelerating warehouse scale computers, right. In our phones, we have IP blocks. We have accelerators in our phones for different common workloads that we run all the time. At the scale of warehouse scale computers, we have the same environment. If Samsung can build an SOC, why shouldn't Google or a Microsoft or a, you know, any of these companies? So what you'd have to do is find the common algorithms, what are the common things. There's a lot of machine learning that happens realtime, right. That should be something that we can build an IP block for we can accelerate. And you've got to get a taxonomy of these different types of workloads that are amenable to acceleration and then we can start designing custom hardware for these different approaches and continue to argue to Google or Microsoft or any of these companies that really, this is the right way to go, because you can save significantly. So just two more. I really think there's a lot of opportunity and it's an emerging momentum in integrating software with architecture. Vertical integration. And hoisting complexity. Where we let the architecture, let the micro architecture expose things to the software platform and take advantage of that. I think one interesting, you know, philosophy of doing that is in the hybrid architecture design. It's transmittal like designs where you have a software component to the chip and there's binary translation going on. It's actually really kind of -- people are starting to look at that commercially and I think that we don't really know what they can do yet. And I think they can do a lot. And I want to realize a lot of this potential and just demonstrate the greatness of it. Software, okay. So and then finally, I think that we don't have enough research. We have a lot of research in the hardware community on accelerators 33 and SOCs and so forth, but we don't have enough software systems. We have some, but we don't have enough attention being paid to how do you build the stack on top of -- how do you extract specialized types of specialized cores? How do you abstract it so you make them specialized, but then you generalize them as much as possible, right. What are the right kinds of systems that we need for that. So that's another promising area that piques my interest. And with that, I'll take questions. >>: Thank you. >> Jason Mars: Thanks. >>: So you have sort of the commoditization of these data center components and I guess the assumption there is that there's a market for such components so the question is, if there's only three companies or whatever, four maybe in the world that build these, how many are there [indiscernible] and where is the market for this? >> Jason Mars: So that's a good point. So those few companies have scale, right. They have volume. So a lot of, like chip architecture companies like Intel pay a lot of attention to them. However, I would look at the trend more broadly. I would look at the fact that kind of applications that we're using computers for, a lot of that computation needs to live in that kind of environment, and you know, we can imagine maybe we would have a lot of smaller companies that have their own little data centers or we have a few big companies that have their infrastructures. I think regardless of how that looks, I think we'll still continue to have more and more of our workloads run in the cloud. But it's an interesting point. That's always a question with how much complexity and cost do you want to investment in something. You have to always think about does it make sense economically, right. Like what are the economics of what you're proposing? Is having an accelerator, does that make sense if there's only, like, three of them that you'll need. So it's a great point. >>: So a little curious, you really focused a lot on the CPU and its systems. What about other components. If you take the exact same system and I change 34 from standard spinning drives to SOCs, how much different dynamics does that cause these applications? >> Jason Mars: So the scope of this work really deals from memory up through the software, through the cores, through the software stack. That's really how I focused this, and there are a lot of applications that never really leave memory. Like, you know, web search is one where you shard the index over some number of machines and the query gets sharded out, sent out and then the search never leaves memory and never goes to disk and then it comes back with the result. But there are applications that you go to disk all the time, like mail, right. Like where you click on some random email, it can hit a disk, right, over there somewhere in the cluster. And it may not even be the same machine with the query. So then the bottlenecks change completely, right. The bottle neck becomes I/O and it becomes network, how congested the network is. So my work doesn't go to that level, but I think a lot of the principles that I kind of expose here applies, right. Like you could imagine having techniques for sensing how contended the network is, right. They might be also -- so I think in the network space, there might be more opportunities in hardware, like you can have something that asks specifically questions. But yeah, yeah, you can have a lot of the same principle. You may be able to have something like Bubble-Up for, you know, network I/O to disk. You can contend for bandwidth to disk and so forth. You can imagine the same kind of techniques, many of my trends translate over. But it's very interesting. >>: I think that the [indiscernible] are less predictable than you memory controller, because you're running, say, two applications on the same machine that's different than the running, you know, there's a thousand applications on the networks and you're getting random traffic, random bursts of traffic. The disks might be also, just because of how it physically behaves, a little less predictable. So building these kinds of models might not be as easy. >> Jason Mars: Yeah, yeah. 35 >>: The next would be to bring it up a level and do it across the whole data center. >> Jason Mars: Yeah, yeah. I think that's a great point. I know it's actually, there's a lot of researchers that are doing a lot of interesting in the network for this exact same space. So yeah, I hope to collaborate with a few. Thank you.