>> Doug Burger: Okay. It's my pleasure to introduce Professor Karu Sankaralingam. Boy, I haven't said that in ->> Karu Sankaralingam: I know. [laughter] >> Doug Burger: [inaudible] but I can spell it hands down -- who's visiting us from the University of Wisconsin. Karu is -- I co-advised Karu for his Ph.D. He was the lead student on the TRIPS Project hardware which built a really exciting prototype, and he did a ton of work. Since then he's gone on to glory at the University of Wisconsin on the faculty there, has done some really, really interesting work in accelerators, controversial papers about whether simulators are good, an influential paper saying that ARM and x86 actually aren't that different from powered performance in the fundamentals of the ISA, which was actually very influential within Microsoft. A lot of people read that paper and have talked about it. So I'm really just proud of the work he's done since going to Wisconsin. And the thing I love about Karu's work is that he digs really deep and tries to get at fundamentals. So you won't be hearing a fluffy talk about the latest flavor of the day. He'll be trying to solve the problem. And he's currently in search for a universal accelerator, which may be an oxymoron, but we'll find out. So thanks very much for coming to visit. >> Karu Sankaralingam: Great. Thank you for the introduction, Doug, and thank you all for coming. And it's great to have this opportunity to present some of the work that's been going on in my group here. So as Doug mentioned, going to be talking about our search for a universal accelerator. I hope to give you some evidence that we are nearing success of some sort. So please feel free to interrupt with questions at any time. You don't need to withhold your questions till the very end. >>: You've caught a glimpse of the unicorn in the woods. >> Karu Sankaralingam: I may have. I may have. >>: You know what happens to unicorn [inaudible]. >>: Yes. >>: [inaudible] Yes, I do. >> Karu Sankaralingam: So a lot of this work -- or most of the work was done by my students. A lot of it was led by Tony Nowatzki, who is graduating soon; Vinay, who is a second-year or third-year student; Chen-Han and Venka who have since graduated based on some of the early work they've done on all these projects. So we have to begin most all architecture talks with some kind of quantitative graph, so I figured I would go with this cartoon pie chart which is technology motivation for a lot of work that happened and happens in architecture and is happening in architecture that energy expanded on execution is a very small portion of the energy expended on everything else, which is like fetching instructions, registry naming, all kinds of stuff. So this is where today's state-of-the-art high-end processors are. And how did we get here? How did we get into the mess? So this is kind of history, very, very quick history of 30, 40 years of processors. So up through the 2000s the entire field was focused on building great high-performance uniprocessors. Power wasn't considered such a big deal so it was okay and we got to a certain thermal point that it was not okay. It was then led to the multicore hype, which lasted not as long as some people would have liked, but that's probably a good thing. And we figured out just putting more cores isn't going to solve the problem. Right now there is ongoing specialization era of trying to build various kind of things on our chip and have them do various things for that. Let me drill down a little bit and ->>: Ask a quick question. >> Karu Sankaralingam: >>: Yes. Why do you end the multicore era in 2010? >> Karu Sankaralingam: Well, I -- sure. I should continue it for longer. I wanted to have a small time period so I can say it didn't last for very long, but that's not quite the right time period. It is something ongoing and maybe the hype ended in 2010. >>: Okay. >> Karu Sankaralingam: [inaudible] so let me drill down into looking at what I'm calling domain specialized accelerators. There's a lot of focus on this, people saying, oh, I'm going to build an accelerator for regular expressions, graph traversal, this, that, and bake it all in silicon and then hope someone is actually going to build this and this all is going to make sense. I'm going to argue this doesn't make a lot of sense in multiple ways, but the good thing about all this is just like people are writing these papers saying 100X performance, 1000X [inaudible] 1000 times better performance than some baseline. So it's like you read these and like what the hell is going on here? How can it be thousand times where it got to be -- yes. >>: If the work is still about 8 or 10 percent of the work -- >> Karu Sankaralingam: Well, that is that [inaudible] also. But there's one good thing in all of this in like one sense, which is like, okay, great, if your technique is providing thousand times, 50 times better performance, a few percentage simulated error doesn't matter. That's the only good thing about this entire paradigm or work I would argue. I wanted to work that in somehow. But so what is the problem with all this, right? One, if you're just going to -- everyone is going to build data accelerator for this that, this that, then you're totally ignoring how do we provide general purpose performance for workloads that don't fit into one of these domains. Second, it's like how the hell do I program this? In fact, almost all of these papers don't have the word programming in them. It's like here's my blob of silicon, now go. And this design complexity, how do I integrate so many of these things, what memory do they talk to, how to actually feed this. And I coined this word called obsoletion prone. Basically you say, okay, once stencil computation goes out of error, you just have the silicon there. That's not doing anything for you. And I also call something called the snow white trap which is how do you decide which of these accelerators do I build. Is mine the best looking accelerator? So build it. And it's this economics argument, okay, let's find out the domains are important, we'll build accelerators for them, but still will they continue to be important. There's this big problem of which ones do I build, once I've put 16 of them, what is the 17th accelerator I should put there. What [inaudible] what constraints should it meet. And [inaudible] just plain boring. What is the point of taking a workload and putting something in digital [inaudible] like what do I learn by doing it, what do I learn by writing this paper. It's okay if I don't write these papers, but I want to read these papers for sure because I don't learn anything from them. Right? So I stole this picture from Joel Emer. He calls these domain-specialized accelerators. And I'm paraphrasing him. He calls them white swans as opposed to black swans. So I just stuck in these nodes over there. A black swan effect is this [inaudible] effect that comes as a surprise to the observer and the effect has -- and distinct event has a major effect. So domain-specialized accelerators are white swans. There's no surprise at all. You build the stencil accelerator exactly like you would expect to do it, and there's no major effect. No one's going to build this thing. There's no use of this thing. So what we really need to be doing definitely in research is looking for this black swan of having general purpose performance, easy programability and high efficiency, and you can call this [inaudible] accelerator, can go after various different workloads at very high efficiency. So, I mean, there's got to be some surprise, maybe not, but it shouldn't -- then you will have a major effect that you can accelerate very types of workloads and have a large impact rather than going after one narrowing thing, narrow thing and hope for the best. Okay. So I'm going to propose we're calling this thing the ExoCore processor. The word exo originates from linguistics which is this is originating from outside the code in some sense, there's also a biology root word for this, especially for a disease that is caused externally rather than internally. But the idea is, the principle, we're going to infect a company from the outside. But no. The technical idea is that principles of dataflow can actually be used to take some kind of convention processor or a basic processor and build a hybrid Von Neumann dataflow execution engine that can provide high performance for general purpose workloads so we don't give up on those. Okay. And then you can take these very same principles of dataflow, combine it with concurrency and build as one substrate onto which you can map many of these acceleratable workloads that can give you these big integer factor performance improvements. Okay. And a summary of the results are based on some simulation studies of the spine. We're looking at some prototype implementations right now we can get. Compared to a high-end [inaudible] processor we can get higher performance and half the power by executing in this hybrid mode of doing von Neumann execution at some phases of the program and switching into a dataflow mode during some other phases. And for these other acceleratable workloads like deep learning, really highly concurrent workloads, you can use this general fabric. And we will show you can sustain 50X to 100X lower power than this big [inaudible] core, rather, this thing that we designed. And we'll also show that we can match the power efficiency of these white swans that I just described before, these purpose build domain-specialized accelerators. So that's a summary of what we can achieve with some of the architecture ideas I'm going to talk about. Okay. So I'm going to focus on the fundamental principles that we believe we have discovered. They are relatively simple but yet very capable. Okay. So the two principles that combine the wraparound of most all of this work first is that by building an execution model that can seamlessly allow you to execute as a von Neumann processor while quickly switching to a dataflow execution model and having this happen at very, very fine granularity and at low cycles to switch is real useful to execute programs that have these regions that prefer von Neumann execution versus dataflow execution. So this hybrid execution, surprisingly, this is just not being looked at in the dataflow literature. >>: What about Arvind [phonetic] 30 years ago? >> Karu Sankaralingam: No, Arvind -- in fact, I didn't know if var min was going to be there. [inaudible], my student, said Arvind is there, be prepared for some weird question from him. And so he got up and said something like this is the only dataflow paper from which I've learned anything. Or something to that effect. And he actually went -- because we wrote this claim and like surely someone must have looked at this. And people are nuts. So it's actually you need to have this hybrid and it needs to be at a fine grain that you want to switch from one to the other. We'll get into some reasons on why you need this, and then we can talk about whether you really need to do all this in the architecture, can you do it all in the microarchitecture and so on. >>: Okay. So now this is the question that you obviously came prepared for coming here ->> Karu Sankaralingam: Maybe. >>: -- why do you need for No. 1 to split them? Why do you just have a dataflow core that replaces the von Neumann core? >> Karu Sankaralingam: the dataflow core is. >>: Okay. The dataflow core alone may suffice. So this is more of a hedge to be more compatible with industry -- >> Karu Sankaralingam: >>: Depending on what the microarchitecture of Sure. >> Karu Sankaralingam: >>: You can. Correct. -- where they currently are? >> Karu Sankaralingam: Yes. And also for them to -- if I have a core, probably I'm willing to throw all of that away and start with a brand-new design, I'm probably more likely to say, all right, I'm going to take my core, I'm going to take your idea, integrate it, I'll see if it works. If it doesn't work, I at least have my core, I can still sell it. Did that answer your question? >>: Yeah. >>: [inaudible]. >> Karu Sankaralingam: Thank you. Apparently still that's my defense. are the remaining [inaudible] so big? Who So the second idea is -- second principle, rather, is that you can combine concurrency with dataflow. It might seem a little redundant. The dataflow is inherently concurrent, right? But you can partition programs up into individual data for regions, and you can have multiple of those going concurrently. It's kind of like what you guys have been doing in Catapult and Bing. And that gives you a lot of the benefits of what all these domain-specialized accelerators are actually exploiting. Inherently they're going off of concurrency, and you can build it in a flexible way rather than doing it all in a domain-specialized way. So these are really the two fundamental principles that allow us to go after what we're calling these general purpose workloads and to go after these highly acceleratable workloads that have lot of concurrency inside them. Okay. And yes. >>: So you seem to be suggesting the dataflow is enough to cover all of these different types of accelerators. Is that really -- like how can you actually argue that? So for sure, right? >> Karu Sankaralingam: Yes. So I will -- I'm coming very close to that statement. I'm going to -- let me refine that a little more since you brought it up. To go after these workloads, we need little bit more than dataflow. You actually need to do little bit of communication specialization where if you have operands being sent a certain way, you need to specialize for that. We need to do a little bit of data reuse specialization in terms of putting in a scratchpad or something. So those are needed also. These two give you the biggest bang for buck. You do need these other two and they get combined synergistically. So when we talk about the second part, that will become more clear, the role of those other two. >>: So for the potential accelerators, you'd want that -- those different features would be enough? >> Karu Sankaralingam: Correct. And we've also looked at some scenarios where this fabric is definitely not enough. And one great example became of it was actually some of the FPGA work here on looking at compression where basically you're just banging on this memory in arbitrary ways. The memory isn't that big. You just have this [inaudible] hash table that is doing like [inaudible] compression, then you just need to build a hash table and it doesn't matter what compute fabric you put around it, you need that weird multiportal thing. So that's one scenario. Regular expressions and things like that where you're doing your regular memory, you just need something else. So right now we have some work that's looking at how can we actually generalize it and then can that actually become another fundamental principle. Okay. All right. So I'll very briefly talk about some tools which were instrumental for us to be able to get this far long in this research. One is for about again 30 years people had been looking at various types of spatial architecture schedulers. So I looked at a couple in my research as part of TRIPS, and there was lots of like really clever techniques in like doing simulated annealling and various other techniques to push that forward. And Tony, my student, came up with this idea that you could actually specify the original scheduling problem for any dataflow fabric as an integer linear program. And in the many past 30, 40 years, that literature has made huge leaps and bounds in the kinds of problems they can solve and how fast they can solve them. So based on that, we came up with a technique that can provide the capability to be a universal spatial architecture scheduler. So one input to the scheduler is the architecture and the other input is the code you're trying to schedule and will limit the scheduled core for that architecture. And the architecture itself was specified as a graph. So this gave us the capability to go and look at various parts of the design space without constantly having to build some kind of scheduler and be worrying whether or not our performance was gated by the scheduler that we had. And one of the fascinating results of this was because we were using commercial state-of-the-art ILP solvers, we were able to beat the performance of published specialized schedulers for various different architectures. It was also very easy for us to write this paper because all the TRIPS coauthors who conflicted with this paper and could say whatever I want, how about that, and no one would bother trying to kill this paper, right? Because they still did it. It still took us four times to get this paper. And I'm like, oh, my God. But, anyway ->>: So have you been able to compile very large programs with this? >> Karu Sankaralingam: Yeah, yeah. that were part of the TRIPS papers. >>: So we compile all the TRIPS benchmarks We -- Very large [inaudible]. [laughter] >> Karu Sankaralingam: Aaron is a very, very self-critical reviewer, but -- >>: I remember it took you four times to get it accepted. >> Karu Sankaralingam: No, and then it's like bizarre we got this -- I'm not even a PLDI guy. We got a PLDI Best Paper award for this paper. But getting to your question, yes, we can -- we have done lot of these big DySER program which is this function like thing we built in our group which has 100, 200 instructions. We fed it even bigger ones as part of some sensitivity studies we're looking at right now. So okay. So million lines of code, is eventually going to get broken up into regions. So we had like a huge [inaudible] FPGA that have like a billion lots, then you're going to try and map the entire million lines of code but you're typically going to break it down. Did that actually answer your question? >>: Yeah. >>: Integer [inaudible] known to be NP-hard. that problem and somebody else's problem? You know how to get around >> Karu Sankaralingam: Exactly. That's exactly right. It is somebody else's problem. And the great thing is -- and that -- you hit the nail on its head. I wasn't planning to spend so much time on this slide, but that is exactly why people have been avoiding this problem for 30 years. They're like they have the introduction for all these papers, it's like, oh, we have a scheduling problem, oh, we have one graph, another graph, it's NP-hard, let's go solve some other approximate problem because we cannot solve this. It's like sure. This is an NP-hard problem. This is what I hope you guys do. They've been solving this for 40 years. But they go and have heuristics and they automatically determine the heuristics that apply for the instance of the problem that you're presented with, and that can be totally tractably solved and solved optimally. It's actually totally kick ass. So basically they used all this literature that's been around for a long time. And not to say we didn't do anything clever here. The clever thing we did was to specify the scheduling problem as ILP question which is a bit of work and requires some creativity. So the other thing is we came up with this simulation mechanism that can model the role of architecture and the compiler in a single framework we've been calling the [inaudible] build dependence graph that allows us to look at various parts of the design space without having to first implement a complete compiler, a simulator for it, which is all multiple years of work, which we can then use a very rapid design space exploration methodology, which is -- will be appearing in this year's ASPLOS and also recently published as a Cal paper. So some of the results I will be presenting will be looking at very, very vast design space. So if you're wondering how we were able to even capture all of that, it's part of this technology we built about two years ago that allows us to do this very rapidly. Okay. All right. So getting back to this graph. So why does dataflow actually allow us to do that. So the simple answer is dataflow principles can eliminate these overheads by getting rid of having to fetch and decode and retract this all at every cycle and allow you to focus on just the execution part. Okay. That's really the main high-level bit. So very briefly, the history of how we got here was for the first 2008 to '13 or so, we build this architecture called DySER, which was a way of having a specialized functional unit that we were going to integrate into a processor pipeline and have the processor feed this thing with the load slice executing on the main processor and just the computation mapped onto this thing. And this itself would execute in some kind of dataflow fashion. So we built a prototype, we built a compiler. This kind of drove a lot of these problems that I talked about in terms of the compiler and simulation infrastructure that we developed. So then after we build this, we did our prototype, we did lots of evaluation, and we learned a lot of lessons. Thankfully it wasn't like, oh, we built this, let's move on. So one thing is, well, if you build this thing and then you try to feed it, then the thing that's feeding it becomes the power bottleneck. Sometimes even becomes a performance bottleneck. We then showed how we can use dataflow principles to go and come up with better ways to feed this thing. Going to talk briefly about the compiler, but what is relevant for this talk is we observe from all of that work that program heterogeneity is abundant to the execution of our programs and you can come up with various types of dataflow techniques to go after various regions and fine tune the technique itself to make the execution way better than executing on a conventional processor. Okay. So this was the main takeaway from building all of this and figuring out what it is you're actually trying to do. So that's what I'm going to be focusing on, this first principle of how can I have a hybrid von Neumann and dataflow execution to get the benefits of both of these things. Okay. We'll sidestep a little bit the question that Doug brought up, well, why do you need the von Neumann core, but we're going to assume you need it just for compatibilities and so on. >>: How do you get memory [inaudible] processor? >> Karu Sankaralingam: You can deal with it in various ways. Once I talk about the microarchitecture, if I don't answer it, you can ask me again. Okay? All right. So here is the -- just to make sure we're all on the same page. The architecture is going to consist of a von Neumann core. You're going to then have an explicit dataflow processor which allow its execution encoded in some kind of dataflow I say. Okay. And we'll talk about why even von Neumann can complement dataflow architectures, talk about some program properties we can leverage to have this hybrid execution model, talk about our microarchitecture proposal here, which is essentially a synthesis of many existing ideas. It's nothing super novel here. And talk briefly about some results of how much better you can do. And the overall execution model is going to be you execute on this core when you enter one of these regions that are best suited for executing in dataflow mode. You'll switch to this thing, you'll power gate or turn this thing off so you got a lot of power efficiency, and this will execute at much higher performance and lower power than having to execute it on this thing. Okay. That's the overall execution model we're going to have. All right. So I'll go over this very quickly. Lots of you are likely familiar with dataflow and so on. But the basic high-level idea is if I compare von Neumann execution to explicit dataflow execution, if you need speculation because you're highly control dependent and there's some kind of data-dependent graphs that are correlating [inaudible] very efficiently, you are likely going to need something like this to get high performance because predication alone may not get you all the benefits, okay, as opposed to if you have lots of operand-level parallelism and there's local and local ILP, as in far away ILP, then you want to be executing in some kind of dataflow substrate so you can get all of the concurrency without much of the overheads. Okay. And this example is just a cartoon that shows how something like this can happen on very simple straight line dependence with some control flow happening based on the load I just did. So whatever you do with the dataflow processor, you're going to be much better on a control flow processor doing something like this. Okay? And here you have lots of operand-level concurrency, which could be obtained even with a SIMD processor, but that's not so important. You can do well on a dataflow processor with this because the loads are not on the critical path of whatever it is you're trying to do. Okay? So here is a more general characterization of what is actually going on. If I look at control regularity and memory regularity along two axes, if both are highly regular, right, GPU-friendly cores, then you do it on a SIMD engine, you do it on GPU. You don't really bother with those kinds of workloads on a general purpose processor. Okay. If you have very low memory regularity or you're highly irregular, then it's just going to be waiting for memory all the time. So you want to have a processor that's really efficient in waiting around, it doesn't have a lot of structures to maintain all this [inaudible], so you're likely going to do well on a dataflow processor here because you don't have all these overhead structures. Okay? And if you are highly -- become highly regular but you have some control of regularity, then again you be -- it's not clear. Depending on exactly what the code behavior is, you could be better off on a dataflow processor. And if you're somewhere in this middle region, right, where you have some amount of control regularity and the memory behavior varies between the regular to regular, then the low overheads of the dataflow processor are going to allow you to get more power-efficient execution, and the higher concurrency inside it will allow you to get higher performance. Right? And the rest of the regions where you have some control regularity dominate what's happening, you will need some kind of mechanism that allows you to execute those regions efficiently, which is why you'll need some kind of hybrid execution which you can overcome by taking some of these mechanism and embedding them inside your dataflow core itself. Yes. >>: [inaudible] one more time what you think the mixed -- or is good at this [inaudible] because it's speculation. >> Karu Sankaralingam: It's basically a core requires some kind of control speculation because you don't have that much operand-level parallelism, so your dataflow isn't doing anything, it's just called a lot. But your control flow core will be able to run ahead much better than the dataflow core. Okay? >>: Are these regions detectable dynamically while during execution? >> Karu Sankaralingam: So we'll get to that in the next slide, whether or not we are proposing you can actually do this statically and get away with how you can use these cores. >>: Isn't most of the energy the outer core because of the outer [inaudible]? >> Karu Sankaralingam: Sure. >>: So by combining a dataflow accelerator, I mean, you're going to have to offload a lot of compute to the accelerator. >> Karu Sankaralingam: Yes. You will. That actually goes back to the previous question. So why do we offload actually becomes the big question, right, what can we offload to this thing. And so we'll be spending -presenting some quantitative data here. So for this we particularly [inaudible] only -- this is kind of subsetting to make it look bad for us. So we looked only at this back-end workload and in regular workloads for media bands, so look at only these hard workloads, right? If you throw SPECfp here, our numbers would look even better. Okay? All right. So we're going to look at these workloads and figure out what program properties can we use to actually exploit this graph I just showed before. Okay? So the first property is this thing we call affinity phase behavior, or underlying phase behavior in the program. So this was time and this was different applications. Blue stands for preferring von Neumann execution and orange stood for data parallel and the red stood for either one, either one or the other. The main thing is that these programs, or most programs, they have these phases happening at relatively fine granularity. It's not like an execute for a billion instructions, offload one, that will execute for another 1 billion instruction. These happen at thousands, 10,000s of instruction cycles granularity. So programs are constantly moving from one phase to another, okay, in fact, so much so that if I had an ideal dataflow processor to which I could offload instantaneously, only 25 percent of the programs wanted to be in this thing all the time. Most of the time you really wanted to be in some kind of hybrid execution mode where you're constantly moving from the von Neumann core to the dataflow core because they have these program properties intermingling within the entire program's execution. Okay. So that's going to motivate us to have this kind of processor organization. We're going to take explicit dataflow engine and integrate it within the private L1 cache hierarchy. So you could migrate some registered values very quickly, execute it here for some time using the same private level 1 cache and then switch back here once you're done with this execution. So if the granularity was much larger, you could do other things. Okay. All right. So then we need to actually come up with some -- >>: So I understand that, are you switching within the speculative pipeline, or is it post commit? >> Karu Sankaralingam: This is post commit. Yeah. So what program regions we could do, so we could take arbitrary core, right, functions, whatever. If you do that, then the mechanisms -- goes back to Aaron's question, the mechanisms you need inside the dataflow processor are high because you need to cover all kinds of program behavior that ends up causing some [inaudible] power overhead. Depending on how you design this, you could lower this, but I'll call it relatively high. Okay. If you did only inner loops of programs, then you could come up with a very simple processor organization, but the coverages as measured from these programs I just showed before is you're getting only 61 percent of all the instructions in the program. If you did only traces, you get only 41 percent. And depending on if I had longer duration regions only, this comes down even further. So neither of these is a good idea. If you look at what we call nested loops, then you could still come up with a mechanism that is relatively low area on power, low design complexity, and you get pretty good coverage. So this is 74 percent of the entire program's execution can be offloaded cumulatively, can get after offloaded to the dataflow engine, leaving only 26 percent on average on the high-power von Neumann core. And if you restricted yourself to only long duration regions, the number still is pretty high. More than 2/3 of the program can be offloaded to this low-power thing. Okay. So this is our motivation, and we're going to go after a mechanism that will allow us to offload nested loops. Okay. And the whole nested loops idea is kind of -- it's a little bit important to understand because it means you need to have some kind of control flow support inside this core because it's not just one outer loop that's been added on to this thing. Okay. All right. I will go over this relatively briefly. want to -- yes. But the main idea is I >>: Sorry. Can you go back one slide. So I'm trying to understand the difference between nested loops and inner loops. Because it seems like if I had a nested loop, the inner loop would dominate. But here there's a 13 percent difference between the two. The one on the right is the subset of the one on the left. >> Karu Sankaralingam: So, well, this diagram isn't quite as expressive as it probably could be. So the trip count of this is small when you have multiple of these happening and you don't want to unroll all of this because the dataflow fabric usually has some spatial constraints, then you really wouldn't be able to fit the whole thing. Right? And if you tried to ->>: Why is your coverage higher with nested loops? the same issue with nested loops? Because don't you have >> Karu Sankaralingam: So if you come up -- yeah. If you come up with an encoding that allowed you to express this control flow in some way without unrolling the whole thing, then you can come up with a smaller spatial fabric that can take the entire program's execution. >>: But if you -- but if you come up with mechanism, why doesn't it apply to inner loops? >> Karu Sankaralingam: You could. That's what I said, which is depending on how you actually design your fabric, you could move from one to the other. >>: Okay. >> Karu Sankaralingam: Okay? Did I actually answer your question, or -- >>: Yeah, well, the -- the -- the -- if you take 13 percent, it's like -- I just don't -- it doesn't seem right to me that the difference between nested loops and inner loops should be a fifth, right, of the interloop. >> Karu Sankaralingam: Oh. >>: On the time you spend in the inner loops, the -- I would expect inner loops to dominate nested loops as well. In fact, dominate them even more. >> Karu Sankaralingam: >>: Inner loops to be more than nested loops? Well, what do you mean by coverage? >> Karu Sankaralingam: could offload. Maybe that's the [inaudible]. This is the total cumulative program execution you >>: [inaudible] instructions. So really what you're saying is the -- within nested loop the code that is not -- that is in nested loop but is not in inner loops is 13 percent of the program. >> Karu Sankaralingam: Oh, I see. >>: And -- and the inner loop is 61 percent of the program, and that just seems really high to me given that the inner loop is going to be running a bunch. Right? Because you're -- because you're going to -- you're going to iterate some -- that inner loop some number of times, and then there's a bunch of additional code. But that additional code gets executed once per iteration. >> Karu Sankaralingam: >>: Okay. Yeah. So that -- it just doesn't add up. >> Karu Sankaralingam: So this has -- okay, I should -- >>: I would actually point to the long duration, that that's even more of an issue there. >> Karu Sankaralingam: So there are some implicit heuristics we're also using here in terms of when I offload I want to get -- I think the objective function we were using is I want to get a higher energy delay product on executing on this offloaded thing so it has some built-in mechanisms of -built-in heuristics on how efficient the mechanism we are using for each of these is. >>: Okay. >> Karu Sankaralingam: So, for example, if you're doing a trace, we were saying you would have these sequences of instructions or compound function units you would have for this. For the inner loops we were saying I had a little bit of control flow mechanism. So there are some heuristics on how much energy benefit you can get, and only then we allow the offload decision. >>: Okay. >> Karu Sankaralingam: breakdown. >>: So it could be a lot closer -- >> Karu Sankaralingam: >>: Yes. -- and you're discounting a bunch of stuff -- >> Karu Sankaralingam: >>: This is just not a pure I can offload anytime Correct. Correct. So why is it that you can -- not to beat on this graph forever -- >> Karu Sankaralingam: Yes. >>: -- why do you get such better coverage when you don't look at the trace? Because the trace wouldn't encompass the dynamic path through the loop, right? >> Karu Sankaralingam: >>: So, well, okay. So here -- You'd also get function call, which would not get -- >> Karu Sankaralingam: So the -- okay. These are -- depends on your code behavior, right? So a trace is defined as something which has -- you enter and then you exit single path of control. This is no predication, no hyper block, nothing. And any time you deviate, they need to get back to the start of the execution and start over from scratch. So if you're having little bit of irregular controls, so most of the time you enter these traces, or rather many traces, you end up basically throwing it away and restarting from scratch. >>: So you're saying you have a lot of side exits from the trace. >> Karu Sankaralingam: >>: Yes. Early exits from the trace. >> Karu Sankaralingam: Yes. >>: So that's why. So maybe the trace -- well, I don't know how you filmed your traces. Maybe that just needs to ->> Karu Sankaralingam: So this use -- again, like I said, this graph embeds a lot of heuristics on how we could implement all of these. And some of the design decisions will come back again when I talk about the actual dataflow fabric we are using for these slides. All right. So the high-level overview, just so that we can get to how the architecture actually works, is you would take a program, you have -- we have a compiler path that will detect these nested loops because nested loops are a program property, you can have some heuristics on trip count to figure out which ones you actually offload. And then you embed in a data explicit dataflow encoding inner binary for these things. And then when you run in a program and you enter one of these regions, you transfer live values -- whoops -- live values from the main core into this dataflow fabric. You executed on it for some time. When you're done, you transfer back all the architecture state and you continue on in the main core. And since running on this core for long enough, you will power gate out of our core. That will give you some power gating benefits. Plus this thing will run at a lower power for the core region anyway. Okay. And so the main thing would be you would transfer from von Neumann dataflow at various regions and you will need some intelligent scheduling technique so that you actually want to schedule or stream in the core that runs on this dataflow fabric before the region actually starts, which you can do with some very simple prediction technique that tells you I am going to enter one of these regions by starting the configuration well before you actually enter the program region. Okay. And so that's this whole SEED config instruction that we introduce at [inaudible], all of the state that needs to be transferred into this instruction state that needs to be put into this fabric, and the begin instructions actually tell you now is when this region actually starts -yes. >>: So over there in the system architecture, are you representing -- so where you say it's a core, it's a natural core, not a hybrid thread? >> Karu Sankaralingam: >>: Yes. -- companion typically? >> Karu Sankaralingam: >>: Yes. So each core would have a SEED -- >> Karu Sankaralingam: >>: This -- yeah. Yes. Yes. And how does the threading of the core relates to the SEED? >> Karu Sankaralingam: So we -- for now let's ignore hyperthreadings or multiple threads. Each one executes SEED. For now let's assume that at any point of time only one thread can be using the SEED engine. So when ->>: So if you had multiple hybrid threads for the core, there would still be only one SEED and you just control the entrance to the ->> Karu Sankaralingam: Yes. Correct. >>: [inaudible] happens when you [inaudible] decides to schedule another thread after you've sent SEED config, do you have some like regions that kind of become atomic? >> Karu Sankaralingam: We can -- we'll get to how we can get precise exceptions and exceptions at some point once we talk about architecture design. Okay? Yes. >>: What about the state for host component? Say if it's a short enough time like DRAM, stay alive and keep the state? So can you power down these two? >> Karu Sankaralingam: core itself? Ah. You want to power down more than the -- just the >>: You mentioned it's a nonstate, so I'm saying the stateful could be powered down for about as long as a DRAM cell is. >> Karu Sankaralingam: You could. Right now we are -- for these programs, at least the types of programs we are looking at, they have relatively good L2 cache behavior and we have some decent L1 cache behavior, so we want to keep this on. We want to play nice with the virtual memory, so we're actually going to keep the DLBs on so we can do the translation. So the main power -- we get some leakage power savings by turning this off, but the main power savings we are going to get really is this thing is more power efficient at executing the core than this thing. So turning this off is just little bit of leakage power savings that we're going after. >>: So jumping ahead a little bit, just I know you'll get there, but when you run code on the native code on the OOO versus let's say for a region you've offloaded on to C, what is the factor difference in energy efficiency? >> Karu Sankaralingam: Energy efficiency, uh -- >>: How many joules do I burn running code on the OO core versus the equivalent loop on C? >> Karu Sankaralingam: performance. Okay. I don't -- let me see if I can jump ahead to a >>: [inaudible] might help. >> Karu Sankaralingam: I don't. Because I notice in terms of performance I know the overall energy. I don't have the individual region-by-region energy breakdown off the top of my head. >>: It'd be really interesting to understand what factor [inaudible] like just kind of what -- you know, you started this off with the pie chart. >> Karu Sankaralingam: Yeah. So we have the numbers; I don't know it off the top of my head. I don't want to give you some bullshit answer. Okay. >>: Wouldn't be any different. >> Karu Sankaralingam: >>: I know. That's a good one -- >> Karu Sankaralingam: I don't want to give you -- I don't want to give you a bullshit answer that seems too obviously bullshit. >>: [inaudible] your defense. [laughter] >>: Okay. >>: I have a question. >> Karu Sankaralingam: Yes. >>: A similar question. So what is the -- is there some notion of a minimum amount of work that needs to happen with the SEED for it to be worth whatever the overhead of the transfer is? >> Karu Sankaralingam: Right. So for these results, we have an implicit cutoff that you need to be executing at least 10,000 instructions. And the results I will be presenting will be based on an Oracle scheduler that also after the fact goes and checks if the objective function, which were an energy delay, actually improved. We also have a heuristic that will have some mechanisms to detect this. We call the [inaudible] scheduler which will do this, and it's being optimized for dual issue core, and that ends up being pretty accurate in terms of offloading only when necessary. Did that answer your question? >>: Yeah, yeah, yeah. >> Karu Sankaralingam: Okay. All right. So I have about 15 minutes. Let me see -- I'll describe the architecture briefly, and then we'll get to some results at least. Okay? All right. So this is just a brief slide that talks about a rich history of dataflow work that starts from my dissertation. No. But -[laughter] >>: [inaudible] [laughter] >> Karu Sankaralingam: Okay. All right. So we have a version of this slide which goes all the way back to 1977. Okay? I'm not going to get to this, but the point of this is there's a lot of dataflow mechanisms that are going after various different design decisions in terms of how they handing control flow, how they do the dataflow firing rules, what the execution units itself there are. And some of these are actually dataflow processors, but they don't call it dataflow processor like the BERET work. It's basically a very tiny dataflow engine that's doing compound function units. Right? So we can combine judiciously a subset of these to come up with one microarchitecture that covers as efficient execution on nested loops. This is not the only way to do it. This is one way to do it. It's kind of think of this as a way to come up with some quantitative numbers that give you a good sense of the potential for something like this. So what we ended up doing is our criteria was we wanted to have low area on power because we're going to take one of these, attach it to every core. If we said I want to have one of these for my entire chip, I would do totally different design decisions. So I want this to be small. I want it to go after certain types of these things. So that's why we did these things. We also wanted to complement the capabilities of the von Neumann core. So we don't want to go and reinvent a bunch of load store queue mechanism, memory [inaudible] mechanisms. So if you're going to do purely nonspeculative dataflow execution, you will just serialize when you get to some regions, and that gets embedded into the heuristics which you'll use for offloading core to this thing. Okay? So basically we integrate between the level 1 cache. We have some relatively simple mechanisms for doing control flow with predicate execution. We absolutely don't allow any control speculation. So this is purely nonspeculative dataflow. And we'll use some very simple dataflow-based firing rules. And to get a monetization of how much, how little instruction fetch and decode we need to do, we're going to have execution units that allow compound function units by combining multiple types of primitive operations into a single big block. So you combine all of this. You get a microarchitecture that looks like this. We have eight compound functional units that within them allow the capability to do two to five primitive RISC-like instructions with the work. We will have a simple instruction storage resembling the classic token store for the original dataflow machines. We will then have a simple bus which will allow us to transmit two values every cycle. So this is, again, a heuristic engineering design decision we came up with that this was sufficient. That will then end up triggering more dataflow execution on this entire thing. Okay. We'll have a store buffer to serialize stores to go out of the memory system. We'll have a transfer block whose job is to take the architecture values from the main processor and move it into this thing. We have a configuration unit whose job will be to put instructions into the instruction store and do some little bit of configuration of these compound function [inaudible]. Okay. So that's the main high-level architecture. And in terms of why this actually performs much better than other four core is in parentheses you can see the theoretical maximum IPC we can get. Because we have these compound function units, we have eight of them, we could get as high as 16. And I'll make some time to go over some performance numbers. You'll see that we get some really high speedups in some program regions. And instruction window ends up effectively much larger because it's distributed plus each instruction is doing much more work than a primitive RISC-like instruction even on a [inaudible] machine. Okay. And they got rid of speculative control because if you get those regions, it will just do it on the von Neumann core. Okay? All right. So let's talk about some results. So this is based on a relatively good scheduler -- I mean simulator. And we've configured it so we're not under -making a low ball out of our core. So I have very high confidence about these results. Okay. All right. So all of these graphs are performance versus energy, geometric being across all the benchmarks that I described. And we're going to take an [inaudible] code or two-way or [inaudible] and integrate a SEED engine to them. Okay. So going lower on this graph is basically better. So we also added just a BERET core to it or a BERET engine to it. We added conservation cores which basically hardens basic blocks, attaches to a core. We added in-place loop execution, and we added the set of bigLITTLE [inaudible] what it would get. Okay. So as you can see, these existing techniques give you relatively small performance improvement. They're not big game changers in terms of how you can push the envelope here. Okay. So with SEED you can get significantly high benefits. You can lower this core by quite a bit, getting 40 percent, 50 percent higher performance, and do all this at lower power. Okay. So ->>: Can you go back, because these numbers are important. So really you're looking at roughly equivalent performance at about half the energy. >> Karu Sankaralingam: Yes. This is only about 1/4 of what we can achieve by adding more dataflow techniques, which I'll get to in a couple of minutes. Okay. Is that okay? >>: Yeah. >> Karu Sankaralingam: All right. So this slide actually is probably the most important part of this talk. This breaks down why this thing is actually doing what it's able to do. So this is representative of various different benchmarks we have. The name of the benchmarks are not important. But what I'm showing here is their individual regions, each of these tacks represents one of those regions. They don't mean they all contributed equal time to each program or anything like that. They're just regions which naturally execute on the outer four core or on the dataflow core. There are many regions where we do like 5 times, 10 times worse because it's just control dominated, so you don't want to put it on a dataflow core or build a dataflow core that can handle these regions very well. Okay. There are some regions where we get similar performance, and these would still be better offloaded because they will have much higher energy efficiency because we don't have much lower power. Okay. This goes back to the question that Doug was asking. I don't have a breakdown right now. Then there are other regions, and these are the regions where there are basically fewer [inaudible] and there's not a lot of -- whole lot of stacks building happening for x86 versions of these cores. And then there are these regions where we get really high speedup because it's just a lot of high instruction level parallelism and we can get like 2X speedup compared to like a big [inaudible] core. It's a big deal. Right? And then other regions where you have a lot of high memory level parallelism, this thing is going after faraway regions in the program and you do like 3 times and even 4 times better than your baseline core. And so by having this hybrid execution of doing [inaudible] when you're in this region doing dataflow when you're in this region, overall you're able to get performance improvement and get power reduction. >>: Are you simulating the memory system faithfully here or -- >> Karu Sankaralingam: This is -- yeah. This is like a DRAM SIM-like thing plugged into [inaudible] 5 which we then extract out using this GDD model. So the point of all this is with the SEED thing that I've talked about we've only scratched the surface of looking at this vast design space of program behavior of which most of which is missing, saying if I have non data parallel core and it had high ILP and control wasn't on a critical path, then do this type of dataflow. There's actually a lot more here. You can have data parallel code which has low control, then you would do SIMD-like engine which actually can be folded into a dataflow substrate. If you had little bit of control and you could separate the data path into an axis slice and a computation slice, then you would do what I was doing five years ago with our DySER project, which is where we started, which led us to this entire thing. Right? And you can populate the design space and look at various different parts of the space. If going back to what Aaron was asking, if I had very, very stable control and they had a tight hard loop, it was dominating the execution, you would just build a trace processor that didn't even have any of these IMU mechanisms to do different instructions and such. It would just be compound function unit, one over the other, boom, good to go. Okay? And so then you can flesh out various parts of the design space. And really you want multiple we can call them accelerators. You can call them region specialization units. You can call them whatever we want. We come up with these simple mechanisms, you integrate them to an [inaudible] core. And, in fact, people have papers going after each and every one of these mechanisms. But one of the -- some of them I did, others others have done, and they've all claimed how these are all very simple to design and can be integrated into an existing core, and they all have some evidence that they could be compiled for. Right? So you could take a general purpose core, this could be an optimized dataflow core, it doesn't matter, you can integrate these region specialization units that go after these various different mechanisms, using some well-defined principles, and each engine is itself very efficient. Okay? So this, you can combine all of these, and this is the same baseline curve that I was showing before. The range of the access is different because we get higher performance improvements. And if I just add SIMD or CGRA to my existing core, then you move down a little bit on this curve. Okay? If I ->>: And this was -- this was the curve today because we do have SIMD. >> Karu Sankaralingam: Correct. Correct. So our SIMD implementation is way more generous than what you can get out of GCC. But you're right. This is -- the red one is the baseline we should be looking at. And so by adding various other things to it, you get a little bit of performance improvement. And if you add all of these mechanisms, you add a trace processor, you add BERET, and you add some SEED-like engine, you make a big shift of this curve down. Okay? To summarize all of the data here, the thing to take away here is this is where we are. This is Intel's highest [inaudible] core. You can take a two-way [inaudible] core, enhance it with these dataflow mechanisms and beat it a little bit at less than half the power. And if you feel you need every ounce of performance you can get, you can enhance this thing as well by augmenting with these high ILP techniques that are power efficient when you get into certain program regions, and you can increase performance and lower the power consumption of this red thing. Yes. >>: So how do you scale this into the future? is it -- Is this a one-trick thing, or >> Karu Sankaralingam: Yeah. So that's a great question. I think we have barely scratched the surface here because we've gone only after three or four program regions. You're covering about 67 percent of the entire execution. And I think you can get more. You can make each of these individually more power efficient. You can make the techniques themselves higher performance. Right. So I think there's a lot more to be gained here by applying these principles and individually tuning the microarchitecture itself. It probably wasn't a great, very specific answer, but it's kind of -- that's where I think we're going. >>: Yeah, I'm thinking sort of going into the future and having issues with, say, Moore's law or ->> Karu Sankaralingam: Sure. So, in fact, again, all -- none of this -- all of these are designed to be relatively small area footprint. So they're not going after like big techniques, so I don't think we require these -- the design philosophy does not require us a lot of benefit from process scaling or actually any benefits from process scaling. So my personal view is this is kind of the most promising thing, right, which is I build something simple and then I can keep improving it in a relatively nondisruptive way and get lots of performance improvements. Okay. So the summary of this part is that we can take these dataflow principles, add it to a von Neumann core, and for modest sized core gives us big performance improvements. And so can this help avoid the need for application-specific accelerator for some more time because it can improve general purpose performance some. That depends on exactly how much benefit you need for these -- excuse me, specific applications that you are targeting. Okay. >>: So -- so -- >> Karu Sankaralingam: Yes. >>: -- maybe -- maybe you'll get there, but I -- you know, if you take an Intel, modern Intel Xeon, you spend about 30 picojoules to do a floating point, multiply 10 picojoules to do that whole instruction, so it's back -factor of 300, so your pie chart is way too fat, the ->> Karu Sankaralingam: Sure. >>: [inaudible] part, right? .3 percent roughly. And then you're getting about a 3X gain in energy. So you're still only pulling that up to 1 percent. If you believe those numbers. >> Karu Sankaralingam: Yes. >>: So I -- I mean, maybe you're going to jump ahead and show how you're going to get much more efficient. >> Karu Sankaralingam: >>: I will try to do that in two minutes if that's okay. Yeah. >> Doug Burger: We're here until 3:00. >> Karu Sankaralingam: Okay. So I think -- let me paraphrase Doug's question. So your question is for these workloads, the execution is still pretty overhead dominant because did not change how much energy really goes into execution compared to how much goes into overhead. So if you need really, really high performance, these techniques alone are not going to be sufficient. Right? So you're right. Absolutely right. In fact, there are workloads for which we need to get 50X to 100X performance improvements because that's -- only then do they become meaningful for end users to use. This could be various types of like datacenter workloads, data mining, speech recognition, all these things where you need big improvements before things become realistic for people to use. Right? So then if you wanted to make something 50X, 100X better, you can put more cores. That doesn't change the energy equation. You really need to do it more energy efficiently. Okay. So that's what the second piece is to apply concurrency and dataflow to go after these acceleratable workloads where there's a little bit more structure and you can get after the concurrency without having to change too much of the algorithm or whatever, right? So and a very brief overview, I'll go through this very quickly, is we will look at first what domain-specific accelerators are doing for these workloads. So I look at four. And I will be most critical of one that Doug is involved with. Which one? >>: [inaudible]. >> Karu Sankaralingam: Shit. I should have thought of that, too. [laughter] >> Karu Sankaralingam: No, this is the NP work. No, I -- and then we will [inaudible] identify some common principles and show how you can unify all of this and try to come up with some more genuine identifying thing. Okay. So the principles which may seem obvious to you once you kind of look at it this way but they were not obvious when we started, they're probably not obvious for all of people who have designed these specialized things, right? So for all accelerators, what people are explicitly or implicitly doing is they're matching the hardware concurrency to that of the algorithm. If the algorithm was hundred-way parallel, I'm going to build some kind of engine that's hundred-way parallel. And there's going to be some way to do explicit communication instead of doing it through a register file, which is probably the most energy inefficient way of doing communication. Put stuff in a centralized place and then someone will read it later. It could be great when your communication happens across long time scales, but if that doesn't happen, why put it in a centralized structure, just send it point to point. Okay. So there's that explicit thing. There are some problem-specific computation units that are used and there's some specialization of the particular type of data reuse that is happening in that workload. Or you can think of as locality, but data reuse is a more general way of thinking about locality. And finally there is some coordination of all these four other things that's happening which is done through some explicitly hardened state machine of sorts or a scheduler or something. But its job is to make sure these four mechanisms work together properly. So these are principles. I will look at four different examples which are really diverse to show how these same principles are just implemented in different ways across these different techniques. So first let's start with NPU, which is -- I mean, actually I really like the work ->>: You can slam it; it's fine. >> Karu Sankaralingam: No, I wish I could, but -- it helps me make my point, so I don't want to. So this is work that was done -- Doug and Hadi [phonetic], others at U-Dub. And basically you run a small neural network accelerator to go after general purpose workloads with a core-designed embedded profiling compiler. And so this is the architecture diagram for it. And all of these slides I'm going to use this color code. So this is light green, this is light yellow. They kind of show up similar, right? And this is the -- our redrawing of their architecture, this general purpose core. There are eight processing elements. That's where you get the concurrency from. You're doing something eight wide compared to the baseline process. That's first way to get the concurrency from. Second in terms of the specialization here is sigmoids using I think baseline processor you kinds of weird stuff. specialization of the compute, the only real there's a sigmoid unit that does either 16 or 64 bits eight cycles or so compared to if you're using a have to execute some [inaudible] functions, do all That takes a lot of time. Okay. And in terms of communication, there is this FIFO buffer and an output buffer whose job is to coordinate the communication among all of these processing elements to do big networks which will then be virtualized to run on the small eight piece. Okay. And the final type of communication specialization is these all have a shared bus that allows them to rapidly communicate values with each other which are the ways you would send values from one neuron to all of the neurons that here can belong in one of these eight other piece. Okay. >>: I think the only thing probably that I took issue with, and I think you'll agree, is that we didn't pick eight units because that was the concurrency of the program. >> Karu Sankaralingam: Yes. >>: You know, doesn't matter for energy, right, you can do it in space or time. We just picked eight because they kind of got the lion's share of this more area. >> Karu Sankaralingam: Sure. Sure. Yes. >>: So if we had 16 [inaudible] decided it wouldn't have been any more or less energy efficient. >> Karu Sankaralingam: Yep, I do. Yeah, I misspoke. So the concurrency is matched a little bit to other. It's based on engineering heuristics on how you could make, then, that architecture work. >>: [inaudible] >> Karu Sankaralingam: Exactly. Okay. All right. So that's what is there. Okay. And so -- and finally there is this controller which is too kind of related by unrelated things. One is within each PE it kind of figures out when to push stuff here and figures out when stuff arrives in these output buffers. And it also very carefully schedules stuff on this bus so that there are no conflicts and things don't get lost. And this is hard coded as a state machine which is generated by the compiler that will take a neural network and create all this fancy logic and embed it onto this thing. Okay. So that's NPU, and then therefore a couple of slides that talk about various other architectures. I'll briefly talk about convolution engine, spend a little bit time on DianNao. So this is work from Stanford, and you can see some of these mechanisms show up again here. There's high-level control state machine whose job is to push stuff into specialized one leash of registers or 2D coefficient register which is a way of doing data reuse because that's what you want to do for convolution, right? So either 1D or 2D convolution and some specialized storage for output. And the combination itself is done basically through these relatively widish SIMD units. They do some amount of computation specialization. They should also be colored in little bit of red because they do arbitrary -- semi-arbitrary precision computation over here. And there's some fusion operation that happens for combining many of these operations before you write them back out. Okay. So and within this there's a lot of concurrency inside every one of these things, and they go into a reduction tree before you get written out. Okay. So that's the -- you can see these mechanisms show up on this convolution engine design as well. And Q100 I'm going skip. Needs a lot of knowledge on TPC-H queries, which I'm not going to have time to cover to get this across. DianNao, some of you may be familiar with it. It's this [inaudible] work that's been coming out from Olivier Temam and his collaborators in ICD. Thank you. And it's a fascinating work where essentially they update these big networks and stream some set of weights into a scratchpad. They call this synapse buffer. And then the actual inputs and outputs come into two other dedicated buffers and stuff is just constantly streaming in after weights have been locked down. Okay. And the actual compute is basically a big array feeding one sigmoid nonlinear transformation at does very little for DianNao. If we can't have cycles doing this, you still get almost all the from the specialized chip. Okay? Yes. of multiply accumulates, the very end. This actually this at all and we spent 40 performance you would get >>: [inaudible] I don't know what [inaudible] but we use just a mother table for that ->> Karu Sankaralingam: Oh, I see. Okay. >>: [inaudible] that was fine, which means that it's already verging on general purpose. >> Karu Sankaralingam: Yeah. Yeah. So I don't know what -- so this is -- I know the 16 bit, but I don't know exactly how it's implemented. They have some cycle numbers in their paper on what its performance expectation is. Right. And the communication special ->>: [inaudible] he was also a coauthor of the MPU work. >> Karu Sankaralingam: Yes, I know. I did -- >>: [inaudible] >> Karu Sankaralingam: >>: Yeah. I know. I forgot -- I didn't hear you say [inaudible]. >> Karu Sankaralingam: And you can see one of the very nice things about the communication specialization here are these add and multiplies. They just communicate with each other, getting rid of all of this [inaudible] traffic the general purpose processor would have to do because that's the only way you can do it. Compared to a dataflow processor, for example, in which you've probably explicitly designed it to do point-to-point communication and do it efficiently. Okay. So combining all of this, what we can do or what we conclude is all accelerators, since you looked at four, we can now say all, but all accelerators, they essentially imply these five common principles. I'll briefly talk about some examples to prove the negative briefly. Okay. And you can then as an architect come up with mechanisms to implement these principles in a somewhat general purpose way or a more universal way. Okay. To get concurrency, you come up with multiple tiles. Now, what should each tile comprise of to get the communication and computation specialization, you come up with an efficient communication mechanism by using a spatial fabric that does efficient point-to-point communication, and that thing will execute in dataflow fashion, avoiding the overheads of going through some centralized structure and so on. Okay. And to data reuse, oddly a single port of scratchpad is enough for these workloads. Okay. I'm not saying any accelerator workload is enough, but essentially when you can get this 500X, 100X, 50X speedups, this is some inherent simplicity in your algorithm that your algorithm provider has already provided. So you can get away with a single probably scratchpad, logically partitioning among different structures you need to put on to it, different data structures. Okay. And for coordination, you need to make these things do different things. You can bake it using some hard-coded state machine if you can get away with a little bit of [inaudible] inefficiency, I say just put a simple, very, very simple three-stage processor there and just write C programs for it. Okay. So you can combine all of this with this high-level microarchitecture where you will have some set of tiles which [inaudible] embed some computation specialization, attach a simple, efficient router to all of these so they can go do efficient point-to-point communication and do a little bit of routing. Have a mechanism to feed this with values and then have a scratchpad that can send values into this fabric, attach a simple low-power core whose job is to coordinate what goes into the scratchpad and sequence values going from the scratchpad into the fabric itself. Okay. And you stamp multiple things out of these and you'll get a concurrent fabric that can do various types of workloads and can be programmed using high-level language without having to be generating specialized RTL that gets baked into a [inaudible]. Okay. So very briefly why I think this is really cool is you can write C programs to this with some simple pragmas to map different structures, different data structures to different hardware structures. And you can actually come up with compilers that will take C code, generate assembly code that can run on this fabric. Okay. And at this year's [inaudible] there was [phonetic] group and my student Tony and done some great work in basically saying DySER is not necessary by coming up with technique that will just profile running fabrics dynamically. Okay. some work from David August Matt Watkins [phonetic], they've all of the compiler work we did for a dynamic binary translation code and generate code for spatial So you can -- with little bit of things, you can have a static complier that will map code onto this. This is also evidence that some part of this can be done dynamically today. There's more work that needs to be done to manage the scratchpad and so on. So you can program this thing, that's the main point I want to make, by just writing C programs. It's also a hardware design component of how many units do I need, what do I put in every one of these. So that's a design process you do based on what are your workloads of interest and so on. Okay. Not going to spend too much time on this because I want to at least briefly talk about results. I'm happy to come back to this. So to give you some context of what are the results I'm going to talk about, we looked at these four different accelerators. We looked at their published numbers, and we're compare to their published numbers. We got some more details about the actual numbers from a subset of them. DianNao, I'm still talking to Olivier for him to double-check exactly what we are reporting. Okay. So you'll see graphs where we have their published numbers compared to what we get out of LSSD. Okay. All right. So this talks about the four different configurations we have. Each is different tuned to match exactly one thing. We have in our paper details of coming up with one fabric, how it will match if I just add that one thing and run various different workloads on it. Okay. The main difference is what goes into the compute fabric itself and the total number of tiles. Okay. All right. So for these results, what we do is we provision each of these fabrics to match the performance of the specialized domain-specific accelerators. So that's why you'll see this blue bar is by design matched to come up close to the domain-specialized accelerators. Q100 I'll come to in the end. Okay. It's surprising result, but we'll come to it. And the main takeaway from here is the overhead you pay in power and area is about 2X. You might be wondering why am I here talking about a technique that gives you more overhead, but this is a single fabric programmable in C to which you can unleash a compiler, run various different workloads that you've not thought of while designing the thing. And we're comparing that to domain-specialized things that have been baked into silicon. Okay. So my one way to paraphrase this result is all of these techniques, their true benefit of specialization is only 2X in area, only 2X in power. >>: So how are you doing the sigmoid in the NPU case? >> Karu Sankaralingam: So for this we said the [inaudible] we have a sigmoid unit, that one sigmoid will be embedded inside the tiles ->>: You have a sigmoid [inaudible]. >> Karu Sankaralingam: Yes. And that there are actually some of the ones [inaudible] for some depending on the size of the tiles we have and the fabric, we can be NPU because of [inaudible] and so on, right? >>: I'm not worried about that [inaudible]. >> Karu Sankaralingam: Yeah. >>: How do your 2X [inaudible] compare with the [inaudible] you pay for FPGA versus [inaudible]? >> Karu Sankaralingam: All right. That's a great question. I don't have the quantitative numbers, but I can answer it qualitatively. Okay. Which is -- okay. It's in a backup slide. Let me now try and find it. But the high-level bit is FPGAs give you some other capabilities. They are lacking in a couple of ways. One is right now if frequency of FPGAs is not very high. So the performance will be limited. Second, depending on if you're going to synthesize all of these functional units using the [inaudible] instead of putting them as specialized DSP slices, that's going to cost you a lot. Third, if you're going to -- when you have this much operand level concurrency and actual operations are done on the DSP slices, you're going to effectively have to synthesize a big multi-porter register file for these things to talk to each other. That ends up becoming really inefficient. So I would argue actually better off with the CGRA than using an FPGA with DSP slices and block RAMs to do the communication ->>: [inaudible] the CGRA [inaudible]. >> Karu Sankaralingam: Oh, coarse-grained reconfigurable. >>: [inaudible]. >> Karu Sankaralingam: >>: Yes. -- more pipelining, networks on chip, lots of hard [inaudible]. >> Karu Sankaralingam: >>: So -- I think that's a false choice because the FPGAs are moving towards -- >> Karu Sankaralingam: >>: Correct. Yes. They already have the [inaudible]. >> Karu Sankaralingam: Yeah. >>: [inaudible] blocks. So really the question is what could you add to an FPGA to get more of these benefits? >> Karu Sankaralingam: >>: Absolutely. So -- And close the [inaudible]. >> Karu Sankaralingam: [inaudible] so, in fact, what I do is I think you can view this as -- this is on my conclusion slide as either this is the natural evolutionary point that GPU, FPGAs, all of these will get to by taking these mechanisms and embedding in the most microarchitecture friendly way for their native execution model. So I don't see this as like a competition of FPGA. I think this points to some somewhat of a natural evolutionary trend where they're going. And some of this stuff that's already in this [inaudible] some of these things that I talked about are clearly already there in the roadmap. They're hardening the DSP slices, they put these transfer and latches everywhere, [inaudible] frequency to a gigahertz and things like that. So in terms of where area goes, this is a breakdown of individually how much goes into the SRAM, the dataflow fabric, and the [inaudible] connection network. And this is the ratio of what happens compared to the specialized accelerator itself. So very briefly, so DianNao -- Q100 is this thing that's been designed to go after TPC-H queries. Basically they designed an architecture that's actually not very balanced. So the queries are sort heavy. Q100 is not good at doing sort actually. So we can just do sort with a simple core we've put in. We are way better at sort. That's basically why we end up looking better than a domain-specialized accelerator. So there's nothing magic going on here. It's just in my opinion slightly imbalanced design. But it's -- I mean, I don't mean to criticize what it is. analysis. Okay. That's kind of what we came up with our So let me very briefly -- well, let me skip this slide. So going after in terms of performance analysis with our framework and in our paper, we talk about it all offline. We are able to break down the performance improvement that comes from each and every one of these mechanisms of just having a simple core, multiple of these cores, adding dataflow fabric, adding the data reuse, adding the communication and specialization. And so this is why I was saying before that most of the benefits for these workloads ends up coming from the concurrency and the dataflow execution. The compute specialization isn't buying you that much. Okay. And it's kind of also you can say is it something data reuses, something communication specialization, you can label the it either way. Right. So, again, I'm not saying this is true for all accelerators, for these workloads which actually represent a pretty wide spectrum, this is what is happening and these are the sources of improvement. And our paper has details for each and every workload domain and goes into this breakdown. And you can see for different workloads the breakdowns of where the implements come from change. I'm happy to talk about this offline. And if any interest in the paper, please let me know. I'm happy to send it to you. Okay. So very briefly, when will this not work? There are many cases when this will not work. For sure I know one case where it will not work is your -- the workload you're trying to run is highly memory dominated, specifically it's very irregular in that you're getting some data, you go to memory, or maybe even a smallish memory, you do some bit mangling on it, you go to memory again. Okay. So then it's like, okay, I don't need to do any coordination, I don't need a dataflow fabric, I just need memory and a little bit of arbitrary binary logic, Boolean logic attached to that memory. That's all I need. And I want that to be really efficient, and what we have right now is not very efficient at it. We know it's not efficient. Okay. So there are two concrete examples of this. One, it actually goes back to the deflate work that you guys have done here on doing lossless compression on streaming data using FPGAs. And so there the biggest benefit or one of the big sources of benefit is this super funky hash table into which you have eight accesses every cycle, and that is implemented using the block RAMs, and like it's like awesome and this -- we don't have a mechanism to match that. It will be nice to come up with a single principle that will expose SRAMs or block RAMs in some kind of IS [inaudible] that anyone can access this. And IBM PowerEN is basically a regular expression IDE sorting engine. main thing it does is it takes the [inaudible], does some intelligent The software, breaks it up into nice pieces, locks it up into a scratchpad, and they just access this thing constantly [inaudible] build an entire chip for it. The only real specialization happening there is this memory which can be your access with arbitrary transformations of inputs coming in, access the memory, do some more transformations. So what we are looking at right now is whether or not there is a general principle here to come up with memory sitting and exposed that can be addressed in arbitrary ways and can then become a building block that you can then embed into an FPGA. Because I've got to believe building the massive cool hash table using the block RAMs isn't the most efficient way of putting it in silicon. Yes. >>: What about the cloud workloads, right? large memory footprints. >> Karu Sankaralingam: >>: Sure. They're the -- they have very So once you get beyond the -- yes. If you -- [inaudible] little computation to do, right, little execution -- >> Karu Sankaralingam: Yeah. If you go out of the DRAM capacity, then you might be better off with like some offload engine that is running close to the thing. To me the question is what should that offload engine be. Can it be an LSSD and not looked at as workloads enough to answer that either way. We have looked at some of them and ->>: The problem irregularity, right, it's -- they might all fit in memory, they all might be designed to fit in memory, but they'll be irregular and they won't be that much computation to do per byte for a lot of them. >> Karu Sankaralingam: >>: They rent from memory, right? >> Karu Sankaralingam: >>: Yes. Right. So -- [inaudible] just have a host. >> Karu Sankaralingam: Yeah. I would argue -- >>: No, then the -- the thing that I was trying to get to is maybe you can put your [inaudible] not in the CPU but in the [inaudible] memory hierarchy or something like that. >> Karu Sankaralingam: Yes. Absolutely. In fact, that's one of the things we're looking at in my group is how can I take the pieces of this LSSD thing and go after problems that are basically memory dominated, what is the unit that I need. And our current results, my bias, pushes us to believe that you basically get rid of this fabric, have a simple very, very primitive processing core, and just the brute force approach gets you really far. And people [inaudible] like funky prefetch engines, this, that. Just this brute force approach of a power efficient code that's very efficient at being idle can get you very, very far. So we have some. We can talk more offline. All right. So I have these results that show how much worse we are compared [inaudible] work. So we are 5X worse when you [inaudible] on this thing. After we make fancy assumptions about the hash table. I won't get into that. So very briefly can we build this thing today? There are design tech actual questions. Are RSUs simple enough in can these compliers actually work? So we built some other research some. I think the idea is [inaudible] compiler expert, lots of you are, there's every [inaudible] things can be built, these things can be promising. And in terms of ISA compatibility, there's a lot of ongoing work from my group and others that say how you can embed all of this in some dynamic binary translation engine and pull off many of these benefits. So I am very exciting at all of these results, and I think they're really promising and point to a direction of where we can push future processors forward, providing high-performance and general programability. So my group, we are looking at basically some kind of prototype of building this thing and seeing, well, let's run it against a big [inaudible] rebuilt it, see what happens. And I am confident we can beat many of these emerging ARM V8 cores and N power and performance using many of these techniques here. And right now we are in the midst of building this specialized -- not specialized -- going after the DianNao specialized chip with a prototype implementation of the LSSD idea that I just talked about. That's all I had. I'm happy to take any more questions. And thank you all for inviting me again and listening. I am sorry I ran over by 28 minutes. [laughter] [applause]