>> Doug Burger: Okay. It's my pleasure to introduce Professor Dean Tullsen from the University of California at San Diego. Dean is a very, very well known and productive and influential member of the computer architecture community. I won't list you his long list of awards and recognition and work and papers, but he is well known for his seminal work on simultaneous multithreading done in a region of the country with which we are all very fond of, namely here, although not at Microsoft. And I think it's public, right, your award is public now? It was in your ->> Dean Tullsen: I expect it is. It's probably still somewhat secret but you could reveal ->> Doug Burger: Well, it was in your bio, so I'm going to release it. So Dean is this year is receiving at ISCA the ISCA Influential Paper Award from an ISCA 15 years ago, and he is the first person to have won it two years running. Or is it two years running or was there a year in between? I think it was two years running. So this is a huge deal, two influential papers in a row for this prestigious award is a really incredible feat. So, we are delighted to have you here, looking at hearing your talk on what will be the most influential piece of work 15 years from now. And so please. >> Dean Tullsen: Great, thanks. All right. Thanks. Thanks for coming. And thank Doug for hosting me. So if you'll bear with me a little bit, I always have trouble when I go out and give talks because we work on so many things that I have trouble and coming and talking about one of the things we're working on. So I'm actually going to give you a brief introduction to a couple of other things that we're working -- that we recently -- that have recently produced results and then I'll spend most of my time talking about this topic which is data-triggered threads which is a topic I'm really exciting about. But to sort of put a spin on the -- on all the topics that I could talk about, and a lot of the things that we are working on, most of the things we're working on these days, and it's very much focused on what some are called the parallelism crisis, and I think you're familiar with parallelism crisis, but in short what we're finding is that it's a lot easier to create hardware parallelism than it is to create software parallelism. And this isn't true in every domain where there's tons of parallelism but certainly there's a lot of it where we can build, you know, really large multi-cores and many cores but there's not going to be, you know, hundreds or -of software threads to run on them. Right? And so again, again, you know, everybody has these road maps that are predicting more and more cores, but it's not clear that software parallelism is going to grow at the same speed that hardware parallelism is. Doug doesn't believe me. >> Doug Burger: No, I think it is clear. >> Dean Tullsen: Okay. Good. So, like I said, I didn't think I had to convince this audience. But there are a lot of implications of that. And so we could continue to advertise, you know, peak performance that is scaling, but unfortunately actual performance is only going to grow now at the rate at which software parallelism grows which is again a lower rate than hardware parallelism grows, and in some cases will be quite slow. And the problem with that is that a large part of our economy, as I know you are well aware, is tied to this performance scaling, right, because along with this performance scaling is this upgrade cycle, right, where every three years you buy a new computer because your old one is embarrassingly slow. And when you buy that, you also buy new Microsoft products, et cetera, to go with them. And if we don't have the performance scaling, then all of a sudden this upgrade cycle slows down considerably. Some of the other implications, right, we need -- in order to keep up, we're going to need a lot of software parallelism. And so we're working architecture things, but really everybody needs to chip in to solve this problem, and there's not going to be a silver bullet, and so the architects need to work on it, compiler people need to work on it, program languages need to work on it, and maybe together, you know, we can all -- we can all make some progress. Another implication, and this is the way we've been thinking about -- about these things for years is that -- is that we just treat threads as a free resource, right, because in a lot of environments threads are just going to be signature around idle. I'm talking about hardware context, whether it's multicore or multithreading. Most of the time if we have some problem we want to solve, there are going to be threads available to help solve that problem. And another way of putting that -- you know, most of you heard the expression that if the only tool you have is a hammer, then every problem looks like a nail. Well, that's very much going to be the case moving into future is that we have this giant hammer and this giant hammer is lots and lots of hardware parallelism. And so most of the problems that we're trying to -- going to try to solve from here on out we're going to try to beat down with hardware parallelism. And in fact, the particular problem that we are trying to attack with this coming back around to where we started is this parallelism crisis. That is if we only have a if you threads to run, how are we going to scale the performance of those few threads or that one thread? Well, we're going to use lots of cores, lots of parallelism to do that. >>: [inaudible]. >> Dean Tullsen: Yes? >>: I had noticed you're not talking a lot here about energy and that's one of the bounding things we're facing. So I had a fairly specific question. If you take SMT, if I have not, you know, [inaudible] blah, blah, blah core and I could run a thread -- I have two threads, A and B, I could run them both on the same core and SMT context or on, you know, adjacent cores. Have you guys measured what you think the energy per unit performance is for those two scenarios? >> Dean Tullsen: We have. We did it years ago. So we haven't revisited it really recently. And some of ->>: [inaudible] together but you don't get as much [inaudible]. >> Dean Tullsen: Right. The marginal energy -- the marginal power you use is really small if you're multithreaded. And generally it's much smaller than the marginal performance you get from running multithreaded. So most of the time SMT is a real big energy win. But again, that was -- that was working years ago, but I -- I really believe that that's still valid. All right. In fact we're doing lots of stuff with energy, but I'm -- I am talking about more performance issues right now. But they tend to be -- they tend to be highly related. My favorite way to reduce energy is still to run faster. And many times that's the most effective way to save energy is just to run fast. Okay. Okay. So again what we're looking at is to use lots of hardware parallelism to run low levels of software parallelism quickly, efficiently or with high energy efficiency. And so really sort of two things that -- three things I wanted to talk about. Again we'll spend most of our time on the last topic. But I'm going to talk about some things that we're doing in terms of compiling for non-traditional parallelism, architectural support for short threads and then the data-triggered threads. And again, sort of true to the previous slide is that we've sort of branched out, and we're sort of again also trying to sort of attack the parallelism crisis from all different angles. So first I'll talk briefly about this work I'm compiling for non-traditional parallelism. Non-traditional parallelism it's this term I've been using for over a decade now. I keep expecting it to catch on. It never does. But what I mean by non-traditional parallelism is getting parallel speed up, and by parallel speed up, I mean take something that runs at this speed with one core, make it run a lot faster with lots of cores. But applying it even when there's no software parallelism. Okay? And so a couple things I'm going to talk about. Data spreading and inter-core prefetching are just the most recent examples of this approach, in particular pointing the compiler at this problem. So the software data spreading, what we're trying to do is take some code, presumably serial code, and get speedup on multiple cores. And the particular problem I've got here is I've got a couple loops with really large datasets that don't fit in my cache, and therefore there's been lots of cache thrashing. What I execute loop A, then B, back to A is not in the cache any more. And if we were able to parallelize this, a lot of times when we parallelize code we get speedup not necessarily because we have four times as many functional units but we get speedup because we have four times as much cache. So the question is if we're running serial code, can we get those same speedups? And so with data spreading all we do is we stick these migration calls in the middle of these loops and start bouncing this serial code around from core to core to core which then allows us to aggregate the space of our private caches so that these data structures get spread over all the private caches in our system. And so what happens here, if the sizes happen to work out, as they do in my example, then all of a sudden everything fits in the cache when I go back to the blue loop again everything's still in the private cache. And so you get the effect of aggregating the private caches of my cores without any hardware support with a simple compiler solution. Okay. So inter-core prefetching, another attempt to use the compiler to get parallel speedup on serial code. And I really like this work because I consider helper thread prefetching on multi-cores to be and open architectural problem, meaning that we did sort of the early work on helper thread prefetching for SMT as did some others a little more than a decade ago. And even though multi-cores are really the dominant source of hardware parallelism now, nobody has really ever shown how to do helper thread prefetching across multi-cores. And so -- and you can do this with hardware support, but nobody has ever really seriously proposed that either. And so what we're going to do it actually without any hardware support, what we're going to do is we're still going to have this main thread and this helper thread that is precomputing addresses that the main thread is going to use like the SMT helper thread work, but what we're going to do is again use migration and we're going to have these threads basically chase each other around our multi-core. And so in the -- we're divided and execution into chunks. The first chunk the main thread's executing here, the helper thread's executing here and after they both execute a chunk, they'll move. And now the main thread is executing in the CPU where the private cache is completely loaded with all the data it's going to need for the next chunk. Right? And they'll continue to just chase each other around the processor. Okay. So helper thread prefetching on multi-cores. All right. Another project that I'm pretty excited about, and again I'm not sure I've gotten the world to believe me yet, but I believe that in order to take full advantage of the hardware parallelism that's becoming available to us, we need to be able to execute short threads effectively. Which means we're seeing more and more execution bottles that require short threads or generate lots of short threads, whether it's speculative multithreading, transactional memory, a lot of the compiler techniques would like to create short threads. But unfortunately, unfortunately we're really good at building processors that can execute a billion instructions well. But we have no idea how to build processors that could execute a hundred instructions well. And really to amortize the cost of sort of forking a thread and executing some code on it, we need multiple hundreds of thousands of instructions to do that. But what I'd really like to do is be able to find a pocket of parallel that's a hundred instructions long and be able to fork a thread and profitability execute that. Because there's just a ton of parallelism we're leaving on the floor. But because we can't amortize that cost. And so we did some simple things. We actually earlier had done some work with the branch predictor, but just focusing on memory facts, the fact that we're running these 100-instruction threads and they're running seven times slower than if we just executed that same code on the same process -- on the same core where we had built up the state. And we weren't able to do some pretty simple things to really bring that down and to get this multiplicative performance increase for small threads. But they weren't the obvious things necessarily. Because copying cache data actually is kind of worse than doing nothing, it turns out. Okay. That's all I was going to say about that. Okay. So the topic at hand is data-triggered threads. The point here is that traditional parallelism you're executing, the program counter is moving, it gets to a fork instruction and you generate parallelism. And so parallelism is virtually always generated based on the program counter on control flow through the program. And this -- this is true not just for sort of traditional miles of execution but even most -- even all these exotic forms of execution still tend to spawn parallelism when you hit a program counter, whether it's triggered instruction or fork instruction or what have you. So the question we asked is what if instead of spawning threads because we had a program counter, we spawn threads because we attach memory? And that's what we're going to look at. Okay. And so I'm going to talk a little about dataflow. You might have guessed I was going to talk about dataflow. This is in no way a dataflow machine, but it's interesting to see how this sort of relates to the ton of past work on dataflow. And so again von Neumann architecture we're sort of following control through the program. We spawn parallelism. But again, it's always based on control flow. Okay? Dataflow, I know you're familiar with this, but you express a program in terms of this graph and it's not driven by instructions of memory, it's not driven by this artificial control flow generated by the compiler, but, you know, computation happens when the data becomes available. Right? And so -- the problem is again we've known about dataflow for years and years. It's this beautiful execution model. But nobody has ever built one and nobody is ever going to build one because there are all kinds of technical problems with doing that. At least not one that's -- not one that's virtually ->>: Can you define nobody has ever built one. >> Dean Tullsen: No one has built one that is commercially viable. I apologize. I said exactly the wrong thing to this audience. Nobody has ever built one that's commercially viable. >>: I think even that's a little too strong. Nobody has ever built one that's been sold commercial. [brief talking over]. >> Dean Tullsen: Okay. You have to agree, there are all kinds of reasons why dataflow is hard. But there are a couple things we really love about dataflow is one is parallelism becomes exposed immediately, right, as soon as the data gets changed the computation that depends on that becomes available for execution. And talked about quite a bit less is there's no unnecessary computation, or at least a lot less unnecessary computation. That is if you don't change A and B, you never execute this ad. Okay? And that's the thing I want to talk about, because that's a little less obvious and that's really the focus of what we're doing with the data-triggered threads. Although you get to parallelism -- you can get to parallelism benefits as well. And from the way I like to think about computation is a somewhat simplified model but it's not all that simplified. So think of computation as just strings of computation that start with loads and ends in -- end with stores. You know, sometimes there are multiple loads involved. But, you know, most things can be characterized this way. Because if you're not storing data then actually the computation you're doing doesn't have any impact. And what we find is that if we sort of thing of computation like this, what you find is a lot of times you go back and you do the same computation, you do the same loads at the same addresses, load the same values, do the exact same computation you did sometime before and store the exact same value to memory. And, you know, so sometimes these loads have changed and you're loading a new value and you produce useful stores. But a lot of times these are what we call redundant loads. And of course then the computation that follows them is typically redundant as well. So redundant load is when I load the same value from the same address as some previous instance with no intervening store that actually changes the data, I should say. So how often does this happen? Oh, wait. Story. I would give a little more concrete example. So I've got some code here. This would probably be more representative if maybe this was a matrix multiplied but a matrix add makes it a lot easier to see this example. So what happens here is that I've got A and B and then at the end of this loop I compute C, which is A plus B. And in the code intervening I make some changes to A and B. But in a lot of code it's rare that I would change all of A and all of B. It's more common that I would probably change a few values in A and a few values in B. And then I write this code and this is -- again, you could say this is wasteful but this is the way we write code, this is the way that code, you know, pretty much always appears when we look at it is I compute all -- recompute all the values of C, even though very few of them have actually changed. All right. And so this is -- this is a real classic example of redundant computation based on redundant loads. Okay. So how often does this happen? Well, all the time. Right? And so we -we measured spec benchmarks and on average 80 percent of the loads that we do are redundant by our definition, which means that it's the same value we loaded last time we accessed this address. Nothing has changed. And most of the computation that follows this is also redundant and then eventually I get to a store where I store the same value again. Not only is -- on average is 80 percent of the computation redundant, but in the worst case over 60 percent of the computation we're doing is redundant and unnecessary. >>: Are you looked at what fraction of those [inaudible] every time you give the talk. You do zeros in blue on the chart [inaudible]. >> Dean Tullsen: Oh. Sorry. Say that again. >>: If you figure out what percentage of the loads you're doing the values were the same as zeros and show that as a fraction of those [inaudible]. >> Dean Tullsen: That's a good question. I didn't -- you know, that's something we always looked at when we were looking at value prediction like everybody else several years ago. And, you know, most of the predictable values were zeros. I don't believe that that's -- that that subsumes this nearly as much as it did the value prediction work. >>: Maybe 25 percent ->> Dean Tullsen: Right. Right. >>: Dean? >> Dean Tullsen: Yes. >>: [inaudible] for the particular value of the load or the whole chain that results in the store. >> Dean Tullsen: This is just the loads. And if I graphed here the actual computation that's redundant, the numbers would be lower. But it's still over 50 percent. So -- but we're focusing on loads primarily because that's the way we're thinking about it. We're sort of looking at the loads as sort of triggers to signify computation that we don't want to actually recompute things like that. And so we focused on loads rather than the actual computation just because it made more sense to us intuitively. But if I actually graph the computation, it's still over 50 percent. But it's lower because, you know, sometime -- you know, instructions depend on multiple loads. And if one of them's non-redundant you have to execute it. All right. Yeah. In the worse case, you know, 60 percent of the time it's still -we're still doing mostly useless computation. Okay. Okay. So the point is with data-triggered threads can we exploit parallelism and avoid redundant computation in some of the same ways that maybe a dataflow architecture could but with really, really minor changes to a von Neumann architecture? So it would hopefully be real simple to implement. And so we've got these data-triggered threads that are triggered by a modification to memory rather than reaching some program counter. Our threads are always non-speculative so this is very different -- I'm going to show some pictures that look a lot like speculative multithreading but they're completely non-speculative and the appearance of similarity is just going to be false. And so think about code like this. I've got -- I've got a store which may or may not be executed up here. I've got this region of computation that depends on the store. And then I've got some following region of code here. And so what I'm going to do with data-triggered threads is I'm going to watch for -- watch for this store to change memory. As soon as it's changed memory, I'm going to spawn this data-triggered thread which is going to do the computation that maybe used to be in B. When A sort of gets to the end of its region, it's going to wait for the data-triggered thread to be done, and then it's going to resume computation here at the end of what was B. We're going to call this skippable region and then execute the code in C. Okay. And so hopefully we get some speedup. We may get some extra parallelism here, depending on how much distance there is here. But the real power, at least in our initial look at this, where we really focused on redundancy, the real power is if this store did not actually change memory, maybe it wrote the same value or maybe this if resolved so that we never executed the store, then, in fact, we just completely skipped the B region, which is represented by the data-triggered thread which we now don't have to execute because the computation that the data-triggered thread would do is would just be recomputing a value that's already in memory. >>: So what [inaudible] but not the load later on? So you might store something and a load path might be conditional. >> Dean Tullsen: So that's possible. It depends on the programmer to sort of think that through. But, you know, in this model, we might actually execute the data-triggered thread. >>: So it might be some speculation? >> Dean Tullsen: There could be some software speculation in this way, yeah. And it's a programming model, so the programmer could actually do all kinds of bad things. But when it -- you know, when it works well, hopefully it works a lot like this. Okay. And so going back to this, you know, matrix add example, so the type of thing that's going to happen is I've attached data-triggered thread with a very simple pragma to this array. I obviously have one attached to B, but I don't here. When I -- when I execute, right, so this is my data trigger, this is my data-triggered thread, this is what I'm going to call my skippable region. When I actually do this store to A, which changes a single value in the array A, this is going to spawn this data-triggered thread, it's going to recompute one -- one element in C and then when I get to this code, which is but -- which was here originally which is now what I'm calling my skippable region, it's going to completely skip that. Yeah, Jim? >>: So what's the [inaudible] programmer thinks about? >> Dean Tullsen: So that's a good question. And we should talk about that after, because you could probably help me figure that out. It's -- but it's -- it is -- I don't know how to describe this. There is -- let me describe this in sort of the terms we've been using, which is -- which is the -- is that the -- is that the programmer can do anything they want, but we basically -- we basically want them to create, you know, data-triggered threads with no data erases and then you've got a barrier here in which everything should be -- which case you're going to wait for all updates to be complete and to appear in memory before moving on here. >>: So you [inaudible]. >> Dean Tullsen: This skippable region is an [inaudible] barrier, right? And so you're never going to be past the skippable region until all the data-triggered threads are completed. >>: [inaudible] do you run the skippable -- it seems like when you make the updates then you use the updated function. >> Dean Tullsen: So this is something I was going to get into later. So this is a long conversation I had with my -- with my graduate student Hung-Wei, and I didn't particularly like skippable regions and everybody looks at this and says why do you have the skippable regions there. I initially thought it was just because, you know, we're working from real code instead of -- existing code instead of writing new code and he wanted to leave it in there. But actually he -you know, in the end he's right and there's a real good reason to have the skippable region. And that's because we'd like the data-triggered threads to be able to fail every once in a while. We have an explicit abort. Sometimes you finds that you're executing this data-triggered thread and you -- you want to touch some data you didn't really expect to touch and you might just abort the data-triggered threads. You also want to have the ability for the data-triggered threads not to find a core and so they don't execute. And in all those cases, and this represents a backup, we'd also like to be able to dynamically turn off data-triggered threads. You know, sometimes we're just -- we're just firing off too many data-triggered threads that we'd like to just turn them off for a while and fall back to this code. >>: [inaudible] hardware data [inaudible] threads and have the 15th one decide it doesn't want to run, and then what do you do? >> Dean Tullsen: So -- so. >>: Does the block update need to be added [inaudible]. >> Dean Tullsen: So sorry. Ask the question again, make sure I understand it. >>: Well, you said that one of the things that could happen is that you could run a bunch of -- you could spawn a bunch of data to [inaudible] threads. One of them might decide oh, there's some reason I don't [inaudible]. >> Dean Tullsen: Yes. >>: And if you then abort at that time you've done some of the updates via the update functions. Then you've decided you're going to run the block update. Does that mean the block update needs to be [inaudible]? >> Dean Tullsen: It does. And so -- and in fact the -- we're going to place some restrictions on data-triggered threads. We're going to say they need to be restartable, for example. So for instance you could actually abort data-triggered threads in the middle and they better not be doing anything that sort of creates an inconsistency. They shouldn't even be accumulating state, you know, as some kind of sum internally, et cetera, because restarting them could create false values. And we place some restrictions. And the real one -- the only one we really -- the main one is just this restartable feature. >>: [inaudible] fundamental model and programming model [inaudible]. And you then have an incremental algorithm. And the programmer needs to ensure that the incrementalization that they have done with the data-triggered threads matches essentially the batch. >> Dean Tullsen: Yes. >>: And then the -- there's lower level guarantees at the appropriate level. So fundamentally the [inaudible] is on the programmer to figure out with respect to their data structures and with respect to this right -- sort of this dependence model to create an incremental version of their algorithm, and then the correctness condition at the high level is that the -- when the data-triggered threads run that's just optimization of the batch. >> Dean Tullsen: Yes. >>: So is that ->> Dean Tullsen: That's a reasonable ->>: [inaudible]. >> Dean Tullsen: That's a reasonable way to think about it, yes. >>: Okay. Because there's like then there's a whole area of something called self adjusting computation which is the software version of how you get an incremental algorithm from a batch one which is essentially looks very much like creating a dataflow graph in memory and then normalizing an [inaudible] validation and reexecuting the dataflow. So have you considered actually, you know, what you could do in hardware to support that sort of model because there's a -- there's already this self-adjusting computation [inaudible] explored in software [inaudible]. On the other hand, you get a hundred -- sometimes you get 20,000 ton speedups, you know, I mean really nice, really nice [inaudible]. >> Dean Tullsen: [inaudible] part of my reason for coming here, you could probably point me at some other things we might have missed. And if you wouldn't mind doing that, maybe send me an e-mail. >>: Just clarification. What is the X in the function there and where [inaudible]. >> Dean Tullsen: So, sorry. Unfortunately it [inaudible] so we only allow one argument to the data-triggered thread. I'll talk about why we do that. And the argument is actually the address, right. So we trigger -- we trigger a thread because we touched a memory address, right? And the only argument that we allow the data-triggered thread to take is that address. Okay? It's always doing it's thing based on that address and knowing about where it starts, I figured out what this index is. That's what this funny code is doing. >>: [inaudible] just on writing so you don't actually check to see if the values change from ->> Dean Tullsen: No. And that's done in the hardware, in this example [inaudible] we're working on a software model where the software does the checking. But I'm only talking about our sort of hardware model here. >>: [inaudible] memory action. What about the cache hierarchy, is that on the first rightful CPU or when it hits the DRAM or it doesn't matter? >> Dean Tullsen: So it doesn't matter, but it's the L1 cache that's going to have -- because eventually it's going to get the L1 cache if we're doing a store anyway. And it's the L1 cache that's going to tell us eventually if the data is in the cache it's going to tell us right away, if it's in memory it's going to tell us a little later. But the L1 cache is going to tell us whether it's actually changed or not. >>: So the effect of [inaudible] calculate something useful to [inaudible] I mean if I had a dynamic data structure ->> Dean Tullsen: We actually do this for tons of dynamic data structures, and it's really powerful for those. Let me talk about this a little bit. But there is some truth there, and I -- and it doesn't really fit the flow of my slide, so I need to remember to say some things when I get there. All right. So let me talk about the basic structure of this. I've gotten three elements. This data-triggered thread, this skippable region, and these data triggers. This data-triggered thread is this new thread that does this incremental computation. I've got the skippable region which again I tried to convince my graduate student we didn't need, but he finally convinced me he did. One of the things it does is creates this implicit barrier, you know, as a performance optimization you might be able to find some reasons where you don't need the barrier, but we just -- we use a very traditional barrier there. And we've got this -- right. And this fail safe for, you know, something goes wrong with the data-triggered threads. And also it allows us -- it's not something we've done yet, at least in the work I'm talking about in this talk. But again it allows you to turn off data-triggered threads which sometimes is useful dynamically. And then we've got these data-triggered threads where we attach the thread with a pragma to actual variables in the program. And there are two ways in which we can do that. And so the obvious one is either just attaching it to some variable, which is the example I already gave, but where it actually turns out to be really valuable and probably 80 percent of the time when things are really working well, we're using this form we're actually attaching it to a field of a temperature. Right? And the beauty here is that by attaching it to -- if I attach this to the whole variable, if any of these values change, then I trigger threads when maybe I didn't really want to. But by attaching it to a single field instruction all this stuff can change all the time, but if this particular field doesn't change, then I don't spawn the data-triggered threads. All right? And this is what allows me to sort of track these dynamic structures. And so here's the example -- here's the classic example that's actually any think AMMP from the spec benchmarks. Is what's going on here, is that AMMP and we find MCF doing something real similar spends a lot of time and a lot of computation just calculating a metadata over these large structures. And so the structures themselves are changing constantly, the values in them. But the structure itself is almost never changing, right? And so what I can do is if I just attach data-triggered thread to this next field, right, then as long as the structure of this -- in this case, link list doesn't change then I'm never spawning a data-triggered thread. Now, what I'm doing here is I'm doing it coarse grained. So for instance, if it's a link list and I change next, it's not clear whether I'm pointing at something that, you know, now points toss, you know, dead data or whatever. So this is more coarse grained. And so what I'm doing is I'm -- is in the AMMP case is I'm creating a data-triggered thread that is just looking to see if the entire structure of this link list changes. Because it turns out it almost never does. And that's how I'm using the data-triggered thread in that case. And so it would be a little bit harder to just follow this next field, you know, if I'm moving things around in this link list. I think there might be ways to do it, but -but pretty much all the cases where we really found data-triggered threads to be effective we were looking at sort of more coarse grained threads. All right. So again I could do this just by attaching the next field and I could ignore the fact that all these other fields are changing all the time. Okay. So we made a lot of policy decisions, you know, I think most of them were pretty well thought out, but we're not making any -- any strong arguments that these are absolutely the best decisions. But they're ones we sort of made on the first pass of thinking through this programming model. And so some of the things that we did -- so attaching data-triggered threads to structure fields was critical. It turns out our initial model of sort of how to support this in hardware didn't support this, so we had to rethink that. But we had to do that because this was actually what really made data-triggered threads work. We placed some constraints on data-triggered threads. I wish we didn't have to do this. But actually this made things a lot more straightforward, and this was really required. So the no argument thing I've referred to a couple of times. The problem with allowing arguments is that, you know, you spawn this data-triggered thread and you have some arguments that you pull out of the main thread, and then you get to the skippable region and you find that the data-triggered thread executed with different arguments of what you have now and you know what do you do with that, if you keep the computation, do you throw it away? And it looks a lot like speculative multi-threading all over again with all the problems of speculative multi-threading and versioning caches and all this stuff that we didn't want, so we just said no arguments that it completely solved this problem. So now data-triggered threads are not speculative in any way, at least in a sense, you know, speculative multi-threading. And it's restartable because with we want to have the ability to abort threads and we can't have them sort of accumulate state and then die and start up again, die, start up again and then have these false values. And so this is a bit of a burden on a programmer or something, it will have to be thought about. But I think compared to a lot of things that you'll find in parallel programming models it's not completely onerous. >>: [inaudible] it's like a closure but it doesn't -- you can't refer to anything in your surrounding environment. So is that ->> Dean Tullsen: No. That is not true. So we're -- we're making tons of changes to memory, but they have to be such -- so for instance you can compute value, you can write in the memory but as long as, you know, you restart the thread and it still computes the right value ->>: I thought it was more of the leads from the environment [inaudible] thread so the data-triggered thread its initial state cannot have values that were computed by -- it can read memory. >> Dean Tullsen: Right, right, right. It can read those explicitly, right, but it's not past those, yes. >>: But it's not [inaudible]. >> Dean Tullsen: Yes. That's right. >>: The notion of the caching is sort of -- that's a programmer problem. >> Dean Tullsen: Yeah. >>: Right? Not a hardware problem? >> Dean Tullsen: Right. Yes. Thank you. All right. So this restartable thing is something the programmer will have to think about a little bit. But, you know, maybe you could sort of have the compiler, you know, basically raise an error and say, you know, tell him he's doing something wrong. Okay. So that's some of the programming model policy decisions we made, some of the things we had to think about architecturally is sort of how to track these addresses. You know, it started with this notion that we would have some structure that would sort of track memory addresses. But as soon as we made this decision to -- let's see, as soon as we made this decision -- sorry -- to support fields and structures -- and so in some cases we have these multi-gigabyte structures where we're just tracking one field, which means that we have, you know -- you know, thousands and thousands of these specific addresses we want to track, right? So there's no way we're going to create a hardware structure -- create a hardware table to do that. If we did, it's going to be bigger than the cache. And so instead we just do this with ISA support. We just say here's a new type of store. If that store changes memory, we generate a thread. And so now we can track these huge data structures with just, you know, sometimes it's a single store. And so ->>: [inaudible]. >> Dean Tullsen: What's that? >>: [inaudible]. >> Dean Tullsen: It's like a [inaudible] right. So again, this is -- you know if you just look at the hardware implementation, there's not a lot of novelty here. But the novelty is in the programming model, okay. Okay. So again we've got a store in the fork, you're absolutely right, because we can't embed enough information for the -- for the thread spawn in the store we've actually got two consecutive instructions and so you are absolutely right. And so we've got this tstore now and a new instruction after it that will sort of tell me where to get the information to the thread. And again, you know, sometimes this -- this tstore is going to execute and it's not going to change memory. And again, this is -- this is hardware support that's very similar to what we saw with the silent store work. And it's not that onerous because again, unless you actually allow, you know, word rights to your cache most of the time a cache access is a remodified right. And so actually doing this should have very little overhead. >>: [inaudible]. >> Dean Tullsen: I would think so, yeah. >>: But it is atomic. >> Dean Tullsen: In this case, this is going to be atomic, yeah. Okay. And so a lot of times you don't even execute the tstore, in which case you don't generate the data-triggered threads but even when you do execute it sometimes you don't generate the data-triggered thread. So we've got a couple of hardware structures. Thread queue is a static structure that sort of knows about -- knows, you know, what thread this tspawn should generate. And in our case, this is really small because we have very few static data-triggered threads. And then this is a dynamic structure that keeps track of the active data-triggered threads and in reality we actually of very few of those actives, so this is actually a very small structure as well. But this one is dynamic. This one's static. So again, this used in the thread queue, we spawn a data-triggered thread, we do allow it to have a return value, which gets stored in the thread status table. And so when A completes -- yes? >>: Back to the consistency model, does this tstore, tspawn have the implicit barrier? Can you jump -- does that mean -- yeah, you do a range modifier right but typically you have to wait and I keep going, it misses. >> Dean Tullsen: That's a good question. Did we think -- I believe it does, yeah. I believe you would have to be careful there. I can't remember what we implemented, but I believe you're right. Yeah. Yeah. Okay. So all right. So then, you know, the -- the skippable region is skipped, but you still have to wait for the data to -- the return value to show up in a thread status table and you complete. But again, the beauty of this is that sometimes you get here and the value and thread status table is valid, even though you never executed a data-triggered thread and you just continue on with the -- with the region in C. Okay? I'm not going to talk about the hardware structures. They're just really small. And these are trivially small. There's no value storage or anything like that, like you get in other models. Execution models that try and take advantage of redundancy. I'll talk about that a little bit. So let's -- let's look at some results. Methodology, we're running SPEC benchmarks. Again, we're -- it's programming model stuff, but we're architects, so we're running SPEC benchmarks, and we're not writing new code, but we're modifying existing very mature code so that we can -- we can -- you know, show the benefits you get from applying this technique. The data-triggered threads tend to be fairly small. And the actual changes to the source code are also very small. I want to be careful here because the intellectual effort that the grad student intellectual effort that went into this was fairly high. But when it was all done, the actual changes he made to the source code were actually very, very small. Just a few lines usually. Okay. So here's what we get. We're running with this SMT so it's very easy to model both a multi-core CMP and SMT threads. So the difference being that with CMP you get no competition for execution resources, with SMT you do. And so in the best case we're getting close to a 6X speedup. Kind of more than one and a half average, close to one and a half. If you look at the arithmetic mean. With the harmonic mean, which really biases towards the low values, you know, you still get close to 20 percent speedup with CMP. Now, you lose some of that with SMT, and we'll talk about why in just a second. Yeah? >>: Given the chart you showed at the beginning about how much redundant computation there was in each of these, I'm surprised that these numbers are so low. >> Dean Tullsen: So there's -- there's a couple points here, right? There's a ton of redundant computation that goes on. We have sort of scratched the surface in terms of how to exploit that. This is one model that allows you to exploit this in fairly large chunks. Again because we haven't completely solved, although we're working on this with other research, the problem of creating short threads efficiently. Again, we're doing a lot of coarse grained things, when I'd like to be doing more fine-grained things, which limits the amount of redundant computation that we can exploit. And so what I'm going to say is the redundant computation and redundant loads is a phenomenon that we are somewhat taking advantage of. But most of it is still out there. Does that make sense? That's a good question. >>: As a baseline I'd really like to know if you had just done the equivalent without having to change anything architecturally, so you just change the code on that it does a load before the store checks to see if anything's different and then spawns that little bit of code if something changed. How much of a benefit would you have gotten with absolutely no parallel -- additional parallelism? >> Dean Tullsen: So you're meaning if you just programmed it differently? >>: Well, you're reprogramming it to take advantage of additional hardware, but you could have just reprogrammed it and not put in the additional hardware. So without any extra -- without those extra hardware structures, are they buying you anything, or are you just getting this because you've changed the ->> Dean Tullsen: Yeah, we ->>: The code? And I don't know that for [inaudible]. >> Dean Tullsen: Right, right. We don't have that. But the point is that what this buys you is what's going to happen is you're going to end up putting -- if you want to track all the changes, you know, go back to my matrix add example. If you want to track all the changes to these, you've got to put conditionals in the middle of all of your tight loops, right? And all of a sudden this code gets a lot slower. And so this allows you to do that without messing with -- if you're -- without messing with that, right? And so most of the computation the original computation stays the same without sticking conditionals and extra flags and extra large data structures to track all these changes. >>: [inaudible] traditional compiler just take care of that? Right? You're putting traditionals in large loops, but you're going to unroll those loops. Your compiler will take care of that, and presumably existing compiler technology takes care of that. >> Dean Tullsen: But the conditional doesn't go away because you still have to check on every access. Right? So the cost of that conditional is still there. Right? And so you're still expanding the cost of your inner loop significantly. >>: But are you going to even notice that given that the loads and stores are probably what's killing you here? >> Dean Tullsen: Possibly. >>: Right. So if the loads and stores are what's killing you here, a few extra instructions aren't doing anything. What you've really done is you've just saved the need to go through this twice and a few extra conditionals are probably negligible. >> Dean Tullsen: So here's the issue. You could sort of make a case that you could program this way, right? >>: But you're already ->> Dean Tullsen: But we're executing mature code that is not programmed this way. >>: [inaudible] programmed this way. I just want to know the benefit of -- that you got just from forcing the user to change their model versus the extra hardware. If it's all from just forcing the issue to change their model, then great, we don't need the extra hardware, you can get this today. If -- if the extra hardware really matters then I'd like to know that too. >> Dean Tullsen: Yeah. So I mean we're looking -- so give you a [inaudible] we are looking at you know a full software version of this, right? But it's still -- you know, it's still the same programming model. But it's, you know, without any hardware support. And, you know, in some cases, and I could show you where, you know, as we move a little further in this graph, because, you know, sometimes we're getting advantages for redundancy in which case the software overhead doesn't matter, in some cases we're getting, you know, better threads from parallels in which case the software overhead does. But again, that's work we're kind of in the middle of right now. >>: You just answered that [inaudible]. So do you know right now what fraction of this benefit can be attributed to starting the thread early versus just getting rid of redundant computations? >> Dean Tullsen: Yeah. Let me sort of move forward. We've got those graphs. And so here are the applications where we're getting advantage from redundancy. And the reason you can -- one of the ways you can see that, and I've got another graph where you can see even more clearly, one of the reasons you can see that is because I get the same benefit with CMP as I do with SMT, right, so the fact that I'm competing resources doesn't matter because I'm not actually executing this other code. But where I'm getting advantage from parallelism, I hate to even show these graphs, but actually SMT doesn't do nearly as well as CMP because there's more competition for these execution resources. And that's these. And then the others where we don't get speedup. And so a better way to see that is the execution time breakdown. So if you look at MCF, you know, the main thread is running here. Never running data-triggered threads. But I'm also skipping all the computation that I've moved to the data-triggered threads. And so I'm getting all my speedup just by not executing this stuff. Right? I can come over here which if you sort of, you know, double this, you can see that I'm actually executing more instructions here. But I'm getting speedup because I get a lot more parallelism. And keep in mind that I'm running the SPEC benchmarks and so there's very little thread level parallelism here anyway. And so we're really scratching when we do find some. Although these are cases where people have success fully found some parallelism. >>: [inaudible] speed up to the extra energy ratio; in other words, are you running more efficiently or less efficient as a result of this [inaudible]. >> Dean Tullsen: Well, here ->>: Well, MCF is an outlier, so for the rest [inaudible]. >> Dean Tullsen: It's not an outlier. It's an outlier by degree. But, I mean, there's a lot of cases where we're running a lot fewer instructions, right? And so there's a huge gain in these cases for energy, right? I mean, it's, you know, 6X. Well, 6X at least, you know, improvement in energy efficiency. And so the benefits we get from redundancy are all, you know, go straight to energy right? The parallelism, this is just -- you know, you can just think of this as traditional parallelism, right? You know, the energy benefits you get from either going multi-core or SMT, which is a little or a lot, you know, really is translated pretty directly here. All right. All right. So the three interesting cases here, we're just limiting a lot of redundancy. Here, more instructions but speedup from parallelism. So this sort of fits a pretty traditional model. Interesting case here is where we actually have data-triggered threads. They execute sometimes. They never execute in parallel with the data-triggered threads. There's just not enough slack to spawn them early enough, but because they're redundant often enough we still get speedup. All right. This is less interesting. So sensitivity to spawning latency is actually pretty predictable if we don't actually spawn the threads, we don't care how long it takes to spawn them. In some cases we do. One of the interesting things that happened here is that we really didn't speed up for a couple of reasons. The first one was really expected. In most of the cases we're doing less computation. And we're doing a lot less -- a lot fewer loads. But even those loads that we're doing actually get better cache hit rates. And the reason for that is a lot of times we're surrounding these very large data structures with these data-triggered threads not bringing that data, you know, into cache or began pushing other stuff out. So our overall [inaudible] doing fewer loads, but those loads are miss -- are hitting more often in the caches. Okay. So people look at all kinds of different mechanisms for taking advantage of redundancy. Their models of redundancy were different than ours and so, you know, if you asked them how much redundancy there is, you get very different numbers because it's defined differently. But the big difference, you know, between our technique and all these others, value prediction, instruction reuse, memoization is that they all have mechanisms, whether they're in hardware or software, that detect sameness. Right? And so if you want to know that this structure is the same as it was before, you actually of to store that structure somewhere. Right? And so, you know, in our case, though, we're -- you know, we're -- we're supplying data-triggered threads to multi-gigabyte structures with no hardware storage. And the reason we can do that is because we're detecting changes, not sameness, so we have no storage to actually detect sameness, right? And so if we -- if we go through this multi-gigabyte structure and never change it, you know, without any storage, we could detect that, you know, we know that it hasn't changed. Right? And so our -- you know, again, our hardware structures are tiny. All right. Okay. So that's everything I was going to say. Parallelism crisis, we really need new compilers, new execution model, new programming models. We find tons of redundant code in the C code that we measured. We created this data-triggered threads programming and execution model, which attacks redundant computation. It can expose parallelism again because the code that we are attacking and because of the way that we, you know, profiled the code to kind of create these threads, we put a lot more emphasis on the redundant computation. But there is certainly some potential there that we'll probably look at more in the future. Very little architectural support. A little bit of ISA, some small hardware structures and minor changes of source code, at least in terms of, you know, line counts and things like that. And, you know, in the best case we get really large speedups. Yeah? >>: If the layout of data-triggered memories are occurring, how do you avoid storing one extra bit per memory word to tell you this word is booby trapped? >> Dean Tullsen: So because the -- if you can think of this extra bit, it actually appears in the code instead of in memory, right? And so I have an extra instruction. I have a new instruction in my ISA. It's a tstore, and so if -- you know, if that -- if that tstore instruction touches memory, I know I spawn data-triggered threads, otherwise I don't. So I don't need any bits in memory. I've just got a new type of store instruction that watches it for me. >>: Somehow there's got to be a structure of which way they're booby trapped between the tstore as [inaudible]. >>: [inaudible]. >>: So maybe this is something I didn't [inaudible] I saw the tstore but where did it actual register, I want to run this code with that. [inaudible]. >> Dean Tullsen: So ->>: [inaudible]. >> Dean Tullsen: So that was the tstore instruction that came after, right? So [inaudible]. >>: [inaudible] hidden address [inaudible]. >> Dean Tullsen: Yeah. So that was the thread queue, right. So I'll set up this thread queue at the beginning of my execution that puts -- that aligns a tspawn instruction with a data-triggered thread add program counter, right? And that's enough that when I see the T -- when I see the tstore followed by the tspawn, the tstore changes memory, the tspawn then causes a lookup in the thread queue which then pulls program counter out of there, starts a thread. >>: It seems like you could get really far with the crisis towards this. Maybe a different level of overhead. But very, very far. >> Dean Tullsen: Yeah, yeah. Pretty close. The, you know, thing that you won't find in any ISA that I know of is one that allows you to do this is silent store check. And that's really the part of the tstore that I don't think we can find in a current ISA. Most have the rest of the stuff, yeah. >>: It's a kind of silly question but why do you have a static thread queue and not just put the address of the thread you want to serve from the immediate [inaudible]. >> Dean Tullsen: That's a good question. That's a good question. Boy, I'd have to look -- I'd have to look that up. >>: [inaudible]. >> Dean Tullsen: If we went back and we could look at what's in this table without Hung-Wei here, I would have to actually try to figure it out myself. Maybe I won't do that. And it -- I'm trying to think whether it's a residual of the time when we actually had arguments because there was some information about arguments in there, and we illuminated that. The question is what I don't know is that we don't have arguments if we could completely illuminate the thread queue. I'm not sure. That's a good question. But it's actually really small so -- as I said multiple times. All right. Thanks a lot. [applause]