>> Doug Burger: Okay. It's my pleasure to... from the University of California at San Diego. Dean...

advertisement
>> Doug Burger: Okay. It's my pleasure to introduce Professor Dean Tullsen
from the University of California at San Diego. Dean is a very, very well known
and productive and influential member of the computer architecture community. I
won't list you his long list of awards and recognition and work and papers, but he
is well known for his seminal work on simultaneous multithreading done in a
region of the country with which we are all very fond of, namely here, although
not at Microsoft. And I think it's public, right, your award is public now? It was in
your ->> Dean Tullsen: I expect it is. It's probably still somewhat secret but you could
reveal ->> Doug Burger: Well, it was in your bio, so I'm going to release it. So Dean is
this year is receiving at ISCA the ISCA Influential Paper Award from an ISCA 15
years ago, and he is the first person to have won it two years running. Or is it
two years running or was there a year in between? I think it was two years
running. So this is a huge deal, two influential papers in a row for this prestigious
award is a really incredible feat. So, we are delighted to have you here, looking
at hearing your talk on what will be the most influential piece of work 15 years
from now. And so please.
>> Dean Tullsen: Great, thanks. All right. Thanks. Thanks for coming. And
thank Doug for hosting me.
So if you'll bear with me a little bit, I always have trouble when I go out and give
talks because we work on so many things that I have trouble and coming and
talking about one of the things we're working on. So I'm actually going to give
you a brief introduction to a couple of other things that we're working -- that we
recently -- that have recently produced results and then I'll spend most of my time
talking about this topic which is data-triggered threads which is a topic I'm really
exciting about.
But to sort of put a spin on the -- on all the topics that I could talk about, and a lot
of the things that we are working on, most of the things we're working on these
days, and it's very much focused on what some are called the parallelism crisis,
and I think you're familiar with parallelism crisis, but in short what we're finding is
that it's a lot easier to create hardware parallelism than it is to create software
parallelism. And this isn't true in every domain where there's tons of parallelism
but certainly there's a lot of it where we can build, you know, really large
multi-cores and many cores but there's not going to be, you know, hundreds or -of software threads to run on them. Right?
And so again, again, you know, everybody has these road maps that are
predicting more and more cores, but it's not clear that software parallelism is
going to grow at the same speed that hardware parallelism is. Doug doesn't
believe me.
>> Doug Burger: No, I think it is clear.
>> Dean Tullsen: Okay. Good. So, like I said, I didn't think I had to convince
this audience. But there are a lot of implications of that. And so we could
continue to advertise, you know, peak performance that is scaling, but
unfortunately actual performance is only going to grow now at the rate at which
software parallelism grows which is again a lower rate than hardware parallelism
grows, and in some cases will be quite slow.
And the problem with that is that a large part of our economy, as I know you are
well aware, is tied to this performance scaling, right, because along with this
performance scaling is this upgrade cycle, right, where every three years you buy
a new computer because your old one is embarrassingly slow. And when you
buy that, you also buy new Microsoft products, et cetera, to go with them.
And if we don't have the performance scaling, then all of a sudden this upgrade
cycle slows down considerably.
Some of the other implications, right, we need -- in order to keep up, we're going
to need a lot of software parallelism. And so we're working architecture things,
but really everybody needs to chip in to solve this problem, and there's not going
to be a silver bullet, and so the architects need to work on it, compiler people
need to work on it, program languages need to work on it, and maybe together,
you know, we can all -- we can all make some progress.
Another implication, and this is the way we've been thinking about -- about these
things for years is that -- is that we just treat threads as a free resource, right,
because in a lot of environments threads are just going to be signature around
idle. I'm talking about hardware context, whether it's multicore or multithreading.
Most of the time if we have some problem we want to solve, there are going to be
threads available to help solve that problem.
And another way of putting that -- you know, most of you heard the expression
that if the only tool you have is a hammer, then every problem looks like a nail.
Well, that's very much going to be the case moving into future is that we have
this giant hammer and this giant hammer is lots and lots of hardware parallelism.
And so most of the problems that we're trying to -- going to try to solve from here
on out we're going to try to beat down with hardware parallelism. And in fact, the
particular problem that we are trying to attack with this coming back around to
where we started is this parallelism crisis. That is if we only have a if you threads
to run, how are we going to scale the performance of those few threads or that
one thread? Well, we're going to use lots of cores, lots of parallelism to do that.
>>: [inaudible].
>> Dean Tullsen: Yes?
>>: I had noticed you're not talking a lot here about energy and that's one of the
bounding things we're facing. So I had a fairly specific question. If you take
SMT, if I have not, you know, [inaudible] blah, blah, blah core and I could run a
thread -- I have two threads, A and B, I could run them both on the same core
and SMT context or on, you know, adjacent cores. Have you guys measured
what you think the energy per unit performance is for those two scenarios?
>> Dean Tullsen: We have. We did it years ago. So we haven't revisited it
really recently. And some of ->>: [inaudible] together but you don't get as much [inaudible].
>> Dean Tullsen: Right. The marginal energy -- the marginal power you use is
really small if you're multithreaded. And generally it's much smaller than the
marginal performance you get from running multithreaded.
So most of the time SMT is a real big energy win. But again, that was -- that was
working years ago, but I -- I really believe that that's still valid.
All right. In fact we're doing lots of stuff with energy, but I'm -- I am talking about
more performance issues right now. But they tend to be -- they tend to be highly
related. My favorite way to reduce energy is still to run faster. And many times
that's the most effective way to save energy is just to run fast.
Okay. Okay. So again what we're looking at is to use lots of hardware
parallelism to run low levels of software parallelism quickly, efficiently or with high
energy efficiency. And so really sort of two things that -- three things I wanted to
talk about. Again we'll spend most of our time on the last topic. But I'm going to
talk about some things that we're doing in terms of compiling for non-traditional
parallelism, architectural support for short threads and then the data-triggered
threads.
And again, sort of true to the previous slide is that we've sort of branched out,
and we're sort of again also trying to sort of attack the parallelism crisis from all
different angles.
So first I'll talk briefly about this work I'm compiling for non-traditional parallelism.
Non-traditional parallelism it's this term I've been using for over a decade now. I
keep expecting it to catch on. It never does. But what I mean by non-traditional
parallelism is getting parallel speed up, and by parallel speed up, I mean take
something that runs at this speed with one core, make it run a lot faster with lots
of cores. But applying it even when there's no software parallelism. Okay? And
so a couple things I'm going to talk about. Data spreading and inter-core
prefetching are just the most recent examples of this approach, in particular
pointing the compiler at this problem.
So the software data spreading, what we're trying to do is take some code,
presumably serial code, and get speedup on multiple cores. And the particular
problem I've got here is I've got a couple loops with really large datasets that
don't fit in my cache, and therefore there's been lots of cache thrashing. What I
execute loop A, then B, back to A is not in the cache any more. And if we were
able to parallelize this, a lot of times when we parallelize code we get speedup
not necessarily because we have four times as many functional units but we get
speedup because we have four times as much cache.
So the question is if we're running serial code, can we get those same
speedups? And so with data spreading all we do is we stick these migration calls
in the middle of these loops and start bouncing this serial code around from core
to core to core which then allows us to aggregate the space of our private caches
so that these data structures get spread over all the private caches in our
system.
And so what happens here, if the sizes happen to work out, as they do in my
example, then all of a sudden everything fits in the cache when I go back to the
blue loop again everything's still in the private cache. And so you get the effect
of aggregating the private caches of my cores without any hardware support with
a simple compiler solution.
Okay. So inter-core prefetching, another attempt to use the compiler to get
parallel speedup on serial code. And I really like this work because I consider
helper thread prefetching on multi-cores to be and open architectural problem,
meaning that we did sort of the early work on helper thread prefetching for SMT
as did some others a little more than a decade ago. And even though
multi-cores are really the dominant source of hardware parallelism now, nobody
has really ever shown how to do helper thread prefetching across multi-cores.
And so -- and you can do this with hardware support, but nobody has ever really
seriously proposed that either. And so what we're going to do it actually without
any hardware support, what we're going to do is we're still going to have this
main thread and this helper thread that is precomputing addresses that the main
thread is going to use like the SMT helper thread work, but what we're going to
do is again use migration and we're going to have these threads basically chase
each other around our multi-core.
And so in the -- we're divided and execution into chunks. The first chunk the
main thread's executing here, the helper thread's executing here and after they
both execute a chunk, they'll move. And now the main thread is executing in the
CPU where the private cache is completely loaded with all the data it's going to
need for the next chunk. Right? And they'll continue to just chase each other
around the processor. Okay. So helper thread prefetching on multi-cores.
All right. Another project that I'm pretty excited about, and again I'm not sure I've
gotten the world to believe me yet, but I believe that in order to take full
advantage of the hardware parallelism that's becoming available to us, we need
to be able to execute short threads effectively. Which means we're seeing more
and more execution bottles that require short threads or generate lots of short
threads, whether it's speculative multithreading, transactional memory, a lot of
the compiler techniques would like to create short threads. But unfortunately,
unfortunately we're really good at building processors that can execute a billion
instructions well. But we have no idea how to build processors that could
execute a hundred instructions well.
And really to amortize the cost of sort of forking a thread and executing some
code on it, we need multiple hundreds of thousands of instructions to do that.
But what I'd really like to do is be able to find a pocket of parallel that's a hundred
instructions long and be able to fork a thread and profitability execute that.
Because there's just a ton of parallelism we're leaving on the floor. But because
we can't amortize that cost.
And so we did some simple things. We actually earlier had done some work with
the branch predictor, but just focusing on memory facts, the fact that we're
running these 100-instruction threads and they're running seven times slower
than if we just executed that same code on the same process -- on the same
core where we had built up the state. And we weren't able to do some pretty
simple things to really bring that down and to get this multiplicative performance
increase for small threads. But they weren't the obvious things necessarily.
Because copying cache data actually is kind of worse than doing nothing, it turns
out.
Okay. That's all I was going to say about that. Okay. So the topic at hand is
data-triggered threads. The point here is that traditional parallelism you're
executing, the program counter is moving, it gets to a fork instruction and you
generate parallelism. And so parallelism is virtually always generated based on
the program counter on control flow through the program.
And this -- this is true not just for sort of traditional miles of execution but even
most -- even all these exotic forms of execution still tend to spawn parallelism
when you hit a program counter, whether it's triggered instruction or fork
instruction or what have you.
So the question we asked is what if instead of spawning threads because we had
a program counter, we spawn threads because we attach memory? And that's
what we're going to look at.
Okay. And so I'm going to talk a little about dataflow. You might have guessed I
was going to talk about dataflow. This is in no way a dataflow machine, but it's
interesting to see how this sort of relates to the ton of past work on dataflow.
And so again von Neumann architecture we're sort of following control through
the program. We spawn parallelism. But again, it's always based on control
flow. Okay?
Dataflow, I know you're familiar with this, but you express a program in terms of
this graph and it's not driven by instructions of memory, it's not driven by this
artificial control flow generated by the compiler, but, you know, computation
happens when the data becomes available. Right? And so -- the problem is
again we've known about dataflow for years and years. It's this beautiful
execution model. But nobody has ever built one and nobody is ever going to
build one because there are all kinds of technical problems with doing that. At
least not one that's -- not one that's virtually ->>: Can you define nobody has ever built one.
>> Dean Tullsen: No one has built one that is commercially viable. I apologize.
I said exactly the wrong thing to this audience. Nobody has ever built one that's
commercially viable.
>>: I think even that's a little too strong. Nobody has ever built one that's been
sold commercial.
[brief talking over].
>> Dean Tullsen: Okay. You have to agree, there are all kinds of reasons why
dataflow is hard. But there are a couple things we really love about dataflow is
one is parallelism becomes exposed immediately, right, as soon as the data gets
changed the computation that depends on that becomes available for execution.
And talked about quite a bit less is there's no unnecessary computation, or at
least a lot less unnecessary computation. That is if you don't change A and B,
you never execute this ad. Okay?
And that's the thing I want to talk about, because that's a little less obvious and
that's really the focus of what we're doing with the data-triggered threads.
Although you get to parallelism -- you can get to parallelism benefits as well.
And from the way I like to think about computation is a somewhat simplified
model but it's not all that simplified. So think of computation as just strings of
computation that start with loads and ends in -- end with stores. You know,
sometimes there are multiple loads involved. But, you know, most things can be
characterized this way. Because if you're not storing data then actually the
computation you're doing doesn't have any impact.
And what we find is that if we sort of thing of computation like this, what you find
is a lot of times you go back and you do the same computation, you do the same
loads at the same addresses, load the same values, do the exact same
computation you did sometime before and store the exact same value to
memory.
And, you know, so sometimes these loads have changed and you're loading a
new value and you produce useful stores. But a lot of times these are what we
call redundant loads. And of course then the computation that follows them is
typically redundant as well. So redundant load is when I load the same value
from the same address as some previous instance with no intervening store that
actually changes the data, I should say.
So how often does this happen? Oh, wait. Story. I would give a little more
concrete example. So I've got some code here. This would probably be more
representative if maybe this was a matrix multiplied but a matrix add makes it a
lot easier to see this example. So what happens here is that I've got A and B and
then at the end of this loop I compute C, which is A plus B. And in the code
intervening I make some changes to A and B. But in a lot of code it's rare that I
would change all of A and all of B. It's more common that I would probably
change a few values in A and a few values in B.
And then I write this code and this is -- again, you could say this is wasteful but
this is the way we write code, this is the way that code, you know, pretty much
always appears when we look at it is I compute all -- recompute all the values of
C, even though very few of them have actually changed. All right. And so this is
-- this is a real classic example of redundant computation based on redundant
loads.
Okay. So how often does this happen? Well, all the time. Right? And so we -we measured spec benchmarks and on average 80 percent of the loads that we
do are redundant by our definition, which means that it's the same value we
loaded last time we accessed this address. Nothing has changed. And most of
the computation that follows this is also redundant and then eventually I get to a
store where I store the same value again. Not only is -- on average is 80 percent
of the computation redundant, but in the worst case over 60 percent of the
computation we're doing is redundant and unnecessary.
>>: Are you looked at what fraction of those [inaudible] every time you give the
talk. You do zeros in blue on the chart [inaudible].
>> Dean Tullsen: Oh. Sorry. Say that again.
>>: If you figure out what percentage of the loads you're doing the values were
the same as zeros and show that as a fraction of those [inaudible].
>> Dean Tullsen: That's a good question. I didn't -- you know, that's something
we always looked at when we were looking at value prediction like everybody
else several years ago. And, you know, most of the predictable values were
zeros. I don't believe that that's -- that that subsumes this nearly as much as it
did the value prediction work.
>>: Maybe 25 percent ->> Dean Tullsen: Right. Right.
>>: Dean?
>> Dean Tullsen: Yes.
>>: [inaudible] for the particular value of the load or the whole chain that results
in the store.
>> Dean Tullsen: This is just the loads. And if I graphed here the actual
computation that's redundant, the numbers would be lower. But it's still over 50
percent. So -- but we're focusing on loads primarily because that's the way we're
thinking about it. We're sort of looking at the loads as sort of triggers to signify
computation that we don't want to actually recompute things like that. And so we
focused on loads rather than the actual computation just because it made more
sense to us intuitively.
But if I actually graph the computation, it's still over 50 percent. But it's lower
because, you know, sometime -- you know, instructions depend on multiple
loads. And if one of them's non-redundant you have to execute it.
All right. Yeah. In the worse case, you know, 60 percent of the time it's still -we're still doing mostly useless computation. Okay. Okay.
So the point is with data-triggered threads can we exploit parallelism and avoid
redundant computation in some of the same ways that maybe a dataflow
architecture could but with really, really minor changes to a von Neumann
architecture? So it would hopefully be real simple to implement.
And so we've got these data-triggered threads that are triggered by a
modification to memory rather than reaching some program counter. Our
threads are always non-speculative so this is very different -- I'm going to show
some pictures that look a lot like speculative multithreading but they're
completely non-speculative and the appearance of similarity is just going to be
false.
And so think about code like this. I've got -- I've got a store which may or may
not be executed up here. I've got this region of computation that depends on the
store. And then I've got some following region of code here. And so what I'm
going to do with data-triggered threads is I'm going to watch for -- watch for this
store to change memory. As soon as it's changed memory, I'm going to spawn
this data-triggered thread which is going to do the computation that maybe used
to be in B.
When A sort of gets to the end of its region, it's going to wait for the
data-triggered thread to be done, and then it's going to resume computation here
at the end of what was B. We're going to call this skippable region and then
execute the code in C. Okay. And so hopefully we get some speedup. We may
get some extra parallelism here, depending on how much distance there is here.
But the real power, at least in our initial look at this, where we really focused on
redundancy, the real power is if this store did not actually change memory,
maybe it wrote the same value or maybe this if resolved so that we never
executed the store, then, in fact, we just completely skipped the B region, which
is represented by the data-triggered thread which we now don't have to execute
because the computation that the data-triggered thread would do is would just be
recomputing a value that's already in memory.
>>: So what [inaudible] but not the load later on? So you might store something
and a load path might be conditional.
>> Dean Tullsen: So that's possible. It depends on the programmer to sort of
think that through. But, you know, in this model, we might actually execute the
data-triggered thread.
>>: So it might be some speculation?
>> Dean Tullsen: There could be some software speculation in this way, yeah.
And it's a programming model, so the programmer could actually do all kinds of
bad things. But when it -- you know, when it works well, hopefully it works a lot
like this.
Okay. And so going back to this, you know, matrix add example, so the type of
thing that's going to happen is I've attached data-triggered thread with a very
simple pragma to this array. I obviously have one attached to B, but I don't here.
When I -- when I execute, right, so this is my data trigger, this is my
data-triggered thread, this is what I'm going to call my skippable region. When I
actually do this store to A, which changes a single value in the array A, this is
going to spawn this data-triggered thread, it's going to recompute one -- one
element in C and then when I get to this code, which is but -- which was here
originally which is now what I'm calling my skippable region, it's going to
completely skip that.
Yeah, Jim?
>>: So what's the [inaudible] programmer thinks about?
>> Dean Tullsen: So that's a good question. And we should talk about that after,
because you could probably help me figure that out. It's -- but it's -- it is -- I don't
know how to describe this. There is -- let me describe this in sort of the terms
we've been using, which is -- which is the -- is that the -- is that the programmer
can do anything they want, but we basically -- we basically want them to create,
you know, data-triggered threads with no data erases and then you've got a
barrier here in which everything should be -- which case you're going to wait for
all updates to be complete and to appear in memory before moving on here.
>>: So you [inaudible].
>> Dean Tullsen: This skippable region is an [inaudible] barrier, right? And so
you're never going to be past the skippable region until all the data-triggered
threads are completed.
>>: [inaudible] do you run the skippable -- it seems like when you make the
updates then you use the updated function.
>> Dean Tullsen: So this is something I was going to get into later. So this is a
long conversation I had with my -- with my graduate student Hung-Wei, and I
didn't particularly like skippable regions and everybody looks at this and says
why do you have the skippable regions there. I initially thought it was just
because, you know, we're working from real code instead of -- existing code
instead of writing new code and he wanted to leave it in there. But actually he -you know, in the end he's right and there's a real good reason to have the
skippable region. And that's because we'd like the data-triggered threads to be
able to fail every once in a while. We have an explicit abort. Sometimes you
finds that you're executing this data-triggered thread and you -- you want to touch
some data you didn't really expect to touch and you might just abort the
data-triggered threads.
You also want to have the ability for the data-triggered threads not to find a core
and so they don't execute. And in all those cases, and this represents a backup,
we'd also like to be able to dynamically turn off data-triggered threads. You
know, sometimes we're just -- we're just firing off too many data-triggered threads
that we'd like to just turn them off for a while and fall back to this code.
>>: [inaudible] hardware data [inaudible] threads and have the 15th one decide it
doesn't want to run, and then what do you do?
>> Dean Tullsen: So -- so.
>>: Does the block update need to be added [inaudible].
>> Dean Tullsen: So sorry. Ask the question again, make sure I understand it.
>>: Well, you said that one of the things that could happen is that you could run
a bunch of -- you could spawn a bunch of data to [inaudible] threads. One of
them might decide oh, there's some reason I don't [inaudible].
>> Dean Tullsen: Yes.
>>: And if you then abort at that time you've done some of the updates via the
update functions. Then you've decided you're going to run the block update.
Does that mean the block update needs to be [inaudible]?
>> Dean Tullsen: It does. And so -- and in fact the -- we're going to place some
restrictions on data-triggered threads. We're going to say they need to be
restartable, for example. So for instance you could actually abort data-triggered
threads in the middle and they better not be doing anything that sort of creates
an inconsistency. They shouldn't even be accumulating state, you know, as
some kind of sum internally, et cetera, because restarting them could create false
values. And we place some restrictions. And the real one -- the only one we
really -- the main one is just this restartable feature.
>>: [inaudible] fundamental model and programming model [inaudible]. And you
then have an incremental algorithm. And the programmer needs to ensure that
the incrementalization that they have done with the data-triggered threads
matches essentially the batch.
>> Dean Tullsen: Yes.
>>: And then the -- there's lower level guarantees at the appropriate level. So
fundamentally the [inaudible] is on the programmer to figure out with respect to
their data structures and with respect to this right -- sort of this dependence
model to create an incremental version of their algorithm, and then the
correctness condition at the high level is that the -- when the data-triggered
threads run that's just optimization of the batch.
>> Dean Tullsen: Yes.
>>: So is that ->> Dean Tullsen: That's a reasonable ->>: [inaudible].
>> Dean Tullsen: That's a reasonable way to think about it, yes.
>>: Okay. Because there's like then there's a whole area of something called
self adjusting computation which is the software version of how you get an
incremental algorithm from a batch one which is essentially looks very much like
creating a dataflow graph in memory and then normalizing an [inaudible]
validation and reexecuting the dataflow.
So have you considered actually, you know, what you could do in hardware to
support that sort of model because there's a -- there's already this self-adjusting
computation [inaudible] explored in software [inaudible]. On the other hand, you
get a hundred -- sometimes you get 20,000 ton speedups, you know, I mean
really nice, really nice [inaudible].
>> Dean Tullsen: [inaudible] part of my reason for coming here, you could
probably point me at some other things we might have missed. And if you
wouldn't mind doing that, maybe send me an e-mail.
>>: Just clarification. What is the X in the function there and where [inaudible].
>> Dean Tullsen: So, sorry. Unfortunately it [inaudible] so we only allow one
argument to the data-triggered thread. I'll talk about why we do that. And the
argument is actually the address, right. So we trigger -- we trigger a thread
because we touched a memory address, right? And the only argument that we
allow the data-triggered thread to take is that address. Okay? It's always doing
it's thing based on that address and knowing about where it starts, I figured out
what this index is. That's what this funny code is doing.
>>: [inaudible] just on writing so you don't actually check to see if the values
change from ->> Dean Tullsen: No. And that's done in the hardware, in this example
[inaudible] we're working on a software model where the software does the
checking. But I'm only talking about our sort of hardware model here.
>>: [inaudible] memory action. What about the cache hierarchy, is that on the
first rightful CPU or when it hits the DRAM or it doesn't matter?
>> Dean Tullsen: So it doesn't matter, but it's the L1 cache that's going to have
-- because eventually it's going to get the L1 cache if we're doing a store anyway.
And it's the L1 cache that's going to tell us eventually if the data is in the cache
it's going to tell us right away, if it's in memory it's going to tell us a little later. But
the L1 cache is going to tell us whether it's actually changed or not.
>>: So the effect of [inaudible] calculate something useful to [inaudible] I mean if
I had a dynamic data structure ->> Dean Tullsen: We actually do this for tons of dynamic data structures, and it's
really powerful for those. Let me talk about this a little bit.
But there is some truth there, and I -- and it doesn't really fit the flow of my slide,
so I need to remember to say some things when I get there. All right.
So let me talk about the basic structure of this. I've gotten three elements. This
data-triggered thread, this skippable region, and these data triggers. This
data-triggered thread is this new thread that does this incremental computation.
I've got the skippable region which again I tried to convince my graduate student
we didn't need, but he finally convinced me he did.
One of the things it does is creates this implicit barrier, you know, as a
performance optimization you might be able to find some reasons where you
don't need the barrier, but we just -- we use a very traditional barrier there.
And we've got this -- right. And this fail safe for, you know, something goes
wrong with the data-triggered threads. And also it allows us -- it's not something
we've done yet, at least in the work I'm talking about in this talk. But again it
allows you to turn off data-triggered threads which sometimes is useful
dynamically.
And then we've got these data-triggered threads where we attach the thread with
a pragma to actual variables in the program. And there are two ways in which
we can do that. And so the obvious one is either just attaching it to some
variable, which is the example I already gave, but where it actually turns out to be
really valuable and probably 80 percent of the time when things are really
working well, we're using this form we're actually attaching it to a field of a
temperature. Right?
And the beauty here is that by attaching it to -- if I attach this to the whole
variable, if any of these values change, then I trigger threads when maybe I didn't
really want to. But by attaching it to a single field instruction all this stuff can
change all the time, but if this particular field doesn't change, then I don't spawn
the data-triggered threads. All right? And this is what allows me to sort of track
these dynamic structures.
And so here's the example -- here's the classic example that's actually any think
AMMP from the spec benchmarks. Is what's going on here, is that AMMP and
we find MCF doing something real similar spends a lot of time and a lot of
computation just calculating a metadata over these large structures. And so the
structures themselves are changing constantly, the values in them. But the
structure itself is almost never changing, right?
And so what I can do is if I just attach data-triggered thread to this next field,
right, then as long as the structure of this -- in this case, link list doesn't change
then I'm never spawning a data-triggered thread.
Now, what I'm doing here is I'm doing it coarse grained. So for instance, if it's a
link list and I change next, it's not clear whether I'm pointing at something that,
you know, now points toss, you know, dead data or whatever. So this is more
coarse grained. And so what I'm doing is I'm -- is in the AMMP case is I'm
creating a data-triggered thread that is just looking to see if the entire structure of
this link list changes. Because it turns out it almost never does. And that's how
I'm using the data-triggered thread in that case.
And so it would be a little bit harder to just follow this next field, you know, if I'm
moving things around in this link list. I think there might be ways to do it, but -but pretty much all the cases where we really found data-triggered threads to be
effective we were looking at sort of more coarse grained threads.
All right. So again I could do this just by attaching the next field and I could
ignore the fact that all these other fields are changing all the time. Okay.
So we made a lot of policy decisions, you know, I think most of them were pretty
well thought out, but we're not making any -- any strong arguments that these are
absolutely the best decisions. But they're ones we sort of made on the first pass
of thinking through this programming model. And so some of the things that we
did -- so attaching data-triggered threads to structure fields was critical. It turns
out our initial model of sort of how to support this in hardware didn't support this,
so we had to rethink that. But we had to do that because this was actually what
really made data-triggered threads work. We placed some constraints on
data-triggered threads. I wish we didn't have to do this. But actually this made
things a lot more straightforward, and this was really required.
So the no argument thing I've referred to a couple of times. The problem with
allowing arguments is that, you know, you spawn this data-triggered thread and
you have some arguments that you pull out of the main thread, and then you get
to the skippable region and you find that the data-triggered thread executed with
different arguments of what you have now and you know what do you do with
that, if you keep the computation, do you throw it away? And it looks a lot like
speculative multi-threading all over again with all the problems of speculative
multi-threading and versioning caches and all this stuff that we didn't want, so we
just said no arguments that it completely solved this problem.
So now data-triggered threads are not speculative in any way, at least in a
sense, you know, speculative multi-threading. And it's restartable because with
we want to have the ability to abort threads and we can't have them sort of
accumulate state and then die and start up again, die, start up again and then
have these false values.
And so this is a bit of a burden on a programmer or something, it will have to be
thought about. But I think compared to a lot of things that you'll find in parallel
programming models it's not completely onerous.
>>: [inaudible] it's like a closure but it doesn't -- you can't refer to anything in
your surrounding environment. So is that ->> Dean Tullsen: No. That is not true. So we're -- we're making tons of
changes to memory, but they have to be such -- so for instance you can compute
value, you can write in the memory but as long as, you know, you restart the
thread and it still computes the right value ->>: I thought it was more of the leads from the environment [inaudible] thread so
the data-triggered thread its initial state cannot have values that were computed
by -- it can read memory.
>> Dean Tullsen: Right, right, right. It can read those explicitly, right, but it's not
past those, yes.
>>: But it's not [inaudible].
>> Dean Tullsen: Yes. That's right.
>>: The notion of the caching is sort of -- that's a programmer problem.
>> Dean Tullsen: Yeah.
>>: Right? Not a hardware problem?
>> Dean Tullsen: Right. Yes. Thank you. All right. So this restartable thing is
something the programmer will have to think about a little bit. But, you know,
maybe you could sort of have the compiler, you know, basically raise an error
and say, you know, tell him he's doing something wrong. Okay.
So that's some of the programming model policy decisions we made, some of the
things we had to think about architecturally is sort of how to track these
addresses. You know, it started with this notion that we would have some
structure that would sort of track memory addresses. But as soon as we made
this decision to -- let's see, as soon as we made this decision -- sorry -- to
support fields and structures -- and so in some cases we have these
multi-gigabyte structures where we're just tracking one field, which means that
we have, you know -- you know, thousands and thousands of these specific
addresses we want to track, right? So there's no way we're going to create a
hardware structure -- create a hardware table to do that. If we did, it's going to
be bigger than the cache.
And so instead we just do this with ISA support. We just say here's a new type of
store. If that store changes memory, we generate a thread. And so now we can
track these huge data structures with just, you know, sometimes it's a single
store. And so ->>: [inaudible].
>> Dean Tullsen: What's that?
>>: [inaudible].
>> Dean Tullsen: It's like a [inaudible] right. So again, this is -- you know if you
just look at the hardware implementation, there's not a lot of novelty here. But
the novelty is in the programming model, okay.
Okay. So again we've got a store in the fork, you're absolutely right, because we
can't embed enough information for the -- for the thread spawn in the store we've
actually got two consecutive instructions and so you are absolutely right.
And so we've got this tstore now and a new instruction after it that will sort of tell
me where to get the information to the thread. And again, you know, sometimes
this -- this tstore is going to execute and it's not going to change memory. And
again, this is -- this is hardware support that's very similar to what we saw with
the silent store work. And it's not that onerous because again, unless you
actually allow, you know, word rights to your cache most of the time a cache
access is a remodified right.
And so actually doing this should have very little overhead.
>>: [inaudible].
>> Dean Tullsen: I would think so, yeah.
>>: But it is atomic.
>> Dean Tullsen: In this case, this is going to be atomic, yeah. Okay. And so a
lot of times you don't even execute the tstore, in which case you don't generate
the data-triggered threads but even when you do execute it sometimes you don't
generate the data-triggered thread.
So we've got a couple of hardware structures. Thread queue is a static structure
that sort of knows about -- knows, you know, what thread this tspawn should
generate. And in our case, this is really small because we have very few static
data-triggered threads. And then this is a dynamic structure that keeps track of
the active data-triggered threads and in reality we actually of very few of those
actives, so this is actually a very small structure as well. But this one is dynamic.
This one's static.
So again, this used in the thread queue, we spawn a data-triggered thread, we
do allow it to have a return value, which gets stored in the thread status table.
And so when A completes -- yes?
>>: Back to the consistency model, does this tstore, tspawn have the implicit
barrier? Can you jump -- does that mean -- yeah, you do a range modifier right
but typically you have to wait and I keep going, it misses.
>> Dean Tullsen: That's a good question. Did we think -- I believe it does, yeah.
I believe you would have to be careful there. I can't remember what we
implemented, but I believe you're right. Yeah. Yeah.
Okay. So all right. So then, you know, the -- the skippable region is skipped, but
you still have to wait for the data to -- the return value to show up in a thread
status table and you complete. But again, the beauty of this is that sometimes
you get here and the value and thread status table is valid, even though you
never executed a data-triggered thread and you just continue on with the -- with
the region in C. Okay?
I'm not going to talk about the hardware structures. They're just really small.
And these are trivially small.
There's no value storage or anything like that, like you get in other models.
Execution models that try and take advantage of redundancy. I'll talk about that
a little bit.
So let's -- let's look at some results. Methodology, we're running SPEC
benchmarks. Again, we're -- it's programming model stuff, but we're architects,
so we're running SPEC benchmarks, and we're not writing new code, but we're
modifying existing very mature code so that we can -- we can -- you know, show
the benefits you get from applying this technique.
The data-triggered threads tend to be fairly small. And the actual changes to the
source code are also very small. I want to be careful here because the
intellectual effort that the grad student intellectual effort that went into this was
fairly high. But when it was all done, the actual changes he made to the source
code were actually very, very small. Just a few lines usually.
Okay. So here's what we get. We're running with this SMT so it's very easy to
model both a multi-core CMP and SMT threads. So the difference being that
with CMP you get no competition for execution resources, with SMT you do. And
so in the best case we're getting close to a 6X speedup. Kind of more than one
and a half average, close to one and a half. If you look at the arithmetic mean.
With the harmonic mean, which really biases towards the low values, you know,
you still get close to 20 percent speedup with CMP. Now, you lose some of that
with SMT, and we'll talk about why in just a second. Yeah?
>>: Given the chart you showed at the beginning about how much redundant
computation there was in each of these, I'm surprised that these numbers are so
low.
>> Dean Tullsen: So there's -- there's a couple points here, right? There's a ton
of redundant computation that goes on. We have sort of scratched the surface in
terms of how to exploit that. This is one model that allows you to exploit this in
fairly large chunks. Again because we haven't completely solved, although we're
working on this with other research, the problem of creating short threads
efficiently. Again, we're doing a lot of coarse grained things, when I'd like to be
doing more fine-grained things, which limits the amount of redundant
computation that we can exploit. And so what I'm going to say is the redundant
computation and redundant loads is a phenomenon that we are somewhat taking
advantage of. But most of it is still out there. Does that make sense? That's a
good question.
>>: As a baseline I'd really like to know if you had just done the equivalent
without having to change anything architecturally, so you just change the code on
that it does a load before the store checks to see if anything's different and then
spawns that little bit of code if something changed. How much of a benefit would
you have gotten with absolutely no parallel -- additional parallelism?
>> Dean Tullsen: So you're meaning if you just programmed it differently?
>>: Well, you're reprogramming it to take advantage of additional hardware, but
you could have just reprogrammed it and not put in the additional hardware. So
without any extra -- without those extra hardware structures, are they buying you
anything, or are you just getting this because you've changed the ->> Dean Tullsen: Yeah, we ->>: The code? And I don't know that for [inaudible].
>> Dean Tullsen: Right, right. We don't have that. But the point is that what this
buys you is what's going to happen is you're going to end up putting -- if you want
to track all the changes, you know, go back to my matrix add example. If you
want to track all the changes to these, you've got to put conditionals in the middle
of all of your tight loops, right? And all of a sudden this code gets a lot slower.
And so this allows you to do that without messing with -- if you're -- without
messing with that, right? And so most of the computation the original
computation stays the same without sticking conditionals and extra flags and
extra large data structures to track all these changes.
>>: [inaudible] traditional compiler just take care of that? Right? You're putting
traditionals in large loops, but you're going to unroll those loops. Your compiler
will take care of that, and presumably existing compiler technology takes care of
that.
>> Dean Tullsen: But the conditional doesn't go away because you still have to
check on every access. Right? So the cost of that conditional is still there.
Right? And so you're still expanding the cost of your inner loop significantly.
>>: But are you going to even notice that given that the loads and stores are
probably what's killing you here?
>> Dean Tullsen: Possibly.
>>: Right. So if the loads and stores are what's killing you here, a few extra
instructions aren't doing anything. What you've really done is you've just saved
the need to go through this twice and a few extra conditionals are probably
negligible.
>> Dean Tullsen: So here's the issue. You could sort of make a case that you
could program this way, right?
>>: But you're already ->> Dean Tullsen: But we're executing mature code that is not programmed this
way.
>>: [inaudible] programmed this way. I just want to know the benefit of -- that
you got just from forcing the user to change their model versus the extra
hardware. If it's all from just forcing the issue to change their model, then great,
we don't need the extra hardware, you can get this today. If -- if the extra
hardware really matters then I'd like to know that too.
>> Dean Tullsen: Yeah. So I mean we're looking -- so give you a [inaudible] we
are looking at you know a full software version of this, right? But it's still -- you
know, it's still the same programming model. But it's, you know, without any
hardware support. And, you know, in some cases, and I could show you where,
you know, as we move a little further in this graph, because, you know,
sometimes we're getting advantages for redundancy in which case the software
overhead doesn't matter, in some cases we're getting, you know, better threads
from parallels in which case the software overhead does. But again, that's work
we're kind of in the middle of right now.
>>: You just answered that [inaudible]. So do you know right now what fraction
of this benefit can be attributed to starting the thread early versus just getting rid
of redundant computations?
>> Dean Tullsen: Yeah. Let me sort of move forward. We've got those graphs.
And so here are the applications where we're getting advantage from
redundancy. And the reason you can -- one of the ways you can see that, and
I've got another graph where you can see even more clearly, one of the reasons
you can see that is because I get the same benefit with CMP as I do with SMT,
right, so the fact that I'm competing resources doesn't matter because I'm not
actually executing this other code.
But where I'm getting advantage from parallelism, I hate to even show these
graphs, but actually SMT doesn't do nearly as well as CMP because there's more
competition for these execution resources. And that's these.
And then the others where we don't get speedup. And so a better way to see
that is the execution time breakdown. So if you look at MCF, you know, the main
thread is running here. Never running data-triggered threads. But I'm also
skipping all the computation that I've moved to the data-triggered threads. And
so I'm getting all my speedup just by not executing this stuff. Right? I can come
over here which if you sort of, you know, double this, you can see that I'm
actually executing more instructions here. But I'm getting speedup because I get
a lot more parallelism.
And keep in mind that I'm running the SPEC benchmarks and so there's very
little thread level parallelism here anyway. And so we're really scratching when
we do find some. Although these are cases where people have success fully
found some parallelism.
>>: [inaudible] speed up to the extra energy ratio; in other words, are you
running more efficiently or less efficient as a result of this [inaudible].
>> Dean Tullsen: Well, here ->>: Well, MCF is an outlier, so for the rest [inaudible].
>> Dean Tullsen: It's not an outlier. It's an outlier by degree. But, I mean,
there's a lot of cases where we're running a lot fewer instructions, right? And so
there's a huge gain in these cases for energy, right? I mean, it's, you know, 6X.
Well, 6X at least, you know, improvement in energy efficiency.
And so the benefits we get from redundancy are all, you know, go straight to
energy right? The parallelism, this is just -- you know, you can just think of this
as traditional parallelism, right? You know, the energy benefits you get from
either going multi-core or SMT, which is a little or a lot, you know, really is
translated pretty directly here.
All right. All right. So the three interesting cases here, we're just limiting a lot of
redundancy. Here, more instructions but speedup from parallelism. So this sort
of fits a pretty traditional model. Interesting case here is where we actually have
data-triggered threads. They execute sometimes. They never execute in parallel
with the data-triggered threads. There's just not enough slack to spawn them
early enough, but because they're redundant often enough we still get speedup.
All right. This is less interesting. So sensitivity to spawning latency is actually
pretty predictable if we don't actually spawn the threads, we don't care how long
it takes to spawn them. In some cases we do. One of the interesting things that
happened here is that we really didn't speed up for a couple of reasons. The first
one was really expected. In most of the cases we're doing less computation.
And we're doing a lot less -- a lot fewer loads.
But even those loads that we're doing actually get better cache hit rates. And the
reason for that is a lot of times we're surrounding these very large data structures
with these data-triggered threads not bringing that data, you know, into cache or
began pushing other stuff out. So our overall [inaudible] doing fewer loads, but
those loads are miss -- are hitting more often in the caches.
Okay. So people look at all kinds of different mechanisms for taking advantage
of redundancy. Their models of redundancy were different than ours and so, you
know, if you asked them how much redundancy there is, you get very different
numbers because it's defined differently.
But the big difference, you know, between our technique and all these others,
value prediction, instruction reuse, memoization is that they all have
mechanisms, whether they're in hardware or software, that detect sameness.
Right? And so if you want to know that this structure is the same as it was
before, you actually of to store that structure somewhere. Right?
And so, you know, in our case, though, we're -- you know, we're -- we're
supplying data-triggered threads to multi-gigabyte structures with no hardware
storage. And the reason we can do that is because we're detecting changes, not
sameness, so we have no storage to actually detect sameness, right? And so if
we -- if we go through this multi-gigabyte structure and never change it, you
know, without any storage, we could detect that, you know, we know that it hasn't
changed. Right?
And so our -- you know, again, our hardware structures are tiny. All right. Okay.
So that's everything I was going to say.
Parallelism crisis, we really need new compilers, new execution model, new
programming models.
We find tons of redundant code in the C code that we measured.
We created this data-triggered threads programming and execution model, which
attacks redundant computation. It can expose parallelism again because the
code that we are attacking and because of the way that we, you know, profiled
the code to kind of create these threads, we put a lot more emphasis on the
redundant computation. But there is certainly some potential there that we'll
probably look at more in the future. Very little architectural support. A little bit of
ISA, some small hardware structures and minor changes of source code, at least
in terms of, you know, line counts and things like that.
And, you know, in the best case we get really large speedups. Yeah?
>>: If the layout of data-triggered memories are occurring, how do you avoid
storing one extra bit per memory word to tell you this word is booby trapped?
>> Dean Tullsen: So because the -- if you can think of this extra bit, it actually
appears in the code instead of in memory, right? And so I have an extra
instruction. I have a new instruction in my ISA. It's a tstore, and so if -- you
know, if that -- if that tstore instruction touches memory, I know I spawn
data-triggered threads, otherwise I don't. So I don't need any bits in memory.
I've just got a new type of store instruction that watches it for me.
>>: Somehow there's got to be a structure of which way they're booby trapped
between the tstore as [inaudible].
>>: [inaudible].
>>: So maybe this is something I didn't [inaudible] I saw the tstore but where did
it actual register, I want to run this code with that. [inaudible].
>> Dean Tullsen: So ->>: [inaudible].
>> Dean Tullsen: So that was the tstore instruction that came after, right? So
[inaudible].
>>: [inaudible] hidden address [inaudible].
>> Dean Tullsen: Yeah. So that was the thread queue, right. So I'll set up this
thread queue at the beginning of my execution that puts -- that aligns a tspawn
instruction with a data-triggered thread add program counter, right? And that's
enough that when I see the T -- when I see the tstore followed by the tspawn, the
tstore changes memory, the tspawn then causes a lookup in the thread queue
which then pulls program counter out of there, starts a thread.
>>: It seems like you could get really far with the crisis towards this. Maybe a
different level of overhead. But very, very far.
>> Dean Tullsen: Yeah, yeah. Pretty close. The, you know, thing that you won't
find in any ISA that I know of is one that allows you to do this is silent store
check. And that's really the part of the tstore that I don't think we can find in a
current ISA. Most have the rest of the stuff, yeah.
>>: It's a kind of silly question but why do you have a static thread queue and not
just put the address of the thread you want to serve from the immediate
[inaudible].
>> Dean Tullsen: That's a good question. That's a good question. Boy, I'd have
to look -- I'd have to look that up.
>>: [inaudible].
>> Dean Tullsen: If we went back and we could look at what's in this table
without Hung-Wei here, I would have to actually try to figure it out myself. Maybe
I won't do that. And it -- I'm trying to think whether it's a residual of the time when
we actually had arguments because there was some information about
arguments in there, and we illuminated that. The question is what I don't know is
that we don't have arguments if we could completely illuminate the thread queue.
I'm not sure. That's a good question. But it's actually really small so -- as I said
multiple times. All right. Thanks a lot.
[applause]
Download