>> Doug Burger: It's my pleasure to introduce Tor... British Columbia. Tor has done a ton of interesting...

advertisement
>> Doug Burger: It's my pleasure to introduce Tor Aamodt from the University of
British Columbia. Tor has done a ton of interesting work. He's worked with
recently FPGAs and some in GPUs, done some work on branch prediction in the
past and analog circuits and is just, all in all, fairly knowledgeable on a broad
range of subjects. I always enjoy talking to him.
So he came down and is sharing his work with us. And I'm actually really excited
to hear your talk on GPUs today. So thanks for coming.
>> Tor Aamodt: Thanks, Doug. So the title of this talk is evolving GPUs into a
substrate for cloud computing. I was trying to think of how to summarize the
work I've been doing for a last little while, and this sort of popped into my head as
a title. And then after I sent this, I went oh, wow, evolution doesn't sound that
exciting. It sounds really slow paced like glacial or something, and, you know,
we're architects. We're supposed to think of the answer right away and come up
with something really cool.
But then, you know, I thought even more and no, this is the right answer. This is
the right title. Because evolution can be disruptive. It can be disruptive. If you,
you know, start with a small market that to rates rapid innovation, right? So of
course what I'm thinking about here is the game's market where you can pump
out a new graphics processor every few years and they'll change the hardware
RSA in each one, right? And game's market will absorb that.
You start with that as your beginning point, right? And then you make many
small changes, right, because you're in a highly competitive environment, you
have to keep innovating. Make many small changes to improve your product.
And while you're doing that, you start thinking about, okay, what else can I do
besides this small market? Start making little tweaks to solve a larger market's
problems, right?
And maybe if you can do that and achieve that at a lower cost than the
incumbent then, you know, at some point you've got you know what's known as
the innovator's dilemma. You just sort of start eating their lunch.
And so I kind of think that maybe GPUs are going to do this. This is a
speculation I guess of my research.
So what am I going to talk about? I'm going to talk about really two bits of
research we've been doing. One recently published on transactional memory for
GPUs. And this was published at I triple -- at MICRO, which is computer
architecture conference. And it was also selected for IEEE Micro's Top Picks this
year.
And then I'll talk about some software that we've been using. And this is using
an AMD fusion APU. And this was taking a server type application, Memcached,
and something you wouldn't necessarily think would run well on a GPU. I think
first people's intuition is why are you doing that? And then we showed that this
actually can run pretty well on a GPU. So I'll talk a bit about that as well.
>>: So are you going to say who the GPUs are going to disrupt?
>> Tor Aamodt: We can have that discussion later maybe.
>>: Okay.
>> Tor Aamodt: Yeah. Yeah.
>>: Us?
>> Tor Aamodt: Could be, right? I was just trying to take a step back and say
what am I doing, right? Trying to change GPUs. I'm not trying to radically
redesign them from the ground. And, you know, is that the right way to do it?
Okay. So I'll focus first on the transactional memory on GPU. And what's the
motivation for this? So a short story might help explain this.
So when I first got to UBC, GPUs were coming out and you could write CUDA
programs on them. And so really enthusiastic undergrad came into my office,
wanted to do a senior project. And he wanted to do it on vision processing. He
was involved in some robotics competition and they needed to do like realtime
vision analysis.
And he thought well there's cool algorithms out there, but they can't -- the
hardware isn't fast enough to do them in realtime but there's these GPUs now,
people are talking about using those. So he loved to use GPU to accelerate this.
And I said that's great. I don't know anything about vision, so you're on your own
with that, but if you want to write some codes to run on GPU, that's great.
So he went off. He found some algorithm that was like really high quality
algorithm and tried to get that to run on GPU and spent all semester getting it
almost working, right? And then he ran out of time. And I was like oh, okay, well,
that's pretty cool. You made a lot of good progress, you know, A plus to you.
Another student came along and didn't -- this was a more normal student I
discovered. I have no idea what I want to do. I said okay, finish this project off.
All you got to do is debug it, right. It's already there.
And so needless to say, they couldn't get that to work. You know four months
later the think's not working. I said well, maybe that's just not as good a student.
And then another student showed up and tried to, you know, again I said okay,
for sure, this is an awesome problem. And they had the same problem.
So if I looked -- what was it that was different about this problem than other GPU
problems is it required fine-grained synchronization using locks. And this turned
out to be kind of a hard problem for GPUs.
So if you look at the normal lifetime of GPU application development here, so this
time here is like developer time, and this Y axis is goodness of some sort. A
good application that maps on to a GPU well has lifetime like this. Well, some
time after not too many weeks of development the application is running correctly
on the GPU. So this green line is saying it's not working, it's not working, and
suddenly it's getting you more or less the right answer.
Now, that different mean it's running really fast. So initially performance might be
really bad. And here I'm assuming you've sort of identified like a big kernel that
you want to run. And that kernel initially is running -- it's may be very parallel but
it's running very slowly on the GPU. And so you have to analyze why is it
running slowly and incrementally refine it and get performance to be good. And
then you're happy.
So that initial getting it to work gives you the hope that you can continue this
process and get somewhere. If you start looking at the kind of application that
we were looking at with this vision algorithm which involved locking, it's more like
this, okay? Your months and months and months of trying to get this code to do
anything correctly and it's not working. And you might give up here. So if you're
a company you might say well, we sunk way too much money into this bad idea.
Let's just go and do something else, all right?
Or maybe, you know, you reach this point where it suddenly starts to work and
you figured out the code enough, but then you encounter this bad performance.
Okay. So then when your manager sees that they go okay, forget it, okay. You
spent months and months and months. You finally got this thing to work but the
performance is like 3X slower than a CPU. What are we doing here, right?
So the hope of this transactional memory business on GPU is to -- okay, before I
go into that, why would you want to use fine-grained locking on a GPU? So here
is just an example. The N-body problem. Okay?
So if you do like the simplest order N squared algorithm on a GPU, a lot of
parallelism, and you do this problem with five million interacting particles and you
run it for some number of time steps it will take say half an hour, okay? There's a
very simple software implementation.
Now if you do a more clever implementation but requiring something like
fine-grained locking, you can get this down to order N log in much better
computationally efficient and so it's running like orders of magnitude faster. So
that's probably the reason why you want to do fine-grained locking is to get some
better goodness in the algorithm.
But you have this problem as a software developer. Can I even get it to work?
Okay.
>>: [inaudible].
>> Tor Aamodt: Yes. Sure.
>>: Why is the shape of the red curve ->> Tor Aamodt: Yeah. We can debate that, right.
>>: I'm not disagreeing, I'm just curious.
>> Tor Aamodt: Yeah. So this isn't measured data, first of all, right, this is my
intuition. And if I have identified a kernel like so I've done some profiling and I
said this is a parallel kernel and I go run it off, then each change I make to it is
going to maybe give me a factor of two or something improvement. So this is like
these factors are two multiplying together to give me something like that. That's
my intuition about how it works. Each one of those bottlenecks.
>>: You're knocking down anvil bottlenecks one after the other. The last one
you [inaudible] okay.
>> Tor Aamodt: All right. So let's go back to this region here where the
manager's freaking out about all the time and no results. So we want to shrink
that time to be small like it was for the applications that sort of naturally fit on a
GPU. So we're hoping that we can get the code to work. Now, we're not saying
it works really fast. We're just saying it gets you the right answer. We want to
get that within, you know, weeks or so so that we got some sort of think
happening on the GPU.
And of course at that point performance will stick and we'll have to go through
this whole performance tuning process. And of course transactional memory
costs something. So we may not actually ever achieve the same peak
performance we had before. But the hypothesis or the assertion is that this is a
much more better scenario for a business perspective where I'm going to get
some, you know, good level performance in some reasonable amount of time
with some reasonable amount of risk so that I can actually be successful in what
I'm trying to do, whatever it is I'm trying to do here.
Okay. So before I go more into this I want to just talk about what is a GPU and
quickly review what is transactional memory. So what do I mean by GPU?
So to us it's NVIDIA or AMD-like, compute accelerator. SIMD hardware and
aggressive memory subsystem. And so it gives us, we believe, high compute
throughput and efficiency. At least if you have the right problem.
And we're talking about nongraphics here. So OpenCL, DirectCompute, CUDA,
C++ amp, whatever you want to think of there. The programming hierarchy is -or the programming model is the hierarchy of scalar threads. Okay. So what
does that look like? So I've got a little picture here.
Another thing, in today's GPUs we have limited communication and
synchronization. That's not to say there's none, but it's limited. So a picture of
this in today's terminology using OpenCL terminology, you've got a kernel.
Inside of it you've got a work group. And this is some number of scalar work
items. And then the hardware from AMD or NVIDIA, those would be grouped
together into things called wavefronts or warps.
So inside here the smallest unit here would be like a work item in OpenCL or
scalar thread. If you could think of it as a scalar thread. Those will group some
number of them that could be 32, let's say, nominally of them into a unit of lock
step execution called a wavefront or a warp. Then there might be about a
thousand of these in something called a work group or a thread block. And a
kernel may have many -- may have thousands of thread blocks all running -starting off running in the same code.
Okay. So they can communicate. There's some local shared memory. This
might be 64 kilobytes. Everything within one of these work groups or thread
blocks is like a thousand threads there, can talk through this 64 kilobyte roughly
sized scratch pad memory you can think of it.
They can perform a barrier operation very quickly with all these through threads
within a workgroup. If I want to synchronize across threads in different
workgroups then you have to do something through global memory using some
sort of atomics.
Okay. So what is our baseline assumptions about the hardware? So there is
sort of a semantic of what we think of a GPU. And basically all of this stuff would
be inside of the GPU package, right? And then there's some off-chip DRAM. So
we've got a number of SIMT cores which I'll explain in a bit more detail in a
second. That's where those work items and workgroups are giving.
When they do memory requests, they talk through an interconnection network.
This is an on-chip interconnection network to a bunch of memory partitions where
we'll have last level cache bank, memory controller talking to off-chip DRAM,
some sort of atomic operation unit.
Okay. So inside one of these SIMT cores we've got some sort of front end. This
is like a pipeline of a processor. This is a front end, fetch, decode, schedule,
handling branches. And there's this thing called SIMT U which is single
instruction multiple thread. And then there's this -- the rest of the pipeline is like a
traditional SIMD data path where you've got your function units that are operating
in parallel.
So I'll talk a bit more about this SIMT thing and what that means in a second
here. Okay. So what is this SIMT, single instruction multiple thread? It really
goes back to a paper from SIGGRAPH '84. This was done at Pixar initially for a
processor called CHAP. And it's a way to take a bunch of scalar threads and
map them on to SIMD hardware. Okay?
So let's start off here with -- we have a bunch of threads that are going to run
through this control flow graph here. So you've got four threads. And these
squares here represent basic blocks in a program. The edges here are control
flow transfers between them. And the idea is the question is, okay, how can I
actual run this stuff on this hardware in parallel?
If everyone goes different -- if everyone goes with the same path, this isn't going
to be too hard. If everyone wants to follow a different path, it will be hard. Okay?
So let's start off as basic block A. Everyone is -- all four threads here are
executing basic block A. When I get to the end of basic block A, everyone goes
to basic block B, right? Pretty straightforward. What happens if at the end of
basic block B some threads want to go to C and some want to go to D? Okay.
So in this model, what happens is when that happens it's called branch
divergence. What we're going to do over here is we're going to push a couple
entries on to a stack. So this stack tracks the control flow state of this group of
threads, which would be a warp or a wavefront with NVIDIA or AMD terminology.
So what we can see is if I just go back a step, when I was executing at B, I had
one entry on the stack and I had an active map where all the threads were active.
And the next PC I was just saying to execute B. When I got to the end here, I'm
going to push two entries on to the stack, one for each of the targets of this
branch, okay?
So push those on to the stack. And so the threads that are going one direction
will get one entry. The threads that are going other direction will get a different
entry. And their active mass will be shown here. Okay?
Then I start executing whatever's on the top of the stack. So in this case, the
next thing on the stack is saying I want to execute basic block C. So execute
that. And I can get the active mask from here and just go ahead and execute
that.
When I get to the end of this -- of this branch, basically all detect that, that I'm at
the end, when I reach reconvergence point here so this basic block E, when we
diverge that, B will know that we can do some control flow analysis on this graph.
We can see that the immediate post denominator D is down here, basically the
joint point after this branch is here. And we'll track that when we diverge so that
when we see the next PC of the currently executing top of the stack here
matches this reconvergence point we'll know that this is the point where we want
to pop this entry off the stack.
Okay. So we pop this entry off the stack because the PCs match. And now we
-- the next entry on the stack is revealed then that's executing at D. So now we
are automatically executing at D. Same thing's going to happen here. D is going
to eventually become E, and we're going to pop this entry off the stack.
We push these two entries on the stack, forgot to mention that we updated the
bottom entry on the stack to have the reconvergence PC as the next PC. So
when we finally pop that last entry off the stack, we continue execution,
reconverge the gain at E. And we can continue from that point forward.
Okay. So we can go around and do this again. So this type of mechanism here
can handle arbitrary control flow. It can be irreducible control flow, et cetera. At
least the one that we -- the mechanism I just described here as and also in more
detail describing this MICRO paper from a few years back can handle arbitrary
control flow. So this is what I mean by a GPU in this talk.
Next I'm going to just briefly talk about transactional memory. Okay? So the
idea here is programmer specifies acomic chunks of code called transactions.
Okay. So here's some code. It's got a critical section and it's a fairly complex
piece of code, so I've got multiple locks that I need to acquire before I can enter
the critical section. And if I'm not very careful about how I do this, I could end up
by a situation where I got a deadlock, right? So if I'm not very clever or -- yeah,
even if I am very clever but I just haven't thought it all through, I could have a
deadlock situation.
So this has been tricky. People have recognized this. And they said wouldn't
this be great if programmer could just write this magic keyword atomic and put a
curly brace and put all the code that they want to run as if no one else is running
on the system and get that to happen. Okay. So that's the sort of the dream
here of transactional memory.
So how is it implemented? So first of all the programmer's view is if I have two
transactions running, one or the other happens at any given time. So this is two
transactions T, TX1 and TX2. Programmers use either TX1 happen before TX2
or TX2 happen before TX1. They didn't happen at the same time.
Of course if the hardware actually did this, it would be horribly slow, right, it
wouldn't give us a lot of parallelism. So in related what the runtime system or the
hardware is going to do is run these things in parallel. So if these two
transactions are nonconflicting, meaning they access different memory locations
like you can see in this example, then they can run in parallel.
Of course if they don't, if that's not true, if there's a conflict between them, then
we have to do something to repair the situation. So here I've got transaction 1 is
writing to B, transaction 2 is reading B. You know, if I want to be really safe
about this, I'm going to abort one of these things and rerun it. So in this case I'll
decide to abort transaction 2 and restarted it with the updated value to B so that I
get the situation where transaction 1 happened before transaction 2.
So that's in a nutshell transactional memory. So just to summarize, each
transaction has three phases. There's the execution. The in that execution
phase, we're in the transaction and we're tracking all the memory accesses into
what's called a read-set and a write-set.
Then we're going to validate the transaction. We want to detect any of these
conflicting accesses and resolve it somehow. For example, stalling or aborting
the transaction.
And then we want to commit the -- if there was no conflict, we want to update
global memory. Okay.
So the question becomes then are -- is TM and -- transactional memory and
GPUs, are they compatible or not? And so we believe the answer is that they
are compatible. Although it's not at all obvious that they would be. So what I'm
going to talk about next few slides here are all the problems that you encounter
when you try to take transactional memory and implement it on a GPU. And
there's several of them.
And what is our solution to that problem? Okay. So well, okay, so where are the
problems? How do they arise? First of all, GPUs are different from multicore
CPUs. One of the obvious differences we're talking about thousands of
concurrent threads, not tens. Okay. So what are the challenges here?
So challenge number one is it's SIMD hardware, okay? So as I was just
describing, the scalar threads are grouped into this thing called the warp or
wavefront and execute in lockstep. So here's some, you know, code,
transactional memory code written in assembly, so transaction begin. It's like the
curly brace opening up a transaction. Then some code that runs in the
transaction. And then transaction commit. And here's my group of threads that
are supposed to execute in SIMD lockstep.
You know, the obvious thing that's going to happen is, okay, so they -- they start
progressing through. And when I get to this commit part, only some of them can
-- will succeed in the commit operation. So let's say those, the first three threads
were successful. They didn't have any conflicts with other transactions. So they
can commit. But T3 is unlucky, and it has to abort and retry. So we've now
introduced this control flow divergence problem.
Well, okay, so if you think back to what I was just talking about with that stack
based mechanism, there is sort of already a technique in modern GPUs for
handle this kind of control flow divergence. So the solution which I described in
more detail in the paper and I'm not going to have time to go into it here is to just
extend the SIMT stack to handle this problem for transactional memory. Okay.
So basically we treat this like a loop with some divergence inside of it. And we
have very simple extension to the SIMT stack to handle it. Okay.
So a second challenge -- so I'm going to treat this one as solved, right. If you
guys want more details I can give you more details later. The second challenge
is transaction rollback, okay? So if I think about a CPU, okay, what happens
here? When I begin a transaction, I have to be able to restart it in case there is a
conflict. While I'm running in the transaction and updating registers I'm not
updating memory. So let's just focus on the registers for a moment.
When I start a transaction, a lot of research has proposed make a checkpoint of
the register file effectively at the beginning of the transaction so if I have to abort
the transaction I can go back and restore the register file to the state it was at the
beginning of the transaction and restart.
But tens of registers, okay, it's not nice, but it can be done. On a GPU there's
actually more register file than there is cache on today's GPUs. There's more
bytes of storage in register file than cache. So this is just not going to work,
right? Two megabytes of on-chip for register file and Fermi, right? So
checkpointing that naively is just -- it's a nonstarter.
So we started studying while, you know, what are we going to do about this?
And what we -- what my student discovered was that actually you don't actually
have to checkpoint a whole lot. Because mostly what we find is that registers are
overwritten at first use inside of a transaction. And there was a recent work
published on something very similar called I guess idempotent points. So that it
made this -- this is sort of a general observation about code and it's true beyond
just transactional memory, so there's some ongoing work on using this for more
general purposes.
But here's the -- here's how it applies for us. So here's an example transaction.
So you can see that we're overwriting R2, right? So I overwrite R2. Let's say I
have to abort this transaction. Well, when I overwrote R2, if I just sort of naive
about this, I would checkpoint the -- or make a copy of R2, right? But when I
restart this, the first thing I'm going to do is overwrite R2 again. So there's no
really value to checkpointing R2 at this point.
And so when we looked at the kernels that we were running in this study, we
found there's only one of them that actually needed anything to be checkpointed
at all, and it only needed two registers to be checkpointed. And the fact that
there's other people who are sort of exploiting this property more generally
idempotent points gives us some hope that this is something that we could
actually rely on.
So our solution is really just let's software checkpoint only what you need to
checkpoint. We're not going to blindly checkpoint all the registers. Okay.
So the third challenge is conflict detection, okay. So if you look at examining
hardware transactional memory proposals, a lot of them say let's use cache
coherence protocols to detect conflicts. And the problem with that is well, current
GPUs don't really have cache coherence. Well, Larrabee has some cache
coherence. But NVIDIA and AMD GPUs don't have cache coherence. And it's
not clear how well it's going to scale.
The other thing is even if they have cache coherence, it -- like say Larrabee, it's
not private per work item or per scalar thread as in the model that we have. It's a
cache shared amongst maybe hundreds of work items. And all of the hardware
transactional memory proposals that have been put out there assume that each
thread kind of has its own private cache and the coherence interaction between
those private caches is what tells about you the conflicts.
So this -- that idea is kind of out. Okay. What about signatures? That's another
proposal. Use some sort of Bloom filter structure to detect conflicts. And that's
been shown as a way to get around having to modify cache coherence protocol.
So we did some experiments with this, and we assumed -- you know, we tried
various sizes. But let's -- you know, I'll quote a data point here. For a signature
with 1024 bits per thread, that's a lot. Okay. So that -- this is roughly the size. If
you use smaller than this, you have a really high what's known as a false conflict
rate. Okay? So you have to scale up to this to get a reasonable conflict rate.
And I don't have data on the exact conflict rate that was, but it was something
reasonable like 10 percent, say, false conflicts. And that, with the number of
threads you can have on a GPU that's like four megabytes of storage just for
signatures. So this seems again like, wow, this is huge amount of area to do
transactional memory.
So this looked like a show stopper to us. We weren't really sure how we were
going to solve this. So we spent a lot of tying thinking, okay, how are we going to
get around this? And Wilson Fung, the student who was doing this work had this
great idea of how to solve that. And before I tell you that great idea, I have to
also tell you challenge number 4.
Okay. The other thing is right buffering. So I talked about register files and
checkpoint and register files. The other thing is when I modify memory in a
transaction I need to buffer that up. I either need to buffer up what I'm changing
or I need to buffer up the changes I would like to make. So let's think about how
we might do that.
If I look at a GPU today and I look at one of these SIMT cores, right, on a
Fermi-style GPU there might be 1500 threads running on that SIMT core. And
L1 data cache, that might be 48 kilobytes in Fermi. And with 128 byte lines. So
there's about 384 cache blocks on the L1 cache on Fermi. And there's about five
times as many, four or five times as many threads sharing that. So I can't really
buffer up my writes nicely in this small amount of storage. So how am I going to
handle that problem?
Okay. So both of these two problems, one is the conflict detection problem,
okay, and the other one is this write buffering problem. We're going to try to
solve together. And the solution is value-based conflict detection.
Again, this Wilson spent a lot of time scratch his head, figure out how we were
going to solve this. And there is prior work on using this notion of value-based
conflict detection. And it really -- and it was really in sort of the software
transactional memory world where they're proposing this initially. We realize this
actually makes a lot of sense for what we're trying to do here as well.
So let me just go through this example to explain what I mean by value-based
conflict detection.
So here I've got transaction 1. And it's going to do this atomic block of code that
is going to compute B is equal to A plus 1. And here's the assembly code for it
down here. And here I've got some private memory that's going to track read set
and write set. And then I've got my memory system. And initially the value of A
is 1 and the value of B is zero. Okay?
So if I run this code I'm going to load A. I'll put that into my read log. And then I'll
do the computation, which is I'm going to add one to it. And then I going to say
that I want to write B is equal to 2. So before I can update the global state and
commit this transaction, I need to validate that this thing is -- has not had a
conflict. So the way I'll do this in a value-based conflict detection system is I'll
check the value of A and see if it changed.
So if A didn't change, the assumption will be that this is -- there is no conflict and
I can go ahead and commit this update to memory. Okay. So this didn't change.
Our read set didn't change so it's safe to publish B is equal to 2.
Okay. So that's a somewhat simplified view of the system. Let's imagine there's
another transaction running, what would happen. So let's say it was running
concurrently and it read the value B is equal to zero. It didn't see this new
updated value when it was running, it saw the old value. And it computed -because this transaction is computing A is equal to B plus 2. It computed the
new value of A is equal to 2. So what would happen here is when we go to
validate this we'll see that value B has changed so there was a conflict. And so
we know we have to redo this transaction, abort it and restart it. Okay. So far so
good.
>>: So are you going to talk about silent stores.
>> Tor Aamodt: I wasn't planning to talk about silent stores. But you want to talk
about the ABA problem, we can talk about it.
>>: Yes.
>> Tor Aamodt: Okay. Can we get to it in a second or -- okay. Maybe ->>: Yes.
>>: When it's appropriate.
>>: Does the compare on the right have to be done atomically otherwise you
compare somebody else go around and you write and then you make a mess.
>> Tor Aamodt: Okay. I'll get to your [inaudible] his first.
>>: Priority order.
>> Tor Aamodt: Just slide order. Okay. So what happens if we try to do both at
the same time, right, which is your scenario. Okay. So here -- did I go too fast?
Okay. So both these transactions executed concurrently, all right? So A1 -transaction 1 computed B is equal to 2. Transaction 2 computed A is equal to 2.
And they go and they validate very quickly, right, and they both match. And so
they both get the go ahead. And it looks like we can go ahead and update. But
this -- this is bad, right? This is not a valid ordering.
Because the programmer expects transaction 1, then transactions 2, which
would give us this result from memory. Or transaction 2, then transaction 1,
which would give us this result from memory. Okay? So the one we got, which
is both values equal to 2 didn't match either. So this is bad. We have to fix this.
So the way we're going to fix this problem is -- well, the naive way to do is it to
serialize validation, okay. So do one thing at a time, basically. So if I just did my
validation commit for each transaction serially, then I wouldn't -- I wouldn't
actually have this problem. So V here is validation, C is commit. So if this guy
finishes, I'll detect this race condition and it will be fine.
So benefit number one, no data race. Benefit number 2, no -- there's some
esoteric things, live lock, that you can get into with this kind of system. So those
are great.
The huge drawback here is we've serialized even the nonconflicting transaction
so we've just killed performance. There's now no point in doing any of this. So
huge collateral damage. So this certainly won't scale up to 10s of thousands of
threads. So the solution for this is speculative validation.
So we're going to split up the conflict detection to two parts. Okay? So one is
we're going to do a conflict detection against recently committed transactions and
we're going to do that in parallel. Okay? And so what does that look like? So
here's a bunch of transaction. They all want to validate. They're all sort of
coming in at the same time. And so we'll compare all of the resets in parallel with
what's in global memory here. So we're checking all the read sets. They all
match. And that looks good.
So they pass this -- now, they're not comparing against each other. They're just
comparing against transactions like transaction zero or whatever came before
transaction one that's already updated memory, but it's recently updated it. And
we'll see that if the value change we'll see that there was a conflict.
Okay. So step 2 is what about if one of these transactions updates something
that another transaction that we just validated in parallel, what if there's conflict
between these guys? So for that, we're going to -- we're going to add some hard
work to detect that case. Okay? And when we do detect that case, we're going
to stall one of the transactions and try to revalidate it against memory a little later.
Okay. So I don't think I have slides on the details of how we do that
implementation. There's much more detail in the paper that's basically a little
new type of Bloom filter that we've proposed called a recency Bloom filter to
handle this.
So here's a summary of the hardware changes. And then I'm going to jump to
talk about this ABA problem to get to Doug's question. So changes are so I
mention the SIMD problem with committing, you know, some lanes may commit,
some may not. So we have to make some modifications to the SIMT stack
hardware. We need some hardware changes to track our read and write sets.
And we have a commit unit that implements this parallel validation. Okay.
So before we go into the results here, let me just quickly talk about the ABA
problem. Okay. So ABA problem. What is the ABA problem? So classic
example is this linked list thing. And this is where I'm trying to do some very
fancy stuff with atomic operations so I get around using locks basically.
So imagine I want to do like a push and pop operation, but I don't want to have a
global lock on this linked list. So one idea I proposed is do something like this.
So I go in, I check. Top is the first thing here and the next is this next pointer.
And I'm going to make copies of these two things. And then I'm going to do this
atomic compare and swap. So I'm going to see okay, if top didn't change, if it's
still pointing to whatever it was pointing to before, which is A, then I'm going to
swap out this thing and get rid of -- get rid of A and have top now point to B. So
this is how this code is supposed to work. Okay?
So I popped A off the list because A didn't change. And I want to pop -- and I
want B to be the next pop. Okay? That's all well and good until another thread
comes along and does the following. Pops A, pops B, and then pushes A back
on. Okay?
So now I've got two threads sort of racing through this code at the same time.
And we can get really -- you know we get this really weird result. So what's
supposed to happen, right, or what will happen here is so this thread finishes, it
popped A and B and then pushed A. So this is now the state of the linked list,
okay? Now, thread 1 comes back and does this compare and swap and it sees
well pop is still pointing to A, okay? So now I can swap this out, right? And then
I get this really weird thing, which didn't make any sense here, right? Because
really what I wanted to have is if thread 1 got there first and did this consequence
of operation and then thread 0 comes along and executes pop, I should have
gone from AC and just gone to C. But somehow B popped up on the list again,
right? So that's the ABA problem, or the first version of it that was found.
So sort of the intuition for why we don't suffer from this problem is basically -- I
mean, the real question is what is the problem? Okay, compare and swap
operated correctly. Right? It did the comparison. And it swapped things. Right?
The real problem here is the programmer didn't think of this race condition.
Okay? So our assertion is that if you are running transactions and you write
code like this, what you're going to get -- okay. Here's the problem. If you write
a transaction, you're going to basically guard all of this stuff. You're not just
going to guard this one little thing up here, you're going to guard all these things
that you're touching in a transaction. You're going to detect the conflict.
Okay. So that's a very sort of hand wavy informal explanation. But we've been
really grid about this, so we wrote a correctness proof. Doug? Okay. You've got
to queue our code reader. Okay. You have a tech report here. Basically we can
do this conflict serializability. We can show that we got conflict serializability.
And so we're pretty happy that you don't have that problem. Or we're pretty
satisfied we don't have that problem. Or another way of saying is that we tolerate
it. That's what we say. It can happen -- ABA can happen and we can still
serialize the transaction. So as far as you're concerned, it didn't happen.
>>: Yeah, the --
>> Tor Aamodt: Sure. Yeah. There's a conflict.
>>: There's a record, yeah.
>> Tor Aamodt: Okay. Let me go back to where I was, which is to talk about ->>: Thank you.
>> Tor Aamodt: Yes. Okay. So let's talk about data. Okay. So how well does
this work? So we used our simulator, GPGPU-Sim, which we've developed at
EBC. It's available under BSD. So anybody can use it. There's the URL.
The version that we used for this study we correlated against GT 200 and the
operation per cycle correlation is pretty good, .93. And we implemented the
changes to model KILO TM and did some, you know, pretty detailed modeling of
actually when values change in memory to try and catch any problems like an
ABA problem happening, if it were to happen.
So we were kind of defensive in our programming to try and catch those kind of
problems.
And we have a very detailed model of it as well. Okay. So GPU applications that
we looked at. So we have some essentially microbenchmarks here. Hash table.
And we have two versions, a high conflict and a low conflict version.
Then we have something that models like moving money between two bank
accounts. So we call that ATM.
And then we've got some code. This was code from a game's company that was
collaborating on this research. It's cloth physics demo written in OpenCL. And
it's an interesting one because they couldn't scale it up because it was written for
CPU to run on like the SSE units on an x86. And it had -- if you tried to scale it
up beyond this, you would have race conditions in it. So it was actually an
interesting candidate to try this out on.
Barnes Hut, which is the N-body problem, it uses atomics.
CudaCuts. This is interesting because this is the actual vision problem that we
were trying to solve. I had these undergrads trying to solve. Somebody else
solved it, published a paper, while we didn't.
And then data mining problem.
>>: Did any of these problems do a lot of [inaudible] updates in data mining is a
[inaudible] classification.
>> Tor Aamodt: So the data mining is doing like a priori which is the sort of the
Amazon, you know, if you want this, what are other things you might want, based
on what other people bought.
Okay. So performance here. I've got a few different ways of looking at
performance. And this view what we're plotting is speedup over serialized
transactions. So what you can see here, if I do an ideal transactional memory
we're getting like around close to 300X over serializing all the transactions. Ideal
here has no overheads for the conflict detection and for committing the
transactions. Things just happen really quickly.
We're comparing that against fine-grained locking here. So this is a version of
the code that has locks for all the critical sections. And so you can see here that
transactional memory can do a little better than fine-grained locks. And that's
sort of a well-known result from transactional memory.
Okay. So now let's look at our detailed model. So in the prior slide I was just
looking at this sort of idealized transactional memory. So the middle bar here is
our detailed performance simulation data. And comparing against both
fine-grained locking and the ideal TM, here I'm normalizing to the ideal TM. And
we're looking at normalized execution time. So in this graph, higher is actually
worse. The prior graph higher was better.
And there's a few things you can note here. So first of all, ATM. Okay. So for
the ATM one you can see that we're actually doing better. So with the ideal ATM
you might think okay, I can believe that that would be better than fine-grained
locking. Here with all that stuff that I've described, we're still doing better than
fine-grained locking. And the reason here is when you write a critical section for
this particular application you have to acquire multiple locks. And there's a lot of
extra memory traffic that's being optimized by doing transactional memory in this
case.
The other examples, the trend is that we're going to do worse than fine-grained
locking. So in these examples here, these are the ones that did the worse.
There was significant contention. And so that invoked the transactional memory
a lot more.
Another thing you can see here is some of the examples the fine-grained locking
actually did better than the ideal TM, and that's because in some of these
algorithms they actually are a little bit more -- the fine-grained locking might be a
little bit more optimized so the CudaCuts one is not actually not doing like
acquiring a lock, it's actually just doing an atomic ad for example.
And Barnes Hut they have like tri-lock type code, which basically if it fails, it will
do something else. So it's a bit more computationally efficient as well.
Okay. If I go to lower contention, so this is the lower contention hash table,
things get better. So performance tuning will help. Okay.
>>: Can you not easily go into something like Barnes Hut and replace the
fine-grained models with the smaller transactions and should show better
performance because you're not -- assuming the cost of rollback is -- of -- it's like
a transaction that's free, which it's not, you should never [inaudible] right.
>> Tor Aamodt: Right. So I think that code has been really heavily optimized
and -- yeah. Should we be better? Well, we're not.
>>: You might be locking your transactions and your locks may be in different
granularities [inaudible].
>> Tor Aamodt: I think there was some trickiness here. And I -- there was, you
know, some trickiness about isolation here where we have sort of a weekly
isolated transactional memory system. So we had to make our transactions also
a little bigger than you would want if you had a strongly isolated system. So I
think that might also explain a bit of it.
Okay. So the bottom line here is with the performance model that we have we're
getting within 59 percent of the performs of fine-grained lock performance. And
one thing I want to point out is that when we compared our baseline in this our
fine-grained lock performance and our atomic performance versus real GPU. It's
actually about four times faster than an NVIDIA GPU like Fermi.
So, you know, this -- we may actually be faster if you believe our performance
model for transactional memory and you don't believe it for atomics then, you
know, it's sort of like comparable performance is the way I would say this. This is
what I think that boils down to.
Okay. Implementation complexity here. So we went through and did the detailed
analysis of how much stuff you had to add. And basically bottom line number is
somewhere around one percent of current GPU. And so we did some of this with
CACTI. Wilson doesn't -- you know, he's gotten a lot of flack from one of the -our faculty members was one of the first people to work on CACTI, and, you
know, knows that it's not very well calibrated at low numbers like this.
So he went off and actually used a memory compiler to come up with similar
numbers, you know, within a factor of two to the CACTI numbers for the area.
So we have -- you know, if you believe this is the hardware you need to build,
then we think this is the overhead that's going to -- how much it's going to cost
you.
So just to summarize the first half of this -- first three quarters of it, however
much it will work out. So we have thousands of concurrent scalar transactions.
And basically we have no cache coherence protocol. Our KILO TM handles
scalar transaction abort.
I didn't mention this before, but it's word-level conflict detection, not like cache
block level conflict detection. I didn't mention this either, but essentially in terms
of correctness it could handle unbounded transactions. Performance -- there will
be a performance cliff when you go to really large transactions that exceed the
size of the hardware structures. But we'll still run them. And I just mentioned the
performance. Okay.
Okay. So I'm going to move on to the next part of this talk which is running
Memcached on a fusion APU. I don't know if there was any questions before I
did that. Okay. If not. So here's a picture of a Google site which I guess is, you
know, it's a famous picture. This is like this big river to cool down these data
centers, right?
And so the focus of this work was, you know, can we reduce power make things
more efficient by running them on GPUs that, you know, are using a lot of power.
And we're interested in non-HPC. So GPUs pretty well established that if you
have like a high performance computing type application that's very regular,
dense matrix multiply-esque type of infliction it's going to map on to a GPU pretty
well.
But there's a lot of work loads that are economically you know much more
important that don't map so well or do not fit that picture. And the question is, do
-- could you take those applications and run them on the GPU? And most
people's intuition is, well, no, right? And so we set out because we're
researchers to evaluate whether that is the case. And we're not the only people I
think looking at this space. There's -- at this year's PPoPP there's a number of
papers where people are doing really wild things on GPUs like alias analysis and,
you know, breath research and getting really impressive results.
So I'm going to look at this particular application now. So okay. So we're talking
about Memcached. So here to motivate this is some analysis of what I get for
SIMD efficiency on a GPU. And if I'm looking at an application and I'm the
programmer and I'm deciding should I invest several months of effort to try and
take this code and pour it to some new platform, I'm going to look at that code
and say, well, does it look like something else that has run well on a GPU?
And so what I might do is I might look at control flow, for example, and have
some sort of intuitive understanding that a lot of control flow doesn't map too well
because every time there's a branch I lose some -- you know, I mask off half of
the threads, et cetera. And so if I just assume each branch kind of has 50/50
probability of going left or right, this is the kind of efficiency, SIMD efficiency I
would expect out of Memcached, okay? Out of the key lookup panel in
Memcached. So SIMD efficiency is basically what fraction of my lanes on any
given clock particular are actually going to do something when they do anything.
So I'm factoring out waiting for memory here.
And this actual here is when you actually do this, when you actually finish the
exercise and you have the code ported over and run it, you see that we're
several times higher efficiency. It's not perfect. There is branch divergence. But
it's not like -- it's not this low number that your intuition would give you. Okay.
So what is Memcached? So this is -- Facebook uses this. This is a slide from
their keynote at an HPCA-PPoPP. So in the system I've got a bunch of web
servers. I've got some storage with all my database stuff. And I want to speed
up accesses from the web server to the storage. And so I'll put some sort of a
software caching thing in here. That's what Memcached does. I'll go and I'll look
and see if what I'm looking for can be satisfied auto of this thing really quickly. If
not, then I have to go and find it in my other storage.
And so right there, that there is Facebook's numbers. It helps allot so they do it.
So what we want to do is actually just take a look at this and just speed this part
up here. Okay.
So the question is, is this compatible with a GPU? So this is the control flow
graph of the key value -- key value lookup handler in Memcached. This is
actually just the hash function part. And this is, you know, like one of these dot
graphs and I think something is hiding part of the graph here. But it's pretty
messy, right? There's edges all over the place. Really highly irregular control
flow.
So can you actually get this thing to run on a GPU? This is the kind of thing
where you look at it and you go, nah, that's not going to run on a GPU. GPU
SIMD, everyone should do the same thing, right?
So that doesn't look very good. Not only that, but every thread is going to be
accessing something different. So it's also got a regular memory access
patterns. It needs a lot of large memory foot print and highly input data
dependent, right? So can we actually do this?
So this is the result of our study. Basically ported it over. Our students ported it
over. And so this is kind of the way they did it. Basically we only ported the GET
request. So that's reading the database. Updating it is still done on the CPU.
And so this is sort of like a schematic view of what we ported over. So GET
comes in, we do some hash function on it, we look up some array and we get out
a key there. If the key matches what we're looking for, then that's great, we have
a hit. Otherwise, we do some chaining to look for it further around in this hash
table. And the end result is just a hit or a miss. And an index into this table
where it is. Yes?
>>: The values are on GPU memory?
>> Tor Aamodt: So the values are some some separate array, okay? And so
what we're going to return from this is just a pointer to that value. The value
could be really big, right? And we're just going to give you the offset into some
data structure that holds the value.
Okay. So how would we port this? If you just did each individual GET request,
it's not going to work out too well. So we're going to batch up requests some
number of hundreds or thousands of requests to be batched together. And we're
going to run each one of these as sort of a single work item in an OpenCL kernel.
Okay? So what were we trying to do? We wanted to increase requests
throughput and keep request latency reasonable. This batching, of course you're
waiting around for new requests to come in. It's going to increase the individual
request latency obviously.
And so what are the challenges? Well, I've mentioned them already. There's
this regular control flow is a big part of it.
Our methodology here was we took some recent AMD GPUs fusion parts. We
also took a discrete AMD Radeon GPU high-end part and we poured the code
over to OpenCL. We also ran versions of it on our simulator to get a little bit
more insight into what was going on. And also we created a sort of a Monte
Carlo simulator to run the control flow graph to get that view of what is the
programmer's intuition going to look that.
And our input to Memcached was traces of Wikipedia accesses that have been
published online. So we pretended that we went into Wikipedia and added a
Memcached server to that.
Okay. So one request per work item. Basically data can be anywhere in
memory, so there's going to be a perform -- you know, the question can be about
the performance of this. So one of the studies we did first was, okay, how much
benefit will there be from improving the memory system of this. Actually what
we're trying to figure out here is how much of this bad performance, initial bad
performance might be due to, say, memory system versus control flow? And so
this study is just showing what would happen if you changed the memory
system? Okay. We're actually -- the ends goal of this project is trying to make
the performance for this even better by changing the hardware.
But here you can see if I had no cache and I went up to one megabyte fully
associative cache and no memory latency and no memory stalls. So this is
running on a GPU. And this is percentage of peak IPC.
So you can see that there's definitely benefit to using caching. Right? So we're
in our kernel we've actually put some effort into using some cache. But it kind of
saturates here. The details here about these two last bars are different version
of ideal memory system where in the first one basically -- well, the last one here
is when I take out all stalling due to memory. Basically if I do a memory access,
even if it accesses one different cache block per lane, I'm going to assume that
takes one cycle essentially. So that -- that peaks out at about 32, 33 percent
peak IPC. The rest is control flow.
>>: Are you majority [inaudible] because different elements have different chain
lengths in the hash table elements?
>> Tor Aamodt: Yes. I think that's a big part of it, yes. Okay. And actually the
key -- I'm not sure if the keys can be different lengths too. But okay.
So let's just look at this control flow problem again. So here's our control flow
graph. And here's the different work items. And they can diverge, right? So this
is just restating the divergence problem. Work items 1, 2, and 5 go here. Work
item 3 and 4 go over here. And if that, you know -- if that's actually the case, it's
going to reduce SIMD efficiency.
So here is some analysis. And this is sort of giving a little bit more detail of that
graph I showed you before on intuition versus actual. So I've got two versions of
intuition here. And this is slightly different plot. This is showing for each version
of Memcached or each way of running it a breakdown of where cycles go. So
this thing at the bottom here is saying on these cycles, this, you know, 10 percent
or a little less of cycles there's no divergence, okay? So using all of my SIMD
lanes. At the top here I'm saying this vast big number up here I'm getting, you
know, like 60 percent have the cycles where I'm only getting between 1 and 4 out
of, say, 32 lanes are actually switched on. So this is highly inefficient execution
up here due to multiple nested branches.
So the different bars here are showing -- the first two are different versions of
intuition. And the last one is what actually occurs. So this is from detail
simulation. And these are just sort of different versions of the programmer's
intuition.
The first one is the bar I showed you before, which is let's assume every branch
is 50-50 taken or not taken.
The next version here is a little bit more refined. It's where I'm saying, well, I
know some code will never run as a programmer. Like it's and exception
condition or an assertion or something. I know that will never run. So if the
programmer can reason that that won't happen, I can get slightly better, tighter
bound on how things can run. But as you can see, that doesn't change things a
whole maybe too much. But when I compare it to the final actual thing, there's
still a huge gap.
So 62 percent of, you know, of execution will be highly diverged if I just sort of
have the pessimistic view of what control flow is going to do. If I refine it a little
bit 51 percent is heavily diverged but in actuality it's really only 29 percent is this
heavily diverged. Yes?
>>: I see some intuition. If the brown goes 50/50, why would the green bar be
seen visible? It would seem to be roughly one in a billion. A few in a billion. But
not visible.
>> Tor Aamodt: Yeah. Well, okay. So it really depends on your nesting depth,
right? I could go back to that graph, and it's really how many nested branches do
I have? So if I nest only once it's going to be half of them, right? So this is
saying that maybe only nest about four times or something? Yeah. Yeah.
>>: Why do you care about only 1 through 4? I mean.
>>: It's just a way to summarize it. I think the -- the way I had it initially this
average in the efficiency, what's the average number of units active is probably
the right number.
>>: Do you know what is the point that this -- this [inaudible] CPU? Is it like if
your efficiency falls [inaudible] 16 [inaudible] GPU?
>> Tor Aamodt: I would argue it's probably around -- yeah, look at the clock
frequencies and say, you know, maybe -- well, it's hard to do, right? How many
cores do you have on the CPU, how many cores do you have on the GPU,
depending on what you call a core? Right? So what I'm calling SIMT core, right
in Fermi they have 16 of those, right? So if I have 16 core GPU, right, then I
need, you know, basically -- and that's 4 wide issue then I need 4 lanes for each
of those cores. So, right, like 8% or something. Right?
And we're doing a lot better than that, right? Plus we have a higher bandwidth
memory system. So a lot of these applications like memory bandwidth. So, yes?
Okay. So this -- this is just slide talking about how we handle the memory
transfers. So they wrote a little dynamic memory manager to do this, allocated
memory and depending on how you're doing this you either transfer memory
between the CPU to the GPU or if it's a fusion part you don't actually have to do
that transfer.
Even if it's a fusion part on the current systems you'll get different virtual
addresses on the GPU versus the CPU. So that's a bit of a wrinkle that you have
to deal with. So this is just talking about how we handle that.
So on the fusion systems we have physically shared memory. We use the zero
copy API. And on the discrete systems we actually have to transfer things. And
we're -- the data I'm going to show is sort of an optimistic version of that data. So
in reality with a discrete system you'd probably be slower than what we're
showing here.
Okay. So here's performance versus the CPU. So this is an AMD Llano GPU is
the baseline for the first two bars here that we're comparing against. And we're
just looking at the kernel for the first bar here. So if I look at a discrete GPU and I
just look at the key -- the hash function part of it running on a GPU it's like 30
something times faster on this, you know, big monster car of a GPU.
If I include the data transfers, which is getting my requests into the GPU and then
getting the response, which is just a pointer to the answer, it's not the whole
value, just the pointer to the answer, it's actually slower. It's like -- it's actually -this is lower than one, so it's a .88 or something.
So this is like not -- this is not winning. This is losing, right? In reality.
If I look at the fusion part, where I don't actually have to transfer stuff around,
then it's actually a pretty good win. Here 7 and a half X better. This is on a Llano
fusion APU, which is think is the latest one. And this is the lower powered
Zacate part we're still doing better than one. So ->>: [inaudible] difference between the [inaudible].
>> Tor Aamodt: Yeah. Yeah. So if I just look at the bar that does thought
include data transfer, yeah, this bar is much higher. And the reason for that is
this has a lot more like peak gigaflops is, you know, much more compute horse
power. Yeah. But the data transfers kill you, right? So having this integrated
memory system looks like a pretty big win.
Okay. So here's just another view of the data. This is just showing you the
execution transfer overhead, right? So here, just all of your time is going to
moving stuff around. When you go to Llano, it's almost -- you can't almost see it.
For the Zacate for some reason when you map or unmap, like maybe this visible
to the GPU or not, there's some overhead.
Okay. Here's another view of this. So I talk about batching. So the question
becomes well how much batching do I need to do? So you can see here a few
data points. So if I look here, I'm looking at the normalized throughput. And this
is showing me that if I basically batch up about 10,000 requests, I'm kind of
saturating how much throughput I'm going to get so. That's this curve over here.
Of course as they batch up more, there's going to be increase in latency. So
these units are sort of dimensionless because this was work done with AMD and
they didn't really than want us publishing absolute numbers. I guess it's an
industry thing, right?
But this number here for the latency is 0.5 milliseconds. So to give you some
idea of the latency. And what we understood is that that's sort of a reasonable
latency. Okay.
So the highest throughput to latency was around 8,000 requests at that point
there. Okay. So quick summary here for the second part. Programmer -- our
assertion is programmer intuition doesn't always paint the whole picture. And I
don't think we're the only people who have seen this. There's, as I said, PPoPP
there's multiple papers on taking applications you might not think would fit on a
GPU and actually showing that they can run reasonably well.
So, you know, how does that work? We exploited the available parallelism on a
GPU, batch-up requests. And, you know, the result we got here is a 7.5X
performance increase on the Llano system.
You know, we think you could do better on the -- on reducing the latency.
There's some work that we're aware of that follows on this that tries to do more
fine-grained spawning of the work on to the GPU. So hopefully that will -- you'll
hear about that soon.
And so basically data transfer overheads can have a large impact.
All right. So is there any other questions? I hope there's some time. Yeah.
>>: Can you go to the previous slide?
>> Tor Aamodt: Sure.
>>: The batch that says 8,000 requests, does this number include the amount of
time that you need to wait for 10,000 requests to come in?
>> Tor Aamodt: I don't think it does actually.
>>: [inaudible].
>> Tor Aamodt: Right. I think actually it might not include that. That's -- I was
grilling him about that two days ago. Why aren't we doing that. But as I
understand it from when they were profiling the overall system, they -- I guess
they got the impression that wasn't going to be a huge amount of it. So, yeah, I
think the answer there is it's not, which is we should include it, obviously.
Although -- yeah. That's the answer I'm sticking with right now.
>> Doug Burger: Any other questions? Okay. Thank you very much.
[applause]
Download