>> Doug Burger: It's my pleasure to introduce Tor... British Columbia. Tor has done a ton of interesting...

>> Doug Burger: It's my pleasure to introduce Tor Aamodt from the University of British Columbia. Tor has done a ton of interesting work. He's worked with recently FPGAs and some in GPUs, done some work on branch prediction in the past and analog circuits and is just, all in all, fairly knowledgeable on a broad range of subjects. I always enjoy talking to him. So he came down and is sharing his work with us. And I'm actually really excited to hear your talk on GPUs today. So thanks for coming. >> Tor Aamodt: Thanks, Doug. So the title of this talk is evolving GPUs into a substrate for cloud computing. I was trying to think of how to summarize the work I've been doing for a last little while, and this sort of popped into my head as a title. And then after I sent this, I went oh, wow, evolution doesn't sound that exciting. It sounds really slow paced like glacial or something, and, you know, we're architects. We're supposed to think of the answer right away and come up with something really cool. But then, you know, I thought even more and no, this is the right answer. This is the right title. Because evolution can be disruptive. It can be disruptive. If you, you know, start with a small market that to rates rapid innovation, right? So of course what I'm thinking about here is the game's market where you can pump out a new graphics processor every few years and they'll change the hardware RSA in each one, right? And game's market will absorb that. You start with that as your beginning point, right? And then you make many small changes, right, because you're in a highly competitive environment, you have to keep innovating. Make many small changes to improve your product. And while you're doing that, you start thinking about, okay, what else can I do besides this small market? Start making little tweaks to solve a larger market's problems, right? And maybe if you can do that and achieve that at a lower cost than the incumbent then, you know, at some point you've got you know what's known as the innovator's dilemma. You just sort of start eating their lunch. And so I kind of think that maybe GPUs are going to do this. This is a speculation I guess of my research. So what am I going to talk about? I'm going to talk about really two bits of research we've been doing. One recently published on transactional memory for GPUs. And this was published at I triple -- at MICRO, which is computer architecture conference. And it was also selected for IEEE Micro's Top Picks this year. And then I'll talk about some software that we've been using. And this is using an AMD fusion APU. And this was taking a server type application, Memcached, and something you wouldn't necessarily think would run well on a GPU. I think first people's intuition is why are you doing that? And then we showed that this actually can run pretty well on a GPU. So I'll talk a bit about that as well. >>: So are you going to say who the GPUs are going to disrupt? >> Tor Aamodt: We can have that discussion later maybe. >>: Okay. >> Tor Aamodt: Yeah. Yeah. >>: Us? >> Tor Aamodt: Could be, right? I was just trying to take a step back and say what am I doing, right? Trying to change GPUs. I'm not trying to radically redesign them from the ground. And, you know, is that the right way to do it? Okay. So I'll focus first on the transactional memory on GPU. And what's the motivation for this? So a short story might help explain this. So when I first got to UBC, GPUs were coming out and you could write CUDA programs on them. And so really enthusiastic undergrad came into my office, wanted to do a senior project. And he wanted to do it on vision processing. He was involved in some robotics competition and they needed to do like realtime vision analysis. And he thought well there's cool algorithms out there, but they can't -- the hardware isn't fast enough to do them in realtime but there's these GPUs now, people are talking about using those. So he loved to use GPU to accelerate this. And I said that's great. I don't know anything about vision, so you're on your own with that, but if you want to write some codes to run on GPU, that's great. So he went off. He found some algorithm that was like really high quality algorithm and tried to get that to run on GPU and spent all semester getting it almost working, right? And then he ran out of time. And I was like oh, okay, well, that's pretty cool. You made a lot of good progress, you know, A plus to you. Another student came along and didn't -- this was a more normal student I discovered. I have no idea what I want to do. I said okay, finish this project off. All you got to do is debug it, right. It's already there. And so needless to say, they couldn't get that to work. You know four months later the think's not working. I said well, maybe that's just not as good a student. And then another student showed up and tried to, you know, again I said okay, for sure, this is an awesome problem. And they had the same problem. So if I looked -- what was it that was different about this problem than other GPU problems is it required fine-grained synchronization using locks. And this turned out to be kind of a hard problem for GPUs. So if you look at the normal lifetime of GPU application development here, so this time here is like developer time, and this Y axis is goodness of some sort. A good application that maps on to a GPU well has lifetime like this. Well, some time after not too many weeks of development the application is running correctly on the GPU. So this green line is saying it's not working, it's not working, and suddenly it's getting you more or less the right answer. Now, that different mean it's running really fast. So initially performance might be really bad. And here I'm assuming you've sort of identified like a big kernel that you want to run. And that kernel initially is running -- it's may be very parallel but it's running very slowly on the GPU. And so you have to analyze why is it running slowly and incrementally refine it and get performance to be good. And then you're happy. So that initial getting it to work gives you the hope that you can continue this process and get somewhere. If you start looking at the kind of application that we were looking at with this vision algorithm which involved locking, it's more like this, okay? Your months and months and months of trying to get this code to do anything correctly and it's not working. And you might give up here. So if you're a company you might say well, we sunk way too much money into this bad idea. Let's just go and do something else, all right? Or maybe, you know, you reach this point where it suddenly starts to work and you figured out the code enough, but then you encounter this bad performance. Okay. So then when your manager sees that they go okay, forget it, okay. You spent months and months and months. You finally got this thing to work but the performance is like 3X slower than a CPU. What are we doing here, right? So the hope of this transactional memory business on GPU is to -- okay, before I go into that, why would you want to use fine-grained locking on a GPU? So here is just an example. The N-body problem. Okay? So if you do like the simplest order N squared algorithm on a GPU, a lot of parallelism, and you do this problem with five million interacting particles and you run it for some number of time steps it will take say half an hour, okay? There's a very simple software implementation. Now if you do a more clever implementation but requiring something like fine-grained locking, you can get this down to order N log in much better computationally efficient and so it's running like orders of magnitude faster. So that's probably the reason why you want to do fine-grained locking is to get some better goodness in the algorithm. But you have this problem as a software developer. Can I even get it to work? Okay. >>: [inaudible]. >> Tor Aamodt: Yes. Sure. >>: Why is the shape of the red curve ->> Tor Aamodt: Yeah. We can debate that, right. >>: I'm not disagreeing, I'm just curious. >> Tor Aamodt: Yeah. So this isn't measured data, first of all, right, this is my intuition. And if I have identified a kernel like so I've done some profiling and I said this is a parallel kernel and I go run it off, then each change I make to it is going to maybe give me a factor of two or something improvement. So this is like these factors are two multiplying together to give me something like that. That's my intuition about how it works. Each one of those bottlenecks. >>: You're knocking down anvil bottlenecks one after the other. The last one you [inaudible] okay. >> Tor Aamodt: All right. So let's go back to this region here where the manager's freaking out about all the time and no results. So we want to shrink that time to be small like it was for the applications that sort of naturally fit on a GPU. So we're hoping that we can get the code to work. Now, we're not saying it works really fast. We're just saying it gets you the right answer. We want to get that within, you know, weeks or so so that we got some sort of think happening on the GPU. And of course at that point performance will stick and we'll have to go through this whole performance tuning process. And of course transactional memory costs something. So we may not actually ever achieve the same peak performance we had before. But the hypothesis or the assertion is that this is a much more better scenario for a business perspective where I'm going to get some, you know, good level performance in some reasonable amount of time with some reasonable amount of risk so that I can actually be successful in what I'm trying to do, whatever it is I'm trying to do here. Okay. So before I go more into this I want to just talk about what is a GPU and quickly review what is transactional memory. So what do I mean by GPU? So to us it's NVIDIA or AMD-like, compute accelerator. SIMD hardware and aggressive memory subsystem. And so it gives us, we believe, high compute throughput and efficiency. At least if you have the right problem. And we're talking about nongraphics here. So OpenCL, DirectCompute, CUDA, C++ amp, whatever you want to think of there. The programming hierarchy is -or the programming model is the hierarchy of scalar threads. Okay. So what does that look like? So I've got a little picture here. Another thing, in today's GPUs we have limited communication and synchronization. That's not to say there's none, but it's limited. So a picture of this in today's terminology using OpenCL terminology, you've got a kernel. Inside of it you've got a work group. And this is some number of scalar work items. And then the hardware from AMD or NVIDIA, those would be grouped together into things called wavefronts or warps. So inside here the smallest unit here would be like a work item in OpenCL or scalar thread. If you could think of it as a scalar thread. Those will group some number of them that could be 32, let's say, nominally of them into a unit of lock step execution called a wavefront or a warp. Then there might be about a thousand of these in something called a work group or a thread block. And a kernel may have many -- may have thousands of thread blocks all running -starting off running in the same code. Okay. So they can communicate. There's some local shared memory. This might be 64 kilobytes. Everything within one of these work groups or thread blocks is like a thousand threads there, can talk through this 64 kilobyte roughly sized scratch pad memory you can think of it. They can perform a barrier operation very quickly with all these through threads within a workgroup. If I want to synchronize across threads in different workgroups then you have to do something through global memory using some sort of atomics. Okay. So what is our baseline assumptions about the hardware? So there is sort of a semantic of what we think of a GPU. And basically all of this stuff would be inside of the GPU package, right? And then there's some off-chip DRAM. So we've got a number of SIMT cores which I'll explain in a bit more detail in a second. That's where those work items and workgroups are giving. When they do memory requests, they talk through an interconnection network. This is an on-chip interconnection network to a bunch of memory partitions where we'll have last level cache bank, memory controller talking to off-chip DRAM, some sort of atomic operation unit. Okay. So inside one of these SIMT cores we've got some sort of front end. This is like a pipeline of a processor. This is a front end, fetch, decode, schedule, handling branches. And there's this thing called SIMT U which is single instruction multiple thread. And then there's this -- the rest of the pipeline is like a traditional SIMD data path where you've got your function units that are operating in parallel. So I'll talk a bit more about this SIMT thing and what that means in a second here. Okay. So what is this SIMT, single instruction multiple thread? It really goes back to a paper from SIGGRAPH '84. This was done at Pixar initially for a processor called CHAP. And it's a way to take a bunch of scalar threads and map them on to SIMD hardware. Okay? So let's start off here with -- we have a bunch of threads that are going to run through this control flow graph here. So you've got four threads. And these squares here represent basic blocks in a program. The edges here are control flow transfers between them. And the idea is the question is, okay, how can I actual run this stuff on this hardware in parallel? If everyone goes different -- if everyone goes with the same path, this isn't going to be too hard. If everyone wants to follow a different path, it will be hard. Okay? So let's start off as basic block A. Everyone is -- all four threads here are executing basic block A. When I get to the end of basic block A, everyone goes to basic block B, right? Pretty straightforward. What happens if at the end of basic block B some threads want to go to C and some want to go to D? Okay. So in this model, what happens is when that happens it's called branch divergence. What we're going to do over here is we're going to push a couple entries on to a stack. So this stack tracks the control flow state of this group of threads, which would be a warp or a wavefront with NVIDIA or AMD terminology. So what we can see is if I just go back a step, when I was executing at B, I had one entry on the stack and I had an active map where all the threads were active. And the next PC I was just saying to execute B. When I got to the end here, I'm going to push two entries on to the stack, one for each of the targets of this branch, okay? So push those on to the stack. And so the threads that are going one direction will get one entry. The threads that are going other direction will get a different entry. And their active mass will be shown here. Okay? Then I start executing whatever's on the top of the stack. So in this case, the next thing on the stack is saying I want to execute basic block C. So execute that. And I can get the active mask from here and just go ahead and execute that. When I get to the end of this -- of this branch, basically all detect that, that I'm at the end, when I reach reconvergence point here so this basic block E, when we diverge that, B will know that we can do some control flow analysis on this graph. We can see that the immediate post denominator D is down here, basically the joint point after this branch is here. And we'll track that when we diverge so that when we see the next PC of the currently executing top of the stack here matches this reconvergence point we'll know that this is the point where we want to pop this entry off the stack. Okay. So we pop this entry off the stack because the PCs match. And now we -- the next entry on the stack is revealed then that's executing at D. So now we are automatically executing at D. Same thing's going to happen here. D is going to eventually become E, and we're going to pop this entry off the stack. We push these two entries on the stack, forgot to mention that we updated the bottom entry on the stack to have the reconvergence PC as the next PC. So when we finally pop that last entry off the stack, we continue execution, reconverge the gain at E. And we can continue from that point forward. Okay. So we can go around and do this again. So this type of mechanism here can handle arbitrary control flow. It can be irreducible control flow, et cetera. At least the one that we -- the mechanism I just described here as and also in more detail describing this MICRO paper from a few years back can handle arbitrary control flow. So this is what I mean by a GPU in this talk. Next I'm going to just briefly talk about transactional memory. Okay? So the idea here is programmer specifies acomic chunks of code called transactions. Okay. So here's some code. It's got a critical section and it's a fairly complex piece of code, so I've got multiple locks that I need to acquire before I can enter the critical section. And if I'm not very careful about how I do this, I could end up by a situation where I got a deadlock, right? So if I'm not very clever or -- yeah, even if I am very clever but I just haven't thought it all through, I could have a deadlock situation. So this has been tricky. People have recognized this. And they said wouldn't this be great if programmer could just write this magic keyword atomic and put a curly brace and put all the code that they want to run as if no one else is running on the system and get that to happen. Okay. So that's the sort of the dream here of transactional memory. So how is it implemented? So first of all the programmer's view is if I have two transactions running, one or the other happens at any given time. So this is two transactions T, TX1 and TX2. Programmers use either TX1 happen before TX2 or TX2 happen before TX1. They didn't happen at the same time. Of course if the hardware actually did this, it would be horribly slow, right, it wouldn't give us a lot of parallelism. So in related what the runtime system or the hardware is going to do is run these things in parallel. So if these two transactions are nonconflicting, meaning they access different memory locations like you can see in this example, then they can run in parallel. Of course if they don't, if that's not true, if there's a conflict between them, then we have to do something to repair the situation. So here I've got transaction 1 is writing to B, transaction 2 is reading B. You know, if I want to be really safe about this, I'm going to abort one of these things and rerun it. So in this case I'll decide to abort transaction 2 and restarted it with the updated value to B so that I get the situation where transaction 1 happened before transaction 2. So that's in a nutshell transactional memory. So just to summarize, each transaction has three phases. There's the execution. The in that execution phase, we're in the transaction and we're tracking all the memory accesses into what's called a read-set and a write-set. Then we're going to validate the transaction. We want to detect any of these conflicting accesses and resolve it somehow. For example, stalling or aborting the transaction. And then we want to commit the -- if there was no conflict, we want to update global memory. Okay. So the question becomes then are -- is TM and -- transactional memory and GPUs, are they compatible or not? And so we believe the answer is that they are compatible. Although it's not at all obvious that they would be. So what I'm going to talk about next few slides here are all the problems that you encounter when you try to take transactional memory and implement it on a GPU. And there's several of them. And what is our solution to that problem? Okay. So well, okay, so where are the problems? How do they arise? First of all, GPUs are different from multicore CPUs. One of the obvious differences we're talking about thousands of concurrent threads, not tens. Okay. So what are the challenges here? So challenge number one is it's SIMD hardware, okay? So as I was just describing, the scalar threads are grouped into this thing called the warp or wavefront and execute in lockstep. So here's some, you know, code, transactional memory code written in assembly, so transaction begin. It's like the curly brace opening up a transaction. Then some code that runs in the transaction. And then transaction commit. And here's my group of threads that are supposed to execute in SIMD lockstep. You know, the obvious thing that's going to happen is, okay, so they -- they start progressing through. And when I get to this commit part, only some of them can -- will succeed in the commit operation. So let's say those, the first three threads were successful. They didn't have any conflicts with other transactions. So they can commit. But T3 is unlucky, and it has to abort and retry. So we've now introduced this control flow divergence problem. Well, okay, so if you think back to what I was just talking about with that stack based mechanism, there is sort of already a technique in modern GPUs for handle this kind of control flow divergence. So the solution which I described in more detail in the paper and I'm not going to have time to go into it here is to just extend the SIMT stack to handle this problem for transactional memory. Okay. So basically we treat this like a loop with some divergence inside of it. And we have very simple extension to the SIMT stack to handle it. Okay. So a second challenge -- so I'm going to treat this one as solved, right. If you guys want more details I can give you more details later. The second challenge is transaction rollback, okay? So if I think about a CPU, okay, what happens here? When I begin a transaction, I have to be able to restart it in case there is a conflict. While I'm running in the transaction and updating registers I'm not updating memory. So let's just focus on the registers for a moment. When I start a transaction, a lot of research has proposed make a checkpoint of the register file effectively at the beginning of the transaction so if I have to abort the transaction I can go back and restore the register file to the state it was at the beginning of the transaction and restart. But tens of registers, okay, it's not nice, but it can be done. On a GPU there's actually more register file than there is cache on today's GPUs. There's more bytes of storage in register file than cache. So this is just not going to work, right? Two megabytes of on-chip for register file and Fermi, right? So checkpointing that naively is just -- it's a nonstarter. So we started studying while, you know, what are we going to do about this? And what we -- what my student discovered was that actually you don't actually have to checkpoint a whole lot. Because mostly what we find is that registers are overwritten at first use inside of a transaction. And there was a recent work published on something very similar called I guess idempotent points. So that it made this -- this is sort of a general observation about code and it's true beyond just transactional memory, so there's some ongoing work on using this for more general purposes. But here's the -- here's how it applies for us. So here's an example transaction. So you can see that we're overwriting R2, right? So I overwrite R2. Let's say I have to abort this transaction. Well, when I overwrote R2, if I just sort of naive about this, I would checkpoint the -- or make a copy of R2, right? But when I restart this, the first thing I'm going to do is overwrite R2 again. So there's no really value to checkpointing R2 at this point. And so when we looked at the kernels that we were running in this study, we found there's only one of them that actually needed anything to be checkpointed at all, and it only needed two registers to be checkpointed. And the fact that there's other people who are sort of exploiting this property more generally idempotent points gives us some hope that this is something that we could actually rely on. So our solution is really just let's software checkpoint only what you need to checkpoint. We're not going to blindly checkpoint all the registers. Okay. So the third challenge is conflict detection, okay. So if you look at examining hardware transactional memory proposals, a lot of them say let's use cache coherence protocols to detect conflicts. And the problem with that is well, current GPUs don't really have cache coherence. Well, Larrabee has some cache coherence. But NVIDIA and AMD GPUs don't have cache coherence. And it's not clear how well it's going to scale. The other thing is even if they have cache coherence, it -- like say Larrabee, it's not private per work item or per scalar thread as in the model that we have. It's a cache shared amongst maybe hundreds of work items. And all of the hardware transactional memory proposals that have been put out there assume that each thread kind of has its own private cache and the coherence interaction between those private caches is what tells about you the conflicts. So this -- that idea is kind of out. Okay. What about signatures? That's another proposal. Use some sort of Bloom filter structure to detect conflicts. And that's been shown as a way to get around having to modify cache coherence protocol. So we did some experiments with this, and we assumed -- you know, we tried various sizes. But let's -- you know, I'll quote a data point here. For a signature with 1024 bits per thread, that's a lot. Okay. So that -- this is roughly the size. If you use smaller than this, you have a really high what's known as a false conflict rate. Okay? So you have to scale up to this to get a reasonable conflict rate. And I don't have data on the exact conflict rate that was, but it was something reasonable like 10 percent, say, false conflicts. And that, with the number of threads you can have on a GPU that's like four megabytes of storage just for signatures. So this seems again like, wow, this is huge amount of area to do transactional memory. So this looked like a show stopper to us. We weren't really sure how we were going to solve this. So we spent a lot of tying thinking, okay, how are we going to get around this? And Wilson Fung, the student who was doing this work had this great idea of how to solve that. And before I tell you that great idea, I have to also tell you challenge number 4. Okay. The other thing is right buffering. So I talked about register files and checkpoint and register files. The other thing is when I modify memory in a transaction I need to buffer that up. I either need to buffer up what I'm changing or I need to buffer up the changes I would like to make. So let's think about how we might do that. If I look at a GPU today and I look at one of these SIMT cores, right, on a Fermi-style GPU there might be 1500 threads running on that SIMT core. And L1 data cache, that might be 48 kilobytes in Fermi. And with 128 byte lines. So there's about 384 cache blocks on the L1 cache on Fermi. And there's about five times as many, four or five times as many threads sharing that. So I can't really buffer up my writes nicely in this small amount of storage. So how am I going to handle that problem? Okay. So both of these two problems, one is the conflict detection problem, okay, and the other one is this write buffering problem. We're going to try to solve together. And the solution is value-based conflict detection. Again, this Wilson spent a lot of time scratch his head, figure out how we were going to solve this. And there is prior work on using this notion of value-based conflict detection. And it really -- and it was really in sort of the software transactional memory world where they're proposing this initially. We realize this actually makes a lot of sense for what we're trying to do here as well. So let me just go through this example to explain what I mean by value-based conflict detection. So here I've got transaction 1. And it's going to do this atomic block of code that is going to compute B is equal to A plus 1. And here's the assembly code for it down here. And here I've got some private memory that's going to track read set and write set. And then I've got my memory system. And initially the value of A is 1 and the value of B is zero. Okay? So if I run this code I'm going to load A. I'll put that into my read log. And then I'll do the computation, which is I'm going to add one to it. And then I going to say that I want to write B is equal to 2. So before I can update the global state and commit this transaction, I need to validate that this thing is -- has not had a conflict. So the way I'll do this in a value-based conflict detection system is I'll check the value of A and see if it changed. So if A didn't change, the assumption will be that this is -- there is no conflict and I can go ahead and commit this update to memory. Okay. So this didn't change. Our read set didn't change so it's safe to publish B is equal to 2. Okay. So that's a somewhat simplified view of the system. Let's imagine there's another transaction running, what would happen. So let's say it was running concurrently and it read the value B is equal to zero. It didn't see this new updated value when it was running, it saw the old value. And it computed -because this transaction is computing A is equal to B plus 2. It computed the new value of A is equal to 2. So what would happen here is when we go to validate this we'll see that value B has changed so there was a conflict. And so we know we have to redo this transaction, abort it and restart it. Okay. So far so good. >>: So are you going to talk about silent stores. >> Tor Aamodt: I wasn't planning to talk about silent stores. But you want to talk about the ABA problem, we can talk about it. >>: Yes. >> Tor Aamodt: Okay. Can we get to it in a second or -- okay. Maybe ->>: Yes. >>: When it's appropriate. >>: Does the compare on the right have to be done atomically otherwise you compare somebody else go around and you write and then you make a mess. >> Tor Aamodt: Okay. I'll get to your [inaudible] his first. >>: Priority order. >> Tor Aamodt: Just slide order. Okay. So what happens if we try to do both at the same time, right, which is your scenario. Okay. So here -- did I go too fast? Okay. So both these transactions executed concurrently, all right? So A1 -transaction 1 computed B is equal to 2. Transaction 2 computed A is equal to 2. And they go and they validate very quickly, right, and they both match. And so they both get the go ahead. And it looks like we can go ahead and update. But this -- this is bad, right? This is not a valid ordering. Because the programmer expects transaction 1, then transactions 2, which would give us this result from memory. Or transaction 2, then transaction 1, which would give us this result from memory. Okay? So the one we got, which is both values equal to 2 didn't match either. So this is bad. We have to fix this. So the way we're going to fix this problem is -- well, the naive way to do is it to serialize validation, okay. So do one thing at a time, basically. So if I just did my validation commit for each transaction serially, then I wouldn't -- I wouldn't actually have this problem. So V here is validation, C is commit. So if this guy finishes, I'll detect this race condition and it will be fine. So benefit number one, no data race. Benefit number 2, no -- there's some esoteric things, live lock, that you can get into with this kind of system. So those are great. The huge drawback here is we've serialized even the nonconflicting transaction so we've just killed performance. There's now no point in doing any of this. So huge collateral damage. So this certainly won't scale up to 10s of thousands of threads. So the solution for this is speculative validation. So we're going to split up the conflict detection to two parts. Okay? So one is we're going to do a conflict detection against recently committed transactions and we're going to do that in parallel. Okay? And so what does that look like? So here's a bunch of transaction. They all want to validate. They're all sort of coming in at the same time. And so we'll compare all of the resets in parallel with what's in global memory here. So we're checking all the read sets. They all match. And that looks good. So they pass this -- now, they're not comparing against each other. They're just comparing against transactions like transaction zero or whatever came before transaction one that's already updated memory, but it's recently updated it. And we'll see that if the value change we'll see that there was a conflict. Okay. So step 2 is what about if one of these transactions updates something that another transaction that we just validated in parallel, what if there's conflict between these guys? So for that, we're going to -- we're going to add some hard work to detect that case. Okay? And when we do detect that case, we're going to stall one of the transactions and try to revalidate it against memory a little later. Okay. So I don't think I have slides on the details of how we do that implementation. There's much more detail in the paper that's basically a little new type of Bloom filter that we've proposed called a recency Bloom filter to handle this. So here's a summary of the hardware changes. And then I'm going to jump to talk about this ABA problem to get to Doug's question. So changes are so I mention the SIMD problem with committing, you know, some lanes may commit, some may not. So we have to make some modifications to the SIMT stack hardware. We need some hardware changes to track our read and write sets. And we have a commit unit that implements this parallel validation. Okay. So before we go into the results here, let me just quickly talk about the ABA problem. Okay. So ABA problem. What is the ABA problem? So classic example is this linked list thing. And this is where I'm trying to do some very fancy stuff with atomic operations so I get around using locks basically. So imagine I want to do like a push and pop operation, but I don't want to have a global lock on this linked list. So one idea I proposed is do something like this. So I go in, I check. Top is the first thing here and the next is this next pointer. And I'm going to make copies of these two things. And then I'm going to do this atomic compare and swap. So I'm going to see okay, if top didn't change, if it's still pointing to whatever it was pointing to before, which is A, then I'm going to swap out this thing and get rid of -- get rid of A and have top now point to B. So this is how this code is supposed to work. Okay? So I popped A off the list because A didn't change. And I want to pop -- and I want B to be the next pop. Okay? That's all well and good until another thread comes along and does the following. Pops A, pops B, and then pushes A back on. Okay? So now I've got two threads sort of racing through this code at the same time. And we can get really -- you know we get this really weird result. So what's supposed to happen, right, or what will happen here is so this thread finishes, it popped A and B and then pushed A. So this is now the state of the linked list, okay? Now, thread 1 comes back and does this compare and swap and it sees well pop is still pointing to A, okay? So now I can swap this out, right? And then I get this really weird thing, which didn't make any sense here, right? Because really what I wanted to have is if thread 1 got there first and did this consequence of operation and then thread 0 comes along and executes pop, I should have gone from AC and just gone to C. But somehow B popped up on the list again, right? So that's the ABA problem, or the first version of it that was found. So sort of the intuition for why we don't suffer from this problem is basically -- I mean, the real question is what is the problem? Okay, compare and swap operated correctly. Right? It did the comparison. And it swapped things. Right? The real problem here is the programmer didn't think of this race condition. Okay? So our assertion is that if you are running transactions and you write code like this, what you're going to get -- okay. Here's the problem. If you write a transaction, you're going to basically guard all of this stuff. You're not just going to guard this one little thing up here, you're going to guard all these things that you're touching in a transaction. You're going to detect the conflict. Okay. So that's a very sort of hand wavy informal explanation. But we've been really grid about this, so we wrote a correctness proof. Doug? Okay. You've got to queue our code reader. Okay. You have a tech report here. Basically we can do this conflict serializability. We can show that we got conflict serializability. And so we're pretty happy that you don't have that problem. Or we're pretty satisfied we don't have that problem. Or another way of saying is that we tolerate it. That's what we say. It can happen -- ABA can happen and we can still serialize the transaction. So as far as you're concerned, it didn't happen. >>: Yeah, the -- >> Tor Aamodt: Sure. Yeah. There's a conflict. >>: There's a record, yeah. >> Tor Aamodt: Okay. Let me go back to where I was, which is to talk about ->>: Thank you. >> Tor Aamodt: Yes. Okay. So let's talk about data. Okay. So how well does this work? So we used our simulator, GPGPU-Sim, which we've developed at EBC. It's available under BSD. So anybody can use it. There's the URL. The version that we used for this study we correlated against GT 200 and the operation per cycle correlation is pretty good, .93. And we implemented the changes to model KILO TM and did some, you know, pretty detailed modeling of actually when values change in memory to try and catch any problems like an ABA problem happening, if it were to happen. So we were kind of defensive in our programming to try and catch those kind of problems. And we have a very detailed model of it as well. Okay. So GPU applications that we looked at. So we have some essentially microbenchmarks here. Hash table. And we have two versions, a high conflict and a low conflict version. Then we have something that models like moving money between two bank accounts. So we call that ATM. And then we've got some code. This was code from a game's company that was collaborating on this research. It's cloth physics demo written in OpenCL. And it's an interesting one because they couldn't scale it up because it was written for CPU to run on like the SSE units on an x86. And it had -- if you tried to scale it up beyond this, you would have race conditions in it. So it was actually an interesting candidate to try this out on. Barnes Hut, which is the N-body problem, it uses atomics. CudaCuts. This is interesting because this is the actual vision problem that we were trying to solve. I had these undergrads trying to solve. Somebody else solved it, published a paper, while we didn't. And then data mining problem. >>: Did any of these problems do a lot of [inaudible] updates in data mining is a [inaudible] classification. >> Tor Aamodt: So the data mining is doing like a priori which is the sort of the Amazon, you know, if you want this, what are other things you might want, based on what other people bought. Okay. So performance here. I've got a few different ways of looking at performance. And this view what we're plotting is speedup over serialized transactions. So what you can see here, if I do an ideal transactional memory we're getting like around close to 300X over serializing all the transactions. Ideal here has no overheads for the conflict detection and for committing the transactions. Things just happen really quickly. We're comparing that against fine-grained locking here. So this is a version of the code that has locks for all the critical sections. And so you can see here that transactional memory can do a little better than fine-grained locks. And that's sort of a well-known result from transactional memory. Okay. So now let's look at our detailed model. So in the prior slide I was just looking at this sort of idealized transactional memory. So the middle bar here is our detailed performance simulation data. And comparing against both fine-grained locking and the ideal TM, here I'm normalizing to the ideal TM. And we're looking at normalized execution time. So in this graph, higher is actually worse. The prior graph higher was better. And there's a few things you can note here. So first of all, ATM. Okay. So for the ATM one you can see that we're actually doing better. So with the ideal ATM you might think okay, I can believe that that would be better than fine-grained locking. Here with all that stuff that I've described, we're still doing better than fine-grained locking. And the reason here is when you write a critical section for this particular application you have to acquire multiple locks. And there's a lot of extra memory traffic that's being optimized by doing transactional memory in this case. The other examples, the trend is that we're going to do worse than fine-grained locking. So in these examples here, these are the ones that did the worse. There was significant contention. And so that invoked the transactional memory a lot more. Another thing you can see here is some of the examples the fine-grained locking actually did better than the ideal TM, and that's because in some of these algorithms they actually are a little bit more -- the fine-grained locking might be a little bit more optimized so the CudaCuts one is not actually not doing like acquiring a lock, it's actually just doing an atomic ad for example. And Barnes Hut they have like tri-lock type code, which basically if it fails, it will do something else. So it's a bit more computationally efficient as well. Okay. If I go to lower contention, so this is the lower contention hash table, things get better. So performance tuning will help. Okay. >>: Can you not easily go into something like Barnes Hut and replace the fine-grained models with the smaller transactions and should show better performance because you're not -- assuming the cost of rollback is -- of -- it's like a transaction that's free, which it's not, you should never [inaudible] right. >> Tor Aamodt: Right. So I think that code has been really heavily optimized and -- yeah. Should we be better? Well, we're not. >>: You might be locking your transactions and your locks may be in different granularities [inaudible]. >> Tor Aamodt: I think there was some trickiness here. And I -- there was, you know, some trickiness about isolation here where we have sort of a weekly isolated transactional memory system. So we had to make our transactions also a little bigger than you would want if you had a strongly isolated system. So I think that might also explain a bit of it. Okay. So the bottom line here is with the performance model that we have we're getting within 59 percent of the performs of fine-grained lock performance. And one thing I want to point out is that when we compared our baseline in this our fine-grained lock performance and our atomic performance versus real GPU. It's actually about four times faster than an NVIDIA GPU like Fermi. So, you know, this -- we may actually be faster if you believe our performance model for transactional memory and you don't believe it for atomics then, you know, it's sort of like comparable performance is the way I would say this. This is what I think that boils down to. Okay. Implementation complexity here. So we went through and did the detailed analysis of how much stuff you had to add. And basically bottom line number is somewhere around one percent of current GPU. And so we did some of this with CACTI. Wilson doesn't -- you know, he's gotten a lot of flack from one of the -our faculty members was one of the first people to work on CACTI, and, you know, knows that it's not very well calibrated at low numbers like this. So he went off and actually used a memory compiler to come up with similar numbers, you know, within a factor of two to the CACTI numbers for the area. So we have -- you know, if you believe this is the hardware you need to build, then we think this is the overhead that's going to -- how much it's going to cost you. So just to summarize the first half of this -- first three quarters of it, however much it will work out. So we have thousands of concurrent scalar transactions. And basically we have no cache coherence protocol. Our KILO TM handles scalar transaction abort. I didn't mention this before, but it's word-level conflict detection, not like cache block level conflict detection. I didn't mention this either, but essentially in terms of correctness it could handle unbounded transactions. Performance -- there will be a performance cliff when you go to really large transactions that exceed the size of the hardware structures. But we'll still run them. And I just mentioned the performance. Okay. Okay. So I'm going to move on to the next part of this talk which is running Memcached on a fusion APU. I don't know if there was any questions before I did that. Okay. If not. So here's a picture of a Google site which I guess is, you know, it's a famous picture. This is like this big river to cool down these data centers, right? And so the focus of this work was, you know, can we reduce power make things more efficient by running them on GPUs that, you know, are using a lot of power. And we're interested in non-HPC. So GPUs pretty well established that if you have like a high performance computing type application that's very regular, dense matrix multiply-esque type of infliction it's going to map on to a GPU pretty well. But there's a lot of work loads that are economically you know much more important that don't map so well or do not fit that picture. And the question is, do -- could you take those applications and run them on the GPU? And most people's intuition is, well, no, right? And so we set out because we're researchers to evaluate whether that is the case. And we're not the only people I think looking at this space. There's -- at this year's PPoPP there's a number of papers where people are doing really wild things on GPUs like alias analysis and, you know, breath research and getting really impressive results. So I'm going to look at this particular application now. So okay. So we're talking about Memcached. So here to motivate this is some analysis of what I get for SIMD efficiency on a GPU. And if I'm looking at an application and I'm the programmer and I'm deciding should I invest several months of effort to try and take this code and pour it to some new platform, I'm going to look at that code and say, well, does it look like something else that has run well on a GPU? And so what I might do is I might look at control flow, for example, and have some sort of intuitive understanding that a lot of control flow doesn't map too well because every time there's a branch I lose some -- you know, I mask off half of the threads, et cetera. And so if I just assume each branch kind of has 50/50 probability of going left or right, this is the kind of efficiency, SIMD efficiency I would expect out of Memcached, okay? Out of the key lookup panel in Memcached. So SIMD efficiency is basically what fraction of my lanes on any given clock particular are actually going to do something when they do anything. So I'm factoring out waiting for memory here. And this actual here is when you actually do this, when you actually finish the exercise and you have the code ported over and run it, you see that we're several times higher efficiency. It's not perfect. There is branch divergence. But it's not like -- it's not this low number that your intuition would give you. Okay. So what is Memcached? So this is -- Facebook uses this. This is a slide from their keynote at an HPCA-PPoPP. So in the system I've got a bunch of web servers. I've got some storage with all my database stuff. And I want to speed up accesses from the web server to the storage. And so I'll put some sort of a software caching thing in here. That's what Memcached does. I'll go and I'll look and see if what I'm looking for can be satisfied auto of this thing really quickly. If not, then I have to go and find it in my other storage. And so right there, that there is Facebook's numbers. It helps allot so they do it. So what we want to do is actually just take a look at this and just speed this part up here. Okay. So the question is, is this compatible with a GPU? So this is the control flow graph of the key value -- key value lookup handler in Memcached. This is actually just the hash function part. And this is, you know, like one of these dot graphs and I think something is hiding part of the graph here. But it's pretty messy, right? There's edges all over the place. Really highly irregular control flow. So can you actually get this thing to run on a GPU? This is the kind of thing where you look at it and you go, nah, that's not going to run on a GPU. GPU SIMD, everyone should do the same thing, right? So that doesn't look very good. Not only that, but every thread is going to be accessing something different. So it's also got a regular memory access patterns. It needs a lot of large memory foot print and highly input data dependent, right? So can we actually do this? So this is the result of our study. Basically ported it over. Our students ported it over. And so this is kind of the way they did it. Basically we only ported the GET request. So that's reading the database. Updating it is still done on the CPU. And so this is sort of like a schematic view of what we ported over. So GET comes in, we do some hash function on it, we look up some array and we get out a key there. If the key matches what we're looking for, then that's great, we have a hit. Otherwise, we do some chaining to look for it further around in this hash table. And the end result is just a hit or a miss. And an index into this table where it is. Yes? >>: The values are on GPU memory? >> Tor Aamodt: So the values are some some separate array, okay? And so what we're going to return from this is just a pointer to that value. The value could be really big, right? And we're just going to give you the offset into some data structure that holds the value. Okay. So how would we port this? If you just did each individual GET request, it's not going to work out too well. So we're going to batch up requests some number of hundreds or thousands of requests to be batched together. And we're going to run each one of these as sort of a single work item in an OpenCL kernel. Okay? So what were we trying to do? We wanted to increase requests throughput and keep request latency reasonable. This batching, of course you're waiting around for new requests to come in. It's going to increase the individual request latency obviously. And so what are the challenges? Well, I've mentioned them already. There's this regular control flow is a big part of it. Our methodology here was we took some recent AMD GPUs fusion parts. We also took a discrete AMD Radeon GPU high-end part and we poured the code over to OpenCL. We also ran versions of it on our simulator to get a little bit more insight into what was going on. And also we created a sort of a Monte Carlo simulator to run the control flow graph to get that view of what is the programmer's intuition going to look that. And our input to Memcached was traces of Wikipedia accesses that have been published online. So we pretended that we went into Wikipedia and added a Memcached server to that. Okay. So one request per work item. Basically data can be anywhere in memory, so there's going to be a perform -- you know, the question can be about the performance of this. So one of the studies we did first was, okay, how much benefit will there be from improving the memory system of this. Actually what we're trying to figure out here is how much of this bad performance, initial bad performance might be due to, say, memory system versus control flow? And so this study is just showing what would happen if you changed the memory system? Okay. We're actually -- the ends goal of this project is trying to make the performance for this even better by changing the hardware. But here you can see if I had no cache and I went up to one megabyte fully associative cache and no memory latency and no memory stalls. So this is running on a GPU. And this is percentage of peak IPC. So you can see that there's definitely benefit to using caching. Right? So we're in our kernel we've actually put some effort into using some cache. But it kind of saturates here. The details here about these two last bars are different version of ideal memory system where in the first one basically -- well, the last one here is when I take out all stalling due to memory. Basically if I do a memory access, even if it accesses one different cache block per lane, I'm going to assume that takes one cycle essentially. So that -- that peaks out at about 32, 33 percent peak IPC. The rest is control flow. >>: Are you majority [inaudible] because different elements have different chain lengths in the hash table elements? >> Tor Aamodt: Yes. I think that's a big part of it, yes. Okay. And actually the key -- I'm not sure if the keys can be different lengths too. But okay. So let's just look at this control flow problem again. So here's our control flow graph. And here's the different work items. And they can diverge, right? So this is just restating the divergence problem. Work items 1, 2, and 5 go here. Work item 3 and 4 go over here. And if that, you know -- if that's actually the case, it's going to reduce SIMD efficiency. So here is some analysis. And this is sort of giving a little bit more detail of that graph I showed you before on intuition versus actual. So I've got two versions of intuition here. And this is slightly different plot. This is showing for each version of Memcached or each way of running it a breakdown of where cycles go. So this thing at the bottom here is saying on these cycles, this, you know, 10 percent or a little less of cycles there's no divergence, okay? So using all of my SIMD lanes. At the top here I'm saying this vast big number up here I'm getting, you know, like 60 percent have the cycles where I'm only getting between 1 and 4 out of, say, 32 lanes are actually switched on. So this is highly inefficient execution up here due to multiple nested branches. So the different bars here are showing -- the first two are different versions of intuition. And the last one is what actually occurs. So this is from detail simulation. And these are just sort of different versions of the programmer's intuition. The first one is the bar I showed you before, which is let's assume every branch is 50-50 taken or not taken. The next version here is a little bit more refined. It's where I'm saying, well, I know some code will never run as a programmer. Like it's and exception condition or an assertion or something. I know that will never run. So if the programmer can reason that that won't happen, I can get slightly better, tighter bound on how things can run. But as you can see, that doesn't change things a whole maybe too much. But when I compare it to the final actual thing, there's still a huge gap. So 62 percent of, you know, of execution will be highly diverged if I just sort of have the pessimistic view of what control flow is going to do. If I refine it a little bit 51 percent is heavily diverged but in actuality it's really only 29 percent is this heavily diverged. Yes? >>: I see some intuition. If the brown goes 50/50, why would the green bar be seen visible? It would seem to be roughly one in a billion. A few in a billion. But not visible. >> Tor Aamodt: Yeah. Well, okay. So it really depends on your nesting depth, right? I could go back to that graph, and it's really how many nested branches do I have? So if I nest only once it's going to be half of them, right? So this is saying that maybe only nest about four times or something? Yeah. Yeah. >>: Why do you care about only 1 through 4? I mean. >>: It's just a way to summarize it. I think the -- the way I had it initially this average in the efficiency, what's the average number of units active is probably the right number. >>: Do you know what is the point that this -- this [inaudible] CPU? Is it like if your efficiency falls [inaudible] 16 [inaudible] GPU? >> Tor Aamodt: I would argue it's probably around -- yeah, look at the clock frequencies and say, you know, maybe -- well, it's hard to do, right? How many cores do you have on the CPU, how many cores do you have on the GPU, depending on what you call a core? Right? So what I'm calling SIMT core, right in Fermi they have 16 of those, right? So if I have 16 core GPU, right, then I need, you know, basically -- and that's 4 wide issue then I need 4 lanes for each of those cores. So, right, like 8% or something. Right? And we're doing a lot better than that, right? Plus we have a higher bandwidth memory system. So a lot of these applications like memory bandwidth. So, yes? Okay. So this -- this is just slide talking about how we handle the memory transfers. So they wrote a little dynamic memory manager to do this, allocated memory and depending on how you're doing this you either transfer memory between the CPU to the GPU or if it's a fusion part you don't actually have to do that transfer. Even if it's a fusion part on the current systems you'll get different virtual addresses on the GPU versus the CPU. So that's a bit of a wrinkle that you have to deal with. So this is just talking about how we handle that. So on the fusion systems we have physically shared memory. We use the zero copy API. And on the discrete systems we actually have to transfer things. And we're -- the data I'm going to show is sort of an optimistic version of that data. So in reality with a discrete system you'd probably be slower than what we're showing here. Okay. So here's performance versus the CPU. So this is an AMD Llano GPU is the baseline for the first two bars here that we're comparing against. And we're just looking at the kernel for the first bar here. So if I look at a discrete GPU and I just look at the key -- the hash function part of it running on a GPU it's like 30 something times faster on this, you know, big monster car of a GPU. If I include the data transfers, which is getting my requests into the GPU and then getting the response, which is just a pointer to the answer, it's not the whole value, just the pointer to the answer, it's actually slower. It's like -- it's actually -this is lower than one, so it's a .88 or something. So this is like not -- this is not winning. This is losing, right? In reality. If I look at the fusion part, where I don't actually have to transfer stuff around, then it's actually a pretty good win. Here 7 and a half X better. This is on a Llano fusion APU, which is think is the latest one. And this is the lower powered Zacate part we're still doing better than one. So ->>: [inaudible] difference between the [inaudible]. >> Tor Aamodt: Yeah. Yeah. So if I just look at the bar that does thought include data transfer, yeah, this bar is much higher. And the reason for that is this has a lot more like peak gigaflops is, you know, much more compute horse power. Yeah. But the data transfers kill you, right? So having this integrated memory system looks like a pretty big win. Okay. So here's just another view of the data. This is just showing you the execution transfer overhead, right? So here, just all of your time is going to moving stuff around. When you go to Llano, it's almost -- you can't almost see it. For the Zacate for some reason when you map or unmap, like maybe this visible to the GPU or not, there's some overhead. Okay. Here's another view of this. So I talk about batching. So the question becomes well how much batching do I need to do? So you can see here a few data points. So if I look here, I'm looking at the normalized throughput. And this is showing me that if I basically batch up about 10,000 requests, I'm kind of saturating how much throughput I'm going to get so. That's this curve over here. Of course as they batch up more, there's going to be increase in latency. So these units are sort of dimensionless because this was work done with AMD and they didn't really than want us publishing absolute numbers. I guess it's an industry thing, right? But this number here for the latency is 0.5 milliseconds. So to give you some idea of the latency. And what we understood is that that's sort of a reasonable latency. Okay. So the highest throughput to latency was around 8,000 requests at that point there. Okay. So quick summary here for the second part. Programmer -- our assertion is programmer intuition doesn't always paint the whole picture. And I don't think we're the only people who have seen this. There's, as I said, PPoPP there's multiple papers on taking applications you might not think would fit on a GPU and actually showing that they can run reasonably well. So, you know, how does that work? We exploited the available parallelism on a GPU, batch-up requests. And, you know, the result we got here is a 7.5X performance increase on the Llano system. You know, we think you could do better on the -- on reducing the latency. There's some work that we're aware of that follows on this that tries to do more fine-grained spawning of the work on to the GPU. So hopefully that will -- you'll hear about that soon. And so basically data transfer overheads can have a large impact. All right. So is there any other questions? I hope there's some time. Yeah. >>: Can you go to the previous slide? >> Tor Aamodt: Sure. >>: The batch that says 8,000 requests, does this number include the amount of time that you need to wait for 10,000 requests to come in? >> Tor Aamodt: I don't think it does actually. >>: [inaudible]. >> Tor Aamodt: Right. I think actually it might not include that. That's -- I was grilling him about that two days ago. Why aren't we doing that. But as I understand it from when they were profiling the overall system, they -- I guess they got the impression that wasn't going to be a huge amount of it. So, yeah, I think the answer there is it's not, which is we should include it, obviously. Although -- yeah. That's the answer I'm sticking with right now. >> Doug Burger: Any other questions? Okay. Thank you very much. [applause]

>> Doug Burger: It's my pleasure to introduce Tor... British Columbia. Tor has done a ton of interesting...

Related documents

Products

Support

&gt;&gt; Doug Burger: It's my pleasure to introduce Tor... British Columbia. Tor has done a ton of interesting...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Doug Burger: It's my pleasure to introduce Tor... British Columbia. Tor has done a ton of interesting...