>> Adam Eversole: My name's Adam Eversole and I've... about 24 years now, started out in the applications group...

>> Adam Eversole: My name's Adam Eversole and I've been around these parts in Microsoft for about 24 years now, started out in the applications group and -- my wife's calling me for some reason. Just a second. Fix that. All right. So I've been here for a while. I've been in ADT, the Advanced Development Team, here in Research for the last 10 years. I'm one of the first two guys that started it all off under Gavin, known as Gavin's group by many. Anyway, I'm here to talk a little bit about parallel -- exploiting parallelism in native code. So everything here is going to be about C++ish stuff. So I'm not going to really touch on anything from C#. There's enough to go over in native code land to take at least the hour and a half that I have. So without further ado, we'll get underway. And the reason why I'm giving this talk, one of the reasons is because I have recently been working on a machine learning project where we are using multiple GPUs per server in order to speed up training. So I've learned a lot about this and I'm very interested in the topic, so I thought I'd give a talk on it. First of all, we don't just use GPUs or what I like to call accelerators. In fact, the cards that we use in the servers to speed up our application for training don't even have display ports on them. They're Tesla cards and that's their only purpose in life, is to be accelerators. And so, first of all, we want to have a -- we have some kind of problem we want to solve. And in order to optimize any application, you have to start with understanding the problem well and then getting some sort of design and code the solution. And once you have a working solution, then you start optimizing it. It may be that depending on your application you can't use five Tesla GPU cards because they just aren't available. But depending on what hardware capabilities and limitations you have, that lets you decide what type of optimizations you can do as far as parallelizing code and finding the best method to make the code parallel. So I'm going to take a sample problem which everyone understands pretty well, matrix multiply. So I think everyone understands how to do a matrix multiply, row times column gives you the cell that you're looking for. So it's pretty well understood what the problem is here. I'm going to be using square matrices for everything even though that's not necessary because the application that I use generally uses square matrices. It's in the hidden layers inside of a deep network, a neural network model. All the hidden layers use large square matrices as a place to hold all the state for the model. So you can look at this either as a summation or, if you prefer, the vector notation. It's the dot product of the row and the column. So, first of all, you come up with a coding solution for whatever the problem is. Here's the naive solution, the one that probably all of you could code without thinking about it. You just go through the rows and the columns and you multiply them together and you put it in the resulting matrix. So there's not too many mysteries there. Here's the main line that does all the computation. So once we get something that we know works and follows the algorithm that we understand, then we can start thinking about what we can do to improve it. So the first thing you probably want to do is using some reasonable dataset, something that's like what you're expecting to be used on your algorithm, you use tools such as profilers and other tools that are available to find the hot spots and find the places where you can optimize the solution. So do use a profiler. In this particular case it's not very interesting because there's not a lot of calls. It's one call. So if I'm using instrumented profiler, I get one call and a hundred percent's in there. But in a real application, there's many -- you look at the profile and you find out where all the time is being taken and that's the part you optimize. Everyone's heard of the 80/20 rule. 80 percent of the time is spent in 20 percent of the code. And that's the part that you want to optimize. And you have to think about processor architecture, where I'm going to be talking about multiple types of processors here. We have our Intel type of processors that have huge amounts of cache and are optimized to do things really quickly, you have high clock speeds, and then you have GPUs which are the other end of the spectrum. Regular CPUs do branch prediction and they have all the silicon to make things as quick as possible for sequential code. GPUs, on the other hand, have very simple cores but they have lots of them. So they're optimized in a different way and you have to program for these different things in a different way. And you have to look at the hierarchy of where the data comes from. If you follow the data in an application, you'll often find where you need to work. Paying attention to the data hierarchy is important because, as you all know, as you go down the data hierarchy, things get faster by orders of magnitude. Disks, you don't want to hit them at all if you can help it. Even with the new SSDs, if you have to go get something off disk, it's going to be orders of magnitude slower than if it's in main memory. And, again, if you can get it from cache, that's going to be orders of magnitude faster than if you have to go to main memory. And newer implementations of processors that have multiple cores often have at least two if not three levels of cache on chip. Here's a diagram of an i7 -- actually, yeah, this is i7 cache architecture. So the i7s have L1 cache that's split into instruction and data cache, 32k of each. Then you go to L2 cache, which is data and instruction combined, 256k, then you have 8 megabytes of L3 cache which is shared among all four cores. So understanding the architecture and what you have to do to keep things in cache can make your application faster. Especially if you're dealing with algorithms that are going to be executed billions of times. So what you want to do is make sure that your algorithm doesn't -- is cache friendly and won't throw things out of cache unnecessarily. So you want to stay in L1 if you can. If not that, then try to stay in L1 and L2 on the same core. And then if you get down to L3, it gets a little bit slower. If you have to go to main memory because it's not in L3, your performance just plummets. Cache lines are 64 bytes nowadays on the most common architecture. So keep that in mind. So microprocessors, as you probably know, haven't been getting faster here for the last, I don't know, eight years or so. They just decided to add more cores instead. So 3, 4 gigahertz is about the fastest you'll see mainly because of power constraints. If you crank up the -- as anybody who's overclocked a CPU knows, if you crank up the clock speed, it's going to hurt you and you're going to have to cool it quite a bit to make it work at all. So they've pretty much stopped making them faster and just the transistors still accumulate, but they're going to other cores and cache and other things like that. So Moore's law still continues. Transistors are still expanding on the normal curve, but they're going to multiple cores instead of faster clock rates and things of that nature. And we're going to have to -- if you haven't already figured this out, you have to program differently when you have four cores than if you only have one. So optimizing the algorithm. In this particular case, we're using matrix multiply. And before we even get into parallelizing something, it's best to get the best thing done on one thread that you can. So in order to optimize this and make it more cache friendly, if we look at the algorithm, we notice that we're going through an entire row and entire column each time. If our matrix is very large, we're just going to blow everything out of cache before the next iteration. And we're not really utilizing cache as efficiently as we should be. And one of those matrices also happens to be if they're both real major, then one of them will be sequentially accessed. The other one will not be. It will be once every however many columns there are in that matrix. You'll be accessing it in that fashion. So the cache utilization is pretty low. And especially when you're going column by column. Assuming we're C++ so we've got real major matrices or arrays. So the first thing we want to do is segment the algorithm into chunks that are cache friendly, do each chunk one at a time, and then do the rest of the algorithm later to add up the chunks properly. And the matrix multiply algorithm fits very nicely into this. We can do each -- we can take a chunk of the two matrices that we're multiplying together and just do a submultiply, pretend like they're two smaller matrices, and do the operations in just that little chunk of the matrix and put the result in the location. We know it's going to be in the resulting matrix. And then the next time, next two chunks we do, we just add to whatever's already there in the resulting matrix, and everything works out fine. And when you're picking your block size you think about what the cache line size is of the processor. And that's the size or close to it that you want it to be. So for a matrix multiply example, this is a very small one. But you get the idea. We chop it up into equal size chunks and we'll take that row of that chunk times the column of this chunk to make and we'll do a regular matrix multiply on those four elements, put the result here. Then to completely compute that particular resulting matrix element, it won't be done until we do this row times this column and that chunk and add that in to the result matrix. So you do instead of -- you could think of this as a 4x4 matrix and now we're doing the same matrix multiply thing, but we're doing it that chunk times that chunk, this chunk times this chunk. You do all the combinations just like you normally would in a matrix multiply and you end up getting the same result, the same number of operations as you would have if you had done it the normal way. Yeah. >>: How -- are the matrices made out of main memory? Are you letting out the chunks continuously? There's one chunk ->> Adam Eversole: No, it's not continuous. >>: Not continuous. >> Adam Eversole: So the -- it's just -- this is basically, you know, stored this way in memory. It's just one continuous chunk of memory. What? >>: You rely on it being fully mechanic. >> Adam Eversole: Exactly. >>: Okay. >> Adam Eversole: Exactly. And if we do it right, then even though this one up here, you know, it's probably -- it's real major so it stores it across the rows, this contiguous memory, so we're actually pulling in four lines of cache to get this up here and another four lines to get it down here. But if we do it in the right sized chunks and the next operations over here are done in the right order, those will still be in cache by the time we get around to them. So that's ->>: So would you spend a little extra memory and store [inaudible]? >> Adam Eversole: Yeah. You could actually just rotate this to do transposition on the matrix before you even start. And then the data would all be contiguous in memory. That's another -you can do that in addition to. >>: That's for all your chunks, right, so you have a -- so ->> Adam Eversole: Pardon me? >>: That's for all your chunks, right? So you see we'd have to load [inaudible] of the row when you want to [inaudible]. >> Adam Eversole: You have to load -- pardon me? >>: So, sorry, I don't understand how -- so if the memory is [inaudible] you still want to take those sub chunks ->> Adam Eversole: Yeah, you still want to do the sub chunks. >>: [inaudible]. >> Adam Eversole: I think what he was saying if you actually just transpose this matrix, then you're going to get -- this column now becomes a row, so instead of taking eight cache lines to store this, you only take two. If these are single point -- or floating point -- single-precision floating point, then you can fit four floats in a cache line. >>: Do you want to rewrite [inaudible] in those [inaudible]? >> Adam Eversole: Rewrite it? >>: Yeah. Like so you take the chunks and rewrite so that -- because like when you reach the end of a row [inaudible] you still have this within a chunk. You're not looping back into a chunk, right, [inaudible] chunk and you will come back near the end of the row. >> Adam Eversole: Yeah. You're probably right. There's actually a lot of other things I could have done to superoptimize this. I didn't do all those because I have a lot of other things to talk about. But the concept is you want to get -- you want to be efficient with cache. And in order to do that, you want to think about locality and you want to do things in the same area as much as possible so the things you just accessed, if you're going to access them again -- and matrix multiply is a good example because you look at the same elements over and over again, the same row as many times as there is columns and so on and so forth. Yeah. >>: So the idea here is that you have -- within each of these matrices you have four submatrices. >> Adam Eversole: Right. >>: Each with just its own kind of block, and now you only have -- now you're accounting for two chunks instead of the entire matrix? Is that the ->> Adam Eversole: Yeah. You only look at these -- and we'll get into some real code here in a little bit that actually does this so you can see what it does. I think it's the next slide in fact. There it is. Okay. So this isn't the most efficient algorithm, but it fits on a slide and it's a lot better than the last one. So now you can see instead of only three layers now we have six layers. However, the number of iterations is exactly the same as it was last time. So you'll notice we have a block size with the cache size divided by the size of whatever type is coming in here. This doesn't actually show you this, but this is a template based on the type. So this would be like double or a float or an int or something. I'm using floating point in here because that's what I'm used to. Anyway, these outside loops go up by the block size, and the inside loops start at wherever the block is in the outer loop and goes up by the block size. Now, if you don't int, it has this min here, so if you're not -- if it doesn't exactly fit in your block size, which is probably pretty likely, you're not going to be going off the end and hitting memory you shouldn't. So that's -- these three interloops all do that. And the internal stuff is exactly the same as it was last time. And, again, you probably have noticed that I -- K and N don't change this loop, so you could technically bring this outside of this loop and just increment it. But I'm counting on the compiler to do some of that stuff for me. Yes. >>: Do you have to worry about the addresses being corrupt with cache? >> Adam Eversole: The what? >>: If the address that were actually -- so you have this block size to make sure ->> Adam Eversole: Oh, yeah. So unlined access ->>: Yeah. >> Adam Eversole: -- penalties and things like that. Yeah. You'd want to pick your block size. Some power of 2 would be a good thing. Probably something by -- oh, let's see. What have I got? Cache size divided by -- so this gives us actually 16, I think, which in most processors today works out pretty well. So the -- yeah. You have to think about alignment as well. Depending on the processor, you have a bigger penalty for if you're accessing unlined memory. >>: So my question is -- so my question is also does -- so here the block size is basically the cache line divided by the size of the type. Do you have to worry that your address actually maps with the first thing in your cache line at all? Or can [inaudible] where you actually start the map and meet with the cache line and then you wrap around? >> Adam Eversole: I don't think cache lines are wrapped, but ->>: At least in -- I know there's an operation to say I want an allocated 2D array and it will give you a2D array ->> Adam Eversole: Right. And it's aligned properly and -- okay. >>: [inaudible] raised a little bit at the edge of each stride just to make sure that you stay aligned but [inaudible]. >>: [inaudible] might be longer than the actual thing [inaudible]. >> Adam Eversole: Yeah. You want the -- you want to do that if there -- and most processors have a penalty if you're like one byte off or something. Yeah. >>: So are you suggesting the cache size should be the size of your L1 data cache? >> Adam Eversole: That, or a multiple of that. I actually -- when I ran this, it ended up that -doubling or two times that value actually ended up being the best. So I just fudged the cache size up to 128 and it worked out better. But that's a good starting point. >>: Did you notice if this got compiled, like the complier compiled this into the extended -- the MMX like ->>: The MMX? I actually didn't -- actually it did. It did use some of the MMX instructions, so ->>: Well, not MMX ->> Adam Eversole: Not MMX. Whatever the ->>: [inaudible]. >> Adam Eversole: Yeah, right. >>: [inaudible]. >> Adam Eversole: Single instruction, multiple data, SIMD, sort of, whatever the set's called, SSSEE -- I forget. >>: [inaudible]. >> Adam Eversole: So this particular one is the cache-friendly implementation. Oh, and I forgot to run this stuff. Let me run this. Can't do it without running it. I've actually got the real code here. So let's see. We'll just throw a return in here right after. Okay. So we're going to compare these two. So I'm doing 1024 by 1024 which means we're doing a million multiplies. Or, no, billion. Sorry. So it takes -- so there's the top one. It takes 7 1/2 seconds to do it the dumb way and 1.2 seconds to do it the smart way. So it worked pretty well. And we're still all single threaded. We haven't got anything too fancy yet. >>: So it's a difference you see between the L1, L2 cache and the L3? >> Adam Eversole: Actually, I tried to use the Visual Studio Profiler to tell me that. But it didn't work because everything's in one function. The instrumentation version instruments the function names and collects data. But I only had one function and so it gave me like no data. So it wasn't very useful. If I had more time, I'd probably use some other tool to try to get the L2 cache. >>: VTune does good. >> Adam Eversole: VTune? Yeah. >>: Intel's profiler? >>: Yeah. >>: Yeah, that's a great one. >>: We don't have [inaudible]. >>: Some teams [inaudible]. >> Adam Eversole: I mean, the CPUs have counters that will give you how many cache hits you got versus misses. So you can get that information. I was trying to using our Visual Studio stuff, but for this example it didn't work out so well. Next thing is to understand whatever hardware -- yeah. >>: So this works [inaudible] compel is like [inaudible] would basically do that [inaudible]. >> Adam Eversole: We're going to get to that. But, yeah, definitely. This is a toy example because I wouldn't do this for real. I'd use MKL or cuBLAS or something. In fact, that's what we use in the real project is cuBLAS. I mean, why reimplement something that's already been done, right? If somebody's done it and done it well, just reuse it. But this is a toy example. So hardware constraints. It depends on what platform you're on. If you're on a phone, you probably can't do as much parallel processing as if you know you're on a server and you know what your machines look like and you have Tesla cards in your machines or something like that. So it depends on what your hardware constraints are. A lot of phones nowadays are multicore, however, and whether they let you get at the cores or not, that's another story. But most of the new phones are either dual or quad-core processors under the covers. PCs and tablets have been multicore for some time now, two to four usually. And GPUs are usually connected to your PCs even though they're not used for very much at all. A lot of business PCs now have a GPU in them, but they're not playing games hopefully most of the time. So the GPU is very commonly underutilized. And if you can use that resource that's already sitting there that's available to speed up your application, why not. If you're on a server, of course, you probably have lots of memories and lots of cores to deal with and you're in a more constrained hardware environment so you know what you can do and what you can expect to be there. And there's custom environments like the one that I work in where I know I've got four Tesla cards so I've got, you know, thousands of cores at my disposal. GPUs, FPGAs. Phi is the -- or Phi -- I don't know how they say that -- that's the new -- I guess not that new anymore -- Intel version. They're trying to use general purpose accelerator architecture. So I don't know if you've seen this graph before, but this is actually how it went. This only goes up to 2015. And sorry the anti-aliasing there is messing it up. But as you can see, at least until then, the transistor stayed on the curve. But our clock rates kind of evened out here. This is clock speed. Transistors, clock speed, power consumption, and performance per clock are the four lines there. So our free lunch has been over for quite some time now and we need to learn how to use all the cores that are given to us. So not only do we have the CPU cores to use but also GPUs connected to most of the machines nowadays. And even some of the phones have GPUs on them that could be utilized if you can figure out how to get by the API constraints. So using CPU plus GPU for more efficient application is a good idea and something we should be thinking about more than I think we currently do. So CPUs versus GPUs. CPUs have always been optimized for sequential code because that's what they've always run. And they get better and better at it. And in order to keep up with current clock rates, they've had to add more and more cache to their chips. If you look at the picture of an Intel chip, which we will in a second, you'll notice that a large portion of the chip is cache. A lot of the transistors are going to cache. That's to ensure low latency. When you want the instruction, when you want the data, it's got to be there. They'll even do prefetching and speculative branch prediction and all this stuff to try to make things faster. And they generally have high clock rates, which means 3 to 4 gigahertz, somewhere in there. GPUs, on the other hand, are optimized for parallel code. They use much more of the silicon to create the cores, the actual computation units than you get over in the more complex CPU cores. They don't have that much cache. They have a little bit now. They didn't used to have very much at all. But they're adding some cache now. They're made for high throughput. So the memory bandwidth, they'll go to 256 bits, 384 bit wide memory buses to get between the GPU and the onboard memory. That's why they have onboard memory, so they can get to it quickly. It's high throughput. They don't care so much about latency because, after all, it was a graphics card, right, and you didn't have to update frames more than once every -- 60 times a second. 120 times a second is sufficient. So you have a lot of throughput but the latency doesn't matter that much. So that's what the GPUs are optimized for. And they have lower clock rates. Keeps the power consumption down. Even at that they run kind of hot. So here's your Intel four-core processor. I think this is an i5. As you can see, this whole bottom section here is the L3 cache. Has 8 megabytes of it. And if you look at the cores themselves, you'll notice these areas in here that are -- suspiciously look like potential memory locations, those are probably L1 and L2 cache areas. And this is a virtual one. But this is about what they look like if you look at the real GPUs. This particular one is the latest one for NVIDIA. It's got quite a few cores, as you can see. All the green ones are cores. And it's got -- each one of the boxes is a symmetric multiprocessor, which looks like this if you look at it closer. So every one of these symmetric multiprocessors, which there are 14 of on the K20X, is what this came from -- these are pictures from NVIDIA -- 14 SMXs, so you have 2,688 cores on this card. And it runs at 730 megahertz roughly. So not 3 or 4 gigahertz are -- but -- , you know, not even 1 gigahertz, but they don't really need the -- they don't really need to crank up the clock rate so much. It has 6 gigabytes of memory onboard, so it can access it with -- to get this 256-gigabit-per-second bandwidth. And it does have some L2 cache, 1.5 megabytes, but that's nothing compared to like the 8 megabytes that you'd see on an Intel processor. And 64k of L1 cache is shared. Now, this is much different than what you're used to in a regular CPU because you pretty much need to manage this cache yourself. It's -- I say L1 shared because if you shared memory, which is the way most of the GPU code is written, you will be consuming some of this and it won't be available for L1. Okay. So how do you use all those cores? There's different methods of doing it. One of them is libraries, as you mentioned. If you got a library that does what you need, use it, MKL, for example, or LANPACK or whatever. And for GPU libraries, I have cuBLAS, cuSPARSE for the CUDA folks, and other ones for the ATI versions of those. So there's a number of resources out there. And if somebody's already done all the work for you and they work at the GPU manufacturer, they're probably going to do a lot better job than you would. So use the libraries if they're available. Next one is directives. Directives are really easy to use. I don't know how many of you are familiar with OpenMP. We'll look at it in a second. But it's really easy to use. It's in our compiler today. All you have to do is put a little directive in front of it and flip a compiler switch and you've got parallelized code for loops at least. And I'll show you that in a second. Language extensions. These are all extensions to C++ or C -- I threw in the C# stuff just so people would know that it's all available on the C# side too. Yeah. >>: [inaudible] consider using accelerator [inaudible] .NET [inaudible] consider trying that [inaudible]? >> Adam Eversole: Accelerator what? I'm sorry? >>: It will use all your cores [inaudible] GPU, the accelerator project from MSR. >> Adam Eversole: Oh, oh. Okay. So it's a project. I have not -- I have not used that, no. I am actually using something called PTask by Chris Rossbach in Silicon Valley, however, and that to utilize multiple GPU cards on the same machine. And that ->>: So we could do the same thing. >> Adam Eversole: Is it the same thing? >>: Yeah, on this [inaudible] API, it doesn't matter what language [inaudible]. >> Adam Eversole: Okay. Yeah, I'll have to talk to you ->>: [inaudible] like the download -- download people [inaudible]. >>: Yeah, we release from time to time [inaudible]. >> Adam Eversole: I'm not familiar with that particular project, but I'll have to check that out. Language extension. So PPL we'll talk about. The ones in bold I'm actually going to go over. C# has PLINQ and TPL, which stands for -- I forget what it stands for. Thread ->>: Thread Parallel Library. >> Adam Eversole: -- Parallel Library, something like that. And then, of course, there's a whole mess of async, threading APIs if you want to roll it all your own. CUDA-C/C++, C++ AMP, I'll be looking at both of those here. OpenCL, which is an open version of a similar API set. And DirectCompute, which is of course DirectX's version if you like to write [inaudible] language. OpenACC is similar to OpenMP, and I'll talk about that in a second. So OpenMP, really easy to use. It's already there in Visual Studio. So just put it in your code. It's multiplatform. The "open" implies that. It's -- so the directives are usable across different platforms and different compiler support, the OpenMP directive. It's for shared memory devices, which means it won't work on a GPU because GPU has disjoint memory. Open ACC is available. Pretty much equivalent directives, but it works for nonshared memory devices like GPUs. So this is all you really have to do. If you have a loop, put #pragma omp parallel for, flip a compiler switch, and you'll magically be using more cores than you were before. You don't have to tell it how many cores or anything like that, it just figures it out under the covers. And a note here. If you think this is going to make up for having a bad algorithm in the first place, you're wrong. As we'll see, I actually coded up -- I used the original algorithm and then the optimized -- semioptimized algorithm and ran them both using multiprocessors, and we'll just look at those. So we'll move our return down a few chunks. So this is our OpenMP stuff. Let me just show you the code because it's pretty easy. So it's the same as last time except for it's got this pragma in there. And this is the same as last time except for it's got this pragma in there. So all I did was add this little pragma line that says, hey, I want do to this thing in parallel, this loop. And it automatically figures out, oh, you did it at that level so everything under there is going to be private to whatever processor I put it under. You don't have to worry about that most of the time. So and there's a lot more than just parallel_for. If you want to go look up OpenMP, it lets you do lots of other interesting things. Like if you're doing a sum or something of that nature, it has automatic ways of partitioning it out and then bringing it back together again. So let's run it. So this is our old one that takes seven seconds. I think I'll comment that one out for the next time. So now it's OpenMP and you'll notice that the -- the parallel version, I've got four cores on this thing. I think it's plus hyperthreading, so maybe eight. But you'll notice that I'm still slower than -- the old algorithm, the dumb algorithm, even with four cores doesn't meet the performance of just the sped-up single-core version. Then of course this is much better when I use the four cores. So looks like we're getting just over four times the performance, which is good. I don't know how much I should count a hyperthreading core. Okay. Parallel Patterns Library, or PPL. This is another Microsoft-published library that comes with Visual Studio so you've already got it. It does a lot of things, task parallelism for larger -- if you want to have different work items partitioned out. That's not what we're looking for right now. We just want the parallel_for, which it also does. It also has parallel containers that are safe for concurrent access. So if you're using the usual vectors or other type of C++ standard library stuff, it has concurrent versions of those that can be used safely for multiple threads. So that's kind of handy. Unordered_set, a couple other things. It also has a concurrency runtime which does a lot of the background work to help write concurrent code. It is a CPU solution as well. It won't work on your GPU, but it's pretty easy. It's almost as easy as the directive. Just write parallel_for, as we will see right here. Oh. What happened there? Oh, I just collapsed it, didn't I? Okay. So once again I did both versions using PPL. And we'll see that the performance is very similar to when we did the directives. Yeah. >>: So how much faster would using those concurrent friendly containers be as opposed to just putting a lock around the container and then writing and then releasing the lock? >> Adam Eversole: I think it depends on how -- how much you need to use the lock. >>: But what are these ->> Adam Eversole: Presumably that's what they're doing internally, right? >>: Yeah, that's what I was going ->> Adam Eversole: Yeah. And so they have to assume that every possible write or read might be in contention. So if you have a solution where you know that there's only one case in which I'm going to have a conflict, you're probably better doing it yourself. But if you don't really know or you just want to make sure that you're good, you can go ahead and use the other concurrency containers. They don't -- it's not a full API set either. Like it won't let you insert in the middle of a vector, for example. And the memory that it allocates is not guaranteed to be contiguous, so you can't do little tricks like address of C sub zero and just use a pointer. It might break. So let's see. So it's about the same as OpenMP. OpenMP did a little bit better. It did a little bit better on the other version, so your mileage may vary. Yeah. >>: So OpenMP, is that doing anything right when you -- when you [inaudible] so you have to make sure ->> Adam Eversole: Yeah, it's not -- it's not doing anything -- all it does is it provides the infrastructure under the covers, the compiler does it, to use all your cores. So it looks at the loop and it figures out, you know, I can farm this off to multiple processes and he said he wants this done -- it wants it done at this level. Like I could have moved that down farther in the hierarchy if I wanted to, down the second or third level of the loop. And then I would have been choosing to do a sequential loop and then farm it down lower down in the step. But, no, it's not doing anything real special like dividing things up for you or anything like that. Memory accesses are still memory accesses. Is that the question? >>: Yeah. So basically the loop you showed us, right, stands to [inaudible] reads and so there were no [inaudible] you were writing the [inaudible] mapping because [inaudible]. >> Adam Eversole: Yeah, yeah, I mean, you have the same problems with unaligned accesses and -- I mean, the original -- if the original algorithm doesn't work, it's not going to work when it's parallelized either, >>: Yeah, but you have the addition of [inaudible]. >> Adam Eversole: Yes. And in fact -- >>: [inaudible]. >> Adam Eversole: And in fact I have -- that's a good question for the audience here. So let's see. I forgot to ask this question. So let's go look at the PPL implementation for a moment. Okay. As you can see, parallel_for is right here. And you just -- so effectively what it is is four parameters. This last one goes all the way down to here. So it's -- you pass in the function as -or what you want executed as less parameter parallel_for. And this one isn't capturing anything, so it's just from 0 to M, step one. So that's all I had to do for the first one. The second one is very similar except for the step is different. I'm stepping by block size now instead. To answer your -- what I think what you're getting at is do we have problems with concurrent access. Let's go back and look at OpenMP CPU. Oh, by the way, I do a check after each one of these to verify that it actually matches the original CPU version of the matrix to make sure it's actually working as it should. Does anyone see -- so this version of it where we block everything out, does anyone see a bug here, concurrency bug? >>: [inaudible]. >> Adam Eversole: What? >>: [inaudible]. >> Adam Eversole: The result right here? >>: If you [inaudible]. >> Adam Eversole: Right. So you're saying that this access might not be safe. >>: Yes. >> Adam Eversole: And you are correct. That particular line is not thread safe. Now, the possibility that I'm actually going to hit it is very low, so it actually hasn't hit because -- because what has to happen is you have to have two threads. I've only got, well, eight cores if you count hyperthreading. I've only got eight cores. But one of them has to have V result -- they both have to have read it. And one of them increments it and puts it back. The other one increments it and puts it back and I've got a race condition. >>: So but isn't -- IDX result is a function of K, K is based on KB and KB is segmented by the omp parallel_for, right? So isn't it the case that like each of -- each of those will be accessed by uniquely one thread? I mean, I see in general you're totally right, had you have done J -- or, no, had you have parallelized it in a different way, that [inaudible] really could be dangerous. But I think in this case because of where you put your ->> Adam Eversole: You think I'm guaranteed that I'm okay? >>: Yeah, I think if you put the parallel_for somewhere else you would definitely hit this problem. Like if you put it in one of those tight loops, like I ->> Adam Eversole: Right, yeah. >>: -- then you would be dead. >> Adam Eversole: But since I did it the way I did it ->>: Yeah, you're effectively taking stripes of the matrix and saying this thread is responsible for that stripe of the matrix. >> Adam Eversole: Yeah, you're -- I think you're actually right. >>: Yeah. >> Adam Eversole: But this is a ->>: [inaudible]. >> Adam Eversole: This is -- this is a -- this is a cool thing about OpenMP. I can say -- let's see. Pragma. OpenMP atomic. And it will make the next line atomic. Now, how's that for cool? >>: [inaudible]. >> Adam Eversole: It is. I'll show you it's slower. But it's not that much slower. I tried to -- I tried to actually -- I mean, this particular line gets hit, let's see, probably, I don't know, 250 million times, something of that order. It's not in the tight loop -- I mean, the -- this result here gets hit a billion times, and we're one out from the block size is 16, so 1/16 of a billion is still pretty big. If you try to put something, you know, like a critical section, oh, you'll just be there forever. So they do a pretty smart job of the way they do it. >>: Is the result a 32-bit thing or a 64-bit thing? >> Adam Eversole: Right now I'm running it as single float, so it's 32 bit. >>: [inaudible] intrinsic atomics? >> Adam Eversole: There are intrinsic atomics. So you can use that too. But even using an intrinsic atomic, it's a function call ->>: Right. >> Adam Eversole: -- called that many million times. >>: Yes. >> Adam Eversole: It slows things down considerably. I actually tried that too. >>: Okay. >>: When you say I checked the [inaudible] could you ->> Adam Eversole: Oh, yeah. >>: You can do that? Like if you don't add stuff and you [inaudible] is that saying -- yeah. It's floating point, right, so there's a fudge factor in there. Exactly. And especially for single -single -- I mean, if you're down like five or six decimal places, that's about as far as you can expect it to be the same. If you go past that, the order in which you add things starts making a difference. And for these particular ones, I think I was just using random numbers. And I had to -- it was about the fifth or sixth significant digit was good, but past that would start being a problem. So, yeah, floating point's never going to be the same. Back to the slideshow. Okay. So that's a pattern, PPL, Parallel Patterns Library. Oops. Too far. Back. Okay. C++ AMP, which stands for Accelerated Massive Parallelism. Now, this is the first one that actually uses GPU. Up to now it's all been CPU multicore stuff. So C++ amp will use a GPU if it's available. They also -- the cool thing about amp is it can also be used on a multicore processor if you don't have a GPU. So they have multiple accelerators. If you have multiple accelerator cards, it lets you choose which one you want to use. If you don't have multiple accelerator cards, you can say, hey, I want to use the CPU version and you tell it -- it's call the warp version of the accelerator. It's a software emulation of a GPU that uses the actual cores. And it will use any SIMD set that the processor happens to have. So it's for separate memory space. Processors, like GPUs, it includes multidimensional arrays, ways to handle those, indexing. You'll notice that all of my code was just using an array, and I figured out the indexes myself. This lets you do it the easier way or the better looking way. Memory transfers. This is a big deal in GPUs because in order for a GPU to work on some data, it's got to be in GPU memory first. That means you've got to copy it from CPU memory up to GPU memory, run whatever you're going to do, and then when you need the result back, you've got to copy it back to CPU memory again. And if you're not efficient with that stuff, it ends up costing you a lot of time. So it handles all this memory transfer stuff for you kind of transparently if you want it to, or you can be explicit and say, hey, I want to -- I don't want you to transfer it for me, I'll tell you when. It also does tiling, which is much like the blocking that I showed you previously, but for GPUs it's done in just a slightly different way because it doesn't have all that cache. That's used automatically for you. Mathematical function library, and this -- like I said, the same code can be run on the GPU or the CPU. So this is what C++ AMP code looks like. The top line there, extent<2>, that says it's a two-dimensional entity. So e_a, e_b, e_c, so the A and B matrix are multiplied together, C is the result. So the top line just puts the dimensions in for each of those extents. An array view is how you say, hey, this is something that I want the GPU to be able to see. I want the GPU to be able to see this, this A array, which is up there, and here's my new name for it, which is actually v_a. You don't see because it's off the top. But that's the actual array that holds this up as a vector. So this is a vector that came in as parameters. And same with v_b and B result. This is the one that goes back. So you'll notice ABC discard data. That says, oh, by the way, whatever was in this array C thing, I don't care what it was because it's an output. It's an output array. So don't worry about transferring all this garbage up to the GPU because I don't care. You're going to write over it anyway. Parallel_for_each. Since we have an extent object here, this ABC which is -- where is it? Where do you see? There it is. So that's the result. It says, hey, I want to go over all the extents of my result array and I'm passing you in a two-dimensional index called IDX. The restrict AMP here says everything else in here down to the print C colon is going to be -- has to be of a particular set that -- instructions that can be translated into GPU instructions. So it limits what you can do inside of there. But everything you see here is fine. So regular loop except for this goes over. So extent is 0. You'll notice 0 and 1. So that's the second -- so this is the first index or the first whatever you call it -- the ranking of the array. So the first index, the second index. So we want to loop over the -- usually it's row and then column, so this would be the columns that we're looking at. And that would be row. This would be column and the index. So it uses these indexes. You can get into them. So this is two-dimensional index. It loops over it. I think you can probably figure out what's going on. It's pretty much the same sort of algorithm as the first example we did except for this actually takes care of two of those loops out there because it does both dimensions of C at the same time. And down here after we get done, we say, oh, we want to synchronize this. That means bring that thing back to the CPU now. Because it used to be on the GPU. I want it back on the CPU now. And you can either explicitly say that or when this goes out of scope, as you will see in the other example, it automatically comes back. So let's watch that thing run. So I'll show you in a second. I can also use this line right here. This line will say, hey, use the CPU instead of the GPU. If I don't tell it anything, it looks around and it says, oh, do you have any DirectX11 devices on your machine, which means it works on anything that supports DirectX11, which is great, because lots of cards do that. Doesn't have to be CUDA or NVIDIA card, like the next example CUDA does have to be. It can be an ATI. It can be anything that supports DirectX11. >>: Can I do something which is worse than using the processor in a sense that it's kind of something which is good graphics but not good [inaudible]? Could be [inaudible] which support their in a slow way? >> Adam Eversole: In a slow way? Yeah. Actually they have one of the emulators which is really, really slow. >>: Single treaded, though. >> Adam Eversole: Yeah. It's for debugging and it's -- yeah, it's bad. You don't want to be running with that thing ever if you can help it. >>: They try to pick something intelligent. Like there's a logic [inaudible]. >> Adam Eversole: The DirectX guys have an enumeration API and they had capabilities. And in fact there's a DirectX cap saying that you can run and see everything that comes back on your system to see what it has. And, in fact, I have a -- I tried to run this on one of our GPU boxes but found out that it doesn't have drivers that are DirectX11 compatible. So it was running that really slow thing. So let's do this. This one's running really with a GPU this time. The particular one I have in this machine is just my laptop. So it's got 192 cores. It's one of the current -- it's like one core -- that huge one I just showed you, it's one of those symmetric multiprocessors. So it's a little guy. And as you can see, we got the dumb version, 181 million seconds. The smart one only took 69 milliseconds. So we're doing much better than we did with the CPU-parallelized one, which I think our best was 312. So even with a rather dumb GPU, you can get quite a speed up in things like this algorithm. Yeah. >>: Does that time include the memory transfer? >> Adam Eversole: Yes. >>: Okay. >> Adam Eversole: It does. >>: [inaudible]. >> Adam Eversole: And the -- well, let me -- okay. Let me tell you a little bit about what I did here. Since these are so fast, I do it 20 times. I do it 20 times and take the average. So technically it's not a memory transfer every single time. So it's amortized. One memory transfer, 20 runs, memory transfer back, check it. So there's only two memory transfers for 20 runs. So 1/20 of the memory transfer is in there. >>: Okay. Do you know how it did that? >> Adam Eversole: I don't. Actually, we'll be able to see that when I get to the CUDA code, because it actually has a version of the timer that doesn't include the memory transfer so you can see the difference. Now let's just run that one more time, this time using CPU emulation just to show you what the differences are. So that's the really cool thing about C++ AMP. You can have the same code and it will run on GPU, it will run on the CPU. All you have to do is change that line that says what accelerator you want it to use. >>: So I know CUDA has like -- excuse me -- cuBLAS and cuSPARSE and some of these libraries. Is there library support for AMP in the same thing? >> Adam Eversole: There is library support for AMP. >>: Like matrix multiply [inaudible]. >> Adam Eversole: Yes. However, it's not as good as cuBLAS. >>: Okay. Thank you. Okay. >> Adam Eversole: I wanted to use this really bad. >>: Yeah. Looks really pretty. >> Adam Eversole: It does. And I wanted to use it but in our particular application 90 percent of the things that we do are matrix multiply, so that's like at the top of the list and that's the one thing that matters. And when I benchmarked it -- so you can see, you know, it's not quite as good as our PPL and our OpenMP examples, but, you know, it's a lot better than a single threaded. >>: I was really curious, it's like twice as fast on the simple version but twice as slow on the tiled version. It must be doing something that's really inefficient on the tiled. >> Adam Eversole: On the tiled version? >>: Executed. >> Adam Eversole: Oh, I haven't actually told you about the tile. I shouldn't be running that. Oh, well. >>: The real comparison on the AMP would be versus just vectorized code. >>: Right. >>: Which is ->>: Right. Which he doesn't -- which he ->>: [inaudible] single core. >>: Yeah. Which as long as the single core is truly vectorized code, but it might be vectorized in a different [inaudible]. >>: I thought AMP vectorized on the CPU? Is that not true? >>: AMP goes through DirectX and then DirectX executes it as vectorized code. >>: If you run the CPU target instead of the X target. >>: Yeah, I think it vectorizes ->>: You have tons of for loops, so which ones get vectorized? >> Adam Eversole: Yeah, this particular one would -- it's definitely trying to vectorize the stuff. But it's a different -- it's kind of a different abstraction so it doesn't have as much information. It has to go through DirectX11 and the underlying core -- or the underlying emulator isn't really hardware. So there's a little bit of inefficiency there. But it's attractive that you can get, you know, pretty decent -- you know, this is better than that even on the simple version. But, again, it didn't do as well on the tile version. >>: So the run version, like the AMP [inaudible] anything or you just say [inaudible] or you do exact same thing as the block one? >> Adam Eversole: For this one -- the tile version is done differently. I actually -- when we get to the CUDA code, I can show you what I did to emulate what -- basically the short answer is all that stuff I did to use caching, well, GPUs don't have that kind of cache. So it doesn't matter. In fact, it makes it slower to do it the other way. Because you kind of have to do the caching yourself when you're on the GPU by using local memory. Okay. So GPU accelerator programming, I think I've actually already told you all about this, but I'll go through the slide anyway. So the CPU handles program flow in a GPU application. The CPU generally gets the data that you're going to process from disk or wherever it happens to be and it has to copy the data over to the GPU like so. Goes over to the GPU, the GPU does it stuff, and then it has to be transferred back in the end. The key to making GPU programs efficient is to eliminate these copies as much as possible. You don't want to for every single operation go back and forth and back and forth. You want to copy your data up here, let the GPU do as much as it can with that data and then when you really need the answer on the CPU, transfer it back. Remember you've got a copy of that data over here so you don't have to -- it's still sitting there until you bring it back. And you run kernels of code on the GPU. It's an accelerator so it does one thing. You say do this, here's the data, do this, and it does it really fast. So GPUs have hundreds of thousands of cores. Single instruction multithread. This is a little bit different than anything you're probably used to if you haven't done GPU programming before. What happens is the -- there's these units called warps. So you've got 32 threads per warp, which is a unit of processing, and also memory requests. I said that there -- the key to getting efficient GPU programs is making sure when you access memory what you're going to have to do that you do it properly, which means coalesced. And what that means is you don't want sequential access by a particular thread. In fact, you want strided access. If you've got 32 threads per warp, which has been the standard for the last while, you want to go in increments of 32. So the first time you increment -- or the first time you look at a memory address, if it's 0, the next one you want to be 32. And why do you want that? Because all the other 31 threads are going to be looking at the next one and the next one and the next one and the next one all the way across in the same cache line on the GPU. So it's called coalesced memory access. All instructions, if you go through instruction by instruction on a GPU, every single thread is in the same instruction at the same time. They don't diverge from each other, which is different than what you're used to in CPU where they're just kind of all around their own thing. And if there's a branch in your GPU code, if somebody takes that branch, everybody else is just sitting there twirling their thumbs, oh, I wonder when you're going to get done with the branch. So you want to eliminate branches as much as possible. You want to stride your memory access so that your 31 other brothers in the same warp will be accessing adjacent memory locations to the one that you're accessing. So this kind of breaks the -- the usual rule on a CPU you want it to be adjacent, you want them to be adjacent. You don't on a GPU. You can have up to 1024, and now I think it's 2048 on the new ones, threads per block. A block is a number of threads that can access the same shared memory. So if you have something you want to cache in memory, you bring it into local memory, you cache it there, and then everybody in your block can access that memory. And this is L1 cache, slash, shared memory. And then they have another concept called grids which is just a whole bunch of blocks. And it looks kind of like this. So each thread gets per thread local memory which they call registers. They have like 64,000 of these things. So they have tons of registers, quote/unquote. And then here's a thread block which has some shared memory. And then a grid is multiple blocks that can access the global memory. And different grids access the same global memory, so they're shared. So if you have something that you need to share between different blocks or different grids, you can put it in global memory. But you're going to pay the cost of going out to the slower method. AMP C++ Tiling, the second version of the AMP code, used this instead of the regular extent. So compute_domain, you tell it, oh, here's the C, which is the result vector. I want to use this as a compute_domain. And then down here in your parallel_for_each you say a compute_domain tile, and you tell it the size of your tile, tile_size,tile_size. And then when you're accessing it, you use the tiled index. And then it lets you do some -- let me show you the real code on this. I'm running out of time here. But we can go look at this for a second. Okay. So here's the real code. And you'll notice that it's caching the tile for the A and the B matrix here. So that's the first thing it does, is it caches what it wants into local tiles. It has this thing called barrier wait which means I'm going to stay here and wait till all the threads get done populating these two buffers. If you don't do that, then one of them might get ahead of another one because they're not all in the same warp here presumably. So you have to use a barrier wait. And then after everything's in the AMP matrix, I can just use the local memory and calculate my sums, wait again so that it's populated, and then I can set it. And in this case they didn't say synchronize like I showed you last time. It's implicit when the array view, which is right here -- when this array view goes out of scope, it's got to automatically copy it back to the CPU, presuming that you actually want to know what the answer was. Yeah, somebody have -- >>: What is the role of the second barrier? Like why do you wait? >> Adam Eversole: Why do we wait? Well, this is -- the top one here is populating, populating these two tiles, right? >>: Yeah [inaudible]. >> Adam Eversole: Then this one is actually calculating the temp sums for the tiles. And then you wait again before you actually store it. >>: Why do you wait again before you store it? >> Adam Eversole: Why do I wait here? >>: Yeah. >>: So that you don't start changing the cache while somebody else is trying to use it. >>: Yeah. >> Adam Eversole: Presumably nobody should be changing the cache. That's true. >>: So you essentially have multiple tiles and each tile is assigned to bring in a certain piece of data ->> Adam Eversole: Right. >>: -- up there and the tile is static and that's the cross-tiles. Everybody has a certain assignment. So when you do this, you work it out and you say this tile is preassigned to load this memory, this tile is preassigned to load that memory. So the barrier waits are preventing this cross-tile coordination from messing things up. So if you don't have the barrier wait at the bottom, it's going to loop back up into the for loop, it's going to start pulling in ->> Adam Eversole: The next version. Yeah. >>: -- the next version while [inaudible]. >> Adam Eversole: Before you're done using it. Yeah. >>: Doing this cross matrix multiply. >> Adam Eversole: Yeah. But you're right. If we were just dropping it -- if this wasn't here like the last iteration, you probably don't need that. But it's just the last iteration, right? >>: What confused me is that it's almost -- it's single instruction multiplier, correct? >> Adam Eversole: That's only in a warp. Yes. You and your 31 other brothers. >>: I see. >> Adam Eversole: So, yeah, in a warp-by-warp basis you're moving through synchronously. But like that big picture I showed you, it had 14 different processors and each of them had, you know, 192 cores or something like that. And then you store the answer down here. Okay. I'm going to run. So the last version is to use -- here's my verify code, by the way, that goes through and looks for -- let's make this back to CPU version so I can do some good comparisons. Take this out. The last version here -- actually, it's not the last version but probably the last version I have time for. Didn't get to CUDA. So this last one is using cuBLAS, which is the BLAS library for CUDA. And it's quite a bit faster than everything else. So don't go and write this yourself. Use cuBLAS. Or MKL if you're on CPUs or, you know -- oh, where is it? 27.3, yeah, that's pretty good. >>: Is that the same computation mechanism as the AMP, like you run it 20 times on average but you only pay for the GPU as well? >> Adam Eversole: That is. Yeah. Exact same. I do the -- I copy it up, run it 20 times, copy it back down. So there's the -- the best we had previously was 70 on the AMP tiled. So 27 is significantly better than that. They actually have a library that does it at about 46, I think, on this machine. So they have a ->>: Oh, AMP does. >> Adam Eversole: AMP does. >>: I see. >> Adam Eversole: So they do considerably better than the regular tiled version. They have a library, but it's still not as good as cuBLAS. Now, for CUDA I'm just going to run some of this stuff because I'm almost out of time. I forgot to up the font size on this. Sorry about that. Okay. So this is the CUDA version of it. CUDA is a NVIDIA-specific runtime library and toolset that only works with NVIDIA cards currently. It's open so ATI could implement if they want, but they have their own stream things so they're not going to. It's the same exact thing. The block version you'll notice here is slower than the other version. Because that sort of optimization just doesn't work on GPUs. Here's the simple version, 210, which is not as good as we got with AMP. Actually, I think it's -- is that better than the AMP simple version? It's similar. It's similar. I think it was 180 or something like that. And here is the tiled version which isn't really optimized that well. In fact, it doesn't do as good as AMP. But, oh, by the way, this is the difference, 87 versus 92. This is without any memory copies. So you can see the effect there. 5 milliseconds or so was the difference or the penalty. >>: So we have to do 5 times 20 ->>: No. This is doing it 20 times internally. And this number doesn't include any of the -- the two memory copies. This one does. So it's taken 2 1/2 milliseconds per copy, copy up, copy down. >>: So if we were to have 20 different matrices to multiply, you have to pay up ->> Adam Eversole: If you had 20 different matrixes to multiply, yeah, you have to -- you have to copy each one up presumably. Unless like in my application, you use the result of last matrix multiply to do another one on it. And it goes up the chain and it comes back down. That's the way machine learning deep neural networks work. One minute. Okay. Let me just show you the last slide, the performance slide since CUDA C++. Okay. There we go. So this was the whole gamut. This is in a logarithmic scale. As you remember, this one took seven seconds and this one only took -- cuBLAS took, what, 20 something. So this is 25 milliseconds. That's eight seconds. But to get it all on the same chart, I had to do [inaudible]. So you can see that the GPU -- this is all CPU from here that way, and this is all GPU down here. And this is a very lame GPU, by the way. I ran -- I ran this on -- let's see. Where's Excel? So I ran this on my -- can you see that? Probably not. So I ran the same CUDA code on a Tesla card that has 512 cores and it ran the CUDA stuff at 24 seconds versus 209 on my 192-core, 46 milliseconds versus 235 and 16 versus 92. That was the first run. I think we got a little bit better than that when we just ran it. So, yeah, it scales with the number of cores you got. And if I threw this at a K20X with 2,600 whatever cores, it's done before I can think about it. So, anyway, utilize all the resources you've got, and thank you very much. [applause]

>> Adam Eversole: My name's Adam Eversole and I've... about 24 years now, started out in the applications group...

Related documents

Products

Support

&gt;&gt; Adam Eversole: My name's Adam Eversole and I've... about 24 years now, started out in the applications group...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Adam Eversole: My name's Adam Eversole and I've... about 24 years now, started out in the applications group...