>> Laurent Visconti: Well, good afternoon. It's my pleasure to introduce Mark Silberstein. He's a last year PhD student at Technion, Israel Institute of Technology. And Mark's talk today will be on execution of memory-bound workloads on GPU, via software-managed cache. Okay. >> Mark Silberstein: Thanks a lot for introduction and thanks for having me here. It's a huge order. I know you guys are very difficult to impress. You have a lot of nice stuff here going on. But I'll try. So this work has been done in UC Davis in early 2007 and then I continued working on that in Haifa in the Technion. And in UC Davis it was John Owens and Anjul Patney who helped me on that. John Owens is the professor in Davis now. And of course my advisors Dan Geiger and Assaf Schuster at Technion. So briefly, so if you think of memory demanding applications on GPUs, the best workload ever you can do on GPU is when you have a lot of operations per single memory access. This is because of the typical -- this is because of the memory architecture on GPUs. So this is called high arithmetic intensity. Now, this is the best workload you can get, and that's why -- in this workload you would be able to get maximum out of the GPU. Nice workload is the one where you do have memory reuse pattern, so you probably can reuse your memory in a certain way. But this button, access button is known at the compile time. So typically for matrix multiplication you do blocking in order to fit your computations to the cache. And the block size is usually determined at the compile time, so it's kind of [inaudible] and it doesn't -- it doesn't depend on the input. What you typically do, you use this scratch pad memory on GPU in a clever way so you prefetch the data that you're going to use several times in order not to get to the global memory every time, and by prefetching it, you essentially make it available to the computation on chip. Now, the interesting thing is that multiple threads work collaboratively on the prefetch data and then they switch the set of data that are working on and go on to the -- to another piece of data, et cetera, et cetera. The tough workload is when you don't have this luxury of knowing memory reuse pattern at compile time. And this is the workload that we'll be talking about. So in a nutshell, we can think of this software -- of this scratchpad memory that is available to several threads on GPU in blocks of threads on GPU as a cache, a regular cache. And although this is not particularly in a user, many people who consider this as a cache say we do one step further and try to analyze our workload as if it was a real cache and also we are trying to build machinery to allow this cache to be configured at runtime by collecting the appropriate cache policy, depending on the input, which is optimal for a specific input. So in my talk today, I will describe our solution to this particular problem on the example of sum-product computations. And I will also show some old result that we had since 2007 on G80 and some new results that we have on G2 -GTX285. Obviously we do not have results in [inaudible] not yet. But I'm hopeful that we will. And my -- I briefly mention also the things that we did with TSUBAME super computer, which has 620 GPUs. It's a cluster that had a lot of GPUs. And we used it for running very large problems on thousands of CPUs and one 20 GPUs. So I will briefly mention what we did there. So, in essence, we have very, very simple performance model that we can think of on a GPU. So what we want to do is to be able to analyze our workload in advance without any actually writing any piece of code to know what will be the best performance that we can get if we do that in an optimal way. So essentially if the data is supplied -- if the data supply is fast enough to keep all ALUs busy, well this is CPU bound workload and it's easy. However. However, in other case, on GPU you actually have to ship your data, and you saturate the bandwidth. And by saturating the bandwidth, you are limiting your performance. So can we actually evaluate and do kind of a trivial you know arithmetics to know what is the maximum that we're going to get? Well, of course, because basically our expected performance is going to be more or less our arithmetic intensity, which is this parameter, which is algorithmic specific number of operations per memory access that we do. Multiplied by the bandwidth, right? And then we just take minimum between the arithmetic intensity by bandwidth and the best ALU performance we can get. So this is kind of very, very simplistic model. But it turns out that when we just, you know, take very, very simple workload, trivial one that is being used by NVIDIA all over just to show kind of a trivial kernel, this is just multiplication. So what you see from this analysis is that the arithmetic intensity here is one-third, right? We have three memory accesses, one for write and two for read and one operation. So for this arithmetic intensity, the expected performance that we're going to get is only seven gigaflops which is only 5 percent of utilization of the card of a G80 and for a more advanced card with a 285, we would get a slightly more to that, but it would still be like two percent of the available ALU capability of the GPU. Okay. So it's 12 gigaflops which is almost one percent because this card provides teraflop of performance. So with this simplistic model, we already know this workload is going to be very, very GPU unfriendly, if I may. Now, if we now think about the scratchpad as the cache we can think of incorporating this into the -- this simplistic analysis by considering not the number of memory accesses per operation but rather the memory access cost, right? This is what we do when we consider cache misses and cache hits, right? So cache hit is going to cost nothing, cache miss costs the memory access. Right? And if we can formalize the cache hit model for our application, we can think of the way to analyze the workload that does consider shared memory. Okay? So for this case where we have matrix multiplication there is a 0 of N reuse. Without the cache our arithmetic intensity would be only constant, would not depend on the input and in this way would be limited by this very low utilization of GPU. Whereas for a cached -- well, only compulsory misses you would get the workload that would become CPU bound because it would depend on the input parameters. Okay. So it's kind of perfect scaling in this sense because the more data we have the better performance we're going to get. Okay? So with this simple model we already know that for matrix multiplication there's no way we would be able to do that without using scratchpad. Now I'll turn to the application that we are using to evaluate our approach. So the motivation for using some products is inference in very large Bayesian probabilistic networks for parametric genetic linkage analysis. This application allows a geneticist to analyze their data and find genetic provoking genes. So essentially this computation is exponential in some input parameters and that's why it may require a lot of -- a lot of computational time. So what we want to do, we would try to use GPU for computing that. Now, our other approach is to use multiple distributed computing environments, opportunistic image environments, and there is actually working system that uses them. But that's another subject. So what it is, if you think of some product as a reformulation more generalized way to think of matrix multiplication. So what do we have here is product of two matrices and essentially this is the -- you know, this is just the formula for each one of the matrix which one of the values for this resulting matrix. Now, if I think of X, Y, Z, as variables over finite domain, then essentially A, X, Z, and B, Z, Y, turn to be functions of these variables. And the -- you know, these functions map from the discrete domain of X, Y, Z, to the actual values that are part of this matrix. So essentially in this way when we represent matrix products in this way, we can think of A, X, Z and B, Z, Y as being multiplied in some way, right, and then marginalize or summarize over shared dimension, right? So this is just an abstraction for matrix multiplication. So what we do here -- and this is essential to understand because essentially this gives the idea of random access to the input. So for this case what we have here is that we have two vectors, okay, so 0, 0 and here 0, 0 mean X, Z, and Z, Y respectively. So what we do, we kind of compute all different permutations of these two over where we fix all those variables that do not -- that are not shared and compute different permutations of the shared variables here. Okay. So essentially we get the product of all different permutations over variable Z when we fix X, Y to be zero, then X, Y to be 01, et cetera, et cetera. Now, this is the first part. This is the multiplication. And the second part is actually to take these different parts and sum over. And if you think what we do here is we sum over different values that refer to different values of Z. Right? So we put this and this, this and this, this and this, and the result is certainly exactly regular matrix multiplication. However, the point is here that now we can generalize multiplication of multiply matrices or some product of a set of functions now general functions that cannot be reduced to simple matrices and some over the shared dimensions. So we have three matrices like that, and we sum over W and Y. We may get it in two ways. One way is to generate this function over four dimensions and then sum over W and X -- W and Y. This would require all of N by four operations if the number of values for each variable is N. However, now we can use distributed law and in this way we would be able first to compute this product and marginalize over W and then take the result of this one and marginalize over Y. This would be much more computationally efficient and this approach is called bucket computations, where you essentially split your input into bucket where you first compute one bucket and then push the output of this computation to the next one. And this way you save quite a lot. So even in this simple computation you already save all of N operations. Right? So in general case, you have a lot of functions, and you sum over some of the variables. Right? Interestingly enough, this is kind of general representation of a general problem where you can substitute summation by max/min or boolean or and multiplication you can substitute by summation or boolean and, and then you get totally different problem from totally different domain. So -- but the fundamental structure of the problem remains the same. And the fundamental structure is to find the best order of splitting things into buckets. Now, what I'm going to talk today is how to accelerate these computations. So let's zoom in into one bucket. What I want to compute is a single bucket, and I want to compute it efficiently. So I split my input in a certain way and I won't zoom in this particular subject. But let's assume I have some way of splitting things into bucket. Now I have one bucket. And I want to compute it. So one bucket is a set of functions marginalized over a set of separate variables. Okay? I don't split it in more buckets. This is the basic thing. So now what I want to do is I want to take this bucket, compute it on GPU, and push the output of these computations to the next GPU invocation of the next kernel with another parameter. And essentially this computation forms kind of a tree, so it's essentially pipelined. Okay. So now with this in mind, if we do this analysis that we did for simple matrix multiply, which is essentially as we've seen is particular case of sum product. The worst case where there is no reuse, I'm expecting to get at most 22 by one plus number of functions, which means that if I have 10 functions in my bucket, I'm going to get only two gigaflops from this thing. No matter what I do, GPU would give me only this performance. However, if I do use scratchpad in a certain way, I'm going to get significant improvement in the ALU conception. All right. So I hope I convinced you that without the cache I would not be able to improve this performance. Okay. So once we have that in mind, I want to zoom in a little bit just to remind you how GPU is built. So GPU essentially is a [inaudible] that sits across PCI bus. Okay. These numbers may be changing, increasing over time. But I mean essential basics for next two years seem to remain. So in order to do any computation you first have to force your data to the GPU from CPU. Then you invoke what's called kernel that runs on these SPs stream processors and then this kernel, the output of the execution, should be passed by over the PCI bus back to GP -- back to CPU. Now, have a look here. There is no cache, right? So this is kind of a weird thing that we have to deal with, you know, in simplistic representation of GPUs. So let's make it a little bit complex, and that's what NVIDIA did with their CUDA architecture. They said, okay, let's split all these SPs into pieces and provide for each set of SPs a little bit of shared memory, a little bit of on chip memory which is very fast and provides huge bandwidth. Okay? However, this bandwidth and this memory must be managed by the software as a part of your kernel, as a part of their computations, you have to squeeze in cache logic, right? So if you do that in a structured way and use it as a scratchpad indeed, fine, but we want to do the analysis, so we want to be able a real cache here. Right. So how we do that? Just trying to analyze the challenge, we think that the typical cache-aware algorithm when you write something for GP -- for CPU, you are trying to implicitly avoid slow accesses to the memory, right? So you try to be as local as you can be. And essentially you kind of implicitly think about the cache how it works and then try to put things in exactly as cache is supposed to work so you will be ready for that. Okay? For GPU, you must explicitly stage data into the cache and keep it in the cache as long as you need it. But there is this chicken and an egg question, right? You design both your cache algorithm and both your computational algorithm and what is first, right? Just from the methodological perspective, this is kind of a question that should be answered. So from our experience, this kind of workflow that one should use and keep in mind when designing these algorithms is first of all one has to devise an algorithm with temporal locality because no matter what your cache logic is, if you are not temporally local, nothing would help. No cache would help in this case. Now, then you devise your caching algorithm according to the cache access pattern that you have in mind in your computational algorithm. And the key point here is to reduce the overhead of implementing the cache in software. Next thing is that you do, take a serial algorithm which is very local and optimized for locality and you optimize data reuse within each set of threads. So because as we've seen, this shared memory is shared in a subset of threads and these threads have to work collaboratively on this piece of memory. And this is the main challenge that is totally different from the one that we have in CPUs. And then you do kind of multiple pass of this workflow, trying to adjust things. Now, I guess you -- what makes you wonder is I'm mentioning here temporal locality, but what about spatial locality? So essentially spatial locality seems to be not that important because you have this, you know, caching algorithm yourself so you can handle that in the algorithm of your cache. But there is a catch. If you are not spatially local, all this prefetching stuff that's gone on must -- will just be very, very slow because GPU has access to parallel blocks of memory and if your accesses are not what's said coalesced, so that you fully utilize the bandwidth, in that case no way you would get the maximum out of the bandwidth that you get from the global memory. Okay. So you must be also spatial local, right? So with this workflow in mind what's left, right? We have locality. So what do we need to do not cache? What do we need to do in software? Replacement policy, right? You want to know what you bring in, what we pin in, and then what we move out. Okay? So we need to handle capacity and conflict misses here. Right? But think of deciding this on demand in GPU kernel. These are inherently serial decisions to make. This would just join the performance. So what we can do, we can precompute it, depending on the input and then plug it in in the generic mechanism that we already implement as a part of the kernel. So for different functions we may have different amount of data in the cache. And basically what we implement here is read-only direct mapped cache where output is not cached because it's not reused. And input is cached in a certain way we are now exactly what the amount of data from each function we're going to use in this particular set of threads. So we build our cache in a way where each function is cached with different amount of data. Right? And this set of data we call cache page. Because this is the set of data that one thread block is going to work on. So here we already see that the logic of accessing functions in different places is generic. What we only need to know is where each function is residing. And this logic is precomputed on a CPU before the computation and plugged in into the algorithm. So briefly what do we need to cache in our cache page? So think of three functions like that and think of the cache page being four variables. So we want to cache all these data that is accessed when we do all the permutations of this -I'm sorry, four variables. So the size of the cache depends on the number of variables that we select. And the size of the cache in this particular case is assuming that each variable has two values. So it will be four for this, it will be eight for that and eight for there. Okay? These values we're going to cache for each thread and what changes between the thread blocks? These values. Outside of this window. So between the thread blocks we will permutated over all possible values of X5, X6. Now, as you can see the computation of the expected cache size, while they are input dependent, right? So the size of the cache is input dependent, is dictated by the input. But what if in this case we're going to cross the boundaries of the available shared memory. So we compute that in this way because this is dictated by the logic of the program, but now the cache size is 16 by four bytes. What if our memory is only 16 bytes? What do we do then? Okay. So these are capacity misses, right, because we are kind of -- we have more data to cache than we have the space for. So what we can do, we can kind of, you know, check that there is no data every time we access something and if we are accessing it so if it's not there, we will prefetch that. But it's very expensive. You would ruin all the benefit of having cache in place. So instead we go and say all right, we will precompute the cache misses that we're going to get because of the capacity and we will avoid putting those functions that cross the -- that cause this required cache space to be large. We would avoid caching them all together. And by doing that, we want to select the set of functions that minimizes the total expected cache miss rate. Okay? So this can be reduced to very, very simple binary knapsack problem which can be with easy heuristics can give you factor of two to the optimal. As you can see, we solve this binary knapsack problem on a CPU and we plug that into the GPU code depending on the input. So the various simple parallel compute structure is that every thread handles one output and in handling one output, it acts as a certain amount of data. As we have seen, this data is cached. Okay? Now, different threads work collaboratively on this piece of data to reuse it efficiently and to avoid threshing of this cache. The interesting thing here is that we do not have any cache conflicts. Why? Because different threads discard the cache when they cross the block boundary. And we cache all the data that will be accessed in one block in one thread block. So by having all the data cached in one thread block and discarding it across the block boundaries, we make sure that there are now cache conflicts. It's not efficient because we increase the cache missed rate in this way. Okay? And I'll show you in a minute that we can improve that. So what is the kernel structure? It turns out that this is fairly typical kernel structure that one can find in many kernels, and you definitely seen something like that. So first you compute which chunk you're going to work on. Then you populate your cache and then you go from zero to the number of summation values that you want to trade over. And for each summation value, you iterate over the functions that you have and when you are accessing the function, you ask yourself whether the function is in the cache or not and if it is, then you just get it from the cache, and if it is not, then you get it from the main memory. So the essential -- so once you get this value, you just -- you know, aggregate the values of different functions in the internal iteration and then you sum over all the summation values. Okay? If you think of it, this is exactly the code for matrix multiplication and simplistic way, okay, when you just sum over the same dimension. But it is general for any number of functions that you may have. And then after you've done that, you just exit the output end and put it in, all right. Okay. So as I said, we didn't have cache conflicts. Because we ensured that our threads acts as the same -- the same piece of data and they never acts as different cache pages. But as I said, there is this problem of increased cache missed rate because of that, because across the block boundaries we just lose everything that we prefetched. So what we can do now, we can improve this by realizing that we can do several cache pages in the same thread block, but them some of the functions would have to be reload. How do we know which functions to reload? Well, you might have guessed that. We decide that on CPU and plug that into GPU. Another thing is that if you have too many summation values your cache is going to be -- is going to explode, right? Because every function has to have all the summation values prefetched and if you have too many of them, well, no way you would be able to make it. So what you can do is send what everyone does in matrix roll is splicing, right, is tiling. So you can tile your input across different -- many different invocations. So in the first kernel invocation you are dealing with the first part, and then you invoke your kernel a second time and aggregate with the previous kernel invocation. So the interesting question to ask here is how to determine the tiling cycles. Because essentially if you have this tiling and each of your tiles fits memory but kernel invocation is very, very short, then you're going to miss all the benefit of having your data cached because the overhead of kernel invocation would dominate this benefit. So there is some heuristic that we apply to determine the tile size and predict the expected kernel invocation time and try to optimize that. So there is one catch, though. The catch is that we are trying to plug in the data from the CPU to be used during the cache population, right, populating the cache. Now, when we populate the cache, where do we get the metadata from? Well, from memory. So what that means, that means that we are trying to access -- to improve the access to slow memory but making it even slower because we add more memory accesses on the way. Too bad, right? So what we can do, we can use texture and constant memories here because some of them are cached, some of them reuse the bandwidth load on the main bus like texture memory and what we -- and this is really helpful. So when we use that, and we also reuse the same data many time from the first memory from the scratchpad, we amortize the overhead of accessing the slow memory in order to push everything to the cache. Another thing is that we were forced to do dynamic loop unrolling. As you can see, the loop is from the -- from zero to the number of summation variables, right? So we don't know how many summation variables we're going to expect. So what we can do, we can do dynamic loop unrolling or tiling where we just -where we just have a look at how many left and we unroll it four times, there are more than four values left and we unroll it two times. So essentially we do this factorization of the number of remaining iterations and do the best that we can for unrolling that. Unrolling it aggressively ruins the performance because some of the registration are spilled out to the global memory. So we got to the very interesting part, right? So these are the results. Small wording. These results are from the old invocation, but we have some brand new results, and I wonder whether you're going to see the catch in them. I mean, they are too good to be true. But maybe you can find the problem. So hardware. Single core CPU Intel core 2 with large cache and GPU G80, no matter how memory had, I mean this is the classical old third generation CUDA card. CUDA is 101.0 and we actually tried it with our own CPU kernel. And this kernel performed quite well in ATLAS. So if you consider that this is not just regular matrix multiplication but kind of a lot of other overheads involved there with index computation and all that, getting within five percent from ATLAS is not that bad. We had two implementations. One is linear scale, another is log scale. So log scale is essentially we wanted to avoid underflows and to remind you our input was from probabilistic network so things are below one. So if you multiply too many times and then sum over too many times, you lose -- you get zero sometimes. Although it's definitely not. So in order to avoid -- so the trivial way to avoid overflow is -- underflow is to use log scale, and this is quite often used essentially, although it's extremely inefficient. So if we just -- so what we do, we switch multiplication by summation, summation by XA and log and then we just preprocess all the data beforehand and map it to the log scale. So the interesting part -- so definitely I would want to stop here and say, okay, this is my performance. But that would be unfair. So transcendental functions on Graphic Processing Unit are mapped to special function use. They are performed in hardware. So when we got this nice speedup, peak although, but still, three orders of magnitude of speedup, well, we thought that, you know, there was some problem with the computations of the speedup. It cannot be true. So then we ran a lot of microbenchmarks just trying to understand where the speedup comes from. And it turns out that in this case order of 25 is direct consequence of using software managed cache. Without it we lose the performance by factor of 25. Factor of two is just by using the hardware. Okay? Just by using GPU in that way is better than using CPU. And factor of 25 is because of special function units. So essentially XF, more or less six times slower than summation on the GPU, and 200 times slower on the CPU. Okay? So there is factor of a more or less a 35 between the speedup -- between the speed of this particular thing. So when you get a lot of coincidental functions this what makes a huge difference. Now, you know that as special function unit its abilities are limited, there is only one special function unit per the set of ALUs for 16 unit speeds but -- and this has been dramatically improved in the latest generation of CUDA -- of the cards but even here it's quite amazing. Now, on average as you can see, the performance is not as bright as we would want it to be. And the worst case is really devastating. Why is that? Well, as we all saw, the reuse determines the expected performance. So if our input doesn't have any reviews, we just saturate the bandwidth. So we are out of luck. We got bad input. So we just should not run it on CPU -- on GPU if you will stop. This is the decision that we're going to get to make before running this in a production setup. However, obviously in some cases it's definitely worthwhile. Now, this is single precision performance on this card. So this is the bottom line. Yes? >>: [inaudible]. >> Mark Silberstein: I'm sorry? >>: What does varying the input ->> Mark Silberstein: Okay. So I'll show you on this graph. This kind of a bizarre graph with a lot of points in it just to confuse you. So one point is one kernel invocation and the kernel -- well, the input actually has a huge influence on a CPU card as well as you see because what we vary here is the number of input functions, the number of values per function, the number of overlap what we call between different functions in terms of the number of variables between them. So we perform more or less 400 runs in different, you know, combinations of all those. And this is what we get. So this is log scale, right? So as you can see, the error is indeed more or less 15 gigaflop but the interesting thing is that here on this very card we get bandwidth if we just map the best performance to the bandwidth that would be required for that performance we get 212 gigabyte per second, which means that the original bandwidth could not have been squeezed and stretched to this size, right? So our cache is actually doing well. All right. But -- okay. So here is the graph. As you can see, there is this part of it that is not really good for GPU. And as you might have guessed, the reason is that there is just not enough parallelism of the data to saturate the GPU. And here as you can see there is this boundary point of 50 kill flops in the input which, you know, makes the difference between CPU and GPU. And this is more or less the multiplication of two matrices 40 by 40. So ice pretty small input if you think of that. So how is our model working on this particular thing? So if we just take our model and map the expected results and those that we get in practice by changing the cache size, okay, into 16K, okay, so what we can see is first of all that our model is more so -- the approximation is quite good. I mean the tendency is exactly what we see with the curve, right? I mine absolute values are totally off, right? Because we do not account for all these overheads of kernel, of the -- of different things. But I mean what we would want to get from this is the upper bound. And this is very strong upper bound that you cannot cross, right? And what we want to get is whether it's worthwhile to increase the cache size in order to optimize -- to improve the performance. And we clearly see that we can do that by improving increasing the cache size. So we see the tendency, we see the global picture of what we're going to get just from the model. Numbers are too optimistic. Okay. As you can see, there are several knees here, right? So what are these knees? Recall that in our cache we decide whether to cache the whole function, right? So at this stage we got zero functions in the cache. Here we got jumped because we managed to squeeze in one function. We're we managed to squeeze in two and here three. Okay? And this is the maximum that we could get that. Another is the number of cache pages per thread block, right? Recall we have this problem of cache misses every time we discard the whole cache page and switch to another thread block. So now we try to see how many -- how much improvement we get from having multiple cache pages per block. So as you can see, the model actually predicts something that is beyond the ALU performance, right, so it's really off. But again, if you reduce factor of 10 more or less, you can see that this is the boundary line where we increasing the cache size and we're not getting any improvement because basically we are not using cache. So this is kind of a validation curve that says okay, this is not an artifact, this is true because of cache improvement. So on this graph you can see that this is low grade mixed scale again, so we are starting with 20 something and we get as far as a three with 34, 35 by changing the number of cache pages. And our theoretical model actually shows that this is what's going to happen. And this is the comparison of having no cache using texture for, you know, hardware cache that there isn't the texture cache and our cache, okay? Okay. For specific input we just go and increase the number of summation variables and by that increasing the complexity. And CPU we barely see here in the bottom. Yeah? >>: [inaudible]. >> Mark Silberstein: Yeah. Yeah. That's -- yeah. Lecture said that it's a good question because he has an answer. So this is a good question. And the reason is that if you can, you know, just have a look, there is clear periodic pure [inaudible] here. And the reason is that we are hitting band conflicts, bizarre band conflict here. Because the data is aligned in a way that increasing the number of summation values at some point gets to all the banks accessed -- I mean one bank accessed by all threads in the block. So it's kind of the worst possible band conflict one can get. And as you can see, texture, texture cache presents exactly the same problem. It's also missing here because of this, because of [inaudible]. So you may also ask what happens here on this, what also be good question. And the answer is that you can see that something bad happened and this bad thing is that we had to drop one function out of the cache. So by reducing only by one function we kind of go exactly like regular texture cache. Now, if you think of the texture cache as some -- well, it's not exactly fair to compare us with texture cache because texture cache is also shared among different flat box, right? Whereas our cache is discarded across thread block boundaries. And it's why having texture cache here behaving like that means two things. First of all it means that we are not doing -- we are doing a good job with the metadata. So the overhead is actually quite low. Because the hardware cache behaves even worse than we do. Right? And another thing is that hardware has some advantages over our think -- over our software manage cache and still we're getting back. Even in the worse case of bizarre benchmarks. Now, this is the effect of loop unrolling. And as you can see, compiler is doing very bad job on unrolling this loop. And we actually had to -- well, in my current implementation, I have plenty different kernels unrolled in all different combinations of summation values and number of matrices. Now, it turns out that there is another parameter of using texture cache for the input where this parameter is also -- also requires different kernels. So I have 40 different kernels. Generated automatically by my scripts and compiled in. Okay? Because the compiler doesn't do good job on unrolling my loops. And you see that the benefit is huge. Now, this is the new, brand new data, more or less, yes, degenerated. And it might be a little bit difficult to follow, so I'll explain. I did each graph is actually an average over 100,000 runs of the kernel, over large radial of input parameters where I actually run over up to six functions and up to 500 summation variables. And this is the input complexity, this is the affect throughput in gigaflops, double precision on GTX285. So the average performance of the kernel can -- I measured several combinations. So first of all I tried software managed cache with tiling and software managed cache without tiling. You remember tiling I split the input into multiple kernel location if it doesn't fit the cache? I also used combination of software managed cache and texture. So essentially my input is mapped to the texture memory. I fetch it and then put it into my cache. Right? So this way I wanted to, you know, to squeeze most out of it. This is the boundary line of having no cache at all. And this is no software cache but only texture. Okay? So let me follow here. This is the boundary average performance that we get when we have no cache at all. Okay? This is roughly 16 gigaflop. This is the cache -- this is the case where we do not have any tiling and we're using only software cache without texture. This case is the one where we are using software and texture with tiling, so it's kind of combination of all. And this blue is software cache without tiling but this is the best that we can get, and this says software and texture without tiling, okay? And red is software and texture with tiling. All right. So what we see from these graphs is actually two things. First of all double precision, even on average, double precision performance is quite well, combination of all our methods together is doing not bad if you consider the comparison without tiling then it's actually impressive. Now, the reason why without tiling works sometimes better than with is because the overhead of tiling is heuristically predicted. And this heuristics is sometimes off. Now, this is the average performance. And this is the best performance. Okay? So as you can see, we kind of saturate the double precision unit completely. Now, what may make you wonder is that sometimes we saturated over its theoretical peak, right? And this is something that I realized only today. And I was thinking about it and, you know, checking the code, I realized that the complexity computation doesn't take into account the new improvement that they have in the compiler. And what compiler does whenever I do I equals I plus -- I equals I plus plus or I equals B -- A -- A equals B -- I'm sorry. A plus equals B. And A is statically defined above it as equals one or zero essentially. Zero. In this case, it doesn't do plus, it just puts it, okay? The previous compiler didn't do that. Now, if I take that into account, I lose more or less 15 percent of that. So I check that. I didn't have time to put it on graphs. But this is the performance that I'm expecting. So even 15 percent less than that is still not bad, I would say. So it's 60. So I'm not saturating it until the end. But close to that. Now, what about single precision. Just to compare that with the, you know, with the single precision on a G80. So it's actually quite impressive. We get -- their we got the best performance, this is the best, this is the average case. So just to explain, right? This is best with software and texture, this is the best with -- this is -- this is the best -- I'm sorry? Just let me find out what's going on here. So this is using only texture average, and this is only texture maximum. And this is using software maximum and texture together. All our stuff combined. So as you can see first of all we improved from 50 gigaflops maximum to almost 125 gigaflops maximum single precision. Doing particularly nothing to improve that, just, you know, putting it on a next generation card. Another thing is that without using the cache this is the average and this is the maximum. So the difference between the maximums is still quite large. Okay? So this is -- so my conclusion from that is that next generation of hardware would give me for free hopefully quite a lot of performance benefits. And this scalability to another -- you know, this ability to just put it to another piece of hardware and get the speed up, well, it's nice. And most of it is because of the bandwidth improvements in G285 versus -- and also not only about it as a matter of fact, there are several other things, but I mean bandwidth is really dominating factor here. So what you didn't ask and I didn't answer is the overheads, right? I consider only the performance of the kernel itself as if all the data is there, but if you think of that we have a kernel that has kind of pipeline, right, so we have to move data back and forth. And if you think -- if you consider the overheads so this graph shows that for up to here as long as our cache is behaving greatly and fits -- and all the data fits in, the kernel execution time is actually comparable with the time to prepare the metadata on the CPU for this cache. So essentially if we just go, you know, in straightforward way and for every invocation we move the data back and forth and prepare the metadata and then involve the kernel and then so all these graphs would be nothing at the application level. So the first part conclusions, and I will be finishing in about five minutes and may I take another three minutes of your time I hope, so the use of shared memory is good thing, okay. In this work we tried to do cache decisions on CPU for the runtime on GPU. But what if we have like rate racing? Is there any chance of having this type of architecture for rate racing? Well, I believe no way. Because if you have random access and you cannot predict that, no way that you would be able to do that on GPU. Well, fortunately folks at NVIDIA realized that and designed hardware cache. So we'll see how it works when it goes up. Okay. So just to remind what we were working on. We were working on an application that had different kernel invocations where each kernel was pipelined to the maximum invocation in the form of a tree, so this tree was kind of traversed in the reverse DFS and [inaudible] so it's charged here and here. You cannot go up because you have dependencies, so then you have to satisfy these dependencies by computing this kernel then that and et cetera and only by the end all the leaves are computed and then you go forward. So we performed pretty well on 16 eight slot machine with 16 CPUs through [inaudible] with the same type of approach. We do all the same logic without actually prefetching data, but the locality properties of all computational algorithms still maintain. On GTX285, the complete application -- again, this is only kind of preliminary results. We didn't have a chance to run a lot of input. So for this input we get factor of 37 versus single CPU, which is more or less factor of well, two and a half versus 16 CPU machine. And if you compare price performance then you definitely realize that this is not too bad. >>: [inaudible]. >> Mark Silberstein: I'm sorry? >>: This was the 8800 or the 200 ->> Mark Silberstein: No, this is a new one. Yeah. This is the GTX285. >>: [inaudible]. >> Mark Silberstein: So the point is that we are not losing anything. Almost. In our average performance that I showed you on the graphs and the reason is twofold. One thing that we do is that we maintain all the data on a GPU. We do not go back and forth. This of course creates some problems with the data, with the memory management because sometimes the amount of data cannot be fit into GPU so you have to swap in and out. So we implemented some type of software swapping here. And another thing that we do, if you saw this graph where GPU is worthwhile or not, so on some cases CPU just works better. Right? And what we do, we determine by some heuristics curve we have already better algorithm now but we still not implemented were to execute the input depending on the input parameters on the CPU or on the GPU. And you see that what we have to also take into account in our algorithm that this is the next stage is that we have memory transfers here so we cannot just arbitrarily decide that we execute something on a CPU not on a GPU because of the CPU it performs better because by doing that we also may require some memory print to be transferred and this would ruin whole benefit of using this or that. Another thing is that when measuring the overheads you may have seen that the overhead of creating all this metadata is very high for many kernels. So what we do, we run this computation in two threads. One thread is managing GPU and another thread is preparing the data for GPU. In this way we can traverse this tree and also on the way parallelize and utilize both CPU and GPU. And the last piece of information that I want to deliver now quickly is the use of TSUBAME. So what TSUBAME is, TSUBAME is huge cluster with a lot of very fast interconnect and band base, tons of storage attached and 620 GPUs plugged in their Tesla and 16 eight slot nodes. So what about did in TSUBAME we actually did two level parallelization. One level was to split this huge workload into small, independent business with low overhead, relatively low parallelization overhead. But this made it embarrassingly parallel. This is the same type of approach we use in grids for this workload. And then on the node level, we invoked it either on eight CPUs, so we split the node essentially evenly between eight CPUs, or on a GPU, right? There were two GPUs attached to a node, so we had 16 nodes with 16 CPUs and two Tesla GPUs attached. This brings it to a nice number of thousands. So this is the idea. We had a single master. He maintained the queue and we split the node into four independent parts. And each GPU got its own workload. And this was our way to manage heterogeneity of these machines. So the preliminary results again this is not really something that I'm proud of because there are a lot of room for improvement, but on a single CPU we got like 5.5 CPU years for this workload would be required whereas on the speedup here was more or less 2,487. Now, this number means nothing. In a sense that what really is important is what we get from GPUs, right? So essentially if you think what speedup you get, you get more or less that one GPU is equivalent to 16 CPUs. So we got more or less on this -- on this input we got more or less factor of 16 on a GPU versus single -- versus 16 CPUs. Now, again, I'm saying this is not something that I'm real proud of. There is a lot of things that I can do here and probably I will. And this work is sponsored by you guys. And thank you so much for having me here. And I will take some questions if I have a few minutes. [applause]. >>: So problem you're trying to solve seems identical to the people way back then trying to do math on a machine of 16 or 64K and a hard drive and the numbers are longer but the ratios may be similar. Is there any [inaudible]. >> Mark Silberstein: So you're talking about out of core computations like you ->>: Computations that don't fit in core. >> Mark Silberstein: Oh, yeah. So essentially I should confess I'm not really familiar with this work. From what I heard and I asked -- before reading something I asked people who know, it seems that the tradeoffs were really different so the ratios were kind of different. So I cannot really -- probably some techniques can be borrowed and this is something to work on. >>: So assume that GPUs get N1, N2 caches, what's going to be the marginal benefit of [inaudible] software caches that are precompiled for [inaudible]. >> Mark Silberstein: I think the benefit will be quite substantial. Because texture cache that I'm showing here is not really cache. It didn't reduce latency. It just improves the bids essentially but it doesn't reduce latency. And L1 will definitely reduce the latency of access to the data. So I expect real significant benefit from that. However, you never know. Because on the [inaudible] actually you would be able to split between 16 of L1, 16K of L1 and 48K of scratchpad or vice versa. And who knows, maybe, you know, maybe in my case it would be worthwhile in some cases to go to my software manage cache and would perform better. Because I don't know what replacement policy would be on L1 cache. >>: [inaudible] key here. >> Mark Silberstein: I'm sorry? >>: I think that might be key here because [inaudible] not a structure of [inaudible]. >>: [inaudible]. >>: You might be caching something for a longer term [inaudible]. >> Mark Silberstein: Yes. That's true. But on the other hand, if you look at what I'm getting from texture cache it seems like I'm still getting something from there. And it's really, you know, it's clearly beneficial. So I don't know. I don't know. I'm very curious to see. >>: The numbers that you use, the sizes of the texture caches and the [inaudible] scratchpad [inaudible]. >> Mark Silberstein: I have no idea about the size of texture cache. I know that the working set is supposed to be 8K. I didn't see any correlation between my amount of data that I am accessing with this 8K. I don't know how this texture cache works and did. And the amount of shared memory that I was working with is exactly the same as in G80. What is different, though, is the number of registers allocated for threads. So this workload with texture and my cache stuff requires 30 -- in some cases 28 registers and one would not be able to get that from G80. Okay? You would get all your registers filled out. And this would ruin everything. So that's why I was able to do these experiments now. >>: [inaudible] how do you describe the bucket [inaudible]. >> Mark Silberstein: The bucket is essentially mapped to two things. There is a metadata describing which variables of a bucket there are there, and there is this data. Okay? So the variables are fetched from another large array of all the variables so each function had kind of array pointers where the right variables and there were some, you know, computations before hand, quite nasty ones, you know, to find the indexing. But it's not a big deal. It's just ->>: Shows themselves there are [inaudible] functions. >> Mark Silberstein: The functions themselves? >>: The functions themselves over the [inaudible] are they ->> Mark Silberstein: They're arrays. >>: There's just arrays? >> Mark Silberstein: Yeah, yeah, because it's discrete variables. So they're just arrays. >> Laurent Visconti: Take one more question. Okay. Thank you. >> Mark Silberstein: Thank you very much. [applause]