>> Laurent Visconti: Well, good afternoon. It's my... Silberstein. He's a last year PhD student at Technion,...

advertisement
>> Laurent Visconti: Well, good afternoon. It's my pleasure to introduce Mark
Silberstein. He's a last year PhD student at Technion, Israel Institute of
Technology. And Mark's talk today will be on execution of memory-bound
workloads on GPU, via software-managed cache. Okay.
>> Mark Silberstein: Thanks a lot for introduction and thanks for having me here.
It's a huge order. I know you guys are very difficult to impress. You have a lot of
nice stuff here going on. But I'll try.
So this work has been done in UC Davis in early 2007 and then I continued
working on that in Haifa in the Technion. And in UC Davis it was John Owens
and Anjul Patney who helped me on that. John Owens is the professor in Davis
now. And of course my advisors Dan Geiger and Assaf Schuster at Technion.
So briefly, so if you think of memory demanding applications on GPUs, the best
workload ever you can do on GPU is when you have a lot of operations per
single memory access. This is because of the typical -- this is because of the
memory architecture on GPUs. So this is called high arithmetic intensity.
Now, this is the best workload you can get, and that's why -- in this workload you
would be able to get maximum out of the GPU. Nice workload is the one where
you do have memory reuse pattern, so you probably can reuse your memory in a
certain way. But this button, access button is known at the compile time. So
typically for matrix multiplication you do blocking in order to fit your computations
to the cache. And the block size is usually determined at the compile time, so it's
kind of [inaudible] and it doesn't -- it doesn't depend on the input.
What you typically do, you use this scratch pad memory on GPU in a clever way
so you prefetch the data that you're going to use several times in order not to get
to the global memory every time, and by prefetching it, you essentially make it
available to the computation on chip.
Now, the interesting thing is that multiple threads work collaboratively on the
prefetch data and then they switch the set of data that are working on and go on
to the -- to another piece of data, et cetera, et cetera.
The tough workload is when you don't have this luxury of knowing memory reuse
pattern at compile time. And this is the workload that we'll be talking about.
So in a nutshell, we can think of this software -- of this scratchpad memory that is
available to several threads on GPU in blocks of threads on GPU as a cache, a
regular cache. And although this is not particularly in a user, many people who
consider this as a cache say we do one step further and try to analyze our
workload as if it was a real cache and also we are trying to build machinery to
allow this cache to be configured at runtime by collecting the appropriate cache
policy, depending on the input, which is optimal for a specific input.
So in my talk today, I will describe our solution to this particular problem on the
example of sum-product computations. And I will also show some old result that
we had since 2007 on G80 and some new results that we have on G2 -GTX285. Obviously we do not have results in [inaudible] not yet. But I'm hopeful
that we will.
And my -- I briefly mention also the things that we did with TSUBAME super
computer, which has 620 GPUs. It's a cluster that had a lot of GPUs. And we
used it for running very large problems on thousands of CPUs and one 20 GPUs.
So I will briefly mention what we did there.
So, in essence, we have very, very simple performance model that we can think
of on a GPU. So what we want to do is to be able to analyze our workload in
advance without any actually writing any piece of code to know what will be the
best performance that we can get if we do that in an optimal way. So essentially
if the data is supplied -- if the data supply is fast enough to keep all ALUs busy,
well this is CPU bound workload and it's easy.
However.
However, in other case, on GPU you actually have to ship your data, and you
saturate the bandwidth. And by saturating the bandwidth, you are limiting your
performance. So can we actually evaluate and do kind of a trivial you know
arithmetics to know what is the maximum that we're going to get? Well, of
course, because basically our expected performance is going to be more or less
our arithmetic intensity, which is this parameter, which is algorithmic specific
number of operations per memory access that we do. Multiplied by the
bandwidth, right? And then we just take minimum between the arithmetic
intensity by bandwidth and the best ALU performance we can get.
So this is kind of very, very simplistic model. But it turns out that when we just,
you know, take very, very simple workload, trivial one that is being used by
NVIDIA all over just to show kind of a trivial kernel, this is just multiplication. So
what you see from this analysis is that the arithmetic intensity here is one-third,
right? We have three memory accesses, one for write and two for read and one
operation.
So for this arithmetic intensity, the expected performance that we're going to get
is only seven gigaflops which is only 5 percent of utilization of the card of a G80
and for a more advanced card with a 285, we would get a slightly more to that,
but it would still be like two percent of the available ALU capability of the GPU.
Okay. So it's 12 gigaflops which is almost one percent because this card
provides teraflop of performance.
So with this simplistic model, we already know this workload is going to be very,
very GPU unfriendly, if I may. Now, if we now think about the scratchpad as the
cache we can think of incorporating this into the -- this simplistic analysis by
considering not the number of memory accesses per operation but rather the
memory access cost, right? This is what we do when we consider cache misses
and cache hits, right? So cache hit is going to cost nothing, cache miss costs the
memory access. Right? And if we can formalize the cache hit model for our
application, we can think of the way to analyze the workload that does consider
shared memory. Okay? So for this case where we have matrix multiplication
there is a 0 of N reuse. Without the cache our arithmetic intensity would be only
constant, would not depend on the input and in this way would be limited by this
very low utilization of GPU. Whereas for a cached -- well, only compulsory
misses you would get the workload that would become CPU bound because it
would depend on the input parameters. Okay. So it's kind of perfect scaling in
this sense because the more data we have the better performance we're going to
get.
Okay? So with this simple model we already know that for matrix multiplication
there's no way we would be able to do that without using scratchpad. Now I'll
turn to the application that we are using to evaluate our approach.
So the motivation for using some products is inference in very large Bayesian
probabilistic networks for parametric genetic linkage analysis. This application
allows a geneticist to analyze their data and find genetic provoking genes. So
essentially this computation is exponential in some input parameters and that's
why it may require a lot of -- a lot of computational time. So what we want to do,
we would try to use GPU for computing that.
Now, our other approach is to use multiple distributed computing environments,
opportunistic image environments, and there is actually working system that uses
them. But that's another subject.
So what it is, if you think of some product as a reformulation more generalized
way to think of matrix multiplication. So what do we have here is product of two
matrices and essentially this is the -- you know, this is just the formula for each
one of the matrix which one of the values for this resulting matrix. Now, if I think
of X, Y, Z, as variables over finite domain, then essentially A, X, Z, and B, Z, Y,
turn to be functions of these variables. And the -- you know, these functions map
from the discrete domain of X, Y, Z, to the actual values that are part of this
matrix.
So essentially in this way when we represent matrix products in this way, we can
think of A, X, Z and B, Z, Y as being multiplied in some way, right, and then
marginalize or summarize over shared dimension, right? So this is just an
abstraction for matrix multiplication. So what we do here -- and this is essential
to understand because essentially this gives the idea of random access to the
input. So for this case what we have here is that we have two vectors, okay, so
0, 0 and here 0, 0 mean X, Z, and Z, Y respectively. So what we do, we kind of
compute all different permutations of these two over where we fix all those
variables that do not -- that are not shared and compute different permutations of
the shared variables here. Okay.
So essentially we get the product of all different permutations over variable Z
when we fix X, Y to be zero, then X, Y to be 01, et cetera, et cetera.
Now, this is the first part. This is the multiplication. And the second part is
actually to take these different parts and sum over. And if you think what we do
here is we sum over different values that refer to different values of Z. Right? So
we put this and this, this and this, this and this, and the result is certainly exactly
regular matrix multiplication.
However, the point is here that now we can generalize multiplication of multiply
matrices or some product of a set of functions now general functions that cannot
be reduced to simple matrices and some over the shared dimensions. So we
have three matrices like that, and we sum over W and Y. We may get it in two
ways. One way is to generate this function over four dimensions and then sum
over W and X -- W and Y. This would require all of N by four operations if the
number of values for each variable is N. However, now we can use distributed
law and in this way we would be able first to compute this product and
marginalize over W and then take the result of this one and marginalize over Y.
This would be much more computationally efficient and this approach is called
bucket computations, where you essentially split your input into bucket where
you first compute one bucket and then push the output of this computation to the
next one. And this way you save quite a lot. So even in this simple computation
you already save all of N operations. Right?
So in general case, you have a lot of functions, and you sum over some of the
variables. Right? Interestingly enough, this is kind of general representation of a
general problem where you can substitute summation by max/min or boolean or
and multiplication you can substitute by summation or boolean and, and then you
get totally different problem from totally different domain.
So -- but the fundamental structure of the problem remains the same. And the
fundamental structure is to find the best order of splitting things into buckets.
Now, what I'm going to talk today is how to accelerate these computations. So
let's zoom in into one bucket. What I want to compute is a single bucket, and I
want to compute it efficiently. So I split my input in a certain way and I won't
zoom in this particular subject. But let's assume I have some way of splitting
things into bucket.
Now I have one bucket. And I want to compute it. So one bucket is a set of
functions marginalized over a set of separate variables. Okay? I don't split it in
more buckets. This is the basic thing. So now what I want to do is I want to take
this bucket, compute it on GPU, and push the output of these computations to
the next GPU invocation of the next kernel with another parameter. And
essentially this computation forms kind of a tree, so it's essentially pipelined.
Okay.
So now with this in mind, if we do this analysis that we did for simple matrix
multiply, which is essentially as we've seen is particular case of sum product.
The worst case where there is no reuse, I'm expecting to get at most 22 by one
plus number of functions, which means that if I have 10 functions in my bucket,
I'm going to get only two gigaflops from this thing. No matter what I do, GPU
would give me only this performance. However, if I do use scratchpad in a
certain way, I'm going to get significant improvement in the ALU conception.
All right. So I hope I convinced you that without the cache I would not be able to
improve this performance. Okay. So once we have that in mind, I want to zoom
in a little bit just to remind you how GPU is built. So GPU essentially is a
[inaudible] that sits across PCI bus. Okay. These numbers may be changing,
increasing over time. But I mean essential basics for next two years seem to
remain.
So in order to do any computation you first have to force your data to the GPU
from CPU. Then you invoke what's called kernel that runs on these SPs stream
processors and then this kernel, the output of the execution, should be passed by
over the PCI bus back to GP -- back to CPU.
Now, have a look here. There is no cache, right? So this is kind of a weird thing
that we have to deal with, you know, in simplistic representation of GPUs. So
let's make it a little bit complex, and that's what NVIDIA did with their CUDA
architecture. They said, okay, let's split all these SPs into pieces and provide for
each set of SPs a little bit of shared memory, a little bit of on chip memory which
is very fast and provides huge bandwidth. Okay? However, this bandwidth and
this memory must be managed by the software as a part of your kernel, as a part
of their computations, you have to squeeze in cache logic, right? So if you do
that in a structured way and use it as a scratchpad indeed, fine, but we want to
do the analysis, so we want to be able a real cache here. Right. So how we do
that?
Just trying to analyze the challenge, we think that the typical cache-aware
algorithm when you write something for GP -- for CPU, you are trying to implicitly
avoid slow accesses to the memory, right? So you try to be as local as you can
be. And essentially you kind of implicitly think about the cache how it works and
then try to put things in exactly as cache is supposed to work so you will be ready
for that. Okay?
For GPU, you must explicitly stage data into the cache and keep it in the cache
as long as you need it.
But there is this chicken and an egg question, right? You design both your cache
algorithm and both your computational algorithm and what is first, right? Just
from the methodological perspective, this is kind of a question that should be
answered.
So from our experience, this kind of workflow that one should use and keep in
mind when designing these algorithms is first of all one has to devise an
algorithm with temporal locality because no matter what your cache logic is, if
you are not temporally local, nothing would help. No cache would help in this
case.
Now, then you devise your caching algorithm according to the cache access
pattern that you have in mind in your computational algorithm. And the key point
here is to reduce the overhead of implementing the cache in software. Next thing
is that you do, take a serial algorithm which is very local and optimized for locality
and you optimize data reuse within each set of threads. So because as we've
seen, this shared memory is shared in a subset of threads and these threads
have to work collaboratively on this piece of memory.
And this is the main challenge that is totally different from the one that we have in
CPUs. And then you do kind of multiple pass of this workflow, trying to adjust
things. Now, I guess you -- what makes you wonder is I'm mentioning here
temporal locality, but what about spatial locality? So essentially spatial locality
seems to be not that important because you have this, you know, caching
algorithm yourself so you can handle that in the algorithm of your cache. But
there is a catch. If you are not spatially local, all this prefetching stuff that's gone
on must -- will just be very, very slow because GPU has access to parallel blocks
of memory and if your accesses are not what's said coalesced, so that you fully
utilize the bandwidth, in that case no way you would get the maximum out of the
bandwidth that you get from the global memory. Okay. So you must be also
spatial local, right?
So with this workflow in mind what's left, right? We have locality. So what do we
need to do not cache? What do we need to do in software? Replacement policy,
right? You want to know what you bring in, what we pin in, and then what we
move out. Okay? So we need to handle capacity and conflict misses here.
Right? But think of deciding this on demand in GPU kernel. These are
inherently serial decisions to make. This would just join the performance. So
what we can do, we can precompute it, depending on the input and then plug it in
in the generic mechanism that we already implement as a part of the kernel.
So for different functions we may have different amount of data in the cache.
And basically what we implement here is read-only direct mapped cache where
output is not cached because it's not reused. And input is cached in a certain
way we are now exactly what the amount of data from each function we're going
to use in this particular set of threads. So we build our cache in a way where
each function is cached with different amount of data. Right? And this set of
data we call cache page. Because this is the set of data that one thread block is
going to work on.
So here we already see that the logic of accessing functions in different places is
generic. What we only need to know is where each function is residing. And this
logic is precomputed on a CPU before the computation and plugged in into the
algorithm.
So briefly what do we need to cache in our cache page? So think of three
functions like that and think of the cache page being four variables. So we want
to cache all these data that is accessed when we do all the permutations of this -I'm sorry, four variables. So the size of the cache depends on the number of
variables that we select. And the size of the cache in this particular case is
assuming that each variable has two values. So it will be four for this, it will be
eight for that and eight for there. Okay? These values we're going to cache for
each thread and what changes between the thread blocks? These values.
Outside of this window. So between the thread blocks we will permutated over
all possible values of X5, X6.
Now, as you can see the computation of the expected cache size, while they are
input dependent, right? So the size of the cache is input dependent, is dictated
by the input. But what if in this case we're going to cross the boundaries of the
available shared memory. So we compute that in this way because this is
dictated by the logic of the program, but now the cache size is 16 by four bytes.
What if our memory is only 16 bytes? What do we do then?
Okay. So these are capacity misses, right, because we are kind of -- we have
more data to cache than we have the space for. So what we can do, we can kind
of, you know, check that there is no data every time we access something and if
we are accessing it so if it's not there, we will prefetch that. But it's very
expensive. You would ruin all the benefit of having cache in place.
So instead we go and say all right, we will precompute the cache misses that
we're going to get because of the capacity and we will avoid putting those
functions that cross the -- that cause this required cache space to be large. We
would avoid caching them all together. And by doing that, we want to select the
set of functions that minimizes the total expected cache miss rate. Okay?
So this can be reduced to very, very simple binary knapsack problem which can
be with easy heuristics can give you factor of two to the optimal. As you can see,
we solve this binary knapsack problem on a CPU and we plug that into the GPU
code depending on the input.
So the various simple parallel compute structure is that every thread handles one
output and in handling one output, it acts as a certain amount of data. As we
have seen, this data is cached. Okay?
Now, different threads work collaboratively on this piece of data to reuse it
efficiently and to avoid threshing of this cache. The interesting thing here is that
we do not have any cache conflicts. Why? Because different threads discard
the cache when they cross the block boundary. And we cache all the data that
will be accessed in one block in one thread block. So by having all the data
cached in one thread block and discarding it across the block boundaries, we
make sure that there are now cache conflicts. It's not efficient because we
increase the cache missed rate in this way. Okay?
And I'll show you in a minute that we can improve that.
So what is the kernel structure? It turns out that this is fairly typical kernel
structure that one can find in many kernels, and you definitely seen something
like that. So first you compute which chunk you're going to work on. Then you
populate your cache and then you go from zero to the number of summation
values that you want to trade over. And for each summation value, you iterate
over the functions that you have and when you are accessing the function, you
ask yourself whether the function is in the cache or not and if it is, then you just
get it from the cache, and if it is not, then you get it from the main memory.
So the essential -- so once you get this value, you just -- you know, aggregate
the values of different functions in the internal iteration and then you sum over all
the summation values. Okay?
If you think of it, this is exactly the code for matrix multiplication and simplistic
way, okay, when you just sum over the same dimension. But it is general for any
number of functions that you may have. And then after you've done that, you just
exit the output end and put it in, all right. Okay.
So as I said, we didn't have cache conflicts. Because we ensured that our
threads acts as the same -- the same piece of data and they never acts as
different cache pages. But as I said, there is this problem of increased cache
missed rate because of that, because across the block boundaries we just lose
everything that we prefetched. So what we can do now, we can improve this by
realizing that we can do several cache pages in the same thread block, but them
some of the functions would have to be reload. How do we know which functions
to reload? Well, you might have guessed that. We decide that on CPU and plug
that into GPU.
Another thing is that if you have too many summation values your cache is going
to be -- is going to explode, right? Because every function has to have all the
summation values prefetched and if you have too many of them, well, no way
you would be able to make it. So what you can do is send what everyone does
in matrix roll is splicing, right, is tiling.
So you can tile your input across different -- many different invocations. So in the
first kernel invocation you are dealing with the first part, and then you invoke your
kernel a second time and aggregate with the previous kernel invocation.
So the interesting question to ask here is how to determine the tiling cycles.
Because essentially if you have this tiling and each of your tiles fits memory but
kernel invocation is very, very short, then you're going to miss all the benefit of
having your data cached because the overhead of kernel invocation would
dominate this benefit.
So there is some heuristic that we apply to determine the tile size and predict the
expected kernel invocation time and try to optimize that.
So there is one catch, though. The catch is that we are trying to plug in the data
from the CPU to be used during the cache population, right, populating the
cache.
Now, when we populate the cache, where do we get the metadata from? Well,
from memory. So what that means, that means that we are trying to access -- to
improve the access to slow memory but making it even slower because we add
more memory accesses on the way. Too bad, right? So what we can do, we
can use texture and constant memories here because some of them are cached,
some of them reuse the bandwidth load on the main bus like texture memory and
what we -- and this is really helpful. So when we use that, and we also reuse the
same data many time from the first memory from the scratchpad, we amortize
the overhead of accessing the slow memory in order to push everything to the
cache.
Another thing is that we were forced to do dynamic loop unrolling. As you can
see, the loop is from the -- from zero to the number of summation variables,
right? So we don't know how many summation variables we're going to expect.
So what we can do, we can do dynamic loop unrolling or tiling where we just -where we just have a look at how many left and we unroll it four times, there are
more than four values left and we unroll it two times. So essentially we do this
factorization of the number of remaining iterations and do the best that we can for
unrolling that.
Unrolling it aggressively ruins the performance because some of the registration
are spilled out to the global memory.
So we got to the very interesting part, right? So these are the results. Small
wording. These results are from the old invocation, but we have some brand
new results, and I wonder whether you're going to see the catch in them. I mean,
they are too good to be true. But maybe you can find the problem.
So hardware. Single core CPU Intel core 2 with large cache and GPU G80, no
matter how memory had, I mean this is the classical old third generation CUDA
card. CUDA is 101.0 and we actually tried it with our own CPU kernel. And this
kernel performed quite well in ATLAS. So if you consider that this is not just
regular matrix multiplication but kind of a lot of other overheads involved there
with index computation and all that, getting within five percent from ATLAS is not
that bad.
We had two implementations. One is linear scale, another is log scale. So log
scale is essentially we wanted to avoid underflows and to remind you our input
was from probabilistic network so things are below one. So if you multiply too
many times and then sum over too many times, you lose -- you get zero
sometimes. Although it's definitely not. So in order to avoid -- so the trivial way
to avoid overflow is -- underflow is to use log scale, and this is quite often used
essentially, although it's extremely inefficient. So if we just -- so what we do, we
switch multiplication by summation, summation by XA and log and then we just
preprocess all the data beforehand and map it to the log scale.
So the interesting part -- so definitely I would want to stop here and say, okay,
this is my performance. But that would be unfair. So transcendental functions on
Graphic Processing Unit are mapped to special function use. They are
performed in hardware. So when we got this nice speedup, peak although, but
still, three orders of magnitude of speedup, well, we thought that, you know, there
was some problem with the computations of the speedup. It cannot be true. So
then we ran a lot of microbenchmarks just trying to understand where the
speedup comes from. And it turns out that in this case order of 25 is direct
consequence of using software managed cache. Without it we lose the
performance by factor of 25. Factor of two is just by using the hardware. Okay?
Just by using GPU in that way is better than using CPU. And factor of 25 is
because of special function units. So essentially XF, more or less six times
slower than summation on the GPU, and 200 times slower on the CPU. Okay?
So there is factor of a more or less a 35 between the speedup -- between the
speed of this particular thing. So when you get a lot of coincidental functions this
what makes a huge difference.
Now, you know that as special function unit its abilities are limited, there is only
one special function unit per the set of ALUs for 16 unit speeds but -- and this
has been dramatically improved in the latest generation of CUDA -- of the cards
but even here it's quite amazing.
Now, on average as you can see, the performance is not as bright as we would
want it to be. And the worst case is really devastating. Why is that? Well, as we
all saw, the reuse determines the expected performance. So if our input doesn't
have any reviews, we just saturate the bandwidth. So we are out of luck. We got
bad input. So we just should not run it on CPU -- on GPU if you will stop. This is
the decision that we're going to get to make before running this in a production
setup. However, obviously in some cases it's definitely worthwhile.
Now, this is single precision performance on this card. So this is the bottom line.
Yes?
>>: [inaudible].
>> Mark Silberstein: I'm sorry?
>>: What does varying the input ->> Mark Silberstein: Okay. So I'll show you on this graph. This kind of a bizarre
graph with a lot of points in it just to confuse you. So one point is one kernel
invocation and the kernel -- well, the input actually has a huge influence on a
CPU card as well as you see because what we vary here is the number of input
functions, the number of values per function, the number of overlap what we call
between different functions in terms of the number of variables between them.
So we perform more or less 400 runs in different, you know, combinations of all
those. And this is what we get. So this is log scale, right? So as you can see,
the error is indeed more or less 15 gigaflop but the interesting thing is that here
on this very card we get bandwidth if we just map the best performance to the
bandwidth that would be required for that performance we get 212 gigabyte per
second, which means that the original bandwidth could not have been squeezed
and stretched to this size, right?
So our cache is actually doing well. All right. But -- okay. So here is the graph.
As you can see, there is this part of it that is not really good for GPU. And as you
might have guessed, the reason is that there is just not enough parallelism of the
data to saturate the GPU. And here as you can see there is this boundary point
of 50 kill flops in the input which, you know, makes the difference between CPU
and GPU. And this is more or less the multiplication of two matrices 40 by 40.
So ice pretty small input if you think of that.
So how is our model working on this particular thing? So if we just take our
model and map the expected results and those that we get in practice by
changing the cache size, okay, into 16K, okay, so what we can see is first of all
that our model is more so -- the approximation is quite good. I mean the
tendency is exactly what we see with the curve, right? I mine absolute values
are totally off, right? Because we do not account for all these overheads of
kernel, of the -- of different things. But I mean what we would want to get from
this is the upper bound. And this is very strong upper bound that you cannot
cross, right? And what we want to get is whether it's worthwhile to increase the
cache size in order to optimize -- to improve the performance. And we clearly
see that we can do that by improving increasing the cache size.
So we see the tendency, we see the global picture of what we're going to get just
from the model. Numbers are too optimistic. Okay.
As you can see, there are several knees here, right? So what are these knees?
Recall that in our cache we decide whether to cache the whole function, right?
So at this stage we got zero functions in the cache. Here we got jumped
because we managed to squeeze in one function. We're we managed to
squeeze in two and here three. Okay? And this is the maximum that we could
get that.
Another is the number of cache pages per thread block, right? Recall we have
this problem of cache misses every time we discard the whole cache page and
switch to another thread block. So now we try to see how many -- how much
improvement we get from having multiple cache pages per block.
So as you can see, the model actually predicts something that is beyond the ALU
performance, right, so it's really off. But again, if you reduce factor of 10 more or
less, you can see that this is the boundary line where we increasing the cache
size and we're not getting any improvement because basically we are not using
cache. So this is kind of a validation curve that says okay, this is not an artifact,
this is true because of cache improvement.
So on this graph you can see that this is low grade mixed scale again, so we are
starting with 20 something and we get as far as a three with 34, 35 by changing
the number of cache pages. And our theoretical model actually shows that this is
what's going to happen.
And this is the comparison of having no cache using texture for, you know,
hardware cache that there isn't the texture cache and our cache, okay? Okay.
For specific input we just go and increase the number of summation variables
and by that increasing the complexity.
And CPU we barely see here in the bottom. Yeah?
>>: [inaudible].
>> Mark Silberstein: Yeah. Yeah. That's -- yeah. Lecture said that it's a good
question because he has an answer. So this is a good question. And the reason
is that if you can, you know, just have a look, there is clear periodic pure
[inaudible] here. And the reason is that we are hitting band conflicts, bizarre
band conflict here. Because the data is aligned in a way that increasing the
number of summation values at some point gets to all the banks accessed -- I
mean one bank accessed by all threads in the block. So it's kind of the worst
possible band conflict one can get. And as you can see, texture, texture cache
presents exactly the same problem. It's also missing here because of this,
because of [inaudible].
So you may also ask what happens here on this, what also be good question.
And the answer is that you can see that something bad happened and this bad
thing is that we had to drop one function out of the cache. So by reducing only
by one function we kind of go exactly like regular texture cache.
Now, if you think of the texture cache as some -- well, it's not exactly fair to
compare us with texture cache because texture cache is also shared among
different flat box, right? Whereas our cache is discarded across thread block
boundaries. And it's why having texture cache here behaving like that means
two things. First of all it means that we are not doing -- we are doing a good job
with the metadata. So the overhead is actually quite low. Because the hardware
cache behaves even worse than we do. Right? And another thing is that
hardware has some advantages over our think -- over our software manage
cache and still we're getting back. Even in the worse case of bizarre
benchmarks.
Now, this is the effect of loop unrolling. And as you can see, compiler is doing
very bad job on unrolling this loop. And we actually had to -- well, in my current
implementation, I have plenty different kernels unrolled in all different
combinations of summation values and number of matrices.
Now, it turns out that there is another parameter of using texture cache for the
input where this parameter is also -- also requires different kernels. So I have 40
different kernels. Generated automatically by my scripts and compiled in. Okay?
Because the compiler doesn't do good job on unrolling my loops. And you see
that the benefit is huge.
Now, this is the new, brand new data, more or less, yes, degenerated. And it
might be a little bit difficult to follow, so I'll explain. I did each graph is actually an
average over 100,000 runs of the kernel, over large radial of input parameters
where I actually run over up to six functions and up to 500 summation variables.
And this is the input complexity, this is the affect throughput in gigaflops, double
precision on GTX285.
So the average performance of the kernel can -- I measured several
combinations. So first of all I tried software managed cache with tiling and
software managed cache without tiling. You remember tiling I split the input into
multiple kernel location if it doesn't fit the cache? I also used combination of
software managed cache and texture. So essentially my input is mapped to the
texture memory. I fetch it and then put it into my cache. Right?
So this way I wanted to, you know, to squeeze most out of it. This is the
boundary line of having no cache at all. And this is no software cache but only
texture. Okay? So let me follow here. This is the boundary average
performance that we get when we have no cache at all. Okay? This is roughly
16 gigaflop. This is the cache -- this is the case where we do not have any tiling
and we're using only software cache without texture.
This case is the one where we are using software and texture with tiling, so it's
kind of combination of all. And this blue is software cache without tiling but this is
the best that we can get, and this says software and texture without tiling, okay?
And red is software and texture with tiling. All right. So what we see from these
graphs is actually two things. First of all double precision, even on average,
double precision performance is quite well, combination of all our methods
together is doing not bad if you consider the comparison without tiling then it's
actually impressive.
Now, the reason why without tiling works sometimes better than with is because
the overhead of tiling is heuristically predicted. And this heuristics is sometimes
off.
Now, this is the average performance. And this is the best performance. Okay?
So as you can see, we kind of saturate the double precision unit completely.
Now, what may make you wonder is that sometimes we saturated over its
theoretical peak, right? And this is something that I realized only today. And I
was thinking about it and, you know, checking the code, I realized that the
complexity computation doesn't take into account the new improvement that they
have in the compiler. And what compiler does whenever I do I equals I plus -- I
equals I plus plus or I equals B -- A -- A equals B -- I'm sorry. A plus equals B.
And A is statically defined above it as equals one or zero essentially. Zero.
In this case, it doesn't do plus, it just puts it, okay? The previous compiler didn't
do that.
Now, if I take that into account, I lose more or less 15 percent of that. So I check
that. I didn't have time to put it on graphs. But this is the performance that I'm
expecting. So even 15 percent less than that is still not bad, I would say. So it's
60. So I'm not saturating it until the end. But close to that.
Now, what about single precision. Just to compare that with the, you know, with
the single precision on a G80. So it's actually quite impressive. We get -- their
we got the best performance, this is the best, this is the average case. So just to
explain, right? This is best with software and texture, this is the best with -- this
is -- this is the best -- I'm sorry? Just let me find out what's going on here.
So this is using only texture average, and this is only texture maximum. And this
is using software maximum and texture together. All our stuff combined. So as
you can see first of all we improved from 50 gigaflops maximum to almost 125
gigaflops maximum single precision. Doing particularly nothing to improve that,
just, you know, putting it on a next generation card.
Another thing is that without using the cache this is the average and this is the
maximum. So the difference between the maximums is still quite large. Okay?
So this is -- so my conclusion from that is that next generation of hardware would
give me for free hopefully quite a lot of performance benefits. And this scalability
to another -- you know, this ability to just put it to another piece of hardware and
get the speed up, well, it's nice. And most of it is because of the bandwidth
improvements in G285 versus -- and also not only about it as a matter of fact,
there are several other things, but I mean bandwidth is really dominating factor
here.
So what you didn't ask and I didn't answer is the overheads, right? I consider
only the performance of the kernel itself as if all the data is there, but if you think
of that we have a kernel that has kind of pipeline, right, so we have to move data
back and forth. And if you think -- if you consider the overheads so this graph
shows that for up to here as long as our cache is behaving greatly and fits -- and
all the data fits in, the kernel execution time is actually comparable with the time
to prepare the metadata on the CPU for this cache. So essentially if we just go,
you know, in straightforward way and for every invocation we move the data back
and forth and prepare the metadata and then involve the kernel and then so all
these graphs would be nothing at the application level.
So the first part conclusions, and I will be finishing in about five minutes and may
I take another three minutes of your time I hope, so the use of shared memory is
good thing, okay. In this work we tried to do cache decisions on CPU for the
runtime on GPU. But what if we have like rate racing? Is there any chance of
having this type of architecture for rate racing? Well, I believe no way. Because
if you have random access and you cannot predict that, no way that you would
be able to do that on GPU.
Well, fortunately folks at NVIDIA realized that and designed hardware cache. So
we'll see how it works when it goes up. Okay. So just to remind what we were
working on. We were working on an application that had different kernel
invocations where each kernel was pipelined to the maximum invocation in the
form of a tree, so this tree was kind of traversed in the reverse DFS and
[inaudible] so it's charged here and here. You cannot go up because you have
dependencies, so then you have to satisfy these dependencies by computing this
kernel then that and et cetera and only by the end all the leaves are computed
and then you go forward.
So we performed pretty well on 16 eight slot machine with 16 CPUs through
[inaudible] with the same type of approach. We do all the same logic without
actually prefetching data, but the locality properties of all computational
algorithms still maintain.
On GTX285, the complete application -- again, this is only kind of preliminary
results. We didn't have a chance to run a lot of input. So for this input we get
factor of 37 versus single CPU, which is more or less factor of well, two and a
half versus 16 CPU machine. And if you compare price performance then you
definitely realize that this is not too bad.
>>: [inaudible].
>> Mark Silberstein: I'm sorry?
>>: This was the 8800 or the 200 ->> Mark Silberstein: No, this is a new one. Yeah. This is the GTX285.
>>: [inaudible].
>> Mark Silberstein: So the point is that we are not losing anything. Almost. In
our average performance that I showed you on the graphs and the reason is
twofold. One thing that we do is that we maintain all the data on a GPU. We do
not go back and forth. This of course creates some problems with the data, with
the memory management because sometimes the amount of data cannot be fit
into GPU so you have to swap in and out. So we implemented some type of
software swapping here.
And another thing that we do, if you saw this graph where GPU is worthwhile or
not, so on some cases CPU just works better. Right? And what we do, we
determine by some heuristics curve we have already better algorithm now but we
still not implemented were to execute the input depending on the input
parameters on the CPU or on the GPU. And you see that what we have to also
take into account in our algorithm that this is the next stage is that we have
memory transfers here so we cannot just arbitrarily decide that we execute
something on a CPU not on a GPU because of the CPU it performs better
because by doing that we also may require some memory print to be transferred
and this would ruin whole benefit of using this or that.
Another thing is that when measuring the overheads you may have seen that the
overhead of creating all this metadata is very high for many kernels. So what we
do, we run this computation in two threads. One thread is managing GPU and
another thread is preparing the data for GPU. In this way we can traverse this
tree and also on the way parallelize and utilize both CPU and GPU.
And the last piece of information that I want to deliver now quickly is the use of
TSUBAME. So what TSUBAME is, TSUBAME is huge cluster with a lot of very
fast interconnect and band base, tons of storage attached and 620 GPUs
plugged in their Tesla and 16 eight slot nodes.
So what about did in TSUBAME we actually did two level parallelization. One
level was to split this huge workload into small, independent business with low
overhead, relatively low parallelization overhead. But this made it
embarrassingly parallel. This is the same type of approach we use in grids for
this workload.
And then on the node level, we invoked it either on eight CPUs, so we split the
node essentially evenly between eight CPUs, or on a GPU, right? There were
two GPUs attached to a node, so we had 16 nodes with 16 CPUs and two Tesla
GPUs attached. This brings it to a nice number of thousands.
So this is the idea. We had a single master. He maintained the queue and we
split the node into four independent parts. And each GPU got its own workload.
And this was our way to manage heterogeneity of these machines.
So the preliminary results again this is not really something that I'm proud of
because there are a lot of room for improvement, but on a single CPU we got like
5.5 CPU years for this workload would be required whereas on the speedup here
was more or less 2,487. Now, this number means nothing. In a sense that what
really is important is what we get from GPUs, right?
So essentially if you think what speedup you get, you get more or less that one
GPU is equivalent to 16 CPUs. So we got more or less on this -- on this input we
got more or less factor of 16 on a GPU versus single -- versus 16 CPUs.
Now, again, I'm saying this is not something that I'm real proud of. There is a lot
of things that I can do here and probably I will.
And this work is sponsored by you guys. And thank you so much for having me
here. And I will take some questions if I have a few minutes.
[applause].
>>: So problem you're trying to solve seems identical to the people way back
then trying to do math on a machine of 16 or 64K and a hard drive and the
numbers are longer but the ratios may be similar. Is there any [inaudible].
>> Mark Silberstein: So you're talking about out of core computations like you ->>: Computations that don't fit in core.
>> Mark Silberstein: Oh, yeah. So essentially I should confess I'm not really
familiar with this work. From what I heard and I asked -- before reading
something I asked people who know, it seems that the tradeoffs were really
different so the ratios were kind of different. So I cannot really -- probably some
techniques can be borrowed and this is something to work on.
>>: So assume that GPUs get N1, N2 caches, what's going to be the marginal
benefit of [inaudible] software caches that are precompiled for [inaudible].
>> Mark Silberstein: I think the benefit will be quite substantial. Because texture
cache that I'm showing here is not really cache. It didn't reduce latency. It just
improves the bids essentially but it doesn't reduce latency. And L1 will definitely
reduce the latency of access to the data.
So I expect real significant benefit from that. However, you never know.
Because on the [inaudible] actually you would be able to split between 16 of L1,
16K of L1 and 48K of scratchpad or vice versa. And who knows, maybe, you
know, maybe in my case it would be worthwhile in some cases to go to my
software manage cache and would perform better. Because I don't know what
replacement policy would be on L1 cache.
>>: [inaudible] key here.
>> Mark Silberstein: I'm sorry?
>>: I think that might be key here because [inaudible] not a structure of
[inaudible].
>>: [inaudible].
>>: You might be caching something for a longer term [inaudible].
>> Mark Silberstein: Yes. That's true. But on the other hand, if you look at what
I'm getting from texture cache it seems like I'm still getting something from there.
And it's really, you know, it's clearly beneficial. So I don't know. I don't know.
I'm very curious to see.
>>: The numbers that you use, the sizes of the texture caches and the
[inaudible] scratchpad [inaudible].
>> Mark Silberstein: I have no idea about the size of texture cache. I know that
the working set is supposed to be 8K. I didn't see any correlation between my
amount of data that I am accessing with this 8K. I don't know how this texture
cache works and did. And the amount of shared memory that I was working with
is exactly the same as in G80.
What is different, though, is the number of registers allocated for threads. So this
workload with texture and my cache stuff requires 30 -- in some cases 28
registers and one would not be able to get that from G80. Okay? You would get
all your registers filled out. And this would ruin everything. So that's why I was
able to do these experiments now.
>>: [inaudible] how do you describe the bucket [inaudible].
>> Mark Silberstein: The bucket is essentially mapped to two things. There is a
metadata describing which variables of a bucket there are there, and there is this
data. Okay? So the variables are fetched from another large array of all the
variables so each function had kind of array pointers where the right variables
and there were some, you know, computations before hand, quite nasty ones,
you know, to find the indexing. But it's not a big deal. It's just ->>: Shows themselves there are [inaudible] functions.
>> Mark Silberstein: The functions themselves?
>>: The functions themselves over the [inaudible] are they ->> Mark Silberstein: They're arrays.
>>: There's just arrays?
>> Mark Silberstein: Yeah, yeah, because it's discrete variables. So they're just
arrays.
>> Laurent Visconti: Take one more question. Okay. Thank you.
>> Mark Silberstein: Thank you very much.
[applause]
Download