>> Kathryn McKinley: It's my pleasure to introduce Milind... advisor, Keshav Pingali, and became best friends with my Ph.D....

advertisement
>> Kathryn McKinley: It's my pleasure to introduce Milind who came to UT with his
advisor, Keshav Pingali, and became best friends with my Ph.D. student Mike Bond so
then I had to become his best friend too.
>>: [inaudible].
>> Kathryn McKinley: And so where he's been working on parallelization of graph
algorithms. As we all know, parallelization is nothing if you don't have locality, because
if you have to communicate all the time, then it doesn't do you any good to do things in
parallel.
So now he's been focusing more recently on how do you get both good locality and good
parallelism. And that's what we're going to hear about today. And he's now a CAREER
Award winner, assistant professor at Purdue, and hopefully next year or the year after an
associate.
>> Milind Kulkarni: Yes. Thanks, Kathryn. So let me just go ahead and skip past the
title since it seems pretty obvious what I'm going to be talking about, locality in irregular
programs.
Before I start with the actual talk, I want to acknowledge the students that have helped
me with this work. Youngjoon Jo, who is really the prime mover behind much of this
work. Some of you might have met him last summer when he was at MSR. And also
Michael Goldfarb. Unfortunately they're both graduating. But these guys have done a lot
for this project.
Also the government has surveillance drones these days, so I feel like I should
acknowledge any government support that I get. So from the National Science
Foundation, the Department of Energy.
Okay. So if you've ever seen one of my old talks or one of Keshav's talk about
parallelism in irregular programs. This slide is basically that slide except with a search
and replace done parallelism for locality.
The argument is basically that if you think about where the compiler community has been
or was decades ago, there's a lot of work that has been done and a lot of understanding of
where to find locality in what are called regular programs, programs that deal with dense
matrices, dense arrays. These are the kinds of things that show up in linear algebra codes
and in scientific computing, and it really had been the focus of high-performance
computing and a lot of the compiler community for a very long time.
And in these applications that manipulate matrices and arrays, their behavior is very
predictable. Their behavior is easy -- not easy. It's possible to analyze in a static way,
and so at compile time we can do a lot of stuff, a lot of transformations to improve
locality.
The problem is that we don't just deal with dense matrices and arrays. More and more
applications in more and more domains are what the compiler community has called
irregular. Irregular is really just in contrast with these regular applications of dense
matrices and arrays. So if it's not using a dense matrix, it's an irregular program. But, in
particular, I want to focus on programs that deal with pointer-based data structures, trees,
graphs, lists, things like that.
And we really don't understand nearly as much about what the right things to do to get
locality in these programs are.
So what's the basic problem? The basic problem is that irregular applications are
complex. If I've got dynamic pointer-based data structures, well, the layout is dynamic.
I'm allocating them at runtime, the layout is input dependent, it's dependent on what my
allocator is doing for me. So it's hard to find spatial locality, although people have done
work on figuring out clever ways of letting out pointer-based data structures to get better
spatial locality.
More importantly for the stuff that I'm going to be talking about today, the access
patterns are often highly unpredictable. The particular way that you're going to access a
graph or access a tree is hard to predict at compile time. It's input dependent, it's data
dependent, right, so as the program executes, that is what determines how I access the
data structure so the compiler really can't see a lot about those access patterns.
And that means that it can be hard to structure your computation to get good temporal
locality, to make sure that if I'm reusing data, that the uses and reuses are close together
so that I'm going to get caches.
And, in fact, a useful question to ask is are there even common sources of locality, are
there common patterns that arise in these irregular programs that we can exploit in some
systematic way to get good locality out of pointer-based applications.
Okay. So here's the plan, and this is really the rough outline for the talk. First of all,
irregular applications is just a vast space of things. So I'm not going to try to give you a
[inaudible], I'm not going to be everything to everyone.
So the game plan is we're going focus on a subset of irregular applications. And by
choosing a particular subset there's some hope that we can find some common patterns
that we can exploit in order to get good locality.
Once we've figured out exactly which applications we're going to look at, I'm going to
develop some models for reasoning about the localities in these applications. In the
world of dense matrices and dense arrays, there's all sorts of interesting models of
locality, interesting ways of thinking about locality in these problems, and I want to see if
we can do something similar when we're talking about irregular applications.
Once I know how to reason about locality, that gives me some hope that I can find a way
to actually transform my programs to get good locality. Now I know what it means to
have good locality, so I should be able to target it. And then I want to make sure that
whatever transformations I come up with are correct. I don't want to actually break my
program.
And, finally, I'm a compiler guy at heart. Just because I know how to do these
transformations isn't that much fun unless I can do them automatically. So I want to
figure out ways of actually automatically transforming these programs and tuning them in
order to get good performance in order to actually take advantage of locality.
And then we'll rinse and repeat. So maybe look for different subsets of applications, so
look at different kinds of irregular applications. The particular rinsing and repeating I'm
going to do here is a different set of transformations. So let me just give you the punch
line right up front so you know where we're heading.
So we're going to focus on tree traversal algorithms. These are algorithms that do a lot of
traversals of various tree data structures. I'm going to tell you about two transformations
that we've developed, point blocking and traversal splicing. And I'll explain what those
mean in a bit.
We've done automatic versions of both of these, so we've been able to automatically
transform applications in order to implement these transformations and tune them
because irregular applications are data dependent, so tuning is critical in order to get good
performance. And ultimately we've shown we've gotten significant performance
improvements out of both of these transformations.
So let's work our way through the game plan so that I can prove to you that this punch
line is more than just a purple box that I've put on the screen.
So let's start by narrowing the scope. Let's start by focusing on a subset of irregular
applications. And so the particular applications I want to talk about are tree traversal
algorithms. These are algorithms where I've built some sort of tree and then I actually
repeatedly traversed that tree to compute some object of interest.
They show up in a lot of different domains. One that many people might be familiar with
is the Barnes-Hut N-body algorithm. It's an n log n algorithm for simulating
astrophysical bodies. It shows up in graphics. So in raytracing one of the most
time-consuming parts of the algorithm is figuring out whether a ray intersects a particular
object.
And the way that you do this, or one way that you can do this, is by building something
called a bounding volume hierarchy that tells you something about where the objects are
in space, and then I can accelerate this ray object intersection by traversing this hierarchy,
which is basically a tree.
Shows up in a lot of data mining algorithms, like nearest neighbor and point correlation.
The basic pattern here is that all of these algorithms are building some sort of tree and
repeatedly traversing that tree over and over and over again. And because there's a single
tree and I'm traversing it over and over again, there's data reuse. And because there's data
reuse, there should be an opportunity to exploit locality.
>>: If your goal is to automatically do these transformations, you use the word "tree,"
how do you know something is a tree?
>> Milind Kulkarni: That's a good question. We -- so there's two answers to that. One
is that actually one of our transformations, point blocking, actually works for any
recursive data structure. So any recursive traversal of any recursive data structure can be
transformed. So you actually don't need specifically for it to be a tree. It happens to
work best for trees because there's a lot of reuse. But it can actually work on any
recursive structure.
The second answer to that question is that we're punting. There's a lot of work in shape
analysis and whatnot of looking at data structures and looking at the ways data structures
are used and proving things like acyclicity or tree treeness. We are basically -- we could
build on top of that. So if somebody told me this was a tree and that somebody could be
a programmer that just annotates a data structure and says this is a tree, it could be a
shape analysis that analyzes a code and says this is a tree. But always know it's just that,
is this a tree.
Right? So I agree with you, right, that this is part of not being everything to everyone.
We're not going to be able to do something for -- these transformations aren't necessarily
going to work for general irregular applications. So you do need to -- if we're going to
focus and say that we're looking at a subset of these applications, somehow you need to
know that this is the subset that you care about.
>>: [inaudible] so are your transformations guaranteed to be semantics preserving only if
there's something called a tree that is being prevented there, or are the semantics
preserving for an arbitrary code?
>> Milind Kulkarni: Point blocking. So there's some dependent structure that you need
to be aware of, given certain restrictions on the dependent structure of your program.
And I'll talk about that later in the talk. Point blocking is guaranteed to be semantics
preserving for any recursive data structure that you apply this too. Traversal splicing, the
particular way that we implement it, is only semantics preserving if it's on a tree.
>>: Okay.
>> Milind Kulkarni: Does that ->>: [inaudible].
>> Milind Kulkarni: For now just assume there's some local that says this is a tree.
Right? I'm not a shape analysis guy, so this is not -- think of us as a client of shape
analysis, if that helps. Okay.
So let me just give you a quick overview of a concrete example of one of these tree
traversal algorithms, just to help focus the mind a little bit.
So point correlation, true point correlation is a data mining algorithm. Its basic goal is to
tell me something about how clustered data is.
And the basic idea is I've got some set of endpoints in K dimensions. Here it's two
dimensions. And I've got some point, P. This should be a darker point. It might be a
little hard to tell that it's darker. But there's some point P. And I want to find how many
other points in this dataset are within some radius R of P. All right?
Now, the naive algorithm here is just the N squared algorithm. I'm going to take my
point P and I'm going to compare with every other point in the dataset and see whether or
not it's within our meet. And if I am, then I'll increase my correlation and I can get a
count. So the goal here is I want to say for P its two-point correlation is 2, there are two
other points that are in this radius.
Okay. So N squared, not so great. So here's one way that you can accelerate this. You
can build a data structure called a k-d tree. It's the acceleration structure over this space.
And that will let me actually accelerate the process of finding how many points are in my
radius. And it does so by letting me quickly prune off parts of the space that I know I
don't need to look at, that I know there's no way for a point to be in this space.
So let me show you what that means. Here's how a k-d tree works or here's one example
of a k-d tree. So I start with the root node of my k-d tree, and each node basically
encompasses some subspace of my overall dataset. So the root node actually
encompasses the entire space, and then I'm going to recursively subdivide this space,
So, I'm going to divide this space into two, and that gives me two new nodes, B and G, so
B is talking about the left side of the space and G is the right side of the space, and I can
just continue this process recursively, subdividing the space, until eventually I've built
this tree structure that captures something about the spatial locality, if you will, geometric
locality of these points. And I continue the process until every leaf node has just one
point in it. It's a wonderful addition.
So how do I use this to accelerate two-point correlation? So here's my point P. I want to
figure out how many points are in this radius. So I start by looking at the root of the tree
and basically asking does this circle intersect this square. If it does, then there is a chance
that the point -- that some of the points in the subspace are in my radius, and so I need to
keep looking.
So then I'll go down to B and say does the circle intersect the square, and it doesn't, so I
can actually prune this entire subspace. I no longer need to look on the left side of the
subspace. I know that none of the points can possibly be in my radius. I'll move over to
G. There's an interception. I move down to H.
I -- once I get to a leaf, there's a single point that I care about, so I look at that particular
point and count it. I go to J, I go to K. Right? Now, you can see that this is not
necessarily precise. K intersects the circle even though the point is not actually in the
radius. Nevertheless, what this lets me do is quickly cut out parts of my subspace and
basically I'm turning what would have been an order N process for a single point into a
log N process.
Okay. The code for point correlation is actually simple enough I can put it on a single
slide. K-d tree building code is a little more complicated, so that's hidden. But the actual
traversal is pretty straightforward. I have some set of points, I've got some radius that I
care about. For every point in the set, I call this recursive function, I pass a point, I start
at the root of a true, pass on the radius.
The recursive function says, well, if the point is too far from the subspace I'm looking at,
I can just return. I don't need to keep traversing the tree. Otherwise, if it's a leaf, I update
my correlation if I need to. Otherwise I just keep looking my way down the tree.
>>: [inaudible].
>> Milind Kulkarni: Turns out actually that if you look at the way -- that these codes
actually often do look like that.
>>: [inaudible].
>> Milind Kulkarni: No, it actually still often gets written like this.
>>: [inaudible].
>> Milind Kulkarni: So this is basically what this code looks like. Right? Let's -- if we
simplify the code a little bit, it turns out that this basic pattern is exactly the pattern that
all these other tree traversal algorithms follow.
There's some recursion, there's some truncation conditions that tell me whether or not the
recursion needs to stop, and then there's some continuous recursive traversal. And the
basic pattern is that I'm doing a recursive traversal of a recursive data structure and I'm
doing it over and over again.
So these are the algorithms that I want to look at. This is my refined goal. I want to
improve temporal locality in repeated recursive traversals of recursive structures. If my
algorithm fits this pattern, I want to help you get better locality.
Okay. So why is it hard to reason about locality in this kind of algorithm, right? So part
of what makes this hard is that what is written recursively like this, it's pretty tricky,
right? There's some pointer chasing going on here, I'm working my way down the tree,
there's this recursion going on. It's a little bit hard to reason about what I'm accessing
when in order to figure out where my data uses are, where my data reuse might be.
So let's abstract this away. Let's pretend that -- start over. Much of this recursion, much
of the pointer chasing, much of the recursion, what it's really trying to do is figure out
what parts of this tree I'm supposed to be visiting. What parts of the tree do I visit and
what parts of the tree do I not visit.
So let's actually throw away all of the recursion and pointer chasing and just assume that
there's some oracle that tells me what parts of the tree I'm supposed to visit. So we'll
rewrite the code that looks a little bit like this. I'm basically hiding all of the recursion in
this oracle traversal function. This is just an abstract model. We're not saying implement
the code this way. But this is our abstract model.
So for every point in my point set, for each tree node, so for every node that's part of the
traversal I'm supposed to do, do some sort of interaction, visit this node. And this
actually has exactly the same set of accesses that the original code did. It just no longer
has recursion. It no longer has pointer chasing. In fact, it's a nice little doubly nested
loop.
So now that I've got this nice little doubly nested loop, I can turn to my standard bag of
tricks and think about something like an iteration space where in an iteration space I've
got one access for each loop nest or for each level of my loop nest, and the axes tell me -in this case, the Y axis -- which point I care about.
That X axis is which node that particular point is visiting in the tree. And I can think
about one execution of this program is looking something like this. Point 1 maybe visits
nodes A, B, C, D, E, F, and G. Point 2, its oracle traversal does something a little bit
different. It visits nodes A, B, G, H, I, J, and K. And Point 3 does something, Point 4
does something, et cetera. All I'm doing is just taking that doubly nested loop and
exploding it out into this two-dimensional space. Right?
What .2 is doing is actually exactly the traversal that we saw in our point correlation
example.
So now that I've just exploded this all out and I'm staring at it as a bunch of circles on a
screen, I can start thinking about things like reuse distance and stuff like that. Because I
actually see the particular uses of particular pieces of data.
So let's take, for example, node C. What I want to know is between one use of C and the
next use of C how many other unique data elements am I touching. This is reuse
distance.
So between when .1 accesses C and when .3 accesses C, I've actually touched every other
node in the tree. The reuse distance here is ten. And the nice thing about reuse distance
is it basically directly maps to what kind of locality I would expect to see given a fully
associative cache.
So if my cache could only hold eight tree nodes, if it's only big enough to hold eight tree
nodes, this second access of C is going to be a miss. If my cache could hold ten tree
nodes, this second access to C would be a hit. Right?
So if you're looking at this and thinking about this in terms of reuse distance, the first
thing that jumps out at you is that because these traversals can go all over the place, in
general the reuse distance of a particular node in the tree is on the order of the size of the
tree. The size of the tree is on the order of number of points, so my reuse distance is
order N. Not so good.
But one thing I can think about is, hang on a second, all of these points are independent
of each other. Each point is computing its own correlation. They're just working their
way through the tree. So maybe I shouldn't do them in this particular order. What if I
reorder the points? Right?
So what if I reorder the points so that successive points have very similar traversals. So
instead of going 1, 2, 3, 4, 5, let's run the points in this order. Let's run .1 then .3 then .5
then 2 and 4. Right? So we're just going to sort the points in some sense.
And now what happens to my reuse distance is the reuse distance of C between one
access to C and the next, I've only touched six tree nodes. So my eight-node cache is
now big enough to give me a cache hit. And in general what point sorting does, what
sorting these points will do, is reduce your reuse distance from order size of the tree to
order size of the traversal.
So this is a nice big win. And so people have looked at this in varied application-specific
ways. So, for example, people have looked at taking Barnes-Hut and doing something
like a space filling curve over all the points in your space and using that as the order in
which you execute points in order to get good locality.
The problem is that the sorting is very application specific. Nevertheless, let's assume
that we can do this sorting. So now I've got this nice win. I've gone from order N to
order basically log N.
Now, the problem is is that when these input sizes get big, right, log N, there's often a
pretty big constant there, and this is still bigger than my cache. So in fact if my cache
were small or if my traversals were very large, even after doing this sorting, I get
essentially worst-case behavior? Right? I touch C, and I'm getting -- so I'm point 1, I
touch C and I bring in the cache, I go around. And when 3 gets to C, it's just been kicked
out of cache. So I have to get a miss. And then when 3 goes to D, it's just been kicked
out of my cache so I get a miss and so on. Basically getting the worst-case behavior of
my cache.
So doing the sorting helps up to a point, and then at some point the traversals get so big
that it stops helping and I still get bad behavior.
>>: How do you know how to short?
>> Milind Kulkarni: That is a great question. It basically involves sitting down and
really understanding what any particular algorithm does and then doing the sort. So the
sort here -- so what ->>: Looks like you could sort on the locations of the points. Is that an approximation?
>> Milind Kulkarni: That is a reasonable approximation for some algorithms, but not all
algorithms.
>>: But what algorithms ->> Milind Kulkarni: For this particular algorithm, we're looking at that as a good
approximation.
>>: Okay.
>> Milind Kulkarni: That's right. There's other algorithms that we have looked at and
we will see results for where that's not a actually great approximation. So we can do
better. But we'll get to that in a little while.
>>: So is this dependent on the fact that this is working on a tree, you said it's a tree?
>> Milind Kulkarni: The sorting is -- being able to do the sort in a reasonable way a
priori is probably depending something on the -- is partly depending on the fact that it's a
tree.
>>: [inaudible].
>> Milind Kulkarni: That's right. That's right. And the fundamental obstacle here is this
need to happen before I actually do the traversal, right? I can't do the traversal and then
say, oh, right, the thing to have done was to do one and then three. I need to figure out
something ahead of time. Which means I need some heuristic.
So for the next 10 or 15 slides we're going to assume that the sorting has been done. I'm
going to tell you what we can do if the sorting hasn't been done later.
>>: What if you were to interchange the two loops, then you could say something like
let's do a reverse traversal of the tree, right, and so we'll have perfect access to the tree
and then we'll iterate through ->> Milind Kulkarni: That's right. So that is coming in about four slides. Yes. That is
exactly right.
You don't want to do a perfect interchange because you often have a million or so points,
and so interchange actually still has -- so you're getting better locality in the tree but
you're getting much less locality in the points. And so we'll see what we can do. But,
yes, that's our goal.
So just to drive the point home, this is basically looking at Barnes-Hut. And as the input
sizes get bigger -- so 10,000 bodies, a hundred thousands bodies, a million bodies -- what
you see, first of all, is that traversal sizes don't scale the same way. This is our nice login
benefit.
But what you also see is that this is after doing sorting, the miss rate really does start to
go up. As the traversals get big, the traversals start -- they get big enough that you're
outstepping cache and you start to get misses.
And in particular what you see is that here this is percentage improvement over not
having done the sorting at all right, so percentage improvement over the naive
implementation, and basically you get dimensioning returns.
>>: Are the points still laid out as an arbitrary manner in the tree ->> Milind Kulkarni: The points are ->> -- locality, special locality?
>> Milind Kulkarni: That's a good question. The points themselves are in an array, so
they've got spatial locality. The tree is all over the place. We're not touching that, so
there's no spatial locality in the tree. That being said, the locality ->>: There is special locality; you just don't know what it is.
>> Milind Kulkarni: That's right. That being said, with the points it's not a huge
problem. Because, as we'll see in just a second, in the normal implementation, it will
basically only get coldness on points. So I bring the point into my cache, and then I just
keep hitting this point while it goes down its traversal, so it's not a huge deal. So I might
save a constant factor if I get better spatial locality on the points. But it's not -- beyond
that it's not ->>: Well, is the spatial locality [inaudible]?
>> Milind Kulkarni: Okay. So let's get to the point where it turns out that interchange or
something is the right thing to do.
So let's quickly see how this works by -- so one thing we can do is reason about this in an
even more abstract way. Once I've sorted the points, the differences between consecutive
traversals are actually very small. Their second order effect. So I'm actually going to
ignore the differences and just pretend that all the traversals look exactly the same.
So instead of having this oracle traverse function, I'm just going to replace this with a
simple vector of the nodes in the traversal. So I'm going to just assume that every point
has exactly the same traversal. And now I really have a nice doubly nested loop that is in
fact exactly what compiler writers have loved working with for decades. Right?
In fact, this particular piece of code has exactly the same locality behavior as a vector
outer-product, taking two vectors and multiplying them together to get a matrix.
So to see how that works, if I just replace those vectors with arrays, then I'm interacting
with PS of I and TS of J at each iteration. That's the same reads that would happen if I
were saying A of IJ is equal to TS of I times TS of J. This is just the computation that I'm
doing for vector outer-product.
And now I can start thinking about all the games that people used to play to get good
locality in something like vector outer-product.
But first what we can see is this immediately tells me why I get the behavior that I do. In
vector outer-product I get only cold misses on the points, because I bring them in once
per outer loop iteration and then I hold on to them for the entire inner loop. But I get
capacity misses on every access to the TS array because I look at all of TS before I come
back around to the beginning. So if TS is too big, I get capacity misses.
Just interchange for vector outer-product which says let's put the TS loop on the outside,
what this is doing for us is it's just traversing the locality behavior. Now I'm going to get
capacity misses on every access to the point and cold misses on every access to the tree.
But interchange is not the only trick. It's not the only transformation in our bag of tricks.
What you can actually do to get good locality in vector outer-product is tile. What I
really want to do is tile that point loop. So if I tile TS, I take a block of TS, and then for
every element of TS I look at each point in my block.
And what that means back if we back our way out of our abstraction is let's take a block
of points and ten for every tree node in our same oracle traversal but now over the block
of points, so for every tree node that any of these points might have touched, then for
each point interact with that tree node. So this is a blocked -- point blocked loop of the
tree traversal.
So in case just staring at that code doesn't make sense, which I completely understand,
let's look at an example. So here's what happens if I were to block the same program, the
same example execution, with a block size of three. To I take a block of three points, 1,
3, and 5, and they all visit node A. Then they all go down and visit node B. They all go
down and visit node C. They all go down and visit node D. Some of the points need to
visit node D, so the entire block goes down to D, and points 1 and 3 do whatever
interaction they need to. They visit E, F, G, and so on. H.
If none of the points need to go to a part of the tree, then the entire block can skip that
part of that tree. So I'll skip I and J and I'll move to K. And then I can move on to the
next block of points 2 and 4 and so on. So I can block the points. And basically I'm
moving these points through the tree in lock step.
And here's what happens to our locality. So remember in the original code, in the
original implementation, we were getting a miss on every tree node every time we
accessed it and we were getting only cold misses on points. Now I'm going to get a miss
on every tree node just once per block because all of points 1, 3, and 5 can interact with
node A without A getting kicked out of cache. They can all interact with B without B
getting kicked out of cache and so on. And as long as the block is small enough to fit in
cache, I still only get cold misses on points. So I've got only cold misses on points and
have reduced my misses in the tree by a factor of the block size.
>>: So it looks to me, and maybe I'm wrong about this, but it looks to me you're taking
the -- you're basically saying here's [inaudible] but it's regular enough, right, and it can't
be -- these things can't be totally random, right, because there's no way to sort a bunch of
random points into these nice sort of subspaces that have -- that are dense enough.
Right?
>> Milind Kulkarni: That's right. So you said sort of the magic word. The reason this is
working nicely is that a lot of times that block is pretty dense and I'm doing all three
points at the same time. And the reason that works nicely is because these points are
sorted. It's easy to find the points that all fit into this block nicely.
>>: So what's the property that gives you confidence that when you sort the points you'll
get this kind of density?
>> Milind Kulkarni: That's right. There isn't one. So what we actually ultimately want
is something that does not require that sorting. Right? But, yeah, right. So I can't be
guaranteed that I get this kind of density. I can do this kind of transformation regardless
of whether I have that density, it just might not help. So what we're going to get to in
maybe ten slides is something that will -- sorry.
>>: [inaudible].
>> Milind Kulkarni: Yeah, I know.
>>: [inaudible].
>> Milind Kulkarni: Is what we can do if they're not sorted.
Okay. But this is what can happen. Right? I've reduced my miss rate significantly. So
let's quickly talk about when this is correct. So what I've done is changed the order in
which I do the computations of my program which means that I'm potentially violating
some dependences that exist in the program.
So when is it safe to apply this transformation? So let's think about which dependencies
are preserved, which dependences are violated. So here's a dependence which says
maybe when point 1 visits node B it updates something, for example, it updates its
correlation, and then when it visits node E it also updates its correlation. So there's a
dependence from this particular iteration to this particular iteration.
This dependence is preserved by blocking. The source is still happening before the
synch.
Here's another dependence that might exist. So when node 1 -- when point 1 visits node
F, maybe increments a counter inside the node, and when point 5 visits node F, it also
increments that counter or reads the counter or something. So there's a dependence
within the node.
This dependence is also preserved by point blocking. Node 1 will always get to node F
before node 5 gets to node F.
This is a somewhat weird dependence that really doesn't arise in practice. But if node 3
visits -- sorry point 3 visits F and point 5 then visits I and there's a dependence there, that
is actually also preserved by point blocking. There's only one kind of dependence that's
not preserved, which is one that sort of goes in the other direction.
So in the original code point 3 would get to node E and would do a write. And then later
point 5 would get to node C and do a write -- or do a read and I would get my
dependence. But now in the blocked code point 5 gets to node C before point 3 gets to
node E and so the dependence gets violated.
Now, these arrows are drawn up in a particular way. If you think about sort of all of the
machinery that people develop for regular programs, analyzing when things like loop
interchange or loop tiling we are correct, there was this notion of direction vectors, the
direction that a dependence might go inside a loop nest, inside an iteration space. And it
turns out that these blue arrows correspond to 0+ dependences and ++ dependencies. The
red arrow corresponds to a +- dependence. It's forward in the point direction but it's
backwards in the node direction. These are exactly the same correctness criteria for tiling
in a regular program.
Now, exactly what it means to be forwards and backwards when we're talking about trees
instead of arrays is a little bit tricky, but this gives me some hope, which we haven't done
yet, but this gives me some hope that we can actually try to automatically figure out when
these things are allowed.
>>: Yeah. We've also got cases where you can simplify this away if you don't have any
point-to-point dependencies. [inaudible] that is doing something similar to this for
locality and for VGA programming. But manually [inaudible] some tree-based database,
but you can sort the points any way you want.
>> Milind Kulkarni: That's right. So if you know, for example, that sorting is allowed,
so if you've done the sort, that probably means that there's no dependence across the
points. If there's no dependence across points, that says that the +0 dependence and the
++ dependence cannot exist. And in fact the +- dependence also cannot not exist. The
only dependences I might ever have are 0+ dependences, which means that this is going
to be valid.
>>: [inaudible].
>> Milind Kulkarni: [inaudible] that's right. That's right. And so in fact that's what we
get. So I haven't yet figured out how to really figure out what it means to be forward and
backward in a tree. So what we do instead is look for only 0+ dependencies.
So what we're going to do is automate this process. So point blocking is -- this is the
iteration structure we want. This is the computation structure we want. We want to
actually generate it automatically. And here's how we're going to do it. First we're going
to identify when we can apply point blocking. So we're going to look for recursive
structures, so just classes with fields of the same class.
We're going to look for recursive traversals of those recursive structures, so just recursive
methods that are invoked on the recursive field. And we're going to look for some
enclosing loop, right? So this is exactly what we're looking for. We want a repeated
recursive traversal of a recursive structure.
And then we use a sufficient condition for correctness, which is that enclosing loop is
parallelizable. This says the only dependence that we could possibly have is 0+. A nice
side effect of this is that if all I have are 0+ dependences, it no longer matters if that
recursive structure is a tree or a [inaudible] or a graph or a graph. I can actually just
apply this transformation. So we don't need to do any kind of fancy shape analysis to
prove the treeness of that structure.
So here's what the transformed code looks like. Here's our original code. Right?
Recursive function. Here's the transformed version of the code. Instead of for each point
in the point set, we're going to say for each point block in the point set we're passing in
the point block to the recursive function. If the point block is empty, if none of the points
need to visit this node, we return.
Otherwise, for every point we're basically executing the original recursive body of the
function. So we didn't actually have to make a lot of changes. This basically can be done
with a syntactic transformation. The only difference is that wherever we would have had
a return in the body of the function, we replace it with a continue, we move on to the next
point. Wherever we would have had a recursive call, we do further recursive call. We
add it to a next block, basically a deferred set. And once all of the points have been
processed, we make the recursive calls that we need to.
And the important point here is that this next block only contains the rest of the points
that need to make their way down the tree. So the point block, even as it's getting more
and more sparse, at least it's staying nicely compressed as you're moving down the tree.
So you're not skipping past a lot of empty slots in your block.
Okay. So when you do something like tiling in matrix multiply or any other dense
problem, one of the big problems is figuring out what that file size is. If the tile size is
too small, you're adding a lot of overhead. If the tile size is too big, it's too big to help
your cache. Because the tile is bigger than your cache, and so it still doesn't help.
And there's basically the same principle here. If the block size is too big, that point block
doesn't fit in cache and so I'm still getting misses on the points, which I really want to
avoid. And if the block size is too small, I get a lot of unnecessary misses in the tree. I
miss once in the tree per block, so I want fewer blocks.
So here's just a quick sensitivity study. This is just block sizes versus runtime for
Barnes-Hut and point correlation. And what we see are we get these nice little U-shaped
curves. As the block size gets bigger and bigger, we're getting fewer and fewer misses in
the tree. And then at some point we get more and more misses in the point, on the points.
And so we get this nice little U-shaped curve.
Now, the important thing here is that unlike dense linear algebra, in dense linear algebra
finding the right block size for a particular application was application dependent, right,
the block size for matrix multiply might be different than the block size for inner-product
or outer-product. And it's architecture dependent, obviously depends on the size of my
cache, but it was input independent. Once I find a block size for matrix multiply, I don't
care what matrix you're giving me, I'm going to use the same block size.
That's not the same for us. Depending on the particular input you have, the tree is going
to have different structures, the way the points are moving through the tree is going to
change, and the correct block size is dependent on that.
So if we want to tune this, what we need is some sort of runtime auto-tuner. We need to
see the input first and use input to figure out what block size we want.
And so we have those nice U-shaped curves, so we can just use a gradient-descent search.
We do a little bit of random sampling from the input space, and this is because different
parts of the input might behave differently, so if you tune for the first thousand points,
you might get a different behavior than if you tune for the last thousand points in your
set.
So we do some random sampling to avoid that. And because we're doing this all at
runtime, we don't want to spend a lot of time auto-tuning. So once we consume 1 percent
stop of the points we stop and just take whatever the best block size is that we found.
So we can put all of this together into a source-to-source transformation framework. This
is all done in Java. We automatically identify the potential loops for point blocking,
automatically apply the transformation and insert the tuning code that we need.
We tried this on a bunch of different benchmarks, six different benchmarks, and because
of input dependence for a bunch of the benchmarks we use multiple inputs, so there's 15
different benchmark input pairs. Everything is written in Java, JVM heap, taking the
average of five runs, low coefficient invariant. So the error bars here are under 5 percent
per bar.
Okay. Here the upshot. Right? This is the takeaway of point blocking. So this is sort of
baseline. I'm assuming the points have been sorted. And we see that there's a pretty wide
variance in performance improvements, so we're getting performance improvements
across the board. Sometimes up to 3X. On average the auto-tune version gets about a 50
percent improvement. If you're willing to do a post hoc exploration of the right block
size for a particular input, you can do a little bit better.
>>: So the benchmark and the input is to the pairs there?
>> Milind Kulkarni: That's right. So this is Barnes-Hut on a random input, Barnes-Hut
on a plummer model input, point correlation on a random input, point correlation onto
real-world input datasets, nearest neighbor, k-nearest neighbor, a ball tree
implementation, and raytracing implementation.
>>: So no benefit on the raytracing?
>> Milind Kulkarni: Yeah, the raytracing implementation, yeah, we just weren't able to
get any benefit there. If you can do a little bit of -- if you do the sort of [inaudible]
determine this -- if you determine what the right block size is, there's potential for benefit
there. But the auto-tuner, because it's not able to explore all the space, doesn't find it.
>>: So how big is each point and each tree node?
>> Milind Kulkarni: I don't have those numbers off the top of my head unfortunately. It
varies from application to application.
>>: Right.
>> Milind Kulkarni: And that's part of what determines block sizes. So if the points are
bigger, then you can fit fewer points in the block before you start to get outside of your
cache, things like that. Tree node sizes don't matter quite as much because of the
particular implementation that we use.
We're basically already assuming that every time you come back to a tree node with a
new block it's going to be a miss. So unless an individual tree node is so big that it blows
out your cache, you really only need to have one tree node in your cache at a time in
order to get the locality benefit.
But, yeah, that's definitely one of the effects here.
>>: How big are the data sets?
>> Milind Kulkarni: They're all roughly on the order of a few hundred thousand to a
million points.
>>: A million [inaudible].
>> Milind Kulkarni: Yeah. Which is actually not so big for some of these applications.
>>: [inaudible].
>> Milind Kulkarni: Yes.
>>: So potentially you identify potential traces manually before you start ->> Milind Kulkarni: You identify the potential places -- no, no, that's also [inaudible].
The loop [inaudible] to get transformed is identified automatically. By looking for those
repeated recursive -- so enclosing point loops that make recursive calls inside them.
>>: [inaudible] understanding how much space locality you're getting on these guys
particularly in the context of which ones do well?
>> Milind Kulkarni: I don't actually. That's something that we would want to measure.
The only place where you have any shot at getting good spacial locality is in the points.
So there is potentially spatial locality in the tree, but we're not taking control of how the
tree is allocated. So the tree can be all over the place. The points themselves because of
the way -- so one issue here, if I go back, is what's being added to the next block is
actually slighter to the point. So the blocks themselves don't have great spatial locality.
It's whatever the original order of the points were. So as points sort of get filtered as
they're moving down the tree, they stop being next to each other in cache. So there's
probably not great spatial locality. So there's definitely opportunity for improvement
here.
>>: What about the trees, is there anything about the trees?
>> Milind Kulkarni: We haven't looked at that.
Okay. So let's do one cycle of this rinse and repeat. We're not going to go all the way to
the top of the game plan, we're going to go to this step of the game plan, designing
transformations. And one of the discussions that we've had already at this point was,
well, this sorting was kind of a weird thing, right? It has to be done ahead of time, it has
to be done by the programmer, it's application specific. What if you can't do that? What
if you can't pack points into blocks effectively?
So here's the problem. We get one miss on a tree node per block that visits that tree node.
And if the points are not effectively packed into blocks, then more blocks visit those tree
nodes. So we get more misses. What you really want is to pack all of the points that are
going to visit a particular tree node into one block if you can, because that minimizes the
number of misses you get.
So this means that you need to figure out something about the order of the points because
you have to build those blocks before you start execution.
So point blocking is pretty effective when you can presort the points, but it turns out not
to be so effective when the points aren't sorted. Here's what can happen. If we hadn't
sorted that original example, if we kept the points in 1, 2, 3, 4, 5 order, now node C is
visited both by the first block and by the second block. So the first block gets great
locality on C. There actually is no reuse. Because the reuse distance is 0. C is definitely
in cache. But now I have to go through the entire tree before the second block visits C
again and now this next access to C becomes a cache miss.
Because I wasn't able to sort the points, because I wasn't able to pack them into blocks, I
get all these misses.
So what can we do? How can we avoid this particular problem? So it turns out there's
two loops in this code. We only tiled one of them. We tiled the point loop. So what
would happen if instead of saying for each block of points do something, what if we tiled
the other loop? What does that even mean?
So here's one interpretation of what it means. I'm going to take a partial traversal -- think
of this as taking a subtree, a part of the tree, and then for every point have it work inside
that partial traversal. And then for that point for every tree node inside that partial
traversal, do the work.
So this is tiling the other loop, tiling the tree instead of tiling the points. This turns out to
be much more complicated to do. Generating this code is much harder. I can't show it to
you on one slide. But it has this nice property, which is that as long as this partial
traversal fits in cache, it turns out that I get the locality I want regardless of what order
these points are in. So the points can be sorted, the points can be unsorted. If that partial
traversal fits in cache, I'll get good locality. So I no longer care about sorting the points.
So here's what this looks like in practice. And it turns out we can be a little bit more
clever than even that. So let's tile the tree. So the fist thing we're going to do is figure
out where we're going to tile the tree. So we place what are called splice nodes. These
are basically points of the tree where when a traversal gets to that point, that's where we
cut the traversal off. That's one partial traversal.
So I'm just going to color them in on the iteration space. Then we're going to execute
traversals, and whenever they get to one of those splice nodes, we're going to pause them
there. So I'm going to run this traversal until it gets to C and then pause. I want to do
everything I can in that first partial traversal first.
So I'm go to go back and grab another points and move it to C. The upshot is that the
reused systems of C is now small. Reused systems of C the first time is 0, reused systems
of C the second time is 2 instead of 10.
>>: But is it possible that the other blotch of grass may have gone down a different grass
road?
>> Milind Kulkarni: That's right. We're smart about how we grab these blocks so that
this doesn't happen. You're absolutely right. That can happen.
Here's part of how we're able to be smart about this. When I -- so what I need to do next
is resume these traversals. They need to start going past C. But what I can notice is that
three of these points made it to C and two of the points actually got truncated before they
made it to C. Right?
As the points are making their way through the tree, they get truncated at different places
in the tree. And those truncations are telling me something about the behavior of these
points as they work their way through the tree. The points that all made it to C have
something in common that make them have somewhat similar traversals and the points
that didn't make it to C have something in common that make them not go to C.
So what I'm going to do is when I resume reorder the points. And I'm going to take all
the points that made it to C and do them first and then all the points that only made it to
B. So I'm going to do some reordering. And when I resume, 1, 3, and 5 wind up in a
single block and make it to F.
And what's kind of cool about this is that's actually the original sort of order. I've
actually recovered the sorting order without actually knowing what's going on. I'm just
looking at the past behavior of these points as they work their way through the tree.
So essentially I'm filling the blocks by using this history to figure out how I need to pack
things into blocks. And then I'll reach out the traversals and then I'll go. And I can
actually just keep doing this process. So as the points work their way through the tree,
they'll constantly be getting reshuffled and reordered. This turns out to be pretty cheap.
And the nice thing about this is I think at the beginning of the talk we were asking -there was a question about finding the right sorted order. It's possible that for part of a
traversal you're similar to point A and for a different part of the traversal you're similar to
point B. So any a priori ordering of the points is not going to be good. But because we're
able to constantly reorder the points based on their behavior, we can sort of change the
order as necessary based on how we're working our way through the tree.
>>: Now, when you go through -- I assume that you just have two lists or some number
of lists with the points that didn't make it, right?
>> Milind Kulkarni: So the way to think about -- it's not -- it's a little bit simpler than
that actually. So the way to think about this is that you've basically chopped the
traversals up into phases. There's the phase that goes from A down to C. There's the
phase -- the way we make this work, there's the phase that does the subphase under C.
Then there's the phase that actually gets to F. Then there's the phase that works under F
and so on. So I've chopped the traversals up into the bunch of phases. Every point goes
through the same set of phases. It might skip a phase because it doesn't have to visit any
of the nodes in that phase. But conceptually every point goes through every phase of this
traversal.
So when I'm starting the phase that starts at the bottom of C, I actually just gather all of
the points in my -- in the program. Basically all of the points that are in flight. And I
gather them based on where they were paused. So when point 2 makes -- sorry. When -or, oh, sorry. When this point makes it to B and pauses, it's sort of waiting in a bucket
just sitting in B. It's paused there.
>>: So it's not there in C so you don't grab it.
>> Milind Kulkarni: It's not there in C, so you don't grab it. That's right.
>>: Okay.
>> Milind Kulkarni: So in some sense there's lists at each one of these nodes up here that
sort of keep track of where things got paused.
>>: Okay.
>> Milind Kulkarni: And so the nice thing about this is that because we're not actually
doing any sorting. The order in which we grab points is basically doing the sorting for
us. So we don't actually have to do anything.
>>: So there's no way to know -- to maintain any dependencies.
>> Milind Kulkarni: That's right. So this is doing a lot more damage to the iteration
space, so it has to be truly parallel. That's right. Turns out, though, from a lot of these
applications, that's what you have there. It is truly parallel. But, yeah, we're really doing
a lot of violence to the original order of this point.
>>: Doesn't your memory [inaudible] go up because you have too many things in the
[inaudible] record where you pause ->> Milind Kulkarni: Every -- we have to -- so one way -- yeah. So we have to record
that state. And actually we have to do a little bit more than that. One way to think about
what's going on here is that each of these points is essentially a thread that's in flight. The
point has to carry around some state within about where it is in the tree and what
immediate data it had on its way down to this point of the tree, et cetera.
And point blocking -- so in the original execution one point is in flight at a time. A single
point does its entire traversal and then we go to the next point and so on.
In point blocking, a block of points are in flight at a time, so we have to save that much
extra space. And in traversal splicing every single point is in flight at once. So
potentially there's a big explosion in the amount of state that we have to keep track of.
What we do is -- you'll think about it this way. Points 1, 3, and 5 that all made it to C,
there's obviously a lot of similarity in the state they have because they all made it to C.
So we can identify the parts of the state that are basically -- that are going to be the same
across points and compress it and see if it just runs. So we're able to really reduce the
overhead of keeping everything in flight.
So all the points have to be in flight, and in some sense I have to maintain the context for
every point, but a lot of these contexts are the same.
So there's details in -- so we had an OOPSLA 2012 paper on this. There's details in there
about how we do this. But it varies from application to application. Some applications
there's less compression that we can do so the state goes up a bit.
>>: There are some sides on how much better performance you get?
>> Milind Kulkarni: Yes. So really quickly, as Ben pointed out, I don't just have -- so
it's a necessary and sufficient condition now. It has to be parallel in order for this to
work, so we do that.
>>: [inaudible] that's what you're doing to this space, right?
>> Milind Kulkarni: Yes. It's a full tiling of the space at this point. Implementation and
evaluation. So here I'm going to compare to four different things. The unsorted baseline
now, because our goal here was to say, well, what if we can't do sorting. I'm going to
compare to just point blocking. So this is blocking when I don't have nicely packed
points. I'm going to look at splicing and then blocking and splicing.
We also do some -- we use a heuristic to place the splice nodes instead of tuning it. It's
not an auto-tuner anymore, it's just a heuristic that we use about where the splice nodes
get placed where. But it still gets done automatically. So here's what happens. So one is
the performance of the unsorted baseline. The blue line is blocking. So blocking actually
still helps, even when the points aren't sorted. In fact, blocking still gets you about a 50
percent improvement. It's just over so you're still running a lot slower.
Adding splicing gets you over a 3X improvement. So if you do splicing, you're getting a
3X improvement over the unsorted baseline. You can actually combine blocking and
splicing like in the example I showed you. It doesn't always help. Sometimes actually it
hurts you a little bit because there's extra overhead associated with doing the blocking,
but sometimes it can help. And the main place this helps is if I can't place those splice
nodes such that a partial traversal fits in cache, then I'm back to sort of working with the
big tree, and so I want to do something like blocking.
>>: Where's sorting?
>> Milind Kulkarni: Sorting is not on this slide. Sorting is maybe on the next side.
>>: Okay.
>> Milind Kulkarni: That's right. Here's the next slide. So this is -- think of this as part
two of the talk compared to part one of the talk. Okay? So part one of the talk says I
want to do application-specific sorting plus point blocking. Part two of the talk says I
want to take unsorted points and do completely automatic splicing. So no programmer
intervention required.
And what you see is kind of mixed results. So sometimes sorting is really, really, really
effective. Barnes-Hut. It turns out that doing the sort is incredibly effective, and so it's
hard for us to beat it. So we're slower. Although we're still faster than the baselines.
Other times, though, something like k-nearest neighbor, this turns out to be an application
where an a priori sort is just not a good idea. The points do really weird things as they
work their way through the tree. And so there's no real way to put them together ahead of
time in order to keep them sorted. What you really want is to be able to dynamically
move them around, which is what we do. And so we actually do better.
And the upshot is when we look at all our 15 benchmark and input pairs, what? On
average we're basically the same. So if we're doing something fully automatic, it's
competitive with doing the manual sorting.
>>: [inaudible].
>> Milind Kulkarni: So these benchmarks are mostly pulled out of real applications. So
Barnes-Hut is actually the Barnes-Hut -- it is a Barnes-Hut implementation. It's from the
Lonestar benchmark suite, which is from Galois project.
Point correlation and nearest neighbor or k-nearest neighbor are all based on a k-d tree
implementation that was pulled out of a raytracer, and then use that k-d tree
implementation to do something more like nearest neighbor or k-nearest neighbor or
whatever. So they're curled in some sense. They're not full applications, but they are
pulled out of real applications.
>>: [inaudible] the real application?
>> Milind Kulkarni: So depends on how you define real application. So, for example,
ball tree, this was pulled right out of the -- I forget the name of it. But it's pulled out of
one of the data mining code base repositories. It is truly just here is a ball tree
implementation to do nearest neighbor for you. And so we just took that code straight
out of a repository and transformed it.
>>: What about the other cases?
>> Milind Kulkarni: So Barnes-Hut I would say is a real application. So real
applications will use these as kernels, right? So I'm not just doing a nearest neighbor
computation. I'm using -- that's right. So then -- but we're not transforming that. So
then ->> Milind Kulkarni: Right. Right. That's right.
>>: Okay.
>> Milind Kulkarni: I'm going to actually show you some numbers.
>>: Were you the one who did all the sorting?
>> Milind Kulkarni: For Barnes-Hut we were not because Barnes-Hut there is sort of an
accepted best way to do the sorting.
>>: Okay.
>> Milind Kulkarni: To my knowledge we are the first people to even say that this is
actually a generalizable technique. People have done it in various application-specific
ways, but to say what you're really doing is this, we actually -- so we've did the sorting
ourselves here. We sort of did the best we could.
>>: Except on Barnes-Hut nobody had done a sorting on these algorithms?
>> Milind Kulkarni: Yeah.
>>: So just -- and then in order to do it you to hand code it and you have to understand
that?
>> Milind Kulkarni: You have to understand the algorithm.
>>: So that state of the art isn't really state of the art, it's your state of the art.
>> Milind Kulkarni: It's our state of the art. That's a fair point, yes.
>>: Which is a good thing to have actually [inaudible].
>> Milind Kulkarni: So this is ->>: Generalizing them is [inaudible].
>>: So you have contributions to sorting.
>> Milind Kulkarni: To sorting as well. Yes. Okay. So let's talk about some ongoing
work that we've got that we've been doing. So this is actually -- so there's this real
application question. So here these are actually a real raytracer application. This is in
C++ now. This is a real raytracer.
>>: [inaudible] merry Christmas?
>> Milind Kulkarni: Those are three different models that you can render. One is a
Christmas cabin, one is a dragon -- sorry, one is a cabin, one is like a Christmas tree, and
one is a dragon. Apparently. Okay. So here what we're trying to do is not locality
transformations, we want to do SIMDization. And if you go back and look at -- so it
turns out that blocking and splicing are really nice things to do in order to enable
SIMDization. So to see where that is, let's go back and look at the original sort of
pseudocode.
In the original pseudocode there are no loops here. It's just a bunch of recursion and stuff
like that. You look at this, somebody who is trying to vectorize code, and you say what
am I supposed to do? But now, if we add in blocking, so all we're doing is adding in
blocking, hey, look, there's a nice dense loop for you. Right? So this is a dense loop.
This is something that can be vectorized using your standard techniques like if
conversion and whatnot. Right? So this suddenly becomes vectorizable by applying
blocking.
And then why doesn't SIMD always work even if you have a dense loop like this? It's
because I can't keep my SIMD lines full. I've got my four SIMD lines, but there's only
one element in this loop, and so who cares if I can SIMDize.
But what does splicing do for you? Splicing reorders points nicely in order to make sure
that the blocks stay full. So it keeps these SIMD lines full for us. And so this is what we
wind up being able to do.
So here's how to interpret these results. I know it's a little bit small. So one is the
baseline. So no transformations applied. The first thing we do is we just apply blocking
with a block size of four. This is saying all I want to do is turn the code into something
that can be SIMDized. This actually slows down performance a little bit because we add
the overhead of blocking, we don't actually get any other benefits. It's not a large enough
block size to get good locality.
If you then -- now that I've got SIMDizable code, let's add SIMD to it. And in order to
add SIMD to this effectively, you need to do transformations like structure of arrays and
things like that. So adding all of that stuff where now you're, for example, no longer
copying pointers between blocks but copying entire points between blocks adds enough
overhead that you actually slow down if you try to SIMDize this code.
Okay. So now let's say, well, obviously, we don't want a block size with a block of four.
We want bigger blocks, both because bigger blocks get me better locality and because
bigger blocks make it more likely that I'll be able to find more points that I can fill a
SIMD line with. So I block them and that gets me back up above one. And then I add
SIMD to that. And so I get a small improvement. So I'm paying the penalty of the data
transformation, and then I get some improvement from SIMDization.
The more important thing is once I do blocking and splicing, the improvement I get from
SIMDization is much bigger. SIMD becomes basically 40 percent on average more
effective than once you do blocking and splicing than if you did just basic SIMDization.
>>: And that's all related to [inaudible] all the instructions actually fire, right,
because you're getting ->> Milind Kulkarni: That's right. Because you're able to [inaudible]. So we can
actually -- so one kind of cool thing that we can do is we can measure what percentage of
the time we spend with our SIMD lines completely full. And we can actually come up
with a -- what's the best you can do. It turns out the best you can do, it's not the best for
performance, but the best you can do in terms of keeping SIMD lines full is to do a full
interchange.
If I do a full interchange, I'm able to look across every single point, say do any of these
points need to work on this particular node of the tree. Right? And so of course I find as
many possible points -- I'm finding as many points as possible to keep my SIMD unit
occupied. And so when we compare, it destroys locality, it destroys a bunch of other
things. But it turns out ->>: I have to argue. Unless you order your input data structures statically in such a way
that you get the locality you want.
>> Milind Kulkarni: Yes. That's right. But this was basically -- think of this as an
opportunity study. We want to know how good are we able to do in terms of keeping our
SIMD lines full. And it turns out that if you look at that full utilization, sort of the best
you could possibly do and you look at what we do with splicing, we basically hit that.
Splicing basically fills your SIMD lines as much as possible. Okay. So this is some
ongoing work.
A little bit of other ongoing work. So we're looking at putting these on GPUs. This
seems like a natural thing also. Something interesting about this, which was somewhat
surprising to me, so the key to efficiency -- one of the keys to efficient GPU computation,
there's really two things you have to worry about. One is control divergence. I want to
make sure that all the threads in my loop are actually doing something useful, they're not
just sitting idle because of some control flow divergence.
The other is memory coalescing. I want to make sure that when a bunch of threads in the
loop are doing a load, I can pack -- I can basically use as few memory transactions as
possible to do that load. Right? It turns out that when you take these recursive functions
and put them on the GPU, because they're recursive, you actually get -- the control flow
reconverges very quickly.
And the control flow reconverges quickly in a very suboptimal way. Because it
reconverges quickly, I've gotten back to the same part of my code. But now I'm in this
part of the tree and you're in this part of the tree. So control flows reconverged and
memory coalescing has totally been destroyed. We used to be in the same part of the
tree; now we're in different parts of the tree.
So we can apply a point-blocking-like transformation to the GPU. It's actually simpler in
the GPU because of the various GPU architectural things. But what it basically does is it
says let's make control divergence worse. Let's prevent points from reconverging for a
little bit longer in order to make sure that we get good memory coalescing.
And so this yields -- so I don't have great numbers for you yet because it's still ongoing
stuff, but it yields about one to two orders of magnitude speedup over the CPU
implementations and one to three over the naive GPU implementations.
>>: So how sensitive are the speedup results to the actual dataset?
>> Milind Kulkarni: On the GPU or in general?
>>: Well, just in general for this work. Because it seems like there's a property -- well,
there could be a property of the data then as its inherent clustering, which gives you the
stability to get sort of dense behavior [inaudible].
>> Milind Kulkarni: It's definitely quite input dependent. You could sort of see that
here, right? So take something like -- what's a good example? Barnes-Hut. Right? With
the random -- so plummer is much more clustered than random. So actually somewhat
counterintuitively the more random data we do better. And that's mostly because the
baseline gets really crappy when the data is not -- doesn't have a lot of locality, right?
>>: Supposing it has [inaudible] and you ->> Milind Kulkarni: Are you implying that I am not properly [inaudible]?
[laughter]
>>: [inaudible] somebody else. He does an exercise like this, and he changes his code so
that it does all these locality enhancing transformations, is the idea then that the same
code is going to work well? Doesn't matter what computer you write it on?
>> Milind Kulkarni: Yes. And the basic idea here is that there is this runtime
auto-tuning component. And this is purely an empirical auto-tuner, right? I'm just
looking at as I change the block size how is the performance changing.
And in fact our auto-tuner is smart enough to do the dumb thing which says no matter
what block size I've got and I haven't been able to help you so I'm just going to fall back
to the original code.
>> I see.
>> Milind Kulkarni: But you do need some amount of runtime auto-tuning in order to
make this work. If you pick a bad block size, your performance will be worse than if you
didn't do the transformation at all.
>>: I see.
>> Milind Kulkarni: In fact, in our splicing results, we made the auto-tuner -- we added
some extra stuff, which we added a heuristic check that basically tries to predict whether
your data is sorted. So it looks at sort of pairs of points that are in the input and sees
whether or not they behave pretty similarly. And if the data is all ready pretty well
sorted, we don't do splicing. We just do blocking.
But so -- but you do need this kind of -- this input-sensitive tuning in order to get the
right performance.
>>: Can it happen, for example, that the transform -- let's call it the original program A,
the transformed program B prime, could it happen that A prime works better than A on a
machine with a certain cache size when A works better than A prime on a machine with a
different cache size?
>> Milind Kulkarni: Yes.
>>: That could also happen.
>> Milind Kulkarni: Yeah.
>>: The auto-tuner [inaudible] that issue also?
>> Milind Kulkarni: Yes. So the auto-tuner does some measurements and basically tries
to predict whether doing this blocking is going to be effective overall. So there's clearly
some hand-waving heuristics that have to go on inside the auto-tuner also. But the goal -so here would be an example ->>: Blocking for the L2.
>> Milind Kulkarni: We're blocking, yeah, for the L2. Or a last level.
It's actually -- so it's weird because it's a purely runtime dependent system, it's hard to say
that we're specifically blocking for one level or the other if you look at our original
numbers.
So one of the issues is these new machines have these L3 that are much harder to
measure because they're on core. But if you look at our original set of numbers, the
original set of numbers we were working with machines that just had L1 and L2 so we
could look at exactly what was going on inside the cache. And you could plot L1 miss
rate and L2 miss rate. And what you find is that the block size the auto-tuner hits upon
minimizes neither the L1 miss rate or the L2 miss rate. So go figure.
>> But both go down?
>> Milind Kulkarni: Both are down from their original place, from their original
implementation. But I could use a slightly ->>: So performance isn't determined by either one of those miss rates, it's determined by
both [inaudible] it's doing something ->> Milind Kulkarni: Something -- well, it's doing something really dumb that turns out
to work out okay. It's not paying attention. The auto-tuner doesn't measure miss rates,
the auto-tuner doesn't measure anything else. All the auto-tuner does is it starts a timer,
runs a set of blocks that are size 12 or size 16, sees how long it takes. Starts another
timer, runs a set of blocks that are size 32, sees how long it takes.
>>: Well, it's doing performance at this rate.
>> Milind Kulkarni: That's right.
>>: Because that's [inaudible].
>>: That's what I actually want to optimize.
>>: And the miss rates just give you insight into what it's doing.
>> Milind Kulkarni: And so if you were to make the block size bigger, your L1 miss rate
would have gone up because now your blocks are more likely to step up [inaudible] but
your L2 miss rate goes down. If you make the block smaller, the opposite happens.
Turns out that you want somewhere in between.
One thing that can be a little bit tricky and one of the reasons why it's hard to do
something like a more analytical model of misses is that the block sizes aren't [inaudible]
as they move down the tree. As you move down the tree, some of the points get
truncated early. So the block sizes keep changing as you move through the tree. And so
it's very hard to relate the original block size to what you might think of as the average
block size. That's also very input dependent.
>>: So [inaudible] you can do this runtime estimate, right?
>> Milind Kulkarni: Um-hmm.
>>: So when you do the learning, what's the most important [inaudible] that you take
care of the machines plus if you look at [inaudible] then you have like [inaudible]
parameters [inaudible] machine is [inaudible] parameters. But for this [inaudible] the
machines so [inaudible] the runtime [inaudible] machine are running on so the runtime
just reuse distance to pick ->> Milind Kulkarni: It doesn't even use reuse distance. The runtime literally just runs
the code at various -- it just tries different configurations of it. So if you think about the
way Atlas works -- Atlas is one of these sort of original auto-tune block libraries -- what
Atlas did -- now Atlas does all sorts of crazy stuff. But the original thing Atlas did was it
tried a bunch of different block sizes and a bunch of different [inaudible] factors and
things like that. It just tried a bunch of different parameters and saw which one ran the
best.
It wasn't making any predictions based on your cache size. It wasn't making any
predictions based on the number of registers you had or the number of your instruction
cache size. It was just trying a bunch of -- it was doing a parameter suite and picking the
best one. That's what we do.
>>: Picking the best one based on ->> Milind Kulkarni: We picked the best one based on runtime. That's right.
>>: [inaudible] two machines have five different replacement for cache?
>> Milind Kulkarni: Yeah. And so the nice thing about not worrying about any of that
stuff is that I don't need to think about how these numbers change when you have a
different replacement policy or when you have a different associativity or anything like
that.
Another thing that helps this, right, is it would matrix multiply. A good efficient
implementation of matrix multiply has a half dozen or more parameters, a dozen
parameters. We have tune-ups. So it's tractable in a way that doing a full parameter
sweep of matrix multiply is not.
Moreover, if we go back and look here, we don't actually do auto-tuning for the splicing.
Splicing we just use this heuristic which says essentially cut the tree in half. Because
what we found is that when you combine blocking and splicing, and this is one of the big
reasons we combine the two, you need to do splicing but it's not -- performance is not
that sensitive to the splice parameter to exactly how deep that splice is.
>>: [inaudible].
>> Milind Kulkarni: Sorry?
>>: I mean, [inaudible] so this one you do need good luck to get in there.
>> Milind Kulkarni: I wouldn't call it good luck. It's something about -- but, yeah. You
could construct a case where this heuristic will do badly.
>>: How do you do penalties? You do get penalties.
>> Milind Kulkarni: You could get penalties.
>>: [inaudible] in practice how is [inaudible] you need some subset of the inputs and
then you allocate a certain amount of time between parameter sweeping?
>> Milind Kulkarni: That's pretty much exactly it. So basically what we say is we
reserve one percent of the input points and say that for this 1 percent of the input points
we're willing to spend this many points to do auto-tuning.
Now, there's a trick here which is that, as I said, especially if you have sorted points,
different parts of the input might behave very differently. So you can't just take the first
1 percent of your points. That actually turns out not to work so well. So you have to -so instead you have to sample from different parts of the input space.
But there's a further trick, which is that block to block there's actually still some locality
because successive blocks, if the points are sorted, have -- do pretty similar traversals. So
you can't just take completely random blocks. You need to account for the fact that when
I'm actually running the code I have this block-to-block locality.
So the way you have to do your sampling has to take that into account. So what happens
in practice is we grab groups of points at -- so we grab a set of blocks of size 2 in one
place in the input and a set of blocks of size 2 at another place in the input and figure out
what that means in terms of a block size of 2. Then we do the same thing for block size
of 4, same thing for block size of 8. But you're taking -- you're sampling sets of blocks
out of the input and we do this until we've consumed 1 percent of the points. Or until we
found the need at the curve.
>>: Another one [inaudible] for a programmer who writes program you would try to do
the simple and then he start to do the similar thing [inaudible] whether this could be in
[inaudible] the programmer writes simple program like [inaudible] auto translate
[inaudible] programmers' opinions about that direction they should go.
>> Milind Kulkarni: Right. Right. I mean, arguably what we're trying to do is let
programmers write the simple version and get the performance of the complex version.
And, you're right, that you could also just use this as a way of saying it turns out that
doing something like blocking might be useful so it might be worth your time to go and
implement it in the right way for your particular application.
So there's a bunch of stuff that goes on here. We have to save a lot of data in order to
keep track of plain context and stuff like that, which you might be able to optimize away,
if you were doing it manually, which we don't.
So there's a lot of stuff that if you were willing to go in and do it manually, you could
probably do better than us. But as I was saying earlier, you still need some of these
runtime smarts.
>>: Well, sure, you just put some to verify [inaudible].
>>: We don't want to have to understand current architectures.
>> Milind Kulkarni: Right. I mean, this is the case for compilers. Why do we want
compilers? It's so I don't have to write up bridging this gap.
Okay. So conclusions. So, yeah, irregular algorithms, they're pretty fertile ground for
locality optimizations, at least partly because people haven't spent a lot of time worrying
about localities yet. In the application-specific places people do. But doing stuff in a
more general case, there's not a huge amount of work here. So it's a nice fertile ground.
I think it's important to consider these applications at the right level of abstraction. If you
get too bogged down in the details of the particular pointer taking that's going on or the
particular structure of the code, it can be sometimes a little bit easy to miss the forest for
the trees and miss that, hey, actually there's this locality going on and it has a pretty
simple regular structure, right, the accesses that I can then do something about.
And so in our particular case by using this higher level of abstraction, that really sort of
informed the kind of transformations we wanted to do, it informed the correctness
criteria, informed our reasoning about the locality effect, all of these sorts of things. And
then the upshot is that we can do all of this automatically and get nice benefits. All right.
[applause].
>> Milind Kulkarni: Thanks.
Download