>> Kathryn McKinley: It's my pleasure to introduce Milind who came to UT with his advisor, Keshav Pingali, and became best friends with my Ph.D. student Mike Bond so then I had to become his best friend too. >>: [inaudible]. >> Kathryn McKinley: And so where he's been working on parallelization of graph algorithms. As we all know, parallelization is nothing if you don't have locality, because if you have to communicate all the time, then it doesn't do you any good to do things in parallel. So now he's been focusing more recently on how do you get both good locality and good parallelism. And that's what we're going to hear about today. And he's now a CAREER Award winner, assistant professor at Purdue, and hopefully next year or the year after an associate. >> Milind Kulkarni: Yes. Thanks, Kathryn. So let me just go ahead and skip past the title since it seems pretty obvious what I'm going to be talking about, locality in irregular programs. Before I start with the actual talk, I want to acknowledge the students that have helped me with this work. Youngjoon Jo, who is really the prime mover behind much of this work. Some of you might have met him last summer when he was at MSR. And also Michael Goldfarb. Unfortunately they're both graduating. But these guys have done a lot for this project. Also the government has surveillance drones these days, so I feel like I should acknowledge any government support that I get. So from the National Science Foundation, the Department of Energy. Okay. So if you've ever seen one of my old talks or one of Keshav's talk about parallelism in irregular programs. This slide is basically that slide except with a search and replace done parallelism for locality. The argument is basically that if you think about where the compiler community has been or was decades ago, there's a lot of work that has been done and a lot of understanding of where to find locality in what are called regular programs, programs that deal with dense matrices, dense arrays. These are the kinds of things that show up in linear algebra codes and in scientific computing, and it really had been the focus of high-performance computing and a lot of the compiler community for a very long time. And in these applications that manipulate matrices and arrays, their behavior is very predictable. Their behavior is easy -- not easy. It's possible to analyze in a static way, and so at compile time we can do a lot of stuff, a lot of transformations to improve locality. The problem is that we don't just deal with dense matrices and arrays. More and more applications in more and more domains are what the compiler community has called irregular. Irregular is really just in contrast with these regular applications of dense matrices and arrays. So if it's not using a dense matrix, it's an irregular program. But, in particular, I want to focus on programs that deal with pointer-based data structures, trees, graphs, lists, things like that. And we really don't understand nearly as much about what the right things to do to get locality in these programs are. So what's the basic problem? The basic problem is that irregular applications are complex. If I've got dynamic pointer-based data structures, well, the layout is dynamic. I'm allocating them at runtime, the layout is input dependent, it's dependent on what my allocator is doing for me. So it's hard to find spatial locality, although people have done work on figuring out clever ways of letting out pointer-based data structures to get better spatial locality. More importantly for the stuff that I'm going to be talking about today, the access patterns are often highly unpredictable. The particular way that you're going to access a graph or access a tree is hard to predict at compile time. It's input dependent, it's data dependent, right, so as the program executes, that is what determines how I access the data structure so the compiler really can't see a lot about those access patterns. And that means that it can be hard to structure your computation to get good temporal locality, to make sure that if I'm reusing data, that the uses and reuses are close together so that I'm going to get caches. And, in fact, a useful question to ask is are there even common sources of locality, are there common patterns that arise in these irregular programs that we can exploit in some systematic way to get good locality out of pointer-based applications. Okay. So here's the plan, and this is really the rough outline for the talk. First of all, irregular applications is just a vast space of things. So I'm not going to try to give you a [inaudible], I'm not going to be everything to everyone. So the game plan is we're going focus on a subset of irregular applications. And by choosing a particular subset there's some hope that we can find some common patterns that we can exploit in order to get good locality. Once we've figured out exactly which applications we're going to look at, I'm going to develop some models for reasoning about the localities in these applications. In the world of dense matrices and dense arrays, there's all sorts of interesting models of locality, interesting ways of thinking about locality in these problems, and I want to see if we can do something similar when we're talking about irregular applications. Once I know how to reason about locality, that gives me some hope that I can find a way to actually transform my programs to get good locality. Now I know what it means to have good locality, so I should be able to target it. And then I want to make sure that whatever transformations I come up with are correct. I don't want to actually break my program. And, finally, I'm a compiler guy at heart. Just because I know how to do these transformations isn't that much fun unless I can do them automatically. So I want to figure out ways of actually automatically transforming these programs and tuning them in order to get good performance in order to actually take advantage of locality. And then we'll rinse and repeat. So maybe look for different subsets of applications, so look at different kinds of irregular applications. The particular rinsing and repeating I'm going to do here is a different set of transformations. So let me just give you the punch line right up front so you know where we're heading. So we're going to focus on tree traversal algorithms. These are algorithms that do a lot of traversals of various tree data structures. I'm going to tell you about two transformations that we've developed, point blocking and traversal splicing. And I'll explain what those mean in a bit. We've done automatic versions of both of these, so we've been able to automatically transform applications in order to implement these transformations and tune them because irregular applications are data dependent, so tuning is critical in order to get good performance. And ultimately we've shown we've gotten significant performance improvements out of both of these transformations. So let's work our way through the game plan so that I can prove to you that this punch line is more than just a purple box that I've put on the screen. So let's start by narrowing the scope. Let's start by focusing on a subset of irregular applications. And so the particular applications I want to talk about are tree traversal algorithms. These are algorithms where I've built some sort of tree and then I actually repeatedly traversed that tree to compute some object of interest. They show up in a lot of different domains. One that many people might be familiar with is the Barnes-Hut N-body algorithm. It's an n log n algorithm for simulating astrophysical bodies. It shows up in graphics. So in raytracing one of the most time-consuming parts of the algorithm is figuring out whether a ray intersects a particular object. And the way that you do this, or one way that you can do this, is by building something called a bounding volume hierarchy that tells you something about where the objects are in space, and then I can accelerate this ray object intersection by traversing this hierarchy, which is basically a tree. Shows up in a lot of data mining algorithms, like nearest neighbor and point correlation. The basic pattern here is that all of these algorithms are building some sort of tree and repeatedly traversing that tree over and over and over again. And because there's a single tree and I'm traversing it over and over again, there's data reuse. And because there's data reuse, there should be an opportunity to exploit locality. >>: If your goal is to automatically do these transformations, you use the word "tree," how do you know something is a tree? >> Milind Kulkarni: That's a good question. We -- so there's two answers to that. One is that actually one of our transformations, point blocking, actually works for any recursive data structure. So any recursive traversal of any recursive data structure can be transformed. So you actually don't need specifically for it to be a tree. It happens to work best for trees because there's a lot of reuse. But it can actually work on any recursive structure. The second answer to that question is that we're punting. There's a lot of work in shape analysis and whatnot of looking at data structures and looking at the ways data structures are used and proving things like acyclicity or tree treeness. We are basically -- we could build on top of that. So if somebody told me this was a tree and that somebody could be a programmer that just annotates a data structure and says this is a tree, it could be a shape analysis that analyzes a code and says this is a tree. But always know it's just that, is this a tree. Right? So I agree with you, right, that this is part of not being everything to everyone. We're not going to be able to do something for -- these transformations aren't necessarily going to work for general irregular applications. So you do need to -- if we're going to focus and say that we're looking at a subset of these applications, somehow you need to know that this is the subset that you care about. >>: [inaudible] so are your transformations guaranteed to be semantics preserving only if there's something called a tree that is being prevented there, or are the semantics preserving for an arbitrary code? >> Milind Kulkarni: Point blocking. So there's some dependent structure that you need to be aware of, given certain restrictions on the dependent structure of your program. And I'll talk about that later in the talk. Point blocking is guaranteed to be semantics preserving for any recursive data structure that you apply this too. Traversal splicing, the particular way that we implement it, is only semantics preserving if it's on a tree. >>: Okay. >> Milind Kulkarni: Does that ->>: [inaudible]. >> Milind Kulkarni: For now just assume there's some local that says this is a tree. Right? I'm not a shape analysis guy, so this is not -- think of us as a client of shape analysis, if that helps. Okay. So let me just give you a quick overview of a concrete example of one of these tree traversal algorithms, just to help focus the mind a little bit. So point correlation, true point correlation is a data mining algorithm. Its basic goal is to tell me something about how clustered data is. And the basic idea is I've got some set of endpoints in K dimensions. Here it's two dimensions. And I've got some point, P. This should be a darker point. It might be a little hard to tell that it's darker. But there's some point P. And I want to find how many other points in this dataset are within some radius R of P. All right? Now, the naive algorithm here is just the N squared algorithm. I'm going to take my point P and I'm going to compare with every other point in the dataset and see whether or not it's within our meet. And if I am, then I'll increase my correlation and I can get a count. So the goal here is I want to say for P its two-point correlation is 2, there are two other points that are in this radius. Okay. So N squared, not so great. So here's one way that you can accelerate this. You can build a data structure called a k-d tree. It's the acceleration structure over this space. And that will let me actually accelerate the process of finding how many points are in my radius. And it does so by letting me quickly prune off parts of the space that I know I don't need to look at, that I know there's no way for a point to be in this space. So let me show you what that means. Here's how a k-d tree works or here's one example of a k-d tree. So I start with the root node of my k-d tree, and each node basically encompasses some subspace of my overall dataset. So the root node actually encompasses the entire space, and then I'm going to recursively subdivide this space, So, I'm going to divide this space into two, and that gives me two new nodes, B and G, so B is talking about the left side of the space and G is the right side of the space, and I can just continue this process recursively, subdividing the space, until eventually I've built this tree structure that captures something about the spatial locality, if you will, geometric locality of these points. And I continue the process until every leaf node has just one point in it. It's a wonderful addition. So how do I use this to accelerate two-point correlation? So here's my point P. I want to figure out how many points are in this radius. So I start by looking at the root of the tree and basically asking does this circle intersect this square. If it does, then there is a chance that the point -- that some of the points in the subspace are in my radius, and so I need to keep looking. So then I'll go down to B and say does the circle intersect the square, and it doesn't, so I can actually prune this entire subspace. I no longer need to look on the left side of the subspace. I know that none of the points can possibly be in my radius. I'll move over to G. There's an interception. I move down to H. I -- once I get to a leaf, there's a single point that I care about, so I look at that particular point and count it. I go to J, I go to K. Right? Now, you can see that this is not necessarily precise. K intersects the circle even though the point is not actually in the radius. Nevertheless, what this lets me do is quickly cut out parts of my subspace and basically I'm turning what would have been an order N process for a single point into a log N process. Okay. The code for point correlation is actually simple enough I can put it on a single slide. K-d tree building code is a little more complicated, so that's hidden. But the actual traversal is pretty straightforward. I have some set of points, I've got some radius that I care about. For every point in the set, I call this recursive function, I pass a point, I start at the root of a true, pass on the radius. The recursive function says, well, if the point is too far from the subspace I'm looking at, I can just return. I don't need to keep traversing the tree. Otherwise, if it's a leaf, I update my correlation if I need to. Otherwise I just keep looking my way down the tree. >>: [inaudible]. >> Milind Kulkarni: Turns out actually that if you look at the way -- that these codes actually often do look like that. >>: [inaudible]. >> Milind Kulkarni: No, it actually still often gets written like this. >>: [inaudible]. >> Milind Kulkarni: So this is basically what this code looks like. Right? Let's -- if we simplify the code a little bit, it turns out that this basic pattern is exactly the pattern that all these other tree traversal algorithms follow. There's some recursion, there's some truncation conditions that tell me whether or not the recursion needs to stop, and then there's some continuous recursive traversal. And the basic pattern is that I'm doing a recursive traversal of a recursive data structure and I'm doing it over and over again. So these are the algorithms that I want to look at. This is my refined goal. I want to improve temporal locality in repeated recursive traversals of recursive structures. If my algorithm fits this pattern, I want to help you get better locality. Okay. So why is it hard to reason about locality in this kind of algorithm, right? So part of what makes this hard is that what is written recursively like this, it's pretty tricky, right? There's some pointer chasing going on here, I'm working my way down the tree, there's this recursion going on. It's a little bit hard to reason about what I'm accessing when in order to figure out where my data uses are, where my data reuse might be. So let's abstract this away. Let's pretend that -- start over. Much of this recursion, much of the pointer chasing, much of the recursion, what it's really trying to do is figure out what parts of this tree I'm supposed to be visiting. What parts of the tree do I visit and what parts of the tree do I not visit. So let's actually throw away all of the recursion and pointer chasing and just assume that there's some oracle that tells me what parts of the tree I'm supposed to visit. So we'll rewrite the code that looks a little bit like this. I'm basically hiding all of the recursion in this oracle traversal function. This is just an abstract model. We're not saying implement the code this way. But this is our abstract model. So for every point in my point set, for each tree node, so for every node that's part of the traversal I'm supposed to do, do some sort of interaction, visit this node. And this actually has exactly the same set of accesses that the original code did. It just no longer has recursion. It no longer has pointer chasing. In fact, it's a nice little doubly nested loop. So now that I've got this nice little doubly nested loop, I can turn to my standard bag of tricks and think about something like an iteration space where in an iteration space I've got one access for each loop nest or for each level of my loop nest, and the axes tell me -in this case, the Y axis -- which point I care about. That X axis is which node that particular point is visiting in the tree. And I can think about one execution of this program is looking something like this. Point 1 maybe visits nodes A, B, C, D, E, F, and G. Point 2, its oracle traversal does something a little bit different. It visits nodes A, B, G, H, I, J, and K. And Point 3 does something, Point 4 does something, et cetera. All I'm doing is just taking that doubly nested loop and exploding it out into this two-dimensional space. Right? What .2 is doing is actually exactly the traversal that we saw in our point correlation example. So now that I've just exploded this all out and I'm staring at it as a bunch of circles on a screen, I can start thinking about things like reuse distance and stuff like that. Because I actually see the particular uses of particular pieces of data. So let's take, for example, node C. What I want to know is between one use of C and the next use of C how many other unique data elements am I touching. This is reuse distance. So between when .1 accesses C and when .3 accesses C, I've actually touched every other node in the tree. The reuse distance here is ten. And the nice thing about reuse distance is it basically directly maps to what kind of locality I would expect to see given a fully associative cache. So if my cache could only hold eight tree nodes, if it's only big enough to hold eight tree nodes, this second access of C is going to be a miss. If my cache could hold ten tree nodes, this second access to C would be a hit. Right? So if you're looking at this and thinking about this in terms of reuse distance, the first thing that jumps out at you is that because these traversals can go all over the place, in general the reuse distance of a particular node in the tree is on the order of the size of the tree. The size of the tree is on the order of number of points, so my reuse distance is order N. Not so good. But one thing I can think about is, hang on a second, all of these points are independent of each other. Each point is computing its own correlation. They're just working their way through the tree. So maybe I shouldn't do them in this particular order. What if I reorder the points? Right? So what if I reorder the points so that successive points have very similar traversals. So instead of going 1, 2, 3, 4, 5, let's run the points in this order. Let's run .1 then .3 then .5 then 2 and 4. Right? So we're just going to sort the points in some sense. And now what happens to my reuse distance is the reuse distance of C between one access to C and the next, I've only touched six tree nodes. So my eight-node cache is now big enough to give me a cache hit. And in general what point sorting does, what sorting these points will do, is reduce your reuse distance from order size of the tree to order size of the traversal. So this is a nice big win. And so people have looked at this in varied application-specific ways. So, for example, people have looked at taking Barnes-Hut and doing something like a space filling curve over all the points in your space and using that as the order in which you execute points in order to get good locality. The problem is that the sorting is very application specific. Nevertheless, let's assume that we can do this sorting. So now I've got this nice win. I've gone from order N to order basically log N. Now, the problem is is that when these input sizes get big, right, log N, there's often a pretty big constant there, and this is still bigger than my cache. So in fact if my cache were small or if my traversals were very large, even after doing this sorting, I get essentially worst-case behavior? Right? I touch C, and I'm getting -- so I'm point 1, I touch C and I bring in the cache, I go around. And when 3 gets to C, it's just been kicked out of cache. So I have to get a miss. And then when 3 goes to D, it's just been kicked out of my cache so I get a miss and so on. Basically getting the worst-case behavior of my cache. So doing the sorting helps up to a point, and then at some point the traversals get so big that it stops helping and I still get bad behavior. >>: How do you know how to short? >> Milind Kulkarni: That is a great question. It basically involves sitting down and really understanding what any particular algorithm does and then doing the sort. So the sort here -- so what ->>: Looks like you could sort on the locations of the points. Is that an approximation? >> Milind Kulkarni: That is a reasonable approximation for some algorithms, but not all algorithms. >>: But what algorithms ->> Milind Kulkarni: For this particular algorithm, we're looking at that as a good approximation. >>: Okay. >> Milind Kulkarni: That's right. There's other algorithms that we have looked at and we will see results for where that's not a actually great approximation. So we can do better. But we'll get to that in a little while. >>: So is this dependent on the fact that this is working on a tree, you said it's a tree? >> Milind Kulkarni: The sorting is -- being able to do the sort in a reasonable way a priori is probably depending something on the -- is partly depending on the fact that it's a tree. >>: [inaudible]. >> Milind Kulkarni: That's right. That's right. And the fundamental obstacle here is this need to happen before I actually do the traversal, right? I can't do the traversal and then say, oh, right, the thing to have done was to do one and then three. I need to figure out something ahead of time. Which means I need some heuristic. So for the next 10 or 15 slides we're going to assume that the sorting has been done. I'm going to tell you what we can do if the sorting hasn't been done later. >>: What if you were to interchange the two loops, then you could say something like let's do a reverse traversal of the tree, right, and so we'll have perfect access to the tree and then we'll iterate through ->> Milind Kulkarni: That's right. So that is coming in about four slides. Yes. That is exactly right. You don't want to do a perfect interchange because you often have a million or so points, and so interchange actually still has -- so you're getting better locality in the tree but you're getting much less locality in the points. And so we'll see what we can do. But, yes, that's our goal. So just to drive the point home, this is basically looking at Barnes-Hut. And as the input sizes get bigger -- so 10,000 bodies, a hundred thousands bodies, a million bodies -- what you see, first of all, is that traversal sizes don't scale the same way. This is our nice login benefit. But what you also see is that this is after doing sorting, the miss rate really does start to go up. As the traversals get big, the traversals start -- they get big enough that you're outstepping cache and you start to get misses. And in particular what you see is that here this is percentage improvement over not having done the sorting at all right, so percentage improvement over the naive implementation, and basically you get dimensioning returns. >>: Are the points still laid out as an arbitrary manner in the tree ->> Milind Kulkarni: The points are ->> -- locality, special locality? >> Milind Kulkarni: That's a good question. The points themselves are in an array, so they've got spatial locality. The tree is all over the place. We're not touching that, so there's no spatial locality in the tree. That being said, the locality ->>: There is special locality; you just don't know what it is. >> Milind Kulkarni: That's right. That being said, with the points it's not a huge problem. Because, as we'll see in just a second, in the normal implementation, it will basically only get coldness on points. So I bring the point into my cache, and then I just keep hitting this point while it goes down its traversal, so it's not a huge deal. So I might save a constant factor if I get better spatial locality on the points. But it's not -- beyond that it's not ->>: Well, is the spatial locality [inaudible]? >> Milind Kulkarni: Okay. So let's get to the point where it turns out that interchange or something is the right thing to do. So let's quickly see how this works by -- so one thing we can do is reason about this in an even more abstract way. Once I've sorted the points, the differences between consecutive traversals are actually very small. Their second order effect. So I'm actually going to ignore the differences and just pretend that all the traversals look exactly the same. So instead of having this oracle traverse function, I'm just going to replace this with a simple vector of the nodes in the traversal. So I'm going to just assume that every point has exactly the same traversal. And now I really have a nice doubly nested loop that is in fact exactly what compiler writers have loved working with for decades. Right? In fact, this particular piece of code has exactly the same locality behavior as a vector outer-product, taking two vectors and multiplying them together to get a matrix. So to see how that works, if I just replace those vectors with arrays, then I'm interacting with PS of I and TS of J at each iteration. That's the same reads that would happen if I were saying A of IJ is equal to TS of I times TS of J. This is just the computation that I'm doing for vector outer-product. And now I can start thinking about all the games that people used to play to get good locality in something like vector outer-product. But first what we can see is this immediately tells me why I get the behavior that I do. In vector outer-product I get only cold misses on the points, because I bring them in once per outer loop iteration and then I hold on to them for the entire inner loop. But I get capacity misses on every access to the TS array because I look at all of TS before I come back around to the beginning. So if TS is too big, I get capacity misses. Just interchange for vector outer-product which says let's put the TS loop on the outside, what this is doing for us is it's just traversing the locality behavior. Now I'm going to get capacity misses on every access to the point and cold misses on every access to the tree. But interchange is not the only trick. It's not the only transformation in our bag of tricks. What you can actually do to get good locality in vector outer-product is tile. What I really want to do is tile that point loop. So if I tile TS, I take a block of TS, and then for every element of TS I look at each point in my block. And what that means back if we back our way out of our abstraction is let's take a block of points and ten for every tree node in our same oracle traversal but now over the block of points, so for every tree node that any of these points might have touched, then for each point interact with that tree node. So this is a blocked -- point blocked loop of the tree traversal. So in case just staring at that code doesn't make sense, which I completely understand, let's look at an example. So here's what happens if I were to block the same program, the same example execution, with a block size of three. To I take a block of three points, 1, 3, and 5, and they all visit node A. Then they all go down and visit node B. They all go down and visit node C. They all go down and visit node D. Some of the points need to visit node D, so the entire block goes down to D, and points 1 and 3 do whatever interaction they need to. They visit E, F, G, and so on. H. If none of the points need to go to a part of the tree, then the entire block can skip that part of that tree. So I'll skip I and J and I'll move to K. And then I can move on to the next block of points 2 and 4 and so on. So I can block the points. And basically I'm moving these points through the tree in lock step. And here's what happens to our locality. So remember in the original code, in the original implementation, we were getting a miss on every tree node every time we accessed it and we were getting only cold misses on points. Now I'm going to get a miss on every tree node just once per block because all of points 1, 3, and 5 can interact with node A without A getting kicked out of cache. They can all interact with B without B getting kicked out of cache and so on. And as long as the block is small enough to fit in cache, I still only get cold misses on points. So I've got only cold misses on points and have reduced my misses in the tree by a factor of the block size. >>: So it looks to me, and maybe I'm wrong about this, but it looks to me you're taking the -- you're basically saying here's [inaudible] but it's regular enough, right, and it can't be -- these things can't be totally random, right, because there's no way to sort a bunch of random points into these nice sort of subspaces that have -- that are dense enough. Right? >> Milind Kulkarni: That's right. So you said sort of the magic word. The reason this is working nicely is that a lot of times that block is pretty dense and I'm doing all three points at the same time. And the reason that works nicely is because these points are sorted. It's easy to find the points that all fit into this block nicely. >>: So what's the property that gives you confidence that when you sort the points you'll get this kind of density? >> Milind Kulkarni: That's right. There isn't one. So what we actually ultimately want is something that does not require that sorting. Right? But, yeah, right. So I can't be guaranteed that I get this kind of density. I can do this kind of transformation regardless of whether I have that density, it just might not help. So what we're going to get to in maybe ten slides is something that will -- sorry. >>: [inaudible]. >> Milind Kulkarni: Yeah, I know. >>: [inaudible]. >> Milind Kulkarni: Is what we can do if they're not sorted. Okay. But this is what can happen. Right? I've reduced my miss rate significantly. So let's quickly talk about when this is correct. So what I've done is changed the order in which I do the computations of my program which means that I'm potentially violating some dependences that exist in the program. So when is it safe to apply this transformation? So let's think about which dependencies are preserved, which dependences are violated. So here's a dependence which says maybe when point 1 visits node B it updates something, for example, it updates its correlation, and then when it visits node E it also updates its correlation. So there's a dependence from this particular iteration to this particular iteration. This dependence is preserved by blocking. The source is still happening before the synch. Here's another dependence that might exist. So when node 1 -- when point 1 visits node F, maybe increments a counter inside the node, and when point 5 visits node F, it also increments that counter or reads the counter or something. So there's a dependence within the node. This dependence is also preserved by point blocking. Node 1 will always get to node F before node 5 gets to node F. This is a somewhat weird dependence that really doesn't arise in practice. But if node 3 visits -- sorry point 3 visits F and point 5 then visits I and there's a dependence there, that is actually also preserved by point blocking. There's only one kind of dependence that's not preserved, which is one that sort of goes in the other direction. So in the original code point 3 would get to node E and would do a write. And then later point 5 would get to node C and do a write -- or do a read and I would get my dependence. But now in the blocked code point 5 gets to node C before point 3 gets to node E and so the dependence gets violated. Now, these arrows are drawn up in a particular way. If you think about sort of all of the machinery that people develop for regular programs, analyzing when things like loop interchange or loop tiling we are correct, there was this notion of direction vectors, the direction that a dependence might go inside a loop nest, inside an iteration space. And it turns out that these blue arrows correspond to 0+ dependences and ++ dependencies. The red arrow corresponds to a +- dependence. It's forward in the point direction but it's backwards in the node direction. These are exactly the same correctness criteria for tiling in a regular program. Now, exactly what it means to be forwards and backwards when we're talking about trees instead of arrays is a little bit tricky, but this gives me some hope, which we haven't done yet, but this gives me some hope that we can actually try to automatically figure out when these things are allowed. >>: Yeah. We've also got cases where you can simplify this away if you don't have any point-to-point dependencies. [inaudible] that is doing something similar to this for locality and for VGA programming. But manually [inaudible] some tree-based database, but you can sort the points any way you want. >> Milind Kulkarni: That's right. So if you know, for example, that sorting is allowed, so if you've done the sort, that probably means that there's no dependence across the points. If there's no dependence across points, that says that the +0 dependence and the ++ dependence cannot exist. And in fact the +- dependence also cannot not exist. The only dependences I might ever have are 0+ dependences, which means that this is going to be valid. >>: [inaudible]. >> Milind Kulkarni: [inaudible] that's right. That's right. And so in fact that's what we get. So I haven't yet figured out how to really figure out what it means to be forward and backward in a tree. So what we do instead is look for only 0+ dependencies. So what we're going to do is automate this process. So point blocking is -- this is the iteration structure we want. This is the computation structure we want. We want to actually generate it automatically. And here's how we're going to do it. First we're going to identify when we can apply point blocking. So we're going to look for recursive structures, so just classes with fields of the same class. We're going to look for recursive traversals of those recursive structures, so just recursive methods that are invoked on the recursive field. And we're going to look for some enclosing loop, right? So this is exactly what we're looking for. We want a repeated recursive traversal of a recursive structure. And then we use a sufficient condition for correctness, which is that enclosing loop is parallelizable. This says the only dependence that we could possibly have is 0+. A nice side effect of this is that if all I have are 0+ dependences, it no longer matters if that recursive structure is a tree or a [inaudible] or a graph or a graph. I can actually just apply this transformation. So we don't need to do any kind of fancy shape analysis to prove the treeness of that structure. So here's what the transformed code looks like. Here's our original code. Right? Recursive function. Here's the transformed version of the code. Instead of for each point in the point set, we're going to say for each point block in the point set we're passing in the point block to the recursive function. If the point block is empty, if none of the points need to visit this node, we return. Otherwise, for every point we're basically executing the original recursive body of the function. So we didn't actually have to make a lot of changes. This basically can be done with a syntactic transformation. The only difference is that wherever we would have had a return in the body of the function, we replace it with a continue, we move on to the next point. Wherever we would have had a recursive call, we do further recursive call. We add it to a next block, basically a deferred set. And once all of the points have been processed, we make the recursive calls that we need to. And the important point here is that this next block only contains the rest of the points that need to make their way down the tree. So the point block, even as it's getting more and more sparse, at least it's staying nicely compressed as you're moving down the tree. So you're not skipping past a lot of empty slots in your block. Okay. So when you do something like tiling in matrix multiply or any other dense problem, one of the big problems is figuring out what that file size is. If the tile size is too small, you're adding a lot of overhead. If the tile size is too big, it's too big to help your cache. Because the tile is bigger than your cache, and so it still doesn't help. And there's basically the same principle here. If the block size is too big, that point block doesn't fit in cache and so I'm still getting misses on the points, which I really want to avoid. And if the block size is too small, I get a lot of unnecessary misses in the tree. I miss once in the tree per block, so I want fewer blocks. So here's just a quick sensitivity study. This is just block sizes versus runtime for Barnes-Hut and point correlation. And what we see are we get these nice little U-shaped curves. As the block size gets bigger and bigger, we're getting fewer and fewer misses in the tree. And then at some point we get more and more misses in the point, on the points. And so we get this nice little U-shaped curve. Now, the important thing here is that unlike dense linear algebra, in dense linear algebra finding the right block size for a particular application was application dependent, right, the block size for matrix multiply might be different than the block size for inner-product or outer-product. And it's architecture dependent, obviously depends on the size of my cache, but it was input independent. Once I find a block size for matrix multiply, I don't care what matrix you're giving me, I'm going to use the same block size. That's not the same for us. Depending on the particular input you have, the tree is going to have different structures, the way the points are moving through the tree is going to change, and the correct block size is dependent on that. So if we want to tune this, what we need is some sort of runtime auto-tuner. We need to see the input first and use input to figure out what block size we want. And so we have those nice U-shaped curves, so we can just use a gradient-descent search. We do a little bit of random sampling from the input space, and this is because different parts of the input might behave differently, so if you tune for the first thousand points, you might get a different behavior than if you tune for the last thousand points in your set. So we do some random sampling to avoid that. And because we're doing this all at runtime, we don't want to spend a lot of time auto-tuning. So once we consume 1 percent stop of the points we stop and just take whatever the best block size is that we found. So we can put all of this together into a source-to-source transformation framework. This is all done in Java. We automatically identify the potential loops for point blocking, automatically apply the transformation and insert the tuning code that we need. We tried this on a bunch of different benchmarks, six different benchmarks, and because of input dependence for a bunch of the benchmarks we use multiple inputs, so there's 15 different benchmark input pairs. Everything is written in Java, JVM heap, taking the average of five runs, low coefficient invariant. So the error bars here are under 5 percent per bar. Okay. Here the upshot. Right? This is the takeaway of point blocking. So this is sort of baseline. I'm assuming the points have been sorted. And we see that there's a pretty wide variance in performance improvements, so we're getting performance improvements across the board. Sometimes up to 3X. On average the auto-tune version gets about a 50 percent improvement. If you're willing to do a post hoc exploration of the right block size for a particular input, you can do a little bit better. >>: So the benchmark and the input is to the pairs there? >> Milind Kulkarni: That's right. So this is Barnes-Hut on a random input, Barnes-Hut on a plummer model input, point correlation on a random input, point correlation onto real-world input datasets, nearest neighbor, k-nearest neighbor, a ball tree implementation, and raytracing implementation. >>: So no benefit on the raytracing? >> Milind Kulkarni: Yeah, the raytracing implementation, yeah, we just weren't able to get any benefit there. If you can do a little bit of -- if you do the sort of [inaudible] determine this -- if you determine what the right block size is, there's potential for benefit there. But the auto-tuner, because it's not able to explore all the space, doesn't find it. >>: So how big is each point and each tree node? >> Milind Kulkarni: I don't have those numbers off the top of my head unfortunately. It varies from application to application. >>: Right. >> Milind Kulkarni: And that's part of what determines block sizes. So if the points are bigger, then you can fit fewer points in the block before you start to get outside of your cache, things like that. Tree node sizes don't matter quite as much because of the particular implementation that we use. We're basically already assuming that every time you come back to a tree node with a new block it's going to be a miss. So unless an individual tree node is so big that it blows out your cache, you really only need to have one tree node in your cache at a time in order to get the locality benefit. But, yeah, that's definitely one of the effects here. >>: How big are the data sets? >> Milind Kulkarni: They're all roughly on the order of a few hundred thousand to a million points. >>: A million [inaudible]. >> Milind Kulkarni: Yeah. Which is actually not so big for some of these applications. >>: [inaudible]. >> Milind Kulkarni: Yes. >>: So potentially you identify potential traces manually before you start ->> Milind Kulkarni: You identify the potential places -- no, no, that's also [inaudible]. The loop [inaudible] to get transformed is identified automatically. By looking for those repeated recursive -- so enclosing point loops that make recursive calls inside them. >>: [inaudible] understanding how much space locality you're getting on these guys particularly in the context of which ones do well? >> Milind Kulkarni: I don't actually. That's something that we would want to measure. The only place where you have any shot at getting good spacial locality is in the points. So there is potentially spatial locality in the tree, but we're not taking control of how the tree is allocated. So the tree can be all over the place. The points themselves because of the way -- so one issue here, if I go back, is what's being added to the next block is actually slighter to the point. So the blocks themselves don't have great spatial locality. It's whatever the original order of the points were. So as points sort of get filtered as they're moving down the tree, they stop being next to each other in cache. So there's probably not great spatial locality. So there's definitely opportunity for improvement here. >>: What about the trees, is there anything about the trees? >> Milind Kulkarni: We haven't looked at that. Okay. So let's do one cycle of this rinse and repeat. We're not going to go all the way to the top of the game plan, we're going to go to this step of the game plan, designing transformations. And one of the discussions that we've had already at this point was, well, this sorting was kind of a weird thing, right? It has to be done ahead of time, it has to be done by the programmer, it's application specific. What if you can't do that? What if you can't pack points into blocks effectively? So here's the problem. We get one miss on a tree node per block that visits that tree node. And if the points are not effectively packed into blocks, then more blocks visit those tree nodes. So we get more misses. What you really want is to pack all of the points that are going to visit a particular tree node into one block if you can, because that minimizes the number of misses you get. So this means that you need to figure out something about the order of the points because you have to build those blocks before you start execution. So point blocking is pretty effective when you can presort the points, but it turns out not to be so effective when the points aren't sorted. Here's what can happen. If we hadn't sorted that original example, if we kept the points in 1, 2, 3, 4, 5 order, now node C is visited both by the first block and by the second block. So the first block gets great locality on C. There actually is no reuse. Because the reuse distance is 0. C is definitely in cache. But now I have to go through the entire tree before the second block visits C again and now this next access to C becomes a cache miss. Because I wasn't able to sort the points, because I wasn't able to pack them into blocks, I get all these misses. So what can we do? How can we avoid this particular problem? So it turns out there's two loops in this code. We only tiled one of them. We tiled the point loop. So what would happen if instead of saying for each block of points do something, what if we tiled the other loop? What does that even mean? So here's one interpretation of what it means. I'm going to take a partial traversal -- think of this as taking a subtree, a part of the tree, and then for every point have it work inside that partial traversal. And then for that point for every tree node inside that partial traversal, do the work. So this is tiling the other loop, tiling the tree instead of tiling the points. This turns out to be much more complicated to do. Generating this code is much harder. I can't show it to you on one slide. But it has this nice property, which is that as long as this partial traversal fits in cache, it turns out that I get the locality I want regardless of what order these points are in. So the points can be sorted, the points can be unsorted. If that partial traversal fits in cache, I'll get good locality. So I no longer care about sorting the points. So here's what this looks like in practice. And it turns out we can be a little bit more clever than even that. So let's tile the tree. So the fist thing we're going to do is figure out where we're going to tile the tree. So we place what are called splice nodes. These are basically points of the tree where when a traversal gets to that point, that's where we cut the traversal off. That's one partial traversal. So I'm just going to color them in on the iteration space. Then we're going to execute traversals, and whenever they get to one of those splice nodes, we're going to pause them there. So I'm going to run this traversal until it gets to C and then pause. I want to do everything I can in that first partial traversal first. So I'm go to go back and grab another points and move it to C. The upshot is that the reused systems of C is now small. Reused systems of C the first time is 0, reused systems of C the second time is 2 instead of 10. >>: But is it possible that the other blotch of grass may have gone down a different grass road? >> Milind Kulkarni: That's right. We're smart about how we grab these blocks so that this doesn't happen. You're absolutely right. That can happen. Here's part of how we're able to be smart about this. When I -- so what I need to do next is resume these traversals. They need to start going past C. But what I can notice is that three of these points made it to C and two of the points actually got truncated before they made it to C. Right? As the points are making their way through the tree, they get truncated at different places in the tree. And those truncations are telling me something about the behavior of these points as they work their way through the tree. The points that all made it to C have something in common that make them have somewhat similar traversals and the points that didn't make it to C have something in common that make them not go to C. So what I'm going to do is when I resume reorder the points. And I'm going to take all the points that made it to C and do them first and then all the points that only made it to B. So I'm going to do some reordering. And when I resume, 1, 3, and 5 wind up in a single block and make it to F. And what's kind of cool about this is that's actually the original sort of order. I've actually recovered the sorting order without actually knowing what's going on. I'm just looking at the past behavior of these points as they work their way through the tree. So essentially I'm filling the blocks by using this history to figure out how I need to pack things into blocks. And then I'll reach out the traversals and then I'll go. And I can actually just keep doing this process. So as the points work their way through the tree, they'll constantly be getting reshuffled and reordered. This turns out to be pretty cheap. And the nice thing about this is I think at the beginning of the talk we were asking -there was a question about finding the right sorted order. It's possible that for part of a traversal you're similar to point A and for a different part of the traversal you're similar to point B. So any a priori ordering of the points is not going to be good. But because we're able to constantly reorder the points based on their behavior, we can sort of change the order as necessary based on how we're working our way through the tree. >>: Now, when you go through -- I assume that you just have two lists or some number of lists with the points that didn't make it, right? >> Milind Kulkarni: So the way to think about -- it's not -- it's a little bit simpler than that actually. So the way to think about this is that you've basically chopped the traversals up into phases. There's the phase that goes from A down to C. There's the phase -- the way we make this work, there's the phase that does the subphase under C. Then there's the phase that actually gets to F. Then there's the phase that works under F and so on. So I've chopped the traversals up into the bunch of phases. Every point goes through the same set of phases. It might skip a phase because it doesn't have to visit any of the nodes in that phase. But conceptually every point goes through every phase of this traversal. So when I'm starting the phase that starts at the bottom of C, I actually just gather all of the points in my -- in the program. Basically all of the points that are in flight. And I gather them based on where they were paused. So when point 2 makes -- sorry. When -or, oh, sorry. When this point makes it to B and pauses, it's sort of waiting in a bucket just sitting in B. It's paused there. >>: So it's not there in C so you don't grab it. >> Milind Kulkarni: It's not there in C, so you don't grab it. That's right. >>: Okay. >> Milind Kulkarni: So in some sense there's lists at each one of these nodes up here that sort of keep track of where things got paused. >>: Okay. >> Milind Kulkarni: And so the nice thing about this is that because we're not actually doing any sorting. The order in which we grab points is basically doing the sorting for us. So we don't actually have to do anything. >>: So there's no way to know -- to maintain any dependencies. >> Milind Kulkarni: That's right. So this is doing a lot more damage to the iteration space, so it has to be truly parallel. That's right. Turns out, though, from a lot of these applications, that's what you have there. It is truly parallel. But, yeah, we're really doing a lot of violence to the original order of this point. >>: Doesn't your memory [inaudible] go up because you have too many things in the [inaudible] record where you pause ->> Milind Kulkarni: Every -- we have to -- so one way -- yeah. So we have to record that state. And actually we have to do a little bit more than that. One way to think about what's going on here is that each of these points is essentially a thread that's in flight. The point has to carry around some state within about where it is in the tree and what immediate data it had on its way down to this point of the tree, et cetera. And point blocking -- so in the original execution one point is in flight at a time. A single point does its entire traversal and then we go to the next point and so on. In point blocking, a block of points are in flight at a time, so we have to save that much extra space. And in traversal splicing every single point is in flight at once. So potentially there's a big explosion in the amount of state that we have to keep track of. What we do is -- you'll think about it this way. Points 1, 3, and 5 that all made it to C, there's obviously a lot of similarity in the state they have because they all made it to C. So we can identify the parts of the state that are basically -- that are going to be the same across points and compress it and see if it just runs. So we're able to really reduce the overhead of keeping everything in flight. So all the points have to be in flight, and in some sense I have to maintain the context for every point, but a lot of these contexts are the same. So there's details in -- so we had an OOPSLA 2012 paper on this. There's details in there about how we do this. But it varies from application to application. Some applications there's less compression that we can do so the state goes up a bit. >>: There are some sides on how much better performance you get? >> Milind Kulkarni: Yes. So really quickly, as Ben pointed out, I don't just have -- so it's a necessary and sufficient condition now. It has to be parallel in order for this to work, so we do that. >>: [inaudible] that's what you're doing to this space, right? >> Milind Kulkarni: Yes. It's a full tiling of the space at this point. Implementation and evaluation. So here I'm going to compare to four different things. The unsorted baseline now, because our goal here was to say, well, what if we can't do sorting. I'm going to compare to just point blocking. So this is blocking when I don't have nicely packed points. I'm going to look at splicing and then blocking and splicing. We also do some -- we use a heuristic to place the splice nodes instead of tuning it. It's not an auto-tuner anymore, it's just a heuristic that we use about where the splice nodes get placed where. But it still gets done automatically. So here's what happens. So one is the performance of the unsorted baseline. The blue line is blocking. So blocking actually still helps, even when the points aren't sorted. In fact, blocking still gets you about a 50 percent improvement. It's just over so you're still running a lot slower. Adding splicing gets you over a 3X improvement. So if you do splicing, you're getting a 3X improvement over the unsorted baseline. You can actually combine blocking and splicing like in the example I showed you. It doesn't always help. Sometimes actually it hurts you a little bit because there's extra overhead associated with doing the blocking, but sometimes it can help. And the main place this helps is if I can't place those splice nodes such that a partial traversal fits in cache, then I'm back to sort of working with the big tree, and so I want to do something like blocking. >>: Where's sorting? >> Milind Kulkarni: Sorting is not on this slide. Sorting is maybe on the next side. >>: Okay. >> Milind Kulkarni: That's right. Here's the next slide. So this is -- think of this as part two of the talk compared to part one of the talk. Okay? So part one of the talk says I want to do application-specific sorting plus point blocking. Part two of the talk says I want to take unsorted points and do completely automatic splicing. So no programmer intervention required. And what you see is kind of mixed results. So sometimes sorting is really, really, really effective. Barnes-Hut. It turns out that doing the sort is incredibly effective, and so it's hard for us to beat it. So we're slower. Although we're still faster than the baselines. Other times, though, something like k-nearest neighbor, this turns out to be an application where an a priori sort is just not a good idea. The points do really weird things as they work their way through the tree. And so there's no real way to put them together ahead of time in order to keep them sorted. What you really want is to be able to dynamically move them around, which is what we do. And so we actually do better. And the upshot is when we look at all our 15 benchmark and input pairs, what? On average we're basically the same. So if we're doing something fully automatic, it's competitive with doing the manual sorting. >>: [inaudible]. >> Milind Kulkarni: So these benchmarks are mostly pulled out of real applications. So Barnes-Hut is actually the Barnes-Hut -- it is a Barnes-Hut implementation. It's from the Lonestar benchmark suite, which is from Galois project. Point correlation and nearest neighbor or k-nearest neighbor are all based on a k-d tree implementation that was pulled out of a raytracer, and then use that k-d tree implementation to do something more like nearest neighbor or k-nearest neighbor or whatever. So they're curled in some sense. They're not full applications, but they are pulled out of real applications. >>: [inaudible] the real application? >> Milind Kulkarni: So depends on how you define real application. So, for example, ball tree, this was pulled right out of the -- I forget the name of it. But it's pulled out of one of the data mining code base repositories. It is truly just here is a ball tree implementation to do nearest neighbor for you. And so we just took that code straight out of a repository and transformed it. >>: What about the other cases? >> Milind Kulkarni: So Barnes-Hut I would say is a real application. So real applications will use these as kernels, right? So I'm not just doing a nearest neighbor computation. I'm using -- that's right. So then -- but we're not transforming that. So then ->> Milind Kulkarni: Right. Right. That's right. >>: Okay. >> Milind Kulkarni: I'm going to actually show you some numbers. >>: Were you the one who did all the sorting? >> Milind Kulkarni: For Barnes-Hut we were not because Barnes-Hut there is sort of an accepted best way to do the sorting. >>: Okay. >> Milind Kulkarni: To my knowledge we are the first people to even say that this is actually a generalizable technique. People have done it in various application-specific ways, but to say what you're really doing is this, we actually -- so we've did the sorting ourselves here. We sort of did the best we could. >>: Except on Barnes-Hut nobody had done a sorting on these algorithms? >> Milind Kulkarni: Yeah. >>: So just -- and then in order to do it you to hand code it and you have to understand that? >> Milind Kulkarni: You have to understand the algorithm. >>: So that state of the art isn't really state of the art, it's your state of the art. >> Milind Kulkarni: It's our state of the art. That's a fair point, yes. >>: Which is a good thing to have actually [inaudible]. >> Milind Kulkarni: So this is ->>: Generalizing them is [inaudible]. >>: So you have contributions to sorting. >> Milind Kulkarni: To sorting as well. Yes. Okay. So let's talk about some ongoing work that we've got that we've been doing. So this is actually -- so there's this real application question. So here these are actually a real raytracer application. This is in C++ now. This is a real raytracer. >>: [inaudible] merry Christmas? >> Milind Kulkarni: Those are three different models that you can render. One is a Christmas cabin, one is a dragon -- sorry, one is a cabin, one is like a Christmas tree, and one is a dragon. Apparently. Okay. So here what we're trying to do is not locality transformations, we want to do SIMDization. And if you go back and look at -- so it turns out that blocking and splicing are really nice things to do in order to enable SIMDization. So to see where that is, let's go back and look at the original sort of pseudocode. In the original pseudocode there are no loops here. It's just a bunch of recursion and stuff like that. You look at this, somebody who is trying to vectorize code, and you say what am I supposed to do? But now, if we add in blocking, so all we're doing is adding in blocking, hey, look, there's a nice dense loop for you. Right? So this is a dense loop. This is something that can be vectorized using your standard techniques like if conversion and whatnot. Right? So this suddenly becomes vectorizable by applying blocking. And then why doesn't SIMD always work even if you have a dense loop like this? It's because I can't keep my SIMD lines full. I've got my four SIMD lines, but there's only one element in this loop, and so who cares if I can SIMDize. But what does splicing do for you? Splicing reorders points nicely in order to make sure that the blocks stay full. So it keeps these SIMD lines full for us. And so this is what we wind up being able to do. So here's how to interpret these results. I know it's a little bit small. So one is the baseline. So no transformations applied. The first thing we do is we just apply blocking with a block size of four. This is saying all I want to do is turn the code into something that can be SIMDized. This actually slows down performance a little bit because we add the overhead of blocking, we don't actually get any other benefits. It's not a large enough block size to get good locality. If you then -- now that I've got SIMDizable code, let's add SIMD to it. And in order to add SIMD to this effectively, you need to do transformations like structure of arrays and things like that. So adding all of that stuff where now you're, for example, no longer copying pointers between blocks but copying entire points between blocks adds enough overhead that you actually slow down if you try to SIMDize this code. Okay. So now let's say, well, obviously, we don't want a block size with a block of four. We want bigger blocks, both because bigger blocks get me better locality and because bigger blocks make it more likely that I'll be able to find more points that I can fill a SIMD line with. So I block them and that gets me back up above one. And then I add SIMD to that. And so I get a small improvement. So I'm paying the penalty of the data transformation, and then I get some improvement from SIMDization. The more important thing is once I do blocking and splicing, the improvement I get from SIMDization is much bigger. SIMD becomes basically 40 percent on average more effective than once you do blocking and splicing than if you did just basic SIMDization. >>: And that's all related to [inaudible] all the instructions actually fire, right, because you're getting ->> Milind Kulkarni: That's right. Because you're able to [inaudible]. So we can actually -- so one kind of cool thing that we can do is we can measure what percentage of the time we spend with our SIMD lines completely full. And we can actually come up with a -- what's the best you can do. It turns out the best you can do, it's not the best for performance, but the best you can do in terms of keeping SIMD lines full is to do a full interchange. If I do a full interchange, I'm able to look across every single point, say do any of these points need to work on this particular node of the tree. Right? And so of course I find as many possible points -- I'm finding as many points as possible to keep my SIMD unit occupied. And so when we compare, it destroys locality, it destroys a bunch of other things. But it turns out ->>: I have to argue. Unless you order your input data structures statically in such a way that you get the locality you want. >> Milind Kulkarni: Yes. That's right. But this was basically -- think of this as an opportunity study. We want to know how good are we able to do in terms of keeping our SIMD lines full. And it turns out that if you look at that full utilization, sort of the best you could possibly do and you look at what we do with splicing, we basically hit that. Splicing basically fills your SIMD lines as much as possible. Okay. So this is some ongoing work. A little bit of other ongoing work. So we're looking at putting these on GPUs. This seems like a natural thing also. Something interesting about this, which was somewhat surprising to me, so the key to efficiency -- one of the keys to efficient GPU computation, there's really two things you have to worry about. One is control divergence. I want to make sure that all the threads in my loop are actually doing something useful, they're not just sitting idle because of some control flow divergence. The other is memory coalescing. I want to make sure that when a bunch of threads in the loop are doing a load, I can pack -- I can basically use as few memory transactions as possible to do that load. Right? It turns out that when you take these recursive functions and put them on the GPU, because they're recursive, you actually get -- the control flow reconverges very quickly. And the control flow reconverges quickly in a very suboptimal way. Because it reconverges quickly, I've gotten back to the same part of my code. But now I'm in this part of the tree and you're in this part of the tree. So control flows reconverged and memory coalescing has totally been destroyed. We used to be in the same part of the tree; now we're in different parts of the tree. So we can apply a point-blocking-like transformation to the GPU. It's actually simpler in the GPU because of the various GPU architectural things. But what it basically does is it says let's make control divergence worse. Let's prevent points from reconverging for a little bit longer in order to make sure that we get good memory coalescing. And so this yields -- so I don't have great numbers for you yet because it's still ongoing stuff, but it yields about one to two orders of magnitude speedup over the CPU implementations and one to three over the naive GPU implementations. >>: So how sensitive are the speedup results to the actual dataset? >> Milind Kulkarni: On the GPU or in general? >>: Well, just in general for this work. Because it seems like there's a property -- well, there could be a property of the data then as its inherent clustering, which gives you the stability to get sort of dense behavior [inaudible]. >> Milind Kulkarni: It's definitely quite input dependent. You could sort of see that here, right? So take something like -- what's a good example? Barnes-Hut. Right? With the random -- so plummer is much more clustered than random. So actually somewhat counterintuitively the more random data we do better. And that's mostly because the baseline gets really crappy when the data is not -- doesn't have a lot of locality, right? >>: Supposing it has [inaudible] and you ->> Milind Kulkarni: Are you implying that I am not properly [inaudible]? [laughter] >>: [inaudible] somebody else. He does an exercise like this, and he changes his code so that it does all these locality enhancing transformations, is the idea then that the same code is going to work well? Doesn't matter what computer you write it on? >> Milind Kulkarni: Yes. And the basic idea here is that there is this runtime auto-tuning component. And this is purely an empirical auto-tuner, right? I'm just looking at as I change the block size how is the performance changing. And in fact our auto-tuner is smart enough to do the dumb thing which says no matter what block size I've got and I haven't been able to help you so I'm just going to fall back to the original code. >> I see. >> Milind Kulkarni: But you do need some amount of runtime auto-tuning in order to make this work. If you pick a bad block size, your performance will be worse than if you didn't do the transformation at all. >>: I see. >> Milind Kulkarni: In fact, in our splicing results, we made the auto-tuner -- we added some extra stuff, which we added a heuristic check that basically tries to predict whether your data is sorted. So it looks at sort of pairs of points that are in the input and sees whether or not they behave pretty similarly. And if the data is all ready pretty well sorted, we don't do splicing. We just do blocking. But so -- but you do need this kind of -- this input-sensitive tuning in order to get the right performance. >>: Can it happen, for example, that the transform -- let's call it the original program A, the transformed program B prime, could it happen that A prime works better than A on a machine with a certain cache size when A works better than A prime on a machine with a different cache size? >> Milind Kulkarni: Yes. >>: That could also happen. >> Milind Kulkarni: Yeah. >>: The auto-tuner [inaudible] that issue also? >> Milind Kulkarni: Yes. So the auto-tuner does some measurements and basically tries to predict whether doing this blocking is going to be effective overall. So there's clearly some hand-waving heuristics that have to go on inside the auto-tuner also. But the goal -so here would be an example ->>: Blocking for the L2. >> Milind Kulkarni: We're blocking, yeah, for the L2. Or a last level. It's actually -- so it's weird because it's a purely runtime dependent system, it's hard to say that we're specifically blocking for one level or the other if you look at our original numbers. So one of the issues is these new machines have these L3 that are much harder to measure because they're on core. But if you look at our original set of numbers, the original set of numbers we were working with machines that just had L1 and L2 so we could look at exactly what was going on inside the cache. And you could plot L1 miss rate and L2 miss rate. And what you find is that the block size the auto-tuner hits upon minimizes neither the L1 miss rate or the L2 miss rate. So go figure. >> But both go down? >> Milind Kulkarni: Both are down from their original place, from their original implementation. But I could use a slightly ->>: So performance isn't determined by either one of those miss rates, it's determined by both [inaudible] it's doing something ->> Milind Kulkarni: Something -- well, it's doing something really dumb that turns out to work out okay. It's not paying attention. The auto-tuner doesn't measure miss rates, the auto-tuner doesn't measure anything else. All the auto-tuner does is it starts a timer, runs a set of blocks that are size 12 or size 16, sees how long it takes. Starts another timer, runs a set of blocks that are size 32, sees how long it takes. >>: Well, it's doing performance at this rate. >> Milind Kulkarni: That's right. >>: Because that's [inaudible]. >>: That's what I actually want to optimize. >>: And the miss rates just give you insight into what it's doing. >> Milind Kulkarni: And so if you were to make the block size bigger, your L1 miss rate would have gone up because now your blocks are more likely to step up [inaudible] but your L2 miss rate goes down. If you make the block smaller, the opposite happens. Turns out that you want somewhere in between. One thing that can be a little bit tricky and one of the reasons why it's hard to do something like a more analytical model of misses is that the block sizes aren't [inaudible] as they move down the tree. As you move down the tree, some of the points get truncated early. So the block sizes keep changing as you move through the tree. And so it's very hard to relate the original block size to what you might think of as the average block size. That's also very input dependent. >>: So [inaudible] you can do this runtime estimate, right? >> Milind Kulkarni: Um-hmm. >>: So when you do the learning, what's the most important [inaudible] that you take care of the machines plus if you look at [inaudible] then you have like [inaudible] parameters [inaudible] machine is [inaudible] parameters. But for this [inaudible] the machines so [inaudible] the runtime [inaudible] machine are running on so the runtime just reuse distance to pick ->> Milind Kulkarni: It doesn't even use reuse distance. The runtime literally just runs the code at various -- it just tries different configurations of it. So if you think about the way Atlas works -- Atlas is one of these sort of original auto-tune block libraries -- what Atlas did -- now Atlas does all sorts of crazy stuff. But the original thing Atlas did was it tried a bunch of different block sizes and a bunch of different [inaudible] factors and things like that. It just tried a bunch of different parameters and saw which one ran the best. It wasn't making any predictions based on your cache size. It wasn't making any predictions based on the number of registers you had or the number of your instruction cache size. It was just trying a bunch of -- it was doing a parameter suite and picking the best one. That's what we do. >>: Picking the best one based on ->> Milind Kulkarni: We picked the best one based on runtime. That's right. >>: [inaudible] two machines have five different replacement for cache? >> Milind Kulkarni: Yeah. And so the nice thing about not worrying about any of that stuff is that I don't need to think about how these numbers change when you have a different replacement policy or when you have a different associativity or anything like that. Another thing that helps this, right, is it would matrix multiply. A good efficient implementation of matrix multiply has a half dozen or more parameters, a dozen parameters. We have tune-ups. So it's tractable in a way that doing a full parameter sweep of matrix multiply is not. Moreover, if we go back and look here, we don't actually do auto-tuning for the splicing. Splicing we just use this heuristic which says essentially cut the tree in half. Because what we found is that when you combine blocking and splicing, and this is one of the big reasons we combine the two, you need to do splicing but it's not -- performance is not that sensitive to the splice parameter to exactly how deep that splice is. >>: [inaudible]. >> Milind Kulkarni: Sorry? >>: I mean, [inaudible] so this one you do need good luck to get in there. >> Milind Kulkarni: I wouldn't call it good luck. It's something about -- but, yeah. You could construct a case where this heuristic will do badly. >>: How do you do penalties? You do get penalties. >> Milind Kulkarni: You could get penalties. >>: [inaudible] in practice how is [inaudible] you need some subset of the inputs and then you allocate a certain amount of time between parameter sweeping? >> Milind Kulkarni: That's pretty much exactly it. So basically what we say is we reserve one percent of the input points and say that for this 1 percent of the input points we're willing to spend this many points to do auto-tuning. Now, there's a trick here which is that, as I said, especially if you have sorted points, different parts of the input might behave very differently. So you can't just take the first 1 percent of your points. That actually turns out not to work so well. So you have to -so instead you have to sample from different parts of the input space. But there's a further trick, which is that block to block there's actually still some locality because successive blocks, if the points are sorted, have -- do pretty similar traversals. So you can't just take completely random blocks. You need to account for the fact that when I'm actually running the code I have this block-to-block locality. So the way you have to do your sampling has to take that into account. So what happens in practice is we grab groups of points at -- so we grab a set of blocks of size 2 in one place in the input and a set of blocks of size 2 at another place in the input and figure out what that means in terms of a block size of 2. Then we do the same thing for block size of 4, same thing for block size of 8. But you're taking -- you're sampling sets of blocks out of the input and we do this until we've consumed 1 percent of the points. Or until we found the need at the curve. >>: Another one [inaudible] for a programmer who writes program you would try to do the simple and then he start to do the similar thing [inaudible] whether this could be in [inaudible] the programmer writes simple program like [inaudible] auto translate [inaudible] programmers' opinions about that direction they should go. >> Milind Kulkarni: Right. Right. I mean, arguably what we're trying to do is let programmers write the simple version and get the performance of the complex version. And, you're right, that you could also just use this as a way of saying it turns out that doing something like blocking might be useful so it might be worth your time to go and implement it in the right way for your particular application. So there's a bunch of stuff that goes on here. We have to save a lot of data in order to keep track of plain context and stuff like that, which you might be able to optimize away, if you were doing it manually, which we don't. So there's a lot of stuff that if you were willing to go in and do it manually, you could probably do better than us. But as I was saying earlier, you still need some of these runtime smarts. >>: Well, sure, you just put some to verify [inaudible]. >>: We don't want to have to understand current architectures. >> Milind Kulkarni: Right. I mean, this is the case for compilers. Why do we want compilers? It's so I don't have to write up bridging this gap. Okay. So conclusions. So, yeah, irregular algorithms, they're pretty fertile ground for locality optimizations, at least partly because people haven't spent a lot of time worrying about localities yet. In the application-specific places people do. But doing stuff in a more general case, there's not a huge amount of work here. So it's a nice fertile ground. I think it's important to consider these applications at the right level of abstraction. If you get too bogged down in the details of the particular pointer taking that's going on or the particular structure of the code, it can be sometimes a little bit easy to miss the forest for the trees and miss that, hey, actually there's this locality going on and it has a pretty simple regular structure, right, the accesses that I can then do something about. And so in our particular case by using this higher level of abstraction, that really sort of informed the kind of transformations we wanted to do, it informed the correctness criteria, informed our reasoning about the locality effect, all of these sorts of things. And then the upshot is that we can do all of this automatically and get nice benefits. All right. [applause]. >> Milind Kulkarni: Thanks.