1 >> Madan Musuvathi: Hi, everybody. I'm Madan Musuvathi from the research and software engineering group, and it's my pleasure to introduce Jonathan Ragan-Kelley today to give a talk here. His expertise is in high performance graphics, but his interests span multiple disciplines; systems, architecture, graphics and compilers. It's actually very nice to see graduate students working in multiple areas and excelling in all of them. So he's -- he's almost graduated from M.I.T. He did his Ph.D. with Frdo Durand and Saman Amarasinghe, and he is headed to a post-doc and we'll very soon know where that is. So Jonathan? >> Jonathan Ragan-Kelley: Thank you. Thanks for coming and thanks for having me. So this is going to be a kind of overview of a lot of different research I've done, but most focused on Halide, which is a programming language for high performance imaging, which some of you may be familiar with, which is the focus of my thesis. So graphics and imaging are everywhere today. And not just in games, movies and cameras, where we traditionally focus in the SIGRAFF community, but also things like 3D printing, medical imaging mapping and satellite-based science and even high throughput gene sequencing and cell biology. In my research, I build compiler systems and algorithms for these sorts of data intensive applications and graphics and imaging. And specifically, I've worked on rendering high performance imaging and 3D printing. And I'll touch on all of these, but I'm mostly going to discuss Halide again, which is a language and compiler [indiscernible] processing. In this talk, I'll show that the key challenge is to reorganize or restructure computations and data, to balance competing demands for parallelism, locality and avoiding redundant or wasted work. And I'll show that by changing how we program and build systems to focus on this sort of reorganization, we can enable simpler program code to run an order of magnitude faster on existing machines and to scale on the highly parallel and bandwidth constraint architecture Moore's law's giving us. 2 I'll show an example of how we C or double the performance of pipeline, optimized over three while using 25 times less code can deliver 20 times the performance of everyday a hand-tuned parallelized vectorized imaging months by an expert working on PhotoShop, all that we wrote in a single day. And I'll show how we can generate a completely different organization on a GPU, delivering several times more performance without rewriting a single line. But first, why do we care about performance in this domain in the first place? I'd argue that graphics and imaging applications today are still orders of magnitude from good enough. So take a light field cameras. By capturing and computationally reintegrating a 4D array of light, you can do things like refocus photographs after they're taken, correct aberrations in the lens or extract high quality depth from a single shot. But if we wanted to produce eight megapixel 4K light field video at 60 frames a second, current software would take about an hour for each second of light field video. In rendering, even using the same GPUs used to play PC games, the movie frame is still six orders of magnitude from real time. Multi-material 3D printing has huge potential to transform manufacturing, but to directly print production quality objects, we need about a trillion voxels and something the size of a tennis shoe. And the printing hardware is almost ready to do this, but it's just a ton of computation and data. So in this case, if we wanted to finish ten shoes an hour, we'd need to synthesize about four billion voxels a second just to keep the printer busy. At the same time, Alazon image sensors are going to be all around us, but we need to figure out what to do with all the data they produce. So just sending it to the cloud isn't going to be enough, since the best cellular radio we could build would need at least a thousand times more energy to send each frame than it took to capture it. In these days, most complex sensing uses imaging somewhere under the hood. From high throughput gene sequencing to automated cell biology experiments, to neural scanning and connectomics. For example, today, it takes a large supercomputer just to process the scan of a single cubic millimeter of mouse brain neurons. 3 And really, lots of what I'm going to say in the rest of this talk complies to almost any data intensive application, not just those involving pixels. So the good news is Moore's law is continuing to give us exponentially more computational resources. As we've been told for the past decade, one of the big challenges we face will be exposing more and more parallelism for this hardware to exploit. Superficially, this seems easy, since all these applications I showed are enormously data parallel. But the real challenge here is that all these millions of data parallel computations aren't independent. They need to be able to communicate with each other. And the same hardware trends that have pushed us from big unit processors to lots of parallel cores have made communication and data movement dominate the cost of computation as a whole. And by communication, clearly I don't mean over networks. computer or inside a single chip. I even mean inside a And as a result of that, locality or moving data around as little as possible is at least as big a problem for future programs as parallelism. So to put this in perspective, today, relative to the energy cost of doing some arithmetic operation on a Pete of data, loading or storing that data in a small local SRAM like a cache can be several times more expensive. Moving the result ten millimeters across a chip is an order of magnitude more expensive. And moving it to or from off-chip RAM is three or four more orders of magnitude more expensive than computing the value in the first place. And this disparity between local communication -- computation and global communication is only getting bigger over time. So because of this, it can often be most efficient and make surprising trade-offs. Doing things like redundantly recomputing values that are used in multiple different places instead of storing and reloading them from memory. And this is the first key message I want you to take away from this talk, that parallelism, locality and the total amount of work we do interact in complicated ways. They often need to be traded off against each other to maximize performance and efficiency of a particular algorithm on a particular machine. 4 Now, the architecture community is well aware of this, and they're trying to exploit locality and parallelism to improve energy efficiency in their hardware designs. On the software side, I think we've long recognized the need to expose increasing parallelism in application, but as a whole, these challenges are still acute in how we design software. Again, especially locality, as least as much as parallelism. So given these pressures, where does performance actually come from? We usually think of it in terms of two factors, the hardware and the program running on it. But the competing pressures on locality, parallelism and redundant work depend heavily on how the program is mapped to the hardware. So the second key message I want you to take away from this talk is that because of this, I think it's useful to actually think of the program as two separate concerns, and this is motherhood and apple pie for compiler and programming language people, I think. But still, I'm going to try to emphasize it anyway. It's useful to think of the program somewhat separately as the algorithm or the fundamental operations which need to be performed, and the organization of computation, including where and when those operations are performed and where their results get stored and loaded. My work focuses specifically on enabling and exploiting the reorganization of computation. And secondary to that, its interaction with algorithms and hardware. In this talk, I'll show that the organization of computation is key to the things we need to optimize performance and efficiency, but it's challenging because of the fundamental tensions in between them. But ultimately, if we want to keep scaling with Moore's law, I think we need to treat the organization of computation as a first class issue. Now, these same themes have come up throughout my research, and I'm going to touch on them in the context of six different projects I've done in grad school, but I'm mostly going to focus on Halide. So Halide is a programming language and compiler from modeling and exploiting this exact set of trade-offs in image processing pipe lines. This is joint work with a number of other people at M.I.T. and elsewhere, but most 5 significantly is Andrew Adams, who is my day-to-day collaborator on this project at M.I.T. since finishing his Ph.D. at Stanford. And there are papers covering different parts of this, both at SIGRAFF last year and then the one I just presented a couple days ago at PLDI here. So to see what I mean by organization of computation, I think it's useful to look at an example. If you want to do a simple 3 by 3 box filter as two three by one pass, we can write a simple C++ code like this. So it's just a series of two sets of loops over the image where the first computes a horizontal 3 by 1 blur and stores in a temporary buffer, and the second computes a vertical one by three blur of that to produce the output. So here, I'm using some slightly -- assuming operator overloading to hide some of the multiplication of the indexing, but these are basically just raw arrays. Now, what if we exchange the loops in each loop nest? Most exile compilers would say these are the same algorithms, since the loop reordering is easy to understand, since the operations within the loops are independent of each other. This is something most compilers can easily do today. It turns out it's also important, as trivial as it is, because of poor locality, the column major order is actually 15 times slower than the row major version on a modern machine. This is the trivial reorganization against something that most compilers can easily do already. But our focus with Halide is to take this to the next level, starting from what expert programmers currently have to do by hand. And the key thing to do is going to be understandings that the more complex dependence across the two sets of loops, not just within each one. So this is a hand optimized version of that same function. It's paralleled, vectorized, tiled and fused. We had to restructure all the loops and introduce redundant computation on the tile boundaries to decouple them, and we had to change the data layout to make this most efficient. It's a complete mess, given that all we're doing is averaging together 3 by 3 pixels. But it's another order of magnitude faster. Near the peak throughput for this machine. So in this talk, I'm going to deconstruct what this optimized version actually does. To do that, first I'm going to sketch a theory of this problem and its 6 challenges and then I'll show how we apply this to build a model from which we can actually generate code. Now, this kind of optimization is hard, both for people and for compilers. Because getting parallelism in locality requires transforming program and data structure in very deep ways. But to begin with, it's not always obvious what transformations are legal, and even when we know something is legal, it's hard to know whether it's a good idea. So again, the things we want to optimize are often at odds with each other, and the optimal balance is usually very subtle and complex. So just making a single heuristic guess, like most compiler optimizers do, we might actually wind up making our program ten times slower rather than faster. And critically, libraries of optimized code don't solve this either. Taking Intel performance primitives, for example, it's extremely fast at individual operations, but no matter how much they tune the assembly loops in the math kernel library, the individually optimized kernels still compose into inefficient pipe lines, since they can't interleave stages for locality across function boundaries. I believe the right answer is to explicitly decouple the definition of the algorithm from the organization of computation which we're going to call the schedule. So in Halide, the algorithm defines what values are computed for each pipeline stage while the schedule defines where and when they get computed. This makes it easier for the programer to write algorithms, since it strips out lots of unnecessary details, and makes it easier to compose small fragments of algorithm into larger pipelines. It makes it easy for earring a compiler or programmer to specify and explore the space of possible optimizations, and it leaves the back-end compiler free to do the complex but deterministic work of generating fast code that implements a pipeline, given a defined schedule. So ultimately, that will let us do the same thing as a hand optimizer version with just these four lines of code or just the top two if we use auto tuning to infer a good schedule. So this example and the work I'm going focus on is at the intersection of stream programs and stencil computation, but it also draws heavily on ideas from loop optimization, parallel work scheduling, region-based languages and even what I have here is just a tiny sample. I think relative to most primary languages and compilers, the two biggest differences are that we focus on treating the organization of computation as a 7 first class part of the front end programming model, and we don't just optimize parallelism and locality, but we also focus on the potential of introducing redundant computation to optimize the overall organization. So for this type of data parallel application that I've focused on, I think it's often easier to understand the organization if we look at it as a pipeline. And for the simple blur example of the pipeline looks like this, and I've reduced it to 1D, but all the problems are still the same. The input image is at the top, flowing down to the blurX and blurY stages below. It's just corresponding to the two loops we have. I'm going to use this view to introduce a theory of the types of reorganization that we do in Halide first. So again, what's most significant here is not what happens within each stage, but what happens across the two stages. In the first versions we look at executed each stage breadth first, which means they compute every pixel in the first stage before computing anything in the second. That means the first stage produces lots of intermediate data before it ever gets used in the second. As a result, it has to be sent far away and written to memory before computing the next stage that has to slowly read it back in. This is fundamentally because the way we've organized computation here has poor locality. So once we compute a given value in the first stage, we have to compute lots of other unrelated values before we ever need to actually use it in the second. In this view, once we compute the value in the first stage, again, we have to compute lots of other unrelated stuff before we ever need to use it. In this view, the locality is a function of the reuse distance along this red path we take through the computation. So we can improve locality by interleaving these two stages instead. I'm going to start with a cartoon version, ignoring the dependencies between the stages and we'll take care of the final details later. So in the interleaved organization, we first compute a small piece of blurX, then immediately send the results down and compute the corresponding piece of blurY. After we're done, we can throw away the intermediate result and move on to the next piece. And then we repeat the same process to compute the two stages on 8 the next part of the image. Repeating this interleaving over the whole image improves producer consumer locality. And it does this by reducing the reuse distance between where values get produced in the first stage to where they get consumed in the second. And in practice, this means they can be kept nearby, like in registers that are [indiscernible] cache instead of having to move long distances to and from main memory. And this reorganization of computations often called fusion, because we've merged the two, the computation of the two stages together. And in practice, this makes optimization a global problem of carefully interleaving the computation of the whole pipeline. So again, I want to emphasize that you cannot address locality just by locally tweaking operations in the inner loops. Now, fusion is something that the compiler community's done for decades, but this type of pipeline is actually more complex than most of what people traditionally do with what they call fusion. So to understand how we can actually fuse a given computation, we have to look at the dependencies. And these are the details I was ignoring in the cartoon version a second ago. These dependencies are a fundamental characteristic of the algorithm we've defined. And in this case, looking at one pixel in blurY, it depends on three neighboring pixels in blurX. But its neighbor depends on an overlapping set of pixels. This means that the pieces or tiles that we wanted to interleave during fusion actually depend on overlapping tiles further up the pipeline. And we usually want to decouple the execution of tiles both to minimize reuse distance across them and to execute them in parallel. But to do this, we have to break the dependencies between them. And splitting them up means we need to duplicate the shared work, the shared computations along the boundary, which introduces redundant work. This is, again a key trade-off that experts make all the time when hand optimizing this sort of pipeline but which most existing compilers can't consider. So by removing that dependency, we can execute in parallel both within each tile of each stage and across the tiles. So now we can independently compute a tile at the first, send it down, and compute parallel tiles at the second from 9 that. Then each of these independent computations can throw out their intermediate data, moving on to the next tile that they need to process. So we not only have parallelism here, we also get good locality from the short, fixed reuse distance within each of these tile computations. But to get there, we had to make a trade-off. We had to introduce redundant work to break the dependencies in the algorithm and enable this overall reorganization. In this case, that's the right choice, but it's not always going to be. So this blur we're looking at is just a trivial two-stage pipeline and something more realistic looks like local Laplacian filters on the right. And even actually that is a heavily simplified cartoon of local Laplacian filters graph. Local Laplacian filters has about a hundred stages connected in a complex graph and locally, most of the dependencies are simple stencils, like in our blur, but globally they're extremely far-reaching and complex. And overall, this gives us lots of choices about the organization and degree of fusion at each of dozens of stages throughout the whole graph. Each strategy can mean a totally different set of dozens of kernels or loop desks, completely restructuring all the code, and the difference between two plausible strategies can easily be an order of magnitude. So to put this in concrete perspective, local Laplacian filters is used heavily in Adobe's camera raw pipeline, which is a Meijer feature of both PhotoShop and Lightroom. Adobe's original version was written by one of their best developers in about 1500 lines of C++. It's manually multi-threaded and hand-coded for SSE and it took him about three months to implement and optimize. Just like our blur example, the optimized version is about ten times faster than a clean version written in about 300 lines of C. But then last summer, I took a single data rewrite their version of local Laplacian filters in Halide and integrated into Lightroom. It took 60 lines of code to express the same algorithm and ran twice as fast as their version on the same [indiscernible] machine. At the same time, Adobe was starting to investigate GPU acceleration, but they were constrained by the cost of potentially rewriting tens of thousands of lines of optimized code across this [indiscernible] code basis. But on that 10 same day, and actually this number is now 9X with a slightly different schedule, on that same day, just by changing the Halide schedule, we also generated mixed CPU, GPU code that ran seven to, now, nine times faster than the original eight-course CPU version. And the best schedule was on schedules on different architectures here, optimized parallelism and locality in different ways. So this wasn't just a translation. This was a totally different organization of the computation. >>: [indiscernible]. >> Jonathan Ragan-Kelley: It was a few hours of each and a few hours of integration. I was actually partly drawing on some existing related stuff we'd done, so it might have taken a day and a half if I'd done the entirely from scratch. >>: [indiscernible] on the back end? I mean, at the high level, do you see that generating code for GPU so now the like locality is kind of different? >> Jonathan Ragan-Kelley: So the model we have for expressing the space the choices is exactly the same, and there's a one-to-one mapping between the way those choices work out in one back end and the way -- kind of what they mean in another. But the best choices are going to be different. And I'll get to this later, but we don't have any heuristic optimizers going on here, so there's -- so far, all of these, everything I'm going to show is either handwritten schedules or stochastic search over the space. So there's no awareness of whether locality might matter more or less in different contexts. It's just purely empirical. It's based on what runs fast. >>: Are you just talking about what sort of Halide did that the expert programmer couldn't do or didn't do? >> Jonathan Ragan-Kelley: Yeah, I'll come back to that in the end a little bit more at least. In something this complicated, the answer is a lot of tiny stuff and it's actually hard to fully even tease it apart. It's like basically a different giant set of several dozen loops or kernels. But yeah, I will come back to in the end. So this brings us back to the first message that I gave you, which was that optimizing performance requires fundamental trade-offs. So we saw there's a 11 tension between the locality or reuse distance and introducing redundant work to break along dependency chains in an algorithm. We can trade off along this axis by varying the granularity by which we interleave these stages, so we can use smaller tiles for better locality or bigger tiles to minimize the fraction of redundant computation along the trial boundaries. It turns out there also attention between these and parallelism, which requires independent work both within and across stages. My second message was that these trade-offs were determined by the organization of computation. So we've seen that changing the order and granularity of computation can have a huge impact on performance. Again, in this visualization we've been using, these aspects of organization are given by the red path we take through computation, and the algorithm defines the dependencies between the individual tasks. This leads to my last major message which is that the ways we can reorganize this graph and the effect each choice has on parallelism, locality and the total amount of work we do is a function of the fundamental dependencies in the algorithm. And this is also one of the places where domain specific knowledge can make the biggest difference. So putting it all together, I think really what I've been showing you is actually a very general problem formulation. We basically just have a task graph. We have pixels or per pixel computations as the tasks themselves. They're linked by their dependencies, which are encoded in the algorithm, and then we can organize the computation by choosing the order in which we traverse the graph or the task schedule. We can change the schedule either to minimize reuse distance for locality or to maximize the number of independent tasks for parallelism. And then finally, we can also break dependencies and split the graph, but this introduces redundant computation on the boundaries. So we have all these choices. Most traditional languages deeply conflate the underlying algorithm with these sorts of choices about how to order and schedule its computation. So back to the optimized blur again. It's completely different than the original. The code uses different instructions, different data layout and deeper loop nests. It shares almost no code with the original version. But all that's happened here is this sort of reorganization I just showed to 12 optimize for parallelism and locality. And specifically, it's mostly doing that sort of tile level fusion. That was the main thing that I just teased. So the parallelism here comes from distributing the work across threads and from computing in eight wide SIMD chunks within those threads, just like you might imagine. But again, exposing that parallelism in a problem like away the easy part. Just as important and often a lot express is the locality through the pipeline. Without optimization, even the most well parallelized pipeline bottlenecked on every bandwidth. this is actually far and harder to think about or this kind of locality still severely And here, the optimized code improves locality by computing each stage in tiles, interleaving the computation of tiles across stages and keeping the intermediate data in small local buffers that never leave the cache. It has to redundantly recompute values on the boundary between intermediate tiles to recouple them. Now, as programmers, we often think of this as a different algorithm from the original. But just like the simple loop interchange I showed at the beginning, I'd argue it's more useful to think of them as the same algorithm, but where the computation's just been reorganized. I don't expect everyone to agree exactly with my choice of the word algorithm here, but I do hope that you can see that the separation of concerns is useful. Then for a given algorithm, we want to find the organization that optimizes performance and efficiency by make the best overall use of parallelism and locality while minimizing the total amount of work that we do. A compiler community has done heroic work to automatically understand and reorganize computations like this coming out of, you know, an existing language like C. But it's fundamentally a challenging problem. Again, Halide's answer here is to make the problem easier by explicitly decoupling the definition of the algorithm from the organization of computation, we call the schedule. So the algorithm defines what values get computed for each pipeline stage and the dependencies in between them, while the schedule defines where and when they get computed. 13 So once we strip out the concerns of the scheduling, the algorithm is defined as a series of functions from pixel coordinates to expressions giving the values of those coordinates. The functions don't have side effects so they can be evaluated anywhere in an infinite domain and a required region of either stage gets inferred by the compiler. The execution order and storage are all unspecified so the points can be evaluated or re-evaluated in any order. They can be cached, duplicated, thrown away or recomputed without changing their meaning. So the resulting code looks like this for the simple 3 by 3 blur. The first stage blurX is just defined at any point, X, Y is the average of three points of the input and then blurY at any point XY is the average of three points in blurX. And notice that we don't have any bounds over which these are being computed. These are functions over an infinite range of free variables. So this is basically a field over a 2D grid. The specific scope of this programming model is what lets us have lots of flexibility in scheduling. So first, we only model functions over regular grids up so four dimensions in the current version. Second, we focus on feet forward pipelines, which is a good fit for this domain, but not for everything. We can't express recursive functions and reductions, but these have to have bounded depth at the type they're invoked. These last two points actually mean that our programming model is not technically turn-in complete because we need potentially infinite sized pipelines to express arbitrary complexity computations. Also, because of the fact that as you saw from the little snippet of code for the blur algorithm, there aren't any bounds or sizes or regions that we're computing here. To work well, we need to be able to accurately infer the dependence patterns between different stages. And for most patterns, our analyses are both general and fairly precise. But in some cases, the programmer might need to explicitly clamp an index expression to a reasonable range to assure we don't allocate too much intermediate buffering. Finally, our problem domain's different than most traditional work on stencil computations in scientific computing because where those usually deal with thousands of repeated reapplications of the same stencil to a single grid, our applications are often -- are usually large graphs of heterogenous computation, often over a hundred different stages or more, with complex and varying dependence between them. 14 And we potentially need to schedule all those stages differently. Yeah? >>: Just curious, do you let the -- the user know the compiler when you can't infer type bounds? >> Jonathan Ragan-Kelley: So the only context -- we can infer good bounds through everything except for stuff that's like just fetching an arbitrary memory address. In that case, the bounds are just the bounds of the type. We can throw a warning there. We don't currently. Usually, you'll run into the fact that that will allocate too much memory instantly the first time you try to run it. That's actually an easy thing to check. Then we have a built-in clamping expressions, the language that you'll frequently want to use anyway for at least the very beginning. Generally, you wind up writing clamp to edge on your inputs. There are a few other things like that, and then through the whole rest of the pipeline, you don't bother with thinking about boundaries, and little, you know, fringes of necessary access are added if needed, but everything kind of carries through fine. >>: Do you like do something on a matrix and transposition then the [indiscernible] for the destination but for the sources? >> Jonathan Ragan-Kelley: Yes. So if you look at how the algorithms were written, everything is written from the perspective of the output point. So one stage basically says, you know, you're not it rating over some domain and reading from somewhere and writing to somewhere. You're saying at any logical point in my domain, which in that case would be the output matrix, the value at that point is given as some expression of, you know, the coordinates in other previous functions. So in the case of transpose where you wind up with kind an FFT in this, that is not that I'll show you. We can the most power. that would be easy. In the case of situations of all the all communication, like trying to write the focus of most of the scheduling transformations do that, but it's not the context where we can have >>: Seemed like the dependencies are inferable, the computation [indiscernible]. 15 >> Jonathan Ragan-Kelley: >>: So you mean like a data dependent kind of warp or -- [indiscernible]. >> Jonathan Ragan-Kelley: Yeah, so what we would usually wind up doing in that context is write, basically write like a reconstruction -- the second stage would be, you know, nine tap filter over the previous stage, which the actual indices being computed values based on the scale factor and the input indices. Does that makes sense? Are you wondering about how we can actually comprehend what the bounds of that are or ->>: Might vary with the output position. >> Jonathan Ragan-Kelley: Oh. So that's -- I mean that would just be an expression that -- I think that would just wind up being an expression where the weights are a function of some other expression, if that makes sense. But you would usually still be writing it from the perspective of every output location. So, I mean, it's the same way you write it if you wrote a GPU shader that wanted to do that, if that makes sense. If you were handily filtering your textures on some non-integer sized resampling. >>: [inaudible] across operations. It just assumes the data is there, correct? Let's say you're doing discrete imagery scaling, you want to do it by user selectable amount. We're going to even ignore the filtering part now, just nearest neighbor, right? If I say in one iteration I want to rescale by 3.7, the next iteration by 7.2, then the bounds analysis has to change, right? >> Jonathan Ragan-Kelley: Correct. Well, the bounds analysis just has to be conservative with respect to that. So one thing that we would commonly do in that context is just say that schedule which I'm going to explain now should have computed that entire prior stage. If you're potentially going to be using large rescaling factors, so that you have -- you know, you may be accessing very, you know, differing ranges of the image, so I guess I'd say two things. First, our bounds analysis doesn't infer static constant bounds. It statically infers expressions that compute the bounds. Which means that if those bounds are some kind of expression in terms of, like, a rescaling parameter and an input image size parameter or something like that, then the bounds at a given point in the computation will be computed immediately before doing it. So it can adapt in that kind of way, but if you're likely to be, you know, touching 16 an entire image for even very small regions of an output, then it usually is the best choice to make sure that entire image has been computed to a buffer that you can access, just because that's the most efficient strategy. >>: [indiscernible] one might be shrinking, one might be growing. >> Jonathan Ragan-Kelley: >>: Yeah. But across the image, the -- >> Jonathan Ragan-Kelley: Yes, but at any given -- if you think of splitting it up into tiles, and you schedule, which I'm about to explain, the input to that to be computed as needed for tiles of the output, then the bounds on what size tile of the input you need for each of those tiles is going to be computed dynamically every iteration. So we're really injecting dynamic expressions that compute those bounds as it goes around. Like as needed. So given that model of the algorithm, the schedule defines the organization of computation by specifying two things. First, for each stage, in what order should it compute its values. So remember, computing whole grids of pixels here, so in what order should we compute the points in this grid for each function. Second, when should each stage compute its inputs or at what granularity? this defines the interleaving of computation between the producers and consumers in the pipeline, which determines the amount of fusion between stages. And And this is the model we used to describe the space of computational organizations I showed earlier, that can be specified explicitly by a programmer or inferred and optimized by the compiler. But different choices about order within and across stages can describe all the types of reorganization I showed earlier. So first, the schedule defines the order we compute values in within each stage and this is just the simple stuff you might imagine. If we have a 2D function, we can traverse it sequentially across Y and then X giving us a simple row major loop nest. We can compute the X dimension in four wide vectors. We can transpose those dimensions if we want. We can distribute the scan lines across parallel threads. 17 And we can split the X and Y dimensions to separately control the outer and inner components of each. So in this case, by reordering the split dimensions, we get a simple tile traversal. Now, what all these choices actually do, as you might imagine, is specify a loop nest to traverse the required region of each function. So, for example, the serial Y, serial X worder gives a basic row major loop nest over the required region of the function. And it's to the point about dynamic bounds. I'm using Y min and Y max kind of expressions there and in practice there would be some computation immediately before this that might be hoisted if it's independent of everything in between, but might not if it's not. That actually computes what Y min and Y max are in terms of its whole environment. So putting everything together, we can split into two by two tiles, spread strips out over parallel threads and compute scan lines and vectors. This synthesizes a 4D bloop nest including a multi-threaded parallel loop and an inner vector loop. And the specific complexities here don't really matter, I just want to give a sense that the choices here are both rich and composable. And to be clear, this stuff on the right is not code any programmer writes. It's just meant as an representation of what these choices on the left correspond to. So this is basically what the compiler is synthesizing under the covers. >>: [inaudible]. >> Jonathan Ragan-Kelley: >>: Yes. [inaudible]. >> Jonathan Ragan-Kelley: No you specify this for all the stages. You can specify this for all the stages. There are same defaults, about you specify this for every stage. So you're actually making both these order choices and the interleaving choices that I'll show in a second for basically every edge in the whole graph. It's a lot more -- less verbose than that sounds and it can be inferred automatically. So then once we have the loop nest for evaluating each function's own domain, we need to decide how to interleave them with each other to compute the whole 18 pipeline. So the other thing the schedule defines is the order of evaluation between producers and consumers in the pipeline, which determines the degree of fusion between stages. So for simple point-wise operations, it makes sense to fuse the pipeline in line in the computation of each point in the first stage and to the corresponding point in the second. Then as soon as we apply the tone adjustment to one pixel, we can immediately color correct it, throw it away and do this across the image. Producing and consuming individual pixels through the pipeline. But the simple strategy only works because the one-to-one dependence between pixels here. And that's normally what we would mean by loop fusion. So if instead we had a large blur as the second stage, then each pixel in the blur depends on a lot of pixels in the first stage. So the dependence between even far away pixels has lots of overlap. And if we fuse, we have to recompute or reload lots of the same values over and over again for each point in the blur. So in this case, it makes more sense to precompute the entire producer stage before evaluating the entire consumer stage. So we can choose to sacrifice locality to avoid wasting too much work. And finely, for stencil operations like our 3 by 1 blur, every point in the consumer stage depends on a bounded window of the previous. So in this case, full fusion with inline lots of redundant computation or the stencils overlapped, while computing breadth first would completely sacrifice producer consumer locality. So not surprisingly, the right answer is somewhere in between. So we can schedule the stages to be computed within smaller tiles sized to fit in the nearby caches. And if we then interleave the computation of a producer tile with the corresponding consumer tile and do the same thing over the image, this is the same choice we saw with the optimized 3 by 3 blur example earlier. We may have to redundantly recompute values on the shared boundary on the tiles. But again, by changing the size of the tiles, we can trade off between locality and redundant computation, which is spanning the space of choices between full fusion and breadth first execution. So our model of scheduling is specifically designed to span the space of 19 locality trade-offs and to describe them separately for every pair of stages in the pipeline. The way we actually trade this off is the most fundamental idea in Halide's design. We do it by controlling the granularity at which the storage and computation of each stage get interleaved. So the granularity with which we interleave the computation determines how soon you use values after computing them. So here, fine interleaving provides high temporal locality, quickly switching back and forth between computing and consuming values. While coarse interleaving provides poor temporal locality, computing lots of values before using any of them. Then the storage granularity determines how long you hold on to values for potential reuse. So fine grain storage only keeps values around for a short time before throwing them away, which potentially introduces redundant work if we need them again later. While coarse grain storage keeps values around longer to capture more potential reuse. So using the examples from the previous slide, full fusion interleaves the storage and computation of two stages of very fine granularity. This maximizes locality but potentially introduces lots of redundant work. Breadth-first execution avoids any redundant work but to get that it sacrifices locality. And tile level fusion trades off between these extremes depending on the size of the tiles that are interleaved. But now what about the whole other axis of this space? It turns out it's also possible to maximize locality why doing no redundant work, but doing so requires constraining the order of execution which limits parallelism. So by setting the storage granularity to the coarsest level, while interleaving the computation per pixel, which was the far bottom right corner of that triangle, we get a sliding window pattern like this. So now after computing the first two values in the first stage, for every new value you can compute, we can immediately consume it to compute the corresponding value in the second. And it says excellent locality and it computes every value only once, wasting no work. But to do this, notice that we have to walk across the two images together in a fixed order. So we've sacrificed potential data parallelism to optimize locality and redundant work. And I want to emphasize that these examples I've been showing are not just a 20 laundry list of scheduling options. Fundamentally, it's a parameterization of the space of trade-offs for pipelines of stencil computations like this. And our schedules directly model and span this whole space. >>: Do you combine strategies sometimes? level. I can imagine doing that at the tile >> Jonathan Ragan-Kelley: Yeah, absolutely. So one of the reasons why the domain order stuff matters is because that, it's the dimensions of those loop nests at which we can determine how to interleave storage and computation, so you can split dimension as whole bunch of times, create a recursive tiling and actually in just a second, I'll show a couple things like that. So yes, the whole point is that we can combine all of this stuff in different ways for all the different stages, yes. >>: The sliding window, would you like a streaming computation? >> Jonathan Ragan-Kelley: Um-hmm. >>: Those really correspond to any of the other schedules that you mentioned, right? It's not really a fine ->> Jonathan Ragan-Kelley: So the way we actually think about it is one of the big realizations here that is you can actually think about it that way. So if you think about it in terms of compute and storage granularity, you can express the sliding window, ignoring the fact that we're reusing storage. Like if you imagine I left all the previous pixels I computed sitting around, it corresponds to the coarsest granularity, meaning I store everything for all time. The coarsest granularity of storage but the finest interleaving of computation. So for every new pixel I compute, I compute one of the consumer. And then the actual transformation that reduces the amount of storage to use a small circular buffer and recognizes exactly what previously computed values can be excluded is dependent on the fact that we're walking along it in a fixed order. So it's basically by picking some point down here in the space, where we have a large window over which we could reuse all the values but we're quickly interleaving, computing one new value and consuming a new value, combined with the fact that the intervening loops are all in a known order, you know, they have a known dependence vector, that gives that pattern. Does that 21 kind of make sense? It's a little easier to see on a white board. So yeah, one of the big realizations here is that actually does fall into this space, which was a bit surprising. So even for a pipeline as simple as the two-stage blur, we can express an infinite range of choices. You don't have to follow what's going on in all of these, and I'm actually just going to jump through quickly. But the top three sit at extremes of the trade-off space and the bottom three balance them all in different ways. Here's what they actually look like running. So some of these -- to your question, Matt, actually are doing combinations of basically sliding window on chunks or line buffering or other things like that. And the bottom three, in particular, are making fairly complicated combinations of tiling and sliding Windows and so on. And different strategies here are best depending on the exact algorithms you use, the structure in which the pipeline -- the pipeline in which they're composed and how they interact with the underlying hardware. And the difference between these can be more than an order of magnitude of performance and efficiency on the same machine. Even between two plausibly high quality ones. So putting this all together to describe the optimized blur looks like this. It's the same blur algorithm from before with the schedule equivalent to the optimized C++. So first, it schedules blurY to be evaluated in 256 by 32 pixel tiles. Each of these tiles is computed in eight-wide vectors across the inner X dimension, and the scan line strips are spread across parallel threads. Then it schedules the intermediate stage blurX to be computed and stored as needed for each tile of blurY. And it also vectorizes it across X. So these compute at and store at correspond to the positions on the two axes of that triangle. And they're given relative to effectively loop levels of the loop structure that we've defined for the previous function, if that makes a bit of sense. It's a little easier to see on the white board. Ed this generates nearly identical machine code to the optimized C++. But unlike the C, it's not only simpler, the Halide version makes it easier to add new stages, change the algorithm or change the organization. And it can also 22 compile not just x86 the arm neon code. With a few tweaks to the schedule, we get a GPU implementation. Actually, a whole range of GPU implementations. We have a bunch of other language features, which I'm actually going to skip. Our current implementations embedded DSL and C++ and it uses LLVM for code generation. So to exile a pipeline, we take the Halide functions in the schedule and combine those into synthesize a single loop nest and a set of allocations describing the entire fused pipeline for a given architecture. Then after our own vectorization pass and other optimizations, we pass it to LLVM to emit vectorized x86 or ARM code or CUDA kernels and graphs of CUDA kernels in the x86 code including logic to launch and manage them. And in that last case, on the GPU, we can actually generate heterogenous mixtures of both vectorized multi-threaded CPU code intertwined with many different GPU kernels, depending on the structure implied the by schedule. So to pop back up, the last question is how can we determine good schedules. And so far we've worked on two ways. I've alluded to this a couple times as it's come up. First, you can write them partly or completely by hand. We weren't even -- when we built this model, we weren't ever expecting real people to write this by hand but wound up being surprised how far you can get with a teeny bit of syntactic sugar on top of it just how terse the schedules wind up being, even for fairly complicated programs. So it's a lot easier than traditional hand optimization, both because they're compact but also because you can't accidentally break the algorithm. The compiler synthesizes all the complex control flow and data management for you that's implied by what your schedule says. We found expert developers using this in places like Adobe and Google actually appreciate the control and predictability this gives them. Second, we can automatically search the space of schedules for a given pipeline on a given machine using auto tuning. And the dimensionality of this problem is large, so we've focused on stochastic search, but it works surprisingly well in practice. Now, I'm going to present results from a few real computational photography and imaging applications we've built in Halide, which I chose because they represent a wide range of different program structures and levels of optimization in this domain. And for all of these, I'm going to shot 23 complexity of the best handwritten schedule, but in every case, the auto tuner was able to match or beat that in some number of hours without any human intervention. So first, we looked at the bilateral grid, which is a fast algorithm for the bilateral filter. We implemented it in Halide and compared it to a pretty efficient but clean C++ version by the original authors. Our version uses a third of the code and runs six times faster on the same machine. Then with a different schedule, we can run on the GPU instead and. There, we looked at the author's handwritten CUDA version and started with a schedule that matched their strategy. And that gave almost the same performance they had. But then we tried a few alternative schedules and found a slightly nonintuitive one that was twice as fast as their hand-tuned CUDA. So to the point earlier about why we're beating hand tuned code. In this case -- in every case, we're not doing everything you couldn't do by hand. It's just a question of making it much easier to search the space quickly. So in this case, the ideal trade-off, even on a very parallel GPU, chose to sacrifice a little parallelism to improve locality during the reduction stack that computes the grid. And that made an enormous difference in the performance. >>: Why didn't you [indiscernible] different choices [indiscernible]. >> Jonathan Ragan-Kelley: So that -- so all of the allocations in one of these pipelines are implicit in the schedule. They're basically implied by the store app granularity, and the semantics are when you're running on a conventional CPU-ish architecture, what we're basically doing is building a giant bloop nest or set of loop nests that all runs in one place and so allocations wherever all kind of mean the same thing. We have some optimizations for keeping stuff on the stack or doing small pool allocations if we know that something's constant sized or whatever. But generally, memory is memory. In the case of running on the GPU we have two separate memory spaces so everything that's allocated externally to stuff that corresponds to a kernel launch is mirrored in both spaces and lazily copied. I mean, it's actually not eagerly mirrored. It's lazily mirrored, which you have to do if you're doing computations in both spaces. But we have semantics where allocations that correspond to granularity within 24 the thread dimensions of your kernel launch correspond to local memory and allocations at the block level correspond to shared memory. So you can effectively span the space of things you reasonably do. >>: There are things on GPUs that are very important to optimization, like register pressure and memory band conflicts. >> Jonathan Ragan-Kelley: Memory band conflicts aren't very important anymore, actually. You'd be surprised. Basically, they didn't used to be doing cache reads and writes in the general memory system. Now the cache is ->>: [indiscernible]. >> Jonathan Ragan-Kelley: Yeah, and so actually, even for me that got better. So it's basically now instead of throwing reads and writes directly at the memory system and hoping they cover all the banks, you read and write to cache lines and the cache lines sized to those transactions. It's the same way CPU works. >>: And what [indiscernible] sizes? >> Jonathan Ragan-Kelley: So that's actually defined by the -- there's basically a mapping between what would have been like parallel loops when I showed you the loop synthesis and carving off some set of those and saying those correspond to the block and thread dimensions of the GPU launch. And so the granularity which you block that corresponds exactly to the block sizes that get launched on the device. >>: So suppose I work for a big company, and I have a problem that I run over and over and over again on thousands and thousands of machines. And maybe suppose that program is the one you just showed me. How do I know -- do you have any idea as to how, like, I'm willing to wait, you know have this thing run for weeks on end to find the best schedule. Do you have an idea as to how good a schedule you can get? You mean, you're searching for a needle in a haystack, so in all these problems, do you know like what's the throughput of the machine and how close ->> Jonathan Ragan-Kelley: So we haven't, for simple examples, it's relatively easy to build kind of a roofline-ish model. So for the blur that's relatively easy to figure out. Actually, being able to reason about bounds based on the 25 fundamental structure of the program, you know, what the intrinsic arithmetic intensity of different choices is, stuff like that, is something that we're interested in doing but is not something we've done at all. I think this type of representation gives you a lot more view into that than just writing C that reads and writes memory willy-nilly. And that kind of view that I showed earlier where you think about reuse distance and, you know, like crossing dependencies and stuff like that is a natural view from which to begin to get a more concrete sense of that kind of thing, of what the fundamental limits on an algorithm are. But in practice, we haven't done anything yet. So I'm going to skip this example, it's a helluva lot faster than even very good Matlab, not surprisingly, not because Matlab is slow, but because of locality. And finally, in the local Laplacian filters example I showed earlier, the win here wasn't again because we emitted better code. It actually came from fusing different stages to different degrees and accepting some redundant computation in exchange for better locality. So our implementation didn't just make a few different optimizations beyond what the original developer found. It had to fuse and schedule each of the hundred stages differently to globally balance parallelism, locality and the total amount of work. And our strategies actually substantially different through -- basically, we can talk about it offline, some of the interesting things we noticed. And then the strategy, we wound up using on the GPU was completely different from that yet again. So it's -- if you were rewriting this to explore these different choices, you would be tearing up and rewriting something on the order of 80 different imperfectly nested loops and all the allocations and boundary handling and everything along the way. >>: So just like anecdotally kind of going back to the prior question, did you see that on average, most -- so you get an order of magnitude range as kind of the claims you're making, right, and you see those numbers bear out. But are most schedules around average, and a handful are better? >> Jonathan Ragan-Kelley: So you can write -- the whole space of things that we model includes a great deal of stuff that makes absolutely no sense. Introduces huge amounts of redundant computation, has crazily large intermediate allocations, other things like that. 26 So I wouldn't say that most in any statistical sense are around average. The space is pretty spiky. We generally found, without putting any heuristics into the auto tuner, it converged, you know, along kind of a smooth exponential decay kind of curve. You can make of that what you will. There may be sort of a normal distribution of performance in that space. I'm not sure. When we did this stuff by hand, we usually kind of sampled a few -- we had a few hypothesis about points in the space or strategies that might work, kind of tried them, twiddled some stuff around them and usually tried about ten total things and a variation among the best few is usually not that big. The variation overall was often order of magnitude. But which were going to be best, we didn't know ahead of time. totally wrong. We were often So that's our experience so far with Halide. To recap, again, the big idea is to explicitly decouple the definition of the algorithm from the organization of computation. I think the biggest thing here is that we designed this model of scheduling specifically to span the space of trade-offs fundamental to organizing computation and data parallel image processing pipelines. More than anything else, I think it's that model that's actually what lets this be possible. It's the unification and composability of that model that lets us promote it all the way to be visible to the user, unlike most compiler optimizations. It's, you know, the kind of orthogonality of that model that lets us span the whole space and walk through it automatically. And that's really the key to all this. And also in this domain and I think probably in others, it wound up being really important to put emphasis on the potential of redundant computation to break actual dependencies in an algorithm and minimize computation. Because on modern machines, that actually pays for itself an impressive amount of the time. So here, the end results were simpler programs that run code while giving composable and portable performance. portable components that can scale all the way from our didn't show. We have a whole suite of these results on massively parallel GPUs. faster than hand-tuned Or composable and mobile phones, which I ARM all the way to Our implementation's Open Source under a, I think, permissive enough for you guys license, an M.I.T. license and is being actively developed currently in 27 collaboration with people at Google and Adobe. People are using it at both companies and I can actually now tell you that every picture taken on Google Glass is processed on board by pipelines written and compiled with Halide. I have other projects to discuss, but we've spent a lot of time talking about other things. So I think I'm going to leave it there. Thank you. >>: Have you extended this to other things beyond just image processing? >> Jonathan Ragan-Kelley: Yes. So I think of this not as a domain-specific language in the sense of having much to do with image processing itself so much as the schedules in particular are, I think, a fairly complete parameterization of the ways you can interleave computation and storage on computations over graphs of computations over rectangular regular grids, which is basically what images are. And so we have -- I've done some, you know, fluid simulation, a few other things in this context. Like I said, the things that do all to all communication don't benefit from a lot of these types of reorganization. They benefit from things that, from this perspective, would be actually algorithmic reorganization if you wanted to do an FFT or something like that. That -doing algorithmic combined with schedule reorganization for approximate convolutions is actually something we're beginning to look at. The extreme example being the GPU efficient recursive filtering kind of idea. I see that in this view. That's not just a change of order. It also takes advantage of associativity and linearity and so on to rewrite the algorithm or what we would mean by the algorithm here. And then also, actually, among the other things I was going to touch on we worked on applying some similar ideas outside of graphics in a different language. So this is not in Halide. But we were using the Petabricks language. Sorry, this is actually not a great slide to talk over. It was meant to be a quick summary. We're using the Petabricks language, which is built around auto tuning to determine lots of -- similar sorts of choices. And in this context, we were working on mostly traditional HPC kinds of code. This is also mostly over regular grids. Their program model is primarily focused on stuff that's data parallel on a relatively similar fashion. But the trade-offs explored here included not just the types of organization I discussed, but also algorithmic 28 choice. So in particular, things that naturally lend themselves to divide and conquer algorithms, there's often different types of algorithms you can use at different levels of recursion, and in this case, there was -- we were basically inferring a decision tree to decide what strategies to use, including different algorithmic choices. >>: So does your recursor generate any shuffling search. >> Jonathan Ragan-Kelley: So the scheduling model doesn't express anything like that, but we have, when I say we have our own vectorization, we're not just relying on an automatic loop vectorizer or something like that. We're directly constructing kind of SOA SIMD code, and then our back end actually does, you know, people optimization kind of level instruction selection for ->>: But do you like generate like intrinsic shuffling? >> Jonathan Ragan-Kelley: Yes, so we generate whatever has been necessary so far. We basically have a peephole, a pattern matching phase at the end that looks for common sub-structure in the -- in our vector IR and substitutes common extrinsics for those. So we're emitting machine code directly, but we're going through -- we emit LLVM intrinsics, which basically we just say I want this exact instruction to be selected here. >>: I won't, like, you have some cases where the axis are nonlinear, like two ->> Jonathan Ragan-Kelley: Right. So we have a variety of optimizations in the different back ends. This type of code is actually a large subset of exactly what both SSE and the [indiscernible] were designed for. So there's a lot of esoteric instructions that actually wind up being useful in different contexts. And dealing with strided loads or things like that or packing and unpacking stuff is one of the common contexts where we try to make sure that the front end and main lowering process will always generate relatively common patterns that we can pick up in the back end and replace with specialized ops. >>: Have you thought all about run time fusion and scheduling? You can think of a variety of reasons to want to do that. For example if I write an app, I want to know that it works well on current generation and any generation 29 [indiscernible]. >> Jonathan Ragan-Kelley: Yes. >>: Another example is like the PhotoShop layers list, where the user is in charge of the composition and you still have [indiscernible] run time. >> Jonathan Ragan-Kelley: I think of those as being different, actually. In the first case, I don't think you need run time compilation. You just need what you think of as sort of load time compilation. So performance portability probably requires -- well, for one thing, one of the big reasons why we have this whole idea of split scheduling from algorithm definition is that our assumption is that the choices we're making in the organization are not going to be performance portable, that you're going to want different organization on different machines. So you need to have some way of deciding what the best organization is on some future machine, but that compilation doesn't need to be happening with amazing low latency because it's not like in the UI loop. It's happening once when the user boots the program on a new machine or something. The latter context is a bit harder, where you're actually respecializing programs, depending on -- well, and so you're potentially rescheduling or the more extreme case, which I worked on in the context of visual effects, is that you're actually specializing the computation based on knowledge that certain parameters are static or other things like that and trying to fold it in different ways. And so in this context, we were basically trying to accelerate lighting preview. People are only changing a few parameters of a 3D rendering, so we automatically analyze the program, figured out what was changing, what wasn't. The vast majority wasn't. We could auto parallelize the stuff that was. We had all kinds of aspirations about adapting to individual parameters being edited one at a time, and it turned out that dealing with that was monstrously more complicated and much higher overhead than just doing a fairly good job specializing for the common case. I think if you look at something like light room, it's kind of halfway in between the general photo shop layers or, you know, compositing graph kind of case and the totally static case. They have a bunch of fast paths that they 30 hand code right now, where when -- you know, they have say a half dozen fast paths for preview or final res with or without a few major features turned on or off, like whether you do or don't have to do the scale decomposition to do all the local tone adjustments. Or like whether or not you've cached the results of demosaicing. And I think picking a handful of fast path optimizations compiled ahead of time can be extremely fruitful, because then you're just quickly choosing among a few places to jump to. But injecting the compiler into the middle of the overall process is at least a hassle and often hard -- it's hard to -- even if 80 percent of the time it makes things go better, having nasty hiccups 20 percent of the time is often not worth it. It's sort of how I'd summarize my experience with that kind of thing. So one thing you can do with auto tuning, and people have done this, try to auto tune over a large suite of tests, producing a more than one answer for the right -- for the tuning so you can say, I want the five schedules that best span this space, you know, best optimize overall best for this large set of benchmarks. And if, into that set of benchmarks, you include different types of patterns which might be specialized against, you know, just running quick previews with certain things off, versus having everything turned on, that's a much easier context in which to automate that kind of thing, I think. >>: So the trade-off we make a lot [indiscernible] code is precision versus. >> Jonathan Ragan-Kelley: Yes. >>: And there's a whole bunch of [indiscernible] vector instructions. have you looked at, experimented with that at all. Do you, >> Jonathan Ragan-Kelley: Yeah, so one of the things we thought about early on was so our native types, I didn't show you any types anywhere, because everything's all inferred, but it's also -- it's all just inferred using basically C-style promotion roles from -- or you can use explicit castes from whatever your input types are. So we do a lot of computation on, like, 8 and 16 bit fixed point numbers on many of the pipelines we use. We did work hard on the sort of vector op pattern matching for those types, because that's a context in which there's a lot of specialized operations that can be actually really useful. Saturating 31 ads and other weird things like that. We, early on, we were thinking that we might want to have these programs be able to be parameterized over some space of choices on precision that you might use and even be able to build in potentially higher level knowledge of, like, fixed point representations to handle that kind of thing automatically. We wound up deciding to keep it simple and keep it to the semantics of basically the kind of stuff you'd write by hand in something like C today. But in the long run, I think that would be a useful higher level front end feature that would give both more flexible to the compiler and make it easier to explore that space by hand. >>: [inaudible]. >> Jonathan Ragan-Kelley: So the vectorization, the way we express it, remember that we were synthesizing loops over the domain of a given function. So we have some rectangle that we have to compute. We're synthesizing loops that cover that. Our model of vectorization is somewhat of those loops or a split subset of one of those loops gets computed in vector stride of some constant width so I say I want the X dimension to be vectorized in strides of eight. If that one function includes a bunch of different types, we're not necessarily going to be -- the back end might be cleverly selecting different width ops. So doing, you know, two sequential ops on 32 bit numbers while doing only one op on 16 bit numbers, if we have 120 bit vector unit, usually what we'd wind up doing is vectorizing to the kind of the maximum width that any of the types flowing through that function have, if that makes sense. I can show you on a white board. But all our operations, all our general operations are agnostic of the underlying vector types. So if you express in a 256 bit vector op, like adding two eight by floats together, if you're on SSE or neon, that will get code generated into a sequence of two ops. You can think about it as just like unrolling the regular vector loop. And those kinds of details wind up being taken care of in the back end 32 instruction selection. But you're not specifying, like, per op how wide you want to do it. It's just per function. >>: Along those lines, I'm curious. So how much is the performance enrichment you're getting a function of your people optimizations versus [indiscernible]? >> Jonathan Ragan-Kelley: I think that the big win, so if we throw in schedules that are equivalent to simple C, the big wins are from restructuring the computation. So it's not -- we're mostly relying on LLVM for register allocation, really low level instruction selection and low level instruction scheduling. >>: [inaudible]. >> Jonathan Ragan-Kelley: Not until now. 3.3, which came out like two days ago, is the first time they have any significant vectorization. Well, there's multiple answers to that depending on what you mean. In terms of loop auto vectorization or super word parallelism stuff, that's only out literally right now and we've never built on any of it. Depending on the back end, they especially in the case of neon, they actually do a pretty good job of the same kind of people optimization that we're doing where they recognize a sequence of, you know, simple, you know, app, mull, something something and recognize that that can be fused in a magic way into a magic neon op. So generally, the way you target weird ops in neon is to figure out what their magic pattern is and use it. But that's really only true for ARM, in our experience. On x86, they don't make any effort to do clever instruction selection for you. They generate like if you're multiplying two single precision float for vectors, they'll give you a mull PS, and other than that they'll just do the stupidest, simple thing you can possibly imagine. So we have to be much more aggressive with the people optimization on x86. But I think that's mostly just like the last factor of two, that kind of stuff. Just making sure it doesn't do something stupid and that it really is vectorizing by the width you told is to all the time and not spilling everything to the stack and, you know, packing and unpacking the vectors between every op and that kind of thing. 33 Thank you.