1

advertisement
1
>> Madan Musuvathi: Hi, everybody. I'm Madan Musuvathi from the research and
software engineering group, and it's my pleasure to introduce Jonathan
Ragan-Kelley today to give a talk here. His expertise is in high performance
graphics, but his interests span multiple disciplines; systems, architecture,
graphics and compilers.
It's actually very nice to see graduate students working in multiple areas and
excelling in all of them. So he's -- he's almost graduated from M.I.T. He did
his Ph.D. with Frdo Durand and Saman Amarasinghe, and he is headed to a
post-doc and we'll very soon know where that is.
So Jonathan?
>> Jonathan Ragan-Kelley: Thank you. Thanks for coming and thanks for having
me. So this is going to be a kind of overview of a lot of different research
I've done, but most focused on Halide, which is a programming language for high
performance imaging, which some of you may be familiar with, which is the focus
of my thesis.
So graphics and imaging are everywhere today. And not just in games, movies
and cameras, where we traditionally focus in the SIGRAFF community, but also
things like 3D printing, medical imaging mapping and satellite-based science
and even high throughput gene sequencing and cell biology.
In my research, I build compiler systems and algorithms for these sorts of data
intensive applications and graphics and imaging. And specifically, I've worked
on rendering high performance imaging and 3D printing. And I'll touch on all
of these, but I'm mostly going to discuss Halide again, which is a language and
compiler [indiscernible] processing.
In this talk, I'll show that the key challenge is to reorganize or restructure
computations and data, to balance competing demands for parallelism, locality
and avoiding redundant or wasted work. And I'll show that by changing how we
program and build systems to focus on this sort of reorganization, we can
enable simpler program code to run an order of magnitude faster on existing
machines and to scale on the highly parallel and bandwidth constraint
architecture Moore's law's giving us.
2
I'll show an example of how we
C or double the performance of
pipeline, optimized over three
while using 25 times less code
can deliver 20 times the performance of everyday
a hand-tuned parallelized vectorized imaging
months by an expert working on PhotoShop, all
that we wrote in a single day.
And I'll show how we can generate a completely different organization on a GPU,
delivering several times more performance without rewriting a single line.
But first, why do we care about performance in this domain in the first place?
I'd argue that graphics and imaging applications today are still orders of
magnitude from good enough. So take a light field cameras. By capturing and
computationally reintegrating a 4D array of light, you can do things like
refocus photographs after they're taken, correct aberrations in the lens or
extract high quality depth from a single shot.
But if we wanted to produce eight megapixel 4K light field video at 60 frames a
second, current software would take about an hour for each second of light
field video.
In rendering, even using the same GPUs used to play PC games, the movie frame
is still six orders of magnitude from real time. Multi-material 3D printing
has huge potential to transform manufacturing, but to directly print production
quality objects, we need about a trillion voxels and something the size of a
tennis shoe. And the printing hardware is almost ready to do this, but it's
just a ton of computation and data. So in this case, if we wanted to finish
ten shoes an hour, we'd need to synthesize about four billion voxels a second
just to keep the printer busy.
At the same time, Alazon image sensors are going to be all around us, but we
need to figure out what to do with all the data they produce. So just sending
it to the cloud isn't going to be enough, since the best cellular radio we
could build would need at least a thousand times more energy to send each frame
than it took to capture it.
In these days, most complex sensing uses imaging somewhere under the hood.
From high throughput gene sequencing to automated cell biology experiments, to
neural scanning and connectomics. For example, today, it takes a large
supercomputer just to process the scan of a single cubic millimeter of mouse
brain neurons.
3
And really, lots of what I'm going to say in the rest of this talk complies to
almost any data intensive application, not just those involving pixels.
So the good news is Moore's law is continuing to give us exponentially more
computational resources. As we've been told for the past decade, one of the
big challenges we face will be exposing more and more parallelism for this
hardware to exploit. Superficially, this seems easy, since all these
applications I showed are enormously data parallel.
But the real challenge here is that all these millions of data parallel
computations aren't independent. They need to be able to communicate with each
other. And the same hardware trends that have pushed us from big unit
processors to lots of parallel cores have made communication and data movement
dominate the cost of computation as a whole.
And by communication, clearly I don't mean over networks.
computer or inside a single chip.
I even mean inside a
And as a result of that, locality or moving data around as little as possible
is at least as big a problem for future programs as parallelism. So to put
this in perspective, today, relative to the energy cost of doing some
arithmetic operation on a Pete of data, loading or storing that data in a small
local SRAM like a cache can be several times more expensive.
Moving the result ten millimeters across a chip is an order of magnitude more
expensive. And moving it to or from off-chip RAM is three or four more orders
of magnitude more expensive than computing the value in the first place. And
this disparity between local communication -- computation and global
communication is only getting bigger over time.
So because of this, it can often be most efficient and make surprising
trade-offs. Doing things like redundantly recomputing values that are used in
multiple different places instead of storing and reloading them from memory.
And this is the first key message I want you to take away from this talk, that
parallelism, locality and the total amount of work we do interact in
complicated ways. They often need to be traded off against each other to
maximize performance and efficiency of a particular algorithm on a particular
machine.
4
Now, the architecture community is well aware of this, and they're trying to
exploit locality and parallelism to improve energy efficiency in their hardware
designs.
On the software side, I think we've long recognized the need to expose
increasing parallelism in application, but as a whole, these challenges are
still acute in how we design software. Again, especially locality, as least as
much as parallelism.
So given these pressures, where does performance actually come from? We
usually think of it in terms of two factors, the hardware and the program
running on it. But the competing pressures on locality, parallelism and
redundant work depend heavily on how the program is mapped to the hardware.
So the second key message I want you to take away from this talk is that
because of this, I think it's useful to actually think of the program as two
separate concerns, and this is motherhood and apple pie for compiler and
programming language people, I think. But still, I'm going to try to emphasize
it anyway.
It's useful to think of the program somewhat separately as the algorithm or the
fundamental operations which need to be performed, and the organization of
computation, including where and when those operations are performed and where
their results get stored and loaded. My work focuses specifically on enabling
and exploiting the reorganization of computation. And secondary to that, its
interaction with algorithms and hardware.
In this talk, I'll show that the organization of computation is key to the
things we need to optimize performance and efficiency, but it's challenging
because of the fundamental tensions in between them.
But ultimately, if we want to keep scaling with Moore's law, I think we need to
treat the organization of computation as a first class issue. Now, these same
themes have come up throughout my research, and I'm going to touch on them in
the context of six different projects I've done in grad school, but I'm mostly
going to focus on Halide.
So Halide is a programming language and compiler from modeling and exploiting
this exact set of trade-offs in image processing pipe lines. This is joint
work with a number of other people at M.I.T. and elsewhere, but most
5
significantly is Andrew Adams, who is my day-to-day collaborator on this
project at M.I.T. since finishing his Ph.D. at Stanford.
And there are papers covering different parts of this, both at SIGRAFF last
year and then the one I just presented a couple days ago at PLDI here.
So to see what I mean by organization of computation, I think it's useful to
look at an example. If you want to do a simple 3 by 3 box filter as two three
by one pass, we can write a simple C++ code like this. So it's just a series
of two sets of loops over the image where the first computes a horizontal 3 by
1 blur and stores in a temporary buffer, and the second computes a vertical one
by three blur of that to produce the output. So here, I'm using some
slightly -- assuming operator overloading to hide some of the multiplication of
the indexing, but these are basically just raw arrays.
Now, what if we exchange the loops in each loop nest? Most exile compilers
would say these are the same algorithms, since the loop reordering is easy to
understand, since the operations within the loops are independent of each
other. This is something most compilers can easily do today. It turns out
it's also important, as trivial as it is, because of poor locality, the column
major order is actually 15 times slower than the row major version on a modern
machine.
This is the trivial reorganization against something that most compilers can
easily do already. But our focus with Halide is to take this to the next
level, starting from what expert programmers currently have to do by hand. And
the key thing to do is going to be understandings that the more complex
dependence across the two sets of loops, not just within each one.
So this is a hand optimized version of that same function. It's paralleled,
vectorized, tiled and fused. We had to restructure all the loops and introduce
redundant computation on the tile boundaries to decouple them, and we had to
change the data layout to make this most efficient.
It's a complete mess, given that all we're doing is averaging together 3 by 3
pixels. But it's another order of magnitude faster. Near the peak throughput
for this machine. So in this talk, I'm going to deconstruct what this
optimized version actually does.
To do that, first I'm going to sketch a theory of this problem and its
6
challenges and then I'll show how we apply this to build a model from which we
can actually generate code. Now, this kind of optimization is hard, both for
people and for compilers. Because getting parallelism in locality requires
transforming program and data structure in very deep ways. But to begin with,
it's not always obvious what transformations are legal, and even when we know
something is legal, it's hard to know whether it's a good idea. So again, the
things we want to optimize are often at odds with each other, and the optimal
balance is usually very subtle and complex.
So just making a single heuristic guess, like most compiler optimizers do, we
might actually wind up making our program ten times slower rather than faster.
And critically, libraries of optimized code don't solve this either. Taking
Intel performance primitives, for example, it's extremely fast at individual
operations, but no matter how much they tune the assembly loops in the math
kernel library, the individually optimized kernels still compose into
inefficient pipe lines, since they can't interleave stages for locality across
function boundaries.
I believe the right answer is to explicitly decouple the definition of the
algorithm from the organization of computation which we're going to call the
schedule. So in Halide, the algorithm defines what values are computed for
each pipeline stage while the schedule defines where and when they get
computed. This makes it easier for the programer to write algorithms, since it
strips out lots of unnecessary details, and makes it easier to compose small
fragments of algorithm into larger pipelines. It makes it easy for earring a
compiler or programmer to specify and explore the space of possible
optimizations, and it leaves the back-end compiler free to do the complex but
deterministic work of generating fast code that implements a pipeline, given a
defined schedule.
So ultimately, that will let us do the same thing as a hand optimizer version
with just these four lines of code or just the top two if we use auto tuning to
infer a good schedule. So this example and the work I'm going focus on is at
the intersection of stream programs and stencil computation, but it also draws
heavily on ideas from loop optimization, parallel work scheduling, region-based
languages and even what I have here is just a tiny sample.
I think relative to most primary languages and compilers, the two biggest
differences are that we focus on treating the organization of computation as a
7
first class part of the front end programming model, and we don't just optimize
parallelism and locality, but we also focus on the potential of introducing
redundant computation to optimize the overall organization.
So for this type of data parallel application that I've focused on, I think
it's often easier to understand the organization if we look at it as a
pipeline. And for the simple blur example of the pipeline looks like this, and
I've reduced it to 1D, but all the problems are still the same.
The input image is at the top, flowing down to the blurX and blurY stages
below. It's just corresponding to the two loops we have. I'm going to use
this view to introduce a theory of the types of reorganization that we do in
Halide first. So again, what's most significant here is not what happens
within each stage, but what happens across the two stages.
In the first versions we look at executed each stage breadth first, which means
they compute every pixel in the first stage before computing anything in the
second.
That means the first stage produces lots of intermediate data before it ever
gets used in the second. As a result, it has to be sent far away and written
to memory before computing the next stage that has to slowly read it back in.
This is fundamentally because the way we've organized computation here has poor
locality. So once we compute a given value in the first stage, we have to
compute lots of other unrelated values before we ever need to actually use it
in the second.
In this view, once we compute the value in the first stage, again, we have to
compute lots of other unrelated stuff before we ever need to use it. In this
view, the locality is a function of the reuse distance along this red path we
take through the computation.
So we can improve locality by interleaving these two stages instead. I'm going
to start with a cartoon version, ignoring the dependencies between the stages
and we'll take care of the final details later. So in the interleaved
organization, we first compute a small piece of blurX, then immediately send
the results down and compute the corresponding piece of blurY.
After we're done, we can throw away the intermediate result and move on to the
next piece. And then we repeat the same process to compute the two stages on
8
the next part of the image.
Repeating this interleaving over the whole image improves producer consumer
locality. And it does this by reducing the reuse distance between where values
get produced in the first stage to where they get consumed in the second. And
in practice, this means they can be kept nearby, like in registers that are
[indiscernible] cache instead of having to move long distances to and from main
memory.
And this reorganization of computations often called fusion, because we've
merged the two, the computation of the two stages together. And in practice,
this makes optimization a global problem of carefully interleaving the
computation of the whole pipeline. So again, I want to emphasize that you
cannot address locality just by locally tweaking operations in the inner loops.
Now, fusion is something that the compiler community's done for decades, but
this type of pipeline is actually more complex than most of what people
traditionally do with what they call fusion. So to understand how we can
actually fuse a given computation, we have to look at the dependencies. And
these are the details I was ignoring in the cartoon version a second ago.
These dependencies are a fundamental characteristic of the algorithm we've
defined. And in this case, looking at one pixel in blurY, it depends on three
neighboring pixels in blurX. But its neighbor depends on an overlapping set of
pixels.
This means that the pieces or tiles that we wanted to interleave during fusion
actually depend on overlapping tiles further up the pipeline. And we usually
want to decouple the execution of tiles both to minimize reuse distance across
them and to execute them in parallel.
But to do this, we have to break the dependencies between them. And splitting
them up means we need to duplicate the shared work, the shared computations
along the boundary, which introduces redundant work. This is, again a key
trade-off that experts make all the time when hand optimizing this sort of
pipeline but which most existing compilers can't consider.
So by removing that dependency, we can execute in parallel both within each
tile of each stage and across the tiles. So now we can independently compute a
tile at the first, send it down, and compute parallel tiles at the second from
9
that. Then each of these independent computations can throw out their
intermediate data, moving on to the next tile that they need to process.
So we not only have parallelism here, we also get good locality from the short,
fixed reuse distance within each of these tile computations. But to get there,
we had to make a trade-off. We had to introduce redundant work to break the
dependencies in the algorithm and enable this overall reorganization.
In this case, that's the right choice, but it's not always going to be. So
this blur we're looking at is just a trivial two-stage pipeline and something
more realistic looks like local Laplacian filters on the right. And even
actually that is a heavily simplified cartoon of local Laplacian filters graph.
Local Laplacian filters has about a hundred stages connected in a complex graph
and locally, most of the dependencies are simple stencils, like in our blur,
but globally they're extremely far-reaching and complex. And overall, this
gives us lots of choices about the organization and degree of fusion at each of
dozens of stages throughout the whole graph.
Each strategy can mean a totally different set of dozens of kernels or loop
desks, completely restructuring all the code, and the difference between two
plausible strategies can easily be an order of magnitude.
So to put this in concrete perspective, local Laplacian filters is used heavily
in Adobe's camera raw pipeline, which is a Meijer feature of both PhotoShop and
Lightroom. Adobe's original version was written by one of their best
developers in about 1500 lines of C++. It's manually multi-threaded and
hand-coded for SSE and it took him about three months to implement and
optimize.
Just like our blur example, the optimized version is about ten times faster
than a clean version written in about 300 lines of C. But then last summer, I
took a single data rewrite their version of local Laplacian filters in Halide
and integrated into Lightroom. It took 60 lines of code to express the same
algorithm and ran twice as fast as their version on the same [indiscernible]
machine.
At the same time, Adobe was starting to investigate GPU acceleration, but they
were constrained by the cost of potentially rewriting tens of thousands of
lines of optimized code across this [indiscernible] code basis. But on that
10
same day, and actually this number is now 9X with a slightly different
schedule, on that same day, just by changing the Halide schedule, we also
generated mixed CPU, GPU code that ran seven to, now, nine times faster than
the original eight-course CPU version. And the best schedule was on schedules
on different architectures here, optimized parallelism and locality in
different ways. So this wasn't just a translation. This was a totally
different organization of the computation.
>>:
[indiscernible].
>> Jonathan Ragan-Kelley: It was a few hours of each and a few hours of
integration. I was actually partly drawing on some existing related stuff we'd
done, so it might have taken a day and a half if I'd done the entirely from
scratch.
>>: [indiscernible] on the back end? I mean, at the high level, do you see
that generating code for GPU so now the like locality is kind of different?
>> Jonathan Ragan-Kelley: So the model we have for expressing the space the
choices is exactly the same, and there's a one-to-one mapping between the way
those choices work out in one back end and the way -- kind of what they mean in
another. But the best choices are going to be different.
And I'll get to this later, but we don't have any heuristic optimizers going on
here, so there's -- so far, all of these, everything I'm going to show is
either handwritten schedules or stochastic search over the space. So there's
no awareness of whether locality might matter more or less in different
contexts. It's just purely empirical. It's based on what runs fast.
>>: Are you just talking about what sort of Halide did that the expert
programmer couldn't do or didn't do?
>> Jonathan Ragan-Kelley: Yeah, I'll come back to that in the end a little bit
more at least. In something this complicated, the answer is a lot of tiny
stuff and it's actually hard to fully even tease it apart. It's like basically
a different giant set of several dozen loops or kernels. But yeah, I will come
back to in the end.
So this brings us back to the first message that I gave you, which was that
optimizing performance requires fundamental trade-offs. So we saw there's a
11
tension between the locality or reuse distance and introducing redundant work
to break along dependency chains in an algorithm.
We can trade off along this axis by varying the granularity by which we
interleave these stages, so we can use smaller tiles for better locality or
bigger tiles to minimize the fraction of redundant computation along the trial
boundaries.
It turns out there also attention between these and parallelism, which requires
independent work both within and across stages. My second message was that
these trade-offs were determined by the organization of computation. So we've
seen that changing the order and granularity of computation can have a huge
impact on performance. Again, in this visualization we've been using, these
aspects of organization are given by the red path we take through computation,
and the algorithm defines the dependencies between the individual tasks.
This leads to my last major message which is that the ways we can reorganize
this graph and the effect each choice has on parallelism, locality and the
total amount of work we do is a function of the fundamental dependencies in the
algorithm. And this is also one of the places where domain specific knowledge
can make the biggest difference.
So putting it all together, I think really what I've been showing you is
actually a very general problem formulation. We basically just have a task
graph. We have pixels or per pixel computations as the tasks themselves.
They're linked by their dependencies, which are encoded in the algorithm, and
then we can organize the computation by choosing the order in which we traverse
the graph or the task schedule. We can change the schedule either to minimize
reuse distance for locality or to maximize the number of independent tasks for
parallelism.
And then finally, we can also break dependencies and split the graph, but this
introduces redundant computation on the boundaries. So we have all these
choices. Most traditional languages deeply conflate the underlying algorithm
with these sorts of choices about how to order and schedule its computation.
So back to the optimized blur again. It's completely different than the
original. The code uses different instructions, different data layout and
deeper loop nests. It shares almost no code with the original version. But
all that's happened here is this sort of reorganization I just showed to
12
optimize for parallelism and locality.
And specifically, it's mostly doing that sort of tile level fusion. That was
the main thing that I just teased. So the parallelism here comes from
distributing the work across threads and from computing in eight wide SIMD
chunks within those threads, just like you might imagine.
But again, exposing that parallelism in a problem like
away the easy part. Just as important and often a lot
express is the locality through the pipeline. Without
optimization, even the most well parallelized pipeline
bottlenecked on every bandwidth.
this is actually far and
harder to think about or
this kind of locality
still severely
And here, the optimized code improves locality by computing each stage in
tiles, interleaving the computation of tiles across stages and keeping the
intermediate data in small local buffers that never leave the cache. It has to
redundantly recompute values on the boundary between intermediate tiles to
recouple them.
Now, as programmers, we often think of this as a different algorithm from the
original. But just like the simple loop interchange I showed at the beginning,
I'd argue it's more useful to think of them as the same algorithm, but where
the computation's just been reorganized. I don't expect everyone to agree
exactly with my choice of the word algorithm here, but I do hope that you can
see that the separation of concerns is useful.
Then for a given algorithm, we want to find the organization that optimizes
performance and efficiency by make the best overall use of parallelism and
locality while minimizing the total amount of work that we do.
A compiler community has done heroic work to automatically understand and
reorganize computations like this coming out of, you know, an existing language
like C. But it's fundamentally a challenging problem. Again, Halide's answer
here is to make the problem easier by explicitly decoupling the definition of
the algorithm from the organization of computation, we call the schedule.
So the algorithm defines what values get computed for each pipeline stage and
the dependencies in between them, while the schedule defines where and when
they get computed.
13
So once we strip out the concerns of the scheduling, the algorithm is defined
as a series of functions from pixel coordinates to expressions giving the
values of those coordinates. The functions don't have side effects so they can
be evaluated anywhere in an infinite domain and a required region of either
stage gets inferred by the compiler.
The execution order and storage are all unspecified so the points can be
evaluated or re-evaluated in any order. They can be cached, duplicated, thrown
away or recomputed without changing their meaning. So the resulting code looks
like this for the simple 3 by 3 blur. The first stage blurX is just defined at
any point, X, Y is the average of three points of the input and then blurY at
any point XY is the average of three points in blurX. And notice that we don't
have any bounds over which these are being computed. These are functions over
an infinite range of free variables. So this is basically a field over a 2D
grid.
The specific scope of this programming model is what lets us have lots of
flexibility in scheduling. So first, we only model functions over regular
grids up so four dimensions in the current version. Second, we focus on feet
forward pipelines, which is a good fit for this domain, but not for everything.
We can't express recursive functions and reductions, but these have to have
bounded depth at the type they're invoked. These last two points actually mean
that our programming model is not technically turn-in complete because we need
potentially infinite sized pipelines to express arbitrary complexity
computations.
Also, because of the fact that as you saw from the little snippet of code for
the blur algorithm, there aren't any bounds or sizes or regions that we're
computing here. To work well, we need to be able to accurately infer the
dependence patterns between different stages. And for most patterns, our
analyses are both general and fairly precise. But in some cases, the
programmer might need to explicitly clamp an index expression to a reasonable
range to assure we don't allocate too much intermediate buffering.
Finally, our problem domain's different than most traditional work on stencil
computations in scientific computing because where those usually deal with
thousands of repeated reapplications of the same stencil to a single grid, our
applications are often -- are usually large graphs of heterogenous computation,
often over a hundred different stages or more, with complex and varying
dependence between them.
14
And we potentially need to schedule all those stages differently.
Yeah?
>>: Just curious, do you let the -- the user know the compiler when you can't
infer type bounds?
>> Jonathan Ragan-Kelley: So the only context -- we can infer good bounds
through everything except for stuff that's like just fetching an arbitrary
memory address. In that case, the bounds are just the bounds of the type. We
can throw a warning there. We don't currently. Usually, you'll run into the
fact that that will allocate too much memory instantly the first time you try
to run it. That's actually an easy thing to check. Then we have a built-in
clamping expressions, the language that you'll frequently want to use anyway
for at least the very beginning. Generally, you wind up writing clamp to edge
on your inputs.
There are a few other things like that, and then through the whole rest of the
pipeline, you don't bother with thinking about boundaries, and little, you
know, fringes of necessary access are added if needed, but everything kind of
carries through fine.
>>: Do you like do something on a matrix and transposition then the
[indiscernible] for the destination but for the sources?
>> Jonathan Ragan-Kelley: Yes. So if you look at how the algorithms were
written, everything is written from the perspective of the output point. So
one stage basically says, you know, you're not it rating over some domain and
reading from somewhere and writing to somewhere. You're saying at any logical
point in my domain, which in that case would be the output matrix, the value at
that point is given as some expression of, you know, the coordinates in other
previous functions.
So in the case of transpose
where you wind up with kind
an FFT in this, that is not
that I'll show you. We can
the most power.
that would be easy. In the case of situations
of all the all communication, like trying to write
the focus of most of the scheduling transformations
do that, but it's not the context where we can have
>>: Seemed like the dependencies are inferable, the computation
[indiscernible].
15
>> Jonathan Ragan-Kelley:
>>:
So you mean like a data dependent kind of warp or --
[indiscernible].
>> Jonathan Ragan-Kelley: Yeah, so what we would usually wind up doing in that
context is write, basically write like a reconstruction -- the second stage
would be, you know, nine tap filter over the previous stage, which the actual
indices being computed values based on the scale factor and the input indices.
Does that makes sense? Are you wondering about how we can actually comprehend
what the bounds of that are or ->>:
Might vary with the output position.
>> Jonathan Ragan-Kelley: Oh. So that's -- I mean that would just be an
expression that -- I think that would just wind up being an expression where
the weights are a function of some other expression, if that makes sense. But
you would usually still be writing it from the perspective of every output
location. So, I mean, it's the same way you write it if you wrote a GPU shader
that wanted to do that, if that makes sense. If you were handily filtering
your textures on some non-integer sized resampling.
>>: [inaudible] across operations. It just assumes the data is there,
correct? Let's say you're doing discrete imagery scaling, you want to do it by
user selectable amount. We're going to even ignore the filtering part now,
just nearest neighbor, right? If I say in one iteration I want to rescale by
3.7, the next iteration by 7.2, then the bounds analysis has to change, right?
>> Jonathan Ragan-Kelley: Correct. Well, the bounds analysis just has to be
conservative with respect to that. So one thing that we would commonly do in
that context is just say that schedule which I'm going to explain now should
have computed that entire prior stage. If you're potentially going to be using
large rescaling factors, so that you have -- you know, you may be accessing
very, you know, differing ranges of the image, so I guess I'd say two things.
First, our bounds analysis doesn't infer static constant bounds. It statically
infers expressions that compute the bounds. Which means that if those bounds
are some kind of expression in terms of, like, a rescaling parameter and an
input image size parameter or something like that, then the bounds at a given
point in the computation will be computed immediately before doing it. So it
can adapt in that kind of way, but if you're likely to be, you know, touching
16
an entire image for even very small regions of an output, then it usually is
the best choice to make sure that entire image has been computed to a buffer
that you can access, just because that's the most efficient strategy.
>>:
[indiscernible] one might be shrinking, one might be growing.
>> Jonathan Ragan-Kelley:
>>:
Yeah.
But across the image, the --
>> Jonathan Ragan-Kelley: Yes, but at any given -- if you think of splitting
it up into tiles, and you schedule, which I'm about to explain, the input to
that to be computed as needed for tiles of the output, then the bounds on what
size tile of the input you need for each of those tiles is going to be computed
dynamically every iteration. So we're really injecting dynamic expressions
that compute those bounds as it goes around. Like as needed.
So given that model of the algorithm, the schedule defines the organization of
computation by specifying two things. First, for each stage, in what order
should it compute its values. So remember, computing whole grids of pixels
here, so in what order should we compute the points in this grid for each
function.
Second, when should each stage compute its inputs or at what granularity?
this defines the interleaving of computation between the producers and
consumers in the pipeline, which determines the amount of fusion between
stages.
And
And this is the model we used to describe the space of computational
organizations I showed earlier, that can be specified explicitly by a
programmer or inferred and optimized by the compiler. But different choices
about order within and across stages can describe all the types of
reorganization I showed earlier.
So first, the schedule defines the order we compute values in within each stage
and this is just the simple stuff you might imagine. If we have a 2D function,
we can traverse it sequentially across Y and then X giving us a simple row
major loop nest. We can compute the X dimension in four wide vectors. We can
transpose those dimensions if we want. We can distribute the scan lines across
parallel threads.
17
And we can split the X and Y dimensions to separately control the outer and
inner components of each. So in this case, by reordering the split dimensions,
we get a simple tile traversal.
Now, what all these choices actually do, as you might imagine, is specify a
loop nest to traverse the required region of each function. So, for example,
the serial Y, serial X worder gives a basic row major loop nest over the
required region of the function. And it's to the point about dynamic bounds.
I'm using Y min and Y max kind of expressions there and in practice there would
be some computation immediately before this that might be hoisted if it's
independent of everything in between, but might not if it's not. That actually
computes what Y min and Y max are in terms of its whole environment.
So putting everything together, we can split into two by two tiles, spread
strips out over parallel threads and compute scan lines and vectors. This
synthesizes a 4D bloop nest including a multi-threaded parallel loop and an
inner vector loop. And the specific complexities here don't really matter, I
just want to give a sense that the choices here are both rich and composable.
And to be clear, this stuff on the right is not code any programmer writes.
It's just meant as an representation of what these choices on the left
correspond to.
So this is basically what the compiler is synthesizing under the covers.
>>:
[inaudible].
>> Jonathan Ragan-Kelley:
>>:
Yes.
[inaudible].
>> Jonathan Ragan-Kelley: No you specify this for all the stages. You can
specify this for all the stages. There are same defaults, about you specify
this for every stage. So you're actually making both these order choices and
the interleaving choices that I'll show in a second for basically every edge in
the whole graph. It's a lot more -- less verbose than that sounds and it can
be inferred automatically.
So then once we have the loop nest for evaluating each function's own domain,
we need to decide how to interleave them with each other to compute the whole
18
pipeline. So the other thing the schedule defines is the order of evaluation
between producers and consumers in the pipeline, which determines the degree of
fusion between stages.
So for simple point-wise operations, it makes sense to fuse the pipeline in
line in the computation of each point in the first stage and to the
corresponding point in the second. Then as soon as we apply the tone
adjustment to one pixel, we can immediately color correct it, throw it away and
do this across the image. Producing and consuming individual pixels through
the pipeline.
But the simple strategy only works because the one-to-one dependence between
pixels here. And that's normally what we would mean by loop fusion. So if
instead we had a large blur as the second stage, then each pixel in the blur
depends on a lot of pixels in the first stage. So the dependence between even
far away pixels has lots of overlap.
And if we fuse, we have to recompute or reload lots of the same values over and
over again for each point in the blur. So in this case, it makes more sense to
precompute the entire producer stage before evaluating the entire consumer
stage. So we can choose to sacrifice locality to avoid wasting too much work.
And finely, for stencil operations like our 3 by 1 blur, every point in the
consumer stage depends on a bounded window of the previous. So in this case,
full fusion with inline lots of redundant computation or the stencils
overlapped, while computing breadth first would completely sacrifice producer
consumer locality. So not surprisingly, the right answer is somewhere in
between.
So we can schedule the stages to be computed within smaller tiles sized to fit
in the nearby caches. And if we then interleave the computation of a producer
tile with the corresponding consumer tile and do the same thing over the image,
this is the same choice we saw with the optimized 3 by 3 blur example earlier.
We may have to redundantly recompute values on the shared boundary on the
tiles. But again, by changing the size of the tiles, we can trade off between
locality and redundant computation, which is spanning the space of choices
between full fusion and breadth first execution.
So our model of scheduling is specifically designed to span the space of
19
locality trade-offs and to describe them separately for every pair of stages in
the pipeline. The way we actually trade this off is the most fundamental idea
in Halide's design. We do it by controlling the granularity at which the
storage and computation of each stage get interleaved.
So the granularity with which we interleave the computation determines how soon
you use values after computing them. So here, fine interleaving provides high
temporal locality, quickly switching back and forth between computing and
consuming values. While coarse interleaving provides poor temporal locality,
computing lots of values before using any of them.
Then the storage granularity determines how long you hold on to values for
potential reuse. So fine grain storage only keeps values around for a short
time before throwing them away, which potentially introduces redundant work if
we need them again later.
While coarse grain storage keeps values around longer to capture more potential
reuse. So using the examples from the previous slide, full fusion interleaves
the storage and computation of two stages of very fine granularity. This
maximizes locality but potentially introduces lots of redundant work.
Breadth-first execution avoids any redundant work but to get that it sacrifices
locality. And tile level fusion trades off between these extremes depending on
the size of the tiles that are interleaved. But now what about the whole other
axis of this space?
It turns out it's also possible to maximize locality why doing no redundant
work, but doing so requires constraining the order of execution which limits
parallelism. So by setting the storage granularity to the coarsest level,
while interleaving the computation per pixel, which was the far bottom right
corner of that triangle, we get a sliding window pattern like this.
So now after computing the first two values in the first stage, for every new
value you can compute, we can immediately consume it to compute the
corresponding value in the second. And it says excellent locality and it
computes every value only once, wasting no work. But to do this, notice that
we have to walk across the two images together in a fixed order. So we've
sacrificed potential data parallelism to optimize locality and redundant work.
And I want to emphasize that these examples I've been showing are not just a
20
laundry list of scheduling options. Fundamentally, it's a parameterization of
the space of trade-offs for pipelines of stencil computations like this. And
our schedules directly model and span this whole space.
>>: Do you combine strategies sometimes?
level.
I can imagine doing that at the tile
>> Jonathan Ragan-Kelley: Yeah, absolutely. So one of the reasons why the
domain order stuff matters is because that, it's the dimensions of those loop
nests at which we can determine how to interleave storage and computation, so
you can split dimension as whole bunch of times, create a recursive tiling and
actually in just a second, I'll show a couple things like that. So yes, the
whole point is that we can combine all of this stuff in different ways for all
the different stages, yes.
>>:
The sliding window, would you like a streaming computation?
>> Jonathan Ragan-Kelley:
Um-hmm.
>>: Those really correspond to any of the other schedules that you mentioned,
right? It's not really a fine ->> Jonathan Ragan-Kelley: So the way we actually think about it is one of the
big realizations here that is you can actually think about it that way. So if
you think about it in terms of compute and storage granularity, you can express
the sliding window, ignoring the fact that we're reusing storage.
Like if you imagine I left all the previous pixels I computed sitting around,
it corresponds to the coarsest granularity, meaning I store everything for all
time. The coarsest granularity of storage but the finest interleaving of
computation. So for every new pixel I compute, I compute one of the consumer.
And then the actual transformation that reduces the amount of storage to use a
small circular buffer and recognizes exactly what previously computed values
can be excluded is dependent on the fact that we're walking along it in a fixed
order. So it's basically by picking some point down here in the space, where
we have a large window over which we could reuse all the values but we're
quickly interleaving, computing one new value and consuming a new value,
combined with the fact that the intervening loops are all in a known order, you
know, they have a known dependence vector, that gives that pattern. Does that
21
kind of make sense?
It's a little easier to see on a white board.
So yeah, one of the big realizations here is that actually does fall into this
space, which was a bit surprising.
So even for a pipeline as simple as the two-stage blur, we can express an
infinite range of choices. You don't have to follow what's going on in all of
these, and I'm actually just going to jump through quickly. But the top three
sit at extremes of the trade-off space and the bottom three balance them all in
different ways.
Here's what they actually look like running. So some of these -- to your
question, Matt, actually are doing combinations of basically sliding window on
chunks or line buffering or other things like that. And the bottom three, in
particular, are making fairly complicated combinations of tiling and sliding
Windows and so on.
And different strategies here are best depending on the exact algorithms you
use, the structure in which the pipeline -- the pipeline in which they're
composed and how they interact with the underlying hardware. And the
difference between these can be more than an order of magnitude of performance
and efficiency on the same machine. Even between two plausibly high quality
ones.
So putting this all together to describe the optimized blur looks like this.
It's the same blur algorithm from before with the schedule equivalent to the
optimized C++. So first, it schedules blurY to be evaluated in 256 by 32 pixel
tiles. Each of these tiles is computed in eight-wide vectors across the inner
X dimension, and the scan line strips are spread across parallel threads.
Then it schedules the intermediate stage blurX to be computed and stored as
needed for each tile of blurY. And it also vectorizes it across X. So these
compute at and store at correspond to the positions on the two axes of that
triangle. And they're given relative to effectively loop levels of the loop
structure that we've defined for the previous function, if that makes a bit of
sense. It's a little easier to see on the white board.
Ed this generates nearly identical machine code to the optimized C++. But
unlike the C, it's not only simpler, the Halide version makes it easier to add
new stages, change the algorithm or change the organization. And it can also
22
compile not just x86 the arm neon code. With a few tweaks to the schedule, we
get a GPU implementation. Actually, a whole range of GPU implementations.
We have a bunch of other language features, which I'm actually going to skip.
Our current implementations embedded DSL and C++ and it uses LLVM for code
generation. So to exile a pipeline, we take the Halide functions in the
schedule and combine those into synthesize a single loop nest and a set of
allocations describing the entire fused pipeline for a given architecture.
Then after our own vectorization pass and other optimizations, we pass it to
LLVM to emit vectorized x86 or ARM code or CUDA kernels and graphs of CUDA
kernels in the x86 code including logic to launch and manage them.
And in that last case, on the GPU, we can actually generate heterogenous
mixtures of both vectorized multi-threaded CPU code intertwined with many
different GPU kernels, depending on the structure implied the by schedule.
So to pop back up, the last question is how can we determine good schedules.
And so far we've worked on two ways. I've alluded to this a couple times as
it's come up. First, you can write them partly or completely by hand. We
weren't even -- when we built this model, we weren't ever expecting real people
to write this by hand but wound up being surprised how far you can get with a
teeny bit of syntactic sugar on top of it just how terse the schedules wind up
being, even for fairly complicated programs.
So it's a lot easier than traditional hand optimization, both because they're
compact but also because you can't accidentally break the algorithm. The
compiler synthesizes all the complex control flow and data management for you
that's implied by what your schedule says.
We found expert developers using this in places like Adobe and Google actually
appreciate the control and predictability this gives them. Second, we can
automatically search the space of schedules for a given pipeline on a given
machine using auto tuning. And the dimensionality of this problem is large, so
we've focused on stochastic search, but it works surprisingly well in practice.
Now, I'm going to present results from a few real computational photography and
imaging applications we've built in Halide, which I chose because they
represent a wide range of different program structures and levels of
optimization in this domain. And for all of these, I'm going to shot
23
complexity of the best handwritten schedule, but in every case, the auto tuner
was able to match or beat that in some number of hours without any human
intervention.
So first, we looked at the bilateral grid, which is a fast algorithm for the
bilateral filter. We implemented it in Halide and compared it to a pretty
efficient but clean C++ version by the original authors. Our version uses a
third of the code and runs six times faster on the same machine. Then with a
different schedule, we can run on the GPU instead and. There, we looked at the
author's handwritten CUDA version and started with a schedule that matched
their strategy. And that gave almost the same performance they had. But then
we tried a few alternative schedules and found a slightly nonintuitive one that
was twice as fast as their hand-tuned CUDA. So to the point earlier about why
we're beating hand tuned code. In this case -- in every case, we're not doing
everything you couldn't do by hand. It's just a question of making it much
easier to search the space quickly.
So in this case, the ideal trade-off, even on a very parallel GPU, chose to
sacrifice a little parallelism to improve locality during the reduction stack
that computes the grid. And that made an enormous difference in the
performance.
>>:
Why didn't you [indiscernible] different choices [indiscernible].
>> Jonathan Ragan-Kelley: So that -- so all of the allocations in one of these
pipelines are implicit in the schedule. They're basically implied by the store
app granularity, and the semantics are when you're running on a conventional
CPU-ish architecture, what we're basically doing is building a giant bloop nest
or set of loop nests that all runs in one place and so allocations wherever all
kind of mean the same thing. We have some optimizations for keeping stuff on
the stack or doing small pool allocations if we know that something's constant
sized or whatever. But generally, memory is memory.
In the case of running on the GPU we have two separate memory spaces so
everything that's allocated externally to stuff that corresponds to a kernel
launch is mirrored in both spaces and lazily copied. I mean, it's actually not
eagerly mirrored. It's lazily mirrored, which you have to do if you're doing
computations in both spaces.
But we have semantics where allocations that correspond to granularity within
24
the thread dimensions of your kernel launch correspond to local memory and
allocations at the block level correspond to shared memory. So you can
effectively span the space of things you reasonably do.
>>: There are things on GPUs that are very important to optimization, like
register pressure and memory band conflicts.
>> Jonathan Ragan-Kelley: Memory band conflicts aren't very important anymore,
actually. You'd be surprised. Basically, they didn't used to be doing cache
reads and writes in the general memory system. Now the cache is ->>:
[indiscernible].
>> Jonathan Ragan-Kelley: Yeah, and so actually, even for me that got better.
So it's basically now instead of throwing reads and writes directly at the
memory system and hoping they cover all the banks, you read and write to cache
lines and the cache lines sized to those transactions. It's the same way CPU
works.
>>:
And what [indiscernible] sizes?
>> Jonathan Ragan-Kelley: So that's actually defined by the -- there's
basically a mapping between what would have been like parallel loops when I
showed you the loop synthesis and carving off some set of those and saying
those correspond to the block and thread dimensions of the GPU launch. And so
the granularity which you block that corresponds exactly to the block sizes
that get launched on the device.
>>: So suppose I work for a big company, and I have a problem that I run over
and over and over again on thousands and thousands of machines. And maybe
suppose that program is the one you just showed me. How do I know -- do you
have any idea as to how, like, I'm willing to wait, you know have this thing
run for weeks on end to find the best schedule. Do you have an idea as to how
good a schedule you can get? You mean, you're searching for a needle in a
haystack, so in all these problems, do you know like what's the throughput of
the machine and how close ->> Jonathan Ragan-Kelley: So we haven't, for simple examples, it's relatively
easy to build kind of a roofline-ish model. So for the blur that's relatively
easy to figure out. Actually, being able to reason about bounds based on the
25
fundamental structure of the program, you know, what the intrinsic arithmetic
intensity of different choices is, stuff like that, is something that we're
interested in doing but is not something we've done at all. I think this type
of representation gives you a lot more view into that than just writing C that
reads and writes memory willy-nilly. And that kind of view that I showed
earlier where you think about reuse distance and, you know, like crossing
dependencies and stuff like that is a natural view from which to begin to get a
more concrete sense of that kind of thing, of what the fundamental limits on an
algorithm are.
But in practice, we haven't done anything yet. So I'm going to skip this
example, it's a helluva lot faster than even very good Matlab, not
surprisingly, not because Matlab is slow, but because of locality.
And finally, in the local Laplacian filters example I showed earlier, the win
here wasn't again because we emitted better code. It actually came from fusing
different stages to different degrees and accepting some redundant computation
in exchange for better locality. So our implementation didn't just make a few
different optimizations beyond what the original developer found. It had to
fuse and schedule each of the hundred stages differently to globally balance
parallelism, locality and the total amount of work.
And our strategies actually substantially different through -- basically, we
can talk about it offline, some of the interesting things we noticed. And then
the strategy, we wound up using on the GPU was completely different from that
yet again. So it's -- if you were rewriting this to explore these different
choices, you would be tearing up and rewriting something on the order of 80
different imperfectly nested loops and all the allocations and boundary
handling and everything along the way.
>>: So just like anecdotally kind of going back to the prior question, did you
see that on average, most -- so you get an order of magnitude range as kind of
the claims you're making, right, and you see those numbers bear out. But are
most schedules around average, and a handful are better?
>> Jonathan Ragan-Kelley: So you can write -- the whole space of things that
we model includes a great deal of stuff that makes absolutely no sense.
Introduces huge amounts of redundant computation, has crazily large
intermediate allocations, other things like that.
26
So I wouldn't say that most in any statistical sense are around average. The
space is pretty spiky. We generally found, without putting any heuristics into
the auto tuner, it converged, you know, along kind of a smooth exponential
decay kind of curve. You can make of that what you will. There may be sort of
a normal distribution of performance in that space. I'm not sure. When we did
this stuff by hand, we usually kind of sampled a few -- we had a few hypothesis
about points in the space or strategies that might work, kind of tried them,
twiddled some stuff around them and usually tried about ten total things and a
variation among the best few is usually not that big. The variation overall
was often order of magnitude.
But which were going to be best, we didn't know ahead of time.
totally wrong.
We were often
So that's our experience so far with Halide. To recap, again, the big idea is
to explicitly decouple the definition of the algorithm from the organization of
computation. I think the biggest thing here is that we designed this model of
scheduling specifically to span the space of trade-offs fundamental to
organizing computation and data parallel image processing pipelines.
More than anything else, I think it's that model that's actually what lets this
be possible. It's the unification and composability of that model that lets us
promote it all the way to be visible to the user, unlike most compiler
optimizations. It's, you know, the kind of orthogonality of that model that
lets us span the whole space and walk through it automatically.
And that's really the key to all this. And also in this domain and I think
probably in others, it wound up being really important to put emphasis on the
potential of redundant computation to break actual dependencies in an algorithm
and minimize computation. Because on modern machines, that actually pays for
itself an impressive amount of the time.
So here, the end results were simpler programs that run
code while giving composable and portable performance.
portable components that can scale all the way from our
didn't show. We have a whole suite of these results on
massively parallel GPUs.
faster than hand-tuned
Or composable and
mobile phones, which I
ARM all the way to
Our implementation's Open Source under a, I think, permissive enough for you
guys license, an M.I.T. license and is being actively developed currently in
27
collaboration with people at Google and Adobe. People are using it at both
companies and I can actually now tell you that every picture taken on Google
Glass is processed on board by pipelines written and compiled with Halide.
I have other projects to discuss, but we've spent a lot of time talking about
other things. So I think I'm going to leave it there. Thank you.
>>:
Have you extended this to other things beyond just image processing?
>> Jonathan Ragan-Kelley: Yes. So I think of this not as a domain-specific
language in the sense of having much to do with image processing itself so much
as the schedules in particular are, I think, a fairly complete parameterization
of the ways you can interleave computation and storage on computations over
graphs of computations over rectangular regular grids, which is basically what
images are.
And so we have -- I've done some, you know, fluid simulation, a few other
things in this context. Like I said, the things that do all to all
communication don't benefit from a lot of these types of reorganization. They
benefit from things that, from this perspective, would be actually algorithmic
reorganization if you wanted to do an FFT or something like that. That -doing algorithmic combined with schedule reorganization for approximate
convolutions is actually something we're beginning to look at. The extreme
example being the GPU efficient recursive filtering kind of idea. I see that
in this view. That's not just a change of order. It also takes advantage of
associativity and linearity and so on to rewrite the algorithm or what we would
mean by the algorithm here.
And then also, actually, among the other things I was going to touch on we
worked on applying some similar ideas outside of graphics in a different
language. So this is not in Halide. But we were using the Petabricks
language. Sorry, this is actually not a great slide to talk over. It was
meant to be a quick summary.
We're using the Petabricks language, which is built around auto tuning to
determine lots of -- similar sorts of choices. And in this context, we were
working on mostly traditional HPC kinds of code. This is also mostly over
regular grids. Their program model is primarily focused on stuff that's data
parallel on a relatively similar fashion. But the trade-offs explored here
included not just the types of organization I discussed, but also algorithmic
28
choice.
So in particular, things that naturally lend themselves to divide and conquer
algorithms, there's often different types of algorithms you can use at
different levels of recursion, and in this case, there was -- we were basically
inferring a decision tree to decide what strategies to use, including different
algorithmic choices.
>>:
So does your recursor generate any shuffling search.
>> Jonathan Ragan-Kelley: So the scheduling model doesn't express anything
like that, but we have, when I say we have our own vectorization, we're not
just relying on an automatic loop vectorizer or something like that. We're
directly constructing kind of SOA SIMD code, and then our back end actually
does, you know, people optimization kind of level instruction selection for ->>:
But do you like generate like intrinsic shuffling?
>> Jonathan Ragan-Kelley: Yes, so we generate whatever has been necessary so
far. We basically have a peephole, a pattern matching phase at the end that
looks for common sub-structure in the -- in our vector IR and substitutes
common extrinsics for those. So we're emitting machine code directly, but
we're going through -- we emit LLVM intrinsics, which basically we just say I
want this exact instruction to be selected here.
>>: I won't, like, you have some cases where the axis are nonlinear, like
two ->> Jonathan Ragan-Kelley: Right. So we have a variety of optimizations in the
different back ends. This type of code is actually a large subset of exactly
what both SSE and the [indiscernible] were designed for. So there's a lot of
esoteric instructions that actually wind up being useful in different contexts.
And dealing with strided loads or things like that or packing and unpacking
stuff is one of the common contexts where we try to make sure that the front
end and main lowering process will always generate relatively common patterns
that we can pick up in the back end and replace with specialized ops.
>>: Have you thought all about run time fusion and scheduling? You can think
of a variety of reasons to want to do that. For example if I write an app, I
want to know that it works well on current generation and any generation
29
[indiscernible].
>> Jonathan Ragan-Kelley:
Yes.
>>: Another example is like the PhotoShop layers list, where the user is in
charge of the composition and you still have [indiscernible] run time.
>> Jonathan Ragan-Kelley: I think of those as being different, actually. In
the first case, I don't think you need run time compilation. You just need
what you think of as sort of load time compilation. So performance portability
probably requires -- well, for one thing, one of the big reasons why we have
this whole idea of split scheduling from algorithm definition is that our
assumption is that the choices we're making in the organization are not going
to be performance portable, that you're going to want different organization on
different machines.
So you need to have some way of deciding what the best organization is on some
future machine, but that compilation doesn't need to be happening with amazing
low latency because it's not like in the UI loop. It's happening once when the
user boots the program on a new machine or something.
The latter context is a bit harder, where you're actually respecializing
programs, depending on -- well, and so you're potentially rescheduling or the
more extreme case, which I worked on in the context of visual effects, is that
you're actually specializing the computation based on knowledge that certain
parameters are static or other things like that and trying to fold it in
different ways.
And so in this context, we were basically trying to accelerate lighting
preview. People are only changing a few parameters of a 3D rendering, so we
automatically analyze the program, figured out what was changing, what wasn't.
The vast majority wasn't. We could auto parallelize the stuff that was. We
had all kinds of aspirations about adapting to individual parameters being
edited one at a time, and it turned out that dealing with that was monstrously
more complicated and much higher overhead than just doing a fairly good job
specializing for the common case.
I think if you look at something like light room, it's kind of halfway in
between the general photo shop layers or, you know, compositing graph kind of
case and the totally static case. They have a bunch of fast paths that they
30
hand code right now, where when -- you know, they have say a half dozen fast
paths for preview or final res with or without a few major features turned on
or off, like whether you do or don't have to do the scale decomposition to do
all the local tone adjustments.
Or like whether or not you've cached the results of demosaicing. And I think
picking a handful of fast path optimizations compiled ahead of time can be
extremely fruitful, because then you're just quickly choosing among a few
places to jump to. But injecting the compiler into the middle of the overall
process is at least a hassle and often hard -- it's hard to -- even if 80
percent of the time it makes things go better, having nasty hiccups 20 percent
of the time is often not worth it. It's sort of how I'd summarize my
experience with that kind of thing.
So one thing you can do with auto tuning, and people have done this, try to
auto tune over a large suite of tests, producing a more than one answer for the
right -- for the tuning so you can say, I want the five schedules that best
span this space, you know, best optimize overall best for this large set of
benchmarks. And if, into that set of benchmarks, you include different types
of patterns which might be specialized against, you know, just running quick
previews with certain things off, versus having everything turned on, that's a
much easier context in which to automate that kind of thing, I think.
>>:
So the trade-off we make a lot [indiscernible] code is precision versus.
>> Jonathan Ragan-Kelley:
Yes.
>>: And there's a whole bunch of [indiscernible] vector instructions.
have you looked at, experimented with that at all.
Do you,
>> Jonathan Ragan-Kelley: Yeah, so one of the things we thought about early on
was so our native types, I didn't show you any types anywhere, because
everything's all inferred, but it's also -- it's all just inferred using
basically C-style promotion roles from -- or you can use explicit castes from
whatever your input types are.
So we do a lot of computation on, like, 8 and 16 bit fixed point numbers on
many of the pipelines we use. We did work hard on the sort of vector op
pattern matching for those types, because that's a context in which there's a
lot of specialized operations that can be actually really useful. Saturating
31
ads and other weird things like that.
We, early on, we were thinking that we might want to have these programs be
able to be parameterized over some space of choices on precision that you might
use and even be able to build in potentially higher level knowledge of, like,
fixed point representations to handle that kind of thing automatically.
We wound up deciding to keep it simple and keep it to the semantics of
basically the kind of stuff you'd write by hand in something like C today. But
in the long run, I think that would be a useful higher level front end feature
that would give both more flexible to the compiler and make it easier to
explore that space by hand.
>>:
[inaudible].
>> Jonathan Ragan-Kelley: So the vectorization, the way we express it,
remember that we were synthesizing loops over the domain of a given function.
So we have some rectangle that we have to compute. We're synthesizing loops
that cover that. Our model of vectorization is somewhat of those loops or a
split subset of one of those loops gets computed in vector stride of some
constant width so I say I want the X dimension to be vectorized in strides of
eight.
If that one function includes a bunch of different types, we're not necessarily
going to be -- the back end might be cleverly selecting different width ops.
So doing, you know, two sequential ops on 32 bit numbers while doing only one
op on 16 bit numbers, if we have 120 bit vector unit, usually what we'd wind up
doing is vectorizing to the kind of the maximum width that any of the types
flowing through that function have, if that makes sense. I can show you on a
white board.
But all our operations, all our general operations are agnostic of the
underlying vector types. So if you express in a 256 bit vector op, like adding
two eight by floats together, if you're on SSE or neon, that will get code
generated into a sequence of two ops. You can think about it as just like
unrolling the regular vector loop.
And those kinds of details wind up being taken care of in the back end
32
instruction selection. But you're not specifying, like, per op how wide you
want to do it. It's just per function.
>>: Along those lines, I'm curious. So how much is the performance enrichment
you're getting a function of your people optimizations versus [indiscernible]?
>> Jonathan Ragan-Kelley: I think that the big win, so if we throw in
schedules that are equivalent to simple C, the big wins are from restructuring
the computation. So it's not -- we're mostly relying on LLVM for register
allocation, really low level instruction selection and low level instruction
scheduling.
>>:
[inaudible].
>> Jonathan Ragan-Kelley: Not until now. 3.3, which came out like two days
ago, is the first time they have any significant vectorization. Well, there's
multiple answers to that depending on what you mean. In terms of loop auto
vectorization or super word parallelism stuff, that's only out literally right
now and we've never built on any of it.
Depending on the back end, they especially in the case of neon, they actually
do a pretty good job of the same kind of people optimization that we're doing
where they recognize a sequence of, you know, simple, you know, app, mull,
something something and recognize that that can be fused in a magic way into a
magic neon op. So generally, the way you target weird ops in neon is to figure
out what their magic pattern is and use it.
But that's really only true for ARM, in our experience. On x86, they don't
make any effort to do clever instruction selection for you. They generate like
if you're multiplying two single precision float for vectors, they'll give you
a mull PS, and other than that they'll just do the stupidest, simple thing you
can possibly imagine. So we have to be much more aggressive with the people
optimization on x86.
But I think that's mostly just like the last factor of two, that kind of stuff.
Just making sure it doesn't do something stupid and that it really is
vectorizing by the width you told is to all the time and not spilling
everything to the stack and, you know, packing and unpacking the vectors
between every op and that kind of thing.
33
Thank you.
Download