Juan Vargas: Welcome, again, everybody. I hope you had a very

advertisement
>> Juan Vargas: Welcome, again, everybody. I hope you had a very nice evening last night
after the incredibly exciting sessions we had yesterday.
So today we're going to have the UPCRC workshop. This is what Microsoft calls the third day
event after the faculty summit. And oftentimes what happens in the faculty summit is people just
have these short times to present [inaudible] directions of research. So this session is [inaudible]
deeper dive into the topics.
And we work with Illinois [inaudible] Microsoft to get this program ready, and we're going to go
[inaudible] faculty summit just on presentations by the partners, we decided to go by teams.
So the teams are programming systems, tools for parallel testing and debugging, applications,
architecture. And at the end, like yesterday, we'll have another panel that we hope is going to be
as exciting as the one we had yesterday. And the panel is going to be UPCRC: Can Industry and
Academia Collaborations Be Effective. We hope that there will be a lot of discussion debate,
and we hope we can have real conclusions.
So a few remarks. If you want to connect externally through wireless, there is a username and
password out there. And we have busses going back to the Hyatt Hotel starting at 5:00, and
probably last bus is going to be about 5:30. So once we finish the panel, we just have to go out
and wait for the bus.
If you could please sign, you have not done that, sign at the door so that we know who is coming
and the affiliations and your names.
And with that I'd like to introduce our first speaker, David Padua. He comes from the
University of Illinois where he has been since 1985. And when I do these presentations, it's
always difficult to omit some of the remarks because everybody who is talking today has such a
distinguished career that I struggle to find what to say during less than one minute so that we can
spend most of the time listening to them.
So David has been a professor in Illinois since 1985. He has a very distinguished career, and
probably the last book he wrote is the Encyclopedia on Parallel Computing.
And I don't want to say more because we know pretty much very well his real accomplishments,
and David is going to talk about abstraction for parallel programming.
>> David Padua: Thanks. Of course the last book I wrote I didn't write because it was an
encyclopedia, and that's...
>> Juan Vargas: [inaudible].
>> David Padua: And that was a very pleasant experience, the encyclopedia. Many people
contributed and it was very moving to see how much people were willing to participate in that
project.
Okay. So I was asked by colleagues to give you a summary of the work going on at Illinois.
That's a very difficult task. So the work going on in Illinois in the area of compilers and
languages.
So I'll try do my best, but first I must say that what I am going to say is not comprehensive,
neither in terms of the projects, nor in terms of people. So there are more people doing compiler,
there are more compiler projects going on. I'll tell you, mention a few that seem more relevant to
this meeting.
So mainly I'll focus on the work of Vikram Adve, Wen-Mei Hwu and my research group, and
will be presentations on compilers and the languages that are being designed.
So let me start by saying some common places about what the problem is. So, as we know,
parallelism is everywhere. And if parallelism is going to succeed, it will be that because
software will be developed that can take advantage of parallelism. Okay?
But when that happens and if that happens, one of the things that obviously is going to happen is
that there will be a dip in productivity, because writing parallel programs is more complex
writing [inaudible] need to deal with performance, with scalability. There are new sources of
defects and so on, so forth. Okay?
So what we face as a challenge as software people is facilitate parallel programmer to reduce the
impact of parallelism on productivity. It's not very glamorous, but it's true. That's all we're
trying to do, reduce the impact, try to bring back productivity to a level of the sequential era.
And we find a lot of challenges.
But of course optimization, because performance is crucial in the parallel region, because only
through scalability, new machines -- through the scalability of software, new machines will be
valuable and people will be willing to buy. It's a business model.
There is the issue of portability that is also tremendously difficult because while porting across
machines in the sequential era was [inaudible] because the difference was in the type of
instructions.
And perhaps if you [inaudible] in the case of parallelism, the classes of machine can differ
widely. There's a big difference between a GPU and a collection of CPUs, and there is a big
difference between vector processing and the [inaudible] memory machines and so on. So there
is a tremendous issue there. So that's a problem that has never been resolved, and we really don't
have a good answer how to guarantee that you write the code that will be executable on a variety
of classes of machines, distilled memory, shared memory array, and so on.
And the other challenge of course is correctness. There are new classes of defects.
Nondeterminism, deadlocks, and, you know, other issues related to parallelism.
So how to address this challenge? From what I have seen, there are mainly two strategies that I
believe are viable to address the problem. One is to raise the level of programming so that you
have enough information to be able to map across machines that will facilitate portability. You
have enough information through optimization that would solve the problem of getting good
performance and scalability and so on, so forth.
And that of course is a different problem because at what level you present the language is not an
issue, it's an issue you may want to work at the level of applications and have abstractions that
represent complete application so you can have abstractions that represent algorithms, or you can
have abstractions that represent goodness. Which ones are useful. How you combine them is
not a problem that is easy to solve.
So you -- and there are a number of possibilities. You can work from the formula level or the
very general description level. So you can have domain specific language, you can have very
high-level operators, and you can have just plain parallel abstractions like parallel loops and
[inaudible] that sort of thing. So that's one.
And the other, which I think is also crucial, although many people don't seem to be fond of this
approach, is to have techniques that automate the process of optimization, automate the process
of porting and so on, so forth. So we actually need to work on this area of autotuners, compilers,
so on, so forth.
So, you know, maybe I did this a couple of days ago, but I think, you know, maybe I'm missing
something, but I think certainly these are two of the main strategies that we can use to address
the problem of productivity in the parallel world.
So let me start by going quickly over what is being done at Illinois, and I hope it's not too boring.
Will be just a short description of projects.
So three projects in the language realm: Triolet, that's Wen-Mei Hwu's group; this HTA work
that we have been doing in my group; and the Hydra thing. I'll describe this very quickly to you.
So Triolet is a library, both building functions that you can use from -- basically here is an
example, is basically a library that you can call from Python codes. And I think this work is
similar I guess to the work that is going to be described in the talk after this one, basically,
number of operators with iterators, and the idea is that the operators are implemented in parallel
and you can work in the program at the high level using the computational notation. Okay. So
the notation is Python, and I think that's all I'm able to say about this.
They are working on compilation techniques and so on, and they continue working in this area.
This is the work, as I was saying, of Wen-Mei Hwu and his student Christopher Rodrigues.
That's one project. And here, as you can see, the goal is to raise the level of abstraction functions
that implement operations and you encapsulate the parallelism within the functions.
The other related approach is the one in my group on hierarchical tiled arrays, and this is work
that my students, colleagues, and I have worked for for a few years.
And basically the main idea as [inaudible] described yesterday is that we recognize the
importance of blocking and tiling as a first-class object. If you look at the algorithm for parallel
computing, sequential algorithm facing -- dealing with locality, the -- the tiles appear all
[inaudible] as a constant.
So having them in the language explicitly we thought was [inaudible]. This started when we had
the HPCS program as started by DARPA, and the IBM guys wanted an extension to model up
for parallel programming. And by looking at MATLAB, I decided that some of the structures,
they have these -- and I forgot the name -- they have some classes of objects that basically are
collections of arrays and enable the hierarchy of tiles, I thought that that could be used to
represent parallelism.
The idea is relatively simple. You have arrays that are tiles and the tiles can be tiled and so on,
and then you can use the tiles at the top for distribution across machines, the tiles within the tiles
for locality at different levels, you can use the tiles at the second level for shared memory
parallelism and so on, so forth. So it's a natural thing in the hierarchical machines of today.
And then we have work more recently extending this notion to regular data structures like sets
and so on. So here the operations are on arrays and sets, and the idea is that the arrays are tiled
and you can represent by manipulating these object's parallel computations. That's basically it.
I am personally still struggling with the notion of what primitive you need to go beyond linear
algebra and also struggling with the problem do we need to go beyond linear algebra. You can
do everything with linear algebra perhaps. So that's an issue that I don't think I can answer.
But what I can say is that our work on only regular structures proves to be very effective. There
is a tremendous complication in tiling not regular structure, because you need mechanisms to
decide where each component, as I said, will go. And we can make those mechanisms, but it's
not obvious how to build them from scratch. You need to think very hard.
So the end result with this notation of course is that you can get codes that are much more
compact, and, more importantly, more readable than the equivalent MPI codes. So -- and the
other advantage of this approach of having all your computation with the parallelism
encapsulated inside operations on these arrays is that you have portable programs. You can
implement your operations as a GPU [inaudible], you can implement them as threads within
MPI, you can implement them as parallel loops. And they -- in all cases we have tested they
have worked reasonably well. So that's that.
And then the other project that I just wanted to mention briefly, some other one in our group, is
called Hydra, and that is with my students Alexandre Duchateau and a colleague from Bordeaux,
Denis Barthou.
And the idea here is to start with a formula and, for example, you want to find X in that equation.
And then what we do is we tiled the difficult operators and convert the equation into multiple
equations. And looking for a recursion. So eventually this reduces if you want to go all the way
to a scaler equation that is easy to solve. Okay.
So in the process of composing the equation, what you find is a number of operations that can be
executed in parallel, and by doing the composition, what you build is a graph of operations that
can be executed in parallel.
Okay. The great advantage of this approach is that you can partition the components in multiple
ways, and that enables you to search the space of possible solutions and look for a good
performing version of the algorithm.
We have still the preliminary stages, but what we have seen is that the codes that we solved are
very good and competitive with what people have written by hand. So stay tuned. There will be
more about this in the future.
Okay. Next project is about compiler evaluation. This is a topic that I think is tremendously
important, because while you see a lot of publications about new compiler techniques and
strategies and so on, what you don't see too much is evaluation of what exists out there.
And compiler technology is not just about algorithms, it's not just about ideas, it's about how
well our algorithms work when applied to real problems.
So it's a tremendous difficult problem to do, evaluation of compilers, not so much for the
conceptual issues, but the labor involved. Because you need really to do manual analysis of what
is happening to really understand how well the compiler is working.
So here is a project that we did for two years at Illinois, the evaluation of autovectorization. So
what happened is the Blue Waters Project -- the Blue Waters Project the idea is to program the
machine with MPI codes. But the -- they thread within -- the MPI program has to execute as
sufficiently as possible, and what they want is automate the process of tuning for vectorization -for -- to take advantage of the vector extensions of the microprocessor, power-saving machine at
the time.
So they asked me to work with the IBM compiler people to make sure they deliver a compiler for
the order that will produce good performance, especially in the context of taking advantage of
the vector extensions.
So this is regarding Tim Mattson's comment yesterday. So it's a -- in realty this is a form of
autoparallelization, and of course the issue is not whether it's possible or not, it's a matter of
degree. Because typically what people do is run their programs through the compiler, and
whenever they can find vector extensions, they just -- vector operations they profit from that and
benefit from that.
And in reality it's a very important technology, so much so that all compilers autovectorize. And
the compiler groups invest a lot of time in this technology. For many reasons. One of them of
course is that you evaluate compiler machines by how well they perform on spec codes. And if
the compilers do better at vectorizing, the machine does better.
So there is -- but there is also I think a real relationship of autovectorization to regular users all
over the world. Autoparallelization I think it's also true that all compilers autoparallelize. But
when you talk to the IBM compiler people, they hear very little about users of
autoparallelization, but autovectorization is widely used.
And, of course, why not? You can have computers that answer all sort of questions. No reason
why your computer cannot answer autovectorization questions, can this loop be vectorized.
So let me just tell you briefly the outcome of the evaluation and briefly what we're doing going
forward. So basically we look at the -- we took the collection of loops that David Callahan and
[inaudible] and Ivan put together 20 years ago and ran them through three compilers: GCC, ICC,
and the IBM compiler. And what we found out is that the -- there was great variability in what
the compilers were able to do.
And despite the fact that at some point we gave the benchmarks and the resource to different
compiler groups, still after a while they are on the compiler [inaudible] because of that, after a
while you get to still see that there were loops that were vectorized by some compilers and not by
the others. And with no good reason for that.
So basically I think the most important lesson I learned is that we have tremendous difficulty
figuring out how to do quality control in compiler level. So that's -- that to me is the biggest
lesson. Because when we look at the loops, there is no reason why they cannot be vectorized.
We have the technology, we have the mechanisms to do the transformation, to do the analysis.
The only problem is that the compiler just doesn't know how to take all these tools and put them
together and do the proper thing for a particular job. It doesn't know how to ->>: What do the numbers in the shaded boxes mean, the 2 and the 3? No, in the other -- the
autovectorized? Under ICC you have a box with a 3 and then under GCC a box with a 2.
>> David Padua: I -- to tell you, I don't remember right now.
>>: Okay.
>> David Padua: There is a paper in PAT last year, and they describe all the details there.
The -- the ->>: [inaudible] compiler [inaudible].
>> David Padua: Huh?
>>: And if I look at this [inaudible].
>> David Padua: The intercompiler -- okay. This is not the end of the story. This is only the
loops that were very simple. When we talk the real applications, all the compilers behave
equally badly. 30 percent of the loops were vectorized. Performance, when you did the work by
hand, as if you were a compiler, you can get factors of two or more performance improvement.
So is there lot to -- but in all cases real applications or not, when you look at the transformation
that you need to apply, they are there. We know the transformation. There is nothing new that
needs to be learned about the analysis of transformation. What we need to learn is how to guide
the compiler, how to establish a path to take the code from the original form to a form that
performs well.
If you just launch a compiler and say search, if they were not tomorrow, then we'll perform
[inaudible]. The problem is compiler must be efficient, so they cannot take a long time to
compile. That's part of the reason. And, you know, the problem of optimization is a black art.
Nobody knows exactly how to search [inaudible]. Yes.
>>: [inaudible] what programming language are they in?
>> David Padua: C.
>>: And your analysis was that from language perspective they are able to vectorize?
>> David Padua: Yeah.
>>: Okay.
>> David Padua: Yeah. Of course. And, in fact, all these defects or lack of effectiveness of the
compiler was after restrict and aligned keywords were inserted by hand, so even then [inaudible].
>> Juan Vargas: I'd like to insert an editorial comment. If you are interested in vectorization,
reading that original paper by David and Jack and so on would be worthwhile, just to see what
could be vectorized and how one might test the ability of the compiler to discover.
>> David Padua: Yeah, yeah.
>>Juan Vargas: It's an interesting paper.
>> David Padua: And when you look at that, you also read this one, and one thing Saeed did is
compare the effectiveness of vectorizing compilers then and now. And they're not better now, as
you will see.
All right. So then the next -- I have 40 minutes, right, so I have ->> Juan Vargas: No, it's 30 minutes.
>> David Padua: Ah. Okay. So let me just ->> Juan Vargas: 30 minutes.
>> David Padua: 30 minutes. Okay. So let me very briefly describe in two minutes all the
things going on. So there is of course compiling OpenCL, and I'll tell you one is a translator
developed by [inaudible] group that transform OpenCL depending on the target machine for
vector, multithreading, sequential, and so on. And it's very effective, as you can see here. And
they are interested working on that.
Then there is work also by Wen-Mei Hwu on data layout system that reorganizes structures of
arrays of structures into tile information, and also very effective mechanism to reorganize. This
is an area of great importance, to be able to restructure the data to get good performance.
And then finally we did work on tiling and overlap tiling and hierarchically overlap tiling that
was very important for certain classes of application. We are now working on eight dimensional
tiling and overlaps of the -- using this tile to overlap communication with computation with some
success.
And then finally the other project, and Joseph described this yesterday, is the work on the
deterministic parallel Java. And basically as you know the idea is to insert declarations in the
program and have the compiler infer what regions of the data are being modified by the different
threads in the program and determine whether or not the program will be deterministic or not.
Let me just say there are noble features of DPJ not available before. For example, the reliability
of nested regions, which is very important when traversing [inaudible], for example. And
support for arrays computation, which of course is crucial. And I'm going to skip this. The other
thing important they did is they allow for nondeterminism so you can have nondeterministic
programs, nondeterminant programs, and check still for the absence of arrays. So they are sure
that the data that conflict is enclosed within anatomica, within a critical region.
Okay. So that's that. So [inaudible] there is a lot to do. The truth is that there probably has
proof more complex than we expected in all dimensions, but much progress has been made and
we'll keep working. We, of course, will never solve it completely, but the current situation is
pathetic, so we need to do much more. Thanks.
[applause].
>> Juan Vargas: Selective Embedding Just-in-Time Specialization, and this is the main bit that
UC Berkeley has been working on, as you know. And Shoaib is graduated from Berkeley and is
taking a postdoc at MIT. Is that right, Shoaib?
>> Shoaib Kamil: Yes. That's right.
>> Juan Vargas: And if you please repeat the questions because we are taping the presentations.
Thank you.
>> Shoaib Kamil: So I was told to make sure it's the right resolution. Okay. So I'm Shoaib
Kamil. I'm from UC Berkeley where I've been a grad student for a few years. And I'm going to
be giving some updates on SEJITS, which is Selective Embedded Just-in-Time Specialization.
This is our technology for enabling high performance from high-level languages.
And it was really good to see some of the stuff that Professor Padua talked about, because, one, it
looks like there are other people doing similar things, and, two, the motivation was all there. So
vectorizing is arguably the easiest form of autoparallelism, but -- and the technology to
autovectorize is known, but it looks like it requires somebody to guide the use of that technology.
And that's kind of the approach we've taken.
So just to remind you what Selective Embedded Just-in-Time Specialization, or SEJITS, looks
like, we want people to be able to write in these productivity languages like Python, so their
application has portions that just run on the interpreter and use the normal Python infrastructure.
But there are other portions of the program which pass through domain-specific embedded
language compilers. And these compilers use the infrastructure we've built, which is called Asp,
to intercept these calls and translate the function in question into a low-level language. That
translated function is then compiled and dynamically loaded into the interpreter which then calls
that function and returns the result to the user.
So all of this looks like from the user perspective that they're writing in Python, but what's really
happening is that portions of that Python code are being translated into C or C++ or CUDA or
other low-level parallel language compiled and run and the result returned.
We also introduce caching so that this Just-in-Time compilation doesn't have to happen all the
time, only the first time something is run. So here is an example of an actual kernel that we can
specialize using the stencil domain-specific embedded language. If you look, it looks just like
Python code. If you're familiar with Python, lots of colons and indentation. But what this
computation is basically saying is for each of the interior points in a multidimensional grid, for
each of the maybes of that point, apply this particular function.
So we treat this kind of like a declarative specification of the computation. So even though it
looks like usual imperative programming with a 4 and so on, all this is actually doing is telling
the infrastructure what the computation is, not how to do the computation.
In particular, there are certain constructs here that are domain-specific. So the first thing is that
we inherent from a particular class. This is what tells the infrastructure that this computation is
something that needs to be specialized. So the initializer for this stencil kernel class is the thing
that, you know, looks at the source code and does all the steps that I'm going to describe in a
minute.
We also have these special iterators which are used in the program translation.
So the first step is to introspect this function and get the abstracts syntax tree. This is done
mechanically, automatically using Python infrastructure. We didn't have to really implement
anything for this, right? Python can introspect itself. So you get this syntax tree that represents
the computation. If you look closely, you'll see there's a function definition, and then there's
these four loops, and, you know, lots of other nodes.
So we use that AST and do a mechanical translation to domain-specific intermediate
representation. So there's some effort required for the person writing the domain-specific
embedded language compiler to write this translation from that Python AST into this
intermediate representation.
Now, the intermediate representation looks like a tree, it has some domain-specific constructs,
but, again, what it really is representing is not how to do a computation, but actually the
declarative specification of what that computation is.
This is done in our infrastructure by writing simple tree transformation rules that work on local
pieces of the tree. It uses our infrastructure and the visitor pattern to make it as easy as possible
to define.
The point of this translation is once you get to this point, we can run correctness checks, we can
ensure that, you know, because it's a declarative we can do things like make sure that the
specification is correct by construction as in when we're translating it. We make sure that
everything we're translating is going to result in a correct specification.
So most if not all of the checking is done in this portion of the compilers.
So from there depending on the target machine or what the actual code is running on, it is
translated into a back-end domain -- a back-end general AST. So an AST in C++ or in CUDA
with parallel constructs, et cetera, et cetera. And then that tree is then optimized. Now, this is
where most of the domain-specific knowledge is used or the expert programming knowledge is
used. So in this process, you know, you figure out, well, for this particular construct I translate it
into this kinds of C++ and I know that I can do certain optimizations on this.
As Professor Padua pointed out, there's often cases where the user who is writing the code knows
that a particular transformation is correct. However, the compiler does not have the absolute
knowledge in order to know that.
Common example is things like loop and rolling. Certain cases the compiler can prove that this
is a correct transformation, but that's not in every case in which it would be correct.
In this particular example, you know, I'm an expert in stencil computations. I come from the
high performance computing world, so I know how to do this. I know how to make a stencil
computation go fast. So all of my knowledge is embedded into, you know, optimizing this tree.
We've also implemented many of the common compiler optimizations, things like loop blocking,
loop and rolling. These are things that are part of our infrastructure, so you just need to apply
them in the proper way.
So what does the code kind of look like? Well, you end up with a very -- so this is actually a
very simple example of what comes out of the code I just showed you. And this is a simple
example in that, you know, it's 2D, it only has a couple optimizations applied.
So just so you don't have to squint, there's parallelization here. There is cache blocking which
ensures that you work on a working set sized cache block at a time. And there is loop and rolling
or register blocking, which helps expose things like vectorizability. It also helps, you know,
expose things like common subexpression elimination and so on.
So this is the code that gets pasted to the compiler. And as a bunch of the autotuning work has
shown, if you write your code in a particular way, you can get the compiler to give you much
better performance than writing your code in the most naive way.
So that's what we're doing here.
I've left out a couple of the major pieces of what SEJITS does. For example, I just mentioned
autotuning. I haven't shown you any of that, but basically we generate many different versions
and on each invocation of a function, we run a different version until we've explored the space
completely, and then we use the best version for subsequent invocations.
So this is SEJITS.
For the rest of my time, I'm going to give some updates on the current domain-specific
embedded languages and libraries we have, show you some new results, some users and
upcoming releases, and finally talk a bit about some future directions in what we're doing.
So first off, this is kind of the set of implemented domain-specific embedded languages and
libraries that we have. The red ones I'll give you some more detail about, but let me talk briefly
about a few of these.
The platforms we target rank from, you know, x86 plus OpenMP to MPI to CUDA to Cilk Plus.
So we have basically every back end. We even have a cloud back end at this point.
I want to just -- let me just briefly describe each one. So the first one is for stencil or structured.
Great computations. I'll spend some time talking about some results from that. The second one
is for graph algorithms, and I'll spend some time talking about that as well.
We've also implemented a simple parallel map, domain-specific language, which basically lets
you compose these languages with map, so you can do high-level parallelism there.
One of the big success stories, I would say, is the Gaussian mixture modeling library we have,
which has been released. I'll talk a bit about the infrastructure that that's a part of. But, you
know, that's able to get performance that's 250X real-time faster than what the domain scientists
were using before, which was handwritten, hand-optimized CUDA.
We also have a matrix powers communication avoiding library, which I'll talk a bit about. And
we have three new domain-specific embedded languages that are under development right now.
So the first one is for the Bag of Little Bootstraps algorithm, which is a statistical analysis tool,
and this one is used by people who want to evaluate machine learning and other kinds of things.
We have a version that can target Cilk Plus on a local multicore machine as well as a version that
can run on top of the Spark Infrastructure in the cloud.
So that's under development.
We also are working with the people who make GraphLab, which is a C++ parallel library for
machine learning using the graph representation. We're working with them to replace their
current API. Their current API does something quite convoluted. So first they created a Java
API for their C++, then they used JPython -- or Jython to write a Python API on top of the Java
API on top of the C++ API. So we're getting rid of all of those layers and we're using our
infrastructure to replace the current API so that you can get C++-level performance while writing
in Python for your graph algorithms.
And finally we are working on a communication avoiding parallel recursive structural pattern.
So what this comes from is the insight that there's lots of algorithms that have a particular
recursive structure. And this particular recursive structure allows you to choose at any point
whether you want to run the different leaf nodes of the computation in parallel or whether you
want to run them serially, depending on the available resources.
So a preliminary implementation of this was applied to matrix multiply, just, you know, the
standard thing that has been optimized the hell out of, but we were actually able to beat MKL,
Intel's Math Kernel Libraries, using this particular recursive structure.
Now we're using MKL by using MKL but being more intelligent about when to do things in
parallel and when not to do things in parallel.
So first let me talk a bit about performance that we're getting out of the stencil domain-specific
embedded language. This is a set of benchmark kernels on the X axis. On the Y axis is the
fraction of peak performance. The last set of bars is the geometric mean, and we're comparing
against Pochoir, which is the state-of-the-art domain-specific language for stencils using Cilk
Plus. It's developed by MIT. And it gets excellent performance using the cache-oblivious
algorithm.
So we are able to get a geometric mean of 93 percent of peak performance. And this peak
performance comes from looking at the kernels and finding out that they are bound by the
memory bandwidth performance of the machine. And 93 percent of peak performance is quite a
bit better than you could get in Python. Just as an order of magnitude kind of thing, each of these
is about 2- to 3,000 times faster than what you would get by writing it in Python.
>>: So what did you [inaudible]?
>> Shoaib Kamil: So this is a single node four core i7 machine.
We compared to the autoparallelizer in GCC which uses the polyhedral framework, and we're
able to outperform that by up to 11X and about 2 1/2X faster than Pochoir. And, as I said, the
geometric mean here is 93 percent of attainable peak. Yes.
>>: This is the first time I get to see comparison of this and Pochoir on the same problem.
>> Shoaib Kamil: That's right.
>>: Please repeat the question.
>> Shoaib Kamil: Sorry? Oh. The question was this is the first comparison that Robert has
seen on the same hardware for Pochoir and our infrastructure.
So I mentioned the communication -- sorry.
>>: [inaudible] the language, right, the difference is that they generate [inaudible] algorithm and
you generate [inaudible].
>> Shoaib Kamil: Well, so the -- the question was the difference is not due to language, whether
the difference is due to language or whether it is because they're using a cache-oblivious
algorithm and we're using the cache-aware algorithm.
Well, I think the difference is -- well, it's certainly not due to language. We're both using C++.
We're both using the Intel compiler. But the difference is due perhaps somewhat to the use of
cache-oblivious algorithm versus cache-aware algorithm. But I think it's more so that because
we're doing our compilation Just-in-Time, we have to us available every parameter that you
would possibly inline into the computation. So we know the sizes which tells us what kind of
blockings are valid, and we explicitly generate all the valid blockings, and we use autotuning.
So I would say it's probably more related to our use of autotuning than it is to the algorithm
because conceptually, if you look at the algorithms, they do the same thing. However, because
we can use autotuning, we're able to get the fastest possible performance from a wide variety of
code variants.
>>: [inaudible].
>> Shoaib Kamil: So the question is they have a way to add autotuning to their process. Is that
a statement, or are you asking ->>: [inaudible].
>> Shoaib Kamil: Oh. So I'm not aware of this functionality. When I presented this in front of
the Pochoir people, they think they can improve their performance. And, you know, where I'm
sharing the code with them, so hopefully we can figure out how to make their performance
better.
So the other thing I wanted to talk about in terms of performance results is that we've talked a bit
about our communication avoiding matrix powers library. And what this particular kernel does
is it's a building block of communication avoiding Krylov subspace methods. And what this
chart shows is the performance of three different implementations of a communication avoiding
CG solve. I'm sorry. Three different implementations of a CG solves, one of which is
communication avoiding. So the first one is kind of the standard, the blue is the standard CG
that you would get if you're writing in Python and using a good performance library. This is
scipy, which basically implements all of the MATLAB functionality in Python. It's what most
people use when they're doing this kinds of computation in Python.
The red is using Intel's MKL. And the yellow is our communication avoiding CG with matrix
powers only doing one matrix multiplication at a time. So this is kind of the baseline that you
can compare with the noncommunication avoiding stuff.
And the green is the communication avoiding CG using our matrix powers kernel.
The dark bottom parts of the bars are the times spent in the matrix powers portion of the CG
solve, and the -- I'm sorry, the matrix multiplication part of the CG solve, and the light bars are
the rest of the computation.
What this really is showing is that if you look at just the matrix powers part, we're able to
outperform Intel's MKL by 2X which results in much faster solves.
So I want to talk a bit about some new work that we've been doing in collaboration with people
from UC Santa Barbara. And we've been working with a Python infrastructure that they've built
called the Knowledge Discovery Toolbox. And the Knowledge Discovery Toolbox is basically a
high-level graph algorithms library for domain scientists. So most people will use the, you
know, KDT to write their -- to run their graph algorithms, do things like [inaudible] search or,
you know, clustering and things like that. It's built on top of the combinatorial BLAS which is a
low-level C++ with MPI library that casts graph algorithms as linear algebra operations.
And that, you know, uses MPI and can use OpenMP, Cilk, and things like that. This is not yet
implemented. That's why it's in a dotted line.
So we're adding in the SEJITS infrastructure here to bridge the performance gap between things
that are written by domain scientists and things that are written by people at the low-level
distributed combinatorial BLAS.
And example of this is a domain-specific embedded language for filtering. Now, oftentimes you
have a graph that has attributes on the edges or the vertices, but you want to perform your graph
algorithm only on a subsets of this graph. So, for example, you have phone calls and texts, and
you only care about performing your graph algorithm on the phone calls.
So what we've done is we've implemented a domain-specific embedded language that allows you
to write filters, and the filters are basically functions that return true if this edge should be
included and false if it should not.
KDT itself has infrastructure for doing these filters, but in KDT the filters are written in Python.
And at each edge during the graph algorithm, it has to up call into Python and decide whether
this particular edge should be included or not. So that incurs a pretty big performance penalty.
So our domain-specific embedded language looks just -- you know, uses Python. It uses the
same kind of infrastructure to write that. Here's an example of a filter that checks, you know, if
the count is greater than zero or -- and if the time is less than some particular time you've passed
in, and these are edge attributes of the graph.
So using this we can speed up the performance quite a bit. So here's a graph that shows the mean
BFS time on a log scale on the left side, and on the X axis is filter permeability. That's the
percentage of edges that pass the filter. The performance that you should be comparing is the
green which is KDT's current implementation that, like I said, runs in Python, and the blue,
which is the one that uses SEJITS.
Now, in order to build a baseline for this, we also implemented the same thing at the low-level
C++ library just to compare. Now, this isn't something that, you know, we would expect a
domain scientist to be able to do because it involves writing low-level C++ parallel code, but that
gives a good baseline.
In any case, the performance here is 5 times faster than the current KDT, and the slowdown
versus writing the whole thing in C++ is 25 to 50 percent.
What I'm showing here is performance on a -- on 36 cores of a 40-core Xeon four-socket
machine. We also run this in the distributed setting on hopper at NRSK [phonetic], which is a
large-scale AMD -- I'm sorry, it's a large-scale Cray XE 6 using AMD processors. And we've
seen similar results there.
So I'm going to switch in the last few minutes I have here and talk a bit about what's going on
with users and so on.
So we've had over a thousand downloads of our ASP infrastructure from the Python package
index, which is kind of the mechanism for distributing Python packages. The KDT work I
showed you is going to be integrated into mainline KDT and released this year. This is going to
be, you know, kind of the first major outside users releasing something using our infrastructure.
And the goal is that the second DSEL we've implemented there, which I haven't really talked
about, is also going to be part of the release after that.
In the first two weeks after the Gaussian mixture model library was released at a -- at the ACM
multimedia conference, we had over 800 downloads of that library. And we have outside users
at Carnegie Mellon and other places that are using this in production to replace their current C++
code.
Yesterday Tim Mattson showed some numbers on an application which does protein docking.
This is an application that was supported by Henry Gabb who was at Intel at the time. And it
gets 290X speedup by running in the cloud without any change to the ported application.
And finally Tim's also developing an FFT specializer and giving us great feedback on things we
need to work on in terms of usability and in terms of making it easy for people to develop these
domain-specific embedded languages and autotuning libraries for Python.
So some future work. So right now the autotuning and the results I've showed you is kind of
doing the most naive autotuning can which looks at all the particular versions and just runs them
in some kind of random order.
So we think by adding machine learning, and we have some evidence of this, we can get much
faster searches that converge on the optimal version.
So there was a paper by Archena Genapotti [phonetic] and Koushik Dutta a couple years back
for stencils that shows that machine learning can greatly improve the speed of converging on the
best version. So we want to integrate that kind of search into our infrastructure and make it easy
for other people to use in their specializers, in their domain-specific embedded languages.
Some of the ideas from SEJITS are also finding their way into hardware prototyping. I think
Krste's going to talk about that later on today. And we're also exploring one of the big problems
in using these, which is composition. And one of the directions we've seen is that pattern-based
frameworks that are used for a particular kind of computation or a particular class of programs
such as multimedia programs or other things like that is a good testbed for bringing composition
to these domain-specific embedded languages.
So there's some work from Berkeley that's also going to be talked about later today called
PyCASP, which is building an infrastructure for doing multimedia application -- multimedia
applications in Python and get fast performance from that. So it consists of a bunch of library
components, some customizable components and structural patterns that you can put together in
different ways to build your application.
There's going to be a whole talk on that later today, so you'll hear much more about that.
So to conclude, we've seen really high performance in kernels. We've seen examples of
applications authored by people at Berkeley, around also people outside of Berkeley that are
using this infrastructure to build real applications.
We're doing some work on composition and we think that this infrastructure and ideas from it are
really critical for enabling agile hardware software code design and design space explorations.
Thank you.
[applause].
>> Shoaib Kamil: Lots of acknowledgments because there's lots of people who worked on this.
>> Juan Vargas: Do you have questions for Shoaib?
>>: There's a question. On your previous answer that [inaudible] performance is that logic and
taking advantage of that, do you have sensitivities studies on how much you gain by that?
>> Shoaib Kamil: Yes. And I can show you a graph offline.
>> Juan Vargas: What was question?
>> Shoaib Kamil: So the question was do you have sensitivity studies on how much
performance you gain by using autotuning. And I certainly do. It's in my thesis, so I can show
you a graph.
>>: [inaudible] what type of effort did you make [inaudible]?
>> Shoaib Kamil: That's a good question. So we're working with Koushik Sen on some
correctness infrastructure that allows you to basically prove that the generated code and the
original code are doing the same thing, or at least the intermediate results are correct.
So in terms of the stencil stuff, the correctness is more so embedded in the domain knowledge.
So we're not doing formal verification online for any of this stuff, but we have applied those
techniques that Koushik Sen and his students have come up with and actually proved that the
stencil DSEL is outputting correct code.
>>: Some kind of [inaudible] taken from [inaudible].
>> Shoaib Kamil: That's right.
>>: Just an additional comment on that. Some of the transformations that are domain-specific
use associativity and distributivity of a written thing to do the transformations and that changes
the floating point properties, so you don't get exactly the same answers. And sometimes they can
be very different, and a lot of domain knowledge is required to know which transformations are
okay.
>> Shoaib Kamil: So Jim pointed out that the domain-specific transformations we're doing do
assume associativity and communitivity in other properties of floating point. So they can change
the answer in terms of the floating point correctness. So that requires no -- having domain
knowledge of what things are okay to do and what things are not okay to do.
>> Juan Vargas: The last speaker for this session is David Callahan. David is a Distinguished
Engineer at Microsoft. He has a Ph.D. from Rice University and he spent some time at Tera
Computer, Cray Computer, and since joining Microsoft he has been work on parallelism and
concurrency, and he's going to talk about his latest story, C++ AMP.
>> David Callahan: Good morning, everyone. Thank you for coming to Redmond. I'm
delighted to have a chance to talk to you a little bit about what we've been doing in Visual Studio
in parallel programming.
So developer division has two broad products. One is Visual Studio, the other is the .NET
developer platform. And our broad mission is to make sure developers are successful targeting
Windows platforms of which there are a great variety for.
I've sort of been at Microsoft for six or seven years now helping to shepherd along the parallel
programming assets. We made a big investment in Visual Studio 2010 around multicore, and
tomorrow I think we'll have a member of my team come and talk both about that and the
extension work that we're continuing on, the next one.
Today I'm going to talk about C++ Accelerated Massive Parallelism, or C++ AMP, which is
really our offering to sort of tackle the problem of GPU programming.
So let me give you a little context here. We know Moore's law continues, but the hardware
architects have been talking about power, power, power, power, power for years now. And so
sort of a natural conclusion of that combination of cheap transistors but can't turn them all on is
that we'll see specialization of silicon to particular workloads, which is already somewhat
pervasive in a lot of places.
And one of the really important workloads that exists is the rendering workload manifested that's
today sort of handled by GPUs.
So this combination of factors have led to a really fast evolution about what GPUs are and where
they're used. So part of it is in the programability of the GPUs, and I'll start sort of six years ago
when NVIDIA introduced C for CUDA and said, hey, you don't have to work in a graphics
framework to get the power of the GPU available to you.
And once they made that transition, they're sort of put on this path of making GPUs
programmable like CPUs in ways they've never been before, subject of course the fact that
graphics is the workload that matters, so they're still dominated by engineering for that workload.
Three years after that came out, we saw that the execution model that was sort of pioneered in
CUDA then became standardized both in OpenCL and in DirectX11 Compute, so the two major
client operating systems both embraced this model and went forward with as a core capability.
Along that same time, of course, we saw the rise of mobile form factors as a huge concern. They
embrace this rich visual experience now coupled with touch and that interactivity with the visual
experience is something that's only delivered through a specialized hardware system and
integrated GPU in that solution space.
NVIDIA also built out special skews saying, hey, this stuff will work in a server environment as
well, and they'll tackle the HPC market as aggressively as they're attacking the mobile market.
At the last GPU technology conference they ran, they had a factoid about the growth of this at
SC where in 2007 there was one booth that talked about GPUs. But last year 80 percent of the
booths were talking about GPUs. A big impact in the HPC space.
And of course now you can go rent those GPUs by the hour at Amazon first and then other
places.
And then the last step of this trend was the embracing of moving these specialized hardware
capabilities on to the main CPU die and building a composite heterogenous package. AMD
released their first APU last year, and the Intel Ivy Bridge which Intel's had graphics accelerators
as integrated parts for a long time, but this one is a DX11 compliant part and is really a candidate
for producing the compute workload that we're interested in.
So in this context, then, we started investing about how can we make this sort of emerging sort of
commodity platform, a heterogenous platform, accessible to a very broad range of developers.
So that's what we built in C++ Accelerated Massive Parallelism. Focus on data parallel
algorithms, which are the part of the sort of algorithm pattern space that are most effectively
optimized by these wide new generation of vector processors. We embrace the heterogenous
nature of the computing so it's not just one kind of processor but two you must deal with. Put
this in a mainstream context. So put it into C++ and give support in a major IDE like Visual
Studio which will support not just the language but also debugging and profiling. And get tools
in the hands of users that can use this effectively for productivity and portability and
performance. So balancing these three is sort of the design challenge in the API.
So this will be now part of Visual Studio 2012. It's available in a release candidate now. Our
approach to it was to say, hey, this is just part of C ++. So you build a C++ app, you include our
app header, suddenly you can be doing GPU programming. It's just sort of part of the product,
not a special bolt-on add-on. And that goes all the way to how we think about acquiring the tools
and how you deploy the resulting opportunities.
We do not think, however, of this as a Microsoft technology per se. We have a first
implementation, but we think that this is the direction that the C++ community should move.
We've created an open specification for our API design and put behind that the Microsoft
community promise which will guarantee a royalty-free license to whatever IP you need to
implement the spec, and we are actively looking to have implementations on Linux on Android
and the Mac OS.
At its core, AMP is just an STL library, but there's a couple of extensions to C++ we made to
actually enable targeting of the diverse range of hardware that we're interested in, and I will dig
into those. One of them is a pretty novel extension, the other is a tweak on sort of an existing
idea.
Our implementation is a layer on top of our graphics platform support DirectX11. However,
DirectX11 doesn't shine through, so it should be feasible to build an implementation on top of
OpenCL or even perhaps OpenGL. We haven't done that, but our intention is not to make
DirectX shine through except in some interop APIs.
Give you a very brief introduction to the core concepts of what is in C++ AMP, starting with
matrix multiplication. I'll do a very simple implementation, and later in the talk I'll do a more
sophisticated implementation that will give a more substantial performance advantage.
So here our starting point is simple C++ version. The usual textbook three loops, inner product
in the innermost loop. The two outer loops are then completely parallel.
The interface starts with C++ [inaudible] vector types and some explicit bounds passed in. You
know, one of the weaknesses of C++ is it hasn't actually standardized on a multidimensional data
type, which is pretty important to do data parallel programming with.
So one of the things then we added in C++ AMP, and this is an essentially complete C++ AMP
program subject to missing on AMP.H header and a name space inclusion, we added a notion of
an array view which allows you to overlay onto an existing linear data structure a
multidimensional view of that data. And so we do that for the two input vectors A and B they
were flagged as const, we're not going to modify them, and the output vector C. Then we have a
library based parallel loop construct which is actually very similar to what we did for our task
based parallelism as well, and so this should be familiar to users of our earlier work or Intel's
TBB.
This parallel for each takes two parameters. One of them captures the extent information of a
parallel loop next. So the number of iterations in each sort of dimension of the problem. So in
this case that would be M and N. And then it takes a function closure which in C++ is called a
lambda. And we'll invoke that lambda once for every point in the implied iteration space and
passing in an index vector to tell you where you are in it. So that's the core data parallel
paradigm of for every point in this space do this function.
Our most significant addition to the language is this restrict keywords which defines can be
applied to functions and to lambda closures, and it creates a subset of C++ in the body of those
functions. That subset is -- defines what is legal to run on a GPU, and it is also hooked to the
compiler to understand when multiple implementations are necessary. And I'll talk about a little
bit more about the nature of the restrictions later.
Note that the body of the interloop is essentially unchanged, and we can pick off the row column
values out of the index value.
The array views are also our hook to understand if you're running a distributed environment
where the GPU has its own memory what needs to be copied over there and how. Partly that's
through what's captured in the lambda. It's also partly how we marshal the array views over.
And I'll talk a little bit more about that later as well.
Now, because we are running in a -- potentially in a separate address space, potentially
asynchronously to the CPU, there's also some API extensions here to talk about what data, if it
needs to be moved or doesn't. So the discard data API on C is an assertion, hey, the parallel for
each is just going to overwrite this data. So don't bother copying it over to the GPU if you need
to treat the underlying data as undefined. And at the bottom there's a synchronize which sort of
does the reverse, saying, hey, this data may have been modified on a GPU, but I want to store it
back to its home location in the CPU data structures. Please do that now for me.
So this is now essentially a complete C++ AMP program where most of the concepts we've
introduced are really kind of around the array abstraction, sort of the size of it. So it should be
pretty straightforward for most of our existing task-based customers to sort of understand this,
which sort of gets to sort of some of our -- I'm sorry, go ahead.
>>: So if it doesn't fit, you'll do all the blocking that's necessary to move it back and forth
between CPU and GPU memory?
>> David Callahan: So the question is if the data that is accessed in the parallel kernel doesn't
actually fit into the video memory that's available on the card, what happens.
So in our first implementation, it's the responsibility of the developer to ensure that happens. So
you'll get an exception if you exceed it. There is not enough information in general in the API to
allow us to automatically block this.
>>: Can I ask you a question?
>> David Callahan: Yes.
>>: Can you detect if the disk data was put in incorrectly, like say you [inaudible]? Do you
have any checks?
>> David Callahan: So the question is are there any checks in the system to verify that if you put
a discard data incorrectly and then use the stale data whether it will detect that. We don't do that
in the first version.
So we have this natural tension now between portability and productivity on the one hand and
the performance. And this will show up sort of in how much of the machine do we let shine
through in the abstractions. So we know that sort of for portability and productivity reasons you
really want to minimize the number of concepts that the developers are exposed to that represent
an increment of what they already know. This is just generally any time a new technology
comes from the richer the set of concepts the more trouble they'll have to embrace it.
And you also need to make sure that they know exactly when it's a good idea to apply that. So
that leads us to focus fairly narrowly on the patterns that were applied to sort of these data
parallel constructs. One good composition was a host environment which is C++, and we've
achieved that by having a strong integrated type system that spans both classes of code.
And you also want to sort of make sure that you pay attention to sort of the key patterns that your
developer will be facing and give good application pattern support for that. So in that
configuration we focus on the parallel loop as the pattern we're trying to make sure we have good
support.
In sort of the more general multicore space, there's lower levels of work for tasking from which
you could build the pattern stuff, but here we focus just on the pattern itself.
On the performance side, we're faced with a lot of decisions about how much memory hierarchy,
how many memory alignment requirements, what to do with the exposed concurrency between
the two kinds of processors to have and how the hardware scheduling mechanisms might get
exposed.
These are all areas where the architecture is still considerably in flux. Even today the range of
GPUs are fairly diverse. And since we'd like to have future proofing so that the codes we write
today run well in a few years as well, it drives us to sort of minimize the amount of hardware we
actually expose.
So I'd like to sort of dig down into the bottom four of these in the time I have left about sort of
some details of what we're attempting to achieve.
So let me start by talking about the restrict AMP function qualifier. So our implementation
builds on top of DirectX11 which represented sort of an industry standardization of capability,
which is somewhat less than what CPUs provide. So there's a bunch of rules about once you're
sort of in the GPU space what happens. So you'd have to kind of stay in the GPU space, you
can't call back CPU functions. Most of the GPU architectures actually don't support either
indirect functions or function calls, and so you have to carve off a subset of the language which
you can fully inline and map to that more limited capability. There's typically no memory
management and very restricted pointer uses.
So these are things if you look at the most modern GPUs from some of the leading companies
have already been relaxed, and we would expect over time to relax this list, and we have a
strategy for that.
The other thing is this restrict is part of the type system. So we currently have two notions of
restrict. Restrict AMP for our GPU targeting, and restrict CPU for things that just run on CPU.
Restrict CPU is the default, so if you don't say anything, that's what you get.
And you can have overload so you can tailor the implementation function to the context that it's
running in. You can also write functions that are guaranteed to be -- by design time to be safe to
run in either function.
We anticipate there could be other uses of restrict, so we worked really hard to make sure that it
was orthogonal to the rest of the system. Maybe there's restrictions that are appropriate for CPU
vector targets. Maybe there's a notion of a pure function we'd want to carve out. So it's intended
as a general addition to C++ that we use to enable heterogeneity.
We also worked with some of our hardware partners to think about what would be the evolution
of this set of restrictions over time. If you go read the open spec at the bottom, there's a roadmap
there of likely big buckets of how we might move from our current AMP, AMP 1, do an AMP 2
which would have a subset of this list.
So I've already talked about a little bit about array view and its role of providing a common name
space between the CPU host and the GPU. The real function of it, however, is to sort of provide
copy-as-needed semantics. So here if my parallel computation is running on a discrete memory
system, then at the time that I launch the kernel I can observe in the lambda what needs to be
copied over. And subsequently if there's a reference on the host, I can do the appropriate
bookkeeping to say, oh, no, the current copies from the GPU, I need to bring it back.
On the other hand, we know that we're moving to an age where integrated systems will become
the norm. They'll be able to share physical memory, they'll share caches. Eventually they'll
share cache protocols. And so those copies will become unnecessary. And so we want to make
sure that we had a system that allowed you to have -- to span that transition. So you can write to
tolerate a distributed memory environment that will then light up and improve on an integrated
system.
We also provide a mechanism to explicitly allocate data on a GPU, so there's an array type, it's
kind of analogous to array view, but is a storage container rather than just an access method.
And so even today there's a way to talk about place memory on the GPU because there are
scenarios in which that makes sense.
A more harder concept where we actually start to say, okay, machines are not as simple as we'd
like them to be, and they kind of never have been, but particularly when you get to the memory
hierarchy, there's a disconnect between how I want to think about the machine and how it really
is.
And so we chose to embrace a fairly simple extension which is, hey, these machines, they're
multicore and every core has a cache, and you have to provide some mechanism to talk about
that locality within an otherwise data parallel context.
So this is not a model invented. We inherited it from where the graphics guys were with the
X11, but it's a pretty good model, spans a lot of interesting things, and is a compromise because
this part of architecture is probably pretty stable. I don't know what the next stable point in
memory hierarchy is likely to be.
So we introduced within our data parallel computation space the idea you can break that space
into tiles, blocks, as what Padua called them, and these are sets of iterations that are going to be
mapped to the same physical processor. Now, these physical processors are typically vector
processors, so an iteration in our space may actually map to a lane and a vector processor, not a
thread of its own. But we will call it a thread nonetheless.
Once we have this idea that a chunk of my space can map to a physical processor, we can allow
certain capabilities that tend to not scale very well to shine through. And the particular ones
are we introduce the notion of barrier coordination, so all the threads in a tile, all the threads on a
processor, reach some point before any proceed, and now we can introduce a notion of shared
storage there that is suitable both for sort of temporal cache as we see on CPUs, but also as the
scratchpad memories that are common as software managed caches on GPUs.
Let me go back to my matrix multiply. Remind you that it's easy to do it in a block style, so you
can take a block of columns and a block of rows and then accumulate them into a block of the
output.
So if the block of the output is our tiling strategy for the data parallelism, and we can stage how
we do that row block, row column reduction into it. For example, we guarantee that bringing in
block 3, we multiply it against block 3 and B and accumulate in the output, those
subcomputations are now things that can fit in a private cache of a multiprocessor.
And so this model we then can take forward into the API design as here's matrix multiply in the
simple data parallel case that has no manifest memory locality to exploit, and here's the variant
of it where we do introduce those things. We introduce a tile size. In this case it's a static
constant, a limitation of the V1 as it must be compile time constant.
We have a variant of the parallel for each in which you take the extent space and you tile it based
on these fixed size blocks. So this is just an overloaded parallel for each that handles these sort
of tiled extents.
We change the parameter into the lambda instead of taking in an index, it takes a tiled index, a
tiled index is a slightly larger thing that has more -- a lot more information related to what's
going on in a tiled context. But we can still pick off the sort of global index about whether you
are in the original space. And so these two variants haven't out made any substantive change in
the algorithm. All we've done is tile the iteration space into blocks.
The more interesting change, then, is how do we get to the locality. And that involves the
following slide, where now we introduce a tile static local variable. So tile static is our second
language extension. It's a new storage class, can only be used inside a tiled parallel for each.
And it allows us to create little arrays which will be mapped into the scratchpad memory of
GPUs or could be fit into the L1 caches on a CPU.
Now we take our inner product loop, we strip mine it into chunks, so each chunk of that inner
product will load blocks from global memory, tuck them into the cache local memory. Then
when those loads are all done, it's done cooperatively, every thread in the tile does one data
motion, we do a barrier to verify that all those loads have settled, then we can drop down into the
set of chunks of the inner product.
They now read out of cache of protected, and so the effect here is that instead of every thread
going to memory for every array reference, there's a -- in this case 16-fold reduction in the global
memory requirements achieved through sort of this explicit it caching strategy.
And now a second barrier protects the reads in the inner product from the writes from the next
iteration. So this is how we've taken our model of sort of L1 caches attached to processors and
put it into the API set for the programming instruction. Yes, Jim.
>>: So what about determinacy? Because you have a sum reduce, which could happen in any
order, and in one issue depending on how things are dynamically scheduled is that you get the
sums in different order, get a different answer. So is there anything in your language extensions
that addresses that?
>> David Callahan: So Jim's question is about how deterministic is the programming model in
this particular example.
So this particular example is actually completely deterministic because the output sum is done
the same way because it's done by one thread and it's actually done in the same order every time.
The concurrency in this model is between different tiles they are completely unordered. But
between different tiles in this model there's no interactions.
A second question is is it possible to introduce data races into this model. And the answer is yes,
it is. For example, if we dropped the second barrier from here we would create a data race
between some threads reading finishing their sum reductions and other threads starting to
overwrite the shared data that's involved.
We actually have some static analysis to detect those data races, and we can run in a mode in
which there's some dynamic analysis, detect them as well. But it's not guaranteed to be a
foolproof system.
So in deference to time, I'm not going to talk about our compilation deployment model in depth.
We build sort of an existing graphics infrastructure. We ship a fat binary. We still have a final
JIT to target that gives us the great portability across different hardware, both in the field now
and over time. It's a cumbersome model that can lead to finger pointing, but it does give that
core portability that's pretty interesting.
So here are the two points I tried to cover today. We have a ton of developer facing information
about this tool that's on the Web. There's a team blog that you can get to by searching for C++
AMP in a nutshell, or it's [inaudible] at the bottom of the slide. And of course this -- all of this
technology is available until the release candidate, so you can go download that today, try it out
yourself, see what use like.
Any questions? More questions? All right. Thank you.
[applause]
Download