>> Juan Vargas: Welcome, again, everybody. I hope you had a very nice evening last night after the incredibly exciting sessions we had yesterday. So today we're going to have the UPCRC workshop. This is what Microsoft calls the third day event after the faculty summit. And oftentimes what happens in the faculty summit is people just have these short times to present [inaudible] directions of research. So this session is [inaudible] deeper dive into the topics. And we work with Illinois [inaudible] Microsoft to get this program ready, and we're going to go [inaudible] faculty summit just on presentations by the partners, we decided to go by teams. So the teams are programming systems, tools for parallel testing and debugging, applications, architecture. And at the end, like yesterday, we'll have another panel that we hope is going to be as exciting as the one we had yesterday. And the panel is going to be UPCRC: Can Industry and Academia Collaborations Be Effective. We hope that there will be a lot of discussion debate, and we hope we can have real conclusions. So a few remarks. If you want to connect externally through wireless, there is a username and password out there. And we have busses going back to the Hyatt Hotel starting at 5:00, and probably last bus is going to be about 5:30. So once we finish the panel, we just have to go out and wait for the bus. If you could please sign, you have not done that, sign at the door so that we know who is coming and the affiliations and your names. And with that I'd like to introduce our first speaker, David Padua. He comes from the University of Illinois where he has been since 1985. And when I do these presentations, it's always difficult to omit some of the remarks because everybody who is talking today has such a distinguished career that I struggle to find what to say during less than one minute so that we can spend most of the time listening to them. So David has been a professor in Illinois since 1985. He has a very distinguished career, and probably the last book he wrote is the Encyclopedia on Parallel Computing. And I don't want to say more because we know pretty much very well his real accomplishments, and David is going to talk about abstraction for parallel programming. >> David Padua: Thanks. Of course the last book I wrote I didn't write because it was an encyclopedia, and that's... >> Juan Vargas: [inaudible]. >> David Padua: And that was a very pleasant experience, the encyclopedia. Many people contributed and it was very moving to see how much people were willing to participate in that project. Okay. So I was asked by colleagues to give you a summary of the work going on at Illinois. That's a very difficult task. So the work going on in Illinois in the area of compilers and languages. So I'll try do my best, but first I must say that what I am going to say is not comprehensive, neither in terms of the projects, nor in terms of people. So there are more people doing compiler, there are more compiler projects going on. I'll tell you, mention a few that seem more relevant to this meeting. So mainly I'll focus on the work of Vikram Adve, Wen-Mei Hwu and my research group, and will be presentations on compilers and the languages that are being designed. So let me start by saying some common places about what the problem is. So, as we know, parallelism is everywhere. And if parallelism is going to succeed, it will be that because software will be developed that can take advantage of parallelism. Okay? But when that happens and if that happens, one of the things that obviously is going to happen is that there will be a dip in productivity, because writing parallel programs is more complex writing [inaudible] need to deal with performance, with scalability. There are new sources of defects and so on, so forth. Okay? So what we face as a challenge as software people is facilitate parallel programmer to reduce the impact of parallelism on productivity. It's not very glamorous, but it's true. That's all we're trying to do, reduce the impact, try to bring back productivity to a level of the sequential era. And we find a lot of challenges. But of course optimization, because performance is crucial in the parallel region, because only through scalability, new machines -- through the scalability of software, new machines will be valuable and people will be willing to buy. It's a business model. There is the issue of portability that is also tremendously difficult because while porting across machines in the sequential era was [inaudible] because the difference was in the type of instructions. And perhaps if you [inaudible] in the case of parallelism, the classes of machine can differ widely. There's a big difference between a GPU and a collection of CPUs, and there is a big difference between vector processing and the [inaudible] memory machines and so on. So there is a tremendous issue there. So that's a problem that has never been resolved, and we really don't have a good answer how to guarantee that you write the code that will be executable on a variety of classes of machines, distilled memory, shared memory array, and so on. And the other challenge of course is correctness. There are new classes of defects. Nondeterminism, deadlocks, and, you know, other issues related to parallelism. So how to address this challenge? From what I have seen, there are mainly two strategies that I believe are viable to address the problem. One is to raise the level of programming so that you have enough information to be able to map across machines that will facilitate portability. You have enough information through optimization that would solve the problem of getting good performance and scalability and so on, so forth. And that of course is a different problem because at what level you present the language is not an issue, it's an issue you may want to work at the level of applications and have abstractions that represent complete application so you can have abstractions that represent algorithms, or you can have abstractions that represent goodness. Which ones are useful. How you combine them is not a problem that is easy to solve. So you -- and there are a number of possibilities. You can work from the formula level or the very general description level. So you can have domain specific language, you can have very high-level operators, and you can have just plain parallel abstractions like parallel loops and [inaudible] that sort of thing. So that's one. And the other, which I think is also crucial, although many people don't seem to be fond of this approach, is to have techniques that automate the process of optimization, automate the process of porting and so on, so forth. So we actually need to work on this area of autotuners, compilers, so on, so forth. So, you know, maybe I did this a couple of days ago, but I think, you know, maybe I'm missing something, but I think certainly these are two of the main strategies that we can use to address the problem of productivity in the parallel world. So let me start by going quickly over what is being done at Illinois, and I hope it's not too boring. Will be just a short description of projects. So three projects in the language realm: Triolet, that's Wen-Mei Hwu's group; this HTA work that we have been doing in my group; and the Hydra thing. I'll describe this very quickly to you. So Triolet is a library, both building functions that you can use from -- basically here is an example, is basically a library that you can call from Python codes. And I think this work is similar I guess to the work that is going to be described in the talk after this one, basically, number of operators with iterators, and the idea is that the operators are implemented in parallel and you can work in the program at the high level using the computational notation. Okay. So the notation is Python, and I think that's all I'm able to say about this. They are working on compilation techniques and so on, and they continue working in this area. This is the work, as I was saying, of Wen-Mei Hwu and his student Christopher Rodrigues. That's one project. And here, as you can see, the goal is to raise the level of abstraction functions that implement operations and you encapsulate the parallelism within the functions. The other related approach is the one in my group on hierarchical tiled arrays, and this is work that my students, colleagues, and I have worked for for a few years. And basically the main idea as [inaudible] described yesterday is that we recognize the importance of blocking and tiling as a first-class object. If you look at the algorithm for parallel computing, sequential algorithm facing -- dealing with locality, the -- the tiles appear all [inaudible] as a constant. So having them in the language explicitly we thought was [inaudible]. This started when we had the HPCS program as started by DARPA, and the IBM guys wanted an extension to model up for parallel programming. And by looking at MATLAB, I decided that some of the structures, they have these -- and I forgot the name -- they have some classes of objects that basically are collections of arrays and enable the hierarchy of tiles, I thought that that could be used to represent parallelism. The idea is relatively simple. You have arrays that are tiles and the tiles can be tiled and so on, and then you can use the tiles at the top for distribution across machines, the tiles within the tiles for locality at different levels, you can use the tiles at the second level for shared memory parallelism and so on, so forth. So it's a natural thing in the hierarchical machines of today. And then we have work more recently extending this notion to regular data structures like sets and so on. So here the operations are on arrays and sets, and the idea is that the arrays are tiled and you can represent by manipulating these object's parallel computations. That's basically it. I am personally still struggling with the notion of what primitive you need to go beyond linear algebra and also struggling with the problem do we need to go beyond linear algebra. You can do everything with linear algebra perhaps. So that's an issue that I don't think I can answer. But what I can say is that our work on only regular structures proves to be very effective. There is a tremendous complication in tiling not regular structure, because you need mechanisms to decide where each component, as I said, will go. And we can make those mechanisms, but it's not obvious how to build them from scratch. You need to think very hard. So the end result with this notation of course is that you can get codes that are much more compact, and, more importantly, more readable than the equivalent MPI codes. So -- and the other advantage of this approach of having all your computation with the parallelism encapsulated inside operations on these arrays is that you have portable programs. You can implement your operations as a GPU [inaudible], you can implement them as threads within MPI, you can implement them as parallel loops. And they -- in all cases we have tested they have worked reasonably well. So that's that. And then the other project that I just wanted to mention briefly, some other one in our group, is called Hydra, and that is with my students Alexandre Duchateau and a colleague from Bordeaux, Denis Barthou. And the idea here is to start with a formula and, for example, you want to find X in that equation. And then what we do is we tiled the difficult operators and convert the equation into multiple equations. And looking for a recursion. So eventually this reduces if you want to go all the way to a scaler equation that is easy to solve. Okay. So in the process of composing the equation, what you find is a number of operations that can be executed in parallel, and by doing the composition, what you build is a graph of operations that can be executed in parallel. Okay. The great advantage of this approach is that you can partition the components in multiple ways, and that enables you to search the space of possible solutions and look for a good performing version of the algorithm. We have still the preliminary stages, but what we have seen is that the codes that we solved are very good and competitive with what people have written by hand. So stay tuned. There will be more about this in the future. Okay. Next project is about compiler evaluation. This is a topic that I think is tremendously important, because while you see a lot of publications about new compiler techniques and strategies and so on, what you don't see too much is evaluation of what exists out there. And compiler technology is not just about algorithms, it's not just about ideas, it's about how well our algorithms work when applied to real problems. So it's a tremendous difficult problem to do, evaluation of compilers, not so much for the conceptual issues, but the labor involved. Because you need really to do manual analysis of what is happening to really understand how well the compiler is working. So here is a project that we did for two years at Illinois, the evaluation of autovectorization. So what happened is the Blue Waters Project -- the Blue Waters Project the idea is to program the machine with MPI codes. But the -- they thread within -- the MPI program has to execute as sufficiently as possible, and what they want is automate the process of tuning for vectorization -for -- to take advantage of the vector extensions of the microprocessor, power-saving machine at the time. So they asked me to work with the IBM compiler people to make sure they deliver a compiler for the order that will produce good performance, especially in the context of taking advantage of the vector extensions. So this is regarding Tim Mattson's comment yesterday. So it's a -- in realty this is a form of autoparallelization, and of course the issue is not whether it's possible or not, it's a matter of degree. Because typically what people do is run their programs through the compiler, and whenever they can find vector extensions, they just -- vector operations they profit from that and benefit from that. And in reality it's a very important technology, so much so that all compilers autovectorize. And the compiler groups invest a lot of time in this technology. For many reasons. One of them of course is that you evaluate compiler machines by how well they perform on spec codes. And if the compilers do better at vectorizing, the machine does better. So there is -- but there is also I think a real relationship of autovectorization to regular users all over the world. Autoparallelization I think it's also true that all compilers autoparallelize. But when you talk to the IBM compiler people, they hear very little about users of autoparallelization, but autovectorization is widely used. And, of course, why not? You can have computers that answer all sort of questions. No reason why your computer cannot answer autovectorization questions, can this loop be vectorized. So let me just tell you briefly the outcome of the evaluation and briefly what we're doing going forward. So basically we look at the -- we took the collection of loops that David Callahan and [inaudible] and Ivan put together 20 years ago and ran them through three compilers: GCC, ICC, and the IBM compiler. And what we found out is that the -- there was great variability in what the compilers were able to do. And despite the fact that at some point we gave the benchmarks and the resource to different compiler groups, still after a while they are on the compiler [inaudible] because of that, after a while you get to still see that there were loops that were vectorized by some compilers and not by the others. And with no good reason for that. So basically I think the most important lesson I learned is that we have tremendous difficulty figuring out how to do quality control in compiler level. So that's -- that to me is the biggest lesson. Because when we look at the loops, there is no reason why they cannot be vectorized. We have the technology, we have the mechanisms to do the transformation, to do the analysis. The only problem is that the compiler just doesn't know how to take all these tools and put them together and do the proper thing for a particular job. It doesn't know how to ->>: What do the numbers in the shaded boxes mean, the 2 and the 3? No, in the other -- the autovectorized? Under ICC you have a box with a 3 and then under GCC a box with a 2. >> David Padua: I -- to tell you, I don't remember right now. >>: Okay. >> David Padua: There is a paper in PAT last year, and they describe all the details there. The -- the ->>: [inaudible] compiler [inaudible]. >> David Padua: Huh? >>: And if I look at this [inaudible]. >> David Padua: The intercompiler -- okay. This is not the end of the story. This is only the loops that were very simple. When we talk the real applications, all the compilers behave equally badly. 30 percent of the loops were vectorized. Performance, when you did the work by hand, as if you were a compiler, you can get factors of two or more performance improvement. So is there lot to -- but in all cases real applications or not, when you look at the transformation that you need to apply, they are there. We know the transformation. There is nothing new that needs to be learned about the analysis of transformation. What we need to learn is how to guide the compiler, how to establish a path to take the code from the original form to a form that performs well. If you just launch a compiler and say search, if they were not tomorrow, then we'll perform [inaudible]. The problem is compiler must be efficient, so they cannot take a long time to compile. That's part of the reason. And, you know, the problem of optimization is a black art. Nobody knows exactly how to search [inaudible]. Yes. >>: [inaudible] what programming language are they in? >> David Padua: C. >>: And your analysis was that from language perspective they are able to vectorize? >> David Padua: Yeah. >>: Okay. >> David Padua: Yeah. Of course. And, in fact, all these defects or lack of effectiveness of the compiler was after restrict and aligned keywords were inserted by hand, so even then [inaudible]. >> Juan Vargas: I'd like to insert an editorial comment. If you are interested in vectorization, reading that original paper by David and Jack and so on would be worthwhile, just to see what could be vectorized and how one might test the ability of the compiler to discover. >> David Padua: Yeah, yeah. >>Juan Vargas: It's an interesting paper. >> David Padua: And when you look at that, you also read this one, and one thing Saeed did is compare the effectiveness of vectorizing compilers then and now. And they're not better now, as you will see. All right. So then the next -- I have 40 minutes, right, so I have ->> Juan Vargas: No, it's 30 minutes. >> David Padua: Ah. Okay. So let me just ->> Juan Vargas: 30 minutes. >> David Padua: 30 minutes. Okay. So let me very briefly describe in two minutes all the things going on. So there is of course compiling OpenCL, and I'll tell you one is a translator developed by [inaudible] group that transform OpenCL depending on the target machine for vector, multithreading, sequential, and so on. And it's very effective, as you can see here. And they are interested working on that. Then there is work also by Wen-Mei Hwu on data layout system that reorganizes structures of arrays of structures into tile information, and also very effective mechanism to reorganize. This is an area of great importance, to be able to restructure the data to get good performance. And then finally we did work on tiling and overlap tiling and hierarchically overlap tiling that was very important for certain classes of application. We are now working on eight dimensional tiling and overlaps of the -- using this tile to overlap communication with computation with some success. And then finally the other project, and Joseph described this yesterday, is the work on the deterministic parallel Java. And basically as you know the idea is to insert declarations in the program and have the compiler infer what regions of the data are being modified by the different threads in the program and determine whether or not the program will be deterministic or not. Let me just say there are noble features of DPJ not available before. For example, the reliability of nested regions, which is very important when traversing [inaudible], for example. And support for arrays computation, which of course is crucial. And I'm going to skip this. The other thing important they did is they allow for nondeterminism so you can have nondeterministic programs, nondeterminant programs, and check still for the absence of arrays. So they are sure that the data that conflict is enclosed within anatomica, within a critical region. Okay. So that's that. So [inaudible] there is a lot to do. The truth is that there probably has proof more complex than we expected in all dimensions, but much progress has been made and we'll keep working. We, of course, will never solve it completely, but the current situation is pathetic, so we need to do much more. Thanks. [applause]. >> Juan Vargas: Selective Embedding Just-in-Time Specialization, and this is the main bit that UC Berkeley has been working on, as you know. And Shoaib is graduated from Berkeley and is taking a postdoc at MIT. Is that right, Shoaib? >> Shoaib Kamil: Yes. That's right. >> Juan Vargas: And if you please repeat the questions because we are taping the presentations. Thank you. >> Shoaib Kamil: So I was told to make sure it's the right resolution. Okay. So I'm Shoaib Kamil. I'm from UC Berkeley where I've been a grad student for a few years. And I'm going to be giving some updates on SEJITS, which is Selective Embedded Just-in-Time Specialization. This is our technology for enabling high performance from high-level languages. And it was really good to see some of the stuff that Professor Padua talked about, because, one, it looks like there are other people doing similar things, and, two, the motivation was all there. So vectorizing is arguably the easiest form of autoparallelism, but -- and the technology to autovectorize is known, but it looks like it requires somebody to guide the use of that technology. And that's kind of the approach we've taken. So just to remind you what Selective Embedded Just-in-Time Specialization, or SEJITS, looks like, we want people to be able to write in these productivity languages like Python, so their application has portions that just run on the interpreter and use the normal Python infrastructure. But there are other portions of the program which pass through domain-specific embedded language compilers. And these compilers use the infrastructure we've built, which is called Asp, to intercept these calls and translate the function in question into a low-level language. That translated function is then compiled and dynamically loaded into the interpreter which then calls that function and returns the result to the user. So all of this looks like from the user perspective that they're writing in Python, but what's really happening is that portions of that Python code are being translated into C or C++ or CUDA or other low-level parallel language compiled and run and the result returned. We also introduce caching so that this Just-in-Time compilation doesn't have to happen all the time, only the first time something is run. So here is an example of an actual kernel that we can specialize using the stencil domain-specific embedded language. If you look, it looks just like Python code. If you're familiar with Python, lots of colons and indentation. But what this computation is basically saying is for each of the interior points in a multidimensional grid, for each of the maybes of that point, apply this particular function. So we treat this kind of like a declarative specification of the computation. So even though it looks like usual imperative programming with a 4 and so on, all this is actually doing is telling the infrastructure what the computation is, not how to do the computation. In particular, there are certain constructs here that are domain-specific. So the first thing is that we inherent from a particular class. This is what tells the infrastructure that this computation is something that needs to be specialized. So the initializer for this stencil kernel class is the thing that, you know, looks at the source code and does all the steps that I'm going to describe in a minute. We also have these special iterators which are used in the program translation. So the first step is to introspect this function and get the abstracts syntax tree. This is done mechanically, automatically using Python infrastructure. We didn't have to really implement anything for this, right? Python can introspect itself. So you get this syntax tree that represents the computation. If you look closely, you'll see there's a function definition, and then there's these four loops, and, you know, lots of other nodes. So we use that AST and do a mechanical translation to domain-specific intermediate representation. So there's some effort required for the person writing the domain-specific embedded language compiler to write this translation from that Python AST into this intermediate representation. Now, the intermediate representation looks like a tree, it has some domain-specific constructs, but, again, what it really is representing is not how to do a computation, but actually the declarative specification of what that computation is. This is done in our infrastructure by writing simple tree transformation rules that work on local pieces of the tree. It uses our infrastructure and the visitor pattern to make it as easy as possible to define. The point of this translation is once you get to this point, we can run correctness checks, we can ensure that, you know, because it's a declarative we can do things like make sure that the specification is correct by construction as in when we're translating it. We make sure that everything we're translating is going to result in a correct specification. So most if not all of the checking is done in this portion of the compilers. So from there depending on the target machine or what the actual code is running on, it is translated into a back-end domain -- a back-end general AST. So an AST in C++ or in CUDA with parallel constructs, et cetera, et cetera. And then that tree is then optimized. Now, this is where most of the domain-specific knowledge is used or the expert programming knowledge is used. So in this process, you know, you figure out, well, for this particular construct I translate it into this kinds of C++ and I know that I can do certain optimizations on this. As Professor Padua pointed out, there's often cases where the user who is writing the code knows that a particular transformation is correct. However, the compiler does not have the absolute knowledge in order to know that. Common example is things like loop and rolling. Certain cases the compiler can prove that this is a correct transformation, but that's not in every case in which it would be correct. In this particular example, you know, I'm an expert in stencil computations. I come from the high performance computing world, so I know how to do this. I know how to make a stencil computation go fast. So all of my knowledge is embedded into, you know, optimizing this tree. We've also implemented many of the common compiler optimizations, things like loop blocking, loop and rolling. These are things that are part of our infrastructure, so you just need to apply them in the proper way. So what does the code kind of look like? Well, you end up with a very -- so this is actually a very simple example of what comes out of the code I just showed you. And this is a simple example in that, you know, it's 2D, it only has a couple optimizations applied. So just so you don't have to squint, there's parallelization here. There is cache blocking which ensures that you work on a working set sized cache block at a time. And there is loop and rolling or register blocking, which helps expose things like vectorizability. It also helps, you know, expose things like common subexpression elimination and so on. So this is the code that gets pasted to the compiler. And as a bunch of the autotuning work has shown, if you write your code in a particular way, you can get the compiler to give you much better performance than writing your code in the most naive way. So that's what we're doing here. I've left out a couple of the major pieces of what SEJITS does. For example, I just mentioned autotuning. I haven't shown you any of that, but basically we generate many different versions and on each invocation of a function, we run a different version until we've explored the space completely, and then we use the best version for subsequent invocations. So this is SEJITS. For the rest of my time, I'm going to give some updates on the current domain-specific embedded languages and libraries we have, show you some new results, some users and upcoming releases, and finally talk a bit about some future directions in what we're doing. So first off, this is kind of the set of implemented domain-specific embedded languages and libraries that we have. The red ones I'll give you some more detail about, but let me talk briefly about a few of these. The platforms we target rank from, you know, x86 plus OpenMP to MPI to CUDA to Cilk Plus. So we have basically every back end. We even have a cloud back end at this point. I want to just -- let me just briefly describe each one. So the first one is for stencil or structured. Great computations. I'll spend some time talking about some results from that. The second one is for graph algorithms, and I'll spend some time talking about that as well. We've also implemented a simple parallel map, domain-specific language, which basically lets you compose these languages with map, so you can do high-level parallelism there. One of the big success stories, I would say, is the Gaussian mixture modeling library we have, which has been released. I'll talk a bit about the infrastructure that that's a part of. But, you know, that's able to get performance that's 250X real-time faster than what the domain scientists were using before, which was handwritten, hand-optimized CUDA. We also have a matrix powers communication avoiding library, which I'll talk a bit about. And we have three new domain-specific embedded languages that are under development right now. So the first one is for the Bag of Little Bootstraps algorithm, which is a statistical analysis tool, and this one is used by people who want to evaluate machine learning and other kinds of things. We have a version that can target Cilk Plus on a local multicore machine as well as a version that can run on top of the Spark Infrastructure in the cloud. So that's under development. We also are working with the people who make GraphLab, which is a C++ parallel library for machine learning using the graph representation. We're working with them to replace their current API. Their current API does something quite convoluted. So first they created a Java API for their C++, then they used JPython -- or Jython to write a Python API on top of the Java API on top of the C++ API. So we're getting rid of all of those layers and we're using our infrastructure to replace the current API so that you can get C++-level performance while writing in Python for your graph algorithms. And finally we are working on a communication avoiding parallel recursive structural pattern. So what this comes from is the insight that there's lots of algorithms that have a particular recursive structure. And this particular recursive structure allows you to choose at any point whether you want to run the different leaf nodes of the computation in parallel or whether you want to run them serially, depending on the available resources. So a preliminary implementation of this was applied to matrix multiply, just, you know, the standard thing that has been optimized the hell out of, but we were actually able to beat MKL, Intel's Math Kernel Libraries, using this particular recursive structure. Now we're using MKL by using MKL but being more intelligent about when to do things in parallel and when not to do things in parallel. So first let me talk a bit about performance that we're getting out of the stencil domain-specific embedded language. This is a set of benchmark kernels on the X axis. On the Y axis is the fraction of peak performance. The last set of bars is the geometric mean, and we're comparing against Pochoir, which is the state-of-the-art domain-specific language for stencils using Cilk Plus. It's developed by MIT. And it gets excellent performance using the cache-oblivious algorithm. So we are able to get a geometric mean of 93 percent of peak performance. And this peak performance comes from looking at the kernels and finding out that they are bound by the memory bandwidth performance of the machine. And 93 percent of peak performance is quite a bit better than you could get in Python. Just as an order of magnitude kind of thing, each of these is about 2- to 3,000 times faster than what you would get by writing it in Python. >>: So what did you [inaudible]? >> Shoaib Kamil: So this is a single node four core i7 machine. We compared to the autoparallelizer in GCC which uses the polyhedral framework, and we're able to outperform that by up to 11X and about 2 1/2X faster than Pochoir. And, as I said, the geometric mean here is 93 percent of attainable peak. Yes. >>: This is the first time I get to see comparison of this and Pochoir on the same problem. >> Shoaib Kamil: That's right. >>: Please repeat the question. >> Shoaib Kamil: Sorry? Oh. The question was this is the first comparison that Robert has seen on the same hardware for Pochoir and our infrastructure. So I mentioned the communication -- sorry. >>: [inaudible] the language, right, the difference is that they generate [inaudible] algorithm and you generate [inaudible]. >> Shoaib Kamil: Well, so the -- the question was the difference is not due to language, whether the difference is due to language or whether it is because they're using a cache-oblivious algorithm and we're using the cache-aware algorithm. Well, I think the difference is -- well, it's certainly not due to language. We're both using C++. We're both using the Intel compiler. But the difference is due perhaps somewhat to the use of cache-oblivious algorithm versus cache-aware algorithm. But I think it's more so that because we're doing our compilation Just-in-Time, we have to us available every parameter that you would possibly inline into the computation. So we know the sizes which tells us what kind of blockings are valid, and we explicitly generate all the valid blockings, and we use autotuning. So I would say it's probably more related to our use of autotuning than it is to the algorithm because conceptually, if you look at the algorithms, they do the same thing. However, because we can use autotuning, we're able to get the fastest possible performance from a wide variety of code variants. >>: [inaudible]. >> Shoaib Kamil: So the question is they have a way to add autotuning to their process. Is that a statement, or are you asking ->>: [inaudible]. >> Shoaib Kamil: Oh. So I'm not aware of this functionality. When I presented this in front of the Pochoir people, they think they can improve their performance. And, you know, where I'm sharing the code with them, so hopefully we can figure out how to make their performance better. So the other thing I wanted to talk about in terms of performance results is that we've talked a bit about our communication avoiding matrix powers library. And what this particular kernel does is it's a building block of communication avoiding Krylov subspace methods. And what this chart shows is the performance of three different implementations of a communication avoiding CG solve. I'm sorry. Three different implementations of a CG solves, one of which is communication avoiding. So the first one is kind of the standard, the blue is the standard CG that you would get if you're writing in Python and using a good performance library. This is scipy, which basically implements all of the MATLAB functionality in Python. It's what most people use when they're doing this kinds of computation in Python. The red is using Intel's MKL. And the yellow is our communication avoiding CG with matrix powers only doing one matrix multiplication at a time. So this is kind of the baseline that you can compare with the noncommunication avoiding stuff. And the green is the communication avoiding CG using our matrix powers kernel. The dark bottom parts of the bars are the times spent in the matrix powers portion of the CG solve, and the -- I'm sorry, the matrix multiplication part of the CG solve, and the light bars are the rest of the computation. What this really is showing is that if you look at just the matrix powers part, we're able to outperform Intel's MKL by 2X which results in much faster solves. So I want to talk a bit about some new work that we've been doing in collaboration with people from UC Santa Barbara. And we've been working with a Python infrastructure that they've built called the Knowledge Discovery Toolbox. And the Knowledge Discovery Toolbox is basically a high-level graph algorithms library for domain scientists. So most people will use the, you know, KDT to write their -- to run their graph algorithms, do things like [inaudible] search or, you know, clustering and things like that. It's built on top of the combinatorial BLAS which is a low-level C++ with MPI library that casts graph algorithms as linear algebra operations. And that, you know, uses MPI and can use OpenMP, Cilk, and things like that. This is not yet implemented. That's why it's in a dotted line. So we're adding in the SEJITS infrastructure here to bridge the performance gap between things that are written by domain scientists and things that are written by people at the low-level distributed combinatorial BLAS. And example of this is a domain-specific embedded language for filtering. Now, oftentimes you have a graph that has attributes on the edges or the vertices, but you want to perform your graph algorithm only on a subsets of this graph. So, for example, you have phone calls and texts, and you only care about performing your graph algorithm on the phone calls. So what we've done is we've implemented a domain-specific embedded language that allows you to write filters, and the filters are basically functions that return true if this edge should be included and false if it should not. KDT itself has infrastructure for doing these filters, but in KDT the filters are written in Python. And at each edge during the graph algorithm, it has to up call into Python and decide whether this particular edge should be included or not. So that incurs a pretty big performance penalty. So our domain-specific embedded language looks just -- you know, uses Python. It uses the same kind of infrastructure to write that. Here's an example of a filter that checks, you know, if the count is greater than zero or -- and if the time is less than some particular time you've passed in, and these are edge attributes of the graph. So using this we can speed up the performance quite a bit. So here's a graph that shows the mean BFS time on a log scale on the left side, and on the X axis is filter permeability. That's the percentage of edges that pass the filter. The performance that you should be comparing is the green which is KDT's current implementation that, like I said, runs in Python, and the blue, which is the one that uses SEJITS. Now, in order to build a baseline for this, we also implemented the same thing at the low-level C++ library just to compare. Now, this isn't something that, you know, we would expect a domain scientist to be able to do because it involves writing low-level C++ parallel code, but that gives a good baseline. In any case, the performance here is 5 times faster than the current KDT, and the slowdown versus writing the whole thing in C++ is 25 to 50 percent. What I'm showing here is performance on a -- on 36 cores of a 40-core Xeon four-socket machine. We also run this in the distributed setting on hopper at NRSK [phonetic], which is a large-scale AMD -- I'm sorry, it's a large-scale Cray XE 6 using AMD processors. And we've seen similar results there. So I'm going to switch in the last few minutes I have here and talk a bit about what's going on with users and so on. So we've had over a thousand downloads of our ASP infrastructure from the Python package index, which is kind of the mechanism for distributing Python packages. The KDT work I showed you is going to be integrated into mainline KDT and released this year. This is going to be, you know, kind of the first major outside users releasing something using our infrastructure. And the goal is that the second DSEL we've implemented there, which I haven't really talked about, is also going to be part of the release after that. In the first two weeks after the Gaussian mixture model library was released at a -- at the ACM multimedia conference, we had over 800 downloads of that library. And we have outside users at Carnegie Mellon and other places that are using this in production to replace their current C++ code. Yesterday Tim Mattson showed some numbers on an application which does protein docking. This is an application that was supported by Henry Gabb who was at Intel at the time. And it gets 290X speedup by running in the cloud without any change to the ported application. And finally Tim's also developing an FFT specializer and giving us great feedback on things we need to work on in terms of usability and in terms of making it easy for people to develop these domain-specific embedded languages and autotuning libraries for Python. So some future work. So right now the autotuning and the results I've showed you is kind of doing the most naive autotuning can which looks at all the particular versions and just runs them in some kind of random order. So we think by adding machine learning, and we have some evidence of this, we can get much faster searches that converge on the optimal version. So there was a paper by Archena Genapotti [phonetic] and Koushik Dutta a couple years back for stencils that shows that machine learning can greatly improve the speed of converging on the best version. So we want to integrate that kind of search into our infrastructure and make it easy for other people to use in their specializers, in their domain-specific embedded languages. Some of the ideas from SEJITS are also finding their way into hardware prototyping. I think Krste's going to talk about that later on today. And we're also exploring one of the big problems in using these, which is composition. And one of the directions we've seen is that pattern-based frameworks that are used for a particular kind of computation or a particular class of programs such as multimedia programs or other things like that is a good testbed for bringing composition to these domain-specific embedded languages. So there's some work from Berkeley that's also going to be talked about later today called PyCASP, which is building an infrastructure for doing multimedia application -- multimedia applications in Python and get fast performance from that. So it consists of a bunch of library components, some customizable components and structural patterns that you can put together in different ways to build your application. There's going to be a whole talk on that later today, so you'll hear much more about that. So to conclude, we've seen really high performance in kernels. We've seen examples of applications authored by people at Berkeley, around also people outside of Berkeley that are using this infrastructure to build real applications. We're doing some work on composition and we think that this infrastructure and ideas from it are really critical for enabling agile hardware software code design and design space explorations. Thank you. [applause]. >> Shoaib Kamil: Lots of acknowledgments because there's lots of people who worked on this. >> Juan Vargas: Do you have questions for Shoaib? >>: There's a question. On your previous answer that [inaudible] performance is that logic and taking advantage of that, do you have sensitivities studies on how much you gain by that? >> Shoaib Kamil: Yes. And I can show you a graph offline. >> Juan Vargas: What was question? >> Shoaib Kamil: So the question was do you have sensitivity studies on how much performance you gain by using autotuning. And I certainly do. It's in my thesis, so I can show you a graph. >>: [inaudible] what type of effort did you make [inaudible]? >> Shoaib Kamil: That's a good question. So we're working with Koushik Sen on some correctness infrastructure that allows you to basically prove that the generated code and the original code are doing the same thing, or at least the intermediate results are correct. So in terms of the stencil stuff, the correctness is more so embedded in the domain knowledge. So we're not doing formal verification online for any of this stuff, but we have applied those techniques that Koushik Sen and his students have come up with and actually proved that the stencil DSEL is outputting correct code. >>: Some kind of [inaudible] taken from [inaudible]. >> Shoaib Kamil: That's right. >>: Just an additional comment on that. Some of the transformations that are domain-specific use associativity and distributivity of a written thing to do the transformations and that changes the floating point properties, so you don't get exactly the same answers. And sometimes they can be very different, and a lot of domain knowledge is required to know which transformations are okay. >> Shoaib Kamil: So Jim pointed out that the domain-specific transformations we're doing do assume associativity and communitivity in other properties of floating point. So they can change the answer in terms of the floating point correctness. So that requires no -- having domain knowledge of what things are okay to do and what things are not okay to do. >> Juan Vargas: The last speaker for this session is David Callahan. David is a Distinguished Engineer at Microsoft. He has a Ph.D. from Rice University and he spent some time at Tera Computer, Cray Computer, and since joining Microsoft he has been work on parallelism and concurrency, and he's going to talk about his latest story, C++ AMP. >> David Callahan: Good morning, everyone. Thank you for coming to Redmond. I'm delighted to have a chance to talk to you a little bit about what we've been doing in Visual Studio in parallel programming. So developer division has two broad products. One is Visual Studio, the other is the .NET developer platform. And our broad mission is to make sure developers are successful targeting Windows platforms of which there are a great variety for. I've sort of been at Microsoft for six or seven years now helping to shepherd along the parallel programming assets. We made a big investment in Visual Studio 2010 around multicore, and tomorrow I think we'll have a member of my team come and talk both about that and the extension work that we're continuing on, the next one. Today I'm going to talk about C++ Accelerated Massive Parallelism, or C++ AMP, which is really our offering to sort of tackle the problem of GPU programming. So let me give you a little context here. We know Moore's law continues, but the hardware architects have been talking about power, power, power, power, power for years now. And so sort of a natural conclusion of that combination of cheap transistors but can't turn them all on is that we'll see specialization of silicon to particular workloads, which is already somewhat pervasive in a lot of places. And one of the really important workloads that exists is the rendering workload manifested that's today sort of handled by GPUs. So this combination of factors have led to a really fast evolution about what GPUs are and where they're used. So part of it is in the programability of the GPUs, and I'll start sort of six years ago when NVIDIA introduced C for CUDA and said, hey, you don't have to work in a graphics framework to get the power of the GPU available to you. And once they made that transition, they're sort of put on this path of making GPUs programmable like CPUs in ways they've never been before, subject of course the fact that graphics is the workload that matters, so they're still dominated by engineering for that workload. Three years after that came out, we saw that the execution model that was sort of pioneered in CUDA then became standardized both in OpenCL and in DirectX11 Compute, so the two major client operating systems both embraced this model and went forward with as a core capability. Along that same time, of course, we saw the rise of mobile form factors as a huge concern. They embrace this rich visual experience now coupled with touch and that interactivity with the visual experience is something that's only delivered through a specialized hardware system and integrated GPU in that solution space. NVIDIA also built out special skews saying, hey, this stuff will work in a server environment as well, and they'll tackle the HPC market as aggressively as they're attacking the mobile market. At the last GPU technology conference they ran, they had a factoid about the growth of this at SC where in 2007 there was one booth that talked about GPUs. But last year 80 percent of the booths were talking about GPUs. A big impact in the HPC space. And of course now you can go rent those GPUs by the hour at Amazon first and then other places. And then the last step of this trend was the embracing of moving these specialized hardware capabilities on to the main CPU die and building a composite heterogenous package. AMD released their first APU last year, and the Intel Ivy Bridge which Intel's had graphics accelerators as integrated parts for a long time, but this one is a DX11 compliant part and is really a candidate for producing the compute workload that we're interested in. So in this context, then, we started investing about how can we make this sort of emerging sort of commodity platform, a heterogenous platform, accessible to a very broad range of developers. So that's what we built in C++ Accelerated Massive Parallelism. Focus on data parallel algorithms, which are the part of the sort of algorithm pattern space that are most effectively optimized by these wide new generation of vector processors. We embrace the heterogenous nature of the computing so it's not just one kind of processor but two you must deal with. Put this in a mainstream context. So put it into C++ and give support in a major IDE like Visual Studio which will support not just the language but also debugging and profiling. And get tools in the hands of users that can use this effectively for productivity and portability and performance. So balancing these three is sort of the design challenge in the API. So this will be now part of Visual Studio 2012. It's available in a release candidate now. Our approach to it was to say, hey, this is just part of C ++. So you build a C++ app, you include our app header, suddenly you can be doing GPU programming. It's just sort of part of the product, not a special bolt-on add-on. And that goes all the way to how we think about acquiring the tools and how you deploy the resulting opportunities. We do not think, however, of this as a Microsoft technology per se. We have a first implementation, but we think that this is the direction that the C++ community should move. We've created an open specification for our API design and put behind that the Microsoft community promise which will guarantee a royalty-free license to whatever IP you need to implement the spec, and we are actively looking to have implementations on Linux on Android and the Mac OS. At its core, AMP is just an STL library, but there's a couple of extensions to C++ we made to actually enable targeting of the diverse range of hardware that we're interested in, and I will dig into those. One of them is a pretty novel extension, the other is a tweak on sort of an existing idea. Our implementation is a layer on top of our graphics platform support DirectX11. However, DirectX11 doesn't shine through, so it should be feasible to build an implementation on top of OpenCL or even perhaps OpenGL. We haven't done that, but our intention is not to make DirectX shine through except in some interop APIs. Give you a very brief introduction to the core concepts of what is in C++ AMP, starting with matrix multiplication. I'll do a very simple implementation, and later in the talk I'll do a more sophisticated implementation that will give a more substantial performance advantage. So here our starting point is simple C++ version. The usual textbook three loops, inner product in the innermost loop. The two outer loops are then completely parallel. The interface starts with C++ [inaudible] vector types and some explicit bounds passed in. You know, one of the weaknesses of C++ is it hasn't actually standardized on a multidimensional data type, which is pretty important to do data parallel programming with. So one of the things then we added in C++ AMP, and this is an essentially complete C++ AMP program subject to missing on AMP.H header and a name space inclusion, we added a notion of an array view which allows you to overlay onto an existing linear data structure a multidimensional view of that data. And so we do that for the two input vectors A and B they were flagged as const, we're not going to modify them, and the output vector C. Then we have a library based parallel loop construct which is actually very similar to what we did for our task based parallelism as well, and so this should be familiar to users of our earlier work or Intel's TBB. This parallel for each takes two parameters. One of them captures the extent information of a parallel loop next. So the number of iterations in each sort of dimension of the problem. So in this case that would be M and N. And then it takes a function closure which in C++ is called a lambda. And we'll invoke that lambda once for every point in the implied iteration space and passing in an index vector to tell you where you are in it. So that's the core data parallel paradigm of for every point in this space do this function. Our most significant addition to the language is this restrict keywords which defines can be applied to functions and to lambda closures, and it creates a subset of C++ in the body of those functions. That subset is -- defines what is legal to run on a GPU, and it is also hooked to the compiler to understand when multiple implementations are necessary. And I'll talk about a little bit more about the nature of the restrictions later. Note that the body of the interloop is essentially unchanged, and we can pick off the row column values out of the index value. The array views are also our hook to understand if you're running a distributed environment where the GPU has its own memory what needs to be copied over there and how. Partly that's through what's captured in the lambda. It's also partly how we marshal the array views over. And I'll talk a little bit more about that later as well. Now, because we are running in a -- potentially in a separate address space, potentially asynchronously to the CPU, there's also some API extensions here to talk about what data, if it needs to be moved or doesn't. So the discard data API on C is an assertion, hey, the parallel for each is just going to overwrite this data. So don't bother copying it over to the GPU if you need to treat the underlying data as undefined. And at the bottom there's a synchronize which sort of does the reverse, saying, hey, this data may have been modified on a GPU, but I want to store it back to its home location in the CPU data structures. Please do that now for me. So this is now essentially a complete C++ AMP program where most of the concepts we've introduced are really kind of around the array abstraction, sort of the size of it. So it should be pretty straightforward for most of our existing task-based customers to sort of understand this, which sort of gets to sort of some of our -- I'm sorry, go ahead. >>: So if it doesn't fit, you'll do all the blocking that's necessary to move it back and forth between CPU and GPU memory? >> David Callahan: So the question is if the data that is accessed in the parallel kernel doesn't actually fit into the video memory that's available on the card, what happens. So in our first implementation, it's the responsibility of the developer to ensure that happens. So you'll get an exception if you exceed it. There is not enough information in general in the API to allow us to automatically block this. >>: Can I ask you a question? >> David Callahan: Yes. >>: Can you detect if the disk data was put in incorrectly, like say you [inaudible]? Do you have any checks? >> David Callahan: So the question is are there any checks in the system to verify that if you put a discard data incorrectly and then use the stale data whether it will detect that. We don't do that in the first version. So we have this natural tension now between portability and productivity on the one hand and the performance. And this will show up sort of in how much of the machine do we let shine through in the abstractions. So we know that sort of for portability and productivity reasons you really want to minimize the number of concepts that the developers are exposed to that represent an increment of what they already know. This is just generally any time a new technology comes from the richer the set of concepts the more trouble they'll have to embrace it. And you also need to make sure that they know exactly when it's a good idea to apply that. So that leads us to focus fairly narrowly on the patterns that were applied to sort of these data parallel constructs. One good composition was a host environment which is C++, and we've achieved that by having a strong integrated type system that spans both classes of code. And you also want to sort of make sure that you pay attention to sort of the key patterns that your developer will be facing and give good application pattern support for that. So in that configuration we focus on the parallel loop as the pattern we're trying to make sure we have good support. In sort of the more general multicore space, there's lower levels of work for tasking from which you could build the pattern stuff, but here we focus just on the pattern itself. On the performance side, we're faced with a lot of decisions about how much memory hierarchy, how many memory alignment requirements, what to do with the exposed concurrency between the two kinds of processors to have and how the hardware scheduling mechanisms might get exposed. These are all areas where the architecture is still considerably in flux. Even today the range of GPUs are fairly diverse. And since we'd like to have future proofing so that the codes we write today run well in a few years as well, it drives us to sort of minimize the amount of hardware we actually expose. So I'd like to sort of dig down into the bottom four of these in the time I have left about sort of some details of what we're attempting to achieve. So let me start by talking about the restrict AMP function qualifier. So our implementation builds on top of DirectX11 which represented sort of an industry standardization of capability, which is somewhat less than what CPUs provide. So there's a bunch of rules about once you're sort of in the GPU space what happens. So you'd have to kind of stay in the GPU space, you can't call back CPU functions. Most of the GPU architectures actually don't support either indirect functions or function calls, and so you have to carve off a subset of the language which you can fully inline and map to that more limited capability. There's typically no memory management and very restricted pointer uses. So these are things if you look at the most modern GPUs from some of the leading companies have already been relaxed, and we would expect over time to relax this list, and we have a strategy for that. The other thing is this restrict is part of the type system. So we currently have two notions of restrict. Restrict AMP for our GPU targeting, and restrict CPU for things that just run on CPU. Restrict CPU is the default, so if you don't say anything, that's what you get. And you can have overload so you can tailor the implementation function to the context that it's running in. You can also write functions that are guaranteed to be -- by design time to be safe to run in either function. We anticipate there could be other uses of restrict, so we worked really hard to make sure that it was orthogonal to the rest of the system. Maybe there's restrictions that are appropriate for CPU vector targets. Maybe there's a notion of a pure function we'd want to carve out. So it's intended as a general addition to C++ that we use to enable heterogeneity. We also worked with some of our hardware partners to think about what would be the evolution of this set of restrictions over time. If you go read the open spec at the bottom, there's a roadmap there of likely big buckets of how we might move from our current AMP, AMP 1, do an AMP 2 which would have a subset of this list. So I've already talked about a little bit about array view and its role of providing a common name space between the CPU host and the GPU. The real function of it, however, is to sort of provide copy-as-needed semantics. So here if my parallel computation is running on a discrete memory system, then at the time that I launch the kernel I can observe in the lambda what needs to be copied over. And subsequently if there's a reference on the host, I can do the appropriate bookkeeping to say, oh, no, the current copies from the GPU, I need to bring it back. On the other hand, we know that we're moving to an age where integrated systems will become the norm. They'll be able to share physical memory, they'll share caches. Eventually they'll share cache protocols. And so those copies will become unnecessary. And so we want to make sure that we had a system that allowed you to have -- to span that transition. So you can write to tolerate a distributed memory environment that will then light up and improve on an integrated system. We also provide a mechanism to explicitly allocate data on a GPU, so there's an array type, it's kind of analogous to array view, but is a storage container rather than just an access method. And so even today there's a way to talk about place memory on the GPU because there are scenarios in which that makes sense. A more harder concept where we actually start to say, okay, machines are not as simple as we'd like them to be, and they kind of never have been, but particularly when you get to the memory hierarchy, there's a disconnect between how I want to think about the machine and how it really is. And so we chose to embrace a fairly simple extension which is, hey, these machines, they're multicore and every core has a cache, and you have to provide some mechanism to talk about that locality within an otherwise data parallel context. So this is not a model invented. We inherited it from where the graphics guys were with the X11, but it's a pretty good model, spans a lot of interesting things, and is a compromise because this part of architecture is probably pretty stable. I don't know what the next stable point in memory hierarchy is likely to be. So we introduced within our data parallel computation space the idea you can break that space into tiles, blocks, as what Padua called them, and these are sets of iterations that are going to be mapped to the same physical processor. Now, these physical processors are typically vector processors, so an iteration in our space may actually map to a lane and a vector processor, not a thread of its own. But we will call it a thread nonetheless. Once we have this idea that a chunk of my space can map to a physical processor, we can allow certain capabilities that tend to not scale very well to shine through. And the particular ones are we introduce the notion of barrier coordination, so all the threads in a tile, all the threads on a processor, reach some point before any proceed, and now we can introduce a notion of shared storage there that is suitable both for sort of temporal cache as we see on CPUs, but also as the scratchpad memories that are common as software managed caches on GPUs. Let me go back to my matrix multiply. Remind you that it's easy to do it in a block style, so you can take a block of columns and a block of rows and then accumulate them into a block of the output. So if the block of the output is our tiling strategy for the data parallelism, and we can stage how we do that row block, row column reduction into it. For example, we guarantee that bringing in block 3, we multiply it against block 3 and B and accumulate in the output, those subcomputations are now things that can fit in a private cache of a multiprocessor. And so this model we then can take forward into the API design as here's matrix multiply in the simple data parallel case that has no manifest memory locality to exploit, and here's the variant of it where we do introduce those things. We introduce a tile size. In this case it's a static constant, a limitation of the V1 as it must be compile time constant. We have a variant of the parallel for each in which you take the extent space and you tile it based on these fixed size blocks. So this is just an overloaded parallel for each that handles these sort of tiled extents. We change the parameter into the lambda instead of taking in an index, it takes a tiled index, a tiled index is a slightly larger thing that has more -- a lot more information related to what's going on in a tiled context. But we can still pick off the sort of global index about whether you are in the original space. And so these two variants haven't out made any substantive change in the algorithm. All we've done is tile the iteration space into blocks. The more interesting change, then, is how do we get to the locality. And that involves the following slide, where now we introduce a tile static local variable. So tile static is our second language extension. It's a new storage class, can only be used inside a tiled parallel for each. And it allows us to create little arrays which will be mapped into the scratchpad memory of GPUs or could be fit into the L1 caches on a CPU. Now we take our inner product loop, we strip mine it into chunks, so each chunk of that inner product will load blocks from global memory, tuck them into the cache local memory. Then when those loads are all done, it's done cooperatively, every thread in the tile does one data motion, we do a barrier to verify that all those loads have settled, then we can drop down into the set of chunks of the inner product. They now read out of cache of protected, and so the effect here is that instead of every thread going to memory for every array reference, there's a -- in this case 16-fold reduction in the global memory requirements achieved through sort of this explicit it caching strategy. And now a second barrier protects the reads in the inner product from the writes from the next iteration. So this is how we've taken our model of sort of L1 caches attached to processors and put it into the API set for the programming instruction. Yes, Jim. >>: So what about determinacy? Because you have a sum reduce, which could happen in any order, and in one issue depending on how things are dynamically scheduled is that you get the sums in different order, get a different answer. So is there anything in your language extensions that addresses that? >> David Callahan: So Jim's question is about how deterministic is the programming model in this particular example. So this particular example is actually completely deterministic because the output sum is done the same way because it's done by one thread and it's actually done in the same order every time. The concurrency in this model is between different tiles they are completely unordered. But between different tiles in this model there's no interactions. A second question is is it possible to introduce data races into this model. And the answer is yes, it is. For example, if we dropped the second barrier from here we would create a data race between some threads reading finishing their sum reductions and other threads starting to overwrite the shared data that's involved. We actually have some static analysis to detect those data races, and we can run in a mode in which there's some dynamic analysis, detect them as well. But it's not guaranteed to be a foolproof system. So in deference to time, I'm not going to talk about our compilation deployment model in depth. We build sort of an existing graphics infrastructure. We ship a fat binary. We still have a final JIT to target that gives us the great portability across different hardware, both in the field now and over time. It's a cumbersome model that can lead to finger pointing, but it does give that core portability that's pretty interesting. So here are the two points I tried to cover today. We have a ton of developer facing information about this tool that's on the Web. There's a team blog that you can get to by searching for C++ AMP in a nutshell, or it's [inaudible] at the bottom of the slide. And of course this -- all of this technology is available until the release candidate, so you can go download that today, try it out yourself, see what use like. Any questions? More questions? All right. Thank you. [applause]