>> Shobana Balakrishnan: I'd like to introduce you to Yossi Levanoni. He's been at DevDiv now for about six years. By the way, my name is Shobana. I just decided to host him because I thought this talk had broader implications just to MSR and, in fact, what we're doing with FPGAs. Hopefully he'll allude to that later in the talk as well on how hopefully we can extend this to FPGAs as well. Right now he's focused on GPUs. And he'll talk about what they've done in Dev 11 and hopefully even beyond that a if you ideas for the future as well. So thanks, Yossi for being here. And I'll hand it over to you. >> Yossi Levanoni: All right. >> Shobana Balakrishnan: Take it from here. >> Yossi Levanoni: Thank you, Shobana. So how many have heard here about C++ AMP or played with it in the past? Okay. All right. So I think I've prepared the wrong talk for the audience. This is going to be an introductory talk. So probably it's going to be things that you've already seen. But you'll notice kind of new syntax along the way that we have just introduced in beta. So that's going to be one added value of this talk. And if you feel that -- please tell me if you think it's -- the pace is too slow and you're familiar with what I'm showing, and then we can kind of going faster to the Q and A section, and we can drill into questions about the future of C++ AMP and possible research directions for you guys to look into in conjunction with what we've been doing. So and this is going to be FTE internal only, so we'll be free to talk about future product directions. Okay. So the agenda is first setting some context for C++ AMP, why we've done it and the opportunity. Next thing is the programming model. Then we'll talk about the IDE enhancements that we've added to VisualStudio. And I'll wrap up with a summary and Q and A and hopefully that will be a big part of the talk today. Okay. So I'm going to do something that is a little bit of a no-no. I'm going to show you a visual demo over terminal session. Terminal server. And this is an N-Body simulation. So N-Body simulation is a simulation that demonstrates gravitational forces between N distinct bodies. Each body exerts a force on any other body, so it's kind of an N square computation. We simulate it in every time increment. And there we see the forces. And that accrues to a vector of acceleration. That adds up to the speed, and that changes the location of the bodies. So this is supposed to be a visual demo. But we should focus on this number here where my cursor is showing. That's the gigaflops, basically how many computations we're doing per second. And we're starting with 10,000 bodies. Now, this is on a single CPU core. This is a core I7 machine. And we say that it takes about, I don't know, 12 percent of the CPU compute power when we look here at task manager. Now this has been written to take advantage of SSC4 instructions. Okay? So we're getting about four gigaflops. Now, if we go to multi-core, we have another version of the same computation. I'm going to erase the particle so it looks nicer. And we see that we got about 4X increase in the gigaflops that we're getting. And we see that all the cores are pretty much pegged. Maybe we need to fit it a little bit more bodies to really drive it to a hundred percent CPU. But that's kind of a pretty good scale. We've done that using PPL. So now, all cores the busy. And that was a straightforward addition to the code. Everybody here familiar with PPL? Okay. Already. So now, the next step up is going to be to a straightforward C++ AMP port. And we've jumped to, I don't know, 60 gigaflops. But we need to feed it a little bit more bodies in order to saturate the GPU. So we have about 40,000 bodies now. And we are at about 200 gigaflops. Now, this is NVIDIA GTX 580, a fairly high-end GPU. This is one of the reasons you send this through terminal server, because the machine is really gigantic and it was difficult for me to schlep it here. So sorry for being lazy. So we get about 200 gigaflops just using that one GPU. And you see that the CPU is not busy at all now, and I can use my machine for other things. So basically I get the benefit of kind of both, the GPU and CPU working on the compute. Now, if I endeavor to make it a little bit more optimized using what we call tiling, which I'll talk about more later, and we are now getting 370 gigaflops. And I think if we take it all the way to 80,000 threads, we are getting close to 500 gigaflops. So we got another factor of two or something like that by optimization. And finally we can now spread the computation on multiple GPUs. So that machine has two NVIDIA GPUs, two GTX 580s and we're getting something like twice that the 400 gigaflops that we got before. So it scales pretty linearly across those two GPUs. All right. Now the terminal server has its tax. It's kind of serializing and it has its overhead. So if I run it locally I get fairly close to -- I get above 900 gigaflops. Okay. So that's basically the power of the GPU that we want to harness. And this is what C++ AMP was mostly about in this release. So how does it happen that you can get this type of acceleration on the GPU? So we have here on the slide juxtaposing GPUs and CPUs. I think you can seat on the graphic that the CPU has a bunch of colored areas around it. It has a lot of specialized logic. It needs to be able to execute any -- any code, starting from your 95 favorite, you know, game to the latest version of Office. The GPU on the other hand, is much more regular. You've seen many, many identical processing elements. And even within those processing elements there is much more silicone dedicated to executing math than to scheduling. Okay? So the GPU is really kind of dumb, in a way. It knows how to crunch a lot of numbers. As long as you can give it kind of straight-line code, it does it very, very efficiently. It has a thicker pipe to -- a wider pipe to its on-board memory. So once you get the data on the GPU, it's kind of a -- it's more efficient to actually see it compared to RAM for the CPU. On the other hand, it has a smaller caches, compared with the CPU. It doesn't have all these deep pipeline and elaborate prediction logic. Instead it dedicates this silicone to actual computing. And therefore it runs at the lower clock rate. The way -- the way the GPU feeds -- maximizes the memory bandwidth is by presenting a lot of memory requests to the memory controller unit and does that by parallelism. So you have lots of threads. Each one of them can access -- ask to access memory. There's logic to combine those requests and present kind of chunky, chunky transactions to the memory controller. And this is how the GPU gets better utilization from the memory controlling unit. The CPU, on the other hand has these very deep appliance, you know, so it looks inside the instruction stream for things to protect, you know, what you're going to want to access next. And this is how it tries to maximize the memory bandwidth. So it all boils down to CPU being very adequate for mainstream programming, general purpose programming. And GPUs are more suited for niche or parallel programming or mathematical programming. Now, Intel and AMD and other CPU manufacturers seeing that, you know, there's so much action on the G -- on the parallel side of the hardware are definitely not, you know, staying idle. They're also investing in wider vectors. Okay? So the AVX instruction set that goes to -- will go to 500 -- 512 bits and probably bigger than that. On the other hand, the GPUs, they also want to become more general purpose. So they're adding capabilities that weren't there just a few years ago. So that's in terms of capabilities is they're getting closer together. But also in terms of topology you're seeing AMD with the fusion architecture. They're taking the GPU and the CPU and putting them together on the same die. And NVIDIA is working on the Denver architecture. Our manufacturers are doing the same thing. So we're going to be looking at hardware that has both GPUs and CPUs fairly tight together looking at the same underlying memory. So -- and that's going to be fairly mainstream on your slate and phone and desktops. So we've designed C++ AMP with that in mind, even though or main target for Dev 11 is -- has been discrete CPUs, that's what's been for us available to design to. We think that C++ AMP can evolve easily to those type of shared memory architectures. Okay. So the approach we've taken with C++ AMP has been to optimize for these three peelers, performance, productivity, and portability. So the performance is basically you get it through exposing the hardware. That's kind of the easy bit actually. The more difficult part is how to make it productive and portable, especially from a Microsoft perspective. So with respect to portability we have made it well integrated into -- sorry, with respect to productivity, we've made it well integrated into VisualStudio. It's first-class citizen in the C++ dialect that we have in visual C++. It's modern. It likes like STL. Okay? Now, in terms of portability, it builds on top of Direct3D. So you immediately get the portability that Direct3D is kind of a virtual machine architecture, gets you across the hardware vendors. But we've also opened up the spec and we are trying to entice compiler vendors to provide their own clean room implementations of C++ AMP. And hardware vendors are also interested in that as kind of a way to showcase their architecture. We've also kind of outlined how the spec will evolve to further versions of GPUs such that hardware vendors like AMD and NVIDIA have the confidence to know that if they buy into this vehicle they will be able to expose future features using C++ AMP. All right. So that's been the context of why we're doing it and how we've gone about it. Now, let's drill into programming model. Unless anybody has questions they want to ask right now. Yeah, sure? >>: Can I ask a [inaudible] question? >> Yossi Levanoni: Oh, yeah. >>: Why would I want to use this versus CUDA or OpenCL? >> Yossi Levanoni: Okay. So if you -- if you look at those approaches that we've taken, it's kind of apparent through each one of them through a different access. So CUDA is becoming more modern and it has thrust. And that allows it to look like modern C++ to a large disagree. But you obviously are stuck with NVIDIA hardware if you want to use CUDA. So if that works for you, you know, then you should use it. We see that there are many customers who are unwilling to buy into CUDA because it's hardware vendor specific. So it's kind of stuck -- it's obviously stuck because of that and the type of -- the size of addressable market. >>: [inaudible] did you tie yourself to the Direct3D? >> Yossi Levanoni: No. Direct3D is our implementation platform. And we really try to hide it from the bulk of the programming model. So the Direct3D specific things in the programming model are available through an interop API that is optional in the spec. And the spec itself is open. And everybody can implement. And we definitely didn't want to preclude the ability to implement on top of OpenCL or CUDA. And, you know, actually, you know, if anybody here wants to prove that it's possible, that will be really awesome. Yeah. >>: What does this say about the future HLSL? >> Yossi Levanoni: The future of HLSL ->>: Why do we have both? Why do we need both? >> Yossi Levanoni: Yeah. >>: I would rather do this. But, you know ->> Yossi Levanoni: Right. So actually the genesis of all these projects started from this very question. You know, we have a language that has been developed outside of the developer division in the Windows division. And, yeah, it's kind of an interesting question, you know. Can they really support it going forward, or should it be all completely moved into C++ domain? One of the things that the team will be looking for Dev 12 is do we want to be able to express additional types of shaders, for example, vertex shader or pixel shaders in C++. And also Windows has taken the approach of exposing HLSL as just a language but not the bytecode. Okay? And that's also been very difficult for us to build on top. Just build on top of the textual language rather than at the VM level, at the bytecode level. So there's definitely going to be some thinking about what needs to be done, what is going for the in the Windows 9 timeframe. There's also questions about the bytecode itself. The bytecode for DirectX today is a vector bytecode for a four-way vector. And on the other hand, if you look at something like PTX, which is the bytecode from NVIDIA, it's not vector. It's scalar, but it's a CMT. So it's implicitly vector. So a single instruction, multiple thread. They're all executing scalar code, but they're executing on a vector unit. With DX you get both CMT and vector execution. Compilers have a hard time kind of figure out how to map that to the hardware. To maybe that will need some revision too. And right now is the time where a discussion between DevDiv and Windows 9 is really starting to heat up about that. And if you want to be a part of that and you want to contribute to where this thing goes forward, I think that's -- that's going to be an excellent opportunity for collaboration. The person driving this from our side, from DevDiv is David Callahan. He's the distinguished engineer for C++ compiler. Yeah? >>: So why is this in C++ [inaudible] some of the [inaudible]. >> Yossi Levanoni: Yeah. So ->>: [inaudible]. >> Yossi Levanoni: This basically goes to the first peeler, which is performance. So when we started, we asked ourselves that we exactly should we go C# or should we go C++, assuming we couldn't do both. There's also a question about should you put it in a scripting language, right? Do you want to -- should you put it into Python or something like that? And we've done customer studies. And what we have seen is there was more demand for that in C++. C++ being the foundational -- basically today anybody that wants to write performance oriented code doesn't go any lower than C++. So it seems like a foundational place to add this type of capability to talk to the hardware. Now, it didn't preclude doing C#. But it didn't seem like the top proprietary for Dev 11. >>: [inaudible]. >> Yossi Levanoni: [inaudible]. No we recommend doing -- we recommend doing interop with C++. And we have a few samples showing how to do that. Okay. So the programming model. So let's start with the Hello World of data parallelism which is adding two vectors into a third one. So we have here a function add arrays in let's say C. It takes the number of elements in each vector, which is just an integer pointer, and it adds every element in PA and PB into the corresponding element in PC. Now, how do you -- how do we make that into a C++ AMP program? So this is basically what it looks like after you have ported the code over. So if you look at the right-hand side, this is the C++ AMP code. And on the left side is the original code. I'll drill deeper into the C++ AMP code. But just on this slide, look at the top and you'll see the include amp.h and using namespace concurrency. So this is how you get started, okay? This gives you all the classes and everything you need for simple compute scenarios. Now let's drill into the C++ AMP add arrays. So what do we have here? So the first thing to note is the parallel for each. So parallel for each represents your parallel computation. Every time you have a loop nest that is perfectly nested and you have independent iterations that can parallelize it. So in that case you can replace it with a parallel for each. The number and shape of the threads that you launch the captured by an extent object. So extent is an object that describes dimensionality. It says how many to go in what direction and how many dimensions. So you can map any loop nest into an extent object, any perfect loop nest. Sorry. The next thing is the lambda. Anybody here not familiar with lambdas in C++? Okay. So are you familiar with the delegates in C#? Okay. So it's kind of the analog with differences. But, you know, to refresh approximation it's like delegates. So you have here basically the body of the lambda. It's kind of like a function object. And these are the parameters. This is expression here says what you're capturing from the environment, what you want to make available here, and when you say equals it means I want to capture everything that this body refers to by value. And this is the bit that we have added. So restrict amp tells the compiler, hey, watch out that everything that I write in this function complies with the amp set of restrictions. And I'll talk more about in a minute. Now, inside the lambda what characterizes your iteration of the loop, your instance of the lap or basically your thread ID is the index parameter. Now, the rank of the index is the same as the rank of the extent. So if we are doing two dimensional iteration, we would get an index of two. This is a single dimensional index. Now, the last bit is the data. So we are using the array view class to wrap over your host data. So if you look at the top one, we have array view of integers of rank one. So it's a single-dimensional array of integers. And it wraps over PA. So it says basically I have this piece of memory, please make it accessible in the accelerator. Then we can access it on the accelerator and you'll note that you can use a subscript operator supplying an index object. And that just gives us this addition. So this was basically the Hello World okay. So let's drill into those classes. So index class. Extent and index, as I said, talk about dimensionality, iteration spaces, data shape and where you are inside the shape. So on the left-hand side we have a single dimensional space. It's represented using extent of one. The dimension that the rank of the extent is -- must be known at compile time, so it's a template parameter. But the extent is dynamic. So it's passed as a constructor parameter. So here we have an extent -- a single dimensional extent with dynamic extent six. Now, once you initialize an extent with a certain size, it's immutable. Now, on the top we see an index of the same rank, rank one, pointing into the second element in that space. Counting from zero. Okay? Next to the right in the middle we have two dimensional example. Where we have a three by four extent. So basically if you look at it as a matrix you have three rows and four -- four columns. And we have an index pointing into the zero row and second column. And lastly we have three dimensional extent. And you can talk about any number of dimensions practically. I think we go up to 128. But nobody has ever used more than four or five. So... Now, this -- these guys at all don't talk about data, just the shape of data and the shape of compute. Array view, on the other hand, talks about data. It basically gives you a view over a data of -- and you give it a certain shape that you want to say this is what this shape -- this is what the data looks like. So it's a -- it has -- it's a template with two parameters. The first one is the type of the element. Second one is the rank of the array. So if we look here -- let's start with the example which is depicted on the bottom right. So let's say we have a vector in memory of integers. This is just a standard vector. It has 10 elements. Now, we know that we want to treat it as if it was a matrix having two rows and five columns. Okay? So how can we tell C++ AMP to treat it as such? Well say here's an extent, and the extent has two rows and five columns. And then we say we want an array view. And the array view has that extent, two by five, and it wraps over that vector. Okay? So that's the second argument there to the array view. So basically now, accesses to the array view are wrapping over the original vector. Now, we have a lot of short -- short cuts in the programming model. You could just write A, 2, 5, V. You don't always have to declare those tedious extent objects when, you know, you don't need them for other things. Now, once you've created this array, you can access it both on the GPU -- this array view, sorry, you can access it both on the GPU and the CPU. And what C++ AMP gives you is coherence. We manage coherence for the array view. So if necessary we copy data back and forth and we try to do that intelligently. Now -- I do you migrate your computation between the devices. Like could I start something on the CPU and then move it over to the GPU at [inaudible]. >> Yossi Levanoni: Yes. >>: And then come back? >> Yossi Levanoni: There are some caveats. And so basically one of the caveats is what kind of performance can you expect when you access an array view? Is it the same performance as accessing the underlying vector? Now, when you access it on the GPU, it's very performant because this is code that we generate, and we know a lot about it. So there's a specialized path there. And it basically -- we almost strip out anything that's not necessary to really get you into accessing the row DX buffer underneath it. But you can also access it on the CPU elementally. And there in this release we have to do a coherence check basically on every access. And we don't -- we don't -- we don't have in DX -- in Dev 11 the ability to host that outside of your inner loops. So this is -- so in order to get good performance on the CPU, if you want to work with array views, is basically we have a performance contract that says this is the third bullet here, that they are dense in the least significant dimension. So it means always, you know, when you work with array views, let's say they're representing a matrix, if I get a pointer to the first element in a row, I can run over the row using that row pointer. Okay? And then I won't be paying any cost on the CPU if I did that. But mostly these things are accessed elementally on the GPU. And on the CPU they're used kind of as bulk things that you just, you know, store to disk or load from network or something like that. >>: Do the views have any personal type of access like [inaudible] I'll even access this array as a read only or write only ->> Yossi Levanoni: Yes. >>: Or read and write, stuff like that? >> Yossi Levanoni: Very good question. So you can specialize it for read only. You can say I want an array view will constant ->>: [inaudible]. >> Yossi Levanoni: Rank two. And therefore -- and when we see that you've captured that and that's how you want to access it on the GPU, we know that you're not going to modify that data. We can tell that to the shader compiler. And we can tell that to our run time so we know we never need to fold the data back. Okay? Now, we had write only as well. But it turned out to be really accusing to people. So we know longer have write only. Instead of that we have discard. So you can basically say to the -- to the run time, I'm going to -- I'm going to give you this space, array view, that I'm going to write into in my -- in my kernel. But you don't need to copy the data in, because I tell you you can discard it. Yeah. So that's what we have for write only after beta. >>: [inaudible]. >> Yossi Levanoni: Yeah? >>: Since sort of [inaudible] have on the other dimensions [inaudible] released [inaudible]. >> Yossi Levanoni: Okay. So we have for example a method called project. So you can take an array that's two dimension -- an array view -- a two-dimensional array view and you can say I want to project the first row -- the Nth row out of that. And that gives you an array view of rank one. So you get an array view that really represents a subset of the original data. Or you can do a section. So you have and array view presenting a big matrix -- a big contiguous matrix and you can say I want to get a window out of that. So now if you look at that than window, it's no longer contiguous in memory. It's only contiguous in the least significant dimension. >>: Is it always a [inaudible]. >> Yossi Levanoni: There is always a -- the API surface that we have now, there are always -- there is always a multiplicative factor between each dimension. And the multiplicative factor is always one in the least significant dimension. So, all right. Next up is parallel for each. I kind of talked about that briefly. So we have -- this is an API call. This is how you inject parallelism. We didn't feel like we needed to add another statement to the language. The first parameter is the extent. Again, like the extent that you specify for data you also specify for compute. So you can say I want a three dimensional, two-dimensional computation. And then every index in that space gets invoked. You get an invocation for index and you get that as a parameter, which is depicted here in red. And the code that you write here has to be your restrict amp. So this is how we write a kernel. Now, the synchronization semantics for parallel for each is that parallel for each is that are those of synchronous execution even though it rarely is executing synchronously. So underneath the cover these calls are piped to the GPU. But as far as you can tell, they -- their side effects are observable after the call returns. And the reason for that is because we delay the check, we delay the synchronization point to the time that you access the data through array view or you copy back from the GPU. Okay? So that means that if we wanted to implement that on the CPU and take away the coherence checks then it will really have to be synchronous. And maybe at that point we'll add an asynchronous version of parallel for each. >>: Now ->> Yossi Levanoni: Yes? >>: [inaudible] GPU is like that data [inaudible] different some like it 32 [inaudible], some like it in 64 [inaudible]. >> Yossi Levanoni: Yes. >>: 64 tasks in each [inaudible]. >> Yossi Levanoni: Yes. >>: How do you -- do you control that? >> Yossi Levanoni: I'll get to that. >>: Okay. >> Yossi Levanoni: Hold on to that question. So restrict is the main language feature that we've added to visual C++. We've added it such that it could be applicable to other uses other than amp. So that's another area for creativity and research. You know, people have been talking about, for example, purity or other types of restrictions you want to be able to apply to code. So now we have a language anchor for you to fill in those dots. We provide two restriction and -- two restrictions in this release. And the one is CPU. And the other one is amp. So amp basically is the thing that you say, okay, I want to be able to run in -- author code that conformance to the V1 of the spec and be able to execute in an accelerator. CPU is just the default. So if you say my function restricts CPU it's as if you didn't write anything. And you can combine them. You can say restrict CPU comma and. That means I want to be able to run this code both here and there. So restrict amp restrictions are derived from what you can do pretty much using GPU -- today's GPU technology. Or to be a little bit more honest what you can do using DX. So there are a bunch of types that are not supported. And there's some restrictions on pointers and references. DX doesn't know at all about pointers. It doesn't have a flat memory model. It has those resources that are kind of self-contained. And every kernel needs to say what resource it takes. So simulating pointers in the wild is impossible. Or at least very inefficient. We've decided to provide references and pointers as long as they're at local variables. So that covers a lot of ground. Because people use references a lot for passing parameters between functions. So we can do that using C++ AMP. Going forward this restriction is probably going to be removed as GPU vendors are moving to flat memory model architectures. But again, it will require a change in the VM. So we were need to see how this is going to be played out during Windows 9. Now, today again in DX there's no form of really, you know, object level linking. When you write code, it all has to kind of be combined into one executable unit. So in C++ AMP we must represent that restriction somehow. So basically all your code needs to be inlinable somehow. So one way that you could do that is write your code in header files if it's template code or you could use link time code-gen. That works as well. So there's a laundry list of things you can't do. A lot of them have to do with no pointers to code and no pointers to data. There's also no exception hand -- C++ exception handling. All right. Now, one nice feature about restriction that we've added is the ability to overload on restrictions. So we had to implement a math library for the GPU. And we asked ourselves how should we do that? So let's say I see call to cosine in your code. In C++ code that is going to be generated for the GPU, how do we know where -- what code to generate for that? Do we just teach the compiler about all those math functions and treat them as intrinsics? That's been kind of very costly and it's also -- it's only something that we could do as compiler vendors. We recognize that people may want to have specialization of code, based on whether their running on the CPU versus the GPU. Without requiring you to completely bifurcate your code at the root of the kernel. So the way we do that is using specialization of implementations and the overloaded calls for these specializations. So on the right -- on the top line -- on the top line of code we see a normal cosine declaration. Okay? So it takes a double, returns a double. Next one is restrict amp. Also cosine. So they only differ by their restriction. The first one is restrict CPU, the other one is restrict amp. And then a third one is a function that is restricted both. And then we see a kernel code here. And we see a call to bar followed by a calm to cosine. So the call to bar will call the single overload that we have for bar because it's restricted both for CPU and amp. It's callable from an amp context. The kernel function here. The lambda is itself restrict amp. Right? So we can call bar, which is restrict CPU amp. And then when we call cosine, we say do we have an overload that -- that satisfies that? And we see, yes, we have the second one. So we call this guy. All right. And then you can do all sorts of things like with that. You can -- you could maybe trivialize some things that are not really necessary, just say they are doing nothing on the GPU. Making logging facilities or things that are harder to implement on GPU. You can just implement them using empty functions. Or if you know that there is a more efficient way of doing something on the CPU versus a GPU you can do that too. Okay. Next up is accelerator and accelerator view. This is how we talk about where you actually execute code and where data lives. So so far in the presentation you haven't seen any mention of where this thing actually happened. And that's because most of the time you can just default to a default accelerator and you don't have to worry about enumerating them and selecting one. But we have classes that allow you to explicitly do that. So you can enumerate all the accelerators that are in the system. You can set the default one. You can specify one explicitly when you allocate arrays. And you can specify one explicitly when you call parallel for each. And then you say I want to execute this parallel for each in this particular accelerator. So this is how you achieve, for example, distribution between multiple GPUs. We don't do that for you in this release. You have to kind of manage it yourself. Now, what is an accelerator? So every DX11 GPU is exposed as an accelerator. That includes two special ones REF and WARP. REF is a reference implementation of a DX11 GPU which is kind of basically is the spec of what a DX11 GPU is. Okay? This is what we give to hardware vendors and we tell them this is how your GPU should behave. And so it's very slow. It's only good for testing. We've used it to enable the debugging experience. But, you know, it's not something that you would want to use in production. WARP is an implementation of a DX11 GPU on multi-core CPU. So it uses SSE, and it uses multiple threads. And it's fairly efficient. The problem with WARP is that it's -- didn't have all the functionality that REF has. In particularly, it doesn't have double precision floats. So it doesn't have doubles. So that's kind of painful. But, you know, if it did have doubles we would really kind of be very happy to promote it. Because it doesn't have double we tell people that, you know, it's kind of a full back if you don't have a real GPU. And we have -- we have something called if CPU accelerator, which is pretty lame in this release, to be very honest. It didn't really let you execute code on the CPU. It just lets you allocate data on the CPU. And this is important in some scenarios for staging data to the GPU. But this is kind of a place holder for the next release where we plan to use the C++ vectorizer and really generate code statically that runs on the CPU given the kernel. Now, these are all accelerators. When you talk to a particular one of them, you use an accelerator view. Accelerator view is your context basically. We have a particular accelerator so it's kind of a scope for memory location and quality of service. All right. Now, lastly, we have arrays. Arrays are very similar to array views except that they contain data rather than wrap over data, and that you explicitly say where you want to allocate them. So they reside on a particular accelerator and you can only access them there. We don't do coherence for arrays. You have to copy them explicitly in and out. So the way you would use it is that you always copy data into it and copy data out of it or you generate data into it. So let's walk over the code, starting from this box here. So we have a vector using data that we interpret as 8 by 12 matrix. And then we have an extent using the same dimensions. We obtain an accelerator from somewhere which is not shown here. And then we allocate an array on that default -- on the default view of that accelerator. So every accelerator has a default view that you can use to communicate with it and with those dimensions, 8 by 12. So once we did that, we actually have memory allocated on that accelerator. Okay? So now we need to populate it with data. And we do it using a copy command. So we copy from the vector using a begin and end iterator into the array. So now it has the data that we want. And now we can do a reach, which takes this array and does basically increment every element by one. And now we are ready, after we've done that, we can copy the data back from the array into the vector. So this is the difference between array and array view. Array view you're taking control both over data movement and data placement. >>: I have a question. >> Yossi Levanoni: Yeah? >>: [inaudible]. >> Yossi Levanoni: So array view is not as efficient. Yeah. Because array view needs to represent a section, it always has an offset edition. So that's one cost that you pay for it. The other thing is that you may not be happy what our runtime is doing in terms of managing coherence. So you may want to be sure sometimes that you don't -it doesn't get in your way. We tell users to use array view whenever they can. Because we think that these problems are going to go away as soon as we have shared memory, which is probably going to be in the next release. So if you want to write code that doesn't do unnecessary copies, you probably want to use array views. There are also cases where you want to use an array. For example you're doing some computation, you're doing one kernel followed by another kernel. And there's a temporary between the two. In that case, you may also want to use an array. Because you don't need data for that on the CPU. Yes? >>: [inaudible] is it in a different namespace? >> Yossi Levanoni: It is in a different namespace. And we've been going back and forth on whether it's okay for us to reuse class names. So we've overloaded array and extent as well also exists in the STD namespace. So, yeah, we -- we just recommend users to qualify with namespace in they need to. It just seemed like the right names to use. And that's what namespaces are for. Okay. So that's basically been the core of C++ AMP. With those restrictions, parallel for each, accelerator view, accelerator, extent, index, array and array view, you're ready to go using writing simple kernels. So you can fire VS 11 beta and write your code using those things. And that's pretty much it. So it's fairly lightweight. Now, if you want to get this 2X or 3X that I've shown you in the N-Body simulation, you need to kind of get closer to the hardware. And you do that using what we refer to as tiling. So tiling basically allows you to gain access to the GPU's programmable cache. It allows you to have better -- stronger guarantees of scheduling between threads, such that they cooperate which barriers together, exchange data. So, you know, you can do basically a more localized reductions let's say first in tiles. And before you do it to the whole data. Finally, there are also kind of benefits that are more mundane and low level. When we -- when we have, for example, an extent like this, eight by six, and you tell us to map that to the -- inform DX, DX actually is kind of regimented in the way it wants to schedule threads. And if you give us let's say a five-dimensional extent with, you know, primes as the numbers of threads, we have to do through a fairly sophisticated index mapping from the logical domain to the physical domain that -- to the one that DX actually schedules. Then when we run the GPU, we have to do the reverse mapping, give you back your logical ID. Okay. So this costs both in runtime -- it has a runtime cost. But in addition if you want to generate vector code, can completely throw the vector -vectorizer off because, you know, what would have looked like a vector load becomes a gather. And what would become a vector store becomes a scatter and things that nature. So basically with tiles you can express the -- you can express more local structure for your computation. Now, if you see on the right-hand side we see the same tile -- the same extent being tiled into three by four files. So now, once we've done that, we've basically told the hardware, we are always going to -- we request you, hardware, to execute always groups of 12 threads concurrently. And basically what the hardware is going to do is it's going to make this tile of threads resident. Because it's resident on the core all together, they can refer to fast cache memory, which is also referred to as a scratch pad memory. They can put data there and they can accord using this cache. And they can reach barriers together. Okay. So they execute from start to finish together as kind of a gang of threads. And then when the hardware is done executing those threads, can get a new tile and execute it. So you basically at some more intimate scheduling contract with a hardware. Now, to give you some feel as to what these real numbers are, the extents are determined dynamically so you know you could ask to schedule a few million threads. But the tiles are always determined statically. They are specified as -as template parameters if you see that at the tile call. So we do E.tile three by four. It has to be known to the compiler. And the maximum number of threads in the tile is 1K. Okay? And typically you do something like 256 threads in the tile. So it's kind of really different orders of magnitude. And you can choose whatever you want. Here's another example, two by two. Now, going back to your question, when you do tiling, you really want to be aware of the width of the hardware vector unit. So basically you -- you want the least significant dimension of your tile to be at least as wide as the vector unit. So -- because otherwise you're going to get -- you're going to be wasting this part of that vector unit. So rule of thumb that we use is use 64. Today hardware uses four for the CPU on SSE, then 16 on AMD. >>: No, 64 AMD ->> Yossi Levanoni: 64 on AMD. >>: [inaudible]. >> Yossi Levanoni: Right. Right. So -- so we think that 64 is going to stick for a little while longer. This is something that we -- you know, we're not really, really happy with that you have to binds your code to a particular number. Another interesting area of research is whether we can bring in VisualStudio some form of auto tuner or way to instantiate that in a more -- in a manner that will allow you to upgrade it in a more regular manner as the hardware does. >>: [inaudible] query, the hardware does not ->> Yossi Levanoni: This is not part of the DX model. So if you take -- if you take advantage of that -- so, for example, you can -- you can seen rely on that. And CUDA tells you that you can take advantage of that, that if you -- if you execute let's say instruction one using this WARP width and then if you move to instruction two then you know that everything in your work has already done that for the first line. So you know that those WARPs are executed in unison. They give that guarantee to developers. And because they do that, now the compiler's totally constrained. They cannot reorder instructions. So DX has decided not to give any guarantees of that. If you want to take advantage of that, you still can. You will need to put fences between your instructions in DX or in C++ AMP. And your code is not getting to be portable between hardware vendors. So once you've tiled your computation like that, when you do E.tile using some tile dimensions, you get a tiled extent. It's another class. It has three -- up to three dimensions. And what you get in your lambda is a tiled index. Now, a tiled index is pretty much like a index, but it has more information about where you are in the -- both global and local index space. So it tests things like what is my global index, what is my location index, which is my index within my tile. What is my tile index? Okay? Because there is also a extent of tiles. And what is the first position in my tile? So those are all things that are deducible. But we have them precomputed so, you know, we provide access to them using those properties. Any questions on that so far? No. Okay. All right. So now we get to the interesting part. This is what you get. Okay? So you do all this hard work. What do you get? You get the ability to use tile static variables. So tile static variables are variables that are local to a tile. So it's memory, and it's going to be placing in the GPU programmable cache. So you know that you got very good access to it. So how you could use it? You could -- if there's data that global memory that many threads are going to access repeatedly, they could collectively load that into tile static variables, then do their computation over tile static memory, and then when they're done they can store it back to memory. Okay? Now, this coordination that works in phases requires them to use barriers. And that's what tile barrier is about. So all the threads will do some collective thing. And when they are done with that, they're all going to reach a barrier. Then they know that data is ready in tile static memory and then it can move to the next step. Okay. So in order to show tiling I'll work with the matrix multiplication example. This example does not use tiling. So on the left-hand side we see the sequential version. And on the right-hand side we see a simple amp version. Now, the thing to note here is that we've replaced the two -- the doubly nested loop with the single parallel for each invocation. Okay? So this is two-dimensional parallel for each. Each one of these loops is totally independent, so we can do that. And now, basically what we do in the body of that lambda is compute -- compute the value of C at position IDX. Okay? So this is kind of independent. We can do this thing. Now, each -- each thread or every, you know, row-column combination here basically computes this sum of products. Okay? And in the simple computation this is the same thing. This was a six-sequential loop and this remains a sequential loop executed by a thread. Okay. So this is how we would do it using simple C++ AMP coding. Now, if you want to tile it, we have to do a bunch of stuff. So there's -- so we have to decide what is going to be the tile size. So we decide it's going to be 16 by 16. Okay? So we take our extent that we had before, and we tile it. By 16 by 16. And now we get this index -- we get a tiled index that's going to represent our thread. And then we can excavate from that the row and the column as we did before. Now, the interesting thing here is that we have here a single loop running over the entire input matrix, the left-hand side matrix row and the other -- the right-hand side matrix column. What we have done basically is we've broken this loop into two loops. The first one runs over tiles. And the second one runs inside the tile. Okay? So in the outer loop we basically collectively fetch global data into tile static variables. Which are -- sorry. Oh, there they are. Right here. Okay. So we all do a single load. Each thread does a single load and loads it into this tile static variables. And then they all wait and wait for everybody else to complete. And now we can do the partial product -- sum of products in tile static memory. So if you do the math, that -- that saves the factor of a tile size of global memory accesses and replaces them with scratch pad memory accesses. So this again gives a factor of two or three better -- better solution. Okay. So any other questions about tiling before I'll move to the last section here? >>: There's control over [inaudible] usage, how many registers does your compiler ->> Yossi Levanoni: Yeah. Unfortunately we don't have that so, yeah. You can play all sorts of games. And if you have an interesting application where you need that, you should probably hook with our development team. We've worked with people trying to work around this DX limitation. But, yeah, we -- it's very indirect the type of influence you have over that. >>: What happens [inaudible] like in the CUDA world it just says you can't compile. >> Yossi Levanoni: In the CUDA code, I think they actually spill. >>: They will spill. >> Yossi Levanoni: They will spill. >>: [inaudible]. Yeah. The old days it just says like you're out of registers. >> Yossi Levanoni: Right. >>: So what's the equivalent [inaudible]. >> Yossi Levanoni: You also [inaudible] into similar. >>: [inaudible]. >> Yossi Levanoni: No. We will tell you you're out of registers. >>: But we won't tell you your program is going to run slow because I can't generate enough threads to because I only have -- I don't have enough registers to generate as much threads as you need to. >> Yossi Levanoni: We have some information on that, and we're working on exposing it. So I'll talk to concurrency visualizer in a bit. >>: There's no way to access the [inaudible]. >> Yossi Levanoni: That's my next topic. >>: [inaudible]. >> Yossi Levanoni: Yes. So next topic is DX integration. So basically for both our goal has been to allow you to interleave your computations with what you will do otherwise with DX. Many times the pattern is that you compute something. You leave it on the GPU. And then it's available for vertex shader or pixel shaders or directly for bleeding into the screen. So we wanted to facilitate this kind of interop. And you get interop on data and scheduling. So the array view class you can expose it -- you can interop between array view and ID3D resource. And this represents a byte address buffer in HLSL, so it's basically unstructured memory. And you can interop between accelerate view and ID3D device and ID3D device context. So you can go either way. Okay? And that also allows you to use some advanced features that are not available through the C++ AMP surface area. For example, in 8 they've added the capability to control TDR. It's time of detection and recovery. There's something -- because the GPU is a shared resource but it's not fully [inaudible] like your CPU, there is timeout mechanism for long-running shaders which appears stuck to the OS. They've done a lot of improvement on that in Windows 8. But you have to kind of buy into it. So you could do that by configuring your device context. Then you can use it as an accelerator view in C++ AMP. N-Body simulation, for example, uses that. In every frame you both execute C++ AMP code and DX code for rendering. Now, in addition to that, while you're inside the body of your kernel, you can use also about 20, 20-some intrinsics that we've exposed -- that we are channelling to HLSL. They have to do with min, max, bit manipulation, this type of stuff. They are not part of the core spec. We're thinking of putting some of them back into the core spec. People have asked for that. All right. The next thing is what you can do in terms of textures. And the main thing there is -- that's the main thing that we do in the amp graphics namespace. So textures allow you basically to have some sort of a one-dimensional, two-dimensional, three-dimensional addressable container that you can sample or address at a particular integer index and get back the tixel [phonetic]. The tixel is typically a short vector. It can be either a scalar or a short vector. That's an int 3. So it's a tuple of three integers. That's what you get once you sample a texture. So the first thing that we've done is provide our short vector types. For example, int underscore three is a tuple of three integers. There's also things called norm and unorm. These are float variables that are clamped to let's say zero or one or minus one and one. That's the difference between norm and unorm. Unorm means on site norm. And then you have short vector of these up to a rank of four. Okay? So this is something that is very familiar to graphics developers. Another thing that graphics developers are used to is this swizzle expression. So you can say my vector dot YZX equals this integer three. So -- and this is not a very good example because you just order this according to the correct order. But it allows you to change the order that you access the different components. Just select the subset of them. Do these type of things. So we expose many swizzle expressions. But we don't expose all of them. The syntax in HLSL basically allows you to write any combination of characters, including repetition and the constant zero and one. Okay? So if you look at the combinatorial space for that it totally explodes. And we -- we're really scratching our heads how to expose that in C++. And we ended up really trimming it down to the most common combinations. But we think -- looking at bodies of HLSL code we cover everything that people are using. And if anybody's interested in language surface area investigation, that would be an interesting topic, you know, how can you -- how can you add into something like C++ and more let's say compile time dynamic properties? Okay. So once you have those short vector types, you can represent textures. And basically the short vector types and scalars can be the element type of a texture. So if you look at the DX spec for textures, it got lots of gotchas. For example you can't have textures of three -- well, vector length of three. Can't have a texture of int 3. There's also lots of limitations on when you can reed and when you can write. You can almost never read and write simultaneously. I mean, your shader either reads or writes, except for some really common but special cases. There are also lots of different encodings that are supported. Your data doesn't have to be in the same data type that the texture is. Texture provides a sort of data transformation for the underlying data type. So it's kind of a really, really big area. However you can interface with it. We don't cover all of the options that you can instruct a template -- a texture with. We covered the most common ones. And we have an interop path that you can say here's a texture that I created from DX and please treat it as a C++ AMP texture. Now one thing that was really painful in this release is that we had to cut sampling. So this is really what most people like texture for. It means that you can treat the texture as if it was with a mapping with a real cord in its space. So I could index it using -- and I could say give me -- give me the tixel at position 2.5 and 3.7. And that gives me that point that is not in the center of a tixel but is closer to some other tixels. And then it can do linear interpolation or other interpolations. And that gives you this kind of smooth texture effect that you get games. And it's really nice facility. But we -- we ran out of resources and we had to cut it. So what you can do with textures today in C++ AMP and Dev 11 is access them basically as arrays using integer indices. >>: [inaudible]. >> Yossi Levanoni: But we're really -- huh? >>: [inaudible] instead of [inaudible] I'm sorry, all four ->> Yossi Levanoni: Yeah. Yeah. We have to cut the [inaudible]. All right. So that's it pretty much. We covered everything that's really important in the programming model. Yeah? >>: [inaudible] so one of the nice things about direct compute is you've got these shader resources that you can use in compute, use in graphics [inaudible] exactly the same model, which is very nice. It seems like you're kind of throwing that away to create more like CUDA, DX or [inaudible] separate for a [inaudible] memory here and if you do it in [inaudible] it's very painful. Is that the reality or [inaudible]. >> Yossi Levanoni: That's the reality of it, yeah. I won't dispute that. And the question is how do -- how do we move forward. So, yeah, the focus was compute. >>: [inaudible] either it's all or nothing, right? I mean, you could [inaudible] bring everything into C++, all the [inaudible] the whole entire model or ->> Yossi Levanoni: Yeah. I think that's where we would like to be. And the question is when and what's going to be the customer demand for that. But if the team had the -- it's desires, you know, admitted, then we would go and target at least the most common things, common shader types such that you won't have to leave C++ at all. That's basically the productivity value proposition that you stay within C++, you stay within the same tool set. And if you look at it from the DX side, they also have an opportunity here. Because it seems like a lot of the shaders could be recast or even the higher -- some higher-level APIs could be recast in terms of compute. So you could -- you could for example take like, I don't know, Direct2D and implement it on top of C++ AMP -- yeah. A modern C++ version of it, right? Yeah. The question has always been about resources. How much are we going to get in each release done. Yeah. So things that I'm not going to cover are things related to memory model, atomic operations, memory fence, some direct diagnostic things. The types of exemptions you get from our runtime and the math library. The math library is really interesting. We worked on that together with AMD. They basically implemented it for us. And it's kind of a clean room implementation that is very accurate and -- yeah? >>: [inaudible]. >> Yossi Levanoni: Huh? >>: Do you have scan ->> Yossi Levanoni: No, we don't. So one thing that the team is working on right now is looking into the higher tier of libraries. And we have thrust, from NVIDIA, that is basically also in that space providing all the standard library, like overloads for their CUDA programming model. So they have reductions and scans, sorts. So this is one area that we started to invest in. And this is going to be an open source project. So we have already published it -- sorry. We're kind of in the last review stages of the first version. I'm going to push into it the web. And we have all sorts of customers that are interested in contributing to that. The other thing we're doing is investing in BLAS, linear algebra. And we are also investing in random number generators. If anybody here wants to chip in or give us requirements, please do. On the next set of libraries. Yeah? >>: [inaudible]. >> Yossi Levanoni: FFT we actually published as a sample. That was -- that was easy to do relatively speaking because DX provides an FFT library. It's a little bit outdated. It originated here in MSR. And a DX basically wrapped it in Windows API. And you can use it, if you have DX. So we -- we provided a sample that shows how to use it from C++ AMP and kind of gives a nice C++ AMP facade to it. But we don't have like an amp library that implements FFT from scratch. And there are some opportunities there. Naga's [phonetic] team had also created the -- an auto trimming framework for FFT. I think it could be great if, you know, somebody looked into bringing that back to life and maybe porting it over to C++ AMP. So generally speaking the C++ team is not well staffed to do breadth libraries or domain specific libraries. But, you know, breadth libraries are even kind of a lower goal and even that is kind of very difficult to get down. So we're trying to enlist the community for that. So, yeah, if you're willing to contribute, that would be awesome. Yeah? >>: I'm just wondering what [inaudible] this for and [inaudible] is that I could [inaudible] then it seems like [inaudible] and so I'm wondering how do you think that that is going to get better where we keep [inaudible], you know, like rendering something where we don't know exactly [inaudible] and how do we see that [inaudible]. >> Yossi Levanoni: I think you -- yeah, so if you are alluding to what I say that, some other shaders could be expressed in terms of compute or you're just asking ->>: One whole model is being [inaudible]. I don't know that it's [inaudible]. >> Yossi Levanoni: No, we think it is necessarily. We looked into streaming, which is kind of a key component of a [inaudible] like algorithms, right. You know algorithm that generate variable number of outputs and maybe you know has kind of a loop back to feed themselves with that. Yeah. So we don't have conclusive findings on how to expose that yet. Yeah. Very interesting. Yeah? >>: So inside of reach you're not allowed to call another [inaudible]; is that right? >> Yossi Levanoni: That's right. >>: So have you you looked into maybe doing that so you can nest it? >> Yossi Levanoni: Yes. The hardware vendors are talking about the next level of things that you should be able to do. So GPUs should be able to do quite a number of things in the next REF of major hardware. They could launch something asynchronously to themselves, kind of a continuation. They should be able to do stream like things, you know, kind of a parallel do while, right? While I have more work loop and in the loop I generate more work for myself. Like maybe a little bit like tesselation. They should be able to also express synchronously nesting parallel loops. But I'm not sure that this really is -- maps well to the hardware. But they could -- but they could represent that as continuations, I suppose. If they have continuations then they could do something like that as well. >>: [inaudible] like if you have a -- let's say a [inaudible] walk down the tree [inaudible] thing to do is [inaudible] you have two things and [inaudible] you have four things but [inaudible] across the whole tree. You unfold the whole tree and flatten it. Could you do that with software instead of [inaudible] hardware. [inaudible]. >>: Yeah. I mean, this is a program in MSR Cambridge that [inaudible] take like -- you have this data type called parallel array and [inaudible] just kind of get flattened out [inaudible]. >>: So parallel [inaudible] for this type of things. Yes. >> Yossi Levanoni: Yeah. No, we didn't look into that very carefully. We looked at it from kind of a syntactical perspective saying, okay, if you did want to allow nested parallel for each, will you be able to do that? And that's part of our versioning strategy for amp. Approximate. So if you look at the restrict amp, actually of designed and implemented the versioning mechanism for that. So you could say I'm restrict amp column two, and that opens up the things for you to allow more things. And -- but then all the things that are for view one will always be kind of a better optimization target because you know that they don't have nested parallelism, you know that they don't do a but of stuff. They don't have virtual function calls. They don't have pointers in the heap, right? So we think that the V1 that we're specifying is always going to be kind of a good breadth target for optimization, you know, like matrix multiplication. There's no reason you would ever want to go beyond amp V1 to express that. And that gives the compiler a lot of knowledge. >>: [inaudible]. >> Yossi Levanoni: Yeah? >>: So can you talk a little bit like what you compile into? You say you built on top of the DX. Do you use HLSL as a compiler at some point? I mean, what exactly [inaudible]. >> Yossi Levanoni: Yeah [inaudible]. >>: How you exactly get [inaudible]. >> Yossi Levanoni: Okay. So are built into the C++ compiler. We ingest everything in the front end. And what we do in the front end is basically outline those parallel for each payloads. That creates the stub of execution on the GPU. But to the compiler at that time it just looks like a function. Then we get to the back end. We recognize those functions. They have a certain bit and that's a GPU kernel. And at that point we apply some specific processing. For example we take it and we do full inlining. We do pointer relation. With invoke the right intrinsics, okay? And then we're ready for co-generation. And the co-generation goes to HLSL source code. But the source that we generate looks like assembly code, you know. It's basically, you know, variable equals variable or expressions are binary expressions and assigned to temporary. So, yeah, it's ->>: [inaudible] do they still own that compiler? >> Yossi Levanoni: The HLSL compiler? >>: Yeah. So you inherent all the bugs? >> Yossi Levanoni: We have -- we have -- they owe us -- we worked very, very closely, and we have found so many bugs that have been fixed. I mean, they've done a really Herculean effort to ->>: [inaudible]. >> Yossi Levanoni: Say again? >>: It's graphics guys writing compilers. >>: Yes. >>: That's true. >> Yossi Levanoni: So that organizationally, let's say it wasn't the best thing possible, but I have to kind of say that everybody did the best that they could to make it work. And going forward we would want to own co-generation all the way down to the bytecode. >>: I could see [inaudible] a lot of work this way, right? >> Yossi Levanoni: Oh, yeah [inaudible]. >>: [inaudible] handle all the details [inaudible]. >> Yossi Levanoni: Yeah. >>: [inaudible]. >> Yossi Levanoni: Right. No, but morning that, if you take about C++, C++ developers are not crazy about JIT, right? And it seems like we have a opportunity to go lower level than the bytecode and allow you to generate machine code all the way from the compiler. Or even something that is not exactly machine code. Maybe something that you just need to run one single pass of finalization over let's say to determine offsets into structures or something like that. Or encoding of instruction set. So these are directions that we want to explore in Windows 9. Yeah? >>: So what's your -- what's your update pathway and how often are you expecting new versions? >> Yossi Levanoni: The update for C++? >>: Yeah. >> Yossi Levanoni: So the -- it's just part of the compiler. And so so far it's been two or three years at least. >>: So [inaudible]. >> Yossi Levanoni: Yes. Yes. But the libraries are separate. The libraries will be able to update all the time. And the compiler, if I understand correctly, there's very strong desire to change -- to change the cadence with which it is delivered. >>: So it links to the studio and not to the [inaudible] 11 or 12? So we'll get new versions as there are new versions of dev studio, not as [inaudible]. >> Yossi Levanoni: Yes. There will be -- we have some flexibility to do things in service specs. But, you know, it's still kind of early to say exactly how Dev 12 is going to be played out. >>: So one question about the compiler again. You said you did the outlining in the front end. And then are you writing that all to CIL. Is the one [inaudible]. >> Yossi Levanoni: We are writing all to CIL, but it gets outlined. So you basically get two functions. You get the original function that contained a parallel for each. It now contains a call to runtime trampoline. And you get the new function that is also expressed in CIL, but that gets translated into GPU code. And then at runtime you call into this runtime trampoline that sits in our DLL -runtime DLL. And dev basically knows how to rummage through your XE and find a bytecode that we buried there and initialize DX with that and call it. Let me show you some nice slides about what we've done in the IDE. This is the concurrency visualizer. We've added both DX and C++ AMP capabilities so you can see basically kind of the high level dynamics of what's going on. When do I have a data transfer? When -- you know, when am I blocking? What are the overall statistics that? We have prototyped more data collection that talks about divergence and spilling and that type of stuff. But it hasn't -- it's not in the beta, and we weren't able to ship it in our team. The concurrency visualizer in previous releases we've been able to ship it out of [inaudible] so this is one thing that we can innovate more quickly. Although we don't know how much resources we get to -- we get for that going forward. Unfortunately, that team, most of it is no longer in DevDiv, so we're not sure that we will really be able to turn the crank on this as fast as we want to. And the other thing where we had much more investment and innovation is in the debugger. So basically everything that you have on the CPU that you're used to works on GPU. And you get new experiences. One of the cool thing that we've added is a GPU threads window. The amount of threads that you work with in a parallel for each is not like, you know, 10, 20. You're talking about millions of threads potentially. And you really want to treat them this data. You want to be able to tabulate them and search them and flex some of them and do filtering operations. And the GPU threads window lets you do that. And complementing that we've added a watch window, which we've added both for GPUs and CPUs and gives you this laminated view of variables across threads. So again, you can search here, you can filter threads by values and basically see a lot of data through -- in lots of threads simultaneously. You can also export these types of things in Excel and do further analysis over there. And we have the parallel stacks window that we used to have in the CPU that's also available on GPU. Yeah? >>: [inaudible]. >> Yossi Levanoni: So we have defined debugging API. And the hardware vendors will have to implement that. Now NVIDIA is working on that, and they're going to be done by our team. So we demoed this together with NVIDIA in supercomputing last year. So they're going to, as they do now, the best developer experience on their hardware. AMD said that they'll start working on that soon. And you also have the REF GPU emulator that you can debug with. Okay? So if you don't -- if you don't have an NVIDIA card, you can fall back to that. And you can use that on your laptop. You know, driving on the about us. And one last thing that we've added which you can't see here is we've added this runtime assist for race detection. So now you're working with all these threads. You now can enable these hazard checks that will break you in the debugger if we see that you have a read and a write that are not correctly synchronized. Or if you're missing a barrier you need to have a barrier in such a ways. All right. So, yeah, that's pretty much it. The thrust of what we are doing is that it's open, it's high level, and gives people lower barrier to entry into the space. And it's mappable to new hardware, both on the CPU and the GPU. And it's available to anybody who's downloading VS 11. You get it in the express SKU. So, you know, the idea is really to democratize this space and allow people to experiment with it and utilize their hardware. Thank you. [applause]