>> Shobana Balakrishnan: I'd like to introduce you to... been at DevDiv now for about six years. By...

advertisement
>> Shobana Balakrishnan: I'd like to introduce you to Yossi Levanoni. He's
been at DevDiv now for about six years. By the way, my name is Shobana. I
just decided to host him because I thought this talk had broader implications just
to MSR and, in fact, what we're doing with FPGAs. Hopefully he'll allude to that
later in the talk as well on how hopefully we can extend this to FPGAs as well.
Right now he's focused on GPUs. And he'll talk about what they've done in Dev
11 and hopefully even beyond that a if you ideas for the future as well. So
thanks, Yossi for being here. And I'll hand it over to you.
>> Yossi Levanoni: All right.
>> Shobana Balakrishnan: Take it from here.
>> Yossi Levanoni: Thank you, Shobana. So how many have heard here about
C++ AMP or played with it in the past? Okay. All right. So I think I've prepared
the wrong talk for the audience. This is going to be an introductory talk. So
probably it's going to be things that you've already seen. But you'll notice kind of
new syntax along the way that we have just introduced in beta. So that's going to
be one added value of this talk.
And if you feel that -- please tell me if you think it's -- the pace is too slow and
you're familiar with what I'm showing, and then we can kind of going faster to the
Q and A section, and we can drill into questions about the future of C++ AMP
and possible research directions for you guys to look into in conjunction with
what we've been doing.
So and this is going to be FTE internal only, so we'll be free to talk about future
product directions.
Okay. So the agenda is first setting some context for C++ AMP, why we've done
it and the opportunity. Next thing is the programming model. Then we'll talk
about the IDE enhancements that we've added to VisualStudio. And I'll wrap up
with a summary and Q and A and hopefully that will be a big part of the talk
today.
Okay. So I'm going to do something that is a little bit of a no-no. I'm going to
show you a visual demo over terminal session. Terminal server. And this is an
N-Body simulation.
So N-Body simulation is a simulation that demonstrates gravitational forces
between N distinct bodies. Each body exerts a force on any other body, so it's
kind of an N square computation. We simulate it in every time increment. And
there we see the forces. And that accrues to a vector of acceleration. That adds
up to the speed, and that changes the location of the bodies.
So this is supposed to be a visual demo. But we should focus on this number
here where my cursor is showing. That's the gigaflops, basically how many
computations we're doing per second.
And we're starting with 10,000 bodies. Now, this is on a single CPU core. This is
a core I7 machine. And we say that it takes about, I don't know, 12 percent of
the CPU compute power when we look here at task manager.
Now this has been written to take advantage of SSC4 instructions. Okay? So
we're getting about four gigaflops. Now, if we go to multi-core, we have another
version of the same computation. I'm going to erase the particle so it looks nicer.
And we see that we got about 4X increase in the gigaflops that we're getting.
And we see that all the cores are pretty much pegged. Maybe we need to fit it a
little bit more bodies to really drive it to a hundred percent CPU.
But that's kind of a pretty good scale. We've done that using PPL. So now, all
cores the busy. And that was a straightforward addition to the code. Everybody
here familiar with PPL? Okay. Already.
So now, the next step up is going to be to a straightforward C++ AMP port. And
we've jumped to, I don't know, 60 gigaflops. But we need to feed it a little bit
more bodies in order to saturate the GPU. So we have about 40,000 bodies
now. And we are at about 200 gigaflops.
Now, this is NVIDIA GTX 580, a fairly high-end GPU. This is one of the reasons
you send this through terminal server, because the machine is really gigantic and
it was difficult for me to schlep it here. So sorry for being lazy. So we get about
200 gigaflops just using that one GPU.
And you see that the CPU is not busy at all now, and I can use my machine for
other things. So basically I get the benefit of kind of both, the GPU and CPU
working on the compute.
Now, if I endeavor to make it a little bit more optimized using what we call tiling,
which I'll talk about more later, and we are now getting 370 gigaflops. And I think
if we take it all the way to 80,000 threads, we are getting close to 500 gigaflops.
So we got another factor of two or something like that by optimization.
And finally we can now spread the computation on multiple GPUs. So that
machine has two NVIDIA GPUs, two GTX 580s and we're getting something like
twice that the 400 gigaflops that we got before. So it scales pretty linearly across
those two GPUs.
All right. Now the terminal server has its tax. It's kind of serializing and it has its
overhead. So if I run it locally I get fairly close to -- I get above 900 gigaflops.
Okay. So that's basically the power of the GPU that we want to harness. And
this is what C++ AMP was mostly about in this release.
So how does it happen that you can get this type of acceleration on the GPU?
So we have here on the slide juxtaposing GPUs and CPUs. I think you can seat
on the graphic that the CPU has a bunch of colored areas around it. It has a lot
of specialized logic. It needs to be able to execute any -- any code, starting from
your 95 favorite, you know, game to the latest version of Office.
The GPU on the other hand, is much more regular. You've seen many, many
identical processing elements. And even within those processing elements there
is much more silicone dedicated to executing math than to scheduling. Okay?
So the GPU is really kind of dumb, in a way. It knows how to crunch a lot of
numbers. As long as you can give it kind of straight-line code, it does it very,
very efficiently. It has a thicker pipe to -- a wider pipe to its on-board memory.
So once you get the data on the GPU, it's kind of a -- it's more efficient to actually
see it compared to RAM for the CPU.
On the other hand, it has a smaller caches, compared with the CPU. It doesn't
have all these deep pipeline and elaborate prediction logic. Instead it dedicates
this silicone to actual computing. And therefore it runs at the lower clock rate.
The way -- the way the GPU feeds -- maximizes the memory bandwidth is by
presenting a lot of memory requests to the memory controller unit and does that
by parallelism. So you have lots of threads. Each one of them can access -- ask
to access memory. There's logic to combine those requests and present kind of
chunky, chunky transactions to the memory controller. And this is how the GPU
gets better utilization from the memory controlling unit.
The CPU, on the other hand has these very deep appliance, you know, so it
looks inside the instruction stream for things to protect, you know, what you're
going to want to access next. And this is how it tries to maximize the memory
bandwidth.
So it all boils down to CPU being very adequate for mainstream programming,
general purpose programming. And GPUs are more suited for niche or parallel
programming or mathematical programming.
Now, Intel and AMD and other CPU manufacturers seeing that, you know, there's
so much action on the G -- on the parallel side of the hardware are definitely not,
you know, staying idle. They're also investing in wider vectors. Okay? So the
AVX instruction set that goes to -- will go to 500 -- 512 bits and probably bigger
than that.
On the other hand, the GPUs, they also want to become more general purpose.
So they're adding capabilities that weren't there just a few years ago. So that's in
terms of capabilities is they're getting closer together. But also in terms of
topology you're seeing AMD with the fusion architecture. They're taking the GPU
and the CPU and putting them together on the same die. And NVIDIA is working
on the Denver architecture. Our manufacturers are doing the same thing.
So we're going to be looking at hardware that has both GPUs and CPUs fairly
tight together looking at the same underlying memory. So -- and that's going to
be fairly mainstream on your slate and phone and desktops.
So we've designed C++ AMP with that in mind, even though or main target for
Dev 11 is -- has been discrete CPUs, that's what's been for us available to
design to. We think that C++ AMP can evolve easily to those type of shared
memory architectures.
Okay. So the approach we've taken with C++ AMP has been to optimize for
these three peelers, performance, productivity, and portability. So the
performance is basically you get it through exposing the hardware. That's kind of
the easy bit actually. The more difficult part is how to make it productive and
portable, especially from a Microsoft perspective.
So with respect to portability we have made it well integrated into -- sorry, with
respect to productivity, we've made it well integrated into VisualStudio. It's
first-class citizen in the C++ dialect that we have in visual C++. It's modern. It
likes like STL. Okay?
Now, in terms of portability, it builds on top of Direct3D. So you immediately get
the portability that Direct3D is kind of a virtual machine architecture, gets you
across the hardware vendors.
But we've also opened up the spec and we are trying to entice compiler vendors
to provide their own clean room implementations of C++ AMP. And hardware
vendors are also interested in that as kind of a way to showcase their
architecture.
We've also kind of outlined how the spec will evolve to further versions of GPUs
such that hardware vendors like AMD and NVIDIA have the confidence to know
that if they buy into this vehicle they will be able to expose future features using
C++ AMP.
All right. So that's been the context of why we're doing it and how we've gone
about it. Now, let's drill into programming model. Unless anybody has questions
they want to ask right now. Yeah, sure?
>>: Can I ask a [inaudible] question?
>> Yossi Levanoni: Oh, yeah.
>>: Why would I want to use this versus CUDA or OpenCL?
>> Yossi Levanoni: Okay. So if you -- if you look at those approaches that we've
taken, it's kind of apparent through each one of them through a different access.
So CUDA is becoming more modern and it has thrust. And that allows it to look
like modern C++ to a large disagree. But you obviously are stuck with NVIDIA
hardware if you want to use CUDA. So if that works for you, you know, then you
should use it.
We see that there are many customers who are unwilling to buy into CUDA
because it's hardware vendor specific. So it's kind of stuck -- it's obviously stuck
because of that and the type of -- the size of addressable market.
>>: [inaudible] did you tie yourself to the Direct3D?
>> Yossi Levanoni: No. Direct3D is our implementation platform. And we really
try to hide it from the bulk of the programming model. So the Direct3D specific
things in the programming model are available through an interop API that is
optional in the spec. And the spec itself is open. And everybody can implement.
And we definitely didn't want to preclude the ability to implement on top of
OpenCL or CUDA. And, you know, actually, you know, if anybody here wants to
prove that it's possible, that will be really awesome. Yeah.
>>: What does this say about the future HLSL?
>> Yossi Levanoni: The future of HLSL ->>: Why do we have both? Why do we need both?
>> Yossi Levanoni: Yeah.
>>: I would rather do this. But, you know ->> Yossi Levanoni: Right. So actually the genesis of all these projects started
from this very question. You know, we have a language that has been
developed outside of the developer division in the Windows division. And, yeah,
it's kind of an interesting question, you know. Can they really support it going
forward, or should it be all completely moved into C++ domain?
One of the things that the team will be looking for Dev 12 is do we want to be
able to express additional types of shaders, for example, vertex shader or pixel
shaders in C++.
And also Windows has taken the approach of exposing HLSL as just a language
but not the bytecode. Okay? And that's also been very difficult for us to build on
top. Just build on top of the textual language rather than at the VM level, at the
bytecode level.
So there's definitely going to be some thinking about what needs to be done,
what is going for the in the Windows 9 timeframe. There's also questions about
the bytecode itself. The bytecode for DirectX today is a vector bytecode for a
four-way vector. And on the other hand, if you look at something like PTX, which
is the bytecode from NVIDIA, it's not vector. It's scalar, but it's a CMT. So it's
implicitly vector. So a single instruction, multiple thread. They're all executing
scalar code, but they're executing on a vector unit. With DX you get both CMT
and vector execution. Compilers have a hard time kind of figure out how to map
that to the hardware. To maybe that will need some revision too.
And right now is the time where a discussion between DevDiv and Windows 9 is
really starting to heat up about that. And if you want to be a part of that and you
want to contribute to where this thing goes forward, I think that's -- that's going to
be an excellent opportunity for collaboration.
The person driving this from our side, from DevDiv is David Callahan. He's the
distinguished engineer for C++ compiler. Yeah?
>>: So why is this in C++ [inaudible] some of the [inaudible].
>> Yossi Levanoni: Yeah. So ->>: [inaudible].
>> Yossi Levanoni: This basically goes to the first peeler, which is performance.
So when we started, we asked ourselves that we exactly should we go C# or
should we go C++, assuming we couldn't do both.
There's also a question about should you put it in a scripting language, right? Do
you want to -- should you put it into Python or something like that? And we've
done customer studies. And what we have seen is there was more demand for
that in C++. C++ being the foundational -- basically today anybody that wants to
write performance oriented code doesn't go any lower than C++. So it seems like
a foundational place to add this type of capability to talk to the hardware.
Now, it didn't preclude doing C#. But it didn't seem like the top proprietary for
Dev 11.
>>: [inaudible].
>> Yossi Levanoni: [inaudible]. No we recommend doing -- we recommend
doing interop with C++. And we have a few samples showing how to do that.
Okay.
So the programming model. So let's start with the Hello World of data parallelism
which is adding two vectors into a third one. So we have here a function add
arrays in let's say C. It takes the number of elements in each vector, which is just
an integer pointer, and it adds every element in PA and PB into the
corresponding element in PC.
Now, how do you -- how do we make that into a C++ AMP program? So this is
basically what it looks like after you have ported the code over. So if you look at
the right-hand side, this is the C++ AMP code. And on the left side is the original
code. I'll drill deeper into the C++ AMP code. But just on this slide, look at the
top and you'll see the include amp.h and using namespace concurrency. So this
is how you get started, okay? This gives you all the classes and everything you
need for simple compute scenarios.
Now let's drill into the C++ AMP add arrays. So what do we have here? So the
first thing to note is the parallel for each. So parallel for each represents your
parallel computation. Every time you have a loop nest that is perfectly nested
and you have independent iterations that can parallelize it. So in that case you
can replace it with a parallel for each.
The number and shape of the threads that you launch the captured by an extent
object. So extent is an object that describes dimensionality. It says how many to
go in what direction and how many dimensions. So you can map any loop nest
into an extent object, any perfect loop nest. Sorry.
The next thing is the lambda. Anybody here not familiar with lambdas in C++?
Okay. So are you familiar with the delegates in C#? Okay. So it's kind of the
analog with differences. But, you know, to refresh approximation it's like
delegates.
So you have here basically the body of the lambda. It's kind of like a function
object. And these are the parameters. This is expression here says what you're
capturing from the environment, what you want to make available here, and
when you say equals it means I want to capture everything that this body refers
to by value.
And this is the bit that we have added. So restrict amp tells the compiler, hey,
watch out that everything that I write in this function complies with the amp set of
restrictions. And I'll talk more about in a minute.
Now, inside the lambda what characterizes your iteration of the loop, your
instance of the lap or basically your thread ID is the index parameter. Now, the
rank of the index is the same as the rank of the extent. So if we are doing two
dimensional iteration, we would get an index of two.
This is a single dimensional index.
Now, the last bit is the data. So we are using the array view class to wrap over
your host data. So if you look at the top one, we have array view of integers of
rank one. So it's a single-dimensional array of integers. And it wraps over PA.
So it says basically I have this piece of memory, please make it accessible in the
accelerator. Then we can access it on the accelerator and you'll note that you
can use a subscript operator supplying an index object. And that just gives us
this addition.
So this was basically the Hello World okay. So let's drill into those classes. So
index class. Extent and index, as I said, talk about dimensionality, iteration
spaces, data shape and where you are inside the shape.
So on the left-hand side we have a single dimensional space. It's represented
using extent of one. The dimension that the rank of the extent is -- must be
known at compile time, so it's a template parameter. But the extent is dynamic.
So it's passed as a constructor parameter.
So here we have an extent -- a single dimensional extent with dynamic extent
six. Now, once you initialize an extent with a certain size, it's immutable. Now,
on the top we see an index of the same rank, rank one, pointing into the second
element in that space. Counting from zero. Okay?
Next to the right in the middle we have two dimensional example. Where we
have a three by four extent. So basically if you look at it as a matrix you have
three rows and four -- four columns. And we have an index pointing into the zero
row and second column. And lastly we have three dimensional extent. And you
can talk about any number of dimensions practically. I think we go up to 128.
But nobody has ever used more than four or five. So...
Now, this -- these guys at all don't talk about data, just the shape of data and the
shape of compute.
Array view, on the other hand, talks about data. It basically gives you a view
over a data of -- and you give it a certain shape that you want to say this is what
this shape -- this is what the data looks like.
So it's a -- it has -- it's a template with two parameters. The first one is the type
of the element. Second one is the rank of the array. So if we look here -- let's
start with the example which is depicted on the bottom right. So let's say we
have a vector in memory of integers. This is just a standard vector. It has 10
elements.
Now, we know that we want to treat it as if it was a matrix having two rows and
five columns. Okay? So how can we tell C++ AMP to treat it as such? Well say
here's an extent, and the extent has two rows and five columns. And then we
say we want an array view. And the array view has that extent, two by five, and it
wraps over that vector. Okay? So that's the second argument there to the array
view.
So basically now, accesses to the array view are wrapping over the original
vector. Now, we have a lot of short -- short cuts in the programming model. You
could just write A, 2, 5, V. You don't always have to declare those tedious extent
objects when, you know, you don't need them for other things.
Now, once you've created this array, you can access it both on the GPU -- this
array view, sorry, you can access it both on the GPU and the CPU. And what
C++ AMP gives you is coherence. We manage coherence for the array view.
So if necessary we copy data back and forth and we try to do that intelligently.
Now -- I do you migrate your computation between the devices. Like could I start
something on the CPU and then move it over to the GPU at [inaudible].
>> Yossi Levanoni: Yes.
>>: And then come back?
>> Yossi Levanoni: There are some caveats. And so basically one of the
caveats is what kind of performance can you expect when you access an array
view? Is it the same performance as accessing the underlying vector?
Now, when you access it on the GPU, it's very performant because this is code
that we generate, and we know a lot about it. So there's a specialized path there.
And it basically -- we almost strip out anything that's not necessary to really get
you into accessing the row DX buffer underneath it.
But you can also access it on the CPU elementally. And there in this release we
have to do a coherence check basically on every access. And we don't -- we
don't -- we don't have in DX -- in Dev 11 the ability to host that outside of your
inner loops.
So this is -- so in order to get good performance on the CPU, if you want to work
with array views, is basically we have a performance contract that says this is the
third bullet here, that they are dense in the least significant dimension. So it
means always, you know, when you work with array views, let's say they're
representing a matrix, if I get a pointer to the first element in a row, I can run over
the row using that row pointer. Okay? And then I won't be paying any cost on
the CPU if I did that.
But mostly these things are accessed elementally on the GPU. And on the CPU
they're used kind of as bulk things that you just, you know, store to disk or load
from network or something like that.
>>: Do the views have any personal type of access like [inaudible] I'll even
access this array as a read only or write only ->> Yossi Levanoni: Yes.
>>: Or read and write, stuff like that?
>> Yossi Levanoni: Very good question. So you can specialize it for read only.
You can say I want an array view will constant ->>: [inaudible].
>> Yossi Levanoni: Rank two. And therefore -- and when we see that you've
captured that and that's how you want to access it on the GPU, we know that
you're not going to modify that data. We can tell that to the shader compiler.
And we can tell that to our run time so we know we never need to fold the data
back. Okay?
Now, we had write only as well. But it turned out to be really accusing to people.
So we know longer have write only. Instead of that we have discard. So you can
basically say to the -- to the run time, I'm going to -- I'm going to give you this
space, array view, that I'm going to write into in my -- in my kernel. But you don't
need to copy the data in, because I tell you you can discard it. Yeah. So that's
what we have for write only after beta.
>>: [inaudible].
>> Yossi Levanoni: Yeah?
>>: Since sort of [inaudible] have on the other dimensions [inaudible] released
[inaudible].
>> Yossi Levanoni: Okay. So we have for example a method called project. So
you can take an array that's two dimension -- an array view -- a two-dimensional
array view and you can say I want to project the first row -- the Nth row out of
that. And that gives you an array view of rank one. So you get an array view
that really represents a subset of the original data.
Or you can do a section. So you have and array view presenting a big matrix -- a
big contiguous matrix and you can say I want to get a window out of that. So
now if you look at that than window, it's no longer contiguous in memory. It's only
contiguous in the least significant dimension.
>>: Is it always a [inaudible].
>> Yossi Levanoni: There is always a -- the API surface that we have now, there
are always -- there is always a multiplicative factor between each dimension.
And the multiplicative factor is always one in the least significant dimension.
So, all right. Next up is parallel for each. I kind of talked about that briefly. So
we have -- this is an API call. This is how you inject parallelism. We didn't feel
like we needed to add another statement to the language. The first parameter is
the extent. Again, like the extent that you specify for data you also specify for
compute. So you can say I want a three dimensional, two-dimensional
computation. And then every index in that space gets invoked.
You get an invocation for index and you get that as a parameter, which is
depicted here in red. And the code that you write here has to be your restrict
amp. So this is how we write a kernel.
Now, the synchronization semantics for parallel for each is that parallel for each
is that are those of synchronous execution even though it rarely is executing
synchronously. So underneath the cover these calls are piped to the GPU. But
as far as you can tell, they -- their side effects are observable after the call
returns.
And the reason for that is because we delay the check, we delay the
synchronization point to the time that you access the data through array view or
you copy back from the GPU. Okay? So that means that if we wanted to
implement that on the CPU and take away the coherence checks then it will
really have to be synchronous. And maybe at that point we'll add an
asynchronous version of parallel for each.
>>: Now ->> Yossi Levanoni: Yes?
>>: [inaudible] GPU is like that data [inaudible] different some like it 32
[inaudible], some like it in 64 [inaudible].
>> Yossi Levanoni: Yes.
>>: 64 tasks in each [inaudible].
>> Yossi Levanoni: Yes.
>>: How do you -- do you control that?
>> Yossi Levanoni: I'll get to that.
>>: Okay.
>> Yossi Levanoni: Hold on to that question.
So restrict is the main language feature that we've added to visual C++. We've
added it such that it could be applicable to other uses other than amp. So that's
another area for creativity and research. You know, people have been talking
about, for example, purity or other types of restrictions you want to be able to
apply to code.
So now we have a language anchor for you to fill in those dots. We provide two
restriction and -- two restrictions in this release. And the one is CPU. And the
other one is amp. So amp basically is the thing that you say, okay, I want to be
able to run in -- author code that conformance to the V1 of the spec and be able
to execute in an accelerator.
CPU is just the default. So if you say my function restricts CPU it's as if you
didn't write anything. And you can combine them. You can say restrict CPU
comma and. That means I want to be able to run this code both here and there.
So restrict amp restrictions are derived from what you can do pretty much using
GPU -- today's GPU technology. Or to be a little bit more honest what you can
do using DX.
So there are a bunch of types that are not supported. And there's some
restrictions on pointers and references. DX doesn't know at all about pointers. It
doesn't have a flat memory model. It has those resources that are kind of
self-contained. And every kernel needs to say what resource it takes.
So simulating pointers in the wild is impossible. Or at least very inefficient.
We've decided to provide references and pointers as long as they're at local
variables. So that covers a lot of ground. Because people use references a lot
for passing parameters between functions. So we can do that using C++ AMP.
Going forward this restriction is probably going to be removed as GPU vendors
are moving to flat memory model architectures. But again, it will require a
change in the VM. So we were need to see how this is going to be played out
during Windows 9.
Now, today again in DX there's no form of really, you know, object level linking.
When you write code, it all has to kind of be combined into one executable unit.
So in C++ AMP we must represent that restriction somehow. So basically all
your code needs to be inlinable somehow.
So one way that you could do that is write your code in header files if it's template
code or you could use link time code-gen. That works as well.
So there's a laundry list of things you can't do. A lot of them have to do with no
pointers to code and no pointers to data. There's also no exception hand -- C++
exception handling.
All right. Now, one nice feature about restriction that we've added is the ability to
overload on restrictions. So we had to implement a math library for the GPU.
And we asked ourselves how should we do that? So let's say I see call to cosine
in your code. In C++ code that is going to be generated for the GPU, how do we
know where -- what code to generate for that? Do we just teach the compiler
about all those math functions and treat them as intrinsics?
That's been kind of very costly and it's also -- it's only something that we could do
as compiler vendors. We recognize that people may want to have specialization
of code, based on whether their running on the CPU versus the GPU. Without
requiring you to completely bifurcate your code at the root of the kernel.
So the way we do that is using specialization of implementations and the
overloaded calls for these specializations. So on the right -- on the top line -- on
the top line of code we see a normal cosine declaration. Okay? So it takes a
double, returns a double. Next one is restrict amp. Also cosine. So they only
differ by their restriction.
The first one is restrict CPU, the other one is restrict amp. And then a third one
is a function that is restricted both.
And then we see a kernel code here. And we see a call to bar followed by a
calm to cosine. So the call to bar will call the single overload that we have for bar
because it's restricted both for CPU and amp. It's callable from an amp context.
The kernel function here. The lambda is itself restrict amp. Right?
So we can call bar, which is restrict CPU amp. And then when we call cosine,
we say do we have an overload that -- that satisfies that? And we see, yes, we
have the second one. So we call this guy. All right. And then you can do all
sorts of things like with that. You can -- you could maybe trivialize some things
that are not really necessary, just say they are doing nothing on the GPU.
Making logging facilities or things that are harder to implement on GPU. You can
just implement them using empty functions.
Or if you know that there is a more efficient way of doing something on the CPU
versus a GPU you can do that too.
Okay. Next up is accelerator and accelerator view. This is how we talk about
where you actually execute code and where data lives. So so far in the
presentation you haven't seen any mention of where this thing actually
happened. And that's because most of the time you can just default to a default
accelerator and you don't have to worry about enumerating them and selecting
one.
But we have classes that allow you to explicitly do that. So you can enumerate
all the accelerators that are in the system. You can set the default one. You can
specify one explicitly when you allocate arrays. And you can specify one
explicitly when you call parallel for each.
And then you say I want to execute this parallel for each in this particular
accelerator. So this is how you achieve, for example, distribution between
multiple GPUs. We don't do that for you in this release. You have to kind of
manage it yourself.
Now, what is an accelerator? So every DX11 GPU is exposed as an accelerator.
That includes two special ones REF and WARP. REF is a reference
implementation of a DX11 GPU which is kind of basically is the spec of what a
DX11 GPU is. Okay? This is what we give to hardware vendors and we tell
them this is how your GPU should behave. And so it's very slow. It's only good
for testing. We've used it to enable the debugging experience. But, you know,
it's not something that you would want to use in production.
WARP is an implementation of a DX11 GPU on multi-core CPU. So it uses SSE,
and it uses multiple threads. And it's fairly efficient. The problem with WARP is
that it's -- didn't have all the functionality that REF has. In particularly, it doesn't
have double precision floats. So it doesn't have doubles. So that's kind of
painful. But, you know, if it did have doubles we would really kind of be very
happy to promote it. Because it doesn't have double we tell people that, you
know, it's kind of a full back if you don't have a real GPU.
And we have -- we have something called if CPU accelerator, which is pretty
lame in this release, to be very honest. It didn't really let you execute code on
the CPU. It just lets you allocate data on the CPU. And this is important in some
scenarios for staging data to the GPU.
But this is kind of a place holder for the next release where we plan to use the
C++ vectorizer and really generate code statically that runs on the CPU given the
kernel. Now, these are all accelerators. When you talk to a particular one of
them, you use an accelerator view. Accelerator view is your context basically.
We have a particular accelerator so it's kind of a scope for memory location and
quality of service.
All right. Now, lastly, we have arrays. Arrays are very similar to array views
except that they contain data rather than wrap over data, and that you explicitly
say where you want to allocate them. So they reside on a particular accelerator
and you can only access them there. We don't do coherence for arrays. You
have to copy them explicitly in and out.
So the way you would use it is that you always copy data into it and copy data
out of it or you generate data into it. So let's walk over the code, starting from
this box here. So we have a vector using data that we interpret as 8 by 12
matrix. And then we have an extent using the same dimensions. We obtain an
accelerator from somewhere which is not shown here.
And then we allocate an array on that default -- on the default view of that
accelerator. So every accelerator has a default view that you can use to
communicate with it and with those dimensions, 8 by 12.
So once we did that, we actually have memory allocated on that accelerator.
Okay? So now we need to populate it with data. And we do it using a copy
command. So we copy from the vector using a begin and end iterator into the
array. So now it has the data that we want.
And now we can do a reach, which takes this array and does basically increment
every element by one. And now we are ready, after we've done that, we can
copy the data back from the array into the vector.
So this is the difference between array and array view. Array view you're taking
control both over data movement and data placement.
>>: I have a question.
>> Yossi Levanoni: Yeah?
>>: [inaudible].
>> Yossi Levanoni: So array view is not as efficient. Yeah. Because array view
needs to represent a section, it always has an offset edition. So that's one cost
that you pay for it.
The other thing is that you may not be happy what our runtime is doing in terms
of managing coherence. So you may want to be sure sometimes that you don't -it doesn't get in your way.
We tell users to use array view whenever they can. Because we think that these
problems are going to go away as soon as we have shared memory, which is
probably going to be in the next release. So if you want to write code that doesn't
do unnecessary copies, you probably want to use array views.
There are also cases where you want to use an array. For example you're doing
some computation, you're doing one kernel followed by another kernel. And
there's a temporary between the two. In that case, you may also want to use an
array. Because you don't need data for that on the CPU. Yes?
>>: [inaudible] is it in a different namespace?
>> Yossi Levanoni: It is in a different namespace. And we've been going back
and forth on whether it's okay for us to reuse class names. So we've overloaded
array and extent as well also exists in the STD namespace.
So, yeah, we -- we just recommend users to qualify with namespace in they need
to. It just seemed like the right names to use. And that's what namespaces are
for.
Okay. So that's basically been the core of C++ AMP. With those restrictions,
parallel for each, accelerator view, accelerator, extent, index, array and array
view, you're ready to go using writing simple kernels. So you can fire VS 11 beta
and write your code using those things. And that's pretty much it. So it's fairly
lightweight.
Now, if you want to get this 2X or 3X that I've shown you in the N-Body
simulation, you need to kind of get closer to the hardware. And you do that using
what we refer to as tiling. So tiling basically allows you to gain access to the
GPU's programmable cache. It allows you to have better -- stronger guarantees
of scheduling between threads, such that they cooperate which barriers together,
exchange data.
So, you know, you can do basically a more localized reductions let's say first in
tiles. And before you do it to the whole data.
Finally, there are also kind of benefits that are more mundane and low level.
When we -- when we have, for example, an extent like this, eight by six, and you
tell us to map that to the -- inform DX, DX actually is kind of regimented in the
way it wants to schedule threads. And if you give us let's say a five-dimensional
extent with, you know, primes as the numbers of threads, we have to do through
a fairly sophisticated index mapping from the logical domain to the physical
domain that -- to the one that DX actually schedules.
Then when we run the GPU, we have to do the reverse mapping, give you back
your logical ID. Okay. So this costs both in runtime -- it has a runtime cost. But
in addition if you want to generate vector code, can completely throw the vector -vectorizer off because, you know, what would have looked like a vector load
becomes a gather. And what would become a vector store becomes a scatter
and things that nature.
So basically with tiles you can express the -- you can express more local
structure for your computation. Now, if you see on the right-hand side we see
the same tile -- the same extent being tiled into three by four files. So now, once
we've done that, we've basically told the hardware, we are always going to -- we
request you, hardware, to execute always groups of 12 threads concurrently.
And basically what the hardware is going to do is it's going to make this tile of
threads resident. Because it's resident on the core all together, they can refer to
fast cache memory, which is also referred to as a scratch pad memory. They
can put data there and they can accord using this cache. And they can reach
barriers together. Okay.
So they execute from start to finish together as kind of a gang of threads. And
then when the hardware is done executing those threads, can get a new tile and
execute it. So you basically at some more intimate scheduling contract with a
hardware.
Now, to give you some feel as to what these real numbers are, the extents are
determined dynamically so you know you could ask to schedule a few million
threads. But the tiles are always determined statically. They are specified as -as template parameters if you see that at the tile call. So we do E.tile three by
four. It has to be known to the compiler. And the maximum number of threads in
the tile is 1K. Okay? And typically you do something like 256 threads in the tile.
So it's kind of really different orders of magnitude. And you can choose whatever
you want. Here's another example, two by two.
Now, going back to your question, when you do tiling, you really want to be
aware of the width of the hardware vector unit. So basically you -- you want the
least significant dimension of your tile to be at least as wide as the vector unit.
So -- because otherwise you're going to get -- you're going to be wasting this part
of that vector unit.
So rule of thumb that we use is use 64. Today hardware uses four for the CPU
on SSE, then 16 on AMD.
>>: No, 64 AMD ->> Yossi Levanoni: 64 on AMD.
>>: [inaudible].
>> Yossi Levanoni: Right. Right. So -- so we think that 64 is going to stick for a
little while longer. This is something that we -- you know, we're not really, really
happy with that you have to binds your code to a particular number.
Another interesting area of research is whether we can bring in VisualStudio
some form of auto tuner or way to instantiate that in a more -- in a manner that
will allow you to upgrade it in a more regular manner as the hardware does.
>>: [inaudible] query, the hardware does not ->> Yossi Levanoni: This is not part of the DX model. So if you take -- if you take
advantage of that -- so, for example, you can -- you can seen rely on that. And
CUDA tells you that you can take advantage of that, that if you -- if you execute
let's say instruction one using this WARP width and then if you move to
instruction two then you know that everything in your work has already done that
for the first line. So you know that those WARPs are executed in unison.
They give that guarantee to developers. And because they do that, now the
compiler's totally constrained. They cannot reorder instructions. So DX has
decided not to give any guarantees of that.
If you want to take advantage of that, you still can. You will need to put fences
between your instructions in DX or in C++ AMP. And your code is not getting to
be portable between hardware vendors.
So once you've tiled your computation like that, when you do E.tile using some
tile dimensions, you get a tiled extent. It's another class. It has three -- up to
three dimensions. And what you get in your lambda is a tiled index.
Now, a tiled index is pretty much like a index, but it has more information about
where you are in the -- both global and local index space. So it tests things like
what is my global index, what is my location index, which is my index within my
tile. What is my tile index? Okay? Because there is also a extent of tiles.
And what is the first position in my tile? So those are all things that are
deducible. But we have them precomputed so, you know, we provide access to
them using those properties. Any questions on that so far? No. Okay.
All right. So now we get to the interesting part. This is what you get. Okay? So
you do all this hard work. What do you get? You get the ability to use tile static
variables. So tile static variables are variables that are local to a tile. So it's
memory, and it's going to be placing in the GPU programmable cache. So you
know that you got very good access to it.
So how you could use it? You could -- if there's data that global memory that
many threads are going to access repeatedly, they could collectively load that
into tile static variables, then do their computation over tile static memory, and
then when they're done they can store it back to memory. Okay?
Now, this coordination that works in phases requires them to use barriers. And
that's what tile barrier is about. So all the threads will do some collective thing.
And when they are done with that, they're all going to reach a barrier. Then they
know that data is ready in tile static memory and then it can move to the next
step.
Okay. So in order to show tiling I'll work with the matrix multiplication example.
This example does not use tiling. So on the left-hand side we see the sequential
version. And on the right-hand side we see a simple amp version. Now, the
thing to note here is that we've replaced the two -- the doubly nested loop with
the single parallel for each invocation. Okay? So this is two-dimensional parallel
for each.
Each one of these loops is totally independent, so we can do that. And now,
basically what we do in the body of that lambda is compute -- compute the value
of C at position IDX. Okay? So this is kind of independent. We can do this
thing. Now, each -- each thread or every, you know, row-column combination
here basically computes this sum of products. Okay? And in the simple
computation this is the same thing. This was a six-sequential loop and this
remains a sequential loop executed by a thread.
Okay. So this is how we would do it using simple C++ AMP coding. Now, if you
want to tile it, we have to do a bunch of stuff. So there's -- so we have to decide
what is going to be the tile size. So we decide it's going to be 16 by 16. Okay?
So we take our extent that we had before, and we tile it. By 16 by 16. And now
we get this index -- we get a tiled index that's going to represent our thread. And
then we can excavate from that the row and the column as we did before.
Now, the interesting thing here is that we have here a single loop running over
the entire input matrix, the left-hand side matrix row and the other -- the
right-hand side matrix column. What we have done basically is we've broken this
loop into two loops. The first one runs over tiles. And the second one runs
inside the tile. Okay?
So in the outer loop we basically collectively fetch global data into tile static
variables. Which are -- sorry. Oh, there they are. Right here. Okay. So we all
do a single load. Each thread does a single load and loads it into this tile static
variables. And then they all wait and wait for everybody else to complete. And
now we can do the partial product -- sum of products in tile static memory.
So if you do the math, that -- that saves the factor of a tile size of global memory
accesses and replaces them with scratch pad memory accesses. So this again
gives a factor of two or three better -- better solution.
Okay. So any other questions about tiling before I'll move to the last section
here?
>>: There's control over [inaudible] usage, how many registers does your
compiler ->> Yossi Levanoni: Yeah. Unfortunately we don't have that so, yeah. You can
play all sorts of games. And if you have an interesting application where you
need that, you should probably hook with our development team. We've worked
with people trying to work around this DX limitation. But, yeah, we -- it's very
indirect the type of influence you have over that.
>>: What happens [inaudible] like in the CUDA world it just says you can't
compile.
>> Yossi Levanoni: In the CUDA code, I think they actually spill.
>>: They will spill.
>> Yossi Levanoni: They will spill.
>>: [inaudible]. Yeah. The old days it just says like you're out of registers.
>> Yossi Levanoni: Right.
>>: So what's the equivalent [inaudible].
>> Yossi Levanoni: You also [inaudible] into similar.
>>: [inaudible].
>> Yossi Levanoni: No. We will tell you you're out of registers.
>>: But we won't tell you your program is going to run slow because I can't
generate enough threads to because I only have -- I don't have enough registers
to generate as much threads as you need to.
>> Yossi Levanoni: We have some information on that, and we're working on
exposing it. So I'll talk to concurrency visualizer in a bit.
>>: There's no way to access the [inaudible].
>> Yossi Levanoni: That's my next topic.
>>: [inaudible].
>> Yossi Levanoni: Yes. So next topic is DX integration. So basically for both
our goal has been to allow you to interleave your computations with what you will
do otherwise with DX. Many times the pattern is that you compute something.
You leave it on the GPU. And then it's available for vertex shader or pixel
shaders or directly for bleeding into the screen.
So we wanted to facilitate this kind of interop. And you get interop on data and
scheduling. So the array view class you can expose it -- you can interop
between array view and ID3D resource. And this represents a byte address
buffer in HLSL, so it's basically unstructured memory.
And you can interop between accelerate view and ID3D device and ID3D device
context. So you can go either way. Okay? And that also allows you to use
some advanced features that are not available through the C++ AMP surface
area. For example, in 8 they've added the capability to control TDR. It's time of
detection and recovery. There's something -- because the GPU is a shared
resource but it's not fully [inaudible] like your CPU, there is timeout mechanism
for long-running shaders which appears stuck to the OS. They've done a lot of
improvement on that in Windows 8. But you have to kind of buy into it. So you
could do that by configuring your device context. Then you can use it as an
accelerator view in C++ AMP.
N-Body simulation, for example, uses that. In every frame you both execute C++
AMP code and DX code for rendering. Now, in addition to that, while you're
inside the body of your kernel, you can use also about 20, 20-some intrinsics that
we've exposed -- that we are channelling to HLSL. They have to do with min,
max, bit manipulation, this type of stuff.
They are not part of the core spec. We're thinking of putting some of them back
into the core spec. People have asked for that. All right. The next thing is what
you can do in terms of textures. And the main thing there is -- that's the main
thing that we do in the amp graphics namespace. So textures allow you basically
to have some sort of a one-dimensional, two-dimensional, three-dimensional
addressable container that you can sample or address at a particular integer
index and get back the tixel [phonetic]. The tixel is typically a short vector. It can
be either a scalar or a short vector. That's an int 3. So it's a tuple of three
integers. That's what you get once you sample a texture.
So the first thing that we've done is provide our short vector types. For example,
int underscore three is a tuple of three integers. There's also things called norm
and unorm. These are float variables that are clamped to let's say zero or one or
minus one and one. That's the difference between norm and unorm. Unorm
means on site norm.
And then you have short vector of these up to a rank of four. Okay? So this is
something that is very familiar to graphics developers.
Another thing that graphics developers are used to is this swizzle expression. So
you can say my vector dot YZX equals this integer three. So -- and this is not a
very good example because you just order this according to the correct order.
But it allows you to change the order that you access the different components.
Just select the subset of them. Do these type of things.
So we expose many swizzle expressions. But we don't expose all of them. The
syntax in HLSL basically allows you to write any combination of characters,
including repetition and the constant zero and one. Okay? So if you look at the
combinatorial space for that it totally explodes. And we -- we're really scratching
our heads how to expose that in C++. And we ended up really trimming it down
to the most common combinations.
But we think -- looking at bodies of HLSL code we cover everything that people
are using. And if anybody's interested in language surface area investigation,
that would be an interesting topic, you know, how can you -- how can you add
into something like C++ and more let's say compile time dynamic properties?
Okay. So once you have those short vector types, you can represent textures.
And basically the short vector types and scalars can be the element type of a
texture. So if you look at the DX spec for textures, it got lots of gotchas.
For example you can't have textures of three -- well, vector length of three. Can't
have a texture of int 3. There's also lots of limitations on when you can reed and
when you can write. You can almost never read and write simultaneously. I
mean, your shader either reads or writes, except for some really common but
special cases.
There are also lots of different encodings that are supported. Your data doesn't
have to be in the same data type that the texture is. Texture provides a sort of
data transformation for the underlying data type.
So it's kind of a really, really big area. However you can interface with it. We
don't cover all of the options that you can instruct a template -- a texture with.
We covered the most common ones. And we have an interop path that you can
say here's a texture that I created from DX and please treat it as a C++ AMP
texture.
Now one thing that was really painful in this release is that we had to cut
sampling. So this is really what most people like texture for. It means that you
can treat the texture as if it was with a mapping with a real cord in its space. So I
could index it using -- and I could say give me -- give me the tixel at position 2.5
and 3.7. And that gives me that point that is not in the center of a tixel but is
closer to some other tixels. And then it can do linear interpolation or other
interpolations. And that gives you this kind of smooth texture effect that you get
games.
And it's really nice facility. But we -- we ran out of resources and we had to cut it.
So what you can do with textures today in C++ AMP and Dev 11 is access them
basically as arrays using integer indices.
>>: [inaudible].
>> Yossi Levanoni: But we're really -- huh?
>>: [inaudible] instead of [inaudible] I'm sorry, all four ->> Yossi Levanoni: Yeah. Yeah. We have to cut the [inaudible].
All right. So that's it pretty much. We covered everything that's really important
in the programming model. Yeah?
>>: [inaudible] so one of the nice things about direct compute is you've got these
shader resources that you can use in compute, use in graphics [inaudible]
exactly the same model, which is very nice. It seems like you're kind of throwing
that away to create more like CUDA, DX or [inaudible] separate for a [inaudible]
memory here and if you do it in [inaudible] it's very painful. Is that the reality or
[inaudible].
>> Yossi Levanoni: That's the reality of it, yeah. I won't dispute that. And the
question is how do -- how do we move forward. So, yeah, the focus was
compute.
>>: [inaudible] either it's all or nothing, right? I mean, you could [inaudible] bring
everything into C++, all the [inaudible] the whole entire model or ->> Yossi Levanoni: Yeah. I think that's where we would like to be. And the
question is when and what's going to be the customer demand for that. But if the
team had the -- it's desires, you know, admitted, then we would go and target at
least the most common things, common shader types such that you won't have
to leave C++ at all.
That's basically the productivity value proposition that you stay within C++, you
stay within the same tool set. And if you look at it from the DX side, they also
have an opportunity here. Because it seems like a lot of the shaders could be
recast or even the higher -- some higher-level APIs could be recast in terms of
compute.
So you could -- you could for example take like, I don't know, Direct2D and
implement it on top of C++ AMP -- yeah. A modern C++ version of it, right?
Yeah. The question has always been about resources. How much are we going
to get in each release done. Yeah.
So things that I'm not going to cover are things related to memory model, atomic
operations, memory fence, some direct diagnostic things. The types of
exemptions you get from our runtime and the math library. The math library is
really interesting. We worked on that together with AMD. They basically
implemented it for us. And it's kind of a clean room implementation that is very
accurate and -- yeah?
>>: [inaudible].
>> Yossi Levanoni: Huh?
>>: Do you have scan ->> Yossi Levanoni: No, we don't. So one thing that the team is working on right
now is looking into the higher tier of libraries. And we have thrust, from NVIDIA,
that is basically also in that space providing all the standard library, like overloads
for their CUDA programming model. So they have reductions and scans, sorts.
So this is one area that we started to invest in. And this is going to be an open
source project.
So we have already published it -- sorry. We're kind of in the last review stages
of the first version. I'm going to push into it the web. And we have all sorts of
customers that are interested in contributing to that.
The other thing we're doing is investing in BLAS, linear algebra. And we are also
investing in random number generators. If anybody here wants to chip in or give
us requirements, please do. On the next set of libraries. Yeah?
>>: [inaudible].
>> Yossi Levanoni: FFT we actually published as a sample. That was -- that
was easy to do relatively speaking because DX provides an FFT library. It's a
little bit outdated. It originated here in MSR. And a DX basically wrapped it in
Windows API. And you can use it, if you have DX. So we -- we provided a
sample that shows how to use it from C++ AMP and kind of gives a nice C++
AMP facade to it.
But we don't have like an amp library that implements FFT from scratch. And
there are some opportunities there. Naga's [phonetic] team had also created the
-- an auto trimming framework for FFT. I think it could be great if, you know,
somebody looked into bringing that back to life and maybe porting it over to C++
AMP.
So generally speaking the C++ team is not well staffed to do breadth libraries or
domain specific libraries. But, you know, breadth libraries are even kind of a
lower goal and even that is kind of very difficult to get down. So we're trying to
enlist the community for that. So, yeah, if you're willing to contribute, that would
be awesome. Yeah?
>>: I'm just wondering what [inaudible] this for and [inaudible] is that I could
[inaudible] then it seems like [inaudible] and so I'm wondering how do you think
that that is going to get better where we keep [inaudible], you know, like
rendering something where we don't know exactly [inaudible] and how do we see
that [inaudible].
>> Yossi Levanoni: I think you -- yeah, so if you are alluding to what I say that,
some other shaders could be expressed in terms of compute or you're just asking
->>: One whole model is being [inaudible]. I don't know that it's [inaudible].
>> Yossi Levanoni: No, we think it is necessarily. We looked into streaming,
which is kind of a key component of a [inaudible] like algorithms, right. You know
algorithm that generate variable number of outputs and maybe you know has
kind of a loop back to feed themselves with that.
Yeah. So we don't have conclusive findings on how to expose that yet. Yeah.
Very interesting. Yeah?
>>: So inside of reach you're not allowed to call another [inaudible]; is that right?
>> Yossi Levanoni: That's right.
>>: So have you you looked into maybe doing that so you can nest it?
>> Yossi Levanoni: Yes. The hardware vendors are talking about the next level
of things that you should be able to do. So GPUs should be able to do quite a
number of things in the next REF of major hardware. They could launch
something asynchronously to themselves, kind of a continuation. They should be
able to do stream like things, you know, kind of a parallel do while, right? While I
have more work loop and in the loop I generate more work for myself. Like
maybe a little bit like tesselation.
They should be able to also express synchronously nesting parallel loops. But
I'm not sure that this really is -- maps well to the hardware. But they could -- but
they could represent that as continuations, I suppose. If they have continuations
then they could do something like that as well.
>>: [inaudible] like if you have a -- let's say a [inaudible] walk down the tree
[inaudible] thing to do is [inaudible] you have two things and [inaudible] you have
four things but [inaudible] across the whole tree. You unfold the whole tree and
flatten it. Could you do that with software instead of [inaudible] hardware.
[inaudible].
>>: Yeah. I mean, this is a program in MSR Cambridge that [inaudible] take like
-- you have this data type called parallel array and [inaudible] just kind of get
flattened out [inaudible].
>>: So parallel [inaudible] for this type of things. Yes.
>> Yossi Levanoni: Yeah. No, we didn't look into that very carefully. We looked
at it from kind of a syntactical perspective saying, okay, if you did want to allow
nested parallel for each, will you be able to do that? And that's part of our
versioning strategy for amp. Approximate.
So if you look at the restrict amp, actually of designed and implemented the
versioning mechanism for that. So you could say I'm restrict amp column two,
and that opens up the things for you to allow more things. And -- but then all the
things that are for view one will always be kind of a better optimization target
because you know that they don't have nested parallelism, you know that they
don't do a but of stuff. They don't have virtual function calls. They don't have
pointers in the heap, right?
So we think that the V1 that we're specifying is always going to be kind of a good
breadth target for optimization, you know, like matrix multiplication. There's no
reason you would ever want to go beyond amp V1 to express that. And that
gives the compiler a lot of knowledge.
>>: [inaudible].
>> Yossi Levanoni: Yeah?
>>: So can you talk a little bit like what you compile into? You say you built on
top of the DX. Do you use HLSL as a compiler at some point? I mean, what
exactly [inaudible].
>> Yossi Levanoni: Yeah [inaudible].
>>: How you exactly get [inaudible].
>> Yossi Levanoni: Okay. So are built into the C++ compiler. We ingest
everything in the front end. And what we do in the front end is basically outline
those parallel for each payloads. That creates the stub of execution on the GPU.
But to the compiler at that time it just looks like a function.
Then we get to the back end. We recognize those functions. They have a
certain bit and that's a GPU kernel. And at that point we apply some specific
processing. For example we take it and we do full inlining. We do pointer
relation. With invoke the right intrinsics, okay? And then we're ready for
co-generation. And the co-generation goes to HLSL source code. But the
source that we generate looks like assembly code, you know. It's basically, you
know, variable equals variable or expressions are binary expressions and
assigned to temporary. So, yeah, it's ->>: [inaudible] do they still own that compiler?
>> Yossi Levanoni: The HLSL compiler?
>>: Yeah. So you inherent all the bugs?
>> Yossi Levanoni: We have -- we have -- they owe us -- we worked very, very
closely, and we have found so many bugs that have been fixed. I mean, they've
done a really Herculean effort to ->>: [inaudible].
>> Yossi Levanoni: Say again?
>>: It's graphics guys writing compilers.
>>: Yes.
>>: That's true.
>> Yossi Levanoni: So that organizationally, let's say it wasn't the best thing
possible, but I have to kind of say that everybody did the best that they could to
make it work. And going forward we would want to own co-generation all the
way down to the bytecode.
>>: I could see [inaudible] a lot of work this way, right?
>> Yossi Levanoni: Oh, yeah [inaudible].
>>: [inaudible] handle all the details [inaudible].
>> Yossi Levanoni: Yeah.
>>: [inaudible].
>> Yossi Levanoni: Right. No, but morning that, if you take about C++, C++
developers are not crazy about JIT, right? And it seems like we have a
opportunity to go lower level than the bytecode and allow you to generate
machine code all the way from the compiler. Or even something that is not
exactly machine code. Maybe something that you just need to run one single
pass of finalization over let's say to determine offsets into structures or something
like that. Or encoding of instruction set.
So these are directions that we want to explore in Windows 9. Yeah?
>>: So what's your -- what's your update pathway and how often are you
expecting new versions?
>> Yossi Levanoni: The update for C++?
>>: Yeah.
>> Yossi Levanoni: So the -- it's just part of the compiler. And so so far it's been
two or three years at least.
>>: So [inaudible].
>> Yossi Levanoni: Yes. Yes. But the libraries are separate. The libraries will
be able to update all the time. And the compiler, if I understand correctly, there's
very strong desire to change -- to change the cadence with which it is delivered.
>>: So it links to the studio and not to the [inaudible] 11 or 12? So we'll get new
versions as there are new versions of dev studio, not as [inaudible].
>> Yossi Levanoni: Yes. There will be -- we have some flexibility to do things in
service specs. But, you know, it's still kind of early to say exactly how Dev 12 is
going to be played out.
>>: So one question about the compiler again. You said you did the outlining in
the front end. And then are you writing that all to CIL. Is the one [inaudible].
>> Yossi Levanoni: We are writing all to CIL, but it gets outlined. So you
basically get two functions. You get the original function that contained a parallel
for each. It now contains a call to runtime trampoline. And you get the new
function that is also expressed in CIL, but that gets translated into GPU code.
And then at runtime you call into this runtime trampoline that sits in our DLL -runtime DLL. And dev basically knows how to rummage through your XE and
find a bytecode that we buried there and initialize DX with that and call it.
Let me show you some nice slides about what we've done in the IDE. This is the
concurrency visualizer. We've added both DX and C++ AMP capabilities so you
can see basically kind of the high level dynamics of what's going on. When do I
have a data transfer? When -- you know, when am I blocking? What are the
overall statistics that?
We have prototyped more data collection that talks about divergence and spilling
and that type of stuff. But it hasn't -- it's not in the beta, and we weren't able to
ship it in our team. The concurrency visualizer in previous releases we've been
able to ship it out of [inaudible] so this is one thing that we can innovate more
quickly. Although we don't know how much resources we get to -- we get for that
going forward. Unfortunately, that team, most of it is no longer in DevDiv, so
we're not sure that we will really be able to turn the crank on this as fast as we
want to.
And the other thing where we had much more investment and innovation is in the
debugger. So basically everything that you have on the CPU that you're used to
works on GPU. And you get new experiences. One of the cool thing that we've
added is a GPU threads window. The amount of threads that you work with in a
parallel for each is not like, you know, 10, 20. You're talking about millions of
threads potentially. And you really want to treat them this data. You want to be
able to tabulate them and search them and flex some of them and do filtering
operations. And the GPU threads window lets you do that.
And complementing that we've added a watch window, which we've added both
for GPUs and CPUs and gives you this laminated view of variables across
threads. So again, you can search here, you can filter threads by values and
basically see a lot of data through -- in lots of threads simultaneously.
You can also export these types of things in Excel and do further analysis over
there. And we have the parallel stacks window that we used to have in the CPU
that's also available on GPU. Yeah?
>>: [inaudible].
>> Yossi Levanoni: So we have defined debugging API. And the hardware
vendors will have to implement that. Now NVIDIA is working on that, and they're
going to be done by our team. So we demoed this together with NVIDIA in
supercomputing last year. So they're going to, as they do now, the best
developer experience on their hardware.
AMD said that they'll start working on that soon. And you also have the REF
GPU emulator that you can debug with. Okay? So if you don't -- if you don't
have an NVIDIA card, you can fall back to that. And you can use that on your
laptop. You know, driving on the about us.
And one last thing that we've added which you can't see here is we've added this
runtime assist for race detection. So now you're working with all these threads.
You now can enable these hazard checks that will break you in the debugger if
we see that you have a read and a write that are not correctly synchronized. Or
if you're missing a barrier you need to have a barrier in such a ways. All right.
So, yeah, that's pretty much it. The thrust of what we are doing is that it's open,
it's high level, and gives people lower barrier to entry into the space. And it's
mappable to new hardware, both on the CPU and the GPU. And it's available to
anybody who's downloading VS 11. You get it in the express SKU. So, you
know, the idea is really to democratize this space and allow people to experiment
with it and utilize their hardware.
Thank you.
[applause]
Download