20903 >> Juan Vargas: We're going to have a fantastic... Sanders AMD Chair of Electrical and Computing Engineering at the...

advertisement
20903
>> Juan Vargas: We're going to have a fantastic keynote by Wen-Mei Hwu, who is Walter
Sanders AMD Chair of Electrical and Computing Engineering at the University of Illinois. He has
a Ph.D. from the University of Berkeley, California at Berkeley. And he has received many,
many, many awards for his work in architecture and software for HPCs.
He received the 1993 Eta Kappa Nu outstanding young electrical engineering award, the 1994
Phillips award for faculty research and many other awards. He's going to be talking about high
level programming manycores. Please help me to welcome him.
[applause]
>> Wen-Mei Hwu: Thanks, Juan, thanks for that extremely warm introduction. But now I hope I
can live up to the person you introduced.
>> Juan Vargas: We [indiscernible].
>> Wen-Mei Hwu: So this is really a lunch talk. So I promise no compiler flow analysis, no
dependence analysis, no equations. And you know anything else you want me to take out, I will.
So what I'd like to do is give you a little bit of a journey of how we're dealing with engaging people
who are actually doing many programming today and what are the kinds of things we believe will
have to be technologically advanced in order to really help these people.
And so the work involves a lot of people. And this is the picture of UPCR Illinois team. There's
17 faculty members. Many of them are here. So I see many, many faces here.
And so there are a few aspects that you will probably get even better treatment about subject by
talking to them this afternoon. So instead of taking a nap after lunch, you should go to the
sessions or talk to some of my colleagues. So the vision is UPCRC Illinois is to make parallel
programming easy.
And we especially want to focus on simple programming patterns. And there will be some very,
very hard problems. And we believe that if we help people to solve the simpler problems first,
then we have a chance of attacking the harder problems.
Right now I personally believe that a lot of these problems have to do with the tools not being
able to engage the real development process and work flow and people's real needs.
So the topic says manycore. So I'm going to go over quickly what I mean. What I mean is the
kind of processors that will have limited resources, very limited resources.
Mostly, the resources will be dedicated to the compute units. So you have lots and lots of them.
And then you will have relatively small amount of control. So in some ways it could be a CPU
that is kind of, you know what, in terms of reduced core complexity, or reduced among caches or
GPU today you will see typical examples of these manycore types that I will be talking about
here.
So the first thing we learned about manycore is it only makes sense to program these manycores
if you care about performance, scalability or both. So if you don't care about either of those, you
really should save your time. And if you just write a very simple piece of code, chances are you
will run better on the CPU. So don't bother with that.
And the other lesson we learned is that performance on a single chip versus scalability across
different chip levels actually share a lot of the same techniques that people need to use. So in
many ways the first order effect of getting performance on these chips is to make sure that your
program's scaleable.
And the techniques that I'm going to be talking about that many of these developers wish we
could help them with fall into that category. And then there are some detailed performance
tuning, some fairly fine tuning kind of things that you can do. But those in general second order
effects.
So many people will even tell you from some of these vendors that you may or may not even be
worthwhile trying to do some of those kind of fine-tuning even for particular generation.
Okay. So what I would like to convey to you is today there is a major gap between what the
programmers need and what the tools are actually doing. And it's our job, people in this room, to
be able to fill that gap, and being able to empower these people.
So there are three important things that -- almost every manycore programmer would tell you,
okay, if they're successful in doing what they do, they will tell you there are three things in general
they probably did right.
The first one is you need to have massive parallelism in your application. If you don't have
parallelism, it's not going to be good. So in general people are finding most of that in the data
parallelism today. And the second one is regular computation data accesses.
You will need to be able to somehow either have some kind of regular application or regular
dataset that falls on your lap which means you're lucky, or if your application algorithm or your
data is not quite regular, then you need to work hard to regularize it. And regularization is
something that is going to be the key in this whole game. By regularizing things you also create
similar work and avoid loading balance. And I may not even show you some data because it's a
lunch break.
But you can -- the loading balance or nonuniform data distribution can really make a huge
difference in terms of performance of these devices.
And then they will finally say well a lot of my time is spent on two smaller things. One is DRAM
bandwidth. And you know, it may appear like a small thing but that's probably where most of the
people spend the time today, trying to optimize the data access patterns and use the object
memory to cut down the amount of bandwidth demand that applications are putting on the chips.
And the DRAM system. And the second one is a little more subtle, conflicting updates to memory
locations.
And a lot of the straightforward parallelization atomic operations and in general the scalability of
these algorithms. So a lot of the work is actually in turning their algorithms inside out, changing
some of the, replicating some of the states so that you can avoid that kind of situation and then
get scalability and oftentimes performance.
So you say is it really the case? In general, maybe a few years from now people will be building
hardware, the regular data accesses will not be important. Maybe Intel will figure out a way to
build some kind of magical hardware that the memory bandwidth will no longer be a problem.
And I sure hope so.
And I'm a computer architect as well, so I would like to be part of the team to build that magical
device. Okay. I really would love to. So if Intel comes to me and says: Here's a piece of
technology and fabrication technology, can you come head up this team and you will build a
magical device, I will quit my university job and join Intel today. Put it this way.
But if history makes any sense, it tells us that whenever you want to manage a massive amount
of parallelism, you do need to have some kind of regular usage pattern. Otherwise, it's not going
to work. Whether it's military on top, Army, Navy. You know, Air Force. Or manufacturing.
Agriculture. This is close to my house, the second one.
And banquet, you know, have you ever been in this large banquet where the chef says: Please
let me know exactly how you want to customize your food. You're lucky that the chicken gets to
you while it's still warm, right?
That's what regularization means in large scale systems. This is a picture probably taken not too
far from this location on the left-hand side. For those of you who had to come to the airport
around 5:00, you know what that picture means.
So but you know, whenever the cars can maintain some kind of regular pattern, everyone can
progress. Whenever you start to have some people going zigzag like this and people start to hit
their brakes. You see what happens to the traffic pattern, right? You start to see stalls.
So that's why I'm relatively pessimistic about the possibility of eliminating some of this
regularization techniques that people need to use in programming this manycore chips in the
future.
So there are a few things that you need to overcome in order to get to success. Serialization due
to conflicting updates. Oversubscription of critical resources and loading balance.
Bad things happen when you have things where the regularization process start to break down
and you start to have irregular, you know, congestion.
So what are the things that people need to do? I ended up doing something pretty stupid in the
past three months. I agreed to edit the first two volumes of GPU Computing Jams. So we ended
up with almost 300 submissions.
And we carefully read through you know all the proposals and then picked about 110, ones that
we believe are reasonably successful, that people have been successful in doing their
applications and they presented a case of how they reached success. Some have a recipe.
Some kind of you know, repeatable description of what they did.
Out of the 110, when I read through all the work, I started to realize that they invariably go
through pretty much the similar process. They, in general, need to understand the relationship
between the domain problems and computational model.
That's one thing -- the first thing that jumps out. You know, there are many, many different
models for doing the same kind of goals, in the science, in the engineering, in the video
processing domain. And you need to understand that level.
The second one is you need to understand the strength and limitation of your computing devices.
Very few authors who have some kind of erroneous description about the strength and weakness
of the chip will ever be able to get performance out of them today. The third one is implementing
the models to steal away from the weakness.
So essentially there are practicing techniques needed to translate from undesirable patterns to
desirable patterns. That's all it's about. And one of my former students said this to me. He said
after he finished his thesis, I said: What did you learn John? And John said: Boy, we never
solve any problems. We never ever solve any problems. We push the problems around to
places where they matter less.
So it's all about pushing some of these things into, under the carpet or around the corner where
no traffic is there and they don't matter that much. And then you start to get real performance out
of these things.
So let me kind of comment a little bit of philosophy here. This is a slide that David Patterson and
he cited a fairly widely known seven dwarfs and initially there's seven types of computation,
structure grids and structure grids, fast forward transform. Even though Dan said yesterday that
FFT is not an application. But it's a very important component of every application. Dan Sarry, to
me it's still very important.
And I'm still learning something about FFT every time. So the point of using this is to say you
know what, we have a reasonable way, framework, for this particular taxonomy, to understand
the general types of applications or the general types of data structure, computation that people
do.
But what this one doesn't teach us is exactly how people are getting performance out of these
types of applications. From my experience, there's a small set of techniques that people can use
to get good performance out of all these types of applications. So this is what I call the seven
techniques in manycore programming.
And the first one is scattered together transformation. The second one is granularity coarsening
register tiling. The third one is data access tiling for locality. Fourth one is data layout and
traversal ordering, get better access patterns, and bending and cut off. Bin sorting and
partitioning for nonuniform data when the data start to behave badly. And finally is hierarchical
cues and kernels for dealing with dynamically changing and dynamically generated data.
So after reading through all the 110 chapters, I started to say, okay, I can map pretty much what
everyone did in all those works into a subset of these seven techniques. Good. So at least we
have something here.
So what I'd like to do is to give you a little bit of sense of how these techniques are actually being
used by real programmers. But the point is I would like to get you started thinking and hopefully I
will also give you some good indication of how the future tools, how the future languages will fit
into this kind of work flow.
So the first one is scattered together transformation. Whenever you have a quadratic kind of
computation, where you have lots of input. You have lots of output. And each output is going to
be affected by some significant number of input. That's how you get large amount of computation
time out of a lot of these calculations.
Is oftentimes very convenient to express your computation in terms of inputs. That is, given each
input, what are the outputs that should be affected? There's a good reason for this. In most of
the engineering and science computation, the output tend to be more regular than the input.
Whether it's many embodied problem where you're calculating some kind of force grid, whether
it's a medical imaging problem where you're trying to take some scant data and generate some
kind of regularized volumetric data, the output tend to be a lot more organized than the input data.
That's why it's very natural to write a piece of code and say okay go get the next input which will
give me -- the input is going to give me some kind of coordinate and some kind of value and I'm
going to do something.
And you can calculate the extent to which the input will affect the data. But in manycore
execution, this is a disaster waiting to happen, because that input oriented programming is going
to create these threats that will be trying to write into a range of output and they start to trample
each other. So you need to have some kind of atomic operation.
But these things tend to create long latency and wait and they just need to line up. In general, the
first thing that a lot of these people do is convert the execution into output-oriented expression.
Essentially say that every thread is going to look at one output. And it's going to figure out what
are the inputs that it's going to need in order to produce the output. Easier said than done.
Because, remember, I said the input tend to be less regular than output. So given an output is in
general not very easy to find the input, I'm going to come back to this point. But in general, that's
the first order transformation that these people tend to do.
And the second one is thread coarsening, register tiling. We all know that parallel computing in
general requires some kind of redundancy. If you truly want to have clay to do something and
you want to have Mike to do something, in general you don't want them to have to communicate
with each other. So if there's some little things that each one can do locally, you don't want them
to communicate. You just want them to go ahead and do them personally.
So I'm showing a fine-grained parallelization, where each chunk is a thread. And then let's say
two redundant pieces of work followed by one unique work and then some redundant work. And
you will have -- let's say -- 12 of these threads that you can schedule into some kind of manycore
chip and then just execute them.
Oftentimes it becomes desirable to float several threads worth of work into one bigger heavier
thread so that you can calculate the redundant work, put them into the register, register
somewhere, and then have the other equivalent threads folded into this thread to enjoy that piece
of work.
This essentially sacrifices some parallelism, but we all know that register access can be
extremely efficient. So at some point, if you have too much parallelism, then oftentimes the real
weight of getting efficiency and even scalability in some ways is to create large enough threads
that can actually conserve memory accesses and calculation based on the register storage.
Jim's students did a matrix multiplication method which makes use of essentially this concept.
And Jim -- I'm not going to give you an opportunity to hit me with a hard question about that. But
that's my perception of one good thing that you did.
You actually created a lot much bigger usage of the registers to conserve the bandwidth
requirement from that chip. So the third one is data access tiling. And this particular method
essentially has to do with the following: Once you converted your computation from
input-oriented into output-oriented then you have this gather pattern. But the gather pattern is
going to require too much memory bandwidth to get all that data in. So what you want to do is
you want to actually start to create some execution faces where only a small chunk of the input
data will be actively used by a large number of threads.
So you will stage these -- the data chunks into the unchipped memory, and then whenever you
have a chunk into the unchipped memory, they will be consumed by all the threads that are
waiting for that. And for that algebra, this is extremely effective method. If you can do this, don't
use anything else as far as memory bandwidth is concerned.
But it gets more difficult when you start to have ghost cells for PDE solvers, when you have ghost
cells for convolution algorithms, for media processing kind of things. Then you start to -- that's
where a lot of the developers start to trip, because it takes actually some intellectual power to be
able to figure out the real benefit of piling once you get to that level.
And that's where some of the tools will definitely begin to help. Data layout transformation. So in
many data structures, we declare the data structure as a multi-dimensional array and each
element -- each element of the array is actually a strut. So we can just formulate that as another
dimension of the array.
So in this picture, where I'm showing a three-dimensional array, XY -- actually, only two of the
dimensions, and then I show all the elements that are in each array element -- each element in
the array.
So we can do array of structure layout which is the original layout. And this tends to be a very,
very good CPU layout. And I think one of the talks CK lot gave alluded to this particular issue,
and then you can convert that into structure of array. And in general, whenever you don't have
effective way of piling your algorithm, such as LBN, just by converting from structure of array to
array of strut -- array of strut into strut of array, you can actually create enough of the memory
coalescing, and in fact so that you can conserve a large amount of memory bandwidth out of the
current GPUs. And that will give you about four times performance on a GDX 280 today. What
you really want to do is actually what we call the tiled array -- tiled array strut -- tiled array of
struts, or actually tile -- strut of array.
And essentially, rather than moving all the elements all the way out so that you essentially gather
all the elements together from each dimension, you actually create these little chunks of the
elements and then you essentially repeat this pattern. And what you really do is you take part of
the lower dimension of the original array and move them into the low dimension.
So this gives you a good compromise between three things. One is coalescing needs. And the
other one is actually memory channel utilization in the GPUs. This gives you a way to actually
spread your active accesses across the six to eight memory channels in the chip today. And the
third one is, if you ever need to use the same layout for CPUs, this gives you better locality,
because you're not spreading the nearby element of struck elements into far -- such as the same
elements from the same original array into far away places. They're going to be much closer to
each other for CPU caching. So this tends to be the kind of compromise an expert programmer
would do for something like LVM and to a lesser degree many other gridded applications.
The fifth one is input bending. I mentioned very briefly that input data tend to be a lot less
regularized, regular than output. So the way that people deal with this problem is by presorting
this input data into some kind of bins or some kind of categories.
They can be as irregular as some kind of spatial trees, like KD trees, quad trees, Aug trees, these
different kind of tree structures, or in some applications when the data is sufficiently uniformly
distributed you can use regular-sized spatial bins. And whenever you can get away with the
regular spatial bins, you don't want to go to the more exotic data structures, because you have
much easier access into these structures.
And this is a molecular dynamic application where the force calculation has a certain cut-off
distance. So for each grid point you will have a certain number of bins that need to be
considered. So if you sort these inputs into these bins, then you convert a irregular batch of data
into a much more regular array structure that you can easily index and identify and create a
parallel access. So that's routinely done. And if you look at a typical electrostatic force
calculation, you will see some kind of interesting tricks to avoid some of the bins by giving some
exotic formula in terms of the list that you need to calculate.
The consequence of not doing this can be severe. If you don't do the binning, you start to -you're forced to look at everything before you can determine whether you need to process the
input for large dataset you can actually have a GPU can actually run much, much slower than an
efficient CPU implementation. But if you pay attention to binning then you gain all the data
scalability.
So 6-1. Bin sorting and partitioning. This has to do with non-uniform data distribution. Not all the
data in the real world are uniformly distributed. I wish they are. In fact, if they're all uniformly
distributed, my job as a true developer would be a whole lot easier.
But when my people come, they will first give me some semi uniform data. And once we give
them some good performance they will say oh, by the way, I have a spiral scan data that I need
to handled and you look at it and you go huge data nonuniform distribution.
So in general the most effective way that people deal with those kind of data today is by taking
the input data, sort them into some kind of ordered bins, and then also limit the bin size so that
some of the input will fall into the GPU bin. CPU bin.
And by limiting the data, then you have -- you can actually do a prefix scan to identify the
boundaries of the implicit bins. They don't need to be the same size. But that's why prefix scan
is such an important calculation in data parallel computing. And then so once you define the bin
boundaries then you can load the section of the bins that are important to each thread into the on
chip memory. And, yes, you can see that the range could be dynamic.
So you may have a situation where you may not have enough space to hold all the bins
necessary for these two threads. So that's where you actually will need to default into the main
memory. So in general, if you look at some of these applications, they will actually first test if the
input falls in the range of the on chip memory. If not, they go to the DRAM and fetch the ones
that cannot be windowed into the on chip memory.
And this is where you cross from a naturally organized data, okay, irregular data, into an
artificially regularized data.
And this is where a lot of the tricks are. This is where a lot of the mistakes are being made, and
this is where a lot of the tools should help. None of the tools that I know of help with any of this.
So that's the reason why as a tool developer I look at this, I look at what people are doing and I
lower my head and say: Shame on me. I'm not tuned into that process. 7-1, when things are
dynamic, like graph algorithms, like you know things that are dynamically identified and say this is
my next batch of dynamic data that I need to back process. In general, the way to get
performance out of that kind of application today is the hierarchical cues where you use threads
to produce cued data elements at the warp level which takes advantage of the fact that the
hardware is going to be mostly -- going to mostly have these processing elements that are active
at the same time. So if you provided different cues to different processing elements, then the
contention will be minimized, if not totally eliminated.
So you can provide -- let's say for the current SAN, for immediate GPU, you can provide eight
cues for the eight processing elements. As they do these contact switching and so on, different
threads will be accumulating into the same cue, but they would be doing this in turn. However,
you need to be careful, because more and more GPUs will be executing more than one warp at
the same time. So you still need to use atomic operation. Otherwise when you move into Fermy
you can cause bugs in your code.
Now after each thread block is done, then you can accumulate.
You can actually write all your output into the block level cue. And after the kernel is done, then
you can take the block level cue and then write it into the global cue.
Once you get beyond the warp level and get beyond the block level, the contention will be
minimized. So it's all about scalability of your dynamic assembled data. How do you avoid
having a bottleneck inserting the next batch of data into that structure.
So correspondingly, people also have been doing hierarchical, you know, kernel launch for these
kind of applications. In general, let's say if you are doing a breath for search algorithm, you will
start with a source and you will kind of grow your frontier.
And you can grow into a large number of nodes. But then at some point you may also shrink at
some point depending on your constraint. But, in general, you'll grow more than you will shrink.
At the beginning, if you're processing only a small number of frontier nodes and you have to
launch a kernel and do a synchronization, you'd lose all your performance. So there's usually
three levels of kernels in the high performance libraries. One is the small kernel that has a only
one thread block and uses various synchronization that's intrinsic in the hardware and does the
synchronization and do the frontier propagation, until the frontier grows big enough, then they
would go to the second tier, kernel, that I think Jim's students actually have been using.
Essentially you launch only enough thread blocks which will at most equal to the number of SMs
in your hardware.
And then the language allows you to do a global synchronization across the thread blocks. So
that will take care of about 10,000, up to about 10,000 nodes in the algorithm. And eventually
you will have enough of a big frontier that you just launch a kernel every time and then you
terminate the kernel for the global synchronization.
But when you have so much data to process, the launch overhead is no longer the biggest issue.
You will see this kind of pattern in a lot of the dynamic graph-oriented applications.
So the reason why I talk about those three techniques here is really this: These tools need to fit
into the work flow. This is where -- those are the things that most of the manycore programmers
spend their time. They're spending all that time trying to -- you get those kind of techniques done.
And so the tool should help them to form efficient kernels out of these things, provide clear
performance model. You know, there should be a very clear way for them to say: Oh this is a
good kernel, this is not a good kernel for my data. It should be a very easy thing to do, right?
And then, you know, needs to support software engineering when people apply these techniques.
Portability. If someone crafts their kernel with these techniques, are these kernels portable, not
portable to different targets, architectures. Testing, how do people test these kernels once
they're done and how many times do they need to test these things?
Interface integrity assumption verification. A lot of times when people apply these techniques
they actually need to make some assumptions about how the rest of the code is behaving. The
pointers, the sum of the data structure, the shapes and so on.
And currently there's no way for the system to help them to enforce these assumptions or check
these assumptions at runtime. They say, what you're assuming here is not really what's
happening in the rest of the code. The legacy code is doing something weird here. You're not
going to be able to do the parallelism that you want.
But it's not there. Right? So that's why we believe that these tools and applications really need
to be co-designed so that the tools are feeding into the places where people are really doing most
of their work.
Okay. So this leads to what we're proposing for the second phase of UPCRC in Illinois. So we
have three big chunks. We Avascholar. Safe speed and acrobatics. So the Avascholar
[phonetic] is the component where they're developing some very demanding applications. Some
are in the manycores and some are in both manycore and multi-cores today.
But there will be continued performance requirements out of these applications. And through the
process we are feeding these developers with the safety tools and performance tuning tools. And
essentially we're asking the same question again and again. Why don't people use the tool?
Why are they spending three days here without any tool help. What's going on here? So that we
can really begin to tune into their development process.
And this is a slide that Mark Sneer generated here which I really like. It's kind of a little bit
philosophical, but I think it's a very important point. The point is currently we have the compiler, a
model, where we generate different implementations of the same code.
And then we have the user mode. You know, user model where the user will generate different
code versions through that whole process of performance tuning.
There's nothing in between. And whenever the user cannot fit exactly into this mode, they just
get out of, completely out of this mode and then use the hand to do all these things. I really think
that one of the biggest challenges is to actually provide something that can actually engage the
user somewhere in the middle. So the users will not do these things by hand totally.
So then the compiler would not be, will have to do all the heavy lifting totally by itself. It's easier
said than done for all of us will have experience with that kind of process, it is a very big
challenge.
So I have a few of these applications that, for example, here for gesture interface, Tom Horn's
students already have an implementation with GPU that is near real time, but they still have
stability issues. So they can take another ten times performance just to get to the real stability of
a sensor-free gesture detection.
So surface reconstruction work by John Hearse students, the same thing, you can reconstruct -- if
you really want to do -- if you want to do good quality reconstruction of the surfaces and so on for
interactive, virtual presence, we still have a long way to go in terms of what we can do. In motion
detection, mentioned some of this yesterday. Detecting emotion is not based mostly on the
volumetric image rather than the real gridded surface reconstruction.
So we can still have a long way to go in order to get more reliable detection of the emotions.
So we organized these projects into refactoring paths which will tap into the higher level
programming path that implements parts or whole of the seven techniques in the transformation.
And we currently only have partial implementation of the three of the seven. So it's not an easy
task.
But we're definitely making progress. And one of the things we're very excited about the project
is that some of our -- we both -- we all understand that we're trying to be aggressive in terms of
speed, power and so on. But I think there's an aspect of safety net that has been really
overlooked, a lot of the future past tools. That is, whenever people make these, a big brain
surgery on some part of the code for performance, they really are making some significant
interfacing assumptions.
So when some of these assumptions are being violated, at runtime there is no one that are
checking these things. It's not just necessarily for parallelization, even for code update, even for
regular software update, they're routinely these kind of problems that prevent or costs a lot of
costs later on for software support. We're developing a piece of technology that is checking and
isolating these assumptions at runtime and minimize the debugging supporting costs of these
kind of things. So we're going to save speed project. I didn't spend a whole lot of time on it. But
they really provide a good level of scrutiny over the code base for parallelism. So there's some
actions for the sponsors, as usual.
But one of the things we believe would need to happen in the future is not just Intel and Microsoft.
We really need to begin to engage some more of the ISVs.
So this brings me to the final slide here. This is a very well known phenomenon of valley of
death. You have a whole bunch of research that is very nice, and then you have some start-up
company doing some kind of tools and so on.
But then you have very few things that you can actually take from research and successfully
apply to these things. Whenever you try to get some high value kind of research into the
commercial world, there's a valley of death.
So what we're saying here is, yes, the valley of death would be there. Yes, some of our tools will
be there. On the other hand, the only way that these tools will be able to, some of them will make
their way into the commercial world is by giving the developers what they want. And I think we
have a much better understanding for this particular domain what people want.
I didn't talk about the entire domain, and I don't want to spend the rest of the afternoon
introducing these other domains, but I hope that I gave you a good sense of our philosophy, our
research techniques and how we actually -- how we feel that these tools that we're developing
will be different from the tools that people have been developing in the past 30 years.
There's a different level of engagement that we're doing with the developers. So with that, thank
you very much for your attention and I will be happy to answer a couple of questions.
[applause]
>>: So with all these transformations ->>: Stephon.
>>: Yes, there are two extremas. One is to refactor the code somehow and make explicit those
that you want to have tested and so forth. And the other one is to go to the level of abstraction
where the developer's expressing less about the data layout and more about the algorithms and
having a system underneath to transport them. Where in your experience, which of these has
been more successful and what's your opinion.
>> Wen-Mei Hwu: Okay, intuitively you would expect the first one to be more successful and the
second one to be less successful. In practice the first one is more successful and the second one
less successful. The reason is because of the sorry start of what we're providing people don't
trust this.
So they rather do all the things themselves and at least they know what they're testing. So that's
the reason. But to me you are actually pointing to a very important point. Unless we start to tip
that balance, the productivity is always going to be a problem. So that's actually the key of the
problem.
When can we at this point the balancing point for them to start to trust some of these tools that
they would start to use the testing methodology that we provide as part of the tool rather than
doing the explicit test themselves. And that's going to be the real thing.
>>: Thank you.
>> Wen-Mei Hwu: Any other questions? Yes, please.
>>: What about the architecture indications of [indiscernible].
>> Wen-Mei Hwu: So a couple of things that we did learn. One is we can complain about this
hardware all day long. Every time I have a teacher, I teach a cores and every time I see some
developers, the first thing they say is, oh, you know, why do I need to do coalescing. It's such a
pain. Right? I need to do code turning and all that stuff.
And so one thing that I think hardware people need to really decide is, you know what, you really
need to communicate effectively to people. What are the things that you would not be able to do
in the future. A lot of these people are still hoping that you will be able to reverse that trend.
Maybe you will. If you will come out and say clearly that you will. But if you can't, come out and
say this is something that the tools and the programmers would really have to take care of it.
Because that indecision in that communication process is very harmful.
The second one is you know what, I actually learned that CPUs for any successful manycore
processor, the part that the CPU does is incredibly important. So if you look at all the
regularization processes, most of the techniques can only be done on the CPUs today.
So there is a very important part that I think -- you could put CPU and GPU into competition mode
and say whatever block processing you will do I will do. So there's a lot of low-hanging fruit
where a lot of the regularization processes are not necessarily supported by the CPUs in the
most efficient way. That may be one of the things that the Intel architects should take a look at.
Any other questions? If not -- oh, yes, Jim.
>>: So where is auto tuning fit into the framework.
>> Wen-Mei Hwu: Yes, where does auto tuning fit into the framework. So auto tuning to me is a
parameter, is kind of a parameter setting kind of thing. In general if you look at all these seven
techniques, there are lots and lots of parameters. But each technique will end up in the code. So
that's -- that's why I didn't list auto tuning as part of the seven. To me, it is a -- it's a final
implementation phase where you set, finalize the parameters with those.
>>: If we may -- [indiscernible].
>> Wen-Mei Hwu: You mean the algorithm book.
>>: Yes.
>> Wen-Mei Hwu: As soon as I go I'll start writing. [laughter].
>>: [indiscernible].
>> Wen-Mei Hwu: In fact, so Juan is referring to the GPU Computing Gems books. Volume 1 is
coming out in December. And we already finalized all the contents of this in production. It does
take about four months to come out. So there will be 50 articles in 10 application areas and the
second 1 will come out in June and that's going to be the next seven application areas plus the
frameworks and tools. So there will be a total of 110 articles. And that's where I extracted some
of these observations as well.
>> Juan Vargas: Thank you very much.
>> Wen-Mei Hwu: Thank you.
[applause]
Download