20903 >> Juan Vargas: We're going to have a fantastic... Sanders AMD Chair of Electrical and Computing Engineering at the...

20903 >> Juan Vargas: We're going to have a fantastic keynote by Wen-Mei Hwu, who is Walter Sanders AMD Chair of Electrical and Computing Engineering at the University of Illinois. He has a Ph.D. from the University of Berkeley, California at Berkeley. And he has received many, many, many awards for his work in architecture and software for HPCs. He received the 1993 Eta Kappa Nu outstanding young electrical engineering award, the 1994 Phillips award for faculty research and many other awards. He's going to be talking about high level programming manycores. Please help me to welcome him. [applause] >> Wen-Mei Hwu: Thanks, Juan, thanks for that extremely warm introduction. But now I hope I can live up to the person you introduced. >> Juan Vargas: We [indiscernible]. >> Wen-Mei Hwu: So this is really a lunch talk. So I promise no compiler flow analysis, no dependence analysis, no equations. And you know anything else you want me to take out, I will. So what I'd like to do is give you a little bit of a journey of how we're dealing with engaging people who are actually doing many programming today and what are the kinds of things we believe will have to be technologically advanced in order to really help these people. And so the work involves a lot of people. And this is the picture of UPCR Illinois team. There's 17 faculty members. Many of them are here. So I see many, many faces here. And so there are a few aspects that you will probably get even better treatment about subject by talking to them this afternoon. So instead of taking a nap after lunch, you should go to the sessions or talk to some of my colleagues. So the vision is UPCRC Illinois is to make parallel programming easy. And we especially want to focus on simple programming patterns. And there will be some very, very hard problems. And we believe that if we help people to solve the simpler problems first, then we have a chance of attacking the harder problems. Right now I personally believe that a lot of these problems have to do with the tools not being able to engage the real development process and work flow and people's real needs. So the topic says manycore. So I'm going to go over quickly what I mean. What I mean is the kind of processors that will have limited resources, very limited resources. Mostly, the resources will be dedicated to the compute units. So you have lots and lots of them. And then you will have relatively small amount of control. So in some ways it could be a CPU that is kind of, you know what, in terms of reduced core complexity, or reduced among caches or GPU today you will see typical examples of these manycore types that I will be talking about here. So the first thing we learned about manycore is it only makes sense to program these manycores if you care about performance, scalability or both. So if you don't care about either of those, you really should save your time. And if you just write a very simple piece of code, chances are you will run better on the CPU. So don't bother with that. And the other lesson we learned is that performance on a single chip versus scalability across different chip levels actually share a lot of the same techniques that people need to use. So in many ways the first order effect of getting performance on these chips is to make sure that your program's scaleable. And the techniques that I'm going to be talking about that many of these developers wish we could help them with fall into that category. And then there are some detailed performance tuning, some fairly fine tuning kind of things that you can do. But those in general second order effects. So many people will even tell you from some of these vendors that you may or may not even be worthwhile trying to do some of those kind of fine-tuning even for particular generation. Okay. So what I would like to convey to you is today there is a major gap between what the programmers need and what the tools are actually doing. And it's our job, people in this room, to be able to fill that gap, and being able to empower these people. So there are three important things that -- almost every manycore programmer would tell you, okay, if they're successful in doing what they do, they will tell you there are three things in general they probably did right. The first one is you need to have massive parallelism in your application. If you don't have parallelism, it's not going to be good. So in general people are finding most of that in the data parallelism today. And the second one is regular computation data accesses. You will need to be able to somehow either have some kind of regular application or regular dataset that falls on your lap which means you're lucky, or if your application algorithm or your data is not quite regular, then you need to work hard to regularize it. And regularization is something that is going to be the key in this whole game. By regularizing things you also create similar work and avoid loading balance. And I may not even show you some data because it's a lunch break. But you can -- the loading balance or nonuniform data distribution can really make a huge difference in terms of performance of these devices. And then they will finally say well a lot of my time is spent on two smaller things. One is DRAM bandwidth. And you know, it may appear like a small thing but that's probably where most of the people spend the time today, trying to optimize the data access patterns and use the object memory to cut down the amount of bandwidth demand that applications are putting on the chips. And the DRAM system. And the second one is a little more subtle, conflicting updates to memory locations. And a lot of the straightforward parallelization atomic operations and in general the scalability of these algorithms. So a lot of the work is actually in turning their algorithms inside out, changing some of the, replicating some of the states so that you can avoid that kind of situation and then get scalability and oftentimes performance. So you say is it really the case? In general, maybe a few years from now people will be building hardware, the regular data accesses will not be important. Maybe Intel will figure out a way to build some kind of magical hardware that the memory bandwidth will no longer be a problem. And I sure hope so. And I'm a computer architect as well, so I would like to be part of the team to build that magical device. Okay. I really would love to. So if Intel comes to me and says: Here's a piece of technology and fabrication technology, can you come head up this team and you will build a magical device, I will quit my university job and join Intel today. Put it this way. But if history makes any sense, it tells us that whenever you want to manage a massive amount of parallelism, you do need to have some kind of regular usage pattern. Otherwise, it's not going to work. Whether it's military on top, Army, Navy. You know, Air Force. Or manufacturing. Agriculture. This is close to my house, the second one. And banquet, you know, have you ever been in this large banquet where the chef says: Please let me know exactly how you want to customize your food. You're lucky that the chicken gets to you while it's still warm, right? That's what regularization means in large scale systems. This is a picture probably taken not too far from this location on the left-hand side. For those of you who had to come to the airport around 5:00, you know what that picture means. So but you know, whenever the cars can maintain some kind of regular pattern, everyone can progress. Whenever you start to have some people going zigzag like this and people start to hit their brakes. You see what happens to the traffic pattern, right? You start to see stalls. So that's why I'm relatively pessimistic about the possibility of eliminating some of this regularization techniques that people need to use in programming this manycore chips in the future. So there are a few things that you need to overcome in order to get to success. Serialization due to conflicting updates. Oversubscription of critical resources and loading balance. Bad things happen when you have things where the regularization process start to break down and you start to have irregular, you know, congestion. So what are the things that people need to do? I ended up doing something pretty stupid in the past three months. I agreed to edit the first two volumes of GPU Computing Jams. So we ended up with almost 300 submissions. And we carefully read through you know all the proposals and then picked about 110, ones that we believe are reasonably successful, that people have been successful in doing their applications and they presented a case of how they reached success. Some have a recipe. Some kind of you know, repeatable description of what they did. Out of the 110, when I read through all the work, I started to realize that they invariably go through pretty much the similar process. They, in general, need to understand the relationship between the domain problems and computational model. That's one thing -- the first thing that jumps out. You know, there are many, many different models for doing the same kind of goals, in the science, in the engineering, in the video processing domain. And you need to understand that level. The second one is you need to understand the strength and limitation of your computing devices. Very few authors who have some kind of erroneous description about the strength and weakness of the chip will ever be able to get performance out of them today. The third one is implementing the models to steal away from the weakness. So essentially there are practicing techniques needed to translate from undesirable patterns to desirable patterns. That's all it's about. And one of my former students said this to me. He said after he finished his thesis, I said: What did you learn John? And John said: Boy, we never solve any problems. We never ever solve any problems. We push the problems around to places where they matter less. So it's all about pushing some of these things into, under the carpet or around the corner where no traffic is there and they don't matter that much. And then you start to get real performance out of these things. So let me kind of comment a little bit of philosophy here. This is a slide that David Patterson and he cited a fairly widely known seven dwarfs and initially there's seven types of computation, structure grids and structure grids, fast forward transform. Even though Dan said yesterday that FFT is not an application. But it's a very important component of every application. Dan Sarry, to me it's still very important. And I'm still learning something about FFT every time. So the point of using this is to say you know what, we have a reasonable way, framework, for this particular taxonomy, to understand the general types of applications or the general types of data structure, computation that people do. But what this one doesn't teach us is exactly how people are getting performance out of these types of applications. From my experience, there's a small set of techniques that people can use to get good performance out of all these types of applications. So this is what I call the seven techniques in manycore programming. And the first one is scattered together transformation. The second one is granularity coarsening register tiling. The third one is data access tiling for locality. Fourth one is data layout and traversal ordering, get better access patterns, and bending and cut off. Bin sorting and partitioning for nonuniform data when the data start to behave badly. And finally is hierarchical cues and kernels for dealing with dynamically changing and dynamically generated data. So after reading through all the 110 chapters, I started to say, okay, I can map pretty much what everyone did in all those works into a subset of these seven techniques. Good. So at least we have something here. So what I'd like to do is to give you a little bit of sense of how these techniques are actually being used by real programmers. But the point is I would like to get you started thinking and hopefully I will also give you some good indication of how the future tools, how the future languages will fit into this kind of work flow. So the first one is scattered together transformation. Whenever you have a quadratic kind of computation, where you have lots of input. You have lots of output. And each output is going to be affected by some significant number of input. That's how you get large amount of computation time out of a lot of these calculations. Is oftentimes very convenient to express your computation in terms of inputs. That is, given each input, what are the outputs that should be affected? There's a good reason for this. In most of the engineering and science computation, the output tend to be more regular than the input. Whether it's many embodied problem where you're calculating some kind of force grid, whether it's a medical imaging problem where you're trying to take some scant data and generate some kind of regularized volumetric data, the output tend to be a lot more organized than the input data. That's why it's very natural to write a piece of code and say okay go get the next input which will give me -- the input is going to give me some kind of coordinate and some kind of value and I'm going to do something. And you can calculate the extent to which the input will affect the data. But in manycore execution, this is a disaster waiting to happen, because that input oriented programming is going to create these threats that will be trying to write into a range of output and they start to trample each other. So you need to have some kind of atomic operation. But these things tend to create long latency and wait and they just need to line up. In general, the first thing that a lot of these people do is convert the execution into output-oriented expression. Essentially say that every thread is going to look at one output. And it's going to figure out what are the inputs that it's going to need in order to produce the output. Easier said than done. Because, remember, I said the input tend to be less regular than output. So given an output is in general not very easy to find the input, I'm going to come back to this point. But in general, that's the first order transformation that these people tend to do. And the second one is thread coarsening, register tiling. We all know that parallel computing in general requires some kind of redundancy. If you truly want to have clay to do something and you want to have Mike to do something, in general you don't want them to have to communicate with each other. So if there's some little things that each one can do locally, you don't want them to communicate. You just want them to go ahead and do them personally. So I'm showing a fine-grained parallelization, where each chunk is a thread. And then let's say two redundant pieces of work followed by one unique work and then some redundant work. And you will have -- let's say -- 12 of these threads that you can schedule into some kind of manycore chip and then just execute them. Oftentimes it becomes desirable to float several threads worth of work into one bigger heavier thread so that you can calculate the redundant work, put them into the register, register somewhere, and then have the other equivalent threads folded into this thread to enjoy that piece of work. This essentially sacrifices some parallelism, but we all know that register access can be extremely efficient. So at some point, if you have too much parallelism, then oftentimes the real weight of getting efficiency and even scalability in some ways is to create large enough threads that can actually conserve memory accesses and calculation based on the register storage. Jim's students did a matrix multiplication method which makes use of essentially this concept. And Jim -- I'm not going to give you an opportunity to hit me with a hard question about that. But that's my perception of one good thing that you did. You actually created a lot much bigger usage of the registers to conserve the bandwidth requirement from that chip. So the third one is data access tiling. And this particular method essentially has to do with the following: Once you converted your computation from input-oriented into output-oriented then you have this gather pattern. But the gather pattern is going to require too much memory bandwidth to get all that data in. So what you want to do is you want to actually start to create some execution faces where only a small chunk of the input data will be actively used by a large number of threads. So you will stage these -- the data chunks into the unchipped memory, and then whenever you have a chunk into the unchipped memory, they will be consumed by all the threads that are waiting for that. And for that algebra, this is extremely effective method. If you can do this, don't use anything else as far as memory bandwidth is concerned. But it gets more difficult when you start to have ghost cells for PDE solvers, when you have ghost cells for convolution algorithms, for media processing kind of things. Then you start to -- that's where a lot of the developers start to trip, because it takes actually some intellectual power to be able to figure out the real benefit of piling once you get to that level. And that's where some of the tools will definitely begin to help. Data layout transformation. So in many data structures, we declare the data structure as a multi-dimensional array and each element -- each element of the array is actually a strut. So we can just formulate that as another dimension of the array. So in this picture, where I'm showing a three-dimensional array, XY -- actually, only two of the dimensions, and then I show all the elements that are in each array element -- each element in the array. So we can do array of structure layout which is the original layout. And this tends to be a very, very good CPU layout. And I think one of the talks CK lot gave alluded to this particular issue, and then you can convert that into structure of array. And in general, whenever you don't have effective way of piling your algorithm, such as LBN, just by converting from structure of array to array of strut -- array of strut into strut of array, you can actually create enough of the memory coalescing, and in fact so that you can conserve a large amount of memory bandwidth out of the current GPUs. And that will give you about four times performance on a GDX 280 today. What you really want to do is actually what we call the tiled array -- tiled array strut -- tiled array of struts, or actually tile -- strut of array. And essentially, rather than moving all the elements all the way out so that you essentially gather all the elements together from each dimension, you actually create these little chunks of the elements and then you essentially repeat this pattern. And what you really do is you take part of the lower dimension of the original array and move them into the low dimension. So this gives you a good compromise between three things. One is coalescing needs. And the other one is actually memory channel utilization in the GPUs. This gives you a way to actually spread your active accesses across the six to eight memory channels in the chip today. And the third one is, if you ever need to use the same layout for CPUs, this gives you better locality, because you're not spreading the nearby element of struck elements into far -- such as the same elements from the same original array into far away places. They're going to be much closer to each other for CPU caching. So this tends to be the kind of compromise an expert programmer would do for something like LVM and to a lesser degree many other gridded applications. The fifth one is input bending. I mentioned very briefly that input data tend to be a lot less regularized, regular than output. So the way that people deal with this problem is by presorting this input data into some kind of bins or some kind of categories. They can be as irregular as some kind of spatial trees, like KD trees, quad trees, Aug trees, these different kind of tree structures, or in some applications when the data is sufficiently uniformly distributed you can use regular-sized spatial bins. And whenever you can get away with the regular spatial bins, you don't want to go to the more exotic data structures, because you have much easier access into these structures. And this is a molecular dynamic application where the force calculation has a certain cut-off distance. So for each grid point you will have a certain number of bins that need to be considered. So if you sort these inputs into these bins, then you convert a irregular batch of data into a much more regular array structure that you can easily index and identify and create a parallel access. So that's routinely done. And if you look at a typical electrostatic force calculation, you will see some kind of interesting tricks to avoid some of the bins by giving some exotic formula in terms of the list that you need to calculate. The consequence of not doing this can be severe. If you don't do the binning, you start to -you're forced to look at everything before you can determine whether you need to process the input for large dataset you can actually have a GPU can actually run much, much slower than an efficient CPU implementation. But if you pay attention to binning then you gain all the data scalability. So 6-1. Bin sorting and partitioning. This has to do with non-uniform data distribution. Not all the data in the real world are uniformly distributed. I wish they are. In fact, if they're all uniformly distributed, my job as a true developer would be a whole lot easier. But when my people come, they will first give me some semi uniform data. And once we give them some good performance they will say oh, by the way, I have a spiral scan data that I need to handled and you look at it and you go huge data nonuniform distribution. So in general the most effective way that people deal with those kind of data today is by taking the input data, sort them into some kind of ordered bins, and then also limit the bin size so that some of the input will fall into the GPU bin. CPU bin. And by limiting the data, then you have -- you can actually do a prefix scan to identify the boundaries of the implicit bins. They don't need to be the same size. But that's why prefix scan is such an important calculation in data parallel computing. And then so once you define the bin boundaries then you can load the section of the bins that are important to each thread into the on chip memory. And, yes, you can see that the range could be dynamic. So you may have a situation where you may not have enough space to hold all the bins necessary for these two threads. So that's where you actually will need to default into the main memory. So in general, if you look at some of these applications, they will actually first test if the input falls in the range of the on chip memory. If not, they go to the DRAM and fetch the ones that cannot be windowed into the on chip memory. And this is where you cross from a naturally organized data, okay, irregular data, into an artificially regularized data. And this is where a lot of the tricks are. This is where a lot of the mistakes are being made, and this is where a lot of the tools should help. None of the tools that I know of help with any of this. So that's the reason why as a tool developer I look at this, I look at what people are doing and I lower my head and say: Shame on me. I'm not tuned into that process. 7-1, when things are dynamic, like graph algorithms, like you know things that are dynamically identified and say this is my next batch of dynamic data that I need to back process. In general, the way to get performance out of that kind of application today is the hierarchical cues where you use threads to produce cued data elements at the warp level which takes advantage of the fact that the hardware is going to be mostly -- going to mostly have these processing elements that are active at the same time. So if you provided different cues to different processing elements, then the contention will be minimized, if not totally eliminated. So you can provide -- let's say for the current SAN, for immediate GPU, you can provide eight cues for the eight processing elements. As they do these contact switching and so on, different threads will be accumulating into the same cue, but they would be doing this in turn. However, you need to be careful, because more and more GPUs will be executing more than one warp at the same time. So you still need to use atomic operation. Otherwise when you move into Fermy you can cause bugs in your code. Now after each thread block is done, then you can accumulate. You can actually write all your output into the block level cue. And after the kernel is done, then you can take the block level cue and then write it into the global cue. Once you get beyond the warp level and get beyond the block level, the contention will be minimized. So it's all about scalability of your dynamic assembled data. How do you avoid having a bottleneck inserting the next batch of data into that structure. So correspondingly, people also have been doing hierarchical, you know, kernel launch for these kind of applications. In general, let's say if you are doing a breath for search algorithm, you will start with a source and you will kind of grow your frontier. And you can grow into a large number of nodes. But then at some point you may also shrink at some point depending on your constraint. But, in general, you'll grow more than you will shrink. At the beginning, if you're processing only a small number of frontier nodes and you have to launch a kernel and do a synchronization, you'd lose all your performance. So there's usually three levels of kernels in the high performance libraries. One is the small kernel that has a only one thread block and uses various synchronization that's intrinsic in the hardware and does the synchronization and do the frontier propagation, until the frontier grows big enough, then they would go to the second tier, kernel, that I think Jim's students actually have been using. Essentially you launch only enough thread blocks which will at most equal to the number of SMs in your hardware. And then the language allows you to do a global synchronization across the thread blocks. So that will take care of about 10,000, up to about 10,000 nodes in the algorithm. And eventually you will have enough of a big frontier that you just launch a kernel every time and then you terminate the kernel for the global synchronization. But when you have so much data to process, the launch overhead is no longer the biggest issue. You will see this kind of pattern in a lot of the dynamic graph-oriented applications. So the reason why I talk about those three techniques here is really this: These tools need to fit into the work flow. This is where -- those are the things that most of the manycore programmers spend their time. They're spending all that time trying to -- you get those kind of techniques done. And so the tool should help them to form efficient kernels out of these things, provide clear performance model. You know, there should be a very clear way for them to say: Oh this is a good kernel, this is not a good kernel for my data. It should be a very easy thing to do, right? And then, you know, needs to support software engineering when people apply these techniques. Portability. If someone crafts their kernel with these techniques, are these kernels portable, not portable to different targets, architectures. Testing, how do people test these kernels once they're done and how many times do they need to test these things? Interface integrity assumption verification. A lot of times when people apply these techniques they actually need to make some assumptions about how the rest of the code is behaving. The pointers, the sum of the data structure, the shapes and so on. And currently there's no way for the system to help them to enforce these assumptions or check these assumptions at runtime. They say, what you're assuming here is not really what's happening in the rest of the code. The legacy code is doing something weird here. You're not going to be able to do the parallelism that you want. But it's not there. Right? So that's why we believe that these tools and applications really need to be co-designed so that the tools are feeding into the places where people are really doing most of their work. Okay. So this leads to what we're proposing for the second phase of UPCRC in Illinois. So we have three big chunks. We Avascholar. Safe speed and acrobatics. So the Avascholar [phonetic] is the component where they're developing some very demanding applications. Some are in the manycores and some are in both manycore and multi-cores today. But there will be continued performance requirements out of these applications. And through the process we are feeding these developers with the safety tools and performance tuning tools. And essentially we're asking the same question again and again. Why don't people use the tool? Why are they spending three days here without any tool help. What's going on here? So that we can really begin to tune into their development process. And this is a slide that Mark Sneer generated here which I really like. It's kind of a little bit philosophical, but I think it's a very important point. The point is currently we have the compiler, a model, where we generate different implementations of the same code. And then we have the user mode. You know, user model where the user will generate different code versions through that whole process of performance tuning. There's nothing in between. And whenever the user cannot fit exactly into this mode, they just get out of, completely out of this mode and then use the hand to do all these things. I really think that one of the biggest challenges is to actually provide something that can actually engage the user somewhere in the middle. So the users will not do these things by hand totally. So then the compiler would not be, will have to do all the heavy lifting totally by itself. It's easier said than done for all of us will have experience with that kind of process, it is a very big challenge. So I have a few of these applications that, for example, here for gesture interface, Tom Horn's students already have an implementation with GPU that is near real time, but they still have stability issues. So they can take another ten times performance just to get to the real stability of a sensor-free gesture detection. So surface reconstruction work by John Hearse students, the same thing, you can reconstruct -- if you really want to do -- if you want to do good quality reconstruction of the surfaces and so on for interactive, virtual presence, we still have a long way to go in terms of what we can do. In motion detection, mentioned some of this yesterday. Detecting emotion is not based mostly on the volumetric image rather than the real gridded surface reconstruction. So we can still have a long way to go in order to get more reliable detection of the emotions. So we organized these projects into refactoring paths which will tap into the higher level programming path that implements parts or whole of the seven techniques in the transformation. And we currently only have partial implementation of the three of the seven. So it's not an easy task. But we're definitely making progress. And one of the things we're very excited about the project is that some of our -- we both -- we all understand that we're trying to be aggressive in terms of speed, power and so on. But I think there's an aspect of safety net that has been really overlooked, a lot of the future past tools. That is, whenever people make these, a big brain surgery on some part of the code for performance, they really are making some significant interfacing assumptions. So when some of these assumptions are being violated, at runtime there is no one that are checking these things. It's not just necessarily for parallelization, even for code update, even for regular software update, they're routinely these kind of problems that prevent or costs a lot of costs later on for software support. We're developing a piece of technology that is checking and isolating these assumptions at runtime and minimize the debugging supporting costs of these kind of things. So we're going to save speed project. I didn't spend a whole lot of time on it. But they really provide a good level of scrutiny over the code base for parallelism. So there's some actions for the sponsors, as usual. But one of the things we believe would need to happen in the future is not just Intel and Microsoft. We really need to begin to engage some more of the ISVs. So this brings me to the final slide here. This is a very well known phenomenon of valley of death. You have a whole bunch of research that is very nice, and then you have some start-up company doing some kind of tools and so on. But then you have very few things that you can actually take from research and successfully apply to these things. Whenever you try to get some high value kind of research into the commercial world, there's a valley of death. So what we're saying here is, yes, the valley of death would be there. Yes, some of our tools will be there. On the other hand, the only way that these tools will be able to, some of them will make their way into the commercial world is by giving the developers what they want. And I think we have a much better understanding for this particular domain what people want. I didn't talk about the entire domain, and I don't want to spend the rest of the afternoon introducing these other domains, but I hope that I gave you a good sense of our philosophy, our research techniques and how we actually -- how we feel that these tools that we're developing will be different from the tools that people have been developing in the past 30 years. There's a different level of engagement that we're doing with the developers. So with that, thank you very much for your attention and I will be happy to answer a couple of questions. [applause] >>: So with all these transformations ->>: Stephon. >>: Yes, there are two extremas. One is to refactor the code somehow and make explicit those that you want to have tested and so forth. And the other one is to go to the level of abstraction where the developer's expressing less about the data layout and more about the algorithms and having a system underneath to transport them. Where in your experience, which of these has been more successful and what's your opinion. >> Wen-Mei Hwu: Okay, intuitively you would expect the first one to be more successful and the second one to be less successful. In practice the first one is more successful and the second one less successful. The reason is because of the sorry start of what we're providing people don't trust this. So they rather do all the things themselves and at least they know what they're testing. So that's the reason. But to me you are actually pointing to a very important point. Unless we start to tip that balance, the productivity is always going to be a problem. So that's actually the key of the problem. When can we at this point the balancing point for them to start to trust some of these tools that they would start to use the testing methodology that we provide as part of the tool rather than doing the explicit test themselves. And that's going to be the real thing. >>: Thank you. >> Wen-Mei Hwu: Any other questions? Yes, please. >>: What about the architecture indications of [indiscernible]. >> Wen-Mei Hwu: So a couple of things that we did learn. One is we can complain about this hardware all day long. Every time I have a teacher, I teach a cores and every time I see some developers, the first thing they say is, oh, you know, why do I need to do coalescing. It's such a pain. Right? I need to do code turning and all that stuff. And so one thing that I think hardware people need to really decide is, you know what, you really need to communicate effectively to people. What are the things that you would not be able to do in the future. A lot of these people are still hoping that you will be able to reverse that trend. Maybe you will. If you will come out and say clearly that you will. But if you can't, come out and say this is something that the tools and the programmers would really have to take care of it. Because that indecision in that communication process is very harmful. The second one is you know what, I actually learned that CPUs for any successful manycore processor, the part that the CPU does is incredibly important. So if you look at all the regularization processes, most of the techniques can only be done on the CPUs today. So there is a very important part that I think -- you could put CPU and GPU into competition mode and say whatever block processing you will do I will do. So there's a lot of low-hanging fruit where a lot of the regularization processes are not necessarily supported by the CPUs in the most efficient way. That may be one of the things that the Intel architects should take a look at. Any other questions? If not -- oh, yes, Jim. >>: So where is auto tuning fit into the framework. >> Wen-Mei Hwu: Yes, where does auto tuning fit into the framework. So auto tuning to me is a parameter, is kind of a parameter setting kind of thing. In general if you look at all these seven techniques, there are lots and lots of parameters. But each technique will end up in the code. So that's -- that's why I didn't list auto tuning as part of the seven. To me, it is a -- it's a final implementation phase where you set, finalize the parameters with those. >>: If we may -- [indiscernible]. >> Wen-Mei Hwu: You mean the algorithm book. >>: Yes. >> Wen-Mei Hwu: As soon as I go I'll start writing. [laughter]. >>: [indiscernible]. >> Wen-Mei Hwu: In fact, so Juan is referring to the GPU Computing Gems books. Volume 1 is coming out in December. And we already finalized all the contents of this in production. It does take about four months to come out. So there will be 50 articles in 10 application areas and the second 1 will come out in June and that's going to be the next seven application areas plus the frameworks and tools. So there will be a total of 110 articles. And that's where I extracted some of these observations as well. >> Juan Vargas: Thank you very much. >> Wen-Mei Hwu: Thank you. [applause]

20903 >> Juan Vargas: We're going to have a fantastic... Sanders AMD Chair of Electrical and Computing Engineering at the...

Related documents

Products

Support

20903 &gt;&gt; Juan Vargas: We're going to have a fantastic... Sanders AMD Chair of Electrical and Computing Engineering at the...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

20903 >> Juan Vargas: We're going to have a fantastic... Sanders AMD Chair of Electrical and Computing Engineering at the...