>> John Nordlinger: Hello. My name's John Nordlinger. Thank you all very much for coming and thanks for the folks online that are watching. I'm very pleased to present Eric Preisz from Full Sail University. Eric's been doing a lot of innovative work around optimization, in particular to games. And we're sure to benefit from his knowledge. So thank you again and thank you, Eric. >> Eric Preisz: Thanks, John. First off, it's a real pleasure to be here today and present with you some of my ideas and things that we've learned about optimizing video games. I'm from Full Sail University which is in Orlando, Florida. Optimization for me is a topic I'm really, really excited about. I've always been very passionate about -- except for maybe in the beginning, one of the things that drove me to learn optimization is I worked for a company where we were a bunch of guys that got things working. We could make graphics work, but not necessarily using the API always correctly. But we could get things to work. And we had a gentleman that worked with us that was from the game industry and he really taught us a lot about the ins and the outs of using things correctly, really using our resources better, and it made us all a lot better in our performance, kind of a lot better, because of that. Sometimes in the meetings, though, [inaudible] the tasks that we were going to work on next and it would kind of -- if it shifted towards him, sometimes it seemed like he would use the optimization thing as a way to get out of work, kind of go, oh, well, you can't do that, that would be totally unoptimal for us to do, I think you should do this. And so I decided to go to some conferences and try to arm myself with some knowledge on optimization so I can, you know, stand a chance in those meetings of not getting stuck with all of the work all the time. So I told my students when they came in if you're not really, really excited about optimization like I am, pick someone in the room that maybe you don't like and learn it in spite of them. All right. So that's what got me into it. And then after I got into it, then I found that I was really, really into the ins and outs of optimizing games. So here are my goals today. We're going to talk about optimizing efficiently. It's not just good enough to say that you're optimizing something. You need to do it efficiently. It's very -- can be a very long process when you spend your time optimizing the wrong things. So we're going to spend a lot of time focusing on how do we know what exactly to optimize and where should we spend our time for maximum ORI. That's really important here is about ORI of your optimizations. Now, if you're listening to an optimization talk, especially a game optimization talk, I think you'd really expect to see -- let's look at tons of little lines -- line of code optimizations. I want to know -- let's get rid of all of our divides, let's get rid of all those little things that get in our way, I want to find the tightest loop of the loop of the loop of the loop with the most math and let's make that fast. We're actually not going to focus on that topic today. First off, that's actually not the thing that's got me -- that's helped optimize games the most. Isn't actually always going down and finding those little things. It's actually some of the higher-level things. So we're going to talk a lot about some of those higher-level things. Plus, you know, because of it, of my experience personally, those aren't even the areas that I'm the best at. And I haven't had a chance to actually get as good at those areas. Yesterday I had the pleasure of getting -- while I was up here getting to meet Mike Arash [phonetic]. I stopped by his office and, you know, there's a lot of guys like that that have a lot more experience than I do at those little -- little itty-bitty things like Mike. So that's not even my area of where I get the most -- have the most experience, but I've also not needed it as much as things have kind of changed in the last 10, 15 years, let's say. And then so lastly, feedback. I'm really, really interested in feedback. So if while I'm going through if you want to stop me at any time, ask a question, please do. I'm more than open to that. Really, really interested in your feedback. All right. Here's your mental map. So where am I going to take you along in this journey? We're going to start off with talking about my motivation. What are some of the things that make optimizing for games or optimizing in general difficult? And so those are some of the things that motivate -- or motivate me are the difficulty of optimizing games and some of the things that you may not expect. Then we're going to talk a little bit about trends. All right. If we know where we are today in this process, I kind of want to be prepared for the future. We've seen some things changing, and so whatever this process is that we're going to discuss, we're going to have to also talk about the trends. We'll do some classification. We're going to talk about different types of optimization, we're going to categorize them. Doing so helps us come up with a strategy in how we're going to start from 300,000 lines of code and say where do we optimize. All right. So this -- these classes that we're going to cover are going to help us do that because their attributes lended to telling us about this process. And then hopefully once we got that base set, then we can actually step through the process what is holistic video game optimization, which as you can tell from the name is about looking from the forest, not from the trees. Not at the trees. So let's talk about some of my motivation. I pulled this off of the DirectX [inaudible] mailing list. This was about I think maybe two or three weeks ago. It says here: When I switch to full screen it blistered along in the hundreds as you'd expect. So what's up with my Windowed mode? I've experimented with a few flags like WS in a square popup, et cetera, and none of those had any effect. The window wasn't clipped, which I know can cause problems, so I'm all out of ideas. I'm all out of ideas. This gentleman has a dictionary in his mind of optimizations that he can go through and step through and this case that he's come across doesn't fall within that dictionary of his known tips. So he's coming to this mailing list saying, please give me some more tips. And I've bolded his next word here, his next phrase here: Clearly something's not write. And I agree with him. Clearly something's not right. And I think the part that may be the most difficult about what he's stating as a problem is he doesn't know where to go next. And I think that's clearly not right. So we're going to talk about, you know, where could he go next or where would he go as opposed to just waiting for some more optimization tips, which, by the way, I liken to stock tips. All right. Optimization tips and stock tips are probably very similar. I think it was Warren Buffett who said was it a million dollars and a year worth of stock tips and you'll be broke? You know, there's a lot of tips out there of optimization that people have given us that either are maybe 90 percent true, they're true 90 percent of the time and we've accepted them as being always true, or the way to do things as a best practice without the context. There's things that were true that are no longer true. And they still float around. And there's things that we teach that we still believe as being true that may not necessarily be true as well. I mean, you know, it only takes a year in this technology with the changes we've had, especially in concurrency, to turn things upside down. So all right. Here's another motivation of mine. This is another part that I think makes optimizing difficult is where; where in your code should you optimize. So I ran some statistics here. I did some code counting, and this is from the Torque Engine, this is their demo that you get when you get the Torque Engine. They have 18 percent comments, they have 111,000 comments, some white space of course. But let's take a look at that number there, the code. 313,000 lines of code. Now, with 313,000 lines of code, the question is where do we start, where do we optimize. Even if you don't know the engine, that's even more difficult. So I put some numbers here. I have 313,000606 times .2. That's the Prado principle: 80 percent of the effects come from 20 percent of the costs. So maybe Prado would suggest that we have 62,721 lines that we need to worry about because of -- for the effects. They come from that. But Prado, he was an economist, so maybe we shouldn't take that number at face value. We understand the concept of what he's trying to say. But maybe we can't translate it directly to lines of code. The other number I have here is .03. That one comes from Don Kanute [phonetic]. Don Kanute says we should forget about small efficiencies. 97 percent of the time premature optimization is the root of all evil. So if we don't have to worry about small efficiencies -- by the way, this number is for small efficiencies; that means that maybe 3 percent is what we do have to worry about. So as far as small efficiencies, maybe for the Torque Engine here we need to be concerned with 900 -- or, sorry, 9,408 lines of code for small efficiencies. And, you know, that might be on as well, but still we still have this difficulty of which 9,408 lines are they. But that can give us maybe an idea of some scope. Well, if that doesn't make it difficult, the fact that engines are very large, another part that makes it difficult is we don't have source code for many of the things that we work with these days. It used to be that if you wanted to use it in your game you wrote it and you optimized it. But now we rely on so many third-party APIs and middle ware. So I have on the left-hand side of this slide you can see, you know, maybe something that was more the way we did games back in a simpler time, and now we have all these other pieces that come into it. There's a large percentage of our game that's going to run in the graphics API. There's a large percentage of our game that runs in the drivers. Maybe you're using some middle ware, physics, APIs, things like that, or fixed graphics. We've moved work from the CPU to the fixed part of our graphics. All of those areas, we don't have source code for. So if you want to worry about those micro-optimizations, those things, those little, small things, well, that works for the code that we have source for. But if we don't have source for it, you're going to be stuck with using some other methods and really making sure that you understand not just how you use their API from a syntax point of view but how to use their API from the assumptions that they make about how you'll use the API. All right. Another difficulty for optimizing games. This one's a little bit more specific to optimizing PCs. All right. That's just kind of the area that we focus our class for, we talk about how to optimize for PCs. One of the difficulties is some of our optimizations that we use are very sensitive to the machines that we're on, and we have a lot of configuration. So I took this -- these are GPUs from the last 12 months from the month of November from Futuremark, from their Web site, that talks -- where they -- a certain demographic of gamer is going to go out and use their Web site and determine what kind of numbers that their hardware is going to give them, but they also gather some information about their platform and their configuration. So if you look here, you'll see that other than the other category, which really scares me, they got the Gforce 8800 which seemed to be a pretty large -- but clearly there's a lot of -a lot of different pieces of hardware out there that may have different specs. Well, we know they have different specs. The real scary one here being is how big is that other column. It's quite large. And who knows how many different versions of cards are coming from that other category. How do we optimize for these, and if you want to -you know, some of the best optimizations exploit specific pieces of hardware. Well, what do you do when there's thousands of configurations. So there's the GPU numbers, and here's the CPU numbers as well. And as you can see, they're as staggering in the other column as well. So really difficult to -- you know, it's difficult to optimize on a console. It's -- this is an extra challenge that you get when optimizing for the PC. So this is another part of my motivation that drives me to get better at what I do. So let's talk about some of the trends. What are some of the things that we see in the PC world as far as hardware and how they're evolving. What is concurrency obviously. This is probably a topic of everyone's talk in this area for the last, who knows, three years and probably for the next -- who knows. Concurrency is one of the ways that we solve problems in latency, which as we know aren't necessarily getting better, especially for things like IO or memory latency. So concurrency can make optimization really difficult because you can be successful in your optimization and see no overall frame rate increase. All right. Now, typically you will see small, but one of the things -- in this situation what I did here is, to illustrate, let's say that I take a deck of cards and I split it between two people. And I split it between two people and I ask them to sort it red black and give it back to me. Well, say that it takes person A a minute and 30 seconds to sort those cards; it takes person B a minute to sort those cards. Well, let's say that I'm Joe optimizer and I come along here and I want to optimize the problem and I'm pretty sure because I watch this and I know a lot about person B, I'm pretty sure that person B is what I need to optimize. So let's say I do that. I go and I tell person B how to get better at optimizing his process, and I get him down to 30 seconds. Well, so person A takes a minute 30, person B takes 30 seconds. The overall process is still a minute and 30 seconds. We're bounded in performance for this setup here by the slowest person. It's not really a combination of the two when they're perfectly parallel. All right. So did I optimize something? Yeah, I did. Person B, I successfully optimized him from one minute to 30 seconds, but my overall increase is -- so this can happen in our world as far as game developers too because you have the CPU and the GPU, which we try to keep separate as much as possible, and we keep them running concurrently, and then within the GPU, you have many parallel stages that run as well, [inaudible] many GPU kernels that are all running in parallel, and then also on the CPU side we have multiple CPUs. So we -- this scenario here -- I have it playing out in two people, plays out in many across our machines. So that's one of our trends that we're going to see continue, is this idea of -- excuse me -- of more and more currency. All right. This gap is widening. The gap I'm referring to is our ability to do calculations versus memory IO. All right. This gap is widening. As we're getting calculations, the ability to perform them faster and faster and faster, where we're seeing huge strides in that over time, whereas in the latency of memory, we don't necessarily solve it as often as we spend our time hiding it and the advances that we've made in memory latency are not to par with what we're able to do in calculations. So, again, going back to the idea of finding the loop of the loop of the loop, the nested loops, with the most calculations inside of it may not be the most important thing. You know, if we miss an L2 cache, have an L2 cache miss and say that we don't have an L3 going straight to system memory, you're giving up on the orders of magnitudes of hundreds of clock ticks. Hundreds of clock ticks. And we're going to worry about a 23 clock tick, a divide when we're not always focused on the hundreds of clock ticks that we get from system-level cache misses. I put a quote on here. Herb Sutter has a video online that's really, really good, and he mentions Rico Mariani here at Microsoft who likened a processor -- Herb Sutter stated that Rico thinks that a processor today is like a sports card -- sports car in Manhattan. Sorry. There's a typo there that says sports card, from my last slide of cards, I guess. Sports car in Manhattan. So picture a Ferrari in Manhattan. Their ability to do calculations is the Ferrari part; our ability to do -- or our cache misses are the stoplights. Right? Sorry. >>: [inaudible] that analogy could be expanded, though, because if somebody has a Ferrari in Manhattan [inaudible] much different goal. >> Eric Preisz: Sure. Sure. Their goal may not be to get from one side of the street to the other as fast as they can. Right. I'm sure, yeah, this analogy breaks down pretty quickly when you look at it from that perspective. But I do like that. It works really well with my students to give them this idea of, hey, listen, do you want to optimize the Ferrari or do you want to remove the stoplights. All right. So when it comes to looking at this trend, the challenges of memory IO latency aren't going away. They're physically bound. We're doing things to solve them, you know, multicore helps in solving this. But, boy, you better be using those cores correctly and really understand your memory architecture because multicore can also make it worse, can make memory IO latencies worse, especially if they're all polluting each other's -- the L2 cache on each other. So if you use it right, that's great. And we're making some strides in that way. But this is very much a difficult trend but needs to focus our optimization sometimes in the idea of optimizing for memory IO. All right. Here's another trend. I see this one as these gaps are narrowing. The bridge between the CPU and a GPU, I think the GPU has given us some different ideas on how to do calculations. And you're seeing some things starting to come from the GPU world moving towards the CPU world or maybe moving from the CPU world towards the GPU world. Examples like the Layerby [phonetic] I think are a pretty good example of how really having a combination of both may be the right approach. And I put some of the things that make these processors different. And I also have these two words up here. I have on the left-hand side trucks and trains. This is the analogy I use in my class, and I'm sure we could find ways to break this down too, but this is the way I think of which processor is right, who has the right idea about how we're doing calculations. I liken it to having trucks and having trains. We still have trucks and we still have trains in this world, and trucks are really good. They're agile, they can go down streets, turn left and turn right. Maybe it doesn't take as long to get them going; you can just kind of hit the gas pedal. And trains, right, there's a lot of setup time, we have to get things lined up correctly, we've got to get everything into these big batches. But once a train gets going, man, boy, it's getting a lot done really fast. So the question isn't who's right; the question is what's your algorithm. Is your algorithm more suitable for a train; is your algorithm more suitable for a truck. So we're seeing this gap kind of narrow, and you're seeing things like a layer by which is maybe closer to a CPU than a typical GPU, buts also has texture hardware for doing texture fetching and for doing texture decompression. All right. I want to talk about some different classes of optimization. And these are really key to the concept of holistic video game optimization, are these classes. So you're going to see them over and over, and I've color-coded them as much as I can through this presentation, so you'll see system level in blue, the application level in red and micro in green. Now, these levels -- actually, the first place I've seen them would be there's one page in the [inaudible] that they talk about these different levels: system, application and micro. And it's a pretty short description, but in working with Vieten [phonetic] a lot, I've really seen this trend across other tools. I've seen this trend in just how we look for optimization problems and then how we solve them. So system level, the goal of system-level optimizations, they focus on balancing. So it's a very high-level system look. You know, for example, how many cores do we have and how many of them are we using. So simple things like utilization, things like idle counts, you know, how idle is this processor is really important from the system perspective. And I put down here that it's machine dependent. And this is required of us today. This kind of dependency of knowing about our system, chip makers are telling us it's our responsibility now, it may not have been in the past, we necessarily didn't need to know, you know, what's the number of cores. We had one core. But now it's up to you. We used to be able to increase performance of cores by increasing performance with construction-level parallelism, which was, you know, very evident to the compiler developer, but to the application developer we didn't really have to know a whole lot about instruction-level parallelism to continue to see performance increase. But now with moving to something like multicore, it is required on the GPU side as well, you know. You could have multiple GPUs inside one system, and that's something that we're or the vendors are telling developers: This is your responsibility now, you need to go figure out what you had and take the best use of it. The application level is much more focused on groups of code. So I liken it to the algorithms, to the class size. Maybe even multiple-function sized. I put here the fastest triangle. What is the fastest triangle you can draw. The fastest triangle you can draw is the triangle you don't draw. What's the fastest memory IO? It's the memory IO that you don't do, it's the way that you get around from doing it at all. That's kind of a central theme of application-level optimizations; they focus on what can I not do as opposed to what am I doing in making it faster. Machine independent. This is one really nice thing about application-level optimizations is if you can successfully do less work, that's something that translates well between all of those different processor configurations. You don't necessarily need to know exactly what -- how your processor, the latency and throughput of the divide, latency of the divide, and the throughput of the multiply to take advantage of not doing that. That's something that translates really well from machine to machine. Lastly, micro-optimizations, and these are the parts that, like I said, you know, it's -- I'm still trying to catch up with those greats who are really, really at the area of knowing micro-level optimizations. I find myself watching more and more lectures on compiler presentations and, like I said, people like Herb Sutter who really, really understand the ins and outs of our microprocessors today, the instruction-level parallelism concurrency issues. So I find myself looking at them to get better at this stage, maybe the assembly optimizer, someone who can really do that. But, again, you better really understand the processor as well as those compiler developers if you think that you can come in and make the code faster. There's all sorts of things from the assembly level that if you look at the code that's generated from our assembler without knowing the architecture underneath of some these trends, you may think they're crazy, why are they doing this stuff. I mean, a really good example is how we use right [phonetic] combined buffers. There was a talk at GDC -- oh, shoot, I forget exactly what year it was, a GDC talk from Intel where they talked about using right combined buffers. And their suggestion was, well, listen, if you touch every single point of that right combined buffer memory, that can be faster because then you're going to get that full 64-byte burst going across to system memory. It's going to batch it together. If you don't touch every value, like let's say that you're stepping through a locked vertex buffer and only touching the XYZ, that right combined buffers are going to update into 8-byte increments. So there's a hidden opportunity for performance increasing from batching if you touch every value. Now, if you look at, say, the C++ code or the assembly code to that, you'd say why is this person touching every single value that they're not changing. All right. And they may look slower, so these micro-optimizations can really be tricky and they can be very deceiving, and just thinking about lean and how lean your assembly looks isn't necessary the same as understanding how fast your assembly's running. All right. It's very much so about understanding the instruction-level parallelism and taking advantage of that. These are also machine dependent because different instructions and different processors have very different latencies and throughput. So this is also an area that's difficult just because you've found something that works really well on this processor, you may actually have a performance cliff on a different processor, and what you've now checked into your code -- I'm sure we've all had experiences where you've checked into your code and it works for you and it breaks every one else's build, you know. Oh, works on my machine. Right? We've all been there. But you can have the same kind of thing occur with these micro-optimizations where it's like it goes faster on my machine, how come you didn't see it, how come it wasn't faster on yours or, who knows, maybe even slower. I put down here at the bottom a little quote that I think is kind of interesting when looking at these levels. I was looking into the Toyota production system, TPS, which is a set of theories on how they optimize, for better sake of a word, you know, operations in their plants. The graphics card is very much like an assembly line, and I think there's some interesting points that we can make between trying to optimize such a large system like that and our smaller but very complex system. So originally they were looking to optimize what they call the muda, which is waste. They're looking for -- which I liken that to maybe micro-optimizations or even application optimizations, those things. They wanted to optimize from the small person, the individual and the group size first because they didn't want to get everyone in this huge organization involved in their optimization. They wanted to see change quickly, so they kind of focused on the small things first, which is very similar to what a lot of us have been focusing on for optimizing games, are these little pieces. And I thought, wow, that's interesting that they want to start from the little and make their way to the big when, you know, we've come up with these ideas of starting from the big and going down to the small. But then I found a letter that Jim Womack wrote. He is the gentleman from MIT who brought some of Toyota's philosophies over here, and we now call that lean manufacturing. And he in this letter wrote: The inevitable result is that mura creates muri but undercuts previous efforts to eliminate muda. So he was suggesting you know what? Maybe we should start looking from the top down because we're undercutting our efforts to remove them by not looking at the big things, all right, which I think we would tend to agree with that process according to these levels. So I'll give you some quick examples of the ideas of system-level optimizations -- or, sorry, all of those optimizations. I put some percentages up here. We're using a hundred percent of our CPU. We're only using 20 percent of our GPU. I've seen that before. That's not really that uncommon for us to use a smaller percentage of our GPU, especially considering the CPU, drives our GPU with instructions and commands, you know, that's not uncommon for us to be wasting that. I have on here Prostein's [phonetic] law. Now he's one of your folks and he has this sort of tongue-in-cheek law on the Internet about basically saying that all of our efforts in compiler development, if you look at them over the last 36 years, really on the yearly basis contribute to about 4 percent of performance increase in over 36 years. And I'll let you go up there and take a look more on how he drives those numbers. And, again, it is a bit tongue in cheek, I realize. But I think he has a good point here. He states that with the hardware side that we'd be -- on a yearly basis roughly about 60 percent performance increase coming from hardware from the last 36 years. I think he looks at it from 36 years. So the system level says, listen, if this hardware is so important as far as performance, then we really should -- the first thing we need to do is find out why aren't we using that other 80 percent of the GPU, right, if we're seeing these -- the performance increase at such that level on this hardware, then why aren't we focusing on that. So I have some examples. Hardware instancing is an opportunity for us to move work from the CPU over to the GPU, so that's an idea of balancing. If you are overutilized on your CPU and you're underutilized on your GPU, an example of balancing would say let's take the work from the CPU and move it over to the GPU. Same with GP, GPU, right, general processing on the graphics card and compute shaders, which I was just taking a look at the documentation on [inaudible] when that came out. So again it's another opportunity for us to -- I'm kind of focusing here on moving work from the CPU to the GPU. But it's still relevant that you may have to take work from the GPU and move that to the CPU as well, although I'd have to say currently you don't see that as often being the case. And so I have some characters up here, skin characters. If you're doing your skin character work on the CPU, you can move that from the CPU and now move that on the GPUs. Another way that GPUs are becoming more like CPUs is the added flexibility in Shader Models 4.0 and 5.0 are being -- are much, much more flexible for us to start doing this processing on that side. All right. Application-level optimizations. You know, these are really kind of a big part of what makes a game engine are all these algorithms designed to reduce the amount of work or remove it. So I have an example here of a ray versus a piece of terrain, and if you wanted to, we could spend our time -- let's optimize that ray versus that little collision point. For each of those collision points make your way through this whole mesh, and let's get rid of that square root, or let's make that as fast as we can, we can focus on that way. But a quad tree looks at it and says, whoa, on our first pass of this quad -- of this quad hierarchy, we could cut out three-fourths of the work on the first pass. Right? So really that's the main concept of application of optimizations. I have quad trees listed here, early is equal, which actually I think of as an application of optimization on the GPU, right, the idea there being that let's remove the pixel work by using the Z buffer and initialize the Z buffer to say that that pixel is never going to be visible so why even spend the time making it go faster, why optimize that shader when we can just make it faster. And for us [inaudible] another example of let's not optimize the work, let's get rid of it. All right. So micro-optimizations. And this is the part I think everybody would be excited about that I'm going to let you down on. So I have three different examples of this loop stepping through loop A, unrolling it in loop B, and one that I saw in a piece of text called Unruled Enhanced, now, the idea of unruled enhanced, if you look at it, we've broken that dependency from the sum variable as we move across, each of those lines are now independent of each other. And from a concept of instruction-level parallelism, it seems like that can be faster because you remove that chain of dependency until the very last line when you're actually summing these things together and with instruction-level parallelism, the idea, wow, we can do each of those lines at the exact same time. Sounds like a big win. And these are all true, but the problem with the way I have them presented here is we're thinking about them on the C++ level. Right? So if you're writing code that's C++ like this and you're going to compile it into assembly, look, you think our compiler has a big role in what that loop is going to look like. And we found that loop A and loop B are the same from the assembly point of view. So from my perspective, I would rather just keep the loop looking like loop A, because maybe the assembly that looks like loop B here may be inferior in the future, and if I keep my code looking like A, then my compiler will adjust and fix those problems for me. So as far as going into the future, I like to keep my pattern simple. Now, that last one there -- and, again, who knows exactly the details. I looked at the assembly, and to me I didn't quite understand why, but when I got these into VTune and did a performance test on them, I actually found loop C to be slower. And, again, I understand from the theory why this should be faster, and when we measured it, though, it turned out to not be. So, again, keep your eye on those micro-optimizations. It may have been that this was faster on the machine that they tested it on, and maybe the compiler optimizations were maybe more focused towards that machine and on the machine that I used to look at these numbers, it was slightly different, maybe causing a performance cliff. Let's talk about these levels and the project lifecycle. One of my favorite interview questions is to ask somebody when do you optimize a game. A lot of times people will go back to, well, premature optimization is the root of all evil. I love that phrase. That is powerful. Premature optimization is the root of all evil. With all the things going on out there, if we just -- I wish we could just stop optimizing our code too early. But of course you get more context when you look at that whole phrase, right? But when do you optimize a gain is a difficult question to ask, and of course right now I'd say that our approach is usually somewhere similar to how some people treat sound; oh, we'll just do it later, right? And we have this idea that we can't optimize code until we're done with it. And I would disagree with that. I think that we optimize through the whole project lifecycle, and I have here from implementation start date, tomorrow is done, but we're going to do it in a bit of a different way, right, because early on that's when you want to do your system-level optimizations. Why? Because it's about the only time that you can. It would be great to do a system-level optimization a week before a milestone if it didn't cause bugs. You know, you can't just, oh, our game's going to ship in a month, let's work in some multithreading, right, let's just do that. Right? The only time that you can do that is up front. So that's when you have to really look at it from the system level, how are we going to use all the resources we have, the algorithms that we're using, you know, maybe you can take some data to try to see the algorithms that we're using, how well do they balance the work across all of the hardware that we could be on for all of our different configurations that we have to worry about. Now, once you have a little bit of a design, then you can start working on application optimizations, right, the algorithms, and that should be able to extend for a really long time, and you should be continuing to do the application-level optimizations. And once we get towards the end from stabilities perspective, you may want to start really getting choosy about how you're going to optimize, because you don't want to introduce radical change in your milestone, obviously. And so because of that, micro-optimizations can fit well at the end, but, again, you do want to be careful because, you never know, if that is faster on that machine, you're going to want to show this across many, many machines before you feel confident about the fact that that micro-optimization is better for all machines and not just the one you tested on. Another picture you may want to include on this, I was thinking as I was looking at this the other day, is really you could even extend a circle here of a level optimization outside of implementation and, say, from a design. So maybe we should be focusing a bit on educating our game designers on the effects that their design has on our performance, give them the tools as game designers and, you know, the good game designers do have the experience. Over time they're going to get this anyways. But from a design point of view, maybe we really need to focus on teaching them how to make designs that take the best advantage of our trends from a system level, from an application level and a micro level. All right. Again, a little more with these levels here. We got system, application, and micro. I've listed some of the tools here -- sorry the text is a little small. But I've listed some of the tools. You know, you're never going to get away with just using one tool to optimize a game right now because vendors are making different tools for the different parts. And so I have listed here some of the tools. And, you know, the tools usually break down into some smaller utilities within that tool. And I've tried to kind of correspond the utility that goes with the tool so [inaudible] at the system level, you can use the performance dashboard which gives you a very really good system-level overview of how your GPU is running and how all the individual kernels are running. And I won't go through all of these here; I'll just let you look at them in your own time. But different tools for different levels, and that's going to play a role in our process as well. All right. So now we're on to holistic video game optimization. We spent about 40 minutes there going through all that just to set us up for this part right here. Let's talk a little bit about the optimization cycle. We're going to measure, we're going to analyze. In fact, you're going to probably have to do that a couple rounds before you can even figure out what the problem is. So you'll measure, you'll analyze, which may lead us to measure some different things and then analyze. At the very minimum, I think you'll have at least two rounds of this, and usually you have probably closer to three or four. So you measure and analyze, measure and analyze, you're going through, you're gathering data. We're trying to figure out what optimization bottleneck or hotspot is -- or what bottleneck and hotspot is going to get us our best ROI for performance increase. That's what we're looking for. And we'll never know that with a hundred percent certainty. Right? We're kind of like detectives here. We're gathering clues, we're zeroing in, we're zeroing in to what we think is going to give us the best performance increase. Once we believe we found that, you implement a solution, and if you see a frame rate increase, then you start all over again at the beginning. Right? You know, code executes kind of like water flowing down a book. If you have a big rock in the middle and you remove that rock, the water is going to change its course. So we have to start way at the beginning now and go back through this whole process of measuring and analyze because we've really changed the way that code is flowing through our application. All right. End of the process now. I've separated these into the three levels, system at the top, the application in the middle and then the micro at the very bottom. Now, this is the detection process, right? We're detecting what our bottleneck and our hotspot is, and we're going to start from the system level when we do this detection. We're going to start from the system level and make our way down. You know, usually you're going to end up with a micro area for solving a problem, But that doesn't necessarily mean you have to solve with the micro solution. Right? Once you get to that level of knowing whether you're -- what your problem is, you're going to have a whole list of things you can go through at different levels, different types of solutions. So let's start very much at the top here. Obviously we're going to start with something, a benchmark that's a reliable test. We're going to -- you know, you're going to set up maybe in a certain part of your game and you're going to stay here so that we know if we optimize and we see the frame rate increase, you know, the next time you run it, the frame rate increase isn't because your laptop -- you know, the fan just turned on on your GPU and you're getting more power now or some kind of power issue. You want to have is a reliable benchmark. So there's lots of things that make really good benchmarks or really bad benchmarks. The problem with benchmarking games is all of those traits that make really good benchmarks, games kind of violate all of those, you know, repeatable, well, AI can really cause some problems there, you know, all of those characteristics of benchmarks are very difficult. Okay. The very first thing that you want to do when you're going to optimize a game on this system is figure out how busy is your GPU and know are we GPU bound or are we not. Now, I have two numbers up here, 80 percent, less than 80 percent and greater than 90 percent for our two directions that we're going to go. All right. Now, let me tell you about these numbers. I wouldn't call them authoritative at this time. We've seen this kind of trend and we're doing some more studying to try and figure out how these numbers work exactly. We've gotten some anecdotal support from folks like in NVIDIA when we show them this method. And so we have that, but I encourage you to look at these numbers under your own environment and help us to determine exactly what these numbers mean. So let me explain. So what we do is we take a very GPU-bound application, a hundred percent GPU, you just fill it with [inaudible] work for your GPU and leave it very CPU-like. All right. And we do that, we look at the GPU number, and you can confirm from the tools that you are 100 percent a GPU -- or at least 99.9. Very close. What we'll do is we'll put some work on the CPU side that we can toggle so that we can continue to increase it, increase the work. So you started out with a hundred percent GPU, and we toggle that work and increase CPU work. The frame rate won't change right away, right, because you're kind of filling into that level where you're GPU bound, so you have a lot of room left of CPU power that you can use. So we'll keep filling that and filling that and filling that, at the same time watching how busy our GPU is. Now, what we've found is once we get to the point where we see frame rates begin to develop, we believe that that's that equilibrium of where you're kind of close to CPU or GPU bound, because you can toggle it back down and it goes up and you can toggle it up and you see it go down. So we think that's kind of the equilibrium point. It tends to be from the tools that we use when we look at that number, it's around 85 percent GPU busy. So instead of just stating above or below 80 percent, we've kind of put a little bit of a threshold here. If you're below 80 percent, we're considering you to be more CPU bound. And the farther you are down that line, you know, closer to being 150 percent GPU, we consider that to be more CPU bound. If you're greater than 90 percent, then you're probably more GPU bound. Now, what happens if you're in between? Well, this is -- you know, these are two lines that are crossing with the intersection point being around 85 percent. At that point where you're close to the equilibrium, chances are either side that you optimize you're going to see some type of performance increase. And it may not be the one with the biggest ROI, but it will be very close to it, and that's kind of the goal. So that's what we suggest. Less than 80 percent, we're going to go optimize for the CPU part of our world; if it's greater than 90 percent, we're going to go optimize for the GPU side. And if it's in between, we're going to look at both and choose the easier one. So that's kind of how we go about it. So let's say that we -- for our assumptions right now, I'm going to go back down to the GPU side in a bit. Let's assume that we're CPU bound. We're CPU bound, now it's time to break out a tool, some type of profiler. In our class we teach VTune, and that's what we use for CPU applications. And once you get to that level, you can again take a look at their tools from the system level. You can take a look at how is our executable running on the module level. And you can again take a look at their tools from the system level. You can take a look at how is our executable running on the module level and you can take a look at all the different modules and what percentages they're using. You'd be surprised how often because of maybe a mistake in how you use the API, what percentage of the drivers make up your game or what percent even DirectX will take up in your game. All right? If you don't understand the key assumptions about those APIs, it's very easy for that to happen. So we want to take a look at that. If the code -- the module that's slow is the code that you wrote, well, then we can go down this process a little bit more and do some more measuring and some more analysis. If it isn't, meaning third party, I have listed, you know, our typical would be probably either the graphics API, the graphics drivers and in some multithreading cases, you know, the OS, if you're not taking advantage of all your system locking appropriately, those are areas that you want to investigate. And we'll cover those in a little bit more too. But you are -- do have the source code for it, then you have some more room for some measure and analysis. First thing that I like to take a look at is let's run the VTune sampling. VTune sampling will come back and we'll use it -- sampling measures some event. And the default event that we think of when profiling is usually time. You want to measure the event of a passage of time for a given amount of units and then compare that for all of your modules or functions or threads or processes, however you want to look at the view. So what we'll do look at it from a time perspective: where are we spending our time, what is maybe the slowest function or slowest class, slowest area of code really. It's hard to just break it down into our levels of C++ because this is execution. But what is our slowest area of code, and from that, then we can start looking at some more specifics. So if we know where the time is being spent, then let's figure out why the time is being spent. So in VTune I'll take a look at level 2, level 1 cache misses for memory. That's very useful. If the areas where the level 1 and level 2 cache misses match the area where the time is being spent, then we want to look at the code visually and take a look at the memory and say this is this an area that has lots of memory access for one reason or another. And it's kind of giving us a pretty good hint that we could with memory IO bound, that's where we should focus our time. If those areas differ and the areas with the highest cache misses have nothing to do with the area where we're spending our time, those are probably an area for the future that we're going to want to go after in a bit. Next is compute bound. So you can just do kind of a visual inspection usually, you can look at the code and say this area where the time is being spent, if it's some collision loop and you've ruled out memory and it's a collision loop and there's lots of math, then you can feel pretty confident that it's related to compute. But there is another area that you have to look out for, which is being instruction fetching bound, the front-end pipeline of your CPU. It is possible that you're having cache misses of instructions, right, we have that L1 data cache and you have an L1 instruction cache. So it could very well be that it's in your instruction cache is where you're having your problems. And again VTune can help validate that by if you go through and you track the level 1 instruction cache misses. All right. So if that area is where you're spending your time, you've also ruled out memory, right, you've ruled out memory and you see that we're spending our time having instruction cache misses from our instruction cache, then it's time to bring out some strategies to solve those instruction problems. And then, again, one way to get to compute bound is to rule out the other two, and then say that -- and assume that you're compute bound. Because those seem to be the areas that we have problems with. Okay. Back up the tree to the top again. Let's cover the other side real quick on the GPU. So let's say that maybe I just found an authorization -- and I'll talk about some of those solutions in a bit too. But let's say that we are no longer CP bound and now we believe that we're GPU bound. We came back and saw GPU busy was 98 percent, right, that's pretty -- I'd be pretty confident that we're GPU bound at that point. Well, on Shader Model -- any Shader Model below 4.0, we can break the graphics card into two pieces. You kind of have the side of your graphics card that deals with triangles and the side of your graphics card that deals with pixels. So that's a real quick test we can do up front. We can rule out the pixel problem by changing the resolution. If you lower the resolution and your frame rate stays exactly the same, then, you know, it's not -- probably not the right time for you to optimize for pixel performance, let's go move over to the vertex side and start right there. If you're in Shader Model 4.0 or higher, you're going to need some help from some tools, right, PerfHUD will actually show you for your unified shade or architecture from unified shader perspective where you're spending your time, is it load balance mostly on your vertex side or is it load balance more on your pixel side. So a little bit different here when you're below Shader Model 4.0 as opposed to Shader Model 4.0. So let's say that we're below so I can break it into smaller pieces still. Let's say that we determine that we change the resolution and we see our frame rate increase. Well, then we are limited by that side, our raster side of our GPU, and we're going to step backwards from the pipeline to determine these next couple steps. All right. This comes from NVIDIA's guide to using PerfHUD. I have the reference in here from a your Eurographics paper where they talk about this more. And you don't see them talk about it as much anymore. They kind of have some other strategies now with their frame profiler on how to optimize. I still really -- I still really like this approach. Maybe it's just because it's the way I've been working. But I still really like going back to this and then using the frame profiler for supporting my ideas of what I find. So we're going to start from the back of the graphics pipeline because if you don't and you change a stage, you may affect the stages after you, thus invalidating what you're trying to discover. So we'll start at raster operations. A test you can do for raster operations is to cut your depth buffer and frame buffer value to 16 bit. By doing this you've dramatically reduced the bandwidth. Typically if it's raster operations bound, you're bound by the frame rate -sorry, by the bandwidth between the raster operations and the global video memory. So if you're limited by that stage, our test is to cut the frame buffer and depth buffer in half, thus cutting the bandwidth in half. If your frame rate stays exactly the same, then chances are you don't really have the problem with frame buffer bandwidth, right? That's a pretty logical conclusion. We cut the work in half and nothing changed, so this isn't a limited factor in this part -- in this pipeline of the GPU. So you rule it out and we move on to the next stage. We move to textures. Texture work, setting the bitmap level to the highest LOD will give us a very, very small -- I believe it's a 4x4 texture that we're going to get for all of our textures, and that's going to greatly reduce texture fetching work, right? Texture filtering work, there's a lot less texture filtering work. You're going to get very good use of the texture cache if all of your textures are that small. Now, when we do this, we've always seen a little frame rate increase, right, very small. And by little I mean 5 percent around. But I don't know that that's been enough to conclude that we are texture bound, right? But if we see a massive frame rate increase, then, guess what, it's a pretty good opportunity for you to increase performance from textures. On the pixel test now, since we've already ruled out raster operations, we've already ruled out textures, and we don't test the only other stage here, which is raster -- the raterizer, we don't test that because from a peak performance standpoint the vendors tell us we don't have to. They said from peak performance we put a lot of hardware there. I've not really seen it. I still think it probably could happen, right, but I haven't really seen it myself. But so we should just be able to test this by lowering the resolution, which you may have done that test earlier to find these two different directions. Okay. I'm going to speed up a little bit from the geometry point of view, you're going to want to look at your vertex work, right, that's our next stage. If you've ruled out the whole pixel pipeline and now you're on to geometry, you're going to look at your pixel work. You can do a simple test, is to just have a very simple vertex shader. Take your vertex shader that you are running, replace it with one that's very simple, watch the frame rate. If it's none of these stages and there's a couple little details in there that you could also be bound by -- let's say vertex fetching, vertex assembly can also be an issue in there that you would want to do a little bit of detection on, but I'm running short on time, so if you've ruled out all the stages, really the only thing that's left in this whole process that we haven't talked about is the bus, which is down at the bottom, the graphics bus. So that seems like the next logical conclusion if you haven't been able to determine what you are, it's either the bus or you've made a mistake in the process or the process isn't 100 percent. And there can be cases like that. A lot of these are educated guesses and not necessarily always 100 percent. So overall here I have this detection process, and you can see I won't be able to go through all of these, but following through, I go through each of these and show our different solutions that we have, and they correspond to the different levels. So if you do find that you are memory IO limited, depending on where you are in your lifecycle, it would be in your best interest to try to solve this from a system perspective first, maybe you don't have time. So at the very least you want to look at it from this perspective, and if you're going to pass on a system level, pass on it for an application-level solution. And if you don't have time on it, pass on that for a memory-level optimization. But we can look at it from these perspectives. Just because the problem may be a micro problem, doesn't mean the solution has to be a micro solution. So I will flip through here these real quick and just show some of the other topics if you're compute bound in some of our different options for solving them and give a little bit more time. System level, looking all at this great concurrent hardware that we have, application level are things that reduce work, personally look up tables to me, you've got to be very careful with lookup tables because if you're trying to reduce computation and you create a lookup table that has -- is full of cache misses that cost you maybe hundreds of instructions, then what have you saved. [inaudible] is really good alternative to that where you have a very small buffer of memory and you just save off the last computations that you've done. Very similar to what the Postino [phonetic] cache does for us on the GPU side. Yes, sir. >>: [inaudible] improve [inaudible]. >> Eric Preisz: Yes. Yeah. I could go into all of them and maybe when we're done here I can go through them on a smaller level with you. Improving branch prediction, there's really two ways for us to improve branch prediction. One is to make the branch less random, because branch prediction inside [inaudible] a branch prediction table that's tracking a branch and it will actually guess whether or not the [inaudible] for that basic case is taken. So by making the branch less random, you're giving the CPU a better opportunity to make a good guess and actually have -- guess the correct branch. The other solution is there is some assembly ways and some SSE ways that we can actually get rid of the branch or do things closer to like predication, where you run maybe both sides like the GPU does, and then just store the answer of the one that's correct, of the branch that's correct. We can go through them all. I just want to make sure I get them all up on the slides here so everyone can see some of these different solutions. Now, the instruction processing one, I'll have to admit that this is an area that I'm still looking into a bit more. This is probably one of the later things that I've been wanting to look into is more about being instruction bound. On the graphics side, I start from raster operations, right, because we need to start from that back and go to the front. And here are some raster operation solutions. Reduce your overdraw, work on better CPU or GPU occlusion calling, work on the early Z cull in certain cases you can initialize that Z buffer and then go back through and render everything again and actually see better performance from drawing your scene twice of frame. Under certain conditions of having low pixel work -- or, sorry, very high pixel work, very low vertex work, very low number of draw calls. Texture, got some more strategies here. Notice that early Z cull will show up for all of the pixel side, because that's an application, it's a higher-level solution. And pixel side, offload work to the vertex shader if possible. And of course shader optimization, which is a whole talk on its own, something that I find very exciting, vertex work, deferred shading, optimizing for Postino cache, again, these slides will hopefully be available in some sort for you to go through these more and we can talk into greater detail on some of these other solutions since I am going through these pretty quick for you. Lastly, I just want to bring up the point here, though, that this process does tend to focus on one machine, right? You're doing this optimization on that machine, and it's telling you about the performance of that machine, but it's not necessarily telling you that you've made games -- all games fast. So but what I have here is a holistic video game optimization done on several configurations sufficient to say that we've made all PC games faster, or the other strategy may be that we need to come up with load balancing solutions so that we can dynamically do that and determine that we were using the systems to their fullest. This is something that we're working on. We're working on a project we called Coperf [phonetic]. Coperf is something that we're going to try to get a lot of people to run based on certain types of tests, and we want to test the variability of optimization strategies across many, many different platforms, many, many different PCs. So hopefully if we do that, maybe we can start getting to the point where we can say for this demographic of game, high-end gamer, or for this demographic of game, we want everyone to play, what are the right optimization strategies to do that based on variability. So you will see that I have a white paper that's coming soon. I can give you the rough draft now. We're getting industry feedback by right now at this stage of the white paper. Intel is working with us a bit on it, and we hope to get a lot of people involved. It's an open source project, so these benchmarks that we're building to help gather this data is open source. So not only will you be able to get our data on how this performance runs on many, many machines, we want you to come look at our code and make sure that we are -- you know, these tests are hard to do at 100 percent. I know that they're correct because it's hard to isolate certain parts of hardware. So we want you guys to take a look at it and start picking through it and improving our benchmarks so that there's something that's reliable. Yes. >>: Have you done any study or even discussion around what happens to a student that figures out how to optimize their code as a motivator, how this activity could be used to engage students [inaudible]? >> Eric Preisz: We've not done anything well organized. I'll have to say from my experience, coming through my course is halfway through the program. A lot of times they don't find the optimization part the most interesting, and they're not necessarily enthused about learning these processes. When they get to final project, however, I see those students come back. What I do to try to encourage the students who are currently in my class is when the final project students come to me with optimization problems, I don't solve it in my office hours. I make them come back to my classroom and we do it as a case study, usually in the first 20 minutes of class. And I -- depending on where we are in the class, I let my students try to solve their problems for them based on what they've learned. I figure maybe if I can get them to think it's important to the final student projects, that maybe it's important to them now. I still have plenty of students coming back. So I don't know if that answered your question or not, but, yeah, we haven't really done much of an organized study. And that's really it. I have a -- I'll leave this last slide here that says based -- as a concurrency is growing, this responsibility of moving into these levels is also growing. Developers are leading the role right now as far as the system-level optimizations and looking to find out what kind of cores do we have and how are we using them, how many GPUs, which particular GPU. And compilers are also moving into higher levels of optimization and moving into things like whole program optimization and some of these higher-level ways of doing optimization at link time, get us closer and closer to the compilers actually being able to do application-level optimization. And then also runtimes reconfiguring themselves in more optimal ways. So maybe that will continue and maybe the answer here is virtualization. I don't know. That's -- I think might be a ways out, but you guys may have other information. And that concludes my talk really. All of my references are listed here so that you can go through and take a look for yourself. And I guess do we have time for questions then? Okay. If you have some questions. >>: [inaudible] >> Eric Preisz: Yes. >>: Is there a summary of what you're describing here? >> Eric Preisz: You know, I don't have a summary in the white paper. But what I've tried to do is through the lecture if we can -- I can pass these out to you. That's probably what you want the details for are those. You'll find the details for all the other areas listed in the note section of the slide. So you can find a lot of it there. I didn't go into all the details here necessarily. But there will be a format for that, for going into greater detail on a mass public side within the next six months that I can't really talk about directly. I'm just not allowed to yet, but, yeah, it will be more available, so... Any other questions? All right. I appreciate it. Thank you. [applause]