>> John Nordlinger: Hello. My name's John Nordlinger. ... coming and thanks for the folks online that are watching. ...

advertisement
>> John Nordlinger: Hello. My name's John Nordlinger. Thank you all very much for
coming and thanks for the folks online that are watching. I'm very pleased to present Eric
Preisz from Full Sail University. Eric's been doing a lot of innovative work around
optimization, in particular to games. And we're sure to benefit from his knowledge.
So thank you again and thank you, Eric.
>> Eric Preisz: Thanks, John. First off, it's a real pleasure to be here today and present
with you some of my ideas and things that we've learned about optimizing video games.
I'm from Full Sail University which is in Orlando, Florida. Optimization for me is a topic
I'm really, really excited about. I've always been very passionate about -- except for
maybe in the beginning, one of the things that drove me to learn optimization is I worked
for a company where we were a bunch of guys that got things working. We could make
graphics work, but not necessarily using the API always correctly. But we could get
things to work.
And we had a gentleman that worked with us that was from the game industry and he
really taught us a lot about the ins and the outs of using things correctly, really using our
resources better, and it made us all a lot better in our performance, kind of a lot better,
because of that.
Sometimes in the meetings, though, [inaudible] the tasks that we were going to work on
next and it would kind of -- if it shifted towards him, sometimes it seemed like he would
use the optimization thing as a way to get out of work, kind of go, oh, well, you can't do
that, that would be totally unoptimal for us to do, I think you should do this.
And so I decided to go to some conferences and try to arm myself with some knowledge
on optimization so I can, you know, stand a chance in those meetings of not getting stuck
with all of the work all the time.
So I told my students when they came in if you're not really, really excited about
optimization like I am, pick someone in the room that maybe you don't like and learn it in
spite of them. All right. So that's what got me into it. And then after I got into it, then I
found that I was really, really into the ins and outs of optimizing games.
So here are my goals today. We're going to talk about optimizing efficiently. It's not just
good enough to say that you're optimizing something. You need to do it efficiently. It's
very -- can be a very long process when you spend your time optimizing the wrong
things. So we're going to spend a lot of time focusing on how do we know what exactly
to optimize and where should we spend our time for maximum ORI. That's really
important here is about ORI of your optimizations.
Now, if you're listening to an optimization talk, especially a game optimization talk, I
think you'd really expect to see -- let's look at tons of little lines -- line of code
optimizations. I want to know -- let's get rid of all of our divides, let's get rid of all those
little things that get in our way, I want to find the tightest loop of the loop of the loop of
the loop with the most math and let's make that fast.
We're actually not going to focus on that topic today. First off, that's actually not the
thing that's got me -- that's helped optimize games the most. Isn't actually always going
down and finding those little things. It's actually some of the higher-level things.
So we're going to talk a lot about some of those higher-level things. Plus, you know,
because of it, of my experience personally, those aren't even the areas that I'm the best at.
And I haven't had a chance to actually get as good at those areas.
Yesterday I had the pleasure of getting -- while I was up here getting to meet Mike Arash
[phonetic]. I stopped by his office and, you know, there's a lot of guys like that that have
a lot more experience than I do at those little -- little itty-bitty things like Mike. So that's
not even my area of where I get the most -- have the most experience, but I've also not
needed it as much as things have kind of changed in the last 10, 15 years, let's say.
And then so lastly, feedback. I'm really, really interested in feedback. So if while I'm
going through if you want to stop me at any time, ask a question, please do. I'm more
than open to that. Really, really interested in your feedback.
All right. Here's your mental map. So where am I going to take you along in this
journey? We're going to start off with talking about my motivation. What are some of
the things that make optimizing for games or optimizing in general difficult? And so
those are some of the things that motivate -- or motivate me are the difficulty of
optimizing games and some of the things that you may not expect. Then we're going to
talk a little bit about trends.
All right. If we know where we are today in this process, I kind of want to be prepared
for the future. We've seen some things changing, and so whatever this process is that
we're going to discuss, we're going to have to also talk about the trends. We'll do some
classification. We're going to talk about different types of optimization, we're going to
categorize them. Doing so helps us come up with a strategy in how we're going to start
from 300,000 lines of code and say where do we optimize.
All right. So this -- these classes that we're going to cover are going to help us do that
because their attributes lended to telling us about this process.
And then hopefully once we got that base set, then we can actually step through the
process what is holistic video game optimization, which as you can tell from the name is
about looking from the forest, not from the trees. Not at the trees.
So let's talk about some of my motivation. I pulled this off of the DirectX [inaudible]
mailing list. This was about I think maybe two or three weeks ago. It says here: When I
switch to full screen it blistered along in the hundreds as you'd expect. So what's up with
my Windowed mode? I've experimented with a few flags like WS in a square popup, et
cetera, and none of those had any effect. The window wasn't clipped, which I know can
cause problems, so I'm all out of ideas. I'm all out of ideas.
This gentleman has a dictionary in his mind of optimizations that he can go through and
step through and this case that he's come across doesn't fall within that dictionary of his
known tips. So he's coming to this mailing list saying, please give me some more tips.
And I've bolded his next word here, his next phrase here: Clearly something's not write.
And I agree with him. Clearly something's not right. And I think the part that may be the
most difficult about what he's stating as a problem is he doesn't know where to go next.
And I think that's clearly not right. So we're going to talk about, you know, where could
he go next or where would he go as opposed to just waiting for some more optimization
tips, which, by the way, I liken to stock tips.
All right. Optimization tips and stock tips are probably very similar. I think it was
Warren Buffett who said was it a million dollars and a year worth of stock tips and you'll
be broke? You know, there's a lot of tips out there of optimization that people have given
us that either are maybe 90 percent true, they're true 90 percent of the time and we've
accepted them as being always true, or the way to do things as a best practice without the
context.
There's things that were true that are no longer true. And they still float around. And
there's things that we teach that we still believe as being true that may not necessarily be
true as well. I mean, you know, it only takes a year in this technology with the changes
we've had, especially in concurrency, to turn things upside down.
So all right. Here's another motivation of mine. This is another part that I think makes
optimizing difficult is where; where in your code should you optimize. So I ran some
statistics here. I did some code counting, and this is from the Torque Engine, this is their
demo that you get when you get the Torque Engine. They have 18 percent comments,
they have 111,000 comments, some white space of course. But let's take a look at that
number there, the code. 313,000 lines of code.
Now, with 313,000 lines of code, the question is where do we start, where do we
optimize. Even if you don't know the engine, that's even more difficult. So I put some
numbers here. I have 313,000606 times .2. That's the Prado principle: 80 percent of the
effects come from 20 percent of the costs. So maybe Prado would suggest that we have
62,721 lines that we need to worry about because of -- for the effects. They come from
that.
But Prado, he was an economist, so maybe we shouldn't take that number at face value.
We understand the concept of what he's trying to say. But maybe we can't translate it
directly to lines of code.
The other number I have here is .03. That one comes from Don Kanute [phonetic]. Don
Kanute says we should forget about small efficiencies. 97 percent of the time premature
optimization is the root of all evil. So if we don't have to worry about small
efficiencies -- by the way, this number is for small efficiencies; that means that maybe 3
percent is what we do have to worry about.
So as far as small efficiencies, maybe for the Torque Engine here we need to be
concerned with 900 -- or, sorry, 9,408 lines of code for small efficiencies. And, you
know, that might be on as well, but still we still have this difficulty of which 9,408 lines
are they. But that can give us maybe an idea of some scope.
Well, if that doesn't make it difficult, the fact that engines are very large, another part that
makes it difficult is we don't have source code for many of the things that we work with
these days. It used to be that if you wanted to use it in your game you wrote it and you
optimized it. But now we rely on so many third-party APIs and middle ware.
So I have on the left-hand side of this slide you can see, you know, maybe something that
was more the way we did games back in a simpler time, and now we have all these other
pieces that come into it. There's a large percentage of our game that's going to run in the
graphics API. There's a large percentage of our game that runs in the drivers. Maybe
you're using some middle ware, physics, APIs, things like that, or fixed graphics.
We've moved work from the CPU to the fixed part of our graphics. All of those areas,
we don't have source code for. So if you want to worry about those micro-optimizations,
those things, those little, small things, well, that works for the code that we have source
for. But if we don't have source for it, you're going to be stuck with using some other
methods and really making sure that you understand not just how you use their API from
a syntax point of view but how to use their API from the assumptions that they make
about how you'll use the API.
All right. Another difficulty for optimizing games. This one's a little bit more specific to
optimizing PCs. All right. That's just kind of the area that we focus our class for, we talk
about how to optimize for PCs. One of the difficulties is some of our optimizations that
we use are very sensitive to the machines that we're on, and we have a lot of
configuration.
So I took this -- these are GPUs from the last 12 months from the month of November
from Futuremark, from their Web site, that talks -- where they -- a certain demographic
of gamer is going to go out and use their Web site and determine what kind of numbers
that their hardware is going to give them, but they also gather some information about
their platform and their configuration.
So if you look here, you'll see that other than the other category, which really scares me,
they got the Gforce 8800 which seemed to be a pretty large -- but clearly there's a lot of -a lot of different pieces of hardware out there that may have different specs. Well, we
know they have different specs. The real scary one here being is how big is that other
column. It's quite large. And who knows how many different versions of cards are
coming from that other category. How do we optimize for these, and if you want to -you know, some of the best optimizations exploit specific pieces of hardware.
Well, what do you do when there's thousands of configurations. So there's the GPU
numbers, and here's the CPU numbers as well. And as you can see, they're as staggering
in the other column as well. So really difficult to -- you know, it's difficult to optimize on
a console. It's -- this is an extra challenge that you get when optimizing for the PC. So
this is another part of my motivation that drives me to get better at what I do.
So let's talk about some of the trends. What are some of the things that we see in the PC
world as far as hardware and how they're evolving. What is concurrency obviously. This
is probably a topic of everyone's talk in this area for the last, who knows, three years and
probably for the next -- who knows.
Concurrency is one of the ways that we solve problems in latency, which as we know
aren't necessarily getting better, especially for things like IO or memory latency. So
concurrency can make optimization really difficult because you can be successful in your
optimization and see no overall frame rate increase.
All right. Now, typically you will see small, but one of the things -- in this situation what
I did here is, to illustrate, let's say that I take a deck of cards and I split it between two
people. And I split it between two people and I ask them to sort it red black and give it
back to me. Well, say that it takes person A a minute and 30 seconds to sort those cards;
it takes person B a minute to sort those cards. Well, let's say that I'm Joe optimizer and I
come along here and I want to optimize the problem and I'm pretty sure because I watch
this and I know a lot about person B, I'm pretty sure that person B is what I need to
optimize.
So let's say I do that. I go and I tell person B how to get better at optimizing his process,
and I get him down to 30 seconds. Well, so person A takes a minute 30, person B takes
30 seconds. The overall process is still a minute and 30 seconds. We're bounded in
performance for this setup here by the slowest person. It's not really a combination of the
two when they're perfectly parallel.
All right. So did I optimize something? Yeah, I did. Person B, I successfully optimized
him from one minute to 30 seconds, but my overall increase is -- so this can happen in
our world as far as game developers too because you have the CPU and the GPU, which
we try to keep separate as much as possible, and we keep them running concurrently, and
then within the GPU, you have many parallel stages that run as well, [inaudible] many
GPU kernels that are all running in parallel, and then also on the CPU side we have
multiple CPUs.
So we -- this scenario here -- I have it playing out in two people, plays out in many across
our machines. So that's one of our trends that we're going to see continue, is this idea
of -- excuse me -- of more and more currency.
All right. This gap is widening. The gap I'm referring to is our ability to do calculations
versus memory IO. All right. This gap is widening. As we're getting calculations, the
ability to perform them faster and faster and faster, where we're seeing huge strides in
that over time, whereas in the latency of memory, we don't necessarily solve it as often as
we spend our time hiding it and the advances that we've made in memory latency are not
to par with what we're able to do in calculations.
So, again, going back to the idea of finding the loop of the loop of the loop, the nested
loops, with the most calculations inside of it may not be the most important thing. You
know, if we miss an L2 cache, have an L2 cache miss and say that we don't have an L3
going straight to system memory, you're giving up on the orders of magnitudes of
hundreds of clock ticks. Hundreds of clock ticks. And we're going to worry about a 23
clock tick, a divide when we're not always focused on the hundreds of clock ticks that we
get from system-level cache misses.
I put a quote on here. Herb Sutter has a video online that's really, really good, and he
mentions Rico Mariani here at Microsoft who likened a processor -- Herb Sutter stated
that Rico thinks that a processor today is like a sports card -- sports car in Manhattan.
Sorry. There's a typo there that says sports card, from my last slide of cards, I guess.
Sports car in Manhattan.
So picture a Ferrari in Manhattan. Their ability to do calculations is the Ferrari part; our
ability to do -- or our cache misses are the stoplights. Right? Sorry.
>>: [inaudible] that analogy could be expanded, though, because if somebody has a
Ferrari in Manhattan [inaudible] much different goal.
>> Eric Preisz: Sure. Sure. Their goal may not be to get from one side of the street to
the other as fast as they can. Right. I'm sure, yeah, this analogy breaks down pretty
quickly when you look at it from that perspective.
But I do like that. It works really well with my students to give them this idea of, hey,
listen, do you want to optimize the Ferrari or do you want to remove the stoplights. All
right. So when it comes to looking at this trend, the challenges of memory IO latency
aren't going away. They're physically bound. We're doing things to solve them, you
know, multicore helps in solving this.
But, boy, you better be using those cores correctly and really understand your memory
architecture because multicore can also make it worse, can make memory IO latencies
worse, especially if they're all polluting each other's -- the L2 cache on each other. So if
you use it right, that's great. And we're making some strides in that way. But this is very
much a difficult trend but needs to focus our optimization sometimes in the idea of
optimizing for memory IO.
All right. Here's another trend. I see this one as these gaps are narrowing. The bridge
between the CPU and a GPU, I think the GPU has given us some different ideas on how
to do calculations. And you're seeing some things starting to come from the GPU world
moving towards the CPU world or maybe moving from the CPU world towards the GPU
world.
Examples like the Layerby [phonetic] I think are a pretty good example of how really
having a combination of both may be the right approach. And I put some of the things
that make these processors different. And I also have these two words up here. I have on
the left-hand side trucks and trains. This is the analogy I use in my class, and I'm sure we
could find ways to break this down too, but this is the way I think of which processor is
right, who has the right idea about how we're doing calculations.
I liken it to having trucks and having trains. We still have trucks and we still have trains
in this world, and trucks are really good. They're agile, they can go down streets, turn left
and turn right. Maybe it doesn't take as long to get them going; you can just kind of hit
the gas pedal. And trains, right, there's a lot of setup time, we have to get things lined up
correctly, we've got to get everything into these big batches. But once a train gets going,
man, boy, it's getting a lot done really fast.
So the question isn't who's right; the question is what's your algorithm. Is your algorithm
more suitable for a train; is your algorithm more suitable for a truck. So we're seeing this
gap kind of narrow, and you're seeing things like a layer by which is maybe closer to a
CPU than a typical GPU, buts also has texture hardware for doing texture fetching and
for doing texture decompression.
All right. I want to talk about some different classes of optimization. And these are
really key to the concept of holistic video game optimization, are these classes. So you're
going to see them over and over, and I've color-coded them as much as I can through this
presentation, so you'll see system level in blue, the application level in red and micro in
green.
Now, these levels -- actually, the first place I've seen them would be there's one page in
the [inaudible] that they talk about these different levels: system, application and micro.
And it's a pretty short description, but in working with Vieten [phonetic] a lot, I've really
seen this trend across other tools. I've seen this trend in just how we look for
optimization problems and then how we solve them.
So system level, the goal of system-level optimizations, they focus on balancing. So it's a
very high-level system look. You know, for example, how many cores do we have and
how many of them are we using. So simple things like utilization, things like idle counts,
you know, how idle is this processor is really important from the system perspective.
And I put down here that it's machine dependent. And this is required of us today. This
kind of dependency of knowing about our system, chip makers are telling us it's our
responsibility now, it may not have been in the past, we necessarily didn't need to know,
you know, what's the number of cores. We had one core. But now it's up to you.
We used to be able to increase performance of cores by increasing performance with
construction-level parallelism, which was, you know, very evident to the compiler
developer, but to the application developer we didn't really have to know a whole lot
about instruction-level parallelism to continue to see performance increase.
But now with moving to something like multicore, it is required on the GPU side as well,
you know. You could have multiple GPUs inside one system, and that's something that
we're or the vendors are telling developers: This is your responsibility now, you need to
go figure out what you had and take the best use of it.
The application level is much more focused on groups of code. So I liken it to the
algorithms, to the class size. Maybe even multiple-function sized. I put here the fastest
triangle. What is the fastest triangle you can draw. The fastest triangle you can draw is
the triangle you don't draw. What's the fastest memory IO? It's the memory IO that you
don't do, it's the way that you get around from doing it at all.
That's kind of a central theme of application-level optimizations; they focus on what can I
not do as opposed to what am I doing in making it faster.
Machine independent. This is one really nice thing about application-level optimizations
is if you can successfully do less work, that's something that translates well between all
of those different processor configurations. You don't necessarily need to know exactly
what -- how your processor, the latency and throughput of the divide, latency of the
divide, and the throughput of the multiply to take advantage of not doing that. That's
something that translates really well from machine to machine.
Lastly, micro-optimizations, and these are the parts that, like I said, you know, it's -- I'm
still trying to catch up with those greats who are really, really at the area of knowing
micro-level optimizations. I find myself watching more and more lectures on compiler
presentations and, like I said, people like Herb Sutter who really, really understand the
ins and outs of our microprocessors today, the instruction-level parallelism concurrency
issues. So I find myself looking at them to get better at this stage, maybe the assembly
optimizer, someone who can really do that.
But, again, you better really understand the processor as well as those compiler
developers if you think that you can come in and make the code faster. There's all sorts
of things from the assembly level that if you look at the code that's generated from our
assembler without knowing the architecture underneath of some these trends, you may
think they're crazy, why are they doing this stuff.
I mean, a really good example is how we use right [phonetic] combined buffers. There
was a talk at GDC -- oh, shoot, I forget exactly what year it was, a GDC talk from Intel
where they talked about using right combined buffers. And their suggestion was, well,
listen, if you touch every single point of that right combined buffer memory, that can be
faster because then you're going to get that full 64-byte burst going across to system
memory. It's going to batch it together.
If you don't touch every value, like let's say that you're stepping through a locked vertex
buffer and only touching the XYZ, that right combined buffers are going to update into
8-byte increments. So there's a hidden opportunity for performance increasing from
batching if you touch every value.
Now, if you look at, say, the C++ code or the assembly code to that, you'd say why is this
person touching every single value that they're not changing. All right. And they may
look slower, so these micro-optimizations can really be tricky and they can be very
deceiving, and just thinking about lean and how lean your assembly looks isn't necessary
the same as understanding how fast your assembly's running. All right. It's very much so
about understanding the instruction-level parallelism and taking advantage of that.
These are also machine dependent because different instructions and different processors
have very different latencies and throughput. So this is also an area that's difficult just
because you've found something that works really well on this processor, you may
actually have a performance cliff on a different processor, and what you've now checked
into your code -- I'm sure we've all had experiences where you've checked into your code
and it works for you and it breaks every one else's build, you know. Oh, works on my
machine. Right? We've all been there.
But you can have the same kind of thing occur with these micro-optimizations where it's
like it goes faster on my machine, how come you didn't see it, how come it wasn't faster
on yours or, who knows, maybe even slower.
I put down here at the bottom a little quote that I think is kind of interesting when looking
at these levels. I was looking into the Toyota production system, TPS, which is a set of
theories on how they optimize, for better sake of a word, you know, operations in their
plants. The graphics card is very much like an assembly line, and I think there's some
interesting points that we can make between trying to optimize such a large system like
that and our smaller but very complex system.
So originally they were looking to optimize what they call the muda, which is waste.
They're looking for -- which I liken that to maybe micro-optimizations or even
application optimizations, those things. They wanted to optimize from the small person,
the individual and the group size first because they didn't want to get everyone in this
huge organization involved in their optimization. They wanted to see change quickly, so
they kind of focused on the small things first, which is very similar to what a lot of us
have been focusing on for optimizing games, are these little pieces.
And I thought, wow, that's interesting that they want to start from the little and make their
way to the big when, you know, we've come up with these ideas of starting from the big
and going down to the small.
But then I found a letter that Jim Womack wrote. He is the gentleman from MIT who
brought some of Toyota's philosophies over here, and we now call that lean
manufacturing. And he in this letter wrote: The inevitable result is that mura creates
muri but undercuts previous efforts to eliminate muda.
So he was suggesting you know what? Maybe we should start looking from the top down
because we're undercutting our efforts to remove them by not looking at the big things,
all right, which I think we would tend to agree with that process according to these levels.
So I'll give you some quick examples of the ideas of system-level optimizations -- or,
sorry, all of those optimizations. I put some percentages up here. We're using a hundred
percent of our CPU. We're only using 20 percent of our GPU. I've seen that before.
That's not really that uncommon for us to use a smaller percentage of our GPU,
especially considering the CPU, drives our GPU with instructions and commands, you
know, that's not uncommon for us to be wasting that.
I have on here Prostein's [phonetic] law. Now he's one of your folks and he has this sort
of tongue-in-cheek law on the Internet about basically saying that all of our efforts in
compiler development, if you look at them over the last 36 years, really on the yearly
basis contribute to about 4 percent of performance increase in over 36 years. And I'll let
you go up there and take a look more on how he drives those numbers.
And, again, it is a bit tongue in cheek, I realize. But I think he has a good point here. He
states that with the hardware side that we'd be -- on a yearly basis roughly about 60
percent performance increase coming from hardware from the last 36 years. I think he
looks at it from 36 years.
So the system level says, listen, if this hardware is so important as far as performance,
then we really should -- the first thing we need to do is find out why aren't we using that
other 80 percent of the GPU, right, if we're seeing these -- the performance increase at
such that level on this hardware, then why aren't we focusing on that.
So I have some examples. Hardware instancing is an opportunity for us to move work
from the CPU over to the GPU, so that's an idea of balancing. If you are overutilized on
your CPU and you're underutilized on your GPU, an example of balancing would say let's
take the work from the CPU and move it over to the GPU.
Same with GP, GPU, right, general processing on the graphics card and compute shaders,
which I was just taking a look at the documentation on [inaudible] when that came out.
So again it's another opportunity for us to -- I'm kind of focusing here on moving work
from the CPU to the GPU. But it's still relevant that you may have to take work from the
GPU and move that to the CPU as well, although I'd have to say currently you don't see
that as often being the case.
And so I have some characters up here, skin characters. If you're doing your skin
character work on the CPU, you can move that from the CPU and now move that on the
GPUs. Another way that GPUs are becoming more like CPUs is the added flexibility in
Shader Models 4.0 and 5.0 are being -- are much, much more flexible for us to start doing
this processing on that side.
All right. Application-level optimizations. You know, these are really kind of a big part
of what makes a game engine are all these algorithms designed to reduce the amount of
work or remove it. So I have an example here of a ray versus a piece of terrain, and if
you wanted to, we could spend our time -- let's optimize that ray versus that little
collision point. For each of those collision points make your way through this whole
mesh, and let's get rid of that square root, or let's make that as fast as we can, we can
focus on that way. But a quad tree looks at it and says, whoa, on our first pass of this
quad -- of this quad hierarchy, we could cut out three-fourths of the work on the first
pass. Right?
So really that's the main concept of application of optimizations. I have quad trees listed
here, early is equal, which actually I think of as an application of optimization on the
GPU, right, the idea there being that let's remove the pixel work by using the Z buffer and
initialize the Z buffer to say that that pixel is never going to be visible so why even spend
the time making it go faster, why optimize that shader when we can just make it faster.
And for us [inaudible] another example of let's not optimize the work, let's get rid of it.
All right. So micro-optimizations. And this is the part I think everybody would be
excited about that I'm going to let you down on. So I have three different examples of
this loop stepping through loop A, unrolling it in loop B, and one that I saw in a piece of
text called Unruled Enhanced, now, the idea of unruled enhanced, if you look at it, we've
broken that dependency from the sum variable as we move across, each of those lines are
now independent of each other.
And from a concept of instruction-level parallelism, it seems like that can be faster
because you remove that chain of dependency until the very last line when you're actually
summing these things together and with instruction-level parallelism, the idea, wow, we
can do each of those lines at the exact same time. Sounds like a big win. And these are
all true, but the problem with the way I have them presented here is we're thinking about
them on the C++ level. Right?
So if you're writing code that's C++ like this and you're going to compile it into assembly,
look, you think our compiler has a big role in what that loop is going to look like. And
we found that loop A and loop B are the same from the assembly point of view. So from
my perspective, I would rather just keep the loop looking like loop A, because maybe the
assembly that looks like loop B here may be inferior in the future, and if I keep my code
looking like A, then my compiler will adjust and fix those problems for me.
So as far as going into the future, I like to keep my pattern simple. Now, that last one
there -- and, again, who knows exactly the details. I looked at the assembly, and to me I
didn't quite understand why, but when I got these into VTune and did a performance test
on them, I actually found loop C to be slower. And, again, I understand from the theory
why this should be faster, and when we measured it, though, it turned out to not be.
So, again, keep your eye on those micro-optimizations. It may have been that this was
faster on the machine that they tested it on, and maybe the compiler optimizations were
maybe more focused towards that machine and on the machine that I used to look at these
numbers, it was slightly different, maybe causing a performance cliff.
Let's talk about these levels and the project lifecycle. One of my favorite interview
questions is to ask somebody when do you optimize a game. A lot of times people will
go back to, well, premature optimization is the root of all evil. I love that phrase. That is
powerful. Premature optimization is the root of all evil. With all the things going on out
there, if we just -- I wish we could just stop optimizing our code too early.
But of course you get more context when you look at that whole phrase, right? But when
do you optimize a gain is a difficult question to ask, and of course right now I'd say that
our approach is usually somewhere similar to how some people treat sound; oh, we'll just
do it later, right?
And we have this idea that we can't optimize code until we're done with it. And I would
disagree with that. I think that we optimize through the whole project lifecycle, and I
have here from implementation start date, tomorrow is done, but we're going to do it in a
bit of a different way, right, because early on that's when you want to do your
system-level optimizations.
Why? Because it's about the only time that you can. It would be great to do a
system-level optimization a week before a milestone if it didn't cause bugs. You know,
you can't just, oh, our game's going to ship in a month, let's work in some multithreading,
right, let's just do that. Right? The only time that you can do that is up front. So that's
when you have to really look at it from the system level, how are we going to use all the
resources we have, the algorithms that we're using, you know, maybe you can take some
data to try to see the algorithms that we're using, how well do they balance the work
across all of the hardware that we could be on for all of our different configurations that
we have to worry about.
Now, once you have a little bit of a design, then you can start working on application
optimizations, right, the algorithms, and that should be able to extend for a really long
time, and you should be continuing to do the application-level optimizations. And once
we get towards the end from stabilities perspective, you may want to start really getting
choosy about how you're going to optimize, because you don't want to introduce radical
change in your milestone, obviously.
And so because of that, micro-optimizations can fit well at the end, but, again, you do
want to be careful because, you never know, if that is faster on that machine, you're going
to want to show this across many, many machines before you feel confident about the
fact that that micro-optimization is better for all machines and not just the one you tested
on.
Another picture you may want to include on this, I was thinking as I was looking at this
the other day, is really you could even extend a circle here of a level optimization outside
of implementation and, say, from a design. So maybe we should be focusing a bit on
educating our game designers on the effects that their design has on our performance,
give them the tools as game designers and, you know, the good game designers do have
the experience.
Over time they're going to get this anyways. But from a design point of view, maybe we
really need to focus on teaching them how to make designs that take the best advantage
of our trends from a system level, from an application level and a micro level.
All right. Again, a little more with these levels here. We got system, application, and
micro. I've listed some of the tools here -- sorry the text is a little small. But I've listed
some of the tools. You know, you're never going to get away with just using one tool to
optimize a game right now because vendors are making different tools for the different
parts.
And so I have listed here some of the tools. And, you know, the tools usually break
down into some smaller utilities within that tool. And I've tried to kind of correspond the
utility that goes with the tool so [inaudible] at the system level, you can use the
performance dashboard which gives you a very really good system-level overview of
how your GPU is running and how all the individual kernels are running. And I won't go
through all of these here; I'll just let you look at them in your own time. But different
tools for different levels, and that's going to play a role in our process as well.
All right. So now we're on to holistic video game optimization. We spent about 40
minutes there going through all that just to set us up for this part right here.
Let's talk a little bit about the optimization cycle. We're going to measure, we're going to
analyze. In fact, you're going to probably have to do that a couple rounds before you can
even figure out what the problem is. So you'll measure, you'll analyze, which may lead
us to measure some different things and then analyze. At the very minimum, I think
you'll have at least two rounds of this, and usually you have probably closer to three or
four.
So you measure and analyze, measure and analyze, you're going through, you're
gathering data. We're trying to figure out what optimization bottleneck or hotspot is -- or
what bottleneck and hotspot is going to get us our best ROI for performance increase.
That's what we're looking for. And we'll never know that with a hundred percent
certainty. Right? We're kind of like detectives here. We're gathering clues, we're
zeroing in, we're zeroing in to what we think is going to give us the best performance
increase.
Once we believe we found that, you implement a solution, and if you see a frame rate
increase, then you start all over again at the beginning. Right? You know, code executes
kind of like water flowing down a book. If you have a big rock in the middle and you
remove that rock, the water is going to change its course. So we have to start way at the
beginning now and go back through this whole process of measuring and analyze because
we've really changed the way that code is flowing through our application.
All right. End of the process now. I've separated these into the three levels, system at the
top, the application in the middle and then the micro at the very bottom. Now, this is the
detection process, right? We're detecting what our bottleneck and our hotspot is, and
we're going to start from the system level when we do this detection. We're going to start
from the system level and make our way down. You know, usually you're going to end
up with a micro area for solving a problem,
But that doesn't necessarily mean you have to solve with the micro solution. Right?
Once you get to that level of knowing whether you're -- what your problem is, you're
going to have a whole list of things you can go through at different levels, different types
of solutions. So let's start very much at the top here.
Obviously we're going to start with something, a benchmark that's a reliable test. We're
going to -- you know, you're going to set up maybe in a certain part of your game and
you're going to stay here so that we know if we optimize and we see the frame rate
increase, you know, the next time you run it, the frame rate increase isn't because your
laptop -- you know, the fan just turned on on your GPU and you're getting more power
now or some kind of power issue. You want to have is a reliable benchmark.
So there's lots of things that make really good benchmarks or really bad benchmarks.
The problem with benchmarking games is all of those traits that make really good
benchmarks, games kind of violate all of those, you know, repeatable, well, AI can really
cause some problems there, you know, all of those characteristics of benchmarks are very
difficult.
Okay. The very first thing that you want to do when you're going to optimize a game on
this system is figure out how busy is your GPU and know are we GPU bound or are we
not. Now, I have two numbers up here, 80 percent, less than 80 percent and greater than
90 percent for our two directions that we're going to go. All right.
Now, let me tell you about these numbers. I wouldn't call them authoritative at this time.
We've seen this kind of trend and we're doing some more studying to try and figure out
how these numbers work exactly. We've gotten some anecdotal support from folks like
in NVIDIA when we show them this method. And so we have that, but I encourage you
to look at these numbers under your own environment and help us to determine exactly
what these numbers mean.
So let me explain. So what we do is we take a very GPU-bound application, a hundred
percent GPU, you just fill it with [inaudible] work for your GPU and leave it very
CPU-like. All right. And we do that, we look at the GPU number, and you can confirm
from the tools that you are 100 percent a GPU -- or at least 99.9. Very close.
What we'll do is we'll put some work on the CPU side that we can toggle so that we can
continue to increase it, increase the work. So you started out with a hundred percent
GPU, and we toggle that work and increase CPU work. The frame rate won't change
right away, right, because you're kind of filling into that level where you're GPU bound,
so you have a lot of room left of CPU power that you can use. So we'll keep filling that
and filling that and filling that, at the same time watching how busy our GPU is.
Now, what we've found is once we get to the point where we see frame rates begin to
develop, we believe that that's that equilibrium of where you're kind of close to CPU or
GPU bound, because you can toggle it back down and it goes up and you can toggle it up
and you see it go down. So we think that's kind of the equilibrium point. It tends to be
from the tools that we use when we look at that number, it's around 85 percent GPU busy.
So instead of just stating above or below 80 percent, we've kind of put a little bit of a
threshold here. If you're below 80 percent, we're considering you to be more CPU bound.
And the farther you are down that line, you know, closer to being 150 percent GPU, we
consider that to be more CPU bound. If you're greater than 90 percent, then you're
probably more GPU bound.
Now, what happens if you're in between? Well, this is -- you know, these are two lines
that are crossing with the intersection point being around 85 percent. At that point where
you're close to the equilibrium, chances are either side that you optimize you're going to
see some type of performance increase. And it may not be the one with the biggest ROI,
but it will be very close to it, and that's kind of the goal.
So that's what we suggest. Less than 80 percent, we're going to go optimize for the CPU
part of our world; if it's greater than 90 percent, we're going to go optimize for the GPU
side. And if it's in between, we're going to look at both and choose the easier one. So
that's kind of how we go about it.
So let's say that we -- for our assumptions right now, I'm going to go back down to the
GPU side in a bit. Let's assume that we're CPU bound. We're CPU bound, now it's time
to break out a tool, some type of profiler.
In our class we teach VTune, and that's what we use for CPU applications. And once you
get to that level, you can again take a look at their tools from the system level. You can
take a look at how is our executable running on the module level.
And you can again take a look at their tools from the system level. You can take a look
at how is our executable running on the module level and you can take a look at all the
different modules and what percentages they're using.
You'd be surprised how often because of maybe a mistake in how you use the API, what
percentage of the drivers make up your game or what percent even DirectX will take up
in your game. All right? If you don't understand the key assumptions about those APIs,
it's very easy for that to happen. So we want to take a look at that.
If the code -- the module that's slow is the code that you wrote, well, then we can go
down this process a little bit more and do some more measuring and some more analysis.
If it isn't, meaning third party, I have listed, you know, our typical would be probably
either the graphics API, the graphics drivers and in some multithreading cases, you know,
the OS, if you're not taking advantage of all your system locking appropriately, those are
areas that you want to investigate. And we'll cover those in a little bit more too.
But you are -- do have the source code for it, then you have some more room for some
measure and analysis. First thing that I like to take a look at is let's run the VTune
sampling. VTune sampling will come back and we'll use it -- sampling measures some
event. And the default event that we think of when profiling is usually time. You want
to measure the event of a passage of time for a given amount of units and then compare
that for all of your modules or functions or threads or processes, however you want to
look at the view.
So what we'll do look at it from a time perspective: where are we spending our time,
what is maybe the slowest function or slowest class, slowest area of code really. It's hard
to just break it down into our levels of C++ because this is execution. But what is our
slowest area of code, and from that, then we can start looking at some more specifics. So
if we know where the time is being spent, then let's figure out why the time is being
spent.
So in VTune I'll take a look at level 2, level 1 cache misses for memory. That's very
useful. If the areas where the level 1 and level 2 cache misses match the area where the
time is being spent, then we want to look at the code visually and take a look at the
memory and say this is this an area that has lots of memory access for one reason or
another. And it's kind of giving us a pretty good hint that we could with memory IO
bound, that's where we should focus our time.
If those areas differ and the areas with the highest cache misses have nothing to do with
the area where we're spending our time, those are probably an area for the future that
we're going to want to go after in a bit.
Next is compute bound. So you can just do kind of a visual inspection usually, you can
look at the code and say this area where the time is being spent, if it's some collision loop
and you've ruled out memory and it's a collision loop and there's lots of math, then you
can feel pretty confident that it's related to compute.
But there is another area that you have to look out for, which is being instruction fetching
bound, the front-end pipeline of your CPU. It is possible that you're having cache misses
of instructions, right, we have that L1 data cache and you have an L1 instruction cache.
So it could very well be that it's in your instruction cache is where you're having your
problems. And again VTune can help validate that by if you go through and you track
the level 1 instruction cache misses. All right.
So if that area is where you're spending your time, you've also ruled out memory, right,
you've ruled out memory and you see that we're spending our time having instruction
cache misses from our instruction cache, then it's time to bring out some strategies to
solve those instruction problems. And then, again, one way to get to compute bound is to
rule out the other two, and then say that -- and assume that you're compute bound.
Because those seem to be the areas that we have problems with.
Okay. Back up the tree to the top again. Let's cover the other side real quick on the
GPU. So let's say that maybe I just found an authorization -- and I'll talk about some of
those solutions in a bit too. But let's say that we are no longer CP bound and now we
believe that we're GPU bound. We came back and saw GPU busy was 98 percent, right,
that's pretty -- I'd be pretty confident that we're GPU bound at that point.
Well, on Shader Model -- any Shader Model below 4.0, we can break the graphics card
into two pieces. You kind of have the side of your graphics card that deals with triangles
and the side of your graphics card that deals with pixels. So that's a real quick test we
can do up front. We can rule out the pixel problem by changing the resolution. If you
lower the resolution and your frame rate stays exactly the same, then, you know, it's
not -- probably not the right time for you to optimize for pixel performance, let's go move
over to the vertex side and start right there.
If you're in Shader Model 4.0 or higher, you're going to need some help from some tools,
right, PerfHUD will actually show you for your unified shade or architecture from unified
shader perspective where you're spending your time, is it load balance mostly on your
vertex side or is it load balance more on your pixel side.
So a little bit different here when you're below Shader Model 4.0 as opposed to Shader
Model 4.0. So let's say that we're below so I can break it into smaller pieces still. Let's
say that we determine that we change the resolution and we see our frame rate increase.
Well, then we are limited by that side, our raster side of our GPU, and we're going to step
backwards from the pipeline to determine these next couple steps.
All right. This comes from NVIDIA's guide to using PerfHUD. I have the reference in
here from a your Eurographics paper where they talk about this more. And you don't see
them talk about it as much anymore. They kind of have some other strategies now with
their frame profiler on how to optimize. I still really -- I still really like this approach.
Maybe it's just because it's the way I've been working.
But I still really like going back to this and then using the frame profiler for supporting
my ideas of what I find. So we're going to start from the back of the graphics pipeline
because if you don't and you change a stage, you may affect the stages after you, thus
invalidating what you're trying to discover.
So we'll start at raster operations. A test you can do for raster operations is to cut your
depth buffer and frame buffer value to 16 bit. By doing this you've dramatically reduced
the bandwidth. Typically if it's raster operations bound, you're bound by the frame rate -sorry, by the bandwidth between the raster operations and the global video memory.
So if you're limited by that stage, our test is to cut the frame buffer and depth buffer in
half, thus cutting the bandwidth in half. If your frame rate stays exactly the same, then
chances are you don't really have the problem with frame buffer bandwidth, right? That's
a pretty logical conclusion. We cut the work in half and nothing changed, so this isn't a
limited factor in this part -- in this pipeline of the GPU.
So you rule it out and we move on to the next stage. We move to textures. Texture
work, setting the bitmap level to the highest LOD will give us a very, very small -- I
believe it's a 4x4 texture that we're going to get for all of our textures, and that's going to
greatly reduce texture fetching work, right? Texture filtering work, there's a lot less
texture filtering work. You're going to get very good use of the texture cache if all of
your textures are that small.
Now, when we do this, we've always seen a little frame rate increase, right, very small.
And by little I mean 5 percent around. But I don't know that that's been enough to
conclude that we are texture bound, right? But if we see a massive frame rate increase,
then, guess what, it's a pretty good opportunity for you to increase performance from
textures.
On the pixel test now, since we've already ruled out raster operations, we've already ruled
out textures, and we don't test the only other stage here, which is raster -- the raterizer, we
don't test that because from a peak performance standpoint the vendors tell us we don't
have to. They said from peak performance we put a lot of hardware there. I've not really
seen it. I still think it probably could happen, right, but I haven't really seen it myself.
But so we should just be able to test this by lowering the resolution, which you may have
done that test earlier to find these two different directions.
Okay. I'm going to speed up a little bit from the geometry point of view, you're going to
want to look at your vertex work, right, that's our next stage. If you've ruled out the
whole pixel pipeline and now you're on to geometry, you're going to look at your pixel
work. You can do a simple test, is to just have a very simple vertex shader. Take your
vertex shader that you are running, replace it with one that's very simple, watch the frame
rate.
If it's none of these stages and there's a couple little details in there that you could also be
bound by -- let's say vertex fetching, vertex assembly can also be an issue in there that
you would want to do a little bit of detection on, but I'm running short on time, so if
you've ruled out all the stages, really the only thing that's left in this whole process that
we haven't talked about is the bus, which is down at the bottom, the graphics bus.
So that seems like the next logical conclusion if you haven't been able to determine what
you are, it's either the bus or you've made a mistake in the process or the process isn't 100
percent. And there can be cases like that. A lot of these are educated guesses and not
necessarily always 100 percent.
So overall here I have this detection process, and you can see I won't be able to go
through all of these, but following through, I go through each of these and show our
different solutions that we have, and they correspond to the different levels. So if you do
find that you are memory IO limited, depending on where you are in your lifecycle, it
would be in your best interest to try to solve this from a system perspective first, maybe
you don't have time.
So at the very least you want to look at it from this perspective, and if you're going to
pass on a system level, pass on it for an application-level solution. And if you don't have
time on it, pass on that for a memory-level optimization. But we can look at it from these
perspectives. Just because the problem may be a micro problem, doesn't mean the
solution has to be a micro solution.
So I will flip through here these real quick and just show some of the other topics if
you're compute bound in some of our different options for solving them and give a little
bit more time. System level, looking all at this great concurrent hardware that we have,
application level are things that reduce work, personally look up tables to me, you've got
to be very careful with lookup tables because if you're trying to reduce computation and
you create a lookup table that has -- is full of cache misses that cost you maybe hundreds
of instructions, then what have you saved.
[inaudible] is really good alternative to that where you have a very small buffer of
memory and you just save off the last computations that you've done. Very similar to
what the Postino [phonetic] cache does for us on the GPU side. Yes, sir.
>>: [inaudible] improve [inaudible].
>> Eric Preisz: Yes. Yeah. I could go into all of them and maybe when we're done here
I can go through them on a smaller level with you.
Improving branch prediction, there's really two ways for us to improve branch prediction.
One is to make the branch less random, because branch prediction inside [inaudible] a
branch prediction table that's tracking a branch and it will actually guess whether or not
the [inaudible] for that basic case is taken. So by making the branch less random, you're
giving the CPU a better opportunity to make a good guess and actually have -- guess the
correct branch.
The other solution is there is some assembly ways and some SSE ways that we can
actually get rid of the branch or do things closer to like predication, where you run maybe
both sides like the GPU does, and then just store the answer of the one that's correct, of
the branch that's correct.
We can go through them all. I just want to make sure I get them all up on the slides here
so everyone can see some of these different solutions.
Now, the instruction processing one, I'll have to admit that this is an area that I'm still
looking into a bit more. This is probably one of the later things that I've been wanting to
look into is more about being instruction bound. On the graphics side, I start from raster
operations, right, because we need to start from that back and go to the front. And here
are some raster operation solutions. Reduce your overdraw, work on better CPU or GPU
occlusion calling, work on the early Z cull in certain cases you can initialize that Z buffer
and then go back through and render everything again and actually see better
performance from drawing your scene twice of frame. Under certain conditions of
having low pixel work -- or, sorry, very high pixel work, very low vertex work, very low
number of draw calls.
Texture, got some more strategies here. Notice that early Z cull will show up for all of
the pixel side, because that's an application, it's a higher-level solution. And pixel side,
offload work to the vertex shader if possible. And of course shader optimization, which
is a whole talk on its own, something that I find very exciting, vertex work, deferred
shading, optimizing for Postino cache, again, these slides will hopefully be available in
some sort for you to go through these more and we can talk into greater detail on some of
these other solutions since I am going through these pretty quick for you.
Lastly, I just want to bring up the point here, though, that this process does tend to focus
on one machine, right? You're doing this optimization on that machine, and it's telling
you about the performance of that machine, but it's not necessarily telling you that you've
made games -- all games fast.
So but what I have here is a holistic video game optimization done on several
configurations sufficient to say that we've made all PC games faster, or the other strategy
may be that we need to come up with load balancing solutions so that we can
dynamically do that and determine that we were using the systems to their fullest.
This is something that we're working on. We're working on a project we called Coperf
[phonetic]. Coperf is something that we're going to try to get a lot of people to run based
on certain types of tests, and we want to test the variability of optimization strategies
across many, many different platforms, many, many different PCs. So hopefully if we do
that, maybe we can start getting to the point where we can say for this demographic of
game, high-end gamer, or for this demographic of game, we want everyone to play, what
are the right optimization strategies to do that based on variability.
So you will see that I have a white paper that's coming soon. I can give you the rough
draft now. We're getting industry feedback by right now at this stage of the white paper.
Intel is working with us a bit on it, and we hope to get a lot of people involved. It's an
open source project, so these benchmarks that we're building to help gather this data is
open source.
So not only will you be able to get our data on how this performance runs on many, many
machines, we want you to come look at our code and make sure that we are -- you know,
these tests are hard to do at 100 percent. I know that they're correct because it's hard to
isolate certain parts of hardware. So we want you guys to take a look at it and start
picking through it and improving our benchmarks so that there's something that's reliable.
Yes.
>>: Have you done any study or even discussion around what happens to a student that
figures out how to optimize their code as a motivator, how this activity could be used to
engage students [inaudible]?
>> Eric Preisz: We've not done anything well organized. I'll have to say from my
experience, coming through my course is halfway through the program. A lot of times
they don't find the optimization part the most interesting, and they're not necessarily
enthused about learning these processes.
When they get to final project, however, I see those students come back. What I do to try
to encourage the students who are currently in my class is when the final project students
come to me with optimization problems, I don't solve it in my office hours. I make them
come back to my classroom and we do it as a case study, usually in the first 20 minutes of
class. And I -- depending on where we are in the class, I let my students try to solve their
problems for them based on what they've learned.
I figure maybe if I can get them to think it's important to the final student projects, that
maybe it's important to them now. I still have plenty of students coming back. So I don't
know if that answered your question or not, but, yeah, we haven't really done much of an
organized study.
And that's really it. I have a -- I'll leave this last slide here that says based -- as a
concurrency is growing, this responsibility of moving into these levels is also growing.
Developers are leading the role right now as far as the system-level optimizations and
looking to find out what kind of cores do we have and how are we using them, how many
GPUs, which particular GPU.
And compilers are also moving into higher levels of optimization and moving into things
like whole program optimization and some of these higher-level ways of doing
optimization at link time, get us closer and closer to the compilers actually being able to
do application-level optimization.
And then also runtimes reconfiguring themselves in more optimal ways. So maybe that
will continue and maybe the answer here is virtualization. I don't know. That's -- I think
might be a ways out, but you guys may have other information.
And that concludes my talk really. All of my references are listed here so that you can go
through and take a look for yourself. And I guess do we have time for questions then?
Okay. If you have some questions.
>>: [inaudible]
>> Eric Preisz: Yes.
>>: Is there a summary of what you're describing here?
>> Eric Preisz: You know, I don't have a summary in the white paper. But what I've
tried to do is through the lecture if we can -- I can pass these out to you. That's probably
what you want the details for are those. You'll find the details for all the other areas
listed in the note section of the slide. So you can find a lot of it there. I didn't go into all
the details here necessarily. But there will be a format for that, for going into greater
detail on a mass public side within the next six months that I can't really talk about
directly. I'm just not allowed to yet, but, yeah, it will be more available, so...
Any other questions? All right. I appreciate it. Thank you.
[applause]
Download