CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13

CS248: Graphics Performance,
Debugging and Optimisation
Dave Oldcorn
November 13th 2007
Your Guest Instructor
• Back in the mists of time, I wrote games…
• The last ten years have all been about 3D hardware
• Since 2001 at ATI, joining forces with AMD last year
• Optimisation specialist: linking the software and the
hardware




2
Tweaking games
Understanding the hardware
Driver performance
Shader code optimisation
– (I find assembly language fun)
November 13th 2007 Graphics Performance and Optimisation
Overview
• Three basic sections
 GPU Architecture
 Efficient OpenGL
 Practical Optimisation And Debugging
• There’s a lot in here
 Broad overview of all issues
 I’ve prioritised the biggest issues and the ones most likely
to help with Project 3
 More details with respect to GPU architecture included as
appendix
3
November 13th 2007 Graphics Performance and Optimisation
GPU Architecture
4
November 13th 2007 Graphics Performance and Optimisation
Graphics hardware architecture
• Parallel computation
• All about pipelines
• The OpenGL vertex pipeline shown
right will be familiar…
5
November 13th 2007 Graphics Performance and Optimisation
Graphics hardware architecture
• Extend the top of the pipeline with
some more implementation detail
• Ideally, every stage is working
simultaneously
Application
API
Video Drivers
CPU
GPU
• Could also decompose to smaller
blocks
• And eventually to individual
hardware pipeline stages
 As shown last week, the hardware
implementation may be considerably
more complex than a linear pipeline
6
November 13th 2007 Graphics Performance and Optimisation
Command Buffers
Parser
Vertex
Assembly
Vertex
Operations
Primitive
Assembly
Draw commands
• Data enters the GPU pipeline via command buffers
containing state and draw commands
• The draw command is a packet of primitives
• Occurs in the context of the current state
 As set by glEnable, glBlendFunc, etc.
 The full set of state is often referred to as a state vector
 Driver translates API state into hardware state
 State changes may be pipelined; different parts of the
GPU pipeline may be operating with different state
vectors (even to the level of per-vertex data such as
glColor)
7
November 13th 2007 Graphics Performance and Optimisation
Pipeline performance
• The performance of a pipelined system is measured by
throughput and latency
 Can subdivide at any level from the full pipeline down to
individual stages
• Throughput: the rate at which items enter and exit
• Latency: the time taken from entrance to exit
Latency is not typically a major issue for API users
It is a huge issue for GPU designers
Even GPU-local memory reads may be hundreds of cycles
Substantial percentage of both design effort and silicon is
devoted to latency compensation
 The system will generally run at full throughput until the
latency compensation is exceeded




8
November 13th 2007 Graphics Performance and Optimisation
Pipeline throughput
• Given a particular state vector, each part of the pipeline has
its own throughput
• The throughput of a system can be no higher than the
slowest part: this is a bottleneck
 More generally, if input is ready but output is not, it is a
bottleneck
9
November 13th 2007 Graphics Performance and Optimisation
Pipeline bottlenecks
• Consider system shown right
 Stage 1 can run at 1 per clock and is
100% utilised
 Stage 2 can only accept on each other
clock; still 100% utilised
 Stage 3 is therefore starved on half of
the cycles it could be working; 50%
utilised
 Although stage 3 has the longest latency,
it has no effect on the throughput of the
system
Items enter
1 per clock
Stage 1
Throughput 1/clock
Latency 5 cycles
Items pass at
1 per clock
Stage 2
Throughput 1 per 2 clocks
Latency 10 cycles
Half throughput,
result only every
alternate clock
Stage 3
Throughput 1/clock
Latency 15 cycles
Still only alternate
clock results, despite
per-clock throughput
10
November 13th 2007 Graphics Performance and Optimisation
Pipeline bottlenecks
• A key subtlety; for this to work as
shown, there must be load balancing
between stages 1 and 2 (probably a
FIFO)
• Once the FIFO is full, the input buffer
will exert backpressure on stage 1
 Happens after equilibrium is reached
• This pipeline therefore runs at the
speed of the slowest part as soon as the
FIFO fills
Items enter
1 per clock
Stage 1
Throughput 1/clock
Latency 5 cycles
Items pass at
1 per clock;
eventually queue
Input Buffer
Stage 2
Throughput 1 per 2 clocks
Latency 10 cycles
Half throughput,
result only every
alternate clock
Stage 3
Throughput 1/clock
Latency 15 cycles
Still only alternate
clock results, despite
per-clock throughput
11
November 13th 2007 Graphics Performance and Optimisation
Variable throughput
• In general, throughput is data dependent
 Example: clipping is a complex operation which often isn’t
required
 Example: texture fetch depends on the filtering chosen, which is
data dependent
• Some pipeline stages require different rates at the input and
the output
 Example: back-face culling; primitive in, no primitive out
 Example: rasterisation of primitives to fragments; few
primitives in, many fragments out
• Buffering between stages takes up the slack
12
November 13th 2007 Graphics Performance and Optimisation
Pipeline bottlenecks
• A particular state vector will tend to have a characteristic set
of bottlenecks
 The input data does also have an effect
• Small changes to the state vector can make substantial
changes to the bottleneck
• As a state change filters through the pipeline and for a short
period afterwards, bottlenecks shift into the new equilibrium
 For usual loads, where the render time is much larger than the
pipeline depth, this time can be ignored
• Can be hard to determine bottlenecks if the states in the
pipe are disparate
 Smearing effect
13
November 13th 2007 Graphics Performance and Optimisation
Pipeline bottlenecks
• There may be multiple bottlenecks if the throughput is
not constant at all parts of the pipeline
 In general it is not constant
• GPU buffering absorbs changes in load
 Measured in tens or hundreds of cycles at best
 Whole pipeline is thousands of cycles
• The bottleneck could be outside the GPU
 Application, driver, memory management…
• Bottleneck analysis is key to hardware performance
 Not easy: bottlenecks are always present
 separating expected and unexpected cases is the challenge
14
November 13th 2007 Graphics Performance and Optimisation
Flushes and synchronisation
• Some state cannot be pipelined; a flush occurs
 Various localities of flush
 For a whole-pipeline flush, the parser waits before allowing
new data into the pipe
 CPU can carry on building and queuing command buffers
 Low cost ~ thousands of cycles (~5us?)
 Some operations can require the CPU to wait for the GPU
 Example: CPU wants to read memory the GPU is writing
 This is a serialising event
 Very expensive: wait for pipeline completion, flush all caches,
and the restart time taken to build the next command buffer
 You can force this with glFinish: please don’t!
15
November 13th 2007 Graphics Performance and Optimisation
Asynchronous system
Input / Physics thread
Runs continuously using input and time
(fixed or delta) to update the game world
– typically including the scene graph
Typical runtime 10-30ms
Render thread
Runs continuously to convert the scene
graph to rendering commands
Generally cannot start until input/physics
thread has processed whole frame
GPU Renderer
 The process of rendering a
typical game image is massively
asynchronous
 The boxes left show possible
asynchronous actors
 The diagram below shows a
possible timeline
 The shaded areas are the same
frame
Runs on its command buffers
DAC
Loops over the display at 60-100Hz
A command buffer operation changes the
display at end of render; picked up at
start of next frame (unless vsync is off)
16
Input
Render
GPU
DAC
November 13th 2007 Graphics Performance and Optimisation
Synchronisation
• GPU’s aim to run just under two frames ahead
 Block at SwapBuffers if there is another SwapBuffers in
the pipe that is not yet reached
• Reading any GPU memory on the CPU causes a sync
 glReadPixels is one method, for example; avoid
• Writing to GPU memory generally does not
 The GPU, driver and memory manager work together to
do uploads without serialisation
 No need to be unusually scared of glTexImage
• If you have to lock GPU memory, look for discard or
write-only flags that will allow asynchronous access
17
November 13th 2007 Graphics Performance and Optimisation
Shaders
• Texture lookup operations are relatively expensive
 Competition on GPU or system bus, cost of filtering,
unpredictable
 Some of this is only a latency issue – but latency is not
important…
– … until the buffering is exceeded
– Latency more than doubles for dependent texture operations
 Prefer ALU math to texture until the function is complex
 Might replace very small textures with shader constants
• Shader – typically its texture operations – likely to be
the limiting factor on performance
18
November 13th 2007 Graphics Performance and Optimisation
Shaders
• Each shader is run at a particular frequency
 Per-vertex, per-fragment now; per-primitive also exists; persample seems likely in the future
 Can view constants calculated on the CPU as another frequency
(per-draw packet)
 Aim to do calculations at the lowest necessary frequency
• Issues to be aware of:
 Data passed from vertex to fragment shader is interpolated
linearly in the space of the primitive (i.e. with perspective
correction) so can only use interpolators if this is appropriate
(linear or nearly so); high tessellation can be a workaround
 Excessive use of interpolators can itself be a bottleneck; up to
two interpolators per texture fetch, as a ballpark figure
19
November 13th 2007 Graphics Performance and Optimisation
Shader constants
• Shader constants are a large part of the state vector
 Updating hundreds on each draw call will not be free
• Prefer inline constants (known at compile time) to
state vector constants
 Gives the compiler and constant manager more
information
• For the same reason, avoid parameterising for its own
sake
• Don’t switch shader just to change a couple of
constants
20
November 13th 2007 Graphics Performance and Optimisation
Efficient OpenGL
21
November 13th 2007 Graphics Performance and Optimisation
Efficient OpenGL
• This is a data processing issue
• What data does the GPU need to render a scene?
 State data, texture data, vertices / primitives
• CPU-side performance can easily be dominated by inefficient
management of this data
• Of them all, vertex data is the most problematic
Type of data
State
Vertex
Texture
Volume (per frame)
Low (~kB)
Rate of change
Very high
Med-high
(~MB)
Low-med
Very high
(~GB)
Very low
22
November 13th 2007 Graphics Performance and Optimisation
Efficient vertex data
• Application needs to feed mesh data in somehow
• GL provides two basic methods
 glBegin/glEnd (known as ‘immediate mode’)
 Vertex arrays
• Immediate mode is easy to use but has high overheads
 Many tiny, unaligned copies
 Non ‘v’ forms imply extra copies
 Command stream is unpredictable and irregular
glBegin(GL_TRIANGLE_FAN);
glColor4f(1,1,1,1);
glVertex3f(0,0,0);
// position
glVertex3f(0,1,0);
// position
glColor4f(1,0,0,1);
glVertex3f(1,1,0);
// position
glVertex3f(1,0,0);
// position
glEnd();
23
+ colour
only
+ colour
only
November 13th 2007 Graphics Performance and Optimisation
Vertex arrays
• Vertex arrays are an alternative
 The application probably has its data in arrays somewhere, so
let GL read them en masse
 glVertexPointer, glColorPointer, etc. specify the array
 glDrawElements to issue a draw command; takes index list
 Primitives are drawn using the indices into the arrays as set up
by the gl*Pointer commands
glVertexPointer(3, GL_FLOAT, 16, vertex_array);
glColorPointer(4, GL_UNSIGNED_BYTE, 0, color_array);
glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_COLOR_ARRAY);
glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices);
• Easier for the driver and GPU to handle
 State vector is instantiated at the glDrawElements command
 The GPU can process all the primitives in a single draw packet
24
November 13th 2007 Graphics Performance and Optimisation
Vertex arrays
• Did you hear a but?
• The vertex data still belongs to the application
 Until the glDrawElements call is entered, the GPU knows
nothing of the data
 After the call completes the app can change the data
 Therefore, the driver must copy the data on every
glDrawElements call
 Even if the data never changes – the GL can’t know
• Wouldn’t it be great if we could avoid the copy?
 We don’t supply textures on every call, just upload them
to the GPU and let the driver manage them…
25
November 13th 2007 Graphics Performance and Optimisation
Buffer Objects
• This facility is provided with the Vertex Buffer Objects
(VBO) extension
 allows the creation of buffer objects in GPU memory with
access mediated by the driver
 Data can be uploaded at any time with glBufferData
– As with glTexImage, done through command buffer to
avoid serialisation
 BindBuffer <-> BindTexture, BufferData <-> TexImage
// During program initialisation
glBindBuffer(GL_ARRAY_BUFFER, 0);
glBufferData(GL_ARRAY_BUFFER, 16*4*sizeof(GL_FLOAT), vertex_array, GL_STATIC_DRAW);
...
// In render loop
glBindBuffer(GL_ARRAY_BUFFER, 0);
glVertexPointer(3, GL_FLOAT, 16, 0);
glEnableClientState(GL_VERTEX_ARRAY);
glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices);
26
November 13th 2007 Graphics Performance and Optimisation
Index data
• DrawElements only needs to send the indices
• Actually we can optimise that away too; Element
Arrays allow buffer objects to contain index data
 Index data is far smaller in volume, and tends to come in
larger batches if state changes are minimised, so this can
be overoptimisation
• Keep batches as large as possible
 Keep state changes to a minimum
 Primarily use triangle lists
 Don’t mess with locality of reference
 Strips can be marginally more efficient
27
November 13th 2007 Graphics Performance and Optimisation
Display Lists
• These offer the driver opportunity for unlimited
optimisation
• It’s hard for the driver to do
 The list can contain literally any GL command
• Not recommended for games or other consumer apps
• Professional GL apps do make heavy use of display
lists (and immediate mode)
 The effort required to efficiently optimise these is one
reason professional GL cards are more expensive
28
November 13th 2007 Graphics Performance and Optimisation
Visibility optimisations
• It’s far more efficient not to render something at all
• Try to avoid sending primitives that can’t be seen
 Not in the view frustum
 Obscured
• Send it, but have it rejected at some early point in the
pipeline
 Cull primitives before rasterisation
 Reject fragments before shading
29
November 13th 2007 Graphics Performance and Optimisation
Bounds
• Bounding boxes or spheres to reject objects wholly
outside the view frustum
• Optimal methods for using these were in lecture 11
30
November 13th 2007 Graphics Performance and Optimisation
Occlusion culling
• PVS (Potentially Visible Set) culling
 For each location in the set of locations,
store which other locations might be visible
A
 Precalculate before render process starts
• If you are standing anywhere in A, you
absolutely cannot see C and vice versa
 View frustum checks cannot solve this part
of the problem; consider the position of the
observer shown
 A frustum test is still useful; if the observer
was standing in B looking the same way,
bounds could cull C
• Very effective on room-based games; not
so useful on outdoor games
 Fewer large-scale occluders
31
November 13th 2007 Graphics Performance and Optimisation
B
C
Other visibility methods
• Portals – as discussed in lecture 11
• BSP – Binary Space Partition – trees
 Complex but efficient way to store large static worlds for
fast frustum visibility calculations
 Combine with PVS and portals; all need precalculation
phase
• Abrash - Graphics Programming Black Book ch.
59-64, 70
 Detailed information on these and other research he and
John Carmack did on visibility while developing Quake
 Still in use today on modern FPS games (with many
enhancements!)
32
November 13th 2007 Graphics Performance and Optimisation
Model LOD
• Need to render something, render less of it
• Demonstrated two weeks ago:
 A model close to the camera requires many triangles
 Carry reduced detail models and select on each
render
– Like mipmapping, memory cost not prohibitive.
– Target sizes near GPU’s high efficiency 100 pixel region
• Visualise with wireframe
33
November 13th 2007 Graphics Performance and Optimisation
Model LOD
• Non trivial implementation
 Popping a well known issue; morph or blend common
solutions
 Must generate reduced detail models
 Can reuse vertex data, just change indices
• Terrain offers particular challenges
 LOD systems essential for really large worlds
 Terrain tiles must match between different LODs
• Can also solve sampling issues
 As with undersampling textures, semirandom triangles
can be picked; occur if triangles are smaller than 1 pixel
34
November 13th 2007 Graphics Performance and Optimisation
GPU primitive culling
• Degenerate primitives (example: triangles with two
indices the same) will be culled at index fetch
• A primitive with all vertices outside the same clip plane
will be culled
• Back-face culling is a simple optimisation and should
be used for all closed opaque models
• Zero area triangles will be culled before rasterisation
 This is rarely usefully exploitable
• Scissor rectangles cull large parts of primitives during
rasterisation
35
November 13th 2007 Graphics Performance and Optimisation
GPU Z rejection
• The Z test can occur before shading
 Reduces colour read/write load as well
• Some states inhibit early Z test
 Write Z in the shader, obviously
 Gate Z update in the shader (pixel kill / alpha test with Z write)
– Alpha test sounds like an optimisation, but it only saves colour
read/write; use it for visual effect not performance
– Shader kill acts as a shader conditional
• Z unit can reject at hundreds of pixels per clock
 Accept rate is lower (at the very least Z has to be written) but
as fast or faster than any other post-rasteriser operation
• Stencil usually rejects at Z rates
 Having a stencil op that does something implies a stencil write
36
November 13th 2007 Graphics Performance and Optimisation
Early Z rejection
• Draw opaque geometry in roughly front-to-back order
 Do not work too hard to make this perfect, that’s what the Z
buffer was created for in the first place
 Do not draw the sky first. Please!
 This assumes you’re bottlenecked in the shader
• Consider a Z pass
 If the fragment shaders are very expensive
 If at any point rendering the colour buffer you need some
algorithm that requires the Z buffer
 Disable colour writes (glColorMask) or fill the colour buffer with
something cheap but useful (example: ambient lighting)
 Invariance issues should be rare nowadays (but be aware)
37
November 13th 2007 Graphics Performance and Optimisation
Shader conditionals
• Can also reduce shader load
• Treat with care…
• Use mostly for high coherency data
 the conditional is unlikely to have per-pixel granularity
 An if-then-else clause can have to execute both branches
• For low coherency data, prefer conditional move type
operations
• Typically the shader compiler and optimiser can’t know
much about the likely coherency
 So it guesses
38
November 13th 2007 Graphics Performance and Optimisation
Triangle sizes
• Larger triangles are more efficient than small ones
• Rules of thumb:
 Over 1000 pixels is large
 100 pixel triangles are considered typical and the GPU
should be into the ballpark of its peak performance
 Under 25 pixel triangles are small
 Tiny triangles likely to cause granularity losses in the GPU
• Often the type of object and size of triangle are related
 Example: world triangles tend to be larger than entities
39
November 13th 2007 Graphics Performance and Optimisation
Bump mapping
• Can trade off geometric complexity for more expensive
fragment shading
 Textures in general offer this capability
– light maps are an earlier example
• Having a normal map available in the fragment shader
useful for other reasons too
 Per-pixel lighting is an obvious use
• Doom3 an early pioneer:
 polygon counts are low compared with other games of the
time
 bump mapping makes it hard to see except on silhouette
edges
40
November 13th 2007 Graphics Performance and Optimisation
Practical Optimisation and Debugging
41
November 13th 2007 Graphics Performance and Optimisation
Optimising Applications
• Always profile; never assume
• Target optimisations
 Better to get a small gain in something that takes half the
time than a big gain in something that takes a couple of
percent
 Better to do easy things than hard things
 “Low-hanging fruit”
42
November 13th 2007 Graphics Performance and Optimisation
Instrumentation for debugging
• Logging
• Visualisation: make particular rendering (more) visible
• Simple interfaces into the high-level parts of the program to
make low-level testing easier
 ‘God mode’
 Skip to level N or subpart of the level
– Saved games may seem to be an answer here, but minor
changes during development usually break them
 Metadata display
• Multiple monitors and remote debugging
 Key for fullscreen applications
 Useful to have ‘stable’ dev machine and separate debug target
43
November 13th 2007 Graphics Performance and Optimisation
Instrumentation for performance
• Feedback on what the performance actually is
 A simple onscreen frames per second (FPS) and/or timeper-frame counter
 Special benchmarking modes
• Modify the performance
 Skip particular rendering passes
 Add known extra load
– Examples: new entities, particle system load, force
postprocessing effects on
44
November 13th 2007 Graphics Performance and Optimisation
Real-world example: Doom3 engine
 Heavily instrumented with developer console
 accessed with ctrl-alt Most commands prefixed according to their functional unit
– r_ commands are to the renderer, s_ the sound system, sv_
the server, g_ the client (game), etc.
 Record demos; playback with playdemo or timedemo
 Capture individual frames with demoshot for debugging or
performance
 Can also send console commands from the command line –
essential for external tools
 Many debugging commands
– noclip to fly anywhere on a level
– r_showshadows 1 displays the shadow volumes
– g_showPVS 1 to show the PVS regions at work
45
November 13th 2007 Graphics Performance and Optimisation
More Doom3 convenience features
 PAK files are just ZIP files
 You can look at the ARB_fragment_program shaders
Doom3 uses (glprogs/ directory in the first pakfile).
 You can also modify them: real files (e.g. under the
base/glprogs directory) override the pakfiles
• Human-readable configuration files
• TAB completion on the console
 Long commands not a problem – plus you can find the
command you want!
• Key bindings
46
November 13th 2007 Graphics Performance and Optimisation
Doom3 render: multipass process
1. Z pass: set the Z buffer for the frame
2. Lighting passes: for each light in the scene
2A. Shadow pass: render shadow volumes into the stencil buffer
2B. Interaction pass: accumulate the contribution from this light
to the framebuffer.
-
Cheap Phong algorithm (per-pixel lighting with interpolated E; Prey
calculates E on a per-pixel basis for better specular)
-
Vertex/fragment shader pair
3. Effects rendering; mostly blended geometry for
explosions, smoke, decals, etc.
4. One or more postprocessing phases for refraction
and other screen-space effects
47
November 13th 2007 Graphics Performance and Optimisation
Doom3 benchmarking tools
• Each render pass can be disabled from the console
 r_skipinteractions, r_shadows, r_skippostprocess
 Benchmark each pass individually
 Worth considering render time rather than just FPS; linear
quantity
Rendered
FPS
Frame time
(ms)
Everything
55.8
17.9
- postproc
58.5
- interactions
- shadows
48
Isolated
pass
Pass time
(ms)
Pass load
17.1
Postproc
0.8
4%
104.7
9.6
Interaction
7.5
42%
174.5
5.7
Shadows
3.9
22%
The rest
5.7
32%
November 13th 2007 Graphics Performance and Optimisation
Case study: Doom3 interaction shader
• The shader has 7 texture lookups
 Texture limited on most GPUs
 One of them was a simple function texture
– Probably originally a point of customisation but unused
 We tested gain by eliminating the lookup
– replaced with a constant – note not 0 or 1, which might
allow the optimiser to eliminate other code
– Provided the expected ~15% gain for the pass
 Replaced with a couple of scalar ALU instructions
– Gain was still the same, as the scalar ALU scheduled
into gaps in the existing shader
• Quake4 and later games all picked up the change
49
November 13th 2007 Graphics Performance and Optimisation
Instrumenting applications
• Be wary of profiling API calls
 Asynchronous system; SwapBuffers is probably the only
point of synchronisation
 Can’t easily measure hardware performance at a finer
granularity than a frame
 Don’t try to profile the cost of rendering a mesh by timing
DrawElements; only measures the time taken to validate
state and fill the command buffer
 Which isn’t to say that’s never useful information
50
November 13th 2007 Graphics Performance and Optimisation
Instrumenting applications
• Don’t overprofile
 QueryPerformanceCounter has a cost
 Even RDTSC does
• Try to look at the high level and in broad terms first
 30% physics, 20% walking scene graph, 30% in the driver, 20%
waiting for end of frame
 Rather than 15.26% inside DrawElements
• Aim to be GPU limited, then optimise GPU workload
 Don’t waste time optimising CPU code if it’s waiting for the GPU
 Iterate as the GPU workload becomes more optimal
• Try to avoid compromising readability for performance
 Rarely necessary
 Download the Quake 3 source to see how clear really fast code can be
 The games industry is really, incredibly, bad at this.
51
November 13th 2007 Graphics Performance and Optimisation
Benchmark modes
• Timed runs on repeatable scenes
 Two options
– Fix the number and exact content of frames and time
the run (could be one frame repeated N times)
– Fix the run time, render frames as fast as possible,
count the frames
 Former is more repeatable; often essential if tools require
multiple runs to accumulate data
 Latter more convenient for benchmarkers and more
realistic to how games behave in the real world
 Cynical reason for benchmarks: applications get more
attention from press (and hence driver developers)
52
November 13th 2007 Graphics Performance and Optimisation
CodeAnalyst
• CodeAnalyst is an AMD tool that allows non-intrusive
profiling of the application’s CPU usage
 A profiling session spawns the application under test
– Make sure to avoid profiling startup and shutdown time
 Can drill down to individual source lines in your code and
show you the cost
 Many examples on AMD’s web site using this
 Useful for all CPU-limited applications
53
November 13th 2007 Graphics Performance and Optimisation
CodeAnalyst hints
 A spike inside driver components may not be driver overhead
 The driver is probably waiting on the GPU to meet the
SwapBuffers limit
 If there's not a large spike in the driver, it's probably the
application that's the limit.
 This is complicated by the fact that we may choose to block if
the GPU is not yet ready, so time may move from the driver to
being reported as 'system idle process‘, PID 0, or similar.
 Vary the resolution and check the how the traces change
 If the relative time in the driver or system idle doesn't change,
the application is not pixel limited.
 Multicore systems make interpreting the results harder.
 You might be best off switching a core off if you can
54
November 13th 2007 Graphics Performance and Optimisation
GPUPerfStudio
• Lets you look inside the GPU
• Hardware performance counters
 3D busy is the most obvious and often important
 Vertex / pixel load can also be seen
• The bad news: GL support is not in the currently
downloadable version 1.1. Coming soon…
55
November 13th 2007 Graphics Performance and Optimisation
Shader Development
• AMD GPUShaderAnalyzer
 Available to download from AMD web site
 Will handle all GL shader types (GLSL, ARB_fp, ARB_vp)
 Good development environment; no need to run your app
to compile
 Will show output code and statistics including estimated
cycle counts for all AMD GPUs
56
November 13th 2007 Graphics Performance and Optimisation
Scalability
• Look to create consistent performance
 Better to run at 30fps continuously than oscillate wildly
between 15fps and 100fps.
 Target worst-case scenes
 You will need headroom to guarantee 60fps
• Is a particular gain useful?
 A 4% speedup won’t help anyone play your game
 Five 4% speedups would, though
 Gains in a lesser component allow more use of that
component
57
November 13th 2007 Graphics Performance and Optimisation
Scalability
• PC environment is a huge scalability challenge
 Matrix of CPUs, GPUs and render resolutions is huge
 Performance is in tension with image quality
 Adjust quality to scale for GPU power and set higher loads
– when CPU limited, more pixels probably have no cost
 Adjust quality in profiling
– Resolution (or clock) scaling to test if CPU or GPU
limited
• Consoles have it easier: more fixed in every way
 Still need headroom, just less of it
 Now have resolution scaling issues - five TV resolutions in
NTSC 480i, PAL 576i, 720p, 1080i/p
 60Hz / 50Hz is a headache here
58
November 13th 2007 Graphics Performance and Optimisation
Caveats on optimisation
• Windowed mode
 GPUs can behave differently in windowed mode to
fullscreen mode
 Windowed should still be your primary development mode
unless you have remote debugging
• Front Buffer rendering
 May be useful for debugging, but could have similar
performance implications
• Avoid misusing benchmarks
 Repeat runs – make sure everything’s ‘warm’.
59
November 13th 2007 Graphics Performance and Optimisation
Guidelines for Project 3
• Concentrate on the scene graph first, GPU second, CPU cycle
picking last
 Look for algorithms that cull monsters, trees and rooms
rather than triangles or pixels
• Work on model or texture data in the GPU, not CPU
 Primarily, use the shader to do the work
 Anywhere index data, primitive count and connectivity
don’t change is a candidate
 If you have to generate a texture consider using the GPU
60
November 13th 2007 Graphics Performance and Optimisation
Guidelines for Project 3
• Short of time to write shaders
 Write a few shaders that you use a lot
• Don’t try to do everything in this lecture
 Many techniques won’t apply to your specific case
 Even those that do often won’t matter
 Profile-Guided Optimisation!
61
November 13th 2007 Graphics Performance and Optimisation
Headline performance items
 Scene graph optimisations: visibility culling, model LOD
 Don’t touch model data on the CPU unless the algorithm absolutely
requires it
 Use vertex arrays for complex mesh data (> 10 primitives); store
static data in VBOs.
 Use mipmaps for all static textures; avoid undersampling textures
without mipmaps
 Render roughly front to back; don’t kill yourself trying but give it a
go for the largest geometry; draw the sky last!
 Use compressed textures by default; only disable if artifacts appear
 Disable unnecessary alpha testing; don’t do kills in shaders unless
you have to
 Move work from fragment to vertex shaders where possible
 Prefer moderate math to texture lookups particularly if they
increase the dependent fetch level
62
November 13th 2007 Graphics Performance and Optimisation
Further reading
• Abrash, Mike: The Graphics Programming Black Book
 Even in 1997 the asm and register programming section
was dated
 Much of the Quake documentation isn’t
– Clear explanation of BSP, PVS and some on portals.
 Rest is still worth reading to show the mindset
– Skip asm-specific bits, concentrate on thought process
 Chapter 1 and chapter 70 are required reading
 Stencil shadows; the Wikipedia page has many links
63
November 13th 2007 Graphics Performance and Optimisation
Samples and Tools
 http://ati.amd.com/developer/
 GPUPerfstudio, GPUShaderAnalyzer and the
Compressonator
 Tootle is also interesting; optimise meshes both for
vertex cache and ‘internal’ front-to-backness
 Many other samples, documents and tools
 http://www.amd.com/codeanalyst
64
November 13th 2007 Graphics Performance and Optimisation
Questions
• If we have time…
65
November 13th 2007 Graphics Performance and Optimisation
Appendix
Background information on more aspects of the GPU
a.k.a. “The slides I knew I didn’t have time to go through”
66
November 13th 2007 Graphics Performance and Optimisation
Texture and rendertarget tiling
• Memory interface efficiency mostly determined by
burst sizes
 The more useful memory fetched in one go the better
 Avoid fetching anything that isn’t then used
– This is why mipmapping is so important: minifying a
texture implies fetching memory that isn’t then used
• Rearranging memory into tiles increases locality of
reference
 64 bytes might contain 4x4 pixels instead of 16x1 pixel
 Format is transparent to application
67
November 13th 2007 Graphics Performance and Optimisation
Texture Compression
• GL_ARB_texture_compression
• The S3TC / DXTC / BC algorithm is a high quality method
for typical image textures
 Designed such that the artifacts introduced in lossy compression
tend to be smoothed out by texture filtering
 Function textures and unusual use textures may not meet
acceptable quality
 Rearranging components can help
 Use high-quality compressors - The Compressonator
• Compression isn’t just about memory bandwidth
 Reduces effective latency (one fetch brings in more useful
texels)
 Effectively increases texture cache size
68
November 13th 2007 Graphics Performance and Optimisation
Texture Filtering
• A bilinear-filtered sample is the common basic unit of
work for a texture unit
 Unlikely that point sampling is any faster than bilinear; can
make this work for you in image processing shaders (rather
than point sampling and doing some constant weighted sum)
 Each additional bilinear sample for trilinear or anisotropic
filtering is probably consuming additional time
• Smart algorithms ensure that only needed samples are
taken
 No need for trilinear if magnifying
 No need for anisotropy if square-on
 Example: walls tend to have less anisotropy than floors
• Gradient calculations may be dynamic
 Necessary to handle dependent texture reads
 Be wary with dependency; gradient can be unpredictable
69
November 13th 2007 Graphics Performance and Optimisation
Render to texture
• Useful for generating extra views or postprocessing
 Example: mirror in driving game
 Example: postprocessing for refraction
• glCopyTexImage copies the framebuffer to a texture
 CPU-GPU serialisation is not implied; this can probably be
queued into a command buffer
• Other methods exist such as pbuffers and framebuffer
extensions
 Can be slightly more efficient
 Can return to a rendertarget after rendering on another
 More complex; don’t use without good reason
70
November 13th 2007 Graphics Performance and Optimisation
Multisample antialiasing
• The key gain is to run the fragment shader at pixel
frequency rather than sample frequency
• Also saves memory bandwidth; can compress Z and
colour
 The buffer may need to be resolved to an uncompressed
buffer for display or if used as a texture
 Triangle size may be worth extra consideration with
MSAA; frame buffer and Z compression rate is likely to be
in roughly inverse proportion to the number of visible
edges in the scene
71
November 13th 2007 Graphics Performance and Optimisation
Caching
• Many caches inside the GPU
• Different to what you might be familiar with in a CPU
 More about memory bursts and latency compensation
than reuse
 In general you do need to hit the memory
– Example: texture mapping the whole framebuffer at
1:1; every pixel and texel will be touched exactly once
 Therefore, be pessimistic: assume this
 Choose to compensate memory latency with large buffers
– Rather than using the cache to dodge the accesses
• In a few places short-term ‘reuse’ is critical
 Bilinear filtering the most obvious case
72
November 13th 2007 Graphics Performance and Optimisation
Caching
• Can still be advantages to avoiding cycling
 Used to be a big thing, particularly in the days of visible
caching (software controlled rather than auto)
 Caused sorting policy of hard sort by material
 Nowadays far less important, hence rough sort by depth
 In some pathological circumstances sort by shader and
depth (or a Z pass followed by sort by shader) might be
more efficient
73
November 13th 2007 Graphics Performance and Optimisation
Disclaimer & Attribution
•
•
DISCLAIMER
The information presented in this document is for informational purposes
only and may contain technical inaccuracies, omissions and typographical
errors.
•
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO
THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
•
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT,
INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING
FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD
IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
•
•
ATTRIBUTION
© 2007 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD
Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. Other names are for informational purposes only and may
be trademarks of their respective owners.
74
November 13th 2007 Graphics Performance and Optimisation