Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka

advertisement
Graphics Performance: Balancing
the Rendering Pipeline
Cem Cebenoyan and Matthias Wloka
Introduction
At a minimum, PC is a 2 processor system
CPU
GPU
Maximum efficiency IFF
All processors are busy
All the time
GPU
CPU
AGP Bus
NVIDIA PROPRIETARY AND CONFIDENTIAL
Actually, It’s Worse
GPU
Vertex Processing
CPU
Application
Triangle Setup
API
Large Cache
Fragment Shading
Framebuffer Access
NVIDIA PROPRIETARY AND CONFIDENTIAL
Multi-Processor System
Conceptually, 5 processors
CPU
Vertex-processor(s)
Setup processor(s)
Fragment processor(s)
Blending processor(s)
All connected via some form of cache
To smooth data flow
To keep things humming
NVIDIA PROPRIETARY AND CONFIDENTIAL
MP Systems Become Inefficient If…
One or more processors sync to each other
For example, frame-buffer lock
Insures that all caches drain
Insures that all processors idle (CPU and GPU!)
Overhead in restarting the processors
A single processor bottlenecks all others
NVIDIA PROPRIETARY AND CONFIDENTIAL
Overview
CPU
AGP Bus
Vertex Processing
Triangle Setup
Rasterization
Memory bandwidth
Writing to and blending with video memory
NVIDIA PROPRIETARY AND CONFIDENTIAL
Overview: For Each Stage
What are its characteristics?
How does it behave?
How to measure whether it is the bottleneck
How to influence it
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Characteristics
Stay within on-chip cache for maximum
performance
Use CPU for
Collision detection
Physics
AI
Etc.
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Characteristics (cont.)
Note that graphics is capable of
20+ MTri/s (2 year old high-end)
20+ MTri/s (integrated graphics)
100+ MTri/s (current high-end)
CPU also responsible for pushing data to GPU
Cannot look at every triangle
Don’t limit graphics with CPU processing
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Measurement
Use VTune
Or any other profiler
Most games are CPU-limited
Little to no time in the graphics driver:
CPU is the bottleneck
Faster GPU will NOT result in faster graphics
Use VTune to track where you spend your time
Optimize those places
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Measurement (cont.)
But even if most time is spent in graphics driver:
CPU might still be the bottleneck
Faster GPU will NOT result in faster graphics
Use Nvidia Stats-driver (NVTune) to trace into the
GPU
Timing graphics calls is pointless
Remember the large cache between CPU/GPU
Use Nvidia Stats-driver (NVTune) instead
NVTune available from Nvidia’s registered developer
site
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Common Problems
Small batches of geometry being sent to the GPU
100 triangles per batch should be your minimum
Would like to see ~500 triangles/batch
Up to 10,000 triangles/batch
Combination of causes kill your performance
Runtime
Driver
Hardware
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU: Batch Size Characteristic
MTris/sec vs. Batch Size
(all draw-calls use same render-state)
16
14
MTris/sec
12
10
8
6
4
2
Batch Size in vertices
NVIDIA PROPRIETARY AND CONFIDENTIAL
30
00
0
25
00
0
20
00
0
50
00
10
00
0
10
00
90
0
80
0
70
0
60
0
50
0
30
0
20
0
10
0
50
20
0
CPU: Batching Solutions
Sort by render-state
Texture switches
Combine textures into one large (4kx4k) texture
Modify uv-coordinates accordingly
Tessellate geometry to overcome mirroring and
wrapping
Mip-mapping works just fine
Transform switches
Pre-transform on the CPU into world-space
Replicate data into VBs (costs AGP memory)
NVIDIA PROPRIETARY AND CONFIDENTIAL
Other Common CPU Problems
Specify vertex buffers as WRITEONLY
Minimize state changes
consider using a PURE device, iff you are optimal
Do not lock and read data from GPU
Multi-processor sync!
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics
AGP 4x supports 20+ MTri/s
Even if all vertices and indices are dynamic
BenMark5 does just that
http://developer.nvidia.com/view.asp?IO=BenMark5
Too often AGP 4x support is busted
Use BenMark5 to test for AGP 4x support
AGP Bus through-put influenced by
Size of vertex format of dynamically written vertices
How many vertices are dynamically written
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics (cont.)
But if frame-buffer and textures exceed videomemory, AGP is also used
to transfer STATIC vertices to GPU every frame
to transfer textures to GPU every frame
Make sure you avoid partial writes
See “Fast AGP Writes for Dynamic Vertex Data” by
Dean Macri for details
Always modify all vertex-data,
even if only some data changes
Pentium 3: write in 32 byte chunks
Pentium 4: write in 64 byte chunks
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics (cont.)
GPU caches vertex fetches
Hitting this cache causes no data to cross the bus
Cache has 32-byte lines
Vertex sizes that are multiples of 32 are beneficial
See also
http://developer.nvidia.com/view.asp?IO=Vertex_Buff
er_Statistics
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics
MTris/sec vs. VB Size vs. FVF size
16
14
24 byte FVF
12
MTris/sec
32 byte FVF
40 byte FVF
10
48 byte FVF
8
56 byte FVF
64 byte FVF
6
4
2
0
100
500
1000
5000
10000
Ordered VB Size, in vertices
NVIDIA PROPRIETARY AND CONFIDENTIAL
20000
30000
AGP Bus Measurement
You can tell you’re bound by the bus if:
Increasing/decreasing vertex format size
significantly impacts performance
Best to increase vertex format size using
components not needed by rasterizer
for example, normals
NVIDIA PROPRIETARY AND CONFIDENTIAL
Increasing AGP Bus Performance
Make sure frame buffer and textures fit into video-memory
Decrease number of dynamic objects (vertices)
Use vertex-shaders to animate static VBs!
Decrease vertex size
Let vertex-shader generate vertex-components!
Compress components and use vertex shader to
decompress
For example, use 16bit short normals
Reorder vertices in VB to be sequential in use
Can use NVTriStrip to do this
Pad to multiples of 32-bytes
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Characteristics
Each vertex is transformed and lit
Performance correlates directly to
Number of vertices processed
Length of vertex shader or
Fixed-function factors, such as
Number of active lights
Type of lights
Specular on/off
LOCALVIEWER on/off
Texgen on/off
GPU core clock frequency
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Characteristics
Vertex Processing Performance
Verts/s_
1
Instructions per Vertex Shader
NVIDIA PROPRIETARY AND CONFIDENTIAL
126
121
116
111
106
101
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
1
0
Vertex Processing Characteristics
After processing, vertices land in post-TnL FIFO
GeForce1/2/4 MX: effectively 10 entries
GeForce3/4 Ti: effectively 18 entries
Cache-hit saves:
all TnL work!
Everything before TnL in the pipeline
Only works with indexed primitives
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Performance
Do not be afraid to use triangles
Rarely the bottleneck
Even if it is, it would make us happy
A lot of vertex processing power available
6 * 6 pixel-quad with 2 tris is not vertex bound
If you can tell an object is made from triangles, you
are not using enough triangles
~10k triangles/frame is off by 2 (two!) orders of
magnitude
NVIDIA PROPRIETARY AND CONFIDENTIAL
Code Creatures Demo
Grass scenes are NOT vertex-bound
In excess of 1,000,000 tris/frame for opening scene
~250k tris/frame minimum
CodeCreatures demo available from:
http://www.codecult.de/
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Measurement
You are bound by vertex processing if:
Increasing/decreasing vertex shader length
significantly influences performance
Adding unnecessary instructions may be optimized
out by driver, though
Instead, use instructions that access constant
memory to add zero to a result, for example
Fixed-function TnL performance improves when
Reducing number of lights
Turning off texgen
Simplifying light types
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Vertex Processing
Optimize for the post-TnL vertex cache
Use indexed primitives
Access vertices mostly sequentially, revisiting only
recently accessed vertices
Let NVTriStrip or ID3DXMesh do the work
Turn off unnecessary calculations
LOCALVIEWER often unnecessary for specular
Prefer cheap approximations for lighting and other
math when using vertex shaders
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Vertex Processing (cont.)
Optimize your vertex shaders
Use swizzling/masking extensively
Question all MOV instructions
Storing lookup tables in constant memory
for example, to compute sin/cos
See “Implementation of ‘Missing’ Vertex Shader
Instructions” for more ideas
http://developer.nvidia.com/view.asp?IO=Implementa
tion_Missing_Instructions
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Vertex Processing (cont.)
Consider moving per-vertex work to per-pixel
Consider using ‘shader-LODing’
Do far-away objects really need 4-bone skinning?
Can always increase screen-res/use AA to NOT be
vertex-processing bound!
NVIDIA PROPRIETARY AND CONFIDENTIAL
Triangle Setup Characteristics
Triangle setup is never the bottleneck
Except when rating the GPU
Since it is the fastest stage
Setup speed influenced by:
Number of triangles
Vertex attributes needed by rasterization
Extremely small triangles running very simple TnL
i.e., degenerate triangles!
No TnL cost, since most likely hits post-TnL cache
No fill-cost, since rejected in setup
NVIDIA PROPRIETARY AND CONFIDENTIAL
Measuring/Improving Triangle Setup
Has never come up
Reduce ratio of degenerate triangles to real
triangles
Reduce unnecessary components written out from
the vertex shader
NVIDIA PROPRIETARY AND CONFIDENTIAL
Rasterization Characteristics
Prefer the term “fragment” to “pixel”
May not correspond to any pixel in framebuffer, for
example, due to z/stencil/alpha tests
May correspond to more than one pixel due to
multisampling
Commonly referred to as “fill-rate”
NVIDIA PROPRIETARY AND CONFIDENTIAL
Fill-Rate Characteristics
Fill-rate is function of
number of fragments filled
cost of each fragment
GPU’s core clock
Parallel SIMD operation, processes
Up to 4 pixels per clock on GeForce1/2/3/4 Ti
Up to 2 pixels per clock on GeForce2 MX / 4 MX
Broken into a number of parts:
Texture fetching
Texture addressing operations
Color blending operations
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Fetching Characteristics
Texture fetches are
From AGP to local video-memory, only if framebuffer and textures exceed video-memory (to be
avoided), then
From local video-memory to on-chip cache
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Fetching Characteristics (cont.)
Minimize cache-misses:
Use mip-mapping!
Avoid LOD bias to sharpen: it hurts caching and adds
aliasing
Prefer anisotropic filtering for sharpening
Use DXT everywhere you can
Texture size as big as needed and no bigger
Texture format as small as possible
16 vs. 32 bit
Localize texture access
E.g., normal texture reads
Dependent texture reads are less local
Per-pixel reflection potentially really bad
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Fetching Characteristics (cont.)
Number of samples taken also affects
performance:
Trilinear filtering cuts fillrate in half
Anisotropic even worse
Depending on level of anisotropy
The hardware is intelligent in this regard, you only pay
for the anisotropy you use
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Addressing Characteristics
Different texture addressing operations have wildly
different performance characteristics
But texture cache hits/misses more significant
Texture Shader Performance
1D
2D
Cubemap
Shader program type
Passthrough
Pixel kill
Dependent AR
Dependent GB
Offset 2D (no luminance)
Offset 2D (luminance)
Dot product 2D
Dot product depth
Dot product cubemap
Dot product reflection
Pixels/s
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Addressing Characteristics
Also, every two textures cuts fill-rate in half:
1 or 2 textures runs at full speed
3 or 4 textures runs at half speed (two clocks)
NVIDIA PROPRIETARY AND CONFIDENTIAL
Color Blending Characteristics
Color blending operations also called ‘Register
Combiners’
1 or 2 instructions (combiners) – full speed
3 or 4 instructions (combiners) – half speed
5 or 6 instructions (combiners) – one third speed
7 or 8 instructions (combiners) – one quarter speed
These numbers are for GF3 / 4 Ti
But if using 4 textures
Already at half-speed or less
Using up to 4 combiners is free
NVIDIA PROPRIETARY AND CONFIDENTIAL
Fill-Rate Measurement
You are bound by fill-rate, if
Reducing texture sizes
Or better turning off texturing
Increases performance significantly
Turning on / off trilinear affects performance
Increasing texture units used to 4, but not actually
fetching from any textures (using pixel shader
instructions like texcoord), causes you to slow down
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Fill-Rate
Render z-only pass first
Because z-optimizations happen before rasterization
Helps with memory bandwidth as well
Even for older chips without z-optimizations
Do everything to reduce texture cache misses
Turn on anisotropic, but turn off trilinear filtering
Mip-map transitions are less visible with anisotropic
filtering on
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Fill-Rate (cont.)
Consider palletized normal maps for compression
Consider moving per-pixel work to per-vertex
Consider ‘shader LODing’
Turn off detail map computations in the distance
NVIDIA PROPRIETARY AND CONFIDENTIAL
Memory Bandwidth Characteristics
Memory bandwidth is often the bottleneck
especially at high resolutions
Memory bandwidth influenced by:
Screen and render-target resolutions
Render-target color / z bit depth
FSAA
Texture sizes and formats (texture fetching)
Overdraw complexity
Alpha blending
GPU’s memory-interface width
Memory clock
NVIDIA PROPRIETARY AND CONFIDENTIAL
Memory Bandwidth Characteristics
FSAA hits memory bandwidth exclusively
no fill-rate hit with multi-sample
Failing the z/stencil/alpha test means
Pixel color is not written
Z is not written
NVIDIA PROPRIETARY AND CONFIDENTIAL
Measuring Memory Bandwidth
Switch frame-buffer format to 16bit
Switch all render-targets to 16bit
If performance doubles
App was 100% memory-bandwidth bound
If performance unchanged
App is not memory-bandwidth bound
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Memory Bandwidth
Overdraw
Reduce as much as possible
Lightly sort objects front to back
All architectures benefit, since z-test fails
Reduce blending as much as possible
Always enable alpha-test when blending
Tweak test-value as much as possible
Consider using 2-pass alpha-test/-blend technique
Always clear z/stencil (using clear())
Do not clear color if not necessary
Writing z from shader destroys early z
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Memory Bandwidth (cont.)
Prefer FSAA over high resolution
Consider using z-only pass
Turn off z-writing for all subsequent passes
NVIDIA PROPRIETARY AND CONFIDENTIAL
Conclusion
A lot of different performance bottle-necks
Know which one to tweak
Use suggestions here to
make things faster w/o making it visibly worse
Make things prettier for free!
NVIDIA PROPRIETARY AND CONFIDENTIAL
Questions…
?
cem@nvidia.com
mwloka@nvidia.com
http://developer.nvidia.com
NVIDIA PROPRIETARY AND CONFIDENTIAL
Download