Graphics Analysis

advertisement
Developing Efficient Graphics Software
Developing Efficient Graphics
Software
Intent of Course
• Identify application and hardware interaction
• Quantify and optimize interaction
• Identify efficient software structure
• Balance software and hardware system component use
Developing Efficient Graphics
Software
Outline
• 1:35 Hardware and graphics architecture and performance
• 2:05 Software and System Performance
• Break
• 2:55 Software profiling and performance analysis
• 3:20 C/C++ language issues
• 3:50 Graphics techniques and algorithms
• 4:40 Performance Hints
Developing Efficient Graphics
Software
Speakers
• Applications Consulting Engineers for SGI
– optimizing, differentiating, graphics
• Keith Cok, Bob Kuehne, Thomas True, Alan Commike
Hardware & Graphics Architecture &
Performance
Bob Kuehne, SGI
Course Overview
Why is your application drawing so slowly?
• Could actually be the graphics
• Could be the data traversal
• Could be something entirely different
Tour Guide
Platform architecture & components
• CPU
• Memory
• Graphics
Graphics performance
• Measurements: triangle rate, fill rate, misc.
• Reproduce & maximize
Bottlenecks & Balance
Bottlenecks
• Find them
• Eliminate them (sort of - move them around)
Balance
• Understand hardware architecture
• Fully utilize hardware
Yin & Yang
• “Yin and yang are the two primal cosmic principles of the
universe”
• “The best state for everything in the universe is a state of
harmony represented by a balance of yin and yang.”
– Skeptics Dictionary -- http://skepdic.com/yinyang.html
Write Once Run Everywhere?
My application ran fast on that platform! Why is
this one so slow?
• Different platforms require different tuning
• Different platforms implement hardware differently
– Macro: Architecture & features
– Micro: Storage capacities, buffers, & caches
– Effect: Bandwidth & latency
Latency & Bandwidth
Definitions:
• Latency: time required to communicate a unit of data
• Bandwidth: data transferred per unit time
Example:
• Latency bottleneck:
S t
• Bandwidth bottleneck: S t
S t
S t
S t
t
t
t
t
t
: unit of time
s: texture setup time
t: texture download time
S t
S t
Platform: Software View
graphics
CPU
i/o
memory
misc
net
Platform: PCI, AGP
Memory
CPU
glue
Net
Disk
Graphics
PCI
I/O
Net
Disk
PCI
AGP
Graphics
glue
Memory
I/O
CPU
Platform: UMA, Switched Hub
CPU
Memory
glue
CPU
UMA
Memory
glue
Graphics
I/O
Net
Disk
Graphics
I/O
Net
Disk
PCI
Platform: The Points
Why learn about hardware?
• To understand how your app interacts with it
• To best utilize the hardware
• Potentially can use extra hardware features
Where?
• Platform documentation
• Talk with hardware vendor
CPU: Overview
CPU Operation
• Data transferred from main memory to registers
• CPU works on data in registers
Latency
• Registers: 0 (free)
CPU
• Level-1 (L1) cache: 1
• Level-2 (L2) cache: 10x L1
• Main memory: 100x L1
R
L1
L2
Main
Memory
CPU, Cache, and Memory
Caches designed to exploit data locality
• Temporal locality
• Spatial locality
CPU
Registers
L1
L2
Main
Memory
Memory: Cache & Logical Flow
In Register?
In L1?
In L2?
Copy to L2
(100)
Compute
Copy to
Register
(1)
Copy to L1
(10)
Memory: Cache & Physical Flow
Main Memory
L2 Cache
L1 Cache
Page
Registers
CPU
Memory: Allocation & Pools
• List elements are often allocated as-needed
– This leads to spatial disparity
• Mitigated by use of application memory management
– Bad: malloc, malloc, malloc, malloc, ...
– Good: pools - pool_init, pool_alloc, ...
• Graphics example:
– Vertices, normals, textures, etc.
Memory: Graphics! Vertex Arrays
Vertex Array Cache Behavior
Time to Traverse
Platform 0 - Interleaved
Platform 0 - Non-interleaved
Platform 1 - Interleaved
Platform 1 - Non-interleaved
0
0
1
1
6
1
2
2
2
3
8
2
4
4
3
6
0
4
5
6
4
6
2
5
7
8
5
8
4
6
9
0
7
Number of Array Vertices
1
7
7
0
3
8
1
9
8
Graphics: Pipe
xf
light
clip
rast
fx
fops
FIFO
xf: world to screen
light: apply light
clip: clip to view
rast: convert to pixels
fx: apply texture, etc.
fops: test pixel ops
Graphics: Pipe & Akeley Taxonomy
G
X
R
T
• G - Generate geometric data
• T - Traverse data structures
• X - Transform primitives world to screen
• R - Rasterize triangles to pixels
• D - Display framebuffer on output device
D
Graphics: Hardware
4 types of hardware are common
• G-TXRD : all hardware
• GT-XRD :
• GTX-RD :
• GTXR-D : all software
Graphics: Performance
Benchmarks
• “Trust, but verify.” - an ex-president
Definitions
• Triangle rate: speed at which primitives are transformed (X)
• Fill rate: speed at which primitives are rasterized (R)
– Depth complexity: number of times pixel filled
Caveats
• Quantization, fastpath
Graphics: Quantization
• Frame quantization is the result of swapbuffers occurring at
the next vertical retrace.
– Necessary to avoid image artifacts such as tearing
• Example: 100Hz display refresh
Graphics: Quantization
no-sync 120 Hz
100 Hz
50 Hz
50 Hz
33 Hz
t0 t1 t2 t3 t4 t5 t4 t6 t7
: one graphics frame tn: 1/100 second
Graphics: Fastpath
Definition
• Fastpath: the most optimized path through graphics
hardware
Example
• fast path: float verts, float norms, AGBR textures, z-test
• less fast path: float verts, float norms, RGBA textures, z-test
Graphics: Fastpath Example
Graphics: Fastpath Points
• Fast path is often synonymous with ideal path.
– Real usage of graphics falls on a continuum.
Fast path
(hardware)
Slow path
(software)
Speed
Quality
Where is your application?
• Must quantify what hardware can do
– Quality & speed
Graphics Hardware: Testing
Duplicate performance numbers simply:
• Good: build a simple test program
• Better: glPerf - http://www.spec.org
Maximize performance in an app:
• Good: Use fast API extensions
• Better: Create an “is-fast” test, use what is verified as fast
Graphics Hardware: “Is-Fast”
Test each platform to determine fast path
• Once, per-machine, test primitives and modes
– Vertex array format, texture format, display list, etc.
• Store data in database
– Detect hardware changes or time-to-live
• Read data from database at startup
– Check database or re-generate data
Graphics Hardware: “Is-Fast”
Pseudo-code
If ( new_machine() || hardware_changed() ) {
test_interesting_modes();
store_in_database();
}
else {
// have database entry
get_performance_data_from_database();
}
// use the modes & primitives that are ‘’fast’’ when rendering
Think Globally, Act Locally
Think globally
• Know the platforms & graphics hardware
• Use hardware effectively in your app
• Balance hardware utilization
Act locally
• Use in-cache data
• Understand hardware & graphics fastpaths
• Balance quality vs. performance
Software and System Performance
Thomas J. True, SGI
A Four Step Process
Quantify
System Evaluation
Graphics Analysis
Bottleneck Elimination
Quantify
Characterize
• Application Space
• Primitive Types
• Primitive Counts
• Rendering Characteristics
• Frame Rate
Quantify
Compare
Fill Rate
Triangle
Rate
My Performance
Ideal Performance
Examine System Configuration
Resources
• Memory
• Disk
Setup
• Display
• Network
Graphics Analysis
Ideal Performance
• Keep graphics pipeline full.
• 100% CPU utilization running application code.
• 100% graphics utilization.
Graphics Analysis
Graphics Bound
Acme Electronics
40 50 60
30
20
10
0
70
80
90
100
40 50 60
30
20
10
0
70
80
90
100
Graphics Analysis
Graphics Bound
• Graphics subsystem processes data slower than CPU can
feed it.
• Graphics subsystem issues an interrupt which causes the
CPU to stall.
• Data processing within application stops until graphics
subsystem can again accept data.
Graphics Analysis
Geometry Limited
• Limited by the rate at which vertices can be transformed and
clipped.
Fill Limited
• Limited by the rate at which transformed vertices can be
rasterized.
Graphics Analysis
CPU Bound
Acme Electronics
40 50 60
30
20
10
0
70
80
90
100
40 50 60
30
20
10
0
70
80
90
100
Graphics Analysis
CPU Bound
• CPU at 100% utilization but can’t feed graphics fast enough.
• Graphics subsystem at less than 100% utilization.
• All CPU cycles consumed by data processing.
Graphics Analysis
Determination Techniques
• Remove graphics API calls.
• Shrink graphics window.
• Reduce geometry processing requirements.
• Use system monitoring tool.
Graphics Analysis
Start
Performance
Problem Not
Graphics
Remove
rendering
calls
Graphics
Performance
Problem
Remove
graphics API
calls
Shrink
graphics
window
Reduce
geometry
load
Use system
monitoring
tool
Excessive or
unexpected CPU
activity
Graphics bound:?
= frame rate increase
Graphics bound: fill
limited
Graphics bound:
geometry limited
= no change in frame rate
Fallen off fast
path
Graphics Analysis
Graphics Architecture: GTXR-D
Acme Electronics
Graphics Analysis
Graphics Architecture: GTXR-D
(aka Dumb Frame Buffer)
• CPU does everything.
• Typically CPU bound.
• To remedy, buy a “real” graphics board.
Graphics Analysis
Graphics Architecture: GTX-RD
Acme Electronics
Graphics Analysis
Graphics Architecture: GTX-RD
• Screen space operations performed by graphics.
• Object-space to screen-space transform on host.
• Can easily become CPU bound.
“Roughly 100 single-precision floating point operations are required to
transform, light, clip test, project and map an object-space vertex to screenspace.” - K. Akeley & T. Jermoluk
• Beware of fast-path and slow-path issues.
Graphics Analysis
Graphics Architecture: GTX-RD
• If Graphics Bound:
– Reduce per-pixel operations.
– Reduce depth complexity.
– Use native-format data.
Graphics Analysis
Graphics Architecture: GTX-RD
• If CPU Bound:
– Reduce scene complexity.
– Use more efficient graphics algorithms.
Graphics Analysis
Graphics Architecture: GT-XRD
Acme Electronics
Graphics Analysis
Graphics Architecture: GT-XRD
• Transformation and rasterization performed by graphics.
• Can be CPU or graphics bound.
• Beware of fast-path and slow-path issues.
• Subject to host bandwidth limitations.
Graphics Analysis
Graphics Architecture: GT-XRD
• If Graphics Bound:
– Move lighting back to CPU.
– Use native data formats within application.
– Use display lists or vertex arrays.
– Use less expensive lighting modes.
Graphics Analysis
Graphics Architecture: GT-XRD
• If CPU Bound:
– Move lighting from CPU to graphics subsystem.
– Do matrix operations in graphics hardware.
– Profile in search of computational performance issues.
Bottleneck Elimination
Bottlenecks
Bottleneck Elimination
Bottlenecks
• Understanding, crucial to effective tuning.
• Will always exist, tune to balance.
• Not always a bad thing.
Bottleneck Elimination
Graphics
• Use native graphics formats.
• Remove excessive state changes.
• Package graphics primitives efficiently.
• Use textures that fit in texture cache.
• Don’t use unnecessary rendering modes.
• Decrease depth complexity.
• Cull out excessive geometry.
Bottleneck Elimination
Memory
• Don’t allocate memory in rendering loop.
• Avoid copying and repackaging of graphics data.
• Organize graphics data.
• Avoid memory fragmentation.
Bottleneck Elimination
Memory Bandwidth and Fragmentation
Independent Triangles
9 vertices: 504 bytes
Triangle Strip
5 vertices: 280 bytes
Vertex Array
5 vertices: 280 bytes
Vertex = RGBA+XYZW+XYZ+STR = 56 bytes
Bottleneck Elimination
Code and Language
• Use native data types.
• Avoid contention for a single shared resource.
• Avoid application bottlenecks in non-graphics code.
• Reduce API call overhead.
Bottleneck Elimination
API Call Overhead
Independent Triangles
(XYZW + RGBA + XYZ + STR) * 9 vertices: 36 function calls
Triangle Strips
(XYZW + RGBA + XYZ + STR) * 5 vertices: 20 function calls
Vertex Array
5 function calls
Display List
1 function call
Conclusion
Performance Tuning an Iterative Process
Quantify
System Evaluation
Graphics Analysis
Bottleneck Elimination
Conclusion
It’s all about balance!
Profiling and Performance Analysis
Keith Cok, SGI
Profile and Performance Analysis
• Profiling points out code areas that take up most time
• Imperative for well balanced application
• Points out code and system bottlenecks
Two Methods of Software Profiling
Basic block
• A section of code that has one entry and one exit
• Measures ideal time
Statistical sampling
• Interrupts program execution and examines current location
• Measures actual CPU cycles spent executing a line of code
How Do You Profile Code?
• Compile/link with compiler optimizations turned on
– cc foo.c -use_all_optimization_flags ....
• Instrument the code
– Unix: pixie foo.exe -> foo.exe.pixie
– Visual Studio: embedded in tool suite
• Run the application with relevant data sets
– foo.exe.pixie - args -> produces results data file
Profiling: Finding the Hot Spot
Function list, in descending order by exclusive ideal time
excl.%
cum.%
instructions
calls function (dso: file, line)
[1] 10.3%
10.3%
190583064
[2]
8.9%
19.2%
173920781
[3]
8.2%
27.4%
145950460
[4]
5.9%
33.3%
97798122 1975976 __sin (libm.so: sin.c, 194)
[5]
4.1%
37.4%
82310479
[6]
3.4%
40.8%
50786176 1204269 __glMgrim_Begin (libGLcore.so: mgras_prim.c, 221)
[7]
3.2%
44.0%
58099072
[8]
3.1%
47.1%
53832546
290970 R_RecursiveWorldNode (foo: gl_rsurf.c, 894)
[9]
3.1%
50.2%
43855299
437627 R_CullBox (foo: gl_rlight.c, 313; compiled in gl_rmain.c)
[10]
2.8%
53.0%
44666700
11484 GL_CreateSurfaceLightmap (foo: gl_rsurf.c, 1293)
3203 S_Update_ (foo: snd_dma.c, 848)
338787 R_RenderBrushPoly (foo: gl_rsurf.c, 641)
240 GL_LoadTexture (foo: gl_draw.c, 990)
16797 R_DrawAliasModel (foo: gl_rmain.c, 232)
30981 EmitWaterPolys (foo: gl_warp.c, 187)
Profiling: Fixing the Hot Spot
What do you look for?
• Common sub-expressions
• Loop invariant code
• Repeated pointer de-referencing
• Global variables and cache misses
• “Thin” loops
Profiling Example
// Code the old way
19: void old_loop() {
20: sum = 0;
21: for (i = 0;i < NUM; i++)
22: sum += x[i];
23: printf("sum = %f\n",sum);
24: }
// Code the new way
27: void new_loop () {
28: sum = 0;
29: ii = NUM%4;
30: for (i=0; i < ii; i++)
31:
sum +=x[I];
32: for (i = ii; i < NUM; i +=4) {
33:
sum += x[i];
34:
sum += x[i+1];
35:
sum += x[i+2];
36:
sum += x[i+3];
37 : }
38: printf(“ sum = %f\n”,sum);
39: }
Profiling Example: Profile Results
cycles instructions
calls
function (dso: file: line)
[1]
6160
6168
1
old_loop
(blahdso.so: blahdso.c, 19)
[2]
4869
8714
1
setup_data (blahdso.so: blahdso.c, 11)
[1]
4869
8714
1
setup_data (blahdso.so: blahdso.c, 11)
[2]
4625
4891
1
new_loop
(blahdso.so: blahdso.c, 27)
Profile Example: Line Analysis
Line list, in descending order by time
-----------------------------------------------------cycles invocations function (dso: file, line)
4096
2061
1024
1024
old_loop
old_loop
sum += x[i];
for (i = 0;i < NUM; i++)
978
968
968
968
733
7
256
256
256
256
256
1
new_loop
new_loop
new_loop
new_loop
new_loop
new_loop
sum += x[i+3];
sum += x[i+2];
sum += x[i+1];
sum += x[i];
for (i = ii; i < NUM; i +=4)
ii = NUM%4;
Profile and Performance Analysis
Profile Example: Visual C++/Intel
Function Percent of
Hit
Function
Time(s)
Run Time
Count
-----------------------------------------------------------------0.410
39.4
1
_old_loop
0.249
23.9
1
_new_loop
Statistical vs. Basic Block Profile
void ijk_loop(){
sum = 0;
for (i=0;i<YNUM;i++)
for (j=0;j<YNUM;j++)
for (k=0;k<YNUM;k++)
sum += y[i][j][k];
}
printf("sum = %f\n",sum);
// loops kji and ikj as well
Basic Block vs. Statistical Sampling
Basic Block:
Percent
[1]
25.3%
[2]
25.3%
[3]
25.3%
cycles
51141434
51141434
51141434
Statistical Sampling:
Percent Samples
[1]
38.0%
2700
[2]
23.9%
1700
[3]
19.7%
1400
[4]
18.3%
1300
inst
calls function
37101028
1 ijk_loop foo.c, 47
37101028
1 kji_loop foo.c, 57
37101028
1 ikj_loop foo.c, 66
Procedure Function
kji_loop
foo.c, 57
setup_data foo.c, 15
ikj_loop
foo.c, 66
ijk_loop
foo.c, 47
Now We Know About Hot Spots...
What do we do next?
• Use compilers to fine-tune code
• Use knowledge of language to optimize
• Hand-tune code
Profiling is fun, hard, and iterative and it can be
highly effective
Compiler and Language Issues
Keith Cok, SGI
Bob Kuehne, SGI
Compiler and Language Issues
Compiler Optimizations:
• Occur within a compromise of
speed and memory space
vs.
time to compile and link
• An iterative process to discover what does and doesn’t work
• Important to keep at it
Compiler Issues: Trade-Offs
• Trade-offs:
– Round-off vs. needed precision
– Inter-procedural analysis vs. link time
– Pointer aliasing vs. coding constraints
– Optimizing for processor architectures vs. work of multiple
binaries (support, test)
• Explore other compilers than your first choice
• Different source code - different flags
Compiler and Language Issues
Comments on 32 vs. 64 bit code
• Benefits of 64 bit code:
– Increased address space
– Higher precision
• Downsides of 64 bit code:
– Application memory footprint
– Need to port which can be difficult!
• Performance issues
Language Issues
• Data Management
• Unrolling loops
• Arrays
• Temporary variables
• Pointer aliasing
Language Issues: Data
Management
Manipulate data structures efficiently since
graphics IS data
struct { str *next;
str *prev;
large_type foo;
int key;
} str;
struct { str *next;
str *prev;
int key;
large_type foo;
} str;
Language Issues: Data
Management
Pack data efficiently
struct foo {
char aa;
float bb;
char cc;
float dd;
char ee;
} foo_t;
struct foo_better {
// 8 bits + 24 pad
float bb; // 32 bits
// 32 bits
char aa; // 8 bits
// 8 bits + 24 pad
char cc; // 8 bits
// 32 bits
char ee; // 8 bits + 8 pad
// 8 bits + 24 pad
float dd; // 32 bits
// 160 bits
} foo_t;
// 96 bits
Language Issues: Data
Management
Examine your arrays and note their caching
behavior
• Break up large arrays into smaller sub-arrays for better
memory access patterns
• Understand the implications of data layout and cache
behavior
Language Issues: Loop Unrolling
Profiling Example
// Code the old way
19: void old_loop() {
20: sum = 0;
21: for (i = 0;i < NUM; i++)
22:
sum += x[i];
23: printf("sum = %f\n",sum);
24: }
// Code the new way
27: void new_loop() {
28: sum = 0;
29: ii = NUM%4;
30: for (i=0; i < ii; i++)
31:
sum +=x[i];
32: for (i=ii; i<NUM; i +=4) {
33:
sum += x[i];
34:
sum += x[i+1];
35:
sum += x[i+2];
36:
sum += x[i+3];
37: }
38: printf(“ sum = %f\n”,sum);
39: }
Language Issues: Loop Unrolling
Profile Example: Line Analysis
Line list, in descending order by time
-----------------------------------------------------cycles invocations function
4096
1024
old_loop
sum += x[i];
2061
1024
old_loop
for (i = 0;i < NUM; i++)
978
968
968
968
733
7
256
256
256
256
256
1
new_loop
new_loop
new_loop
new_loop
new_loop
new_loop
sum += x[i+3];
sum += x[i+2];
sum += x[i+1];
sum += x[i];
for (i = ii; i < NUM; i +=4)
ii = NUM%4;
Language Issues: Loop Unrolling
Issues with loop unrolling:
• Code complexity
• Clutter
• Compiler may/may not do this
• Flags may affect compiler time spent optimizing
Only “thin” loops gain performance
Use application knowledge to take advantage of
loop unrolling
Language Issues: Local temporary
variables
Use local temporary variables to avoid repeatedly
de-referencing a pointer structure
Example:
x = global_ptr->record_str->a;
y = global_ptr->record_str->b;
Use:
tmp = global_ptr->record_str;
x = tmp->a;
y = tmp->b;
Language Issues: Using tmp vars
for global vars within a function
void tr_point(FLOAT *old_pt, FLOAT *m, FLOAT *new_pt)
FLOAT *c1, *c2, *c3, *c4, *op, *np, tmp;
c1 = m; c2 = m+4; c3 = m+8; c4 = m+12;
for (j=0, np = new_pt;j<4; j++) {
for (j=0; np = new_pt; j<4;j++)
op = old_pt;
op = old_pt;
tmp += *op++ * *c1++;
*np += *op++ * *c1++;
tmp += *op++ * *c2++;
*np += *op++ * *c2++;
tmp += *op++ * *c3++;
*np += *op++ * *c3++;
*np++ = tmp + (*op * *c4++); }
*np++ = *op++ * *c4++; }
Language Issues: Pointer Aliasing
• Pointers are aliases when they point to potentially
overlapping regions of memory
• If regions never overlap, may optimize for this case. Not
possible, though, in general
• Compiler can't tell when pointers are aliased
• Use restrict key word or compiler option
Language Issues: Pointer Aliasing
Unaliased Pointers
Compilers may use:
- Parallelism
- Pipelining
in
in
out
out Aliased pointers
Language Issues: Pointer Aliasing
void process_data( float * restrict in,
float * restrict out,
float gain) {
int i;
for (i = 0; i < NSAMPS; i++) {
out[i] = in[i] * gain;
}
}
C++: General Issues
• Language features
– RTTI, safe casts, etc.
• Use const, mutable, volatile, & inline
– hints to compilers
• Object construction
– arrays, default constructors, arguments, etc.
• Method invocation issues
– operators, overloads, conversion, etc.
C++: Virtual Functions
• Good - used to invoke child method when managing baseclass handles
• Expensive - incur an additional pointer de-reference
– one, find VTBL, two, find method, invoke
– bad for caching
• Use when necessary, but not for common objects
– Good for ‘large’ methods that do lots of work
– Bad for ‘small’ methods, like a vertex query
C++: Exceptions & Templates
Exceptions
• Great for error checking
• Performance penalty
– Additional stack information required
Templates
• Great for code re-use
• Memory penalty
– Across libraries, across object files
Code & Language Issues: The End
Balance
• Know your compiler
– Features & performance
• Know your language
– Features & performance
• Know your app
– Features & performance
Idioms and Application Architectures
Alan Commike, SGI
Starting Quote
The best tuned most efficient bubble sort is still a
bubble sort. Additional tweaking won't improve
performance.
Change The Algorithm!
- Commike ‘99
Introduction
To write an efficient graphics application, one
must:
• Understand the platform
• Use graphics efficiently
• Write good code
Use efficient application structures and algorithms
Outline
• Outline
• Background
• Culling
• Level of Detail (LOD) management
• Application architectures
Application Architectures:
Rendering Path
• Application work, culling, LOD, drawing
• Pipelined rendering path
App
Cull
LOD
Draw
Application Architectures:
Rendering Path
• Application work, culling, LOD, drawing
• Pipelined rendering path
App
Cull
LOD
Draw
App
Cull
LOD
Draw
Application Architectures:
Rendering Path
• Application work, culling, LOD, drawing
• Pipelined rendering path
Frame0
App
Frame1
Cull
LOD
Draw
App
Cull
LOD
Draw
App
Cull
LOD
Draw
T2
T3
T4
T5
Frame2
T0
T1
Application Architectures:
Target Frame Rate
A target frame rate attempts to bound the
maximum render time
• Control Culling and LOD aggressiveness
• Maintain a constant frame rate
• Achieve an acceptable interactive frame rate
Graphics Idioms
• Culling
– Removing geometry that isn't visible
• Level of Detail Management
– Reducing geometric complexity
Culling
Don’t draw what you can’t see
Culling:
Culling Types
Use one. Use all. Pipeline them together.
• View Frustum Culling
• Backface Culling
• Contribution Culling
• Occlusion Culling
Culling:
Bounding Volumes
Test against a bounding volume not individual
primitives
• Can be bounding sphere, box, oriented box, or any enclosing
volume
• Hierarchical bounding volumes to reduce cull time
• Spheres are fast, boxes are more accurate
– Use a combination of both
Culling:
View Frustum
Graphics pipeline clips data that falls outside the
View Frustum
If it will be clipped don’t bother drawing
Culling:
View Frustum Usefulness
• Improves geometry rate
– Culled vertices are not transformed, lit, and clipped
• Improves host download rate
– Less data moved from memory into graphics
• Does not change fill rate
– Triangles outside the View Frustum would not have been
drawn anyway
Culling:
View Frustum Implementation
• Transform vertices to clip coordinates (in OpenGL multiply by
Model-View and Projection matrix)
• Check each vertex against View Frustum
• Geometry is either In, Out, or Partial
• Render In and Partial
Culling:
Skip the Clip
In software transform systems (GTX-RD) skip the
clip
• Partial and In geometry classified
– Pipe renders Partial as usual
– Pipe can render In without a View Frustum clip
• Might be a hint to render
• Can improve geometry rates if not already fill-limited
Culling:
Backface
Only half of any closed polyhedron is visible at
any one time
Don’t render what you can’t see
Culling:
Backface Usefulness
• Improves fill rate when using a native implementation
– Primitives are transformed and lit before culling
• Helps both geometry and fill with an application specific
algorithm
– More computationally expensive
– Balance graphics and CPU work
• This may not work well when you can enter closed geometry
or need two-sided lighting
Lava. Hot!
Random Quote
Try not. Do, or do not. There is no try.
- Yoda ‘80
Culling:
Contribution
If it’s too small to make a difference
don’t render it
Culling:
Contribution Usefulness
• Improves geometry rate
– Culled vertices are not transformed, lit, and clipped
• Improves host download rate
– Less data moved from memory into graphics
• Does not change fill rate
– Screen space projection already minimal
– Removes few pixels from rasterization stage
Culling:
Contribution Implementation
Don’t render items that fall below a size threshold
• Screen space size of bounding volume
• A less computational approach
– Distance to object combined with some notion of global
object size
Culling:
Occlusion
If you can’t see it
Front
Side
don’t draw it
Culling:
Occlusion Goals
Find the optimal set of occluders that will enable
drawing the minimal number of occludees
• Occluders: The geometry that is visible
• Occludees: The geometry that is not visible
• Use general purpose occlusion culling algorithms
• Use application specific spatial knowledge if possible
Culling:
Occlusion Culling Usefulness
• Can improve both transform-limited and fill-limited
applications
• Computationally expensive
– Beware of time trade-offs
• Possible hardware support
Culling:
General Occlusion Culling
• Used for arbitrary scenes
• Can improve both transform limited and fill limited
applications
• Computationally expensive for arbitrary scenes
Culling:
Occlusion Spatial Partitioning
“Cell and Portal” Culling
• Spatial organization leads to Cells and Portals
• Games that move from room to room
• Architectural walkthroughs
LOD:
Overview
After culling, need to draw what is left
• Still too much geometry:
– Use multiple Levels of Detail, I.e. multi-resolution objects
• Match geometric complexity to visible on-screen space
coverage
• Reduce geometric complexity to maintain target frame rate
LOD:
Issues
• Generating LODs:
– Height Fields vs 3D objects
– View-Dependent: nice, but compute intensive
– View-Independent: fast, memory intensive
• Need to decide which LOD level to use
– Not trivial!
• Need smooth transitions between levels
– Geomorphs
LOD:
Height Fields
• Generally thought of as infinite terrain
• Specialized algorithms can be used
LOD:
3D Models
• General purpose simplification algorithm
• Can use on height fields also
• Some recent real-time view-dependent algorithms
• Also used for compression
1024 Triangles
256 Triangles
64 Triangles
16 Triangles
LOD:
When to switch LOD levels
Ability to only generate LOD models is not
sufficient
• Need to know when to use which LOD level
– single constant hard metric: distance from eye
– Multiple heuristics: cost, benefit, rankings
• Can bias LODs to ensure frame rate targets are reached
LOD:
Level determination
• Determine system rendering characteristics
• Determine cost of rendering each object
• Render objects with highest benefit while remaining under
the target frame rate
Level determination can be time consuming!
“take the time to time the time taken to reduce the
rendering time”
Going, and going, and going...
LOD:
Determining cost of rendering
Cost is affected by many factors
• Graphics hardware: published benchmarks, startup tests
• Number of vertices: primarily a function of LOD algorithm
• Rendering Quality: lighting, shading, wire frame, anti-aliasing,
etc.
• Global Factors: total texture memory, dirty internal state
LOD:
Benefit Function
Cost alone is not good enough, need benefit also
• Rendered size of object
• Error tolerance between LOD level and reference model
• Importance in scene
• Frame-to-frame coherency
LOD:
The Optimal LODs
For all Objects, at each LOD Level, rendered with
each RenderType
Maximize the Benefit function:
Benefit(Object, Level, RenderType)
Subject to:
Cost(Object, Level, RenderType) <= TargetFrameRate
LOD:
Optimal Optimizations
• Simulated Annealing
• Monte Carlo Simulations
• Simplex Searches
LOD:
Optimal Optimizations
• Simulated Annealing
• Monte Carlo Simulations
• Simplex Searches
Dude,
Can you spare a few dozen CPUs?
LOD:
Trade-offs
Don’t have enough time to run full LOD
optimization problem and render the scene
• Simplify cost and benefit functions
• Simplify optimization problem into a ranking of Benefit/Cost
• Use frame-to-frame coherency
• Be sure to consider time taken to calculate LODs
Application Architectures:
Multi-Threading
• More stages give more time to cull or generate LODs
• Each stage adds latency
Frame0
App
Frame1
Cull
LOD
Draw
App
Cull
LOD
Draw
App
Cull
LOD
Draw
T2
T3
T4
T5
Frame2
T0
T1
Application Architectures:
Multi-Threading
• Hard part is data synchronization
• Watch out for memory bloat
Application Architectures:
Scene Graphs
A scene graph is the basic data structures holding
the description of your scene
• Cull-able, sort-able, and can contain multi-resolution objects
• Hierarchical Bounding Volumes
• Statistics gathering and timing infrastructure
• For large scenes can do memory management and database
paging
Application Architectures:
Trade-offs
• Quality
• Speed
• Memory
• Complexity
Conclusion:
Most importantly - Think about balance!
Performance Hints
Keith Cok, SGI
Performance Hints:
Pipeline Management
• Avoid round trips to graphics server
– Cache own state/attribute information
– Avoid pipeline queries (e.g., glGet*)
– Flush buffer efficiently (glFlush vs. glFinish)
• Reduce state changes. Sort by expense. For example, sort
geometry by type (triangles, quads, etc) and then by color
• Eliminate unused attributes
Performance Hints: Debugging
Detect graphic errors:
#ifdef DEBUG
#define GLEND() glEnd();\
{int err; \
err = glGetError(); \
if (err != GL_NO_ERROR)
\
printf("%s\n",gluErrorString(err)); \
assert(err == GL_NO_ERROR);}
#else
#define GLEND() glEnd()
#endif
Performance Hints: Geometry
• Maximize data between glBegin/glEnd
– Sort geometry by type (triangle, quad, etc.) and group
them together
– Find best fit for length of glBegin/glEnd pair
• Use stripped primitives (GL_TRIANGLE_STRIP...) to reduce
geometry data sent to the pipeline
• Avoid GL_POLYGON. Use specific geometric primitives
instead (GL_TRIANGLE, GL_QUAD, etc.)
• Use GL_FASTEST with glHint calls where possible
Performance Hints: Geometry
• Use flat display lists for static geometry. Deep display lists
may induce unwanted memory thrashing
• Use API matrix operations instead of your own
• Use texture to simulate complex geometry
• Use vertex arrays. Test vertex, interleaved, precompiled
arrays
Performance Hints: Geometry
• Pass one normal (not 3 or 4) per flat shaded polygon
• Use a data format suitable for quick transfer to the graphics
subsystem
• Disable unneeded operations (alpha blending, depth, stencil,
blending, dithering, fog, etc.)
Performance Hints: Lighting
• Reduce lighting requirements:
– Use as few lights as possible
– Use directional (infinite) lighting. Use
glLightfv(GL_LIGHTn, GL_POSITION, {x,y,z,0});
– Use positional lights rather than spot lights
– Use one-sided lighting when possible (be aware of issues
associated with normals)
– Don’t change material properties frequently
Performance Hints: Lighting
• Use normalized normal vectors
– Supply unit length vectors
– Don’t enable GL_NORMALIZE
– Don’t scale using model-view matrix
• Pre-multiply geometry, if possible
Performance Hints:
Visuals/Pixel Formats
• Pick the correct visual. Use hardware accelerated visuals
• Structure windows and contexts to maximize performance
(app may block after context swaps)
• Put GUI elements in overlay planes to avoid unwanted
graphics window refreshes
Performance Hints: Buffers
• Turn off depth buffer when possible
• Use HW accelerated off-screen buffer for backing-store
• Use stencil buffer for interactive picking and quick re-render
(see course notes for full algorithm)
• Use color/depth buffer data for interactive editing of complex
scenes (see course notes for full algorithm)
Performance Hints: Textures
• Be aware of texture sizes
– Reduce texture resolution
– Use texture LOD extension (OpenGL 1.2)
• Use texture objects. Create textures once
• Don’t swap textures frequently, if possible
– Mosaic multiple textures into one large texture
– Sort geometry by texture
Performance Hints: Textures
• Use texture as an additional data lookup to simulate more
complex data:
– Lighting, geometry, color, clipping, application-space data
• Use glTexSubImage to replace part of a texture rather than
creating a whole new texture
• Avoid expensive texture filter modes
• Use texture lookup tables instead of multi-channel textures
Conclusion
Know how your application works within the
system
• Don’t let caches, latencies, bandwidths, etc. slow you down
• Know how fast you can go
• Identify system performance characteristics
• Work your compiler
• Get all you can out of the hardware
Questions and Answers
Download