Developing Efficient Graphics Software Developing Efficient Graphics Software Intent of Course • Identify application and hardware interaction • Quantify and optimize interaction • Identify efficient software structure • Balance software and hardware system component use Developing Efficient Graphics Software Outline • 1:35 Hardware and graphics architecture and performance • 2:05 Software and System Performance • Break • 2:55 Software profiling and performance analysis • 3:20 C/C++ language issues • 3:50 Graphics techniques and algorithms • 4:40 Performance Hints Developing Efficient Graphics Software Speakers • Applications Consulting Engineers for SGI – optimizing, differentiating, graphics • Keith Cok, Bob Kuehne, Thomas True, Alan Commike Hardware & Graphics Architecture & Performance Bob Kuehne, SGI Course Overview Why is your application drawing so slowly? • Could actually be the graphics • Could be the data traversal • Could be something entirely different Tour Guide Platform architecture & components • CPU • Memory • Graphics Graphics performance • Measurements: triangle rate, fill rate, misc. • Reproduce & maximize Bottlenecks & Balance Bottlenecks • Find them • Eliminate them (sort of - move them around) Balance • Understand hardware architecture • Fully utilize hardware Yin & Yang • “Yin and yang are the two primal cosmic principles of the universe” • “The best state for everything in the universe is a state of harmony represented by a balance of yin and yang.” – Skeptics Dictionary -- http://skepdic.com/yinyang.html Write Once Run Everywhere? My application ran fast on that platform! Why is this one so slow? • Different platforms require different tuning • Different platforms implement hardware differently – Macro: Architecture & features – Micro: Storage capacities, buffers, & caches – Effect: Bandwidth & latency Latency & Bandwidth Definitions: • Latency: time required to communicate a unit of data • Bandwidth: data transferred per unit time Example: • Latency bottleneck: S t • Bandwidth bottleneck: S t S t S t S t t t t t t : unit of time s: texture setup time t: texture download time S t S t Platform: Software View graphics CPU i/o memory misc net Platform: PCI, AGP Memory CPU glue Net Disk Graphics PCI I/O Net Disk PCI AGP Graphics glue Memory I/O CPU Platform: UMA, Switched Hub CPU Memory glue CPU UMA Memory glue Graphics I/O Net Disk Graphics I/O Net Disk PCI Platform: The Points Why learn about hardware? • To understand how your app interacts with it • To best utilize the hardware • Potentially can use extra hardware features Where? • Platform documentation • Talk with hardware vendor CPU: Overview CPU Operation • Data transferred from main memory to registers • CPU works on data in registers Latency • Registers: 0 (free) CPU • Level-1 (L1) cache: 1 • Level-2 (L2) cache: 10x L1 • Main memory: 100x L1 R L1 L2 Main Memory CPU, Cache, and Memory Caches designed to exploit data locality • Temporal locality • Spatial locality CPU Registers L1 L2 Main Memory Memory: Cache & Logical Flow In Register? In L1? In L2? Copy to L2 (100) Compute Copy to Register (1) Copy to L1 (10) Memory: Cache & Physical Flow Main Memory L2 Cache L1 Cache Page Registers CPU Memory: Allocation & Pools • List elements are often allocated as-needed – This leads to spatial disparity • Mitigated by use of application memory management – Bad: malloc, malloc, malloc, malloc, ... – Good: pools - pool_init, pool_alloc, ... • Graphics example: – Vertices, normals, textures, etc. Memory: Graphics! Vertex Arrays Vertex Array Cache Behavior Time to Traverse Platform 0 - Interleaved Platform 0 - Non-interleaved Platform 1 - Interleaved Platform 1 - Non-interleaved 0 0 1 1 6 1 2 2 2 3 8 2 4 4 3 6 0 4 5 6 4 6 2 5 7 8 5 8 4 6 9 0 7 Number of Array Vertices 1 7 7 0 3 8 1 9 8 Graphics: Pipe xf light clip rast fx fops FIFO xf: world to screen light: apply light clip: clip to view rast: convert to pixels fx: apply texture, etc. fops: test pixel ops Graphics: Pipe & Akeley Taxonomy G X R T • G - Generate geometric data • T - Traverse data structures • X - Transform primitives world to screen • R - Rasterize triangles to pixels • D - Display framebuffer on output device D Graphics: Hardware 4 types of hardware are common • G-TXRD : all hardware • GT-XRD : • GTX-RD : • GTXR-D : all software Graphics: Performance Benchmarks • “Trust, but verify.” - an ex-president Definitions • Triangle rate: speed at which primitives are transformed (X) • Fill rate: speed at which primitives are rasterized (R) – Depth complexity: number of times pixel filled Caveats • Quantization, fastpath Graphics: Quantization • Frame quantization is the result of swapbuffers occurring at the next vertical retrace. – Necessary to avoid image artifacts such as tearing • Example: 100Hz display refresh Graphics: Quantization no-sync 120 Hz 100 Hz 50 Hz 50 Hz 33 Hz t0 t1 t2 t3 t4 t5 t4 t6 t7 : one graphics frame tn: 1/100 second Graphics: Fastpath Definition • Fastpath: the most optimized path through graphics hardware Example • fast path: float verts, float norms, AGBR textures, z-test • less fast path: float verts, float norms, RGBA textures, z-test Graphics: Fastpath Example Graphics: Fastpath Points • Fast path is often synonymous with ideal path. – Real usage of graphics falls on a continuum. Fast path (hardware) Slow path (software) Speed Quality Where is your application? • Must quantify what hardware can do – Quality & speed Graphics Hardware: Testing Duplicate performance numbers simply: • Good: build a simple test program • Better: glPerf - http://www.spec.org Maximize performance in an app: • Good: Use fast API extensions • Better: Create an “is-fast” test, use what is verified as fast Graphics Hardware: “Is-Fast” Test each platform to determine fast path • Once, per-machine, test primitives and modes – Vertex array format, texture format, display list, etc. • Store data in database – Detect hardware changes or time-to-live • Read data from database at startup – Check database or re-generate data Graphics Hardware: “Is-Fast” Pseudo-code If ( new_machine() || hardware_changed() ) { test_interesting_modes(); store_in_database(); } else { // have database entry get_performance_data_from_database(); } // use the modes & primitives that are ‘’fast’’ when rendering Think Globally, Act Locally Think globally • Know the platforms & graphics hardware • Use hardware effectively in your app • Balance hardware utilization Act locally • Use in-cache data • Understand hardware & graphics fastpaths • Balance quality vs. performance Software and System Performance Thomas J. True, SGI A Four Step Process Quantify System Evaluation Graphics Analysis Bottleneck Elimination Quantify Characterize • Application Space • Primitive Types • Primitive Counts • Rendering Characteristics • Frame Rate Quantify Compare Fill Rate Triangle Rate My Performance Ideal Performance Examine System Configuration Resources • Memory • Disk Setup • Display • Network Graphics Analysis Ideal Performance • Keep graphics pipeline full. • 100% CPU utilization running application code. • 100% graphics utilization. Graphics Analysis Graphics Bound Acme Electronics 40 50 60 30 20 10 0 70 80 90 100 40 50 60 30 20 10 0 70 80 90 100 Graphics Analysis Graphics Bound • Graphics subsystem processes data slower than CPU can feed it. • Graphics subsystem issues an interrupt which causes the CPU to stall. • Data processing within application stops until graphics subsystem can again accept data. Graphics Analysis Geometry Limited • Limited by the rate at which vertices can be transformed and clipped. Fill Limited • Limited by the rate at which transformed vertices can be rasterized. Graphics Analysis CPU Bound Acme Electronics 40 50 60 30 20 10 0 70 80 90 100 40 50 60 30 20 10 0 70 80 90 100 Graphics Analysis CPU Bound • CPU at 100% utilization but can’t feed graphics fast enough. • Graphics subsystem at less than 100% utilization. • All CPU cycles consumed by data processing. Graphics Analysis Determination Techniques • Remove graphics API calls. • Shrink graphics window. • Reduce geometry processing requirements. • Use system monitoring tool. Graphics Analysis Start Performance Problem Not Graphics Remove rendering calls Graphics Performance Problem Remove graphics API calls Shrink graphics window Reduce geometry load Use system monitoring tool Excessive or unexpected CPU activity Graphics bound:? = frame rate increase Graphics bound: fill limited Graphics bound: geometry limited = no change in frame rate Fallen off fast path Graphics Analysis Graphics Architecture: GTXR-D Acme Electronics Graphics Analysis Graphics Architecture: GTXR-D (aka Dumb Frame Buffer) • CPU does everything. • Typically CPU bound. • To remedy, buy a “real” graphics board. Graphics Analysis Graphics Architecture: GTX-RD Acme Electronics Graphics Analysis Graphics Architecture: GTX-RD • Screen space operations performed by graphics. • Object-space to screen-space transform on host. • Can easily become CPU bound. “Roughly 100 single-precision floating point operations are required to transform, light, clip test, project and map an object-space vertex to screenspace.” - K. Akeley & T. Jermoluk • Beware of fast-path and slow-path issues. Graphics Analysis Graphics Architecture: GTX-RD • If Graphics Bound: – Reduce per-pixel operations. – Reduce depth complexity. – Use native-format data. Graphics Analysis Graphics Architecture: GTX-RD • If CPU Bound: – Reduce scene complexity. – Use more efficient graphics algorithms. Graphics Analysis Graphics Architecture: GT-XRD Acme Electronics Graphics Analysis Graphics Architecture: GT-XRD • Transformation and rasterization performed by graphics. • Can be CPU or graphics bound. • Beware of fast-path and slow-path issues. • Subject to host bandwidth limitations. Graphics Analysis Graphics Architecture: GT-XRD • If Graphics Bound: – Move lighting back to CPU. – Use native data formats within application. – Use display lists or vertex arrays. – Use less expensive lighting modes. Graphics Analysis Graphics Architecture: GT-XRD • If CPU Bound: – Move lighting from CPU to graphics subsystem. – Do matrix operations in graphics hardware. – Profile in search of computational performance issues. Bottleneck Elimination Bottlenecks Bottleneck Elimination Bottlenecks • Understanding, crucial to effective tuning. • Will always exist, tune to balance. • Not always a bad thing. Bottleneck Elimination Graphics • Use native graphics formats. • Remove excessive state changes. • Package graphics primitives efficiently. • Use textures that fit in texture cache. • Don’t use unnecessary rendering modes. • Decrease depth complexity. • Cull out excessive geometry. Bottleneck Elimination Memory • Don’t allocate memory in rendering loop. • Avoid copying and repackaging of graphics data. • Organize graphics data. • Avoid memory fragmentation. Bottleneck Elimination Memory Bandwidth and Fragmentation Independent Triangles 9 vertices: 504 bytes Triangle Strip 5 vertices: 280 bytes Vertex Array 5 vertices: 280 bytes Vertex = RGBA+XYZW+XYZ+STR = 56 bytes Bottleneck Elimination Code and Language • Use native data types. • Avoid contention for a single shared resource. • Avoid application bottlenecks in non-graphics code. • Reduce API call overhead. Bottleneck Elimination API Call Overhead Independent Triangles (XYZW + RGBA + XYZ + STR) * 9 vertices: 36 function calls Triangle Strips (XYZW + RGBA + XYZ + STR) * 5 vertices: 20 function calls Vertex Array 5 function calls Display List 1 function call Conclusion Performance Tuning an Iterative Process Quantify System Evaluation Graphics Analysis Bottleneck Elimination Conclusion It’s all about balance! Profiling and Performance Analysis Keith Cok, SGI Profile and Performance Analysis • Profiling points out code areas that take up most time • Imperative for well balanced application • Points out code and system bottlenecks Two Methods of Software Profiling Basic block • A section of code that has one entry and one exit • Measures ideal time Statistical sampling • Interrupts program execution and examines current location • Measures actual CPU cycles spent executing a line of code How Do You Profile Code? • Compile/link with compiler optimizations turned on – cc foo.c -use_all_optimization_flags .... • Instrument the code – Unix: pixie foo.exe -> foo.exe.pixie – Visual Studio: embedded in tool suite • Run the application with relevant data sets – foo.exe.pixie - args -> produces results data file Profiling: Finding the Hot Spot Function list, in descending order by exclusive ideal time excl.% cum.% instructions calls function (dso: file, line) [1] 10.3% 10.3% 190583064 [2] 8.9% 19.2% 173920781 [3] 8.2% 27.4% 145950460 [4] 5.9% 33.3% 97798122 1975976 __sin (libm.so: sin.c, 194) [5] 4.1% 37.4% 82310479 [6] 3.4% 40.8% 50786176 1204269 __glMgrim_Begin (libGLcore.so: mgras_prim.c, 221) [7] 3.2% 44.0% 58099072 [8] 3.1% 47.1% 53832546 290970 R_RecursiveWorldNode (foo: gl_rsurf.c, 894) [9] 3.1% 50.2% 43855299 437627 R_CullBox (foo: gl_rlight.c, 313; compiled in gl_rmain.c) [10] 2.8% 53.0% 44666700 11484 GL_CreateSurfaceLightmap (foo: gl_rsurf.c, 1293) 3203 S_Update_ (foo: snd_dma.c, 848) 338787 R_RenderBrushPoly (foo: gl_rsurf.c, 641) 240 GL_LoadTexture (foo: gl_draw.c, 990) 16797 R_DrawAliasModel (foo: gl_rmain.c, 232) 30981 EmitWaterPolys (foo: gl_warp.c, 187) Profiling: Fixing the Hot Spot What do you look for? • Common sub-expressions • Loop invariant code • Repeated pointer de-referencing • Global variables and cache misses • “Thin” loops Profiling Example // Code the old way 19: void old_loop() { 20: sum = 0; 21: for (i = 0;i < NUM; i++) 22: sum += x[i]; 23: printf("sum = %f\n",sum); 24: } // Code the new way 27: void new_loop () { 28: sum = 0; 29: ii = NUM%4; 30: for (i=0; i < ii; i++) 31: sum +=x[I]; 32: for (i = ii; i < NUM; i +=4) { 33: sum += x[i]; 34: sum += x[i+1]; 35: sum += x[i+2]; 36: sum += x[i+3]; 37 : } 38: printf(“ sum = %f\n”,sum); 39: } Profiling Example: Profile Results cycles instructions calls function (dso: file: line) [1] 6160 6168 1 old_loop (blahdso.so: blahdso.c, 19) [2] 4869 8714 1 setup_data (blahdso.so: blahdso.c, 11) [1] 4869 8714 1 setup_data (blahdso.so: blahdso.c, 11) [2] 4625 4891 1 new_loop (blahdso.so: blahdso.c, 27) Profile Example: Line Analysis Line list, in descending order by time -----------------------------------------------------cycles invocations function (dso: file, line) 4096 2061 1024 1024 old_loop old_loop sum += x[i]; for (i = 0;i < NUM; i++) 978 968 968 968 733 7 256 256 256 256 256 1 new_loop new_loop new_loop new_loop new_loop new_loop sum += x[i+3]; sum += x[i+2]; sum += x[i+1]; sum += x[i]; for (i = ii; i < NUM; i +=4) ii = NUM%4; Profile and Performance Analysis Profile Example: Visual C++/Intel Function Percent of Hit Function Time(s) Run Time Count -----------------------------------------------------------------0.410 39.4 1 _old_loop 0.249 23.9 1 _new_loop Statistical vs. Basic Block Profile void ijk_loop(){ sum = 0; for (i=0;i<YNUM;i++) for (j=0;j<YNUM;j++) for (k=0;k<YNUM;k++) sum += y[i][j][k]; } printf("sum = %f\n",sum); // loops kji and ikj as well Basic Block vs. Statistical Sampling Basic Block: Percent [1] 25.3% [2] 25.3% [3] 25.3% cycles 51141434 51141434 51141434 Statistical Sampling: Percent Samples [1] 38.0% 2700 [2] 23.9% 1700 [3] 19.7% 1400 [4] 18.3% 1300 inst calls function 37101028 1 ijk_loop foo.c, 47 37101028 1 kji_loop foo.c, 57 37101028 1 ikj_loop foo.c, 66 Procedure Function kji_loop foo.c, 57 setup_data foo.c, 15 ikj_loop foo.c, 66 ijk_loop foo.c, 47 Now We Know About Hot Spots... What do we do next? • Use compilers to fine-tune code • Use knowledge of language to optimize • Hand-tune code Profiling is fun, hard, and iterative and it can be highly effective Compiler and Language Issues Keith Cok, SGI Bob Kuehne, SGI Compiler and Language Issues Compiler Optimizations: • Occur within a compromise of speed and memory space vs. time to compile and link • An iterative process to discover what does and doesn’t work • Important to keep at it Compiler Issues: Trade-Offs • Trade-offs: – Round-off vs. needed precision – Inter-procedural analysis vs. link time – Pointer aliasing vs. coding constraints – Optimizing for processor architectures vs. work of multiple binaries (support, test) • Explore other compilers than your first choice • Different source code - different flags Compiler and Language Issues Comments on 32 vs. 64 bit code • Benefits of 64 bit code: – Increased address space – Higher precision • Downsides of 64 bit code: – Application memory footprint – Need to port which can be difficult! • Performance issues Language Issues • Data Management • Unrolling loops • Arrays • Temporary variables • Pointer aliasing Language Issues: Data Management Manipulate data structures efficiently since graphics IS data struct { str *next; str *prev; large_type foo; int key; } str; struct { str *next; str *prev; int key; large_type foo; } str; Language Issues: Data Management Pack data efficiently struct foo { char aa; float bb; char cc; float dd; char ee; } foo_t; struct foo_better { // 8 bits + 24 pad float bb; // 32 bits // 32 bits char aa; // 8 bits // 8 bits + 24 pad char cc; // 8 bits // 32 bits char ee; // 8 bits + 8 pad // 8 bits + 24 pad float dd; // 32 bits // 160 bits } foo_t; // 96 bits Language Issues: Data Management Examine your arrays and note their caching behavior • Break up large arrays into smaller sub-arrays for better memory access patterns • Understand the implications of data layout and cache behavior Language Issues: Loop Unrolling Profiling Example // Code the old way 19: void old_loop() { 20: sum = 0; 21: for (i = 0;i < NUM; i++) 22: sum += x[i]; 23: printf("sum = %f\n",sum); 24: } // Code the new way 27: void new_loop() { 28: sum = 0; 29: ii = NUM%4; 30: for (i=0; i < ii; i++) 31: sum +=x[i]; 32: for (i=ii; i<NUM; i +=4) { 33: sum += x[i]; 34: sum += x[i+1]; 35: sum += x[i+2]; 36: sum += x[i+3]; 37: } 38: printf(“ sum = %f\n”,sum); 39: } Language Issues: Loop Unrolling Profile Example: Line Analysis Line list, in descending order by time -----------------------------------------------------cycles invocations function 4096 1024 old_loop sum += x[i]; 2061 1024 old_loop for (i = 0;i < NUM; i++) 978 968 968 968 733 7 256 256 256 256 256 1 new_loop new_loop new_loop new_loop new_loop new_loop sum += x[i+3]; sum += x[i+2]; sum += x[i+1]; sum += x[i]; for (i = ii; i < NUM; i +=4) ii = NUM%4; Language Issues: Loop Unrolling Issues with loop unrolling: • Code complexity • Clutter • Compiler may/may not do this • Flags may affect compiler time spent optimizing Only “thin” loops gain performance Use application knowledge to take advantage of loop unrolling Language Issues: Local temporary variables Use local temporary variables to avoid repeatedly de-referencing a pointer structure Example: x = global_ptr->record_str->a; y = global_ptr->record_str->b; Use: tmp = global_ptr->record_str; x = tmp->a; y = tmp->b; Language Issues: Using tmp vars for global vars within a function void tr_point(FLOAT *old_pt, FLOAT *m, FLOAT *new_pt) FLOAT *c1, *c2, *c3, *c4, *op, *np, tmp; c1 = m; c2 = m+4; c3 = m+8; c4 = m+12; for (j=0, np = new_pt;j<4; j++) { for (j=0; np = new_pt; j<4;j++) op = old_pt; op = old_pt; tmp += *op++ * *c1++; *np += *op++ * *c1++; tmp += *op++ * *c2++; *np += *op++ * *c2++; tmp += *op++ * *c3++; *np += *op++ * *c3++; *np++ = tmp + (*op * *c4++); } *np++ = *op++ * *c4++; } Language Issues: Pointer Aliasing • Pointers are aliases when they point to potentially overlapping regions of memory • If regions never overlap, may optimize for this case. Not possible, though, in general • Compiler can't tell when pointers are aliased • Use restrict key word or compiler option Language Issues: Pointer Aliasing Unaliased Pointers Compilers may use: - Parallelism - Pipelining in in out out Aliased pointers Language Issues: Pointer Aliasing void process_data( float * restrict in, float * restrict out, float gain) { int i; for (i = 0; i < NSAMPS; i++) { out[i] = in[i] * gain; } } C++: General Issues • Language features – RTTI, safe casts, etc. • Use const, mutable, volatile, & inline – hints to compilers • Object construction – arrays, default constructors, arguments, etc. • Method invocation issues – operators, overloads, conversion, etc. C++: Virtual Functions • Good - used to invoke child method when managing baseclass handles • Expensive - incur an additional pointer de-reference – one, find VTBL, two, find method, invoke – bad for caching • Use when necessary, but not for common objects – Good for ‘large’ methods that do lots of work – Bad for ‘small’ methods, like a vertex query C++: Exceptions & Templates Exceptions • Great for error checking • Performance penalty – Additional stack information required Templates • Great for code re-use • Memory penalty – Across libraries, across object files Code & Language Issues: The End Balance • Know your compiler – Features & performance • Know your language – Features & performance • Know your app – Features & performance Idioms and Application Architectures Alan Commike, SGI Starting Quote The best tuned most efficient bubble sort is still a bubble sort. Additional tweaking won't improve performance. Change The Algorithm! - Commike ‘99 Introduction To write an efficient graphics application, one must: • Understand the platform • Use graphics efficiently • Write good code Use efficient application structures and algorithms Outline • Outline • Background • Culling • Level of Detail (LOD) management • Application architectures Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path App Cull LOD Draw Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path App Cull LOD Draw App Cull LOD Draw Application Architectures: Rendering Path • Application work, culling, LOD, drawing • Pipelined rendering path Frame0 App Frame1 Cull LOD Draw App Cull LOD Draw App Cull LOD Draw T2 T3 T4 T5 Frame2 T0 T1 Application Architectures: Target Frame Rate A target frame rate attempts to bound the maximum render time • Control Culling and LOD aggressiveness • Maintain a constant frame rate • Achieve an acceptable interactive frame rate Graphics Idioms • Culling – Removing geometry that isn't visible • Level of Detail Management – Reducing geometric complexity Culling Don’t draw what you can’t see Culling: Culling Types Use one. Use all. Pipeline them together. • View Frustum Culling • Backface Culling • Contribution Culling • Occlusion Culling Culling: Bounding Volumes Test against a bounding volume not individual primitives • Can be bounding sphere, box, oriented box, or any enclosing volume • Hierarchical bounding volumes to reduce cull time • Spheres are fast, boxes are more accurate – Use a combination of both Culling: View Frustum Graphics pipeline clips data that falls outside the View Frustum If it will be clipped don’t bother drawing Culling: View Frustum Usefulness • Improves geometry rate – Culled vertices are not transformed, lit, and clipped • Improves host download rate – Less data moved from memory into graphics • Does not change fill rate – Triangles outside the View Frustum would not have been drawn anyway Culling: View Frustum Implementation • Transform vertices to clip coordinates (in OpenGL multiply by Model-View and Projection matrix) • Check each vertex against View Frustum • Geometry is either In, Out, or Partial • Render In and Partial Culling: Skip the Clip In software transform systems (GTX-RD) skip the clip • Partial and In geometry classified – Pipe renders Partial as usual – Pipe can render In without a View Frustum clip • Might be a hint to render • Can improve geometry rates if not already fill-limited Culling: Backface Only half of any closed polyhedron is visible at any one time Don’t render what you can’t see Culling: Backface Usefulness • Improves fill rate when using a native implementation – Primitives are transformed and lit before culling • Helps both geometry and fill with an application specific algorithm – More computationally expensive – Balance graphics and CPU work • This may not work well when you can enter closed geometry or need two-sided lighting Lava. Hot! Random Quote Try not. Do, or do not. There is no try. - Yoda ‘80 Culling: Contribution If it’s too small to make a difference don’t render it Culling: Contribution Usefulness • Improves geometry rate – Culled vertices are not transformed, lit, and clipped • Improves host download rate – Less data moved from memory into graphics • Does not change fill rate – Screen space projection already minimal – Removes few pixels from rasterization stage Culling: Contribution Implementation Don’t render items that fall below a size threshold • Screen space size of bounding volume • A less computational approach – Distance to object combined with some notion of global object size Culling: Occlusion If you can’t see it Front Side don’t draw it Culling: Occlusion Goals Find the optimal set of occluders that will enable drawing the minimal number of occludees • Occluders: The geometry that is visible • Occludees: The geometry that is not visible • Use general purpose occlusion culling algorithms • Use application specific spatial knowledge if possible Culling: Occlusion Culling Usefulness • Can improve both transform-limited and fill-limited applications • Computationally expensive – Beware of time trade-offs • Possible hardware support Culling: General Occlusion Culling • Used for arbitrary scenes • Can improve both transform limited and fill limited applications • Computationally expensive for arbitrary scenes Culling: Occlusion Spatial Partitioning “Cell and Portal” Culling • Spatial organization leads to Cells and Portals • Games that move from room to room • Architectural walkthroughs LOD: Overview After culling, need to draw what is left • Still too much geometry: – Use multiple Levels of Detail, I.e. multi-resolution objects • Match geometric complexity to visible on-screen space coverage • Reduce geometric complexity to maintain target frame rate LOD: Issues • Generating LODs: – Height Fields vs 3D objects – View-Dependent: nice, but compute intensive – View-Independent: fast, memory intensive • Need to decide which LOD level to use – Not trivial! • Need smooth transitions between levels – Geomorphs LOD: Height Fields • Generally thought of as infinite terrain • Specialized algorithms can be used LOD: 3D Models • General purpose simplification algorithm • Can use on height fields also • Some recent real-time view-dependent algorithms • Also used for compression 1024 Triangles 256 Triangles 64 Triangles 16 Triangles LOD: When to switch LOD levels Ability to only generate LOD models is not sufficient • Need to know when to use which LOD level – single constant hard metric: distance from eye – Multiple heuristics: cost, benefit, rankings • Can bias LODs to ensure frame rate targets are reached LOD: Level determination • Determine system rendering characteristics • Determine cost of rendering each object • Render objects with highest benefit while remaining under the target frame rate Level determination can be time consuming! “take the time to time the time taken to reduce the rendering time” Going, and going, and going... LOD: Determining cost of rendering Cost is affected by many factors • Graphics hardware: published benchmarks, startup tests • Number of vertices: primarily a function of LOD algorithm • Rendering Quality: lighting, shading, wire frame, anti-aliasing, etc. • Global Factors: total texture memory, dirty internal state LOD: Benefit Function Cost alone is not good enough, need benefit also • Rendered size of object • Error tolerance between LOD level and reference model • Importance in scene • Frame-to-frame coherency LOD: The Optimal LODs For all Objects, at each LOD Level, rendered with each RenderType Maximize the Benefit function: Benefit(Object, Level, RenderType) Subject to: Cost(Object, Level, RenderType) <= TargetFrameRate LOD: Optimal Optimizations • Simulated Annealing • Monte Carlo Simulations • Simplex Searches LOD: Optimal Optimizations • Simulated Annealing • Monte Carlo Simulations • Simplex Searches Dude, Can you spare a few dozen CPUs? LOD: Trade-offs Don’t have enough time to run full LOD optimization problem and render the scene • Simplify cost and benefit functions • Simplify optimization problem into a ranking of Benefit/Cost • Use frame-to-frame coherency • Be sure to consider time taken to calculate LODs Application Architectures: Multi-Threading • More stages give more time to cull or generate LODs • Each stage adds latency Frame0 App Frame1 Cull LOD Draw App Cull LOD Draw App Cull LOD Draw T2 T3 T4 T5 Frame2 T0 T1 Application Architectures: Multi-Threading • Hard part is data synchronization • Watch out for memory bloat Application Architectures: Scene Graphs A scene graph is the basic data structures holding the description of your scene • Cull-able, sort-able, and can contain multi-resolution objects • Hierarchical Bounding Volumes • Statistics gathering and timing infrastructure • For large scenes can do memory management and database paging Application Architectures: Trade-offs • Quality • Speed • Memory • Complexity Conclusion: Most importantly - Think about balance! Performance Hints Keith Cok, SGI Performance Hints: Pipeline Management • Avoid round trips to graphics server – Cache own state/attribute information – Avoid pipeline queries (e.g., glGet*) – Flush buffer efficiently (glFlush vs. glFinish) • Reduce state changes. Sort by expense. For example, sort geometry by type (triangles, quads, etc) and then by color • Eliminate unused attributes Performance Hints: Debugging Detect graphic errors: #ifdef DEBUG #define GLEND() glEnd();\ {int err; \ err = glGetError(); \ if (err != GL_NO_ERROR) \ printf("%s\n",gluErrorString(err)); \ assert(err == GL_NO_ERROR);} #else #define GLEND() glEnd() #endif Performance Hints: Geometry • Maximize data between glBegin/glEnd – Sort geometry by type (triangle, quad, etc.) and group them together – Find best fit for length of glBegin/glEnd pair • Use stripped primitives (GL_TRIANGLE_STRIP...) to reduce geometry data sent to the pipeline • Avoid GL_POLYGON. Use specific geometric primitives instead (GL_TRIANGLE, GL_QUAD, etc.) • Use GL_FASTEST with glHint calls where possible Performance Hints: Geometry • Use flat display lists for static geometry. Deep display lists may induce unwanted memory thrashing • Use API matrix operations instead of your own • Use texture to simulate complex geometry • Use vertex arrays. Test vertex, interleaved, precompiled arrays Performance Hints: Geometry • Pass one normal (not 3 or 4) per flat shaded polygon • Use a data format suitable for quick transfer to the graphics subsystem • Disable unneeded operations (alpha blending, depth, stencil, blending, dithering, fog, etc.) Performance Hints: Lighting • Reduce lighting requirements: – Use as few lights as possible – Use directional (infinite) lighting. Use glLightfv(GL_LIGHTn, GL_POSITION, {x,y,z,0}); – Use positional lights rather than spot lights – Use one-sided lighting when possible (be aware of issues associated with normals) – Don’t change material properties frequently Performance Hints: Lighting • Use normalized normal vectors – Supply unit length vectors – Don’t enable GL_NORMALIZE – Don’t scale using model-view matrix • Pre-multiply geometry, if possible Performance Hints: Visuals/Pixel Formats • Pick the correct visual. Use hardware accelerated visuals • Structure windows and contexts to maximize performance (app may block after context swaps) • Put GUI elements in overlay planes to avoid unwanted graphics window refreshes Performance Hints: Buffers • Turn off depth buffer when possible • Use HW accelerated off-screen buffer for backing-store • Use stencil buffer for interactive picking and quick re-render (see course notes for full algorithm) • Use color/depth buffer data for interactive editing of complex scenes (see course notes for full algorithm) Performance Hints: Textures • Be aware of texture sizes – Reduce texture resolution – Use texture LOD extension (OpenGL 1.2) • Use texture objects. Create textures once • Don’t swap textures frequently, if possible – Mosaic multiple textures into one large texture – Sort geometry by texture Performance Hints: Textures • Use texture as an additional data lookup to simulate more complex data: – Lighting, geometry, color, clipping, application-space data • Use glTexSubImage to replace part of a texture rather than creating a whole new texture • Avoid expensive texture filter modes • Use texture lookup tables instead of multi-channel textures Conclusion Know how your application works within the system • Don’t let caches, latencies, bandwidths, etc. slow you down • Know how fast you can go • Identify system performance characteristics • Work your compiler • Get all you can out of the hardware Questions and Answers