CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13th 2007 Your Guest Instructor • Back in the mists of time, I wrote games… • The last ten years have all been about 3D hardware • Since 2001 at ATI, joining forces with AMD last year • Optimisation specialist: linking the software and the hardware 2 Tweaking games Understanding the hardware Driver performance Shader code optimisation – (I find assembly language fun) November 13th 2007 Graphics Performance and Optimisation Overview • Three basic sections GPU Architecture Efficient OpenGL Practical Optimisation And Debugging • There’s a lot in here Broad overview of all issues I’ve prioritised the biggest issues and the ones most likely to help with Project 3 More details with respect to GPU architecture included as appendix 3 November 13th 2007 Graphics Performance and Optimisation GPU Architecture 4 November 13th 2007 Graphics Performance and Optimisation Graphics hardware architecture • Parallel computation • All about pipelines • The OpenGL vertex pipeline shown right will be familiar… 5 November 13th 2007 Graphics Performance and Optimisation Graphics hardware architecture • Extend the top of the pipeline with some more implementation detail • Ideally, every stage is working simultaneously Application API Video Drivers CPU GPU • Could also decompose to smaller blocks • And eventually to individual hardware pipeline stages As shown last week, the hardware implementation may be considerably more complex than a linear pipeline 6 November 13th 2007 Graphics Performance and Optimisation Command Buffers Parser Vertex Assembly Vertex Operations Primitive Assembly Draw commands • Data enters the GPU pipeline via command buffers containing state and draw commands • The draw command is a packet of primitives • Occurs in the context of the current state As set by glEnable, glBlendFunc, etc. The full set of state is often referred to as a state vector Driver translates API state into hardware state State changes may be pipelined; different parts of the GPU pipeline may be operating with different state vectors (even to the level of per-vertex data such as glColor) 7 November 13th 2007 Graphics Performance and Optimisation Pipeline performance • The performance of a pipelined system is measured by throughput and latency Can subdivide at any level from the full pipeline down to individual stages • Throughput: the rate at which items enter and exit • Latency: the time taken from entrance to exit Latency is not typically a major issue for API users It is a huge issue for GPU designers Even GPU-local memory reads may be hundreds of cycles Substantial percentage of both design effort and silicon is devoted to latency compensation The system will generally run at full throughput until the latency compensation is exceeded 8 November 13th 2007 Graphics Performance and Optimisation Pipeline throughput • Given a particular state vector, each part of the pipeline has its own throughput • The throughput of a system can be no higher than the slowest part: this is a bottleneck More generally, if input is ready but output is not, it is a bottleneck 9 November 13th 2007 Graphics Performance and Optimisation Pipeline bottlenecks • Consider system shown right Stage 1 can run at 1 per clock and is 100% utilised Stage 2 can only accept on each other clock; still 100% utilised Stage 3 is therefore starved on half of the cycles it could be working; 50% utilised Although stage 3 has the longest latency, it has no effect on the throughput of the system Items enter 1 per clock Stage 1 Throughput 1/clock Latency 5 cycles Items pass at 1 per clock Stage 2 Throughput 1 per 2 clocks Latency 10 cycles Half throughput, result only every alternate clock Stage 3 Throughput 1/clock Latency 15 cycles Still only alternate clock results, despite per-clock throughput 10 November 13th 2007 Graphics Performance and Optimisation Pipeline bottlenecks • A key subtlety; for this to work as shown, there must be load balancing between stages 1 and 2 (probably a FIFO) • Once the FIFO is full, the input buffer will exert backpressure on stage 1 Happens after equilibrium is reached • This pipeline therefore runs at the speed of the slowest part as soon as the FIFO fills Items enter 1 per clock Stage 1 Throughput 1/clock Latency 5 cycles Items pass at 1 per clock; eventually queue Input Buffer Stage 2 Throughput 1 per 2 clocks Latency 10 cycles Half throughput, result only every alternate clock Stage 3 Throughput 1/clock Latency 15 cycles Still only alternate clock results, despite per-clock throughput 11 November 13th 2007 Graphics Performance and Optimisation Variable throughput • In general, throughput is data dependent Example: clipping is a complex operation which often isn’t required Example: texture fetch depends on the filtering chosen, which is data dependent • Some pipeline stages require different rates at the input and the output Example: back-face culling; primitive in, no primitive out Example: rasterisation of primitives to fragments; few primitives in, many fragments out • Buffering between stages takes up the slack 12 November 13th 2007 Graphics Performance and Optimisation Pipeline bottlenecks • A particular state vector will tend to have a characteristic set of bottlenecks The input data does also have an effect • Small changes to the state vector can make substantial changes to the bottleneck • As a state change filters through the pipeline and for a short period afterwards, bottlenecks shift into the new equilibrium For usual loads, where the render time is much larger than the pipeline depth, this time can be ignored • Can be hard to determine bottlenecks if the states in the pipe are disparate Smearing effect 13 November 13th 2007 Graphics Performance and Optimisation Pipeline bottlenecks • There may be multiple bottlenecks if the throughput is not constant at all parts of the pipeline In general it is not constant • GPU buffering absorbs changes in load Measured in tens or hundreds of cycles at best Whole pipeline is thousands of cycles • The bottleneck could be outside the GPU Application, driver, memory management… • Bottleneck analysis is key to hardware performance Not easy: bottlenecks are always present separating expected and unexpected cases is the challenge 14 November 13th 2007 Graphics Performance and Optimisation Flushes and synchronisation • Some state cannot be pipelined; a flush occurs Various localities of flush For a whole-pipeline flush, the parser waits before allowing new data into the pipe CPU can carry on building and queuing command buffers Low cost ~ thousands of cycles (~5us?) Some operations can require the CPU to wait for the GPU Example: CPU wants to read memory the GPU is writing This is a serialising event Very expensive: wait for pipeline completion, flush all caches, and the restart time taken to build the next command buffer You can force this with glFinish: please don’t! 15 November 13th 2007 Graphics Performance and Optimisation Asynchronous system Input / Physics thread Runs continuously using input and time (fixed or delta) to update the game world – typically including the scene graph Typical runtime 10-30ms Render thread Runs continuously to convert the scene graph to rendering commands Generally cannot start until input/physics thread has processed whole frame GPU Renderer The process of rendering a typical game image is massively asynchronous The boxes left show possible asynchronous actors The diagram below shows a possible timeline The shaded areas are the same frame Runs on its command buffers DAC Loops over the display at 60-100Hz A command buffer operation changes the display at end of render; picked up at start of next frame (unless vsync is off) 16 Input Render GPU DAC November 13th 2007 Graphics Performance and Optimisation Synchronisation • GPU’s aim to run just under two frames ahead Block at SwapBuffers if there is another SwapBuffers in the pipe that is not yet reached • Reading any GPU memory on the CPU causes a sync glReadPixels is one method, for example; avoid • Writing to GPU memory generally does not The GPU, driver and memory manager work together to do uploads without serialisation No need to be unusually scared of glTexImage • If you have to lock GPU memory, look for discard or write-only flags that will allow asynchronous access 17 November 13th 2007 Graphics Performance and Optimisation Shaders • Texture lookup operations are relatively expensive Competition on GPU or system bus, cost of filtering, unpredictable Some of this is only a latency issue – but latency is not important… – … until the buffering is exceeded – Latency more than doubles for dependent texture operations Prefer ALU math to texture until the function is complex Might replace very small textures with shader constants • Shader – typically its texture operations – likely to be the limiting factor on performance 18 November 13th 2007 Graphics Performance and Optimisation Shaders • Each shader is run at a particular frequency Per-vertex, per-fragment now; per-primitive also exists; persample seems likely in the future Can view constants calculated on the CPU as another frequency (per-draw packet) Aim to do calculations at the lowest necessary frequency • Issues to be aware of: Data passed from vertex to fragment shader is interpolated linearly in the space of the primitive (i.e. with perspective correction) so can only use interpolators if this is appropriate (linear or nearly so); high tessellation can be a workaround Excessive use of interpolators can itself be a bottleneck; up to two interpolators per texture fetch, as a ballpark figure 19 November 13th 2007 Graphics Performance and Optimisation Shader constants • Shader constants are a large part of the state vector Updating hundreds on each draw call will not be free • Prefer inline constants (known at compile time) to state vector constants Gives the compiler and constant manager more information • For the same reason, avoid parameterising for its own sake • Don’t switch shader just to change a couple of constants 20 November 13th 2007 Graphics Performance and Optimisation Efficient OpenGL 21 November 13th 2007 Graphics Performance and Optimisation Efficient OpenGL • This is a data processing issue • What data does the GPU need to render a scene? State data, texture data, vertices / primitives • CPU-side performance can easily be dominated by inefficient management of this data • Of them all, vertex data is the most problematic Type of data State Vertex Texture Volume (per frame) Low (~kB) Rate of change Very high Med-high (~MB) Low-med Very high (~GB) Very low 22 November 13th 2007 Graphics Performance and Optimisation Efficient vertex data • Application needs to feed mesh data in somehow • GL provides two basic methods glBegin/glEnd (known as ‘immediate mode’) Vertex arrays • Immediate mode is easy to use but has high overheads Many tiny, unaligned copies Non ‘v’ forms imply extra copies Command stream is unpredictable and irregular glBegin(GL_TRIANGLE_FAN); glColor4f(1,1,1,1); glVertex3f(0,0,0); // position glVertex3f(0,1,0); // position glColor4f(1,0,0,1); glVertex3f(1,1,0); // position glVertex3f(1,0,0); // position glEnd(); 23 + colour only + colour only November 13th 2007 Graphics Performance and Optimisation Vertex arrays • Vertex arrays are an alternative The application probably has its data in arrays somewhere, so let GL read them en masse glVertexPointer, glColorPointer, etc. specify the array glDrawElements to issue a draw command; takes index list Primitives are drawn using the indices into the arrays as set up by the gl*Pointer commands glVertexPointer(3, GL_FLOAT, 16, vertex_array); glColorPointer(4, GL_UNSIGNED_BYTE, 0, color_array); glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_COLOR_ARRAY); glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices); • Easier for the driver and GPU to handle State vector is instantiated at the glDrawElements command The GPU can process all the primitives in a single draw packet 24 November 13th 2007 Graphics Performance and Optimisation Vertex arrays • Did you hear a but? • The vertex data still belongs to the application Until the glDrawElements call is entered, the GPU knows nothing of the data After the call completes the app can change the data Therefore, the driver must copy the data on every glDrawElements call Even if the data never changes – the GL can’t know • Wouldn’t it be great if we could avoid the copy? We don’t supply textures on every call, just upload them to the GPU and let the driver manage them… 25 November 13th 2007 Graphics Performance and Optimisation Buffer Objects • This facility is provided with the Vertex Buffer Objects (VBO) extension allows the creation of buffer objects in GPU memory with access mediated by the driver Data can be uploaded at any time with glBufferData – As with glTexImage, done through command buffer to avoid serialisation BindBuffer <-> BindTexture, BufferData <-> TexImage // During program initialisation glBindBuffer(GL_ARRAY_BUFFER, 0); glBufferData(GL_ARRAY_BUFFER, 16*4*sizeof(GL_FLOAT), vertex_array, GL_STATIC_DRAW); ... // In render loop glBindBuffer(GL_ARRAY_BUFFER, 0); glVertexPointer(3, GL_FLOAT, 16, 0); glEnableClientState(GL_VERTEX_ARRAY); glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices); 26 November 13th 2007 Graphics Performance and Optimisation Index data • DrawElements only needs to send the indices • Actually we can optimise that away too; Element Arrays allow buffer objects to contain index data Index data is far smaller in volume, and tends to come in larger batches if state changes are minimised, so this can be overoptimisation • Keep batches as large as possible Keep state changes to a minimum Primarily use triangle lists Don’t mess with locality of reference Strips can be marginally more efficient 27 November 13th 2007 Graphics Performance and Optimisation Display Lists • These offer the driver opportunity for unlimited optimisation • It’s hard for the driver to do The list can contain literally any GL command • Not recommended for games or other consumer apps • Professional GL apps do make heavy use of display lists (and immediate mode) The effort required to efficiently optimise these is one reason professional GL cards are more expensive 28 November 13th 2007 Graphics Performance and Optimisation Visibility optimisations • It’s far more efficient not to render something at all • Try to avoid sending primitives that can’t be seen Not in the view frustum Obscured • Send it, but have it rejected at some early point in the pipeline Cull primitives before rasterisation Reject fragments before shading 29 November 13th 2007 Graphics Performance and Optimisation Bounds • Bounding boxes or spheres to reject objects wholly outside the view frustum • Optimal methods for using these were in lecture 11 30 November 13th 2007 Graphics Performance and Optimisation Occlusion culling • PVS (Potentially Visible Set) culling For each location in the set of locations, store which other locations might be visible A Precalculate before render process starts • If you are standing anywhere in A, you absolutely cannot see C and vice versa View frustum checks cannot solve this part of the problem; consider the position of the observer shown A frustum test is still useful; if the observer was standing in B looking the same way, bounds could cull C • Very effective on room-based games; not so useful on outdoor games Fewer large-scale occluders 31 November 13th 2007 Graphics Performance and Optimisation B C Other visibility methods • Portals – as discussed in lecture 11 • BSP – Binary Space Partition – trees Complex but efficient way to store large static worlds for fast frustum visibility calculations Combine with PVS and portals; all need precalculation phase • Abrash - Graphics Programming Black Book ch. 59-64, 70 Detailed information on these and other research he and John Carmack did on visibility while developing Quake Still in use today on modern FPS games (with many enhancements!) 32 November 13th 2007 Graphics Performance and Optimisation Model LOD • Need to render something, render less of it • Demonstrated two weeks ago: A model close to the camera requires many triangles Carry reduced detail models and select on each render – Like mipmapping, memory cost not prohibitive. – Target sizes near GPU’s high efficiency 100 pixel region • Visualise with wireframe 33 November 13th 2007 Graphics Performance and Optimisation Model LOD • Non trivial implementation Popping a well known issue; morph or blend common solutions Must generate reduced detail models Can reuse vertex data, just change indices • Terrain offers particular challenges LOD systems essential for really large worlds Terrain tiles must match between different LODs • Can also solve sampling issues As with undersampling textures, semirandom triangles can be picked; occur if triangles are smaller than 1 pixel 34 November 13th 2007 Graphics Performance and Optimisation GPU primitive culling • Degenerate primitives (example: triangles with two indices the same) will be culled at index fetch • A primitive with all vertices outside the same clip plane will be culled • Back-face culling is a simple optimisation and should be used for all closed opaque models • Zero area triangles will be culled before rasterisation This is rarely usefully exploitable • Scissor rectangles cull large parts of primitives during rasterisation 35 November 13th 2007 Graphics Performance and Optimisation GPU Z rejection • The Z test can occur before shading Reduces colour read/write load as well • Some states inhibit early Z test Write Z in the shader, obviously Gate Z update in the shader (pixel kill / alpha test with Z write) – Alpha test sounds like an optimisation, but it only saves colour read/write; use it for visual effect not performance – Shader kill acts as a shader conditional • Z unit can reject at hundreds of pixels per clock Accept rate is lower (at the very least Z has to be written) but as fast or faster than any other post-rasteriser operation • Stencil usually rejects at Z rates Having a stencil op that does something implies a stencil write 36 November 13th 2007 Graphics Performance and Optimisation Early Z rejection • Draw opaque geometry in roughly front-to-back order Do not work too hard to make this perfect, that’s what the Z buffer was created for in the first place Do not draw the sky first. Please! This assumes you’re bottlenecked in the shader • Consider a Z pass If the fragment shaders are very expensive If at any point rendering the colour buffer you need some algorithm that requires the Z buffer Disable colour writes (glColorMask) or fill the colour buffer with something cheap but useful (example: ambient lighting) Invariance issues should be rare nowadays (but be aware) 37 November 13th 2007 Graphics Performance and Optimisation Shader conditionals • Can also reduce shader load • Treat with care… • Use mostly for high coherency data the conditional is unlikely to have per-pixel granularity An if-then-else clause can have to execute both branches • For low coherency data, prefer conditional move type operations • Typically the shader compiler and optimiser can’t know much about the likely coherency So it guesses 38 November 13th 2007 Graphics Performance and Optimisation Triangle sizes • Larger triangles are more efficient than small ones • Rules of thumb: Over 1000 pixels is large 100 pixel triangles are considered typical and the GPU should be into the ballpark of its peak performance Under 25 pixel triangles are small Tiny triangles likely to cause granularity losses in the GPU • Often the type of object and size of triangle are related Example: world triangles tend to be larger than entities 39 November 13th 2007 Graphics Performance and Optimisation Bump mapping • Can trade off geometric complexity for more expensive fragment shading Textures in general offer this capability – light maps are an earlier example • Having a normal map available in the fragment shader useful for other reasons too Per-pixel lighting is an obvious use • Doom3 an early pioneer: polygon counts are low compared with other games of the time bump mapping makes it hard to see except on silhouette edges 40 November 13th 2007 Graphics Performance and Optimisation Practical Optimisation and Debugging 41 November 13th 2007 Graphics Performance and Optimisation Optimising Applications • Always profile; never assume • Target optimisations Better to get a small gain in something that takes half the time than a big gain in something that takes a couple of percent Better to do easy things than hard things “Low-hanging fruit” 42 November 13th 2007 Graphics Performance and Optimisation Instrumentation for debugging • Logging • Visualisation: make particular rendering (more) visible • Simple interfaces into the high-level parts of the program to make low-level testing easier ‘God mode’ Skip to level N or subpart of the level – Saved games may seem to be an answer here, but minor changes during development usually break them Metadata display • Multiple monitors and remote debugging Key for fullscreen applications Useful to have ‘stable’ dev machine and separate debug target 43 November 13th 2007 Graphics Performance and Optimisation Instrumentation for performance • Feedback on what the performance actually is A simple onscreen frames per second (FPS) and/or timeper-frame counter Special benchmarking modes • Modify the performance Skip particular rendering passes Add known extra load – Examples: new entities, particle system load, force postprocessing effects on 44 November 13th 2007 Graphics Performance and Optimisation Real-world example: Doom3 engine Heavily instrumented with developer console accessed with ctrl-alt Most commands prefixed according to their functional unit – r_ commands are to the renderer, s_ the sound system, sv_ the server, g_ the client (game), etc. Record demos; playback with playdemo or timedemo Capture individual frames with demoshot for debugging or performance Can also send console commands from the command line – essential for external tools Many debugging commands – noclip to fly anywhere on a level – r_showshadows 1 displays the shadow volumes – g_showPVS 1 to show the PVS regions at work 45 November 13th 2007 Graphics Performance and Optimisation More Doom3 convenience features PAK files are just ZIP files You can look at the ARB_fragment_program shaders Doom3 uses (glprogs/ directory in the first pakfile). You can also modify them: real files (e.g. under the base/glprogs directory) override the pakfiles • Human-readable configuration files • TAB completion on the console Long commands not a problem – plus you can find the command you want! • Key bindings 46 November 13th 2007 Graphics Performance and Optimisation Doom3 render: multipass process 1. Z pass: set the Z buffer for the frame 2. Lighting passes: for each light in the scene 2A. Shadow pass: render shadow volumes into the stencil buffer 2B. Interaction pass: accumulate the contribution from this light to the framebuffer. - Cheap Phong algorithm (per-pixel lighting with interpolated E; Prey calculates E on a per-pixel basis for better specular) - Vertex/fragment shader pair 3. Effects rendering; mostly blended geometry for explosions, smoke, decals, etc. 4. One or more postprocessing phases for refraction and other screen-space effects 47 November 13th 2007 Graphics Performance and Optimisation Doom3 benchmarking tools • Each render pass can be disabled from the console r_skipinteractions, r_shadows, r_skippostprocess Benchmark each pass individually Worth considering render time rather than just FPS; linear quantity Rendered FPS Frame time (ms) Everything 55.8 17.9 - postproc 58.5 - interactions - shadows 48 Isolated pass Pass time (ms) Pass load 17.1 Postproc 0.8 4% 104.7 9.6 Interaction 7.5 42% 174.5 5.7 Shadows 3.9 22% The rest 5.7 32% November 13th 2007 Graphics Performance and Optimisation Case study: Doom3 interaction shader • The shader has 7 texture lookups Texture limited on most GPUs One of them was a simple function texture – Probably originally a point of customisation but unused We tested gain by eliminating the lookup – replaced with a constant – note not 0 or 1, which might allow the optimiser to eliminate other code – Provided the expected ~15% gain for the pass Replaced with a couple of scalar ALU instructions – Gain was still the same, as the scalar ALU scheduled into gaps in the existing shader • Quake4 and later games all picked up the change 49 November 13th 2007 Graphics Performance and Optimisation Instrumenting applications • Be wary of profiling API calls Asynchronous system; SwapBuffers is probably the only point of synchronisation Can’t easily measure hardware performance at a finer granularity than a frame Don’t try to profile the cost of rendering a mesh by timing DrawElements; only measures the time taken to validate state and fill the command buffer Which isn’t to say that’s never useful information 50 November 13th 2007 Graphics Performance and Optimisation Instrumenting applications • Don’t overprofile QueryPerformanceCounter has a cost Even RDTSC does • Try to look at the high level and in broad terms first 30% physics, 20% walking scene graph, 30% in the driver, 20% waiting for end of frame Rather than 15.26% inside DrawElements • Aim to be GPU limited, then optimise GPU workload Don’t waste time optimising CPU code if it’s waiting for the GPU Iterate as the GPU workload becomes more optimal • Try to avoid compromising readability for performance Rarely necessary Download the Quake 3 source to see how clear really fast code can be The games industry is really, incredibly, bad at this. 51 November 13th 2007 Graphics Performance and Optimisation Benchmark modes • Timed runs on repeatable scenes Two options – Fix the number and exact content of frames and time the run (could be one frame repeated N times) – Fix the run time, render frames as fast as possible, count the frames Former is more repeatable; often essential if tools require multiple runs to accumulate data Latter more convenient for benchmarkers and more realistic to how games behave in the real world Cynical reason for benchmarks: applications get more attention from press (and hence driver developers) 52 November 13th 2007 Graphics Performance and Optimisation CodeAnalyst • CodeAnalyst is an AMD tool that allows non-intrusive profiling of the application’s CPU usage A profiling session spawns the application under test – Make sure to avoid profiling startup and shutdown time Can drill down to individual source lines in your code and show you the cost Many examples on AMD’s web site using this Useful for all CPU-limited applications 53 November 13th 2007 Graphics Performance and Optimisation CodeAnalyst hints A spike inside driver components may not be driver overhead The driver is probably waiting on the GPU to meet the SwapBuffers limit If there's not a large spike in the driver, it's probably the application that's the limit. This is complicated by the fact that we may choose to block if the GPU is not yet ready, so time may move from the driver to being reported as 'system idle process‘, PID 0, or similar. Vary the resolution and check the how the traces change If the relative time in the driver or system idle doesn't change, the application is not pixel limited. Multicore systems make interpreting the results harder. You might be best off switching a core off if you can 54 November 13th 2007 Graphics Performance and Optimisation GPUPerfStudio • Lets you look inside the GPU • Hardware performance counters 3D busy is the most obvious and often important Vertex / pixel load can also be seen • The bad news: GL support is not in the currently downloadable version 1.1. Coming soon… 55 November 13th 2007 Graphics Performance and Optimisation Shader Development • AMD GPUShaderAnalyzer Available to download from AMD web site Will handle all GL shader types (GLSL, ARB_fp, ARB_vp) Good development environment; no need to run your app to compile Will show output code and statistics including estimated cycle counts for all AMD GPUs 56 November 13th 2007 Graphics Performance and Optimisation Scalability • Look to create consistent performance Better to run at 30fps continuously than oscillate wildly between 15fps and 100fps. Target worst-case scenes You will need headroom to guarantee 60fps • Is a particular gain useful? A 4% speedup won’t help anyone play your game Five 4% speedups would, though Gains in a lesser component allow more use of that component 57 November 13th 2007 Graphics Performance and Optimisation Scalability • PC environment is a huge scalability challenge Matrix of CPUs, GPUs and render resolutions is huge Performance is in tension with image quality Adjust quality to scale for GPU power and set higher loads – when CPU limited, more pixels probably have no cost Adjust quality in profiling – Resolution (or clock) scaling to test if CPU or GPU limited • Consoles have it easier: more fixed in every way Still need headroom, just less of it Now have resolution scaling issues - five TV resolutions in NTSC 480i, PAL 576i, 720p, 1080i/p 60Hz / 50Hz is a headache here 58 November 13th 2007 Graphics Performance and Optimisation Caveats on optimisation • Windowed mode GPUs can behave differently in windowed mode to fullscreen mode Windowed should still be your primary development mode unless you have remote debugging • Front Buffer rendering May be useful for debugging, but could have similar performance implications • Avoid misusing benchmarks Repeat runs – make sure everything’s ‘warm’. 59 November 13th 2007 Graphics Performance and Optimisation Guidelines for Project 3 • Concentrate on the scene graph first, GPU second, CPU cycle picking last Look for algorithms that cull monsters, trees and rooms rather than triangles or pixels • Work on model or texture data in the GPU, not CPU Primarily, use the shader to do the work Anywhere index data, primitive count and connectivity don’t change is a candidate If you have to generate a texture consider using the GPU 60 November 13th 2007 Graphics Performance and Optimisation Guidelines for Project 3 • Short of time to write shaders Write a few shaders that you use a lot • Don’t try to do everything in this lecture Many techniques won’t apply to your specific case Even those that do often won’t matter Profile-Guided Optimisation! 61 November 13th 2007 Graphics Performance and Optimisation Headline performance items Scene graph optimisations: visibility culling, model LOD Don’t touch model data on the CPU unless the algorithm absolutely requires it Use vertex arrays for complex mesh data (> 10 primitives); store static data in VBOs. Use mipmaps for all static textures; avoid undersampling textures without mipmaps Render roughly front to back; don’t kill yourself trying but give it a go for the largest geometry; draw the sky last! Use compressed textures by default; only disable if artifacts appear Disable unnecessary alpha testing; don’t do kills in shaders unless you have to Move work from fragment to vertex shaders where possible Prefer moderate math to texture lookups particularly if they increase the dependent fetch level 62 November 13th 2007 Graphics Performance and Optimisation Further reading • Abrash, Mike: The Graphics Programming Black Book Even in 1997 the asm and register programming section was dated Much of the Quake documentation isn’t – Clear explanation of BSP, PVS and some on portals. Rest is still worth reading to show the mindset – Skip asm-specific bits, concentrate on thought process Chapter 1 and chapter 70 are required reading Stencil shadows; the Wikipedia page has many links 63 November 13th 2007 Graphics Performance and Optimisation Samples and Tools http://ati.amd.com/developer/ GPUPerfstudio, GPUShaderAnalyzer and the Compressonator Tootle is also interesting; optimise meshes both for vertex cache and ‘internal’ front-to-backness Many other samples, documents and tools http://www.amd.com/codeanalyst 64 November 13th 2007 Graphics Performance and Optimisation Questions • If we have time… 65 November 13th 2007 Graphics Performance and Optimisation Appendix Background information on more aspects of the GPU a.k.a. “The slides I knew I didn’t have time to go through” 66 November 13th 2007 Graphics Performance and Optimisation Texture and rendertarget tiling • Memory interface efficiency mostly determined by burst sizes The more useful memory fetched in one go the better Avoid fetching anything that isn’t then used – This is why mipmapping is so important: minifying a texture implies fetching memory that isn’t then used • Rearranging memory into tiles increases locality of reference 64 bytes might contain 4x4 pixels instead of 16x1 pixel Format is transparent to application 67 November 13th 2007 Graphics Performance and Optimisation Texture Compression • GL_ARB_texture_compression • The S3TC / DXTC / BC algorithm is a high quality method for typical image textures Designed such that the artifacts introduced in lossy compression tend to be smoothed out by texture filtering Function textures and unusual use textures may not meet acceptable quality Rearranging components can help Use high-quality compressors - The Compressonator • Compression isn’t just about memory bandwidth Reduces effective latency (one fetch brings in more useful texels) Effectively increases texture cache size 68 November 13th 2007 Graphics Performance and Optimisation Texture Filtering • A bilinear-filtered sample is the common basic unit of work for a texture unit Unlikely that point sampling is any faster than bilinear; can make this work for you in image processing shaders (rather than point sampling and doing some constant weighted sum) Each additional bilinear sample for trilinear or anisotropic filtering is probably consuming additional time • Smart algorithms ensure that only needed samples are taken No need for trilinear if magnifying No need for anisotropy if square-on Example: walls tend to have less anisotropy than floors • Gradient calculations may be dynamic Necessary to handle dependent texture reads Be wary with dependency; gradient can be unpredictable 69 November 13th 2007 Graphics Performance and Optimisation Render to texture • Useful for generating extra views or postprocessing Example: mirror in driving game Example: postprocessing for refraction • glCopyTexImage copies the framebuffer to a texture CPU-GPU serialisation is not implied; this can probably be queued into a command buffer • Other methods exist such as pbuffers and framebuffer extensions Can be slightly more efficient Can return to a rendertarget after rendering on another More complex; don’t use without good reason 70 November 13th 2007 Graphics Performance and Optimisation Multisample antialiasing • The key gain is to run the fragment shader at pixel frequency rather than sample frequency • Also saves memory bandwidth; can compress Z and colour The buffer may need to be resolved to an uncompressed buffer for display or if used as a texture Triangle size may be worth extra consideration with MSAA; frame buffer and Z compression rate is likely to be in roughly inverse proportion to the number of visible edges in the scene 71 November 13th 2007 Graphics Performance and Optimisation Caching • Many caches inside the GPU • Different to what you might be familiar with in a CPU More about memory bursts and latency compensation than reuse In general you do need to hit the memory – Example: texture mapping the whole framebuffer at 1:1; every pixel and texel will be touched exactly once Therefore, be pessimistic: assume this Choose to compensate memory latency with large buffers – Rather than using the cache to dodge the accesses • In a few places short-term ‘reuse’ is critical Bilinear filtering the most obvious case 72 November 13th 2007 Graphics Performance and Optimisation Caching • Can still be advantages to avoiding cycling Used to be a big thing, particularly in the days of visible caching (software controlled rather than auto) Caused sorting policy of hard sort by material Nowadays far less important, hence rough sort by depth In some pathological circumstances sort by shader and depth (or a Z pass followed by sort by shader) might be more efficient 73 November 13th 2007 Graphics Performance and Optimisation Disclaimer & Attribution • • DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. • AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. • AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. • • ATTRIBUTION © 2007 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners. 74 November 13th 2007 Graphics Performance and Optimisation