“The Slow Game Performance Of Life” Understanding Consumer Hardware And Performance Coding Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Hello So… Who exactly am I? And what am I doing here Firstly, hands up who… Has heavily optimized an application Hasn’t, and doesn’t care Is actually here and alive Is hungover and mainly hoping for the answers for the group assignment Hello Duncan let me speak today because… Career spent on performance hardware Experience with a variety of consoles Have managed teams building game engines Still have those photos of Duncan With Doug, the West Highland Terrier I did my degree at Strathclyde Computer architecture Low level programming Will Optimize For Money Previous Experience It’s not all about me Except this bit Strathclyde The Game Of Life assignment Left Strathclyde Immediately Did database paid analysis, enormous hatedfortune it Workedwear Didn’t in telecoms, a suit, worked hated in it games Moved to Bought first 3 person Ferrarigame 3 months company after Uni …“Until Had more I could than 1find girlfriend a proper job” Previous Experience 2 years PC engine development 2D 640x480 bitmap graphics C, C++, 80x86 (486, Pentium) 3 years at Sony 3 years PS1 3rd party support and game dev C, C++, MIPS R3000 2 years at game developer in Glasgow PS1 engine development C, C++, MIPS R3000 Previous Experience 6 years owning own developer PS1, PS2, GC, Xbox 1, PC development C, C++, MIPS R4400, VU assembly, HLSL 2 years at Eurocom PS3, 360, PC C, C++, PowerPC, SPU assembly 2 years at Microsoft Xbox 360, some Windows C, C++, PowerPC, HLSL Previous Experience Fair amount of optimization experience Part of XDC group at Microsoft 3rd party developer support group Visited 60+ game developers Performance reviews Consultancy Sample code Bespoke coding Previous Experience “All this will go away soon” 1992 Multiplying by 320 in x86 assembler Surely it should, because… Processor power increasing Processor cost reducing Compilers getting better Console Hardware Console Hardware Console hardware is about… Maximum performance …for minimum cost Often CPUs are… Cut down production processors Have bespoke processing hardware added Eg vector processing units Attached to cheap memory and peripherals Consoles are sold at a loss 80x86 PC (circa mid-90s) 512Kb L2 Cache AGP Main Memory To monitor FPU + MMX VRAM 8Kb L1 Graphics Card Pentium Pro 200Mhz PS1 I$ D$ GPU MDEC 2Mb Main Memory To telly MIPS R3000 33.868Mhz 1Mb VRAM GTE Xbox 1 128Kb L2 Cache 64Mb UMA Main Memory To telly L1 FPU + MMX SSE nVidia NV2A Pentium III 733Mhz PS2 EE I$ VU1 mem mem VIF0 VIF1 32Mb Main Memory GS GIF To telly S-Pad D$ VU0 4Mb VRAM MIPS R4400 294Mhz FPU + MMX Xbox 360 512Mb UMA To telly 1Mb L2 Cache L1 ATI Xenos L1 PowerPC Core FPU + VMX PowerPC Core FPU + VMX L1 FPU + VMX PowerPC Core PS3 Cell LS SPE SPE SPE SPE SPE SPE SPE SPE LS LS LS LS DMAC 256Mb To telly LS 256Mb VRAM L1 LS nVidia RSX PPE L2 Cache LS The Sad Truth About CPU Design In which programmers have to do the hard work again This Is What You Want CPU Very Wide, Very Fast Main Memory CPUs Not Getting Faster… Core 0 Core 1 ? Main Memory Core 2 Fast Memory is Expensive… Core 0 Core 1 Cache Main Memory Core 2 This Is What You Get… Core 0 L1 Store Queue Load Queue Store Gather Core 1 L1 Store Queue Load Queue Store Gather Core 2 L1 Store Queue Load Queue Store Gather RC Machines L2 Cache NCU 0 Main Memory NCU 1 NCU 0 Multicore Strategy Multicore is future of performance Scenario forced on unwilling game developers Not necessarily a happy marriage Game systems often highly… Temporally connected Intertwined Game devs often from single thread background Some tasks easy to parallelize Rendering, physics, effects, animation Multicore Strategy Single threaded On Xbox360 and PS3, this is a bad plan Two main threads Game logic update Renderer submission Two main threads + fixed tasks As above plus… …fixed tasks in parallel … eg streaming, effects, audio Multicore Strategy Truly multi-threaded Usually a main game logic thread Main tasks sliced into independent pieces Rendering, physics, collision, effects… Scheduler controls task execution Tasks execute when preconditions met Scheduler runs task on any available unit Real trick is… Balancing scheduling Making sure tasks truly independent Multicore Strategy Problems Very hard to debug a task system… …especially at sub millisecond resolution Balancing tasks and scheduler can be hard Slicing data and tasks into pieces tricky Many conditions very hard to find… …never mind debug Side effects in code not always obvious Game Engine Concerns Game Engine Coding Main concerns: Speed Feature set Memory usage Disc space for assets But most importantly… Speed Because this dictates game content Slow means less features Game Engine Coding Speed measured in… Frames per second Or equivalently ms per frame 33.33ms in a frame at 30fps Game must perform update in this time Update all of the game’s systems Set up and submit all rendering for frame Do all of the drawing for previous frame Game Engine Coding Critical choices for engine design Algorithms Sorting, searching, pruning calculations Rendering policy Data structuring How you bend the above around hardware Consoles have hardware acceleration… …for certain tasks …for certain data …for certain data layouts Game Engine Coding Example: VMX instructions on Xbox360 SIMD instructions, operating on vectors Vector can be 8, 16, 32 bit values 32 bit can be float or int Multiply, add, shift, pack, unpack Great! But… No divide, sqrt, individual bit operations Only aligned loading Loading individual pieces to build expensive Possible to lose improvement easily The 360 Core Remember, cheap hardware Cut down PowerPC core Missing out of order execution hardware Missing store forwarding hardware Ie, this is an in-order processor Attached to slow memory Means loading data is painful Which in turn makes data layout critical 360 Core Very commonly ocurring penalties: Load Hit Store L2 cache miss Expensive instructions Branch mispredict Load-Hit-Store (LHS) What is it? Storing to a memory location… …then loading from it very shortly after What causes LHS? Type casts, changing register set, aliasing Passing by value, or by reference Why is it a problem? On PC, bullet usually dodged by… Instruction re-ordering Store forwarding hardware L2 Miss What is it? Loading from a location not already in cache Why is it a problem? Costs ~610 cycles to load a cache line You can do a lot of work in 610 cycles What can we do about it? Hot/cold split Reduce in-memory data size Use cache coherent structures Expensive Instructions What is it? Certain instructions not pipelined No other instructions issued ‘til they complete Stalls both hardware threads high latency and low throughput What can we do about it? Know when those instructions are generated Avoid or code round those situations But only in critical places Branch Mispredicts What is it? Mispredicting a branch causes… …CPU to discard instructions it predicted it needed …23-24 cycle delay as correct instructions fetched Why is this a problem? Misprediction penalty can… …dominate total time in tight loops …waste time fetching unneeded instructions PIX for Xbox 360 PIX Performance Investigator for Xbox For analysing various kinds of performance Rendering, file system, CPU For CPU… Several different mechanisms Stochastic sampling High level timers and counters Instruction trace CPU Instruction Trace What is an instruction trace? CPU core set to single step mode Tools record instructions and load/store addrs 400x slower than normal execution Trace (and code) affected by: Compiler output – un-optimized / optimized Some statistics are simulated Eg cache statistics assumes Cache starts empty No other threads run and evict data CPU Instruction Trace Instruction trace contains 5 tabs: Summary tab Top Issues tab Memory Accesses tab Source tab Functions tab CPU Instruction Trace Summary tab Instructions executed statistics I-cache statistics D-cache statistics Very useful: cache line usage % TLB statistics Very useful: 4Kb and 64Kb page usage Very useful: TLB miss rate exceeding 1024 Instruction type histogram Summary Tab Executed instructions – gives notion of possible maximum speed Cache line efficiency – try for 35% minimum Top Issues Tab Major CPU penalties, by cycle cost order Includes link to: Address of instruction where penalty occurs Function in source view L2 miss and LHS normally dominate Other common penalties: Branch mispredict fcmp Expensive instructions (fdiv et al) Top Issue Tab Cache misses Displays % of data used before eviction Load-hit-stores Displays store instruction addr, last data addr Source / destination register types Expensive instructions Location of instruction Branch mispredictions Conditional or branch target mispredict Memory Accesses Tab Shows all memory accesses by… Page type, address, and cache line For each cache lines shows… Symbol that touched the cache line most Right click gives all symbols touching the line Source Tab Annotated source and assembly Columns show ‘penalty’ counts With hot links to more details Brings up this dialog, showing you all store instructions that this load hit Click here for load-hitstore details Functions Tab Per-function values of six counters: Instruction counts L2 misses, LHS, fcmp, L1 D & I cache misses All available as inclusive and exclusive Exclusive – for this function only Inclusive – this function and everything it calls Optimization Example Optimization Zen Perspective is king 90% of time spent in 10% of code Optimization is expensive, slow, error prone Improvement to execution speed Generality Maintainability Understandability Speed of development Optimization Zen Ground rules for optimization Have CPU budgets in place Budget planning assists good performance Measure twice, cut once Optimize in an iterative pruning fashion Remove easiet to tackle & worst culprits first Re-evaluat timing and metrics Stop as soon as budget achieved Be sure to performance issues correctly Optimization Example class BaseParticle { public: … virtual Vector& Position() { return virtual Vector& PreviousPosition() float& Intensity() { return float& Lifetime() { return bool& Active() { return … private: … float mIntensity; float mLifetime; bool mActive; Vector mPosition; Vector mPreviousPosition; … }; mPosition; } { return mPreviousPosition; } mIntensity; } mLifetime; } mActive; } Optimization Example // Boring old vector class class Vector { … public: float x,y,z,w; }; // Boring old generic linked list class template <class T> class ListNode { public: ListNode(T* contents) : mNext(NULL), mContents(contents) void SetNext(ListNode* node) { mNext = node; } ListNode* NextNode() { return mNext; } T* Contents() { return mContents; } private: ListNode<T>* mNext; T* mContents; }; {} Optimization Example // Run through list and update each active particle for (ListNode<BaseParticle>* node = gParticles; node != NULL; node = node->NextNode()) if (node->Contents()->Active()) { Vector vel; vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x; vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y; vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z; const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z)); if (length > cLimitLength) { float newIntensity = cMaxIntensity - node->Contents()->Lifetime(); if (newIntensity < 0.0f) newIntensity = 0.0f; node->Contents()->Intensity() = newIntensity; } else node->Contents()->Intensity() = 0.0f; } Optimization Example // Replacement for straight C vector work // Build 360 friendly __vector4s __vector4 position, prevPosition; position.x = node->Contents()->Position().x; position.y = node->Contents()->Position().y; position.z = node->Contents()->Position().z; prevPosition.x = node->Contents()->PrevPosition().x; prevPosition.y = node->Contents()->PrevPosition().y; prevPosition.z = node->Contents()->PrevPosition().z; // Use VMX to do the calculations __vector4 velocity = __vsubfp(position,previousPosition); __vector4 velocitySqr = __vmsum4fp(velocity,velocity); // Grab the length result from the vector const float length = __fsqrts(velocitySqr.x); Measure First PIX Summary 704k instructions executed 40% L2 cache line usage Top penalties L2 cache miss @ 3m cycles bctr mispredicts @ 1.14m cycles __fsqrt @ 696k cycles 2x fcmp @ 490k cycles Some 20.9m cycles of penalty overall Takes 7.528ms Improving Original Example 1) Avoid branch mispredict #1 Ditch the zealous use of virtual Call functions just once Gives 1.13x speedup 2) Improve L2 use #1 Refactoring list to contiguous array Hot/cold split Using bitfield for active flag Gives 3.59x speedup Improving Original Example 4) Remove expensive instructions Ditch __fsqrts and compare with squares Gives 4.05x speedup 5) Avoid fcmp pipeline flush Insert __fsel() to select tail length Gives 4.44x speedup Insert 2nd fsel Now only branch on active flag remains Gives 5.0x speedup Improving Original Example 7) Use VMX Use __vsubfp and __vmsum3fp for vector math Gives 5.28x speedup 8) Avoid branching too often Unroll the loop 4x Sticks at 5.28x speedup Improving Original Example 9) Avoid branch mispredict #2 Read vector4 of tail intensities Build a __vector4 mask from active flags __vsel tail lengths from existing and new Write updated vector4 of tail intensities back Gives 6.01x speedup 10) Improve L2 access #2 Add __dcbt on particle array Gives 16.01x speedup Improving Original Example 11) Improve L2 use #3 Move to short coordinates Now loading ¼ the data for positions Gives 21.23x speedup 12) Avoid unnecessary work We are now writing tail lengths for every particle Wait, we don’t care about inactive particles Epiphany - don’t check active flag at all Gives 23.2x speedup Improving Original Example 13) Improve L2 use #4 Remaining L2 misses on output array __dcbt that too Tweak __dcbt offsets and pre-load 39.01x speedup Check its correct! for (int loop = 0; loop < cParticleCount; loop+=4) { __dcbt(768,&gParticles[loop]); __dcbt(768,&gParticleLifetime[loop]); __vector4 lifetimes = *(__vector4 *)&gParticleLifetime[loop]; __vector4 newIntensity = __vsubfp(maxLifetime,lifetimes); const __vector4 velocity0 = gParticles[loop].Velocity(); __vector4 lengthSqr0 = __vmsum3fp(velocity0,velocity0); // …calculate remaining lengths and concatenate into one __vector4 lengths = __vsubfp(lengths,cLimitLengthSqrV); __vector4 lengthMask = __vcmpgtfp(lengths,zero); newIntensity = __vmaxfp(newIntensity,zero); __vector4 result = __vsel(zero,newIntensity,lengthMask); *(__vector4 *)&gParticleTailIntensity[loop] = __vsel(zero,newIntensity,lengthMask); } Improving Original Example PIX Summary 259k instructions executed 99.4% L2 usage Top penalties ERAT Data Miss @ 14k cycles 1 LHS via 4kb aliasing No mispredict penalties 71k cycles of penalty overall Takes 0.193ms Summary Summary Thanks for listening Hopefully you gathered something about: Cheap consumer hardware Multicore strategies What game engine programmers worry about How games are profiled and optimized Q&A http://www.xna.com © 2008 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. Dawson’s Creek Figures Clock rate = 3.2 GHz = 3,200,000,000 cycles per second 60 fps = 53,333,333 cycles per frame 30 fps = 106,666,666 cycles per frame Dawson’s Law: average 0.2 IPC in a game title Therefore … at 60 fps, you can do 10,666,666 instructions ~= 10M at 30 fps, you can do 21,333,333 instructions ~= 21M Or put another way… how bad is a 1M-cycle penalty? It’s approx 200K instructions of quality execution going missing. 1M cycles is 1/50th – 2% of a frame at 60 fps, or 1/100th – 1% of a frame at 30 fps, or 1% of a frame at 30 fps 1M cycles is ~0.32 ms.