CS305_Guest - CIS Personal Web Pages

“The
Slow Game Performance
Of Life”
Understanding
Consumer Hardware And Performance Coding
Allan Murphy
Senior Software Development Engineer
XNA Developer Connection
Microsoft
Hello
So…
Who exactly am I?
And what am I doing here
Firstly, hands up who…
Has heavily optimized an application
Hasn’t, and doesn’t care
Is actually here and alive
Is hungover and mainly hoping for the answers
for the group assignment
Hello
Duncan let me speak today because…
Career spent on performance hardware
Experience with a variety of consoles
Have managed teams building game engines
Still have those photos of Duncan
With Doug, the West Highland Terrier
I did my degree at Strathclyde
Computer architecture
Low level programming
Will Optimize For Money
Previous Experience
It’s not all about me
Except this bit
Strathclyde
The Game Of Life assignment
Left Strathclyde
Immediately
Did
database paid
analysis,
enormous
hatedfortune
it
Workedwear
Didn’t
in telecoms,
a suit, worked
hated in
it games
Moved to
Bought
first
3 person
Ferrarigame
3 months
company
after Uni
…“Until
Had
more
I could
than 1find
girlfriend
a proper job”
Previous Experience
2 years PC engine development
2D 640x480 bitmap graphics
C, C++, 80x86 (486, Pentium)
3 years at Sony
3 years PS1 3rd party support and game dev
C, C++, MIPS R3000
2 years at game developer in Glasgow
PS1 engine development
C, C++, MIPS R3000
Previous Experience
6 years owning own developer
PS1, PS2, GC, Xbox 1, PC development
C, C++, MIPS R4400, VU assembly, HLSL
2 years at Eurocom
PS3, 360, PC
C, C++, PowerPC, SPU assembly
2 years at Microsoft
Xbox 360, some Windows
C, C++, PowerPC, HLSL
Previous Experience
Fair amount of optimization experience
Part of XDC group at Microsoft
3rd party developer support group
Visited 60+ game developers
Performance reviews
Consultancy
Sample code
Bespoke coding
Previous Experience
“All this will go away soon”
1992
Multiplying by 320 in x86 assembler
Surely it should, because…
Processor power increasing
Processor cost reducing
Compilers getting better
Console Hardware
Console Hardware
Console hardware is about…
Maximum performance
…for minimum cost
Often CPUs are…
Cut down production processors
Have bespoke processing hardware added
Eg vector processing units
Attached to cheap memory and peripherals
Consoles are sold at a loss
80x86 PC (circa mid-90s)
512Kb L2 Cache
AGP
Main Memory
To monitor
FPU
+
MMX
VRAM
8Kb L1
Graphics Card
Pentium Pro
200Mhz
PS1
I$
D$
GPU
MDEC
2Mb Main Memory
To telly
MIPS R3000
33.868Mhz
1Mb VRAM
GTE
Xbox 1
128Kb L2 Cache
64Mb UMA Main Memory
To telly
L1
FPU
+
MMX
SSE
nVidia NV2A
Pentium III
733Mhz
PS2
EE
I$
VU1
mem
mem
VIF0
VIF1
32Mb Main Memory
GS
GIF
To telly
S-Pad
D$
VU0
4Mb VRAM
MIPS R4400
294Mhz
FPU
+
MMX
Xbox 360
512Mb UMA
To telly
1Mb L2 Cache
L1
ATI Xenos
L1
PowerPC
Core
FPU + VMX
PowerPC
Core
FPU + VMX
L1
FPU + VMX
PowerPC
Core
PS3
Cell
LS
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
LS
LS
LS
LS
DMAC
256Mb
To telly
LS
256Mb VRAM
L1
LS
nVidia RSX
PPE
L2 Cache
LS
The Sad Truth About CPU Design
In which programmers have to do the hard work again
This Is What You Want
CPU
Very Wide,
Very Fast
Main Memory
CPUs Not Getting Faster…
Core 0
Core 1
?
Main Memory
Core 2
Fast Memory is Expensive…
Core 0
Core 1
Cache
Main Memory
Core 2
This Is What You Get…
Core 0 L1
Store
Queue
Load
Queue
Store
Gather
Core 1 L1
Store
Queue
Load
Queue
Store
Gather
Core 2 L1
Store
Queue
Load
Queue
Store
Gather
RC Machines
L2 Cache
NCU 0
Main Memory
NCU 1
NCU 0
Multicore Strategy
Multicore is future of performance
Scenario forced on unwilling game developers
Not necessarily a happy marriage
Game systems often highly…
Temporally connected
Intertwined
Game devs often from single thread background
Some tasks easy to parallelize
Rendering, physics, effects, animation
Multicore Strategy
Single threaded
On Xbox360 and PS3, this is a bad plan
Two main threads
Game logic update
Renderer submission
Two main threads + fixed tasks
As above plus…
…fixed tasks in parallel
… eg streaming, effects, audio
Multicore Strategy
Truly multi-threaded
Usually a main game logic thread
Main tasks sliced into independent pieces
Rendering, physics, collision, effects…
Scheduler controls task execution
Tasks execute when preconditions met
Scheduler runs task on any available unit
Real trick is…
Balancing scheduling
Making sure tasks truly independent
Multicore Strategy
Problems
Very hard to debug a task system…
…especially at sub millisecond resolution
Balancing tasks and scheduler can be hard
Slicing data and tasks into pieces tricky
Many conditions very hard to find…
…never mind debug
Side effects in code not always obvious
Game Engine Concerns
Game Engine Coding
Main concerns:
Speed
Feature set
Memory usage
Disc space for assets
But most importantly…
Speed
Because this dictates game content
Slow means less features
Game Engine Coding
Speed measured in…
Frames per second
Or equivalently ms per frame
33.33ms in a frame at 30fps
Game must perform update in this time
Update all of the game’s systems
Set up and submit all rendering for frame
Do all of the drawing for previous frame
Game Engine Coding
Critical choices for engine design
Algorithms
Sorting, searching, pruning calculations
Rendering policy
Data structuring
How you bend the above around hardware
Consoles have hardware acceleration…
…for certain tasks
…for certain data
…for certain data layouts
Game Engine Coding
Example: VMX instructions on Xbox360
SIMD instructions, operating on vectors
Vector can be 8, 16, 32 bit values
32 bit can be float or int
Multiply, add, shift, pack, unpack
Great! But…
No divide, sqrt, individual bit operations
Only aligned loading
Loading individual pieces to build expensive
Possible to lose improvement easily
The 360 Core
Remember, cheap hardware
Cut down PowerPC core
Missing out of order execution hardware
Missing store forwarding hardware
Ie, this is an in-order processor
Attached to slow memory
Means loading data is painful
Which in turn makes data layout critical
360 Core
Very commonly ocurring penalties:
Load Hit Store
L2 cache miss
Expensive instructions
Branch mispredict
Load-Hit-Store (LHS)
What is it?
Storing to a memory location…
…then loading from it very shortly after
What causes LHS?
Type casts, changing register set, aliasing
Passing by value, or by reference
Why is it a problem?
On PC, bullet usually dodged by…
Instruction re-ordering
Store forwarding hardware
L2 Miss
What is it?
Loading from a location not already in cache
Why is it a problem?
Costs ~610 cycles to load a cache line
You can do a lot of work in 610 cycles
What can we do about it?
Hot/cold split
Reduce in-memory data size
Use cache coherent structures
Expensive Instructions
What is it?
Certain instructions not pipelined
No other instructions issued ‘til they complete
Stalls both hardware threads
high latency and low throughput
What can we do about it?
Know when those instructions are generated
Avoid or code round those situations
But only in critical places
Branch Mispredicts
What is it?
Mispredicting a branch causes…
…CPU to discard instructions it predicted it needed
…23-24 cycle delay as correct instructions fetched
Why is this a problem?
Misprediction penalty can…
…dominate total time in tight loops
…waste time fetching unneeded instructions
PIX for Xbox 360
PIX
Performance Investigator for Xbox
For analysing various kinds of performance
Rendering, file system, CPU
For CPU…
Several different mechanisms
Stochastic sampling
High level timers and counters
Instruction trace
CPU Instruction Trace
What is an instruction trace?
CPU core set to single step mode
Tools record instructions and load/store addrs
400x slower than normal execution
Trace (and code) affected by:
Compiler output – un-optimized / optimized
Some statistics are simulated
Eg cache statistics assumes
Cache starts empty
No other threads run and evict data
CPU Instruction Trace
Instruction trace contains 5 tabs:
Summary tab
Top Issues tab
Memory Accesses tab
Source tab
Functions tab
CPU Instruction Trace
Summary tab
Instructions executed statistics
I-cache statistics
D-cache statistics
Very useful: cache line usage %
TLB statistics
Very useful: 4Kb and 64Kb page usage
Very useful: TLB miss rate exceeding 1024
Instruction type histogram
Summary Tab
Executed instructions –
gives notion of possible
maximum speed
Cache line efficiency –
try for 35% minimum
Top Issues Tab
Major CPU penalties, by cycle cost order
Includes link to:
Address of instruction where penalty occurs
Function in source view
L2 miss and LHS normally dominate
Other common penalties:
Branch mispredict
fcmp
Expensive instructions (fdiv et al)
Top Issue Tab
Cache misses
Displays % of data used before eviction
Load-hit-stores
Displays store instruction addr, last data addr
Source / destination register types
Expensive instructions
Location of instruction
Branch mispredictions
Conditional or branch target mispredict
Memory Accesses Tab
Shows all memory accesses by…
Page type, address, and cache line
For each cache lines shows…
Symbol that touched the cache line most
Right click gives all symbols touching the line
Source Tab
Annotated source and assembly
Columns show ‘penalty’ counts
With hot links to more details
Brings up this dialog,
showing you all store
instructions that this load hit
Click here for load-hitstore details
Functions Tab
Per-function values of six counters:
Instruction counts
L2 misses, LHS, fcmp, L1 D & I cache misses
All available as inclusive and exclusive
Exclusive – for this function only
Inclusive – this function and everything it calls
Optimization Example
Optimization Zen
Perspective is king
90% of time spent in 10% of code
Optimization is expensive, slow, error prone
Improvement to execution speed
Generality
Maintainability
Understandability
Speed of development
Optimization Zen
Ground rules for optimization
Have CPU budgets in place
Budget planning assists good performance
Measure twice, cut once
Optimize in an iterative pruning fashion
Remove easiet to tackle & worst culprits first
Re-evaluat timing and metrics
Stop as soon as budget achieved
Be sure to performance issues correctly
Optimization Example
class BaseParticle
{
public:
…
virtual Vector& Position()
{ return
virtual Vector& PreviousPosition()
float& Intensity()
{ return
float& Lifetime()
{ return
bool& Active()
{ return
…
private:
…
float mIntensity;
float mLifetime;
bool mActive;
Vector mPosition;
Vector mPreviousPosition;
…
};
mPosition; }
{ return mPreviousPosition; }
mIntensity; }
mLifetime; }
mActive; }
Optimization Example
// Boring old vector class
class Vector
{
…
public:
float x,y,z,w;
};
// Boring old generic linked list class
template <class T> class ListNode
{
public:
ListNode(T* contents) : mNext(NULL), mContents(contents)
void SetNext(ListNode* node)
{ mNext = node; }
ListNode* NextNode()
{ return mNext; }
T* Contents()
{ return mContents; }
private:
ListNode<T>* mNext;
T* mContents;
};
{}
Optimization Example
// Run through list and update each active particle
for (ListNode<BaseParticle>* node = gParticles; node != NULL; node = node->NextNode())
if (node->Contents()->Active())
{
Vector vel;
vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x;
vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y;
vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z;
const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z));
if (length > cLimitLength)
{
float newIntensity = cMaxIntensity - node->Contents()->Lifetime();
if (newIntensity < 0.0f)
newIntensity = 0.0f;
node->Contents()->Intensity() = newIntensity;
}
else
node->Contents()->Intensity() = 0.0f;
}
Optimization Example
// Replacement for straight C vector work
// Build 360 friendly __vector4s
__vector4 position, prevPosition;
position.x = node->Contents()->Position().x;
position.y = node->Contents()->Position().y;
position.z = node->Contents()->Position().z;
prevPosition.x = node->Contents()->PrevPosition().x;
prevPosition.y = node->Contents()->PrevPosition().y;
prevPosition.z = node->Contents()->PrevPosition().z;
// Use VMX to do the calculations
__vector4 velocity = __vsubfp(position,previousPosition);
__vector4 velocitySqr = __vmsum4fp(velocity,velocity);
// Grab the length result from the vector
const float length = __fsqrts(velocitySqr.x);
Measure First
PIX Summary
704k instructions executed
40% L2 cache line usage
Top penalties
L2 cache miss @ 3m cycles
bctr mispredicts @ 1.14m cycles
__fsqrt @ 696k cycles
2x fcmp @ 490k cycles
Some 20.9m cycles of penalty overall
Takes 7.528ms
Improving Original Example
1) Avoid branch mispredict #1
Ditch the zealous use of virtual
Call functions just once
Gives 1.13x speedup
2) Improve L2 use #1
Refactoring list to contiguous array
Hot/cold split
Using bitfield for active flag
Gives 3.59x speedup
Improving Original Example
4) Remove expensive instructions
Ditch __fsqrts and compare with squares
Gives 4.05x speedup
5) Avoid fcmp pipeline flush
Insert __fsel() to select tail length
Gives 4.44x speedup
Insert 2nd fsel
Now only branch on active flag remains
Gives 5.0x speedup
Improving Original Example
7) Use VMX
Use __vsubfp and __vmsum3fp for vector math
Gives 5.28x speedup
8) Avoid branching too often
Unroll the loop 4x
Sticks at 5.28x speedup
Improving Original Example
9) Avoid branch mispredict #2
Read vector4 of tail intensities
Build a __vector4 mask from active flags
__vsel tail lengths from existing and new
Write updated vector4 of tail intensities back
Gives 6.01x speedup
10) Improve L2 access #2
Add __dcbt on particle array
Gives 16.01x speedup
Improving Original Example
11) Improve L2 use #3
Move to short coordinates
Now loading ¼ the data for positions
Gives 21.23x speedup
12) Avoid unnecessary work
We are now writing tail lengths for every particle
Wait, we don’t care about inactive particles
Epiphany - don’t check active flag at all
Gives 23.2x speedup
Improving Original Example
13) Improve L2 use #4
Remaining L2 misses on output array
__dcbt that too
Tweak __dcbt offsets and pre-load
39.01x speedup
Check its correct!
for (int loop = 0; loop < cParticleCount; loop+=4)
{
__dcbt(768,&gParticles[loop]);
__dcbt(768,&gParticleLifetime[loop]);
__vector4 lifetimes = *(__vector4 *)&gParticleLifetime[loop];
__vector4 newIntensity = __vsubfp(maxLifetime,lifetimes);
const __vector4 velocity0 = gParticles[loop].Velocity();
__vector4 lengthSqr0 = __vmsum3fp(velocity0,velocity0);
// …calculate remaining lengths and concatenate into one __vector4
lengths = __vsubfp(lengths,cLimitLengthSqrV);
__vector4 lengthMask = __vcmpgtfp(lengths,zero);
newIntensity = __vmaxfp(newIntensity,zero);
__vector4 result = __vsel(zero,newIntensity,lengthMask);
*(__vector4 *)&gParticleTailIntensity[loop] =
__vsel(zero,newIntensity,lengthMask);
}
Improving Original Example
PIX Summary
259k instructions executed
99.4% L2 usage
Top penalties
ERAT Data Miss @ 14k cycles
1 LHS via 4kb aliasing
No mispredict penalties
71k cycles of penalty overall
Takes 0.193ms
Summary
Summary
Thanks for listening
Hopefully you gathered something about:
Cheap consumer hardware
Multicore strategies
What game engine programmers worry about
How games are profiled and optimized
Q&A
http://www.xna.com
© 2008 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
Dawson’s Creek Figures
Clock rate = 3.2 GHz = 3,200,000,000 cycles per second
60 fps = 53,333,333 cycles per frame
30 fps = 106,666,666 cycles per frame
Dawson’s Law: average 0.2 IPC in a game title
Therefore …
at 60 fps, you can do 10,666,666 instructions ~= 10M
at 30 fps, you can do 21,333,333 instructions ~= 21M
Or put another way… how bad is a 1M-cycle penalty?
It’s approx 200K instructions of quality execution going missing.
1M cycles is 1/50th – 2% of a frame at 60 fps, or 1/100th – 1% of a
frame at 30 fps, or 1% of a frame at 30 fps
1M cycles is ~0.32 ms.