C++ on Next-Gen Consoles: Effective Code for New

advertisement
C++ on Next-Gen Consoles:
Effective Code for New
Architectures
Pete Isensee
Development Manager
Microsoft Game Technology Group
Last Year at GDC
Chris Hecker ranted
 What did he say?

Programmers: danger ahead
 Out-of-order execution: good
 In-order execution: bad
 Microsoft and Sony are going to screw you
 You are so hosed. Game over, man.


“There’s absolutely nothing you can do
about this”
Console Hardware Architectures
Optimized to do floating-point math
 Optimized for multithreaded tasks
 Optimized to run games
 Not optimized to run general purpose code
 Not optimized to do branch prediction, code
reordering, instruction pipelining or other
out-of-order magic
 Large L2 caches
 Large latencies

We’re Game Programmers.
We Love Challenges.
We will make games on these consoles
 The solution is not assembly language
 The solution is to tailor our C/C++ engines,
inner loops and bottleneck functions to the
realities of the hardware
 Remember: C++ code can make or break
your game’s performance

Not Covering
Profiling (do it)
 Multithreading (do it)
 Memory allocation (avoid in game loop)
 Compiler settings (experiment)
 Exception handling (avoid it)

Topics for Today

Thinking about L2
Optimize memory access
 Use CPU caches effectively


Thinking about in-order processing
Avoid function call overhead
 Tips for efficient math
 Avoid hidden C++ inefficiencies

Optimize Memory Access
Proverb: thou shalt treat memory as if it were
thy hard drive
 You will be memory-bound on new consoles
 Recommendations

Never read from the same place twice in a frame
 Read data sequentially
 Write data sequentially
 Use everything you read

Minimize Data Passes

Game frame loops often access data twice
Or three times
 Or more

Optimize for a single pass
 Consider less frequent operations

AI
 Physics, collision
 Networking
 Particle systems

Multiple Pass
Architecture
Pointer Aliasing Explained
void init( float *a, const float *b ) {
a[0] = 1.0f - *b;
a[1] = 1.0f - *b;
}
Nominal case
0.0
0.0 1.0
1.0
0.0
b
a
Worst case
float a[2]={0.0f};
init( a, &a[0] );
0.0 0.0
1.0
a
b
A Solution: Restrict


Restrict keyword tells the compiler there’s no
aliasing
Restrict permits the compiler to generate much
more efficient code
void init( float* __restrict a,
const float* __restrict b ) {
a[0] = 1.0f - *b; // compiler can do
a[1] = 1.0f - *b; // the right thing
}
What to Restrict
Use restrict widely
 Function pointer parameters
 Local pointers
 Pointers in structs/classes
 But not:

Function return types
 Casts
 Global pointers (maybe)
 References (maybe)

Use the CPU Caches Effectively
The L2 cache is your best friend
 Using the cache well is an art
 Ensure you have a good profiler by your side

Keep the Working Set Small
Pack commonly used data together
 Frequently used data might deserve its own
struct/class
 Keep rarely used data separate



Consider bitfields


Example: texture file names
Bitfields are extremely efficient on PowerPC
Consider other forms of lossless
compression
Inefficient Structs Are Bad Mojo
struct InefficientCar {
bool manual; // padding here
wheel wheels[8]; // 8 wheels?
bool convertible; // more pad
char engine; // 4 bits used
char file[32]; // rarely used
double maxAccel; // double?
};
sizeof(InefficientCar) = 80
Carefully Design Structures
struct EfficientCar {
wheel wheels[4]; // 4 wheels
wheel *moreWheels;
char *file; // stored elsewhere
float maxAccel; // float
unsigned engine:4; // bitfields
unsigned manual:1;
unsigned convertible:1;
};
sizeof(EfficientCar) = 32
Choose the Right Container

Prefer contiguous containers
Or at least mostly contiguous
 Examples: array, vector, deque


Avoid node-based containers

List, set/map, binary trees, hash tables
If you must use a tree, consider a custom
allocator for memory locality
 Vector + std::sort is often faster (and
smaller) than set or map or hash tables, by
an order of magnitude

Avoid Function Call Overhead
Function call overhead was a surprising
cause of performance issues on Xbox
 The same is true on Xbox 360 and PS3
 Fortunately, there are lots of solutions
 Research compiler settings. On Xbox 360:

Inline “any suitable”
 Enable link-time code generation


Spend time ensuring the compiler is inlining
the right things
Avoid Virtual Functions

Weigh the limitations of virtual functions
Adds a branch instruction
 Branch is always mispredicted
 Compiler is limited in how it can optimize


Consider replacing


virtual void Draw() = 0;
With
Xbox360.cpp: void Draw() { ... }
 Windows.cpp: void Draw() { ... }
 PS3.cpp:
void Draw() { ... }

Maximize Leaf Functions
Leaf functions don’t call other functions, ever
 If a potential leaf function calls another
function, the high-level function:

Is much less likely to be inlined
 Must set up a stack frame
 Must set up registers


Potential solutions
Remove the inner function completely
 Inline the inner function
 Provide two versions of the outer function

Unroll Inner Loops
Compiler can’t unroll loops where n is variable
 Even unrolling from ++i to i+=4 can be a
significant gain

Eliminates three branch instructions
 Increases opportunity for code scheduling


Don’t forget to hoist invariants out, too
Example Unrolling
// original
for( i=a.beg(); i!=a.end(); ++i )
process(i);
// unrolled
e = a.end();
for( i=a.beg(); i!=e; i+=4 ) {
process(i); process(i+1);
process(i+2); process(i+3);
}
Pass Native Types by Value

Tradition says that “large” types are passed
by pointer or reference, but be careful


New consoles have really large registers
Native types include
64-bit int (__int64)
 VMX vector (__vector4) – 128 bits!


Pass structs by pointer or reference

One exception: pass structs consisting of bitfields
<= 64 bits by value
Know Data Type Performance
int32 and int64 have equivalent perf
 float and double have equivalent perf
 int8 and int16 are slower than int


They generate extra instructions



High bits cleared or sign-extended
Example: int32 adds 2X faster than int16 adds
Recommendations
Store as smallest type required
 Load into int32, int64 or double for calculations

Use Native Vector Types

In CS 101, you learned to create abstract
data types, such as matrices
typedef std::vector<float,4> vec;
typedef std::vector<vec,4> matrix;
This code is an abomination
 At least on Xbox 360 and PS3
 Xbox 360 and PS3 have dedicated vector
math units called VMX units
 Use them!

Your Math Buddies
__vector4 (4 32-bit floats; 128-bit register)
 XMVECTOR (typedef for vector4)
 XMMATRIX (array of 4 vector4s)
 XMVECTOR operators (+,-,*,/)
 Hundreds of XMVECTOR and XMMATRIX
functions
 Xbox 360-specific, but similar constructs in
PS3 compilers

Avoid Floating-Point Branches

FP branches are slow
Cache has to be flushed
 ~10X slower than int branches

Avoid loops with float test
expressions
 Eliminate altogether if possible


Can be faster to calculate values
you won’t use!
Compare integers instead
 Replace with fsel when possible


10-20X performance gain
The fsel Option in Detail

Definition of hardware implementation:
float fsel(float a, float b, float c)
{
return ( a < 0.0f ) ? b : c;
}

You can replace expressions like


v = ( w < x ) ? y : z; // slow
With faster expressions like

v = fsel( w - x, y, z ); // turbo
Prefer Platform-Specific Funcs
The C runtime (CRT) is not usually the best
option when performance matters
 Xbox 360 examples


Prefer CreateFile to fopen or C++ streams


Prefer XMemCpy to memcpy


Options for asynchronous reads and other goodness
2-6X faster
Prefer XMemSet to memset

8-14X faster
Avoid Hidden C++ Inefficiencies
C++ rocks the house!
 C++ can bring your game to its knees!
 Consider these innocuous snippets

Quaternion q;
 s.push_back( k );
 if( (float)i > f )
 obj->Draw();
 GameObject arr[1000];
 a = b + c;
 i++;

C++ is Dangerous
With power comes responsibility
 Beware constructors


Is initialization the right thing to do?
Beware hidden allocations
 Conversion casts may have significant cost
 Use virtual functions with care
 Beware overloaded operators
 Stick to known idioms

Operator++ should be a constant-time operation.
 Really.

Summary
There absolutely are many things you can
do to efficiently program next-gen consoles
 Two key issues: L2/memory and in-order
processing

Treat memory as you would a hard disk
 Watch out for those branches; use tricks like fsel


Prefer a light C++ touch
What’s Next
Our games are only as good as the weakest
member of the team
 Share what you’ve learned
 “The sharing of ideas allows us to stand on
one another’s shoulders instead of on one
another’s feet” – Jim Warren

Questions
pkisensee@msn.com
 Fill out your feedback forms

Download