ECE 454 Computer Systems Programming Compiler and Optimization (II) Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan Content • Compiler Basics (last lec.) • Understanding Compiler Optimization (last lec.) • Manual Optimization • Advanced Optimizations • Parallel Unrolling • Profile-Directed Feedback 2 Ding Yuan, ECE454 Manual Optimization Example: Vector Sum Function 3 Ding Yuan, ECE454 Vector Procedures vec_ptr: length data 0 1 2 length–1 • Procedures: vec_ptr new_vec(int len) • Create vector of specified length int get_vec_element(vec_ptr v, int index, int *dest) • Retrieve vector element, store at *dest • Return 0 if out of bounds, 1 if successful int *get_vec_start(vec_ptr v) • Return pointer to start of vector data 4 Ding Yuan, ECE454 Vector Sum Function: Original Code void vsum(vec_ptr v, int *dest) { int i; *dest = 0; for (i=0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • What it does: • Add all the vector elements together • Store the resulting sum at destination location *dest • Performance Metric: CPU Cycles per Element (CPE) • CPE = 42.06 (Compiled -g) 5 Ding Yuan, ECE454 After Compiler Optimization void vsum(vec_ptr v, int *dest) { int i; *dest = 0; for (i=0; i<vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } // same as vsum void vsum1(vec_ptr v, int *dest) { int i; *dest = 0; for (i=0; i<vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Impact of compiler optimization • vsum: • CPE = 42.06 (Compiled -g) • vsum1: (a C-code view of the resulting assembly instructions) • CPE = 31.25 (Compiled -O2, will compile –O2 from now on) • Result: better scheduling & reg alloc, lots of missed opportunities 6 Ding Yuan, ECE454 Optimization Blocker: Procedure Calls • Why didn’t compiler move vec_len out of the inner loop? • Procedure may have side effects • Function may not return same value for given arguments • Depends on other parts of global state • Why doesn’t compiler look at code for vec_len? • Linker may overload with different version • Unless declared static • Interprocedural optimization is not used extensively due to cost • Warning: • Compiler treats procedure call as a black box • Weak optimizations in and around them 7 Ding Yuan, ECE454 Manual LICM void vsum1(vec_ptr v, int *dest) { int i; *dest = 0; for (i=0; i<vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } void vsum2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • CPE = 20.66 • Next Opportunity: Inlining get_vec_element()? • compiler may not have thought worth it • O2 will prioritize code size increase rather than performance 8 Ding Yuan, ECE454 Manual Inlining void vsum2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } void vsum3i(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; int *data = get_vec_start(v); val = data[i]; *dest += val; } } • Inlining: replace a function call with its body • shouldn’t normally have to do this manually • could also convince the compiler to do it via directives • Inlining enables further optimizations 9 Ding Yuan, ECE454 Further LICM, val unnecessary void vsum3i(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; int *data = get_vec_start(v); val = data[i]; *dest += val; } } void vsum3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } } CPE = 6.00 Compiler may not always inline when it should It may not know how important a certain loop is It may be blocked by missing function definition 10 Ding Yuan, ECE454 Reduce Unnecessary Memory Refs void vsum3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } } void vsum4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Don’t need to store in destination until end • Can use local variable sum instead • It is more likely to be allocated to a register! • Avoids 1 memory read, 1 memory write per iteration • CPE = 2.00 • Difficult for the compiler to promote pointer de-refs to registers 11 Ding Yuan, ECE454 Why are pointers so hard for compilers? • Pointer Aliasing • Two different pointers might point to a single location • Example: int x = 5, y = 10; int *p = &x; int *q = &y; if (one_in_a_million){ q = p; } // compiler must assume q==p = &x from here on, though unlikely 12 Ding Yuan, ECE454 Minimizing the impact of Pointers • Easy to over-use pointers in C/C++ • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Eg., when accumulating within loops • Your way of telling compiler not to worry about aliasing 13 Ding Yuan, ECE454 Understanding Instruction-Level Parallelism: Unrolling and Software Pipelining 14 Ding Yuan, ECE454 Superscalar CPU Design Instruction Control Address Fetch Control Retirement Unit Register File Instruction Cache Instrs. Instruction Decode Operations Register Updates Prediction OK? Integer/ Branch General Integer FP Add FP Mult/Div Operation Results Load Addr. Store Functional Units Addr. Data Data Data Cache Execution 15 Ding Yuan, ECE454 Assumed CPU Capabilities • Multiple Instructions Can Execute in Parallel • • • • • 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division • Some Instructions Take > 1 Cycle, but Can be Pipelined • • • • • • • Instruction Latency Load / Store 3 Integer Multiply 4 Double/Single FP Multiply 5 Double/Single FP Add 3 Integer Divide 36 Double/Single FP Divide 38 16 Cycles/Issue 1 1 2 1 36 38 Ding Yuan, ECE454 Intel Core i7 17 Ding Yuan, ECE454 Zooming into x86 assembly void vsum4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 # # # # # // // // // i => %edx length => %esi data => %eax sum => %ecx Loop code only: sum += data[i] // data+i*4 i++ compare i and length if i < length goto L24 18 Ding Yuan, ECE454 Note: addl includes a load! addl (%eax,%edx,4),%ecx %edx(i) 3 cycles %ecx.0 (sum) 1 cycle load incl load %ecx.i cmpl+1 cc.1 %edx.1 addl jl t.1 addl %ecx.1 19 Ding Yuan, ECE454 Visualizing the vsum Loop .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 load %ecx.0 (sum) Time incl cmpl+1 %ecx.i cc.1 jl t.1 addl %ecx.1 Loop code only: sum += data[i] // data+i*4 i++ compare i and length if i < length goto L24 • Operations %edx.0 (i) load # # # # # %edx.1 • Vertical position denotes time at which executed • Cannot begin operation until operands available • Height denotes latency • addl is split into a load and addl • Eg., X86 converts to micro-ops 20 Ding Yuan, ECE454 4 Iterations of vsum %edx.0 incl 1 load 2 3 4 cmpl+1 %ecx.i cc.1 jl %ecx.0 incl load t.1 addl i=0 %ecx.1 5 6 %edx.1 Iteration 1 Cycle %edx.2 cmpl+1 %ecx.i cc.2 jl incl load t.2 addl i=1 cmpl+1 %ecx.i cc.3 jl incl load t.3 %ecx.2 Iteration 2 %edx.3 addl i=2 %ecx.3 7 Iteration 3 • Unlimited Resource Analysis %edx.4 4 integer ops cmpl+1 %ecx.i cc.4 jl t.4 addl i=3 %ecx.4 Iteration 4 • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • CPE = 1.0 • But needs to be able to issue 4 integer ops per cycle • Performance on assumed CPU: • CPE = 2.0 • Number of parallel integer ops (2) is the bottleneck 21 Ding Yuan, ECE454 Loop Unrolling: vsum void vsum5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; for (i = 0; i < limit; i+=3){ sum += data[i]; sum += data[i+1]; sum += data[i+2]; } for ( ; i < length; i++){ sum += data[i] *dest = sum; } 22 • Optimization • Combine multiple iterations into single loop body • Amortizes loop overhead across multiple iterations • Finish extras at end • Measured CPE = 1.33 Ding Yuan, ECE454 Visualizing Unrolled Loop: vsum • Loads can pipeline, since don’t have dependences • Only one set of loop %ecx.0c control operations %edx.0 addl load %edx.1 cmpl+1 %ecx.i load jl cc.1 t.1a addl %ecx.1a load addl %ecx.1b 23 Time t.1b t.1c addl %ecx.1c Ding Yuan, ECE454 Executing with Loop Unrolling: vsum %edx.2 5 6 addl 7 8 load 10 cc.3 jl t.3a addl %ecx.3a 11 12 cmpl+1 %ecx.i load 9 %ecx.2c %edx.3 i=6 13 load addl %edx.4 t.3b addl %ecx.3b load cmpl+1 %ecx.i t.3c addl %ecx.3c %ecx.4a 14 Cycle • Predicted Performance 15 • Can complete iteration in 3 cycles • Should give CPE of 1.0 jl t.4a addl Iteration 3 cc.4 load i=9 load t.4b addl %ecx.4b t.4c addl %ecx.4c Iteration 4 24 Ding Yuan, ECE454 Why 3 times? %edx.0 addl load %edx.1 cmpl+1 %ecx.i load jl %edx.0 cc.1 addl %edx.1 t.1a addl %ecx.1a load t.1b addl %ecx.1b %ecx.0c load cmpl+1%edx.0 %ecx.i cc.1 jl addl %edx. t.1a addl %ecx.1a load cmpl+1 %ecx.i t.1b addl %ecx.1b Unroll 2 times: what is the problem? 25 %ecx.0c load jl cc.1 t.1a addl %ecx.1a t.1b addl %ecx.1b Ding Yuan, ECE454 Vector multiply function: vprod .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 # # # # # Loop: prod *= data[i] i++ i:length if < goto Loop %edx.0 incl load %ecx.0 %edx.1 cmpl jl void vprod4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int prod = 1; for (i = 0; i < length; i++) prod *= data[i]; *dest = prod; } cc.1 t.1 Time imull %ecx.1 26 Ding Yuan, ECE454 3 Iterations of vprod • Unlimited Resource Analysis %edx.0 incl 1 load 2 cmpl cc.1 jl 3 incl load cc.2 i=0 4 %edx.2 cmpl t.1 %ecx.0 5 %edx.1 jl incl load cmpl t.2 cc.3 jl imull t.3 6 %ecx.1 7 Iteration 1 8 9 10 11 12 13 14 15 • Performance on CPU: imull Cycle %edx.3 • Assume operation can start as soon as operands available • Operations for multiple iterations overlap in time • Limiting factor becomes latency of integer multiplier i=1 %ecx.2 Iteration 2 • Data hazard imull i=2 • Gives CPE of 4.0 %ecx.3 27 3 Iteration Ding Yuan, ECE454 Unrolling: Performance Summary Unrolling Degree Relevant FU Latency 1 2 3 4 8 16 vsum 1 cycle 2.00 1.50 1.33 1.50 1.25 1.06 vprod 4 cycles 4.00 vFPsum 3 cycles 3.00 vFPprod 5 cycles 5.00 • Only helps integer sum for our examples • Other cases constrained by functional unit latencies • Effect is nonlinear with degree of unrolling • Many subtle effects determine exact scheduling of operations can we do better for vprod? 28 Ding Yuan, ECE454 Visualizing vprod 1 x0 * x1 • Computation * x2 * ((((((((((((1 * x0) * x1) * x2) * x3) * x4) * x5) * x6) * x7) * x8) * x9) * x10) * x11) x3 * • Performance x4 * x5 * • N elements, D cycles/operation • N*D cycles = N * 4 x6 * x7 * x8 * x9 * x10 * x11 * 29 Ding Yuan, ECE454 Parallel Unrolling Method1: vprod void vprod5pu1(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x = 1; int i; for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } if (i < length) // fix-up x *= data[i]; *dest = x; } • Code Version • Integer product • Optimization • Multiply pairs of elements together • And then update product • Performance • CPE = 2 Assume unlimited hardware resources from now on 30 Ding Yuan, ECE454 %edx.0 addl load %edx.1 cmpl load jl addl load cmpl load mull tmp=data[i] * data[i+1] mull mull x=x*tmp jl • Still have the data dependency btw. multiplications, but it’s one dep. every two elements mull 31 Ding Yuan, ECE454 Order of Operations Can Matter! 1 x0 * x1 * x2 * x3 * x4 /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x * data[i]) * data[i+1]; } * x5 * x6 * x7 * x8 * • CPE = 4.00 • All multiplies performed in sequence /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } CPE = 2 Multiplies overlap x9 * Compiler can do it for you x10 * x11 * x0 x1 1 * x2 x3 * * x4 x5 * * x6 x7 * * x8 x9 * * x10 x11 * * * 32 Ding Yuan, ECE454 Parallel Unrolling Method2: vprod void vprod6pu2(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; for (i = 0; i < limit; i+=2) x0 *= data[i]; x1 *= data[i+1] } if (i < length) // fix-up x0 *= data[i]; *dest = x0 * x1; } • Code Version • Integer product • Optimization • Accumulate in two different products • Can be performed simultaneously • Combine at end • Performance • CPE = 2.0 • 2X performance Compiler won’t do it for you 33 Ding Yuan, ECE454 Method2: vprod6pu2 visualized 1 x0 * 1 x0 x1 * * x2 * for (i = 0; i < limit; i+=2) x0 *= data[i]; x1 *= data[i+1] } … *dest = x0 * x1; x1 * x3 * x2 * x4 * x3 * x5 * x4 * x6 * x5 * x7 * x6 * x8 * x7 * x9 * x8 * x9 x10 * x10 * x11 * * 34 x11 * Ding Yuan, ECE454 Trade-off for loop unrolling • Benefits • Reduce the loop operations (e.g., condition check & loop variable increment) • Reason for the integer add (vsum5) to achieve a CPE of 1 • Enhance parallelism (often require manually rewrite the loop body) • Harm • Increased code size • Register pressure • Lots of Registers to hold sums/products • IA32: Only 6 usable integer registers, 8 FP registers • When not enough registers, must spill temporaries onto stack • Wipes out any performance gains 35 Ding Yuan, ECE454 Needed for Effective Parallel Unrolling: • Mathematical • Combining operation must be associative & commutative • OK for integer multiplication • Not strictly true for floating point • OK for most applications • Hardware • Pipelined functional units, superscalar execution • Lots of Registers to hold sums/products 36 Ding Yuan, ECE454 Summary 37 Ding Yuan, ECE454 Summary of Optimizations Version Optimization vsum -g vsum1 -O2 Vsum2 LICM vsum3 Applied to Manual or GCC CPE 42.06 GCC 31.25 vec_length() manual 20.66 Inlining + LICM get_vec_element() manual 6.00 vsum4 mem-ref reduction *dest manual 2.00 vsum5 unrolling (3x) for loop -funroll-loops* 1.33 vprod4 (same as vsum4) 4.00 vprod5pu1 par. unrolling 1 for loop manual 2.0 vprod5pu2 par. unrolling 2 for loop manual 2.0 38 Ding Yuan, ECE454 Punchlines • Before you start: will optimization be worth the effort? • profile to estimate potential impact • Get the most out of your compiler before going manual • trade-off: manual optimization vs readable/maintainable • Exploit compiler opts for readable/maintainable code • have lots of functions/modularity • debugging/tracing code (enabled by static flags) • Limit use of pointers • pointer-based arrays and pointer arithmetic • function pointers and virtual functions (unfortunately) • For super-performance-critical code: • zoom into assembly to look for optimization opportunities • consider the instruction-parallelism capabilities of CPU 39 Ding Yuan, ECE454 How to know what the compiler did? run: objdump –d a.out • prints a listing of instructions , with inst address and encoding void main(){ int i = 5; int x = 10; … 08048394 <main>: inst address inst encoding 8048394: 55 push %ebp 8048395: 89 e5 mov 8048397: 83 ec 10 sub 804839a: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp) 80483a1: c7 45 fc 0a 00 00 00 movl $0xa,-0x4(%ebp) 40 %esp,%ebp $0x10,%esp Ding Yuan, ECE454 Advanced Topics • Profile-Directed Feedback (PDF) • Basically feeding gprof-like measurements into compiler • Allows compiler to make smarter choices/optimizations • Eg., handle the “if (one-in-a-million)” case well • Just-in-Time Compilation and Optimization (JIT) • Performs simple version of many of the basic optimizations • Does them dynamically at run-time, exploits online PDF • Eg., Java JIT • Current hot topics in compilation (more later): • Automatic parallelization (exploiting multicores) 41 Ding Yuan, ECE454