15-213 “The course that gives CMU its Zip!” Code Optimization II: Machine Dependent Optimizations Feb 18, 2003 Topics • Machine-Dependent Optimizations • Pointer code • Unrolling • Enabling instruction level parallelism • Understanding Processor Operation • Translation of instructions into operations • Out-of-order execution of operations • Branches and Branch Prediction • Advice class11.ppt Previous Best Combining Code void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } Task • Compute sum of all elements in vector • Vector represented by C-style abstract data type • Achieved CPE of 2.00 • Cycles per element –2– 15-213, S’03 General Forms of Combining void abstract_combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t; } Data Types • Use different declarations for data_t •int •float •double –3– Operations • Use different definitions of OP and IDENT • + / 0 • * / 1 15-213, S’03 Machine Independent Opt. Results Optimizations • Reduce function calls and memory references within loop Method Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Floating Point + * * 42.06 31.25 20.66 6.00 2.00 41.86 33.25 21.25 9.00 4.00 41.44 31.25 21.15 8.00 3.00 160.00 143.00 135.00 117.00 5.00 Performance Anomaly • Computing FP product of all elements exceptionally slow. • Very large speedup when accumulate in temporary • Caused by quirk of IA32 floating point –4– • Memory uses 64-bit format, register use 80 • Benchmark data caused overflow of 64 bits, but not 80 15-213, S’03 Pointer Code void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } Optimization • Use pointers rather than array references • CPE: 3.00 (Compiled -O2) • Oops! We’re not making progress here! Warning: Some compilers do better job optimizing array code –5– 15-213, S’03 Pointer vs. Array Code Inner Loops Array Code .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 # # # # # Loop: sum += data[i] i++ i:length if < goto Loop Pointer Code .L30: addl (%eax),%ecx addl $4,%eax cmpl %edx,%eax jb .L30 # Loop: # sum += *data # data ++ # data:dend # if < goto Loop Performance • Array Code: 4 instructions in 2 clock cycles • Pointer Code: Almost same 4 instructions in 3 clock cycles –6– 15-213, S’03 Modern CPU Design Instruction Control Fetch Control Retirement Unit Register File Address Instrs. Instruction Decode Instruction Cache Operations Register Updates Prediction OK? Integer/ Branch General Integer FP Add Operation Results FP Mult/Div Load Addr. Store Functional Units Addr. Data Data Data Cache Execution –7– 15-213, S’03 CPU Capabilities of Pentium III Multiple Instructions Can Execute in Parallel • 1 load • 1 store • 2 integer (one may be branch) • 1 FP Addition • 1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined • Instruction Latency Cycles/Issue • Load / Store 3 1 • Integer Multiply 4 1 • Integer Divide 36 36 • Double/Single FP Multiply 5 2 • Double/Single FP Add 3 1 • Double/Single FP Divide 38 38 –8– 15-213, S’03 Instruction Control Instruction Control Retirement Unit Register File Fetch Control Instruction Decode Address Instrs. Instruction Cache Operations Grabs Instruction Bytes From Memory • Based on current PC + predicted targets for predicted branches • Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates Instructions Into Operations • Primitive steps required to perform instruction • Typical instruction requires 1–3 operations Converts Register References Into Tags • Abstract identifier linking destination of one operation with sources of later operations –9– 15-213, S’03 Translation Example Version of Combine4 • Integer data, multiply operation .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 # # # # # Loop: t *= data[i] i++ i:length if < goto Loop Translation of First Iteration .L24: imull (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 – 10 – load (%eax,%edx.0,4) imull t.1, %ecx.0 incl %edx.0 cmpl %esi, %edx.1 jl-taken cc.1 t.1 %ecx.1 %edx.1 cc.1 15-213, S’03 Translation Example #1 imull (%eax,%edx,4),%ecx • Split into two operations load (%eax,%edx.0,4) t.1 imull t.1, %ecx.0 %ecx.1 • load reads from memory to generate temporary result t.1 • Multiply operation just operates on registers • Operands • Registers %eax does not change in loop. Values will be retrieved from register file during decoding • Register %ecx changes on every iteration. Uniquely identify different versions as %ecx.0, %ecx.1, %ecx.2, … • Register renaming • Values passed directly from producer to consumers – 11 – 15-213, S’03 Translation Example #2 incl %edx incl %edx.0 %edx.1 • Register %edx changes on each iteration. Rename as %edx.0, %edx.1, %edx.2, … – 12 – 15-213, S’03 Translation Example #3 cmpl %esi,%edx cmpl %esi, %edx.1 cc.1 • Condition codes are treated similar to registers • Assign tag to define connection between producer and consumer – 13 – 15-213, S’03 Translation Example #4 jl .L24 jl-taken cc.1 • Instruction control unit determines destination of jump • Predicts whether will be taken and target • Starts fetching instruction at predicted destination • Execution unit simply checks whether or not prediction was OK • If not, it signals instruction control • Instruction control then “invalidates” any operations generated from misfetched instructions • Begins fetching and decoding instructions at correct target – 14 – 15-213, S’03 Visualizing Operations %edx.0 incl load cmpl cc.1 %ecx.0 jl %edx.1 load (%eax,%edx,4) imull t.1, %ecx.0 incl %edx.0 cmpl %esi, %edx.1 jl-taken cc.1 t.1 %ecx.1 %edx.1 cc.1 t.1 Operations • Vertical position denotes time at which Time imull executed %ecx.1 • Cannot begin operation until operands available • Height denotes latency Operands • Arcs shown only for operands that are passed within execution unit – 15 – 15-213, S’03 Visualizing Operations (cont.) %edx.0 load incl load cmpl+1 %ecx.i %edx.1 load (%eax,%edx,4) iaddl t.1, %ecx.0 incl %edx.0 cmpl %esi, %edx.1 jl-taken cc.1 t.1 %ecx.1 %edx.1 cc.1 cc.1 %ecx.0 Time jl t.1 addl %ecx.1 Operations • Same as before, except that add has latency of 1 – 16 – 15-213, S’03 3 Iterations of Combining Product %edx.0 incl 1 load 2 cmpl cc.1 jl 3 incl load cc.2 i=0 4 %edx.2 cmpl t.1 %ecx.0 5 %edx.1 jl incl load cmpl t.2 cc.3 jl imull t.3 6 %ecx.1 7 Iteration 1 8 9 10 11 12 13 14 15 – 17 – start as soon as operands available • Operations for multiple iterations overlap in time Performance • Limiting factor becomes imull Cycle %edx.3 Unlimited Resource Analysis • Assume operation can i=1 latency of integer multiplier • Gives CPE of 4.0 %ecx.2 Iteration 2 imull i=2 %ecx.3 Iteration 3 15-213, S’03 4 Iterations of Combining Sum %edx.0 incl 1 load 2 3 4 cmpl+1 %ecx.i cc.1 jl %ecx.0 addl 5 7 incl load t.1 i=0 %ecx.1 6 %edx.1 Iteration 1 Cycle %edx.2 cmpl+1 %ecx.i cc.2 jl incl load t.2 addl i=1 %ecx.2 Iteration 2 %edx.3 cmpl+1 %ecx.i cc.3 jl incl load t.3 addl i=2 %ecx.3 Iteration 3 Unlimited Resource Analysis %edx.4 4 integer ops cmpl+1 %ecx.i cc.4 jl t.4 addl i=3 %ecx.4 Iteration 4 Performance • Can begin a new iteration on each clock cycle • Should give CPE of 1.0 • Would require executing 4 integer operations in parallel – 18 – 15-213, S’03 Combining Sum: Resource Constraints %edx.3 6 7 8 %ecx.3 9 load %edx.4 cmpl+1 %ecx.i t.4 addl incl cc.4 jl %ecx.4 Iteration 4 12 %edx.5 load i=3 10 11 incl t.5 addl %ecx.5 cmpl+1 %ecx.i cc.5 load %edx.6 jl t.6 i=4 addl cmpl load cc.6 cc.6 Iteration 5 jl 13 14 incl %ecx.6 incl t.7 i=5 addl jl 15 operands 18available • Set priority based on program order cmpl cc.7 Iteration 6 • Only have16two integer functional units Cycle • Some operations delayed even though 17 %edx.7 load i=6 %ecx.7 incl %edx.8 cmpl+1 %ecx.i t.8 addl Iteration 7 cc.8 jl i=7 %ecx.8 Iteration 8 Performance • Sustain CPE of 2.0 – 19 – 15-213, S’03 Loop Unrolling void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; } *dest = sum; } – 20 – Optimization • Combine multiple iterations into single loop body • Amortizes loop overhead across multiple iterations • Finish extras at end • Measured CPE = 1.33 15-213, S’03 Visualizing Unrolled Loop • Loads can pipeline, since %edx.0 addl don’t have dependencies • Only one set of loop control operations load cmpl+1 %ecx.i load %ecx.0c %ecx.1a load t.1b – 21 – t.1a %ecx.1a t.1b %ecx.1b t.1c %ecx.1c %edx.1 cc.1 Time addl %ecx.1b jl cc.1 t.1a addl load (%eax,%edx.0,4) iaddl t.1a, %ecx.0c load 4(%eax,%edx.0,4) iaddl t.1b, %ecx.1a load 8(%eax,%edx.0,4) iaddl t.1c, %ecx.1b iaddl $3,%edx.0 cmpl %esi, %edx.1 jl-taken cc.1 %edx.1 t.1c addl %ecx.1c 15-213, S’03 Executing with Loop Unrolling %edx.2 5 6 addl 7 load 8 9 %ecx.2c cmpl+1 %ecx.i 11 i=6 13 load addl %edx.4 t.3b addl %ecx.3b load cmpl+1 %ecx.i t.3c addl %ecx.3c Iteration 3 • Predicted Performance 14 jl t.3a %ecx.3a 12 cc.3 load addl 10 %edx.3 jl t.4a addl %ecx.4a Cycle 15 • Can complete iteration in 3 cycles • Should give CPE of 1.0 cc.4 load i=9 load t.4b addl %ecx.4b t.4c addl %ecx.4c Iteration 4 • Measured Performance • CPE of 1.33 • One iteration every 4 cycles – 22 – 15-213, S’03 Effect of Unrolling Unrolling Degree 1 2 3 4 8 16 2.00 1.50 1.33 1.50 1.25 1.06 Integer Sum Integer Product 4.00 FP Sum 3.00 FP Product 5.00 • Only helps integer sum for our examples • Other cases constrained by functional unit latencies • Effect is nonlinear with degree of unrolling • Many subtle effects determine exact scheduling of operations – 23 – 15-213, S’03 1 x0 * Serial Computation x1 * Computation x2 * ((((((((((((1 * x0) * x1) * x2) * x3) * x4) * x5) * x6) * x7) * x8) * x9) * x10) * x11) x3 * x4 * x5 * Performance x6 * x7 * • N elements, D cycles/operation • N*D cycles x8 * x9 * x10 * x11 * – 24 – 15-213, S’03 Parallel Loop Unrolling void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 *= data[i]; x1 *= data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x0 *= data[i]; } *dest = x0 * x1; } – 25 – Code Version • Integer product Optimization • Accumulate in two different products • Can be performed simultaneously • Combine at end Performance • CPE = 2.0 • 2X performance 15-213, S’03 Dual Product Computation Computation 1 x0 * ((((((1 * x0) * x2) * x4) * x6) * x8) * x10) * ((((((1 * x1) * x3) * x5) * x7) * x9) * x11) 1 x1 * x2 * x3 * x4 * Performance x5 * x6 * • N elements, D cycles/operation • (N/2+1)*D cycles • ~2X performance improvement x7 * x8 * x10 x9 * * x11 * * – 26 – 15-213, S’03 Requirements for Parallel Computation Mathematical • Combining operation must be associative & commutative • OK for integer multiplication • Not strictly true for floating point • OK for most applications Hardware • Pipelined functional units • Ability to dynamically extract parallelism from code – 27 – 15-213, S’03 Visualizing Parallel Loop %edx.0 • Two multiplies within loop addl no longer have data dependency load • Allows them to pipeline %ecx.0 %edx.1 cmpl load cc.1 jl t.1a %ebx.0 t.1b Time imull imull load (%eax,%edx.0,4) imull t.1a, %ecx.0 load 4(%eax,%edx.0,4) imull t.1b, %ebx.0 iaddl $2,%edx.0 cmpl %esi, %edx.1 jl-taken cc.1 – 28 – t.1a %ecx.1 t.1b %ebx.1 %edx.1 cc.1 %ecx.1 %ebx.1 15-213, S’03 Executing with Parallel Loop %edx.0 addl 1 load 2 load cmpl load cc.2 jl addl load %edx.3 cmpl cc.3 load jl imull imull 7 %ecx.1 i=0 t.2a %ebx.1 t.2b Iteration 1 9 10 jl %edx.2 t.1b 6 8 cc.1 addl t.1a 4 %ebx.0 5 cmpl load 3 %ecx.0 %edx.1 imull Cycle imull 11 12 • Predicted Performance 13 • Can keep 4-cycle multiplier busy performing14two simultaneous multiplications 15 • Gives CPE of 16 2.0 %ecx.2 i=2 t.3a %ebx.2 t.3b Iteration 2 imull imull %ecx.3 i=4 %ebx.3 Iteration 3 – 29 – 15-213, S’03 Optimization Results for Combining Method Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Pointer Unroll 4 Unroll 16 2X2 4X4 8X4 Theoretical Opt. Worst : Best – 30 – Floating Point + * * 42.06 31.25 20.66 6.00 2.00 3.00 1.50 1.06 1.50 1.50 1.25 1.00 39.7 41.86 33.25 21.25 9.00 4.00 4.00 4.00 4.00 2.00 2.00 1.25 1.00 33.5 41.44 31.25 21.15 8.00 3.00 3.00 3.00 3.00 2.00 1.50 1.50 1.00 27.6 160.00 143.00 135.00 117.00 5.00 5.00 5.00 5.00 2.50 2.50 2.00 2.00 80.0 15-213, S’03 Parallel Unrolling: Method #2 void combine6aa(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x *= (data[i] * data[i+1]); } /* Finish any remaining elements */ for (; i < length; i++) { x *= data[i]; } *dest = x; } – 31 – Code Version • Integer product Optimization • Multiply pairs of elements together • And then update product • “Tree height reduction” Performance • CPE = 2.5 15-213, S’03 Method #2 Computation Computation ((((((1 * (x0 * x1)) * (x2 * x3)) * (x4 * x5)) * (x6 * x7)) * (x8 * x9)) * (x10 * x11)) x0 x1 1 Performance * x2 x3 * • N elements, D cycles/operation • Should be (N/2+1)*D cycles * x4 x5 * * x6 x7 * • CPE = 2.0 • Measured CPE worse * x8 x9 * * x10 x11 * * * – 32 – Unrolling CPE CPE (measured) (theoretical ) 2 2.50 2.00 3 1.67 1.33 4 1.50 1.00 6 1.78 1.00 15-213, S’03 Understanding Parallelism 1 x0 * x1 * x2 * x3 * x4 /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x * data[i]) * data[i+1]; } * x5 * x6 * x7 * x8 • CPE = 4.00 • All multiplies perfomed in sequence * x9 * x10 * x11 * /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } • CPE = 2.50 • Multiplies overlap x0 x1 1 * x2 x3 * * x4 x5 * * x6 x7 * * x8 x9 * * x10 x11 * * * – 33 – 15-213, S’03 Limitations of Parallel Execution Need Lots of Registers • To hold sums/products • Only 6 usable integer registers • Also needed for pointers, loop conditions • 8 FP registers • When not enough registers, must spill temporaries onto stack • Wipes out any performance gains • Not helped by renaming • Cannot reference more operands than instruction set allows • Major drawback of IA32 instruction set – 34 – 15-213, S’03 Register Spilling Example Example • 8 X 8 integer product • 7 local variables share 1 register • See that are storing locals on stack • E.g., at -8(%ebp) – 35 – .L165: imull (%eax),%ecx movl -4(%ebp),%edi imull 4(%eax),%edi movl %edi,-4(%ebp) movl -8(%ebp),%edi imull 8(%eax),%edi movl %edi,-8(%ebp) movl -12(%ebp),%edi imull 12(%eax),%edi movl %edi,-12(%ebp) movl -16(%ebp),%edi imull 16(%eax),%edi movl %edi,-16(%ebp) … addl $32,%eax addl $8,%edx cmpl -32(%ebp),%edx jl .L165 15-213, S’03 Summary: Results for Pentium III Method Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Unroll 4 Unroll 16 4X2 8X4 8X8 Worst : Best Floating Point + * * 42.06 31.25 20.66 6.00 2.00 1.50 1.06 1.50 1.25 1.88 39.7 41.86 33.25 21.25 9.00 4.00 4.00 4.00 2.00 1.25 1.88 33.5 41.44 31.25 21.15 8.00 3.00 3.00 3.00 1.50 1.50 1.75 27.6 160.00 143.00 135.00 117.00 5.00 5.00 5.00 2.50 2.00 2.00 80.0 • Biggest gain doing basic optimizations • But, last little bit helps – 36 – 15-213, S’03 Results for Alpha Processor Method Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Unroll 4 Unroll 16 4X2 8X4 8X8 Worst : Best Floating Point + * * 40.14 25.08 19.19 6.26 1.76 1.51 1.25 1.19 1.15 1.11 36.2 47.14 36.05 32.18 12.52 9.01 9.01 9.01 4.69 4.12 4.24 11.4 52.07 37.37 28.73 13.26 8.08 6.32 6.33 4.44 2.34 2.36 22.3 53.71 32.02 32.73 13.01 8.01 6.32 6.22 4.45 2.01 2.08 26.7 • Overall trends very similar to those for Pentium III. • Even though very different architecture and compiler – 37 – 15-213, S’03 Results for Pentium 4 Method Integer + Abstract -g Abstract -O2 Move vec_length data access Accum. in temp Unroll 4 Unroll 16 4X2 8X4 8X8 Worst : Best Floating Point + * * 35.25 26.52 18.00 3.39 2.00 1.01 1.00 1.02 1.01 1.63 35.2 35.34 30.26 25.71 31.56 14.00 14.00 14.00 7.00 3.98 4.50 8.9 35.85 31.55 23.36 27.50 5.00 5.00 5.00 2.63 1.82 2.42 19.7 38.00 32.00 24.25 28.35 7.00 7.00 7.00 3.50 2.00 2.31 19.0 • Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0) • Clock runs at 2.0 GHz • Not an improvement over 1.0 GHz P3 for integer * • Avoids FP multiplication anomaly – 38 – 15-213, S’03 What About Branches? Challenge • Instruction Control Unit must work well ahead of Exec. Unit • To generate enough operations to keep EU busy 80489f3: 80489f8: 80489fa: 80489fc: 80489fe: 8048a00: movl xorl cmpl jnl movl imull $0x1,%ecx %edx,%edx %esi,%edx 8048a25 %esi,%esi (%eax,%edx,4),%ecx Executing Fetching & Decoding • When encounters conditional branch, cannot reliably determine where to continue fetching – 39 – 15-213, S’03 Branch Outcomes • When encounter conditional branch, cannot determine where to continue fetching • Branch Taken: Transfer control to branch target • Branch Not-Taken: Continue with next instruction in sequence • Cannot resolve until outcome determined by branch/integer unit 80489f3: 80489f8: 80489fa: 80489fc: 80489fe: 8048a00: movl xorl cmpl jnl movl imull 8048a25: 8048a27: 8048a29: 8048a2c: 8048a2f: – 40 – $0x1,%ecx %edx,%edx %esi,%edx Branch Not-Taken 8048a25 %esi,%esi (%eax,%edx,4),%ecx Branch Taken cmpl jl movl leal movl %edi,%edx 8048a20 0xc(%ebp),%eax 0xffffffe8(%ebp),%esp %ecx,(%eax) 15-213, S’03 Branch Prediction Idea • Guess which way branch will go • Begin executing instructions at predicted position • But don’t actually modify register or memory data 80489f3: 80489f8: 80489fa: 80489fc: . . . movl xorl cmpl jnl $0x1,%ecx %edx,%edx %esi,%edx 8048a25 8048a25: 8048a27: 8048a29: 8048a2c: 8048a2f: – 41 – cmpl jl movl leal movl Predict Taken %edi,%edx 8048a20 0xc(%ebp),%eax 0xffffffe8(%ebp),%esp %ecx,(%eax) Execute 15-213, S’03 Branch Prediction Through Loop 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 98 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 99 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 100 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: – 42 – movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 101 %esi,%edx 80488b1 Assume vector length = 100 Predict Taken (OK) Predict Taken Executed (Oops) Read invalid location Fetched 15-213, S’03 Branch Misprediction Invalidation – 43 – 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 98 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 99 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 100 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: movl addl incl (%ecx,%edx,4),%eax %eax,(%edi) i = 101 %edx Assume vector length = 100 Predict Taken (OK) Predict Taken (Oops) Invalidate 15-213, S’03 Branch Misprediction Recovery 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: movl addl incl cmpl jl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 98 %esi,%edx 80488b1 80488b1: 80488b4: 80488b6: 80488b7: 80488b9: 80488bb: 80488be: 80488bf: 80488c0: movl addl incl cmpl jl leal popl popl popl (%ecx,%edx,4),%eax %eax,(%edi) %edx i = 99 %esi,%edx 80488b1 0xffffffe8(%ebp),%esp %ebx %esi %edi Assume vector length = 100 Predict Taken (OK) Definitely not taken Performance Cost • Misprediction on Pentium III wastes ~14 clock cycles • That’s a lot of time on a high performance processor – 44 – 15-213, S’03 Avoiding Branches On Modern Processor, Branches Very Expensive • Unless prediction can be reliable • When possible, best to avoid altogether Example • Compute maximum of two values • 14 cycles when prediction correct • 29 cycles when incorrect int max(int x, int y) { return (x < y) ? y : x; } – 45 – movl 12(%ebp),%edx movl 8(%ebp),%eax cmpl %edx,%eax jge L11 movl %edx,%eax L11: # # # # # Get y rval=x rval:y skip when >= rval=y 15-213, S’03 Avoiding Branches with Bit Tricks • In style of Lab #1 • Use masking rather than conditionals int bmax(int x, int y) { int mask = -(x>y); return (mask & x) | (~mask & y); } • Compiler still uses conditional • 16 cycles when predict correctly • 32 cycles when mispredict xorl %edx,%edx movl 8(%ebp),%eax movl 12(%ebp),%ecx cmpl %ecx,%eax jle L13 movl $-1,%edx L13: – 46 – # mask = 0 # skip if x<=y # mask = -1 15-213, S’03 Avoiding Branches with Bit Tricks • Force compiler to generate desired code int bvmax(int x, int y) { volatile int t = (x>y); int mask = -t; return (mask & x) | (~mask & y); } movl 8(%ebp),%ecx movl 12(%ebp),%edx cmpl %edx,%ecx setg %al movzbl %al,%eax movl %eax,-4(%ebp) movl -4(%ebp),%eax # # # # # # # Get x Get y x:y (x>y) Zero extend Save as t Retrieve t • volatile declaration forces value to be written to memory • Compiler must therefore generate code to compute t • Simplest way is setg/movzbl combination • Not very elegant! • A hack to get control over compiler • 22 clock cycles on all data • Better than misprediction – 47 – 15-213, S’03 Conditional Move • Added with P6 microarchitecture (PentiumPro onward) • cmovXXl %edx, %eax • If condition XX holds, copy %edx to %eax • Doesn’t involve any branching • Handled as operation within Execution Unit movl 8(%ebp),%edx movl 12(%ebp),%eax cmpl %edx, %eax cmovll %edx,%eax # # # # Get x rval=y rval:x If <, rval=x • Current version of GCC won’t use this instruction • Thinks it’s compiling for a 386 • Performance • 14 cycles on all data – 48 – 15-213, S’03 Machine-Dependent Opt. Summary Pointer Code • Look carefully at generated code to see whether helpful Loop Unrolling • Some compilers do this automatically • Generally not as clever as what can achieve by hand Exposing Instruction-Level Parallelism • Very machine dependent Warning: • Benefits depend heavily on particular machine • Best if performed by compiler • But GCC on IA32/Linux is not very good • Do only for performance-critical parts of code – 49 – 15-213, S’03 Role of Programmer How should I write my programs, given that I have a good, optimizing compiler? Don’t: Smash Code into Oblivion • Hard to read, maintain, & assure correctness Do: • Select best algorithm • Write code that’s readable & maintainable • Procedures, recursion, without built-in constant limits • Even though these factors can slow down code • Eliminate optimization blockers • Allows compiler to do its job Focus on Inner Loops • Do detailed optimizations where code will be executed repeatedly • Will get most performance gain here – 50 – 15-213, S’03