Compiler techniques for exposing ILP Instruction Level Parallelism • Potential overlap among instructions • Few possibilities in a basic block – Blocks are small (6-7 instructions) – Instructions are dependent • Goal: Exploit ILP across multiple basic blocks – Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD 0(R1), F4 SUBI R1, R1, #8 stall BNEZ R1, Loop stall 1 2 3 4 5 6 7 8 9 10 Sequential MIPS Assembly Code Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Loop Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6 Loop Unrolling Loop: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations Exit: LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop Loop Transformations • Instruction independency is the key requirement for the transformations • Example – Determine that is legal to move SD after SUBI and BNEZ – Determine that unrolling is useful (iterations are independent) – Use different registers to avoid unnecessary constrains – Eliminate extra tests and branches – Determine that LD and SD can be interchanged – Schedule the code, preserving the semantics of the code 1. Eliminating Name Dependences Loop: LD F0, 0(R1) ADDD Loop: LD F0, 0(R1) F4, F0, F2 ADDD F4, F0, F2 SD 0(R1), F4 SD 0(R1), F4 LD F0, -8(R1) LD F6, -8(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 SD -8(R1), F4 SD -8(R1), F8 LD F0, -16(R1) LD F10, -16(R1) ADDD F4, F0, F2 ADDD F12, F10, F2 SD -16(R1), F4 SD -16(R1), F12 LD F0, -24(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F16, F14, F2 SD -24(R1), F4 SD -24(R1), F16 SUBI R1, R1, #32 SUBI R1, R1, #32 BNEZ R1, Loop BNEZ R1, Loop Register Renaming 2. Eliminating Control Dependences Loop: Exit: LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop Intermediate BEQZ are never taken Eliminate! 3. Eliminating Data Dependences Loop: LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop • Data dependencies SUBI, LD, SD Force sequential execution of iterations • Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI • Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2) 4. Alleviating Data Dependencies Unrolled loop: Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ Scheduled Unrolled loop: F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop Loop: LD LD LD LD ADDD ADDD ADDD ADDD SD SD SUBI SD BNEZ SD F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 0(R1), F4 -8(R1), F8 R1, R1, #32 16(R1), F12 R1, Loop 8(R1), F16 Some General Comments • Dependences are a property of programs • Actual hazards are a property of the pipeline • Techniques to avoid dependence limitations – Maintain dependences but avoid hazards • Code scheduling – hardware – software – Eliminate dependences by code transformations • Complex • Compiler-based Loop-level Parallelism • Primary focus of dependence analysis • Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; Dependence Analysis Algorithms • Assume array indexes are affine (ai + b) – GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) • General graph cycle determination is NP • a, b, c, and d may not be known at compile time Software Pipelining Start-up Finish-up Iteration 0 Iteration 1 Software pipelined iteration Iteration 2 Iteration 3 Example Iteration i LD Iteration i+1 Iteration i+2 F0, 0(R1) ADDD F4, F0, F2 LD SD ADDD F4, F0, F2 LD SD ADDD F4, F0, F2 0(R1), F4 F0, 0(R1) 0(R1), F4 SD Loop: LD F0, 0(R1) Loop: SD F0, 0(R1) 0(R1), F4 16(R1), F4 ADDD F4, F0, F2 ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) SUBI R1, R1, #8 SUBI R1, R1, #8 BNEZ R1, Loop BNEZ R1, Loop Trace (global-code) Scheduling • Find ILP across conditional branches • Two-step process – Trace selection • Find a trace (sequence of basic blocks) • Use loop unrolling to generate long traces • Use static branch prediction for other conditional branches – Trace compaction • Squeeze the trace into a small number of wide instructions • Preserve data and control dependences Trace Selection A[I] = A[I] + B[I] T A[I] = 0? F LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else X B[I] = .... Else: SW 0(R2), . . . J join .... X C[I] = Join: .... SW 0(R3), . . . Summary of Compiler Techniques • Try to avoid dependence stalls • Loop unrolling – Reduce loop overhead • Software pipelining – Reduce single body dependence stalls • Trace scheduling – Reduce impact of other branches • Compilers use a mix of three • All techniques depend on prediction accuracy Food for thought: Analyze this • Analyze this for different values of X and Y – To evaluate different branch prediction schemes – For compiler scheduling purposes • • • add r1, r0, 1000 # all numbers in decimal add r2, r0, a # Base address of array a loop: – – – – – • andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11 even: – addi r2, r2, 4 – subi r1, r1, Y – bnez r1, loop