Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Scheduling g Loops p • Loop p bodies are small • But, lot of time is spend in loops due to large number of iterations • Need better ways to schedule loops Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Loop p Example p • Machine – One load/store unit • load 2 cycles • store 2 cycles – Two arithmetic units • add 2 cycles • branch 2 cycles • multiply 3 cycles – Both units are pipelined (initiate one op each cycle) • Source Code for i = 1 to N A[i] = A[i] * b Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Loop p Example p • Source Code for i = 1 to N A[i] = A[i] * b offset ff • Assembly Code loop: mov imul mov sub jz Saman Amarasinghe base (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 5 6.035 ©MIT Fall 1998 Loopp Example p mov d=7 2 • Assembly Code loop: mov imul mov sub jz imul d=5 3 (%rdi,%rax), %r10 %r11, , %r10 %r10, (%rdi,%rax) $4, %rax loop mov 0 sub d=2 2 • Schedule (9 cycles per iteration) mov d=2 jz d=0 mov mov mov imul bge imul bge imul sub sub Saman Amarasinghe 6 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 7 6.035 ©MIT Fall 1998 Loop p Unrollingg • Unroll the loop pbod yfew times • Pros: – Create a much larger basic block for the body body – Eliminate few loop bounds checks • Cons: – Much larger program – Setup codde (# off iiterations i < unroll ll factor) f ) – beginning and end of the schedule can still have unusedd slots l t Saman Amarasinghe 8 6.035 ©MIT Fall 1998 loop: mov imul mov sub jz Saman Amarasinghe Loop p Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 9 6.035 ©MIT Fall 1998 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loop p Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 10 6.035 ©MIT Fall 1998 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Loopp Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 3 0 2 2 3 0 2 • Schedule (8 cycles per iteration) mov mov mov mov mov mov mov imul mov imul imul bge imul imul bge imul sub sub sub Saman Amarasinghe sub 11 6.035 ©MIT Fall 1998 Loop p Unrollingg • Rename registers g – Use different registers in different iterations Saman Amarasinghe 12 6.035 ©MIT Fall 1998 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 13 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d 2 d=2 jz d=0 2 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %rcx %r11, %rcx %rcx, (%rdi,%rax) $4, %rax loop 14 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d 2 d=2 jz d=0 2 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 Loop pUnrollingg • Rename reg gisters – Use different registers in different iterations • Eliminate unnecessary dependencies – again again, use more registers to eliminate true, true anti and output dependencies – eliminate dependent dependent-chains chains of calculations when when possible Saman Amarasinghe 15 6.035 ©MIT Fall 1998 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %rcx %r11, %rcx %rcx, (%rdi,%rax) $4, %rax loop 16 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 mov d=5 mul d=3 mov d=0 sub d=0 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $8, %rax (%rdi,%rbx), %rcx %r11, %rcx %rcx, (%rdi,%rbx) $8, %rbx loop 17 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 mov d=5 mul d=3 mov d=0 sub d=0 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Loopp Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $8, %rax (%rdi,%rbx), %rcx %r11, %rcx %rcx, (%rdi,%rbx) $8, %rbx loop • S Schedule h d l (4.5 (4 5 cycles l per it iteration ti mov mov mov mov mov imul imul 2 2 3 0 2 mov jz imul imul jz imul sub sub sub Saman Amarasinghe 0 mov mov imul 3 sub 18 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler loop invariiant codde motiion Induction Variable Recognition Saman Amarasinghe 19 6.035 ©MIT Fall 1998 Software Pipelining p g • Try yto overla pmulti ple iterations so that the slots will be filled • Find the steady steady-state state window so that: – all the instructions of the loop body is executed – but from different iterations Saman Amarasinghe 20 6.035 ©MIT Fall 1998 Loop p Example p • Assembly Code loop: mov imul mov sub jz j (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop p • Schedule mov mov mov mov mul jz mul jz mul sub sub Saman Amarasinghe 21 6.035 ©MIT Fall 1998 Loopp Example p • Assembly Code loop: mov imul mov sub jz j (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop p • Schedule mov mov1 mov mul Saman Amarasinghe mov2 mov mov1 mov2 1 2 mul1 mul mul1 mul sub mov3 mov1 mov4 mov2 mov mov3 3 mov1 1 mov4 4 mul2 jz mul3 jz1 mul2 jz mul3 mul1 mul2 sub1 sub2 sub sub1 22 mov5 mov2 2 mul4 jz1 mul3 mov3 ld5 jz2 mul4 mov6 mov3 3 mul5 jz2 mul4 sub3 sub2 6.035 sub3 ©MIT Fall 1998 Loopp Example p • Assembly Code loop: mov imul mov sub jz j (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop p • Schedule (2 cycles per iteration) mov mov1 mov mul Saman Amarasinghe mov2 mov 1 2 mov1 mov2 mul1 mul mul1 mul sub mov3 mov1 mov4 mov2 3 mov1 1 mov4 4 mov mov3 mul2 jz mul3 jz1 mul2 jz mul3 mul1 mul2 sub1 sub2 sub sub1 23 mov5 2 mov2 mul4 jz1 mul3 mov3 ld5 jz2 mul4 mov6 3 mov3 mul5 jz2 mul4 sub3 sub2 6.035 sub3 ©MIT Fall 1998 Loopp Example p • 4 iterations are overlapped – value of %r11 don’t change – 4 regs for (%rdi,%rax) – each addr. incremented by 4*4 – 4 regs to keep value %r10 – Same registers can be reused after 4 of these blocks ggenerate code for 4 blocks,, otherwise need to move Saman Amarasinghe 24 mov4 mov1 mul3 jz mul2 mov2 mov4 jz1 mul3 sub2 sub1 loop: mov imul mov sub jz (%rdi,%rax), %r10 %r11 %r10 %r11, %r10, (%rdi,%rax) $4, %rax loop 6.035 ©MIT Fall 1998 Software Pipelining p g • Optimal use of resources • Need N d a llot off registers i – Values in multiple iterations need to be kept • Issues in dependencies – Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) – Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) • Code generation issues – Generate pre-amble and post-amble code – Multiple blocks so no register copy is needed Saman Amarasinghe 25 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 26 6.035 ©MIT Fall 1998 Register Allocation and d Instruction I t ti Scheduling S h d li • If reg gister allocation is before instruction scheduling –restricts the choices for scheduling g Saman Amarasinghe 27 6.035 ©MIT Fall 1998 Example p 1: 2 2: 3: 4: mov add dd mov add Saman Amarasinghe 4(%rbp), %rax % %rax, % %rbx b 8(%rbp), %rax %rax, %rcx 28 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 1 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %rax %rax, %rcx 3 2 1 1 1 1 3 3 4 Saman Amarasinghe 29 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 1 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %rax %rax, %rcx 3 2 1 1 1 1 3 3 ALUop MEM 1 2 1 MEM 2 Saman Amarasinghe 4 4 3 1 3 30 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 1 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %rax %rax, %rcx 3 2 1 1 1 Anti-dependence How about a different register? 1 3 3 4 Saman Amarasinghe 31 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %r10 %r10, %rcx Anti-dependence How about a different register? 1 3 2 3 3 4 Saman Amarasinghe 32 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %r10 %r10, %rcx 1 3 2 3 3 ALUop MEM 1 2 1 MEM 2 Saman Amarasinghe 3 1 4 4 3 33 6.035 ©MIT Fall 1998 Register Allocation and d Instruction I t ti Scheduling S h d li • If reg gister allocation is before instruction scheduling –restricts the choices for scheduling g Saman Amarasinghe 34 6.035 ©MIT Fall 1998 Register Allocation andd Inst I tructi tion Schhed duli ling • If reg gister allocation is before instruction scheduling – restricts the choices for scheduling g • If instruction scheduling before register register allocation – Register allocation may spill registers – Will change the carefully done schedule!!! Saman Amarasinghe 35 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 36 6.035 ©MIT Fall 1998 Superscalar: Where have all the t transistors i t gone? ? • Out of order execution – If an instruction stalls, go beyond that and start executing non-dependent instructions – Pros: • Hardware scheduling • Tolerates unpredictable latencies – Cons: • Instruction window is small Saman Amarasinghe 37 6.035 ©MIT Fall 1998 Superscalar: Where have all the t transistors i t gone? ? • Reg gister renaming g – If there is an anti or output dependency of a register that stalls the pipeline, use a different hardware register – Pros: • Avoids anti and output dependencies – Cons: • Cannot do more complex transformations to eliminate dependencies Saman Amarasinghe 38 6.035 ©MIT Fall 1998 Hardware vs. Compiler p • In a superscalar, hardware and compiler scheduling can work hand-in-hand • Hardware can reduce the burden when not predictable by the compiler • Compiler can still greatly enhance the performance – Large instruction window for scheduling – Many program transformations that increase parallelism • C Compiler il is i even more critical iti l when h no hardware h d support – VLIW machines (Itanium (Itanium, DSPs) Saman Amarasinghe 39 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d ion V Inducti Variiable bl Recogniitiion loop invariant code motion Saman Amarasinghe 40 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(i) = 0 i = i - 1 Saman Amarasinghe 41 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(i) = 0 i = i - 1 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. Saman Amarasinghe 42 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(i) = 0 i = i - 1 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 43 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(201 - j) = 0 i = i - 1 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 44 6.035 ©MIT Fall 1998 Induction Variables • Example p for j = 1 to 100 a(201 - j) = 0 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 45 6.035 ©MIT Fall 1998 What are induction variables? • x is an induction variable of a loop p L if – variable changes its value every iteration of the loop – the value is a function of number of iterations of the loop • In compilers this function is normally a linear function – Example: for loop index variable j, function c*j + d Saman Amarasinghe 46 6.035 ©MIT Fall 1998 What can we do with induction variables? i bl ? • Use them to perform strength reduction • Get rid of them Saman Amarasinghe 47 6.035 ©MIT Fall 1998 Classification of induction variables • Basic induction variables – Explicitly modified by the same constant amount once during each iteration of the loop – Example: loop index variable • Dependent induction variables – Can be expressed in the form: aa*xx + b where a and be are loop invariant and x is an induction variable – Example: 202 - 2 2*j j Saman Amarasinghe 48 6.035 ©MIT Fall 1998 Classification of induction variables • Class of induction variables: All induction variables with same basic variable in their q linear equations • Basis of a class: the basic variable that determines that class Saman Amarasinghe 49 6.035 ©MIT Fall 1998 Finding g Basic Induction Variables • Look inside loop p nodes • Find variables whose only modification is of the form j = j + d where d is a loop constant Saman Amarasinghe 50 6.035 ©MIT Fall 1998 Finding g Dependent p Induction Variables • Find all the basic induction variables • Search variable k with a single assignment in the loop • Variable assignments of the form k = e op j or k = -j where j is an induction variable and e is loop invariant Saman Amarasinghe 51 6.035 ©MIT Fall 1998 Finding g Dependent p Induction Variables • Example for i = 1 to 100 j = i*c k = j+1 Saman Amarasinghe 52 6.035 ©MIT Fall 1998 A special p case t = 202 for j = 1 to 100 t = t - 2 j = t a(j) t = t - 2 (j) = t b(j) Saman Amarasinghe 53 6.035 ©MIT Fall 1998 A special p case u1 = 200 u2 = 202 for j = 1 to 100 u1 = u1 - 4 a(j) = u1 u2 = u2 - 4 b(j) = u2 t = 202 for j = 1 to 100 t = t - 2 j = t a(j) t = t - 2 (j) = t b(j) Saman Amarasinghe 54 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii Loop invariant code motion Saman Amarasinghe 55 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop Saman Amarasinghe 56 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 57 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 58 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 59 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 60 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 61 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 62 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 t2 = t1 + 10*i + x for j = 1 to N a(i,j) ( ,j) = t1 + 10*i + j + x Saman Amarasinghe 63 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 t2 = t1 + 10*i + x for j = 1 to N a(i,j) ( ,j) = t2 + j Saman Amarasinghe 64 6.035 ©MIT Fall 1998 MIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.