Loop Optimizations Spring 2010 Instruction Scheduling

Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Scheduling g Loops p • Loop p bodies are small • But, lot of time is spend in loops due to large number of iterations • Need better ways to schedule loops Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Loop p Example p • Machine – One load/store unit • load 2 cycles • store 2 cycles – Two arithmetic units • add 2 cycles • branch 2 cycles • multiply 3 cycles – Both units are pipelined (initiate one op each cycle) • Source Code for i = 1 to N A[i] = A[i] * b Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Loop p Example p • Source Code for i = 1 to N A[i] = A[i] * b offset ff • Assembly Code loop: mov imul mov sub jz Saman Amarasinghe base (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 5 6.035 ©MIT Fall 1998 Loopp Example p mov d=7 2 • Assembly Code loop: mov imul mov sub jz imul d=5 3 (%rdi,%rax), %r10 %r11, , %r10 %r10, (%rdi,%rax) $4, %rax loop mov 0 sub d=2 2 • Schedule (9 cycles per iteration) mov d=2 jz d=0 mov mov mov imul bge imul bge imul sub sub Saman Amarasinghe 6 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 7 6.035 ©MIT Fall 1998 Loop p Unrollingg • Unroll the loop pbod yfew times • Pros: – Create a much larger basic block for the body body – Eliminate few loop bounds checks • Cons: – Much larger program – Setup codde (# off iiterations i < unroll ll factor) f ) – beginning and end of the schedule can still have unusedd slots l t Saman Amarasinghe 8 6.035 ©MIT Fall 1998 loop: mov imul mov sub jz Saman Amarasinghe Loop p Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 9 6.035 ©MIT Fall 1998 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loop p Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 10 6.035 ©MIT Fall 1998 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Loopp Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 3 0 2 2 3 0 2 • Schedule (8 cycles per iteration) mov mov mov mov mov mov mov imul mov imul imul bge imul imul bge imul sub sub sub Saman Amarasinghe sub 11 6.035 ©MIT Fall 1998 Loop p Unrollingg • Rename registers g – Use different registers in different iterations Saman Amarasinghe 12 6.035 ©MIT Fall 1998 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop 13 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d 2 d=2 jz d=0 2 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %rcx %r11, %rcx %rcx, (%rdi,%rax) $4, %rax loop 14 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d 2 d=2 jz d=0 2 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 Loop pUnrollingg • Rename reg gisters – Use different registers in different iterations • Eliminate unnecessary dependencies – again again, use more registers to eliminate true, true anti and output dependencies – eliminate dependent dependent-chains chains of calculations when when possible Saman Amarasinghe 15 6.035 ©MIT Fall 1998 mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %rcx %r11, %rcx %rcx, (%rdi,%rax) $4, %rax loop 16 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 mov d=5 mul d=3 mov d=0 sub d=0 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Saman Amarasinghe Loopp Example p (%rdi,%rax), %r10 %r11, %r10 % 10 (% di % ) %r10, (%rdi,%rax) $8, %rax (%rdi,%rbx), %rcx %r11, %rcx %rcx, (%rdi,%rbx) $8, %rbx loop 17 3 0 2 2 3 0 2 6.035 ©MIT Fall 1998 mov d=5 mul d=3 mov d=0 sub d=0 mov d=7 mul d=5 mov d=2 sub d=2 jz d=0 2 loop: mov imul mov sub mov imul mov sub jz Loopp Example p (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $8, %rax (%rdi,%rbx), %rcx %r11, %rcx %rcx, (%rdi,%rbx) $8, %rbx loop • S Schedule h d l (4.5 (4 5 cycles l per it iteration ti mov mov mov mov mov imul imul 2 2 3 0 2 mov jz imul imul jz imul sub sub sub Saman Amarasinghe 0 mov mov imul 3 sub 18 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler loop invariiant codde motiion Induction Variable Recognition Saman Amarasinghe 19 6.035 ©MIT Fall 1998 Software Pipelining p g • Try yto overla pmulti ple iterations so that the slots will be filled • Find the steady steady-state state window so that: – all the instructions of the loop body is executed – but from different iterations Saman Amarasinghe 20 6.035 ©MIT Fall 1998 Loop p Example p • Assembly Code loop: mov imul mov sub jz j (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop p • Schedule mov mov mov mov mul jz mul jz mul sub sub Saman Amarasinghe 21 6.035 ©MIT Fall 1998 Loopp Example p • Assembly Code loop: mov imul mov sub jz j (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop p • Schedule mov mov1 mov mul Saman Amarasinghe mov2 mov mov1 mov2 1 2 mul1 mul mul1 mul sub mov3 mov1 mov4 mov2 mov mov3 3 mov1 1 mov4 4 mul2 jz mul3 jz1 mul2 jz mul3 mul1 mul2 sub1 sub2 sub sub1 22 mov5 mov2 2 mul4 jz1 mul3 mov3 ld5 jz2 mul4 mov6 mov3 3 mul5 jz2 mul4 sub3 sub2 6.035 sub3 ©MIT Fall 1998 Loopp Example p • Assembly Code loop: mov imul mov sub jz j (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop p • Schedule (2 cycles per iteration) mov mov1 mov mul Saman Amarasinghe mov2 mov 1 2 mov1 mov2 mul1 mul mul1 mul sub mov3 mov1 mov4 mov2 3 mov1 1 mov4 4 mov mov3 mul2 jz mul3 jz1 mul2 jz mul3 mul1 mul2 sub1 sub2 sub sub1 23 mov5 2 mov2 mul4 jz1 mul3 mov3 ld5 jz2 mul4 mov6 3 mov3 mul5 jz2 mul4 sub3 sub2 6.035 sub3 ©MIT Fall 1998 Loopp Example p • 4 iterations are overlapped – value of %r11 don’t change – 4 regs for (%rdi,%rax) – each addr. incremented by 4*4 – 4 regs to keep value %r10 – Same registers can be reused after 4 of these blocks ggenerate code for 4 blocks,, otherwise need to move Saman Amarasinghe 24 mov4 mov1 mul3 jz mul2 mov2 mov4 jz1 mul3 sub2 sub1 loop: mov imul mov sub jz (%rdi,%rax), %r10 %r11 %r10 %r11, %r10, (%rdi,%rax) $4, %rax loop 6.035 ©MIT Fall 1998 Software Pipelining p g • Optimal use of resources • Need N d a llot off registers i – Values in multiple iterations need to be kept • Issues in dependencies – Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) – Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) • Code generation issues – Generate pre-amble and post-amble code – Multiple blocks so no register copy is needed Saman Amarasinghe 25 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 26 6.035 ©MIT Fall 1998 Register Allocation and d Instruction I t ti Scheduling S h d li • If reg gister allocation is before instruction scheduling –restricts the choices for scheduling g Saman Amarasinghe 27 6.035 ©MIT Fall 1998 Example p 1: 2 2: 3: 4: mov add dd mov add Saman Amarasinghe 4(%rbp), %rax % %rax, % %rbx b 8(%rbp), %rax %rax, %rcx 28 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 1 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %rax %rax, %rcx 3 2 1 1 1 1 3 3 4 Saman Amarasinghe 29 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 1 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %rax %rax, %rcx 3 2 1 1 1 1 3 3 ALUop MEM 1 2 1 MEM 2 Saman Amarasinghe 4 4 3 1 3 30 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 1 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %rax %rax, %rcx 3 2 1 1 1 Anti-dependence How about a different register? 1 3 3 4 Saman Amarasinghe 31 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %r10 %r10, %rcx Anti-dependence How about a different register? 1 3 2 3 3 4 Saman Amarasinghe 32 6.035 ©MIT Fall 1998 Example p 1: 2: 2 3: 4: mov add dd mov add 4(%rbp), %rax %rax, %rbx % % b 8(%rbp), %r10 %r10, %rcx 1 3 2 3 3 ALUop MEM 1 2 1 MEM 2 Saman Amarasinghe 3 1 4 4 3 33 6.035 ©MIT Fall 1998 Register Allocation and d Instruction I t ti Scheduling S h d li • If reg gister allocation is before instruction scheduling –restricts the choices for scheduling g Saman Amarasinghe 34 6.035 ©MIT Fall 1998 Register Allocation andd Inst I tructi tion Schhed duli ling • If reg gister allocation is before instruction scheduling – restricts the choices for scheduling g • If instruction scheduling before register register allocation – Register allocation may spill registers – Will change the carefully done schedule!!! Saman Amarasinghe 35 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii loop invariant code motion Saman Amarasinghe 36 6.035 ©MIT Fall 1998 Superscalar: Where have all the t transistors i t gone? ? • Out of order execution – If an instruction stalls, go beyond that and start executing non-dependent instructions – Pros: • Hardware scheduling • Tolerates unpredictable latencies – Cons: • Instruction window is small Saman Amarasinghe 37 6.035 ©MIT Fall 1998 Superscalar: Where have all the t transistors i t gone? ? • Reg gister renaming g – If there is an anti or output dependency of a register that stalls the pipeline, use a different hardware register – Pros: • Avoids anti and output dependencies – Cons: • Cannot do more complex transformations to eliminate dependencies Saman Amarasinghe 38 6.035 ©MIT Fall 1998 Hardware vs. Compiler p • In a superscalar, hardware and compiler scheduling can work hand-in-hand • Hardware can reduce the burden when not predictable by the compiler • Compiler can still greatly enhance the performance – Large instruction window for scheduling – Many program transformations that increase parallelism • C Compiler il is i even more critical iti l when h no hardware h d support – VLIW machines (Itanium (Itanium, DSPs) Saman Amarasinghe 39 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d ion V Inducti Variiable bl Recogniitiion loop invariant code motion Saman Amarasinghe 40 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(i) = 0 i = i - 1 Saman Amarasinghe 41 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(i) = 0 i = i - 1 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. Saman Amarasinghe 42 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(i) = 0 i = i - 1 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 43 6.035 ©MIT Fall 1998 Induction Variables • Example p i = 200 for j = 1 to 100 a(201 - j) = 0 i = i - 1 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 44 6.035 ©MIT Fall 1998 Induction Variables • Example p for j = 1 to 100 a(201 - j) = 0 Basic Induction variable: J = 1, 1 2 2, 3 3, 4 ….. 4, Index Variable i in a(i): I = 200, 199, 198, 197…. = 201 - J Saman Amarasinghe 45 6.035 ©MIT Fall 1998 What are induction variables? • x is an induction variable of a loop p L if – variable changes its value every iteration of the loop – the value is a function of number of iterations of the loop • In compilers this function is normally a linear function – Example: for loop index variable j, function c*j + d Saman Amarasinghe 46 6.035 ©MIT Fall 1998 What can we do with induction variables? i bl ? • Use them to perform strength reduction • Get rid of them Saman Amarasinghe 47 6.035 ©MIT Fall 1998 Classification of induction variables • Basic induction variables – Explicitly modified by the same constant amount once during each iteration of the loop – Example: loop index variable • Dependent induction variables – Can be expressed in the form: aa*xx + b where a and be are loop invariant and x is an induction variable – Example: 202 - 2 2*j j Saman Amarasinghe 48 6.035 ©MIT Fall 1998 Classification of induction variables • Class of induction variables: All induction variables with same basic variable in their q linear equations • Basis of a class: the basic variable that determines that class Saman Amarasinghe 49 6.035 ©MIT Fall 1998 Finding g Basic Induction Variables • Look inside loop p nodes • Find variables whose only modification is of the form j = j + d where d is a loop constant Saman Amarasinghe 50 6.035 ©MIT Fall 1998 Finding g Dependent p Induction Variables • Find all the basic induction variables • Search variable k with a single assignment in the loop • Variable assignments of the form k = e op j or k = -j where j is an induction variable and e is loop invariant Saman Amarasinghe 51 6.035 ©MIT Fall 1998 Finding g Dependent p Induction Variables • Example for i = 1 to 100 j = i*c k = j+1 Saman Amarasinghe 52 6.035 ©MIT Fall 1998 A special p case t = 202 for j = 1 to 100 t = t - 2 j = t a(j) t = t - 2 (j) = t b(j) Saman Amarasinghe 53 6.035 ©MIT Fall 1998 A special p case u1 = 200 u2 = 202 for j = 1 to 100 u1 = u1 - 4 a(j) = u1 u2 = u2 - 4 b(j) = u2 t = 202 for j = 1 to 100 t = t - 2 j = t a(j) t = t - 2 (j) = t b(j) Saman Amarasinghe 54 6.035 ©MIT Fall 1998 5 Outline • • • • • • • Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler I d i Variable Induction V i bl Recognition R ii Loop invariant code motion Saman Amarasinghe 55 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop Saman Amarasinghe 56 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 57 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 58 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 59 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 60 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 61 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 62 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 t2 = t1 + 10*i + x for j = 1 to N a(i,j) ( ,j) = t1 + 10*i + j + x Saman Amarasinghe 63 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 t2 = t1 + 10*i + x for j = 1 to N a(i,j) ( ,j) = t2 + j Saman Amarasinghe 64 6.035 ©MIT Fall 1998 MIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Loop Optimizations Spring 2010 Instruction Scheduling

Related documents

Products

Support

Loop Optimizations Spring 2010 Instruction Scheduling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib