Spring 2010 More Loop Optimizations 5 Outline • • • • Strength Reduction Loop Test Replacement Loop Invariant Code Motion SIMDization with SSE Saman Amarasinghe 2 6.035 ©MIT Fall 1998 Strength g Reduction • Replace p expensive p operations p in an expression p using cheaper ones – Not a data-flow problem p – Algebraic simplification – Example: a*4 a 4 a << 2 Saman Amarasinghe 3 6.035 ©MIT Fall 1998 Strength g Reduction • In loops p reduce expensive p operations p in expressions in to cheaper ones by using the ppreviously y calculated value Saman Amarasinghe 4 6.035 ©MIT Fall 1998 Strength g Reduction t = 202 for j = 1 to 100 t = t - 2 A(j) = t Saman Amarasinghe 5 6.035 ©MIT Fall 1998 Strength g Reduction t = 202 for j = 1 to 100 t = t - 2 *( b *(abase + 4*j) = t Saman Amarasinghe 6 6.035 ©MIT Fall 1998 Strength g Reduction t = 202 for j = 1 to 100 t = t - 2 *( b *(abase + 4*j) = t Basic Induction variable: J 1 2 2, = 1, Induction variable 200 - 2*j t = 202, 200, 3 3, 4 ….. 4, 198, 196, ….. Induction variable abase abase+4*j: 4 j: abase+4*j = abase+4, abase+8, abase+12, abase+14, …. Saman Amarasinghe 7 6.035 ©MIT Fall 1998 Strength g Reduction t = 202 for j = 1 to 100 t = t - 2 *( b *(abase + 4*j) 4*j) = t Basic Induction variable: J 22, = 11, 1 1 33, Induction variable 200 - 2*j t = 202, 200, 198, 1 44, ….. 196, ….. Induction variable abase+4*j: j: abase+4*j = abase+4, abase+8, abase+12, abase+14, …. Saman Amarasinghe 8 6.035 ©MIT Fall 1998 Strength g Reduction t = 202 for j = 1 to 100 t = t - 2 *( b *(abase + 4*j) 4*j) = t Basic Induction variable: J 1 2 2, = 1, 1 1 3 3, 1 4 ….. 4, Induction variable 200 - 2*j t = 202, 200, 198, 196, ….. -2 -2 -2 Induction variable abase abase+4*j: 4 j: abase+4*j = abase+4, abase+8, abase+12, abase+14, …. Saman Amarasinghe 9 6.035 ©MIT Fall 1998 Strength g Reduction t = 202 for j = 1 to 100 t = t - 2 *( b *(abase + 4*j) = t Basic Induction variable: J 1 2 2, = 1, 1 1 3 3, 1 4 ….. 4, Induction variable 200 - 2*j t = 202, 200, 198, 196, ….. -2 -2 -2 Induction variable abase abase+4*j: 4 j: abase+4*j = abase+4, abase+8, abase+12, abase+14, …. 4 4 4 Saman Amarasinghe 10 6.035 ©MIT Fall 1998 Strength g Reduction Algorithm g • For a dependent p induction variable k = a*jj + b for j = 1 to 100 *(abase + 4*j) = j Saman Amarasinghe 11 6.035 ©MIT Fall 1998 Strength g Reduction Algorithm g • For a dep pendent induction variable k = a*jj+ b • Add a pre-header k’ = a*jinit + b t = abase + 4*1 for j = 1 to 100 *(abase + 4*j) = j Saman Amarasinghe 12 6.035 ©MIT Fall 1998 Strength g Reduction Algorithm g • For a dependent p induction variable k = a*jj + b • Add a pre-header k’ = a*jinit + b • Next to j = j + c add kk’ = kk’ + a*c t = abase for j = 1 *(abase t = t + Saman Amarasinghe + 4*1 to 100 + 4*j) = j 4 13 6.035 ©MIT Fall 1998 Strength g Reduction Algorithm g • • • • For a dependent p induction variable k = a*jj + b Add a pre-header k’ = a*jinit + b Next to j = j + c add kk’ = kk’ + a*c Use k’ instead of k t = abase + 4*1 for j = 1 to 100 *(t) = j t = t + 4 Saman Amarasinghe 14 6.035 ©MIT Fall 1998 Example p double A[256], B[256][256] j = 1 while(j>100) A[j] = B[j][j] j = j + 2 Saman Amarasinghe 15 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 while(j>100) *(&A + 4*j) = *(&B + 4*(256*j + j)) j = j + 2 Saman Amarasinghe 16 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 while(j>100) *(&A + 4*j) = *(&B + 4*(256*j + j)) j = j + 2 Base Induction Variable: Saman Amarasinghe 17 j 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 while(j>100) *(&A + 4*j) = *(&B + 4*(256*j + j)) j = j + 2 Base Induction Variable: j D Dependent d Induction I d i Variable: V i bl a = &A + 4*j Saman Amarasinghe 18 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 while(j>100) *(&A + 4*j) = *(&B + 4*(256*j + j)) j = j + 2 Base Induction Variable: j D Dependent d Induction I d i Variable: V i bl a = &A + 4*j Saman Amarasinghe 19 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 while(j>100) *(&A + 4*j) = *(&B + 4*(256*j + j)) j = j + 2 a = a + 8 Base Induction Variable: j Dependent Ind Induction ction V Variable: a iable a = &A + 4*j Saman Amarasinghe 20 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 while(j>100) *a = *(&B + 4*(256*j + j)) j = j + 2 a = a + 8 Base Induction Variable: j Dependent Ind Induction ction V Variable: a iable a = &A + 4*j Saman Amarasinghe 21 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 while(j>100) *a = *(&B + 4*(256*j + j)) j = j + 2 a = a + 8 Base IInduction B d ti V Variable: i bl j Dependent Induction Variable: b = &B + 4*257*j Saman Amarasinghe 22 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 b = &B + 1028 while(j>100) *a = *(&B + 4*(256*j + j)) j = j + 2 a = a + 8 Base IInduction B d ti V Variable: i bl j Dependent Induction Variable: b = &B + 4*257*j Saman Amarasinghe 23 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 b = &B + 1028 while(j>100) *a = *(&B + 4*(256*j + j)) j = j + 2 a = a + 8 b = b + 2056 Base Induction Variable: j Dependent Induction Variable: b = &B + 4*257*j 4 257 j Saman Amarasinghe 24 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 b = &B + 1028 while(j>100) *a = *b j = j + 2 a = a + 8 b = b + 2056 Base Induction Variable: j Dependent Induction Variable: b = &B + 4*257*j 4 257 j Saman Amarasinghe 25 6.035 ©MIT Fall 2006 Example p double A[256], B[256][256] j = 1 a = &A + 4 b = &B + 1028 while(j>100) *a = *b j = j + 2 a = a + 8 b = b + 2056 Saman Amarasinghe 26 6.035 ©MIT Fall 2006 5 Outline • • • • Strength Reduction Loop Test Replacement Loop Invariant Code Motion SIMDization with SSE Saman Amarasinghe 27 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables Saman Amarasinghe 28 6.035 ©MIT Fall 1998 Loop p Test Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] j = 1 while(j>100) A[j] = B[j][j] Saman Amarasinghe 29 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] j = 1 a = &A + 4 b = &B + 1028 while(j>100) *a = *b j = j + 2 a = a + 8 b = b + 2056 Saman Amarasinghe 30 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] j = 1 a = &A + 4 • J is only used for the loop bound b = &B + 1028 while(j>100) while(j>100) *a = *b j = j + 2 a = a + 8 b = b + 2056 Saman Amarasinghe 31 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] j = 1 a = &A + 4 • J is only used for the loop bound b = &B + 1028 • Use a dependent IV (a or b) while(j>100) *a = *b j = j + 2 a = a + 8 b = b + 2056 Saman Amarasinghe 32 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] j = 1 a = &A + 4 • J is only used for the loop bound b = &B + 1028 • Use a dependent IV (a or b) while(j>100) • Lets choose a *a = *b j = j + 2 a = a + 8 b = b + 2056 Saman Amarasinghe 33 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] j = 1 a = &A + 4 • J is only used for the loop bound b = &B + 1028 • Use a dependent IV (a or b) while(j>100) • Lets choose a *a = *b j = j + 2 j > 100 a > &A + 800 a = a + 8 b = b + 2056 Saman Amarasinghe 34 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] j = 1 a = &A + 4 • J is only used for the loop bound b = &B + 1028 while(a>&A+800) • Use a dependent IV (a or b) • Lets choose a *a = *b j = j + 2 j > 100 a > &A + 800 a = a + 8 • Replace the loop condition b = b + 2056 Saman Amarasinghe 35 6.035 ©MIT Fall 1998 Loop pTest Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] a = &A + 4 b = &B + 1028 while(a>&A+800) *a = *b • J is only used for the loop bound • Use a dependent IV (a or b) • Lets choose a j > 100 a > &A + 800 a = a + 8 • Replace the loop condition b = b + 2056 • Get rid of j Saman Amarasinghe 36 6.035 ©MIT Fall 1998 Loop p Test Replacement p • Eliminate basic induction variable used only y for calculating other induction variables double A[256], B[256][256] a = &A + 4 b = &B + 1028 while(a>&A+800) *a = *b a = a + 8 b = b + 2056 Saman Amarasinghe 37 6.035 ©MIT Fall 1998 Loop p Test Replacement p Algorithm g • If basic induction variable J is only used for calculating other induction variables • Select an induction variable k in the family of J (K = a*J + b) • Replace a comparison such as if (J > X) goto L1 by if(K’ > a*X + b) goto L1 if(K’ < a*X + b) goto L1 if a is positive if a is negative • If J is live at any exit from loop, recompute J = (K’ - b)/a Saman Amarasinghe 38 6.035 ©MIT Fall 1998 5 Outline • • • • Strength Reduction Loop Test Replacement Loop Invariant Code Motion SIMDization with SSE Saman Amarasinghe 39 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop • Same idea as with induction variables – Variables not updated in the loop are loop invariant – Expressions of loop invariant variables are loop invariant – Variables assigned only loop invariant expressions are loop invariant Saman Amarasinghe 40 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 41 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 42 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = 100*N + 10*i + j + x Saman Amarasinghe 43 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 44 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 45 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 for j = 1 to N a(i,j) = t1 + 10*i + j + x Saman Amarasinghe 46 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a comp putation produces the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 t2 = t1 + 10*i + x for j = 1 to N a(i,j) ( ,j) = t1 + 10*i + j + x Saman Amarasinghe 47 6.035 ©MIT Fall 1998 Loop p Invariant Code Motion • If a computation p produces p the same value in every loop iteration, move it out of the loop t1 = 100*N for i = 1 to N x = x + 1 t2 = t1 + 10*i + x for j = 1 to N a(i,j) ( ,j) = t2 + j Saman Amarasinghe 48 6.035 ©MIT Fall 1998 5 Outline • • • • Strength Reduction Loop Test Replacement Loop Invariant Code Motion SIMDization with SSE Saman Amarasinghe 49 6.035 ©MIT Fall 1998 SIMD Through g SSE extensions • Single g Instruction Multiple p Data – Compute multiple identical operations in a single instruction – Exploit fine grained parallelism Saman Amarasinghe 50 6.035 ©MIT Fall 1998 SSE Registers g • 16 128-bit registers: g %xmm0 to %xmm16 – Multiple interpretations for each register – Each arithmetic operation p comes in multiple p versions 128 bit Double Quadword 64 bit Quadword 32 bit Doubleword 64 bit Quadword 32 bit Doubleword 16 bit word 16 bit word 16 bit word 16 bit word Saman Amarasinghe 51 32 bit Doubleword 32 bit Doubleword 16 bit word 16 bit word 16 bit word 16 bit word 6.035 ©MIT Fall 1998 Data Transfer • Moving Data From Memory or xmm registers – MOVDQA OP1, OP2 Move aligned Double Quadword • Can read or write to memory in 128 bit chunks • If OP1 or OP2 are registers, registers they must be xmm registers • Memory locations in OP1 or OP2 must be multiples of 16 – MOVDQU OP1, OP2 Move unaligned Double Quadword • Same as MOVDQA but • memory addresses don’t have to be multiples of 16 Saman Amarasinghe 52 6.035 ©MIT Fall 1998 Data Transfer • Moving Data From 64-bit registers – MOVQ OP1, OP2 Move Double Quadword • Can move from 64 bit register to xmm register or viceversa • Writes to/Reads from the lower 64 bits of xmm register • Can also be used to read a 64-bit chunk to/from memory MOVQ %r11, %xmm1 %r11 %xmm1 128 Saman Amarasinghe 53 0 6.035 ©MIT Fall 1998 Data Reorderingg • Unpack p and Interleave – PUNPCKLDQ Low Doublewords 0 128 0 0 128 Saman Amarasinghe 128 54 6.035 ©MIT Fall 1998 Data Reorderingg • Unpack p and Interleave – PUNPCKLQDQ Low Quadwords 0 128 0 0 128 Saman Amarasinghe 128 55 6.035 ©MIT Fall 1998 Arithmetic • Arithmetic operations come in many flavors – based on the datatype of the register – specified in the instruction sufix • Example: Addition – PADDQ – PADDD – PADDW Add 64-bit Quadwords Add 32 32-bit bi Doublewords D bl d Add 16-bit words • Example: E ample: Subtraction S btraction – PSUBQ – PSUBD – PSUBW Saman Amarasinghe Subtract 64-bit Quadwords Subtract 32 32-bit -bit Doublewords Subtract 16-bit words 56 6.035 ©MIT Fall 1998 Putting g It All Together g • Source Code for i = 1 to N A[i] = A[i] * b • After Unrolling loop: mov mov imul imul mov sub mov sub jz Saman Amarasinghe (%rdi,%rax), %r10 (%rdi,%rbx), %rcx %r11, %r10 %r11, %rcx %r10, (%rdi,%rax) $8, %rax %rcx, (%rdi,%rbx) $8, %rbx loop 57 Readingg from consecutive addresses Mult by a loop invariant Writing to consecutive addresses 6.035 ©MIT Fall 1998 Putting g it all together g Original Version loop: p mov mov imul imul mov sub mov sub jz Saman Amarasinghe SSE Version (%rdi,%rax), %r10 (%rdi,%rbx), %rcx %r11, %r10 %r11, %rcx %r11 %rcx %r10, (%rdi,%rax) $8, %rax %rcx, (%rdi,%rbx) $8, %rbx loop 58 movq punpckldq loop: movdqa d pmuludq movdqa sub jz %r11, %xmm2 %xmm2, %xmm2 (% di,%rax), (%rdi % ) % %xmm0 0 %xmm2, %xmm0 %xmm0, (%rdi, %rax) $8, %rax loop 6.035 ©MIT Fall 1998 Putting g it all together g SSE Version movq punpckldq loop: movdqa pmuludq movdqa sub jz Saman Amarasinghe %r11, %xmm2 %xmm2, %xmm2 Populate xmm2 with copies of %r11 (%rdi,%rax), %xmm0 %xmm2, %xmm0 %xmm0, (%rdi, %rax) $8, % $8, %rax a Only one loop 59 index is needed 6.035 ©MIT Fall 1998 Conditions for SIMDization • Consecutive iterations reading and writing from consecutive locations • Consecutive iterations are independent of each other • The easiest thing is to pattern match at the basic block level after unrolling loops Saman Amarasinghe 60 6.035 ©MIT Fall 1998 MIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.