# Loop Optimizations Spring 2010 Instruction Scheduling

```Spring 2010
Loop Optimizations
Instruction Scheduling
5
Outline
•
•
•
•
•
•
•
Scheduling for loops
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler
I d i Variable
Induction
V i bl Recognition
R
ii
loop invariant code motion
Saman Amarasinghe
2
6.035
&copy;MIT Fall 1998
Scheduling
g Loops
p
• Loop
p bodies are small
• But, lot of time is spend in loops due to large
number of iterations
• Need better ways to schedule loops
Saman Amarasinghe
3
6.035
&copy;MIT Fall 1998
Loop
p Example
p
• Machine
• store 2 cycles
– Two arithmetic units
• branch 2 cycles
• multiply 3 cycles
– Both units are pipelined (initiate one op each cycle)
• Source Code
for i = 1 to N
A[i] = A[i] * b
Saman Amarasinghe
4
6.035
&copy;MIT Fall 1998
Loop
p Example
p
• Source Code
for i = 1 to N
A[i] = A[i] * b
offset
ff
• Assembly Code
loop:
mov
imul
mov
sub
jz
Saman Amarasinghe
base
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
5
6.035
&copy;MIT Fall 1998
Loopp Example
p
mov
d=7
2
• Assembly Code
loop:
mov
imul
mov
sub
jz
imul
d=5
3
(%rdi,%rax), %r10
%r11,
, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
mov
0
sub
d=2
2
• Schedule (9 cycles per iteration)
mov
d=2
jz
d=0
mov
mov
mov
imul
bge
imul
bge
imul
sub
sub
Saman Amarasinghe
6
6.035
&copy;MIT Fall 1998
5
Outline
•
•
•
•
•
•
•
Scheduling for loops
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler
I d i Variable
Induction
V i bl Recognition
R
ii
loop invariant code motion
Saman Amarasinghe
7
6.035
&copy;MIT Fall 1998
Loop
p Unrollingg
• Unroll the loop
pbod yfew times
• Pros:
– Create a much larger basic block for the body
body
– Eliminate few loop bounds checks
• Cons:
– Much larger program
– Setup codde (# off iiterations
i
&lt; unroll
ll factor)
f
)
– beginning and end of the schedule can still have
unusedd slots
l t
Saman Amarasinghe
8
6.035
&copy;MIT Fall 1998
loop:
mov
imul
mov
sub
jz
Saman Amarasinghe
Loop
p Example
p
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
9
6.035
&copy;MIT Fall 1998
loop:
mov
imul
mov
sub
mov
imul
mov
sub
jz
Saman Amarasinghe
Loop
p Example
p
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
10
6.035
&copy;MIT Fall 1998
mov
d=14
mul
d=12
mov
d=9
sub
d=9
mov
d=7
mul
d=5
mov
d=2
sub
d=2
jz
d=0
2
loop:
mov
imul
mov
sub
mov
imul
mov
sub
jz
Loopp Example
p
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
3
0
2
2
3
0
2
• Schedule (8 cycles per iteration)
mov
mov
mov
mov
mov
mov
mov
imul
mov
imul
imul
bge
imul
imul
bge
imul
sub
sub
sub
Saman Amarasinghe
sub
11
6.035
&copy;MIT Fall 1998
Loop
p Unrollingg
• Rename registers
g
– Use different registers in different iterations
Saman Amarasinghe
12
6.035
&copy;MIT Fall 1998
loop:
mov
imul
mov
sub
mov
imul
mov
sub
jz
Saman Amarasinghe
Loopp Example
p
(%rdi,%rax), %r10
%r11, %r10
%
10 (%
di %
)
%r10,
(%rdi,%rax)
\$4, %rax
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
13
mov
d=14
mul
d=12
mov
d=9
sub
d=9
mov
d=7
mul
d=5
mov
d=2
sub
d 2
d=2
jz
d=0
2
3
0
2
2
3
0
2
6.035
&copy;MIT Fall 1998
loop:
mov
imul
mov
sub
mov
imul
mov
sub
jz
Saman Amarasinghe
Loopp Example
p
(%rdi,%rax), %r10
%r11, %r10
%
10 (%
di %
)
%r10,
(%rdi,%rax)
\$4, %rax
(%rdi,%rax), %rcx
%r11, %rcx
%rcx, (%rdi,%rax)
\$4, %rax
loop
14
mov
d=14
mul
d=12
mov
d=9
sub
d=9
mov
d=7
mul
d=5
mov
d=2
sub
d 2
d=2
jz
d=0
2
3
0
2
2
3
0
2
6.035
&copy;MIT Fall 1998
Loop
pUnrollingg
• Rename reg
gisters
– Use different registers in different iterations
• Eliminate unnecessary dependencies
– again
again, use more registers to eliminate true,
true anti and
output dependencies
– eliminate dependent
dependent-chains
chains of calculations when
when
possible
Saman Amarasinghe
15
6.035
&copy;MIT Fall 1998
mov
d=14
mul
d=12
mov
d=9
sub
d=9
mov
d=7
mul
d=5
mov
d=2
sub
d=2
jz
d=0
2
loop:
mov
imul
mov
sub
mov
imul
mov
sub
jz
Saman Amarasinghe
Loopp Example
p
(%rdi,%rax), %r10
%r11, %r10
% 10 (%
di %
)
%r10,
(%rdi,%rax)
\$4, %rax
(%rdi,%rax), %rcx
%r11, %rcx
%rcx, (%rdi,%rax)
\$4, %rax
loop
16
3
0
2
2
3
0
2
6.035
&copy;MIT Fall 1998
mov
d=5
mul
d=3
mov
d=0
sub
d=0
mov
d=7
mul
d=5
mov
d=2
sub
d=2
jz
d=0
2
loop:
mov
imul
mov
sub
mov
imul
mov
sub
jz
Saman Amarasinghe
Loopp Example
p
(%rdi,%rax), %r10
%r11, %r10
% 10 (%
di %
)
%r10,
(%rdi,%rax)
\$8, %rax
(%rdi,%rbx), %rcx
%r11, %rcx
%rcx, (%rdi,%rbx)
\$8, %rbx
loop
17
3
0
2
2
3
0
2
6.035
&copy;MIT Fall 1998
mov
d=5
mul
d=3
mov
d=0
sub
d=0
mov
d=7
mul
d=5
mov
d=2
sub
d=2
jz
d=0
2
loop:
mov
imul
mov
sub
mov
imul
mov
sub
jz
Loopp Example
p
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$8, %rax
(%rdi,%rbx), %rcx
%r11, %rcx
%rcx, (%rdi,%rbx)
\$8, %rbx
loop
• S
Schedule
h d l (4.5
(4 5 cycles
l per it
iteration
ti
mov
mov
mov
mov
mov
imul
imul
2
2
3
0
2
mov
jz
imul
imul
jz
imul
sub
sub
sub
Saman Amarasinghe
0
mov
mov
imul
3
sub
18
6.035
&copy;MIT Fall 1998
5
Outline
•
•
•
•
•
•
•
Scheduling for loops
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler
loop invariiant codde motiion
Induction Variable Recognition
Saman Amarasinghe
19
6.035
&copy;MIT Fall 1998
Software Pipelining
p
g
• Try
yto overla pmulti ple iterations so that the
slots will be filled
state window so that:
– all the instructions of the loop body is executed
– but from different iterations
Saman Amarasinghe
20
6.035
&copy;MIT Fall 1998
Loop
p Example
p
• Assembly Code
loop:
mov
imul
mov
sub
jz
j
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
p
• Schedule
mov
mov
mov
mov
mul
jz
mul
jz
mul
sub
sub
Saman Amarasinghe
21
6.035
&copy;MIT Fall 1998
Loopp Example
p
• Assembly Code
loop:
mov
imul
mov
sub
jz
j
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
p
• Schedule
mov
mov1
mov
mul
Saman Amarasinghe
mov2 mov
mov1
mov2
1
2
mul1
mul
mul1
mul
sub
mov3 mov1 mov4 mov2
mov mov3
3 mov1
1 mov4
4
mul2 jz
mul3 jz1
mul2 jz
mul3
mul1
mul2
sub1
sub2
sub
sub1
22
mov5
mov2
2
mul4
jz1
mul3
mov3
ld5
jz2
mul4
mov6
mov3
3
mul5
jz2
mul4
sub3
sub2
6.035
sub3
&copy;MIT Fall 1998
Loopp Example
p
• Assembly Code
loop:
mov
imul
mov
sub
jz
j
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
\$4, %rax
loop
p
• Schedule (2 cycles per iteration)
mov
mov1
mov
mul
Saman Amarasinghe
mov2 mov
1
2
mov1
mov2
mul1
mul
mul1
mul
sub
mov3 mov1 mov4 mov2
3 mov1
1 mov4
4
mov mov3
mul2 jz
mul3 jz1
mul2 jz
mul3
mul1
mul2
sub1
sub2
sub
sub1
23
mov5
2
mov2
mul4
jz1
mul3
mov3
ld5
jz2
mul4
mov6
3
mov3
mul5
jz2
mul4
sub3
sub2
6.035
sub3
&copy;MIT Fall 1998
Loopp Example
p
• 4 iterations are overlapped
– value of %r11 don’t change
– 4 regs for (%rdi,%rax)
– each addr. incremented by 4*4
– 4 regs to keep value %r10
– Same registers can be reused
after 4 of these blocks
ggenerate code for 4 blocks,,
otherwise need to move
Saman Amarasinghe
24
mov4
mov1
mul3
jz
mul2
mov2
mov4
jz1
mul3
sub2
sub1
loop:
mov
imul
mov
sub
jz
(%rdi,%rax), %r10
%r11 %r10
%r11,
%r10, (%rdi,%rax)
\$4, %rax
loop
6.035
&copy;MIT Fall 1998
Software Pipelining
p
g
• Optimal use of resources
• Need
N d a llot off registers
i
– Values in multiple iterations need to be kept
• Issues in dependencies
– Executing a store instruction in an iteration before branch
instruction is executed for a previous iteration (writing when
it should not have)
– Loads and stores are issued out-of-order (need to figure-out
dependencies before doing this)
• Code generation issues
– Generate pre-amble and post-amble code
– Multiple blocks so no register copy is needed
Saman Amarasinghe
25
6.035
&copy;MIT Fall 1998
5
Outline
•
•
•
•
•
•
•
Scheduling for loops
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler
I d i Variable
Induction
V i bl Recognition
R
ii
loop invariant code motion
Saman Amarasinghe
26
6.035
&copy;MIT Fall 1998
Register Allocation
and
d Instruction
I t ti Scheduling
S h d li
• If reg
gister allocation is before instruction scheduling
–restricts the choices for scheduling
g
Saman Amarasinghe
27
6.035
&copy;MIT Fall 1998
Example
p
1:
2
2:
3:
4:
mov
dd
mov
Saman Amarasinghe
4(%rbp), %rax
%
%rax,
%
%rbx
b
8(%rbp), %rax
%rax, %rcx
28
6.035
&copy;MIT Fall 1998
Example
p
1:
2:
2
3:
4:
mov
dd
mov
1
4(%rbp), %rax
%rax,
%rbx
%
%
b
8(%rbp), %rax
%rax, %rcx
3
2
1
1
1
1
3
3
4
Saman Amarasinghe
29
6.035
&copy;MIT Fall 1998
Example
p
1:
2:
2
3:
4:
mov
dd
mov
1
4(%rbp), %rax
%rax,
%rbx
%
%
b
8(%rbp), %rax
%rax, %rcx
3
2
1
1
1
1
3
3
ALUop
MEM 1
2
1
MEM 2
Saman Amarasinghe
4
4
3
1
3
30
6.035
&copy;MIT Fall 1998
Example
p
1:
2:
2
3:
4:
mov
dd
mov
1
4(%rbp), %rax
%rax,
%rbx
%
% b
8(%rbp), %rax
%rax, %rcx
3
2
1
1
1
Anti-dependence
1
3
3
4
Saman Amarasinghe
31
6.035
&copy;MIT Fall 1998
Example
p
1:
2:
2
3:
4:
mov
dd
mov
4(%rbp), %rax
%rax,
%rbx
%
%
b
8(%rbp), %r10
%r10, %rcx
Anti-dependence
1
3
2
3
3
4
Saman Amarasinghe
32
6.035
&copy;MIT Fall 1998
Example
p
1:
2:
2
3:
4:
mov
dd
mov
4(%rbp), %rax
%rax,
%rbx
%
%
b
8(%rbp), %r10
%r10, %rcx
1
3
2
3
3
ALUop
MEM 1
2
1
MEM 2
Saman Amarasinghe
3
1
4
4
3
33
6.035
&copy;MIT Fall 1998
Register Allocation
and
d Instruction
I t ti Scheduling
S h d li
• If reg
gister allocation is before instruction scheduling
–restricts the choices for scheduling
g
Saman Amarasinghe
34
6.035
&copy;MIT Fall 1998
Register Allocation andd Inst
I tructi
tion Schhed
duli
ling
• If reg
gister allocation is before instruction
scheduling
– restricts the choices for scheduling
g
• If instruction scheduling before register
register
allocation
– Register allocation may spill registers
– Will change the carefully done schedule!!!
Saman Amarasinghe
35
6.035
&copy;MIT Fall 1998
5
Outline
•
•
•
•
•
•
•
Scheduling for loops
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler
I d i Variable
Induction
V i bl Recognition
R
ii
loop invariant code motion
Saman Amarasinghe
36
6.035
&copy;MIT Fall 1998
Superscalar: Where have all the
t
transistors
i t gone?
?
• Out of order execution
– If an instruction stalls, go beyond that and start
executing non-dependent instructions
– Pros: • Hardware scheduling
• Tolerates unpredictable latencies
– Cons:
• Instruction window is small
Saman Amarasinghe
37
6.035
&copy;MIT Fall 1998
Superscalar: Where have all the
t
transistors
i t gone?
?
• Reg
gister renaming
g
– If there is an anti or output dependency of a register
that stalls the pipeline, use a different hardware
register
– Pros:
• Avoids anti and output dependencies
– Cons:
• Cannot do more complex transformations to eliminate
dependencies
Saman Amarasinghe
38
6.035
&copy;MIT Fall 1998
Hardware vs. Compiler
p
• In a superscalar, hardware and compiler scheduling
can work hand-in-hand
• Hardware can reduce the burden when not predictable
by the compiler
• Compiler can still greatly enhance the performance
– Large instruction window for scheduling
– Many program transformations that increase parallelism
• C
Compiler
il is
i even more critical
iti l when
h no hardware
h d
support
– VLIW machines (Itanium
(Itanium, DSPs)
Saman Amarasinghe
39
6.035
&copy;MIT Fall 1998
5
Outline
•
•
•
•
•
•
•
Scheduling for loops
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler
I d ion V
Inducti
Variiable
bl Recogniitiion
loop invariant code motion
Saman Amarasinghe
40
6.035
&copy;MIT Fall 1998
Induction Variables
• Example
p
i = 200
for j = 1 to 100
a(i) = 0
i = i - 1
Saman Amarasinghe
41
6.035
&copy;MIT Fall 1998
Induction Variables
• Example
p
i = 200
for j = 1 to 100
a(i) = 0
i = i - 1
Basic Induction variable:
J
= 1,
1
2
2,
3
3,
4 …..
4,
Index Variable i in a(i):
I
= 200, 199, 198,
197….
Saman Amarasinghe
42
6.035
&copy;MIT Fall 1998
Induction Variables
• Example
p
i = 200
for j = 1 to 100
a(i) = 0
i = i - 1
Basic Induction variable:
J
= 1,
1
2
2,
3
3,
4 …..
4,
Index Variable i in a(i):
I
= 200, 199, 198,
197…. = 201 - J
Saman Amarasinghe
43
6.035
&copy;MIT Fall 1998
Induction Variables
• Example
p
i = 200
for j = 1 to 100
a(201 - j) = 0
i = i - 1
Basic Induction variable:
J
= 1,
1
2
2,
3
3,
4 …..
4,
Index Variable i in a(i):
I
= 200, 199, 198,
197…. = 201 - J
Saman Amarasinghe
44
6.035
&copy;MIT Fall 1998
Induction Variables
• Example
p
for j = 1 to 100
a(201 - j) = 0
Basic Induction variable:
J
= 1,
1
2
2,
3
3,
4 …..
4,
Index Variable i in a(i):
I
= 200, 199, 198,
197…. = 201 - J
Saman Amarasinghe
45
6.035
&copy;MIT Fall 1998
What are induction variables?
• x is an induction variable of a loop
p L if
– variable changes its value every iteration of the loop
– the value is a function of number of iterations of
the loop
• In compilers this function is normally a linear
function
– Example: for loop index variable j, function c*j + d
Saman Amarasinghe
46
6.035
&copy;MIT Fall 1998
What can we do with induction
variables?
i bl ?
• Use them to perform strength reduction
• Get rid of them
Saman Amarasinghe
47
6.035
&copy;MIT Fall 1998
Classification of induction variables
• Basic induction variables
– Explicitly modified by the same constant amount
once during each iteration of the loop
– Example: loop index variable
• Dependent induction variables
– Can be expressed in the form: aa*xx + b where a and
be are loop invariant and x is an induction variable
– Example: 202 - 2
2*j
j
Saman Amarasinghe
48
6.035
&copy;MIT Fall 1998
Classification of induction variables
• Class of induction variables: All induction
variables with same basic variable in their
q
linear equations
• Basis of a class: the basic variable that
determines that class
Saman Amarasinghe
49
6.035
&copy;MIT Fall 1998
Finding
g Basic Induction Variables
• Look inside loop
p nodes
• Find variables whose only modification is of
the form j = j + d where d is a loop
constant
Saman Amarasinghe
50
6.035
&copy;MIT Fall 1998
Finding
g Dependent
p
Induction Variables
• Find all the basic induction variables
• Search variable k with a single assignment in the loop
• Variable assignments of the form k = e op j or
k = -j where j is an induction variable and e is loop
invariant
Saman Amarasinghe
51
6.035
&copy;MIT Fall 1998
Finding
g Dependent
p
Induction Variables
• Example
for i = 1 to 100
j = i*c
k = j+1
Saman Amarasinghe
52
6.035
&copy;MIT Fall 1998
A special
p
case
t = 202
for j = 1 to 100
t = t - 2
j = t
a(j)
t = t - 2
(j) = t
b(j)
Saman Amarasinghe
53
6.035
&copy;MIT Fall 1998
A special
p
case
u1 = 200
u2 = 202
for j = 1 to 100
u1 = u1 - 4
a(j) = u1
u2 = u2 - 4
b(j) = u2
t = 202
for j = 1 to 100
t = t - 2
j = t
a(j)
t = t - 2
(j) = t
b(j)
Saman Amarasinghe
54
6.035
&copy;MIT Fall 1998
5
Outline
•
•
•
•
•
•
•
Scheduling for loops
Loop unrolling
Software pipelining
Interaction with register allocation
Hardware vs. Compiler
I d i Variable
Induction
V i bl Recognition
R
ii
Loop invariant code motion
Saman Amarasinghe
55
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a computation
p
produces
p
the same value in
every loop iteration, move it out of the loop
Saman Amarasinghe
56
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a computation
p
produces
p
the same value in
every loop iteration, move it out of the loop
for i = 1 to N
x = x + 1
for j = 1 to N
a(i,j) = 100*N + 10*i + j + x
Saman Amarasinghe
57
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a computation
p
produces
p
the same value in
every loop iteration, move it out of the loop
for i = 1 to N
x = x + 1
for j = 1 to N
a(i,j) = 100*N + 10*i + j + x
Saman Amarasinghe
58
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a comp
putation produces the same value in every loop iteration, move it out of the loop
t1 = 100*N
for i = 1 to N
x = x + 1
for j = 1 to N
a(i,j) = 100*N + 10*i + j + x
Saman Amarasinghe
59
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a computation
p
produces
p
the same value in
every loop iteration, move it out of the loop
t1 = 100*N
for i = 1 to N
x = x + 1
for j = 1 to N
a(i,j) = t1 + 10*i + j + x
Saman Amarasinghe
60
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a computation
p
produces
p
the same value in
every loop iteration, move it out of the loop
t1 = 100*N
for i = 1 to N
x = x + 1
for j = 1 to N
a(i,j) = t1 + 10*i + j + x
Saman Amarasinghe
61
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a comp
putation produces the same value in every loop iteration, move it out of the loop
t1 = 100*N
for i = 1 to N
x = x + 1 for j = 1 to N
a(i,j) = t1 + 10*i + j + x
Saman Amarasinghe
62
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a comp
putation produces the same value in every loop iteration, move it out of the loop
t1 = 100*N
for i = 1 to N
x = x + 1 t2 = t1 + 10*i + x
for j = 1 to N
a(i,j)
( ,j) = t1 + 10*i + j + x
Saman Amarasinghe
63
6.035
&copy;MIT Fall 1998
Loop
p Invariant Code Motion
• If a computation
p
produces
p
the same value in
every loop iteration, move it out of the loop
t1 = 100*N
for i = 1 to N
x = x + 1
t2 = t1 + 10*i + x
for j = 1 to N
a(i,j)
( ,j) = t2 + j
Saman Amarasinghe
64
6.035
&copy;MIT Fall 1998
MIT OpenCourseWare
http://ocw.mit.edu
6.035 Computer Language Engineering
Spring 2010