Chapter 4 (4.1,4.2,4.4) Lecture

advertisement
Compiler techniques
for exposing ILP
Instruction Level Parallelism
• Potential overlap among instructions
• Few possibilities in a basic block
– Blocks are small (6-7 instructions)
– Instructions are dependent
• Goal: Exploit ILP across multiple basic
blocks
– Iterations of a loop
for (i = 1000; i > 0; i=i-1)
x[i] = x[i] + s;
Basic Scheduling
for (i = 1000; i > 0; i=i-1)
x[i] = x[i] + s;
Pipelined execution:
Loop: LD
F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD
0(R1), F4
SUBI R1, R1, #8
stall
BNEZ R1, Loop
stall
1
2
3
4
5
6
7
8
9
10
Sequential MIPS Assembly Code
Loop:
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
R1, Loop
Scheduled pipelined execution:
Loop: LD
F0, 0(R1)
1
SUBI R1, R1, #8
2
ADDD F4, F0, F2
3
stall
4
BNEZ R1, Loop
5
SD
8(R1), F4
6
Loop Unrolling
Loop:
Pros:
Larger basic block
More scope for scheduling
and eliminating dependencies
Cons:
Increases code size
Comment:
Often a precursor step for
other optimizations
Exit:
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
R1, Exit
F6, 0(R1)
F8, F6, F2
0(R1), F8
R1, R1, #8
R1, Exit
F10, 0(R1)
F12, F10, F2
0(R1), F12
R1, R1, #8
R1, Exit
F14, 0(R1)
F16, F14, F2
0(R1), F16
R1, R1, #8
R1, Loop
Loop Transformations
• Instruction independency is the key
requirement for the transformations
• Example
– Determine that is legal to move SD after SUBI and
BNEZ
– Determine that unrolling is useful (iterations are
independent)
– Use different registers to avoid unnecessary constrains
– Eliminate extra tests and branches
– Determine that LD and SD can be interchanged
– Schedule the code, preserving the semantics of the
code
1. Eliminating Name Dependences
Loop:
LD
F0, 0(R1)
ADDD
Loop:
LD
F0, 0(R1)
F4, F0, F2
ADDD
F4, F0, F2
SD
0(R1), F4
SD
0(R1), F4
LD
F0, -8(R1)
LD
F6, -8(R1)
ADDD
F4, F0, F2
ADDD
F8, F6, F2
SD
-8(R1), F4
SD
-8(R1), F8
LD
F0, -16(R1)
LD
F10, -16(R1)
ADDD
F4, F0, F2
ADDD
F12, F10, F2
SD
-16(R1), F4
SD
-16(R1), F12
LD
F0, -24(R1)
LD
F14, -24(R1)
ADDD
F4, F0, F2
ADDD
F16, F14, F2
SD
-24(R1), F4
SD
-24(R1), F16
SUBI
R1, R1, #32
SUBI
R1, R1, #32
BNEZ
R1, Loop
BNEZ
R1, Loop
Register Renaming
2. Eliminating Control Dependences
Loop:
Exit:
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
R1, Exit
F6, 0(R1)
F8, F6, F2
0(R1), F8
R1, R1, #8
R1, Exit
F10, 0(R1)
F12, F10, F2
0(R1), F12
R1, R1, #8
R1, Exit
F14, 0(R1)
F16, F14, F2
0(R1), F16
R1, R1, #8
R1, Loop
Intermediate BEQZ are never taken
Eliminate!
3. Eliminating Data Dependences
Loop:
LD
ADDD
SD
SUBI
LD
ADDD
SD
SUBI
LD
ADDD
SD
SUBI
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
F6, 0(R1)
F8, F6, F2
0(R1), F8
R1, R1, #8
F10, 0(R1)
F12, F10, F2
0(R1), F12
R1, R1, #8
F14, 0(R1)
F16, F14, F2
0(R1), F16
R1, R1, #8
R1, Loop
• Data dependencies SUBI, LD, SD
Force sequential execution of iterations
• Compiler removes this dependency by:
Computing intermediate R1 values
Eliminating intermediate SUBI
Changing final SUBI
• Data flow analysis
Can do on Registers
Cannot do easily on memory locations
100(R1) = 20(R2)
4. Alleviating Data Dependencies
Unrolled loop:
Loop:
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
Scheduled Unrolled loop:
F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14, -24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
Loop:
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SUBI
SD
BNEZ
SD
F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
0(R1), F4
-8(R1), F8
R1, R1, #32
16(R1), F12
R1, Loop
8(R1), F16
Some General Comments
• Dependences are a property of programs
• Actual hazards are a property of the pipeline
• Techniques to avoid dependence limitations
– Maintain dependences but avoid hazards
• Code scheduling
– hardware
– software
– Eliminate dependences by code transformations
• Complex
• Compiler-based
Loop-level Parallelism
• Primary focus of dependence analysis
• Determine all dependences and find cycles
for (i=1; i<=100; i=i+1) {
x[i] = y[i] + z[i];
w[i] = x[i] + v[i];
}
for (i=1; i<=100; i=i+1) {
x[i+1] = x[i] + z[i];
}
for (i=1; i<=100; i=i+1) {
x[i] = x[i] + y[i];
y[i+1] = w[i] + z[i];
}
x[1] = x[1] + y[1];
for (i=1; i<=99; i=i+1) {
y[i+1] = w[i] + z[i];
x[i+1] = x[i +1] + y[i +1];
}
y[101] = w[100] + z[100];
Dependence Analysis Algorithms
• Assume array indexes are affine (ai + b)
– GCD test:
For two affine array indexes ai+b and ci+d:
if a loop-carried dependence exists, then GCD (c,a) must
divide (d-b)
x[8*i ] = x[4*i + 2] +3
(2-0)/GCD(8,4)
• General graph cycle determination is NP
• a, b, c, and d may not be known at compile time
Software Pipelining
Start-up
Finish-up
Iteration 0
Iteration 1
Software pipelined iteration
Iteration 2
Iteration 3
Example
Iteration i
LD
Iteration i+1
Iteration i+2
F0, 0(R1)
ADDD F4, F0, F2
LD
SD
ADDD F4, F0, F2
LD
SD
ADDD F4, F0, F2
0(R1), F4
F0, 0(R1)
0(R1), F4
SD
Loop:
LD
F0, 0(R1)
Loop:
SD
F0, 0(R1)
0(R1), F4
16(R1), F4
ADDD F4, F0, F2
ADDD F4, F0, F2
SD
0(R1), F4
LD
F0, 0(R1)
SUBI
R1, R1, #8
SUBI
R1, R1, #8
BNEZ R1, Loop
BNEZ R1, Loop
Trace (global-code)
Scheduling
• Find ILP across conditional branches
• Two-step process
– Trace selection
• Find a trace (sequence of basic blocks)
• Use loop unrolling to generate long traces
• Use static branch prediction for other conditional
branches
– Trace compaction
• Squeeze the trace into a small number of wide
instructions
• Preserve data and control dependences
Trace Selection
A[I] = A[I] + B[I]
T
A[I] = 0?
F
LW
R4, 0(R1)
LW
R5, 0(R2)
ADD
R4, R4, R5
SW
0(R1), R4
BNEZ R4, else
X
B[I] =
....
Else:
SW
0(R2), . . .
J
join
....
X
C[I] =
Join:
....
SW
0(R3), . . .
Summary of Compiler
Techniques
• Try to avoid dependence stalls
• Loop unrolling
– Reduce loop overhead
• Software pipelining
– Reduce single body dependence stalls
• Trace scheduling
– Reduce impact of other branches
• Compilers use a mix of three
• All techniques depend on prediction
accuracy
Food for thought: Analyze this
•
Analyze this for different values of X and Y
– To evaluate different branch prediction schemes
– For compiler scheduling purposes
•
•
•
add r1, r0, 1000 #  all numbers in decimal
add r2, r0, a # Base address of array a
loop:
–
–
–
–
–
•
andi r10, r1, X
beqz r10, even
lw r11, 0(r2)
addi r11, r11, 1
sw 0(r2), r11
even:
– addi r2, r2, 4
– subi r1, r1, Y
– bnez r1, loop
Download