Lectures 14 & 15: Instruction Scheduling Simple Machine Model Fall 2005

advertisement
Simple Machine Model
• Instructions are executed in sequence
Fall 2005
– Fetch, decode, execute, store results
– One instruction at a time
Lectures 14 & 15:
Instruction Scheduling
• For branch instructions, start fetching from a
different location if needed
– Check branch condition
– Next instruction may come from a new location
given by the branch instruction
2
Saman Amarasinghe
Simple Execution Model
6.035
©MIT Fall 1998
Simple Execution Model
time
• 5 Stage pipe-line
fetch
decode
execute
Inst 1
memory writeback
IF
DE
EXE
MEM
WB
IF
DE
EXE
MEM
WB
Inst 2
• Fetch: get the next instruction
• Decode: figure-out what that instruction is
• Execute: Perform ALU operation
– address calculation in a memory op
3
6.035
©MIT Fall 1998
From a Simple Machine Model
to a Real Machine Model
• Many pipeline stages
–
–
–
–
Pentium
Pentium Pro
Pentium IV (130nm)
Pentium IV (90nm)
IF
Inst 2
Inst 3
• Memory: Do the memory access in a mem. Op.
• Write Back: write the results back
Saman Amarasinghe
Inst 1
Inst 4
DE
EXE
MEM
WB
IF
DE
EXE
MEM
WB
IF
DE
EXE
MEM
WB
IF
DE
EXE
MEM
WB
IF
DE
EXE
MEM
Inst 5
Saman Amarasinghe
4
6.035
WB
©MIT Fall 1998
Real Machine Model cont.
• Most modern processors have multiple
execution units (superscalar)
5
10
20
31
• Different instructions taking different amount of time
to execute
– If the instruction sequence is correct, multiple operations will happen in the same cycles
– Even more important to have the right instruction
sequence
• Hardware to stall the pipeline if an instruction uses a
result that is not ready
Saman Amarasinghe
5
6.035
©MIT Fall 1998
Saman Amarasinghe
6
6.035
©MIT Fall 1998
1
Data Dependency between
Instructions
Constraints On Scheduling
• Data dependencies
• Control dependencies
• Resource Constraints
• If two instructions access the same variable,
they can be dependent
• Kind of dependencies
– True: write → read
– Anti: read → write
– Output: write → write
• What to do if two instructions are dependent.
– The order of execution cannot be reversed
– Reduce the possibilities for scheduling
Saman Amarasinghe
7
6.035
©MIT Fall 1998
8
Saman Amarasinghe
Computing Dependencies
6.035
©MIT Fall 1998
Representing Dependencies
• For basic blocks, compute dependencies by
walking through the instructions
• Identifying register dependencies is simple
• Using a dependence DAG, one per basic block
• Nodes are instructions, edges represent
dependencies
– is it the same register?
• For memory accesses
–
–
–
–
simple: base + offset1 ?= base + offset2
data dependence analysis: a[2i] ?= a[2i+1]
interprocedural analysis: global ?= parameter
pointer alias analysis: p1 ?= p
Saman Amarasinghe
9
6.035
©MIT Fall 1998
10
Saman Amarasinghe
Representing Dependencies
• Using a dependence DAG, one per basic block
• Nodes are instructions, edges represent
dependencies
2
1
1: r2 = *(r1 + 4)
2: r3 = *(r1 + 8)
3: r4 = r2 + r3
4: r5 = r2 - 1
2
4
• Edge is labeled with Latency:
2
6.035
Example
1:
2:
3:
4:
r2
r3
r4
r5
=
=
=
=
*(r1
*(r2
r2 +
r2 -
+ 4)
+ 4)
r3
1
2
1
2
2
3
4
2
11
6.035
©MIT Fall 1998
Saman Amarasinghe
12
2
2
3
– v(i → j) = delay required between initiation times of
i and j minus the execution time required by i
Saman Amarasinghe
©MIT Fall 1998
6.035
©MIT Fall 1998
2
Control Dependencies and
Resource Constraints
Another Example
1:
2:
3:
4:
r2 =
*(r1
r3 =
r5 =
*(r1
+ 4)
r2 +
r2 -
+ 4)
= r3
r3
1
• For now, lets only worry about basic blocks
• For now, lets look at simple pipelines
4
13
Saman Amarasinghe
2
1
3
6.035
©MIT Fall 1998
Saman Amarasinghe
Example
1:
2:
3:
4:
5:
6:
7:
lea
add
inc
mov
add
and
imul
6.035
©MIT Fall 1998
List Scheduling Algorithm
Results In
1 cycle
1 cycle
1 cycle
3 cycles
var_a, %rax
$4, %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
16(%rsp), %rbx
%rax, %rbx
14
4 cycles
3 cycles
• Idea
– Do a topological sort of the dependence DAG
– Consider when an instruction can be scheduled
without causing a stall
– Schedule the instruction if it causes no stall and all
its predecessors are already scheduled
• Optimal list scheduling is NP-complete
– Use heuristics when necessary
1
2
Saman Amarasinghe
3
4
st st
5
15
6
st st st
6.035
7
©MIT Fall 1998
Saman Amarasinghe
List Scheduling Algorithm
• Create a dependence DAG of a basic block
• Topological Sort
READY = nodes with no predecessors
Loop until READY is empty
Schedule each node in READY when no stalling
Update READY
Saman Amarasinghe
17
6.035
©MIT Fall 1998
16
6.035
©MIT Fall 1998
Heuristics for selection
• Heuristics for selecting from the READY list
– pick the node with the longest path to a leaf in the
dependence graph
– pick a node with most immediate successors
– pick a node that can go to a less busy pipeline (in a
superscalar)
Saman Amarasinghe
18
6.035
©MIT Fall 1998
3
Heuristics for selection
Heuristics for selection
• pick the node with the longest path to a leaf in
the dependence graph
• Algorithm (for node x)
• pick a node with most immediate successors
• Algorithm (for node x):
– fx = number of successors of x
– If no successors dx = 0
– dx = MAX( dy + cxy) for all successors y of x
– reverse breadth-first visitation order
19
Saman Amarasinghe
©MIT Fall 1998
6.035
20
Saman Amarasinghe
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
add
and
imul
mov
lea
2
1
6
lea
add
inc
mov
add
and
imul
mov
lea
2
1
Saman Amarasinghe
2
4
4
st st
7
3
5
5
23
6
7
4
d=3
f=2
1
d=0
f=0
9
3
8
©MIT Fall 1998
1
Saman Amarasinghe
8
st st st
9
4
7
3
5
d=3
f=1
8
d=7
f=1
5
d=0
f=0
d=0
f=0
9
22
6.035
©MIT Fall 1998
• Modern machines have many resource
constraints
• Superscalar architectures:
– can run few parallel operations
– But have constraints
4 cycles
3 cycles
6
2
4
Resource Constraints
Results In
1 cycle
1 cycle
1 cycle
3 cycles
var_a, %rax
$4, %rax
%r11 4(%rsp), %r10
%r10, 8(%rsp)
16(%rsp), %rbx
%rax, %rbx
%rbx, 16(%rsp)
var_b, %rax
3
d=4
f=1
1
6.035
d=0
f=0
3
READY = { }
4 cycles
3 cycles
21
3
1
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
d=5
f=1
1
6
Saman Amarasinghe
©MIT Fall 1998
Example
Results In
1 cycle
1 cycle
1 cycle
3 cycles
var_a, %rax
$4, %rax
%r11 4(%rsp), %r10
%r10, 8(%rsp)
16(%rsp), %rbx
%rax, %rbx
%rbx, 16(%rsp)
var_b, %rax
6.035
7
8
9
14 cycles vs
9 cycles
6.035
©MIT Fall 1998
Saman Amarasinghe
24
6.035
©MIT Fall 1998
4
Resource Constraints of a
Superscalar Processor
List Scheduling Algorithm with
resource constraints
• Represent the superscalar architecture as multiple
pipelines
• Example:
– One fully pipelined reg-to-reg unit
– Each pipeline represent some resource
• All integer operations taking one cycle
• Example
In parallel with
– One fully pipelined memory-to/from-reg unit
– One single cycle reg-to-reg ALU unit
– One two-cycle pipelined reg-to/from-memory unit
• Data loads take two cycles
• Data stores teke one cycle
ALU
MEM 1
MEM 2
25
Saman Amarasinghe
6.035
©MIT Fall 1998
List Scheduling Algorithm with
resource constraints
• Create a dependence DAG of a basic block
• Topological Sort
READY = nodes with no predecessors
Loop until READY is empty
Let n ∈READY be the node with the highest priority
Schedule n in the earliest slot
that satisfies precedence + resource constraints
Update READY
26
Saman Amarasinghe
6.035
©MIT Fall 1998
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
mov
and
imul
lea
mov
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
2
1
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
READY = { 1, 6, 4, 3 }
3 d=0
f=0
9 d=0
f=0
ALUop
MEM 1
MEM 2
27
Saman Amarasinghe
6.035
©MIT Fall 1998
28
Saman Amarasinghe
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
mov
and
imul
lea
mov
2
1
ALUop
3 d=0
f=0
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
READY = { 1, 6, 4, 3 }
9 d=0
f=0
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
mov
and
imul
lea
mov
ALUop
1
MEM 1
MEM 2
MEM 2
29
6.035
©MIT Fall 1998
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
2
1
3 d=0
f=0
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
READY = { 2, 6, 4, 3 }
MEM 1
Saman Amarasinghe
©MIT Fall 1998
Example
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
6.035
9 d=0
f=0
1
Saman Amarasinghe
2
2
30
6.035
©MIT Fall 1998
5
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
mov
and
imul
lea
mov
2
1
ALUop
3 d=0
f=0
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
READY = { 6, 4, 3 }
MEM 1
Example
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
9 d=0
f=0
lea
add
inc
mov
mov
and
imul
lea
mov
ALUop
MEM 1
31
6.035
©MIT Fall 1998
lea
add
inc
mov
mov
and
imul
lea
mov
2
1
ALUop
MEM 1
1 6
4 2
MEM 2
3 d=0
f=0
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
READY = { 7, 3, 5 }
9 d=0
f=0
32
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
mov
and
imul
lea
mov
lea
add
inc
mov
mov
and
imul
lea
mov
ALUop
MEM 1
1 6
4 2
MEM 2
4 2
33
6.035
2
1
©MIT Fall 1998
ALUop
MEM 1
1 6
4 2
MEM 2
Saman Amarasinghe
3
5
3 d=0
f=0
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
READY = { 5, 8, 9 }
9 d=0
f=0
3
2
1
3 d=0
f=0
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
9 d=0
f=0
7
4 2
34
Saman Amarasinghe
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
mov
and
imul
lea
mov
6.035
MEM 1
1 6
4 2
MEM 2
4 2
6.035
©MIT Fall 1998
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
2
1
©MIT Fall 1998
Saman Amarasinghe
3
5
3 d=0
f=0
6
d=2
f=1
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
READY = { 8, 9 }
ALUop
7
35
©MIT Fall 1998
Example
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
6.035
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
9 d=0
f=0
4 2
READY = { 3, 5, 8, 9 }
7
Saman Amarasinghe
d=2
f=1
Example
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
1
6
4 d=2
f=1
2
d=0
5 f=0
1
7 d=1
f=2
1
8 d=0
f=0
Saman Amarasinghe
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
2
3 d=0
f=0
1 6
4 2
MEM 2
2
Saman Amarasinghe
d=4
1 f=1
1
d=3
2 f=1
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
READY = { 4, 7, 3 }
1 6
2
MEM 2
1:
2:
3:
4:
5:
6:
7:
8:
9:
9 d=0
f=0
7 8
4 2
36
6.035
©MIT Fall 1998
6
Example
1:
2:
3:
4:
5:
6:
7:
8:
9:
lea
add
inc
mov
mov
and
imul
lea
mov
var_a, %rax
4(%rsp), %rax
%r11
4(%rsp), %r10
%r10, 8(%rsp)
$0x00ff, %rbx
%rax, %rbx
var_b, %rax
%rbx, 16(%rsp)
ALUop
1 6
4 2
MEM 2
1
1
3
4
2
• Number of instructions in a basic block is small
2
6
5
– Cannot keep a multiple units with long pipelines
busy by just scheduling within a basic block
1
2
1
– Scheduling constraints across basic blocks
– Scheduling policy
1
9
8
3
5
• Need to handle control dependence
7
READY = { }
MEM 1
Scheduling across basic blocks
7 8
9
4 2
37
Saman Amarasinghe
6.035
©MIT Fall 1998
Moving across basic blocks
• Downward to adjacent basic block
B
C
39
©MIT Fall 1998
• Upward to adjacent basic block
C
A
• A path to B that does not execute A?
Saman Amarasinghe
6.035
Moving across basic blocks
A
B
38
Saman Amarasinghe
• A path from C that does not reach A?
6.035
©MIT Fall 1998
Saman Amarasinghe
40
6.035
©MIT Fall 1998
Control Dependencies
Control Dependencies
• Constraints in moving instructions across basic
blocks
• Constraints in moving instructions across basic
blocks
if ( . . . )
a = b op c
Saman Amarasinghe
If ( valid address? )
d = *(a1)
41
6.035
©MIT Fall 1998
Saman Amarasinghe
42
6.035
©MIT Fall 1998
7
Trace Scheduling
Trace Scheduling
• Find the most common trace of basic blocks
A
– Use profile information
• Combine the basic blocks in the trace and
schedule them as one block
• Create clean-up code if the execution goes offtrace
B
C
D
E
F
G
H
43
Saman Amarasinghe
©MIT Fall 1998
6.035
44
Saman Amarasinghe
6.035
©MIT Fall 1998
Large Basic Blocks via
Code Duplication
Trace Scheduling
• Creating large extended basic blocks by
duplication
• Schedule the larger blocks
A
B
A
D
B
A
C
B
C
D
D
D
E
E
E
E
G
H
45
Saman Amarasinghe
©MIT Fall 1998
6.035
Saman Amarasinghe
Trace Scheduling
46
6.035
©MIT Fall 1998
Scheduling Loops
• Loop bodies are small
• But, lot of time is spend in loops due to large
number of iterations
• Need better ways to schedule loops
A
B
C
D
D
E
E
F
F
G
H
H
G
H
Saman Amarasinghe
47
6.035
©MIT Fall 1998
Saman Amarasinghe
48
6.035
©MIT Fall 1998
8
Loop Example
Loop Example
• Machine
• Source Code
– One load/store unit
for i = 1 to N
A[i] = A[i] * b
• load 2 cycles
• store 2 cycles
• Assembly Code
– Two arithmetic units
• add 2 cycles
• branch 2 cycles
• multiply 3 cycles
loop:
mov
imul
mov
sub
bge
– Both units are pipelined (initiate one op each cycle)
• Source Code
for i = 1 to N
A[i] = A[i] * b
49
Saman Amarasinghe
6.035
Loop Example
©MIT Fall 1998
imul d=5
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$4, %rax
loop
mov d=2
mov
– Create a much larger basic block for the body
– Eliminate few loop bounds checks
d=2
2
• Schedule (9 cycles per iteration)
• Cons:
bge
d=0
– Much larger program
– Setup code (# of iterations < unroll factor)
– beginning and end of the schedule can still have
unused slots
mov
mov
mov
imul
bge
imul
©MIT Fall 1998
• Unroll the loop body few times
• Pros:
0
sub
6.035
Loop Unrolling
mov d=7
3
loop:
mov
imul
mov
sub
bge
50
Saman Amarasinghe
2
• Assembly Code
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$4, %rax
loop
bge
imul
sub
sub
51
Saman Amarasinghe
6.035
mov
d=14
mul
d=12
mov
d=9
©MIT Fall 1998
52
Saman Amarasinghe
6.035
©MIT Fall 1998
2
loop:
mov
imul
mov
sub
mov
imul
mov
sub
bge
Loop Example
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$4, %rax
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$4, %rax
loop
Loop Unrolling
3
0
sub
d=9
mov
d=7
mul
d=5
mov
d=2
sub
d=2
• Rename registers
2
– Use different registers in different iterations
2
3
0
2
• Schedule (8 cycles per iteration)
mov
mov
mov
bge
mov
mov
d=0
mov
mov
imul
mov
imul
imul
bge
imul
imul
bge
imul
sub
sub
sub
Saman Amarasinghe
sub
53
6.035
©MIT Fall 1998
Saman Amarasinghe
54
6.035
©MIT Fall 1998
9
loop:
mov
imul
mov
sub
mov
imul
mov
sub
bge
mov
d=14
mul
d=12
mov
d=9
sub
d=9
mov
d=7
2
Loop Example
0
• Rename registers
2
– Use different registers in different iterations
2
mul
d=5
mov
d=2
sub
d=2
bge
d=0
3
• Eliminate unnecessary dependencies
0
– again, use more registers to eliminate true, anti and
output dependencies
– eliminate dependent-chains of calculations when
possible
2
55
Saman Amarasinghe
Loop Unrolling
3
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$4, %rax
(%rdi,%rax), %rcx
%r11, %rcx
%rcx, (%rdi,%rax)
$4, %rax
loop
6.035
mov
d=12
mul
d=10
mov
d=7
©MIT Fall 1998
Saman Amarasinghe
56
6.035
©MIT Fall 1998
2
Loop Example
loop:
mov
imul
mov
sub
mov
imul
mov
sub
bge
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$8, %rax
(%rdi,%rbx), %rcx
%r11, %rcx
%rcx, (%rdi,%rbx)
$8, %rbx
loop
2
mov
mov
mov
mov
imul
mov
d=5
• Try to overlap multiple iterations so that the
slots will be filled
• Find the steady-state window so that:
mul
d=3
mov
d=0
sub
d=2
bge
d=0
3
– all the instructions of the loop body is executed
– but from different iterations
2
mov
bge
imul
imul
sub
d=7
mov
imul
imul
sub
mov
2
• Schedule (4.5 cycles per iteration
mov
Software Pipelining
3
bge
imul
sub
sub
sub
57
Saman Amarasinghe
6.035
©MIT Fall 1998
Saman Amarasinghe
Loop Example
58
• Assembly Code
– value of %r11 don’t change
– 4 regs for (%rdi,%rax)
– each addr. incremented by 4*4
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$4, %rax
loop
– 4 regs to keep value %r10
• Schedule
mov
mov1
mov
mul
Saman Amarasinghe
mov2 st
mov1
mov2
mul1
mul
mul1
mul
sub
mov3 st1
mov mov3
mul2 bge
mul2
mul1
sub1
sub
59
mov4
mov1
mul3
bge
mul2
st2
mov4
bge1
mul3
mov5
mov2
mul4
bge1
mul3
sub2
sub1
©MIT Fall 1998
Loop Example
• 4 iterations are overlapped
loop:
mov
imul
mov
sub
bge
6.035
st3
ld5
bge2
mul4
mov6
mov3
mul5
bge2
mul4
– Same registers can be reused
after 4 of these blocks
generate code for 4 blocks,
otherwise need to move
mov3 st1
mov mov3
mul2 bge
mul2
mul1
sub1
sub
loop:
mov
imul
mov
sub
bge
(%rdi,%rax), %r10
%r11, %r10
%r10, (%rdi,%rax)
$4, %rax
loop
sub3
sub2
6.035
sub3
©MIT Fall 1998
Saman Amarasinghe
60
6.035
©MIT Fall 1998
10
Register Allocation
and Instruction Scheduling
Software Pipelining
• Optimal use of resources
• Need a lot of registers
• If register allocation is before instruction
scheduling
– Values in multiple iterations need to be kept
– restricts the choices for scheduling
• Issues in dependencies
– Executing a store instruction in an iteration before branch
instruction is executed for a previous iteration (writing when
it should not have)
– Loads and stores are issued out-of-order (need to figure-out
dependencies before doing this)
• Code generation issues
– Generate pre-amble and post-amble code
– Multiple blocks so no register copy is needed
61
Saman Amarasinghe
©MIT Fall 1998
6.035
62
Saman Amarasinghe
Example
1:
2:
3:
4:
mov
add
mov
add
6.035
©MIT Fall 1998
Example
1
4(%rbp), %rax
%rax, %rbx
8(%rbp), %rax
%rax, %rcx
1:
2:
3:
4:
3
2
1
1
mov
add
mov
add
4(%rbp), %rax
%rax, %rbx
8(%rbp), %r10
%r10, %rcx
1
3
2
1
1
3
3
3
3
ALUop
MEM 1
2
1
MEM 2
ALUop
4
MEM 1
3
1
Saman Amarasinghe
4
1
MEM 2
3
63
2
6.035
©MIT Fall 1998
4
3
1
Saman Amarasinghe
4
3
64
6.035
©MIT Fall 1998
Register Allocation
and Instruction Scheduling
Register Allocation
and Instruction Scheduling
• If register allocation is before instruction
scheduling
• If register allocation is before instruction
scheduling
– restricts the choices for scheduling
– restricts the choices for scheduling
• If instruction scheduling before register
allocation
– Register allocation may spill registers
– Will change the carefully done schedule!!!
Saman Amarasinghe
65
6.035
©MIT Fall 1998
Saman Amarasinghe
66
6.035
©MIT Fall 1998
11
Superscalar: Where have all the
transistors gone?
• Out of order execution
• Register renaming
– If an instruction stalls, go beyond that and start
executing non-dependent instructions
– Pros: • Hardware scheduling
• Tolerates unpredictable latencies
– If there is an anti or output dependency of a register
that stalls the pipeline, use a different hardware
register
– Pros:
• Avoids anti and output dependencies
– Cons:
– Cons:
• Cannot do more complex transformations to eliminate
dependencies
• Instruction window is small
Saman Amarasinghe
Superscalar: Where have all the
transistors gone?
67
6.035
©MIT Fall 1998
Saman Amarasinghe
68
6.035
©MIT Fall 1998
Hardware vs. Compiler
• In a superscalar, hardware and compiler scheduling
can work hand-in-hand
• Hardware can reduce the burden when not predictable
by the compiler
• Compiler can still greatly enhance the performance
– Large instruction window for scheduling
– Many program transformations that increase parallelism
• Compiler is even more critical when no hardware
support
– VLIW machines (Itanium, DSPs)
Saman Amarasinghe
69
6.035
©MIT Fall 1998
12
Download