PPT - University of Delaware

advertisement
Optimizing Compilers
CISC 673
Spring 2009
Instruction Scheduling
John Cavazos
University of Delaware
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Instruction Scheduling


Reordering instructions to improve
performance
Takes into account anticipated latencies



Machine-specific
Performed late in optimization pass
Instruction-Level Parallelism (ILP)
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Modern Architectures Features




Superscalar
 Multiple logic units
Multiple issue
 2 or more instructions issued per cycle
Speculative execution
 Branch predictors
 Speculative loads
Deep pipelines
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
3
Types of Instruction Scheduling


Local Scheduling
 Basic Block Scheduling
Global Scheduling



Trace Scheduling
Superblock Scheduling
Software Pipelining
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
4
Scheduling for different
Computer Architectures

Out-of-order Issue


In-order issue


Scheduling is useful
Scheduling is very important
VLIW
 Scheduling is essential!
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
5
Challenges to ILP



Structural hazards:
 Insufficient resources to exploit parallelism
Data hazards
 Instruction depends on result of previous instruction
still in pipeline
Control hazards
 Branches & jumps modify PC
 affect which instructions should be in pipeline
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
6
Recall from Architecture…





IF – Instruction Fetch
ID – Instruction Decode
EX – Execute
MA – Memory access
WB – Write back
IF ID EX MA WB
IF
ID EX MA WB
IF
ID EX MA WB
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Structural Hazards
Instruction latency: execute takes > 1 cycle
addf R3,R1,R2
addf R3,R3,R4
IF ID EX EX
IF
MA WB
ID stall EX EX MA WB
Assumes floating point ops take 2 execute cycles
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Data Hazards
Memory latency: data not ready
lw R1,0(R2)
add R3,R1,R4
IF ID EX MA WB
IF
ID stall EX MA WB
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Control Hazards
Taken Branch
IF ID EX MA WB
Instr + 1
Branch Target
Branch Target + 1
IF --- --- --- --IF
ID EX MA WB
IF
ID EX MA WB
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Basic Block Scheduling

For each basic block:

Construct directed acyclic graph (DAG) using
dependences between statements



Node = statement / instruction
Edge (a,b) = statement a must execute before b
Schedule instructions using the DAG
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
11
Data Dependences


If two operations access the same register
and one access is a write, they are dependent
Types of data dependences
RAW=Read after Write
WAW
WAR
r1 = r2 + r3
r1 = r2 + r3
r1 = r2 + r3
r4 = r1 * 6
r1 = r4 * 6
r2 = r5 * 6
Cannot reorder two dependent instructions
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Basic Block Scheduling Example
Original Schedule
a)
b)
c)
d)
lw R2, (R1)
lw R3, (R1) 4
R4  R2 + R3
R5  R2 - 1
Schedule 1 (5 cycles)
a)
b)
Dependence DAG
lw R2, (R1)
lw R3, (R1) 4
--- nop ----c) R4  R2 + R3
d) R5  R2 - 1
a
2
b
2
d
2
c
Schedule 2 (4 cycles)
a)
b)
d)
c)
lw R2, (R1)
lw R3, (R1) 4
R5  R2 - 1
R4  R2 + R3
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Scheduling Algorithm




Construct dependence dag on basic block
Put roots in candidate set
Use scheduling heuristics (in order) to select
instruction
While candidate set not empty



Evaluate all candidates and select best one
Delete scheduled instruction from candidate set
Add newly-exposed candidates
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
14
Instruction Scheduling Heuristics


NP-complete = we need heuristics
Bias scheduler to prefer instructions:


Earliest execution time
Have many successors



Progress along critical path
Free registers


More flexibility in scheduling
Reduce register pressure
Can be a combination of heuristics
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
15
Computing Priorities
Height(n) =


exec(n) if n is a leaf
max(height(m)) + exec(n)
for m, where m is a successor of n
Critical path(s) = path through the dependence
DAG with longest latency
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Example – Determine Height and CP
Code
a
3
a
lw r1, w
b
c
b add r1,r1,r1
1
3
c
d
2
f
2
Critical path:
_______
e
3
h
2
i
lw r2,x
d mult r1,r1,r2
g
3
e
lw r2,y
f
mult r1,r1,r2
Assume:
memory instrs = 3
mult = 2 = (to
have result in
register)
rest = 1 cycle
g lw r2,z
h
mult r1,r1,r2
i
sw r1, a
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
17
Example
13
3
10
a
Code
a
b
c
1
3
9
d
2
7
12
e
3
f
2
h
2
i
___ cycles
lw r1, w
b add r1,r1,r1
c
10
g
Schedule
lw r2,x
d mult r1,r1,r2
e
lw r2,y
3
f
mult r1,r1,r2
5
g lw r2,z
3
star
t
8
h
mult r1,r1,r2
i
sw r1, a
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
18
Global Scheduling: Superblock

Definition:



Formation algorithm:



single trace of contiguous, frequently executed blocks
a single entry and multiple exits
pick a trace of frequently executed basic block
eliminate side entrance (tail duplication)
Scheduling and optimization:


speculate operations in the superblock
apply optimization to scope defined by superblock
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Superblock Formation
Tail duplicate
Select a trace
A
A
100
100
B
C
90
10
D
E
0
90
F
100
B
C
90
10
E
D
90
0
F
F’
90
10
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Optimizations within Superblock

By limiting the scope of optimization to superblock:



optimize for the frequent path
may enable optimizations that are not feasible otherwise
(CSE, loop invariant code motion,...)
For example: CSE
r1 = r2*3
r1 = r2*3
r2 = r2 +1
r3 = r2*3
r2 = r2 +1
r3 = r2*3
trace selection
r1 = r2*3
r3 = r2*3
tail duplication
r2 = r2 +1
r3 = r1
r3 = r2*3
CSE within superblock
(no merge since single entry)
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Scheduling Algorithm Complexity

Time complexity: O(n2)


Building dependence dag: worst-case O(n2)



n = max number of instructions in basic block
Each instruction must be compared to every other
instruction
Scheduling then requires each instruction be
inspected at each step = O(n2)
Average-case: small constant (e.g., 3)
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
22
Very Long Instruction Word (VLIW)





Compiler determines exactly what is issued every cycle
(before the program is run)
Schedules also account for latencies
All hardware changes result in a compiler change
Usually embedded systems (hence simple HW)
Itanium is actually an EPIC-style machine (accounts
for most parallelism, not latencies)
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Sample VLIW code
VLIW processor: 5 issue
2 Add/Sub units (1 cycle)
1 Mul/Div unit (2 cycle, unpipelined)
1 LD/ST unit (2 cycle, pipelined)
1 Branch unit (no delay slots)
Add/Sub
Add/Sub
Mul/Div
Ld/St
Branch
c=a+b
d=a-b
e=a*b
ld j = [x]
nop
g=c+d
h=c-d
nop
ld k = [y]
nop
nop
nop
i=j*c
ld f = [z]
br g
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
Next Time

Phase-ordering
UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT
25
Download