Progressive Backend Compiler Optimization Towards a More Principled Compiler:

advertisement
Towards a More
Principled Compiler:
Progressive Backend
Compiler Optimization
David Koes
8/28/2006
School of Computer Science
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
1
2.8Ghz Pentium 4, 1GB RAM, -O3 …
Nov-09
Nov-08
Nov-07
Nov-06
Nov-05
Nov-04
Nov-03
Nov-02
Nov-01
Nov-00
Nov-99
Nov-98
Nov-97
Nov-96
0%
Nov-95
SPEC2000 Performance Improvement
Performance Gains Due to Compiler (gcc)
School of Computer Science
50%
45%
is this possible?
40%
How do we exploit the existing
optimization potential?
35%
30%
2
Yes!
Need a more principled compiler
25%
20%
15%
10-30% improvement just from
reordering compiler phases
10%
5%
http://www.cs.rice.edu/~keith/Adapt/
Nov-09
Nov-08
Nov-07
Nov-06
Nov-05
Nov-04
Nov-03
Nov-02
Nov-01
Nov-00
Nov-99
Nov-98
Nov-97
Nov-96
0%
Nov-95
SPEC2000 Performance Improvement
The Future of Compiler Optimization
School of Computer Science
3
Nov-05
May-05
Nov-04
May-04
Nov-03
May-03
Nov-02
May-02
Nov-01
May-01
Nov-00
May-00
Nov-99
May-99
Nov-98
May-98
Nov-97
May-97
Nov-96
May-96
Nov-95
Code size improvement
Compiler code size improvement
25%
20%
15%
10%
5%
0%
School of Computer Science
A Principled Compiler
A compiler that
– knows right from wrong
(less optimal from more optimal)
– follows a rigorous procedure to get the desired output
4
School of Computer Science
Today’s Compiler
Problems
target independent
copy …
GVN
prop
DCE
strength
SCCP
inlinin
reduct
code
PRE consg
t
motion loop
unroll prop
– some phases not
internally optimal
• purely heuristic solution
– machine description
mostly ignored
– lack of integration
between phases
target dependent
insn sched
reg alloc
insn select
branc
peephol
h opt
e
…
5
machine
description
optimized program
School of Computer Science
Ideal Compiler
copy
prop
const
prop
stren
gth
reduc
t
reg
alloc
loop
unroll
inline
DC
E
code
motio
n
PRE
GV
N
CSE
peephole
insn
selec
t
branc
h opt
SCC
P
…
– each phase locally optimal
– makes full use of machine
description
– tight integration between
phases
Absolutely no idea how to do
this or if it’s even possible
machine
description
optimized program
6
School of Computer Science
Towards a More Principled Compiler
copy
prop
const
prop
stren
gth
reduc
t
reg
alloc
loop
unroll
inline
DC
E
code
motio
n
PRE
GV
N
– each phase locally optimal
– makes full use of machine
description
– tight integration between
phases
CSE
peephole
insn
selec
t
branc
h opt
SCC
P
…
machine
description
optimized program
7
School of Computer Science
Outline
I.
II.
III.
IV.
V.
8
Motivation
Related Work
Completed Work
Proposed Work
Contributions & Timeline
School of Computer Science
Related Work
Register Allocation Problem
unbounded number of
program variables
…spill code optimization
v =
1
w =
v + 3
rematerialization
x =
w + v
register
u =
v
allocator
t =
u + x
print(x);
memory operands
print(w);
print(t);
print(u);
…
9
limited number of
processor registers +
slow memory
eax
register
preferences
ebx
ecx
edx
esi
edi
esp
ebp
live range splitting
School of Computer Science
Related Work
Register Allocation Previous Work
Method
Expressive
Fast
Optimal
/
/
Linear Scan
Graph Coloring
Integer Linear
Programming
Partitioned Boolean
Quadratic Programming
10
School of Computer Science
Related Work
Instruction Selection Problem
Assem
IR
instruction
selector
IR Representation
minimum cost tiling
?
11
movl
leal
leal
leal
(p),t1
(x,t1),t2
1(y),t3
(t2,t3),r
School of Computer Science
Related Work
Instruction Selection Previous Work
Method
DAG
Tiling
Register
Allocation Aware
Fast
Optimal
Dynamic Programming
Binate Covering
Peephole Based
Instruction Selection
AVIV Code Generator
Exhaustive Search
12
School of Computer Science
Outline
I.
II.
III.
IV.
V.
13
Motivation
Related Work
Completed Work
Proposed Work
Contributions & Timeline
School of Computer Science
Completed Work
A More Principled Register Allocator
– fully utilize machine description
reg
alloc
• explicit and expressive model
of costs of allocation for given
architecture
– optimal solutions
machine
description
14
School of Computer Science
Completed Work
Multi-commodity Network Flow:
An Expressive Model
Given network (directed graph) with
– cost and capacity on each edge
– sources & sinks for multiple commodities
Find lowest cost flow of commodities
NP-complete for integer flows
a
Example:
b
edges have unit capacity
1
a
15
0
b
School of Computer Science
Completed Work
Register Allocation as a MCNF
Variables  Commodities
a
Variable Definition  Source
Variable Last Use  Sink a
Nodes  Allocation Classes (Reg/Mem/Const)
r0
r1
mem
1
Registers Limits  Node Capacities
Spill Costs  Edge Costs
r1
mem
1
mem
1
3
r0
r1
Allocation  Flow
16
School of Computer Science
Completed Work
Example
Source Code
int example(int a, int b)
{
int d = 1;
int c = a - b;
return c+d;
}
Pre-alloc Assembly
MOVE 1 -> d
SUB a,b -> c
ADD c,d -> c
MOVE c -> r0
17
load cost
insn pref cost
mem access cost
School of Computer Science
Completed Work
Control Flow
MCNF can only represent straight-line code
– need to link together networks from basic blocks
Extend MCNF model with merge
and split nodes to implement
boundary constraints.
a: %eax
a: %eax
a: %eax
a: mem
a: mem
details in proposal document…
along with modeling persistence of
values in memory
a: mem
a: mem
18
School of Computer Science
Completed Work
A Better Register Allocator
– fully utilize machine description
reg
alloc
machine
description
19
• explicit and expressive model
of costs of allocation for given
architecture: Global MCNF
– locally optimal
• NP-hard, so use progressive
solution technique
School of Computer Science
Completed Work
A Better Register Allocator
– fully utilize machine description
reg
alloc
machine
description
20
• explicit and expressive model of
costs of allocation for given
architecture: Global MCNF
– locally optimal
• NP-hard, so use progressive
solution technique
School of Computer Science
Completed Work
Progressive Solution Technique
Quickly find a good allocation
Then progressively find better allocations
Allocation Quality
– until optimal allocation found
– or time limit is reached
Technique:
Lagrangian relaxation
directed allocators
Compile Time
21
School of Computer Science
Completed Work
Lagrangian Relaxation: Intuition
Relaxes the hard constraints
– only have to solve single commodity flow
Combines easy subproblems using a
Lagrangian multiplier (price)
– an additional price on each edge
– a price on each split/merge node
a
Example:
b
a
b
edges have unit capacity
with price, solution to single
commodity flow can be
solution to multicommodity flow
1
a
22
0
b
1
a
0+1
b
School of Computer Science
Completed Work
Solution Procedure
a
Compute prices with iterative
subgradient optimization
1
– guaranteed converge to optimal prices
– optimal for linear relaxation
23
0+1
a
b
a
b
At each iteration, construct a feasible
integer solution using current prices
– iterative allocator in document
– simultaneous allocator
– trace-based simultaneous allocator
b
1
a
0+1
b
School of Computer Science
Completed Work
Simultaneous Allocator
Edges to/from
memory cost 3
Current cost:
-2
-3
-1
X X
24
School of Computer Science
Completed Work
Trace-Based Allocation
Decompose function into traces of basic blocks
– run simultaneous allocator on each trace
– control flow internal to trace presents difficulty
addressed in proposal document
25
School of Computer Science
Completed Work
Evaluation
Implemented in gcc 3.4.4 targeting x86
Optimize for code size
– perfect static evaluation
– important metric in its own right
MediaBench, MiBench, Spec95, Spec2000
– over 10,000 functions
26
School of Computer Science
Completed Work
Progressiveness
squareEncrypt
27
School of Computer Science
Completed Work
Progressiveness
quicksort
28
School of Computer Science
Completed Work
Average code size improvement over graph allocator
Code Size
8%
7%
6%
5%
4%
3%
2%
1%
0%
initial heuristics only
29
10 iterations
100 iterations
1000 iterations
default allocator
School of Computer Science
Completed Work
Optimality
100%
90%
Percent of functions
80%
70%
Proven
optimality
60%
>25% from optimal
<=25% from optimal
<=10% from optimal
<=5% from optimal
<=1% from optimal
optimal
50%
40%
30%
20%
10%
0%
30
1 iteration
10 iterations
100 iterations
1000 iterations
School of Computer Science
Completed Work
Compile Time Slowdown :-(
Default
Allocator
Graph
Allocator
9.2x slower
Progressive
Allocator
0
5
10
15
20
Factor slower than graph allocator
31
Initialization
Initial Iterative Allocation
Initial Simultaneous Allocation
One Iteration
School of Computer Science
Completed Work
A Better Register Allocator
– fully utilize machine description
reg
alloc
machine
description
32
• explicit and expressive model of
costs of allocation for given
architecture: Global MCNF
– locally optimal
• approach optimality using
progressive solution technique:
Lagrangian directed allocators
School of Computer Science
Outline
I.
II.
III.
IV.
V.
33
Motivation
Related Work
Completed Work
Proposed Work
Contributions & Timeline
School of Computer Science
Proposed Work
A Better Better Register Allocator
Solver Improvements
– Improve initial solution
– Improve quality as prices converge
– Hope to prove approximation bounds
Model Improvements
– Improve accuracy of model
– Model simplification
– Represent uniform register sets efficiently
34
School of Computer Science
Proposed Work
Model Simplification
Summarize overly expressive sections of the model
Conservative simplification
does not change optimal value
Aggressive simplification
explore tradeoff between model
complexity and optimality
35
School of Computer Science
Proposed Work
Instruction Selection Interaction
which instruction is best
depends on the register
allocator
perform same
operation
so let register allocator decide
36
School of Computer Science
Proposed Work
Register Allocation Aware
Instruction SElection (RA2ISE)
Instruction selection not finalized
until register allocation
IR tiled with Register Allocation
Aware Tiles (RAATs)
A RAAT represents several
instruction sequences
– different costs
– a sequence for every possible
register allocation
37
School of Computer Science
Proposed Work
2
RA ISE
RAAT
IR
tiling
model
creatio
n
cwtl %eax
38
register
allocation
School of Computer Science
Proposed Work
Implementing RA2ISE
Add side-constraints to Global MCNF model
– implement inter-variable preferences and constraints
• “if x allocated to r1 and y allocated to r2, then save three bytes”
• “x and y must be allocated to the same register”
Implement x86 RAATs
– RAAT tables created manually
– GMCNF RAAT representation automatically generated
from RAAT table with minimum use of side constraints
Algorithms for tiling RAATs
– leverage existing algorithms
– exploit feedback between passes
39
School of Computer Science
Proposed Work
Tiling RAATs
3
1 4
2
2
1
4
1
2
5 3
1
1
3
1
2
4
3
4
1
edx mem
mem
31
2
eax
4
3
40
School of Computer Science
Proposed Work
Evaluation
Implement in production quality compiler (gcc)
Evaluate code size and simple code speed metric
Evaluate on three different architectures
– x86 (8 registers)
– 68k/ColdFire (16 registers)
– PPC (32 registers)
41
School of Computer Science
Outline
I.
II.
III.
IV.
V.
42
Motivation
Related Work
Completed Work
Proposed Work
Contributions & Timeline
School of Computer Science
Contributions
RA2ISE
– register allocation aware tiles (RAATs) explicitly encode
effect of register allocation on instruction sequence
– algorithms for tiling RAATs
– expressive model of register allocation that operates
on RAATs and explicitly represents all important
components of register allocation
– progressive solver for this model that can quickly find
decent solution and approaches optimality as more time
is allowed for compilation
Comprehensive evaluation of RA2ISE
43
School of Computer Science
Thesis Statement
RA2ISE is a principled and effective system
for performing instruction selection and
register allocation.
44
School of Computer Science
One Step Towards a More Principled Compiler
copy
prop
loop
unroll
const
prop
stren inline
gth
reduc
peept
hole
reg
reg
alloc
alloc insn
selec
t
45
DC
E
code
motio
n
PRE
GV
N
CSE
branc
h opt
SCC
P
…
machine
description
optimized program
School of Computer Science
Timeline
46
Fall 2006
add simple speed metric option to model
begin model simplification work
improve model accuracy and solver performance
Winter 2006
finish model simplification work
add side-constraints to model
implement existing gcc tiles as RAATs
improve model accuracy and solver performance
Spring 2007
finish implementation of side-constraints and gcc RAATs
begin work on RA2ISE infrastructure
create gcc-independent set of RAATs for x86
improve model accuracy and solver performance
Summer 2007
finish work on RA2ISE
investigate and develop tiling algorithms
improve model accuracy and solver performance
Fall 2007
add 68k/ColdFire and PowerPC targets
investigate uniform register set simplifications
improve model accuracy and solver performance
Winter 2007
begin writing thesis
work on improving compile time performance
Spring 2008
finish writing thesis
School of Computer Science
Andrew Richard Koes
47
School of Computer Science
Questions?
?
48
School of Computer Science
49
7000%
6000%
5000%
4000%
3000%
2000%
1000%
Performance
Double every 24 months
May-06
Nov-05
May-05
Nov-04
May-04
Nov-03
May-03
Nov-02
May-02
Nov-01
May-01
Nov-00
May-00
Nov-99
May-99
Nov-98
May-98
Nov-97
May-97
Nov-96
May-96
0%
Nov-95
SPEC2000 Performance Improvement
Processor Performance
w/o Compiler Improvements
Double every 18 months
School of Computer Science
Instruction Selection & Register Allocation
reg
alloc
insn
selec
t
50
– fully utilize machine description
– locally optimal
– tight integration between phases
machine
description
School of Computer Science
Costs of Register Allocation
Spilling to/from memory
movl 8(%ebp), %edx
Direct memory access
addl 8(%ebp), %eax
Moving between registers
movl %edx,%ecx
Rematerialization of constant value
movl $3,%eax
Register usage preferences
imul %edx,%eax
vs.
imul %edx,%ecx
51
School of Computer Science
Iterative Heuristic Allocator
Allocate each variable in a heuristic priority order
Find shortest path in each block
– avoid edges that make remaining problem infeasible
Process blocks in topological order
– allocation at block entry fixed by previous blocks
Intuition:
– shortest path is minimum cost allocation for a variable
– allocate most significant variables first
Limitation:
– greedy: can’t undo poor decisions
52
School of Computer Science
Iterative Heuristic Allocator
Allocation order:
Edges to/from
memory cost 3
a, b
b, c
c, d
a
Cost:
0 4 0 -2
Total: 2
53
School of Computer Science
Simultaneous Allocator
Scan each block
– maintain an allocation of all live variables
– at variable definition find cheapest allocation
• allocation with shortest path to variable’s sink or block exit
• allowed to evict (reallocate) already allocated variable
– eviction cost shortest path to edge from current allocation to new
allocation in this block
– cost of eviction added to shortest path cost
Intuition:
– minimizing cost for all variables at once
Limitation:
– path computations limited to single block
– future blocks do not change previous block allocations
54
School of Computer Science
Trace-Based Allocation
Decompose function into traces of basic blocks
– run simultaneous allocator on each trace
– control flow internal to trace
• update only blocks that are necessary (easy-update)
• update all effected blocks (full-update)
easy-update
full-update
55
School of Computer Science
Accuracy of the Model
60%
Global MCNF model correctly
predicts costs of register
allocation within 2% for
72.5% of functions compiled
Percent of functions
50%
40%
30%
20%
larger than predicted
(under-predicted)
10%
smaller than predicted
(over-predicted)
%
10
>
510
%
5%
3-
3%
2-
2%
1-
1%
0-
0%
0%
1-
1%
2-
2%
3-
3%
5-
10
-5
%
>
10
%
0%
Percent difference between predicted and actual size
56
School of Computer Science
Initialization
57
Initial Iterative Allocation
Initial Simultaneous Allocation
geo. mean
099.go
124.m88ksim
129.compress
130.li
132.ijpeg
134.perl
147.vortex
164.gzip
168.wupwise
171.swim
173.applu
175.vpr
176.gcc
181.mcf
183.equake
188.ammp
197.parser
254.gap
255.vortex
256.bzip2
300.twolf
301.apsi
CRC32
adpcm_d
adpcm_e
basicmath
bitcount
blowfish_d
blowfish_e
dijkstra
epic_d
epic_e
g721_d
g721_e
gsm_d
gsm_e
ispell
jpeg_d
jpeg_e
lame
mesa
mpeg2_d
mpeg2_e
patricia
pegwit_d
pegwit_e
pgp_d
pgp_e
qsort
rasta
sha
stringsearch
susan
Factor slower than graph allocation
Compile Time Slowdown :-(
50
45
40
35
30
25
20
15
10
5
10x
slower
0
One Iteration
School of Computer Science
Code size improvement over graph allocator
Code size improvement
8%
7%
6%
5%
4%
3%
2%
1%
0%
initial heuristics
only
10 iterations
without traces
58
100 iterations
1000 iterations default allocator
with easy traces
with full traces
School of Computer Science
59
initial heuristics only
10 iterations
100 iterations
average
099.go
124.m88ksim
129.compress
130.li
132.ijpeg
134.perl
147.vortex
164.gzip
168.wupwise
171.swim
173.applu
175.vpr
176.gcc
181.mcf
183.equake
188.ammp
197.parser
254.gap
255.vortex
256.bzip2
300.twolf
301.apsi
CRC32
adpcm
basicmath
bitcount
blowfish
dijkstra
epic
g721
gsm
ispell
jpeg
lame
mesa
mpeg2
patricia
pegwit
pgp
qsort
rasta
sha
stringsearch
susan
Code size improvement over graph allocator
Code Size Improvement
20%
15%
10%
5%
0%
-5%
1000 iterations
School of Computer Science
60
initial heuristic only
10 iterations
100 iterations
average
099.go
124.m88ksim
129.compress
130.li
132.ijpeg
134.perl
147.vortex
164.gzip
168.wupwise
171.swim
173.applu
175.vpr
176.gcc
181.mcf
183.equake
188.ammp
197.parser
254.gap
255.vortex
256.bzip2
300.twolf
301.apsi
CRC32
adpcm
basicmath
bitcount
blowfish
dijkstra
epic
g721
gsm
ispell
jpeg
lame
mesa
mpeg2
patricia
pegwit
pgp
qsort
rasta
sha
stringsearch
susan
Code size improvement over default allocator
Code Size Improvement
12%
10%
8%
6%
4%
2%
0%
-2%
-4%
-6%
1000 iterations
School of Computer Science
Code Performance
61
School of Computer Science
Integrating Register Allocation and Instruction Selection
int foo(int a, short b) { return a*4+b; }
4
3
4
1
1
62
movl
4(%esp),%eax
sall $2,%eax
addl
8(%esp),%eax
cwtl
ret
5
4
3
1
movswl 8(%esp),%edx
movl
4(%esp),%eax
leal
(%edx,%eax,4),%eax
ret
School of Computer Science
Another RAAT
63
School of Computer Science
Download