Outline Analyses and Optimizations ƒ Introduction to SSA

advertisement
Outline
Analyses and Optimizations
ƒ Analyses in Software Tech. support programmers
ƒ Analyses in Compiler Construction allow to safely perform optimizations
ƒ Cost model: runtime of a program
ƒ Introduction to SSA
ƒ Construction, Destruction
ƒ Optimizations
ƒ Statically only conservative approximations
ƒ Classic analyses and optimizations on SSA representations
ƒ Heap analyses and optimizations
• Loop iterations
• Conditional code
ƒ Even for linear code not known in advance:
• Instruction scheduling
• Cache access is data dependent
• Instruction pipelining: execution time is not the sum of individual operations costs
ƒ Alternative cost models:
ƒ memory size, power consumptions
ƒ Same non-decidability problem as for execution time
ƒ Caution: cost of a program ≠ sum of costs of its elements
1
2
Algebraic Identity: Elimination of
Operations and its Inverse
Optimization: Implementation
ƒ Legal transformations in SSA-Graphs:
ƒ Simplifying transformations reduce the costs of a program
ƒ Preparative transformations allow the application of simplifying
transformations
ƒ Using
ƒ Algebraic Identities (e.g. Associative / Distributive law for certain
operations)
ƒ Moving of operations
ƒ Reduction of dependencies
y
x
y
x
τ
ƒ Optimization is a sequence of goal directed, legal simplifying
and legal preparative transformations
ƒ Legibility proven
τ -1
No side effects in
τ, τ -1
ƒ Locally by checking preconditions
ƒ Due to static data-flow analyses
3
Elimination of Memory Operation and
its Inverse
Graph Rewrite Schema
SSA-subgraph before
Transformation
y
x
4
SSA-subgraph after
Transformation
x
y
v
a
τ
a
Store
No side effects in
τ, τ -1
Precondition
established by local check
preparative analyses
τ -1
Load
5
v
Store
[a] ≠ void
6
1
Elimination of Duplicated Memory
Operations
Elimination of non-essential
dependencies
a
a
a
a
b
Op1
u1
d1
c
Op2
u2
d2
Load
Load
u2 ∩ d1 = ∅
d1 ∩ d2 = ∅
u1 ∩ d2 = ∅
[a] ≠ void
Load
Op2
b
Op1
c Sync
Op1, Op2 memory operations (store, call)
u1, u2, d1, d2 designate may Use/Define sets
Computed in P2A
7
8
Algebraic Identity: Invariant
Compares
y
x
z
x
τ
cmp
τ
monotone
Associative Law
y
x
y
z
y
x
τ
τ
z
τ
τ
cmp
τ
op
associative
9
10
Distributive Law
y
x
z
Operator Simplification
z
x
y
x
⊕
⊗
⊗
⊗
Const
y
x
mult
Const
k
shift
⊕
⊗ and ⊕
distributive
y=2k
11
12
2
Constant folding over φ-functions
Constant Folding
Const Const
x
y
Const
x
Const Const
τxz
τyz
Const
y
Const
τxy
τ
φ
Const
z
φ
τ
Evaluation using source algebra
or target algebra (if allowed by source language)
τ
13
General: Moving arithmetic
operations over φ-functions
14
Optimizations
ƒ Strength reduction:
ƒ Bauer & Samelson 1959
ƒ Replace expensive by cheep operations
y z
x
x
z
τ
φ
τ
• Loops contain multiplications with iteration variable,
• These operations could be replaced by add operations (Induction
analysis)
y
ƒ One of the oldest optimizations: already in Fortran I-compiler
(1954/55) and Algol 58/60- compiler
τ
ƒ Partial redundancy elimination (PRE):
φ
No side
effects
in τ
(no call, store)
τ
ƒ Morel & Renvoise 1978
ƒ Eliminate partially redundant computations
• SSA eliminates all static but not dynamic redundancies
• Problem on SSA: which is the best block to perform the computation
• Move loop invariant computations out of loops, into conditionals
τ -1
ƒ subsumes a number of simpler optimization
15
Example: Strength reduction
Induction Analysis Idea
ƒ Find Induction variable i for a loop using DFA
for (i=0;i<n;i++){
for (j=0;j<n;j++){
b[i,j]=a[i,j];
}
}
LOOP:
//Original loop body:
ao
= i*n*d + j*d
aij
= >a[0,0]<+ao
bo
= i*n*d + j*d
bij
= >b[0,0]<+bo
<bij> = <aij>
16
END:
adda=>a[0,0]<
addb=>b[0,0]<
d=4
addend=adda+n*n*d;
jump(addend==adda) END
<addb>=<adda>
adda=adda+d
addb=addb+d
jump LOOP
exit
17
ƒ i is induction variable if in loop only assignments of form
i := i+c with loop constant c or, recursively,
i := c’*i’ + c’’ with i’ induction variable and loop constants c’, c’’
ƒ c is a loop constant iff c does not change value in loop, i.e.
• c is static constant,
• c computed in enclosing loop
ƒ Example (cont’d), consider the inner loop:
for (j=0;j<n;j++){…}
ƒ Direct induction variable: j, implicit j = j+1 (c= 1)
ƒ Indirect induction variable: ao = i*n*d+j*d (c’= d, c’’ = i*n*d)
ƒ Note that i*n*d and d a loop constants for the inner loop
18
3
Induction Transformation Idea
Induction Analysis: Implementation
ƒ Transformation goal: values of induction variables should
grow linearly with iteration; add operations replace mult
operations
ƒ Transformation:
ƒ Assume initially optimistically: all variables are induction
variables
ƒ Finding induction variable i for a loop follows definition
ƒ Iteratively until fix point: i is not induction variable if not:
ƒ Let i0 initialization of i and induction variables, i := i+c and i’ :=
c’*i+c’’
ƒ New variable ia initialized ia := c’ * i0 +c’’
ƒ At loop end insert ia= ia + c’ * c
ƒ Replace consistently operands i’ by ia
ƒ Remove all assignments to i,i’ and i,i’ themselves if i is not used
elsewhere (DFA)
ƒ i := ix+c with loop constants c (direct induction variable)
ƒ i := c*ix+cx with ix induction variable and loop constants c, c' (indirect
induction variable)
ƒ i := φ(i1 … in) with ik being direct induction variable
ƒ On SSA, simplifications of that analysis are possible
ƒ any loop variable corresponds to a cyclic subgraph over a φ(i1 … in)
ƒ Example:
node
ƒ Find Strongly Connected Component (SCC) and check those for the
induction variable condition
ƒ Before: loop ao = i*n*d+j*d … j++ end loop
ƒ After: aoa = i*n*d loop … aoa = aoa + d end loop
19
B1
Start
Jmp
B2
Consti 0
φ
20
φ
Consti 100
φ
sum(array[int] a){
s = 0;
for(i=0;i<=100;i++){
s:=a[i] + s;
}
return s;};
Consti 1
<
φ
Addi
B1
Start
Induction Variables
Consti 0
B2
φ
Consti 4
Consti 100
Consti 1
<
φ
Muli
φ
Addi
φ
Addi
Consti 4
φ
Muli
φ
Addi
φ
Load
B4
φ
Addi
B3
Add
Jmp
i
Add
i return
B1
Consti 0
B2
B3
21
Direct Induction
Variable Cycle
22
Induction Variables (Schematic)
0
i
φ
φ
1
Consti 1
φ
Addi
+
4
*
a0
a
+
100
<
LD
B3
23
Conditional jump
Integrate s
24
4
Move * over φ-function
Move Multiplication
0
φ
1
+
*4
*
/4
+
a0
<
a0
/4
+
100
0
φ
1
4
*4
100
LD
<
+
LD
25
26
Invariant Compare
*4
*4
Distributive Law
0
0
φ
1
φ
4
/4
*4 /4
+
+
a0
+
400
<
a0
+
400
LD
<
LD
27
28
Operation and its Inverse
Move Addition
0
0
φ
4
φ
4
*4 /4
+
+
a0
+
400
<
+
400
LD
<
29
a0
LD
30
5
Move Addition
+ a0
Associative Law
a0
a0
φ
4
4
- a0
- a0+a0
+
+
- a0
400
<
φ
- a0
400
LD
<
LD
31
32
Start
B1
Consti 0
Consti 400
Change Compare
Jmp
a0
φ
4
400
φ
Addi
sum(array[int] a){
s = 0;
for(i=0;i<=100;i++){
s:=a[i] + s;
}
return s;};
+
B2
+
φ
φ
Consti 4
φ
φ
Addi
<
φ
Load
B3
B4
LD
<
φ
Addi
Add
i return
Add
Jmp
i
33
B1
Start
Jmp
B2
Consti 0
φ
φ
Consti 100
φ
sum(array[int] a){
s = 0;
for(i=0;i<=100;i++){
s:=a[i] + s;
}
return s;};
Consti 1
<
φ
Addi
φ
Muli
ƒ without (provable) static redundancies
ƒ with all dependencies explicit
ƒ First idea: compute each operation earliest (as soon as all arguments
are available)
ƒ Observation:
ƒ Fast introduction of many live values: high register pressure
ƒ Many execution path compute but do not use a certain value
ƒ Solution is Partial Redundancy Elimination (PRE):
ƒ Delay computation until it is used on all paths
ƒ In practice: move them out of loops into conditional code
φ
Addi
B3
Add
Jmp
i
ƒ SSA is representation
ƒ that the result is used on all path to the end
ƒ that the computation is not repeatedly performed in loops
φ
Load
B4
Partial Redundancy Elimination: Idea
ƒ Question which block should contain the computation guaranteeing
Consti 4
φ
Addi
Add
i return
34
35
36
6
Observations on SSA
Example: Initial Situation
Operations that must be executed in original block:
1. φ-nodes,
2. Computations with exceptions
3. Jumps
4. “pinned” operations (postponed)
1. Observation:
All other nodes could be computed in other blocks as well iff
data dependencies are obeyed.
2. Observation:
No statically redundant computation at all, i.e., one
important goal of optimization immediately follows from the
representation. Dynamic redundancy remains a problem.
1
2
a=...
b=...
...a+b...
4
3
5
6
...a+b...
...a+b...
37
38
Immature φ´
1
Mature φ´ → φ
a1=...
b1=...
1
2
a1=...
b1=...
2
t1: a1+b1
t1: a1+b1
4
3
a2: φ´(a1, a6)
b2: φ´(b1, b6)
t2: a2+b2
5
4
6
...t2...
3
a2: φ(a1, a2)
b2: φ(b1, b2)
t2: a2+b2
5
...t2...
6
...t2...
...t2...
39
40
Placement of computations
1
Placing t1 earliest
a1=...
b1=...
1
2
a1=...
b1=...
t1 =a1+b1
t1 not needed on many paths.
2
t1: a1+b1
4
3
5
4
6
...t1 ...
3
...t1 ...
5
6
... t1 ...
41
... t1 ...
42
7
Placing t1 “optimally”
Placing t1 latest
1
a1=...
b1=...
According of Knoop, Steffen
1
t1 (re-)computed in each iteration.
2
a1=...
b1=...
Still t1 (re-)computed in each iteration.
2
4
3
5
4
6
t1=a1+b1
...t1 ...
3
t2=a1+b1
...t2 ...
t1=a1+b1
5
6
...t1 ...
...t1 ...
43
44
Placing for t1 out of loops into
conditional code
Insert Blocks
1
Virtually, insert empty blocks,
to capture operations
executed only on a specific path
a1=...
b1=...
1
2
a1=...
b1=...
However: Exist path t1 not used on.
2
t1: a1+b1
t1=a1+b1
4
4
3
5
6
...t1 ...
3
... t1 ...
5
6
... t1 ...
... t1 ...
45
46
PRE: Discussion
PRE: Implementation
ƒ Placing earliest
ƒ Find partially (dynamically) redundant computations
ƒ Advantage: short code, could be fast code because of instruction
cache; no unnecessary computations in loops; short jumps …
ƒ Disadvantage: many paths do not need result, high register pressure
ƒ Placing lazily
ƒ Advantage: computation needed on all paths
ƒ Disadvantage: unnecessary computations in loops
ƒ
ƒ
ƒ
ƒ
B contains operation τ computing t, B’ contains τ’ consuming t
If B’ post-dominates B (B’ ≤ B) no dynamic redundancy
Assume all operations as partially (dynamically) redundant
For each operation τ computing t and the set of operations τ’1 … τ’n
consuming t: if B(τ’k) ≤ B(τ) for some k in 1…n then τ is not placed
partially redundant – all others are (!)
ƒ Eliminate partially redundant computations
ƒ Placing out of loops into conditional code
ƒ Advantage: no unnecessary computations in loops
ƒ Disadvantage: some unnecessary computations in general as some
paths do not need result
47
ƒ Compute earliest position (all arguments available, all uses
dominated) for each partially redundant operation τ
ƒ Move copies of τ towards the consuming operations τ’1 … τ’n along
the dominator tree until no partial dynamic redundancies but stop at
loop heads
ƒ Not deterministic but that does not matter (!)
48
8
PRE: “Pinned” Operations
Further Optimizations
ƒ Placement sometimes only possible if it is the last
transformation on SSA
ƒ Same computation computed at several places
ƒ Further optimizations recognize this wanted static redundancy
ƒ Solution:
ƒ Let t1 : a1 + b1 and t2 : a1 + b1 semantic equivalent computations at
different positions (blocks)
ƒ Replace + by a „pinned“ ⊕Block
ƒ Thereby ⊕ operation additionally depends on the current block as
new arguments
ƒ computations a1 ⊕5 b1 and a1 ⊕6 b1 not recognized as congruent any
more
ƒ
ƒ
ƒ
ƒ
ƒ
Constant evaluation (simple transformation rule)
Constant propagation (iterative application of that rule)
Copy propagation (on SSA construction)
Dead code elimination (on SSA construction)
Common subexpression elimination (on SSA construction)
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Specialization of basic blocks, procedures, i.e.cloning
Procedure inlining
Control flow simplifications
Loop transformations (Splitting/merging/unrolling)
Bound check eliminations
Cache optimizations (array access, object layout)
Parallelization
…
49
50
Observations
Outline
ƒ Order of optimizations matters in theory:
ƒ Application of one optimization might destroy precondition of another
ƒ Optimization can ruin the effects of the previous once
ƒ Optimal order unclear (in scientific papers usual statements
like: “Assume my optimization is the last …”
ƒ Simultaneous optimization too complex
ƒ Usually first optimization gives 15% sum of remaining 5%,
independent of the chosen optimizations
ƒ Might differ in certain application domains, e.g. in numerical
applications operator simplification gives factor >2, cache
optimization factor 2-5
ƒ Introduction to SSA
ƒ Construction, Destruction
ƒ Optimizations
ƒ Classic analyses and optimizations on SSA representations
ƒ Heap analyses and optimizations
51
52
Optimizations on Memory
ƒ
ƒ
ƒ
ƒ
Memory Values
ƒ Differentiation by Name Schema (NS)
ƒ Distinguish e.g.:
Elimination of memory accesses.
Elimination of object creations.
Elimination non essential dependencies.
Those are normalizing transformations for further analyses
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ Nothing new under the sun:
ƒ Define abstract values, addresses, memory
ƒ Define context-insensitive transfer functions for memory relevant
SSA nodes (Load, Store, Call) (discussed already)
ƒ Generalization to context-sensitive analyses (discussed already)
ƒ Optimizations as graph transformations (discussed already)
53
heap and stack
local arrays with different name
disjoint index sets in an array (odd/even etc.)
different types of heap objects
objects with same type but statically different creation program point
objects with same creation program point but with statically different
path to that creation program point (execution context, contextsensitive)
54
9
Abstract values, addresses, memory
ƒ
ƒ Given an abstract object-field-(index) of a store operation
ƒ In general, this abstract object points to more than one real memory cell
ƒ A store operation overwrites only one of these cells, all others contain
the same value
ƒ Hence, a store to an abstract object-field-(index) adds a new possible
(abstract) value - weak update
ƒ Only if guaranteed that abstract object-field-(index) matches one and
only one concrete address, a new (abstract) value overwrites the old
value - strong update
ƒ Auxiliary function:
(if strong update possible)
update(M, (o, n, ⊤),v) = v
= M(o, n, ⊤) ∪ v (otherwise, weak update)
References and values
ƒ
ƒ
ƒ
allocation site lattice 2O abstracts from objects O
arbitrary lattice X abstracts from values like Integer or Boolean
abstract heap memory M:
• OμRöO (R set of fields with reference object semantics)
• OμVöX (V set of fields with value semantics)
ƒ
Arrays
ƒ
ƒ
Treated as objects
abstract heap memory M:
• OμR[]μI öO (R[] set of fields with type array of reference object)
• OμV[]μI öX (V[] set of fields with type array of value)
• I an arbitrary integer value latice (e.g., constant or power set lattice)
ƒ
Abstract address Addr Œ 2OμFμI (F set of field names)
ƒ
Updates of Memory
object-field-(index) triples where index might be ignored
56
Transfer functions (insensitive, no arrays)
ƒ Tstore (M, Addr, v) =
57
Example:
Main Loop Inner Product Algorithm
MAv
M[a1 ↦ update(M, a1,v)]…[ak ↦ update(M, ak,v)]
Addr = {a1 … ak} ai = (oi , f , ⊤)
ƒ Tload (M, Addr) =
Store f
M‘
Vector<T>
VectorIter<T>
init(int size)
VecIter iter()
T times(Vector)
boolean hasNext()
T next()
M A
(M, M(a1) ∪ … ∪ M (ak))
ƒ Talloc(type) (M) =
(M[(oid , n1 , ⊤) ↦ ⊥]…[(oid , nk , ⊤) ↦ ⊥], oid)
{n1 … nk} attributes of Type
Load f
VectorArray<T>
M‘ v
VecIter iter()
VectorArrayIter<T>
boolean hasNext()
T next()
T times(Vector v){
i1 = iter();
i2 = v.iter();
for (s = 0; i1.hasNext(); )
s = s+i1.next()*i2.next();
}
M
Type id
Alloc
M‘ o
58
59
No provably different memory
addresses
Example: SSA
Initialization
φ
Vector<T>
VecIter<T>
init(int size)
VecIter iter()
T times(Vector)
boolean hasNext()
T next()
Initialization
φ
v
a
Load
a
Store
Inc
VectorArray<T>
VecIter iter()
Store
Load
Load
boolean hasNext()
T next()
[a] ≠ void
Store
Load
Not applicable!
Inc
for (s = 0; i1.hasNext(); )
s = s+ i1.next()*i2.next();
}
Store
Inc
VectorArrayIter<T>
T times(Vector v){
i1 = iter();
i2 = v.iter();
Inc
Store
v
Load
60
Store
61
10
Initialization: disjoint memory
guarantied
Actually two Iterators?
Initialization
Initialization
φ
Alloc
Alloc
0
0
Store
Store
φ
Load
Load
Elimination
of non essential
dependencies
Inc
Store
Elimination
of reading
memory accesses
Inc
φ
φ
Store
Load
Load
Load
Load
Inc
Inc
Inc
Inc
Store
Store
62
Memory objects replaced by values
Alloc
Alloc
Alloc
0
0
0
Store
Store
Load
Load
Store
φ
φ
Load
φ
φ
Inc
Inc
Store
φ
Store
Load
Load
63
Value numbering proofs equivalence
Alloc
0
Alloc
Alloc
0
0
Store
Store
Store
φ
Inc
Store
Store
Store
Inc
Store
φ
φ
0
φ
Load
φ
Inc
64
Inc
Store
Load
Inc
Store
0
0
0
φ
φ
φ
Inc
Inc
Inc
65
Example revisited
Alloc
0
Store
Alloc
0
Store
φ
Load
Optimization only possible due to
joint application of single techniques:
ƒ
ƒ
ƒ
ƒ
ƒ
Global analysis
Elimination of polymorphism
Elimination of non essential dependencies
Elimination von memory operations
Traditional optimizations
0
Inc
Store
φ
Load
Inc
Store
Inc
66
11
Download