Outline Analyses and Optimizations Analyses in Software Tech. support programmers Analyses in Compiler Construction allow to safely perform optimizations Cost model: runtime of a program Introduction to SSA Construction, Destruction Optimizations Statically only conservative approximations Classic analyses and optimizations on SSA representations Heap analyses and optimizations • Loop iterations • Conditional code Even for linear code not known in advance: • Instruction scheduling • Cache access is data dependent • Instruction pipelining: execution time is not the sum of individual operations costs Alternative cost models: memory size, power consumptions Same non-decidability problem as for execution time Caution: cost of a program ≠ sum of costs of its elements 1 2 Algebraic Identity: Elimination of Operations and its Inverse Optimization: Implementation Legal transformations in SSA-Graphs: Simplifying transformations reduce the costs of a program Preparative transformations allow the application of simplifying transformations Using Algebraic Identities (e.g. Associative / Distributive law for certain operations) Moving of operations Reduction of dependencies y x y x τ Optimization is a sequence of goal directed, legal simplifying and legal preparative transformations Legibility proven τ -1 No side effects in τ, τ -1 Locally by checking preconditions Due to static data-flow analyses 3 Elimination of Memory Operation and its Inverse Graph Rewrite Schema SSA-subgraph before Transformation y x 4 SSA-subgraph after Transformation x y v a τ a Store No side effects in τ, τ -1 Precondition established by local check preparative analyses τ -1 Load 5 v Store [a] ≠ void 6 1 Elimination of Duplicated Memory Operations Elimination of non-essential dependencies a a a a b Op1 u1 d1 c Op2 u2 d2 Load Load u2 ∩ d1 = ∅ d1 ∩ d2 = ∅ u1 ∩ d2 = ∅ [a] ≠ void Load Op2 b Op1 c Sync Op1, Op2 memory operations (store, call) u1, u2, d1, d2 designate may Use/Define sets Computed in P2A 7 8 Algebraic Identity: Invariant Compares y x z x τ cmp τ monotone Associative Law y x y z y x τ τ z τ τ cmp τ op associative 9 10 Distributive Law y x z Operator Simplification z x y x ⊕ ⊗ ⊗ ⊗ Const y x mult Const k shift ⊕ ⊗ and ⊕ distributive y=2k 11 12 2 Constant folding over φ-functions Constant Folding Const Const x y Const x Const Const τxz τyz Const y Const τxy τ φ Const z φ τ Evaluation using source algebra or target algebra (if allowed by source language) τ 13 General: Moving arithmetic operations over φ-functions 14 Optimizations Strength reduction: Bauer & Samelson 1959 Replace expensive by cheep operations y z x x z τ φ τ • Loops contain multiplications with iteration variable, • These operations could be replaced by add operations (Induction analysis) y One of the oldest optimizations: already in Fortran I-compiler (1954/55) and Algol 58/60- compiler τ Partial redundancy elimination (PRE): φ No side effects in τ (no call, store) τ Morel & Renvoise 1978 Eliminate partially redundant computations • SSA eliminates all static but not dynamic redundancies • Problem on SSA: which is the best block to perform the computation • Move loop invariant computations out of loops, into conditionals τ -1 subsumes a number of simpler optimization 15 Example: Strength reduction Induction Analysis Idea Find Induction variable i for a loop using DFA for (i=0;i<n;i++){ for (j=0;j<n;j++){ b[i,j]=a[i,j]; } } LOOP: //Original loop body: ao = i*n*d + j*d aij = >a[0,0]<+ao bo = i*n*d + j*d bij = >b[0,0]<+bo <bij> = <aij> 16 END: adda=>a[0,0]< addb=>b[0,0]< d=4 addend=adda+n*n*d; jump(addend==adda) END <addb>=<adda> adda=adda+d addb=addb+d jump LOOP exit 17 i is induction variable if in loop only assignments of form i := i+c with loop constant c or, recursively, i := c’*i’ + c’’ with i’ induction variable and loop constants c’, c’’ c is a loop constant iff c does not change value in loop, i.e. • c is static constant, • c computed in enclosing loop Example (cont’d), consider the inner loop: for (j=0;j<n;j++){…} Direct induction variable: j, implicit j = j+1 (c= 1) Indirect induction variable: ao = i*n*d+j*d (c’= d, c’’ = i*n*d) Note that i*n*d and d a loop constants for the inner loop 18 3 Induction Transformation Idea Induction Analysis: Implementation Transformation goal: values of induction variables should grow linearly with iteration; add operations replace mult operations Transformation: Assume initially optimistically: all variables are induction variables Finding induction variable i for a loop follows definition Iteratively until fix point: i is not induction variable if not: Let i0 initialization of i and induction variables, i := i+c and i’ := c’*i+c’’ New variable ia initialized ia := c’ * i0 +c’’ At loop end insert ia= ia + c’ * c Replace consistently operands i’ by ia Remove all assignments to i,i’ and i,i’ themselves if i is not used elsewhere (DFA) i := ix+c with loop constants c (direct induction variable) i := c*ix+cx with ix induction variable and loop constants c, c' (indirect induction variable) i := φ(i1 … in) with ik being direct induction variable On SSA, simplifications of that analysis are possible any loop variable corresponds to a cyclic subgraph over a φ(i1 … in) Example: node Find Strongly Connected Component (SCC) and check those for the induction variable condition Before: loop ao = i*n*d+j*d … j++ end loop After: aoa = i*n*d loop … aoa = aoa + d end loop 19 B1 Start Jmp B2 Consti 0 φ 20 φ Consti 100 φ sum(array[int] a){ s = 0; for(i=0;i<=100;i++){ s:=a[i] + s; } return s;}; Consti 1 < φ Addi B1 Start Induction Variables Consti 0 B2 φ Consti 4 Consti 100 Consti 1 < φ Muli φ Addi φ Addi Consti 4 φ Muli φ Addi φ Load B4 φ Addi B3 Add Jmp i Add i return B1 Consti 0 B2 B3 21 Direct Induction Variable Cycle 22 Induction Variables (Schematic) 0 i φ φ 1 Consti 1 φ Addi + 4 * a0 a + 100 < LD B3 23 Conditional jump Integrate s 24 4 Move * over φ-function Move Multiplication 0 φ 1 + *4 * /4 + a0 < a0 /4 + 100 0 φ 1 4 *4 100 LD < + LD 25 26 Invariant Compare *4 *4 Distributive Law 0 0 φ 1 φ 4 /4 *4 /4 + + a0 + 400 < a0 + 400 LD < LD 27 28 Operation and its Inverse Move Addition 0 0 φ 4 φ 4 *4 /4 + + a0 + 400 < + 400 LD < 29 a0 LD 30 5 Move Addition + a0 Associative Law a0 a0 φ 4 4 - a0 - a0+a0 + + - a0 400 < φ - a0 400 LD < LD 31 32 Start B1 Consti 0 Consti 400 Change Compare Jmp a0 φ 4 400 φ Addi sum(array[int] a){ s = 0; for(i=0;i<=100;i++){ s:=a[i] + s; } return s;}; + B2 + φ φ Consti 4 φ φ Addi < φ Load B3 B4 LD < φ Addi Add i return Add Jmp i 33 B1 Start Jmp B2 Consti 0 φ φ Consti 100 φ sum(array[int] a){ s = 0; for(i=0;i<=100;i++){ s:=a[i] + s; } return s;}; Consti 1 < φ Addi φ Muli without (provable) static redundancies with all dependencies explicit First idea: compute each operation earliest (as soon as all arguments are available) Observation: Fast introduction of many live values: high register pressure Many execution path compute but do not use a certain value Solution is Partial Redundancy Elimination (PRE): Delay computation until it is used on all paths In practice: move them out of loops into conditional code φ Addi B3 Add Jmp i SSA is representation that the result is used on all path to the end that the computation is not repeatedly performed in loops φ Load B4 Partial Redundancy Elimination: Idea Question which block should contain the computation guaranteeing Consti 4 φ Addi Add i return 34 35 36 6 Observations on SSA Example: Initial Situation Operations that must be executed in original block: 1. φ-nodes, 2. Computations with exceptions 3. Jumps 4. “pinned” operations (postponed) 1. Observation: All other nodes could be computed in other blocks as well iff data dependencies are obeyed. 2. Observation: No statically redundant computation at all, i.e., one important goal of optimization immediately follows from the representation. Dynamic redundancy remains a problem. 1 2 a=... b=... ...a+b... 4 3 5 6 ...a+b... ...a+b... 37 38 Immature φ´ 1 Mature φ´ → φ a1=... b1=... 1 2 a1=... b1=... 2 t1: a1+b1 t1: a1+b1 4 3 a2: φ´(a1, a6) b2: φ´(b1, b6) t2: a2+b2 5 4 6 ...t2... 3 a2: φ(a1, a2) b2: φ(b1, b2) t2: a2+b2 5 ...t2... 6 ...t2... ...t2... 39 40 Placement of computations 1 Placing t1 earliest a1=... b1=... 1 2 a1=... b1=... t1 =a1+b1 t1 not needed on many paths. 2 t1: a1+b1 4 3 5 4 6 ...t1 ... 3 ...t1 ... 5 6 ... t1 ... 41 ... t1 ... 42 7 Placing t1 “optimally” Placing t1 latest 1 a1=... b1=... According of Knoop, Steffen 1 t1 (re-)computed in each iteration. 2 a1=... b1=... Still t1 (re-)computed in each iteration. 2 4 3 5 4 6 t1=a1+b1 ...t1 ... 3 t2=a1+b1 ...t2 ... t1=a1+b1 5 6 ...t1 ... ...t1 ... 43 44 Placing for t1 out of loops into conditional code Insert Blocks 1 Virtually, insert empty blocks, to capture operations executed only on a specific path a1=... b1=... 1 2 a1=... b1=... However: Exist path t1 not used on. 2 t1: a1+b1 t1=a1+b1 4 4 3 5 6 ...t1 ... 3 ... t1 ... 5 6 ... t1 ... ... t1 ... 45 46 PRE: Discussion PRE: Implementation Placing earliest Find partially (dynamically) redundant computations Advantage: short code, could be fast code because of instruction cache; no unnecessary computations in loops; short jumps … Disadvantage: many paths do not need result, high register pressure Placing lazily Advantage: computation needed on all paths Disadvantage: unnecessary computations in loops B contains operation τ computing t, B’ contains τ’ consuming t If B’ post-dominates B (B’ ≤ B) no dynamic redundancy Assume all operations as partially (dynamically) redundant For each operation τ computing t and the set of operations τ’1 … τ’n consuming t: if B(τ’k) ≤ B(τ) for some k in 1…n then τ is not placed partially redundant – all others are (!) Eliminate partially redundant computations Placing out of loops into conditional code Advantage: no unnecessary computations in loops Disadvantage: some unnecessary computations in general as some paths do not need result 47 Compute earliest position (all arguments available, all uses dominated) for each partially redundant operation τ Move copies of τ towards the consuming operations τ’1 … τ’n along the dominator tree until no partial dynamic redundancies but stop at loop heads Not deterministic but that does not matter (!) 48 8 PRE: “Pinned” Operations Further Optimizations Placement sometimes only possible if it is the last transformation on SSA Same computation computed at several places Further optimizations recognize this wanted static redundancy Solution: Let t1 : a1 + b1 and t2 : a1 + b1 semantic equivalent computations at different positions (blocks) Replace + by a „pinned“ ⊕Block Thereby ⊕ operation additionally depends on the current block as new arguments computations a1 ⊕5 b1 and a1 ⊕6 b1 not recognized as congruent any more Constant evaluation (simple transformation rule) Constant propagation (iterative application of that rule) Copy propagation (on SSA construction) Dead code elimination (on SSA construction) Common subexpression elimination (on SSA construction) Specialization of basic blocks, procedures, i.e.cloning Procedure inlining Control flow simplifications Loop transformations (Splitting/merging/unrolling) Bound check eliminations Cache optimizations (array access, object layout) Parallelization … 49 50 Observations Outline Order of optimizations matters in theory: Application of one optimization might destroy precondition of another Optimization can ruin the effects of the previous once Optimal order unclear (in scientific papers usual statements like: “Assume my optimization is the last …” Simultaneous optimization too complex Usually first optimization gives 15% sum of remaining 5%, independent of the chosen optimizations Might differ in certain application domains, e.g. in numerical applications operator simplification gives factor >2, cache optimization factor 2-5 Introduction to SSA Construction, Destruction Optimizations Classic analyses and optimizations on SSA representations Heap analyses and optimizations 51 52 Optimizations on Memory Memory Values Differentiation by Name Schema (NS) Distinguish e.g.: Elimination of memory accesses. Elimination of object creations. Elimination non essential dependencies. Those are normalizing transformations for further analyses Nothing new under the sun: Define abstract values, addresses, memory Define context-insensitive transfer functions for memory relevant SSA nodes (Load, Store, Call) (discussed already) Generalization to context-sensitive analyses (discussed already) Optimizations as graph transformations (discussed already) 53 heap and stack local arrays with different name disjoint index sets in an array (odd/even etc.) different types of heap objects objects with same type but statically different creation program point objects with same creation program point but with statically different path to that creation program point (execution context, contextsensitive) 54 9 Abstract values, addresses, memory Given an abstract object-field-(index) of a store operation In general, this abstract object points to more than one real memory cell A store operation overwrites only one of these cells, all others contain the same value Hence, a store to an abstract object-field-(index) adds a new possible (abstract) value - weak update Only if guaranteed that abstract object-field-(index) matches one and only one concrete address, a new (abstract) value overwrites the old value - strong update Auxiliary function: (if strong update possible) update(M, (o, n, ⊤),v) = v = M(o, n, ⊤) ∪ v (otherwise, weak update) References and values allocation site lattice 2O abstracts from objects O arbitrary lattice X abstracts from values like Integer or Boolean abstract heap memory M: • OμRöO (R set of fields with reference object semantics) • OμVöX (V set of fields with value semantics) Arrays Treated as objects abstract heap memory M: • OμR[]μI öO (R[] set of fields with type array of reference object) • OμV[]μI öX (V[] set of fields with type array of value) • I an arbitrary integer value latice (e.g., constant or power set lattice) Abstract address Addr Œ 2OμFμI (F set of field names) Updates of Memory object-field-(index) triples where index might be ignored 56 Transfer functions (insensitive, no arrays) Tstore (M, Addr, v) = 57 Example: Main Loop Inner Product Algorithm MAv M[a1 ↦ update(M, a1,v)]…[ak ↦ update(M, ak,v)] Addr = {a1 … ak} ai = (oi , f , ⊤) Tload (M, Addr) = Store f M‘ Vector<T> VectorIter<T> init(int size) VecIter iter() T times(Vector) boolean hasNext() T next() M A (M, M(a1) ∪ … ∪ M (ak)) Talloc(type) (M) = (M[(oid , n1 , ⊤) ↦ ⊥]…[(oid , nk , ⊤) ↦ ⊥], oid) {n1 … nk} attributes of Type Load f VectorArray<T> M‘ v VecIter iter() VectorArrayIter<T> boolean hasNext() T next() T times(Vector v){ i1 = iter(); i2 = v.iter(); for (s = 0; i1.hasNext(); ) s = s+i1.next()*i2.next(); } M Type id Alloc M‘ o 58 59 No provably different memory addresses Example: SSA Initialization φ Vector<T> VecIter<T> init(int size) VecIter iter() T times(Vector) boolean hasNext() T next() Initialization φ v a Load a Store Inc VectorArray<T> VecIter iter() Store Load Load boolean hasNext() T next() [a] ≠ void Store Load Not applicable! Inc for (s = 0; i1.hasNext(); ) s = s+ i1.next()*i2.next(); } Store Inc VectorArrayIter<T> T times(Vector v){ i1 = iter(); i2 = v.iter(); Inc Store v Load 60 Store 61 10 Initialization: disjoint memory guarantied Actually two Iterators? Initialization Initialization φ Alloc Alloc 0 0 Store Store φ Load Load Elimination of non essential dependencies Inc Store Elimination of reading memory accesses Inc φ φ Store Load Load Load Load Inc Inc Inc Inc Store Store 62 Memory objects replaced by values Alloc Alloc Alloc 0 0 0 Store Store Load Load Store φ φ Load φ φ Inc Inc Store φ Store Load Load 63 Value numbering proofs equivalence Alloc 0 Alloc Alloc 0 0 Store Store Store φ Inc Store Store Store Inc Store φ φ 0 φ Load φ Inc 64 Inc Store Load Inc Store 0 0 0 φ φ φ Inc Inc Inc 65 Example revisited Alloc 0 Store Alloc 0 Store φ Load Optimization only possible due to joint application of single techniques: Global analysis Elimination of polymorphism Elimination of non essential dependencies Elimination von memory operations Traditional optimizations 0 Inc Store φ Load Inc Store Inc 66 11