Loops are Key 15-745 Lecture 5 • Loops are extremely important 90 10 rule – the “90-10” • Loop optimization involves – understanding control-flow control flow structure – Understanding data-dependence information – sensitivity to side-effecting side effecting operations – extra care in some transformations such as register spilling Control flow analysis Natural loops Classical Loop Optimizations D Dependencies d i C Copyright i ht © S Seth th Goldstein, G ld t i 2008 Based on slides from Lee & Callahan Lecture 5 15-745 © 2008 1 Lecture 5 Common loop optimizations 15-745 © 2008 2 Finding Loops • To optimize loops, we need to find them! • Could use source language loop information in the abstract syntax tree… • BUT: Scalar opts, • Hoisting of loop-invariant computations DF analysis, l i – pre-compute before entering the loop Control flow analysis • Elimination of induction variables – change h p=i*w+b i* b to p=b,p+=w, b when h w,b b invariant i i • Loop unrolling – to improve scheduling of the loop body Requires • Software pipelining understanding – To improve scheduling of the loop body data dependencies • Loop permutation – to improve cache memory performance Lecture 5 15-745 © 2008 – There are multiple source loop constructs: for, while, do-while, even goto in C – Want IR to support different languages – Ideally, we want a single concept of a loop so all have same analysis, analysis same optimizations – Solution: dismantle source-level constructs, then re-find loops from fundamentals 3 Lecture 5 15-745 © 2008 4 Finding Loops Control-flow analysis • To optimize loops, we need to find them! Specifically • Specifically: – loop-header node(s) • Many languages have goto and other complex control, so loops can be hard to f find nd in n general • Determining g the control structure of a program p g is called control-flow analysis • nodes in a loop that have immediate predecessors not in the loop – back edge(s) • c control-flow nt l fl edges d s tto previously executed nodes • Based on the notion of dominators – all nodes in the loop p body y Lecture 5 15-745 © 2008 5 Lecture 5 Dominators • A control-flow edge from node B3 to B B2 is a back edge if B2 dom B3 • a sdom b – a strictly dominates b if a dom b and a != b • Furthermore, in that case node B2 is a loop header • a idom b – a immediately i di l d dominates i b if a sdom d b b, AND ND there is no c such that a sdom c and c sdom b 15-745 © 2008 6 Back edges and loop headers • a dom b dom nates b if f every possible poss ble – node a dominates execution path from entry to b includes a Lecture 5 15-745 © 2008 Entry B1 k = false i = 1 j = 2 i <= n B3 B2 ..k.. j = j*2 k = true i = i+1 print j B5 B4 i = i+1 B6 exit 7 Lecture 5 15-745 © 2008 8 Natural loop Examples • Consider a back edge from node n to node h a • The natural loop of n→h is the set of nodes L such that for all x∈L: – h dom x and – there is a path from x to n not containing h Lecture 5 15-745 © 2008 Simple example: c (often it’s more complicated, since a FOR loop l f found d iin the source code might g need an if/then guard) 9 Lecture 5 a for (..) { if { … } else { … if (x) { e; break; ) } } c d e f 15-745 © 2008 10 Examples b Lecture 5 d 15-745 © 2008 Examples Try this: b 11 Lecture 5 e 15-745 © 2008 12 Examples for (..) { if { … lexically, in loop, } else { but not in … natural loop if (x) { e; break; ) } } Lecture 5 15-745 © 2008 Examples for (..) { if { … lexically, in loop, } else { but not in … natural loop if (x) { e; break; and another ) y CFG reason why } analysis is } preferred over source/AST loops e 13 Lecture 5 e 15-745 © 2008 Examples 14 Natural Loops One loop per header.. • Yes, it can happen in C What are the natural loops? 0 0 1 1 0 {0 1 2} {0,1,2} 1 2 2 2 0 3 3 {} 1 {1,2,3} Lecture 5 15-745 © 2008 15 Lecture 5 2 {1,2},{0,1,2,3} 15-745 © 2008 16 Nested Loops General Loops • Unless two natural loops have the same header, they are either e ther d disjoint sjo nt or nested w within th n each other p (sets ( of blocks)) with • If A and B are loops headers a and b such that a ≠ b and b ∈ A –B⊂A – loop B is nested within A p – B is the inner loop • Can compute the loop-nest tree • A more general looping structure is a strongly connected t d componentt off th the control t l flow fl graph h – subgraph <Nscc,Escc> such that every block in Nscc is reachable from every other node using only edges in Escc 0 scc 1 Not very useful definition of a loop Lecture 5 15-745 © 2008 17 Lecture 5 Reducible Flow Graphs There is a special p class of flow graphs, g p , called reducible flow graphs, for which several codeoptimizations are especially easy to perform. In reducible flow graphs loops are unambiguously defined and dominators can be efficiently computed. 15-745 © 2008 18 Reducible flow graphs Definition: A flow graph G is reducible iff we can partition the edges into two disjoint groups, forward edges and back edges with the following two properties. edges, properties 1. The forward edges form an acyclic graph in which every node can be reached from the initial node of G. G 2. The back edges consist only of edges whose heads dominate their tails. Why isn’t this reducible? This flow graph has no back edges. edges Thus Thus, it would be reducible if the entire graph were acyclic, which is not the case. Lecture 5 15-745 19© 2008 2 Lecture 5 15-74520 © 2008 0 1 2 Alternative definition Properties of Reducible Flow Graphs • Definition: A flow graph G is reducible if we can repeatedly collapse (reduce) together blocks (x,y) where x is the only predecessor of y (ignoring self loops) until we are left with a single node 0 0 0 0 • In a reducible flow graph, g p , all loops are natural loops p • Can use DFS to find loops • Many analyses are more efficient – polynomial versus exponential 1,2,3 1 1,2 , 1,2,3 , , 2 3 Lecture 5 0 1 2 3 0,1,2,3 3 15-745 21© 2008 Lecture 5 Good News • Most flow graphs are reducible • Languages prohibit irreducibility – goto free C – Java • Programmers usually don’t use such constructs even if they’re they re available – >90% of old Fortran code reducible 15-74522 © 2008 Dealing with Irreducibility • Don’t • Can split nodes and duplicate code to get reducible graph – possible exponential blowup • Other techniques… 0 0 1 1 2 2a Lecture 5 15-74523 © 2008 Lecture 5 15-745 © 2008 2 Loop-invariant computations • A definition t = x op y in a loop is (conservatively) loop-invariant if – x and y are constants, constants or – all reaching definitions of x and y are outside the loop loop, or – only one definition reaches x (or y), and that definition is loop loop-invariant invariant Loop optimizations: L ti i ti s: Hoisting g of loop-invariant p computations • so keep marking iteratively Lecture 5 15-745 © 2008 25 Lecture 5 Loop-invariant computations • Be careful: • In order to “hoist” a loop-invariant computation out of a loop, we need a place to put itt Of course, not an issue in SSA • We could copy it to all immediate predecessors (except along the back-edge) of the loop header... • ...But we can avoid code duplication p by y inserting g a new block, called the pre-header jmp L1; L2: • Even though t’s two reaching expressions p are each invariant, s is not invariant… 15-745 © 2008 26 Hoisting t1 = expr; t = expr; L1: for () { b L2; brc s = t * 2; t2 = phi(t1, t3); t = loop_invariant_expr; s = t2 * 2;; x = t + 2; 2 t3 = loop_invariant_expr; … x1 = t3 * 2; } ... Lecture 5 15-745 © 2008 27 Lecture 5 15-745 © 2008 28 Hoisting Hoisting preheaders A A B B A’ A A B’ B Lecture 5 15-745 © 2008 29 Lecture 5 Hoisting conditions 15-745 © 2008 We need to be careful... • For a loop-invariant definition d: t = x op y • we can hoist d into the loop’s pre-header only if • All hoisting conditions must be satisfied! L0: L0 t = 0 L1: i = i + 1 t = a * b M[i] = t if i<N goto L1 L2: x = t 1 d 1. d’ss block dominates all loop exits at which t is live liveout, and 2. d is only the only definition of t in the loop, and 3. t is not live-out of the pre-header OK Lecture 5 15-745 © 2008 30 31 Lecture 5 L0: L0 t = 0 L1: if i>=N goto L2 i = i + 1 t = a * b M[i] = t goto L1 L2: x = t L0: L0 t = 0 L1: i = i + 1 t = a * b M[i] = t t = 0 M[j] = t if i<N goto L1 L2: violates 1,3 violates 2 15-745 © 2008 32 The basic idea of IVE • Suppose we have a loop variable – i initially 0; each iteration i = i + 1 Loop optimizations: Induction variable Induction-variable Strength reduction • and a variable that linearly depends on it: x = i * c1 + c2 • In such cases, we can try to – initialize i iti li x = io * c11 + c2 2 (execute ( once)) – increment x by c1 each iteration Lecture 5 15-745 © 2008 33 Lecture 5 Simple Example of IVE 15-745 © 2008 34 Simple Example of IVE i <- 0 j' <- 0 k' <- a i <- 0 i <- 0 H: if i j <k <M[k] i << goto for i = 0 to n a[i] = 0 H: >= n goto exit i * 4 j + a <- 0 i + 1 H if i j <k <M[k] i <goto Clearly, j & k do not need to be computed anew each time since they y are related to i and i changes linearly. Lecture 5 15-745 © 2008 >= n goto e exit t i * 4 j + a <- 0 i + 1 H H: if i >= n goto exit j <- j' k <- k' M[k] <- 0 i <- i + 1 j' <- j' + 4 k' < <- k' + 4 goto H B t then But, th we don't d 't even need d j ((or j') 35 Lecture 5 15-745 © 2008 36 Simple Example of IVE Simple Example of IVE Rewrite comparison i <- 0 j' <- 0 k' <- a i <- 0 k' <- a H: i <- 0 k' <- a H: : : H: if i >= n goto exit j <- j' k <- k' M[k] <- 0 i <- i + 1 j' <- j' + 4 k' < <- k' + 4 goto H if i >= n goto exit k <- k' M[k] <- 0 i <- i + 1 k' <- k' + 4 goto H 15-745 © 2008 37 Lecture 5 k' <- a n' <- a + (n * 4) H: : 15-745 © 2008 k' <- a n' <- a + (n * 4) H: : if k' >= n' goto exit k <- k' M[k] <- 0 k' <- k' + 4 goto H H: : if k' >= n' goto exit k <- k' M[k] <- 0 k' <- k' + 4 goto H if k' >= n' goto exit M[k'] <- 0 k' <- k' + 4 goto H Voila! now, we do d copy propagation ti and d eliminate li i t k Lecture 5 38 Copy propagation k' <- a n' <- a + (n * 4) if k' >= a+(n*4)goto exit k <- k' M[k] <- 0 k' <- k' + 4 goto H 15-745 © 2008 Simple Example of IVE Invariant code motion on a+(n*4) H: : if k' >= a+(n*4) goto exi k <- k' M[k] <- 0 k' <- k' + 4 goto H But a+(n*4) is loop invariant But, Simple Example of IVE i <- 0 k' <- a : H: if i >= n goto exit k <- k' M[k] <- 0 i <- i + 1 k' <- k' + 4 goto H D we need Do d i? Lecture 5 i <- 0 k' <- a 39 Lecture 5 15-745 © 2008 40 Simple Example of IVE What we did Compare original and result of IVE • • • • identified induction variables (i,j,k) strength reduction (changed * into +) dead-code elimination (j <- j') useless variable elimination (j useless-variable (j' << jj' + 4) (This can also be done with ADCE) • loop invariant identification & code-motion • almost useless-variable elimination (i) • copy propagation k' <- a n' <- a + (n * 4) i <- 0 H: H: if i j <k <M[k] i < <goto >= n goto exit i * 4 j + a <- 0 i + 1 H if k' >= n' goto exit M[k'] <- 0 k' <- k' + 4 goto H Voila! Lecture 5 15-745 © 2008 41 Lecture 5 Is it faster? • Before attempting IVE, it is best to first perform : – constant propagation & constant folding – copy propagation – loop-invariant hoisting • Furthermore, one fewer value is computed, – thus potentially saving a register – and decreasing the possibility of spilling 15-745 © 2008 42 Loop preparation • On some hardware, hardware adds are much faster than multiplies Lecture 5 15-745 © 2008 43 Lecture 5 15-745 © 2008 44 How to do it, step 1 Representing IVs • First, find the basic IVs – scan loop body for defs of the form x = x + c or x = x – c where c is loop loop-invariant invariant – record these basic IVs as x = (x, (x 1, 1 c) – this represents the IV: x = x * 1 + c • Characterize all induction variables by: (base-variable, offset, multiple) p are loopp – where the offset and multiple invariant • IOW, after an induction variable is defined it equals: offset ff t + multiple lti l * b base-variable i bl Lecture 5 15-745 © 2008 45 Lecture 5 15-745 © 2008 How to do it, step 2 How to do it, step 3 • Scan for derived IVs of the form k = i * c1 + c2 – where i is a basic IV, this is the only def of k in the loop loop, and c1 and c2 are loop invariant • We say k is in the family of i • Record as k = (i, c1, c2) • Iterate, looking for derived IVs of the form k = j * c1 + c2 – where IV j =(i, a, b), and – this is the only def of k in the loop loop, and – there is no def of i between the def of j and the def of k – c1 and c2 are loop invariant • Record as k = (i, (i a a*c1 c1, b b*c1+c2) c1+c2) Lecture 5 15-745 © 2008 47 Lecture 5 15-745 © 2008 46 48 Simple Example of IVE Finding the IVs • Maintain three tables: basic & maybe & other • Find basic Ivs: Scan stmts. If var ∉ maybe, and of proper form,, put p into basic. Otherwise,, put p var in other and remove from maybe. p Ivs: • Find compound – If var defined more than once, put into other proper p form that use a basic – For all stmts of p IV i <- 0 H: if i j <k <M[k] i <goto >= n goto exit i * 4 j + a <- 0 i + 1 H i: (i, 1, 1) i.e., i = 1 + 1 * i j: (i, 0, 4) i.e., j = 0 + 4 * i k: (i, a, 4) i.e., k = a + 4 * i » FIX THIS SLIDE S j & k are in So, i f family il of fi Lecture 5 15-745 © 2008 49 Lecture 5 How to do it, step 4 15-745 © 2008 How to do it, step 5 • This is the strength reduction step • This is the comparison rewriting step • For an induction variable k = (i, c1, c2) • For an induction variable k = (i, ak, bk) – If k used only in definition and comparison – There exists another variable, j, in the same class and is not “useless” useless and j=(i j=(i, aj, bj) • Rewrite k < n as j < (bj/bk)(n )(n-a ak)+aj • Note: since they are in same class: (j-aj)/bj = (k-ak)/bk – initialize k = i * c1 + c2 in the preheader – replace k k’ss def in the loop by k = k + c1 – make sure to do this after ii’ss def Lecture 5 15-745 © 2008 50 51 Lecture 5 15-745 © 2008 52 Notes Is it faster? (2) • Are the c1, c2 constant, or just invariant? constant then you can keep folding them: – if constant, they’re always a constant even for derived IVs otherwise they can be expressions of looploop – otherwise, invariant variables • On some hardware, adds are much faster than mult pl es multiplies • But…not always a win! – Constant multiplies might otherwise be reduced to shifts/adds that result in even better code than IVE – Scaling of addresses (i*4) might come for free on your processor’s address modes • So maybe: only convert i*c1+c2 when c1 is loop invariant but not a constant • But if constant, can find IVs of the type x = i/b and know that it’s legal, if b evenly divides the stride… Lecture 5 15-745 © 2008 53 Common loop optimizations • Hoisting of loop-invariant computations – pre-compute before entering the loop • Elimination of induction variables – change h p=i*w+b i* b to p=b,p+=w, b when h w,b b invariant i i • Loop unrolling – to to improve scheduling of the loop body Requires • Software pipelining understanding – To improve scheduling of the loop body data dependencies • Loop permutation – to improve cache memory performance Lecture 5 15-745 © 2008 55 Lecture 5 15-745 © 2008 54