DF00100 Advanced Compiler Construction Software Pipelining of Loops (1) TDDC86 Compiler Optimizations and Code Generation Software Pipelining Literature: C. Kessler, “Compiling for VLIW DSPs”, 2009, Section 7.2 (handed out) ALSU2e Section 10.5 Muchnick Section 17.4 Allan et al.: Software Pipelining. ACM Computing Surveys 27(3), 1995 Christoph Kessler, IDA, Linköpings universitet, 2014. C. Kessler, IDA, Linköpings universitet. 2 TDDC86 Compiler Optimizations and Code Generation Software Pipelining of Loops (2) Introduction Software Pipelining (Modulo Scheduling) Overlap instructions in loops from different iterations Kernel length II (initiation interval) ~ Throughput Goal: Faster execution of entire loop Better resource utilization, Increase Instruction Level Parallelism, also in the presence of loop-carried dependences Prologue Kernel, steady state II Epilogue C. Kessler, IDA, Linköpings universitet. 3 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 4 TDDC86 Compiler Optimizations and Code Generation 6 TDDC86 Compiler Optimizations and Code Generation Lower Bound for MII Definitions Stage count (SC) = makespan for 1 iteration in terms of kernel length Initiation Interval (II) SC Minimum Initiation Interval (MII) Depends on Data dependence cycles (loop carried), RecMII (registers, functional units), ResMII Resources MII = max( ResMII, RecMII) ResMII = maxU ceil(NU / P), II NU – Number of instructions for resource (functional unit) U in the body P – Number of functional units C. Kessler, IDA, Linköpings universitet. 5 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 1 Calculating the Lower Bound for MII Modulo Scheduling max ( ResMII, RecMII ) Modulo scheduling: Filling the Modulo Reservation Table, one instruction by another Heuristics example ASAP, As Soon As Possible ALAP, As Late As Possible HRMS Swing Modulo Scheduling Optimally C. Kessler, IDA, Linköpings universitet. TDDC86 Compiler Optimizations and Code Generation 7 Modulo Scheduling Integer Linear Programming C. Kessler, IDA, Linköpings universitet. [Eriksson’09] 8 TDDC86 Compiler Optimizations and Code Generation Modulo Scheduling Heuristics Example: Hypernode Reduction Modulo Scheduling (HRMS) C. Kessler, IDA, Linköpings universitet. TDDC86 Compiler Optimizations and Code Generation 9 HRMS – Motivation Forward order / ASAP A Latency = 2 for all operations v1 A E II = 2 C. Kessler, IDA, Linköpings universitet. v5 Backward order/ALAP v1 A v2 v2 B v1 B v2 E D C v4 v5 F v6 G 4 ALUs. TDDC86 Compiler Optimizations and Code Generation Schedule only nodes that have Some nodes in the DAG are scheduled too early and some too late II = 2 10 Hypernode Reduction Approach Problem with simple ASAP or ALAP heuristics: Example: C. Kessler, IDA, Linköpings universitet. Only predecessors already scheduled or Only successors already scheduled or None of them, but not both predecessors and successors. B Ensures low register pressure by scheduling nodes v4 C v4 v5 D E D as close as possible to their relatives. v6 v6 F F C G G v1v2 v4 v5 v6 Ai Bi-1 Ci-2 Di-2 Ei Fi-3 Gi-4 v1 v2 v4 v5 Ai Bi-1Ci-4 Gi-4 Fi-3 Ei-2 Di-2 11 Code Generation MaxLive = 7TDDC86 Compiler Optimizations andMaxLive =7 C. Kessler, IDA, Linköpings universitet. 12 TDDC86 Compiler Optimizations and Code Generation 2 HRMS – Solution Pre-Ordering Step Two stage algorithm Select initial node v 1. • 2. Pre order the nodes of the DAG the hypernode H {v} Reduce nodes to the hypernode iteratively By using a reduction algorithm Schedule according to the order given in step 1. Remove iteratively edges and nodes in the DAG (= reducing the DAG) and add them to H In each reduction step, append to list of ordered set of nodes Similar to list scheduling / topological sorting, but now in both directions – forward and backward along edges incident to H C. Kessler, IDA, Linköpings universitet. 13 TDDC86 Compiler Optimizations and Code Generation Select initial node; H {Initial node}; List = < Initial node >; While (Pred(H) nonempty or Succ(H) nonempty) do V’ = Pred(H); V’ = Search_All_Paths( V’,G ); G’ = Hypernode_Reduction( V’, G, H ); L’ = Sort_PALA(G’); // ALAP with inverted order List = Concatenate ( List, L’ ) V’ = Succ(h); V’ = Search_AllPaths(V’,G); G’ = Hypernode_Reduction(V’,G,h); L’ = Sort_ASAP(G’); List = Concatenate ( List, L’ ); end while return List; 15 TDDC86 Compiler Optimizations and Code Generation 14 HRMS – Example Function pre_ordering( G ) C. Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. Start with initial hypernode H = { A }: Original dependence graph of one loop iteration: A B C H D G E F H G E J F H I G D E H F I E H J F J F F I E I H B F I E J B D B D H J B H B C I H ”eats” successor node C: H H List = { A,C,G,H,D,J,I,E,B,F } Pred nodes to be scheduled ALAP (D,I,E,B,F), Succ nodes scheduled ASAP (A,C,G,H,J) TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 16 TDDC86 Compiler Optimizations and Code Generation Pre-Ordering with circular dependencies The Scheduling Step Circular dependences from loop carried dependences. Places operations in the order given by the pre-ordering step Solution: Different strategy depending on neighbors Reduce complete path causing cycle to the Hypernode How to deal with several connected cycles in DAG? (See details in the paper, skipped). If operation has Only predecessors in partial schedule -> ASAP Only successors in partial schedule -> ALAP Both predecessors and successors in partial schedule Scan from ASAP schedule time towards ALAP time. (If C. Kessler, IDA, Linköpings universitet. 17 TDDC86 Compiler Optimizations and Code Generation no slot found, II++ and reschedule) C. Kessler, IDA, Linköpings universitet. 18 TDDC86 Compiler Optimizations and Code Generation 3 HRMS – Example (cont.) HRMS – Results Resulting HRMS schedule: Perfect Club Benchmark A, B, C, D, E, F, G where for E: ALAP, for others ASAP v1 A II = 2 v2 A C 97.4 % of the loops gave optimal II Works better than Slack scheduling and FRLC scheduling (references in the paper) About same performance as SPILP (optimal algorithm using Integer Linear Programming, ILP) but lower computational complexity B v1 B v2 E D C v4 v5 F v6 G 4 ALUs. Latency = 2 for all operations Comparison with other algorithms in the paper v4 D v5 E v6 F G II = 2 Ai Bi-1Ci-2 Di-2 Fi-3 Ei-2 Gi-4 v1 v2 v4 v5 v6 MaxLive = 6 C. Kessler, IDA, Linköpings universitet. 19 TDDC86 Compiler Optimizations and Code Generation HRMS – Conclusion C. Kessler, IDA, Linköpings universitet. 20 TDDC86 Compiler Optimizations and Code Generation Modulo Scheduling with Recurrences? Works well for loops with high register pressure Low time complexity Tested on large benchmark suite. Reference: J. Llosa, M. Valero, E. Ayguadé and A. Gonzáles: Hypernode Resource Modulo Scheduling. Proc. 28th ACM/IEEE Int. Symposium on Microarchitecture, pp. 350-360, IEEE Computer Society Press, 1995 C. Kessler, IDA, Linköpings universitet. 21 TDDC86 Compiler Optimizations and Code Generation Modulo Scheduling and Register Allocation C. Kessler, IDA, Linköpings universitet. 22 TDDC86 Compiler Optimizations and Code Generation Modulo Scheduling and Register Allocation Software Pipelining tends to increase register pressure Live ranges may span over several iterations May lead to (more) register spill C. Kessler, IDA, Linköpings universitet. 23 TDDC86 Compiler Optimizations and Code Generation Introduces new problems Should we spill or increase II ? How to choose variables to spill ? Integrated software pipelining (later) C. Kessler, IDA, Linköpings universitet. 24 TDDC86 Compiler Optimizations and Code Generation 4 Register Allocation for Modulo-Scheduled Loops Modulo Scheduling for Loops We call a live range self-overlapping if it is longer than II Example: Needs > 1 physical register Hard to address properly without HW support at target level s = 0.0; t = a[0]*b[0]; for ( i=1; i<N; i++) { s = s + t; t = a[i] * b[i]; } s = s + t; Modulo Variable Expansion Unroll the kernel and rename symbolic registers until no self-overlapping live ranges remain t s +1 Given: VLIW-Processor with 3 units: - Adder (Latency 1), - Multiplier/MAC (Latency 3), - Memory access unit (Latency 3) by live range splitting (inserting copy operations before modulo scheduling) [Stotzer,Leiss LCTES-2009] Hardware support: Rotating Register Files Iteration Control Pointer points to window in cyclic loop file, advanced by25hardwareTDDC86 loop control Compiler Optimizations and Code Generation C. Kessler, IDA, register Linköpings universitet. Modulo Scheduling for Loops can benefit from integration with instruction selection s = 0.0; t = a[0]*b[0]; for ( i=1; i<N; i++) { s = s + t; t = a[i] * b[i]; } s = s + t; Loop-carried data dep., Distance +1 Simplified IR: t s * + s (loop control omitted for simplicity, alternatively, ZOL) MAC +1 t INDIR INDIR + + Kernel Given: (i=4,…,N-1): VLIW-Processor with 3 units: ADD - Adder (Latency 1), t a+ii - Multiplier/MAC (Latency 3), - Memory access unit (Latency 3) t+1 b+ii ld add INDIR + + Kernel (i=4,…,N-1): t C. Kessler, IDA, Linköpings universitet. Mem MUL s+t i-3 a[i]*b[i] i-2 ld b[i] i-1 t+1 a+i i t+2 b+i i 26 add b i ADD ld II = 3 ld a[i] i TDDC86 Compiler Optimizations and Code Generation Summary Software Pipelining / Modulo-Scheduling Software Pipelining: Move operations across iteration boundaries Simplest technique: Modulo scheduling = Fill modulo reservation table Better resource utilization, more ILP, also in the presence of loop-carried data dependences In general, higher register need, maybe longer code b i a INDIR t Example: * + s (loop control omitted for simplicity, alternatively, ZOL) mul add a A-priori avoidance of self-overlapping live ranges Loop-carried data dep., Distance +1 Simplified IR: Heuristics e.g. HRMS, Swing Modulo Scheduling, … Optimal methods e.g. Integer Linear Programming MUL mac i-3 Mem ld b[i] i-1 ld a[i] i II = 2 (Problem is NP-complete like acyclic scheduling) Self-overlapping live ranges need special treatment Loop unrolling can leverage additional optimization potential Up to now, only at target code level, C. Kessler, IDA, Linköpings universitet. 27 TDDC86 Compiler Optimizations and Code Generation hardly integrated (sometimes with registerTDDC86 allocation) Compiler Optimizations and Code Generation 28 C. Kessler, IDA, Linköpings universitet. 5