DF00100 Advanced Compiler Construction Phase ordering problems TDDC86 Compiler Optimizations and Code Generation Integrated Code Generation IR gcc, lcc Phase ordering problems Instruction selection Clustered VLIW Processors: More phase ordering problems Integrated Code Generation Our research project: OPTIMIST target code Register allocation Instruction scheduling Christoph Kessler, IDA, Linköpings universitet, 2009. C. Kessler, IDA, Linköpings universitet. TDDC86 Compiler Optimizations and Code Generation 2 Phase ordering problems (1) Phase ordering problems (2) Instruction scheduling vs. register allocation Conflicts Instruction selection Scheduling / Reg. alloc. Selection first: Cost attribute of a pattern covering rule (a) Scheduling first: (b) Register allocation first: determines Live-Ranges Register need, possibly spill-code to be inserted afterwards Reuse of same register by different values introduces ”artificial” data dependences constrains scheduler is only a coarse estimate of the real cost (effect on e.g. overall time) Real cost based on currently free functional units other instructions ready to execute simultaneously pending latencies of already issued but unfinished instructions Integration with instruction scheduling desirable Mutations with different resource requirements C. Kessler, IDA, Linköpings universitet. TDDC86 Compiler Optimizations and Code Generation 3 a=2*b oder a = b << 1 oder a=b+b ? Different instructions with different register need TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 4 More phase ordering problems Clustered VLIW processor E.g., TI C62x, C64x DSP processors Register classes Parallel execution constrained by operand residence In parallel e.g.: Data bus Register File A v u + u Register File B u v Unit B1 Unit A1 load on A || load on B || move AB Mapping instructions cluster should preferably know already beforehand about free move-slots in schedule... Instruction scheduling must know mapping to generate moves where needed Heuristic [Leupers’00] add.A1 u, v C. Kessler, IDA, Linköpings universitet. add.B1 u, v 5 TDDC86 Compiler Optimizations and Code Generation iterative optimization by simulated annealing C. Kessler, IDA, Linköpings universitet. 6 TDDC86 Compiler Optimizations and Code Generation 1 Integrated Code Generation for Clustered VLIW Architectures Integrated Code Generation for non-clustered architectures IR Instruction selection Target code Register allocation Instruction scheduling C. Kessler, IDA, Linköpings universitet. 7 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 8 TDDC86 Compiler Optimizations and Code Generation Integrated Code Generation Some Approaches Example Weak integration / Phase coupling Clustered 8-issue VLIW processor TI C6201 E.g., register pressure aware scheduler [Goodman/Hsu ’88], HRMS… Heuristic Phase Interleaving Leupers’ example E.g., alternating partitioning / scheduling for clust. VLIW [Leupers’00] INDIR INDIRINDIR INDIR INDIRINDIRINDIR INDIR Integration of 2 phases Instruction selection and register allocation DP DP (Dynamic pr.) for space optimization, e.g. [Aho/Johnsson ’77] for space- or time optimization, e.g. [Fraser et al.’92] Instruction scheduling and register allocation ILP (Integer Linear Progr.) e.g. [Kästner’00] Integration of 3 and more phases ILP for simple, non-pipelined RISC processor [Wilson et al.’94] DP, for clustered VLIW: OPTIMIST [K., Bednarski’06] ILP, for clustered VLIW: Integrated software pipelining [Eriksson, K.’09] C. Kessler, IDA, Linköpings universitet. 9 TDDC86 Compiler Optimizations and Code Generation TDDC86 Compiler Optimizations and Code Generation [A. Bednarski, PhD thesis, Linköping Univ., 2006] Retargetable integrated code generator Open Source www.ida.liu.se/~chrke/optimist x x 11 10 Processor specification language xADML Our project: OPTIMIST C. Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. TDDC86 Compiler Optimizations and Code Generation <architecture omega="8"> Specify reservation OPx L1 L2 X2 <registers> ... </registers> tables by <residenceclasses> ... </residenceclasses> t+2 X <funits> ... </funits> <cycle_matrix> t+1 X X <patterns> ... </patterns> ... <instruction_set> t X X </cycle_matrix> <instruction id="ADDP4" op="4407"> <target id="ADD .L1" op0="A" op1="A" op2="A" use_fu="L1"/> <target id="ADD .L2" op0="B" op1="B" op2="B" use_fu="L2"/> + ... </instruction> ... <transfer> <target id="MOVE" op0=“A" op1=“B"> <use_fu="X2"/> <use_fu="L1"/> </target> ... </transfer> </instruction_set> TDDC86 Compiler Optimizations and Code Generation </architecture> C. Kessler, IDA, Linköpings universitet. 12 2 DF00100 Advanced Compiler Construction Optimal Integrated Code Generation ... TDDC86 Compiler Optimizations and Code Generation Why ”Optimal” ? Quantitative assessment of heuristics Feedback to architecture design of an application specific processor Often feasible for small, sometimes not so small instances absolute APPENDIX Optimal Integrated Code Generation with OPTIMIST thanks to modern solver technology and problem transformations Christoph Kessler, IDA, Linköpings universitet, 2009. Optimal Method 0: Exhaustive Enumeration instead of relative comparison Scientific challenge C. Kessler, IDA, Linköpings universitet. 14 TDDC86 Compiler Optimizations and Code Generation Interleaved Exhaustive Enumeration Any ordering of these loops is possible... For all possible instruction selections For all possible schedules with each selection For all possible moves of live values ... Explores a decision tree (enumeration tree, backtracking tree) with partial solutions (= code for sub-DAGs) Example: (only) instruction scheduling = topological sorting Evaluate resulting code, keep optimum Variant: Interleaved exhaustive enumeration C. Kessler, IDA, Linköpings universitet. 15 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 16 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 17 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 18 TDDC86 Compiler Optimizations and Code Generation 3 Optimal Method I: Dynamic Programming Compression of the solution space (tree) Example: (only) instruction scheduling C. Kessler, IDA, Linköpings universitet. 19 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 20 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 21 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 22 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 23 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 24 TDDC86 Compiler Optimizations and Code Generation 4 Dynamic Programming, Regular VLIW / Superscalar Processor Compression Theorem I: Two partial solutions (codes) are equivalent, if they (a) compute the same subDAG, and (b) end up in the same pipeline state ”time profiles” [K., Bednarski LCTES 2001] C. Kessler, IDA, Linköpings universitet. 25 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 26 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 27 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 28 TDDC86 Compiler Optimizations and Code Generation Results (1) – DP Dynamic Programming, Clustered VLIW Processor (1) Regular 3-issue VLIW Processor (no MOVEs) Register classes / Residence classes Random Instruction selection constrained by operand residence DAGs Add Move-instructions speculatively more solutions... Data bus Register File A + u Register File B v u v Unit B1 Unit A1 add.A1 u, v C. Kessler, IDA, Linköpings universitet. 29 TDDC86 Compiler Optimizations and Code Generation u C. Kessler, IDA, Linköpings universitet. add.B1 u, v 30 TDDC86 Compiler Optimizations and Code Generation 5 Dynamic Programming, Clustered VLIW Processor (2) Compression Theorem II: Two partial solutions (codes) are equivalent, if they (a) compute the same subDAG and (b) end up in the same pipeline-state and (c) hold the live values at the end in the same residence classes C. Kessler, IDA, Linköpings universitet. 31 TDDC86 Compiler Optimizations and Code Generation Dynamic Programming, Clustered VLIW Processor (3) Partial solutions for (subDAG, pipeState, LiveV-Residences) C. Kessler, IDA, Linköpings universitet. 32 TDDC86 Compiler Optimizations and Code Generation Results (2) – DP Results (3) – DP Clustered 8-issue VLIW Processor TI C6201 Clustered VLIW Processor TI C62x Leupers’ TI C64x DSP Benchmarks and FreeBench example Measure for degree of compression INDIR INDIRINDIR INDIR INDIRINDIRINDIR INDIR C. Kessler, IDA, Linköpings universitet. 33 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. Results (4) – DP Heuristic Methods Single-issue vs. Single-cluster vs. Double-cluster architecture Heuristic Pruning 34 TDDC86 Compiler Optimizations and Code Generation Limit number of considered schedules, moves Genetic Programming C. Kessler, IDA, Linköpings universitet. 35 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 36 TDDC86 Compiler Optimizations and Code Generation 6 Results (5) – Heuristic Pruning Results (6) – Heuristic Pruning Regular VLIW processor TI C62x Clustered VLIW DSP C. Kessler, IDA, Linköpings universitet. 37 TDDC86 Compiler Optimizations and Code Generation See M. Eriksson, PhD thesis, Linköping U., 2011 C. Kessler, IDA, Linköpings universitet. 38 TDDC86 Compiler Optimizations and Code Generation Optimal Method II: Integer Linear Programming Model by 0/1-variables e.g. ci,p,k,t und linearen Constraints Example: covering constraint for instruction selection Supply of pattern instances No spilling Total spilling Partial spilling (Rematerialization) C. Kessler, IDA, Linköpings universitet. 39 TDDC86 Compiler Optimizations and Code Generation ILP-Model (1) C. Kessler, IDA, Linköpings universitet. for all i in VG, SMusterinstanzen p Sk in p St=1...Tmax C. Kessler, IDA, Linköpings universitet. ci,p,k,t = 1 (1) 40 TDDC86 Compiler Optimizations and Code Generation 42 TDDC86 Compiler Optimizations and Code Generation ILP-Model (2) 41 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 7 ILP-Model (3) Results (7) – DP vs. ILP (AMPL+CPLEX) Calculate makespan from the variables ci,p,k,t TI C64x DSP Benchmarks and FreeBench Minimize makespan t: min t Observations: On the average, DP can solve larger problem instances but ILP is often faster (where it terminates). www.ampl.com Problem specified in AMPL DP works better for deep dependence chains. ILP Solver (CPLEX, Gurobi, lp_solve, ...) C. Kessler, IDA, Linköpings universitet. 43 TDDC86 Compiler Optimizations and Code Generation C. Kessler, IDA, Linköpings universitet. 44 TDDC86 Compiler Optimizations and Code Generation Selected Project Publications Christoph Kessler, Andrzej Bednarski: Optimal integrated code generation for VLIW architectures. Concurrency and Computation: Practice and Experience 18: 1353-1390, 2006. Andrzej Bednarski, Christoph Kessler: Optimal Integrated VLIW Code Generation with Integer Linear Programming. Proc. Euro-Par 2006 conference, Springer LNCS 4128, pp. 461-472, Aug. 2006. Andrzej Bednarski: Optimal Integrated Code Generation for Digital Signal Processors. PhD thesis, Linköping university - Institute of Technology, Linköping, Sweden, June 2006. Christoph Kessler, Andrzej Bednarski, Mattias Eriksson: Classification and generation of schedules for VLIW processors. Concurrency and Computation: Practice and Experience 19: 2369-2389, 2007. Mattias Eriksson, Christoph Kessler: Integrated Modulo Scheduling for Clustered VLIW Architectures. Proc. HiPEAC-2009 High-Performance and Embedded Architecture and Compilers, Paphos, Cyprus, Jan. 2009. Springer LNCS 5409, pp. 65-79. Mattias Eriksson: Integrated Code Generation. PhD thesis, Linköping Univ., 2011 www.ida.liu.se/~chrke/optimist C. Kessler, IDA, Linköpings universitet. 45 TDDC86 Compiler Optimizations and Code Generation 8