Towards a More Principled Compiler: Progressive Backend Compiler Optimization David Koes 8/28/2006 School of Computer Science 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 1 2.8Ghz Pentium 4, 1GB RAM, -O3 … Nov-09 Nov-08 Nov-07 Nov-06 Nov-05 Nov-04 Nov-03 Nov-02 Nov-01 Nov-00 Nov-99 Nov-98 Nov-97 Nov-96 0% Nov-95 SPEC2000 Performance Improvement Performance Gains Due to Compiler (gcc) School of Computer Science 50% 45% is this possible? 40% How do we exploit the existing optimization potential? 35% 30% 2 Yes! Need a more principled compiler 25% 20% 15% 10-30% improvement just from reordering compiler phases 10% 5% http://www.cs.rice.edu/~keith/Adapt/ Nov-09 Nov-08 Nov-07 Nov-06 Nov-05 Nov-04 Nov-03 Nov-02 Nov-01 Nov-00 Nov-99 Nov-98 Nov-97 Nov-96 0% Nov-95 SPEC2000 Performance Improvement The Future of Compiler Optimization School of Computer Science 3 Nov-05 May-05 Nov-04 May-04 Nov-03 May-03 Nov-02 May-02 Nov-01 May-01 Nov-00 May-00 Nov-99 May-99 Nov-98 May-98 Nov-97 May-97 Nov-96 May-96 Nov-95 Code size improvement Compiler code size improvement 25% 20% 15% 10% 5% 0% School of Computer Science A Principled Compiler A compiler that – knows right from wrong (less optimal from more optimal) – follows a rigorous procedure to get the desired output 4 School of Computer Science Today’s Compiler Problems target independent copy … GVN prop DCE strength SCCP inlinin reduct code PRE consg t motion loop unroll prop – some phases not internally optimal • purely heuristic solution – machine description mostly ignored – lack of integration between phases target dependent insn sched reg alloc insn select branc peephol h opt e … 5 machine description optimized program School of Computer Science Ideal Compiler copy prop const prop stren gth reduc t reg alloc loop unroll inline DC E code motio n PRE GV N CSE peephole insn selec t branc h opt SCC P … – each phase locally optimal – makes full use of machine description – tight integration between phases Absolutely no idea how to do this or if it’s even possible machine description optimized program 6 School of Computer Science Towards a More Principled Compiler copy prop const prop stren gth reduc t reg alloc loop unroll inline DC E code motio n PRE GV N – each phase locally optimal – makes full use of machine description – tight integration between phases CSE peephole insn selec t branc h opt SCC P … machine description optimized program 7 School of Computer Science Outline I. II. III. IV. V. 8 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science Related Work Register Allocation Problem unbounded number of program variables …spill code optimization v = 1 w = v + 3 rematerialization x = w + v register u = v allocator t = u + x print(x); memory operands print(w); print(t); print(u); … 9 limited number of processor registers + slow memory eax register preferences ebx ecx edx esi edi esp ebp live range splitting School of Computer Science Related Work Register Allocation Previous Work Method Expressive Fast Optimal / / Linear Scan Graph Coloring Integer Linear Programming Partitioned Boolean Quadratic Programming 10 School of Computer Science Related Work Instruction Selection Problem Assem IR instruction selector IR Representation minimum cost tiling ? 11 movl leal leal leal (p),t1 (x,t1),t2 1(y),t3 (t2,t3),r School of Computer Science Related Work Instruction Selection Previous Work Method DAG Tiling Register Allocation Aware Fast Optimal Dynamic Programming Binate Covering Peephole Based Instruction Selection AVIV Code Generator Exhaustive Search 12 School of Computer Science Outline I. II. III. IV. V. 13 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science Completed Work A More Principled Register Allocator – fully utilize machine description reg alloc • explicit and expressive model of costs of allocation for given architecture – optimal solutions machine description 14 School of Computer Science Completed Work Multi-commodity Network Flow: An Expressive Model Given network (directed graph) with – cost and capacity on each edge – sources & sinks for multiple commodities Find lowest cost flow of commodities NP-complete for integer flows a Example: b edges have unit capacity 1 a 15 0 b School of Computer Science Completed Work Register Allocation as a MCNF Variables Commodities a Variable Definition Source Variable Last Use Sink a Nodes Allocation Classes (Reg/Mem/Const) r0 r1 mem 1 Registers Limits Node Capacities Spill Costs Edge Costs r1 mem 1 mem 1 3 r0 r1 Allocation Flow 16 School of Computer Science Completed Work Example Source Code int example(int a, int b) { int d = 1; int c = a - b; return c+d; } Pre-alloc Assembly MOVE 1 -> d SUB a,b -> c ADD c,d -> c MOVE c -> r0 17 load cost insn pref cost mem access cost School of Computer Science Completed Work Control Flow MCNF can only represent straight-line code – need to link together networks from basic blocks Extend MCNF model with merge and split nodes to implement boundary constraints. a: %eax a: %eax a: %eax a: mem a: mem details in proposal document… along with modeling persistence of values in memory a: mem a: mem 18 School of Computer Science Completed Work A Better Register Allocator – fully utilize machine description reg alloc machine description 19 • explicit and expressive model of costs of allocation for given architecture: Global MCNF – locally optimal • NP-hard, so use progressive solution technique School of Computer Science Completed Work A Better Register Allocator – fully utilize machine description reg alloc machine description 20 • explicit and expressive model of costs of allocation for given architecture: Global MCNF – locally optimal • NP-hard, so use progressive solution technique School of Computer Science Completed Work Progressive Solution Technique Quickly find a good allocation Then progressively find better allocations Allocation Quality – until optimal allocation found – or time limit is reached Technique: Lagrangian relaxation directed allocators Compile Time 21 School of Computer Science Completed Work Lagrangian Relaxation: Intuition Relaxes the hard constraints – only have to solve single commodity flow Combines easy subproblems using a Lagrangian multiplier (price) – an additional price on each edge – a price on each split/merge node a Example: b a b edges have unit capacity with price, solution to single commodity flow can be solution to multicommodity flow 1 a 22 0 b 1 a 0+1 b School of Computer Science Completed Work Solution Procedure a Compute prices with iterative subgradient optimization 1 – guaranteed converge to optimal prices – optimal for linear relaxation 23 0+1 a b a b At each iteration, construct a feasible integer solution using current prices – iterative allocator in document – simultaneous allocator – trace-based simultaneous allocator b 1 a 0+1 b School of Computer Science Completed Work Simultaneous Allocator Edges to/from memory cost 3 Current cost: -2 -3 -1 X X 24 School of Computer Science Completed Work Trace-Based Allocation Decompose function into traces of basic blocks – run simultaneous allocator on each trace – control flow internal to trace presents difficulty addressed in proposal document 25 School of Computer Science Completed Work Evaluation Implemented in gcc 3.4.4 targeting x86 Optimize for code size – perfect static evaluation – important metric in its own right MediaBench, MiBench, Spec95, Spec2000 – over 10,000 functions 26 School of Computer Science Completed Work Progressiveness squareEncrypt 27 School of Computer Science Completed Work Progressiveness quicksort 28 School of Computer Science Completed Work Average code size improvement over graph allocator Code Size 8% 7% 6% 5% 4% 3% 2% 1% 0% initial heuristics only 29 10 iterations 100 iterations 1000 iterations default allocator School of Computer Science Completed Work Optimality 100% 90% Percent of functions 80% 70% Proven optimality 60% >25% from optimal <=25% from optimal <=10% from optimal <=5% from optimal <=1% from optimal optimal 50% 40% 30% 20% 10% 0% 30 1 iteration 10 iterations 100 iterations 1000 iterations School of Computer Science Completed Work Compile Time Slowdown :-( Default Allocator Graph Allocator 9.2x slower Progressive Allocator 0 5 10 15 20 Factor slower than graph allocator 31 Initialization Initial Iterative Allocation Initial Simultaneous Allocation One Iteration School of Computer Science Completed Work A Better Register Allocator – fully utilize machine description reg alloc machine description 32 • explicit and expressive model of costs of allocation for given architecture: Global MCNF – locally optimal • approach optimality using progressive solution technique: Lagrangian directed allocators School of Computer Science Outline I. II. III. IV. V. 33 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science Proposed Work A Better Better Register Allocator Solver Improvements – Improve initial solution – Improve quality as prices converge – Hope to prove approximation bounds Model Improvements – Improve accuracy of model – Model simplification – Represent uniform register sets efficiently 34 School of Computer Science Proposed Work Model Simplification Summarize overly expressive sections of the model Conservative simplification does not change optimal value Aggressive simplification explore tradeoff between model complexity and optimality 35 School of Computer Science Proposed Work Instruction Selection Interaction which instruction is best depends on the register allocator perform same operation so let register allocator decide 36 School of Computer Science Proposed Work Register Allocation Aware Instruction SElection (RA2ISE) Instruction selection not finalized until register allocation IR tiled with Register Allocation Aware Tiles (RAATs) A RAAT represents several instruction sequences – different costs – a sequence for every possible register allocation 37 School of Computer Science Proposed Work 2 RA ISE RAAT IR tiling model creatio n cwtl %eax 38 register allocation School of Computer Science Proposed Work Implementing RA2ISE Add side-constraints to Global MCNF model – implement inter-variable preferences and constraints • “if x allocated to r1 and y allocated to r2, then save three bytes” • “x and y must be allocated to the same register” Implement x86 RAATs – RAAT tables created manually – GMCNF RAAT representation automatically generated from RAAT table with minimum use of side constraints Algorithms for tiling RAATs – leverage existing algorithms – exploit feedback between passes 39 School of Computer Science Proposed Work Tiling RAATs 3 1 4 2 2 1 4 1 2 5 3 1 1 3 1 2 4 3 4 1 edx mem mem 31 2 eax 4 3 40 School of Computer Science Proposed Work Evaluation Implement in production quality compiler (gcc) Evaluate code size and simple code speed metric Evaluate on three different architectures – x86 (8 registers) – 68k/ColdFire (16 registers) – PPC (32 registers) 41 School of Computer Science Outline I. II. III. IV. V. 42 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science Contributions RA2ISE – register allocation aware tiles (RAATs) explicitly encode effect of register allocation on instruction sequence – algorithms for tiling RAATs – expressive model of register allocation that operates on RAATs and explicitly represents all important components of register allocation – progressive solver for this model that can quickly find decent solution and approaches optimality as more time is allowed for compilation Comprehensive evaluation of RA2ISE 43 School of Computer Science Thesis Statement RA2ISE is a principled and effective system for performing instruction selection and register allocation. 44 School of Computer Science One Step Towards a More Principled Compiler copy prop loop unroll const prop stren inline gth reduc peept hole reg reg alloc alloc insn selec t 45 DC E code motio n PRE GV N CSE branc h opt SCC P … machine description optimized program School of Computer Science Timeline 46 Fall 2006 add simple speed metric option to model begin model simplification work improve model accuracy and solver performance Winter 2006 finish model simplification work add side-constraints to model implement existing gcc tiles as RAATs improve model accuracy and solver performance Spring 2007 finish implementation of side-constraints and gcc RAATs begin work on RA2ISE infrastructure create gcc-independent set of RAATs for x86 improve model accuracy and solver performance Summer 2007 finish work on RA2ISE investigate and develop tiling algorithms improve model accuracy and solver performance Fall 2007 add 68k/ColdFire and PowerPC targets investigate uniform register set simplifications improve model accuracy and solver performance Winter 2007 begin writing thesis work on improving compile time performance Spring 2008 finish writing thesis School of Computer Science Andrew Richard Koes 47 School of Computer Science Questions? ? 48 School of Computer Science 49 7000% 6000% 5000% 4000% 3000% 2000% 1000% Performance Double every 24 months May-06 Nov-05 May-05 Nov-04 May-04 Nov-03 May-03 Nov-02 May-02 Nov-01 May-01 Nov-00 May-00 Nov-99 May-99 Nov-98 May-98 Nov-97 May-97 Nov-96 May-96 0% Nov-95 SPEC2000 Performance Improvement Processor Performance w/o Compiler Improvements Double every 18 months School of Computer Science Instruction Selection & Register Allocation reg alloc insn selec t 50 – fully utilize machine description – locally optimal – tight integration between phases machine description School of Computer Science Costs of Register Allocation Spilling to/from memory movl 8(%ebp), %edx Direct memory access addl 8(%ebp), %eax Moving between registers movl %edx,%ecx Rematerialization of constant value movl $3,%eax Register usage preferences imul %edx,%eax vs. imul %edx,%ecx 51 School of Computer Science Iterative Heuristic Allocator Allocate each variable in a heuristic priority order Find shortest path in each block – avoid edges that make remaining problem infeasible Process blocks in topological order – allocation at block entry fixed by previous blocks Intuition: – shortest path is minimum cost allocation for a variable – allocate most significant variables first Limitation: – greedy: can’t undo poor decisions 52 School of Computer Science Iterative Heuristic Allocator Allocation order: Edges to/from memory cost 3 a, b b, c c, d a Cost: 0 4 0 -2 Total: 2 53 School of Computer Science Simultaneous Allocator Scan each block – maintain an allocation of all live variables – at variable definition find cheapest allocation • allocation with shortest path to variable’s sink or block exit • allowed to evict (reallocate) already allocated variable – eviction cost shortest path to edge from current allocation to new allocation in this block – cost of eviction added to shortest path cost Intuition: – minimizing cost for all variables at once Limitation: – path computations limited to single block – future blocks do not change previous block allocations 54 School of Computer Science Trace-Based Allocation Decompose function into traces of basic blocks – run simultaneous allocator on each trace – control flow internal to trace • update only blocks that are necessary (easy-update) • update all effected blocks (full-update) easy-update full-update 55 School of Computer Science Accuracy of the Model 60% Global MCNF model correctly predicts costs of register allocation within 2% for 72.5% of functions compiled Percent of functions 50% 40% 30% 20% larger than predicted (under-predicted) 10% smaller than predicted (over-predicted) % 10 > 510 % 5% 3- 3% 2- 2% 1- 1% 0- 0% 0% 1- 1% 2- 2% 3- 3% 5- 10 -5 % > 10 % 0% Percent difference between predicted and actual size 56 School of Computer Science Initialization 57 Initial Iterative Allocation Initial Simultaneous Allocation geo. mean 099.go 124.m88ksim 129.compress 130.li 132.ijpeg 134.perl 147.vortex 164.gzip 168.wupwise 171.swim 173.applu 175.vpr 176.gcc 181.mcf 183.equake 188.ammp 197.parser 254.gap 255.vortex 256.bzip2 300.twolf 301.apsi CRC32 adpcm_d adpcm_e basicmath bitcount blowfish_d blowfish_e dijkstra epic_d epic_e g721_d g721_e gsm_d gsm_e ispell jpeg_d jpeg_e lame mesa mpeg2_d mpeg2_e patricia pegwit_d pegwit_e pgp_d pgp_e qsort rasta sha stringsearch susan Factor slower than graph allocation Compile Time Slowdown :-( 50 45 40 35 30 25 20 15 10 5 10x slower 0 One Iteration School of Computer Science Code size improvement over graph allocator Code size improvement 8% 7% 6% 5% 4% 3% 2% 1% 0% initial heuristics only 10 iterations without traces 58 100 iterations 1000 iterations default allocator with easy traces with full traces School of Computer Science 59 initial heuristics only 10 iterations 100 iterations average 099.go 124.m88ksim 129.compress 130.li 132.ijpeg 134.perl 147.vortex 164.gzip 168.wupwise 171.swim 173.applu 175.vpr 176.gcc 181.mcf 183.equake 188.ammp 197.parser 254.gap 255.vortex 256.bzip2 300.twolf 301.apsi CRC32 adpcm basicmath bitcount blowfish dijkstra epic g721 gsm ispell jpeg lame mesa mpeg2 patricia pegwit pgp qsort rasta sha stringsearch susan Code size improvement over graph allocator Code Size Improvement 20% 15% 10% 5% 0% -5% 1000 iterations School of Computer Science 60 initial heuristic only 10 iterations 100 iterations average 099.go 124.m88ksim 129.compress 130.li 132.ijpeg 134.perl 147.vortex 164.gzip 168.wupwise 171.swim 173.applu 175.vpr 176.gcc 181.mcf 183.equake 188.ammp 197.parser 254.gap 255.vortex 256.bzip2 300.twolf 301.apsi CRC32 adpcm basicmath bitcount blowfish dijkstra epic g721 gsm ispell jpeg lame mesa mpeg2 patricia pegwit pgp qsort rasta sha stringsearch susan Code size improvement over default allocator Code Size Improvement 12% 10% 8% 6% 4% 2% 0% -2% -4% -6% 1000 iterations School of Computer Science Code Performance 61 School of Computer Science Integrating Register Allocation and Instruction Selection int foo(int a, short b) { return a*4+b; } 4 3 4 1 1 62 movl 4(%esp),%eax sall $2,%eax addl 8(%esp),%eax cwtl ret 5 4 3 1 movswl 8(%esp),%edx movl 4(%esp),%eax leal (%edx,%eax,4),%eax ret School of Computer Science Another RAAT 63 School of Computer Science