Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers Stephen Hines, David Whalley and Gary Tyson Computer Science Dept. Florida State University October 23, 2006 Instruction Packing Store frequently occurring instructions as specified by the compiler in a small, lowpower Instruction Register File (IRF) Allow multiple instruction fetches from the IRF by packing instruction references together Tightly packed – multiple IRF references Loosely packed – piggybacks an IRF reference onto an existing instruction Facilitate parameterization of some instructions using an Immediate Table (IMM) Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 2/17 Execution of IRF Instructions Instruction Fetch Stage Instruction Cache packed instruction IF/ID insn1 insn2 insn3 insn4 packed instruction PC First Half of Instruction Decode Stage IRF insn2 insn4 insn1 insn3 IRWP IMM imm3 To Instruction Decoder imm3 Executing a Tightly Packed Param4c Instruction Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 3/17 Outline Introduction Improved Promotion to the IRF Compiler Optimizations Instruction Selection Register Re-assignment Instruction Scheduling Experimental Evaluation Conclusions & Future Work Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 4/17 Improved Promotion to the IRF Different classes of instructions can consume 1 – 5 slots More accurately model the benefits of promoting from one class of instruction to another Original IRF papers did not promote multiple I-type instructions with different default immediate values addi $3, $3, 4 and addi $3, $3, 1 would not both reside in the IRF, no matter how frequently they occurred Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 5/17 Mixed Profiling Static profiling is best for decreasing code size Dynamic profiling is best for reducing energy consumption Can simultaneously weight static and dynamic profile data to obtain a mixed result that has both good code compression and reduced energy consumption Can obtain most of the benefits of individual static/dynamic profiling Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 6/17 Compiler Optimizations Instruction Selection Register Re-assignment Choose beneficial encodings for increasing redundancy Attempts to rename registers such that instructions can be accessed via IRF Instruction Scheduling Intra-block – focus on reordering instructions so that dense packs are formed (both tight and loose) Inter-block – attempt to move instructions between blocks to fill up packs ending with branches/jumps Code duplication Predication Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 7/17 Intra-block Instruction Scheduling Without Instruction Scheduling With Instruction Scheduling 3 1 2 1 1 2 2 4 5 4’ 53 4 5 3 4’ 4 1 2 4 4’ 5 5 Instruction Dependence DAG Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 8/17 Code Duplication to Reduce Code Size ••• W X 5 c 5’ Y a b Z 1 3 4 3’ 4’ 2 3 3’ 4 4’ Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 6 slots is too many to fit in a single packed instruction … but we can duplicate a single instruction … resulting in the ability to pack the remaining 5 slots together. 9/17 Predication – Forward Branches ••• X Cond Branch a Fall-through Instructions packed after forward branches will only be executed when the branch is not taken Y 1 2 2 3 3 4 2’ b 4’ 2’ 4’ 4 Z Branch taken path ••• Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 10/17 Predication – Backward Branches ••• a b c 2’ 1 2 Branch d e f Instructions packed after backward branches will only be executed when the branch is taken Branch offset ••• Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 11/17 Predication Advantages with IRF IRF facilitates a form of predication for the MIPS – a baseline architecture that traditionally does not support predication No need to waste instruction encoding space specifying predicate bits for most/all instructions (even ARM traded away general predication for reducing code size with Thumb and Thumb2) No need to fetch, decode and possibly execute instructions that are annulled after the branch within a pack (reducing energy consumption and execution time) Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 12/17 Experimental Evaluation MiBench embedded benchmark suite – 6 categories representing common tasks for various domains SimpleScalar MIPS/PISA architectural simulator Out-of-order, single issue embedded machine with 8KB 4way set associative L1 instruction and data caches and 128-entry bimodal branch predictor Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 13/17 Energy Consumption No optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched 100.0% Total Energy 95.0% 90.0% 85.0% 80.0% 75.0% 70.0% oti m o t Au ve er um s n Co tw Ne ork fice Of ri cu e S ty m om c le Te e rag e Av Benchmark Category Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 14/17 Static Code Size No optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched 97.5% Static Code Size 92.5% 87.5% 82.5% 77.5% 72.5% 67.5% 62.5% 57.5% oti m o t Au ve er um s n Co tw Ne ork fice Of ri cu e S ty m om c le Te e rag e Av Benchmark Category Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 15/17 IRF Promotion with Mixed Profiling Code Size Optimized Code Size Total Energy Optimized Total Energy Relative Measure (%) 97.5% 92.5% 87.5% 82.5% 77.5% 72.5% 67.5% 100/0 (Dynamic) 75/25 50/50 25/75 0/100 (Static) Dynamic/Static Mixture Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 16/17 Conclusions & Future Work Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations Register targeting and loop unrolling should also be explored with instruction packing Enhanced parameterization techniques Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 17/17 Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 18/17 Tightly Packed Instruction Format New opcodes for this T-format of MISA instructions Supports sequential execution of up to 5 RISA instructions from the IRF Unnecessary fields are padded with nop Supports up to 2 parameters replacing instruction slots Parameters can come from 32-entry IMM Each IRF entry also retains a default immediate value as well Branches use these 5 bits for displacements R-type RISA instructions can use parameter to replace RD field Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 19/17 MIPS Instruction Format Modifications Creating Loosely Packed instructions R-type: Removed shamt field and merged with rs I-type: Shortened immediate values (16-bit 11bit) Lui now uses 21-bit immediate values, hence no loose packing J-type: Unchanged Adapting Compilation Techniques to Enhance the Packing of Instructions into Registers 20/17