Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida State University June 8-16, 2007 Instruction Packing Store frequently occurring instructions as specified by the compiler in a small, lowpower Instruction Register File (IRF) Allow multiple instruction fetches from the IRF by packing instruction references together Tightly packed – multiple IRF references Loosely packed – piggybacks an IRF reference onto an existing instruction Facilitate parameterization of some instructions using an Immediate Table (IMM) Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 2 Execution of IRF Instructions Instruction Fetch Stage Instruction Cache packed instruction IF/ID insn1 insn2 insn3 insn4 packed instruction PC First Half of Instruction Decode Stage IRF insn2 insn4 insn1 insn3 IRWP IMM imm3 To Instruction Decoder imm3 Executing a Tightly Packed Param4c Instruction Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 3 Outline Introduction IRF and Instruction Packing Overview Integrating an IRF with an L0 I-Cache Decoupling Instruction Fetch Experimental Evaluation Related Work Conclusions & Future Work Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 4 MIPS+IRF Instruction Formats T-type R-type I-type J-type 6 bits 5 bits 5 bits 5 bits opcode inst1 inst2 inst3 6 bits 5 bits 5 bits 5 bits 6 bits 5 bits rt rd function inst opcode rs shamt 5 bits 1 bit inst4 param 5 bits s inst5 param 6 bits 5 bits 5 bits 11 bits 5 bits opcode rs rt immediate inst 6 bits 2 bits 24 bits opcode win immediate Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 5 Previous Work in IRF Register Windowing + Loop Cache (MICRO 2005) Compiler Optimizations (CASES 2006) Instruction Selection Register Renaming Instruction Scheduling Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 6 Integrating an IRF with an L0 I-Cache L0 or Filter Caches Small and direct-mapped 256B L0 I-cache 8B line size [Kin97] Fast hit time Low energy per access Higher miss rate than L1 Fetch energy reduced 68% Cycle time increased 46%!!! IRF reduces code size, while L0 only focuses on energy reduction at the cost of performance IRF can alleviate performance penalty associated with L0 cache misses, due to overlapping fetch Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 7 L0 Cache Miss Penalty Cycle 1 2 3 4 5 6 7 8 9 Insn1 IF ID EX M WB Insn2 Insn3 IF ID EX M WB IF Insn4 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File ID EX M WB IF ID EX M WB 8 Overlapping Fetch with an IRF Cycle 1 2 3 4 5 6 7 8 9 Insn1 IF ID EX M WB Pack2a IFab IDa EXa Ma WBa Pack2b Insn3 IDb EXb Mb WBb IF Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File ID EX M WB 9 Decoupling Instruction Fetch Instruction bandwidth in a pipeline is usually uniform (fetch, decode, issue, commit, …) Artificially limits the effective design space Front-end throttling improves energy utilization by reducing the fetch bandwidth in areas of low ILP IRF can provide virtual front-end throttling Fetch fewer instructions every cycle, but allow multiple issue of packed instructions Areas of high ILP are often densely packed Lower ILP for infrequently executed sections of code Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 10 Out-of-order Pipeline Configurations Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 11 Experimental Evaluation MiBench embedded benchmark suite – 6 categories representing common tasks for various domains SimpleScalar MIPS/PISA architectural simulator Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc3 clock gating) VPO – Very Portable Optimizer targeted for SimpleScalar MIPS/PISA Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 12 L0 Study Configuration Data Parameter Low-Power In-order Embedded Processor I-Fetch Queue 4 entries Branch Predictor Bimodal-128 entries, 3 cycle penalty Fetch/Decode/Issue Single instruction RUU size 8 LSQ size 8 L1 Data Cache 16 KB, 256 lines, 16B line, 4-way s.a., 1 cycle hit L1 Instruction Cache 16 KB, 256 lines, 16B line, 4-way s.a., 1 / 2 cycle hit L0 Instruction Cache 256 B, 32 lines, 8B line, direct mapped, 1 cycle hit Memory Latency 32 cycles IRF/IMM 4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 13 Execution Efficiency for L0 I-Caches IRF L0 L0+IRF 2cycle 2cycle+IRF 130% 110% 100% 90% 80% 70% 60% 50% Benchmark Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 14 Average CRC32 FFT Adpcm Gsm Blowfish Pgp Rijndael Sha Ispell Rsynth Stringsearch Jpeg Lame Tiff2bw 30% Dijkstra Patricia 40% Basicmath Bitcount Qsort Susan Normalized IPC 120% Energy Efficiency for L0 I-Caches Benchmark Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 15 Average 2cycle+IRF CRC32 FFT Adpcm Gsm 2cycle Blowfish Pgp Rijndael Sha L0+IRF Ispell Rsynth Stringsearch L0 Dijkstra Patricia Jpeg Lame Tiff2bw Basicmath Bitcount Qsort Susan Normalized Energy IRF 125% 120% 115% 110% 105% 100% 95% 90% 85% 80% 75% 70% 65% 60% Decoupled Fetch Configurations Parameter High-end Out-of-order Embedded Processor I-Fetch Queue 4/8 entries Branch Predictor Bimodal-2048 entries, 3 cycle penalty Fetch Width 1/2/4 Decode/Issue/Commit Width 1/2/3/4 RUU size 16 LSQ size 8 L1 Data Cache 32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit L1 Instruction Cache 32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit Unified L2 Cache 256 KB, 1024 lines, 64B line, 4-way s.a. 6 cycle hit Memory Latency 32 cycles IRF/IMM 4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 16 4|4 +I R F 4|4 2| 2|4 4 +I R F 2|3 2|3 +I R F 2|2 2|2 +I R F 1|4 1|4 +I R F 1|3 1|3 +I R F 1|1 1|2 1|2 +I R F 250% 240% 230% 220% 210% 200% 190% 180% 170% 160% 150% 140% 130% 120% 110% 100% 90% 1|1 +I R F Normalized IPC Execution Efficiency for Asymmetric Pipeline Bandwidth Fetch Width/Execute Width Configuration Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 17 Energy Efficiency for Asymmetric Pipeline Bandwidth 115% Normalized Energy 110% 105% 100% 95% 90% 85% 80% 75% 4|4 +IR F 4|4 2|4 2|4 +IR F 2|3 +IR F 2|3 2|2 +IR F 2|2 1|4 +IR F 1|4 1|3 +IR F 1|3 1|2 +IR F 1|2 1|1 +IR F 1|1 70% Fetch Width/Execute Width Configuration Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 18 Energy-Delay2 for Asymmetric Pipeline Bandwidth 100% 90% Energy-Delay^2 80% 70% 60% 50% 40% 30% 20% 10% 4 4|4 |4 +IR F 2 2|4 |4 +IR F 2 2|3 |3 +IR F 2 2|2 |2 +IR F 1 1|4 |4 +IR F 1 1|3 |3 +IR F 1 1|2 |2 +IR F 1 1|1 |1 +IR F 0% Fetch Width/Execute Width Configuration Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 19 Related Work L-caches – subdivide instruction cache, such that one portion contains the most frequently accessed code Loop Caches – capture simple loop behaviors and replay instructions Zero Overhead Loop Buffers (ZOLB) Pipeline gating / Front-end throttling – stall fetch when in areas of low IPC Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 20 Conclusions and Future Work Future Topics IRF can alleviate fetch bottlenecks from L0 I-Cache misses or branch mispredictions Can we pack areas where L0 is likely to miss? IRF + encrypted or compressed I-Caches IRF + asymmetric frequency clustering (of pipeline backend functional units) Increased IPC of L0 system by 6.75% Further decreased energy of L0 system by 5.78% Decoupling fetch provides a wider spectrum of design points to be evaluated (energy/performance) Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 21 The End Questions ??? Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 22 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 23 Energy Consumption No optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched 100.0% Total Energy 95.0% 90.0% 85.0% 80.0% 75.0% 70.0% oti m o t Au ve er um s n Co tw Ne ork fice Of ri cu e S ty m om c le Te e rag e Av Benchmark Category Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 24 Static Code Size No optimizations Promotion Inst Selection Reg Re-assign Intra-sched Inter-sched 97.5% Static Code Size 92.5% 87.5% 82.5% 77.5% 72.5% 67.5% 62.5% 57.5% oti m o t Au ve er um s n Co tw Ne ork fice Of ri cu e S ty m om c le Te e rag e Av Benchmark Category Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 25 Conclusions & Future Work Compiler optimizations targeted specifically for IRF can further reduce energy (12.2%15.8%), code size (16.8%28.8%) and execution time Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations Register targeting and loop unrolling should also be explored with instruction packing Enhanced parameterization techniques Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 26 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 27 Instruction Redundancy Profiled largest benchmark in each of six MiBench categories Most frequent 32 instructions comprise 66.5% of total dynamic and 31% of total static instructions Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 28 Compilation Framework Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 29 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 30