Bundled Execution of Recurring Traces for Energy-Efficient General Purpose Processing Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August University of Michigan (Intel, Northrup-Grumman, UIUC, Princeton) MICRO-44 December 6, 2011 1 University of Michigan Electrical Engineering and Computer Science Computational Efficiency Landscape • Energy dilemma • More gates can fit on a die mW / ops AMD 6850 M 0 • But power constraints10limit their use GTX 295 1,000 • To scale performance, need toW increase efficiency 10 1 10 S1070 GTX 280 s/m IBM Cell Mop s/m Mop W 100 r cy tte en Be ffici rE we Po Performance (GFLOPs) 10,000 Core i7 AMD Opteron W Core 2 s/m p o M 0.1 Pentium M Embedded Processors 1 10 1 UltraPortable Power (Watts) Portable with frequent charges 1,000 100 Wall Power 2 Dedicated Power Network University of Michigan Electrical Engineering and Computer Science 2 Where Does The Energy Go? • Energy used in a single-issue RISC in-order core • Instruction fetch and decode energy dominates • Actual execution barely consumes 10% Plenty of opportunities to save energy…. 3 [Dally’08] University of Michigan Electrical Engineering and Computer Science Increasing Efficiency with Accelerators Application regularity defines success: Flexibility FPGAs 1.Small dominant code segments 2.Little control flow 3.Narrow application set 4.Data parallelism General Purpose Processors ASIPs DSPs SIMD Loop Accelerators, ASICs Efficiency, Performance • Accelerators can give 10 – 50X efficiency 4 University of Michigan Electrical Engineering and Computer Science Utility Factor for Accelerators • What fraction of the code gets accelerated? • Most solutions fail for “irregular” or “general-purpose” code Flexibility FPGAs General Purpose Processors ??? ASIPs DSPs SIMD Loop Accelerators, ASICs Efficiency, Performance Goal: A design to target irregular codes 5 University of Michigan Electrical Engineering and Computer Science • A compute engine for “hot regular regions” in irregular codes Program • Key insights: 1. Hot Regions 2. CPU BERET The BERET Architecture CPU BERET L1 I$ L1 D$ Exploits recurring instructions (traces) to save on copy live-ins redundant fetches and decodes copy live-outs Uses a bundled execution model to save on redundant register reads/writes BERET: Bundled Execution of REcurring Traces 6 University of Michigan Electrical Engineering and Computer Science Insight 1: Recurring Instructions • How aboutsuch loops? We leverage looping traces for savings ► Typical loops in irregular codes are large and control intensive! 1. Straight-line code simple hardware BB 0 Hot basic blocks 2. Typically short BB1 easy to buffer 85% BB 1 15% 3. Significant fetch / decode savings for buffered 10% 90% BB 3 BB 2 BB 4 instructions 50% 50% BB 6 BB 3 exit? BB 2 BB 4 exit? BB 5 BB 5 BB 1 BB 2 BB 5 BB 20 BB 20 BB 7 A looping trace BB 20 Control Flow Graph (CFG) 7 University of Michigan Electrical Engineering and Computer Science Frequency of Recurring Instructions Offload stable traces in irregular loops 8 University of Michigan Electrical Engineering and Computer Science Insight 2: Bundled Execution • Traditional processors issue and execute instructions in isolation… >> LD >> + LD + / & ST >> << Bundled execution LD + ST LD + / & ST >> << ST 11 instrs, 14 reads, 10 writes 3 instrs, 6 reads, 2 writes 9 University of Michigan Electrical Engineering and Computer Science Efficiency of Bundled Execution All results normalized to a bundle length of 1 2.6 Normalized Perf/Power 2.4 2.2 2 1.8 1.6 1.4 1.2 1 2 3 4 5 Bundle length Bundled execution increases datapath efficiency by more than 2x 10 University of Michigan Electrical Engineering and Computer Science 10 BERET Hardware Design •I$ Hardware design objectives: D$ ► ► Capable of executing straight-line code in a loop (traces) Index bits MUX Support for bundled execution of trace instructions Input Latch SEB 1 SEB 2 SEB N control to the main SEB Handle trace side-exits, and transfer config. processor Configure SEB 1 – 2 cycles config. bits Configuration RAM (CRAM) ► Store Buffer Internal Register File Writeback Bus LD ALU << ALU Execute SEB Writeback 1 – 5 cycles 1 – 2 cycles 11 Output Latch SEB: Subgraph Execution Block University of Michigan Electrical Engineering and Computer Science Compiler Support 1. Trace Detection 2. Mapping traces to SEBs Data flow Hotsubgraphs Trace Program × 1 Hot Traces (with high loop back probability) 2 3 BERET with SEBs Configuration SEB 0 SEB 1 SEB 2 SEB 3 | exit BR Assert 12 University of Michigan Electrical Engineering and Computer Science RF ST + MPY 2 ADD LD SUB BR & LD BR Assert AND SHIFT << ST ADD ADD 3 OR + + BR Control exit 1 CPU-BERET Execution Flow RF-0 RF-1 RF-0 Header Copy Live-Outs Assert Header Body … Header Body Header Body Copy Live-Ins Header BERET Execution CPU Side Exit RF RF Execution Time RF-1 Registers Program Assert discovered, executes copied to back on last BERET BERET main to iteration main processor processor squashed 13 University of Michigan Electrical Engineering and Computer Science Energy Savings Training set Test set 14 University of Michigan Electrical Engineering and Computer Science Performance Impact 15 University of Michigan Electrical Engineering and Computer Science Concluding Remarks • Scaling program performance in energy-constrained environment requires improving computational efficiency • Most accelerators exploit program regularity for savings • BERET is a configurable engine that saves energy by: ► Exploiting hot traces to avoid redundant fetches and decodes ► Using a bundled execution model to reduce temporary variable reads and writes Energy Saving ~35% Performance Enhancement Area Overhead ~10% 20% 16 University of Michigan Electrical Engineering and Computer Science Questions • For more ► See http://cccp.eecs.umich.edu 17 University of Michigan Electrical Engineering and Computer Science Fine Grain Program Phase Behavior Traditional phases too coarse-grained to match accelerator Traditional phases Fine-grain 0M Accelerate the pink portions 10M Hypothesis of This Work Irregular programs are composed of fine-grain periods of high degrees of regularity. We can identify these periods and run them on an accelerator customized for “simple” execution. 18 University of Michigan Electrical Engineering and Computer Science