Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8th 2012 University of Michigan Electrical Engineering and Computer Science High Performance Cores High energy yields high performance Performance Energy Low performance Time DOES NOT yield low energy High performance cores waste energy on low performance phases 2 University of Michigan Electrical Engineering and Computer Science Core Energy Comparison Out-of-Order In-Order Dally, IEEE Computer’08 Brooks, ISCA’00 • Out-Of-Order hardware Do we contains alwaysperformance need theenhancing extra hardware? • Not necessary for correctness 3 University of Michigan Electrical Engineering and Computer Science Previous Solution: Heterogeneous Multicore • 2+ Cores • Same ISA, different implementations – High performance, but more energy – Energy efficient, but less performance • Share memory at high level – Share L2 cache ( Kumar ‘04) – Coherent L2 caches (ARM’s big.LITTLE) • Operating System (or programmer) maps application to smallest core that provides needed performance 4 University of Michigan Electrical Engineering and Computer Science Current System Limitations • Migration between cores incurs high overheads – 20K cycles (ARM’s big.LITTLE) • Sample-based schedulers – Sample different cores performances and then decide whether to reassign the application – Assume stable performance with a phase • Phase must be long to be recognized and exploited Do instructions finer grained – 100M-500M in lengthphases exist? Can we exploit them? 5 University of Michigan Electrical Engineering and Computer Science Performance Change in GCC 3 Big Core Little Core Instructions / Cycle 2.5 2 1.5 1 0.5 0 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Instructions • Average IPC over a 1M instruction window (Quantum) • Average IPC over 2K Quanta 6 University of Michigan Electrical Engineering and Computer Science Finer Quantum 3 Big Core Instructions / Cycle 2.5 Little Core 2 1.5 1 0.5 0 160K 170K Instructions 180K • 20K instruction window from GCC What if we could map these to a Little Core? • Average IPC over 100 instruction quanta 7 University of Michigan Electrical Engineering and Computer Science Our Approach: Composite Cores • Hypothesis: Exploiting fine-grained phases allows more opportunities to run on a Little core • Problems I. How to minimize switching overheads? II. When to switch cores? • Questions I. How fine-grained should we go? II. How much energy can we save? 8 University of Michigan Electrical Engineering and Computer Science Problem I: State Transfer 10s of KB iCache iTLB Branch Pred Fetch State transfer costs can be veryFetch high: ~20K cycles (ARM’s big.LITTLE) Decode RAT <1 KB Rename dCache iTLB Branch Pred Decode Reg File Reg File dTLB iCache 10s of KB InO Execute Limits O3 switching to coarse granularity: Execute 100M Instructions ( Kumar’04) 9 dTLB dCache University of Michigan Electrical Engineering and Computer Science Creating a Composite Core Only iCache one uEngine Big iTLB active atFetch a time O3 Execute RAT Decode uEngine Reg File Branch Pred Load/Store Queue dTLB iCache iTLB Fetch Controller Branch Pred dCache dTLB <1KB dCache dCache dTLB iCache Little Fetch iTLB uEngine Branch Pred Decode 10 Reg File Mem inO Execute University of Michigan Electrical Engineering and Computer Science Hardware Sharing Overheads • Big uEngine needs – High fetch width – Complex branch prediction – Multiple outstanding data cache misses • Little uEngine wants – Low fetch width – Simple branch prediction – Single outstanding data cache miss • Must build shared units for Big uEngine • – Little over-provision for Littleenergy uEngine pays ~8% overhead to use over Assume clock gating for inactive uEngine provisioned – Still has static leakage energyfetch + caches 11 University of Michigan Electrical Engineering and Computer Science Problem II: When to Switch • Goal: Maximize time on the Little uEngine subject to maximum performance loss • User-Configurable • Traditional OS-based schedulers won’t work – Decisions to frequent – Needs to be made in hardware • Traditional sampling-based approaches won’t work – Performance not stable for long enough – Frequent switching just to sample wastes cycles 12 University of Michigan Electrical Engineering and Computer Science What uEngine to Pick 3 Big Core Little Core Instructions / Cycle 2.5 2 Run on Little 1.5 Difference Run on Big Run on Big 1 0.5 0 0 200000 ΔπΆππΌπβπππ βπππ • 400000 600000 Instructions 800000 1000000 This value is hard to determine a priori, depends on application Run on Little Let configure the target value – Useuser a controller to learn appropriate value over time 13 University of Michigan Electrical Engineering and Computer Science Reactive Online Controller πΆππΌπ‘πππππ‘ π∗ πΆππΌπππ πΆππΌπππ πΆππΌπ΅ππ User-Selected ΔπΆππΌπβπππ βπππ = πΎπ πΆππΌπππππ + πΎπ Performance + πΆππΌπππππ πΆππΌπππ‘π‘ππ Little Model πΆππΌπππππ πΆππΌπππ‘π‘ππ ΔπΆππΌπβπππ βπππ + πΆππΌπππ‘π‘ππ ≤ πΆππΌπππ πΆππΌπππ‘π‘ππ πΆππΌπππ Threshold Controller ΔπΆππΌπβπππ βπππ πΆππΌπππ‘π’ππ Big Model Switching Controller Little uEngine Big uEngine πΆππΌπππ πππ£ππ 14 University of Michigan Electrical Engineering and Computer Science uEngine Modeling Little uEngine IPC: 1.66 Collect Metrics of active uEngine • iL1, dL1 cache misses • L2 cache misses • Branch Mispredicts • ILP, MLP, CPI while(flag){ foo(); flag = bar(); } Use a linear model to estimate inactive uEngine’s performance Big uEngine 15 IPC: 2.15 ??? IPC: University of Michigan Electrical Engineering and Computer Science Evaluation Architectural Feature Parameters Big uEngine 3 wide O3 @ 1.0GHz 12 stage pipeline 128 ROB Entries 128 entry register file Little uEngine 2 wide InOrder @ 1.0GHz 8 stage pipeline 32 entry register file Memory System 32 KB L1 i/d cache, 1 cycle access 1MB L2 cache, 15 cycle access 1GB Main Mem, 80 cycle access Controller 5% performance loss relative to all big core 16 University of Michigan Electrical Engineering and Computer Science Little Engine Utilization Fine-Grained Quantum Traditional OS-Based Quantum Little Engine Utilization 100% astar 80% bzip2 60% gcc gobmk 40% h264ref 20% hmmer mcf 0% 100 1000 10000 100000 1000000 Quantum Length (Instructions) 10000000 omnetpp sjeng • 3-Wide O3 time (Big) on vs. little 2-Wide InOrder More engine with(Little) same loss • 5% performanceperformance loss relative to all Big 17 University of Michigan Electrical Engineering and Computer Science Switches / Million Instructions Engine Switches 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 astar ~1 Switch / 306 Instructions bzip2 gcc gobmk ~1 Switch / 2800 Instructions h264ref hmmer mcf omnetpp 100 1000 10000 100000 1000000 Quantum Length (Instructions) 10000000 sjeng Need LOTS of switching to maximize utilization 18 University of Michigan Electrical Engineering and Computer Science Performance Relative to Big Performance Loss Composite Cores 105% astar ( Quantum Length = 1000 ) 100% bzip2 gcc 95% gobmk 90% h264ref 85% hmmer mcf 80% 100 1000 10000 100000 1000000 Quantum Length (Instructions) 10000000 omnetpp sjeng Switching overheads negligible until ~1000 instructions 19 University of Michigan Electrical Engineering and Computer Science Fine-Grained vs. Coarse-Grained • Little uEngine’s average power 8% higher – Due to shared hardware structures • Fine-Grained can map 41% more instructions to the Little uEngine over Coarse-Grained. • Results in overall 27% decrease in average power over Coarse-Grained 20 University of Michigan Electrical Engineering and Computer Science Decision Techniques 1. Oracle Knows both uEngine’s performance for all quantums 2. Perfect Past Knows both uEngine’s past performance perfectly 3. Model Knows only active uEngine’s past, models inactive uEngine using default weights All models target 95% of the all Big uEngine’s performance 21 University of Michigan Electrical Engineering and Computer Science Dynamic Instructions On Little Little Engine Utilization 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Oracle Perfect Past Model Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average Maps 25% of the dynamic instructions High Issue utilization widthonto dominates forthememory bound application bound Littlecomputation uEngine 22 University of Michigan Electrical Engineering and Computer Science Energy Savings Relative to Big Energy Savings 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Oracle Perfect Past Model Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average • Includes overhead ofin shared hardware structures 18%the reduction energy consumption 23 University of Michigan Electrical Engineering and Computer Science User-Configured Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1% 5% 10% 20% 1% Utilization 5% 10% 20% 1% Overall Performance 5% 10% 20% Energy Savings 20%performance performanceloss loss yields yields 44% energy savings 1% 4% energy savings 24 University of Michigan Electrical Engineering and Computer Science More Details in the Paper • • • • Estimated uEngine area overheads uEngine model accuracy Switching timing diagram Hardware sharing overheads analysis 25 University of Michigan Electrical Engineering and Computer Science Conclusions Questions? • Even high performance applications experience fine-grained phases of low throughput – Map those to a more efficient core • Composite Cores allows – Fine-grained migration between cores – Low overhead switching • 18% energy savings by mapping 25% of the instructions to Little uEngine with a 5% performance loss 26 University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8th 2012 University of Michigan Electrical Engineering and Computer Science Back Up 28 University of Michigan Electrical Engineering and Computer Science The DVFS Question • Lower voltage is useful when: – L2 Miss (stalled on commit) • Little uArch is useful when: – Stalled on L2 Miss (stalled at issue) – Frequent branch mispredicts (shorter pipeline) – Dependent Computation http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf 29 University of Michigan Electrical Engineering and Computer Science Sharing Overheads Average Power Relative to the Big Core Big uEngine Little Core Little uEngine 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 30 University of Michigan Electrical Engineering and Computer Science Performance Performance Relative to Big 103% Oracle Perfect Past Model 100% 98% 95% 93% 90% Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average 5% performance loss 31 University of Michigan Electrical Engineering and Computer Science Model Accuracy Model Average Performance 40% 20% 15% 10% 5% 0% -100% Average Performance 35% 25% Percent of Quantums Percent of Quantums 30% Model 30% 25% 20% 15% 10% 5% -50% 0% 50% Percent Deviation From Actual 100% Little -> Big 0% -100% -50% 0% 50% Percent Deviation From Actual Big -> Little 32 University of Michigan Electrical Engineering and Computer Science 100% Regression Coefficients 100% Relative Coefficient Magnatude 90% 80% L2 Miss 70% Branch Mispredicts 60% ILP 50% L2 Hit 40% MLP 30% Active uEngine Cycles 20% Constant 10% 0% Little -> Big 33 Big -> Little University of Michigan Electrical Engineering and Computer Science Different Than Kumar et al. Kumar et al. Composite Cores • Coarse-grained switching • OS Managed • Fine-grain switching • Hardware Managed • Minimal shared state (L2’s) • Maximizes shared state (L2’s, L1’s, Branch Predictor, TLBs) • Requires sampling • On-the-fly prediction • 6 Wide O3 vs. 8 Wide O3 • Has InOrder, but never uses it! • 3 Wide O3 vs. 2 Wide InOrder Coarse-grained vs. fine-grained 34 University of Michigan Electrical Engineering and Computer Science Register File Transfer RAT Num - Num Value Num Value Commit Registers 3 stage pipeline 1. Map to physical register in RAT 2. Read physical register 3. Write to new register file If commit updates, repeat 35 Registers University of Michigan Electrical Engineering and Computer Science uEngine Model • Linear model: π¦ = π0 + ππ π₯π – π0 : Average uEngine performance – π₯π : Performance counter value – ππ : Weight of performance counter • Different weights for big and little uEngine models • Fixed vs. per-application weights? – Default weights, fixed at design time – Per-application weights 36 University of Michigan Electrical Engineering and Computer Science