IBM Research Guarded Power Gating in a Multi-core Setting Niti Madan, Alper Buyuktosunoglu, Pradip Bose, Murali Annavaram USC IBM T.J.Watson June 2010 © 2010 IBM Corporation IBM Research Outline Motivation Queuing Model based Methodology Results Conclusions and Future Work 2 IBM Research Power Management through Power Gating – Use header or footer transistor to power-gate the idle circuit – Apply “sleep” to header or footer => turn off voltage Sleep Vdd Virtual Vdd . . Logic Block . . – Can be applied at unit-level (intra-core or small-knob) – Can be applied at core-level (percore or big-knob) 3 IBM Research Predictive Power Gating Energy Break-even point Cumulative Energy Savings Energy Overhead 0 Decide to power gate Wake-up Ex. break-even point = 10 cycles Decide to Power Gate …10100 0000000000… Decide to Power Gate …10100 001…………. Correct prediction => save power • Power-gating Algorithms are predictive by nature •Frequent mis-predictions can burn more power than save • Break-even point dependent upon block-size and tech parameters • Guard mechanism proposed for unit-level power gating algorithms by Lungu et al. (ISLPED’09) • Concern for per-core power gating algorithms as breakeven point is much higher for cores Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, Daniel Sorin,”Dynamic power gating with quality guarantees”. ISLPED ‘09 4 IBM Research Power Gating Scenarios Exploiting the two dimensions of utilization to power-gate idle units or cores – System Utilization (OS perspective) triggers the big-knob – Resource Utilization (Core’s perspective) triggers the small-knob • Do we PG cores or execution units or both? Core 1 Core 2 time Core 3 Core 4 time (a) Baseline 4-core system (b) Folded 2-core system How can we maximize power-savings opportunities provided by both the small and big knobs ? 5 IBM Research Goals of this study Explore the trade-offs between unit-level/small-knob power gating algorithms and per-core/big-knob power gating algorithms for a range of latencies/parameters Leverage analytical models for early-stage evaluation A case for guard mechanism for per-core powergating 6 Sriram Vajapeyam, Pradip Bose IBM Research Queuing Theory Based Analytical Model Representation of Multi-processor workloads as a Queuing system – Cores are servers – Processing tasks are customer requests – Tasks are processed in FCFS order – Queuing system tracks average customer waiting time, service time and server utilization Evaluate our power-management policies using C++ based Queuing model simulator: “QUTE” Customers Arrivals Queue Server(s) Departures 7 IBM Research Overview of QUTE Framework Simulation of Queuing Models (G/G/N/k/inf/FCFS) – Faster than cycle-accurate simulations – Easy to explore design-space early on Statistical Workload Generation Parameters: – Task Arrival Times: Exponential Distribution – Task Lengths: Normal/Exponential/Uniform Distributions Evaluation Metrics: – Performance: Average response time – Power: Average number of cores switched on – Other Stats: Server utilization, variance in service demand etc. 8 IBM Research QUTE Framework Task arrival (arrival rate distribution using random number generator) . . FIFO Task Queue (service time or task Length statistical distribution) C1 C2 …….. C3 C4 (all cores queue back the task at the end of a time slice) 9 IBM Research Big Knob Modeling Implemented a simple Idleness-triggered heuristic: Set Idleness Threshold (say to 0.5 msec) Every 0.5 msec (i.e. the idleness threshold), – Scan all cores – Identify cores idle for > idleness threshold – Switch off all such cores (except, make sure there is always at least one core ON, either free or active) When a task arrives at the head of the task queue: – If there is no free core, • If there is a switched-off core, switch it ON 10 IBM Research Small Knob Modeling Cannot directly simulate workload phases Each core can have N power states – 2 states for this version : nominal power state and low power state (75% power) Generate statistical distribution (Gaussian) of each power state duration Each task always starts in the nominal power state – Switch between power states in a given time-slice Parameters: Nominal (Hi) and Low (Lo) power state means, Transition overhead 11 IBM Research Simulation Parameters ρ = λ / N*µ System-level Parameters Big Knob Parameters Small Knob Parameters Number of cores Mean Task Length Mean Task InterArrival Rate Time Slice Simulation Length Core Switch-on Lat (OnLat) Idleness Threshold (CT) Hi state mean Lo state mean Transition overhead Power Factor 32 5 ms 300 µs 1 ms 10000 Tasks 500 µs 500 µs 300 µs 100 µs 1 µs 0.75 12 IBM Research Outline Problem Background Methodology: Queuing Model Results Conclusions and Future Work 13 IBM Research Big Knob Results Experiment Response time (µs) Average Power (Num Cores) Base 5002.22 32 OnLat = 0.5ms CT = 0.5ms CT = 0.3ms CT = 0.1ms CT = 10µs 5038.46 5070.12 5158.51 5244.43 24.99 23.33 21.83 21.68 OnLat = 10µs CT= 0.5ms CT = 10µs 5002.93 5007.07 24.82 20.77 • CT controls the degree of power-savings (up to 34%) • OnLat controls the performance loss (up to 5%) 14 IBM Research Idle-Time Durations Histogram CT 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Number of durations 8000 7000 6000 5000 4000 3000 2000 1000 0 Idle-time Duration (us) 15 IBM Research Small Knob Results Workload Behavior Hi Mean Lo Mean Hi % Lo % Response Time (µs) Avg Power (Effective NumCores) Short phases 100 200 300 100 500 100 100 200 100 300 100 500 52 57 79 30 89 21 48 43 21 70 11 79 5050.51 5027.36 5026.46 5028.23 5013.67 5019.95 28.16 28.48 30.08 26.24 31.04 25.6 High ILP Low ILP Very High ILP Very Low ILP System_Power = Num_cores x (%time_in_Hi_state + F x %time_in_Lo_state) x P where F = 0.75 for this analysis 4 Performance Loss % 2 0 0.5 1 Overhead 5 10 Transition (us) • Power-savings dependent upon workload behavior • Short phases increases number of transitions and overhead • Transition overhead tolerable for our assumptions 16 IBM Research Hybrid Model Results (Big + Small Knob) Inter-arrival Rate (µs) Server Utilization (measured) 50 100 300 500 1000 2000 1.0 1.0 0.52 0.31 0.16 0.08 High ILP Workload • High ILP workloads – Big knob is most helpful • Low ILP workloads – Small knob helpful for even lower utilization Low ILP Workload 17 IBM Research A Case for Guard Mechanism for Multicore Power Gating Experiment Response Time (us) Core Switching ON/OFF Frequency Fixed Arrival Rate 5043.88 91482 Toggling Arrival Rate 5111 226372 Depending upon workload characteristics, Per-core power gating heuristics are prone to mis-predictions and dissipating more power Aggressive power-gating heuristics are also increase the performance overhead of mis-prediction (e.g. Lower CT ) 18 IBM Research Observations In a fully loaded system, the small knob is helpful In a lightly loaded system, the big knob is most useful In the intermediate loaded system, the big knob is useful to have but the usefulness of the small knob depends upon the workload characteristics – Lower ILP or low resource utilization workloads are benefited by the small knob Small knob is a useful feature to have regardless of system load if we can implement power state with lower power factor – Current power factor is conservative (0.75) 19 IBM Research Future Work Improve methodology by supporting real server utilization traces Evaluate a system with multiple P-states and DVFS Architect guard mechanisms for the per-core power gating algorithms Design implementation of a hybrid PG system 20 IBM Research Thanks and Questions! 21 IBM Research Backup Slides 22 IBM Research Power Factor Sensitivity Analysis for High ILP Workload 1.2 1 Inter Intra_0.75 H_0.75 Intra_0.5 H_0.5 Intra_0.25 H_0.25 Intra_0.1 H_0.1 0.8 0.6 0.4 0.2 0 50 100 300 500 1000 2000 23 IBM Research Power Factor Sensitivity Analysis for Low ILP Workload 1.2 1 Inter Intra_0.75 H_0.75 Intra_0.5 H_0.5 Intra_0.25 H_0.25 Intra_0.1 H_0.1 0.8 0.6 0.4 0.2 0 50 100 300 500 1000 2000 24 IBM Research Two Level Power Gating Algorithms (Lungu et al. ISLPED'09) Observations: Correctness requirement of power saving schemes (efficiency-wise): save power Single level idle prediction algorithms can behave incorrectly and waste power Level 2: Monitor & Control Estimate Power Savings No Enable = 0 >0 Yes Enable = 1 Proposed Idea: Add second level monitor to control enabling of power gating scheme Improve efficiency of power wasting cases without degrading power saving of common case Efficiency Counters On Enable Off_U Off_C Per-core power-gating algorithms also rely on such predictive schemes and will require guard mechanisms – Cost of misprediction is higher in per-core power-gating Cnt1++ Level 1: Actuate Cnt2++ Off_U: Power gated, uncompensated Off_C: Power gated, compensated 25