Coordinated Energy Management in Heterogeneous Processors 1 Advanced Micro Devices, Inc. 2 Georgia Institute of Technology 3 University of California, San Diego INDRANI PAUL1,2, VIGNESH RAVI1, SRILATHA MANNE1, MANISH ARORA1,3, SUDHAKAR YALAMANCHILI2 NOV 2013 GOAL & OUTLINE Goal: – Optimize energy efficiency under power and performance constraints in a heterogeneous processor Outline: – Problem – State-of-the-Art Power Management – HPC Application Characteristics and Frequency Sensitivity – Run-time Coordinated Energy Management – Results COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 2 STATE-OF-THE-ART HETEROGENEOUS PROCESSOR Shared Northbridge access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Accelerated processing unit (APU) Many resources are shared between the CPU and GPU – For example, memory hierarchy, power, and thermal capacity COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 3 PROGRAMMING MODEL Host Tasks GPU Tasks User Application Each OpenCL kernel N-Dimensional Range OpenCL™ or other Software Stack Operating System CPU GPU APU Hardware Grid of threads, each operating over a data partition Coupled programming model Offload compute intensive tasks to the GPU COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 4 CPU-GPU PHASE BEHAVIOR IN AN EXASCALE PROXY APPLICATION (LULESH) CPU-GPU coupled execution time-varying redistribution of compute intensity Energy efficient operation coordinated distribution of power to CPU vs. GPU Coordinated power states sensitivity of performance to CPU and GPU power state (frequency) – Need to characterize ROI: Return (performance) on investment (power) COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 5 THE CHALLENGE: CPU-GPU COUPLING EFFECTS Direct Performance Coupling Host Tasks Performance GPU Tasks User Application Performance Constraint Coupling Effects Indirect Performance Coupling: Shared Resources Coordinated Energy Management Power Efficiency • HPC applications have uncompromising performance requirements! • Need more efficient energy management COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 6 STATE-OF-THE-ART POWER MANAGEMENT STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION POWER MANAGEMENT (BAPM) Chip is divided into BAPM-controlled thermal entities (TEs) CU0 TE CU1 TE GPU TE Power management algorithm 1. Calculate digital estimate of power consumption 2. Convert power to temperature - RC network model for heat transfer 3. Assign new power budgets to TEs based on temperature headroom 4. TEs locally control (boost) their own DVFS states to maximize performance COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 8 POWER MANAGEMENT APU Die Temperature Performance and energy efficiency depend on effective utilization of power and thermal headroom 3.0 CPU HW Only (Boost) Thermal Headroom SWVisible Time GPU HW Boost states SW visible states Convert thermal headroom to higher performance through boost P0 P1 P2 --Pmin DVFSstate High Medium Low Instructions/cycle APU Performance HW Only DVFSstate Pb0 Pb1 Time COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 9 KEY OBSERVATIONS Overall application performance is a function of both the CPU and the GPU State of the practice: Manage to thermal limits by locally boosting when power and thermal headroom are available utilize all of the available headroom Pitfall: boosting may not lead to proportional performance improvement energy inefficient Need a concept of performance sensitivity to power states COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 10 HPC APPLICATION CHARACTERISTICS FREQUENCY SENSITIVITY OF GPU KERNELS DVFS-high DVFS-med DVFS-low % increase in run-time 160% 140% 120% 100% 80% 60% 40% 20% 0% Total Force Neighbour Comm GPU DVFS per kernel in miniMD-> Other Some kernels are more sensitive to GPU frequency than others more power efficient COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 12 SENSITIVITY OF GPU KERNEL EXECUTION TO CPU FREQUENCY % increase in run-time P0 P1 P2 P3 P4 50% 40% 30% 20% 10% 0% Total Force Neighbor Comm Other CPU DVFS per kernel in miniMD -> Some kernels are more tightly coupled to CPU’s performance Smaller kernels such as Comm have high overheads in launching and feeding the GPU COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 13 SENSITIVITY TO SHARED RESOURCE INTERFERENCE Performance actually limited by GPU memory demand Normalized Metric -> GPU_Mem_BW/Pb1 CPU_Mem_BW/Pb0 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Power management locally boosts CPU to highest DVFS states Mem_BW_breakdown CPU_DVFS_residency miniMD – Neighbor kernel Wasted energy power inefficient Need online estimates of sensitivity to interference COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 14 COMPUTATION AND CONTROL DIVERGENCE Percentage metric -> 0.80 0.70 • GPU_freq_sensitivity: unit performance gain for unit frequency increase 0.60 0.50 0.40 0.30 • GPU_ALUBusy%: measured hardware compute utilization 0.20 0.10 0.00 GPU_freq_sensitivity(meas) GPU_ALUBusy% Graph Algorithm – BFS Control divergence increased thread serialization increased frequency sensitivity COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 15 KEY OBSERVATIONS HPC applications exhibit varying degrees of CPU and GPU frequency sensitivities due to – Control divergence – Interference at shared resources – Performance coupling between CPU and GPU Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors Sensitivity metrics drive the coordinated setting of CPU and GPU power states COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 16 FREQUENCY SENSITIVITY AND RUN-TIME COORDINATED ENERGY MANAGEMENT PERFORMANCE METRICS FOR APU FREQUENCY SENSITIVITY Linear regression model using the above metrics to compute measures of GPU Compute Interference Performance Coupling COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 CPU Compute 18 DYNACO: RUN-TIME SYSTEM FOR COORDINATED ENERGY MANAGEMENT Performance Metric Monitor CPU-GPU Frequency Sensitivity Computation CPU-GPU Power State Decision GPU Frequency Sensitivity CPU Frequency Sensitivity Decision High Low Shift power to GPU High High Proportional power allocation Low High Shift power to CPU Low Low Reduce power of both CPU and GPU DynaCo-1levelTh: Lowest CPU DVFS-state limited to P2 DynaCo-multilevelTh: Lowest CPU DVFS-state allowed to use up to Pmin based on degree of performance coupling COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 19 KEY OBSERVATIONS Coordinated CPU-GPU execution Linear combination of three key high level performance metrics proposed to model APU frequency sensitivity behavior Run-time coordinated energy management scheme DynaCo to manage CPU and GPU DVFS states dynamically based on measured frequency sensitivities COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 20 RESULTS EXPERIMENTAL SET-UP GPU: Managed by sending software messages through driver layer Trinity A10-5800 APU: 100W TDP CPU: Managed by HW or SW HW Only (Boost) SWVisible CPU P- Voltage state (V) Pb0 1 Freq (MHz) 2400 GPU Pstate GPU-high Freq (MHz) 800 Pb1 0.875 1800 GPU-med 633 P0 0.825 1600 GPU-low 304 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900 DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 22 BENCHMARKS BM (Description) Problem Size miniMD 32 x 32 x 32 elements miniFE 100 x 100 x 100 elements Lulesh 100 x 100 x 100 elements Sort Stencil2D 2,097,152 elements 4,096 x 4,096 elements S3D SHOC default for integrated GPU BFS 1,000,000 nodes COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 23 Normalized ED^2 product ENERGY EFFICIENCY (ED2 PRODUCT) DynaCo-1levelTh 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 DynaCo-multilevelTh Ideal-static Average energy efficiency improvement of 24% and 30% with DynaCo-1levelTh and DynaCo-multilevelTh respectively COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 24 Increase in run-time EXECUTION TIME IMPACT DynaCo-1levelTh 1.06 1.04 1.02 1.00 0.98 0.96 0.94 0.92 0.90 DynaCo-multilevelTh Ideal-static Baseline Average performance slow down of 0.78% and 1.61% with DynaCo-1levelTh and DynaCo-multilevelTh respectively COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 25 POWER SAVINGS DynaCo-1levelTh 60% DynaCo-multilevelTh Ideal-static Power 50% 40% 30% 20% 10% 0% Average power savings of 24% and 31% with DynaCo-1levelTh and DynaCo-multilevelTh respectively COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 26 CONCLUSIONS Demonstrated effects of shared resource interference, control divergence and performance coupling on energy management for HPC applications Illustrated the importance and scope of frequency sensitivity in characterizing energy behaviors in tightly coupled heterogeneous architecture Proposed CPU-GPU frequency sensitivity metrics and run-time policy for energy efficient CPU and GPU DVFS state management – Dynamically shifts power to only the entity that can best utilize it Demonstrated effectiveness of DynaCo on real hardware as a wellrounded energy management scheme for HPC and Exascale COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 27 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 28 BACKUP POWER SHARING AND SHIFTING ANALYSIS GPU-high 0.8 0.6 0.4 GPU-med GPU-low 100% 1 GPU Utilization Global_MemUtil 0.2 80% 60% 40% 20% 0% 10 30 50 70 90 110 130 150 170 190 210 0 GPU DVFS residency Normalized Metric 1.2 Time (ms)-> Phase variation within MATVEC CalcFBHourGlass CalcHourGlass IntegrateStress Lulesh DynaCo adapts to varying compute and memory demands both at kernel granularity and even within a kernel COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013 30