Coordinated Energy Management in
Heterogeneous Processors
1
Advanced Micro Devices, Inc.
2 Georgia Institute of Technology
3 University of California, San Diego
INDRANI PAUL1,2, VIGNESH RAVI1, SRILATHA
MANNE1, MANISH ARORA1,3, SUDHAKAR
YALAMANCHILI2
NOV 2013
GOAL & OUTLINE
 Goal:
– Optimize energy efficiency under power and performance constraints in a
heterogeneous processor
 Outline:
– Problem
– State-of-the-Art Power Management
– HPC Application Characteristics and Frequency Sensitivity
– Run-time Coordinated Energy Management
– Results
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
2
STATE-OF-THE-ART HETEROGENEOUS PROCESSOR
Shared Northbridge  access to overlapping
CPU/GPU physical address spaces
Graphics processing unit
(GPU):
384 AMD Radeon™
cores
Multi-threaded
CPU cores
Accelerated processing unit (APU)
Many resources are shared between the CPU and GPU
– For example, memory hierarchy, power, and thermal capacity
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
3
PROGRAMMING MODEL
Host
Tasks
GPU
Tasks
User Application
Each OpenCL
kernel N-Dimensional Range
OpenCL™ or other
Software Stack
Operating System
CPU
GPU
APU Hardware
Grid of threads, each operating over a data partition
 Coupled programming model  Offload compute intensive tasks to the GPU
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
4
CPU-GPU PHASE BEHAVIOR IN AN EXASCALE PROXY
APPLICATION (LULESH)
 CPU-GPU coupled execution  time-varying redistribution of compute intensity
 Energy efficient operation  coordinated distribution of power to CPU vs. GPU
 Coordinated power states  sensitivity of performance to CPU and GPU power
state (frequency)
– Need to characterize ROI: Return (performance) on investment (power)
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
5
THE CHALLENGE: CPU-GPU COUPLING EFFECTS
Direct Performance Coupling
Host
Tasks
Performance
GPU
Tasks
User Application
Performance Constraint
Coupling Effects
Indirect Performance Coupling:
Shared Resources
Coordinated Energy
Management
Power Efficiency
• HPC applications have uncompromising
performance requirements!
• Need more efficient energy management
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
6
STATE-OF-THE-ART
POWER MANAGEMENT
STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION
POWER MANAGEMENT (BAPM)
Chip is divided into
BAPM-controlled
thermal entities
(TEs)
CU0
TE
CU1
TE
GPU
TE
 Power management algorithm
1. Calculate digital estimate of power consumption
2. Convert power to temperature
- RC network model for heat transfer
3. Assign new power budgets to TEs based on
temperature headroom
4. TEs locally control (boost) their own DVFS states to
maximize performance
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
8
POWER MANAGEMENT
APU Die
Temperature
Performance and energy efficiency depend on effective
utilization of power and thermal headroom
3.0
CPU
HW
Only
(Boost)
Thermal
Headroom
SWVisible
Time
GPU
HW Boost states
SW visible states
Convert thermal
headroom to higher
performance
through boost
P0
P1
P2
--Pmin
DVFSstate
High
Medium
Low
Instructions/cycle
APU
Performance
HW
Only
DVFSstate
Pb0
Pb1
Time
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
9
KEY OBSERVATIONS
 Overall application performance is a function of both the CPU and the
GPU
 State of the practice: Manage to thermal limits by locally boosting
when power and thermal headroom are available  utilize all of the
available headroom
 Pitfall: boosting may not lead to proportional performance
improvement energy inefficient
 Need a concept of performance sensitivity to power states
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
10
HPC APPLICATION
CHARACTERISTICS
FREQUENCY SENSITIVITY OF GPU KERNELS
DVFS-high
DVFS-med
DVFS-low
% increase in run-time
160%
140%
120%
100%
80%
60%
40%
20%
0%
Total
Force
Neighbour
Comm
GPU DVFS per kernel in miniMD->
Other
Some kernels are more sensitive to GPU frequency than others
 more power efficient
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
12
SENSITIVITY OF GPU KERNEL EXECUTION TO CPU
FREQUENCY
% increase in run-time
P0
P1
P2
P3
P4
50%
40%
30%
20%
10%
0%
Total
Force
Neighbor Comm
Other
CPU DVFS per kernel in miniMD ->
 Some kernels are more tightly coupled to CPU’s performance
 Smaller kernels such as Comm have high overheads in launching
and feeding the GPU
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
13
SENSITIVITY TO SHARED RESOURCE INTERFERENCE
Performance
actually limited
by GPU memory
demand
Normalized Metric ->
GPU_Mem_BW/Pb1
CPU_Mem_BW/Pb0
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Power
management
locally boosts
CPU to highest
DVFS states
Mem_BW_breakdown
CPU_DVFS_residency
miniMD – Neighbor kernel
Wasted energy  power inefficient
Need online estimates of sensitivity to interference
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
14
COMPUTATION AND CONTROL DIVERGENCE
Percentage metric ->
0.80
0.70
• GPU_freq_sensitivity: unit
performance gain for unit
frequency increase
0.60
0.50
0.40
0.30
• GPU_ALUBusy%:
measured hardware
compute utilization
0.20
0.10
0.00
GPU_freq_sensitivity(meas)
GPU_ALUBusy%
Graph Algorithm – BFS
Control divergence
 increased thread serialization
 increased frequency sensitivity
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
15
KEY OBSERVATIONS
 HPC applications exhibit varying degrees of CPU and GPU
frequency sensitivities due to
– Control divergence
– Interference at shared resources
– Performance coupling between CPU and GPU
 Efficient energy management requires metrics that can predict frequency
sensitivity (power) in heterogeneous processors
 Sensitivity metrics drive the coordinated setting of CPU and GPU power
states
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
16
FREQUENCY SENSITIVITY AND
RUN-TIME COORDINATED
ENERGY MANAGEMENT
PERFORMANCE METRICS FOR APU FREQUENCY
SENSITIVITY
 Linear regression model using the above metrics to compute measures of
GPU Compute
Interference
Performance Coupling
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
CPU Compute
18
DYNACO: RUN-TIME SYSTEM FOR COORDINATED
ENERGY MANAGEMENT
Performance
Metric Monitor
CPU-GPU
Frequency
Sensitivity
Computation
CPU-GPU
Power State
Decision
GPU Frequency
Sensitivity
CPU Frequency
Sensitivity
Decision
High
Low
Shift power to GPU
High
High
Proportional power
allocation
Low
High
Shift power to CPU
Low
Low
Reduce power of
both CPU and GPU
 DynaCo-1levelTh: Lowest CPU DVFS-state limited to P2
 DynaCo-multilevelTh: Lowest CPU DVFS-state allowed to use up to Pmin
based on degree of performance coupling
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
19
KEY OBSERVATIONS
 Coordinated CPU-GPU execution
 Linear combination of three key high level performance metrics
proposed to model APU frequency sensitivity behavior
 Run-time coordinated energy management scheme DynaCo to
manage CPU and GPU DVFS states dynamically based on measured
frequency sensitivities
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
20
RESULTS
EXPERIMENTAL SET-UP
 GPU: Managed by sending software
messages through driver layer
 Trinity A10-5800 APU: 100W TDP
 CPU: Managed by HW or SW
HW
Only
(Boost)
SWVisible
CPU P- Voltage
state
(V)
Pb0
1
Freq
(MHz)
2400
GPU Pstate
GPU-high
Freq
(MHz)
800
Pb1
0.875
1800
GPU-med
633
P0
0.825
1600
GPU-low
304
P1
0.812
1400
P2
0.787
1300
P3
0.762
1100
P4
0.75
900
 DynaCo implemented as a run-time
software policy overlaid on top of BAPM
in real hardware
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
22
BENCHMARKS
BM (Description)
Problem Size
miniMD
32 x 32 x 32 elements
miniFE
100 x 100 x 100 elements
Lulesh
100 x 100 x 100 elements
Sort
Stencil2D
2,097,152 elements
4,096 x 4,096 elements
S3D
SHOC default for integrated GPU
BFS
1,000,000 nodes
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
23
Normalized ED^2 product
ENERGY EFFICIENCY (ED2 PRODUCT)
DynaCo-1levelTh
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
DynaCo-multilevelTh
Ideal-static
Average energy efficiency improvement of 24% and 30% with
DynaCo-1levelTh and DynaCo-multilevelTh respectively
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
24
Increase in run-time
EXECUTION TIME IMPACT
DynaCo-1levelTh
1.06
1.04
1.02
1.00
0.98
0.96
0.94
0.92
0.90
DynaCo-multilevelTh
Ideal-static
Baseline
Average performance slow down of 0.78% and 1.61% with
DynaCo-1levelTh and DynaCo-multilevelTh respectively
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
25
POWER SAVINGS
DynaCo-1levelTh
60%
DynaCo-multilevelTh
Ideal-static
Power
50%
40%
30%
20%
10%
0%
Average power savings of 24% and 31% with DynaCo-1levelTh
and DynaCo-multilevelTh respectively
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
26
CONCLUSIONS
 Demonstrated effects of shared resource interference, control
divergence and performance coupling on energy management for
HPC applications
 Illustrated the importance and scope of frequency sensitivity in
characterizing energy behaviors in tightly coupled heterogeneous
architecture
 Proposed CPU-GPU frequency sensitivity metrics and run-time policy
for energy efficient CPU and GPU DVFS state management
– Dynamically shifts power to only the entity that can best utilize it
 Demonstrated effectiveness of DynaCo on real hardware as a wellrounded energy management scheme for HPC and Exascale
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
27
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to
update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes
from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR
PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard
Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their
respective owners.
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
28
BACKUP
POWER SHARING AND SHIFTING ANALYSIS
GPU-high
0.8
0.6
0.4
GPU-med
GPU-low
100%
1
GPU Utilization
Global_MemUtil
0.2
80%
60%
40%
20%
0%
10
30
50
70
90
110
130
150
170
190
210
0
GPU DVFS residency
Normalized Metric
1.2
Time (ms)->
Phase variation within MATVEC
CalcFBHourGlass
CalcHourGlass IntegrateStress
Lulesh
 DynaCo adapts to varying compute and memory demands both at kernel
granularity and even within a kernel
COORDINATED ENERGY MANAGEMENT IN HETEROGENEOUS PROCESSORS | NOVEMBER, 2013
30