CUDA Optimization Tutorial - Computer Science

advertisement

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs

Martin Burtscher

Department of Computer Science

Introduction

GPU-based accelerators

 Quickly spreading in PCs and even handheld devices

 Widely used in high-performance computing

Power and energy efficiency

 Heat dissipation is a problem

 Electric bill and battery life are of growing concern

 Exascale requires 50x boost in performance per watt

Important research area

 Need to develop techniques to reduce power and energy

 Have to be able to measure power/energy of programs

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 2

GPU Power Sensors

 Hardware

 High-end compute GPUs include power sensors

 For example, K20/K40 Tesla cards have built-in sensor

 These cards are the target of this talk

 Software

Can query sensor with NVIDIA Management Library http://developer.nvidia.com/nvidia-management-library-nvml

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 3

Problems

 Power sensor data behaves strangely

 Running the same kernel twice yields different energy

 First launch: 114 J, second launch: 147 J (29% more energy)

 Running a kernel 2x as long more than doubles energy

 1x input: 732 J, 2x input: 1579 J (8% above doubling)

 Power sensor sampling rate varies greatly

 Ranges from 0.266 ms to 130 ms (7.7 Hz to 3760 Hz)

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 4

Methodology

 Hardware

 Two K20c, two K20m, two K20X, and two K40m GPUs

 Measurement

 Query power and time in loop on “idle” CPU core

 Test code

 Compute-intensive regular n-body kernel

 Constant computation rate of over 2 TFlops on a K20c

 No data dependences; vary n to adjust kernel runtime

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 5

Expected Power Profile

Kernel starts executing

Kernel stops executing

GPU idle power

Measurement loop runtime

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 6

Measured Power Profile

5s

Power ramps up slowly

Power ramps down slowly

3s

Macroscopic phenomena

4s

Switch to step shape

Idle power reached

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 7

Energy = Area Under Power Curve

Missing energy?

Unclear how big energy is

Delayed energy?

Integrate to where?

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 8

Ramp-up Behavior of 2 Short Runs

Ramp down doesn’t follow

2 nd run starts higher but also follows curve

Short run same as longer run

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 9

Ramp-down Behavior of Several Runs t

2

160 t

3 t

4

140

Driver lowers power level

120

Shape depends on power at t

2

100

80

Shape always the same

60

Steps down every second 40

20

Power increases after kernel done

0

16.2

17.2

18.2

19.2

20.2

Shifted Runtime [s]

21.2

22.2

23.2

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 10

Sampling Interval Lengths t

1

160

140

120

100

80

Very long interval t

2 t

3

Driver activity can prevent sampling t

4

80

70

60

50

40

60 30

40

20

Wide range of intervals

20

Short intervals 10

0

10.7

12.0

13.3

14.6

15.9

17.2

18.5

19.8

21.1

22.4

23.7

Runtime [s]

0

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 11

Sampling Interval Lengths (zoomed-in)

120 12

100

80

Identical values

10

Sampled power only ever changes after long interval

8

60 6

Very long interval

40 4

20

Many short intervals

2

0 0

12.030

12.035

12.040

12.045

12.050

12.055

12.060

Runtime [s]

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 12

Correcting the Measurements

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 13

Sampling Frequency

Eliminate redundant samples

 Only sample once every 15 ms (66.7 Hz)

 Cannot accurately measure kernels under ~150 ms

Account for the variation in interval length t

1  Use high-resolution time stamps 160

140

Example: energy from t

1

 to t

4

Dotted (fixed intervals): 1205 J

 Solid (variable intervals): 1066 J

 13% discrepancy t

4

120

100

80

60

40

20

0

10.7

12.0

13.3

14.6

15.9

17.2

18.5

19.8

21.1

22.4

23.7

Runtime [s]

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 14

True Power

Sensor hardware

 Seems to asymptotically approach true power

 Reminiscent of capacitor charging

True instant power

 P true is a function of the slope of the power profile dP/dt and the power measured by the sensor P sensor

P true

= P sensor

+ C × dP sensor

/dt

“Capacitance” of sensor

 C ≈ 0.84 s on all tested K20 GPUs

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 15

Back-calculated from Expected Profile

Minimized absolute errors to determine C

‘Capacitor’ function matches measured values perfectly

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 16

Corrected Power Profile t

1 t

2

160

140

120

Wobbles due to sampling errors

100 ‘Active idle’ power level

80 t

3

60

40

Corrected profile matches expected rectangular profile

20

0

13 14 15 16 17

Time [s]

18 19 20 21

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 17

Correction of 2 Short Runs t

1a t

2a t

1b t

2b

160

140

120 t

3b

Corrected power profile matches expected profile

100

80

60

40

20

0

111 112 113 114 115

Time [s]

116 117 118 119

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 18

Second K20c GPU t

1

160

140

120

100

80

60

40

20

0

16.5

17.5

Identical to original K20c

18.5

19.5

20.5

Time [s] t

2

21.5

22.5

23.5

t

3

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 19

K20m GPU

180 t

1

160

140

120

100

80

60

40

20

0

62.7

63.7

Similar profile but higher power level

64.7

65.7

66.7

Time [s] t

2

67.7

68.7

69.7

t

3

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 20

K20X GPU t

1 t

2 t

4

200

180

160

140

120

100

Profile is good, no correction needed!

80

60

40

20

Huge 600 ms gap

0

128 129 130 131 132 133 134 135 136 137

Time [s]

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 21

K40m GPU

K40m again requires correction

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 22

Application to Full CUDA Program

 Implementation of Barnes Hut n-body algorithm

 Taken from LonestarGPU benchmark suite

 Contains multiple regular and irregular kernels

 Highly optimized, but still suffers from load imbalance, divergence, and uncoalesced accesses

 Main kernel is ‘regularized’ (warp-based)

NASA/JPL-Caltech/SSC

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 23

Barnes Hut Power Profile (1 Step)

“Wave” in profile

Slow then fast drop-off

Original profile is hard to interpret

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 24

Barnes Hut Power Profile (Kernels)

“Wave” in profile

Slow then fast drop-off

Original profile is hard to interpret

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 25

Corrected Barnes Hut Power Profile

160 a b cd

140

120

100

80

60

40

20

0

61.7

Two similar irreg. kernels

62.7

One more irreg. kernel

63.7

Corrected profile reveals important info

Regularized main kernel

64.7

65.7

Time [s]

Decrease due to load imbal.

66.7

Very short regular kernel

67.7

68.7

ef

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 26

K20Power Tool

 Output

 Corrected profile and corresponding ‘active’ energy

Features

 Computes instant power using ‘capacitor’ formula

 Employs high-resolution time steps

 Samples at true frequency of 66.7 Hz

Dissemination

 Open source, research license

 http://cs.txstate.edu/~burtscher/research/K20power/

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 27

Marcher System

 Tool will be part of Marcher system at Texas State

 NSF-funded green computing infrastructure

 Marcher is a power-measurable cluster system

 832 general-purpose cores

 12,000 GPU and MIC cores

 1.2 TB of DDR3 with power throttling and scaling

 50 TB of hybrid storage with hard drives and SSDs

 Component-level power measurement tools (e.g.,

CPU, DRAM, Disk, GPU, Xeon Phi)

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 28

Summary

 Correctly measuring K20/K40 power and energy

 Sample at 66.7 Hz and include time stamps

 Compute true power with presented formula

 Use neighboring power samples to approximate slope

 Compute true energy by integrating true power

 Over intervals where power is above ‘active idle’

 K20Power tool

 Software tool that implements this methodology

 Paper at http://cs.txstate.edu/~burtscher/papers/gpgpu14.pdf

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 29

Acknowledgments

 Collaborators

 Ivan Zecena and Ziliang Zong

 U.S. National Science Foundation

 DUE-1141022, CNS-1217231, and CNS-1305359

 NVIDIA Corporation

 Grants and equipment donations

 Texas State University

 Research Enhancement Program

Nvidia

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 30

Download