MeterPU: A Generic Measurement Abstraction API – Enabling Energy-tuned Skeleton Backend Selection

advertisement
MeterPU: A Generic Measurement Abstraction
API – Enabling Energy-tuned Skeleton Backend
Selection
Lu Li, Christoph Kessler
lu.li@liu.se, christoph.kessler@liu.se
1 / 31
Agenda
Motivation
MeterPU
SkePU and Integration with MeterPU
Experiments
Related Work
Conclusions and Future Work
2 / 31
Acknowledgment
3 / 31
EXCESS Project (2013-2016)
EU FP7 project
Holistic energy optimization
Embedded and HPC systems.
More info: http://excess-project.eu/
4 / 31
Motivation
Parallel programming is challenging
Decomposition, communication,
synchronization, load-balancing ...
Parallel programming abstraction needed
SkePU: state-of-art skeleton programming framework.
Parallel map, reduce etc, on multiple CPUs and GPUs.
Automated selection for running skeletons on CPU or GPU
for time optimization.
Energy becomes the main bottleneck of continuous
performance improvement in recent years.
How to make a legacy empirical autotuning framework
such as SkePU energy-tuned?
Other empirical autotuning framework: PetaBricks [1]
5 / 31
Main idea
Autotuning software based on empirical modeling optimize
on value, not metric
Measurement facility give time value ⇒ time model built ⇒
time opt.
The same for energy values.
Autotuning logic can be reused.
A natural solution: unification of measurement interface
allow reuse of legacy empirical autotuning frameworks.
More reasons for such a unification
Energy measurement is tricky, especially for GPUs.
MeterPU is developed in our group.
6 / 31
Unification Allow Reuse Legacy Autotuning Framework?
Figure 1 : Unification allows empirical autotuning framework to switch to
multiple meter type on different hardware components.
7 / 31
Why is GPU Energy Measurement Tricky?
·104
50
9
36
32
46
44
8.5
Power(mW)
34
Temperature
Fan speed
48
8
7.5
7
42
6.5
40
6
30
65
70
75
Time(s)
Figure 2 : Data visualization [8] for a program run.
Illustrating capacitor effect,
and correction methods [2] for true instant power.
Blue dashed line: program start;
Red dashed line: program ends;
green dashed line: power drops again to the static power level.
80
85
8 / 31
MeterPU
A software multimeter
A generic measurement abstraction, hiding metric-specific
details.
Four simple functions to unify measurement interfaces on
various metrics on different hardware components.
Time, Energy on CPU, GPU
Easy to extend: FLOPS, cache misses etc.
On top of native measurement libraries.
CPU time: clock_gettime()
GPU time: cudaEventRecord();
CPU and DRAM energy: intel PCM library.
GPU energy: Nvidia NVML library.
9 / 31
MeterPU API
1
2
3
4
5
6
7
8
9
10
11
12
template < c l a s s Type> / / analogous t o s w i t c h on a r e a l
multimeter
c l a s s Meter
{
public :
void s t a r t ( ) ;
/ / s t a r t a measurement
void stop ( ) ;
/ / s t o p a measurement
void calc ( ) ;
/ / c a l c u l a t e a m e t r i c v a l u e o f a code
region
typename M e t e r _ T r a i t s <Type > : : ResultType c o n s t &
get_value ( ) const ;
/ / get the c a l c u l a t e d metric value
private :
METRIC_SPECIFIC_LOGIC_HIDED . . .
}
Listing 1 : Main MeterPU API
10 / 31
An MeterPU Application: Measure Time
1
# i n c l u d e <MeterPU . h>
2
3
4
5
6
i n t main ( )
{
u s i n g namespace MeterPU ;
Meter <CPU_Time> meter ;
7
8
meter . s t a r t ( ) ;
/ / Measurement S t a r t ! !
cpu_func ( ) ;
/ / Do s t h here
meter . s t o p ( ) ;
/ / Measurement Stop ! !
9
10
11
12
13
meter . c a l c ( ) ;
BUILD_CPU_TIME_MODEL ( meter . g e t _ v a l u e ( ) ) ;
14
15
16
}
Listing 2 : An example application that measures CPU Time
11 / 31
An MeterPU Application: Measure GPU Energy
1
# i n c l u d e <MeterPU . h>
2
3
4
5
6
7
i n t main ( )
{
u s i n g namespace MeterPU ;
/ / Only one l i n e d i f f e r s ! ! ! !
Meter <NVML_Energy<> > meter ;
8
meter . s t a r t ( ) ;
9
/ / Measurement S t a r t ! !
10
cuda_func <<< . . . , . . . > > > ( . . . ) ;
cudaDeviceSynchronize ( ) ;
11
12
13
meter . s t o p ( ) ;
14
/ / Measurement Stop ! !
15
meter . c a l c ( ) ;
BUILD_GPU_ENERGY_MODEL( meter . g e t _ v a l u e ( ) ) ;
16
17
18
}
Listing 3 : An example application that measures GPU energy
12 / 31
Measure Combinations of CPU and GPUs
1
# i n c l u d e <MeterPU . h>
2
3
4
5
6
7
i n t main ( )
{
u s i n g namespace MeterPU ;
/ / Only one l i n e d i f f e r s ! ! ! !
Meter < System_Energy <0> > meter ;
8
meter . s t a r t ( ) ;
9
/ / Measurement S t a r t ! !
10
cpu_func ( ) ;
cuda_func <<< . . . , . . . > > > ( . . . ) ;
wait_for_cpu_func_to_finish ( ) ;
cudaDeviceSynchronize ( ) ;
11
12
13
14
15
meter . s t o p ( ) ;
16
/ / Measurement Stop ! !
17
meter . c a l c ( ) ;
BUILD_SYSTEM_ENERGY_MODEL( meter . g e t _ v a l u e ( ) ) ;
18
19
20
}
Listing 4 : An example application that measures system energy
13 / 31
SkePU introduction
State-of-art skeleton programming framework.
Supported skeletons: map, reduce, scan, maparray etc.
Multiple back-ends and multi-GPU support.
Automated context-aware implementation selection.
Smart containers.
...
14 / 31
Some of SkePU Skeletons
3
1
2
3
4
5
6
2
1
2
3
4
5
6
1
4
(a) Map
7
(b) Reduce
(c) MapReduce
1
1
5
2
6
3
7
2
3
4
4
5
6
7
8
9
10
11
12
8
1
2
3
4
5
6
(d) MapOverlap
(e) MapArray
(f) Scan
15 / 31
SkePU MeterPU Integration
Trivial changes on SkePU code
In SkePU’s empirical modeling part of code.
Declare a MeterPU System Energy Meter
Change the time measurement code
(previous implemented by native library call:
clock_gettime()) with MeterPU similar APIs.
(6 lines of code)
Done!
Remarks
, Change opt. goal:
change only 1 line of code (meter initialization).
/ Data transfer is not considered for now.
16 / 31
Experiment Method
Integrate SkePU with MeterPU System Meters, enabling
tune for time and energy easily.
Train SkePU autotuning predictors by training examples
selected by our smart sampling algorithm. [7, 9]
We train for five skeletons: map, reduce, mapreduce,
mapoverlap, and maparray.
Choose test points in exponentially growing sizes.
Besides skeleton types, we test LU decomposition
implemented by maparray skeleton.
Plot the performance of each variant and automated
selection.
17 / 31
Experimental Setup
LIU’s GPU server
CPU: Intel(R) Xeon(R) CPU E5-2630L v2,
2 sockets (6 cores each), max frequency: 2.4GHz.
Support Intel PCM library.
DRAM: 64GB on 2 sockets
GPU: Nvidia Tesla Kepler K20 C-class, 13 SMs, 2496
cores. Support Nvidia NVML library.
18 / 31
Experimental Setup
Skeleton type
Map
Reduce
Mapreduce
Description
bi = f (ai )
d = f (a1 , a2 , ..., an )
d = g(f (a1 , b1 ), ..., f (an , bn ))
Mapoverlap
bi = f (ai−t , ..., ai+t )
Maparray
bi = f (a1 , ..., an )
User function
return a*a;
return a+b;
return a*b; //for map
return a+b; //for reduce
return (a[-2]*4 + a[-1]*2 +
a[0]*1 + a[1]*2 + a[2]*4)/5;
int index = (int)b; return a[index];
Table 1 : Setup for different SkePU skeletons. (a, b, c: vector. d: scalor. t: positive
constant.)
19 / 31
Time(us): average of 100 runs
10000
5000
●
CPU
OMP
CUDA
Selection
●
●
●
●
●
●
●
●
●
●
●
●
10
5
●
●
1000
500
100
50
●
●
●
●
●
●
2e+02 1e+03 5e+03
5e+04
5e+05
Problem size for mapreduce
(a)Time Tuning for
MapReduce.
Energy(milliJ): average of 1000 runs
Mapreduce Skeletons
●
2000
CPU
OMP
CUDA
Selection
●
●
1000
●
●
500
●
●
200
●
●
●
●
●
●
●
●
●
●
●
●
100
2e+02 1e+03 5e+03
5e+04
5e+05
Problem size for mapreduce
(b) Energy Tuning for
MapReduce.
Most of time selctions are smart, and speedup (max 20×
in time and 10× in energy) is increasing with problem size.
20 / 31
Time(us): average of 100 runs
10000
●
1000
CPU
OMP
CUDA
Selection
●
●
●
●
●
●
●
●
●
100
●
●
●
●
●
10
●
●
●
●
●
●
1
2e+02 1e+03 5e+03
5e+04
5e+05
Problem size for reduce
(a) Time tuning for
Reduce.
Energy(milliJ): average of 1000 runs
Reduce Skeletons
CPU
OMP
CUDA
Selection
●
1000
●
●
500
●
●
●
●
●
200
●
●
●
●
●
●
●
●
●
●
●
●
●
100
2e+03
1e+04
5e+04
2e+05
Problem size for reduce
(b) Energy tuning for
Reduce.
Empirical autotuning framework for time opt. reused for
energy opt.
21 / 31
LU decomposition
●
5000
●
●
●
●
●
●
●
2000
●
●
1000
●
●
500
●
●
●
●
●
200
●
●
Energy(milliJ): average of 1000 runs
Time(us): average of 3000 runs
10000
CPU
OMP
CUDA
Selection
10000
●
5000
CPU
OMP
CUDA
Selection
●
●
2000
●
●
1000
500
●
●
200
●
●
●
●
●
●
●
100
20
30
40
50
Problem size for LU decomposition
(a) Time tuning for LU
decomposition
5
10
20
50
100
Problem size for LU decomposition
(b) Energy tuning for LU
decomposition
Easy switching for optimization goals.
22 / 31
MeterPU Overhead
MeterPU
Native API
1e+06
1e+05
5e+04
●
●
●
1e+04
●
●
●
1e+04
5e+03
●
1e+03
5e+02
●
●
●
●
●
1e+02
5e+01
●
1e+02
MeterPU
Native API
●
Energy(milliJ): mean of 40 runs
Time(us): mean of 40 runs
●
●
●
●
●
●
●
●
●
●
1e+01
1e+03
1e+05
Number of microseconds sleeped
(a) MeterPU time overhead
1e+01
1e+03
1e+05
Number of microseconds sleeped
(b) MeterPU energy overhead
Figure 4 : Nonobservable MeterPU overhead (only one function call)
23 / 31
Related Work and Comparison
Other measurement abstraction software
EML [3]: a C++ library tries to unify measurement of time
and energy, and easy to extend, helps to build analytical
models.
REPARA’s [4] performance and energy monitoring library:
support both counter-based and hardware-based
measurement methods. Overhead (within 5%)
MeterPU: the simpler interface while keep generality, our
overhead is only one extra function call. Easy creation of
aggregate meters based on existing meters.
Monitoring frameworks:
Take measurement in either intrusive or non-intrusive way,
and store results in database.
Nagios [6], Ground-Work [5] etc.
MeterPU: more light-weight, lower overhead to retrieve
data, tailored to be used in feedback loop for opt. purpose
A Comparison btw SkePU and other Skeleton Frameworks
is detailed in our paper.
24 / 31
Summary of Contributions
Hide complexity for complex energy measurement,
especially for GPUs.
MeterPU enables to reuse legacy empirical autotuning
frameworks, such as SkePU.
With MeterPU, SkePU offers the first energy-tuned
skeletons, as far as we know.
Switching optimization goal can be easy,
facilitate to build time and energy models conveniently for
multi-object optimization.
25 / 31
Some Limitations
MeterPU requires hardware support the native libraries,
such as Intel PCM, and NVML (sampling power support)
The minimal kernel runtime for GPU energy measurement
is not very small (at least 150ms) [2].
Counter-based measurement method may not
have the same accuracy as hardware-based method. But
hardware method is difficult to deploy everywhere.
, MeterPU does not couple with any specific
measurement method.
26 / 31
Future work
Build infrastructure to measure data transfer. (PCIe
channel)
Planned Integration with other software:
TunePU: Generic Autotuning Framework, comming soon.
GRS: Global Composition Framework,
Build support for clusters and Intel Xeon Phi.
, MeterPU source code download:
http://www.ida.liu.se/labs/pelab/meterpu
27 / 31
Bibliography I
Jason Ansel, Cy P. Chan, Yee Lok Wong, Marek Olszewski,
Qin Zhao, Alan Edelman, and Saman P. Amarasinghe.
PetaBricks: A language and compiler for algorithmic
choice.
In Proceedings of the 2009 ACM SIGPLAN Conference on
Programming Language Design and Implementation, PLDI
2009, pages 38–49. ACM, 2009.
Martin Burtscher, Ivan Zecena, and Ziliang Zong.
Measuring GPU power with the K20 built-in sensor.
In Proc. Workshop on General Purpose Processing Using
GPUs (GPGPU-7). ACM, March 2014.
28 / 31
Bibliography II
Alberto Cabrera, Francisco Almeida, Javier Arteaga, and
Vicente Blanco.
Energy measurement library (eml) usage and overhead
analysis.
In 23rd Euromicro International Conference on Parallel,
Distributed and Network-Based Processing (PDP), pages
554–558, March 2015.
Marco Danuletto, Zoltan Herczeg, Akos Kiss, Peter Molnar,
Robert Sipka, Massimo Torquati, and Laszlo Vidacs.
D6.4: REPARA performance and energy monitoring library.
c
Technical report, The
REPARA Consortium, March 2015.
29 / 31
Bibliography III
GroundWork Inc.
GroundWork—Unified Monitoring For Real.
http://www.gwos.com/.
Accessed: 2015-01-21.
David Josephsen.
Building a Monitoring Infrastructure with Nagios.
Prentice Hall PTR, Upper Saddle River, NJ, USA, 2007.
Lu Li, Usman Dastgeer, and Christoph Kessler.
Adaptive off-line tuning for optimized composition of
components for heterogeneous many-core systems.
In Michel Daydé, Osni Marques, and Kengo Nakajima,
editors, High Performance Computing for Computational
Science - VECPAR 2012, volume 7851 of Lecture Notes in
30 / 31
Bibliography IV
Computer Science, pages 329–345. Springer Berlin
Heidelberg, 2013.
Christoph Kessler Lu Li.
Validating Energy Compositionality of GPU Computations.
In Proc. HIPEAC Workshop on Energy Efficiency with
Heterogeneous Computing (EEHCO-2015), January 2015.
Christoph Kessler Lu Li, Usman Dastgeer.
Pruning strategies in adaptive off-line tuning for optimized
composition of components on heterogeneous systems.
In Proc. Seventh International Workshop on Parallel
Programming Models and Systems Software for High-End
Computing (P2S2) at ICPP, 2014.
to appear.
31 / 31
Download