MeterPU: A Generic Measurement Abstraction API – Enabling Energy-tuned Skeleton Backend Selection Lu Li, Christoph Kessler lu.li@liu.se, christoph.kessler@liu.se 1 / 31 Agenda Motivation MeterPU SkePU and Integration with MeterPU Experiments Related Work Conclusions and Future Work 2 / 31 Acknowledgment 3 / 31 EXCESS Project (2013-2016) EU FP7 project Holistic energy optimization Embedded and HPC systems. More info: http://excess-project.eu/ 4 / 31 Motivation Parallel programming is challenging Decomposition, communication, synchronization, load-balancing ... Parallel programming abstraction needed SkePU: state-of-art skeleton programming framework. Parallel map, reduce etc, on multiple CPUs and GPUs. Automated selection for running skeletons on CPU or GPU for time optimization. Energy becomes the main bottleneck of continuous performance improvement in recent years. How to make a legacy empirical autotuning framework such as SkePU energy-tuned? Other empirical autotuning framework: PetaBricks [1] 5 / 31 Main idea Autotuning software based on empirical modeling optimize on value, not metric Measurement facility give time value ⇒ time model built ⇒ time opt. The same for energy values. Autotuning logic can be reused. A natural solution: unification of measurement interface allow reuse of legacy empirical autotuning frameworks. More reasons for such a unification Energy measurement is tricky, especially for GPUs. MeterPU is developed in our group. 6 / 31 Unification Allow Reuse Legacy Autotuning Framework? Figure 1 : Unification allows empirical autotuning framework to switch to multiple meter type on different hardware components. 7 / 31 Why is GPU Energy Measurement Tricky? ·104 50 9 36 32 46 44 8.5 Power(mW) 34 Temperature Fan speed 48 8 7.5 7 42 6.5 40 6 30 65 70 75 Time(s) Figure 2 : Data visualization [8] for a program run. Illustrating capacitor effect, and correction methods [2] for true instant power. Blue dashed line: program start; Red dashed line: program ends; green dashed line: power drops again to the static power level. 80 85 8 / 31 MeterPU A software multimeter A generic measurement abstraction, hiding metric-specific details. Four simple functions to unify measurement interfaces on various metrics on different hardware components. Time, Energy on CPU, GPU Easy to extend: FLOPS, cache misses etc. On top of native measurement libraries. CPU time: clock_gettime() GPU time: cudaEventRecord(); CPU and DRAM energy: intel PCM library. GPU energy: Nvidia NVML library. 9 / 31 MeterPU API 1 2 3 4 5 6 7 8 9 10 11 12 template < c l a s s Type> / / analogous t o s w i t c h on a r e a l multimeter c l a s s Meter { public : void s t a r t ( ) ; / / s t a r t a measurement void stop ( ) ; / / s t o p a measurement void calc ( ) ; / / c a l c u l a t e a m e t r i c v a l u e o f a code region typename M e t e r _ T r a i t s <Type > : : ResultType c o n s t & get_value ( ) const ; / / get the c a l c u l a t e d metric value private : METRIC_SPECIFIC_LOGIC_HIDED . . . } Listing 1 : Main MeterPU API 10 / 31 An MeterPU Application: Measure Time 1 # i n c l u d e <MeterPU . h> 2 3 4 5 6 i n t main ( ) { u s i n g namespace MeterPU ; Meter <CPU_Time> meter ; 7 8 meter . s t a r t ( ) ; / / Measurement S t a r t ! ! cpu_func ( ) ; / / Do s t h here meter . s t o p ( ) ; / / Measurement Stop ! ! 9 10 11 12 13 meter . c a l c ( ) ; BUILD_CPU_TIME_MODEL ( meter . g e t _ v a l u e ( ) ) ; 14 15 16 } Listing 2 : An example application that measures CPU Time 11 / 31 An MeterPU Application: Measure GPU Energy 1 # i n c l u d e <MeterPU . h> 2 3 4 5 6 7 i n t main ( ) { u s i n g namespace MeterPU ; / / Only one l i n e d i f f e r s ! ! ! ! Meter <NVML_Energy<> > meter ; 8 meter . s t a r t ( ) ; 9 / / Measurement S t a r t ! ! 10 cuda_func <<< . . . , . . . > > > ( . . . ) ; cudaDeviceSynchronize ( ) ; 11 12 13 meter . s t o p ( ) ; 14 / / Measurement Stop ! ! 15 meter . c a l c ( ) ; BUILD_GPU_ENERGY_MODEL( meter . g e t _ v a l u e ( ) ) ; 16 17 18 } Listing 3 : An example application that measures GPU energy 12 / 31 Measure Combinations of CPU and GPUs 1 # i n c l u d e <MeterPU . h> 2 3 4 5 6 7 i n t main ( ) { u s i n g namespace MeterPU ; / / Only one l i n e d i f f e r s ! ! ! ! Meter < System_Energy <0> > meter ; 8 meter . s t a r t ( ) ; 9 / / Measurement S t a r t ! ! 10 cpu_func ( ) ; cuda_func <<< . . . , . . . > > > ( . . . ) ; wait_for_cpu_func_to_finish ( ) ; cudaDeviceSynchronize ( ) ; 11 12 13 14 15 meter . s t o p ( ) ; 16 / / Measurement Stop ! ! 17 meter . c a l c ( ) ; BUILD_SYSTEM_ENERGY_MODEL( meter . g e t _ v a l u e ( ) ) ; 18 19 20 } Listing 4 : An example application that measures system energy 13 / 31 SkePU introduction State-of-art skeleton programming framework. Supported skeletons: map, reduce, scan, maparray etc. Multiple back-ends and multi-GPU support. Automated context-aware implementation selection. Smart containers. ... 14 / 31 Some of SkePU Skeletons 3 1 2 3 4 5 6 2 1 2 3 4 5 6 1 4 (a) Map 7 (b) Reduce (c) MapReduce 1 1 5 2 6 3 7 2 3 4 4 5 6 7 8 9 10 11 12 8 1 2 3 4 5 6 (d) MapOverlap (e) MapArray (f) Scan 15 / 31 SkePU MeterPU Integration Trivial changes on SkePU code In SkePU’s empirical modeling part of code. Declare a MeterPU System Energy Meter Change the time measurement code (previous implemented by native library call: clock_gettime()) with MeterPU similar APIs. (6 lines of code) Done! Remarks , Change opt. goal: change only 1 line of code (meter initialization). / Data transfer is not considered for now. 16 / 31 Experiment Method Integrate SkePU with MeterPU System Meters, enabling tune for time and energy easily. Train SkePU autotuning predictors by training examples selected by our smart sampling algorithm. [7, 9] We train for five skeletons: map, reduce, mapreduce, mapoverlap, and maparray. Choose test points in exponentially growing sizes. Besides skeleton types, we test LU decomposition implemented by maparray skeleton. Plot the performance of each variant and automated selection. 17 / 31 Experimental Setup LIU’s GPU server CPU: Intel(R) Xeon(R) CPU E5-2630L v2, 2 sockets (6 cores each), max frequency: 2.4GHz. Support Intel PCM library. DRAM: 64GB on 2 sockets GPU: Nvidia Tesla Kepler K20 C-class, 13 SMs, 2496 cores. Support Nvidia NVML library. 18 / 31 Experimental Setup Skeleton type Map Reduce Mapreduce Description bi = f (ai ) d = f (a1 , a2 , ..., an ) d = g(f (a1 , b1 ), ..., f (an , bn )) Mapoverlap bi = f (ai−t , ..., ai+t ) Maparray bi = f (a1 , ..., an ) User function return a*a; return a+b; return a*b; //for map return a+b; //for reduce return (a[-2]*4 + a[-1]*2 + a[0]*1 + a[1]*2 + a[2]*4)/5; int index = (int)b; return a[index]; Table 1 : Setup for different SkePU skeletons. (a, b, c: vector. d: scalor. t: positive constant.) 19 / 31 Time(us): average of 100 runs 10000 5000 ● CPU OMP CUDA Selection ● ● ● ● ● ● ● ● ● ● ● ● 10 5 ● ● 1000 500 100 50 ● ● ● ● ● ● 2e+02 1e+03 5e+03 5e+04 5e+05 Problem size for mapreduce (a)Time Tuning for MapReduce. Energy(milliJ): average of 1000 runs Mapreduce Skeletons ● 2000 CPU OMP CUDA Selection ● ● 1000 ● ● 500 ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● 100 2e+02 1e+03 5e+03 5e+04 5e+05 Problem size for mapreduce (b) Energy Tuning for MapReduce. Most of time selctions are smart, and speedup (max 20× in time and 10× in energy) is increasing with problem size. 20 / 31 Time(us): average of 100 runs 10000 ● 1000 CPU OMP CUDA Selection ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● 10 ● ● ● ● ● ● 1 2e+02 1e+03 5e+03 5e+04 5e+05 Problem size for reduce (a) Time tuning for Reduce. Energy(milliJ): average of 1000 runs Reduce Skeletons CPU OMP CUDA Selection ● 1000 ● ● 500 ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● ● ● 100 2e+03 1e+04 5e+04 2e+05 Problem size for reduce (b) Energy tuning for Reduce. Empirical autotuning framework for time opt. reused for energy opt. 21 / 31 LU decomposition ● 5000 ● ● ● ● ● ● ● 2000 ● ● 1000 ● ● 500 ● ● ● ● ● 200 ● ● Energy(milliJ): average of 1000 runs Time(us): average of 3000 runs 10000 CPU OMP CUDA Selection 10000 ● 5000 CPU OMP CUDA Selection ● ● 2000 ● ● 1000 500 ● ● 200 ● ● ● ● ● ● ● 100 20 30 40 50 Problem size for LU decomposition (a) Time tuning for LU decomposition 5 10 20 50 100 Problem size for LU decomposition (b) Energy tuning for LU decomposition Easy switching for optimization goals. 22 / 31 MeterPU Overhead MeterPU Native API 1e+06 1e+05 5e+04 ● ● ● 1e+04 ● ● ● 1e+04 5e+03 ● 1e+03 5e+02 ● ● ● ● ● 1e+02 5e+01 ● 1e+02 MeterPU Native API ● Energy(milliJ): mean of 40 runs Time(us): mean of 40 runs ● ● ● ● ● ● ● ● ● ● 1e+01 1e+03 1e+05 Number of microseconds sleeped (a) MeterPU time overhead 1e+01 1e+03 1e+05 Number of microseconds sleeped (b) MeterPU energy overhead Figure 4 : Nonobservable MeterPU overhead (only one function call) 23 / 31 Related Work and Comparison Other measurement abstraction software EML [3]: a C++ library tries to unify measurement of time and energy, and easy to extend, helps to build analytical models. REPARA’s [4] performance and energy monitoring library: support both counter-based and hardware-based measurement methods. Overhead (within 5%) MeterPU: the simpler interface while keep generality, our overhead is only one extra function call. Easy creation of aggregate meters based on existing meters. Monitoring frameworks: Take measurement in either intrusive or non-intrusive way, and store results in database. Nagios [6], Ground-Work [5] etc. MeterPU: more light-weight, lower overhead to retrieve data, tailored to be used in feedback loop for opt. purpose A Comparison btw SkePU and other Skeleton Frameworks is detailed in our paper. 24 / 31 Summary of Contributions Hide complexity for complex energy measurement, especially for GPUs. MeterPU enables to reuse legacy empirical autotuning frameworks, such as SkePU. With MeterPU, SkePU offers the first energy-tuned skeletons, as far as we know. Switching optimization goal can be easy, facilitate to build time and energy models conveniently for multi-object optimization. 25 / 31 Some Limitations MeterPU requires hardware support the native libraries, such as Intel PCM, and NVML (sampling power support) The minimal kernel runtime for GPU energy measurement is not very small (at least 150ms) [2]. Counter-based measurement method may not have the same accuracy as hardware-based method. But hardware method is difficult to deploy everywhere. , MeterPU does not couple with any specific measurement method. 26 / 31 Future work Build infrastructure to measure data transfer. (PCIe channel) Planned Integration with other software: TunePU: Generic Autotuning Framework, comming soon. GRS: Global Composition Framework, Build support for clusters and Intel Xeon Phi. , MeterPU source code download: http://www.ida.liu.se/labs/pelab/meterpu 27 / 31 Bibliography I Jason Ansel, Cy P. Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman P. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2009, pages 38–49. ACM, 2009. Martin Burtscher, Ivan Zecena, and Ziliang Zong. Measuring GPU power with the K20 built-in sensor. In Proc. Workshop on General Purpose Processing Using GPUs (GPGPU-7). ACM, March 2014. 28 / 31 Bibliography II Alberto Cabrera, Francisco Almeida, Javier Arteaga, and Vicente Blanco. Energy measurement library (eml) usage and overhead analysis. In 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pages 554–558, March 2015. Marco Danuletto, Zoltan Herczeg, Akos Kiss, Peter Molnar, Robert Sipka, Massimo Torquati, and Laszlo Vidacs. D6.4: REPARA performance and energy monitoring library. c Technical report, The REPARA Consortium, March 2015. 29 / 31 Bibliography III GroundWork Inc. GroundWork—Unified Monitoring For Real. http://www.gwos.com/. Accessed: 2015-01-21. David Josephsen. Building a Monitoring Infrastructure with Nagios. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2007. Lu Li, Usman Dastgeer, and Christoph Kessler. Adaptive off-line tuning for optimized composition of components for heterogeneous many-core systems. In Michel Daydé, Osni Marques, and Kengo Nakajima, editors, High Performance Computing for Computational Science - VECPAR 2012, volume 7851 of Lecture Notes in 30 / 31 Bibliography IV Computer Science, pages 329–345. Springer Berlin Heidelberg, 2013. Christoph Kessler Lu Li. Validating Energy Compositionality of GPU Computations. In Proc. HIPEAC Workshop on Energy Efficiency with Heterogeneous Computing (EEHCO-2015), January 2015. Christoph Kessler Lu Li, Usman Dastgeer. Pruning strategies in adaptive off-line tuning for optimized composition of components on heterogeneous systems. In Proc. Seventh International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) at ICPP, 2014. to appear. 31 / 31