5th International Workshop on Power-aware Algorithms, Systems, and Architectures (ICPP PASA 2016) Compiler Transformations Meet CPU Clock Modulation and Power Capping Wei Wang, University of Delaware Allan Porterfield, RENaissance Computing Institute John Cavazos, University of Delaware Sridutt Bhalachandra, University of North Carolina at Chapel Hill August 16, 2016 Motivation Bounded power consumption to achieve exascale computing Various power management techniques exist for applications energy control Impact of power management techniques on compiler transformations 2 / 21 CPU Clock Modulation Write Specific Value to IA32 CLOCK MODULATION (0x19a) MSR Modify /dev/cpu/cpu{0:15}/msr with root privilege Invoke wrmsr inline assembly from applications using added System Call Figure: CPU Clock Modulation. Sample Modulation with 25% Duty Cycle. (Source: IA-32 Intel Architecture Software Developer’s Manual, Volume 3: System Programming Guide) 3 / 21 Available Frequencies Duty Cycle Level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Binary 10001B 10010B 10011B 10100B 10101B 10110B 10011B 11000B 11001B 11010B 11011B 11100B 11101B 11110B 11111B 00000B Decimal 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 Hexadecimal 11H 12H 13H 14H 15H 16H 17H 18H 19H 1AH 1BH 1CH 1DH 1EH 1FH 00H Effective Frequency 6.25% 12.5% 18.75% 25% 31.25% 37.5% 43.75% 50% 56.25% 63.5% 69.75% 75% 81.25% 87.5% 93.75% 100% 4 / 21 Power Capping and Measurement on Intel Architecture Based on RAPL (Running Average Power Limit) Power Capping - Write MSR PKG RAPL POWER LIMIT MSR Power Measurement - Read MSR PKG ENERGY STATUS MSR (Use RCRTool to avoid overflow) 5 / 21 Compiler Transformations Loop Tiling Loop Fusion 6 / 21 Experiments Setup Intel Xeon E5-2680 (dual socket 8-core processor with 20MB cache, Sandy Bridge architecture) Linux Kernel: 2.6.32 Polyhedral Compilers: PoCC v1.2 Benchmarks: 2mm, gesummv, jacobi-2D, and fdtd-2d Polybench Compiler: Intel ICC v14.0.2 (-O3) 7 / 21 Compiler Transformations with CPU Clock Modulation DC16-Time DC12-Time 100 8.5 8 7.5 7 6.5 6 5.5 5 80 60 40 48 46 44 20 42 40 Execution Time (Seconds) 120 0 50 40 30 20 10 0 Program Variants (a) Execution time of transformed program versions under Duty Cycle Modulation setting of 16 (DC16) and 12 (DC12) 8 / 21 Compiler Transformations with CPU Clock Modulation 200 DC16-Power DC12-Power Power (Watts) 180 160 140 120 100 80 60 40 50 40 30 20 10 0 Program Variants (b) Power consumption of transformed program versions under Duty Cycle Modulation setting of 16 (DC16) and 12 (DC12) 9 / 21 Compiler Transformations with CPU Clock Modulation 12000 DC16-Energy DC12-Energy Energy (Joules) 10000 1300 1200 1100 1000 900 800 700 600 8000 6000 4000 48 46 44 42 40 2000 0 50 40 30 20 10 0 Program Variants (c) Energy consumption of transformed program versions under Duty Cycle Modulation setting of 16 (DC16) and 12 (DC12) respectively. 10 / 21 90 Slowdown MAD 80 70 1.2 60 50 1.15 40 1.1 30 20 1.05 10 1 0 50 40 30 20 10 0 Program Variants 1.3 Performance Slowdown Performance Slowdown 1.3 1.25 Memory Access Density (MAD) Memory Access Density VS. Performance 1.25 1.2 1.15 1.1 1.05 1 0 10 20 30 40 50 60 70 80 90 Memory Access Density (MAD) (a) The performance slowdown and the (b) The scattered plot of performance Memory Access Density of all compiler slowdown over the Memory Access transformed programs. Density and the trend line. Figure: The correlation of the Memory Access Density metric and the performance slowdown of jacobi-2D benchmark. 11 / 21 Performance Impact of Clock Modulation for Four Benchmarks 120 8.5 8 7.5 7 6.5 6 5.5 5 80 60 40 7 60 6.5 6 40 5.5 0 0.7 0.6 3 0.5 0.4 2 0.7 0.34 0.65 0.6 0.32 0.55 0.5 0.3 0.45 0.28 60 58 0.35 56 0.4 54 0 DC16-Time DC12-Time 0.75 52 0 37 0 36 0 35 1 Execution Time (Seconds) 0.8 0.8 4 50 5 40 (b) fdtd-2D DC16-Time DC12-Time 6 30 Program Variants (a) jacobi-2D 7 20 10 0 50 40 30 20 10 0 Program Variants Execution Time (Seconds) 48 46 20 44 0 7.5 80 42 48 46 44 42 20 DC16-Time DC12-Time 100 40 Execution Time (Seconds) DC16-Time DC12-Time 100 40 Execution Time (Seconds) 120 0.3 0.25 60 50 40 30 20 10 0 0 35 0 0 30 (c) 2mm 25 0 0 20 15 0 10 50 0 Program Variants Program Variants (d) gesummv 12 / 21 Power Impact of Clock Modulation for Four Benchmarks 200 DC16-Power DC12-Power 180 160 Power (Watts) Power (Watts) 200 DC16-Power DC12-Power 180 140 120 100 80 60 160 140 120 100 80 60 40 40 200 DC16-Power DC12-Power 180 160 Power (Watts) Power (Watts) 50 40 (b) fdtd-2D DC16-Power DC12-Power 180 30 Program Variants (a) jacobi-2D 200 20 10 0 50 40 30 20 10 0 Program Variants 140 120 100 80 60 160 140 120 100 80 60 40 40 60 50 40 30 20 10 0 0 35 0 30 0 0 (c) 2mm 25 20 0 0 15 10 50 0 Program Variants Program Variants (d) gesummv 13 / 21 Energy Impact of Clock Modulation for Four Benchmarks 12000 DC16-Energy DC12-Energy 5000 4500 1300 1200 1100 1000 900 800 700 600 8000 6000 4000 Energy (Joules) Energy (Joules) 5500 DC16-Energy DC12-Energy 10000 3500 3000 2500 2000 48 46 44 42 40 1500 48 46 44 42 40 2000 1100 1050 1000 950 900 850 800 750 700 650 4000 1000 0 500 400 300 90 45 80 40 70 35 60 30 60 58 50 56 100 50 54 0 37 0 36 0 35 200 Energy (Joules) 500 DC16-Energy DC12-Energy 110 100 52 Energy (Joules) 120 85 80 75 70 65 60 55 600 50 700 40 (b) fdtd-2D DC16-Energy DC12-Energy 800 30 Program Variants (a) jacobi-2D 900 20 10 0 50 40 30 20 10 0 Program Variants 40 0 30 60 50 40 30 20 10 0 0 35 0 30 0 0 (c) 2mm 25 20 0 0 15 10 50 0 Program Variants Programs Variants (d) gesummv 14 / 21 Jacobi-2D with 60 Watts Power Cap DC Level 10 12 14 16 Time (seconds) 136.65 111.36 102.49 118.91 Energy (Joules) 5712.81 5103.09 4971.64 5602.35 Power (Watts) 41.81 45.82 48.51 47.12 15 / 21 Jacobi-2D without Power Cap DC Level 10 12 14 16 Time (seconds) 136.50 110.94 98.12 85.91 Energy (Joules) 5735.15 5132.13 4890.33 4776.52 Power (Watts) 42.01 46.26 49.84 55.60 16 / 21 Jacobi-2D (Fastest Version) with 60 Watts Power Cap DC Level 10 12 14 16 Time (seconds) 21.31 21.22 20.92 21.22 Energy (Joules) 1174.35 1173.67 1157.27 1168.4 Power (Watts) 55.10 55.31 55.32 55.07 17 / 21 Jacobi-2D (Fastest Version) without Power Cap DC Level 10 12 14 16 Time (seconds) 5.44 5.41 5.35 5.18 Energy (Joules) 647.45 649.21 750.03 902.20 Power (Watts) 118.96 120.06 140.31 174.02 18 / 21 Conclusion Compiler transformations affected differently by power management techniques Fastest program variants observed to consume more power With power capping, non-optimal program sped up by reducing frequency 19 / 21 Acknowledgment 1 U.S. Department of Defense: ATPER project 2 U.S. National Science Foundation: Award Number 1218734 3 U.S. Department of Energy: XPress project 20 / 21 Q&A Thanks! 21 / 21