slides

advertisement
5th International Workshop on Power-aware Algorithms, Systems,
and Architectures (ICPP PASA 2016)
Compiler Transformations Meet
CPU Clock Modulation and Power Capping
Wei Wang, University of Delaware
Allan Porterfield, RENaissance Computing Institute
John Cavazos, University of Delaware
Sridutt Bhalachandra, University of North Carolina at Chapel Hill
August 16, 2016
Motivation
Bounded power consumption to achieve exascale computing
Various power management techniques exist for applications
energy control
Impact of power management techniques on compiler
transformations
2 / 21
CPU Clock Modulation
Write Specific Value to IA32 CLOCK MODULATION (0x19a)
MSR
Modify /dev/cpu/cpu{0:15}/msr with root privilege
Invoke wrmsr inline assembly from applications using added
System Call
Figure: CPU Clock Modulation. Sample Modulation with 25% Duty Cycle.
(Source: IA-32 Intel Architecture Software Developer’s Manual, Volume 3:
System Programming Guide)
3 / 21
Available Frequencies
Duty Cycle Level
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Binary
10001B
10010B
10011B
10100B
10101B
10110B
10011B
11000B
11001B
11010B
11011B
11100B
11101B
11110B
11111B
00000B
Decimal
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
Hexadecimal
11H
12H
13H
14H
15H
16H
17H
18H
19H
1AH
1BH
1CH
1DH
1EH
1FH
00H
Effective Frequency
6.25%
12.5%
18.75%
25%
31.25%
37.5%
43.75%
50%
56.25%
63.5%
69.75%
75%
81.25%
87.5%
93.75%
100%
4 / 21
Power Capping and Measurement on Intel Architecture
Based on RAPL (Running Average Power Limit)
Power Capping - Write MSR PKG RAPL POWER LIMIT MSR
Power Measurement - Read MSR PKG ENERGY STATUS MSR
(Use RCRTool to avoid overflow)
5 / 21
Compiler Transformations
Loop Tiling
Loop Fusion
6 / 21
Experiments Setup
Intel Xeon E5-2680 (dual socket 8-core processor with 20MB
cache, Sandy Bridge architecture)
Linux Kernel: 2.6.32
Polyhedral Compilers: PoCC v1.2
Benchmarks: 2mm, gesummv, jacobi-2D, and fdtd-2d Polybench
Compiler: Intel ICC v14.0.2 (-O3)
7 / 21
Compiler Transformations with CPU Clock Modulation
DC16-Time
DC12-Time
100
8.5
8
7.5
7
6.5
6
5.5
5
80
60
40
48
46
44
20
42
40
Execution Time (Seconds)
120
0
50
40
30
20
10
0
Program Variants
(a) Execution time of transformed program versions under Duty
Cycle Modulation setting of 16 (DC16) and 12 (DC12)
8 / 21
Compiler Transformations with CPU Clock Modulation
200
DC16-Power
DC12-Power
Power (Watts)
180
160
140
120
100
80
60
40
50
40
30
20
10
0
Program Variants
(b) Power consumption of transformed program versions under Duty
Cycle Modulation setting of 16 (DC16) and 12 (DC12)
9 / 21
Compiler Transformations with CPU Clock Modulation
12000
DC16-Energy
DC12-Energy
Energy (Joules)
10000
1300
1200
1100
1000
900
800
700
600
8000
6000
4000
48
46
44
42
40
2000
0
50
40
30
20
10
0
Program Variants
(c) Energy consumption of transformed program versions under
Duty Cycle Modulation setting of 16 (DC16) and 12 (DC12)
respectively.
10 / 21
90
Slowdown
MAD
80
70
1.2
60
50
1.15
40
1.1
30
20
1.05
10
1
0
50
40
30
20
10
0
Program Variants
1.3
Performance Slowdown
Performance Slowdown
1.3
1.25
Memory Access Density (MAD)
Memory Access Density VS. Performance
1.25
1.2
1.15
1.1
1.05
1
0
10
20
30
40
50
60
70
80
90
Memory Access Density (MAD)
(a) The performance slowdown and the (b) The scattered plot of performance
Memory Access Density of all compiler slowdown over the Memory Access
transformed programs.
Density and the trend line.
Figure: The correlation of the Memory Access Density metric and the
performance slowdown of jacobi-2D benchmark.
11 / 21
Performance Impact of Clock Modulation for Four Benchmarks
120
8.5
8
7.5
7
6.5
6
5.5
5
80
60
40
7
60
6.5
6
40
5.5
0
0.7
0.6
3
0.5
0.4
2
0.7
0.34
0.65
0.6
0.32
0.55
0.5
0.3
0.45
0.28
60
58
0.35
56
0.4
54
0
DC16-Time
DC12-Time
0.75
52
0
37
0
36
0
35
1
Execution Time (Seconds)
0.8
0.8
4
50
5
40
(b) fdtd-2D
DC16-Time
DC12-Time
6
30
Program Variants
(a) jacobi-2D
7
20
10
0
50
40
30
20
10
0
Program Variants
Execution Time (Seconds)
48
46
20
44
0
7.5
80
42
48
46
44
42
20
DC16-Time
DC12-Time
100
40
Execution Time (Seconds)
DC16-Time
DC12-Time
100
40
Execution Time (Seconds)
120
0.3
0.25
60
50
40
30
20
10
0
0
35
0
0
30
(c) 2mm
25
0
0
20
15
0
10
50
0
Program Variants
Program Variants
(d) gesummv
12 / 21
Power Impact of Clock Modulation for Four Benchmarks
200
DC16-Power
DC12-Power
180
160
Power (Watts)
Power (Watts)
200
DC16-Power
DC12-Power
180
140
120
100
80
60
160
140
120
100
80
60
40
40
200
DC16-Power
DC12-Power
180
160
Power (Watts)
Power (Watts)
50
40
(b) fdtd-2D
DC16-Power
DC12-Power
180
30
Program Variants
(a) jacobi-2D
200
20
10
0
50
40
30
20
10
0
Program Variants
140
120
100
80
60
160
140
120
100
80
60
40
40
60
50
40
30
20
10
0
0
35
0
30
0
0
(c) 2mm
25
20
0
0
15
10
50
0
Program Variants
Program Variants
(d) gesummv
13 / 21
Energy Impact of Clock Modulation for Four Benchmarks
12000
DC16-Energy
DC12-Energy
5000
4500
1300
1200
1100
1000
900
800
700
600
8000
6000
4000
Energy (Joules)
Energy (Joules)
5500
DC16-Energy
DC12-Energy
10000
3500
3000
2500
2000
48
46
44
42
40
1500
48
46
44
42
40
2000
1100
1050
1000
950
900
850
800
750
700
650
4000
1000
0
500
400
300
90
45
80
40
70
35
60
30
60
58
50
56
100
50
54
0
37
0
36
0
35
200
Energy (Joules)
500
DC16-Energy
DC12-Energy
110
100
52
Energy (Joules)
120
85
80
75
70
65
60
55
600
50
700
40
(b) fdtd-2D
DC16-Energy
DC12-Energy
800
30
Program Variants
(a) jacobi-2D
900
20
10
0
50
40
30
20
10
0
Program Variants
40
0
30
60
50
40
30
20
10
0
0
35
0
30
0
0
(c) 2mm
25
20
0
0
15
10
50
0
Program Variants
Programs Variants
(d) gesummv
14 / 21
Jacobi-2D with 60 Watts Power Cap
DC Level
10
12
14
16
Time (seconds)
136.65
111.36
102.49
118.91
Energy (Joules)
5712.81
5103.09
4971.64
5602.35
Power (Watts)
41.81
45.82
48.51
47.12
15 / 21
Jacobi-2D without Power Cap
DC Level
10
12
14
16
Time (seconds)
136.50
110.94
98.12
85.91
Energy (Joules)
5735.15
5132.13
4890.33
4776.52
Power (Watts)
42.01
46.26
49.84
55.60
16 / 21
Jacobi-2D (Fastest Version) with 60 Watts Power Cap
DC Level
10
12
14
16
Time (seconds)
21.31
21.22
20.92
21.22
Energy (Joules)
1174.35
1173.67
1157.27
1168.4
Power (Watts)
55.10
55.31
55.32
55.07
17 / 21
Jacobi-2D (Fastest Version) without Power Cap
DC Level
10
12
14
16
Time (seconds)
5.44
5.41
5.35
5.18
Energy (Joules)
647.45
649.21
750.03
902.20
Power (Watts)
118.96
120.06
140.31
174.02
18 / 21
Conclusion
Compiler transformations affected differently by power
management techniques
Fastest program variants observed to consume more power
With power capping, non-optimal program sped up by reducing
frequency
19 / 21
Acknowledgment
1
U.S. Department of Defense: ATPER project
2
U.S. National Science Foundation: Award Number 1218734
3
U.S. Department of Energy: XPress project
20 / 21
Q&A
Thanks!
21 / 21
Download