Uploaded by badrinker

CPU Benchmarking of ARM Processors

advertisement
ARM CPU BENCHMARKING
Robert Reed | University of the Witwatersrand
INTRODUCTION
• What is an ARM Processor
– Advanced RISC Machine
• Where is it being used
– 95% of Smartphone Market
– Consumer Products
– Supercomputers
(K-Supercomputer Japan - RISC)
• Why use them?
– Power efficient
– Low capital cost
30-May-23
2 Quad Cores in the Samsung S4
2
INTRODUCTION
• How are we going to use ARMs
– High-Throughput Supercomputer
• Large numbers of ARMs = many cores = parallel
ARMS
X86
Solution
Problem
China’s Tianhe-2
30-May-23
3
BENCHMARK SELECTION
• Characterising the ARM architecture
– The main factors to look at:
•
•
•
•
30-May-23
CPU
Cache
RAM
Connectivity
4
BENCHMARK SELECTION
• CoreMark by EEMBC
– Supported by ARM Holdings
– Uses common algorithms
– Strict submission rules
Performance vs Iteration
77.5%
12000
77.0%
10000
76.5%
76.0%
Cortex
A7
Cortex
A9
75.5%
75.0%
74.5%
Iterations/sec
Percentage Improvment of No Flags
Flag Selection
8000
6000
4000
10s
2000
0
Cortex
A7
Cortex
A9
74.0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Different Flag Combinations
30-May-23
Iterations
5
RESULTS - COREMARK
Normalised CoreMark
Normalised CoreMark/Watt
9
1.20
8
1.00
CoreMarks/Core/MHz/Watt
CoreMarks/Core/MHz
7
6
5
4
3
2
0.80
0.60
0.40
0.20
1
0
0.00
Cortex A7 Cortex A9 Cortex A9
DB
Atom
N2800
Intel i7
2600
Intel i7
3930k
Cortex A7 Cortex A9 Cortex A9
DB
CoreMark/Core/MHz
Atom
N2800
Intel i7
3930k
Normalised CoreMark/Watt
1x Intel i7 3930K
1 X $630
130 W
6 Cores
14x Cortex A9
14 X $60 = $840
56 W
56 Cores
30-May-23
Intel i7
2600
6
BENCHMARK SELECTION
• High Performance LINPACK by Jack Dongarra
– Used in TOP500 list
Flag Selection
HPL Array Size
3000
1.5%
1.0%
0.5%
Corte
x A7
0.0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-0.5%
Different Flag Combinations
2500
2000
1500
Cortex
A7
Cortex
A9
1000
500
0
5
9
25
50
98
149
248
345
450
595
741
861
893
990
1188
1339
1499
1610
-1.0%
Corte
x A9
Floating Point Operations Per Second
Percentage Improvment of No Flags
2.0%
Array Size (MB)
30-May-23
7
UNDERSTANDING THE BENCHMARKS
• Math Libraries
– Linear Algebra Package (ATLAS)
Generic vs Tuned
3000
2500
MFLOPS
• Improvement: Almost 30%
2000
• Compile time: Approx 22 hours
Generic
1500
Tuned
1000
500
0
Cortex A7
30-May-23
Cortex A9
8
RESULTS - HPL
Cortex A7
Cortex A9
• 0.753 GFLOPS
• 0.264 GFLOPS/Watt
Intel I7 3930k
• 2.382 GFLOPS
• 0.471 GFLOPS/Watt
• 125 GFLOPS
• 0.961 GFLOPS/Watt
HPL Score
HPL Score / Watt
140
1.20
120
1.00
80
Performance
60
GFLOPS/Watt
GFLOPS
100
0.80
0.60
Performance
0.40
40
0.20
20
0
0.00
Cortex A7
30-May-23
Cortex A9
Intel i7
3930k
Cortex A7 Cortex A9
Intel i7
3930k
9
RESULTS - HPL
Single vs Double Precision
5000
4500
4000
MFLOPS
3500
3000
2500
Single Precision
2000
Double Precision
1500
1000
500
0
Cortex A7
Cortex A7
• 1.696 GFLOPS
• 0.595 GFLOPS/Watt
30-May-23
Cortex A9
Cortex A9
• 4.428 GFLOPS
• 0.876 GFLOPS/Watt
10
CONCLUSION
• Great for high throughput
• Energy Efficient
• Need better performance
– GPU co processing
• Problem Specific
30-May-23
11
ARM CPU BENCHMARKING
ARM CPU BENCHMARKING
BACKUP SLIDES
LAYOUT
• Physical Layout
INTERNET
Server
X5
Switch
X5
X1
30-May-23
A7
Cubieboard
A9
Wandboard
A15
ODroid
14
COREMARK FLAGS
Cortex A7 Flag Testing
Test
Floating Point
Neon AutoVec
Optimisation
-mfpu
-march
-mtune
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
mfloat-abi=hard
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
ffastmath
O2
O2
O2
O2
O2
O2
O2
O2
O3
O3
O3
O3
O3
O3
O3
O3
neon-vfpv4
neon-vfpv4
neon-vfpv4
neon-vfpv4
neon
neon
neon
neon
neon-vfpv4
neon-vfpv4
neon-vfpv4
neon-vfpv4
neon
neon
neon
neon
armv7-a
armv7-a
cortex-a7
30-May-23
-mcpu
cortex-a7
cortex-a7
armv7-a
armv7-a
cortex-a7
cortex-a7
cortex-a7
armv7-a
armv7-a
cortex-a7
cortex-a7
cortex-a7
armv7-a
armv7-a
cortex-a7
cortex-a7
cortex-a7
15
COREMARK COMPARISONS
Our Results
Reported Results
Processor
Frequency
Cores
Power (Watts)
CoreMark
CoreMark/Core
CoreMark/MHz
CoreMark/Core/MHz
Cortex A7
1000
2
2.3
4859.97
2429.99
4.86
2.43
Cortex A9
996
4
3.65
11323.34
2830.84
11.37
2.84
Cortex A9 DB
996
4
3.65
10370.75
2592.69
10.41
2.60
Atom N2800
1860
2
6.5
12286.9
6143.45
6.61
3.30
Intel i7 2600
3392
4
95
99562.34
24890.585
29.35
7.34
Intel i7 3930k
3200
6
130
150962.39
25160.4
47.17
7.86
CoreMark/Core/MHz/
Watt
1.06
0.78
0.71
0.51
0.08
0.06
30-May-23
16
RESULTS - HPL
• Scalability – 4x A9
Raw results
Cores
30-May-23
Nodes
1
2
3
4
Problem Size
1
509.4
1030.0
1497.0
1851.0
14592.0
2
984.5
1915.0
2854.0
3595.0
20672.0
3
1452.0
2905.0
4052.0
5239.0
25280.0
4
1803.0
3908.0
5626.0
6437.0
29184.0
Total Problem MB
1624.5
3260.3
4875.8
6498.0
17
RESULTS - HPL
• Block Size
Peak NB
30-May-23
A7
A9
80
144
18
UNDERSTANDING THE BENCHMARKS
• Changing Block Allocation NB
• NB test for the A9
• Dependant on array size
• Larger array = large NB
• Dependant on whole system
30-May-23
19
RESULTS - HPL
• Multi Precision
Total RAM
1972
Real Single
Real Double
Complex Single
Complex Double
Old Complex
Double
30-May-23
Cortex A9
ATLASvfpv3d16
ATLAS-neon
% Improved
ATLASvfpv4d16
4.184
2.335
4.363
2.470
4.602
2.272
4.835
2.502
9.990%
2.773%
10.818%
1.296%
3.170
1.315
3.374
1.251
2.519
2.55
1.231%
Cortex A7
ATLASATLAS-neon
neonvfpv4
3.280
1.439
3.113
1.355
3.286
1.398
3.213
1.363
% Improved
3.659%
9.430%
8.384%
8.953%
20
RESULTS - HPL
• Power Measurements
A9
Voltage
Resistor
IDLE
Cores
mVolts
Amps
Watts
Watts/core
30-May-23
5
0.01
2.5
1
4.80
0.48
2.40
2.40
V
Ohms
mV
2
6.60
0.66
3.30
1.65
A7
3
8.40
0.84
4.20
1.40
4
10.10
1.01
5.05
1.26
Voltage
Resistor
IDLE
1
3.90
0.39
1.95
1.95
5
0.01
2.3
V
Ohms
mV
2
5.70
0.57
2.85
1.43
21
Download