ARM CPU BENCHMARKING Robert Reed | University of the Witwatersrand INTRODUCTION • What is an ARM Processor – Advanced RISC Machine • Where is it being used – 95% of Smartphone Market – Consumer Products – Supercomputers (K-Supercomputer Japan - RISC) • Why use them? – Power efficient – Low capital cost 30-May-23 2 Quad Cores in the Samsung S4 2 INTRODUCTION • How are we going to use ARMs – High-Throughput Supercomputer • Large numbers of ARMs = many cores = parallel ARMS X86 Solution Problem China’s Tianhe-2 30-May-23 3 BENCHMARK SELECTION • Characterising the ARM architecture – The main factors to look at: • • • • 30-May-23 CPU Cache RAM Connectivity 4 BENCHMARK SELECTION • CoreMark by EEMBC – Supported by ARM Holdings – Uses common algorithms – Strict submission rules Performance vs Iteration 77,5% 12000 77,0% 10000 76,5% 76,0% Cortex A7 Cortex A9 75,5% 75,0% 74,5% Iterations/sec Percentage Improvment of No Flags Flag Selection 8000 6000 4000 10s 2000 0 Cortex A7 Cortex A9 74,0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Different Flag Combinations 30-May-23 Iterations 5 RESULTS - COREMARK Normalised CoreMark Normalised CoreMark/Watt 9 1,20 8 1,00 CoreMarks/Core/MHz/Watt CoreMarks/Core/MHz 7 6 5 4 3 2 0,80 0,60 0,40 0,20 1 0 0,00 Cortex A7 Cortex A9 Cortex A9 DB Atom N2800 Intel i7 2600 Intel i7 3930k Cortex A7 Cortex A9 Cortex A9 DB CoreMark/Core/MHz Atom N2800 Intel i7 3930k Normalised CoreMark/Watt 1x Intel i7 3930K 1 X $630 130 W 6 Cores 14x Cortex A9 14 X $60 = $840 56 W 56 Cores 30-May-23 Intel i7 2600 6 BENCHMARK SELECTION • High Performance LINPACK by Jack Dongarra – Used in TOP500 list Flag Selection HPL Array Size 3000 1,5% 1,0% 0,5% Corte x A7 0,0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 -0,5% Different Flag Combinations 2500 2000 1500 Cortex A7 Cortex A9 1000 500 0 5 9 25 50 98 149 248 345 450 595 741 861 893 990 1188 1339 1499 1610 -1,0% Corte x A9 Floating Point Operations Per Second Percentage Improvment of No Flags 2,0% Array Size (MB) 30-May-23 7 UNDERSTANDING THE BENCHMARKS • Math Libraries – Linear Algebra Package (ATLAS) Generic vs Tuned 3000 2500 MFLOPS • Improvement: Almost 30% 2000 • Compile time: Approx 22 hours Generic 1500 Tuned 1000 500 0 Cortex A7 30-May-23 Cortex A9 8 RESULTS - HPL Cortex A7 Cortex A9 • 0.753 GFLOPS • 0.264 GFLOPS/Watt Intel I7 3930k • 2.382 GFLOPS • 0.471 GFLOPS/Watt • 125 GFLOPS • 0.961 GFLOPS/Watt HPL Score HPL Score / Watt 140 1,20 120 1,00 80 Performance 60 GFLOPS/Watt GFLOPS 100 0,80 0,60 Performance 0,40 40 0,20 20 0 0,00 Cortex A7 30-May-23 Cortex A9 Intel i7 3930k Cortex A7 Cortex A9 Intel i7 3930k 9 RESULTS - HPL Single vs Double Precision 5000 4500 4000 MFLOPS 3500 3000 2500 Single Precision 2000 Double Precision 1500 1000 500 0 Cortex A7 Cortex A7 • 1.696 GFLOPS • 0.595 GFLOPS/Watt 30-May-23 Cortex A9 Cortex A9 • 4.428 GFLOPS • 0.876 GFLOPS/Watt 10 CONCLUSION • Great for high throughput • Energy Efficient • Need better performance – GPU co processing • Problem Specific 30-May-23 11 ARM CPU BENCHMARKING ARM CPU BENCHMARKING BACKUP SLIDES LAYOUT • Physical Layout INTERNET Server X5 Switch X5 X1 30-May-23 A7 Cubieboard A9 Wandboard A15 ODroid 14 COREMARK FLAGS Cortex A7 Flag Testing Test Floating Point Neon AutoVec Optimisation -mfpu -march -mtune 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard mfloat-abi=hard ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath ffastmath O2 O2 O2 O2 O2 O2 O2 O2 O3 O3 O3 O3 O3 O3 O3 O3 neon-vfpv4 neon-vfpv4 neon-vfpv4 neon-vfpv4 neon neon neon neon neon-vfpv4 neon-vfpv4 neon-vfpv4 neon-vfpv4 neon neon neon neon armv7-a armv7-a cortex-a7 30-May-23 -mcpu cortex-a7 cortex-a7 armv7-a armv7-a cortex-a7 cortex-a7 cortex-a7 armv7-a armv7-a cortex-a7 cortex-a7 cortex-a7 armv7-a armv7-a cortex-a7 cortex-a7 cortex-a7 15 COREMARK COMPARISONS Our Results Reported Results Processor Frequency Cores Power (Watts) CoreMark CoreMark/Core CoreMark/MHz CoreMark/Core/MHz Cortex A7 1000 2 2.3 4859.97 2429.99 4.86 2.43 Cortex A9 996 4 3.65 11323.34 2830.84 11.37 2.84 Cortex A9 DB 996 4 3.65 10370.75 2592.69 10.41 2.60 Atom N2800 1860 2 6.5 12286.9 6143.45 6.61 3.30 Intel i7 2600 3392 4 95 99562.34 24890.585 29.35 7.34 Intel i7 3930k 3200 6 130 150962.39 25160.4 47.17 7.86 CoreMark/Core/MHz/ Watt 1.06 0.78 0.71 0.51 0.08 0.06 30-May-23 16 RESULTS - HPL • Scalability – 4x A9 Raw results Cores 30-May-23 Nodes 1 2 3 4 Problem Size 1 509.4 1030.0 1497.0 1851.0 14592.0 2 984.5 1915.0 2854.0 3595.0 20672.0 3 1452.0 2905.0 4052.0 5239.0 25280.0 4 1803.0 3908.0 5626.0 6437.0 29184.0 Total Problem MB 1624.5 3260.3 4875.8 6498.0 17 RESULTS - HPL • Block Size Peak NB 30-May-23 A7 A9 80 144 18 UNDERSTANDING THE BENCHMARKS • Changing Block Allocation NB • NB test for the A9 • Dependant on array size • Larger array = large NB • Dependant on whole system 30-May-23 19 RESULTS - HPL • Multi Precision Total RAM 1972 Real Single Real Double Complex Single Complex Double Old Complex Double 30-May-23 Cortex A9 ATLASvfpv3d16 ATLAS-neon % Improved ATLASvfpv4d16 4.184 2.335 4.363 2.470 4.602 2.272 4.835 2.502 9.990% 2.773% 10.818% 1.296% 3.170 1.315 3.374 1.251 2.519 2.55 1.231% Cortex A7 ATLASATLAS-neon neonvfpv4 3.280 1.439 3.113 1.355 3.286 1.398 3.213 1.363 % Improved 3.659% 9.430% 8.384% 8.953% 20 RESULTS - HPL • Power Measurements A9 Voltage Resistor IDLE Cores mVolts Amps Watts Watts/core 30-May-23 5 0.01 2.5 1 4.80 0.48 2.40 2.40 V Ohms mV 2 6.60 0.66 3.30 1.65 A7 3 8.40 0.84 4.20 1.40 4 10.10 1.01 5.05 1.26 Voltage Resistor IDLE 1 3.90 0.39 1.95 1.95 5 0.01 2.3 V Ohms mV 2 5.70 0.57 2.85 1.43 21