Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises Lab # 1: Parallel Programming and Performance measurement using MPAC 1-2 Lab 1 – Goals Objective Use MPAC benchmarks to measure the performance of different subsystems of multi-core based systems Understand accurate measurement method for multi-core based systems Use MPAC to learn to develop parallel programs Mechanism MPAC CPU and memory benchmarks will exercise the processor and memory unit by generating compute and memory intensive workload 1-3 Observations Observe the throughput with increasing number of threads for compute and memory intensive workloads Identify performance bottlenecks 1-4 MPAC fork and join infrastructure In MPAC based applications, the initialization and argument handling is performed by the main thread. The task to be run in parallel are forked to worker threads The worker threads join after completing their task. Final processing is done by main thread 1-5 MPAC code structure 1-6 Compile and Run on Host System (Memory Benchmark) Memory Benchmark host$ cd /<path-to-mpac>/mpac_1.2 host$ ./configure host$ make clean host$ make host$ cd benchmarks/mem host$ ./mpac_mem_bm –n <# of Threads> -s <array size> -r <# of repetitions> -t <data type> For Help host$ ./mpac_mem_bm –h 1-7 Cross Compile for Target System (Memory Benchmark) Cross Compile on Host System Go to Cavium SDK directory and run the command host$ source env-setup <OCTEON-MODEL> (where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX) host$ cd /<path-to-mpac>/mpac_1.2 host$ ./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linux-gnu export CC=mips64-octeon-linux-gnu-gcc host$ make clean host$ make CC=mips64-octeon-linux-gnu-gcc 1-8 Run on Target System (Memory Benchmark) Run on Target System Copy executable “mpac_mem_bm” on target system target$ ./mpac_mem_bm –n <# of Threads> -s <array size> -r <# of repetitions> -t <data type> For Help target$ ./mpac_mem_bm –h 1-9 Performance Measurements (Memory) 20000 15000 10000 5000 $ ./mpac_mem_bm –n <# of Threads> -r 10000 –s 2048 –t i Throughput (Mbps) Throughput (Mbps) $ ./mpac_mem_bm –n <# of Threads> -r 100000 –s 512 –t i 25000 0 1 2 3 4 5 6 7 8 60000 50000 40000 30000 20000 10000 0 9 10 11 12 1 2 3 4 No. of Threads (a) 4 KB (b) 16 KB $ ./mpac_mem_bm –n <# of Threads> -r 100 –s 131072 –t i $ ./mpac_mem_bm –n <# of Threads> -r 10 –s 2097152 –t i 70000 70000 60000 60000 Throughput (Mbps) Throughput (Mbps) 5 6 7 8 9 10 11 12 No. of Threads 50000 40000 30000 20000 50000 40000 30000 20000 10000 10000 0 0 1 2 3 4 5 6 7 8 No. of Threads 9 10 11 12 (c) 1 MB • Results taken on Cavium Networks EVB 5610 Board 1-10 1 2 3 4 5 6 7 8 No. of Threads (d) 16 MB 9 10 11 12 Performance Measurements (Memory) Data sizes of 4 KB, 16 KB, 1 MB and 16 MB are used to exercise the L1, L2 Cache and main memory of the target system. Cavium Octeon (MIPS64) CN5610 Evaluation board is used as System under Test (SUT). With Data more than the size of L2 Cache (2 MB), the throughput is expected to not scale linearly. The linearity is seen because of the low latency interconnect used in place of system bus. Throughput scales linearly across number of threads for all cases. 1-11 Compile and Run on Host System (CPU Benchmark) CPU Benchmark host$ cd /<path-to-mpac>/mpac_1.2 host$ ./configure host$ make clean host$ make host$ cd benchmarks/cpu host$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations> For Help host$ ./mpac_cpu_bm –h 1-12 Cross Compile for Target System (CPU Benchmark) Cross Compile on Host System Go to Cavium SDK directory and run the command host$ source env-setup <OCTEON-MODEL> (where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX) host$ cd /<path-to-mpac>/mpac_1.2 host$ ./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linux-gnu export CC=mips64-octeon-linux-gnu-gcc host$ make clean host$ make CC=mips64-octeon-linux-gnu-gcc 1-13 Run on Target System (CPU Benchmark) Run on Target System Copy executable “mpac_cpu_bm” on target system target$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations> For Help target$ ./mpac_cpu_bm –h 1-14 Performance Measurements (CPU) $ ./mpac_cpu_bm –n <# of Threads> -r 10000000 –a 1 –u i Throughput (MOPS) 1000 800 600 400 $ ./mpac_cpu_bm –n <# of Threads> -r 10000000 –a 1 –u l 200 500 1 2 3 4 5 6 7 8 9 10 11 12 No. of Threads (a) Integer Unit Operation Throughput (MOPS) 0 400 300 200 100 0 1 2 3 4 5 6 7 8 9 No. of Threads • Results taken on Cavium Networks EVB 5610 Board 1-15 (b) Logical Unit Operation 10 11 12 Performance Measurements (CPU) Integer Unit (summation) and Logical Unit (string operation) of the processor are exercised. Cavium Octeon (MIPS64) CN5610 Evaluation board is used as System under Test (SUT). Throughput scales linearly across number of threads for both cases. 1-16 Measurement of Execution Time Measuring the elapsed time since the start of a task until its completion is a straight-forward procedure in the context of a sequential task. This procedure becomes complex when the same task is executed concurrently by n threads on n distinct processors or cores. Not guaranteed that all tasks start at the same time or complete at the same time. Therefore, the measurement is imprecise due to concurrent nature of the tasks. 1-17 Cont… Execution time measured either globally or locally. In the case of global measurement, execution time is equal to the difference of time stamps taken at global fork and join instants. Local times can be measured and recorded by each of the n threads. After thread joining, the maximum of all these individual execution times provides an estimate of overall execution time. 1-18 Definitions LETE: Local Execution Time Estimation GETE: Global Execution Time Estimation 1-19 Cont… procedure() t11 = start time t12 = stop time procedure() t21 = start time t22 = stop time procedure() t31 = start time procedure() t1 = start time t2 = stop time tg1=global start time t t2 t1 t32 = stop time tg2=global stop time procedure() tn1 = start time GETE LETE 1-20 tn2 = stop time t g t g 2 t g1 tl max ti 2 ti1 1i n The Problem Lack of Precision Some tasks finish before others Synchronization issue with large no. of cores Results not repeatable 1-21 Performance Measurement Methodologies Get start time Repeat for N no. of iterations Get start time at the barrier Get end time For sequential case (1) (2) (3) ... (K) Repeat for N no. of iterations Get end time at the barrier For multithreaded case 1-22 Accurate LETE Measurement Methodology Thread synchronization before each round using barrier (1) (2) (3) ... (K) Repeat for N no. of rounds Maximum elapsed time for the round 1-23 Measurement Observations 1-24 Accurate MINMAX Approach Repeat for N no. of Iterations Store thread local execution time for each thread for each iteration For an individual iteration store the largest execution time amongst the threads We have stored N largest execution time values Choose the minimum of that value to be your execution time. The MINMAX value!! 1-25 MPAC Hello World Objective To write a simple ” Hello World” program using MPAC Mechanism User specifies number of worker threads through command-line Each worker thread prints “Hello World” and exits 1-26 Compile and Run $ $ $ $ cd /<path-to-mpac>/mpac_1.2/apps/hello make clean make ./mpac_hello_app –n <# of Threads> 1-27