COE 590 Special Topics: Parallel Architectures

advertisement
Programming Multi-Core
Processors based Embedded
Systems
A Hands-On Experience on Cavium
Octeon based Platforms
Lab Exercises
Lab # 1: Parallel
Programming and
Performance measurement
using MPAC
1-2
Lab 1 – Goals

Objective




Use MPAC benchmarks to measure the
performance of different subsystems of multi-core
based systems
Understand accurate measurement method for
multi-core based systems
Use MPAC to learn to develop parallel programs
Mechanism

MPAC CPU and memory benchmarks will exercise
the processor and memory unit by generating
compute and memory intensive workload
1-3

Observations


Observe the throughput with increasing number of
threads for compute and memory intensive
workloads
Identify performance bottlenecks
1-4
MPAC fork and join
infrastructure




In MPAC based applications, the initialization and
argument handling is performed by the main thread.
The task to be run in parallel are forked to worker
threads
The worker threads join after completing their task.
Final processing is done by main thread
1-5
MPAC code structure
1-6
Compile and Run on Host System
(Memory Benchmark)

Memory Benchmark







host$ cd /<path-to-mpac>/mpac_1.2
host$ ./configure
host$ make clean
host$ make
host$ cd benchmarks/mem
host$ ./mpac_mem_bm –n <# of Threads>
-s <array size> -r <# of repetitions> -t <data type>
For Help

host$ ./mpac_mem_bm –h
1-7
Cross Compile for Target System
(Memory Benchmark)

Cross Compile on Host System

Go to Cavium SDK directory and run the command

host$ source env-setup <OCTEON-MODEL>
(where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX)




host$ cd /<path-to-mpac>/mpac_1.2
host$ ./configure --host=i386-redhat-linux-gnu
--target=mips64-octeon-linux-gnu export
CC=mips64-octeon-linux-gnu-gcc
host$ make clean
host$ make CC=mips64-octeon-linux-gnu-gcc
1-8
Run on Target System
(Memory Benchmark)

Run on Target System



Copy executable “mpac_mem_bm” on target system
target$ ./mpac_mem_bm –n <# of Threads>
-s <array size> -r <# of repetitions> -t <data type>
For Help

target$ ./mpac_mem_bm –h
1-9
Performance Measurements (Memory)
20000
15000
10000
5000
$ ./mpac_mem_bm –n <# of Threads> -r 10000 –s 2048 –t i
Throughput (Mbps)
Throughput (Mbps)
$ ./mpac_mem_bm –n <# of Threads> -r 100000 –s 512 –t i
25000
0
1
2
3
4
5
6
7
8
60000
50000
40000
30000
20000
10000
0
9 10 11 12
1
2
3
4
No. of Threads
(a) 4 KB
(b) 16 KB
$ ./mpac_mem_bm –n <# of Threads> -r 100 –s 131072 –t i
$ ./mpac_mem_bm –n <# of Threads> -r 10 –s 2097152 –t i
70000
70000
60000
60000
Throughput (Mbps)
Throughput (Mbps)
5 6 7 8 9 10 11 12
No. of Threads
50000
40000
30000
20000
50000
40000
30000
20000
10000
10000
0
0
1
2
3
4
5 6 7 8
No. of Threads
9
10 11 12
(c) 1 MB
• Results taken on Cavium Networks EVB 5610 Board
1-10
1
2
3
4
5
6
7
8
No. of Threads
(d) 16 MB
9
10 11 12
Performance Measurements
(Memory)




Data sizes of 4 KB, 16 KB, 1 MB and 16 MB are used
to exercise the L1, L2 Cache and main memory of the
target system.
Cavium Octeon (MIPS64) CN5610 Evaluation board is
used as System under Test (SUT).
With Data more than the size of L2 Cache (2 MB),
the throughput is expected to not scale linearly. The
linearity is seen because of the low latency
interconnect used in place of system bus.
Throughput scales linearly across number of threads
for all cases.
1-11
Compile and Run on Host System
(CPU Benchmark)

CPU Benchmark







host$ cd /<path-to-mpac>/mpac_1.2
host$ ./configure
host$ make clean
host$ make
host$ cd benchmarks/cpu
host$ ./mpac_cpu_bm –n <# of Threads> -r <#
of Iterations>
For Help

host$ ./mpac_cpu_bm –h
1-12
Cross Compile for Target System
(CPU Benchmark)

Cross Compile on Host System

Go to Cavium SDK directory and run the command

host$ source env-setup <OCTEON-MODEL>
(where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX)




host$ cd /<path-to-mpac>/mpac_1.2
host$ ./configure --host=i386-redhat-linux-gnu
--target=mips64-octeon-linux-gnu export
CC=mips64-octeon-linux-gnu-gcc
host$ make clean
host$ make CC=mips64-octeon-linux-gnu-gcc
1-13
Run on Target System
(CPU Benchmark)

Run on Target System




Copy executable “mpac_cpu_bm” on target system
target$ ./mpac_cpu_bm –n <# of Threads>
-r <# of Iterations>
For Help

target$ ./mpac_cpu_bm –h
1-14
Performance Measurements (CPU)
$ ./mpac_cpu_bm –n <# of Threads> -r 10000000 –a 1 –u i
Throughput (MOPS)
1000
800
600
400
$ ./mpac_cpu_bm –n <# of Threads> -r 10000000 –a 1 –u l
200
500
1
2
3
4
5
6
7
8
9 10 11 12
No. of Threads
(a) Integer Unit Operation
Throughput (MOPS)
0
400
300
200
100
0
1
2
3
4
5
6
7
8
9
No. of Threads
• Results taken on Cavium Networks EVB 5610 Board
1-15
(b) Logical Unit Operation
10 11 12
Performance Measurements (CPU)



Integer Unit (summation) and Logical Unit
(string operation) of the processor are
exercised.
Cavium Octeon (MIPS64) CN5610 Evaluation
board is used as System under Test (SUT).
Throughput scales linearly across number of
threads for both cases.
1-16
Measurement of Execution Time



Measuring the elapsed time since the start of a task
until its completion is a straight-forward procedure in
the context of a sequential task.
This procedure becomes complex when the same
task is executed concurrently by n threads on n
distinct processors or cores.
Not guaranteed that all tasks start at the same time
or complete at the same time. Therefore, the
measurement is imprecise due to concurrent nature
of the tasks.
1-17
Cont…




Execution time measured either globally or
locally.
In the case of global measurement, execution
time is equal to the difference of time stamps
taken at global fork and join instants.
Local times can be measured and recorded by
each of the n threads.
After thread joining, the maximum of all
these individual execution times provides an
estimate of overall execution time.
1-18
Definitions

LETE: Local Execution Time Estimation

GETE: Global Execution Time Estimation
1-19
Cont…
procedure()
t11 = start time
t12 = stop time
procedure()
t21 = start time
t22 = stop time
procedure()
t31 = start time
procedure()
t1 = start time
t2 = stop time
tg1=global start time
t  t2  t1
t32 = stop time
tg2=global stop time
procedure()
tn1 = start time
GETE
LETE
1-20
tn2 = stop time
t g  t g 2  t g1
tl  max ti 2  ti1 
1i  n
The Problem

Lack of Precision



Some tasks finish before others
Synchronization issue with large no. of cores
Results not repeatable
1-21
Performance Measurement
Methodologies
Get start time
Repeat for N no.
of iterations
Get start time at the barrier
Get end time
For sequential case
(1)
(2)
(3)
... (K)
Repeat for N no.
of iterations
Get end time at the barrier
For multithreaded case
1-22
Accurate LETE Measurement
Methodology
Thread synchronization before each round
using barrier
(1)
(2)
(3)
...
(K)
Repeat for N no. of
rounds
Maximum elapsed time for the
round
1-23
Measurement Observations
1-24
Accurate MINMAX Approach





Repeat for N no. of Iterations
Store thread local execution time for each
thread for each iteration
For an individual iteration store the largest
execution time amongst the threads
We have stored N largest execution time
values
Choose the minimum of that value to be your
execution time. The MINMAX value!!
1-25
MPAC Hello World

Objective


To write a simple ” Hello World” program using MPAC
Mechanism


User specifies number of worker threads through
command-line
Each worker thread prints “Hello World” and exits
1-26
Compile and Run




$
$
$
$
cd /<path-to-mpac>/mpac_1.2/apps/hello
make clean
make
./mpac_hello_app –n <# of Threads>
1-27
Download