Lab Manual - Cavium University Program

advertisement
LAB WORKBOOK FOR
A Short Course on
"Programming Multi-Core
Processors Based Embedded
Systems"
A Hands-On Experience with Cavium Octeon Based Platforms
2010
Rev 1209-1
© Copyright 2010 Dr Abdul Waheed for Cavium University Program.
Cavium University Program LAB WORK BOOK
Page intentionally left blank.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
2
Cavium University Program LAB WORK BOOK
LAB WORKBOOK
This workbook is written for assisting the students of Short Course on
“Programming Multi-Core Processors Based Embedded Systems - A Hands-On
Experience with Cavium Octeon Based Platforms”.
The contents of this document have been compiled from various academic
resources to expose the students to the basics of multi-core architectures in a
hands-on fashion.
For Further information, please contact
Email: University@CaviumNetworks.com
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
3
Cavium University Program LAB WORK BOOK
TABLE OF CONTENTS
1. INTRODUCTION TO PARALLEL PROGRAMMING, ARCHITECTURE AND PERFORMANCE ................................. 5
1.1.
1.2.
1.3.
1.4.
1.5.
1.6.
1.7.
1.8.
1.9.
1.10.
LAB OBJECTIVES ............................................................................................................................................ 5
SETUP ......................................................................................................................................................... 5
INTRODUCTION TO MPAC .............................................................................................................................. 5
UNDERSTANDING THE HARDWARE .................................................................................................................... 6
UNDERSTANDING PROCESSOR ARCHITECTURE AND PERFORMANCE USING MPAC FOR HOST SYSTEM............................. 7
UNDERSTANDING PROCESSOR ARCHITECTURE AND PERFORMANCE USING MPAC FOR TARGET SYSTEM ......................... 9
A SIMPLE "HELLO WORLD" PROGRAM ........................................................................................................... 10
EXERCISE 2 – PTHREAD VERSION OF "HELLO WORLD" ........................................................................................ 11
WRITING PARALLEL PROGRAM USING MPAC LIBRARY" ...................................................................................... 12
EXERCISE 3 – "HELLO WORLD PROGRAM USING MPAC LIBRARY" ....................................................................... 12
2. PARALLEL SORTING .......................................................................................................................................... 13
2.1.
2.2.
2.3.
LAB OBJECTIVES .......................................................................................................................................... 13
SETUP ....................................................................................................................................................... 13
INTRODUCTION TO MPAC SORT .................................................................................................................... 13
3. NETWORK PACKET SNIFFING (NPS) ................................................................................................................. 16
3.1.
3.2.
LAB OBJECTIVES .......................................................................................................................................... 16
SETUP ....................................................................................................................................................... 16
4. NETWORK PACKET FILTERING (NPF)................................................................................................................ 18
4.1.
4.2.
4.3.
LAB OBJECTIVES .......................................................................................................................................... 18
SETUP ....................................................................................................................................................... 18
INTRODUCTION TO NPF................................................................................................................................ 18
5. DEEP PACKET INSPECTION (DPI) ...................................................................................................................... 21
5.1.
5.2.
5.3.
LAB OBJECTIVES .......................................................................................................................................... 21
SETUP ....................................................................................................................................................... 21
INTRODUCTION TO DPI ................................................................................................................................ 21
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
4
Cavium University Program LAB WORK BOOK
1. Introduction to Parallel Programming, Architecture and Performance
1.1. Lab Objectives
The objective of this lab is to understand the underlying multi-core architecture and its
performance. For this purpose, this lab session introduces "Multi-core Processor Architecture
and Communication" (MPAC) Library and reference performance benchmarks. You will learn to
develop parallel applications using MPAC library for multi-core based systems. At the end of
this lab, you should know:
1. How to use MPAC benchmarks to understand the processor architecture and performance;
2. Learn to write a basic parallel program in C using MPAC library;
1.2. Setup
The required tools for this task are:
1. GNU C Compiler 4.3.0 or above
2. MPAC library and benchmarking suite
3. OCTEON Software Development Kit (SDK) with Cross building tools
All of the code used for this Lab is provided with the description on your host system. You will
need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code
on the target embedded system (Cavium Board), you will need to cross compile it on your host
system for target system, copy the executables to the target system and run the executables.
1.3. Introduction to MPAC
MPAC library provides a framework that eases the development of parallel applications and
benchmarks for state-of-the-art multi-core processors based computing and networking
platforms. MPAC Library uses multiple threads in a fork-and-join approach that helps
simultaneously exercising multiple processor cores of a system according to user specified
workload. The flexibility of MPAC software architecture allows a user to parallelize a task
withoug going into the peculiar intricacies of parallelism. MPAC library allows the user to
implement suitable experimental control and to replicate the same task across multiple
processors or cores using a fork-and-join parallelism. MPAC library is an open source C-based,
POSIX complaint, library, which is freely available under FreeBSD style licensing model. Fig. 1
provides an overview of MPAC’s software architecture. It provides an implementation of some
commons tasks, such as measurement of timer resolution, accurate interval timers, and other
statistical and experimental design related functions, which may be too time consuming or
complex to be written by a regular user. However, these ideas are fundamental to accurate and
repeatable measurement based evaluation.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
5
Cavium University Program LAB WORK BOOK
Figure 1. A high-level architecture of MPAC Library’s extensible benchmarking infrastructure.
Fig. 2 shows an overview of MPAC’s fork-and-join execution model. In the following
subsections, we provide details about various MPAC modules and related APIs. Threading based
parallel application development requires thread creation, execution control, and termination.
Thread usage varies depending on a task. A user may require a thread to terminate after it has
completed its task or wait for other threads to complete their tasks and terminate together.
MPAC library provides a Thread Manager (TM), which facilitates handling thread activities
transparently from the end user. It offers high level functions to manage the life cycle of userspecified thread pool of non-interacting workers. It is based on fork-and-join threading model
for concurrent execution of same workload on all processor cores. Thread manager functions
include thread creation, thread locking, thread affinity, and thread termination.
Thread Routine ( )
Thread Routine ( )
MPAC
Initialization
Argument
Handling
Thread
Joining
Thread
Creation
& Forking
Output
Processing
& Display
Thread Routine ( )
Figure 2. An overview of MPAC Benchmark fork and join infrastructure.
1.4. Understanding the Hardware
Before writing a parallel application for a specific platform, it is a good idea to indetify and
understand the underlying hardware architecture.
To explore the hardware details of the system: CPU, memory, I/O devices, and network
interfaces, Linux maintains a /proc file system to dynamically maintain system hardware
resource information and their performance statistics. We can read various files under /proc to
identify system hardware details:
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
6
Cavium University Program LAB WORK BOOK
$ cat /proc/cpuinfo
$ cat /proc/meminfo
The file "cpuinfo" gives details of all the processor cores available in the system. The main
variables to observe are "processor", "model name", "cpu MHz", "cache size", and
"cpu cores". The "processor" represents the processor id of the core. "Model name"
represents the type and processor vendor. "cpu MHz", "cache size", and "cpu cores",
represent the processor frequency, L2 cache size and the number of cores per socket
respectively. Another way to get the details of your processor is by issuing the following
command.
$ dmesg | grep CPU
The file "meminfo" gives details of the memory organization of the system. The main variable
to observe is "MemTotal" which represents the total size of the main memory of your system.
Another way to get the details of the system memory is by issuing the following command.
$ dmesg | grep mem
Fill in the following table to note your observations and measurements of the host and target
systems:
Table 1. System hardware details.
Hardware Attributes
Values
(Development Host)
Values
(Target Cavium Board)
Processor type
Processor GHz rating
Total number of CPUs (cores)
L1 Data Cache size
L2 Cache size
Total Main Memory size
1.5. Understanding Processor Architecture and Performance using MPAC for host system
In order to understand the performance of your multi-core architecture for CPU and memory
intensive workload, MPAC provides CPU and memory benchmarks. After you have downloaded
and unpacked MPAC software run the following commands to configure and compile the MPAC
software on your development host.
To go to the main directory issue the following command.
host$ cd /<path-to-mpac>/mpac_1.2
where <path-to-mpac> is the directory where mpac is located. Then issue the following
commands.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
7
Cavium University Program LAB WORK BOOK
host$ ./configure
host$ make clean
host $ make
To execute the MPAC CPU benchmark, follow the following steps.
host$ cd benchmarks/cpu
host$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations>
where –n is for number of threads, and -r for the number of times the task is run. For additional
arguments that can be passed through command line, issue the following command.
host$ ./mpac_cpu_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
CPU benchmark of the host system:
Table 2. CPU performance in MOPS
Operation Type
1
2
No of Threads
4
8
16
32
Integer (summation)
Logical (String Operation)
Floating Point (Sin)
To execute the MPAC memory benchmark, follow the following steps.
host$ cd benchmarks/mem
host$ ./mpac_mem_bm –n <# of Threads> -s <array size>
-r <# of repetitions> -t <data type>
For additional arguments that can be passed through command line, issue the following
command.
host$ ./mpac_mem_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
memory benchmark of the host system:
Table 3. Memory performance in Mbps for Integer data type
Array Size
1
2
No of Threads
4
8
16
32
512 (4 KB)
65536 (512 KB)
1048576 (8 MB)
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
8
Cavium University Program LAB WORK BOOK
1.6. Understanding Processor Architecture and Performance using MPAC for Target system
In order to understand the performance of your target embedded multi-core architecture
(Cavium Board) for CPU and memory intensive workload, use MPAC CPU and memory
benchmarks. On your host system, run the following commands to configure and cross compile
the MPAC software for Target system, using the SDK the provided.
To cross compile the code for target system, set the environment variables for your specific
target system. Go to the directory where OCTEON SDK is installed. By default it will be installed
under /usr/local/Cavium_Networks/OCTEON_SDK/. Type the following command.
host$ source env-setup <OCTEON-MODEL>
where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX
To go to the MPAC main directory issue the following command.
host$ cd /<path-to-mpac>/mpac_1.2
where <path-to-mpac> is the directory where mpac is located. Then issue the following
commands.
host$ ./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linuxgnu CC=mips64-octeon-linux-gnu-gcc
host$ make clean
host $ make CC=mips64-octeon-linux-gnu-gcc AR=mips64-octeon-linux-gnu-ar
where "mips64-octeon-linux-gnu-gcc" is the gcc cross compiler for OCTEON based systems.
To execute the MPAC CPU benchmark on the target system, copy the executable "
mpac_cpu_bm" on the target system and follow the following step.
target$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations>
where –n is for number of threads, and -r for the number of times the task is run. For additional
arguments that can be passed through command line, issue the following command.
target$ ./mpac_cpu_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
CPU benchmark of the system you are using:
Table 4. CPU performance in MOPS
Operation Type
1
2
No of Threads
4
8
16
32
Integer (summation)
Logical (String Operation)
Floating Point (Sin)
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
9
Cavium University Program LAB WORK BOOK
To execute the MPAC memory benchmark on the target system, copy the executable "
mpac_mem_bm" on the target system and follow the following step.
target$ ./mpac_mem_bm –n <# of Threads> -s <array size>
-r <# of repetitions> -t <data type>
For additional arguments that can be passed through command line, issue the following
command.
target$ ./mpac_mem_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
memory benchmark of the system you are using:
Table 5. Memory performance in Mbps for Integer data type
Array Size
1
2
No of Threads
4
8
16
32
512 (4 KB)
65536 (512 KB)
1048576 (8 MB)
1.7. A simple "Hello World" Program
In this exercise we will compile and run a simple sequential "Hello World" program written in C
language which prints "Hello World" to the screen and exits.
1.
2.
3.
4.
5.
6.
7.
#include <stdio.h>
int main(void)
{
printf("Hello World\n");
return 0;
}
To Compile & Run:
For Host System
host$ gcc -o outputFilename sourceFileName.c
host$ ./outputFileName
On Target System
host$
mips64-octeon-linux-gnu-gcc -o outputFilename sourceFileName.c
target$ ./outputFileName
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
10
Cavium University Program LAB WORK BOOK
1.8. Exercise 2 – Pthread version of "Hello World"
1.
2.
3.
4.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define MAX_WORKER
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
26.
void *PrintHello(void *threadid)
{
int tid = (long)threadid;
printf("Hello World! It's me, thread #%ld!\n", tid);
pthread_exit(NULL);
}
8
int main(int argc, char *argv[])
{
pthread_t threads[MAX_WORKER];
int t, num_thrs;
num_thrs = (argc > 1) ? atoi(argv[1]) : MAX_WORKER;
for(t=0;t< MAX_WORKER;t++)
{
printf("In main: creating thread %ld\n", t);
pthread_create(&threads[t], NULL, PrintHello, (void *)t);
}
for(t=0;t< MAX_WORKER;t++)
pthread_join(threads[t],NULL);
return 0;
}
This is the multithreaded version of the "Hello World" program using POSIX threads. On line 14 in
the main program, the thread variable is initialized. On line 22, "num_thrs" no. of threads are
created, as specified by the user through command line. If invalid value is given, "MAX_WORKER" no.
of threads are created. The number of threads created execute the funtion "PrintHello" parallely
on line 5 (as specified in the third argument in the pthread_create function) and the threads are
destroyed using ""pthread_exit" on line 9. Meanwhile the main thread waits for the created
thread to complete at line 26 using "pthread_join" function and after that the program control is
handed over to the main program , the program exits.
To Compile & Run:
For Host System
host$ gcc -o outputFilename sourceFileName.c -lpthread
host$ ./outputFileName <# of Threads>
For Target System
host$ mips64-octeon-linux-gnu-gcc -o outputFilename sourceFileName.c -lpthread
target$ ./outputFileName <# of Threads>
"-lpthread" is necessary for every pthread program. It is used to link the pthread library
"libpthread.so" to your code. Without this your program will not run and will report
compiling/linking errors.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
11
Cavium University Program LAB WORK BOOK
1.9. Writing parallel program using MPAC Library"
A four step generic procedure is required to develop a parallel application using MPAC library:
(1) Declarations; (2) Thread Routine; (3) Thread Creation; and (4) Optional final calculations and
garbage collection. The declaration step requires the declaration and initialization of user input
structure and thread data structure variables. The ‘Thread Routine’ step requires the writing of
a thread subroutine to be executed by threads. The ‘Thread Creation’ phase requires creating a
joinable or detachable thread pool according to user requirements. The ‘Optional final
calculations and garbage collection’ step, in case of joinable threads, requires to perform the
final calculations, displaying output and releasing the resources acquired.
To write a parallel program using MPAC library, four files need to be created along with the
makefile which eases in compilation of your application. First is the header file for the
application to be developed which include data structure for user input (config_t), data
structure for passing data to threads (context_t), global variables and function prototypes. The
second file includes all the general functions which include processing user input arguments,
handling default arguments, initializing thread data structure, and help and printing functions.
The thread file includes the main function of the application and invokes the thread function.
The fourth file includes the thread function that is executed by each thread.
1.10. Exercise 3 – "Hello World Program using MPAC Library"
The "hello world" example is included in MPAC under the "apps" directory. The "hello world"
example takes two arguments form the user: (1) number of threads and (2) Processor Affinity.
To Compile & Run:
To execute the "hello world" example issue the following commands:
For Host System
host$
host$
host$
host$
cd /<path-to-mpac>/mpac_1.2/apps/hello
make clean
make
./mpac_hello_app –n <# of Threads>
For Target System
host$
host$
host$
target$
cd /<path-to-mpac>/mpac_1.2/apps/hello
make clean
make CC=mips64-octeon-linux-gnu-gcc AR=mips64-octeon-linux-gnu-ar
./mpac_hello_app –n <# of Threads>
In order to write your own application using MPAC library, just expand the "hello world"
example by updating the data structures and general functions in the mpac_hello.h and
mpac_hello.c files. The only major changes you have to do will be in the mpac_hello_app_hw.c
file which is the thread function.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
12
Cavium University Program LAB WORK BOOK
2. Parallel Sorting
2.1.
Lab Objectives
This lab session implements the parallel sort using MPAC library and measures its performance
on target Cavium system. The objective of this lab is to understand the concept of partitioning
of workload into parallel tasks. At the end of this lab, you should know:
1. How to partition a workload into parallel tasks;
2. Implement parallel sort using MPAC library;
3. Performance measurement and tuning of parallel sort
2.2.
Setup
The required tools for this task are:
1. GNU C Compiler 4.3.0 or above
2. MPAC library and benchmarking suite
3. OCTEON Software Development Kit (SDK) with Cross building tools
All of the code used for this Lab is provided with the description on your local system. You will
need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code
on the target embedded system (Cavium Board), you will need to cross compile it on your host
system for target system, copy the executables to the target system and run the executables.
2.3.
Introduction to MPAC Sort
In MPAC applications, two sorting algorithms are implemented: (1) Parallel Quick Sort; and (2)
Parallel Bucket Sort. The Parallel Quick Sort works as shown in figure below.
31
31
14
23
23
23
14
14
26
26
8
36
4
21
4
7
1
43
32
26
31
12
32
8
36
4
21
4
7
1
43
4
8
21
36
1
4
7
43
12
7
Thread Function
21
7
21
12
7
21
32
14
23
26
31
4
8
21
36
1
4
7
43
7
12
21
32
1
4
4
7
7
8
12
14
21
21
23
26
31
32
36
43
Figure 3. Parallel Quick Sort Implementaion
© Copyright 2010 Dr Abdul Waheed for Cavium University Program.
Cavium University Program LAB WORK BOOK
In the parallel quick sort algorithm the data array is divided equally between total number of
threads. In the above case there are four threads. A partition of the array is passed to the
thread which sorts its partition using standard quick sort and sends the sorted results back to
the main thread. In the main thread the array partitions are combined together and sorted
again.
To Compile & Run:
To execute the Parallel Quick Sort example issue the following commands:
host$
cd /<path-to-mpac>/mpac_1.2/apps/sort
host$
make clean
host$
make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar
target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u q
Other options are –m and –l for upper and lower limit of random data, and –a is to set the
processor affinity.
Fill in the following table to note your observations and measurements after running MPAC
parallel quick sort the target system:
Table 6. Time taken in microseconds to sort an array of Million Elements using parallel quick sort.
Array Size
1
2
No of Threads
4
8
16
32
1,000,000
In the parallel bucket sort algorithm, the minimum and maximum value elements are identified,
and the range between minimum and maximum is divided equally to the total number of
threads, hence forming buckets. So there will be as many buckets as there are threads. Then
each element of the data array is placed in its appropriate bucket array. Bucket arrays are
passed to the threads which sort the data using quick sort and return these bucket arrays back
to the main thread. In the main thread these bucket arrays are combined together to form a
sorted data array. The bucket array algorithm is shown in figure 4.
To Compile & Run:
To execute the Parallel Quick Sort example issue the following commands:
host$
cd /<path-to-mpac>/mpac_1.2/apps/sort
host$
make clean
host$
make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar
target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u b
Other options are –m and –l for upper and lower limit of random data, and –a is to set the
processor affinity.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
14
Cavium University Program LAB WORK BOOK
Min = 1;
31
Max = 43;
23
14
26
Difference = 42
8
36
1 - 11
8
4
4
4
1
1
4
21
4
7
1
12 - 22
1
7
7
4
4
Threads = 4;
4
43
32
12
7
21
12
21
31
23
26
32
12
14
21
21
23
26
31
32
Thread Function
8
12
14
21
21
23
7
34 - 44
14
8
21
23 - 33
7
7
7
Bucket Size = 42/4  11
26
31
32
36
43
36
43
36
43
Figure 4. Parallel Bucket Sort Implementaion
Fill in the following table to note your observations and measurements after running MPAC
parallel bucket sort on the target system:
Table 7. Time taken in microseconds to sort an array of Million Elements using parallel bucket sort.
Array Size
1
2
No of Threads
4
8
16
32
1,000,000
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
15
Cavium University Program LAB WORK BOOK
3. Network Packet Sniffing (NPS)
3.1.
Lab Objectives
This lab session is about network packet sniffing and also serves as a base case for the next
networking labs. Referring to seven layer OSI model, we will capture packets at second layer
(i.e. data link layer). MPAC network benchmark will be used to generate network traffic. We
will measure the packet capturing throughput on a target Cavium system. The objective of this
lab is to understand the concept of parallel processing on network workload. At the end of this
lab, you should know:
1.
2.
3.
4.
3.2.
Implementation of NPS appication;
Usage of NetServer and NetPerf traffic generation tool
Software designing of a parallel application;
Relationship of thread synchronization methods with the perceived and measured
performance
Setup
The required tools for this task are:
1. GNU C Compiler 4.3.0 or above
2. MPAC library and benchmarking suite
3. OCTEON Software Development Kit (SDK) with Cross building tools
All of the code used for this Lab is provided with the description on your local system. You will
need to build, execute, and analyze the code. The testbed setup for Labs 3-5 is shown in figure 5.
By defintion the word sniffing means that you ‘sniff’ or pick something for further analyses. In
computer networks terminology sniffing a packet means to capture a packet arriving or
departing from a network interface. This capturing does not disturb the ongoing
communication. This capturing can be done using one or more systems. Figure 5 shows a
typical scenario when the communication ends (server and client) have some ongoing network
flows and a sniffer captures the packet enroute. Such a scenario can also be produced on a
single machine using the loop back device.
Figure 5: Testbed setup for Lab # 3-5
Clearly there are two major portion of the work. The first one is to capture the packets from the
network interface and the second one is to analyze the packets. Clearly the producer consumer
model seems apt for this scenario where one thread is capturing (i.e. producing) the data and
someone else is analyzing it (i.e. consume the data). We have used raw sockets to capture the
© Copyright 2010 Dr Abdul Waheed for Cavium University Program.
Cavium University Program LAB WORK BOOK
network traffic. Now we need to see that which of the two tasks (i.e. sniffing and data analyses)
needs to be parallelize. Emperical results show that sniffing the network data flowing at a rate of 1
Gbps is easily manageable by a single thread. On the other hand, packet data analyses it CPU
intensive and analyses portion can not keep up with the sniffing speed. It would mean that data
consumption will be slow to empty the shared queue and the packet data producer would have to
wait for the empty space. So data analyses portion is the right candidate for parallilzation. Clearly a
queue is a shared resource between the producer and the consumer and should be properly gauded
against the concurrent accesses and the race conditions. There are many ways to schnronize the
competing threads. Usually locking based protection mechanism are slow and such methods does
not scale as the number of competing threads increase. In this specific lab we will use the optimized
version of NPS which does not use locking for the protection. The MPAC sniffer app folder has 3
version with are built using different schronization techniques. You are encourged to run all three
versions to see the difference in throughput. This specific lab has no work for the data analyses
portion. Consumers just throw the packet away. Later labs will add different analyses portions.
To Compile & Run:
Before running the NPS, we need to generate traffic with MPAC network benchmark which the
sniffer application will capture. Copy the cross compiled "mpac_net_bm" executable (in the directory
mpac_1.2/benchmarks/net/) to target system and run the following commands:
target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads>
target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads>
-i <IP of Receiver>
In this case, the sender and receiver are on the same system, which means that the interface to be
tested is "lo". If you want to test the Ethernet interface, the sender can be executed on one target
and the receiver on the other target.
The options for running the sniffer application are –n for number of threads, -d for the duration of
the test, -e for execution mode (in this case it is 3), -f for the interface the sniffer should use to sniff
packets from e.g. lo, eth0 or eth1, and –a is to set the processor affinity.
To execute the Parallel NPS example, copy the "mpac_sniffer_app" executable under the directory
mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system and issue the following
command in another shell (along with the shell running sender and receiver):
target$ ./mpac_sniffer_app –n<# of Threads> –d<duration> -f<interface to sniff> -e 3
Fill in the following table after running MPAC NPS on the target system:
Table 8. Network packets sniffing throughput in Mbps, for multiple queues case with lock free and optimized
enqueue/dequeue functions.
1
2
No of Threads
4
8
16
32
Throughput (Mbps)
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
17
Cavium University Program LAB WORK BOOK
4. Network Packet Filtering (NPF)
4.1.
Lab Objectives
This lab session implements the parallel NPF using MPAC library and measures network packet
capturing and filtering throughput on multi-core based system. The objective of this lab is to
understand the concept of parallel processing on network workload. At the end of this lab, you
should know:
1. Implement parallel NPF appication using MPAC library;
2. How to process parallel network workload;
3. Performance measurement of parallel NPF
4.2.
Setup
The required tools for this task are:
1. GNU C Compiler 4.3.0 or above
2. MPAC library and benchmarking suite
3. OCTEON Software Development Kit (SDK) with Cross building tools
All of the code used for this Lab is provided with the description on your host system. You will
need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip.
4.3.
Introduction to NPF
In this lab there will be one dispatcher thread which sniffs the packets. There will be multiple
threads to filter this sniffed traffic. To show gradual performance tuning, in MPAC, we
implemented packet filtering application using three architectures: (1) single shared queue
between multiple threads with lock-based design for enqueue/dequeue functions as shown in
figure 6; (2) multiple queues with lock-based design for enqueue/dequeue functions as shown
in figure 7; and (3) multiple queues with lock-free and optimized enqueue/dequeue functions.
We have one dispatcher thread in addition to number of worker threads. The dispatcher gets
packet from NIC and fills the packet queue. The queues are filled by dispatcher in round robin
fashion. Packet filtering is defined as packet header inspection at different OSI layers. In this
lab filtering is done using source and destination IP addresses, IP protocol field (which is fixed to
be TCP for this specific lab) and source and destination ports. These parameters are provided
by the user on the command line. The worker threads try to filter the packets based on the
given 5 tuple comparison. We will measure the throughput of the worker threads that can they
keep up with the sniffer or not. You will also experiment and observe the improvements with
increasing threads and using thread to core affinity.
To Compile & Run:
Before running the NPS, generate traffic with MPAC network benchmark which the sniffer
application will capture. Copy the cross compiled "mpac_net_bm" executable (in the directory
mpac_1.2/benchmarks/net/) to target system and run the following commands:
© Copyright 2010 Dr Abdul Waheed for Cavium University Program.
Cavium University Program LAB WORK BOOK
target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads>
target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads>
-i <IP of Receiver>
In this case, the sender and receiver are on the same system, which means that the interface to
be tested is "lo". If you want to test the Ethernet interface, the sender can be executed on one
target and the receiver on the other target. The options for running NPF application are –n for
number of threads, -d for the duration of the test, -e for execution mode (in this case it is 4), -f
for the interface the sniffer should use to sniff packets from e.g. lo, eth0 or eth1, -p and –P for
port numbers of sender and receiver respectively, -i and -I for IP addresses of sender and
receiver respectively, and –a is to set the processor affinity. To execute the Parallel NPF
example,
copy
the
"mpac_sniffer_app"
executable
under
the
directory
mpac_1.2/apps/sniffer/sniffer_1Q/ to the target system and issue the following
commands:
target$ ./mpac_sniffer_app
T0
T1
TN-1 T0
–n <# of Threads> –d <test duration>
-f <interface to sniff> -e 4
T1
TN-1 T0
T1
Dispatcher putting space
TN-1 T0
T1
TN-1
T
Worker Threads
N
Dispatcher putting direction
Figure 6: shared queue between multiple threads architectture
with lock-based design for enqueue/dequeue functions
Workers getting direction
Fill in the following table to note your observations and measurements after running MPAC NPF
on the target system:
Table 9. Network packets filetering throughput in Mbps, for single shared queue case.
1
2
No of Threads
4
8
16
32
Throughput (Mbps)
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
19
Cavium University Program LAB WORK BOOK
T
T
T
0
1
2
TN-1
T
N
Worker Threads
Dispatcher putting direction
Workers getting direction
Figure 7: multiple queues architecture with lockbased design for enqueue/dequeue functions
Dispatcher putting space
To observe the results after performance tuning, we use the multiple queue architecture with
lock-based design for enqueue/dequeue functions. To execute this example, copy the
"mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ/
to the target system and issue the following command.
target$ ./mpac_sniffer_app
–n <# of Threads> –d <test duration>
-f <interface to sniff> -e 4
Fill in the following table after running MPAC NPF on the target system:
Table 10. Network packets filtering throughput in Mbps, for multiple queues case.
1
2
No of Threads
4
8
16
32
Throughput (Mbps)
We execute our example after further performance tuning and using multiple queues with lockfree and optimized enqueue/dequeue functions. Copy the "mpac_sniffer_app" executable
under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system
and issue the following command.
target$ ./mpac_sniffer_app
–n <# of Threads> –d <test duration>
-f <interface to sniff> -e 4
Fill in the following table after running MPAC NPF on the target system:
Table 11. Network packets filtering throughput in Mbps, for multiple queues case with lock free and optimized
enqueue/dequeue functions.
1
2
No of Threads
4
8
16
32
Throughput (Mbps)
You are encouraged to compare the results of table 7, 8 and 9 so that you can appreciate the
performance of different synchronization and designing techniques.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
20
Cavium University Program LAB WORK BOOK
5. Deep Packet Inspection (DPI)
5.1.
Lab Objectives
This lab session implements the parallel DPI using MPAC library and measures compute
intensive payload extraction and string matching throughput of captured network packets on
multi-core system. The objective of this lab is to understand the concept of parallel deep packet
inspection of network workload. At the end of this lab, you should know:
1. Implement parallel DPI appication using MPAC library;
2. How to inspect parallel network workload;
3. Performance measurement parallel DPI
5.2.
Setup
The required tools for this task are:
1. GNU C Compiler 4.3.0 or above
2. MPAC library and benchmarking suite
3. OCTEON Software Development Kit (SDK) with Cross building tools
All of the code used for this Lab is provided with the description on your local system. You will
need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip.
5.3.
Introduction to DPI
To show gradual performance tuning, in MPAC, we use three architectures for the sniffer
application as was done in lab 3 and 4. In this lab compute intensive payload extraction and
string matching throughput of captured network packets is measured. Dispatcher will sniff the
packets as before and the workers will try to find a string in the application payload.
To Compile & Run:
The options for running this application are –n for number of threads, -d for the duration of the
test, -e for execution mode (in this case it is 3), -f for the interface the sniffer should use to sniff
packets from e.g. lo, eth0 or eth1, -p and –P for port numbers of sender and receiver
respectively, -i and -I for IP addresses of sender and receiver respectively, and –a is to set the
processor affinity. To execute the Parallel DPI example, copy the "mpac_sniffer_app"
executable under the directory mpac_1.2/apps/sniffer/sniffer_1Q/ to the target system
and issue the following commands (as before run the MPAC network benchmark in separate
terminals):
target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads>
target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads>
-i <IP of Receiver>
target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration>
-f <interface to sniff> -e 5
© Copyright 2010 Dr Abdul Waheed for Cavium University Program.
Cavium University Program LAB WORK BOOK
Fill in the following table to note your observations and measurements after running MPAC DPI
on the target system:
Table 12. Deep packet inspection throughput in Mbps, for single shared queue case.
1
2
No of Threads
4
8
16
32
Throughput (Mbps)
To observe the result after performance tuning, we use the multiple queue architecture with
lock-based design for enqueue/dequeue functions. To execute this example, copy the
"mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ/
to the target system and issue the following command.
target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration>
-f <interface to sniff> -e 5
Fill in the following table to note your observations and measurements after running MPAC DPI
on the target system:
Table 13. Deep packet inspection throughput in Mbps, for multiple queues case.
1
2
No of Threads
4
8
16
32
Throughput (Mbps)
We execute our example after further performance tuning and using multiple queues with lockfree and optimized enqueue/dequeue functions. Copy the "mpac_sniffer_app" executable
under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system
and issue the following command.
target$ ./mpac_sniffer_app
–n <# of Threads> –d <test duration>
-f <interface to sniff> -e 5
Fill in the following table to note your observations and measurements after running MPAC DPI
on the target system:
Table 14. Deep packet inspection throughput in Mbps, for multiple queues case with lock free and optimized
enqueue/dequeue functions.
1
2
No of Threads
4
8
16
32
Throughput (Mbps)
You are encouraged to compare the results in table 10, 11 and 12 so that you can appreciate
the high performance of lock free design. You are also encouraged to see the code of the three
different implementations of these networking labs.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program
22
Download