LAB WORKBOOK FOR A Short Course on "Programming Multi-Core Processors Based Embedded Systems" A Hands-On Experience with Cavium Octeon Based Platforms 2010 Rev 1209-1 © Copyright 2010 Dr Abdul Waheed for Cavium University Program. Cavium University Program LAB WORK BOOK Page intentionally left blank. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 2 Cavium University Program LAB WORK BOOK LAB WORKBOOK This workbook is written for assisting the students of Short Course on “Programming Multi-Core Processors Based Embedded Systems - A Hands-On Experience with Cavium Octeon Based Platforms”. The contents of this document have been compiled from various academic resources to expose the students to the basics of multi-core architectures in a hands-on fashion. For Further information, please contact Email: University@CaviumNetworks.com © Copyright 2010 Dr Abdul Waheed for Cavium University Program 3 Cavium University Program LAB WORK BOOK TABLE OF CONTENTS 1. INTRODUCTION TO PARALLEL PROGRAMMING, ARCHITECTURE AND PERFORMANCE ................................. 5 1.1. 1.2. 1.3. 1.4. 1.5. 1.6. 1.7. 1.8. 1.9. 1.10. LAB OBJECTIVES ............................................................................................................................................ 5 SETUP ......................................................................................................................................................... 5 INTRODUCTION TO MPAC .............................................................................................................................. 5 UNDERSTANDING THE HARDWARE .................................................................................................................... 6 UNDERSTANDING PROCESSOR ARCHITECTURE AND PERFORMANCE USING MPAC FOR HOST SYSTEM............................. 7 UNDERSTANDING PROCESSOR ARCHITECTURE AND PERFORMANCE USING MPAC FOR TARGET SYSTEM ......................... 9 A SIMPLE "HELLO WORLD" PROGRAM ........................................................................................................... 10 EXERCISE 2 – PTHREAD VERSION OF "HELLO WORLD" ........................................................................................ 11 WRITING PARALLEL PROGRAM USING MPAC LIBRARY" ...................................................................................... 12 EXERCISE 3 – "HELLO WORLD PROGRAM USING MPAC LIBRARY" ....................................................................... 12 2. PARALLEL SORTING .......................................................................................................................................... 13 2.1. 2.2. 2.3. LAB OBJECTIVES .......................................................................................................................................... 13 SETUP ....................................................................................................................................................... 13 INTRODUCTION TO MPAC SORT .................................................................................................................... 13 3. NETWORK PACKET SNIFFING (NPS) ................................................................................................................. 16 3.1. 3.2. LAB OBJECTIVES .......................................................................................................................................... 16 SETUP ....................................................................................................................................................... 16 4. NETWORK PACKET FILTERING (NPF)................................................................................................................ 18 4.1. 4.2. 4.3. LAB OBJECTIVES .......................................................................................................................................... 18 SETUP ....................................................................................................................................................... 18 INTRODUCTION TO NPF................................................................................................................................ 18 5. DEEP PACKET INSPECTION (DPI) ...................................................................................................................... 21 5.1. 5.2. 5.3. LAB OBJECTIVES .......................................................................................................................................... 21 SETUP ....................................................................................................................................................... 21 INTRODUCTION TO DPI ................................................................................................................................ 21 © Copyright 2010 Dr Abdul Waheed for Cavium University Program 4 Cavium University Program LAB WORK BOOK 1. Introduction to Parallel Programming, Architecture and Performance 1.1. Lab Objectives The objective of this lab is to understand the underlying multi-core architecture and its performance. For this purpose, this lab session introduces "Multi-core Processor Architecture and Communication" (MPAC) Library and reference performance benchmarks. You will learn to develop parallel applications using MPAC library for multi-core based systems. At the end of this lab, you should know: 1. How to use MPAC benchmarks to understand the processor architecture and performance; 2. Learn to write a basic parallel program in C using MPAC library; 1.2. Setup The required tools for this task are: 1. GNU C Compiler 4.3.0 or above 2. MPAC library and benchmarking suite 3. OCTEON Software Development Kit (SDK) with Cross building tools All of the code used for this Lab is provided with the description on your host system. You will need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code on the target embedded system (Cavium Board), you will need to cross compile it on your host system for target system, copy the executables to the target system and run the executables. 1.3. Introduction to MPAC MPAC library provides a framework that eases the development of parallel applications and benchmarks for state-of-the-art multi-core processors based computing and networking platforms. MPAC Library uses multiple threads in a fork-and-join approach that helps simultaneously exercising multiple processor cores of a system according to user specified workload. The flexibility of MPAC software architecture allows a user to parallelize a task withoug going into the peculiar intricacies of parallelism. MPAC library allows the user to implement suitable experimental control and to replicate the same task across multiple processors or cores using a fork-and-join parallelism. MPAC library is an open source C-based, POSIX complaint, library, which is freely available under FreeBSD style licensing model. Fig. 1 provides an overview of MPAC’s software architecture. It provides an implementation of some commons tasks, such as measurement of timer resolution, accurate interval timers, and other statistical and experimental design related functions, which may be too time consuming or complex to be written by a regular user. However, these ideas are fundamental to accurate and repeatable measurement based evaluation. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 5 Cavium University Program LAB WORK BOOK Figure 1. A high-level architecture of MPAC Library’s extensible benchmarking infrastructure. Fig. 2 shows an overview of MPAC’s fork-and-join execution model. In the following subsections, we provide details about various MPAC modules and related APIs. Threading based parallel application development requires thread creation, execution control, and termination. Thread usage varies depending on a task. A user may require a thread to terminate after it has completed its task or wait for other threads to complete their tasks and terminate together. MPAC library provides a Thread Manager (TM), which facilitates handling thread activities transparently from the end user. It offers high level functions to manage the life cycle of userspecified thread pool of non-interacting workers. It is based on fork-and-join threading model for concurrent execution of same workload on all processor cores. Thread manager functions include thread creation, thread locking, thread affinity, and thread termination. Thread Routine ( ) Thread Routine ( ) MPAC Initialization Argument Handling Thread Joining Thread Creation & Forking Output Processing & Display Thread Routine ( ) Figure 2. An overview of MPAC Benchmark fork and join infrastructure. 1.4. Understanding the Hardware Before writing a parallel application for a specific platform, it is a good idea to indetify and understand the underlying hardware architecture. To explore the hardware details of the system: CPU, memory, I/O devices, and network interfaces, Linux maintains a /proc file system to dynamically maintain system hardware resource information and their performance statistics. We can read various files under /proc to identify system hardware details: © Copyright 2010 Dr Abdul Waheed for Cavium University Program 6 Cavium University Program LAB WORK BOOK $ cat /proc/cpuinfo $ cat /proc/meminfo The file "cpuinfo" gives details of all the processor cores available in the system. The main variables to observe are "processor", "model name", "cpu MHz", "cache size", and "cpu cores". The "processor" represents the processor id of the core. "Model name" represents the type and processor vendor. "cpu MHz", "cache size", and "cpu cores", represent the processor frequency, L2 cache size and the number of cores per socket respectively. Another way to get the details of your processor is by issuing the following command. $ dmesg | grep CPU The file "meminfo" gives details of the memory organization of the system. The main variable to observe is "MemTotal" which represents the total size of the main memory of your system. Another way to get the details of the system memory is by issuing the following command. $ dmesg | grep mem Fill in the following table to note your observations and measurements of the host and target systems: Table 1. System hardware details. Hardware Attributes Values (Development Host) Values (Target Cavium Board) Processor type Processor GHz rating Total number of CPUs (cores) L1 Data Cache size L2 Cache size Total Main Memory size 1.5. Understanding Processor Architecture and Performance using MPAC for host system In order to understand the performance of your multi-core architecture for CPU and memory intensive workload, MPAC provides CPU and memory benchmarks. After you have downloaded and unpacked MPAC software run the following commands to configure and compile the MPAC software on your development host. To go to the main directory issue the following command. host$ cd /<path-to-mpac>/mpac_1.2 where <path-to-mpac> is the directory where mpac is located. Then issue the following commands. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 7 Cavium University Program LAB WORK BOOK host$ ./configure host$ make clean host $ make To execute the MPAC CPU benchmark, follow the following steps. host$ cd benchmarks/cpu host$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations> where –n is for number of threads, and -r for the number of times the task is run. For additional arguments that can be passed through command line, issue the following command. host$ ./mpac_cpu_bm –h Fill in the following table to note your observations and measurements by running the MPAC CPU benchmark of the host system: Table 2. CPU performance in MOPS Operation Type 1 2 No of Threads 4 8 16 32 Integer (summation) Logical (String Operation) Floating Point (Sin) To execute the MPAC memory benchmark, follow the following steps. host$ cd benchmarks/mem host$ ./mpac_mem_bm –n <# of Threads> -s <array size> -r <# of repetitions> -t <data type> For additional arguments that can be passed through command line, issue the following command. host$ ./mpac_mem_bm –h Fill in the following table to note your observations and measurements by running the MPAC memory benchmark of the host system: Table 3. Memory performance in Mbps for Integer data type Array Size 1 2 No of Threads 4 8 16 32 512 (4 KB) 65536 (512 KB) 1048576 (8 MB) © Copyright 2010 Dr Abdul Waheed for Cavium University Program 8 Cavium University Program LAB WORK BOOK 1.6. Understanding Processor Architecture and Performance using MPAC for Target system In order to understand the performance of your target embedded multi-core architecture (Cavium Board) for CPU and memory intensive workload, use MPAC CPU and memory benchmarks. On your host system, run the following commands to configure and cross compile the MPAC software for Target system, using the SDK the provided. To cross compile the code for target system, set the environment variables for your specific target system. Go to the directory where OCTEON SDK is installed. By default it will be installed under /usr/local/Cavium_Networks/OCTEON_SDK/. Type the following command. host$ source env-setup <OCTEON-MODEL> where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX To go to the MPAC main directory issue the following command. host$ cd /<path-to-mpac>/mpac_1.2 where <path-to-mpac> is the directory where mpac is located. Then issue the following commands. host$ ./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linuxgnu CC=mips64-octeon-linux-gnu-gcc host$ make clean host $ make CC=mips64-octeon-linux-gnu-gcc AR=mips64-octeon-linux-gnu-ar where "mips64-octeon-linux-gnu-gcc" is the gcc cross compiler for OCTEON based systems. To execute the MPAC CPU benchmark on the target system, copy the executable " mpac_cpu_bm" on the target system and follow the following step. target$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations> where –n is for number of threads, and -r for the number of times the task is run. For additional arguments that can be passed through command line, issue the following command. target$ ./mpac_cpu_bm –h Fill in the following table to note your observations and measurements by running the MPAC CPU benchmark of the system you are using: Table 4. CPU performance in MOPS Operation Type 1 2 No of Threads 4 8 16 32 Integer (summation) Logical (String Operation) Floating Point (Sin) © Copyright 2010 Dr Abdul Waheed for Cavium University Program 9 Cavium University Program LAB WORK BOOK To execute the MPAC memory benchmark on the target system, copy the executable " mpac_mem_bm" on the target system and follow the following step. target$ ./mpac_mem_bm –n <# of Threads> -s <array size> -r <# of repetitions> -t <data type> For additional arguments that can be passed through command line, issue the following command. target$ ./mpac_mem_bm –h Fill in the following table to note your observations and measurements by running the MPAC memory benchmark of the system you are using: Table 5. Memory performance in Mbps for Integer data type Array Size 1 2 No of Threads 4 8 16 32 512 (4 KB) 65536 (512 KB) 1048576 (8 MB) 1.7. A simple "Hello World" Program In this exercise we will compile and run a simple sequential "Hello World" program written in C language which prints "Hello World" to the screen and exits. 1. 2. 3. 4. 5. 6. 7. #include <stdio.h> int main(void) { printf("Hello World\n"); return 0; } To Compile & Run: For Host System host$ gcc -o outputFilename sourceFileName.c host$ ./outputFileName On Target System host$ mips64-octeon-linux-gnu-gcc -o outputFilename sourceFileName.c target$ ./outputFileName © Copyright 2010 Dr Abdul Waheed for Cavium University Program 10 Cavium University Program LAB WORK BOOK 1.8. Exercise 2 – Pthread version of "Hello World" 1. 2. 3. 4. #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define MAX_WORKER 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 26. void *PrintHello(void *threadid) { int tid = (long)threadid; printf("Hello World! It's me, thread #%ld!\n", tid); pthread_exit(NULL); } 8 int main(int argc, char *argv[]) { pthread_t threads[MAX_WORKER]; int t, num_thrs; num_thrs = (argc > 1) ? atoi(argv[1]) : MAX_WORKER; for(t=0;t< MAX_WORKER;t++) { printf("In main: creating thread %ld\n", t); pthread_create(&threads[t], NULL, PrintHello, (void *)t); } for(t=0;t< MAX_WORKER;t++) pthread_join(threads[t],NULL); return 0; } This is the multithreaded version of the "Hello World" program using POSIX threads. On line 14 in the main program, the thread variable is initialized. On line 22, "num_thrs" no. of threads are created, as specified by the user through command line. If invalid value is given, "MAX_WORKER" no. of threads are created. The number of threads created execute the funtion "PrintHello" parallely on line 5 (as specified in the third argument in the pthread_create function) and the threads are destroyed using ""pthread_exit" on line 9. Meanwhile the main thread waits for the created thread to complete at line 26 using "pthread_join" function and after that the program control is handed over to the main program , the program exits. To Compile & Run: For Host System host$ gcc -o outputFilename sourceFileName.c -lpthread host$ ./outputFileName <# of Threads> For Target System host$ mips64-octeon-linux-gnu-gcc -o outputFilename sourceFileName.c -lpthread target$ ./outputFileName <# of Threads> "-lpthread" is necessary for every pthread program. It is used to link the pthread library "libpthread.so" to your code. Without this your program will not run and will report compiling/linking errors. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 11 Cavium University Program LAB WORK BOOK 1.9. Writing parallel program using MPAC Library" A four step generic procedure is required to develop a parallel application using MPAC library: (1) Declarations; (2) Thread Routine; (3) Thread Creation; and (4) Optional final calculations and garbage collection. The declaration step requires the declaration and initialization of user input structure and thread data structure variables. The ‘Thread Routine’ step requires the writing of a thread subroutine to be executed by threads. The ‘Thread Creation’ phase requires creating a joinable or detachable thread pool according to user requirements. The ‘Optional final calculations and garbage collection’ step, in case of joinable threads, requires to perform the final calculations, displaying output and releasing the resources acquired. To write a parallel program using MPAC library, four files need to be created along with the makefile which eases in compilation of your application. First is the header file for the application to be developed which include data structure for user input (config_t), data structure for passing data to threads (context_t), global variables and function prototypes. The second file includes all the general functions which include processing user input arguments, handling default arguments, initializing thread data structure, and help and printing functions. The thread file includes the main function of the application and invokes the thread function. The fourth file includes the thread function that is executed by each thread. 1.10. Exercise 3 – "Hello World Program using MPAC Library" The "hello world" example is included in MPAC under the "apps" directory. The "hello world" example takes two arguments form the user: (1) number of threads and (2) Processor Affinity. To Compile & Run: To execute the "hello world" example issue the following commands: For Host System host$ host$ host$ host$ cd /<path-to-mpac>/mpac_1.2/apps/hello make clean make ./mpac_hello_app –n <# of Threads> For Target System host$ host$ host$ target$ cd /<path-to-mpac>/mpac_1.2/apps/hello make clean make CC=mips64-octeon-linux-gnu-gcc AR=mips64-octeon-linux-gnu-ar ./mpac_hello_app –n <# of Threads> In order to write your own application using MPAC library, just expand the "hello world" example by updating the data structures and general functions in the mpac_hello.h and mpac_hello.c files. The only major changes you have to do will be in the mpac_hello_app_hw.c file which is the thread function. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 12 Cavium University Program LAB WORK BOOK 2. Parallel Sorting 2.1. Lab Objectives This lab session implements the parallel sort using MPAC library and measures its performance on target Cavium system. The objective of this lab is to understand the concept of partitioning of workload into parallel tasks. At the end of this lab, you should know: 1. How to partition a workload into parallel tasks; 2. Implement parallel sort using MPAC library; 3. Performance measurement and tuning of parallel sort 2.2. Setup The required tools for this task are: 1. GNU C Compiler 4.3.0 or above 2. MPAC library and benchmarking suite 3. OCTEON Software Development Kit (SDK) with Cross building tools All of the code used for this Lab is provided with the description on your local system. You will need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code on the target embedded system (Cavium Board), you will need to cross compile it on your host system for target system, copy the executables to the target system and run the executables. 2.3. Introduction to MPAC Sort In MPAC applications, two sorting algorithms are implemented: (1) Parallel Quick Sort; and (2) Parallel Bucket Sort. The Parallel Quick Sort works as shown in figure below. 31 31 14 23 23 23 14 14 26 26 8 36 4 21 4 7 1 43 32 26 31 12 32 8 36 4 21 4 7 1 43 4 8 21 36 1 4 7 43 12 7 Thread Function 21 7 21 12 7 21 32 14 23 26 31 4 8 21 36 1 4 7 43 7 12 21 32 1 4 4 7 7 8 12 14 21 21 23 26 31 32 36 43 Figure 3. Parallel Quick Sort Implementaion © Copyright 2010 Dr Abdul Waheed for Cavium University Program. Cavium University Program LAB WORK BOOK In the parallel quick sort algorithm the data array is divided equally between total number of threads. In the above case there are four threads. A partition of the array is passed to the thread which sorts its partition using standard quick sort and sends the sorted results back to the main thread. In the main thread the array partitions are combined together and sorted again. To Compile & Run: To execute the Parallel Quick Sort example issue the following commands: host$ cd /<path-to-mpac>/mpac_1.2/apps/sort host$ make clean host$ make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u q Other options are –m and –l for upper and lower limit of random data, and –a is to set the processor affinity. Fill in the following table to note your observations and measurements after running MPAC parallel quick sort the target system: Table 6. Time taken in microseconds to sort an array of Million Elements using parallel quick sort. Array Size 1 2 No of Threads 4 8 16 32 1,000,000 In the parallel bucket sort algorithm, the minimum and maximum value elements are identified, and the range between minimum and maximum is divided equally to the total number of threads, hence forming buckets. So there will be as many buckets as there are threads. Then each element of the data array is placed in its appropriate bucket array. Bucket arrays are passed to the threads which sort the data using quick sort and return these bucket arrays back to the main thread. In the main thread these bucket arrays are combined together to form a sorted data array. The bucket array algorithm is shown in figure 4. To Compile & Run: To execute the Parallel Quick Sort example issue the following commands: host$ cd /<path-to-mpac>/mpac_1.2/apps/sort host$ make clean host$ make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u b Other options are –m and –l for upper and lower limit of random data, and –a is to set the processor affinity. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 14 Cavium University Program LAB WORK BOOK Min = 1; 31 Max = 43; 23 14 26 Difference = 42 8 36 1 - 11 8 4 4 4 1 1 4 21 4 7 1 12 - 22 1 7 7 4 4 Threads = 4; 4 43 32 12 7 21 12 21 31 23 26 32 12 14 21 21 23 26 31 32 Thread Function 8 12 14 21 21 23 7 34 - 44 14 8 21 23 - 33 7 7 7 Bucket Size = 42/4 11 26 31 32 36 43 36 43 36 43 Figure 4. Parallel Bucket Sort Implementaion Fill in the following table to note your observations and measurements after running MPAC parallel bucket sort on the target system: Table 7. Time taken in microseconds to sort an array of Million Elements using parallel bucket sort. Array Size 1 2 No of Threads 4 8 16 32 1,000,000 © Copyright 2010 Dr Abdul Waheed for Cavium University Program 15 Cavium University Program LAB WORK BOOK 3. Network Packet Sniffing (NPS) 3.1. Lab Objectives This lab session is about network packet sniffing and also serves as a base case for the next networking labs. Referring to seven layer OSI model, we will capture packets at second layer (i.e. data link layer). MPAC network benchmark will be used to generate network traffic. We will measure the packet capturing throughput on a target Cavium system. The objective of this lab is to understand the concept of parallel processing on network workload. At the end of this lab, you should know: 1. 2. 3. 4. 3.2. Implementation of NPS appication; Usage of NetServer and NetPerf traffic generation tool Software designing of a parallel application; Relationship of thread synchronization methods with the perceived and measured performance Setup The required tools for this task are: 1. GNU C Compiler 4.3.0 or above 2. MPAC library and benchmarking suite 3. OCTEON Software Development Kit (SDK) with Cross building tools All of the code used for this Lab is provided with the description on your local system. You will need to build, execute, and analyze the code. The testbed setup for Labs 3-5 is shown in figure 5. By defintion the word sniffing means that you ‘sniff’ or pick something for further analyses. In computer networks terminology sniffing a packet means to capture a packet arriving or departing from a network interface. This capturing does not disturb the ongoing communication. This capturing can be done using one or more systems. Figure 5 shows a typical scenario when the communication ends (server and client) have some ongoing network flows and a sniffer captures the packet enroute. Such a scenario can also be produced on a single machine using the loop back device. Figure 5: Testbed setup for Lab # 3-5 Clearly there are two major portion of the work. The first one is to capture the packets from the network interface and the second one is to analyze the packets. Clearly the producer consumer model seems apt for this scenario where one thread is capturing (i.e. producing) the data and someone else is analyzing it (i.e. consume the data). We have used raw sockets to capture the © Copyright 2010 Dr Abdul Waheed for Cavium University Program. Cavium University Program LAB WORK BOOK network traffic. Now we need to see that which of the two tasks (i.e. sniffing and data analyses) needs to be parallelize. Emperical results show that sniffing the network data flowing at a rate of 1 Gbps is easily manageable by a single thread. On the other hand, packet data analyses it CPU intensive and analyses portion can not keep up with the sniffing speed. It would mean that data consumption will be slow to empty the shared queue and the packet data producer would have to wait for the empty space. So data analyses portion is the right candidate for parallilzation. Clearly a queue is a shared resource between the producer and the consumer and should be properly gauded against the concurrent accesses and the race conditions. There are many ways to schnronize the competing threads. Usually locking based protection mechanism are slow and such methods does not scale as the number of competing threads increase. In this specific lab we will use the optimized version of NPS which does not use locking for the protection. The MPAC sniffer app folder has 3 version with are built using different schronization techniques. You are encourged to run all three versions to see the difference in throughput. This specific lab has no work for the data analyses portion. Consumers just throw the packet away. Later labs will add different analyses portions. To Compile & Run: Before running the NPS, we need to generate traffic with MPAC network benchmark which the sniffer application will capture. Copy the cross compiled "mpac_net_bm" executable (in the directory mpac_1.2/benchmarks/net/) to target system and run the following commands: target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads> target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads> -i <IP of Receiver> In this case, the sender and receiver are on the same system, which means that the interface to be tested is "lo". If you want to test the Ethernet interface, the sender can be executed on one target and the receiver on the other target. The options for running the sniffer application are –n for number of threads, -d for the duration of the test, -e for execution mode (in this case it is 3), -f for the interface the sniffer should use to sniff packets from e.g. lo, eth0 or eth1, and –a is to set the processor affinity. To execute the Parallel NPS example, copy the "mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system and issue the following command in another shell (along with the shell running sender and receiver): target$ ./mpac_sniffer_app –n<# of Threads> –d<duration> -f<interface to sniff> -e 3 Fill in the following table after running MPAC NPS on the target system: Table 8. Network packets sniffing throughput in Mbps, for multiple queues case with lock free and optimized enqueue/dequeue functions. 1 2 No of Threads 4 8 16 32 Throughput (Mbps) © Copyright 2010 Dr Abdul Waheed for Cavium University Program 17 Cavium University Program LAB WORK BOOK 4. Network Packet Filtering (NPF) 4.1. Lab Objectives This lab session implements the parallel NPF using MPAC library and measures network packet capturing and filtering throughput on multi-core based system. The objective of this lab is to understand the concept of parallel processing on network workload. At the end of this lab, you should know: 1. Implement parallel NPF appication using MPAC library; 2. How to process parallel network workload; 3. Performance measurement of parallel NPF 4.2. Setup The required tools for this task are: 1. GNU C Compiler 4.3.0 or above 2. MPAC library and benchmarking suite 3. OCTEON Software Development Kit (SDK) with Cross building tools All of the code used for this Lab is provided with the description on your host system. You will need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. 4.3. Introduction to NPF In this lab there will be one dispatcher thread which sniffs the packets. There will be multiple threads to filter this sniffed traffic. To show gradual performance tuning, in MPAC, we implemented packet filtering application using three architectures: (1) single shared queue between multiple threads with lock-based design for enqueue/dequeue functions as shown in figure 6; (2) multiple queues with lock-based design for enqueue/dequeue functions as shown in figure 7; and (3) multiple queues with lock-free and optimized enqueue/dequeue functions. We have one dispatcher thread in addition to number of worker threads. The dispatcher gets packet from NIC and fills the packet queue. The queues are filled by dispatcher in round robin fashion. Packet filtering is defined as packet header inspection at different OSI layers. In this lab filtering is done using source and destination IP addresses, IP protocol field (which is fixed to be TCP for this specific lab) and source and destination ports. These parameters are provided by the user on the command line. The worker threads try to filter the packets based on the given 5 tuple comparison. We will measure the throughput of the worker threads that can they keep up with the sniffer or not. You will also experiment and observe the improvements with increasing threads and using thread to core affinity. To Compile & Run: Before running the NPS, generate traffic with MPAC network benchmark which the sniffer application will capture. Copy the cross compiled "mpac_net_bm" executable (in the directory mpac_1.2/benchmarks/net/) to target system and run the following commands: © Copyright 2010 Dr Abdul Waheed for Cavium University Program. Cavium University Program LAB WORK BOOK target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads> target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads> -i <IP of Receiver> In this case, the sender and receiver are on the same system, which means that the interface to be tested is "lo". If you want to test the Ethernet interface, the sender can be executed on one target and the receiver on the other target. The options for running NPF application are –n for number of threads, -d for the duration of the test, -e for execution mode (in this case it is 4), -f for the interface the sniffer should use to sniff packets from e.g. lo, eth0 or eth1, -p and –P for port numbers of sender and receiver respectively, -i and -I for IP addresses of sender and receiver respectively, and –a is to set the processor affinity. To execute the Parallel NPF example, copy the "mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_1Q/ to the target system and issue the following commands: target$ ./mpac_sniffer_app T0 T1 TN-1 T0 –n <# of Threads> –d <test duration> -f <interface to sniff> -e 4 T1 TN-1 T0 T1 Dispatcher putting space TN-1 T0 T1 TN-1 T Worker Threads N Dispatcher putting direction Figure 6: shared queue between multiple threads architectture with lock-based design for enqueue/dequeue functions Workers getting direction Fill in the following table to note your observations and measurements after running MPAC NPF on the target system: Table 9. Network packets filetering throughput in Mbps, for single shared queue case. 1 2 No of Threads 4 8 16 32 Throughput (Mbps) © Copyright 2010 Dr Abdul Waheed for Cavium University Program 19 Cavium University Program LAB WORK BOOK T T T 0 1 2 TN-1 T N Worker Threads Dispatcher putting direction Workers getting direction Figure 7: multiple queues architecture with lockbased design for enqueue/dequeue functions Dispatcher putting space To observe the results after performance tuning, we use the multiple queue architecture with lock-based design for enqueue/dequeue functions. To execute this example, copy the "mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ/ to the target system and issue the following command. target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration> -f <interface to sniff> -e 4 Fill in the following table after running MPAC NPF on the target system: Table 10. Network packets filtering throughput in Mbps, for multiple queues case. 1 2 No of Threads 4 8 16 32 Throughput (Mbps) We execute our example after further performance tuning and using multiple queues with lockfree and optimized enqueue/dequeue functions. Copy the "mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system and issue the following command. target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration> -f <interface to sniff> -e 4 Fill in the following table after running MPAC NPF on the target system: Table 11. Network packets filtering throughput in Mbps, for multiple queues case with lock free and optimized enqueue/dequeue functions. 1 2 No of Threads 4 8 16 32 Throughput (Mbps) You are encouraged to compare the results of table 7, 8 and 9 so that you can appreciate the performance of different synchronization and designing techniques. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 20 Cavium University Program LAB WORK BOOK 5. Deep Packet Inspection (DPI) 5.1. Lab Objectives This lab session implements the parallel DPI using MPAC library and measures compute intensive payload extraction and string matching throughput of captured network packets on multi-core system. The objective of this lab is to understand the concept of parallel deep packet inspection of network workload. At the end of this lab, you should know: 1. Implement parallel DPI appication using MPAC library; 2. How to inspect parallel network workload; 3. Performance measurement parallel DPI 5.2. Setup The required tools for this task are: 1. GNU C Compiler 4.3.0 or above 2. MPAC library and benchmarking suite 3. OCTEON Software Development Kit (SDK) with Cross building tools All of the code used for this Lab is provided with the description on your local system. You will need to build, execute, and analyze the code. MPAC library and benchmarking suite is available online at http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. 5.3. Introduction to DPI To show gradual performance tuning, in MPAC, we use three architectures for the sniffer application as was done in lab 3 and 4. In this lab compute intensive payload extraction and string matching throughput of captured network packets is measured. Dispatcher will sniff the packets as before and the workers will try to find a string in the application payload. To Compile & Run: The options for running this application are –n for number of threads, -d for the duration of the test, -e for execution mode (in this case it is 3), -f for the interface the sniffer should use to sniff packets from e.g. lo, eth0 or eth1, -p and –P for port numbers of sender and receiver respectively, -i and -I for IP addresses of sender and receiver respectively, and –a is to set the processor affinity. To execute the Parallel DPI example, copy the "mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_1Q/ to the target system and issue the following commands (as before run the MPAC network benchmark in separate terminals): target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads> target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads> -i <IP of Receiver> target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration> -f <interface to sniff> -e 5 © Copyright 2010 Dr Abdul Waheed for Cavium University Program. Cavium University Program LAB WORK BOOK Fill in the following table to note your observations and measurements after running MPAC DPI on the target system: Table 12. Deep packet inspection throughput in Mbps, for single shared queue case. 1 2 No of Threads 4 8 16 32 Throughput (Mbps) To observe the result after performance tuning, we use the multiple queue architecture with lock-based design for enqueue/dequeue functions. To execute this example, copy the "mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ/ to the target system and issue the following command. target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration> -f <interface to sniff> -e 5 Fill in the following table to note your observations and measurements after running MPAC DPI on the target system: Table 13. Deep packet inspection throughput in Mbps, for multiple queues case. 1 2 No of Threads 4 8 16 32 Throughput (Mbps) We execute our example after further performance tuning and using multiple queues with lockfree and optimized enqueue/dequeue functions. Copy the "mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system and issue the following command. target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration> -f <interface to sniff> -e 5 Fill in the following table to note your observations and measurements after running MPAC DPI on the target system: Table 14. Deep packet inspection throughput in Mbps, for multiple queues case with lock free and optimized enqueue/dequeue functions. 1 2 No of Threads 4 8 16 32 Throughput (Mbps) You are encouraged to compare the results in table 10, 11 and 12 so that you can appreciate the high performance of lock free design. You are also encouraged to see the code of the three different implementations of these networking labs. © Copyright 2010 Dr Abdul Waheed for Cavium University Program 22