Introduction to Parallel Programming Introduction to Parallel Programming Sameh Ahmed 1 B A Introduction to Parallel Programming Sameh Ahmed 2 work harder A Introduction to Parallel Programming Sameh Ahmed 3 Getting Help A Introduction to Parallel Programming Sameh Ahmed 4 work Smarter A Introduction to Parallel Programming Sameh Ahmed 5 In order to improve the performance of solving any computational problem there are three ways. work harder getting help Work smarter Introduction to Parallel Programming Sameh Ahmed 6 Sum numbers from 1 to 100 For ( int i=1; i<=100 ;i++ ;) Sum =sum + i ; 1 + 2 + …….+100 =100(100+1)/2 Introduction to Parallel Programming Sameh Ahmed 7 Sum numbers from 1 to 100 For ( int i=1; i<=100 ;i++ ;) Sum =sum + i ; 1 + 2 + …….+100 =100(100+1)/2 Introduction to Parallel Programming Sameh Ahmed 8 What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer having a single Central Processing Unit (CPU). A problem is broken into a discrete series of instructions. Instructions are executed one after another. Only one instruction may execute at any moment in time. Introduction to Parallel Programming Sameh Ahmed 9 What is Parallel Computing? In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: To be run using multiple CPUs. A problem is broken into discrete parts that can be solved concurrently. Each part is further broken down to a series of instructions. Instructions from each part execute simultaneously on different CPUs. Introduction to Parallel Programming Sameh Ahmed 10 What is Parallel Computing? The compute resources might be: : A single computer with multiple processors. An arbitrary number of computers connected by a network. A combination of both. The computational problem should be able to : Be broken apart into discrete pieces of work that can be solved simultaneously. Execute multiple program instructions at any moment in time. Be solved in less time with multiple compute resources than with a single compute resource. Introduction to Parallel Programming Sameh Ahmed 11 What is Parallel Computing? Introduction to Parallel Programming Sameh Ahmed 12 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel computers can be built from cheap, commodity components. Solve larger problems: Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously Introduction to Parallel Programming Sameh Ahmed 13 Why Use Parallel Computing? Main Reasons: Use of non-local resources: Using compute resources on a wide area network, or even the Internet when local compute resources are scarce. Limits to serial computing: Both physical and practical reasons pose significant constraints . Current computer architectures are increasingly relying upon hardware level parallelism to improve performance: Multiple execution units Pipelined instructions Multi-core Introduction to Parallel Programming Sameh Ahmed 14 Atmosphere, Earth, Environment Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics , Bioscience, Biotechnology, Genetics , Mathematics Introduction to Parallel Programming Chemistry, Molecular Sciences Geology, Seismology Mechanical Engineering - from prosthetics to spacecraft Electrical Engineering, Circuit Design, Microelectronics Computer Science Sameh Ahmed 15 • Databases, data mining • Oil exploration • Web search engines, web based business services • Medical imaging and diagnosis • Pharmaceutical design • Management of national and multinational environments Introduction to Parallel Programming • Financial and economic modeling • Advanced graphics and virtual reality, particularly in the entertainment industry • Networked video and multi-media technologies • Collaborative work Sameh Ahmed 16 Parallel programming , openMP and C++ behind Shrek, Kung Fu Panda, Madagascar and Monsters vs Aliens Movies by "DreamWorks Animation " company. Introduction to Parallel Programming Sameh Ahmed 17 Flynn’s Taxonomy Flynn’s taxonomy : classification of computer systems by numbers of instruction streams and data streams: SISD : single instruction stream, single data stream . SIMD : single instruction stream, multiple data streams MISD : multiple instruction streams, single data stream MIMD : multiple instruction streams, multiple data streams Introduction to Parallel Programming Sameh Ahmed 18 Models for Parallel Programming Parallel execution style SPMD (single program, multiple data): all processors execute same program, but each operates on different portion of problem data. Fork-join style executes parallel program by spawn parallel activities dynamically at certain points called “fork” in the program that mark the beginning of parallel computation and collects and terminates them at another point called “join”. Parallelism relevance trend in the computational field 1200 References 1000 800 MPI 600 OpenMP 400 Cuda 200 OpenCL 0 2002 2004 2006 2008 2010 2012 Year Introduction to Parallel Programming 19 Parallel Programming Methodologies Fosters PCAM Method Introduction to Parallel Programming 20 Designing Parallel Programs Example of Parallelizable Problem Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation. This problem is able to be solved in parallel. Each of the molecular conformations is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem. Example of a Non-parallelizable Problem Calculation of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula: F(n) = F(n-1) + F(n-2) Introduction to Parallel Programming 21 Memory Organization Shared-Memory Distributed Memory Hybrid Distributed-Shared Memory Introduction to Parallel Programming Sameh Ahmed 22 Partitioning Strategies There are two basic ways to partition computational work among parallel tasks: domain decomposition and functional decomposition Domain Decomposition: In this type of partitioning, the data associated with a problem is decomposed. Functional Decomposition: subdivide system into multiple components Domain Decomposition Introduction to Parallel Programming Functional Decomposition Sameh Ahmed 23 Parallel Programming Issues Load Balancing . Minimizing Communication . 1. Computation time 2. Idle time 3. Communication time Overlapping Communication and Computation. Introduction to Parallel Programming Sameh Ahmed 24 Faculty of Science, Cairo University (CuSci-cluster) Total number of Computing Nodes 24 Total number of GPU Nodes 1 Total number of CPU cores 90 Total number of Cuda cores 2496 Memory 8Gb / node GPU Flouting point performance 3,5 TFLOPS Cluster Flouting point performance 1.8TFLOPS Introduction to Parallel Programming Sameh Ahmed 25 Getting Started with MPI The Message Passing Model : A parallel computation consists of a number of processes, each working on some local data. Each process has purely local variables, and there is no mechanism for any process to directly access the memory of another. Introduction to Parallel Programming Sameh Ahmed 27 What is MPI? Sharing of data between processes takes place by message passing, that is, by explicitly sending and receiving data between processes. MPI stands for "Message Passing Interface". It is a library of functions (in C) or subroutines (in Fortran) that you insert into source code to perform data communication between processes. Introduction to Parallel Programming Sameh Ahmed 28 What is OpenMP? Open Multi Processing OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify shared memory parallelism in Fortran and C/C++ programs. Introduction to Parallel Programming Sameh Ahmed 29 Introduction to Parallel Programming Sameh Ahmed 30 In this course C++ Programs Functions Arrays Pointers I/O functions Introduction to Parallel Programming MPI introduction Send and Receive Communication Collective Operations Matrix multiplication Sameh Ahmed 31 Introduction to Parallel Programming Sameh Ahmed 32 Introduction to MPI Basic Features of Message Passing Programs Message passing programs consist of multiple instances of a serial program that communicate by library calls. These calls may be roughly divided into four classes: 1. Calls used to initialize, manage, and finally terminate communications. 2. Calls used to communicate between pairs of processors. 3. Calls that perform communications operations among groups of processors. 4. Calls used to create arbitrary data types. Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 33 MPI Programs Initializing and Terminating MPI Initializing MPI Terminating MPI Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 34 MPI Programs Example #include <mpi.h> /* Also include usual header files */ main(int argc, char **argv) { MPI_Init (&argc, &argv); /* Initialise MPI */ printf(“Hello Worled ! \n”); MPI_Finalize (); /* Terminate MPI */ } Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 35 MPI Programs Compiling and Running MPI Programs Start MPI Services mpd& Compiling MPI Programs For C mpicc MPI_file_name.c -o file_name_run For C++ mpic++ MPI_file_name.cpp -o file_name_run Running MPI Programs mpiexec -n 4 file_name_run Example mpic++ mpihello.cpp -o mpiexec Introduction to Parallel Introduction to MPI Programming h -n 4 ./h Sameh Ahmed 36 MPI Programs Communicators A communicator is a handle representing a group of processors that can communicate with one another. The communicator name is required as an argument to all pointto-point and collective operations. Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 37 MPI Programs Getting Communicator Information: Rank and Size Getting Communicator Information: Rank A processor can determine its rank in a communicator with a call to MPI_COMM_RANK. Getting Communicator Information: Size A processor can also determine the size, or number of processors, of any communicator to which it belongs with a call to MPI_COMM_SIZE. Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 38 MPI Programs Sample Program 2: Hello World! Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 39 MPI Programs Sample Program: Hello World! Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 40 MPI Programs Manager/Worker Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 41 MPI Programs Exercises 1. Write a MPI program which print “ My master node ” for processor ranked by 0 , “ My even worker node ” for processors ranked by even number and “My odd worker node ” for processors ranked by odd number. 2. Write MPI program which run on P processors such that each Processor display five number starting from after the end of the previous Processor “and processor 0 start from “1” like the following output if P=4 : Processor 1 display : 6 7 8 9 10 Processor 0 display : 1 2 3 4 5 Processor 2 display : 11 12 13 14 15 Processor 3 display : 15 16 17 18 19 Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 42 MPI Programs Exercises Hint : the following function you can use it to convert from integer to string . #include <sstream> string NumberToString ( int Number ) { ostringstream ss; ss << Number; return ss.str(); } 3. Write MPI program which take an integer number N and each process print sum of N / P numbers where P is number of processor . Introduction to Parallel Introduction to MPI Programming Sameh Ahmed 43 Introduction to Parallel Programming Sameh Ahmed 44 MPI Communication Sameh Ahmed Introduction to Parallel Programming 45 Communication Point-to-Point Communication Introduction . A point-to-point communication always involves exactly two processes. One process sends a message to the other. This distinguishes it from the other type of communication in MPI, collective communication, which involves a whole group of processes at one time. To send a message, a source process makes an MPI call which specifies a destination process in terms of its rank in the appropriate communicator (e.g. MPI_COMM_WORLD). The destination process also has to make an MPI call if it is to receive the message. Introduction to Parallel MPI Communication Programming Sameh Ahmed Communication Point-to-Point Communication Simplest form of message passing One process sends a message to another one Like a fax machine Different types: • Synchronous • Asynchronous (buffered) Introduction to Parallel MPI Communication Programming Sameh Ahmed 47 Communication Point-to-Point Communication Introduction to Parallel MPI Communication Programming Sameh Ahmed 48 Communication Point-to-Point Communication Introduction to Parallel MPI Communication Programming Sameh Ahmed 49 Communication Point-to-Point Communication Blocking operations Sending and receiving can be blocking Blocking subroutine returns only after the operation has completed Non-blocking operations Non-blocking operations return immediately and allow the calling program to continue At a later point, the program may test or wait for the completion of the non-blocking operation Introduction to Parallel MPI Communication Programming Sameh Ahmed 50 Communication Point-to-Point Communication The envelope of an MPI message has 4 parts: 1. source - the sending process; 2. destination - the receiving process; 3. communicator - specifies a group of processes to which both source and destination belong 4. tag - used to classify messages. The message body. It has 3 parts: 1. buffer - the message data; 2. datatype - the type of the message data; 3. count - the number of items of type datatype in buffer. Introduction to Parallel MPI Communication Programming Sameh Ahmed 51 Communication Communication Modes Standard Send. The standard send completes once the message has been sent, which may or may not imply that the message has arrived at its destination. The message may instead lie “in the communications network” for some time. The standard send has the following form MPI_Send (buf , count , datatype , dest , tag , comm ) Introduction to Parallel MPI Communication Programming Sameh Ahmed Communication Communication Modes The standard blocking receive . The format of the standard blocking receive is: MPI_Recv(buf, count, datatype, source, tag, comm, status) Introduction to Parallel MPI Communication Programming Sameh Ahmed Communication Point-to-Point Communication Example #include <stdio.h> #include <mpi.h> #include <iostream.h> void main (int argc, char **argv) { int myrank; int a; MPI_Status status; MPI_Init(&argc, &argv); /* Initialize MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* Get rank */ if( myrank == 0 ) /* Send a message */ { cout<<“ENTER an INPUT”; cin >> a; MPI_Send( a , 1 , MPI_INT, 1, 17, MPI_COMM_WORLD ); } else if( myrank == 1 ) /* Receive a message */ MPI_Recv( a, 1, MPI_INT, 0, 17, MPI_COMM_WORLD, &status ); MPI_Finalize(); /* Terminate MPI */ } Introduction to Parallel MPI Communication Programming Sameh Ahmed 54 Communication Point-to-Point Communication Example #include <stdio.h> #include <mpi.h> void main (int argc, char **argv) { int myrank; MPI_Status status; double a[100]; MPI_Init(&argc, &argv); /* Initialize MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* Get rank */ if( myrank == 0 ) { for( int i=0;i<100;i++) a[i]=i; MPI_Send( a , 100 , MPI_DOUBLE , 1, 17, MPI_COMM_WORLD ); } else if( myrank == 1 ) /* Receive a message */ MPI_Recv( a, 100, MPI_DOUBLE, 0, 17, MPI_COMM_WORLD, &status ); MPI_Finalize(); /* Terminate MPI */ } Introduction to Parallel MPI Communication Programming Sameh Ahmed 55 Communication Point-to-Point Communication Data type Introduction to Parallel MPI Communication Programming MPI_CHAR signed char MPI_INT signed int MPI_FLOAT float MPI_DOUBLE double Sameh Ahmed 56 Communication Point-to-Point Communication Requirements For a point-to-point communication to succeed: Sender must specify a valid destination rank Receiver must specify a valid source rank The communicator must be the same Message tags must match Message data types must match Receiver buffer length must be >= message length Introduction to Parallel MPI Communication Programming Sameh Ahmed 57 Communication Point-to-Point Communication Wildcards Can only be used by the destination process! To receive from any source: src = MPI_ANY_SOURCE To receive messages with any tag: tag = MPI_ANY_TAG Introduction to Parallel MPI Communication Programming Sameh Ahmed 58 #include<iostream.h> #include<mpi.h> #define m 1; #define n 1000; int main(int argc, char ** argv){ int mynode, totalnodes; int sum,startval,endval,accum; MPI_Status status; MPI_Init(argc,argv); MPI_Comm_size(MPI_COMM_WORLD, &totalnodes); // get totalnodes MPI_Comm_rank(MPI_COMM_WORLD, &mynode); // get mynode sum = 0; // zero sum for accumulation startval = n*mynode/totalnodes+m; endval = n*(mynode+1)/totalnodes; for(int i=startval;i<=endval;i=i+1) sum = sum + i; Introduction to Parallel Programming 59 if(mynode!=0) MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD); else for(int j=1;j<totalnodes;j=j+1) { MPI_Recv(&accum,1,MPI_INT,j,1,MPI_COMM_WORLD, &status); sum = sum + accum; } if(mynode == 0) cout << "The sum is: " << sum << endl; MPI_Finalize(); } Introduction to Parallel Programming 60 Collective communication Sameh Ahmed Introduction to Parallel Programming 61 Collective communication Introduction In addition to point-to-point communications between individual pairs of processors, MPI includes routines for performing collective communications. These routines allow larger groups of processors to communicate in various ways, for example, one to-several or several-toone. involves the sending and receiving of data among processes. In general, all movement of data among processes can be accomplished using MPI send and receive routines . Collective communication routines transmit data among all processes in a group and allow data motion among all processors or just a specified set of processors. A collective operation is an MPI function that is called by all processes belonging to a communicator. Introduction to Parallel Collective Communication Programming Sameh Ahmed 62 Collective communication Target MPI provides the following collective communication routines: Broadcast from one process to all other processes Global reduction operations such as sum, min, max or user-defined reductions Barrier synchronization across all processes Gather data from all processes to one process Scatter data from one process to all processes Introduction to Parallel Collective Communication Programming Sameh Ahmed 63 Collective communication Broadcast The MPI_BCAST routine enables you to copy data from the memory of the root processor to the same memory locations for other processors in the communicator. A one-to-many communication Introduction to Parallel Collective Communication Programming Sameh Ahmed 64 Collective communication Broadcast send_count = 1; root = 0; MPI_Bcast ( &a, &send_count, MPI_INT, root, comm ) int MPI_Bcast ( void* buffer, int count, MPI_Datatype datatype, int rank, MPI_Comm comm ) Introduction to Parallel Collective Communication Programming Sameh Ahmed 65 Collective communication Reduction The MPI_REDUCE routine enables you to collect data from each processor reduce these data to a single value (such as a sum or max) and store the reduced result on the root processor Introduction to Parallel Collective Communication Programming Sameh Ahmed 66 Collective communication Reduction The routine calls for this example are. count = 1; rank = 0; MPI_Reduce ( &a, &x, count, MPI_REAL, MPI_SUM, rank, MPI_COMM_WORLD ); MPI_Reduce( send_buffer, recv_buffer, count, data_type, reduction_operation, rank_of_receiving_process, communicator ) MPI_REDUCE combines the elements provided in the send buffer, applies the specified operation (sum, min, max, ...), and returns the result to the receive buffer of the root process. Introduction to Parallel Collective Communication Programming Sameh Ahmed 67 Collective communication Reduction Introduction to Parallel Collective Communication Programming Sameh Ahmed 68 Collective communication Reduction MPI_Reduce ( send_buffer, recv_buffer, count, datatype, operation, rank, comm ) Introduction to Parallel Collective Communication Programming Sameh Ahmed 69 Collective communication Barrier Synchronization There are occasions when some processors cannot proceed until other processors have completed their current instructions. A common instance of this occurs when the root process reads data and then transmits these data to other processors. The other processors must wait until the I/O is completed and the data are moved. Introduction to Parallel Collective Communication Programming Sameh Ahmed 70 Collective communication Barrier Synchronization The MPI_BARRIER routine blocks the calling process until all group processes have called the function. When MPI_BARRIER returns, all processes are synchronized at the barrier. ** * MPI_Barrier ( comm ) * ** **** Introduction to Parallel Collective Communication Programming Sameh Ahmed 71 Collective communication Gather The MPI_GATHER routine is an all-to-one communication The receive arguments are only meaningful to the root process. When MPI_GATHER is called, each process (including the root process) sends the contents of its send buffer to the root process. The root process receives the messages and stores them in rank order. The gather also could be accomplished by each process calling MPI_SEND and the root process calling MPI_RECV N times to receive all of the messages. Introduction to Parallel Collective Communication Programming Sameh Ahmed 72 Collective communication Gather Introduction to Parallel Collective Communication Programming Sameh Ahmed 73 Collective communication Gather In this example, data values A on each processor are gathered and moved to processor 0 into contiguous memory locations. MPI_GATHER requires that all processes, including the root, send the same amount of data, and the data are of the same type. Thus send_count = recv_count. Introduction to Parallel Collective Communication Programming Sameh Ahmed 74 Collective communication Gather Introduction to Parallel Collective Communication Programming Sameh Ahmed 75 Collective communication Gather MPI_ALLGATHER. In the previous example, after the data are gathered into processor 0, you could then MPI_BCAST the gathered data to all of the other processors. It is more convenient and efficient to gather and broadcast with the single MPI_ALLGATHER operation. Introduction to Parallel Collective Communication Programming Sameh Ahmed 76 Collective communication Gather MPI_ALLGATHER. The result is the following: Introduction to Parallel Collective Communication Programming Sameh Ahmed 77 Collective communication Scatter The MPI_SCATTER routine is a one-to-all communication. Different data are sent from the root process to each process (in rank order). When MPI_SCATTER is called, the root process breaks up a set of contiguous memory locations into equal chunks and sends one chunk to each processor. The outcome is the same as if the root executed N MPI_SEND operations and each process executed an MPI_RECV. Introduction to Parallel Collective Communication Programming Sameh Ahmed 78 Collective communication Scatter Introduction to Parallel Collective Communication Programming Sameh Ahmed 79 Collective communication Collective communication Introduction to Parallel Collective Communication Programming Sameh Ahmed 80