Parallel programming with MPI Jianfeng Yang Internet and Information Technology Lab Wuhan university yjf@whu.edu.cn Agenda Part Ⅰ: Seeking Parallelism/Concurrency Part Ⅱ: Parallel Algorithm Design Part Ⅲ: Message-Passing Programming 2 Part Ⅰ Seeking Parallel/Concurrency Outline 1 Introduction 2 Seeking Parallel 4 1 Introduction(1/6) Well done is quickly done – Caesar Auguest Fast, Fast, Fast---is not “fast” enough. How to get Higher Performance Parallel Computing. 5 1 Introduction(2/6) What is parallel computing? is the use of a parallel computer to reduce the time needed to solve a single computational problem. is now considered a standard way for computational scientists and engineers to solve problems in areas as diverse as galactic evolution, climate modeling, aircraft design, molecular dynamics and economic analysis. 6 Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we basically need? The ability to start the tasks A way for them to communicate 7 1 Introduction(3/6) What’s parallel computer? Is a Multi-processor computer system supporting parallel programming. Multi-computer Is a parallel computer constructed out of multiple computers and an interconnection network. The processors on different computers interact by passing message to each other. Centralized multiprocessor (SMP: Symmetrical multiprocessor) Is a more high integrated system in which all CPUs share access to a single global memory. The shared memory supports communications and synchronization among processors. 8 1 Introduction(4/6) Multi-core platform Integrated duo/quad or more core in one processor, and each core has their own registers and Level 1 cache, all cores share Level 2 cache, which supports communications and synchronizations among cores. All cores share access to a global memory. 9 1 Introduction(5/6) What’s parallel programming? Is programming in language that allows you to explicitly indicate how different portions of the computation may be executed paralleled/concurrently by different processors/cores. Do I need parallel programming really? YES, for the reasons of: Although a lot of research has been invested in and many experimental parallelizing compilers have been developed, there are still no commercial system thus far. The alternative is for you to write your own parallel programs. 10 1 Introduction(6/6) Why should I program using MPI and OpenMP? MPI ( Message Passing Interface) is a standard specification for message passing libraries. Which is available on virtually every parallel computer system. Free. If you develop programs using MPI, you will be able to reuse them when you get access to a newer, faster parallel computer. On Multi-core platform or SMP, the cores/CPUs have a shared memory space. While MPI is a perfect satisfactory way for cores/processors to communicate with each other, OpenMP is a better way for cores/processors with a single Processor/SMP to interact. The hybrid MPI/OpenMP program can get even high performance. 11 2 Seeking Parallel(1/7) In order to take advantage of multi-core/multiple processors, programmers must be able to identify operations that may be performed in parallel. Several ways: Data Dependence Graphs Data Parallelism Functional Parallelism Pipelining …… 12 2 Seeking Parallel(2/7) Data Dependence Graphs A directed graph Each vertex: represent a task to be completed. An edge from vertex u to vertex v means: task u must be completed before task v begins. ----- Task v is dependent on task u. If there is no path from u to v, then the tasks are independent and may be performed parallelized. 13 2 Seeking Parallel(3/7) Data Dependence Graphs a b b a b b c c a d e b c Tasks a Operation Dependence among tasks 14 2 Seeking Parallel(4/7) Data Parallelism Independent tasks applying the same operation to different elements of a data set. a e.g. For( int i=0;i<99; i++) { a(i) = b(i) + c(i); b b } b c 15 2 Seeking Parallel(5/7) Functional Parallelism Independent tasks applying different operations to different data elements of a data set. a A = 2; b = 3; m = (a + b) / 2; s = (a2 + b2) / 2; v = s - m2 May MaybebeFunctional Functional Parallelized Parallelized b c d e 16 2 Seeking Parallel(6/7) Pipelining A data dependence graph forming a simple path/chain admits no parallelism if only a single problem instance must be processed. If multiple problems instance to be processed: If a computation can be divided into several stage with the same time consumption. Then, can support parallelism. E.g. Assembly line. a b c 17 2 Seeking Parallel(7/7) Pipelining p0←a0 p1←a0 + a1 p2←a0 + a1+ a2 p3←a0 + a1+ a2+ a3 p[0]=a[0] for (int i=1; i<=3;i++) { p[i] = p[i-1]+a[i]; } P[0] = a[0]; P[1] = p[0]+a[1]; P[2] = p[1]+a[2]; P[3] = p[2]+a[3]; P[0] P[1] P[0] = + a[0] a[1] P[2] P[1] P[3] P[2] + + a[2] a[3] 18 For Example: Landscape maintains Prepare for dinner Data cluster …… 19 Homework Given a task that can be divided into m subtasks, each require one unit of time, how much time is needed for an m-stage pipeline to process n tasks? Consider the data dependence graph in figure below. identify all sources of data parallelism; identify all sources of functional parallelism. I A A A B D C A A O A 20 Part Ⅱ Parallel Algorithm Design Outline 1.Introduction 2.The Task/Channel Model 3.Foster’s Design Methodology 22 1.Introduction Foster, Ian. Design and Building Parallel Programs: Concepts and Tools for Parallel Software engineering. Reading, MA: Addison-Wesley, 1995. Describe the Task/Channel Model; A few simple problems… 23 2.The Task/Channel Model The model represents a parallel computation as a set of tasks that may interact with each other by sending message through channels. Task: is a program, its local memory, and a collection of I/O ports. Local memory: instructions private data Memory 24 2.The Task/Channel Model channel: Via channel: A channel is a message queue: A task can send local data to other tasks via output ports; A task can receive data value from other tasks via input ports. Connect one task’s output port with another task’s input port. Data value appears at the inputs port in the same order in which they were placed in the output port of the other end of the channel. Receiving data can be blocked: Synchronous. Sending data can never be blocked: Asynchronous. Access to local memory: faster than nonlocal data access. 25 3.Foster’s Design Methodology Four-step process: Partitioning Communication Agglomeration mapping Partitioning Problem Communication Mapping Agglomeration 26 3.Foster’s Design Methodology Partitioning Is the process of dividing the computation and the data into pieces. More small pieces is good. How to Domain Decomposition Data-centric approach Function-centric approach First, divide data into pieces; Then, determine how to associate computations with the data. Focus on: the largest and/or most frequently accessed data structure in the program. E.g., Functional Decomposition 27 3.Foster’s Design Methodology Domain Decomposition 1-D Primitive Task 2-D 3-D Better 28 3.Foster’s Design Methodology Functional Decomposition Yield collections of tasks that achieve parallel through pipelining. E.g., a system supporting interactive image-guided surgery. Track position of instruments Acquire patient images Register images Determine image locations Display image 29 3.Foster’s Design Methodology The quality of Partition (evaluation) At least an order of magnitude more primitive tasks than processors in the target parallel computer. Redundant computations and redundant data structure storage are minimized. Otherwise: the design may not work well when the size of the problem increases. Primitive tasks are roughly the same size. Otherwise: later design options may be too constrained. Otherwise: it may be hard to balance work among the processors/cores. The number of tasks is an increasing function of the problem size. Otherwise: it may be impossible to use more processor/cores to solve large problem. 30 3.Foster’s Design Methodology Communication After identifying the primitive tasks, the communications type between those primitive tasks should be determined. Two kinds of communication type: Local Global 31 3.Foster’s Design Methodology Communication Local: A task needs values from a small number of other tasks in order to perform a computation, a channel is created from the tasks supplying the data to the task consuming the data. Global: When a significant number of the primitive tasks must be contribute data in order to perform a computation. E.g., computing the sums of the values held by the primitive processes. 32 3.Foster’s Design Methodology Communication Evaluate the communication structure of the designed parallel algorithm. The communication operations are balanced among the tasks. Each task communications with only a small number of neighbors. Tasks can perform their communication in parallel/concurrently. Tasks can perform their computations in parallel/concurrently. 33 3.Foster’s Design Methodology Agglomeration Why we need agglomeration? If the number of tasks exceeds the number of processors/cores by several orders of magnitude, simply creating these tasks would be a source of significant overhead. So, combine primitive tasks into large tasks and map them into physical processors/cores to reduce the amount of parallel overhead. What’s agglomeration? Is the process of grouping tasks into large tasks in order to improve performance or simplify programming. When developing MPI programs, ONE task per core/processor is better. 34 3.Foster’s Design Methodology Agglomeration Goals 1: lower communication overhead. Eliminate communication among tasks. Increasing the locality of parallelism. Combining groups of sending and receiving tasks. 35 3.Foster’s Design Methodology Agglomeration Goals 2: Maintain the scalability of the parallel design. Enable that we have not combined so many tasks that we will not be able to port our program at some point in the future to a computer with more processors/cores. E.g. 3-D Matrix Operation size: 8*128*258 36 3.Foster’s Design Methodology Agglomeration Goals 3: reduce software engineering costs. Make greater use of the existing sequential code. Reducing time; Reducing expense. 37 3.Foster’s Design Methodology Agglomeration evaluation: Has increased the locality of the parallel algorithm. Replicated computations take less time than the computations the replace. The amount of replicated data is small enough to allow algorithm to scale. Agglomeration tasks have similar computational and communication costs. The number of tasks is an increasing function of the problem size. The number of tasks is as small as possible, yet at least as great as the number of cores/processors in the target computers. The trade-off between the chosen agglomeration and the cost of modifications to existing sequential code is reasonable. 38 3.Foster’s Design Methodology Mapping A B C A C H E D E F F B G H D G Increasing processor utilization Minimizing inter-processor communication 39 Part Ⅲ Message-Passing Programming Preface Load prog_a Process Store 41 Node 1 prog_a Node 2 Node 3 42 process 0 process 1 process 2 Load Process Gather Store 43 Hello World! #include <stdio.h> #include “mpi.h” int main(int argc,char *argv[]) { int size, rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); print(“Process %d of %d: Hello world”, rank, size); MPI_Finalize(); } Hello Hello Hello Hello world world world world from from from from process process process process 0 1 2 3 of of of of 4 4 4 4 44 Outline Introduction The Message-Passing Model The Message-Passing Interface (MPI) Communication Mode Circuit satisfiability Point-to-Point Communication Collective Communication Benchmarking parallel performance 45 Introduction MPI: Message Passing Interface Is a library, not a parallel language. C&MPI, Fortran&MPI Is a standard, not a implement for a actually problem. MPICH Intel MPI MSMPI LAM MPI Is a Message Passing Model 46 Introduction The history of MPI: Draft: 1992 MPI-1: 1994 MPI-2:1997 http://www.mpi-forum.org 47 Introduction MPICH: http://www-unix.mcs.anl.gov/mpi/mpich1/download.html; http://wwwunix.mcs.anl.gov/mpi/mpich2/index.htm#download Main Features: Open source; Synchronized on MPI standard; Supports MPMD (multiple Program Multiple Data) and heterogeneous clusters. Supports combining with C/C++, Fortran77 and Fortran90; Supports Unix, Windows NT platform; Supports multi-core, SMP, Cluster, Large Scale Parallel Computer System. 48 Introduction Intel MPI According to MPI-2 standard. Latest version: 3.1 DAPL (Direct Access Programming Library) 49 Introduction-Intel MPI Intel® MPI Library Supports Multiple Hardware Fabrics 50 Introduction-Intel MPI Features is a multi-fabric message passing library. implements the Message Passing Interface, v2 (MPI-2) specification. provides a standard library across Intel® platforms that: Focuses on making applications perform best on IA based clusters Enables adoption of the MPI-2 functions as the customer needs dictate Delivers best in class performance for enterprise, divisional, departmental and workgroup high performance computing 51 Introduction-Intel MPI Why Intel MPI Library? High performance MPI-2 implementation Linux and Windows CCS support Interconnect independence Smart fabric selection Easy installation Free Runtime Environment Close integration with the Intel and 3rd party development tools Internet based licensing and technical support 52 Introduction-Intel MPI Standards Based Argonne National Laboratory's MPICH-2 implementation. Integration, can be easily integrated with: • Platform LSF 6.1 and higher • Altair PBS Pro* 7.1 and higher • OpenPBS* 2.3 • Torque* 1.2.0 and higher • Parallelnavi* NQS* for Linux V2.0L10 and higher • Parallelnavi for Linux Advanced Edition V1.0L10A and higher • NetBatch* 6.x and higher 53 Introduction-Intel MPI System Requirements: Host and Target Systems hardware: • IA-32, Intel® 64, or IA-64 architecture using Intel® Pentium® 4, Intel® Xeon® processor, Intel® Itanium processor family and compatible platforms • 1 GB of RAM - 4 GB recommended • Minimum 100 MB of free hard disk space - 10GB recommended. 54 Introduction-Intel MPI Operating Systems Requirements: Microsoft Windows* Compute Cluster Server 2003 (Intel® 64 architecture only) Red Hat Enterprise Linux* 3.0, 4.0, or 5.0 SUSE* Linux Enterprise Server 9 or 10 SUSE Linux 9.0 thru 10.0 (all except Intel® 64 architecture starts at 9.1) HaanSoft Linux 2006 Server* Miracle Linux* 4.0 Red Flag* DC Server 5.0 Asianux* Linux 2.0 Fedora Core 4, 5, or 6 (IA-32 and Intel 64 architectures only) TurboLinux*10 (IA-32 and Intel® 64 architecture) Mandriva/Mandrake* 10.1 (IA-32 architecture only) SGI* ProPack 4.0 (IA-64 architecture only) or 5.0 (IA-64 and Intel 64 architectures) 55 The Message-Passing Model Processor Processor Processor Memory Memory Memory Processor Memory Processor Interconnection network Memory Processor Processor Memory Memory Processor Memory 56 The Message-Passing Model A task in task/channel model become a process in Message-Passing Model; The number of processes: Is specified by user; Is specified when the program begins; Is constant throughout the execution of the program; Each process: Has a unique ID number; Processor Processor Processor Memory Memory Memory Processor Memory Processor Interconnection network Processor Memory Processor Memory Memory Processor Memory 57 The Message-Passing Model Goals of Message-Passing Model: Communication with each other; Synchronization with each other; 58 The Message-Passing Interface (MPI) Advantages: Run well on a wide variety of MPMD architectures; Easily to debugging; Threading safe 59 What is in MPI Point-to-point message passing Collective communication Support for process groups Support for communication contexts Support for application topologies Environmental inquiry routines Profiling interface 60 Introduction to Groups & Communicator Process model and groups Communication scope Communicators 61 Process model and groups Fundamental computational unit is the process. Each process has: MPI processes execute in MIMD style, but: No mechanism for loading code onto processors, or assigning processes to processors No mechanism for creating or destroying processes MPI supports dynamic process groups. an independent thread of control, a separate address space Process groups can be created and destroyed Membership is static Groups may overlap No explicit support for multithreading, but MPI is designed to be thread-safe. 62 Communication scope In MPI, a process is specified by: A message label is specified by: a group a rank relative to the group ( ) a message context a message tag relative to the context Groups are used to partition process space Contexts are used to partition ``message label space'' Groups and contexts are bound together to form a communicator object. Contexts are not visible at the application level. A communicator defines the scope of a communication operation 63 Communicators Communicators are used to create independent ``message universes''. Communicators are used to disambiguate message selection when an application calls a library routine that performs message passing. Nondeterminacy may arise if processes enter the library routine asynchronously, if processes enter the library routine synchronously, but there are outstanding communication operations. A communicator binds together groups and contexts defines the scope of a communication operation is represented by an opaque object 64 A communicator handle defines which processes a particular command will apply to All MPI communication calls take a communicator handle as a parameter, which is effectively the context in which the communication will take place MPI_INIT defines a communicator called MPI_COMM_WORLD for each process that calls it 65 Every communicator contains a group which is a list of processes The processes are ordered and numbered consecutively from 0. The number of each process is known as its rank The rank identifies each process within the communicator The group of MPI_COMM_WORLD is the set of all MPI processes 66 Skeleton MPI Program #include <mpi.h> main( int argc, char** argv ) { MPI_Init( &argc, &argv ); /* main part of the program */ MPI_Finalize(); } 67 Circuit satisfiability a b What combinations of input value will the circuit output the value of 1? c d e f g h i j k l m n o p 68 Circuit satisfiability Analysis: 16 input, a-p, each take on 2 values of 0 or 1. 216=65536 design a parallel algorithm 1 Partition Function decomposition No channel between tasks Tasks are independent; Suit for parallelism; 2 3 65536 Output Partition Communication Agglomeration Mapping 69 Circuit satisfiability Communication: Tasks are independent; Partition Communication Agglomeration Mapping 70 Circuit satisfiability Agglomeration and Mapping Fixed number of tasks; The time for each task to complete is variable. WHY? How to balance the computation load? Mapping tasks in cyclic fashion. Partition Communication Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Agglomeration Processors/Cores 0 1 2 3 4 5 Mapping 71 Circuit satisfiability Each process will examine a combination of inputs in turn. #include <mpi.h> #include <stdio.h> int main(int argc, char * argv[]) { int i; int id; int p; void check_circuit(int,int); MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &id); MPI_Comm_size(MPI_COMM_WORLD, &p); for( i=id; i< 65536;i++) check_circuit(id,i); printf(“process %d is done\n”,id); fflush(stdout); MPI_Finalize(); return 0 ; } 72 Circuit satisfiability #define EXTRACT_BIT(n,i) ((n&(1<<i))?1:0) void check_circuit(int id,int z){ int v[16]; int i; for( i=0;i<16;i++) v[i] = EXTRACT_BIT(z,i) ; if((v[0] || v[1]) && (!v[1] || !v[3]) && (v[2] || v[3]) && (!v[3] || !v[4]) && (v[4] || !v[5]) && ( v[5] || !v[6]) && (v[5] || v[6]) && ( v[6] || !v[15]) && (v[7] || !v[8]) && (!v[7] || !v[13]) && (v[8] || v[9]) && ( v[9] || v[11]) && (v[10] || v[11]) && ( v[12] || v[13]) && (v[13] || !v[14]) && (v[14] || v[15]) ) { printf(“%d) %d%d%d%d%d%d%d%d%d%d%d%d%d%d%d %d”,id,v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7],v[8],v[9], v[10],v[11],v[12],v[13],v[14],v[15]); fflush(stdout); } } 73 Point-to-Point Communication Overview Blocking Behaviors Non-Blocking Behaviors 74 overview A message is sent from a sender to a receiver There are several variations on how the sending of a message can interact with the program 75 Synchronous does not complete until the message has been received A FAX or registered mail 76 Asynchronous completes as soon as the message is on the way. A post card or email 77 communication modes is selected with send routine. synchronous mode ("safest") ready mode (lowest system overhead) buffered mode (decouples sender from receiver) standard mode (compromise) Calls are also blocking or nonblocking. Blocking stops the program until the message buffer is safe to use Non-blocking separates communication from computation 78 Blocking Behavior int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) buf is the beginning of the buffer containing the data to be sent. For Fortran, this is often the name of an array in your program. For C, it is an address. count is the number of elements to be sent (not bytes) datatype is the type of data dest is the rank of the process which is the destination for the message tag is an arbitrary number which can be used to distinguish among messages comm is the communicator 79 Temporary Knowledge Message Msg: buf, count, datatype Msg envelop: dest, tag, comm Tag----why? Process P: send A,32,Q ; send B,16,Q ; Process Q: recv X, 32, P ; recv Y, 16, P ; Process P: send A,32,Q,tag1 ; send B,16,Q,tag2 ; Process Q: recv X, 32, P, tag1 ; recv Y, 16, P, tag2 80 81 When using standard-mode send It is up to MPI to decide whether outgoing messages will be buffered. Completes once the message has been sent, which may or may not imply that the massage has arrived at its destination Can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. Has non-local completion semantics, since successful completion of the send operation may depend on the occurrence of a matching receive. 82 Blocking Standard Send 83 MPI_Recv int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) buf is the beginning of the buffer where the incoming data are to be stored. For Fortran, this is often the name of an array in your program. For C, it is an address. count is the number of elements (not bytes) in your receive buffer datatype is the type of data source is the rank of the process from which data will be accepted (This can be a wildcard, by specifying the parameter MPI_ANY_SOURCE.) tag is an arbitrary number which can be used to distinguish among messages (This can be a wildcard, by specifying the parameter MPI_ANY_TAG.) comm is the communicator status is an array or structure of information that is returned. For example, if you specify a wildcard for source or tag, status will tell you the actual rank or tag for the message received 84 85 86 Blocking Synchronous Send 87 Cont. can be started whether or not a matching receive was posted will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message sent by the synchronous send. provides synchronous communication semantics: a communication does not complete at either end before both processes rendezvous at the communication. has non-local completion semantics. 88 Blocking Ready Send 89 completes immediately may be started only if the matching receive has already been posted. has the same semantics as a standard-mode send. saves on overhead by avoiding handshaking and buffering 90 Blocking Buffered Send 91 Can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. Has local completion semantics: its completion does not depend on the occurrence of a matching receive. In order to complete the operation, it may be necessary to buffer the outgoing message locally. For that purpose, buffer space is provided by the application. 92 Non-Blocking Behavior MPI_Isend (buf,count,dtype,dest,tag,comm,request) MPI_Wait (request,status) request matches request on Isend or Irecv status returns status equivalent to status for Recv when complete Blocks for send until message is buffered or sent so message variable is free Blocks for receive until message is received and ready 93 Non-blocking Synchronous Send int MPI_Issend (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) IN = provided by programmer, OUT = set by routine buf: starting address of message buffer (IN) count: number of elements in message (IN) datatype: type of elements in message (IN) dest: rank of destination task in communicator comm (IN) tag: message tag (IN) comm: communicator (IN) request: identifies a communication event (OUT) 94 Non-blocking Ready Send int MPI_Irsend (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) 95 Non-blocking Buffered Send int MPI_Ibsend (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) 96 Non-blocking Standard Send int MPI_Isend (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) 97 Non-blocking Receive IN = provided by programmer, OUT = set by routine buf: starting address of message buffer (OUT-buffer contents written) count: number of elements in message (IN) datatype: type of elements in message (IN) source: rank of source task in communicator comm (IN) tag: message tag (IN) comm: communicator (IN) request: identifies a communication event (OUT) 98 int MPI_Irecv (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) 99 request: identifies a communication event (INOUT) status: status of communication event (OUT) count: number of communication events (IN) index: index in array of requests of completed event (OUT) incount: number of communication events (IN) outcount: number of completed events (OUT) 100 int MPI_Wait (MPI_Request *request, MPI_Status *status) int MPI_Waitall (int count, MPI_Request *array_of_requests, MPI_Status *array_of_statuses) int MPI_Waitany (int count, MPI_Request *array_of_requests, int *index, MPI_Status *status) int MPI_Waitsome (int incount, MPI_Request *array_of_requests, int *outcount, int* array_of_indices, MPI_Status *array_of_statuses) 101 Communication Mode Blocking Routines Non-Blocking Routines Synchronous MPI_SSEND MPI_ISSEND Ready MPI_RSEND MPI_IRSEND Buffered MPI_BSEND MPI_IBSEND Standard MPI_SEND MPI_ISEND MPI_RECV MPI_IRECV 102 Advantages Disadvantages Synchro nous Safest, and therefore most portable SEND/RECV order not critical Amount of buffer space irrelevant Can incur substantial synchronization overhead Ready Lowest total overhead SEND/RECV handshake not required RECV must precede SEND Buffered Decouples SEND from RECV No sync overhead on SEND Order of SEND/RECV irrelevant Programmer can control size of buffer space Additional system overhead incurred by copy to buffer Standard Good for many cases Your program may not be suitable 103 MPI Quick Start MPI_Init MPI_BCast MPI_Wtime MPI_Comm_rank MPI_Scatter MPI_Wtick MPI_Comm_size MPI_Gather MPI_Barrier MPI_Send MPI_Reduce MPI_Recv MPI_Finalize MPI_Xxxxx 104 MPI Routines MPI_Init MPI_Init(&argc, &argv); To Initialize MPI execution environment . argc: Pointer to the number of arguments argv: Pointer to the argument vector The First MPI function call; Allow system to do any setup needed to hander further calls to MPI Library. defines a communicator called MPI_COMM_WORLD for each process that calls it MPI_Init must be called before any other MPI functions. Exception: MPI_Initializes, checks to see if MPI has been initialzed. May be called before MPI_Init. 105 MPI Routines MPI_Comm_rank int MPI_Comm_rank(MPI_comm com, int* rank) To determine a process’s ID number. Return: Process’s ID by rank Communicator: MPI_Comm: MPI_COMM_WORLD, include all process when MPI initialized. MPI_Comm_rank(MPI_COMM_WORLD, &id); 106 MPI Routines MPI_Comm_size int MPI_Comm_size(MPI_comm com, int* size) To find the number of processes -- size MPI_Comm_size(MPI_COMM_WORLD, &p); 107 MPI Routines int MPI_Send( void* buf, int count, MPI_Send MPI_Datatype datatype, int dest, The source process send the data in int tag, buffer to destination process. MPI_Comm comm) The starting address of the data to be transmitted. buf The number of data items. count datatype The type of data items.(all of the data items must be in dest tag comm the same type) The rank of the process to receive the data. An integer “label” for the message, allowing messages serving different purpose to be identified. Indicates the communicator in which this message is being sent. 108 MPI Routines MPI_Send Blocks until the message buffer is once again availabel. MPI constants for C data types. 109 MPI Routines MPI_Recv int MPI_Recv( void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status * status) buf count The starting address where the received data is to be stored. datatype source tag comm The type of data items status The maximum number of data items the receiving process is willing to receive. The rank of the process sending this message. The desired tag value for the message Indicates the communicator in which this message is being passed. MPI data structure. Return the status. 110 MPI Routines MPI_Recv Receive the message from the source process. The data type and tag of message received must be in according with the data type and tag define in MPI_Recv funciton. The count of data items received must be less than the count define in this function. Otherwise, will cause the overflow error condition. If count equal to zero, then message is empty. Blocks until the message has been recived. Or an error conditions cause the function to return. 111 MPI Routines MPI_Recv status->MPI_Source status->MPI_Tag The rank of the process sending the msg. The msg’s tag value. status->MPI_ERROE The error condition. int MPI_Abort(MPI_Comm comm, int errorcode) 112 MPI Routines MPI_Finalize Allowing system to free up resources, such as memory, that have been allocated to MPI. Without MPI_Finalize, the result of program will unknowns. MPI_Finalize(); 113 summary MPI_Init MPI_Comm_rank MPI_Comm_size MPI_Send MPI_Recv MPI_Finalize 114 Collective communication Communication operation A group of processes work together to distribute or gather together a set of one or more values. Process Process 0 Run Time Process 1 Process 2 Call Syn (1) Call Syn (2) Wait Call Syn (3) Syn point Wait Parallel Executing 115 Collective communication MPI_Bcast A root process broadcast one or more data items of the same type to all other processed in a communicator. Before broadcast root broadcast After broadcast A A A A A 116 Collective communication MPI_Bcast int MPI_Bcast( void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) //addr of 1st broadcast element // #element to be broadcast // type of element to be broadcast // ID of process doing broadcast //communicator 117 Collective communication MPI_Scatter The root process send the different parts of data item to other processes. A B C D ... Sending buffer of root process h Scatter different parts of data to other process in turn. A B C D Root h Receiving buffer of other process 118 Collective communication MPI_Scatter int MPI_Scatter( void* buffer, //starting addr of sending buffer int sendcount, // #element to be scattered MPI_Datatype sendtype, // type of element to be sent. void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, // ID of root process doing scattered MPI_Comm comm) //communicator 119 Collective communication MPI_Gather Each process sending data of its buffer to root process. A B C Root D h Sending buffer of other process Gather A B C D ... h Receiving buffer of root process 120 Collective communication MPI_Gather int MPI_Gather( void* sendbuffer, //starting addr of sending buffer int sendcount, // #element to be scattered MPI_Datatype sendtype, // type of element to be sent. void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, // ID of root process doing scattered MPI_Comm comm) //communicator 121 Collective communication MPI_Reduce After a process has completed its share of the work, it is ready to participate in the reduction operation. MPI_Reduce perform one or more reduction operations on values submitted by all the processed in a communicator. 122 Collective communication MPI_Reduce int MPI_Reduce( void* operand, //addr of 1st reduction element void* result, // addr of 1st reduction result int count, // reductions to perform MPI_Datatype type, // type of element to be sent. MPI_OP operator, // reduction operator int root, // process getting result(s) MPI_Comm comm) //communicator 123 Collective communication MPI_Reduce MPI’s built-in reduction operators MPI_BAND Bitwise and MPI_BOR Bitwise or MPI_BXOR Bitwise exclusive or MPI_LAND logical and MPI_LOR logical or MPI_LXOR Logical exclusive or MPI_MAX Maximum MPI_MAXLOC Maximum and location of maximum MPI_MIN Minimum MPI_MINLOC Minimum and location of maximum MPI_PORD Product MPI_SUM Sum 124 summary 125 126 127 128 Benchmarking parallel performance Measure the performance of a parallel application. How? Measuring the number of seconds that elapse from the time we initiate execution until the program terminates. double MPI_Wtime(void) Returns the numbers of seconds that have elapsed since some point of time in the past. double MPI_Wtick(void) Returns the precision of the result returned by MPI_Wtime. 129 Benchmarking parallel performance MPI_Barrier int MPI_Barrier(MPI_Comm comm) comm: indicate in which communicator the processes will participate the barrier synchronization. Function of MPI_Barrier is…. double elapsed_time; MPI_Init(&agrc,&argv); elapsed_time = -MPI_Wtime; …. MPI_Reduce(&solutions, &global_solutions,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); elapsed_time += MPI_Wtime; 130 For example Send and receive operation #include “mpi.h” void main(int argc, char * argv[]) { …. MPI_Inti(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); if( myrank == 0 ) { MPI_Send(message,length,MPI_CHAR,1,99,MPI_COMM_WORLD); } else if(myrank == 1) { MPI_Recv(message,length,MPI_CHAR,0,99,MPI_COMM_WORLD,&status); } MPI_Finalize(); } 131 For example Compute pi 1 1 dx arctan( x ) | 0 arctan( 1) arctan( 0) arctan( 1) / 4 0 1 x 2 1 4 f ( x) (1 x 2 ) 1 0 f ( x)dx 132 For example 4 0 1 2 i 1 1 1 N i 0.5 f( ) f ( ) 2 N N N i 1 N i 1 N 133 For example Compute pi MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD); h= 1.0/(double)n; sum = 0.0; for( int i=myrank +1; i<= n; i+= numprocs) { x= h * (I - 0.5 ); sum += 4.0/(1.0 + x* x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); 134 For example Matrix Multiplication MPI_Scatter(&iaA[0][0],N,MPI_INT,&iaA[iRank][0],N,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(&iaB[0][0],N*N,MPI_INT,0,MPI_COMM_WORLD); for(i=0;i<N;i++) { temp = 0; for(j=0;j<N;j++) { remp = temp+iaA[iRank][j] * iaB[j][i]; } iaC[iRank][i] = temp; } MPI_Gather(&iaC[iRank][0],N,MPI_INT,&iaC[0][0],N,MPI_INT,0,MPI_COMM_WORLD); 135 136 l 1 Ci , j ai ,k bk , j k 0 where A is an n x l matrix and B is an l x m matrix. 137 138 139 for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } 140 141 Summary MPI is a Library. Six foundational functions of MPI. collective communication. MPI communication Model. 142 Thanks! Fell free to contact me via yjf@whu.edu.cn for any questions or suggestions. And Welcome to Wuhan University!