Message-Passing Programming Neil MacDonald, Elspeth Minty, Joel Malard, Tim Harding, Simon Brown, Mario Antonioletti, David Henty, Gavin J. Pringle Contents Message Passing Concepts Basic MPI Programs Point-to-Point Communication Profiling tool: VAMPIR Modes, Tags and Communicators Non-blocking Communication Collective Communications Derived Datatypes Virtual Topologies Decomposition Techniques MPI design Motivation The MPI library is the most important piece of software in parallel programming All the world’s largest supercomputers are programmed using MPI Writing parallel programs using MPI is fun! Message-Passing Concepts Sequential Programming Paradigm Process Memory M Processor P Programming Paradigm Processes 0 1 2 3 Message Passing Interface Communication Network Programming Paradigm All variables are private. Processes communicate via special subroutine calls. Typically Written in a conventional sequential language. A single program is run on each processor. A Message-Passing Program if ( me == 0 ) then X = X + 111 Process A Process B X: 19 me: 0 X: 130 X: 5 me: 1 X: 5 What is SPMD? Single Program, Multiple Data Same program runs everywhere. Restriction on the general message-passing model. Some vendors only support SPMD parallel programs. General message-passing model can be emulated. Emulating General Message Passing Using SPMD: C main (int argc, char **argv) { if (controller_process) { Controller( /* Arguments */ ); } else { Worker ( /* Arguments */ ); } } Emulating General Message-Passing Using SPMD: Fortran PROGRAM SPMD IF (controller_process) THEN CALL CONTROLLER ( ! Arguments ! ) ELSE CALL WORKER ENDIF END PROGRAM SPMD ( ! Arguments ! ) Messages Messages are packets of data moving between processes. The message passing system has to be told the following information: – – – – – – – Sending process(es) Source location Data type Data length Receiving process(es) Destination location Size of receive buffer(s) Access A process needs to register to the message passing interface. A message passing system is similar to: – – – – Mail box Phone line fax machine etc. Addressing Messages need to have addresses to be sent to. Addresses are similar to: – – – – – Postal address Phone number Fax number Email address etc. Reception It is important that the receiving process is capable of dealing with messages it is sent. – – – – – Letter box Telephone receiver Printer Email inbox etc Completion The mode of a communication determines when its constituent operations complete. The form of an operation determines when the procedure implementing that operation will return. Point-to-Point Communication Simplest form of message passing. One process sends a message to another. Different types of point-to-point communication. Synchronous Sends Provide information about the completion of the message. Beep! Asynchronous Sends Only know when the message has left. ? Blocking Operations Relate to when the operation has completed. Only return from the subroutine call when the operation has completed. Non-Blocking Operations Return straight away and allow the sub-program to continue to perform other work. At some later time the sub-program can test or wait for the completion of the non-blocking operation. Beep! Non-Blocking Operations All non-blocking operations should have matching wait operations. Some systems cannot free resources until wait has been called. A non-blocking operation immediately followed by a matching wait is equivalent to a blocking operation. Non-blocking operations are not the same as sequential subroutine calls as the operation continues after the call has returned. Collective communications Collective communication routines are higher level routines involving several processes at a time. Can be built out of point-to-point communications. Barriers Synchronise processes. Barrier Barrier Barrier Broadcast A one-to-all communication. Reduction Operations Combine data from several processes to produce a single result. Strike Summary Message-Passing is a programming model – that is implemented by MPI Essential to understand the basic concepts – private variables – explicit communications – SPMD Major difficulty in learning MPI is understanding the Message-Passing model – a very different model to sequential programming Message Passing Programming What is MPI? MPI Forum First message-passing interface standard. Sixty people from forty different organisations. Users and vendors represented, from the US and Europe. Two-year process of proposals, meetings and review. Message Passing Interface document produced. Goals and Scope of MPI MPI's prime goals are: – To provide source-code portability. – To allow efficient implementation. It also offers: – A great deal of functionality. – Support for heterogeneous parallel architectures. Header files C: #include <mpi.h> Fortran: include 'mpif.h' Fortran 90: use mpi MPI Function Format C: error = MPI_Xxxxx(parameter, ...); MPI_Xxxxx(parameter, ...); Fortran: CALL MPI_XXXXX(parameter, ..., IERROR) Handles MPI controls its own internal data structures. MPI releases `handles' to allow programmers to refer to these. C handles are of defined typedefs. Fortran handles are INTEGERs. Initialising MPI C: int MPI_Init(int *argc, char ***argv) Fortran: MPI_INIT(IERROR) INTEGER IERROR Must be the first MPI procedure called. – but multiple processes are already running before MPI_Init MPI_Init int main(int argc, char *argv[]) { ... MPI_Init(&argc, &argv); ... int main() { MPI_Init(NULL, NULL); ... program my_mpi_program integer :: ierror ... CALL MPI_INIT(IERROR) MPI_COMM_WORLD Communicators MPI_COMM_WORLD 0 2 1 4 3 5 6 Rank How do you identify different processes in a communicator? MPI_Comm_rank(MPI_Comm comm, int *rank) MPI_COMM_RANK(COMM, RANK, IERROR) INTEGER COMM, RANK, IERROR The rank is not the physical processor number. – numbering is 0, 1, 2, .... MPI_Comm_rank int rank; ... MPI_Comm_rank(MPI_COMM_WORLD, &rank) printf(“Hello from rank %d\n”, rank); ... integer :: ierror integer :: rank ... CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) write(*,*) ‘Hello from rank ‘, rank ... Size How many processes are contained within a communicator? MPI_Comm_size(MPI_Comm comm, int *size) MPI_COMM_SIZE(COMM, SIZE, IERROR) INTEGER COMM, SIZE, IERROR Exiting MPI C: int MPI_Finalize() Fortran: MPI_FINALIZE(IERROR) INTEGER IERROR Must be the last MPI procedure called. Aborting MPI Aborting the execution from any processor (e.g. error condition) C: int MPI_Abort(MPI_Comm comm,int errorcode) Fortran: MPI_ABORT(COMM, ERRORCODE, IERROR) INTEGER COMM, ERRORCODE, IERROR Behaviour – will abort all processes even if only called by one process – this is the ONLY MPI routine that can have this effect! Summary Have covered basic calls – but no explicit message-passing yet Can still write useful programs – eg a task farm of independent jobs Need to compile and launch parallel jobs – procedure is not specified by MPI – next lecture gives machine-specific details Messages Messages A message contains a number of elements of some particular datatype. MPI datatypes: – Basic types. – Derived types. Derived types can be built up from basic types. C types are different from Fortran types. MPI Basic Datatypes - C MPI Datatype C datatype MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED MPI Basic Datatypes - Fortran MPI Datatype Fortran Datatype MPI_INTEGER INTEGER MPI_REAL REAL MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_COMPLEX COMPLEX MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER(1) MPI_BYTE MPI_PACKED Point-to-Point Communication Point-to-Point Communication 1 2 5 Destination 3 4 Source 0 Communicator Communication between two processes. Source process sends message to destination process. Communication takes place within a communicator. Destination process is identified by its rank in the communicator. Point-to-point messaging in MPI Sender calls a SEND routine – specifying the data that is to be sent – this is called the send buffer Receiver calls a RECEIVE routine – specifying where the incoming data should be stored – this is called the receive buffer Data goes into the receive buffer Metadata describing message also transferred – this is received into separate storage – this is called the status Communication modes Sender mode Notes Synchronous send Only completes when the receive has completed. Buffered send Always completes (unless an error occurs), irrespective of receiver. Standard send Either synchronous or buffered. Ready send Always completes (unless an error occurs), irrespective of whether the receive has completed. Receive Completes when a message has arrived. MPI Sender Modes OPERATION MPI CALL Standard send MPI_Send Synchronous send MPI_Ssend Buffered send MPI_Bsend Ready send MPI_Rsend Receive MPI_Recv Sending a message C: int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Fortran: MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR) <type> BUF(*) INTEGER COUNT, DATATYPE, DEST, TAG INTEGER COMM, IERROR Send data from proc 1 to proc 3 // Array of ten integers int x[10]; ... if (rank == 1) MPI_Ssend(x, 10, MPI_INT, dest=3, tag=0, MPI_COMM_WORLD); // Integer scalar int x; ... if (rank == 1) MPI_Ssend(&x, 1, MPI_INT, dest=3, tag=0, MPI_COMM_WORLD); Send data from proc 1 to proc 3 ! Array of ten integers integer, dimension(10) :: x ... if (rank .eq. 1) CALL MPI_SSEND(x, 10, MPI_INTEGER, dest=3, tag=0, MPI_COMM_WORLD, ierr) ! Integer scalar integer :: x ... if (rank .eq. 1) CALL MPI_SSEND(x, 1, MPI_INTEGER, dest=3, tag=0, MPI_COMM_WORLD, ierr) Receiving a message C: int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Fortran: MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR) <type> BUF(*) INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR Receive data from proc 1 on proc 3 int y[10]; MPI_Status status; ... if (rank == 3) MPI_Recv(y, 10, MPI_INT, src=1, tag=0, MPI_COMM_WORLD, &status); int y; ... if (rank == 3) MPI_Recv(&y, 1, MPI_INT, src=1, tag=0, MPI_COMM_WORLD, &status); Receive data from proc 1 on proc 3 integer, dimension(10) :: y integer, dimension(MPI_STATUS_SIZE) :: status ... if (rank .eq. 3) CALL MPI_RECV(y, 10, MPI_INTEGER, src=1, tag=0, MPI_COMM_WORLD, status, ierr) integer :: y ... if (rank .eq. 3) CALL MPI_RECV(y, 1, MPI_INTEGER, src=1, tag=0, MPI_COMM_WORLD, status, ierr) Synchronous Blocking Message-Passing Processes synchronise. Sender process specifies the synchronous mode. Blocking both processes wait until the transaction has completed. For a communication to succeed: Sender must specify a valid destination rank. Receiver must specify a valid source rank. The communicator must be the same. Tags must match. Message types must match. Receiver's buffer must be large enough. Wildcarding Receiver can wildcard. To receive from any source MPI_ANY_SOURCE To receive with any tag MPI_ANY_TAG Actual source and tag are returned in the receiver's status parameter. Communication Envelope Senders Address For the attention of: Destination Address Data Item 1 Item 2 Item 3 Commmunication Envelope Information Envelope information is returned from MPI_RECV as status Information includes: – Source: status.MPI_SOURCE or status(MPI_SOURCE) – Tag: status.MPI_TAG or status(MPI_TAG) – Count: MPI_Get_count or MPI_GET_COUNT Received Message Count C: int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count) Fortran: MPI_GET_COUNT(STATUS, DATATYPE, COUNT, IERROR) INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, COUNT, IERROR Message Order Preservation 1 2 5 3 4 0 Communicator Messages do not overtake each other. This is true even for non-synchronous sends. Message Passing Programming Modes, Tags and Communicators Overview Lecture will cover – explanation of MPI modes (Ssend, Bsend and Send) – meaning and use of message tags – rationale for MPI communicators These are all commonly misunderstood – essential for all programmers to understand modes – often useful to use tags – certain cases benefit from exploiting different communicators Modes MPI_Ssend (Synchronous Send) – guaranteed to be synchronous – routine will not return until message has been delivered MPI_Bsend (Buffered Send) – guaranteed to be asynchronous – routine returns before the message is delivered – system copies data into a buffer and sends it later on MPI_Send (standard Send) – may be implemented as synchronous or asynchronous send – this causes a lot of confusion (see later) MPI_Ssend Process A Process B Running other non-MPI code Ssend(x,B) Wait in Ssend Recv(y,A) x Ssend returns x can be overwritten by A Data Transfer y Recv returns y can now be read by B MPI_Bsend Process A Process B Running other non-MPI code Bsend(x,B) Bsend returns x can be overwritten by A x Recv(y,A) y Recv returns y can now be read by B Notes Recv is always synchronous – if process B issued Recv before the Bsend from process A, then B would wait in the Recv until Bsend was issued Where does the buffer space come from? – for Bsend, the user provides a single large block of memory – make this available to MPI using MPI_Buffer_attach If A issues another Bsend before the Recv – system tries to store message in free space in the buffer – if there is not enough space then Bsend will FAIL! Send Problems – Ssend runs the risk of deadlock – Bsend less likely to deadlock, and your code may run faster, but • the user must supply the buffer space • the routine will FAIL if this buffering is exhausted MPI_Send tries to solve these problems – buffer space is provided by the system – Send will normally be asynchronous (like Bsend) – if buffer is full, Send becomes synchronous (like Ssend) MPI_Send routine is unlikely to fail – but could cause your program to deadlock if buffering runs out MPI_Send Process A Process B Send(x,B) Send(y,A) Recv(x,B) Recv(y,A) This code is NOT guaranteed to work – will deadlock if Send is synchronous – is guaranteed to deadlock if you used Ssend! Solutions To avoid deadlock – either match sends and receives explicitly – eg for ping-pong • process A sends then receives • process B receives then sends For a more general solution use non-blocking communications (see later) For this course you should program with Ssend – more likely to pick up bugs such as deadlock than Send Tags Every message can have a tag – this is a non-negative integer value – not everyone uses them – many MPI programs set all tags to zero Tags can be useful in some situations – can choose to receive messages only of a given tag Most commonly used with MPI_ANY_TAG – receives the most recent message regardless of the tag – user then finds out the actual value by looking at the status Communicators All MPI communications take place within a communicator – a communicator is fundamentally a group of processes – there is only one pre-defined communicator: MPI_COMM_WORLD which contains ALL the processes A message can ONLY be received within the same communicator from which it was sent – unlike tags, it is not possible to wildcard on comm Uses of Communicators (i) Can split MPI_COMM_WORLD into pieces – each process has a new rank within each sub-communicator – guarantees messages from the different pieces do not interact • can attempt to do this using tags but there are no guarantees MPI_COMM_WORLD size=7 size=4 rank=1 rank=3 rank=5 rank=6 rank=0 rank=2 rank=4 rank=0 rank=1 rank=3 rank=2 comm1 rank=1 rank=2 rank=0 size=3 comm2 Uses of Communicators (ii) Can make a copy of MPI_COMM_WORLD – containing all the same processes but in a new communicator Enables processes to communicate with each other safely within a piece of code – guaranteed that messages cannot be received by other code – this is essential for people writing parallel libraries (eg a Fast Fourier Transform) to stop library messages becoming mixed up with user messages • user cannot intercept the the library messages if the library keeps the identity of the new communicator a secret • not safe to simply try and reserve tag values due to wildcarding Summary (i) Question: Why bother with all these send modes? Answer – it is a little complicated, but you should make sure you understand – Ssend and Bsend are clear • map directly onto synchronous and asynchronous sends – Send can be either synchronous or asynchronous • MPI is trying to be helpful here, giving you the benefits of Bsend if there is sufficient system memory available, but not failing completely if buffer space runs out • in practice this leads to endless confusion! The amount of system buffer space is variable – programs that run on one machine may deadlock on another – you should NEVER assume that Send is asynchronous! Summary (ii) Question: What are the tags for? Answer – if you don’t need them don’t use them! • perfectly acceptable to set all tags to zero – can be useful for debugging • eg always tag messages with the rank of the sender Summary (iii) Question: Can I just use MPI_COMM_WORLD? Answer – yes: many people never need to create new communicators in their MPI programs – however, it is probably bad practice to specify MPI_COMM_WORLD explicitly in your routines • using a variable will allow for greater flexibility later on, eg: MPI_Comm comm; /* or INTEGER for Fortran */ comm = MPI_COMM_WORLD; ... MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size); .... Non-Blocking Communications Deadlock 1 2 5 3 4 Communicator 0 Non-Blocking Communications Separate communication into three phases: Initiate non-blocking communication. Do some work (perhaps involving other communications?) Wait for non-blocking communication to complete. Non-Blocking Send 1 2 5 3 4 Communicator 0 Non-Blocking Receive 1 2 5 3 4 Communicator 0 Handles used for Non-blocking Comms datatype same as for blocking (MPI_Datatype or INTEGER). communicator same as for blocking (MPI_Comm or INTEGER). request MPI_Request or INTEGER. A request handle is allocated when a communication is initiated. Non-blocking Synchronous Send C: int MPI_Issend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Wait(MPI_Request *request, MPI_Status *status) Fortran: MPI_ISSEND(buf, count, datatype, dest, tag, comm, request, ierror) MPI_WAIT(request, status, ierror) Non-blocking Receive C: int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int src, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Wait(MPI_Request *request, MPI_Status *status) Fortran: MPI_IRECV(buf, count, datatype, src, tag, comm, request, ierror) MPI_WAIT(request, status, ierror) Blocking and Non-Blocking Send and receive can be blocking or nonblocking. A blocking send can be used with a nonblocking receive, and vice-versa. Non-blocking sends can use any mode synchronous, buffered, standard, or ready. Synchronous mode affects completion, not initiation. Communication Modes NON-BLOCKING OPERATION MPI CALL Standard send MPI_ISEND Synchronous send MPI_ISSEND Buffered send MPI_IBSEND Ready send MPI_IRSEND Receive MPI_IRECV Completion Waiting versus Testing. C: int MPI_Wait(MPI_Request *request, MPI_Status *status) int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) Fortran: MPI_WAIT(handle, status, ierror) MPI_TEST(handle, flag, status, ierror) Multiple Communications Test or wait for completion of one message. Test or wait for completion of all messages. Test or wait for completion of as many messages as possible. Testing Multiple Non-Blocking Comms Process in in in Derived Datatypes MPI Datatypes Basic types Derived types – vectors – structs – others Motivation Send / Recv calls need a datatype argument – pre-defined values exist for pre-defined language types – eg real <-> MPI_REAL; int <-> MPI_INT What about types defined by a program? – eg structures (in C) or user-defined types (Fortran) Send / Recv calls take a count parameter – what about data that isn’t contiguous in memory? – eg subsections of 2D arrays Approach Can define new types in MPI – user calls setup routines to describe new datatype to MPI • remember, MPI is a library and NOT a compiler! – MPI returns a new datatype handle – store this value in a variable, eg MPI_MY_NEWTYPE Derived types have same status as pre-defined – can use in any message-passing call Some care needed for reduction operations – user must also define a new MPI_Op appropriate to the new datatype to tell MPI how to combine them Defining types All derived types stored by MPI as a list of basic types and displacements (in bytes) – for a structure, types may be different – for an array subsection, types will be the same User can define new derived types in terms of both basic types and other derived types Derived Data types - Type basic datatype 0 displacement of datatype 0 basic datatype 1 displacement of datatype 1 ... ... basic datatype n-1 displacement of datatype n-1 Contiguous Data The simplest derived datatype consists of a number of contiguous items of the same datatype. C: int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) Fortran: MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERROR) INTEGER COUNT, OLDTYPE, NEWTYPE, IERROR Use of contiguous May make program clearer to read Imagine sending a block of 10 integers – use MPI_Ssend with MPI_INT / MPI_INTEGER and count = 10 Or … – define a new contiguous type of 10 integers called BLOCK – use MPI_Ssend with type=BLOCK and count = 1 May also be useful intermediate stage in building more complicated types – ie later used in definition of another derived type Vector Datatype Example Oldtype 5 element stride between blocks Newtype 3 elements per block 2 blocks count = 2 stride = 5 blocklength = 3 What is a vector type? Why is a pattern with blocks and gaps useful? A vector type corresponds to a subsection of a 2D array Think about how arrays are stored in memory – unfortunately, different conventions for C and Fortran! – must use statically allocated arrays in C because dynamically allocated arrays (using malloc) have no defined storage format – In Fortran, can use either static or allocatable arrays Coordinate System (how I draw arrays) x[0][3] x[1][3] x[2][3] x[3][3] x[0][2] x[1][2] x[2][2] x[3][2] x[i][j] x[0][1] x[1][1] x[2][1] x[3][1] j x[0][0] x[1][0] x[2][0] x[3][0] i x(i,j) x(1,4) x(2,4) x(3,4) x(4,4) x(1,3) x(2,3) x(3,3) x(4,3) x(1,2) x(2,2) x(3,2) x(4,2) x(1,1) x(2,1) x(3,1) x(4,1) Arrray Layout in Memory C: x[16] 1 2 3 4 F: x(16) 5 6 7 8 9 10 C: x[4][4] 11 12 13 14 15 F: x(4,4) 13 14 15 16 15 9 10 11 12 10 14 5 6 7 8 9 13 1 2 3 4 4 8 12 16 3 7 11 2 6 1 5 16 j i Data is contiguous in memory – different conventions for mapping 2D t o 1D arrays in C and Fortran C example C: x[5][4] A 3 x 2 subsection of a 5 x 4 array – three blocks of two elements separated by gaps of two Fortran example F: x(5,4) A 3 x 2 subsection of a 5 x 4 array – two blocks of three elements separated by gaps of two Equivalent Vector Datatypes count = 3 blocklength = 2 stride = 4 count = 2 blocklength = 3 stride = 5 Constructing a Vector Datatype C: int MPI_Type_vector (int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) Fortran: MPI_TYPE_VECTOR (COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR) Sending a vector Have defined a 3x2 subsection of a 5x4 array – but not defined WHICH subsection – is it the bottom left-hand corner? top-right? Data that is sent depends on what buffer you pass to the send routines – pass the address of the first element that should be sent Vectors in send routines MPI_Ssend(&x[1][1], 1, vector3x2, ...); MPI_SSEND(x(2,2) , 1, vector3x2, ...) MPI_Ssend(&x[2][1], 1, vector3x2, ...); MPI_SSEND(x(3,2) , 1, vector3x2, ...) Extent of a Datatatype May be useful to find out how big a derived type is – can use these routines to compute extents of basic types too – answer is returned in bytes C: int MPI_Type_extent (MPI_Datatype datatype, MPI_Aint *extent) Fortran: MPI_TYPE_EXTENT( DATATYPE, EXTENT, IERROR) INTEGER DATATYPE, EXTENT, IERROR Structures Can define compound objects in C and Fortran struct compound { int ival; double dval[3]; }; type compound integer :: ival double precision :: dval(3) end type compound Storage format NOT defined by the language – different compilers do different things – eg insert arbitrary padding between successive elements – need to tell MPI the byte displacements of every element Constructing a Struct Datatype C: int MPI_Type_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype) Fortran: MPI_TYPE_STRUCT (COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, ARRAY_OF_TYPES, NEWTYPE, IERROR) Struct Datatype Example count = 2 array_of_blocklengths[0] = 1 array_of_types[0] = MPI_INT array_of_blocklengths[1] = 3 array_of_types[1] = MPI_DOUBLE But how do we compute the displacements? – need to create a compound variable in our program – explicitly compute memory addresses of every member – subtract addresses to get displacements from origin Address of a Variable C: int MPI_Address (void *location, MPI_Aint *address) Fortran: MPI_ADDRESS( LOCATION, ADDRESS, IERROR) <type> LOCATION (*) INTEGER ADDRESS, IERROR Committing a datatype Once a datatype has been constructed, it needs to be committed before it is used in a message-passing call This is done using MPI_TYPE_COMMIT C: int MPI_Type_commit (MPI_Datatype *datatype) Fortran: MPI_TYPE_COMMIT (DATATYPE, IERROR) INTEGER DATATYPE, IERROR Virtual Topologies Virtual Topologies Convenient process naming. Naming scheme to fit the communication pattern. Simplifies writing of code. Can allow MPI to optimise communications. How to use a Virtual Topology Creating a topology produces a new communicator. MPI provides ``mapping functions''. Mapping functions compute processor ranks, based on the topology naming scheme. Example A 2-dimensional Cylinder 0 (0,0) 1 (0,1) 2 (0,2) 3 (0,3) 4 (1,0) 5 (1,1) 6 (1,2) 7 (1,3) 8 (2,0) 9 (2,1) 10 (2,2) 11 (2,3) Topology types Cartesian topologies – each process is “connected” to its neighbours in a virtual grid. • boundaries can be cyclic, or not. • Optionally re-order ranks to allow MPI implementation to optimise for underlying interconnectivity. – processes are identified by cartesian coordinates. Graph topologies – general graphs – not covered here Creating a Cartesian Virtual Topology C: int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart) Fortran: MPI_CART_CREATE(COMM_OLD, NDIMS, DIMS, PERIODS, REORDER, COMM_CART, IERROR) INTEGER COMM_OLD, NDIMS, DIMS(*), COMM_CART, IERROR LOGICAL PERIODS(*), REORDER Balanced Processor Distribution C: int MPI_Dims_create(int nnodes, int ndims, int *dims) Fortran: MPI_DIMS_CREATE(NNODES, NDIMS, DIMS, IERROR) INTEGER NNODES, NDIMS, DIMS(*), IERROR MPI_Dims_create Call tries to set dimensions as close to each other as possible dims before the call function call dims on return (0, 0) MPI_DIMS_CREATE( 6, 2, dims) (3, 2) (0, 0) MPI_DIMS_CREATE( 7, 2, dims) (7, 1) (0, 3, 0) MPI_DIMS_CREATE( 6, 3, dims) (2, 3, 1) (0, 3, 0) MPI_DIMS_CREATE( 7, 3, dims) erroneous call Non zero values in dims sets the number of processors required in that direction. – WARNING:- make sure dims is set to 0 before the call! Cartesian Mapping Functions Mapping process grid coordinates to ranks C: int MPI_Cart_rank(MPI_Comm comm, int *coords, int *rank) Fortran: MPI_CART_RANK (COMM, COORDS, RANK, IERROR) INTEGER COMM, COORDS(*), RANK, IERROR Cartesian Mapping Functions Mapping ranks to process grid coordinates C: int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int *coords) Fortran: MPI_CART_COORDS(COMM, RANK, MAXDIMS,COORDS, IERROR) INTEGER COMM, RANK, MAXDIMS, COORDS(*), IERROR Cartesian Mapping Functions Computing ranks of my neighbouring processes Following conventions of MPI_SendRecv C: int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest) Fortran: MPI_CART_SHIFT(COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR) INTEGER COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR Cartesian Partitioning Cut a grid up into “slices”. A new communicator is produced for each slice. Each slice can then perform its own collective communications. MPI_Cart_sub and MPI_CART_SUB generate new communicators for the slices. – Use array to specify which dimensions should be retained in the new communicator. Partitioning with MPI_CART_SUB C: int MPI_Cart_sub (MPI_Comm comm, int *remain_dims, MPI_Comm *newcomm) Fortran: MPI_CART_SUB (COMM, REMAIN_DIMS, NEWCOMM,IERROR) INTEGER COMM, NEWCOMM, IERROR LOGICAL REMAIN_DIMS(*) Collective Communications Collective Communication Communications involving a group of processes. Called by all processes in a communicator. Examples: – Barrier synchronisation. – Broadcast, scatter, gather. – Global sum, global maximum, etc. Characteristics of Collective Comms Collective action over a communicator. All processes must communicate. Synchronisation may or may not occur. All collective operations are blocking. No tags. Receive buffers must be exactly the right size. Barrier Synchronisation C: int MPI_Barrier (MPI_Comm comm) Fortran: MPI_BARRIER (COMM, IERROR) INTEGER COMM, IERROR Broadcast C: int MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) Fortran: MPI_BCAST (BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR) <type> BUFFER(*) INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR Scatter ABCDE ABCDE A B C D E Scatter C: int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) Fortran: MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF, RECVBUF INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT INTEGER RECVTYPE, ROOT, COMM, IERROR Gather A B C D E C D E ABCDE A B Gather C: int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) Fortran: MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF, RECVBUF INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT INTEGER RECVTYPE, ROOT, COMM, IERROR Global Reduction Operations Used to compute a result involving data distributed over a group of processes. Examples: – global sum or product – global maximum or minimum – global user-defined operation Predefined Reduction Operations MPI Name Function MPI_MAX Maximum MPI_MIN Minimum MPI_SUM Sum MPI_PROD Product MPI_LAND Logical AND MPI_BAND Bitwise AND MPI_LOR Logical OR MPI_BOR Bitwise OR MPI_LXOR Logical exclusive OR MPI_BXOR Bitwise exclusive OR MPI_MAXLOC Maximum and location MPI_MINLOC Minimum and location MPI_Reduce C: int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) Fortran: MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM, IERROR) <type> SENDBUF, RECVBUF INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT INTEGER RECVTYPE, ROOT, COMM, IERROR MPI_REDUCE Rank Root 1 A B CD A B CD 0 MPI_REDUCE E F GH E F GH I J KL I J KL MNO P MNO P 2 3 AoEoIoM Example of Global Reduction Integer global sum C: MPI_Reduce(&x, &result, 1, MPI_INT, MPI_SUM,0, MPI_COMM_WORLD) Fortran: CALL MPI_REDUCE(x, result, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, IERROR) Sum of all the x values is placed in result. The result is only placed there on processor 0. User-Defined Reduction Operators Reducing using an arbitrary operator, C - function of type MPI_User_function: void my_op (void *invec, void *inoutvec,int *len, MPI_Datatype *datatype) Fortran - external subprogram of type SUBROUTINE MY_OP(INVEC(*),INOUTVEC(*), LEN, DATATYPE) <type> INVEC(LEN), INOUTVEC(LEN) INTEGER LEN, DATATYPE Reduction Operator Functions Operator function for must act as: for (i = 1 to len) inoutvec(i) = inoutvec(i) invec(i) Operator need not commute but must be associative. Registering User-Defined Operator Operator handles have type MPI_Op or INTEGER C: int MPI_Op_create(MPI_User_function *my_op, int commute, MPI_Op *op) Fortran: MPI_OP_CREATE (MY_OP, COMMUTE, OP, IERROR) EXTERNAL MY_OP LOGICAL COMMUTE INTEGER OP, IERROR Variants of MPI_REDUCE MPI_Allreduce no root process MPI_Reduce_scatter result is scattered MPI_Scan “parallel prefix” MPI_ALLREDUCE Rank 0 A B CD A B CD MPI_ALLREDUCE 1 E F GH E F GH I J KL I J KL MNO P MNO P 2 3 AoEoIoM MPI_ALLREDUCE Integer global sum C: int MPI_Allreduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) Fortran: MPI_ALLREDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) MPI_SCAN Rank 0 A B CD A B CD A MPI_SCAN 1 E F GH I J KL AoE E F GH I J KL 2 AoEoI 3 MNO P MNO P AoEoIoM MPI_SCAN Integer partial sum C: int MPI_Scan(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) Fortran: MPI_SCAN(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) Message Passing Programming Designing MPI Applications Overview Lecture will cover – – – – – MPI portability maintenance of serial code general design debugging verification MPI Portability Potential deadlock – you may be assuming that MPI_Send is asynchronous – it often is buffered for small messages • but threshhold can vary with implementation – a correct code should run if you replace all MPI_Send calls with MPI_Ssend Buffer space – cannot assume that there will be space for MPI_Bsend – Sun default buffer space is zero! – be sure to use MPI_Buffer_Attach • some advice in MPI standard regarding required size Data Sizes Cannot assume anything about data sizes – eg C float / Fortran REAL are 8 bytes on Cray T3E – can be an issue when defining struct types – use MPI_Type_extent to find out the number of bytes Changing precision – when changing from, say, float to double, must change all the MPI types from MPI_FLOAT to MPI_DOUBLE as well Easiest to achieve with an include file – eg every routine includes precision.h Changing Precision: C Define a header file called, eg, precision.h – typedef float RealNumber – #define MPI_REALNUMBER MPI_FLOAT Include in every function – – – – #include “precision.h” ... RealNumber x; MPI_Routine(&x, MPI_REALNUMBER, ...); Global change of precision now easy – edit 2 lines in one file: float -> double, MPI_FLOAT -> MPI_DOUBLE Changing Precision: Fortran Define a module called, eg, precision – integer, parameter :: REALNUMBER=kind(1.0e0) – integer, parameter :: MPI_REALNUMBER = MPI_REAL Use in every subroutine – – – – use precision ... REAL(kind=REALNUMBER):: x call MPI_ROUTINE(x, MPI_REALNUMBER, ...) Global change of precision now easy – change 1.0e0 -> 1.0d0, MPI_REAL -> MPI_DOUBLE_PRECISION Testing Portability Run on more than one machine – assuming the implementations are different – most Beowulf systems will use MPICH • running on different Beowulf machines may not be a good test More than one implementation on same machine – very useful test, and can give interesting performance numbers Serial Code Adding MPI can destroy a code – would like to maintain a serial version – ie can compile and run identical code without an MPI library – not simply running MPI code with P=1! Need to separate off communications routines – put them all in a separate file – provide a dummy library for the serial code – no explicit reference to MPI in main code Example: Initialisation ! parallel routine subroutine par_begin(size, procid) implicit none integer :: size, procid include "mpif.h" call mpi_init(ierr) call mpi_comm_size(MPI_COMM_WORLD, size, ierr) call mpi_comm_rank(MPI_COMM_WORLD, procid, ierr) procid = procid + 1 end subroutine par_begin ! dummy routine for serial machine subroutine par_begin(size, procid) implicit none integer :: size, procid size = 1 procid = 1 end subroutine par_begin Example: Global Sum ! parallel routine subroutine par_dsum(dval) implicit none include "mpif.h" double precision :: dval, dtmp call mpi_allreduce(dval, dtmp, 1, MPI_DOUBLE_PRECISION, & MPI_SUM, comm, ierr) dval = dtmp end subroutine par_dsum ! dummy routine for serial machine subroutine par_dsum(dval) implicit none double precision dval end subroutine par_dsum Example Makefile SEQSRC= \ demparams.f90 demrand.f90 demcoord.f90 demhalo.f90 \ demforce.f90 demlink.f90 demcell.f90 dempos.f90 demons.f90 MPISRC= \ demparallel.f90 \ demcomms.f90 FAKESRC= \ demfakepar.f90 \ demfakecomms.f90 #PARSRC=$(FAKESRC) PARSRC=$(MPISRC) Advantages of Comms Library Can compile serial program from same source – makes parallel code more readable Enables code to be ported to other libraries – more efficient but less versatile routines may exist – eg Cray-specific SHMEM library – can even choose to only port a subset of the routines Library can be optimised for different MPIs – eg choose the fastest send (Ssend, Send, Bsend?) Design Separate the communications into a library Make parallel code similar as possible to serial – could use the same update routine in serial and parallel • only the array bounds would be different – may have a large impact on the design of your serial code Don’t try and be too clever – don’t agonise whether one more halo swap is really necessary – just do it for the sake of robustness General Considerations Compute everything everywhere – eg use routines such as Allreduce – perhaps the value only really needs to be know on the master • but using Allreduce makes things simpler • little or no performance implications Often easiest to make P a compile-time constant – may not seem elegant but can make coding much easier • eg definition of array bounds – put definition in an include file – a clever Makefile can reduce the need for recompilation • only recompile routines that define arrays rather than just use them • pass array bounds as arguments to all other routines Debugging Parallel debugging can be hard Don’t assume it’s a parallel bug! – run the serial code first – then the parallel code with P=1 – then on a small number of processes … Writing output to separate files can be useful – eg log.00, log.01, log.02, …. for ranks 0, 1, 2, ... – need some way easily to switch this on and off Some parallel debuggers exist – Prism is available on Sun machines – Totalview is the leader across all platforms Verification: Is My Code Working? Should the output be identical for any P? – very hard to accomplish in practice due to rounding errors • may have to look hard to see differences in the last few digits – typically, results vary slightly with number of processes – need some way of quantifiying the differences from serial code – and some definition of “acceptable” What about the same code for fixed P? – identical output for two runs on same number of processes? – should be achievable with some care • not in specific cases like dynamic task farms • possible problems with global sums • MPI doesn’t force reproducibility, but some implementations can – without this, debugging is almost impossible Parallelisation Some parallel approaches may be simple – but not necessarily optimal for performance – often need to consider what is the realistic range of P Some people write incredibly complicated code – step back and ask: what do I actually want to do? – is there an existing MPI routine or collective communication? – should I reconsider my approach if it prohibits me from using existing routines, even if it is not quite so efficient? Optimisation Keep running your code – on a number of input data sets – with a range of MPI processes If scaling is poor – find out what parallel routines are the bottlenecks – again, much easier with a separate comms library If performance is poor – work on the serial code – return to parallel issues later on Conclusions Run on a variety of machines Keep it simple Maintain a serial version Don’t assume all bugs are parallel bugs Find a debugger you like (good luck to you)