Lecture 5: MPI - Non-blocking Communications

advertisement
Non-Blocking Communications
1
#include <mpi.h>
#include <stdio.h>
Example
int main(int argc, char **argv)
{
int my_rank, ncpus;
int left_neighbor, right_neighbor;
int data_received=-1;
int tag = 101;
MPI_Status statSend, statRecv;
MPI_Request reqSend, reqRecv;
mpirun –np 4 test_shift
Among 4 processes, process 3 received from right neighbor: 0
Among 4 processes, process 2 received from right neighbor: 3
Among 4 processes, process 0 received from right neighbor: 1
Among 4 processes, process 1 received from right neighbor: 2
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &ncpus);
left_neighbor = (my_rank-1 + ncpus)%ncpus;
right_neighbor = (my_rank+1)%ncpus;
MPI_Isend(&my_rank,1,MPI_INT,left_neighbor,tag,MPI_COMM_WORLD,&reqSend); // comm start
MPI_Irecv(&data_received,1,MPI_INT,right_neighbor,tag,MPI_COMM_WORLD,&reqRecv);
// maybe do something useful here
MPI_Wait(&reqSend, &statSend); // complete comm
MPI_Wait(&reqRecv, &statRecv);
printf("Among %d processes, process %d received from right neighbor: %d\n",
ncpus, my_rank, data_received);
// clean up
MPI_Finalize();
return 0;
}
2
Semantics etc
Purpose:
Mechanism for overlapping communication and useful
computations. Communication and computation may
proceed concurrently. Latency hiding.
Deadlock avoidance
May avoid system buffering and memory-to-memory
copying, and improve performance
Structure of non-blocking calls
Post communication requests  non-blocking call, MPI_Isend …
… // do some useful work
Complete communication call  MPI_Wait, MPI_Test, …
3
Semantics etc
 Non-blocking calls: MPI_Isend, MPI_Irecv etc
 Will return immediately. Merely post a request to system to initiate
communication.
 However, communication is not completed yet.
 Cannot tamper with the memory provided in these calls until the
communication is completed by calling MPI_Wait or MPI_Test etc
Non-blocking send
Non-blocking receive
4
Non-blocking Send/Recv
int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request *request)
MPI_ISEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,REQUEST,IERROR)
<type> BUF(*)
INTEGER COUNT,DATATYPE,DEST,TAG,COMM,REQUEST, IERROR
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Request *request)
MPI_IRECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR)
<type> BUF(*)
INTEGER COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR
Post send/recv requests to MPI system.
Calls return immediately, but don’t access the memory pointed to by *buf
MPI_Request request is a handle to an internal MPI object. Everything about that
non-blocking communication is through that handle. MPI_REQUEST_NULL is a NULL
request.
MPI_Request req1, req2;
double A[10], B[5];
…
MPI_Isend(A, 10, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req1);
MPI_Irecv(B, 5, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req2);
5
Other Non-blocking Sends
4 communication modes, same semantics as
blocking sends.
MPI_ISEND – standard mode
MPI_IBSEND – buffered mode
MPI_ISSEND – synchronous mode
MPI_IRSEND – ready mode
Identical arguments as MPI_Isend
int MPI_Ibsend(void *buf,int count,MPI_Datatype datatype,int dest,
int tag, MPI_Comm comm, MPI_Request *request)
int MPI_Issend(void *buf,int count,MPI_Datatype datatype,int dest,
int tag, MPI_Comm comm, MPI_Request *request)
int MPI_Irsend(void *buf,int count,MPI_Datatype datatype,int dest,
int tag, MPI_Comm comm, MPI_Request *request)
6
Completion
Use MPI_Wait or MPI_Test to complete
non-blocking communication
Semantics: after MPI_Wait returns
For standard send, message data has been
safely stored away, safe to access buffer.
For receive, data is received.
7
MPI_Wait
int MPI_Wait(MPI_Request *request, MPI_Status *status)
MPI_WAIT(REQUEST,STATUS,IERROR)
INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR
*request is a handle returned from MPI_Isend, MPI_Irecv etc
 Will block until the communication completes (or fails)
 If request is from MPI_Isend, MPI_Irecv etc
 Will deallocate request object, set request to
MPI_REQUEST_NULL.
 Will return in status the status information.
 for MPI_Irecv, hold additional information.
 For MPI_Isend, not much to be used
MPI_Request req;
MPI_Status stat;
…
MPI_Irecv(…, &req);
MPI_Wait(&req, &stat);
8
MPI_Test
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
MPI_TEST(REQUEST,FLAG,STATUS,IERROR)
LOGICAL FLAG
INTEGER REQUEST, STATUS, IERROR
 request – MPI_Request object from MPI_Isend, etc
 flag – true if communication complete; false if not yet
 If true, request object will be de-allocated, and set to
MPI_REQUEST_NULL
 status – contain status information if complete
 Does not block, return immediately.
 Provide a mechanism for overlapping communication
and computation
 Do useful computation; periodically check communication status;
if not complete, go back to computation.
9
Properties
 Order: non-overtaking, order preserved
 according to the execution order of non-blocking calls that initiate
the communications
 Progress: guarantees progress
 Receive call completed by MPI_Wait will eventually return if
there is a matching send.
 Send call completed by MPI_Wait will eventually return if there is
a matching receive.
MPI_Comm_rank(comm,&rank);
If(rank==0) {
MPI_Isend(A,1,MPI_DOUBLE,1,99,comm,&req1);
MPI_Isend(B,1,MPI_DOUBLE,1,99,comm,&req2);
}
Else if(rank==1) {
MPI_Irecv(A,1,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&req1);
MPI_Irecv(B,1,MPI_DOUBLE,0,99,comm,&req2);
}
MPI_Wait(&req1,&stat1);
MPI_Wait(&req2,&stat2);
10
MPI_Wait Variants
 Deal with arrays of MPI_Requests: MPI_Request req[4];
 MPI_Waitall:
 MPI_Waitall(int count, MPI_Request *request, MPI_Status
*status)
 Blocks until all active requests in array complete; return status of all communications
 Deallocate request objects, set to MPI_REQUEST_NULL
 MPI_Waitany:
 MPI_Waitany(int count,MPI_Request *req, int *index, MPI_Status
*stat)
 Blocks until one of the active requests in array completes; return its index in array
and the status of completing request; deallocate that request object. If none
completes, return index=MPI_UNDEFINED.
 MPI_Waitsome:
 MPI_Waitsome(int incount, MPI_Request *req, int *outcount, int
*array_indices, MPI_Status *array_status)
 Blocks until at least one of the active communications completes; return associated
indices and status of completed communications; deallocate objects. If none,
outcount=MPI_UNDEFINED.
MPI_Request req[2];
MPI_Status stat[2];
…
MPI_Isend(…, &req[0]);
MPI_Isend(…, &req[1]);
MPI_Waitall(2, req, stat);
MPI_Request req[2];
MPI_Status stat;
Int index;
MPI_Isend(…, &req[0]);
MPI_Isend(…, &req[1]);
MPI_Waitany(2, req, &index, &stat); 11
…
MPI_Test Variants
 MPI_Testall:
 MPI_Testall(int count, MPI_Request *array_req, int
*flag, MPI_Status *array_stat)
 Return flag=true if all active requests complete; return flag=false
otherwise.
 If true, will de-allocate request objects, set to MPI_REQUEST_NULL.
 MPI_Testany:
 MPI_Testany(int count, MPI_Request *array_req, int
*index, int *flag, MPI_Status *stat)
 If one of active comm completes, return flag=true the index and
status of completing comm; deallocate that object.
 Return flag=false, index=MPI_UNDEFINED if none completes
 Return flag=true, index=MPI_UNDEFINED if none active requests.
 MPI_Testsome:
 MPI_Testsome(int incount, MPI_Request *array_req, int
*outcount, int *array_indices, MPI_Status *array_stat)
 Return in outcount the number of completed active comm and associated
indices and status of completing comm.
 If none completes, return outcount=0
 if none active comm, outcount=MPI_UNDEFINED.
12
Persistent Communication
 Structure for nonblocking calls:
 MPI_Ixxxx allocates MPI_Request
 MPI_Wait or MPI_Test completes and de-allocates request
objects
 Often a communication with same arguments is
executed repeatedly
 e.g. every time step or every iteration.
 Can create a persistent request that will not be deallocated by MPI_Wait. Reduce overhead
Create persistent request  MPI_Send_init, MPI_Recv_init
Repeat:
Start communication  MPI_Start
…
Complete communication  MPI_Wait, MPI_Test
Free persistent request  MPI_Request_free
13
Creation
int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request *req)
int MPI_Recv_init(void *buf, int count, MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Request *req)
 Creates a persistent request object for standard send mode.
 Bind to the arguments: buf, count, datatype, dest, tag, comm. These
arguments will not change in following communications
 On creation, request inactive – not associated with any active
communication. Communication initiated by MPI_Start
MPI_Request req_send, req_recv;
double A[100], B[100];
int left_neighbor, right_neighbor, tag=999;
MPI_Status stat_send, stat_recv;
…
MPI_Send_init(A,100,MPI_DOUBLE,left_neighbor,tag,MPI_COMM_WORLD,&req_send);
MPI_Recv_init(B,100,MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_recv);
MPI_Start(&req_send);
MPI_Start(&req_recv);
… // do something else useful
MPI_Wait(&req_send, &stat_send);
MPI_Wait(&req_recv, &stat_recv);
14
MPI_Request_free(&req_send); MPI_Request_free(&req_recv);
Start Communication, Free Request
int MPI_Start(MPI_Request *request)
MPI_START(REQUEST)
integer REQUEST
 request is a persistent request created by
MPI_Send_init etc.
 Start the communication on request object.
 The call returns immediately. It starts a non-blocking
communication. Should not access the buffer after this
call until completion.
 Complete communication by MPI_Wait, MPI_Test etc.
 MPI_Wait, MPI_Test will not de-allocate the request upon
completion of communication
 De-allocate persistent request using
MPI_Request_free in the end.
int MPI_Request_free(MPI_Request *request)
MPI_REQUEST_FREE(request)
integer REQUEST
15
Example: Matrix-Vector Multiplication
A
X
Y
X1
Y1
cpu 0
A11
A12
cpu 1
A21
A22
A23
X2 =
Y2
cpu 2
A31
A32
A33
X3
Y3
cpu 0
A11
A12
X2
Y1
A13
A13
cpu 1
A21
A22
A23
X3 =
Y2
cpu 2
A31
A32
A33
X1
Y3
cpu 0
A11
A12
X3
Y1
cpu 1
A21
A22
A23
X1 =
Y2
cpu 2
A31
A32
A33
X2
Y3
A13
AX=Y
A – NxN matrix
X,Y – vectors, dimension N
Y1 = A11*X1 + A12*X2 + A13*X3
Y2 = A21*X1 + A22*X2 + A23*X3
Y3 = A31*X1 + A32*X2 + A33*X3
Y1 = A11*X1 + A12*X2 + A13*X3
Y2 = A21*X1 + A22*X2 + A23*X3
Y3 = A31*X1 + A32*X2 + A33*X3
Y1 = A11*X1 + A12*X2 + A13*X3
Y2 = A21*X1 + A22*X2 + A23*X3
Y3 = A31*X1 + A32*X2 + A33*X3
16
Example: Matrix-Vector
Data on cpu 0: [A11 A12 A13]  N/3 x N matrix
X1  vector, length N/3
Y1  vector, length N/3
Data on cpu 1: [A21 A22 A23]  N/3 x N matrix
X2  vector, length N/3
Y2  vector, length N/3
Data on cpu 2: [A31 A32 A33]  N/3 x N matrix
X3  vector, length N/3
Y3  vector, length N/3
Need to communicate: X1, X2, X3
Upward shift. Number of shifts = ncpus-1
Assume: A[i][j] = i+j
X[i] = i
17
#include
#include
#include
#include
<stdio.h>
<string.h>
<mpi.h>
"dmath.h“ //  ignore this for now
#define DIM 1000 // logical A[DIM][DIM], X[DIM], Y[DIM]
Example
(non-blocking comm)
int main(int argc, char **argv)
{
int ncpus, my_rank, left_neighbor, right_neighbor, tag=1001;
int Nx, Ny; // Ny=DIM, Nx=DIM/ncpus, on each cpu: A[Nx][Ny], X[Nx], Y[Nx]
MPI_Request req_sr[2];
MPI_Status stat_sr[2];
double **A, *X, *Y, *Xt;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &ncpus);
if(DIM%ncpus != 0) { // assume DIM dividable by ncpus
if(my_rank==0) printf("ERROR: grid size cannot be divided by ncpus!\n");
MPI_Finalize();
return -1;
}
Nx = DIM/ncpus; // again on each cpu: A[Nx][Ny] etc
Ny = DIM;
left_neighbor = (my_rank-1 + ncpus)%ncpus; // top neighbor
right_neighbor = (my_rank+1)%ncpus; // bottom neighbor
A = DMath::newD(Nx, Ny); // allocate memory, ignore DMath – my own routine
X = DMath::newD(Nx);
Xt = DMath::newD(Nx); // Xt – temporary space for receiving from neighbor18
Y = DMath::newD(Nx);
int i,j;
for(i=0;i<Nx;i++) { // initialize A, X
for(j=0;j<Ny;j++) A[i][j] = (my_rank*Nx+i) + j; //  *** important ***
X[i] = my_rank*Nx+i;
}
Example
int count; // loop counter
int sindex, curr_block;
memset(Y, '\0', sizeof(double)*Nx); // zero out result vector Y first
for(count=0;count<ncpus;count++){
if(count < ncpus-1) {
MPI_Irecv(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]);
// receive from bottom neighbor
MPI_Isend(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]);
// send to top neighbor
}
// compute on current data
curr_block = (my_rank+count)%ncpus; //  *** important ***
sindex = curr_block*Nx; // starting index of A[i][sindex+0:sindex+Nx-1]
for(i=0;i<Nx;i++)
for(j=0;j<Nx;j++)
Y[i] += A[i][sindex+j]*X[j]; //  *** important ***
}
// complete comm
if(count<ncpus-1) {
MPI_Waitall(2, req_sr, stat_sr); // data now in Xt
memcpy(X, Xt, sizeof(double)*Nx); // copy data from Xt to X *** important **
}
19
// clean up, free memory
DMath::del(A); // Ignore DMath for now
DMath::del(X);
DMath::del(Xt);
DMath::del(Y);
Example
MPI_Finalize();
return 0;
}
20
Example: Persistent Communication
...
MPI_Recv_init(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]);
MPI_Send_init(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]);
for(count=0;count<ncpus;count++){
if(count < ncpus-1)
MPI_Startall(2, req_sr);
// compute on current data
curr_block = (my_rank+count)%ncpus;
sindex = curr_block*Nx;
for(i=0;i<Nx;i++)
for(j=0;j<Nx;j++)
Y[i] += A[i][sindex+j]*X[j];
// complete comm
if(count<ncpus-1) {
MPI_Waitall(2, req_sr, stat_sr); // data now in Xt
memcpy(X, Xt, sizeof(double)*Nx); // copy data to X
}
}
MPI_Request_free(&req_sr[0]);
MPI_Request_free(&req_sr[1]);
...
21
Example: Send-Recv
...
for(count=0;count<ncpus;count++){
// compute on current data
curr_block = (my_rank+count)%ncpus;
sindex = curr_block*Nx;
for(i=0;i<Nx;i++)
for(j=0;j<Nx;j++)
Y[i] += A[i][sindex+j]*X[j];
// send-recv
if(count<ncpus-1)
MPI_Sendrecv_replace(X,Nx,MPI_DOUBLE,left_neighbor,tag,
right_neighbor, tag, MPI_COMM_WORLD, &stat_sr);
}
...
22
HWK#2: Matrix Multiplication
A
A1
A2
C
B
A3
B11
B12
B13
B21
B22
B23
B31
B32
B33
C1 = A1*B11 + A2*B21 + A3*B31
cpu 0
C2 = A1*B12 + A2*B22 + A3*B32
cpu 1
C3 = A1*B13 + A2*B23 + A3*B33
cpu 2
Column-wise decomposition
=
C1
C2
C3
A, B, C – NxN matrices
P – number of processors
A1, A2, A3 – Nx(N/P) matrices
C1, C2, C3 - …
Bij – (N/P)x(N/P) matrices
Input:
A[i][j] = 2*i + j
B[i][j] = 2*i – j
23
HWK #2
 Implement the above parallel matrix multiplication (column-wise data
decomposition) in either C, C++ or Fortran
 Use non-blocking communication or persistent communication in MPI
 Test your parallel implementation and make sure the result is correct
 Result for matrix C on p CPUs must be identical to that on 1 CPU
 Use a matrix size 2048x2048 (double)
 Time the “multiplication section” of your code using MPI_Wtime() routine
for wall-clock time.
 Run your code on 1, 2, 4, 8, 16 CPUs and obtain the wall-clock time it
takes: T1, T2, …, T16
 Compute parallel speedup factors: Sp = T1/Tp, e.g. Sp=T1/T8 for 8
CPUs.
 Plot Sp vs. number of CPUs.
 Turn in:




Source code + compiled binary code on either hamlet or radon.
Table of wall-clock time vs. number of CPUs.
Plot of parallel speedup factors.
Write-up of what you have learned from the implementation and timing
results
 Due date: Oct. 11
24
Collective Communications
25
Overview
 All processes in a group participate in communication, by
calling the same function with matching arguments.
 Types of collective operations:
 Synchronization: MPI_Barrier
 Data movement: MPI_Bcast, MPI_Scatter, MPI_Gather,
MPI_Allgather, MPI_Alltoall
 Collective computation: MPI_Reduce, MPI_Allreduce,
MPI_Scan
 Collective routines are blocking:
 Completion of call means the communication buffer can be
accessed
 No indication on other processes’ status of completion
 May or may not have effect of synchronization among
processes.
26
Overview
Can use same communicators as PtP
communications
MPI guarantees messages from collective
communications will not be confused with PtP
communications.
Key is a group of processes partaking
communication
If you want only a sub-group of processes involved in
collective communication, need to create a subgroup/sub-communicator from MPI_COMM_WORLD
27
Barrier
int MPI_Barrier(MPI_Comm comm)
MPI_BARRIER(COMM,IERROR)
integer COMM, IERROR
Blocks the calling process until all group members
have called it.
Decreases performance. Refrain from using it
explicitly.
…
MPI_Barrier(MPI_COMM_WORLD); // synchronization point
…
28
Broadcast
int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root,
MPI_Comm comm)
MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM)
integer BUFFER, COUNT, DATATYPE, ROOT, COMM
 Broadcasts a message from process with rank root to all
processes in group, including itself.
 comm, root must be the same in all processes
 The amount of data sent must be equal to amount of data received,
pairwise between each process and the root
 For now, means count and datatype must be the same for all
processes; may be different when generalized datatypes are involved.
29
Download