CS 179: GPU

CS 179: GPU Programming Lecture 14: Inter-process Communication The Problem  What if we want to use GPUs across a distributed system? GPU cluster, CSIRO Distributed System  A collection of computers  Each computer has its own local memory!  Communication over a network  Communication suddenly becomes harder! (and slower!)  GPUs can’t be trivially used between computers  We need some way to communicate between processes… Message Passing Interface (MPI)  A standard for message-passing  Multiple implementations exist  Standard functions that allow easy communication of data between processes  Works on non-networked systems too!  (Equivalent to memory copying on local system) MPI Functions  There are seven basic functions:  MPI_Init  MPI_Finalize initialize MPI environment terminate MPI environment  MPI_Comm_size  MPI_Comm_rank how many processes we have running the ID of our process  MPI_Isend  MPI_Irecv send data (nonblocking) receive data (nonblocking)  MPI_Wait wait for request to complete MPI Functions  Some additional functions:  MPI_Barrier wait for all processes to reach a certain point  MPI_Bcast  MPI_Reduce send data to all other processes receive data from all processes and reduce to a value  MPI_Send  MPI_Recv send data (blocking) receive data (blocking) The workings of MPI  Big idea:  We have some process ID  We know how many processes there are  A send request/receive request pair are necessary to communicate data! MPI Functions – Init  int MPI_Init( int *argc, char ***argv )  Takes in pointers to command-line parameters (more on this later) MPI Functions – Size, Rank  int MPI_Comm_size( MPI_Comm comm, int *size )  int MPI_Comm_rank( MPI_Comm comm, int *rank )  Take in a “communicator”  For us, this will be MPI_COMM_WORLD – essentially, our communication group contains all processes  Take in a pointer to where we will store our size (or rank) MPI Functions – Send  int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)  Non-blocking send  Program can proceed without waiting for return!  (Can use MPI_Send for “blocking” send) MPI Functions – Send  int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)  Takes in…     A pointer to what we send Size of data Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) Destination process ID … MPI Functions – Send  int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)  Takes in… (continued)  A “tag” – nonnegative integers 0-32767+  Uniquely identifies our operation – used to identify messages when many are sent simultaneously  MPI_ANY_TAG allows all possible incoming messages  A “communicator” (MPI_COMM_WORLD)  A “request” object  Holds information about our operation MPI Functions - Recv  int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)  Non-blocking receive (similar idea)  (Can use MPI_Irecv for “blocking” receive) MPI Functions - Recv  int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)  Takes in…     A buffer (pointer to where received data is placed) Size of incoming data Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) Sender’s process ID … MPI Functions - Recv  int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)  Takes in…  A “tag”  Should be the same tag as the tag used to send message! (assuming MPI_ANY_TAG not in use)  A “communicator” (MPI_COMM_WORLD)  A “request” object Blocking vs. Non-blocking  MPI_Isend and MPI_Irecv are non-blocking communications  Calling these functions returns immediately  Operation may not be finished!  Should use MPI_Wait to make sure operations are completed  Special “request” objects for tracking status  MPI_Send and MPI_Recv are blocking communications  Functions don’t return until operation is complete  Can cause deadlock!  (we won’t focus on these) MPI Functions - Wait  int MPI_Wait(MPI_Request *request, MPI_Status *status)  Takes in…  A “request” object corresponding to a previous operation  Indicates what we’re waiting on  A “status” object  Basically, information about incoming data MPI Functions - Reduce  int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)  Takes in…  A “send buffer” (data obtained from every process)  A “receive buffer” (where our final result will be placed)  Number of elements in send buffer  Can reduce element-wise array -> array  Type of data (MPI label, as before) … MPI Functions - Reduce  int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)  Takes in… (continued)  Reducing operation (special MPI labels, e.g. MPI_SUM, MPI_MIN)  ID of process that obtains result  MPI communication object (as before) int main(int argc, char **argv) { int rank, numprocs; MPI_Status status; MPI_Request request; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); int int int int MPI Example tag=1234; source=0; destination=1; count=1; int send_buffer; int recv_buffer;  Two processes if(rank == source){ send_buffer=5678; MPI_Isend(&send_buffer,count,MPI_INT,destination,tag, MPI_COMM_WORLD,&request); }  Sends a number from process 0 to process 1 if(rank == destination){ MPI_Irecv(&recv_buffer,count,MPI_INT,source,tag, MPI_COMM_WORLD,&request); } MPI_Wait(&request,&status);  Note: Both processes are running this code! if(rank == source){ printf("processor %d sent %d\n",rank,recv_buffer); } if(rank == destination){ printf("processor %d got %d\n",rank,recv_buffer); } MPI_Finalize(); return 0; } Solving Problems  Goal: To use GPUs across multiple computers!  Two levels of division:  Divide work between processes (using MPI)  Divide work among a GPU (using CUDA, as usual)  Important note:  Single-program, multiple-data (SPMD) for our purposes i.e. running multiple copies of the same program! Multiple Computers, Multiple GPUs  Can combine MPI, CUDA streams/device selection, and single-GPU CUDA! MPI/CUDA Example  Suppose we have our “polynomial problem” again  Suppose we have n processes for n computers  Each with one GPU  (Suppose array of data is obtained through process 0)  Every process will run the following: MPI/CUDA Example MPI_Init declare rank, numberOfProcesses set rank with MPI_Comm_rank set numberOfProcesses with MPI_Comm_size if process rank is 0: array <- (Obtain our raw data) for all processes idx = 0 through numberOfProcesses - 1: Send fragment idx of the array to process ID idx using MPI_Isend This fragment will have the size (size of array) / numberOfProcesses) declare local_data Read into local_data for our process using MPI_Irecv MPI_Wait for our process to finish receive request local_sum <- ... Process local_data with CUDA and obtain a sum declare global_sum set global_sum on process 0 with MPI_reduce (use global_sum somehow) MPI_Finalize MPI/CUDA – Wave Equation  Recall our calculation: 𝑦𝑥,𝑡+1 𝑐∆𝑡 = 2𝑦𝑥,𝑡 − 𝑦𝑥,𝑡−1 + ∆𝑥 2 (𝑦𝑥+1,𝑡 −2𝑦𝑥,𝑡 + 𝑦𝑥−1,𝑡 ) MPI/CUDA – Wave Equation  Big idea: Divide our data array between n processes! MPI/CUDA – Wave Equation  In reality: We have three regions of data at a time (old, current, new) MPI/CUDA – Wave Equation  Calculation for timestep t+1 uses the following data: 𝑦𝑥,𝑡+1 t+1 t t-1 𝑐∆𝑡 = 2𝑦𝑥,𝑡 − 𝑦𝑥,𝑡−1 + ∆𝑥 2 (𝑦𝑥+1,𝑡 −2𝑦𝑥,𝑡 + 𝑦𝑥−1,𝑡 ) MPI/CUDA – Wave Equation  Problem if we’re at the boundary of a process! 𝑦𝑥,𝑡+1 𝑐∆𝑡 = 2𝑦𝑥,𝑡 − 𝑦𝑥,𝑡−1 + ∆𝑥 t+1 t t-1 x Where do we get 𝑦𝑥−1,𝑡 ? (It’s outside our process!) 2 (𝑦𝑥+1,𝑡 −2𝑦𝑥,𝑡 + 𝑦𝑥−1,𝑡 ) Wave Equation – Simple Solution  After every time-step, each process gives its leftmost and rightmost piece of “current” data to neighbor processes! Proc0 Proc1 Proc2 Proc3 Proc4 Wave Equation – Simple Solution  Pieces of data to communicate: Proc0 Proc1 Proc2 Proc3 Proc4 Wave Equation – Simple Solution  Can do this with MPI_Irecv, MPI_Isend, MPI_Wait:  Suppose process has rank r:  If we’re not the rightmost process:  Send data to process r+1  Receive data from process r+1  If we’re not the leftmost process:  Send data to process r-1  Receive data from process r-1  Wait on requests Wave Equation – Simple Solution  Boundary conditions:  Use MPI_Comm_rank and MPI_Comm_size  Rank 0 process will set leftmost condition  Rank (size-1) process will set rightmost condition Simple Solution – Problems  Communication can be expensive!  Expensive to communicate every timestep to send 1 value!  Better solution: Send some m values every m timesteps!

CS 179: GPU

Related documents

Products

Support

CS 179: GPU

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib