CS 179: GPU

advertisement
CS 179: GPU Programming
Lecture 14: Inter-process Communication
The Problem
 What if we want to use GPUs across a distributed system?
GPU cluster, CSIRO
Distributed System
 A collection of computers
 Each computer has its own local memory!
 Communication over a network
 Communication suddenly
becomes harder! (and slower!)
 GPUs can’t be trivially used
between computers
 We need some way to communicate between processes…
Message Passing Interface (MPI)
 A standard for message-passing
 Multiple implementations exist
 Standard functions that allow easy communication of data
between processes
 Works on non-networked systems too!
 (Equivalent to memory copying on local system)
MPI Functions
 There are seven basic functions:
 MPI_Init
 MPI_Finalize
initialize MPI environment
terminate MPI environment
 MPI_Comm_size
 MPI_Comm_rank
how many processes we have running
the ID of our process
 MPI_Isend
 MPI_Irecv
send data (nonblocking)
receive data (nonblocking)
 MPI_Wait
wait for request to complete
MPI Functions
 Some additional functions:
 MPI_Barrier
wait for all processes to reach a certain point
 MPI_Bcast
 MPI_Reduce
send data to all other processes
receive data from all processes and
reduce to a value
 MPI_Send
 MPI_Recv
send data (blocking)
receive data (blocking)
The workings of MPI
 Big idea:
 We have some process ID
 We know how many processes there are
 A send request/receive request pair are necessary to
communicate data!
MPI Functions – Init
 int MPI_Init( int *argc, char ***argv )
 Takes in pointers to command-line parameters (more on this later)
MPI Functions – Size, Rank
 int MPI_Comm_size( MPI_Comm comm, int *size )
 int MPI_Comm_rank( MPI_Comm comm, int *rank )
 Take in a “communicator”
 For us, this will be MPI_COMM_WORLD
– essentially, our communication group contains all processes
 Take in a pointer to where we will store our size (or rank)
MPI Functions – Send
 int MPI_Isend(const void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm, MPI_Request
*request)
 Non-blocking send
 Program can proceed without waiting for return!
 (Can use MPI_Send for “blocking” send)
MPI Functions – Send
 int MPI_Isend(const void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm, MPI_Request
*request)
 Takes in…




A pointer to what we send
Size of data
Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc)
Destination process ID
…
MPI Functions – Send
 int MPI_Isend(const void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm, MPI_Request
*request)
 Takes in… (continued)
 A “tag” – nonnegative integers 0-32767+
 Uniquely identifies our operation – used to identify messages when
many are sent simultaneously
 MPI_ANY_TAG allows all possible incoming messages
 A “communicator” (MPI_COMM_WORLD)
 A “request” object
 Holds information about our operation
MPI Functions - Recv
 int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *request)
 Non-blocking receive (similar idea)
 (Can use MPI_Irecv for “blocking” receive)
MPI Functions - Recv
 int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *request)
 Takes in…




A buffer (pointer to where received data is placed)
Size of incoming data
Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc)
Sender’s process ID
…
MPI Functions - Recv
 int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *request)
 Takes in…
 A “tag”
 Should be the same tag as the tag used to send message!
(assuming MPI_ANY_TAG not in use)
 A “communicator” (MPI_COMM_WORLD)
 A “request” object
Blocking vs. Non-blocking
 MPI_Isend and MPI_Irecv are non-blocking communications
 Calling these functions returns immediately
 Operation may not be finished!
 Should use MPI_Wait to make sure operations are completed
 Special “request” objects for tracking status
 MPI_Send and MPI_Recv are blocking communications
 Functions don’t return until operation is complete
 Can cause deadlock!
 (we won’t focus on these)
MPI Functions - Wait
 int MPI_Wait(MPI_Request *request, MPI_Status *status)
 Takes in…
 A “request” object corresponding to a previous operation
 Indicates what we’re waiting on
 A “status” object
 Basically, information about incoming data
MPI Functions - Reduce
 int MPI_Reduce(const void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
 Takes in…
 A “send buffer” (data obtained from every process)
 A “receive buffer” (where our final result will be placed)
 Number of elements in send buffer
 Can reduce element-wise array -> array
 Type of data (MPI label, as before)
…
MPI Functions - Reduce
 int MPI_Reduce(const void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
 Takes in… (continued)
 Reducing operation (special MPI labels, e.g. MPI_SUM, MPI_MIN)
 ID of process that obtains result
 MPI communication object (as before)
int main(int argc, char **argv) {
int rank, numprocs;
MPI_Status status;
MPI_Request request;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int
int
int
int
MPI Example
tag=1234;
source=0;
destination=1;
count=1;
int send_buffer;
int recv_buffer;
 Two processes
if(rank == source){
send_buffer=5678;
MPI_Isend(&send_buffer,count,MPI_INT,destination,tag,
MPI_COMM_WORLD,&request);
}
 Sends a number from
process 0 to process 1
if(rank == destination){
MPI_Irecv(&recv_buffer,count,MPI_INT,source,tag,
MPI_COMM_WORLD,&request);
}
MPI_Wait(&request,&status);
 Note: Both processes are
running this code!
if(rank == source){
printf("processor %d sent %d\n",rank,recv_buffer);
}
if(rank == destination){
printf("processor %d got %d\n",rank,recv_buffer);
}
MPI_Finalize();
return 0;
}
Solving Problems
 Goal: To use GPUs across multiple computers!
 Two levels of division:
 Divide work between processes (using MPI)
 Divide work among a GPU (using CUDA, as usual)
 Important note:
 Single-program, multiple-data (SPMD) for our purposes
i.e. running multiple copies of the same program!
Multiple Computers, Multiple GPUs
 Can combine MPI, CUDA streams/device selection,
and single-GPU CUDA!
MPI/CUDA Example
 Suppose we have our “polynomial problem” again
 Suppose we have n processes for n computers
 Each with one GPU
 (Suppose array of data is obtained through process 0)
 Every process will run the following:
MPI/CUDA Example
MPI_Init
declare rank, numberOfProcesses
set rank with MPI_Comm_rank
set numberOfProcesses with MPI_Comm_size
if process rank is 0:
array <- (Obtain our raw data)
for all processes idx = 0 through numberOfProcesses - 1:
Send fragment idx of the array to process ID idx using MPI_Isend
This fragment will have the size
(size of array) / numberOfProcesses)
declare local_data
Read into local_data for our process using MPI_Irecv
MPI_Wait for our process to finish receive request
local_sum <- ... Process local_data with CUDA and obtain a sum
declare global_sum
set global_sum on process 0 with MPI_reduce
(use global_sum somehow)
MPI_Finalize
MPI/CUDA – Wave Equation
 Recall our calculation:
𝑦𝑥,𝑡+1
𝑐∆𝑡
= 2𝑦𝑥,𝑡 − 𝑦𝑥,𝑡−1 +
∆𝑥
2
(𝑦𝑥+1,𝑡 −2𝑦𝑥,𝑡 + 𝑦𝑥−1,𝑡 )
MPI/CUDA – Wave Equation
 Big idea: Divide our data array between n processes!
MPI/CUDA – Wave Equation
 In reality: We have three regions of data at a time
(old, current, new)
MPI/CUDA – Wave Equation
 Calculation for timestep t+1 uses the following data:
𝑦𝑥,𝑡+1
t+1
t
t-1
𝑐∆𝑡
= 2𝑦𝑥,𝑡 − 𝑦𝑥,𝑡−1 +
∆𝑥
2
(𝑦𝑥+1,𝑡 −2𝑦𝑥,𝑡 + 𝑦𝑥−1,𝑡 )
MPI/CUDA – Wave Equation
 Problem if we’re at the boundary of a process!
𝑦𝑥,𝑡+1
𝑐∆𝑡
= 2𝑦𝑥,𝑡 − 𝑦𝑥,𝑡−1 +
∆𝑥
t+1
t
t-1
x
Where do we get 𝑦𝑥−1,𝑡 ?
(It’s outside our process!)
2
(𝑦𝑥+1,𝑡 −2𝑦𝑥,𝑡 + 𝑦𝑥−1,𝑡 )
Wave Equation – Simple Solution
 After every time-step, each process gives its leftmost and
rightmost piece of “current” data to neighbor processes!
Proc0
Proc1
Proc2
Proc3
Proc4
Wave Equation – Simple Solution
 Pieces of data to communicate:
Proc0
Proc1
Proc2
Proc3
Proc4
Wave Equation – Simple Solution
 Can do this with MPI_Irecv, MPI_Isend, MPI_Wait:
 Suppose process has rank r:
 If we’re not the rightmost process:
 Send data to process r+1
 Receive data from process r+1
 If we’re not the leftmost process:
 Send data to process r-1
 Receive data from process r-1
 Wait on requests
Wave Equation – Simple Solution
 Boundary conditions:
 Use MPI_Comm_rank and MPI_Comm_size
 Rank 0 process will set leftmost condition
 Rank (size-1) process will set rightmost condition
Simple Solution – Problems
 Communication can be expensive!
 Expensive to communicate every timestep to send 1 value!
 Better solution: Send some m values every m timesteps!
Download