Week 2 Power Point Slides

advertisement
Basics of Message-passing
• Mechanics of message-passing
– A means of creating separate processes on different computers
– A way to send and receive messages
• Single program multiple data (SPMD) model
–
–
–
–
–
Logic for multiple processes merged into one program
Control Statements separate processor blocks of logic
A compiled program is stored on each processor
All executables are started together statically
Example: MPI (Message Passing Interface)
• Multiple program multiple data (MPMD) model
– Each processor has a separate master program
– Master program spawns child processes dynamically
– Example: PVM (Parallel Virtual Machine)
PVM (Parallel Virtual Machine)
From Oak Ridge National Laboratories, Free distribution
• Multiple process control: Host process: control environment; Any process
can spawn others, Daemon: control message passing
• PVM System Calls
–
–
–
–
–
–
–
Control: pvm_mytid(), pvm_spawn(), pvm_parent(), pvm_exit()
Get send buffer: pvm_initsend()
Pack for sending: pvm_pkint(), pvm_pkfloat(), pvm_pkstr()
Blocking/non-blocking transmission: pvm_send(), pvm_recv(), pvm_nrecv()
Unpack after receipt: Pvm_upkint(), pvm_upkfload, pvm_upkstr()
Group definition: pvm_joingroup()
Collective communication: pvm_bcast(), pvm_scatter(), pvm_gather,
pvm_reduce(), pvm_mcast()
mpij and MpiJava
• Overview
– MpiJava is a wrapper sitting on mpich or lamMpi
– mpij is a native Java implementation of mpi
• Documentation
– MpiJava (http://www.hpjava.org/mpiJava.html)
– mpij (uses the same API as MpiJava)
• Java Grande consortium (http://www.javagrande.org)
– Sponsors conferences & encourages Java for Parallel Programming
– Maintains Java based paradigms (mpiJava, HPJava, and mpiJ)
• Other Java based implementations
– JavaMpi is another less popular MPI Java wrapper
SPMD Computation (MPI)
main (int argc, char *argv[])
{
MPI_Init(&argc, &argv);
.
.
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0)
master();
else
slave();
.
.
MPI_Finalize();
}
The master process executes master()
The slave processes execute slave()
A Simple MPI Program
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{ int rank, size, MAX = 100 + 1, TAG=1;
char data[MAX];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (size!=2) MPI_Abort(MPI_COMM_WORLD, 1); // Terminate all processors
if (myRank==0) { sprintf(data, "Sending from %d of %d", rank, size);
MPI_Send(data, MAX, MPI_CHAR, 1, TAG, MPI_COMM_WORLD);
} else { MPI_Recv( data, MAX, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG
, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf:%s\n", data);
}
MPI_Finalize();
}
Start and Finish
• MPI_Init: Bring up program on all computers,
pass command line arguments, establish ranks.
• MPI_Comm_rank: Determine the rank of the
current process
• MPI_Comm_size: return the number of processors
that are running
• MPI_Finalize: Terminate the program normally
• MPI_Abort: Terminate with an error code when
something bad happens
Standard Send (MPI_Send)
Block until message is received or data copied to a buffer
int MPI_Send(void *buf, int count, MPI_Datatype type,
int dest, int tag, MPI_Comm comm)
• Input Parameters
– buf: initial address of send buffer (choice)
– count: integer number of elements in send buffer
– type: type of each send buffer element (ex: MPI_CHAR,
MPI_INT, MPI_DOUBLE, MPI_BYTE, MPI_PACK, etc.)
– dest: rank of destination (integer)
– tag: message tag (integer)
– comm communicator (handle)
• Note: MPI_PACK allows different data types to be sent in a single
buffer using the MPI_Pack and MPI_Unpack functions.
• Note: Google MPI_Send, MPI_Recv, etc for more intormation
Matching Message Tags
• Differentiates between types of messages
• The message tag is carried within message.
• Wild card codes allow receipt of any message from any source
– MPI_ANY_TAG: matches any message type
– MPI_ANY_SOURCE: matches messages from any sender
– Sends cannot use wildcards (pull operation, not push)
Process 1
Process 2
x
y
send(&x,2,
5);
Movement
of data
recv(&y,1,
5);
Waits for a message f rom process 1 w ith a tag of 5
Send message type 5 from buffer x to buffer y in process 2
Status of Sends and Receives
MPI_Status status;
MPI_Recv(&result, 1, MPI_DOUBLE, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status);
• status.MPI_SOURCE /* rank of sender */
• status.MPI_TAG
/* type of message */
• status.MPI_ERROR /* error code */
–
–
–
–
–
–
MPI_SUCCESS - Successful, MPI_ERR_BUFFER - Invalid buffer pointer
MPI_ERR_COUNT - Invalid count, MPI_ERR_TYPE - Invalid data type
MPI_ERR_TAG - Invalid tag, MPI_ERR_COMM - Invalid communicator
MPI_ERR_RANK - Invalid rank, MPI_ERR_ARG - Invalid argument
MPI_ERR_UNKNOWN - Unknown error, MPI_ERR_INTERN - internal error
MPI_ERR_TRUNCATE - message truncated on receive
• MPI_Get_count(&status, recv_type, &count)
/* number of elements */
Console Input and Output
• Input
– Console input must be initiated at the host process
if (rank==0) { printf("Enter some fraction");
scanf("%lf", &value);
fflush(stdin);
}
or gets(data) to read a string
• Output
– Any process can initiate an output
– MPI uses internal library functions to route the output to the process
initiating the program
– Transmission using a library functions initiated before normal
application transmissions can arrive after, or visa versa
Groups and Communicators
•
•
•
•
Group: A set of processes ordered by relative rank
Communicators: Context required for sends and receives
Purpose: Enable collective communication (to subgroups of processors)
The default communicator is MPI_COMM_WORLD
– A unique rank corresponds to each executing process
– The rank is an integer from 0 to p – 1
– The number of processors executing is p
• Applications can create subset communicators
– Each processor has a unique rank in each sub-communicator
– The rank is an integer from 0 to g-1
– The number of processors in the group is g
Example
MPI Group Communicator Functions
Typical Usage
1.
2.
3.
4.
5.
6.
Extract group from communicator: MPI_Comm_group
Form new group: MPI_Group_incl or MPI_Group_excl
Create new group communicator: MPI_Comm_create
Determine group rank: MPI_Comm_rank
Communications: MPI message passing functions
Destroy created communicators and groups:
MPI_Comm_free and MPI_Group_free
Details
• MPI_Group_excl:
– New group without certain processes from an existing group
– int MPI_Group_excl(MPI_Group group, int n, int *ranks,
MPI_Group *newgroup);
• MPI_Group_incl:
– New group withc selected processes from an existing group
– int MPI_Group_incl(MPI_Group group, int n, int *ranks,
MPI_Group *newgroup);
Creating and using a sub-group
int ranks[4]={1,3,5,7};
MPI_Group original, subgroup;
MPI_Comm slave;
MPI_Comm_group(MPI_COMM_WORLD, &original);
MPI_Group_incl(original, 4, ranks, &subgroup);
MPI_Comm_create(MPI_COMM_WORLD, subgroup, &slave);
MPI_Send(data,strlen(data)+1 ,MPI_CHAR ,0 ,0, slave);
MPI_Group_free(subgroup); MPI_Group_free(original);
MPI_Comm_free(slave);
Point-to-point Communication
• Pseudo code constructs
Send(data, destination, message tag)
Receive(data, source, message tag)
Process 1
• Synchronous
– Send Completes when data safely received
– Receive completes when data is available
– No copying to/from internal buffers
• Asynchronous
–
–
–
–
Process 2
x
y
send(&x, 2);
recv(&y, 1);
Copy to internal message buffer
Send completes when transmission begins
Local buffers are free for application use Generic syntax (actual formats later)
Receive polls to determine if data is
available
Synchronized sends and receives
Process 1
Time
Suspend
process
Both processes
continue
send();
Process 2
Request to send
Acknowledgment
recv();
Message
(a) send() occurs before recv()
Process 1
Process 2
Time
recv();
send();
Both processes
continue
Request to send
Message
Acknowledgment
(b) recv() occurs before send()
Suspend
process
Point to Point MPI calls
• Buffered Send (receiver gets to it when it can)
– Completes after data is copied to a user supplied buffer
– Becomes synchronous if no buffers are available
• Ready Send (guarantee transmission is successful)
– A matching receive call must precede the send
– Completion occurs when remote processor receives the data
• Standard Send (starts transmission if possible)
– If receive call is posted, completes when transmission starts
– If no receive call is posted, completes when data is buffered by MPI,
but becomes synchronous if no buffers are available
• Blocking - Return occurs when the call completes
• Non-Blocking - Return occurs immediately
– Application must periodically poll or wait for completion
– Why non-blocking? To allows more parallel processing
Buffered Send Example
Applications supply a data buffer area using
MPI_Buffer_attach() to hold the data during transmission
Process 1
Process 2
Message buffer
Time
send();
Continue
process
recv();Read
message buffer
Note: transmission is between sender/receiver MPI buffers
Note: copying in and out of buffers can be expensive
Point-to-point Message Transfer
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
int x;
MPI_Status *stat;
if (myrank == 0)
{
MPI_Send(&x,1,MPI_INT,1,99,MPI_COMM_WORLD);
} else if (myrank == 1)
{
MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);
}
MPI_Send(buf, count, datatype, dest, tag,
Address of
Datatype of
Message tag
send buffer
each item
Number of items
Rank of destination Communicator
to send
process
MPI_Recv(buf, count, datatype, src, tag, comm,
Address of
receive buffer
Datatype of
each item
Maximum n umber
of items to receive
Status
Message tag after operation
Rank of source Communicator
process
Non-blocking Point-to-point Transfer
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
int x;
MPI_Request *io;
MPI_STATUS *stat;
if (myrank == 0)
{
MPI_Isend(&x,1,MPI_INT,1,99,MPI_COMM_WORLD,io);
doSomeProcessing();
MPI_Wait(io, stat);
} else if (myrank == 1)
{ MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);
}
• MPI_Isend() and MPI_Irecv() return immediately
• MPI_Rsend returns when received by remote computer, MPI_Bsend
Buffered send, MPI_Send Standard send
• MPI_Wait() returns after transmission, MPI_Test() returns non-zero
after transmission, returns zero otherwise
Message Passing Order
Process 0
Process 1
Destination
send(…,1,…);
lib()
send(…,1,…);
Source
recv(…,0,…);
(a) Messages received
out of order
lib()
recv(…,0,…);
Process 0
Process 1
send(…,1,…);
(b) Messages received
in order
lib()
send(…,1,…);
recv(…,0,…);
lib()
recv(…,0,…);
Note: Messages originating from a processor will always be received in
order. Messages from different processors can be received out of order.
Collective Communication
MPI operations on groups of processes
•
•
•
•
•
•
•
•
MPI_Bcast()): Broadcast or Multicast data to processors in a group
Scatter (MPI_Scatter()): Send parts of an array to separate processes
Gather (MPI_Gather()): Collect array elements from separate processes
AlltoAll (MPI_Alltoall()): A combination of gather and scatter. All
processes send; then sections of the combined data are gathered
MPI_Reduce(): Combine values from all processes to a single value
using some operation (function call).
MPI_Reduce_scatter(): First reduce and then scatter result
MPI_Scan(): Reduce values received from processors of lower rank in
the group. (Note: this is a prefix reduction)
MPI_Barrier(): Pause until all processors reach the barrier call
Advantages
• MPI can use the processor hierarchy to improve efficiency
• Although, we can implement collective communication using standard
send and receive calls, collective operations require less programming
and debugging
Reduce, BroadCast, All Reduce
Reduce, then broadcast
Butterfly Allreduce
Predefined Collective Operations
• MPI_MAX, MPI_MIN: maximum, minimum
• MPI_MAXLOC, MPI_MINLOC:
– If the output buffer is out
– For each index, out[i].val and out[i].rank contains the max
(or min) value and the processor rank containing it
• MPI_SUM, MPI_PROD: sum, product
• MPI_LAND, MPI_LOR, MPI_LXOR: logical &, |, ^
• MPI_BAND, MPI_BOR, MPI_BXOR: bitwise &, |, ^
Derived MPI Data Types
/* Goal: send items, each containing a double, integer, and a string */
int lengths[3] = {1, 1, 100};
MPI_Datatype types[3] = {MPI_DOUBLE, MPI_INT, MPI_CHAR, };
int displacements[3] = {0, sizeof(double), sizeof(double)+sizeof(int)};
MPI_Datatype* myType;
/* Derive a data type */
MPI_TYPE_create_struct(3, lengths, displacements, types, &myType);
MPI_Type_commit(myType);
/* Commit it for use */
/* count data items broadcast from source to processors in communicator */
MPI_Bcast(&data, count, myType, source, comm);
MPI_Type_free(myType);
/* Don't need it anymore */
Note: Broadcasts can be fifty to a hundred times
faster than doing individual sends using for loops
Collective Communication Example
• Master: Allocate memory to hold all of the date and then gather items from
a group of processes
• Remotes: Fill an array with data and send them to the master
• Note: All processors execute the MPI_Gather() function
int data[10]; /*data to gather from processes*/
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0)
{ MPI_Comm_size(MPI_COMM_WORLD, &grp_size);
buf = (int *)malloc(grp_size*10*sizeof (int));
}
else { for (i=0; i<10; i++) data[i] = myrank; }
MPI_Gather(data, 10, MPI_INT, buf, grp_size*10
, MPI_INT, 0 /* gatherer rank */,MPI_COMM_WORLD);
User Defined Collective Operation
/* User-defined function to add complex numbers (dest = src + dest) */
void compSum(Complex *src, Complex *dest, int *len, MPI_Datatype *ptr)
{ int i; Complex c;
for (i=0; i< *len; ++i)
{ dest->real += src->real;
dest->imag += src->imag;
src++; dest++;
} }
Complex in[100], out[100];
MPI_Op operation; MPI_Datatype complexType;
MPI_Type_contiguous(2, MPI_DOUBLE, &complextype); // Define type
MPI_Type_commit(&complexType);
// Record for possible use
MPI_Op_create( compSum, True, &operation); // Define the operation
MPI_Reduce( in, out, 100, complexType, operation, root, communicator );
Collective Communication Rules
• All of the processors in the communicator call the same
collective function
• The arguments must specify the same host, input data array
length, data type, operation, and communicator
• The destination process is the only one that needs to
specify an output array
• There is no message tag. Matching is done by the calling
order and the communicator
• The input and output buffers must be different and should
not overlap
Broadcast
Broadcast - Sending the same message to all processes
Multicast - Sending the same message to a defined group of processes.
Process 0
data
Process 1
data
Process
p -1
data
Action
buf
bcast();
Code
MPI f or m
bcast();
bcast();
Scatter
Distributing each element of an array to separate processes
Contents of the ith location of the array transmits to process i
Process 0
data
Process 1
data
Process
p - 1
data
Action
buf
scatter();
Code
MPI f
or m
scatter();
scatter();
Gather
One process collects individual values from set of processes.
Process 0
Process 1
Process p - 1
data
data
data
gather();
gather();
gather();
Action
buf
Code
MPI for m
Reduce
Perform a distributed calculation
Example: Perform addition over a distributed array
Action
Process p - 1
Process 0
Process 1
data
data
data
reduce();
reduce();
buf
+
Code
MPI for m
reduce();
Avoiding MPI Deadlocks
An MPI_Recv without a matching send will block forever
• MPI_Send doesn't always work the same way
–
–
–
–
Can copy to a buffer and then return before the transmission is received
Can block until the matching MPI_Recv starts
MPI uses thresholds to switch from buffered to blocking sends
Some implementations buffer small messages and block large messages
• Deadlock Possibilities (MPI_Send followed by MPI_Recv)
– If all of the sends block, none of the receives can start
– Small messages may succeed, while larger messages may lead to deadlock
• Possible Solutions:
– Some processors send before receive; others receive before send
– Use MPI_Sendrecv or Sendrecv_replace so that MPI will automatically
handle the order of calls and guarantee no deadlock.
Timing Parallel Programs
• What should not be timed?
– Time to type input
– Time to print or display output
• What should be timed?
– The actual algorithm's computation
– Communication blocks
• How? Answer: Either use MPI or C time.h functions
double start = MPI_Wtime();
/* Do stuff */
double time = (MPI_Wtime() – start)*MPI_Wtick();
OR C (but doesn't include idle time)
clock_t start = clock();
/* Do stuff */
float time = ((double) (clock()-start)) / CLOCKS_PER_SEC;
Maximum Time over Processors
double start, localElapsed, elapsed;
// Start all processors together
MPI_Barrier(MPI_COMM_WORLD);
start = MPI_Wtime(); // Start time
/** do code here */
// Get processor elapsed time
localElapsed = MPI_Wtime() – start;
// Get the maximum elapsed processor time
MPI_Reduce(&localElapsed, &elapsed, 1,
MPI_DOUBLE, MPI_MAX, 0, comm);
if (rank == 0) // Master processor outputs result
printf("Elapsed time = %f seconds\n", elapsed);
Note: Another way is to code another barrier at the
end in order to avoid needing a reduce operation
Download