CURS MPI (3)

advertisement
CURS MPI (3)
1. Recapitulare
1.1
Ce este MPI?
MPI = Message Passing Interface
Este o specificatie pentru o biblioteca standard
care defineste sintaxa si semantica unui model extins de
transmitere de mesaje si a fost conceputa de MPI Forum
(SC1992) (http://www.mpi-forum.org)
Nu este:
– un limbaj sau o specificatie de compilator
– nu este o implementare specifica
– nu ofera detalii de implementare; sunt oferite unele
indicii, dar implementatorii o un grad mare de
libertate, iar 2 implementari diferite pot face
acelasi lucru intr-o maniera foarte diferita.
MPI a fost conceput pentru implementarea calculului
paralel in sisteme cu memorie distribuita si in acest sens
ofera rutine pentru schimbul mesajelor intre unul sau mai
multi transmitatori si unul sau mai multi receptori.
Functionalitati standardizate:
- comunicatie P2P
- comunicatie colectiva
- rutine pentru sincronizare
- tipuri de date derivate pentru accessul la structuri
de date non-contigue
- posibilitatea de a crea submultimi de procesoare
- abilitatea de a crea topologii de procesoare
Desi MPI are la baza modelul sistemului cu memorie
distribuita, poate fi folosit pe mai multe tipuri de
sisteme:
-
masini cu memorie distribuita
masini cu memorie partajata
clustere de SMP (Symmetric MultiProcessing)
retele de statii de lucru (NOW)
retele heterogene de calculatoare
1.2
Un exemplu simplu (Calculul lui PI)
Se calculeaza o aproximatie a lui PI prin integrarea
functiei:
4
f(x) =
1  x2
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] ) {
int n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
while (1) {
if (myid == 0) {
printf("Number of intervals: (0 quits) ");
scanf("%d",&n);
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0)
break;
else {
h
= 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs){
x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x*x));
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
printf("pi is approximately %.16f,
Error is %.16f\n", pi, fabs(pi PI25DT));
}
}
MPI_Finalize();
return 0;
}
2. Topologii si comunicatori
2.1
Comunicatori
De ce? Exemplu:
BMR (Broadcast Multiply Roll) matrix-matrix multiplication
algorithm (s x s mesh of processes, with the matrices
distributed in a 2D block layout):
for k = 0:s-1
1) The process in row i with A(i, (i+k)mod s)
broadcasts it to all other processes in the same row
i.
2) Processes in row i receive A(i, (i+k)mod s) in
local array T.
3) for i = 0:s-1 and j = 0:s-1 in parallel
C(i,j) = C(i,j) + T*B(i,j)
end
4) Upward circular shift each column of B by 1:
B(i,j) <-- B((i+1)mod s, j)
end
In the BMR, note that in each row we have to broadcast a
part of A across the row. We cannot use the MPI_Broadcast()
function call covered earlier because it will send the
message to all processors, not just those in the same row.
Here is where communicators come in handy; we can define
multiple communicators, one for each row. Then by using
that communicator we broadcast just in the specified row.
MPI has intracommunicators and intercommunicators.
Intracommunicators are for grouping processes and allowing
them to send collective communications to each other, while
intercommunicators are for sending messages between
disjoint intracommunicator groups. What we need here are
intracommunicators, each of which consists of a group and a
context. A group is an ordered set of processes, each
assigned a unique number 0, 1, ... s-1 where the set has s
processes. Its context is an MPI-defined class that
uniquely identifies the communicator. You can have two
communicators, each consisting of the identical set of s
processes, but they will have different contexts
(otherwise, you have only one communicator!).
Here is an example code fragment for creating a group from
the second row of an s by s array of processes, as would be
needed for BMR:
/* ---------------- */
/* s = sqrt(p) here */
/* ---------------- */
MPI_Group group_world;
MPI_Group row2_group;
MPI_Comm row2_comm;
int
*row2_ranks = new int[s];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* ----------------------------------------------- */
/* Create vector of MPI_COMM_WORLD ranks for row2 */
/* processes:
*/
/* ------------------------------------------------*/
for (k = s; k < 2*s; k++)
row2_ranks[k-s] = k;
/*
/*
/*
/*
----------------------------------------------Find the group corresponding to MPI_COMM_WORLD
Note that MPI does the memory allocation needed
----------------------------------------------MPI_Comm_group(MPI_COMM_WORLD, &group_world);
*/
*/
*/
*/
/* --------------------- */
/* Create the row2 group */
/* --------------------- */
MPI_Group_incl(group_world, s, row2_ranks,
&row2_group);
/* ---------------------------------------------------- */
/* Create the new communicator; MPI assigns the context */
/* ---------------------------------------------------- */
MPI_Comm_create(MPI_COMM_WORLD, row2_group,
&row2_comm);
Once you have created the row2 communicator, you can
broadcast in it via operations like
if (my_rank >= s & my_rank < 2*s) {
MPI_Comm_rank(row2_comm, &row2_rank);
if (row2_rank == 0) /* ... set up buffer msg here */
MPI_Bcast(&msg, m*m, MPI_DOUBLE, 0, row2_comm); }
Warning: MPI_Comm_create() is a collective operation, so
all the processes in the old communicator must call it even those not going to be part of the new row2
communicator.
2.2
Topologii
Creating a set of s communicators for the rows, and
another set s of communicators for the columns (to do the
block cyclic shift of the matrix B) is clearly cumbersome.
MPI lets you define an implicit virtual topology on the set
of processors, corresponding to either a grid or a general
graph. This does not magically provide new hardware
interconnects for the machine, but does allows you to take
an algorithm naturally stated for a rectangular mesh of
processors and implement it more naturally. For the BMR
algorithm, this needs to be a square s x s mesh of
processes.
MPI_Comm grid_comm;
int dim_sizes[2], torus[2], coords[2];
dim_sizes[0] = s;
dim_sizes[1] = s;
torus[0] = TRUE; /* or FALSE, we don't care for BMR */
torus[1] = TRUE;
reorder = TRUE;
/*
/*
/*
----------------------------------------------- */
Create the Cartesian grid topology communicator */
----------------------------------------------- */
MPI_Cart_create(MPI_COMM_WORLD, 2, dim_sizes, torus, reorder, &grid_comm);
/*
^
^
^
^
^
^
*/
/*
|
|
|
|
|
|
*/
/*
Parent comm
|
|
|
|
|
*/
/*
Number of dim
|
|
|
|
*/
/*
Size of each dim
|
|
|
*/
/*
Wrap around
|
|
/*
Allow MPI to optimally assign processes to processors
|
/*
Newly created communicator
/* -------------------------------------------------------------------/*
/*
/*
----------------------------------------- */
Find my rank in the new grid communicator */
----------------------------------------- */
MPI_Comm_rank(grid_comm, &grid_rank);
/*
/*
/*
------------------------------------------- */
Find my coords in the new grid communicator */
------------------------------------------- */
MPI_Cart_coords(grid_comm, grid_rank, 2, coords);
Note that the call to MPI_Comm_rank() is needed to
find the new rank in the grid communicator, since MPI was
given permission to reorder the processes (reorder = TRUE).
Then MPI_Cart_coords() returns the coordinates of the
calling process. The inverse, finding the rank given the
coordinates, uses MPI_Cart_rank().
Implementing BMR will be eased with one more function:
MPI_Cart_sub() splits a grid into subgrids of lower
dimension, given a vector that specifies which dimensions
are to be free and which are to be fixed. Specifying the
row communicators takes the form:
MPI_Comm grid_comm;
MPI_Comm row_comm;
free_coords[0] = FALSE;
free_coords[1] = TRUE;
MPI_Cart_sub(grid_comm, free_coords, &row_comm);
and the column communicators:
free_coords[0] = TRUE;
free_coords[1] = FALSE;
MPI_Cart_sub(grid_comm, free_coords, &col_comm);
Using these row_comm and col_comm's, the BMR algorithm
takes the form (in pseudo-code)
for (k = 0; k < s; k++) {
sender = (my_row + k) % s;
if (sender == my_col) {
MPI_Bcast(&my_A_matrix, m*m, MPI_DOUBLE, sender,
row_comm);
/* .. then do local multiply with my_A_matrix */
else
*/
*/
*/
*/
MPI_Bcast(&T, m*m, MPI_DOUBLE, sender, row_comm);
/* .. then do local multiply with temp buffer T*/ }
MPI_Sendrecv_replace(my_B_matrix, m*m, MPI_DOUBLE, dest,
0, source, 0,
col_comm, &status); }
Of course, you have to figure out who is dest and who is
source in the last call.
There are other topologies possible, and using the graph
topology you can define arbitrary ones: hypercubes, rings,
etc. The grid is the most common one, however, which is why
it has been emphasized above.
3. MPI2
3.1
De ce a fost nevoie? Diferente
MPI-1 left out a lot of things that were hard to agree
on for a standard.
MPI-2 what it included:
- Language issues (inter-language operation)
- Dynamic process control/management
- Establishing Communication
- Single sided Communication
- Intercommunicator Collective Operations
- I/O including Parallel IO (PIO)
3.2
Gestiunea dinamica a proceselor
All characteristics of message passing
are contained within communicators. Communicators contain:
- Process lists or groups
- Connection/communication structures topologies
- System derived message tags - envelopes to
separate messages from each other
All processes in an MPI-1 application
belong to a global communicator called MPI_COMM_WORLD and
all other communicators are derived from this global
communicator. Communication can only occur within a
communicator. And this implies safe communication.
All process groups are derived from the membership
of the MPI_COMM_WORLD communicator and this means there are
no external processes => MPI-1 process membership is
static not dynamic:
- simplified consistency reasoning
- fast communication (fixed addressing) even across complex
topologies.
- interfaces well to simple run-time systems as found on
many MPPs.
Disadvantages of static process model:
- if a process fails, all communicators it belongs to
become invalid. I.e. No fault tolerance.
- dynamic resources either cause applications to fail
due to loss of nodes or make applications
inefficient as they cannot take advantage of new
nodes by starting/spawning additional processes.
- when using a dedicated MPP MPI implementation
you cannot usually use off-machine or even off-partion
nodes.
MPI-2 provides a spawn (or
remote start) call:
- depending on the implementation you
have (LAM 6.X+ does MPICH 1.3.1 does not)
- different vendor versions (NEC/SUN)
Two flavors
- MPI_Comm _spawn ( )
Starts new processes from a single binary and
returns an intercommunicator to them
- MPI_Comm_spawn_multiple ( )
Starts new processes from more than one
binary
3.3
Operatii I/O in parallel
3.4
Comunicatie unilaterala
Normal message passing operation
needs at least two parties
- A sender who performs a send call
- A receiver who performs a receive call
Why is this? And what does it have to
do with… memory management/protection?
? Earlier Cray MPP systems allowed
processes to remotely access other
processes memory via shmget and
shmput system function calls.
? This is known as Remote Memory Access
(RMA)
?
?
?
?
?
?
?
?
Remote Memory Access (RMA)
Is fast
Can allow for simple program design
An operation specifies all the send and
receive arguments together
Remote Memory Access (RMA)
Is fast
Can allow for simple program design
An operation specifies all the send and
receive arguments together
Data (memory) in a fixed range (a
window) is made available with a
MPI_Win_create ( ) call.
?
?
?
?
?
Freed with MPI_Win_free ( )
Data can then be accessed via
MPI_Put ( )
MPI_Get ( )
MPI_Accumulate ( )
Synch with MPI_Win
? The communication calls
(put/get/accumulate) are non-blocking
? The operation occurs sometime after the
call BUT before a synchronization point
MPI-2: Single sided communications
? RMA communication is in two classes
? Active
? Memory is moved from one process to another
? One process calls the move
? Both must call the synchronization (including
the owner of the target memory)
? Like message passing
MPI-2: Single sided communications
? RMA communication is in two classes
? Passive
? Memory is copied from a target to two other
processes
? Both processes call the copy
? Both must synchronize (complete) their move,
expect the target does not need to synchronize
Like shared memory
3.5
Probleme legate de limbaj
Disadvantages for MPI specifies both ANSI C and F77
binding:
- no agreed standard for data type conversions
between languages even upon the same
architecture
- communicators cannot be passed between
different language modules due to their
representation
- In F77 a communicator is an INTEGER
- In C it is usually a pointer to a structure
- non standard methods exist in different
implementations
Rezolvate partial – C conversion wrappers -> F2C,
overloading c++ operators called on the c++ side -> C2C++
Download