MPI - Intel® Developer Zone

advertisement
Parallel programming
with MPI
Jianfeng Yang
Internet and Information Technology Lab
Wuhan university
yjf@whu.edu.cn
Agenda



Part Ⅰ: Seeking Parallelism/Concurrency
Part Ⅱ: Parallel Algorithm Design
Part Ⅲ: Message-Passing Programming
2
Part Ⅰ
Seeking Parallel/Concurrency
Outline


1 Introduction
2 Seeking Parallel
4
1 Introduction(1/6)



Well done is quickly done – Caesar Auguest
Fast, Fast, Fast---is not “fast” enough.
How to get Higher Performance

Parallel Computing.
5
1 Introduction(2/6)

What is parallel computing?
is the use of a parallel computer to reduce the time
needed to solve a single computational problem.
 is now considered a standard way for computational
scientists and engineers to solve problems in areas as
diverse as galactic evolution, climate modeling, aircraft
design, molecular dynamics and economic analysis.

6
Parallel Computing



A task is broken down into tasks, performed by
separate workers or processes
Processes interact by exchanging information
What do we basically need?
The ability to start the tasks
 A way for them to communicate

7
1 Introduction(3/6)

What’s parallel computer?


Is a Multi-processor computer system supporting parallel
programming.
Multi-computer



Is a parallel computer constructed out of multiple computers and an
interconnection network.
The processors on different computers interact by passing message to
each other.
Centralized multiprocessor (SMP: Symmetrical multiprocessor)


Is a more high integrated system in which all CPUs share access to a single
global memory.
The shared memory supports communications and synchronization
among processors.
8
1 Introduction(4/6)

Multi-core platform
Integrated duo/quad or more core in one processor, and each
core has their own registers and Level 1 cache, all cores share
Level 2 cache, which supports communications and
synchronizations among cores.
 All cores share access to a global memory.

9
1 Introduction(5/6)

What’s parallel programming?


Is programming in language that allows you to explicitly
indicate how different portions of the computation may
be executed paralleled/concurrently by different
processors/cores.
Do I need parallel programming really?

YES, for the reasons of:
Although a lot of research has been invested in and many
experimental parallelizing compilers have been developed,
there are still no commercial system thus far.
 The alternative is for you to write your own parallel programs.

10
1 Introduction(6/6)

Why should I program using MPI and OpenMP?






MPI ( Message Passing Interface) is a standard specification for
message passing libraries.
Which is available on virtually every parallel computer system.
Free.
If you develop programs using MPI, you will be able to reuse
them when you get access to a newer, faster parallel computer.
On Multi-core platform or SMP, the cores/CPUs have a shared
memory space. While MPI is a perfect satisfactory way for
cores/processors to communicate with each other, OpenMP is a
better way for cores/processors with a single Processor/SMP to
interact.
The hybrid MPI/OpenMP program can get even high
performance.
11
2 Seeking Parallel(1/7)


In order to take advantage of multi-core/multiple
processors, programmers must be able to identify
operations that may be performed in parallel.
Several ways:
Data Dependence Graphs
 Data Parallelism
 Functional Parallelism
 Pipelining
 ……

12
2 Seeking Parallel(2/7)

Data Dependence Graphs



A directed graph
Each vertex: represent a task to be completed.
An edge from vertex u to vertex v means: task u must be
completed before task v begins.
----- Task v is dependent on task u.

If there is no path from u to v, then the tasks are independent
and may be performed parallelized.
13
2 Seeking Parallel(3/7)
Data Dependence Graphs
a
b
b
a
b
b
c
c
a
d
e
b
c
Tasks
a
Operation
Dependence among tasks
14
2 Seeking Parallel(4/7)

Data Parallelism
Independent tasks applying the same operation to
different elements of a data set.
a
 e.g.
For( int i=0;i<99; i++)
{
a(i) = b(i) + c(i);
b
b
}

b
c
15
2 Seeking Parallel(5/7)

Functional Parallelism

Independent tasks applying different operations to different data
elements of a data set.
a
A = 2;
b = 3;
m = (a + b) / 2;
s = (a2 + b2) / 2;
v = s - m2
May
MaybebeFunctional
Functional
Parallelized
Parallelized
b
c
d
e
16
2 Seeking Parallel(6/7)

Pipelining

A data dependence graph forming a simple path/chain


admits no parallelism if only a single problem instance must be
processed.
If multiple problems instance to be processed:



If a computation can be divided into several stage with the same time
consumption.
Then, can support parallelism.
E.g.

Assembly line.
a
b
c
17
2 Seeking Parallel(7/7)

Pipelining
p0←a0
p1←a0 + a1
p2←a0 + a1+ a2
p3←a0 + a1+ a2+ a3
p[0]=a[0]
for (int i=1; i<=3;i++)
{
p[i] = p[i-1]+a[i];
}
P[0] = a[0];
P[1] = p[0]+a[1];
P[2] = p[1]+a[2];
P[3] = p[2]+a[3];
P[0]
P[1]
P[0]
=
+
a[0]
a[1]
P[2]
P[1]
P[3]
P[2]
+
+
a[2]
a[3]
18
For Example:




Landscape maintains
Prepare for dinner
Data cluster
……
19
Homework


Given a task that can be divided into m subtasks, each
require one unit of time, how much time is needed for an
m-stage pipeline to process n tasks?
Consider the data dependence graph in figure below.


identify all sources of data parallelism;
identify all sources of functional parallelism.
I
A
A
A
B
D
C
A
A
O
A
20
Part Ⅱ
Parallel Algorithm Design
Outline



1.Introduction
2.The Task/Channel Model
3.Foster’s Design Methodology
22
1.Introduction



Foster, Ian. Design and Building Parallel Programs:
Concepts and Tools for Parallel Software engineering.
Reading, MA: Addison-Wesley, 1995.
Describe the Task/Channel Model;
A few simple problems…
23
2.The Task/Channel Model

The model represents a parallel computation as a set of
tasks that may interact with each other by sending
message through channels.
Task: is a program,
its local memory,
and a collection
of I/O ports.
Local memory:
instructions
private data

Memory
24
2.The Task/Channel Model

channel:

Via channel:



A channel is a message queue:





A task can send local data to other tasks via output ports;
A task can receive data value from other tasks via input ports.
Connect one task’s output port with another task’s input port.
Data value appears at the inputs port in the same order in which they were
placed in the output port of the other end of the channel.
Receiving data can be blocked: Synchronous.
Sending data can never be blocked: Asynchronous.
Access to local memory: faster than nonlocal data access.
25
3.Foster’s Design Methodology

Four-step process:




Partitioning
Communication
Agglomeration
mapping
Partitioning
Problem
Communication
Mapping
Agglomeration
26
3.Foster’s Design Methodology

Partitioning



Is the process of dividing the computation and the data into
pieces.
More small pieces is good.
How to



Domain Decomposition





Data-centric approach
Function-centric approach
First, divide data into pieces;
Then, determine how to associate computations with the data.
Focus on: the largest and/or most frequently accessed data structure
in the program.
E.g.,
Functional Decomposition
27
3.Foster’s Design Methodology
Domain Decomposition
1-D
Primitive Task
2-D
3-D
Better
28
3.Foster’s Design Methodology
Functional Decomposition


Yield collections of tasks that achieve parallel through
pipelining.
E.g., a system supporting interactive image-guided
surgery.
Track position of
instruments
Acquire patient
images
Register
images
Determine image
locations
Display image
29
3.Foster’s Design Methodology

The quality of Partition (evaluation)

At least an order of magnitude more primitive tasks than
processors in the target parallel computer.


Redundant computations and redundant data structure storage
are minimized.


Otherwise: the design may not work well when the size of the problem
increases.
Primitive tasks are roughly the same size.


Otherwise: later design options may be too constrained.
Otherwise: it may be hard to balance work among the processors/cores.
The number of tasks is an increasing function of the problem
size.

Otherwise: it may be impossible to use more processor/cores to solve
large problem.
30
3.Foster’s Design Methodology

Communication
After identifying the primitive tasks, the
communications type between those primitive tasks
should be determined.
 Two kinds of communication type:

Local
 Global

31
3.Foster’s Design Methodology

Communication

Local:


A task needs values from a small number of other tasks in
order to perform a computation, a channel is created from the
tasks supplying the data to the task consuming the data.
Global:
When a significant number of the primitive tasks must be
contribute data in order to perform a computation.
 E.g., computing the sums of the values held by the primitive
processes.

32
3.Foster’s Design Methodology

Communication

Evaluate the communication structure of the designed
parallel algorithm.
The communication operations are balanced among the tasks.
 Each task communications with only a small number of
neighbors.
 Tasks can perform their communication in parallel/concurrently.
 Tasks can perform their computations in parallel/concurrently.

33
3.Foster’s Design Methodology

Agglomeration

Why we need agglomeration?



If the number of tasks exceeds the number of processors/cores by
several orders of magnitude, simply creating these tasks would be a source
of significant overhead.
So, combine primitive tasks into large tasks and map them into
physical processors/cores to reduce the amount of parallel
overhead.
What’s agglomeration?


Is the process of grouping tasks into large tasks in order to improve
performance or simplify programming.
When developing MPI programs, ONE task per core/processor is better.
34
3.Foster’s Design Methodology

Agglomeration
 Goals
1: lower communication overhead.
 Eliminate
communication among tasks.
 Increasing the locality of parallelism.
 Combining groups of sending and receiving tasks.
35
3.Foster’s Design Methodology
 Agglomeration

Goals 2: Maintain the scalability of the parallel
design.
Enable that we have not combined so many tasks that we will
not be able to port our program at some point in the future to
a computer with more processors/cores.
 E.g. 3-D Matrix Operation
size: 8*128*258

36
3.Foster’s Design Methodology
 Agglomeration

Goals 3: reduce software engineering costs.

Make greater use of the existing sequential code.


Reducing time;
Reducing expense.
37
3.Foster’s Design Methodology

Agglomeration evaluation:







Has increased the locality of the parallel algorithm.
Replicated computations take less time than the computations the
replace.
The amount of replicated data is small enough to allow algorithm to
scale.
Agglomeration tasks have similar computational and communication
costs.
The number of tasks is an increasing function of the problem size.
The number of tasks is as small as possible, yet at least as great as the
number of cores/processors in the target computers.
The trade-off between the chosen agglomeration and the cost of
modifications to existing sequential code is reasonable.
38
3.Foster’s Design Methodology

Mapping
A
B
C
A
C
H
E
D
E
F
F
B
G
H
D
G
Increasing processor utilization
 Minimizing inter-processor communication

39
Part Ⅲ
Message-Passing Programming
Preface
Load
prog_a
Process
Store
41
Node 1
prog_a
Node 2
Node 3
42
process 0
process 1
process 2
Load
Process
Gather
Store
43
Hello World!
#include <stdio.h>
#include “mpi.h”
int main(int argc,char *argv[]) {
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
print(“Process %d of %d: Hello world”, rank, size);
MPI_Finalize();
}
Hello
Hello
Hello
Hello
world
world
world
world
from
from
from
from
process
process
process
process
0
1
2
3
of
of
of
of
4
4
4
4
44
Outline








Introduction
The Message-Passing Model
The Message-Passing Interface (MPI)
Communication Mode
Circuit satisfiability
Point-to-Point Communication
Collective Communication
Benchmarking parallel performance
45
Introduction


MPI: Message Passing Interface
Is a library, not a parallel language.


C&MPI, Fortran&MPI
Is a standard, not a implement for a actually
problem.
MPICH
 Intel MPI
 MSMPI
 LAM MPI


Is a Message Passing Model
46
Introduction

The history of MPI:
Draft: 1992
 MPI-1: 1994
 MPI-2:1997


http://www.mpi-forum.org
47
Introduction

MPICH:



http://www-unix.mcs.anl.gov/mpi/mpich1/download.html;
http://wwwunix.mcs.anl.gov/mpi/mpich2/index.htm#download
Main Features:






Open source;
Synchronized on MPI standard;
Supports MPMD (multiple Program Multiple Data) and
heterogeneous clusters.
Supports combining with C/C++, Fortran77 and Fortran90;
Supports Unix, Windows NT platform;
Supports multi-core, SMP, Cluster, Large Scale Parallel Computer
System.
48
Introduction

Intel MPI
According to
MPI-2 standard.
 Latest version:
3.1
 DAPL (Direct
Access
Programming
Library)

49
Introduction-Intel MPI

Intel® MPI
Library Supports
Multiple
Hardware Fabrics
50
Introduction-Intel MPI

Features
is a multi-fabric message passing library.
 implements the Message Passing Interface, v2 (MPI-2)
specification.
 provides a standard library across Intel® platforms that:

Focuses on making applications perform best on IA based
clusters
 Enables adoption of the MPI-2 functions as the customer
needs dictate
 Delivers best in class performance for enterprise, divisional,
departmental and workgroup high performance computing

51
Introduction-Intel MPI

Why Intel MPI Library?
High performance MPI-2 implementation
 Linux and Windows CCS support
 Interconnect independence
 Smart fabric selection
 Easy installation
 Free Runtime Environment
 Close integration with the Intel and 3rd party
development tools
 Internet based licensing and technical support

52
Introduction-Intel MPI

Standards Based


Argonne National Laboratory's MPICH-2
implementation.
Integration, can be easily integrated with:
• Platform LSF 6.1 and higher
• Altair PBS Pro* 7.1 and higher
• OpenPBS* 2.3
• Torque* 1.2.0 and higher
• Parallelnavi* NQS* for Linux V2.0L10 and higher
• Parallelnavi for Linux Advanced Edition V1.0L10A
and higher
• NetBatch* 6.x and higher
53
Introduction-Intel MPI

System Requirements:

Host and Target Systems hardware:
• IA-32, Intel® 64, or IA-64 architecture using Intel®
Pentium® 4,
Intel® Xeon® processor, Intel® Itanium processor family
and compatible platforms
• 1 GB of RAM - 4 GB recommended
• Minimum 100 MB of free hard disk space - 10GB
recommended.
54
Introduction-Intel MPI

Operating Systems Requirements:












Microsoft Windows* Compute Cluster Server 2003 (Intel® 64 architecture
only)
Red Hat Enterprise Linux* 3.0, 4.0, or 5.0
SUSE* Linux Enterprise Server 9 or 10
SUSE Linux 9.0 thru 10.0 (all except Intel® 64 architecture starts at 9.1)
HaanSoft Linux 2006 Server*
Miracle Linux* 4.0
Red Flag* DC Server 5.0
Asianux* Linux 2.0
Fedora Core 4, 5, or 6 (IA-32 and Intel 64 architectures only)
TurboLinux*10 (IA-32 and Intel® 64 architecture)
Mandriva/Mandrake* 10.1 (IA-32 architecture only)
SGI* ProPack 4.0 (IA-64 architecture only) or 5.0 (IA-64 and Intel 64
architectures)
55
The Message-Passing Model
Processor
Processor
Processor
Memory
Memory
Memory
Processor
Memory
Processor
Interconnection
network
Memory
Processor
Processor
Memory
Memory
Processor
Memory
56
The Message-Passing Model


A task in task/channel model become a process in
Message-Passing Model;
The number of processes:




Is specified by user;
Is specified when the program begins;
Is constant throughout the execution of the program;
Each process:

Has a unique ID number;
Processor
Processor
Processor
Memory
Memory
Memory
Processor
Memory
Processor
Interconnection
network
Processor
Memory
Processor
Memory
Memory
Processor
Memory
57
The Message-Passing Model

Goals of Message-Passing Model:

Communication with each other;

Synchronization with each other;
58
The Message-Passing Interface
(MPI)

Advantages:

Run well on a wide variety of MPMD architectures;

Easily to debugging;

Threading safe
59
What is in MPI







Point-to-point message passing
Collective communication
Support for process groups
Support for communication contexts
Support for application topologies
Environmental inquiry routines
Profiling interface
60
Introduction to Groups &
Communicator



Process model and groups
Communication scope
Communicators
61
Process model and groups

Fundamental computational unit is the process. Each process has:



MPI processes execute in MIMD style, but:



No mechanism for loading code onto processors, or assigning processes to
processors
No mechanism for creating or destroying processes
MPI supports dynamic process groups.




an independent thread of control,
a separate address space
Process groups can be created and destroyed
Membership is static
Groups may overlap
No explicit support for multithreading, but MPI is designed to be
thread-safe.
62
Communication scope

In MPI, a process is specified by:



A message label is specified by:






a group
a rank relative to the group ( )
a message context
a message tag relative to the context
Groups are used to partition process space
Contexts are used to partition ``message label space''
Groups and contexts are bound together to form a
communicator object. Contexts are not visible at the
application level.
A communicator defines the scope of a communication
operation
63
Communicators


Communicators are used to create independent ``message
universes''.
Communicators are used to disambiguate message
selection when an application calls a library routine that
performs message passing. Nondeterminacy may arise



if processes enter the library routine asynchronously,
if processes enter the library routine synchronously, but there are
outstanding communication operations.
A communicator



binds together groups and contexts
defines the scope of a communication operation
is represented by an opaque object
64



A communicator handle defines which processes a
particular command will apply to
All MPI communication calls take a communicator
handle as a parameter, which is effectively the
context in which the communication will take place
MPI_INIT defines a communicator called
MPI_COMM_WORLD for each process that calls
it
65



Every communicator contains a group which is a
list of processes
The processes are ordered and numbered
consecutively from 0.
The number of each process is known as its rank


The rank identifies each process within the
communicator
The group of MPI_COMM_WORLD is the set of
all MPI processes
66
Skeleton MPI Program
#include <mpi.h>
main( int argc, char** argv ) {
MPI_Init( &argc, &argv );
/* main part of the program */
MPI_Finalize();
}
67
Circuit satisfiability
a
b
What combinations
of input value will
the circuit output
the value of 1?
c
d
e
f
g
h
i
j
k
l
m
n
o
p
68
Circuit satisfiability

Analysis:
16 input, a-p, each take on 2 values of 0 or 1.
 216=65536
 design a parallel algorithm

1

Partition


Function decomposition
No channel between tasks
 Tasks are independent;
 Suit for parallelism;
2
3
65536
Output
Partition
Communication
Agglomeration
Mapping
69
Circuit satisfiability

Communication:

Tasks are independent;
Partition
Communication
Agglomeration
Mapping
70
Circuit satisfiability

Agglomeration and Mapping



Fixed number of tasks;
The time for each task to complete is variable. WHY?
How to balance the computation load?

Mapping tasks in cyclic fashion.
Partition
Communication
Tasks
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Agglomeration
Processors/Cores
0
1
2
3
4
5
Mapping
71
Circuit satisfiability

Each process will
examine a
combination of
inputs in turn.
#include <mpi.h>
#include <stdio.h>
int main(int argc, char * argv[])
{
int i;
int id;
int p;
void check_circuit(int,int);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);
for( i=id; i< 65536;i++)
check_circuit(id,i);
printf(“process %d is done\n”,id);
fflush(stdout);
MPI_Finalize();
return 0 ;
}
72
Circuit satisfiability
#define EXTRACT_BIT(n,i) ((n&(1<<i))?1:0)
void check_circuit(int id,int z){
int v[16];
int i;
for( i=0;i<16;i++) v[i] = EXTRACT_BIT(z,i) ;
if((v[0] || v[1]) && (!v[1] || !v[3]) && (v[2] || v[3])
&& (!v[3] || !v[4]) && (v[4] || !v[5])
&& ( v[5] || !v[6]) && (v[5] || v[6])
&& ( v[6] || !v[15]) && (v[7] || !v[8])
&& (!v[7] || !v[13]) && (v[8] || v[9])
&& ( v[9] || v[11]) && (v[10] || v[11])
&& ( v[12] || v[13]) && (v[13] || !v[14])
&& (v[14] || v[15]) )
{
printf(“%d)
%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d
%d”,id,v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7],v[8],v[9],
v[10],v[11],v[12],v[13],v[14],v[15]);
fflush(stdout);
}
}
73
Point-to-Point Communication



Overview
Blocking Behaviors
Non-Blocking Behaviors
74
overview
A
message is sent from a sender to a
receiver
 There are several variations on how the
sending of a message can interact with
the program
75

Synchronous

does not complete until
the message has been
received

A FAX or registered mail
76

Asynchronous

completes as soon as the
message is on the way.

A post card or email
77
communication modes

is selected with send routine.
synchronous mode ("safest")
 ready mode (lowest system overhead)
 buffered mode (decouples sender from receiver)
 standard mode (compromise)


Calls are also blocking or nonblocking.
Blocking stops the program until the message buffer is
safe to use
 Non-blocking separates communication from
computation

78
Blocking Behavior

int MPI_Send(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
buf is the beginning of the buffer containing the data to be
sent. For Fortran, this is often the name of an array in your
program. For C, it is an address.
 count is the number of elements to be sent (not bytes)
 datatype is the type of data
 dest is the rank of the process which is the destination for
the message
 tag is an arbitrary number which can be used to distinguish
among messages
 comm is the communicator

79
Temporary Knowledge

Message



Msg: buf, count, datatype
Msg envelop: dest, tag, comm
Tag----why?
Process P: send A,32,Q ; send B,16,Q ;
Process Q: recv X, 32, P ; recv Y, 16, P ;
Process P: send A,32,Q,tag1 ; send B,16,Q,tag2 ;
Process Q: recv X, 32, P, tag1 ; recv Y, 16, P, tag2
80
81

When using standard-mode send
It is up to MPI to decide whether outgoing messages
will be buffered.
 Completes once the message has been sent, which may
or may not imply that the massage has arrived at its
destination
 Can be started whether or not a matching receive has
been posted. It may complete before a matching receive
is posted.
 Has non-local completion semantics, since successful
completion of the send operation may depend on the
occurrence of a matching receive.

82
Blocking Standard Send
83
MPI_Recv

int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status *status)







buf is the beginning of the buffer where the incoming data are to be
stored. For Fortran, this is often the name of an array in your program.
For C, it is an address.
count is the number of elements (not bytes) in your receive buffer
datatype is the type of data
source is the rank of the process from which data will be accepted (This
can be a wildcard, by specifying the parameter MPI_ANY_SOURCE.)
tag is an arbitrary number which can be used to distinguish among
messages (This can be a wildcard, by specifying the parameter
MPI_ANY_TAG.)
comm is the communicator
status is an array or structure of information that is returned. For
example, if you specify a wildcard for source or tag, status will tell you the
actual rank or tag for the message received
84
85
86
Blocking Synchronous Send
87
Cont.




can be started whether or not a matching receive was
posted
will complete successfully only if a matching receive
is posted, and the receive operation has started to
receive the message sent by the synchronous send.
provides synchronous communication semantics: a
communication does not complete at either end
before both processes rendezvous at the
communication.
has non-local completion semantics.
88
Blocking Ready Send
89




completes immediately
may be started only if the matching receive has
already been posted.
has the same semantics as a standard-mode
send.
saves on overhead by avoiding handshaking
and buffering
90
Blocking Buffered Send
91



Can be started whether or not a matching
receive has been posted. It may complete
before a matching receive is posted.
Has local completion semantics: its completion
does not depend on the occurrence of a
matching receive.
In order to complete the operation, it may be
necessary to buffer the outgoing message
locally. For that purpose, buffer space is
provided by the application.
92
Non-Blocking Behavior

MPI_Isend
(buf,count,dtype,dest,tag,comm,request)

MPI_Wait (request,status)


request matches request on Isend or Irecv
status returns status equivalent to
status for Recv when complete
Blocks for send until message is buffered or sent so
message variable is free
 Blocks for receive until message is received and ready

93
Non-blocking Synchronous Send


int MPI_Issend (void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm,
MPI_Request *request)
IN = provided by programmer, OUT = set by routine buf:
starting address of message buffer (IN)
count: number of elements in message (IN)
datatype: type of elements in message (IN)
dest: rank of destination task in communicator comm (IN)
tag: message tag (IN)
comm: communicator (IN)
request: identifies a communication event (OUT)
94
Non-blocking Ready Send

int MPI_Irsend (void *buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request *request)
95
Non-blocking Buffered Send

int MPI_Ibsend (void *buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request *request)
96
Non-blocking Standard Send

int MPI_Isend (void *buf, int count,
MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request *request)
97
Non-blocking Receive

IN = provided by programmer, OUT = set by
routine buf: starting address of message buffer
(OUT-buffer contents written)
count: number of elements in message (IN)
datatype: type of elements in message (IN)
source: rank of source task in communicator
comm (IN)
tag: message tag (IN)
comm: communicator (IN)
request: identifies a communication event (OUT)
98

int MPI_Irecv (void* buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Request *request)
99

request: identifies a communication event (INOUT)
status: status of communication event (OUT)
count: number of communication events (IN)
index: index in array of requests of completed
event (OUT)
incount: number of communication events (IN)
outcount: number of completed events (OUT)
100




int MPI_Wait (MPI_Request *request, MPI_Status
*status)
int MPI_Waitall (int count, MPI_Request
*array_of_requests, MPI_Status *array_of_statuses)
int MPI_Waitany (int count, MPI_Request
*array_of_requests, int *index, MPI_Status *status)
int MPI_Waitsome (int incount, MPI_Request
*array_of_requests, int *outcount, int*
array_of_indices, MPI_Status *array_of_statuses)
101
Communication Mode
Blocking Routines
Non-Blocking Routines
Synchronous
MPI_SSEND
MPI_ISSEND
Ready
MPI_RSEND
MPI_IRSEND
Buffered
MPI_BSEND
MPI_IBSEND
Standard
MPI_SEND
MPI_ISEND
MPI_RECV
MPI_IRECV
102
Advantages
Disadvantages
Synchro
nous
Safest, and therefore most
portable
SEND/RECV order not
critical
Amount of buffer space
irrelevant
Can incur substantial synchronization
overhead
Ready
Lowest total overhead
SEND/RECV handshake
not required
RECV must precede SEND
Buffered
Decouples SEND from RECV
No sync overhead on
SEND
Order of SEND/RECV
irrelevant
Programmer can control
size of buffer space
Additional system overhead incurred
by copy to buffer
Standard
Good for many cases
Your program may not be suitable
103
MPI Quick Start
MPI_Init
MPI_BCast
MPI_Wtime
MPI_Comm_rank
MPI_Scatter
MPI_Wtick
MPI_Comm_size
MPI_Gather
MPI_Barrier
MPI_Send
MPI_Reduce
MPI_Recv
MPI_Finalize
MPI_Xxxxx
104
MPI Routines

MPI_Init







MPI_Init(&argc, &argv);
To Initialize MPI execution environment .
argc: Pointer to the number of arguments
argv: Pointer to the argument vector
The First MPI function call;
Allow system to do any setup needed to hander further calls
to MPI Library.
defines a communicator called MPI_COMM_WORLD for
each process that calls it
MPI_Init must be called before any other MPI functions.

Exception: MPI_Initializes, checks to see if MPI has been initialzed.
May be called before MPI_Init.
105
MPI Routines

MPI_Comm_rank
int MPI_Comm_rank(MPI_comm com, int* rank)



To determine a process’s ID number.
Return: Process’s ID by rank
Communicator:

MPI_Comm: MPI_COMM_WORLD, include all process when
MPI initialized.
MPI_Comm_rank(MPI_COMM_WORLD, &id);
106
MPI Routines

MPI_Comm_size
int MPI_Comm_size(MPI_comm com, int* size)

To find the number of processes -- size
MPI_Comm_size(MPI_COMM_WORLD, &p);
107
MPI Routines
int MPI_Send(
void* buf,
int count,
 MPI_Send
MPI_Datatype datatype,
int dest,
 The source process send the data in
int tag,
buffer to destination process.
MPI_Comm comm)
The starting address of the data to be transmitted.
buf
The number of data items.
count
datatype The type of data items.(all of the data items must be in
dest
tag
comm
the same type)
The rank of the process to receive the data.
An integer “label” for the message, allowing messages
serving different purpose to be identified.
Indicates the communicator in which this message is
being sent.
108
MPI Routines

MPI_Send
Blocks until the message buffer is once again availabel.
 MPI constants for C data types.

109
MPI Routines

MPI_Recv
int MPI_Recv(
void* buf,
int count,
MPI_Datatype datatype,
int source,
int tag,
MPI_Comm comm,
MPI_Status * status)
buf
count
The starting address where the received data is to be stored.
datatype
source
tag
comm
The type of data items
status
The maximum number of data items the receiving process is
willing to receive.
The rank of the process sending this message.
The desired tag value for the message
Indicates the communicator in which this message is being
passed.
MPI data structure. Return the status.
110
MPI Routines

MPI_Recv
Receive the message from the source process.
 The data type and tag of message received must be in
according with the data type and tag define in
MPI_Recv funciton.
 The count of data items received must be less than the
count define in this function. Otherwise, will cause the
overflow error condition.
 If count equal to zero, then message is empty.
 Blocks until the message has been recived.


Or an error conditions cause the function to return.
111
MPI Routines

MPI_Recv
status->MPI_Source
status->MPI_Tag
The rank of the process
sending the msg.
The msg’s tag value.
status->MPI_ERROE The error condition.
int MPI_Abort(MPI_Comm comm, int errorcode)
112
MPI Routines

MPI_Finalize


Allowing system to free up resources, such as memory, that
have been allocated to MPI.
Without MPI_Finalize, the result of program will unknowns.
MPI_Finalize();
113
summary
MPI_Init
MPI_Comm_rank
MPI_Comm_size
MPI_Send
MPI_Recv
MPI_Finalize
114
Collective communication


Communication operation
A group of processes work together to distribute
or gather together a set of one or more values.
Process
Process 0
Run
Time
Process 1
Process 2
Call Syn (1)
Call Syn (2)
Wait
Call Syn (3)
Syn point
Wait
Parallel Executing
115
Collective communication

MPI_Bcast

A root process broadcast one or more data items of the same type
to all other processed in a communicator.
Before
broadcast
root
broadcast
After
broadcast
A
A
A
A
A
116
Collective communication

MPI_Bcast
int MPI_Bcast(
void* buffer,
int count,
MPI_Datatype datatype,
int root,
MPI_Comm comm)
//addr of 1st broadcast element
// #element to be broadcast
// type of element to be broadcast
// ID of process doing broadcast
//communicator
117
Collective communication

MPI_Scatter

The root process send the different parts of data item to
other processes.
A
B
C
D
...
Sending buffer
of root process
h
Scatter different parts of data to other process in turn.
A
B
C
D
Root
h
Receiving
buffer of other
process
118
Collective communication

MPI_Scatter
int MPI_Scatter(
void* buffer,
//starting addr of sending buffer
int sendcount,
// #element to be scattered
MPI_Datatype sendtype,
// type of element to be sent.
void* recvbuf,
int recvcount,
MPI_Datatype recvtype,
int root,
// ID of root process doing scattered
MPI_Comm comm)
//communicator
119
Collective communication

MPI_Gather

Each process sending data of its buffer to root process.
A
B
C
Root
D
h
Sending buffer
of other
process
Gather
A
B
C
D
...
h
Receiving
buffer of root
process
120
Collective communication

MPI_Gather
int MPI_Gather(
void* sendbuffer,
//starting addr of sending buffer
int sendcount,
// #element to be scattered
MPI_Datatype sendtype,
// type of element to be sent.
void* recvbuf,
int recvcount,
MPI_Datatype recvtype,
int root,
// ID of root process doing scattered
MPI_Comm comm)
//communicator
121
Collective communication

MPI_Reduce
After a process has completed its share of the work, it is
ready to participate in the reduction operation.
 MPI_Reduce perform one or more reduction
operations on values submitted by all the processed in a
communicator.

122
Collective communication

MPI_Reduce
int MPI_Reduce(
void* operand,
//addr of 1st reduction element
void* result,
// addr of 1st reduction result
int count,
// reductions to perform
MPI_Datatype type,
// type of element to be sent.
MPI_OP operator, // reduction operator
int root,
// process getting result(s)
MPI_Comm comm)
//communicator
123
Collective communication

MPI_Reduce
MPI’s built-in reduction operators
MPI_BAND
Bitwise and
MPI_BOR
Bitwise or
MPI_BXOR
Bitwise exclusive or
MPI_LAND
logical and
MPI_LOR
logical or
MPI_LXOR
Logical exclusive or
MPI_MAX
Maximum
MPI_MAXLOC
Maximum and location of maximum
MPI_MIN
Minimum
MPI_MINLOC
Minimum and location of maximum
MPI_PORD
Product
MPI_SUM
Sum
124
summary
125
126
127
128
Benchmarking parallel
performance


Measure the performance of a parallel application.
How?
Measuring the number of seconds that elapse from the
time we initiate execution until the program terminates.
 double MPI_Wtime(void)



Returns the numbers of seconds that have elapsed since some
point of time in the past.
double MPI_Wtick(void)

Returns the precision of the result returned by MPI_Wtime.
129
Benchmarking parallel
performance

MPI_Barrier

int MPI_Barrier(MPI_Comm comm)


comm: indicate in which communicator the processes will participate the
barrier synchronization.
Function of MPI_Barrier is….
double elapsed_time;
MPI_Init(&agrc,&argv);
elapsed_time = -MPI_Wtime;
….
MPI_Reduce(&solutions,
&global_solutions,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
elapsed_time += MPI_Wtime;
130
For example

Send and receive operation
#include “mpi.h”
void main(int argc, char * argv[])
{
….
MPI_Inti(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
if( myrank == 0 )
{
MPI_Send(message,length,MPI_CHAR,1,99,MPI_COMM_WORLD);
}
else if(myrank == 1)
{
MPI_Recv(message,length,MPI_CHAR,0,99,MPI_COMM_WORLD,&status);
}
MPI_Finalize();
}
131
For example

Compute pi
1
1
dx

arctan(
x
)
|
0  arctan( 1)  arctan( 0)  arctan( 1)   / 4
0 1  x 2
1
4
f ( x) 
(1  x 2 )
1

0
f ( x)dx  
132
For example
4
0
1
2  i 1 1
1 N
i  0.5
 f(
)   f (
)
2 N
N N i 1
N
i 1
N
133
For example

Compute pi
MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD);
h= 1.0/(double)n;
sum = 0.0;
for( int i=myrank +1; i<= n; i+= numprocs)
{
x= h * (I - 0.5 );
sum += 4.0/(1.0 + x* x);
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
134
For example

Matrix Multiplication
MPI_Scatter(&iaA[0][0],N,MPI_INT,&iaA[iRank][0],N,MPI_INT,0,MPI_COMM_WORLD);
MPI_Bcast(&iaB[0][0],N*N,MPI_INT,0,MPI_COMM_WORLD);
for(i=0;i<N;i++)
{
temp = 0;
for(j=0;j<N;j++)
{
remp = temp+iaA[iRank][j] * iaB[j][i];
}
iaC[iRank][i] = temp;
}
MPI_Gather(&iaC[iRank][0],N,MPI_INT,&iaC[0][0],N,MPI_INT,0,MPI_COMM_WORLD);
135
136
l 1
Ci , j   ai ,k bk , j
k 0
where A is an n x l matrix and B is an l x m matrix.
137
138
139
for (i = 0; i < n; i++)
for (j = 0; j < n; j++) {
c[i][j] = 0;
for (k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
140
141
Summary




MPI is a Library.
Six foundational functions of MPI.
collective communication.
MPI communication Model.
142
Thanks!
Fell free to contact me via
yjf@whu.edu.cn
for any questions or suggestions.
And
Welcome to Wuhan University!
Download