Message Passing MPI on Origin Systems TM

advertisement
TM
Message Passing
MPI on Origin Systems
TM
MPI Programming Model
TM
Compiling MPI Programs
cc -64 compute.c -lmpi
f77 -64 -LANG:recursive=on compute.f -lmpi
f90 -64 -LANG:recursive=on compute.f -lmpi
CC -64 compute.c -lmpi++ -lmpi
-64 NOT required but improves functionality and
optimization
With 7.2.1 compiler level or higher, can use:
-auto_use mpi_interface
with f77 / f90 for compile time subroutine interface
checking
TM
Compiling MPI Programs
• Must use header file from /usr/include since SGI
libraries built with it (do not use public domain
version)
– FORTRAN: mpif.h or USE MPI
– C: mpi.h
– C++: mpi++.h
• mpi_init version must match main program
language (if called from multiple shared memory
threads must use mpi_init_thread)
TM
Compiling MPI Programs
• MPI definitions:
– FORTRAN: MPI_XXXX (not case sensitive)
– C: MPI_Xxxx (upper and lower case)
– C++: Xxxx (part of name space MPI::)
• Every entry point MPI_ in the MPI Library has a
“shadow” entry point PMPI_ to aid with
implementation of user profiling
• Array Services required to run MPI (arrayd)
TM
Basic MPI Features
TM
Basic MPI Features
TM
Basic MPI Features
TM
MPI Basic Calls
MPI has a large number of calls. The following are most basic:
• every MPI program has to start and finish with these calls (the first and the last
executable statements):
mpi_init
mpi_finalize
• essential inquiry about
the environment:
Program mpitest
include “mpif.h”
call mpi_init(ierr)
call mpi_comm_size(MPI_COMM_WORLD,np,ierr)
call mpi_comm_rank(MPI_COMM_WORLD,id,ierr)
mpi_comm_size
mpi_comm_rank
do I=0,np-1
if(I.eq.id) print *,’np, id’,np,id
call mpi_barrier(MPI_COMM_WORLD,ierr)
enddo
• basic communication calls:
call mpi_finalize(ierr)
stop
end
mpi_send
mpi_recv
Compile with:
f77 -o mpitest -LANG:recursive=on mpitest.f -lmpi
• basic synchronization calls:
mpi_barrier
run with:
mpirun -np N [-stats -prefix “%g”]mpitest
TM
MPI send and receive Calls
mpi_send(buf,count,datatype,dest,tag,comm,ierr)
mpi_recv(buf,count,datatype,dest,tag,comm,stat,ierr)
buff
count
datatype
data to be send/recv
number of items to be send; size of buf for recv
type of data items to send/recv
(MPI_INTEGER, MPI_FLOAT, MPI_DOUBLE_PRECISION, etc.)
dest
tag
comm
id of the pear process (MPI_ANY_SOURCE)
integer mark of the message (MPI_ANY_TAG)
communication handle (MPI_COMM_WORLD)
stat
status of the message of MPI_STATUS type; in Fortran
INTEGER stat(MPI_STATUS_SIZE)
call mpi_get_count(stat,MPI_REAL,nitems)
where nitems can be <= count
check for errors:
if(ierr.ne.MPI_SUCCESS) call abort()
TM
Using send and receive Calls
Example:
If(mod(myid,2).eq.0) then
idst = mod(id+1,np)
itag = 0
call mpi_send(A,N,MPI_REAL,idst,itag,MPI_COMM_WORLD,ierr)
if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr
else
isrc = mod(id-1+np,np)
itag = MPI_ANY_TAG
call mpi_recv(B,NSIZE,MPI_REAL,isrc,itag,MPI_COMM_WORLD,stat,ierr)
if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr
call mpi_get_count(stat,MPI_REAL,N)
endif
rules of use:
• mpi_send/recv are defined as blocking calls
– the program should not assume blocking behaviour (small messages are buffered)
– when these calls return, the buffers can be (re-)used
• the arrival order of messages send from A and B to C is not determined.
Two messages from A to B will arrive in the order sent.
• Message Passing programming models are non-deterministic.
TM
Another Simple Example
TM
MPI send/receive: Buffering
MPI program should not assume buffering of messages. The
following program is erroneous:
Program long_messages
include ‘mpif.h’
real*8 h(4000)
integer stat(MPI_STATUS_SIZE)
call mpi_init(info)
call mpi_comm_rank(MPI_COMM_WORLD, mype, info)
call mpi_comm_size(MPI_COMM_WORLD, npes, info)
do I = 1000, 4000, 100
! Increasing size of the message
call mpi_barrier(MPI_COMM_WORLD,info)
print *,’mype=‘,mype,’ before send’,I
call mpi_send(h,I,mpi_real8,mod(mype+1,npes),I,MPI_COMM_WORLD,info)
call mpi_barrier(MPI_COMM_WORLD,info)
call mpi_recv(h,I,MPI_REAL8,
MOD(mype-1+npes,npes),I,MPI_COMM_WORLD,stat,info)
enddo
call mpi_finalize(info)
END
running on Origin2000 on 2 cpu the program will block after
reaching the size i=2100 (because the buffering constraint
MPI_BUFFER_MAX=16384, I.e. 2048 items of real*8)
TM
MPI Asynchronous send/receive
Non-blocking send and receive calls are available:
mpi_isend(buf,count,datatype,dest,tag,comm,req,ierr)
mpi_irecv(buf,count,datatype,dest,tag,comm,req,ierr)
buf,count,datatype
message content
dest,tag,comm
message envelope
req
integer holding the request id
the asynchronous call returns the request-id after registering the buffer .
The request id can be used in the probe and wait calls:
mpi_wait(req,stat,ierr)
• blocks until the MPI send or receive with req request-id completes
mpi_waitall(count,array-of-req,array-of-stat,ierr)
• waits for all given communications to complete (a blocking call)
• the (array-of-)stat can be probed for items received. The data can be
retrieved with the recv call (or irecv call, or any other variety receive)
NOTE: although this interface announces asynchronous
communication, the actual copy of buffers happens only at the
time of the receive and wait calls
TM
MPI Asynchronous: Example
Buffer management with asynchronous communcation:
include ‘mpif.h’
integer stat(MPI_STATUS_SIZE,10)
integer req(10)
real B1(NB1,10)
if(mype.eq.0) then
! Master receiving from all slaves
do ip=1,npes-1
call mpi_irecv(B1(ip),NB1,MPI_REAL,
ip,MPI_ANY_TAG,MPI_COMM_WORLD,req(ip),info)
enddo
nreq = npes
else
! Slave send to master
call mpi_isend(B1(mype),NB1,MPI_REAL,0,itag,MPI_COMM_WORLD,req,info)
nreq = 1
endif
…
! Some unrelated calculations
call mpi_waitall(nreq,req,stat,ierr)
…
…
! Data is available in B1 in the master process
! Buffer B1 can be reused in the slave processes
• buffers declared in isend/irecv can be (re-)used only after the communication
has actually completed.
• Requests should be freed (mpi_test, mpi_wait, mpi_request_free) for
all the isend calls in the program, otherwise mpi_finalize might hang
Performance of Asynchronous
Communication
TM
TM
MPI Functionality
TM
MPI Most Important Functions
Synchronous communication:
mpi_send
mpi_recv
mpi_sendrecv
Creating communicators:
mpi_comm_dup
mpi_comm_split
mpi_comm_free
mpi_intercomm_create
Asynchronous communication:
mpi_isend
mpi_irecv
mpi_iprobe
mpi_wait/waitall
Collective communication:
mpi_barrier
mpi_bcast
mpi_gather/scatter
mpi_reduce/allreduce
mpi_alltoall
Derived data types:
mpi_type_contiguous
mpi_type_vector
mpi_type_indexed
mpi_type_pack
mpi_type_commit
mpi_type_free
TM
MPI Most Important Functions
One-sided communication:
mpi_win_create
mpi_put
mpi_get
mpi_fence
Miscellaneous:
MPI_Wtime()
• Based on SGI_CYCLE clock with
0.8 microsecond resolution
TM
MPI Run Time System on SGI
Program name, path,
environement variables
mpirun -np N t.exe
mpirun Host_A -np N a.out : Host_B -np M b.out
• On SGI, all MPI programs are launched with the
mpirun command
Array daemon
Array daemon
fork() t.exe N times
fork()N t.exe0N times
N
0
N-1
N-1
– mpirun -np N executable-name arguments
syntax on a single host
– multi-host execution of different executables is possible
• The mpirun establishes connection with the Array
Daemon with the socket interface.
• The Array Daemon launches the mpi executable.
• N+1 threads are started. One additional thread is the
“lazy” thread which is blocked in mpi_init() call and
terminates when all other threads call mpi_finalize()
• The mpirun -cpr (or -miser) will work on the
single host to avoid the socket interface to the Array
Daemon (for Checkpoint/Restart facility)
Note: start MPI programs with N < #procs
HiPPI optimized
communication
TM
MPI Run Time on SGI
TM
MPI Run Time on SGI
TM
MPI Run Time on SGI
TM
MPI Implementation on SGI
• In C, mpi_init ignores all arguments passed to it
• All MPI processes are required to call mpi_finalize at exit
• I/O streams:
– stdin is enabled only for the master thread (process with rank 0)
– stdout and stderr are enabled for all the threads and line buffered
– output from different MPI threads can be prepended with -prefix argument;
output sent to mpirun process
example: mpirun -prefix “<proc %g out of %G> “ prints:
<proc 0 out of 2> Hello World
<proc 1 out of 2> Hello World
– see man mpi(5) and man mpirun(1) for a complete description
• Systems with the HIPPI software installed will trigger usage of the HIPPI
optimized communication (HIPPI bypass). If the hardware is not installed it is
necessary to switch the HIPPI bypass off (setenv MPI_BYPASS_OFF TRUE)
• With f77/f90, the -auto_use mpi_interface flag is available to check the
consistency of mpi arguments at compile time
• With -64 compilation, mpi run time maps out the address space such that shared
memory optimizations are available to circumvent the double copy problem.
In particular, communication involving static data (I.e. common blocks) can be
sped up.
TM
SGI Message-Passing Software
• SGI Message Passing Toolkit (MPT 1.5)
• MPI, SHMEM, PVM components
• Packaged with Array Services software
• MPT external web page:
– http://www.sgi.com/software/mpt/
• MPT engineering internal web page
– http://wwwmn.americas.sgi.com/mpi/
SGI Message-Passing Toolkit
• Fully MPI 1.2 standard compliant (based on MPICH)
• SHMEM API for one-sided communication
• Support for selected MPI-2 features and will continue
enhancing as customer needs dictate
– MPI I/O (ROMIO version 1.0.2)
– MPI one-sided communication
– Thread safety
– Fortran 90 bindings: USE MPI
– C++ bindings
• PVM available on IRIX (Public Domain version)
TM
MPT: Supported Platforms
Now
• IRIX SSI
• IRIX clusters (GSN, Hippi, Ethernet)
• IA32 and IA64 SSI with Linux
• IA32 cluster (Myrinet, Ethernet) with Linux
Soon
• Partitioned IRIX (NUMAlink interconnect)
• IRIX clusters (Myrinet)
• Partitioned SN IA (NUMAlink interconnect)
• IA64 cluster (Myrinet, Ethernet)
TM
Convenience Features in MPT
TM
• MPI
• Array Services provides job
• Totalview debugger
• Array Services and MPI work
job management with
LSF, NQE, PBS, others
interoperability
• Fortran MPI subroutine
interface checking at
compile time with USE MPI
• Aborted cluster jobs are
cleaned up automatically
control for cluster jobs
together to propagate user
signals to all slaves
• Use shell modules to install
multiple versions of MPT on
the same system.
TM
MPI Performance
• Low latency and high
• Automatic NUMA
• Fetchop-assisted fast
• Optimized MPI collectives
• Internal MPI statistics
bandwidth.
message queuing
• Fast fetchop tree barriers
• Very fast MPI and SHMEM
one-sided communication
• Interoperability with
SHMEM
• Support for SSI to 512 P
placement
reporting
• Integration with PCP
• Direct send/recv
transfers
• No-impact thread safety
support
• Runtime MPI tuning
TM
NUMAlink Implementation
• Used by MPI_Barrier, MPI_Win_fence, and
shmem_barrier_all
• Fetch-Op-variables on Hub provide fast synchronization for
flat and tree barrier methods
• The Fetch-Op AMO helped reduce MPI send/recv latency
from 12 to 8 usec
CPU
HUB
Fetch-op
variable
CPU
ROUTER
TM
NUMAlink-based MPI Performance
MPI Performance on Origin 2000 (Origin 3000)
send/recv latency
8 (5) usec
Peak bandwidth
150 (280) Mbytes/sec
One-sided get latency
2 (1) usec
Barrier sync on 128 P
9 (6) usec
Barrier sync on 484 P
26 (17) usec
Origin 3000 performance numbers subject to further verification
TM
SHMEM Model
TM
SHMEM API
TM
SHMEM API
TM
SHMEM API
TM
One-Sided Communication Pattern
Barriers
Processes
0
1
2
3
4
N-1
C
O
M
P
U
T
E
C
O
M
M
U
N
I
C
A
T
E
Time
C
O
M
P
U
T
E
C
O
M
M
U
N
I
C
A
T
E
TM
MPI Message Exchange(on host)
Process 0
0
0
0
fetchop
1
Message
1
queues
Message
headers
src
MPI_Send(src,len,…)
Process 1
1
Data
0
1
buffers
Shared memory
Message
headers
dst
MPI_Recv(dst,len,…)
MPI Message Exchange using
Single Copy (on host)
Process 0
0
0
0
fetchop
TM
1
Process 1
Message
1
queues
Message
headers
1
Message
headers
src
MPI_Send(src,len,…)
dst
Shared memory
MPI_Recv(dst,len,…)
Performance of Synchronous
Communication
TM
Performance of Synchronous
Communication
TM
TM
Using Single Copy send/recv
• Set MPI_BUFFER_MAX to N
• any message with size > N bytes will be
transferred by direct copy if
– MPI semantics allow it
– -64 ABI is used
– the memory region it is allocated in is a globally
accessible location
• N=2000 seems to work well
– shorter messages don’t benefit from direct copy
transfer method
• Look at stats to verify that direct copy was used.
Making Memory Globally Accessible for
Single Copy send/recv
• User’s send buffer must reside in one of the
following regions:
– static memory (-static/common blocks/DATA/SAVE)
– symmetric heap (allocated with SHPALLOC or
shmalloc)
– global heap (allocated with f90 ALLOCATE statement
and SMA_GLOBAL_ALLOC , MIPSPro version
7.3.1.1m )
• When SMA_GLOBAL_ALLOC is set, usually
need to increase global heap size by setting
SMA_GLOBAL_HEAP_SIZE
TM
TM
Global Communication Test
The ALL-to-ALL communication test : (known as COMMS3 in the Parkbench suite)
Send (A)
Receive (B)
iw
p0
p1
pn
pn
p2
p1
p0
TM
Global Communication
The ALL-to-ALL communication test :
MPI Version
C every processor sends message to every other processor
C then every processor receives messages directed to it.
T0 = mpi_time()
Do I = 1, NREPT
CALL mpi_alltoall (A, iw, MPI_DOUBLE_PRECISION,
B, iw, MPI_DOUBLE_PRECISION,
MPI_COMM_WORLD,ier)
End do
T1 = mpi_time()
Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages
SHMEM Version
T0 = mpi_time()
Do I = 1, NREPT
CALL shmem_barrier_all ()
Do j=0, NP-1
other = MOD (my_rank+j, NP)
CALL shmem_put8(B(1+iw*my_rank), A(1+iw*other), iw, other)
enddo
T1 = mpi_time()
Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages
TM
Global Communication
Performance of the global communication test
Actions:
•convert to Shmem
•used single copy versions
on remotely accessible
variables
AlltoAll Bandwidth
for R12K@300MHz:
Std MPI
~45 Mb/s
SC MPI
~95 MB/s
SHMEM
~95 MB/s
The test case shows cache effects since every operation is
performed 50 times.
Global communication routines do already uses in MPT_1.4.0.0 a
single copy algorithm for remotely accessible variables.
TM
Global Communication
Single copy
Double copy
Conclusions:
Implement critical data exchange in MPI programs with SHMEM or
single copy MPI on static or (shmalloc/shpalloc) allocated data.
TM
MPI get/put
• For codes that are latency sensitive, try using onesided MPI (get/put).
• latency over NUMAlink on O3000:
– send/recv: 5 microseconds
– mpi_get: 0.7 microseconds
• if portability isn’t an issue use SHMEM instead
– shmem_get latency: 0.5 microseconds (estimate
by MPT group)
– much easier to write code
Transposition with SHMEM
vs. send/recv
call shmem_barrier_all
ltag=0
do 150 kk=1,lmtot
do 150 kk=1,lmtot
ltag=ltag+1
ktag=ksendto(kk)
call shmem_put8(
y(1+(ktag-1)*len),
x(1,ksnding(kk), len,
ipsndto(kk) )
continue
call shmem_barrier_all
TM
ktag=ksendto(kk)
call mpi_isend(x(1,ksnding(kk), len,
mpireal, ipsndto(kk), ktag, mpicomm,
iss(ltag), istat)
ltag=ltag+1
ktag=krcving(kk)
call mpi_irecv(y(1,krcving(kk), len,
mpireal, iprcvfr(kk), ktag, mpicomm,
iss(ltag), istat)
150 continue
call mpi_wait_all(ltag,iss,istatm, istat)
TM
Transposition with MPI_put
common/buffer/ yg(length)
integer(kind=MPI_ADDRESS_KIND) winsize, target_disp
! Setup: create a window for array yg since we will do puts
into it
call MPI_type_extent(MPI_REAL8, isizereal8, ierr)
winsize=isizereal8*length
call MPI_win_create(yg, winsize, isizereal8,
MPI_INFO_NULL, MPI_COMM_WORLD, iwin, ierr)
TM
Transposition with MPI_put
call mpi_barrier(MPI_COMM_WORLD,ierr)
do 150 kk=1,lmtot
ktag=ksendto(kk)
target_disp=(1+(ktag-1)*len)-1
call mpi_put(x(1,ksnding(kk), len, MPI_REAL8, ipsndto(kk), target_disp, len,
MPI_REAL8, iwin, ierr)
150 continue
call mpi_win_fence(0, iwin, ierr)
do kk=1,len*lmtot
y(kk)=yg(kk)
end do
! Cleanup - destroy window
call mpi_barrier(MPI_COMM_WORLD,ierr)
call mpi_win_free(iwin, ierr)
Performance of One-Sided
Communication
TM
Performance of One-Sided
Communication
TM
Performance of the Message
Passing Libraries
TM
• Latency is the time it takes to pass a very short (zero length) message
• Bandwidth is the sustained performance passing long messages
Origin2000
R10000@195MHz
Latency [s]
Bandwidth [MB/s]
MPI-1
SHMEM
Single
Multiple
Single
multiple
8.5
13.0
0.9
1.7
99
80
140
180
MPI-2
Origin3000
R12000@400MHz
Latency [s]
Bandwidth [MB/s]
Single
Send/Recv
4.3
Asy
send/recv
5.4
One-sided
Put+fence
0.7
250
250
310
• the “single” test is using the send/recv pair; “multiple” test uses the equivalent
of the sendrecv primitive
• Note that a single bcopy speed on Origin2000 is about 150 MB/s
• MPI suffers a performance disadvantage with respect to SHMEM due to the
fact that MPI semantics requires separate address spaces between threads.
Therefore MPI implementation requires “double copy” to pass messages.
• SHMEM is optimized for one-sided communication as is done for SMP
programming and therefore shows a very good latency measurement.
TM
MPI Tips for Performance
• Use ABI 64 for additional memory crossmapping MPI optimizations
• Use cpusets for best reproducible results in
batch environment
• Avoid over-subscription of tasks to physical
CPUs in a throughput benchmark
• Use the -stats option and MPI tuning variables
TM
MPI Tips for Performance
• Try direct-copy send/receive for memory bandwidth
improvement and collective calls
• Use one-sided communication for latency (& memory
bandwidth) improvement
• Try setting MPI_DSM_MUSTRUN or
SMA_DSM_MUSTRUN to maintain CPU / memory
affinity
• Do NOT use bsend/ssend or wild cards
(MPI_ANY_SOURCE, MPI_ANY_TAG) for message
headers
TM
Important Environment Variables
MPI_DSM_MUSTRUN
MPI_REQUEST_MAX
MPI_GM_ON
MPI_BAR_DISSEM
MPI_BUFS_PER_PROC
MPI_BUFS_PER_HOST
MPI_BUFFER_MAX
“-stats” mpirun option / Totalview display
TM
MPI Performance Experiments
Performance data on MPI programs can be collected with:
mpirun -np N perfex -a -y -mp -o perfex.out prog-args
• the -o perfex.out.#procid file will contain event counts for every MPI
thread and the perfex.out will contain aggregate for all the threads together
Profiling data on MPI programs can be collected with:
mpirun -np N ssrun -experiment program prog-args
• the experiment is one of the usual experiments (pcsamp, usertime, etc.) or mpi:
– mpirun -np N ssrun -workshop -mpi prog
will produce N prog.mpi.f#procid files; these files can be aggregated with the
ssaggregate tool and interactively viewed with cvperf tool
– ssaggregate -e prog.mpi.f* -o prog.mpi_all
– cvperf prog.mpi_all or prof prog.mpi_all
– the following routines are traced (see man ssrun(1)):
MPI_Barrier(3)
MPI_Send(3)
MPI_Isend(3)
MPI_Ibsend(3)
MPI_Sendrecv_replace(3)
MPI_Wait(3)
MPI_Waitall(3)
MPI_Test(3)
MPI_Testall(3)
MPI_Request_free(3)
MPI_Bsend(3)
MPI_Issend (3)
MPI_Bcast(3)
MPI_Waitany(3)
MPI_Testany(3)
MPI_Cancel(3)
MPI_Ssend(3)
MPI_Rsend(3)
MPI_Irsend(3)
MPI_Sendrecv(3)
MPI_Recv(3)
MPI_Irecv(3)
MPI_Waitsome(3)
MPI_Testsome(3)
MPI_Pcontrol(3)
TM
MPI versus OpenMP
TM
MPI versus OpenMP
TM
MPI versus OpenMP
TM
MPI versus OpenMP
TM
MPI versus OpenMP
SGI Message-Passing References
•“relnotes mpt” gives information about new features
•“man mpi” tells about all environment variables
•“man shmem” tells about the SHMEM API
•MPI Reference Manuals viewable with insight viewer
–“Message Passing Toolkit: MPI Programmer’s Manual”
(document # 007-3687-005)
•MPT web page:
–http://www.sgi.com/software/mpt
•MPI Web Sites:
–http://www.mpi-forum.org
–http:/www.mcs.anl.gov/mpi/index.html
TM
TM
Summary
• It is important to understand the semantics of MPI
• The send/receive calls provide for data synchronization, not
necessarily process synchronization
• A correct MPI program cannot depend on buffering for messages
• For a highly optimized MPI program, it is important to use only
few optimized subroutines from the MPI library, typically straight
send/receive variants
• The SGI implementation of MPI uses N+1 processes in parallel
region, therefore it is better for scalability to run MPI with smaller
number of processors than physically available in the machine
• Proprietary Message Passing Libraries (I.e. SHMEM) perform
better than MPI on the Origin, because MPI’s generic interface
makes it much harder to optimize
Download