TM Message Passing MPI on Origin Systems TM MPI Programming Model TM Compiling MPI Programs cc -64 compute.c -lmpi f77 -64 -LANG:recursive=on compute.f -lmpi f90 -64 -LANG:recursive=on compute.f -lmpi CC -64 compute.c -lmpi++ -lmpi -64 NOT required but improves functionality and optimization With 7.2.1 compiler level or higher, can use: -auto_use mpi_interface with f77 / f90 for compile time subroutine interface checking TM Compiling MPI Programs • Must use header file from /usr/include since SGI libraries built with it (do not use public domain version) – FORTRAN: mpif.h or USE MPI – C: mpi.h – C++: mpi++.h • mpi_init version must match main program language (if called from multiple shared memory threads must use mpi_init_thread) TM Compiling MPI Programs • MPI definitions: – FORTRAN: MPI_XXXX (not case sensitive) – C: MPI_Xxxx (upper and lower case) – C++: Xxxx (part of name space MPI::) • Every entry point MPI_ in the MPI Library has a “shadow” entry point PMPI_ to aid with implementation of user profiling • Array Services required to run MPI (arrayd) TM Basic MPI Features TM Basic MPI Features TM Basic MPI Features TM MPI Basic Calls MPI has a large number of calls. The following are most basic: • every MPI program has to start and finish with these calls (the first and the last executable statements): mpi_init mpi_finalize • essential inquiry about the environment: Program mpitest include “mpif.h” call mpi_init(ierr) call mpi_comm_size(MPI_COMM_WORLD,np,ierr) call mpi_comm_rank(MPI_COMM_WORLD,id,ierr) mpi_comm_size mpi_comm_rank do I=0,np-1 if(I.eq.id) print *,’np, id’,np,id call mpi_barrier(MPI_COMM_WORLD,ierr) enddo • basic communication calls: call mpi_finalize(ierr) stop end mpi_send mpi_recv Compile with: f77 -o mpitest -LANG:recursive=on mpitest.f -lmpi • basic synchronization calls: mpi_barrier run with: mpirun -np N [-stats -prefix “%g”]mpitest TM MPI send and receive Calls mpi_send(buf,count,datatype,dest,tag,comm,ierr) mpi_recv(buf,count,datatype,dest,tag,comm,stat,ierr) buff count datatype data to be send/recv number of items to be send; size of buf for recv type of data items to send/recv (MPI_INTEGER, MPI_FLOAT, MPI_DOUBLE_PRECISION, etc.) dest tag comm id of the pear process (MPI_ANY_SOURCE) integer mark of the message (MPI_ANY_TAG) communication handle (MPI_COMM_WORLD) stat status of the message of MPI_STATUS type; in Fortran INTEGER stat(MPI_STATUS_SIZE) call mpi_get_count(stat,MPI_REAL,nitems) where nitems can be <= count check for errors: if(ierr.ne.MPI_SUCCESS) call abort() TM Using send and receive Calls Example: If(mod(myid,2).eq.0) then idst = mod(id+1,np) itag = 0 call mpi_send(A,N,MPI_REAL,idst,itag,MPI_COMM_WORLD,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr else isrc = mod(id-1+np,np) itag = MPI_ANY_TAG call mpi_recv(B,NSIZE,MPI_REAL,isrc,itag,MPI_COMM_WORLD,stat,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr call mpi_get_count(stat,MPI_REAL,N) endif rules of use: • mpi_send/recv are defined as blocking calls – the program should not assume blocking behaviour (small messages are buffered) – when these calls return, the buffers can be (re-)used • the arrival order of messages send from A and B to C is not determined. Two messages from A to B will arrive in the order sent. • Message Passing programming models are non-deterministic. TM Another Simple Example TM MPI send/receive: Buffering MPI program should not assume buffering of messages. The following program is erroneous: Program long_messages include ‘mpif.h’ real*8 h(4000) integer stat(MPI_STATUS_SIZE) call mpi_init(info) call mpi_comm_rank(MPI_COMM_WORLD, mype, info) call mpi_comm_size(MPI_COMM_WORLD, npes, info) do I = 1000, 4000, 100 ! Increasing size of the message call mpi_barrier(MPI_COMM_WORLD,info) print *,’mype=‘,mype,’ before send’,I call mpi_send(h,I,mpi_real8,mod(mype+1,npes),I,MPI_COMM_WORLD,info) call mpi_barrier(MPI_COMM_WORLD,info) call mpi_recv(h,I,MPI_REAL8, MOD(mype-1+npes,npes),I,MPI_COMM_WORLD,stat,info) enddo call mpi_finalize(info) END running on Origin2000 on 2 cpu the program will block after reaching the size i=2100 (because the buffering constraint MPI_BUFFER_MAX=16384, I.e. 2048 items of real*8) TM MPI Asynchronous send/receive Non-blocking send and receive calls are available: mpi_isend(buf,count,datatype,dest,tag,comm,req,ierr) mpi_irecv(buf,count,datatype,dest,tag,comm,req,ierr) buf,count,datatype message content dest,tag,comm message envelope req integer holding the request id the asynchronous call returns the request-id after registering the buffer . The request id can be used in the probe and wait calls: mpi_wait(req,stat,ierr) • blocks until the MPI send or receive with req request-id completes mpi_waitall(count,array-of-req,array-of-stat,ierr) • waits for all given communications to complete (a blocking call) • the (array-of-)stat can be probed for items received. The data can be retrieved with the recv call (or irecv call, or any other variety receive) NOTE: although this interface announces asynchronous communication, the actual copy of buffers happens only at the time of the receive and wait calls TM MPI Asynchronous: Example Buffer management with asynchronous communcation: include ‘mpif.h’ integer stat(MPI_STATUS_SIZE,10) integer req(10) real B1(NB1,10) if(mype.eq.0) then ! Master receiving from all slaves do ip=1,npes-1 call mpi_irecv(B1(ip),NB1,MPI_REAL, ip,MPI_ANY_TAG,MPI_COMM_WORLD,req(ip),info) enddo nreq = npes else ! Slave send to master call mpi_isend(B1(mype),NB1,MPI_REAL,0,itag,MPI_COMM_WORLD,req,info) nreq = 1 endif … ! Some unrelated calculations call mpi_waitall(nreq,req,stat,ierr) … … ! Data is available in B1 in the master process ! Buffer B1 can be reused in the slave processes • buffers declared in isend/irecv can be (re-)used only after the communication has actually completed. • Requests should be freed (mpi_test, mpi_wait, mpi_request_free) for all the isend calls in the program, otherwise mpi_finalize might hang Performance of Asynchronous Communication TM TM MPI Functionality TM MPI Most Important Functions Synchronous communication: mpi_send mpi_recv mpi_sendrecv Creating communicators: mpi_comm_dup mpi_comm_split mpi_comm_free mpi_intercomm_create Asynchronous communication: mpi_isend mpi_irecv mpi_iprobe mpi_wait/waitall Collective communication: mpi_barrier mpi_bcast mpi_gather/scatter mpi_reduce/allreduce mpi_alltoall Derived data types: mpi_type_contiguous mpi_type_vector mpi_type_indexed mpi_type_pack mpi_type_commit mpi_type_free TM MPI Most Important Functions One-sided communication: mpi_win_create mpi_put mpi_get mpi_fence Miscellaneous: MPI_Wtime() • Based on SGI_CYCLE clock with 0.8 microsecond resolution TM MPI Run Time System on SGI Program name, path, environement variables mpirun -np N t.exe mpirun Host_A -np N a.out : Host_B -np M b.out • On SGI, all MPI programs are launched with the mpirun command Array daemon Array daemon fork() t.exe N times fork()N t.exe0N times N 0 N-1 N-1 – mpirun -np N executable-name arguments syntax on a single host – multi-host execution of different executables is possible • The mpirun establishes connection with the Array Daemon with the socket interface. • The Array Daemon launches the mpi executable. • N+1 threads are started. One additional thread is the “lazy” thread which is blocked in mpi_init() call and terminates when all other threads call mpi_finalize() • The mpirun -cpr (or -miser) will work on the single host to avoid the socket interface to the Array Daemon (for Checkpoint/Restart facility) Note: start MPI programs with N < #procs HiPPI optimized communication TM MPI Run Time on SGI TM MPI Run Time on SGI TM MPI Run Time on SGI TM MPI Implementation on SGI • In C, mpi_init ignores all arguments passed to it • All MPI processes are required to call mpi_finalize at exit • I/O streams: – stdin is enabled only for the master thread (process with rank 0) – stdout and stderr are enabled for all the threads and line buffered – output from different MPI threads can be prepended with -prefix argument; output sent to mpirun process example: mpirun -prefix “<proc %g out of %G> “ prints: <proc 0 out of 2> Hello World <proc 1 out of 2> Hello World – see man mpi(5) and man mpirun(1) for a complete description • Systems with the HIPPI software installed will trigger usage of the HIPPI optimized communication (HIPPI bypass). If the hardware is not installed it is necessary to switch the HIPPI bypass off (setenv MPI_BYPASS_OFF TRUE) • With f77/f90, the -auto_use mpi_interface flag is available to check the consistency of mpi arguments at compile time • With -64 compilation, mpi run time maps out the address space such that shared memory optimizations are available to circumvent the double copy problem. In particular, communication involving static data (I.e. common blocks) can be sped up. TM SGI Message-Passing Software • SGI Message Passing Toolkit (MPT 1.5) • MPI, SHMEM, PVM components • Packaged with Array Services software • MPT external web page: – http://www.sgi.com/software/mpt/ • MPT engineering internal web page – http://wwwmn.americas.sgi.com/mpi/ SGI Message-Passing Toolkit • Fully MPI 1.2 standard compliant (based on MPICH) • SHMEM API for one-sided communication • Support for selected MPI-2 features and will continue enhancing as customer needs dictate – MPI I/O (ROMIO version 1.0.2) – MPI one-sided communication – Thread safety – Fortran 90 bindings: USE MPI – C++ bindings • PVM available on IRIX (Public Domain version) TM MPT: Supported Platforms Now • IRIX SSI • IRIX clusters (GSN, Hippi, Ethernet) • IA32 and IA64 SSI with Linux • IA32 cluster (Myrinet, Ethernet) with Linux Soon • Partitioned IRIX (NUMAlink interconnect) • IRIX clusters (Myrinet) • Partitioned SN IA (NUMAlink interconnect) • IA64 cluster (Myrinet, Ethernet) TM Convenience Features in MPT TM • MPI • Array Services provides job • Totalview debugger • Array Services and MPI work job management with LSF, NQE, PBS, others interoperability • Fortran MPI subroutine interface checking at compile time with USE MPI • Aborted cluster jobs are cleaned up automatically control for cluster jobs together to propagate user signals to all slaves • Use shell modules to install multiple versions of MPT on the same system. TM MPI Performance • Low latency and high • Automatic NUMA • Fetchop-assisted fast • Optimized MPI collectives • Internal MPI statistics bandwidth. message queuing • Fast fetchop tree barriers • Very fast MPI and SHMEM one-sided communication • Interoperability with SHMEM • Support for SSI to 512 P placement reporting • Integration with PCP • Direct send/recv transfers • No-impact thread safety support • Runtime MPI tuning TM NUMAlink Implementation • Used by MPI_Barrier, MPI_Win_fence, and shmem_barrier_all • Fetch-Op-variables on Hub provide fast synchronization for flat and tree barrier methods • The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 usec CPU HUB Fetch-op variable CPU ROUTER TM NUMAlink-based MPI Performance MPI Performance on Origin 2000 (Origin 3000) send/recv latency 8 (5) usec Peak bandwidth 150 (280) Mbytes/sec One-sided get latency 2 (1) usec Barrier sync on 128 P 9 (6) usec Barrier sync on 484 P 26 (17) usec Origin 3000 performance numbers subject to further verification TM SHMEM Model TM SHMEM API TM SHMEM API TM SHMEM API TM One-Sided Communication Pattern Barriers Processes 0 1 2 3 4 N-1 C O M P U T E C O M M U N I C A T E Time C O M P U T E C O M M U N I C A T E TM MPI Message Exchange(on host) Process 0 0 0 0 fetchop 1 Message 1 queues Message headers src MPI_Send(src,len,…) Process 1 1 Data 0 1 buffers Shared memory Message headers dst MPI_Recv(dst,len,…) MPI Message Exchange using Single Copy (on host) Process 0 0 0 0 fetchop TM 1 Process 1 Message 1 queues Message headers 1 Message headers src MPI_Send(src,len,…) dst Shared memory MPI_Recv(dst,len,…) Performance of Synchronous Communication TM Performance of Synchronous Communication TM TM Using Single Copy send/recv • Set MPI_BUFFER_MAX to N • any message with size > N bytes will be transferred by direct copy if – MPI semantics allow it – -64 ABI is used – the memory region it is allocated in is a globally accessible location • N=2000 seems to work well – shorter messages don’t benefit from direct copy transfer method • Look at stats to verify that direct copy was used. Making Memory Globally Accessible for Single Copy send/recv • User’s send buffer must reside in one of the following regions: – static memory (-static/common blocks/DATA/SAVE) – symmetric heap (allocated with SHPALLOC or shmalloc) – global heap (allocated with f90 ALLOCATE statement and SMA_GLOBAL_ALLOC , MIPSPro version 7.3.1.1m ) • When SMA_GLOBAL_ALLOC is set, usually need to increase global heap size by setting SMA_GLOBAL_HEAP_SIZE TM TM Global Communication Test The ALL-to-ALL communication test : (known as COMMS3 in the Parkbench suite) Send (A) Receive (B) iw p0 p1 pn pn p2 p1 p0 TM Global Communication The ALL-to-ALL communication test : MPI Version C every processor sends message to every other processor C then every processor receives messages directed to it. T0 = mpi_time() Do I = 1, NREPT CALL mpi_alltoall (A, iw, MPI_DOUBLE_PRECISION, B, iw, MPI_DOUBLE_PRECISION, MPI_COMM_WORLD,ier) End do T1 = mpi_time() Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages SHMEM Version T0 = mpi_time() Do I = 1, NREPT CALL shmem_barrier_all () Do j=0, NP-1 other = MOD (my_rank+j, NP) CALL shmem_put8(B(1+iw*my_rank), A(1+iw*other), iw, other) enddo T1 = mpi_time() Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages TM Global Communication Performance of the global communication test Actions: •convert to Shmem •used single copy versions on remotely accessible variables AlltoAll Bandwidth for R12K@300MHz: Std MPI ~45 Mb/s SC MPI ~95 MB/s SHMEM ~95 MB/s The test case shows cache effects since every operation is performed 50 times. Global communication routines do already uses in MPT_1.4.0.0 a single copy algorithm for remotely accessible variables. TM Global Communication Single copy Double copy Conclusions: Implement critical data exchange in MPI programs with SHMEM or single copy MPI on static or (shmalloc/shpalloc) allocated data. TM MPI get/put • For codes that are latency sensitive, try using onesided MPI (get/put). • latency over NUMAlink on O3000: – send/recv: 5 microseconds – mpi_get: 0.7 microseconds • if portability isn’t an issue use SHMEM instead – shmem_get latency: 0.5 microseconds (estimate by MPT group) – much easier to write code Transposition with SHMEM vs. send/recv call shmem_barrier_all ltag=0 do 150 kk=1,lmtot do 150 kk=1,lmtot ltag=ltag+1 ktag=ksendto(kk) call shmem_put8( y(1+(ktag-1)*len), x(1,ksnding(kk), len, ipsndto(kk) ) continue call shmem_barrier_all TM ktag=ksendto(kk) call mpi_isend(x(1,ksnding(kk), len, mpireal, ipsndto(kk), ktag, mpicomm, iss(ltag), istat) ltag=ltag+1 ktag=krcving(kk) call mpi_irecv(y(1,krcving(kk), len, mpireal, iprcvfr(kk), ktag, mpicomm, iss(ltag), istat) 150 continue call mpi_wait_all(ltag,iss,istatm, istat) TM Transposition with MPI_put common/buffer/ yg(length) integer(kind=MPI_ADDRESS_KIND) winsize, target_disp ! Setup: create a window for array yg since we will do puts into it call MPI_type_extent(MPI_REAL8, isizereal8, ierr) winsize=isizereal8*length call MPI_win_create(yg, winsize, isizereal8, MPI_INFO_NULL, MPI_COMM_WORLD, iwin, ierr) TM Transposition with MPI_put call mpi_barrier(MPI_COMM_WORLD,ierr) do 150 kk=1,lmtot ktag=ksendto(kk) target_disp=(1+(ktag-1)*len)-1 call mpi_put(x(1,ksnding(kk), len, MPI_REAL8, ipsndto(kk), target_disp, len, MPI_REAL8, iwin, ierr) 150 continue call mpi_win_fence(0, iwin, ierr) do kk=1,len*lmtot y(kk)=yg(kk) end do ! Cleanup - destroy window call mpi_barrier(MPI_COMM_WORLD,ierr) call mpi_win_free(iwin, ierr) Performance of One-Sided Communication TM Performance of One-Sided Communication TM Performance of the Message Passing Libraries TM • Latency is the time it takes to pass a very short (zero length) message • Bandwidth is the sustained performance passing long messages Origin2000 R10000@195MHz Latency [s] Bandwidth [MB/s] MPI-1 SHMEM Single Multiple Single multiple 8.5 13.0 0.9 1.7 99 80 140 180 MPI-2 Origin3000 R12000@400MHz Latency [s] Bandwidth [MB/s] Single Send/Recv 4.3 Asy send/recv 5.4 One-sided Put+fence 0.7 250 250 310 • the “single” test is using the send/recv pair; “multiple” test uses the equivalent of the sendrecv primitive • Note that a single bcopy speed on Origin2000 is about 150 MB/s • MPI suffers a performance disadvantage with respect to SHMEM due to the fact that MPI semantics requires separate address spaces between threads. Therefore MPI implementation requires “double copy” to pass messages. • SHMEM is optimized for one-sided communication as is done for SMP programming and therefore shows a very good latency measurement. TM MPI Tips for Performance • Use ABI 64 for additional memory crossmapping MPI optimizations • Use cpusets for best reproducible results in batch environment • Avoid over-subscription of tasks to physical CPUs in a throughput benchmark • Use the -stats option and MPI tuning variables TM MPI Tips for Performance • Try direct-copy send/receive for memory bandwidth improvement and collective calls • Use one-sided communication for latency (& memory bandwidth) improvement • Try setting MPI_DSM_MUSTRUN or SMA_DSM_MUSTRUN to maintain CPU / memory affinity • Do NOT use bsend/ssend or wild cards (MPI_ANY_SOURCE, MPI_ANY_TAG) for message headers TM Important Environment Variables MPI_DSM_MUSTRUN MPI_REQUEST_MAX MPI_GM_ON MPI_BAR_DISSEM MPI_BUFS_PER_PROC MPI_BUFS_PER_HOST MPI_BUFFER_MAX “-stats” mpirun option / Totalview display TM MPI Performance Experiments Performance data on MPI programs can be collected with: mpirun -np N perfex -a -y -mp -o perfex.out prog-args • the -o perfex.out.#procid file will contain event counts for every MPI thread and the perfex.out will contain aggregate for all the threads together Profiling data on MPI programs can be collected with: mpirun -np N ssrun -experiment program prog-args • the experiment is one of the usual experiments (pcsamp, usertime, etc.) or mpi: – mpirun -np N ssrun -workshop -mpi prog will produce N prog.mpi.f#procid files; these files can be aggregated with the ssaggregate tool and interactively viewed with cvperf tool – ssaggregate -e prog.mpi.f* -o prog.mpi_all – cvperf prog.mpi_all or prof prog.mpi_all – the following routines are traced (see man ssrun(1)): MPI_Barrier(3) MPI_Send(3) MPI_Isend(3) MPI_Ibsend(3) MPI_Sendrecv_replace(3) MPI_Wait(3) MPI_Waitall(3) MPI_Test(3) MPI_Testall(3) MPI_Request_free(3) MPI_Bsend(3) MPI_Issend (3) MPI_Bcast(3) MPI_Waitany(3) MPI_Testany(3) MPI_Cancel(3) MPI_Ssend(3) MPI_Rsend(3) MPI_Irsend(3) MPI_Sendrecv(3) MPI_Recv(3) MPI_Irecv(3) MPI_Waitsome(3) MPI_Testsome(3) MPI_Pcontrol(3) TM MPI versus OpenMP TM MPI versus OpenMP TM MPI versus OpenMP TM MPI versus OpenMP TM MPI versus OpenMP SGI Message-Passing References •“relnotes mpt” gives information about new features •“man mpi” tells about all environment variables •“man shmem” tells about the SHMEM API •MPI Reference Manuals viewable with insight viewer –“Message Passing Toolkit: MPI Programmer’s Manual” (document # 007-3687-005) •MPT web page: –http://www.sgi.com/software/mpt •MPI Web Sites: –http://www.mpi-forum.org –http:/www.mcs.anl.gov/mpi/index.html TM TM Summary • It is important to understand the semantics of MPI • The send/receive calls provide for data synchronization, not necessarily process synchronization • A correct MPI program cannot depend on buffering for messages • For a highly optimized MPI program, it is important to use only few optimized subroutines from the MPI library, typically straight send/receive variants • The SGI implementation of MPI uses N+1 processes in parallel region, therefore it is better for scalability to run MPI with smaller number of processors than physically available in the machine • Proprietary Message Passing Libraries (I.e. SHMEM) perform better than MPI on the Origin, because MPI’s generic interface makes it much harder to optimize