Introduction to Parallel Programming with C and MPI at MCSR Part 1 MCSR Unix Camp What is a Supercomputer? Loosely speaking, it is a “large” computer with an architecture that has been optimized for bigger solving problems faster than a conventional desktop, mainframe, or server computer. - Pipelining - Parallelism (lots of CPUs or Computers) Supercomputers at MCSR: mimosa - 253 CPU Intel Linux Cluster – Pentium 4 - Distributed memory – 500MB – 1GB per node - Gigabit Ethernet What is Parallel Computing? Using more than one computer (or processor) to complete a computational problem How May a Problem be Parallelized? Data Decomposition Task Decomposition Models of Parallel Programming • Message Passing Computing – Processes coordinate and communicate results via calls to message passing library routines – Programmers “parallelize” algorithm and add message calls – At MCSR, this is via MPI programming with C or Fortran • Sweetgum – Origin 2800 Supercomputer (128 CPUs) • Mimosa – Beowulf Cluster with 253 Nodes • Redwood – Altix 3700 Supercomputer (224 CPUs) • Shared Memory Computing – Processes or threads coordinate and communicate results via shared memory variables – Care must be taken not to modify the wrong memory areas – At MCSR, this is via OpenMP programming with C or Fortran on sweetgum Message Passing Computing at MCSR • • • • • • • • • • Process Creation Manager and Worker Processes Static vs. Dynamic Work Allocation Compilation Models Basics Synchronous Message Passing Collective Message Passing Deadlocks Examples Message Passing Process Creation • Dynamic – – – – one process spawns other processes & gives them work PVM More flexible More overhead - process creation and cleanup • Static – Total number of processes determined before execution begins – MPI Message Passing Processes • Often, one process will be the manager, and the remaining processes will be the workers • Each process has a unique rank/identifier • Each process runs in a separate memory space and has its own copy of variables Message Passing Work Allocation • Manager Process – Does initial sequential processing – Initially distributes work among the workers • Statically or Dynamically – Collects the intermediate results from workers – Combines into the final solution • Worker Process – Receives work from, and returns results to, the manager – May distribute work amongst themselves (decentralized load balancing) Message Passing Compilation • Compile/link programs w/ message passing libraries using regular (sequential) compilers • Fortran MPI example: include mpif.h • C MPI example: #include “mpi.h” Message Passing Compilation Message Passing Models • SPMD – Shared Program/Multiple Data – Single version of the source code used for each process – Manager executes one portion of the program; workers execute another; some portions executed by both – Requires one compilation per architecture type – MPI • MPMP – Multiple Program/Multiple Data – Once source code for master; another for slave – Each must be compiled separately – PVM Message Passing Basics • Each process must first establish the message passing environment • Fortran MPI example: integer ierror call MPI_INIT (ierror) • C MPI example: MPI_Init(&argc, &argv); Message Passing Basics • Each process has a rank, or id number – 0, 1, 2, … n-1, where there are n processes • With SPMD, each process must determine its own rank by calling a library routine • Fortran MPI Example: integer comm, rank, ierror call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) • C MPI Example MPI_Comm_rank(MPI_COMM_WORLD, &rank); Message Passing Basics • Each process has a rank, or id number – 0, 1, 2, … n-1, where there are n processes • Each process may use a library call to determine how many total processes it has to play with • Fortran MPI Example: integer comm, size, ierror call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) • C MPI Example MPI_Comm_size(MPI_COMM_WORLD, &size); Message Passing Basics • Each process has a rank, or id number – 0, 1, 2, … n-1, where there are n processes • Once a process knows the size, it also knows the ranks (id #’s) of those other processes, and can send or receive a message to/from any other process. • C Example: MPI_Send(buf, count, datatype, dest, tag, comm, ierror) ------DATA---------- ---EVELOPE--- -status-----MPI_Recv(buf, count, datatype, sourc,tag,comm, status,ierror) MPI Send and Receive Arguments • • • • • Buf starting location of data Count number of elements Datatype MPI_Integer, MPI_Real, MPI_Character… Destination rank of process to whom msg being sent Source rank of sender from whom msg being received or MPI_ANY_SOURCE • Tag integer chosen by program to indicate type of message or MPI_ANY_TAG • Communicator id’s the process team, e.g., MPI_COMM_WORLD • Status the result of the call (such as the # data items received) Synchronous Message Passing • Message calls may be blocking or nonblocking • Blocking Send – Waits to return until the message has been received by the destination process – This synchronizes the sender with the receiver • Nonblocking Send – Return is immediate, without regard for whether the message has been transferred to the receiver – DANGER: Sender must not change the variable containing the old message before the transfer is done. – MPI_ISend() is nonblocking Synchronous Message Passing • Locally Blocking Send – The message is copied from the send parameter variable to intermediate buffer in the calling process – Returns as soon as the local copy is complete – Does not wait for receiver to transfer the message from the buffer – Does not synchronize – The sender’s message variable may safely be reused immediately – MPI_Send() is locally blocking Synchronous Message Passing • Blocking Receive – The call waits until a message matching the given tag has been received from the specified source process. – MPI_RECV() is blocking. • Nonblocking Receive – If this process has a qualifying message waiting, retrieves that message and returns – If no messages have been received yet, returns anyway – Used if the receiver has other work it can be doing while it waits – Status tells the receive whether the message was received – MPI_Irecv() is nonblocking – MPI_Wait() and MPI_Test() can be used to periodically check to see if the message is ready, and finally wait for it, if desired Collective Message Passing • Broadcast – Sends a message from one to all processes in the group • Scatter – Distributes each element of a data array to a different process for computation • Gather – The reverse of scatter…retrieves data elements into an array from multiple processes Collective Message Passing w/MPI MPI_Bcast() Broadcast from root to all other processes MPI_Gather() Gather values for group of processes MPI_Scatter() Scatters buffer in parts to group of processes MPI_Alltoall() Sends data from all processes to all processes MPI_Reduce() Combine values on all processes to single val MPI_Reduce_Scatter() Broadcast from root to all other processes MPI_Bcast() Broadcast from root to all other processes Message Passing Deadlock • Deadlock can occur when all critical processes are waiting for messages that never come, or waiting for buffers to clear out so that their own messages can be sent • Possible Causes – Program/algorithm errors – Message and buffer sizes • Solutions – Order operations more carefully – Use nonblocking operations – Add debugging output statements to your code to find the problem Portable Batch System in SGI • Sweetgum: – PBS Professional is installed on sweetgum. Queue Max # Processors Max # Running Memory Limit CPU Time Limit Special Validation per User Job Jobs per Queue per User Job per User Job Required SM-defR 4 40 500mb 288 hrs No MM-defR 4 20 1gb 288 hrs No LM-defR 4 2 4gb 288 hrs Yes LM-XR 4 1 4gb 672 hrs Yes LM-8p 8 1 4gb 672 hrs Yes LM-16p 16 1 4gb 672 hrs Yes Portable Batch System on Mimosa • Example Mimosa PBS Configuration: – PBS Professional Queue MCSR-2N MCSR-4N MCSR-8N MCSR-16N MCSR-32N MCSR-64N MCSR-CA Max # Nodes per User Job 2 4 8 16 32 64 0 Default Memory Default Shared (MB) Memory (MB) 400 256 600 256 800 256 1000 256 1200 256 1200 256 400 256 Max # Running Jobs per Queue 32 12 8 4 4 2 13 Special Validation Required No Yes Yes Yes Yes Yes Yes Sample PBS Script mimosa% vi example.pbs #!/bin/bash #PBS -l nodes=4 # MIMOSA #PBS –l ncpus=4 # SWEETGUM #PBS -q MCSR-CA #PBS –N example cd $PWD rm *.pbs.[eo]* pgcc –o add_mpi.exe add_mpi.c –Mmpi-mpich #mimosa mpirun -np 4 add_mpi.exe mimosa % qsub example.pbs 37537.mimosa.mcsr.olemiss.edu Sample Portable Batch System Script Sample Mimosa% qstat Job id Name --------------- -------- User Time Use S Queue --------- ----------- - ----------- 37521.mimosa 37524.mimosa 37525.mimosa 37526.mimosa 37528.mimosa 37530.mimosa 37537.mimosa 37539.mimosa r0829 r0829 lgorb r0829 lgorb lgorb tpirim cs49011 4_3.pbs 2_4.pbs GC8w.pbs 3_6.pbs GCr8w.pbs ATr7w.pbs example try1 01:05:17 R 01:00:58 R 01:03:25 R 01:01:54 R 00:59:19 R 00:55:29 R 0 Q 00:00:00 R MCSR-2N MCSR-2N MCSR-2N MCSR-2N MCSR-2N MCSR-2N MCSR-16N MCSR-CA – Further information about using PBS at MCSR: http://www.mcsr.olemiss.edu/appssubpage.php?pagename=pbs_1.inc&menu=vM BPBS.inc For More Information Hello World MPI Examples on Sweetgum (/usr/local/appl/mpihello) and Mimosa (/usr/local/apps/ppro/mpiworkshop): http://www.mcsr.olemiss.edu/appssubpage.php?pagename=MPI_Ex1.inc http://www.mcsr.olemiss.edu/appssubpage.php?pagename=MPI_Ex2.inc http://www.mcsr.olemiss.edu/appssubpage.php?pagename=MPI_Ex3.inc Websites MPI at MCSR: http://www.mcsr.olemiss.edu/appssubpage.php?pagename=mpi.inc PBS at MCSR: http://www.mcsr.olemiss.edu/appssubpage.php?pagename=pbs_1.inc&menu=vMBPBS.inc Mimosa Cluster: http://www.mcsr.olemiss.edu/supercomputerssubpage.php?pagename=mimosa2.inc MCSR Accounts: http://www.mcsr.olemiss.edu/supercomputerssubpage.php?pagename=accounts.incThe MPI Programming Exercises Hello World sequential parallel (w/MPI and PBS) Add and Array of numbers sequential parallel (w/MPI and PBS) Log in to mimosa & get workshop files A. Use secure shell to login to mimosa using your assigned training account: ssh tracct1@mimosa.mcsr.olemiss.edu ssh tracct2@mimosa.mcsr.olemiss.edu See lab instructor for password. B. Copy workshop files into your home directory by running: /usr/local/apps/ppro/prepare_mpi_workshop Examine, compile, and execute hello.c Examine hello_mpi.c Examine hello_mpi.c Add macro to include the header file for the MPI library calls. Examine hello_mpi.c Add function call to initialize the MPI environment Examine hello_mpi.c Add function call find out how many parallel processes there are. Examine hello_mpi.c Add function call to find out which process this is – the MPI process ID of this process. Examine hello_mpi.c Add IF structure so that the manager/boss process can do one thing, and everyone else (the workers/servants) can do something else. Examine hello_mpi.c All processes, whether manager or worker, must finalize MPI operations. Compile hello_mpi.c Compile it. Why won’t this compile? You must link to the MPI library. Run hello_mpi.exe On 1 CPU On 2 CPUs On 4 CPUs hello_mpi.pbs hello_mpi.pbs hello_mpi.pbs hello_mpi.pbs hello_mpi.pbs hello_mpi.pbs hello_mpi.pbs Submit hello_mpi.pbs Submit hello_mpi.pbs Submit hello_mpi.pbs Submit hello_mpi.pbs Examine add.c Compile & execute add.c Edit add_mpi.c or Edit add_mpi.c Edit add_mpi.c Edit add_mpi.c Edit add_mpi.c Edit add_mpi.c Edit add_mpi.c Compile/debug add_mpi.c Examine add_mpi.pbs Submit add_mpi.pbs