Agenda Agenda Message Passing: MPI May, May, 2002 2002 Charles Charles Grassl IBM IBM Advanced Advanced Computing Computing Technology Center SP SP System Introduciton MPI MPI Introduction Introduction Compilers Compilers Hardware Hardware Environment Software Software Environment PSSP PSSP LoadLeveler LoadLeveler POE POE Point Point to to Point Communication Communication pSeries pSeries Nodes Nodes pSeries pSeries nodes Flat Flat address address space space Uniform Uniform memory access times (approximately) (approximately) Message Message passing works very well on-node Shared Shared memory MPI SMP SMP (e.g. (e.g. OpenMP) OpenMP) works very well on-node on-node Does Does not not work work between nodes Shared Memory Characteristics Characteristics CPU Flat Flat address address space space Single Single operating operating system system CPU Limitations Limitations Memory Memory Contention Contention Bandwidth Bandwidth Cache Cache coherency coherency Benefits Benefits Memory Memory size size CPU CPU Memory CPU CPU CPU CPU Distributed Distributed Memory Distributed Distributed Shared Memory CPU Characteristics Characteristics CPU CPU CPU Memory CPU Memory CPU Memory Switch Switch CPU CPU CPU CPU Multiple Multiple address address spaces spaces Multiple Multiple operating operating systems systems Limitations Limitations Memory Memory CPU CPU Memory CPU CPU CPU Switch Fabric CPU Memory CPU Switch Fabric CPU Memory Contention Contention Bandwidth Bandwidth CPU Memory CPU Memory Local Local memory memory size size CPU Memory CPU Memory CPU Memory Benefits Benefits Cache Cache coherency coherency NUMA Non-Uniform Non-Uniform Memory Memory Access Access Blend Blend of of distributed distributed shared shared memory memory Advantages Advantages Flat Flat address address space space Large Large memory memory Disadvantages Disadvantages Unpredictable Unpredictable performance Page Page management management Benefits Benefits Large Large memory memory Cache coherency Cache coherency CPU CPU Memory CPU CPU Memory CPU CPU CPU CPU Memory CPU CPU Limits Limits Multiple Multiple programming programming models models Why Various Memory Designs? Shared Shared memories memories allow allow larger larger address address space space Distributed Distributed memory memory has has more more scalability scalability Economics Economics CPU Emphasize! Invoking the "MPI" Compiler Message Message Passing (MPI) works VERY well on on shared memory nodes MPI MPI usage: usage: set set 'mp' 'mp' before before the the compiler compiler mp mpxlf xlf .... mp cc mpcc Uses Uses shared shared memory memory messages messages SMP SMP (OpenMP) (OpenMP) does does not not work work between between nodes nodes Sets Sets include include path path Links Links libmpi.a libmpi.a Only Only scales scales up up to to number number of of CPUs CPUs on on node node Language FORTRAN 77 FORTRAN 90 FORTRAN 95 C Fortran Compiler Invocations Source Format Storage Class mpxlf, mpxlf77 Fixed Static mpxlf90, mpxlf95 Free Automatic Compiler Source mpxlf ... -q{fixed|free} ... Storage mpxlf ... -q{save|nosave} ... Compile Command mpxlf mpxlf90 mpxlf95 mpcc Linker Options Option -qsmp -bmaxdata:<bytes> -bmaxstack:<bytes> Description In case of using Threads Specifies the maximum amount of space to reserve for the program data segment Specifies the maximum amount of space to reserve for the program stack segment (256 Mbyte limit) Do Do not not need need to specify -lmpi when using mpxlf. Addressing Modes Operating Operating System System Environment Environment Default Default is is 32-bit 32-bit -q32 -q32 -q64 -q64 option Does Does not not affect affect CPU CPU performance performance All pointers uses 8 bytes All pointers bytes Integer*8 Integer*8 calculations calculations are done in hardware The The default default datatypes datatypes DOES NOT change Parallel Parallel System Support Program (PSSP) PE PE POE POE MPI MPI LoadLeveler LoadLeveler INTEGER INTEGER is is still still 44 bytes bytes PSSP AIX Message Passing Hardware AIX AIX AIX Parallel Parallel Environment Environment (PE) (PE) Pervious Pervious switch switch TPMX TPMX ("The ("The switch") ~150 ~150 Mbyte/s Mbyte/s bandwidth bandwidth per per message message ~25 microsec. latency ~25 microsec. Now Now available: available: "Colony" "Colony" ~500 ~500 Mbyte/s Mbyte/s bandwidth bandwidth per per message message ~15 ~15 microsec. microsec. latency Future Future technology technology "Federation" "Federation" ~1 ~1 Gbyte/s Gbyte/s bandwidth Runs Runson: on: SP SP systems systems AIX AIX workstation workstation clusters clusters Components: Components: LoadLeveler LoadLeveler Parallel Parallel Operating Operating Environment Environment (POE) (POE) Message Message passing passing libraries libraries (MPL (MPL and and MPI) MPI) Parallel Parallel debuggers debuggers Visualization Visualization Tool Tool (VT) (VT) PATH=/usr/lpp/LoadL/full/bin PATH=/usr/lpp/LoadL/full/bin RS/6000 SP Product Documentation: www.rs6000.ibm.com/resource/aix_resource/sp_books/PE Parallel Operating Operating Environment Environment (POE) Takes Takes place place of of "mpirun" "mpirun" command command Also Also distributes distributes local environment Local Local environment environment variables are exported to other other nodes nodes Example: Example: $$ poe poe a.out a.out -procs -procs ... ... or or $$ a.out a.out -procs ... # poe is implied $$ poe ksh myscript.ksh ... POE: POE: Control Control Variable Values MP_NODES 1, ..., n Description Nodes MP_TASKS_PER_NODE 1, ..., n Tasks per node MP_PROCS 1, ..., n No. tasks (processes) = MP_NODES * MP_TASKS_PER_NODE MP_HOSTFILE host.list list of node names Default: ./host.list MP_LABELIO yes, no Lable I/O with task number Runs Runs "myscript.ksh" "myscript.ksh" on nodes listed in hostlist Control variables are prefaced by MP_ POE: Tuning Number of Tasks (processors) MP_PROCS=MP_NODES MP_PROCS=MP_NODES** MP_TASKS_PER_NODE MP_TASKS_PER_NODE MP_PROCS MP_PROCS :: Total Total number number of of processes processes MP_NODES MP_NODES :: Number Number of of nodes nodes to to use use MP_TASKS_PER_NODE MP_TASKS_PER_NODE :: number number of of proc. proc. per per node node Any Any two two variables variables can be specified MP_TASKS_PER_NODE MP_TASKS_PER_NODE is is (usually) (usually) the the number number of of CPUs CPUs per per node node Variable Values Description MP_BUFFER_MEM 0 - 64 000 000 Buffer for early arrivals MP_EAGER_LIMIT 0 - 65536 MP_SHARED_MEMORY yes, no Threshold for rendezvous protocol Use shared memory on node MP_EUILIB poll, yield, sleep us, ip US default: poll IP default: yield Communication Method MP_SINGLE_THREAD no, yes Multiple threads per tasks MP_WAIT_MODE POE POE Tuning: Tuning: MP_EUILIB MP_EUILIB POE Tuning: MP_EUILIB us us (User (User Space) Space) User Space (US mode) Low Low latency latency ~10 ~10 microsec. microsec. User Internet Protocal (IP mode) User Kernel Kernel ip ip (Internet Protocol) Unlimited Unlimited number number of of tasks tasks Higher Higher latency latency Adaptor Adaptor ~50 ~50 -- 100 100 microsec. microsec. Switch Flow Flow Control Control Flow Flow Control Control No. Tasks Small Messages Send Message Header Receive Large Messages Send Header Receive 1 to 16 17 to 32 33 to 64 65 to 128 129 to 256 256 - ... Message Receive 4096 2048 1024 512 256 128 Small Small Message Message (MP_EAGER_LIMIT) (MP_EAGER_LIMIT) Send Send header header and and message message Large Large Message Message Send MP_EAGER_LIMIT (default, bytes) Send Send header header Acknowledge Acknowledge Send Send message message Flow Flow Control Control Unsafe Send-Receive 0 Small Small messages: 1 Example: Circular shift of data Lower Lower latency latency MPI_Send MPI_Send is is more more like like MPI_Isend MPI_Isend 4 2 3 Large Large Messages: Messages: blocking communication calls Rendezvous Rendezvous protocol protocol MPI_Send MPI_Send is is equivalent equivalent to to MPI_Ssend MPI_Ssend Can Can set set MP_EAGER_LIMIT MP_EAGER_LIMIT to 65536 MPI_SEND(sbuf,size,MPI_INTEGER,next,0,MPI_COMM_WORLD,...) MPI_RECV(rbuf,size,MPI_INTEGER,prev,0,MPI_COMM_WORLD,...) Does Does not not always always work work Above: Above: rendezvous rendezvous mode, mode, deadlock deadlock Send Send doesn't doesn't return return Strategy: Develop application with MP_EAGER_LIMIT=0 Run application with MP_EAGER_LIMIT=65536 Multiple Program, Multiple Multiple Data (MPMD) Safe Send-Receive 0 1 4 2 3 Nonblocking communication calls MPI_ISEND(sbuf,size,MPI_INTEGER,next,0,MPI_COMM_WORLD,ireq,...) MPI_IRECV(rbuf,size,MPI_INTEGER,prev,0,MPI_COMM_WORLD,stat,...) ... (as much computation as possible) ... MPI_WAITALL(ireq,stat2,ierr) Always Always works works MPI_ISEND MPI_ISEND returns returns in in all all cases cases Best Best Performance Each Each task task in in MPI MPI session session can can be be aa unique unique program: program: export export MP_PGMMODEL=<mpmd/spmd> MP_PGMMODEL=<mpmd/spmd> export export MP_CMDFILE=cmdfile MP_CMDFILE=cmdfile cmdfile a.out b.out c.out hostfile node1 node2 node1 Execution command: $ MP_PGMMODEL=mpmd $ MP_CMDFILE=cmdfile $ poe -procs 3 US Mode MPI Bandwidth Rate (Mbyte/s) 250 Single Multiple 200 150 100000000 10000000 1000000 100000 10000 1000 100 Bandwidth: 2200 - 2500 Mbyte/s Length (bytes) Latency: 2.7 microsec. 50 1 100 10 10000 1000000 1000 100000 10000000 Length (bytes) Bandwidth: 170 - 225 Mbyte/s Latency: 10 microsec. MPI Performance: Colony Colony Latency Latency IP Mode (on switch) switch) MPI MPI Bandwidth Bandwidth 200 250 200 150 150 Single Multiple 100 Microsec. Rate (Mbyte/s) Single Multiple 100 0 10 3000 2500 2000 1500 1000 500 0 1 Rate (Mbyte/s) Shared Shared Memory Memory MPI MPI Bandwidth Bandwidth 50 0 1 100 10 10000 1000000 1000 100000 10000000 Length (bytes) 100 Bandwidth: 165 - 205 Mbyte/s Latency: 30 microsec. 50 0 ip us (User Space) Shared Memory MPI Performance: Performance: Colony Colony Bandwidth Bandwidth Auxilary Auxilary 500 Mbyte/s 400 300 ip us (User Space) Shared Memory 200 100 0 Load Load Leveler Leveler Batch Batch queing queing system system RS/6000 SP Product RS/6000 SP Product Documentation: Documentation: www.rs6000.ibm.com/resource/aix_resource/sp_books/ www.rs6000.ibm.com/resource/aix_resource/sp_books/ LoadLeveler LoadLeveler PATH=/usr/lpp/LoadL/full/bin PATH=/usr/lpp/LoadL/full/bin Load Load Leveler Leveler Command llq llclass llsubmit llcancel Description Info. on dispatched jobs Info. on job classes Submit job (sript) to LoadLeveler for execution Kill or delete submitted job Example: 32 CPUs on 4 nodes LoadLeveler LoadLeveler Script Script #!/bin/ksh Node_A CPUs 0-7 LoadLeveler # @ PROCS=32 # @ TASKS_PER_NODE=4 ... a.out Node_B CPUs 0-7 Node_C CPUs 0-7 Node_D CPUs 0-7 POE (Interactive) MP_PROCS=32 HOSTFILE: Node_A ... (4 times) Node_B ... (4 times) Node_C ... .... $ poe a.out .... LoadLeveler LoadLeveler Script: Script: CPUs Specify Specify shell, shell, stdout,stderr stdout,stderr Notification Notification (email) (email) Environment: Environment: envrionment envrionment variables variables Time Time limit limit and and class class Type: serial or Type: serial or parallel parallel CPUs CPUs # @ error = Error.file # @ output = Output.file # @ notification = never # @ environment = \ MP_EUILIB=us;\ MP_SHARED_MEMORY=yes;\ MP_EAGER_LIMIT=65536 # @ requirements = ( Pool == 32 ) # @ job_type = parallel # @ node = 4 # @ tasks_per_node = 32 # @ network.mpi = csss,not_shared,US # @ wall_clock_limit=1000 # @ class = batch # @ node_usage = not_shared # @ queue LoadLeveler LoadLeveler Script Script #!/bin/ksh CPUs CPUs Nodes Nodes #@ #@ nodes nodes == Tasks Tasks per node #!/bin/ksh ... # @ node = 4 # @ tasks_per_node = 32 .... #@ queue #@ #@ tasks_per_node tasks_per_node == Total Total #! #! total_tasks total_tasks = Geometry Geometry #@ #@ task_geometry task_geometry = \\ {(0,2,4) {(0,2,4) (1,3) (1,3) }} #!/bin/ksh ... #@ task_geometry = {(0,2,4) (1,3) } ... # @ queue # @ error = Error.file # @ output = Output.file # @ notification = never # @ environment = \ MP_EUILIB=us;\ MP_SHARED_MEMORY=yes;\ MP_EAGER_LIMIT=65536 # @ wall_clock_limit=1000 # @ requirements = ( Pool == 32 ) # @ job_type = parallel # @ node = 4 # @ tasks_per_node = 32 # @ network.mpi = csss,not_shared,US # @ class = batch # @ node_usage = not_shared # @ queue export OMP_NUM_THREADS=1 export XLFRTEOPTS=namelist=old