Message Passing: MPI Agenda

advertisement
Agenda
Agenda
Message Passing:
MPI
May,
May, 2002
2002
Charles
Charles Grassl
IBM
IBM Advanced
Advanced Computing
Computing Technology Center
SP
SP System Introduciton
MPI
MPI Introduction
Introduction
Compilers
Compilers
Hardware
Hardware Environment
Software
Software Environment
PSSP
PSSP
LoadLeveler
LoadLeveler
POE
POE
Point
Point to
to Point Communication
Communication
pSeries
pSeries Nodes
Nodes
pSeries
pSeries nodes
Flat
Flat address
address space
space
Uniform
Uniform memory access times
(approximately)
(approximately)
Message
Message passing works very well on-node
Shared
Shared memory MPI
SMP
SMP (e.g.
(e.g. OpenMP)
OpenMP) works very well
on-node
on-node
Does
Does not
not work
work between nodes
Shared Memory
Characteristics
Characteristics
CPU
Flat
Flat address
address space
space
Single
Single operating
operating system
system
CPU
Limitations
Limitations
Memory
Memory
Contention
Contention
Bandwidth
Bandwidth
Cache
Cache coherency
coherency
Benefits
Benefits
Memory
Memory size
size
CPU
CPU
Memory
CPU
CPU
CPU
CPU
Distributed
Distributed Memory
Distributed
Distributed Shared Memory
CPU
Characteristics
Characteristics
CPU
CPU
CPU
Memory
CPU
Memory
CPU
Memory
Switch
Switch
CPU
CPU
CPU
CPU
Multiple
Multiple address
address spaces
spaces
Multiple
Multiple operating
operating systems
systems
Limitations
Limitations
Memory
Memory
CPU
CPU
Memory
CPU
CPU
CPU
Switch
Fabric
CPU
Memory
CPU
Switch
Fabric
CPU
Memory
Contention
Contention
Bandwidth
Bandwidth
CPU
Memory
CPU
Memory
Local
Local memory
memory size
size
CPU
Memory
CPU
Memory
CPU
Memory
Benefits
Benefits
Cache
Cache coherency
coherency
NUMA
Non-Uniform
Non-Uniform Memory
Memory Access
Access
Blend
Blend of
of distributed
distributed shared
shared memory
memory
Advantages
Advantages
Flat
Flat address
address space
space
Large
Large memory
memory
Disadvantages
Disadvantages
Unpredictable
Unpredictable performance
Page
Page management
management
Benefits
Benefits
Large
Large memory
memory
Cache
coherency
Cache coherency
CPU
CPU
Memory
CPU
CPU
Memory
CPU
CPU
CPU
CPU
Memory
CPU
CPU
Limits
Limits
Multiple
Multiple programming
programming models
models
Why Various Memory Designs?
Shared
Shared memories
memories allow
allow larger
larger address
address
space
space
Distributed
Distributed memory
memory has
has more
more scalability
scalability
Economics
Economics
CPU
Emphasize!
Invoking the "MPI" Compiler
Message
Message Passing (MPI) works VERY well
on
on shared memory nodes
MPI
MPI usage:
usage: set
set 'mp'
'mp' before
before the
the compiler
compiler
mp
mpxlf
xlf ....
mp
cc
mpcc
Uses
Uses shared
shared memory
memory messages
messages
SMP
SMP (OpenMP)
(OpenMP) does
does not
not work
work between
between
nodes
nodes
Sets
Sets include
include path
path
Links
Links libmpi.a
libmpi.a
Only
Only scales
scales up
up to
to number
number of
of CPUs
CPUs on
on node
node
Language
FORTRAN 77
FORTRAN 90
FORTRAN 95
C
Fortran Compiler Invocations
Source
Format
Storage Class
mpxlf, mpxlf77
Fixed
Static
mpxlf90, mpxlf95
Free
Automatic
Compiler
Source
mpxlf ... -q{fixed|free} ...
Storage
mpxlf ... -q{save|nosave} ...
Compile Command
mpxlf
mpxlf90
mpxlf95
mpcc
Linker Options
Option
-qsmp
-bmaxdata:<bytes>
-bmaxstack:<bytes>
Description
In case of using Threads
Specifies the maximum amount of
space to reserve for the program data
segment
Specifies the maximum amount of
space to reserve for the program stack
segment (256 Mbyte limit)
Do
Do not
not need
need to specify -lmpi when using mpxlf.
Addressing Modes
Operating
Operating System
System Environment
Environment
Default
Default is
is 32-bit
32-bit
-q32
-q32
-q64
-q64 option
Does
Does not
not affect
affect CPU
CPU performance
performance
All
pointers
uses
8
bytes
All pointers
bytes
Integer*8
Integer*8 calculations
calculations are done in hardware
The
The default
default datatypes
datatypes DOES NOT change
Parallel
Parallel System Support Program (PSSP)
PE
PE
POE
POE
MPI
MPI
LoadLeveler
LoadLeveler
INTEGER
INTEGER is
is still
still 44 bytes
bytes
PSSP
AIX
Message Passing Hardware
AIX
AIX
AIX
Parallel
Parallel Environment
Environment (PE)
(PE)
Pervious
Pervious switch
switch
TPMX
TPMX ("The
("The switch")
~150
~150 Mbyte/s
Mbyte/s bandwidth
bandwidth per
per message
message
~25
microsec.
latency
~25 microsec.
Now
Now available:
available:
"Colony"
"Colony"
~500
~500 Mbyte/s
Mbyte/s bandwidth
bandwidth per
per message
message
~15
~15 microsec.
microsec. latency
Future
Future technology
technology
"Federation"
"Federation"
~1
~1 Gbyte/s
Gbyte/s bandwidth
Runs
Runson:
on:
SP
SP systems
systems
AIX
AIX workstation
workstation clusters
clusters
Components:
Components:
LoadLeveler
LoadLeveler
Parallel
Parallel Operating
Operating Environment
Environment (POE)
(POE)
Message
Message passing
passing libraries
libraries (MPL
(MPL and
and MPI)
MPI)
Parallel
Parallel debuggers
debuggers
Visualization
Visualization Tool
Tool (VT)
(VT)
PATH=/usr/lpp/LoadL/full/bin
PATH=/usr/lpp/LoadL/full/bin
RS/6000 SP Product Documentation:
www.rs6000.ibm.com/resource/aix_resource/sp_books/PE
Parallel Operating
Operating Environment
Environment
(POE)
Takes
Takes place
place of
of "mpirun"
"mpirun" command
command
Also
Also distributes
distributes local environment
Local
Local environment
environment variables are exported to
other
other nodes
nodes
Example:
Example:
$$ poe
poe a.out
a.out -procs
-procs ...
...
or
or
$$ a.out
a.out -procs ... # poe is implied
$$ poe ksh myscript.ksh ...
POE:
POE: Control
Control
Variable
Values
MP_NODES
1, ..., n
Description
Nodes
MP_TASKS_PER_NODE 1, ..., n
Tasks per node
MP_PROCS
1, ..., n
No. tasks (processes)
= MP_NODES *
MP_TASKS_PER_NODE
MP_HOSTFILE
host.list
list of node names
Default: ./host.list
MP_LABELIO
yes, no
Lable I/O with task number
Runs
Runs "myscript.ksh"
"myscript.ksh" on nodes listed in hostlist
Control variables are prefaced by MP_
POE: Tuning
Number of Tasks (processors)
MP_PROCS=MP_NODES
MP_PROCS=MP_NODES**
MP_TASKS_PER_NODE
MP_TASKS_PER_NODE
MP_PROCS
MP_PROCS :: Total
Total number
number of
of processes
processes
MP_NODES
MP_NODES :: Number
Number of
of nodes
nodes to
to use
use
MP_TASKS_PER_NODE
MP_TASKS_PER_NODE :: number
number of
of proc.
proc.
per
per node
node
Any
Any two
two variables
variables can be specified
MP_TASKS_PER_NODE
MP_TASKS_PER_NODE is
is (usually)
(usually) the
the
number
number of
of CPUs
CPUs per
per node
node
Variable
Values
Description
MP_BUFFER_MEM
0 - 64 000 000 Buffer for early arrivals
MP_EAGER_LIMIT
0 - 65536
MP_SHARED_MEMORY yes, no
Threshold for rendezvous
protocol
Use shared memory on node
MP_EUILIB
poll, yield,
sleep
us, ip
US default: poll
IP default: yield
Communication Method
MP_SINGLE_THREAD
no, yes
Multiple threads per tasks
MP_WAIT_MODE
POE
POE Tuning:
Tuning:
MP_EUILIB
MP_EUILIB
POE Tuning:
MP_EUILIB
us
us (User
(User Space)
Space)
User Space (US mode)
Low
Low latency
latency
~10
~10 microsec.
microsec.
User
Internet Protocal (IP mode)
User
Kernel
Kernel
ip
ip (Internet Protocol)
Unlimited
Unlimited number
number of
of tasks
tasks
Higher
Higher latency
latency
Adaptor
Adaptor
~50
~50 -- 100
100 microsec.
microsec.
Switch
Flow
Flow Control
Control
Flow
Flow Control
Control
No. Tasks
Small Messages
Send
Message
Header
Receive
Large Messages
Send
Header
Receive
1 to 16
17 to 32
33 to 64
65 to 128
129 to 256
256 - ...
Message
Receive
4096
2048
1024
512
256
128
Small
Small Message
Message (MP_EAGER_LIMIT)
(MP_EAGER_LIMIT)
Send
Send header
header and
and message
message
Large
Large Message
Message
Send
MP_EAGER_LIMIT
(default, bytes)
Send
Send header
header
Acknowledge
Acknowledge
Send
Send message
message
Flow
Flow Control
Control
Unsafe Send-Receive
0
Small
Small messages:
1
Example: Circular shift of data
Lower
Lower latency
latency
MPI_Send
MPI_Send is
is more
more like
like MPI_Isend
MPI_Isend
4
2
3
Large
Large Messages:
Messages:
blocking communication calls
Rendezvous
Rendezvous protocol
protocol
MPI_Send
MPI_Send is
is equivalent
equivalent to
to MPI_Ssend
MPI_Ssend
Can
Can set
set MP_EAGER_LIMIT
MP_EAGER_LIMIT to 65536
MPI_SEND(sbuf,size,MPI_INTEGER,next,0,MPI_COMM_WORLD,...)
MPI_RECV(rbuf,size,MPI_INTEGER,prev,0,MPI_COMM_WORLD,...)
Does
Does not
not always
always work
work
Above:
Above: rendezvous
rendezvous mode,
mode, deadlock
deadlock
Send
Send doesn't
doesn't return
return
Strategy:
Develop application with MP_EAGER_LIMIT=0
Run application with MP_EAGER_LIMIT=65536
Multiple Program,
Multiple
Multiple Data (MPMD)
Safe Send-Receive
0
1
4
2
3
Nonblocking communication calls
MPI_ISEND(sbuf,size,MPI_INTEGER,next,0,MPI_COMM_WORLD,ireq,...)
MPI_IRECV(rbuf,size,MPI_INTEGER,prev,0,MPI_COMM_WORLD,stat,...)
... (as much computation as possible) ...
MPI_WAITALL(ireq,stat2,ierr)
Always
Always works
works
MPI_ISEND
MPI_ISEND returns
returns in
in all
all cases
cases
Best
Best Performance
Each
Each task
task in
in MPI
MPI session
session can
can be
be aa unique
unique
program:
program:
export
export MP_PGMMODEL=<mpmd/spmd>
MP_PGMMODEL=<mpmd/spmd>
export
export MP_CMDFILE=cmdfile
MP_CMDFILE=cmdfile
cmdfile
a.out
b.out
c.out
hostfile
node1
node2
node1
Execution command:
$ MP_PGMMODEL=mpmd
$ MP_CMDFILE=cmdfile
$ poe -procs 3
US Mode MPI Bandwidth
Rate (Mbyte/s)
250
Single
Multiple
200
150
100000000
10000000
1000000
100000
10000
1000
100
Bandwidth: 2200 - 2500
Mbyte/s Length (bytes)
Latency:
2.7 microsec.
50
1
100
10
10000
1000000
1000
100000 10000000
Length (bytes)
Bandwidth: 170 - 225 Mbyte/s
Latency:
10 microsec.
MPI Performance:
Colony
Colony Latency
Latency
IP Mode (on switch)
switch) MPI
MPI Bandwidth
Bandwidth
200
250
200
150
150
Single
Multiple
100
Microsec.
Rate (Mbyte/s)
Single
Multiple
100
0
10
3000
2500
2000
1500
1000
500
0
1
Rate (Mbyte/s)
Shared
Shared Memory
Memory MPI
MPI Bandwidth
Bandwidth
50
0
1
100
10
10000
1000000
1000
100000 10000000
Length (bytes)
100
Bandwidth: 165 - 205 Mbyte/s
Latency:
30 microsec.
50
0
ip
us (User Space)
Shared Memory
MPI Performance:
Performance:
Colony
Colony Bandwidth
Bandwidth
Auxilary
Auxilary
500
Mbyte/s
400
300
ip
us (User Space)
Shared Memory
200
100
0
Load
Load Leveler
Leveler
Batch
Batch queing
queing system
system
RS/6000
SP
Product
RS/6000 SP Product Documentation:
Documentation:
www.rs6000.ibm.com/resource/aix_resource/sp_books/
www.rs6000.ibm.com/resource/aix_resource/sp_books/
LoadLeveler
LoadLeveler
PATH=/usr/lpp/LoadL/full/bin
PATH=/usr/lpp/LoadL/full/bin
Load
Load Leveler
Leveler
Command
llq
llclass
llsubmit
llcancel
Description
Info. on dispatched
jobs
Info. on job classes
Submit job (sript) to
LoadLeveler for
execution
Kill or delete
submitted job
Example:
32 CPUs on 4 nodes
LoadLeveler
LoadLeveler Script
Script
#!/bin/ksh
Node_A
CPUs 0-7
LoadLeveler
# @ PROCS=32
# @ TASKS_PER_NODE=4
...
a.out
Node_B
CPUs 0-7
Node_C
CPUs 0-7
Node_D
CPUs 0-7
POE (Interactive)
MP_PROCS=32
HOSTFILE:
Node_A ... (4 times)
Node_B ... (4 times)
Node_C ...
....
$ poe a.out ....
LoadLeveler
LoadLeveler Script:
Script: CPUs
Specify
Specify shell,
shell,
stdout,stderr
stdout,stderr
Notification
Notification (email)
(email)
Environment:
Environment:
envrionment
envrionment
variables
variables
Time
Time limit
limit and
and class
class
Type:
serial
or
Type: serial or
parallel
parallel
CPUs
CPUs
# @ error = Error.file
# @ output = Output.file
# @ notification = never
# @ environment = \
MP_EUILIB=us;\
MP_SHARED_MEMORY=yes;\
MP_EAGER_LIMIT=65536
# @ requirements = ( Pool == 32 )
# @ job_type = parallel
# @ node = 4
# @ tasks_per_node = 32
# @ network.mpi = csss,not_shared,US
# @ wall_clock_limit=1000
# @ class = batch
# @ node_usage = not_shared
# @ queue
LoadLeveler
LoadLeveler Script
Script
#!/bin/ksh
CPUs
CPUs
Nodes
Nodes
#@
#@ nodes
nodes ==
Tasks
Tasks per node
#!/bin/ksh
...
# @ node = 4
# @ tasks_per_node = 32
....
#@ queue
#@
#@ tasks_per_node
tasks_per_node ==
Total
Total
#!
#! total_tasks
total_tasks =
Geometry
Geometry
#@
#@ task_geometry
task_geometry =
\\ {(0,2,4)
{(0,2,4) (1,3)
(1,3) }}
#!/bin/ksh
...
#@ task_geometry = {(0,2,4) (1,3) }
...
# @ queue
# @ error = Error.file
# @ output = Output.file
# @ notification = never
# @ environment = \
MP_EUILIB=us;\
MP_SHARED_MEMORY=yes;\
MP_EAGER_LIMIT=65536
# @ wall_clock_limit=1000
# @ requirements = ( Pool == 32 )
# @ job_type = parallel
# @ node = 4
# @ tasks_per_node = 32
# @ network.mpi = csss,not_shared,US
# @ class = batch
# @ node_usage = not_shared
# @ queue
export OMP_NUM_THREADS=1
export XLFRTEOPTS=namelist=old
Download