OpenMP

advertisement
Parallel Programming
Orientation
Confidential – Internal Use Only
1
Agenda
 Parallel jobs
» Paradigms to parallelize algorithms
» Profiling and compiler optimization
» Implementations to parallelize code
• OpenMP
• MPI
 Queuing
» Job queuing
» Integrating Parallel Programs
 Questions and Answers
2
Confidential – Internal Use Only
Traditional “Ad-Hoc” Linux Cluster
 Full Linux install to disk; Load
into memory
» Manual & Slow; 5-30 minutes
 Full set of disparate daemons,
services, user/password, &
host access setup
Interconnection Network
Master Node
 Basic parallel shell with
complex glue scripts run jobs
 Monitoring & management
added as isolated tools
Internet or Internal Network
3
Confidential – Internal Use Only
Cluster Virtualization Architecture Realized
Optional Disks
 Minimal in-memory OS with single
daemon rapidly deployed in seconds no disk required
» Less than 20 seconds
 Virtual, unified process space enables
intuitive single sign-on, job
submission
» Effortless job migration to nodes
Interconnection Network
 Monitor & manage efficiently from the
Master
» Single System Install
» Single Process Space
» Shared cache of the cluster state
Master Node
» Single point of provisioning
» Better performance due to lightweight
nodes
Internet or Internal Network
4
» No version skew is inherently more
reliable
Confidential – Internal Use Only
Just a Primer
 Only a brief introduction is provided here. Many other
in-depth tutorials are available on the web and in
published sources.
» http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html
» https://computing.llnl.gov/?set=training&page=index
5
Confidential – Internal Use Only
Parallel Code Primer
 Paradigms for writing parallel programs depend upon the
application
» SIMD (single-instruction multiple-data)
» MIMD (multiple-instruction multiple-data)
» MISD (multiple-instruction single-data)
 SIMD will be presented here as it is a commonly used template
» A single application source is compiled to perform operations on
different sets of data (Single Program Multiple Data (SPMD) model)
» The data is read by the different threads or passed between threads via
messages (hence MPI = message passing interface)
• Contrast this with shared memory or OpenMP where data is locally via
memory
• Optimizations in the MPI implementation can perform localhost
optimization; however, the program is still written using a message
passing construct
6
Confidential – Internal Use Only
Explicitly Parallel Programs
 Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
 Most distributed parallel programs are now written
using MPI
» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel
» ClusterWare comes integrated with customized versions of
MPICH and OpenMPI
7
Confidential – Internal Use Only
Example Code
 Calculate p through numerical integration
» Compute p by integrating f(x) = 4/(1 + x**2) from 0 to 1
• Function is the derivative of arctan(x)
• See source code
8
Confidential – Internal Use Only
Compiling and running code
set n = 400,000,000
$ gcc -o cpi-serial cpi-serial.c
$ time ./cpi-serial
Process 0
pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real
user
sys
0m11.009s
0m11.007s
0m0.000s
$ gcc -g -pg -o cpi-serial_prof cpi-serial.c
$ time ./cpi-serial_prof
Process 0
pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real
0m11.012s
user
0m11.010s
sys
0m0.000s
$ ls -ltra
total 40
…
-rw-rw-r-- 1 jhan jhan
9
586 Mar
7 14:45 gmon.out
Confidential – Internal Use Only
Profiling
 -g flag includes debugging information in the binary. Useful for gdb tracing of an
application and for profiling
 -pg flag generates code to writing profile information
$ gprof cpi-serial_prof gmon.out
Flat profile:
Each sample counts as 0.01 seconds.
%
cumulative
self
self
total
time
seconds
seconds
calls ns/call ns/call name
74.48
2.85
2.85
main
23.95
3.77
0.92 400000000
2.29
2.29 f
2.37
3.86
0.09
frame_dummy
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.26% of 3.86 seconds
index % time
self
children
called
name
<spontaneous>
2.85
0.92
main [1]
0.92
0.00 400000000/400000000
f [2]
----------------------------------------------0.92
0.00 400000000/400000000
main [1]
[2]
23.8
0.92
0.00 400000000
f [2]
----------------------------------------------<spontaneous>
[3]
2.3
0.09
0.00
frame_dummy [3]
----------------------------------------------[1]
97.7
 -A flag to gprof shows calls next to source code
10
Confidential – Internal Use Only
Profiling Tips
 Code should be profiled using realistic data set
» Contrast the call graphs of n=100 versus n=400,000,000
 Profiling can give tips about where to optimize the
current algorithm, but it can’t suggest alternative
(better) algorithms
» e.g. Monte Carlo algorithm to calculate p
 Amdahl’s Law
» The speedup parallelization achieves is limited by the serial
part of the code
11
Confidential – Internal Use Only
OpenMP Introduction
 Parallelization using shared memory in a single
machine
» Portion of the code is forked on the machine to parallelize
» i.e. not distributed parallelization
 Done using pragmas in the source code. Compiler
must support OpenMP (gcc 4, Intel, etc.)
» $ gcc -fopenmp -o cpi-openmp cpi-openmp.c
» See Source Code
 Profiling can add overhead to resulting executable
 ‘time’ can be used to measure improvement
 Runtime selection of the number of threads using
OMP_NUM_THREADS environment variable
12
Confidential – Internal Use Only
Scaling with OpenMP
$ time OMP_NUM_THREADS=1 ./cpi-openmp
Process 0
pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real
0m10.583s
user
0m10.581s
sys
0m0.001s
$ time OMP_NUM_THREADS=2 ./cpi-openmp
Process 0
Process 1
pi is approximately 3.1415926535900218, Error is 0.0000000000002287
real
0m5.295s
user
0m11.297s
sys
0m0.000s
$ time OMP_NUM_THREADS=4 ./cpi-openmp
…
real
0m2.650s
user
0m10.586s
sys
0m0.003s
$ time OMP_NUM_THREADS=8 ./cpi-openmp
…
real
0m1.416s
user
0m11.294s
sys
0m0.001s
13
Confidential – Internal Use Only
Scaling with OpenMP
OpenMP Speedup
9
8
7
Speedup
6
5
Ideal Speedup
4
Speedup
3
2
1
0
0
1
2
3
4
5
6
7
8
9
Number of Processors
 Code is easy to parallelize
 Good scaling is seen up to 8 processors, kink in the curve is expected
14
Confidential – Internal Use Only
Role of the Compiler
 Parallelization using shared memory in a single
machine
» i.e. not distributed parallelization
 Done using pragmas in the source code. Compiler
must support OpenMP (gcc 4, Intel, etc.)
» $ gcc -fopenmp -o cpi-openmp cpi-openmp.c
 Profiling can add overhead to resulting executable
 ‘time’ can be used to measure improvement
 Runtime selection of the number of threads using
OMP_NUM_THREADS environment variable
15
Confidential – Internal Use Only
GCC versus Intel C
$ time OMP_NUM_THREADS=1 ./cpi-openmp
…
real
0m10.583s
user
0m10.581s
sys
0m0.001s
$ gcc -O3 -fopenmp -o cpi-openmp-gcc-O3 cpi-openmp.c
$ time OMP_NUM_THREADS=1 ./cpi-openmp-gcc-O3
Process 0
pi is approximately 3.1415926535895520, Error is 0.0000000000002411
real
0m3.154s
user
0m3.143s
sys
0m0.011s
$ time OMP_NUM_THREADS=8 ./cpi-openmp-gcc-O3
…
real
0m0.399s
user
0m3.181s
sys
0m0.001s
$ icc -openmp -o cpi-openmp-icc cpi-openmp.c
$ time OMP_NUM_THREADS=1 ./cpi-openmp-icc
Process 0
pi is approximately 3.1415926535899272, Error is 0.0000000000001341
real
0m1.618s
user
0m1.575s
sys
0m0.000s
$ time OMP_NUM_THREADS=8 ./cpi-openmp-icc
…
real
0m0.204s
user
0m1.584s
sys
0m0.001s
16
Confidential – Internal Use Only
Compiler Timings
17
Confidential – Internal Use Only
Explicitly Parallel Programs
 Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
 Most distributed parallel programs are now written
using MPI
» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel
» ClusterWare comes integrated with customized versions of
MPICH and OpenMPI
18
Confidential – Internal Use Only
OpenMP Summary
 OpenMP provides a mechanism to parallelize within a
single machine
 Shared memory and variables are handled
“automatically”
 Performance, with an appropriate compiler, can provide
significant speedups
 Coupled with large core count SMP machines, OpenMP
could be all of the parallelization required
» GPU programming is similar to the OpenMP model
19
Confidential – Internal Use Only
Explicitly Parallel Programs
 Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
 Most distributed parallel programs are now written
using MPI
» Different options for MPI stacks: MPICH, OpenMPI, HP, Intel
» ClusterWare comes integrated with customized versions of
MPICH and OpenMPI
20
Confidential – Internal Use Only
Running MPI Code
 Binaries are executed simultaneously
» on the same machine or different machines
 After the binaries start running, the
MPI_COMM_WORLD is established
 Any data to be transferred must be explicitly
determined by the programmer
 Hooks exist for a number of languages
» E.g. Python (https://computing.llnl.gov/code/pdf/pyMPI.pdf)
21
Confidential – Internal Use Only
Example MPI Source
 cpi.c calculates p using MPI in C
compute pi by integrating
f(x) = 4/(1 + x**2)
#include "mpi.h"
#include <stdio.h>
#include <math.h>
Terminates
Each
Does
Only
rank
worker
MPI_SUM
“0”
MPI
does
outputs
execution
function
this
the
loop
on
value
and
“1”
System
Initialize
Determines
Gets
Differentiate
MPI
Broadcasts
built-in
theinclude
the
name
“1”
function
the
MPI
actions
MPI_INT
of
file
size
rank
execution
the
which
to
based
of
of
get
the
the
increments
MPI_DOUBLE
of
environment
pi
the
at
counter
&mypi
by
on
the
all
defines
environment
group
calling
processor
on
time
from
rank.
value
&n
associated
process
the
from
Only
MPI
the
“master”
infunctions
process
with
the a
number of
workers
in processors (versus
communictor
communicator
performs
with
rank this
“0" to
action
all other
dividing the range -> possible
MPI_COMM_WORLD
to a
processes
of&pi
theongroup
off-by-one
single
valueerror)
at
rank “0”
while (!done)
{
if (myid == 0)
{
/*
printf("Enter the number of intervals: (0
quits) ");
scanf("%d",&n);
*/
if (n==0) n=100; else n=0;
startwtime = MPI_Wtime();
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0)
done = 1;
else
{
h
= 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;
double f( double );
double f( double a )
{
return (4.0 / (1.0 + a*a));
}
int main( int argc, char *argv[])
{
int done = 0, n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
{
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);
is %.16f\n",
fprintf(stderr,"Process %d on %s\n",
myid, processor_name);
}
}
printf("pi is approximately %.16f, Error
pi, fabs(pi - PI25DT));
endwtime = MPI_Wtime();
printf("wall clock time = %f\n",
endwtime-startwtime);
}
MPI_Finalize();
n = 0;
}
22
if (myid == 0)
return 0;
Confidential – Internal Use Only
Other Common MPI Functions
 MPI_Send, MPI_Recv
» Blocking send and receive between two specific ranks
 MPI_Isend, MPI_Irecv
» Non-blocking send and receive between two specific ranks
 man pages exist for the MPI functions
 Poorly written programs can suffer from poor communication efficiency
(e.g. stair-step) or lost data if the system buffer fills before a blocking
send or receive is initiated to correspond with a non-blocking receive or
send
 Care should be used when creating temporary files as multiple threads
may be running on the same host overwriting the same temporary file
(include rank in file name in a unique temporary directory per simulation)
23
Confidential – Internal Use Only
Compiling MPICH programs
 mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the correct MPI
libraries from /usr/lib64/MPICH
» Environment variables can used to set the compiler:
• CC, CPP, FC, F90
» Command line options to set the compiler:
• -cc=, -cxx=, -fc=, -f90=
» GNU, PGI, and Intel compilers are supported
24
Confidential – Internal Use Only
Running MPICH programs
 mpirun is used to launch MPICH programs
 Dynamic allocation can be done when using the –np
flag
 Mapping is also supported when using the –map flags
 If Infiniband is installed, the interconnect fabric can be
chosen using the machine flag:
» -machine p4
» -machine vapi
25
Confidential – Internal Use Only
Scaling with MPI
$ which mpicc
/usr/bin/mpicc
$ mpicc -show -o cpi-mpi cpi-mpi.c
gcc -L/usr/lib64/MPICH/p4/gnu -I/usr/include -o cpi-mpi cpi-mpi.c -lmpi lbproc
$ mpicc -o cpi-mpi cpi-mpi.c
$ time mpirun -np 1 ./cpi-mpi
Process 0 on scyld.localdomain
…
real
0m11.198s
user
0m11.187s
sys
0m0.010s
$ time mpirun -np 2 ./cpi-mpi
Process 0 on scyld.localdomain
Process 1 on n0
…
real
0m6.486s
user
0m5.510s
sys
0m0.009s
$ time mpirun -map -1:-1:-1:-1:-1:-1:-1:-1:0:0:0:0:0:0:0:0 ./cpi-mpi
…
real
0m1.283s
user
0m1.381s
sys
0m0.016s
26
Confidential – Internal Use Only
Environment Variable Options
 Additional environment variable control:
» NP — The number of processes requested, but not the number of
processors. As in the example earlier in this section, NP=4 ./a.out will run
the MPI program a.out with 4 processes.
» ALL_CPUS — Set the number of processes to the number of CPUs
available to the current user. Similar to the example above, --all-cpus=1
./a.out would run the MPI program a.out on all available CPUs.
» ALL_NODES—Set the number of processes to the number of nodes
available to the current user. Similar to the ALL_CPUS variable, but you get
a maximum of one CPU per node. This is useful for running a job per node
instead of per CPU.
» ALL_LOCAL — Run every process on the master node; used for debugging
purposes.
» NO_LOCAL — Don’t run any processes on the master node.
» EXCLUDE — A colon-delimited list of nodes to be avoided during node
assignment.
» BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node
listed will be the first process (MPI Rank 0) and so on.
27
Confidential – Internal Use Only
Compiling and Running OpenMPI
programs
 env-modules package allow users to change their
environment variables according to predefined files
» module avail
» module load openmpi/gnu
» GNU, PGI, and Intel compilers are supported
 mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the correct MPI
libraries from /opt/scyld/openmpi
 mpirun is used to run code
 Interconnect can be selected at runtime
» -mca btl openib,tcp,sm,self
» -mca btl udapl,tcp,sm,self
28
Confidential – Internal Use Only
Compiling and Running OpenMPI
programs
What env-modules does:
 Set user environment prior to compiling
» export PATH=/opt/scyld/openmpi/gnu/bin:${PATH}
 mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and
link in the correct MPI libraries from /opt/scyld/openmpi
» Environment variables can used to set the compiler:
• OPMI_CC, OMPI_CXX, OMPI_F77, OMPI_FC
 Prior to running PATH and LD_LIBRARY_PATH should be set
» module load openmpi/gnu
» /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out
OR:
» export PATH=/opt/scyld/openmpi/gnu/bin:${PATH}
export MANPATH=/opt/scyld/openmpi/gnu/share/man
export LD_LIBRARY_PATH=/opt/scyld/openmpi/gnu/lib:${LD_LIBRARY_PATH}
» /opt/scyld/openmpi/gnu/bin/mpirun –np 16 a.out
29
Confidential – Internal Use Only
Scaling with MPI Implementations
30
Confidential – Internal Use Only
Scaling with MPI Implementations
Speedup relative to
GCC
Scaling with MPI for gcc
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
14
16
18
Number of Processors
Ideal Speedup 1:1
MPICC - GCC - P4
MPICC - GCC - VAPI
OpenMPI - GCC - tcp,sm,self
OpenMPI - GCC - openib,sm,self
 Infiniband allows wider scaling
 Performance difference between MPICH versus OpenMPI
 A little artificial because it’s only two physical machines
31
Confidential – Internal Use Only
Scaling with MPI Implementations
Compiler Comparison
Speedup relative to GCC
60
50
40
30
20
10
0
0
2
4
6
8
10
12
14
16
18
Number of Processors
I deal Speedup 1: 1
M P I CC - GCC - P 4
M P I CC - GCC - V A P I
OpenM P I - GCC - t cp, sm, sel f
OpenM P I - GCC - openi b, sm, sel f
I deal Speedup 3. 49: 1
M P I CC - GCC -O3 - P 4
M P I CC - GCC -O3 - V A P I
M P I CC - I CC - P 4
M P I CC - I CC - V A P I
OpenM P I - GCC -O3 - t cp, sm, sel f
OpenM P I - GCC -O3 - openi b, sm, sel f
OpenM P I - I CC - t cp, sm, sel f
OpenM P I - I CC - openi b, sm, sel f
 Larger problems would allow continued scaling
32
Confidential – Internal Use Only
MPI Summary
 MPI provides a mechanism to parallelize in a
distributed fashion
» Localhost optimization is done is on a shared memory
machine
 Shared variables are explicitly handled by the
developer
 Tradeoff between CPU versus IO can determine the
performance characteristics
 Hybrid programming models are possible
» MPI code with OpenMPI sections
» MPI code with GPU calls
33
Confidential – Internal Use Only
Queuing
 How are resources allocated among multiple users
and/or groups?
» Statically by using bpctl user and group permissions
» ClusterWare supports a variety of queuing packages
• TaskMaster (advanced MOAB policy based scheduler integrated
ClusterWare)
• Torque
• SGE
34
Confidential – Internal Use Only
Interacting with Torque
 To submit a job:
» qsub script.sh
• Example script.sh:
#!/bin/sh
#PBS –j oe
#PBS –l nodes=4
cd $PBS_O_WORKDIR
hostname
• qsub does not accept arguments for script.sh. All executable
arguments must be included in the script itself
» Administrators can create a ‘qapp’ script that takes user
arguments, creates script.sh with the user arguments
embedded, and runs ‘qsub script.sh’
35
Confidential – Internal Use Only
Interacting with Torque
 Other commands
» qstat – Status of queue server and jobs
» qdel – Remove a job from the queue
» qhold, qrls – Hold and release a job in the queue
» qmgr – Administrator command to configure pbs_server
» /var/spool/torque/server_name: should match hostname of the head
node
» /var/spool/torque/mom_priv/config: file to configure pbs_mom
• ‘$usecp *:/home /home’ indicates that pbs_mom should use ‘cp’
rather than ‘rcp’ or ‘scp’ to relocate the stdout and stderr files at the end of
execution
» pbsnodes – Administrator command to monitor the status of the
resources
» qalter – Administrator command to modify the parameters of a
particular job (e.g. requested time)
36
Confidential – Internal Use Only
Other options to qsub
 Options that can be included in a script (with the #PBS
directive) or on the qsub command line
» Join output and error files : #PBS –j oe
» Request resources: #PBS –l nodes=2:ppn=2
» Request walltime: #PBS –l walltime=24:00:00
» Define a job name: #PBS –N jobname
» Send mail at jobs events: #PBS –m be
» Assign job to an account: #PBS –A account
» Export current environment variables: #PBS –V
 To start an interactive queue job use:
» qsub –I for Torque
» qrsh for SGE
37
Confidential – Internal Use Only
Queue script case studies
#!/bin/bash
 qapp script:
» Be careful about
escaping special
characters in the
redirect section
(\$, \’, \”)
#Usage: qapp arg1 arg2
debug=0
opt1=“${1}”
opt2=“${2}”
if [[ “${opt2}” == “” ]] ; then
echo “Not enough arguments”
exit 1
fi
cat > app.sh << EOF
#!/bin/bash
#PBS –j oe
#PBS –l nodes=1
cd \$PBS_O_WORKDIR
app $opt1 $opt2
EOF
if [[ “${debug}” –lt 1 ]] ; then
qsub app.sh
fi
if [[ “${debug}” –eq 0 ]] ; then
/bin/rm –f app.sh
fi
38
Confidential – Internal Use Only
Queue script case studies
 Using local scratch:
#!/bin/bash
#PBS –j oe
#PBS –l nodes=1
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”
/bin/mkdir –p $tmpdir
rsync –a ./ $tmpdir
cd $tmpdir
$pathto/app arg1 arg2
cd $PBS_O_WORKDIR
rsync –a $tmpdir/ .
/bin/rm –fr $tmpdir
39
Confidential – Internal Use Only
Queue script case studies
 Using local scratch for
MPICH parallel jobs:
» pbsdsh is a Torque
command
#!/bin/bash
#PBS –j oe
#PBS –l nodes=2:ppn=8
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”
/usr/bin/pbsdsh –u “/bin/mkdir –p
$tmpdir”
/usr/bin/pbsdsh –u bash –c “cd
$PBS_O_WORKDIR ; rsync –a ./ $tmpdir”
cd $tmpdir
mpirun –machine vapi $pathto/app arg1
arg2
cd $PBS_O_WORKDIR
/usr/bin/pbsdsh –u “rsync –a $tmpdir/
$PBS_O_WORKDIR”
/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”
40
Confidential – Internal Use Only
Queue script case studies
 Using local scratch for
OpenMPI parallel jobs:
» Do a ‘module load
openmpi/gnu’ prior to
running qsub
» OR explicitly include a
‘module load
openmpi/gnu’ in the
script itself
#!/bin/bash
#PBS –j oe
#PBS –l nodes=2:ppn=8
#PBS -V
cd $PBS_O_WORKDIR
tmpdir=“/scratch/$USER/$PBS_JOBID”
/usr/bin/pbsdsh –u “/bin/mkdir –p
$tmpdir”
/usr/bin/pbsdsh –u bash –c “cd
$PBS_O_WORKDIR ; rsync –a ./ $tmpdir”
cd $tmpdir
/usr/openmpi/gnu/bin/mpirun –np `cat
$PBS_NODEFILE | wc –l` –mca btl
openib,sm,self $pathto/app arg1 arg2
cd $PBS_O_WORKDIR
/usr/bin/pbsdsh –u “rsync –a $tmpdir/
$PBS_O_WORKDIR”
/usr/bin/pbsdsh –u “/bin/rm –fr $tmpdir”
41
Confidential – Internal Use Only
Other considerations
 A queue script need not be a single command
» Multiple steps can be performed from a single script
• Guaranteed resources
• Jobs should typically be a minimum of 2 minutes
» Pre-processing and post-processing can be done from the
same script using the local scratch space
» If configured, it is possible to submit additional jobs from a
running queued job
 To remove multiple jobs from the queue:
» qstat | grep “ [RQ] “ | awk ‘{print $1}’ | xargs qdel
42
Confidential – Internal Use Only
Integrating Parallel Programs
 The scheduler on keeps track of available resources
 Don’t monitor how the resources are used
 Onus is on the user to request and use the correct
resources
» OpenMP: be sure to requests multiple processors on the same
machine
• Torque: #PBS –l nodes=1:ppn=x
• SGE: Correct PE (parallel environment) submission
» MPI: be sure to the use the machines that have been assigned by
the queue system
• Torque: MPICH and OpenMPI mpirun will do the correct thing.
$PBS_NODEFILE contains a list of assigned hosts
• SGE: $PE_HOSTFILE contains a list of assigned hosts. OpenMPI’s
mpirun may need to be recompiled
43
Confidential – Internal Use Only
Integrating Parallel Programs
 Be careful about task pinning (taskset)
» Different jobs may assuming the same CPU set resulting in
oversubscription of some cores and some free cores
» In a shared environment, not using task pinning can be easier
at a slight trade-off in performance
 Make sure that the same MPI implementation and
compiler combination is used to run the code as was
used to compile and link
44
Confidential – Internal Use Only
Questions??
Confidential – Internal Use Only
45
Download