Slide 1 - WestGrid

advertisement
Job Submission on WestGrid
Feb 15 2005
on Access Grid
Introduction
 Simon Sharpe, one member of the WestGrid
support team
 The best way to contact us is to email
support@westgrid.ca
 This seminar tells you;




How to run, monitor, or cancel your jobs
How to select the best site for your job
How to adapt your job submission for different sites
How to get your jobs running as quickly as possible
 Feel free to interrupt if you have questions
Getting into the Queue
 HPC Resources are valuable research
tools
 A batch queuing system is needed to
 Match jobs to resources
 Deliver maximum bang for the research buck
 Distribute jobs and collect output across
parallel CPUs
 Ensure a fair sharing of resources
Getting into the Queue
 WestGrid compute sites use
TORQUE/Moab
 Based on PBS (Portable Batch System)
 You need just a few commands common to
WestGrid machines
 There are important differences in job
submission among sites you need to know
about
 With the diversity of WestGrid, it is
possible that there is more than one
machine suitable for your job
A Simple Sample

This example show how to run a
serial job on Glacier, which is a
good choice for serial jobs
The qsub command tells
TORQUE to run the job
described in the script file
serialhello.pbs


The script file serialhello.pbs
tells TORQUE how to run the C
program serialhello

When your job completes,
TORQUE creates two new files
in the current directory
capturing;
 error out from the job
 standard out
End of Seminar
 Thanks for coming
 I wish it was that easy
HPC: One Size Does Not Fit All
 When the only tool you have is a
hammer, every job looks like a nail
 Things that affect system selection;
 System dictated by executable or
licensing
 MPI or OpenMP
 Availability: How busy is the system?
 Amount of RAM required
 Speed or number of processors
HPC: One Size Does Not Fit All
 Things that affect system selection
(continued);
 Scalability of your application
 Inter-processor communication
requirements
 Queue limits (walltime, number of
CPUs)
 Inertia: It is where we’ve always run it
http://www.westgrid.ca/support/System_Status
http://www.westgrid.ca/support/Facilities
http://www.westgrid.ca/support/software
Uses of WestGrid Machines
Machine
Use
Interconnect
CPUs
Glacier
IBM Xeon
Serial, moderate
parallel MPI
GigE
Shared in node
1680
Dual CPUs/node
Matrix
HP XC Alpha
MPI Parallel
Infiniband,
Shared in node
256
Dual CPUs/node
Lattice
HP SC Alpha
Moderate MPI
parallel, serial
Quadrics,
Shared in node
144, 68 (G03)
Quad CPUs/node
Cortex
IBM Power5
OpenMP, MPI
Parallel
Shared memory
64, 64, 4
Nexus
SGI Origin MIPS
OpenMP, MPI
Parallel
Shared memory
256, 64, 64, 36, 32,
32, 8
Robson
IBM Power5
Serial, moderate
MPI parallel
GigE,
Shared in node
56
Dual CPUs/node
TORQUE and Moab Commands
qsub script
Submit this job to the queue, common options include
-l mem=1GB
-l nodes=4:ppn=2 or, on Nexus –l ncpus=4
-l walltime=06:00:00
-q queue-name
-m and –M for email notifications
showq
Show me the jobs in the queue
qstat jobid
Show the status of the job in the queue, common options include
-a and -an
qdel jobid
Delete this job number from the queue
Sample MPI job on Glacier
Parallel jobs have differing degrees of parallelism
Glacier, which has a slower interconnect than other
WestGrid machines, may not turn out to be the best place
for your parallel job
Latency: Like the time it takes to dial and say “hello”
Bandwidth: How fast can you talk?
If your parallel job does not require intensive
communications between processes, it may be worth
testing on Glacier
More info on Glacier submissions at;
http://www.westgrid.ca/support/programming/glacier.php
http://guide.westgrid.ca/guide-pages/jobs.html
MPI Submission on Glacier
 We need to tell TORQUE how many processors we need




This asks for 2 nodes and 2
processors per node (4 CPUs)
Similar script to last time, but now
calling program parallelized with
MPI
Adding the walltime estimate
helps TORQUE schedule the job
Note that we can pass directives;
 on the command line or
 in the script

This time we wait in the queue
Sample MPI job on Matrix
Matrix is an HP XC cluster using AMD Opterons and
Infiniband Interconnect
64-bit Linux
Not intended for serial work
A good home for parallel jobs
More info on Matrix submissions at;
http://www.westgrid.ca/support/programming/matrix.php
Running MPI Jobs on Matrix
For Matrix, use nodes and
processors/node (ppn) to tell
TORQUE how many CPUs your job
needs
Matrix machines have 2 CPUs/Node
A minimal TORQUE script to
run a parallel MPI job on Matrix
Standard and Error output
dropped into the directory we
submitted from
Sample MPI job on Lattice
Lattice is an HP Alpha cluster connected with Quadrics
64-bit Tru64
Intended for parallel work
Four processor shared memory
Quadrics interconnect for more than 4 processors
MPI communicates through interconnect or shared memory, as
appropriate
Also being used for some serial work
More info on Lattice submissions at;
http://hpc.ucalgary.ca/westgrid/running.html
http://www.westgrid.ca/support/programming/lattice.php
Running MPI Jobs on Lattice
For Lattice, use nodes and
processors/node to set number of
processors. Lattice has 4 processors
on each node.
In this case we ask for 2 CPUs on
one box and 2 on another
A minimal TORQUE script to
run a parallel MPI job on
Lattice
Standard and error out
dropped into the directory we
submitted from
Sample Serial Job on Lattice
 Lattice has a high-speed Quadrics
interconnect
 If your job is serial, it does not take
advantage of the Quadrics
interconnect
 Glacier may be an alternative
 Having said that, many serial jobs are
run on Lattice
Running Serial Jobs on Lattice
On Lattice, we tell TORQUE to run
the job described in the script file
serialhello.pbs
A minimal TORQUE script to
run a serial job on Lattice
Standard and error out
dropped into the directory we
submitted from
Sample Parallel job on Cortex
Cortex is a machine with IBM Power5 SMP processors
Running AIX
Not for serial work
A good home for large parallel applications needing
shared memory and/or fast interconnection
Good for large memory jobs
More info on Cortex submissions at;
http://www.westgrid.ca/support/cortex
http://www.westgrid.ca/support/programming/cortex.php
Running Serial Jobs on Cortex
On Cortex, we tell
TORQUE to run the job
described in the script
file mpihello.pbs
The script which describes
how we want cortex to run the
parallel program mpihello
The standard output file,
dropped into our working
directory
Sample Parallel Job on Nexus
 Nexus is a collection of SGI SMP machines
 Several sizes serviced by different queues.
 Test on smaller machines, heavy lifting on
large ones
 A good home for parallel jobs with intense
communication requirements and/or large
memory needs
 More information at;
http://www.ualberta.ca/AICT/RESEARCH/PBS/index.westgrid.html
Running OpenMP Jobs on Nexus
You can try trivial OpenMP jobs
from the command line. This job
ran interactively on the head node.
You should not use more than 2
processors for interactive jobs.
To run jobs requiring real
processing, you must submit them
to TORQUE
For Nexus, match ncpus with
OMP_NUM_THREADS
In this case we ask for 8 CPUs on
the Helios machine (8-32 CPUs)
Sample Serial Job on Robson
 Robson is a new 56 processor Power5
system
 64-bit Linux
 Good for serial work, may be suitable
for some parallel processing.
 Message passing through MPI
 More info at;
http://www.westgrid.ca/support/robson
Running Serial Jobs on Robson
This is a minimal serial job submission
script for Robson. It runs the executable
“hello”
A more elaborate script example is
available;
http://www.westgrid.ca/support/robson
Robson also runs MPI parallel
jobs, as described on the above
web page
TORQUE drops the Error Out
(zero –length in this case)
and Standard Out to the
directory we submitted from
Shortening HPC Cycle
 Try your jobs at different sites
 Test your process on small jobs
 Give realistic walltimes, memory
requirements
 Apply for a larger Resource
Allocation
 http://www.westgrid.ca/manage_rac.html
Summary






HPC jobs have differing requirements
WestGrid provides an increasing variety of tools
Use the system that is best for your job
Start off simple and small
Find out how well your job scales
Getting help
 Because of implementation differences, “man qsub”
might not be your best source of help
 Support pages as listed throughout this presentation
 Email support@westgrid.ca
Download