HPC101 - Advanced Research Computing at UM (ARC)

advertisement
High Performance
Computing Workshop
HPC 101
Dr. Charles J Antonelli
LSAIT ARS
February, 2014
Credits
Contributors:
Brock Palen (CAEN HPC)
Jeremy Hallum (MSIS)
Tony Markel (MSIS)
Bennet Fauber (CAEN HPC)
Mark Montague (LSAIT ARS)
Nancy Herlocher (LSAIT ARS)
LSAIT ARS
CAEN HPC
cja 2014
2
2/14
Roadmap
High Performance Computing
Flux Architecture
Flux Mechanics
Flux Batch Operations
Introduction to Scheduling
cja 2014
3
2/14
High Performance
Computing
cja 2014
4
2/14
Cluster HPC
A computing cluster
a number of computing nodes connected together via
special hardware and software that together can solve
large problems.
A cluster is much less expensive than a single
supercomputer (e.g., a mainframe)
Using clusters effectively requires support in
scientific software applications (e.g., Matlab's
Parallel Toolbox, or R's Snow library), or custom
code
cja 2014
5
2/14
Programming Models
Two basic parallel programming models
Message-passing
The application consists of several processes running on different nodes and
communicating with each other over the network
Used when the data are too large to fit on a single node, and simple synchronization is
adequate
“Coarse parallelism”
Implemented using MPI (Message Passing Interface) libraries
Multi-threaded
The application consists of a single process containing several parallel threads
that communicate with each other using synchronization primitives
Used when the data can fit into a single process, and the communications overhead of
the message-passing model is intolerable
“Fine-grained parallelism” or “shared-memory parallelism”
Implemented using OpenMP (Open Multi-Processing) compilers and libraries
Both
cja 2014
6
2/14
Amdahl’s Law
cja 2014
7
2/14
Flux Architecture
cja 2014
8
2/14
Flux
Flux is a university-wide shared computational discovery /
high-performance computing service.
Provided by Advanced Research Computing at U-M
Operated by CAEN HPC
Procurement, licensing, billing by U-M ITS
Interdisciplinary since 2010
http://arc.research.umich.edu/resources-services/flux/
cja 2014
9
2/14
The Flux cluster
…
cja 2014
10
2/14
A Flux node
48-64 GB RAM
cja 2014
12-16 Intel cores
Local disk
Ethernet
InfiniBand
11
2/14
A Large Memory Flux node
1 TB RAM
32-40 Intel cores
Local disk
Ethernet
cja 2014
InfiniBand
12
2/14
Coming soon:
A Flux GPU node
64 GB RAM
8 GPUs
16 Intel cores
Local disk
Each GPU contains 2,688 GPU cores
cja 2014
13
2/14
Flux software
Licensed and open software:
Abacus, BLAST, BWA, bowtie, ANSYS, Java, Mason,
Mathematica, Matlab, R, RSEM, STATA SE, …
See http://cac.engin.umich.edu/resources
C, C++, Fortran compilers:
Intel (default), PGI, GNU toolchains
You can choose software using the module
command
cja 2014
14
2/14
Flux network
All Flux nodes are interconnected via Infiniband and a
campus-wide private Ethernet network
The Flux login nodes are also connected to the campus
backbone network
The Flux data transfer node is connected over a 10 Gbps
connection to the campus backbone network
This means
The Flux login nodes can access the Internet
The Flux compute nodes cannot
If Infiniband is not available for a compute node, code on
that node will fall back to Ethernet communications
cja 2014
15
2/14
Flux data
Lustre filesystem mounted on /scratch on all login,
compute, and transfer nodes
640 TB of short-term storage for batch jobs
Large, fast, short-term
NFS filesystems mounted on /home and /home2 on
all nodes
80 GB of storage per user for development & testing
Small, slow, long-term
cja 2014
16
2/14
Flux data
Flux does not provide large, long-term storage
Alternatives:
Value Storage (NFS)
$20.84 / TB / month (replicated, no backups)
$10.42 / TB / month (non-replicated, no backups)
LSA Large Scale Research Storage
2 TB free to researchers (replicated, no backups)
Faculty members, lecturers, postdocs, GSI/GSRA
Additional storage $30 / TB / year (replicated, no backups)
Departmental server
CAEN can mount your storage on the login nodes
cja 2014
17
2/14
Copying data
Three ways to copy data to/from Flux
From Linux or Mac OS X, use scp:
scp localfile login@flux-xfer.engin.umich.edu:remotefile
scp login@flux-login.engin.umich.edu:remotefile localfile
scp -r localdir login@flux-xfer.engin.umich.edu:remotedir
From Windows, use WinSCP
U-M Blue Disc
http://www.itcs.umich.edu/bluedisc/
Use Globus Connect
cja 2014
18
2/14
Globus Connect
Features
High-speed data transfer, much faster than SCP or SFTP
Reliable & persistent
Minimal client software: Mac OS X, Linux, Windows
GridFTP Endpoints
Gateways through which data flow
Exist for XSEDE, OSG, …
UMich: umich#flux, umich#nyx
Add your own client endpoint!
Add your own server endpoint: contact flux-support@umich.edu
More information
http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp
cja 2014
19
2/14
Flux Mechanics
cja 2014
20
2/14
Using Flux
Three basic requirements to use Flux:
1. A Flux account
2. A Flux allocation
3. An MToken (or a Software Token)
cja 2014
21
2/14
Using Flux
1. A Flux account
Allows login to the Flux login nodes
Develop, compile, and test code
Available to members of U-M community, free
Get an account by visiting
https://www.engin.umich.edu/form/cacaccountapplicati
on
cja 2014
22
2/14
Using Flux
2. A Flux allocation
Allows you to run jobs on the compute nodes
Some units cost-share Flux rates
Regular Flux: $11.72/core/month
LSA, Engineering, Medical School $6.60/month
Large Memory Flux: $23.82/core/month
LSA, Engineering, Medical School $13.30/month
GPU Flux: $107.10/2 CPU cores and 1 GPU/month
LSA, Engineering, Medical School $60/month
Flux Operating Environment: $113.25/node/month
LSA, Engineering, Medical School $63.50/month
Flux pricing at http://arc.research.umich.edu/flux/hardware-services/
Rackham grants are available for graduate students
Details at http://arc.research.umich.edu/resources-services/flux/flux-pricing/
To inquire about Flux allocations please email flux-support@umich.edu
cja 2014
23
2/14
Using Flux
3. An MToken (or a Software Token)
Required for access to the login nodes
Improves cluster security by requiring a second means of
proving your identity
You can use either an MToken or an application for your
mobile device (called a Software Token) for this
Information on obtaining and using these tokens at
http://cac.engin.umich.edu/resources/login-nodes/tfa
cja 2014
24
2/14
Logging in to Flux
ssh flux-login.engin.umich.edu
MToken (or Software Token) required
You will be randomly connected a Flux login node
Currently flux-login1 or flux-login2
Firewalls restrict access to flux-login.
To connect successfully, either
Physically connect your ssh client platform to the U-M
campus wired or MWireless network, or
Use VPN software on your client platform, or
Use ssh to login to an ITS login node (login.itd.umich.edu),
and ssh to flux-login from there
cja 2014
25
2/14
Modules
The module command allows you to specify what versions of software
you want to use
module
module
module
module
module
module
list
load name
avail
avail name
unload name
-- Show loaded modules
-- Load module name for use
-- Show all available modules
-- Show versions of module name*
-- Unload module name
-- List all options
Enter these commands at any time during your session
A configuration file allows default module commands to be
executed at login
Put module commands in file ~/privatemodules/default
Don’t put module commands in your .bashrc / .bash_profile
cja 2014
26
2/14
Flux environment
The Flux login nodes have the standard
GNU/Linux toolkit:
make, autoconf, awk, sed, perl,
python, java, emacs, vi, nano, …
Watch out for source code or data files
written on non-Linux systems
Use these tools to analyze and convert source
files to Linux format
file
dos2unix
cja 2014
27
2/14
Lab 1
Task: Invoke R interactively on the login node
module load R
module list
R
q()
Please run only very small computations on the Flux
login nodes, e.g., for testing
cja 2014
28
2/14
Lab 2
Task: Run R in batch mode
module load R
Copy sample code to your login directory
cd
cp ~cja/hpc-sample-code.tar.gz .
tar -zxvf hpc-sample-code.tar.gz
cd ./hpc-sample-code
Examine Rbatch.pbs and Rbatch.R
Edit Rbatch.pbs with your favorite Linux editor
Change #PBS -M email address to your own
cja 2014
29
2/14
Lab 2
Task: Run R in batch mode
Submit your job to Flux
qsub Rbatch.pbs
Watch the progress of your job
qstat -u uniqname
where uniqname is your own uniqname
When complete, look at the job’s output
less Rbatch.out
Copy your results to your local workstation (change
uniqname to your own uniqname)
scp uniqname@flux-xfer.engin.umich.edu:hpcsample-code/Rbatch.out Rbatch.out
cja 2014
30
2/14
Lab 3
Task: Use the multicore package
The multicore package allows you to use multiple cores
on the same node
module load R
cd ~/sample-code
Examine Rmulti.pbs and Rmulti.R
Edit Rmulti.pbs with your favorite Linux editor
Change #PBS -M email address to your own
cja 2014
31
2/14
Lab 3
Task: Use the multicore package
Submit your job to Flux
qsub Rmulti.pbs
Watch the progress of your job
qstat -u uniqname
where uniqname is your own uniqname
When complete, look at the job’s output
less Rmulti.out
Copy your results to your local workstation (change uniqname to
your own uniqname)
scp uniqname@flux-xfer.engin.umich.edu:hpc-samplecode/Rmulti.out Rmulti.out
cja 2014
32
2/14
Compiling Code
Assuming default module settings
Use mpicc/mpiCC/mpif90 for MPI code
Use icc/icpc/ifort with -mp for OpenMP code
Serial code, Fortran 90:
ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90
Serial code, C:
icc -O3 -ipo -no-prec-div –xHost –o prog prog.c
mpicc
-O3
-ipo
-no-prec-div –xHost -o prog prog.c
MPI
parallel
code:
mpirun
-np
2 ./prog
cja 2014
33
2/14
Lab 4
Task: compile and execute simple programs on the Flux login node
Copy sample code to your login directory:
cd
cp ~brockp/cac-intro-code.tar.gz .
tar -xvzf cac-intro-code.tar.gz
cd ./cac-intro-code
Examine, compile & execute helloworld.f90:
ifort -O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90
./f90hello
Examine, compile & execute helloworld.c:
icc -O3 -ipo -no-prec-div -xHost -o chello helloworld.c
./chello
Examine, compile & execute MPI parallel code:
mpicc -O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c
mpirun -np 2 ./c_ex01
cja 2014
34
2/14
Makefiles
The Uses
makea command
automates
your code compilation
process
makefile
to
specify
dependencies
between
source
and object files
The sample directory contains a sample makefile
make
c_ex01
To compile
c_ex01:
make
To compile all programs in the directory
make
clean
To remove
all compiled programs
make
-j8
To make
all the programs using 8 compiles in parallel
cja 2014
35
2/14
Flux Batch Operations
cja 2014
36
2/14
Portable Batch System
All production runs are run on the compute nodes
using the Portable Batch System (PBS)
PBS manages all aspects of cluster job execution
except job scheduling
Flux uses the Torque implementation of PBS
Flux uses the Moab scheduler for job scheduling
Torque and Moab work together to control access to
the compute nodes
PBS puts jobs into queues
Flux has a single queue, named flux
cja 2014
37
2/14
Cluster workflow
You create a batch script and submit it to PBS
PBS schedules your job, and it enters the flux queue
When its turn arrives, your job will execute the batch script
Your script has access to any applications or data stored on
the Flux cluster
When your job completes, anything it sent to standard
output and error are saved and returned to you
You can check on the status of your job at any time, or
delete it if it’s not doing what you want
A short time after your job completes, it disappears
cja 2014
38
2/14
Basic batch commands
Once you have a script, submit it:
qsub scriptfile
$ qsub singlenode.pbs
6023521.nyx.engin.umich.edu
jobid
You can check
on the job status:
qstat
-u
user
$ qstat -u cja
nyx.engin.umich.edu:
Req'd Req'd
Elap
Job ID
Username Queue
Jobname
SessID NDS
TSK Memory Time S
Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ---6023521.nyx.engi
cja
flux
hpc101i
-1
1
-- 00:05 Q
-
To delete your job
qdel jobid
$ qdel 6023521
$
cja 2014
39
2/14
Loosely-coupled batch script
#PBS -N yourjobname
#PBS -V
#PBS -A youralloc_flux
#PBS -l qos=flux
#PBS -q flux
#PBS –l procs=12,pmem=1gb,walltime=01:00:00
#PBS -M youremailaddress
#PBS -m abe
#PBS -j oe
#Your Code Goes Below:
cd $PBS_O_WORKDIR
mpirun ./c_ex01
cja 2014
40
2/14
Tightly-coupled batch script
#PBS -N yourjobname
#PBS -V
#PBS -A youralloc_flux
#PBS -l qos=flux
#PBS -q flux
#PBS –l nodes=1:ppn=12,mem=47gb,walltime=02:00:00
#PBS -M youremailaddress
#PBS -m abe
#PBS -j oe
#Your Code Goes Below:
cd $PBS_O_WORKDIR
matlab -nodisplay -r script
cja 2014
41
2/14
Lab 5
Task: Run an MPI job on 8 cores
Compile c_ex05
cd ~/cac-intro-code
make c_ex05
Edit file run with your favorite Linux editor
Change #PBS -M address to your own
I don’t want Brock to get your email!
Change #PBS -A allocation to FluxTraining_flux, or
to your own allocation, if desired
Change #PBS -l allocation to flux
Submit your job
qsub run
cja 2014
42
2/14
PBS attributes
As always, man qsub is your friend
-N : sets the job name, can’t start with a number
-V : copy shell environment to compute node
-A youralloc_flux: sets the allocation you are using
-l qos=flux: sets the quality of service parameter
-q flux: sets the queue you are submitting to
-l : requests resources, like number of cores or nodes
-M : whom to email, can be multiple addresses
-m : when to email: a=job abort, b=job begin, e=job end
-j oe: join STDOUT and STDERR to a common file
-I : allow interactive use
-X : allow X GUI use
cja 2014
43
2/14
PBS resources (1)
A resource (-l) can specify:
Request wallclock (that is, running) time
-l walltime=HH:MM:SS
Request C MB of memory per core
-l pmem=Cmb
Request T MB of memory for entire job
-l mem=Tmb
Request M cores on arbitrary node(s)
-l procs=M
Request a token to use licensed software
-l gres=stata:1
-l gres=matlab
-l gres=matlab%Communication_toolbox
cja 2014
44
2/14
PBS resources (2)
A resource (-l) can specify:
For multithreaded code:
Request M nodes with at least N cores per node
-l nodes=M:ppn=N
Request M cores with exactly N cores per node (note the difference
vis a vis ppn syntax and semantics!)
-l nodes=M,tpn=N
(you’ll only use this for specific algorithms)
cja 2014
45
2/14
Interactive jobs
You can submit jobs interactively:
qsub -I -X -V -l procs=2 -l walltime=15:00
-A youralloc_flux -l qos=flux –q flux
This queues a job as usual
Your terminal session will be blocked until the job runs
When your job runs, you'll get an interactive shell on one of your nodes
Invoked commands will have access to all of your nodes
When you exit the shell your job is deleted
Interactive jobs allow you to
Develop and test on cluster node(s)
Execute GUI tools on a cluster node
Utilize a parallel debugger interactively
cja 2014
46
2/14
Lab 6
Task: Run an interactive job
Enter this command (all on one line):
qsub -I -V -l procs=1
-l walltime=30:00 -A FluxTraining_flux -l
qos=flux -q flux
When your job starts, you’ll get an interactive shell
Copy and paste the batch commands from the “run” file, one
at a time, into this shell
Experiment with other commands
After thirty minutes, your interactive shell will be killed
cja 2014
47
2/14
Lab 7
Task: Run Matlab interactively
module load matlab
Start an interactive PBS session
qsub -I -V -l procs=2
-l walltime=30:00 -A FluxTraining_flux
-l qos=flux -q flux
Run Matlab in the interactive PBS session
matlab -nodisplay
cja 2014
48
2/14
Introduction to
Scheduling
cja 2014
49
2/14
The Scheduler (1/3)
Flux scheduling policies:
The job’s queue determines the set of nodes you run on
The job’s account and qos determine the allocation to be
charged
If you specify an inactive allocation, your job will never run
The job’s resource requirements help determine when the
job becomes eligible to run
If you ask for unavailable resources, your job will wait until they
become free
There is no pre-emption
cja 2014
50
2/14
The Scheduler (2/3)
Flux scheduling policies:
If there is competition for resources among eligible
jobs in the allocation or in the cluster, two things help
determine when you run:
How long you have waited for the resource
How much of the resource you have used so far
This is called “fairshare”
The scheduler will reserve nodes for a job with
sufficient priority
This is intended to prevent starving jobs with large
resource requirements
cja 2014
51
2/14
The Scheduler (3/3)
Flux scheduling policies:
If there is room for shorter jobs in the gaps of the
schedule, the scheduler will fit smaller jobs in those
gaps
This is called “backfill”
Cores
Time
cja 2014
52
2/14
Gaining insight
There are several commands you can run to get some
insight over the scheduler’s actions:
freenodes : shows the number of free nodes and cores
currently available
mdiag -a youralloc_name : shows resources defined
for your allocation and who can run against it
showq -w acct=yourallocname: shows jobs using your
allocation (running/idle/blocked)
checkjob jobid : Can show why your job might not be
starting
showstart -e all jobid : Gives you a coarse estimate of
job start time; use the smallest value returned
cja 2014
53
2/14
More advanced
scheduling
Job Arrays
Dependent Scheduling
cja 2014
54
2/14
Job Arrays
• Submit copies of identical jobs
• Invoked via qsub –t:
qsub –t array-spec pbsbatch.txt
Where array-spec can be
m-n
a,b,c
m-n%slotlimit
e.g.
qsub –t 1-50%10
Fifty jobs, numbered 1 through 50,
only ten can run simultaneously
• $PBS_ARRAYID records array identifier
cja 2014
55
2/14
Dependent scheduling
• Submit jobs whose execution scheduling depends on other
jobs
• Invoked via qsub –W:
qsub -W depend=type:jobid[:jobid]…
Where depend can be
after
afterok
Schedule after jobids have started
Schedule after jobids have finished,
only if no errors
afternotok
Schedule after jobids have finished,
only if errors
afterany
Schedule after jobids have finished,
regardless of status
before,beforeok,beforenotok,beforeany
cja 2014
56
2/14
Dependent scheduling
Where depend can be (cont’t)
before
When this job has started, jobids will
be scheduled
beforeok
After this job completes without
errors, jobids will be scheduled
beforenotok
After this job completes without
errors, jobids will be scheduled
afterany
After this job completes, regardless
of status, jobids will be scheduled
cja 2014
57
2/14
Some
Flux Resources
http://arc.research.umich.edu/resources-services/flux/
U-M Advanced Research Computing Flux pages
http://cac.engin.umich.edu/
CAEN HPC Flux pages
http://www.youtube.com/user/UMCoECAC
CAEN HPC YouTube channel
For assistance: flux-support@umich.edu
Read by a team of people including unit support staff
Cannot help with programming questions, but can help with
operational Flux and basic usage questions
cja 2014
58
2/14
Summary
The Flux cluster is just a collection of similar Linux
machines connected together to run your code, much
faster than your desktop can
Command-line scripts are queued by a batch system and
executed when resources become available
Some important commands are
qsub
qstat -u username
qdel jobid
checkjob
Develop and test, then submit your jobs in bulk and let
the scheduler optimize their execution
cja 2014
59
2/14
Any Questions?
Charles J. Antonelli
LSAIT Advocacy and Research Support
cja@umich.edu
http://www.umich.edu/~cja
734 763 0607
cja 2014
60
2/14
References
1.
http://arc.research.umich.edu/resources-services/flux/
2.
http://arc.research.umich.edu/flux/hardware-services/
3.
http://cac.engin.umich.edu/resources/software/R.html
1.
http://cac.engin.umich.edu/resources/software/matlab.html
2.
3.
CAC supported Flux software, http://cac.engin.umich.edu/resources/software/flux-software (accessed August 2013)
J. L. Gustafson, “Reevaluating Amdahl’s Law,” chapter for book, Supercomputers and Artificial Intelligence, edited by Kai
Hwang, 1988. http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html (accessed November 2011).
Mark D. Hill and Michael R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, vol. 41, no. 7, pp. 33-38, July
2008. http://research.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf (accessed November 2011).
InfiniBand, http://en.wikipedia.org/wiki/InfiniBand (accessed August 2011).
Intel C and C++ Compiler 1.1 User and Reference Guide,
http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm (accessed
August 2011).
Intel Fortran Compiler 11.1 User and Reference
Guide,http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm
(accessed August 2011).
Lustre file system, http://wiki.lustre.org/index.php/Main_Page (accessed August 2011).
Torque User’s Manual, http://www.clusterresources.com/torquedocs21/usersmanual.shtml (accessed August 2011).
Jurg van Vliet & Flvia Paginelli, Programming Amazon EC2,’Reilly Media, 2011. ISBN 978-1-449-39368-7.
4.
5.
6.
7.
8.
9.
10.
cja 2014
61
2/14
Download