HPC105 - Advanced Research Computing at UM (ARC)

advertisement
Flux for PBS Users
HPC 105
Dr. Charles J Antonelli
LSAIT ARS
August, 2013
Flux
Flux is a university-wide shared computational discovery / high-performance
computing service.
Interdisciplinary
Provided by Advanced Research Computing at U-M (ARC)
Operated by CAEN HPC
Hardware procurement, software licensing, billing support by U-M ITS
Used across campus
Collaborative since 2010
Advanced Research Computing at U-M (ARC)
College of Engineering’s IT Group (CAEN)
Information and Technology Services
Medical School
College of Literature, Science, and the Arts
School of Information
http://arc.research.umich.edu/resources-services/flux/
cja 2013
2
8/13
The Flux cluster
…
cja 2013
3
8/13
Flux node
48 GB RAM
12 Intel cores
Local disk
Ethernet
cja 2013
InfiniBand
4
8/13
Flux Large Memory node
1 TB RAM
40 Intel cores
Local disk
Ethernet
cja 2013
InfiniBand
5
8/13
Flux hardware
8,016 Intel cores
632 Flux nodes
200 Intel Large Memory cores
5 Flux Large Memory nodes
48/64 GB RAM/node
1 TB RAM/ Large Memory node
4 GB RAM/core (allocated) 25 GB RAM/Marge Memory core
4X Infiniband network (interconnects all nodes)
40 Gbps, <2 us latency
Latency an order of magnitude less than Ethernet
Lustre Filesystem
Scalable, high-performance, open
Supports MPI-IO for MPI jobs
Mounted on all login and compute nodes
ES13
6
5/13
Flux software
Licensed software
http://cac.engin.umich.edu/resources/software/flux-software
et al
Compilers & Libraries:
Intel , PGI, GNU
OpenMP
OpenMPI
cja 2013
7
8/13
Using Flux
Three basic requirements to use Flux:
1. A Flux account
2. An MToken (or a Software Token)
3. A Flux allocation
cja 2013
8
8/13
Using Flux
1. A Flux account
Allows login to the Flux login nodes
Develop, compile, and test code
Available to members of U-M community, free
Get an account by visiting
https://www.engin.umich.edu/form/cacaccountapplication
cja 2013
9
8/13
Flux Account Policies
To qualify for a Flux account:
You must have an active institutional role
On the Ann Arbor campus
Not a Retiree or Alumni role
Your uniqname must have a strong identity type
Not a friend account
You must be able to receive email sent to
uniqname@umich.edu
You must have run a job in the last 13 months
http://cac.engin.umich.edu/resources/systems/user-accounts
cja 2013
10
8/13
Using Flux
2. An MToken (or a Software Token)
Required for access to the login nodes
Improves cluster security by requiring a second means of
proving your identity
You can use either an MToken or an application for your
mobile device (called a Software Token) for this
Information on obtaining and using these tokens at
http://cac.engin.umich.edu/resources/login-nodes/tfa
cja 2013
11
8/13
Using Flux
3. A Flux allocation
Allows you to run jobs on the compute nodes
Current rates: (through June 30, 2016)
$18 per core-month for Standard Flux
$24.35 per core-month for Large Memory Flux
$8 cost-share per core-month for LSA, Engineering, and
Medical School
Details at http://arc.research.umich.edu/resourcesservices/flux/flux-pricing/
To inquire about Flux allocations please email fluxsupport@umich.edu
cja 2013
12
8/13
Flux Allocations
To request an allocation send email to fluxsupport@umich.edu with
the type of allocation desired
Regular or Large-Memory
the number of cores needed
the start date and number of months for the allocation
the shortcode for the funding source
the list of people who should have access to the allocation
the list of people who can change the user list and
augment or end the allocations
http://arc.research.umich.edu/resources-services/flux/managing-a-flux-project/
cja 2013
13
8/13
Flux Allocations
An allocation specifies resources that are consumed
by running jobs
Explicit core count
Implicit memory usage (4 or 25 GB per core)
When any resource fully in use, new jobs are blocked
An allocation may be ended early
On the monthly anniversary
You may have multiple active allocations
Jobs draw resources from all active allocations
cja 2013
14
8/13
lsa_flux Allocation
LSA funds a shared allocation named lsa_flux
Usable by anyone in the College
60 cores
For testing, experimentation, exploration
Not for production runs
Each user limited to 30 concurrent jobs
https://sites.google.com/a/umich.edu/fluxsupport/support-for-users/lsa_flux
cja 2013
15
8/13
Monitoring Allocations
Visit https://mreports.umich.edu/mreports/pages/Flux.aspx
Select your allocation from the list at upper left
You’ll see all allocations you can submit jobs against
Four sets of outputs
Allocation details (start & end date, cores, shortcode)
Financial overview (cores allocated vs. used, by month)
Usage summary table (core-months by user and month
Drill down for individual job run data
Usage charts (by user)
Details & screenshots:http://arc.research.umich.edu/resourcesservices/flux/check-my-flux-allocation/
cja 2013
16
8/13
Storing data on Flux
Lustre filesystem mounted on /scratch on all login, compute,
and transfer nodes
640 TB of short-term storage for batch jobs
Pathname depends on your allocation and uniqname
e.g., /scratch/lsa_flux/cja
Can share through UNIX groups
Large, fast, short-term
Data deleted 60 days after allocation expires
http://cac.engin.umich.edu/resources/storage/flux-high-performance-storage-scratch
NFS filesystems mounted on /home and /home2 on all nodes
80 GB of storage per user for development & testing
Small, slow, long-term
cja 2013
17
8/13
Storing data on Flux
Flux does not provide large, long-term storage
Alternatives:
LSA Research Storage
ITS Value Storage
Departmental server
CAEN HPC can mount your storage on the login nodes
Issue df -kh command on a login node to see what
other groups have mounted
cja 2013
18
8/13
Storing data on Flux
LSA Research Storage
2 TB of secure, replicated data storage
Available to each LSA faculty member at no cost
Additional storage available at $30/TB/yr
Turn in existing storage hardware for additional storage
Request by visiting
https://sharepoint.lsait.lsa.umich.edu/Lists/Research%20Sto
rage%20Space/NewForm.aspx?RootFolder=
Authenticate with Kerberos login and password
Select NFS as the method for connecting to your storage
cja 2013
19
8/13
Copying data to Flux
Using the transfer host:
rsync -avz /your/cluster1/directory fluxxfer.engin.umich.edu:newdirname
rsync -avz /your/cluster1/directory fluxxfer.engin.umich.edu:/scratch/youralloc/youru
niqname
Or use scp, sftp, WinSCP, Cyberduck, FileZilla
http://cac.engin.umich.edu/resources/login-nodes/transfer-hosts
cja 2013
20
8/13
Globus Online
Features
High-speed data transfer, much faster than SCP or SFTP
Reliable & persistent
Minimal client software: Mac OS X, Linux, Windows
GridFTP Endpoints
Gateways through which data flow
Exist for XSEDE, OSG, …
UMich: umich#flux, umich#nyx
Add your own server endpoint: contact flux-support@umich.edu
Add your own client endpoint!
More information
http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp
cja 2013
21
8/13
Connecting to Flux
ssh flux-login.engin.umich.edu
Login with token code, uniqname, and Kerberos password
You will be randomly connected a Flux login node
Currently flux-login1 or flux-login2
Do not run compute- or I/O-intensive jobs here
Processes killed automatically after 30 minutes
Firewalls restrict access to flux-login.
To connect successfully, either
Physically connect your ssh client platform to the U-M campus wired
or MWireless network, or
Use VPN software on your client platform, or
Use ssh to login to an ITS login node (login.itd.umich.edu), and ssh
to flux-login from there
cja 2013
22
8/13
Lab 1
Task: Use the multicore package
The multicore package allows you to use multiple cores on the same
node
module load R
Copy sample code to your login directory
cd
cp ~cja/hpc-sample-code.tar.gz .
tar -zxvf hpc-sample-code.tar.gz
cd ./hpc-sample-code
Examine Rmulti.pbs and Rmulti.R
Edit Rmulti.pbs with your favorite Linux editor
Change #PBS -M email address to your own
cja 2013
23
8/13
Lab 1
Task: Use the multicore package
Submit your job to Flux
qsub Rmulti.pbs
Watch the progress of your job
qstat -u uniqname
where uniqname is your own uniqname
When complete, look at the job’s output
less Rmulti.out
cja 2013
24
8/13
Lab 2
Task: Run an MPI job on 8 cores
Compile c_ex05
cd ~/cac-intro-code
make c_ex05
Edit file run with your favorite Linux editor
Change #PBS -M address to your own
I don’t want Brock to get your email!
Change #PBS -A allocation to FluxTraining_flux, or
to your own allocation, if desired
Change #PBS -l allocation to flux
Submit your job
qsub run
cja 2013
25
8/13
PBS resources (1)
A resource (-l) can specify:
Request wallclock (that is, running) time
-l walltime=HH:MM:SS
Request C MB of memory per core
-l pmem=Cmb
Request T MB of memory for entire job
-l mem=Tmb
Request M cores on arbitrary node(s)
-l procs=M
Request a token to use licensed software
-l gres=stata:1
-l gres=matlab
-l gres=matlab%Communication_toolbox
cja 2013
26
8/13
PBS resources (2)
A resource (-l) can specify:
For multithreaded code:
Request M nodes with at least N cores per node
-l nodes=M:ppn=N
Request M cores with exactly N cores per node (note the difference
vis a vis ppn syntax and semantics!)
-l nodes=M,tpn=N
(you’ll only use this for specific algorithms)
cja 2013
27
8/13
Interactive jobs
You can submit jobs interactively:
qsub -I -V -l procs=2 -l walltime=15:00
-A youralloc_flux -l qos=flux –q flux
This queues a job as usual
Your terminal session will be blocked until the job runs
When it runs, you will be connected to one of your nodes
Invoked serial commands will run on that node
Invoked parallel commands (e.g., via mpirun) will run on all of your nodes
When you exit the terminal session your job is deleted
Interactive jobs allow you to
Test your code on cluster node(s)
Execute GUI tools on a cluster node with output on your local platform’s X
server
Utilize a parallel debugger interactively
cja 2013
28
8/13
Lab 3
Task: compile and execute an MPI program on a compute node
Copy sample code to your login directory:
cd
cp ~brockp/cac-intro-code.tar.gz .
tar -xvzf cac-intro-code.tar.gz
cd ./cac-intro-code
Start an interactive PBS session
qsub -I -V -l procs=2
-l walltime=30:00 -A FluxTraining_flux -l
qos=flux -q flux
On the compute node, compile & execute MPI parallel code:
cd $PBS_O_WORKDIR
mpicc -O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c
mpirun -np 2 ./c_ex01
cja 2013
29
8/13
Lab 4
Task: Run Matlab interactively
module load matlab
Start an interactive PBS session
qsub -I -V -l procs=2
-l walltime=30:00 -A FluxTraining_flux
-l qos=flux -q flux
Run Matlab in the interactive PBS session
matlab -nodisplay
cja 2013
30
8/13
The Scheduler (1/3)
Flux scheduling policies:
The job’s queue determines the set of nodes you run on
flux, fluxm
The job’s account determines the allocation to be charged
If you specify an inactive allocation, your job will never run
The job’s resource requirements help determine when the
job becomes eligible to run
If you ask for unavailable resources, your job will wait until they
become free
There is no pre-emption
cja 2013
31
8/13
The Scheduler (2/3)
Flux scheduling policies:
If there is competition for resources among eligible
jobs in the allocation or in the cluster, two things help
determine when you run:
How long you have waited for the resource
How much of the resource you have used so far
This is called “fairshare”
The scheduler will reserve nodes for a job with
sufficient priority
This is intended to prevent starving jobs with large
resource requirements
cja 2013
32
8/13
The Scheduler (3/3)
Flux scheduling policies:
If there is room for shorter jobs in the gaps of the
schedule, the scheduler will fit smaller jobs in those
gaps
This is called “backfill”
Cores
Time
cja 2013
33
8/13
Job monitoring
There are several commands you can run to get some
insight over your jobs’ execution:
freenodes : shows the number of free nodes and cores
currently available
mdiag -a youralloc_name : shows resources defined
for your allocation and who can run against it
showq -w acct=yourallocname: shows jobs using your
allocation (running/idle/blocked)
checkjob jobid : Can show why your job might not be
starting
showstart -e all jobid : Gives you a coarse estimate of
job start time; use the smallest value returned
cja 2013
34
8/13
Job Arrays
• Submit copies of identical jobs
• Invoked via qsub –t:
qsub –t array-spec pbsbatch.txt
Where array-spec can be
m-n
a,b,c
m-n%slotlimit
e.g.
qsub –t 1-50%10
Fifty jobs, numbered 1 through 50,
only ten can run simultaneously
• $PBS_ARRAYID records array identifier
cja 2013
35
35
8/13
Dependent scheduling
• Submit jobs whose execution scheduling depends on other
jobs
• Invoked via qsub –W:
qsub -W depend=type:jobid[:jobid]…
Where depend can be
after
afterok
Schedule after jobids have started
Schedule after jobids have finished,
only if no errors
afternotok
Schedule after jobids have finished,
only if errors
afterany
Schedule after jobids have finished,
regardless of status
Inverted semantics for before,beforeok,beforenotok,beforeany
cja 2013
36
36
8/13
Some Flux Resources
http://arc.research.umich.edu/resources-services/flux/
U-M Advanced Research Computing Flux pages
http://cac.engin.umich.edu/
CAEN HPC Flux pages
http://www.youtube.com/user/UMCoECAC
CAEN HPC YouTube channel
For assistance: flux-support@umich.edu
Read by a team of people including unit support staff
Cannot help with programming questions, but can help with
operational Flux and basic usage questions
cja 2013
37
8/13
Any Questions?
Charles J. Antonelli
LSAIT Advocacy and Research Support
cja@umich.edu
http://www.umich.edu/~cja
734 763 0607
cja 2013
38
8/13
Download