High Performance Computing Workshop HPC 101 Dr. Charles J Antonelli LSAIT ARS February, 2014 Credits Contributors: Brock Palen (CAEN HPC) Jeremy Hallum (MSIS) Tony Markel (MSIS) Bennet Fauber (CAEN HPC) Mark Montague (LSAIT ARS) Nancy Herlocher (LSAIT ARS) LSAIT ARS CAEN HPC cja 2014 2 2/14 Roadmap High Performance Computing Flux Architecture Flux Mechanics Flux Batch Operations Introduction to Scheduling cja 2014 3 2/14 High Performance Computing cja 2014 4 2/14 Cluster HPC A computing cluster a number of computing nodes connected together via special hardware and software that together can solve large problems. A cluster is much less expensive than a single supercomputer (e.g., a mainframe) Using clusters effectively requires support in scientific software applications (e.g., Matlab's Parallel Toolbox, or R's Snow library), or custom code cja 2014 5 2/14 Programming Models Two basic parallel programming models Message-passing The application consists of several processes running on different nodes and communicating with each other over the network Used when the data are too large to fit on a single node, and simple synchronization is adequate “Coarse parallelism” Implemented using MPI (Message Passing Interface) libraries Multi-threaded The application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable “Fine-grained parallelism” or “shared-memory parallelism” Implemented using OpenMP (Open Multi-Processing) compilers and libraries Both cja 2014 6 2/14 Amdahl’s Law cja 2014 7 2/14 Flux Architecture cja 2014 8 2/14 Flux Flux is a university-wide shared computational discovery / high-performance computing service. Provided by Advanced Research Computing at U-M Operated by CAEN HPC Procurement, licensing, billing by U-M ITS Interdisciplinary since 2010 http://arc.research.umich.edu/resources-services/flux/ cja 2014 9 2/14 The Flux cluster … cja 2014 10 2/14 A Flux node 48-64 GB RAM cja 2014 12-16 Intel cores Local disk Ethernet InfiniBand 11 2/14 A Large Memory Flux node 1 TB RAM 32-40 Intel cores Local disk Ethernet cja 2014 InfiniBand 12 2/14 Coming soon: A Flux GPU node 64 GB RAM 8 GPUs 16 Intel cores Local disk Each GPU contains 2,688 GPU cores cja 2014 13 2/14 Flux software Licensed and open software: Abacus, BLAST, BWA, bowtie, ANSYS, Java, Mason, Mathematica, Matlab, R, RSEM, STATA SE, … See http://cac.engin.umich.edu/resources C, C++, Fortran compilers: Intel (default), PGI, GNU toolchains You can choose software using the module command cja 2014 14 2/14 Flux network All Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network The Flux login nodes are also connected to the campus backbone network The Flux data transfer node is connected over a 10 Gbps connection to the campus backbone network This means The Flux login nodes can access the Internet The Flux compute nodes cannot If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications cja 2014 15 2/14 Flux data Lustre filesystem mounted on /scratch on all login, compute, and transfer nodes 640 TB of short-term storage for batch jobs Large, fast, short-term NFS filesystems mounted on /home and /home2 on all nodes 80 GB of storage per user for development & testing Small, slow, long-term cja 2014 16 2/14 Flux data Flux does not provide large, long-term storage Alternatives: Value Storage (NFS) $20.84 / TB / month (replicated, no backups) $10.42 / TB / month (non-replicated, no backups) LSA Large Scale Research Storage 2 TB free to researchers (replicated, no backups) Faculty members, lecturers, postdocs, GSI/GSRA Additional storage $30 / TB / year (replicated, no backups) Departmental server CAEN can mount your storage on the login nodes cja 2014 17 2/14 Copying data Three ways to copy data to/from Flux From Linux or Mac OS X, use scp: scp localfile login@flux-xfer.engin.umich.edu:remotefile scp login@flux-login.engin.umich.edu:remotefile localfile scp -r localdir login@flux-xfer.engin.umich.edu:remotedir From Windows, use WinSCP U-M Blue Disc http://www.itcs.umich.edu/bluedisc/ Use Globus Connect cja 2014 18 2/14 Globus Connect Features High-speed data transfer, much faster than SCP or SFTP Reliable & persistent Minimal client software: Mac OS X, Linux, Windows GridFTP Endpoints Gateways through which data flow Exist for XSEDE, OSG, … UMich: umich#flux, umich#nyx Add your own client endpoint! Add your own server endpoint: contact flux-support@umich.edu More information http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp cja 2014 19 2/14 Flux Mechanics cja 2014 20 2/14 Using Flux Three basic requirements to use Flux: 1. A Flux account 2. A Flux allocation 3. An MToken (or a Software Token) cja 2014 21 2/14 Using Flux 1. A Flux account Allows login to the Flux login nodes Develop, compile, and test code Available to members of U-M community, free Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplicati on cja 2014 22 2/14 Using Flux 2. A Flux allocation Allows you to run jobs on the compute nodes Some units cost-share Flux rates Regular Flux: $11.72/core/month LSA, Engineering, Medical School $6.60/month Large Memory Flux: $23.82/core/month LSA, Engineering, Medical School $13.30/month GPU Flux: $107.10/2 CPU cores and 1 GPU/month LSA, Engineering, Medical School $60/month Flux Operating Environment: $113.25/node/month LSA, Engineering, Medical School $63.50/month Flux pricing at http://arc.research.umich.edu/flux/hardware-services/ Rackham grants are available for graduate students Details at http://arc.research.umich.edu/resources-services/flux/flux-pricing/ To inquire about Flux allocations please email flux-support@umich.edu cja 2014 23 2/14 Using Flux 3. An MToken (or a Software Token) Required for access to the login nodes Improves cluster security by requiring a second means of proving your identity You can use either an MToken or an application for your mobile device (called a Software Token) for this Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/login-nodes/tfa cja 2014 24 2/14 Logging in to Flux ssh flux-login.engin.umich.edu MToken (or Software Token) required You will be randomly connected a Flux login node Currently flux-login1 or flux-login2 Firewalls restrict access to flux-login. To connect successfully, either Physically connect your ssh client platform to the U-M campus wired or MWireless network, or Use VPN software on your client platform, or Use ssh to login to an ITS login node (login.itd.umich.edu), and ssh to flux-login from there cja 2014 25 2/14 Modules The module command allows you to specify what versions of software you want to use module module module module module module list load name avail avail name unload name -- Show loaded modules -- Load module name for use -- Show all available modules -- Show versions of module name* -- Unload module name -- List all options Enter these commands at any time during your session A configuration file allows default module commands to be executed at login Put module commands in file ~/privatemodules/default Don’t put module commands in your .bashrc / .bash_profile cja 2014 26 2/14 Flux environment The Flux login nodes have the standard GNU/Linux toolkit: make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, … Watch out for source code or data files written on non-Linux systems Use these tools to analyze and convert source files to Linux format file dos2unix cja 2014 27 2/14 Lab 1 Task: Invoke R interactively on the login node module load R module list R q() Please run only very small computations on the Flux login nodes, e.g., for testing cja 2014 28 2/14 Lab 2 Task: Run R in batch mode module load R Copy sample code to your login directory cd cp ~cja/hpc-sample-code.tar.gz . tar -zxvf hpc-sample-code.tar.gz cd ./hpc-sample-code Examine Rbatch.pbs and Rbatch.R Edit Rbatch.pbs with your favorite Linux editor Change #PBS -M email address to your own cja 2014 29 2/14 Lab 2 Task: Run R in batch mode Submit your job to Flux qsub Rbatch.pbs Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname When complete, look at the job’s output less Rbatch.out Copy your results to your local workstation (change uniqname to your own uniqname) scp uniqname@flux-xfer.engin.umich.edu:hpcsample-code/Rbatch.out Rbatch.out cja 2014 30 2/14 Lab 3 Task: Use the multicore package The multicore package allows you to use multiple cores on the same node module load R cd ~/sample-code Examine Rmulti.pbs and Rmulti.R Edit Rmulti.pbs with your favorite Linux editor Change #PBS -M email address to your own cja 2014 31 2/14 Lab 3 Task: Use the multicore package Submit your job to Flux qsub Rmulti.pbs Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname When complete, look at the job’s output less Rmulti.out Copy your results to your local workstation (change uniqname to your own uniqname) scp uniqname@flux-xfer.engin.umich.edu:hpc-samplecode/Rmulti.out Rmulti.out cja 2014 32 2/14 Compiling Code Assuming default module settings Use mpicc/mpiCC/mpif90 for MPI code Use icc/icpc/ifort with -mp for OpenMP code Serial code, Fortran 90: ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90 Serial code, C: icc -O3 -ipo -no-prec-div –xHost –o prog prog.c mpicc -O3 -ipo -no-prec-div –xHost -o prog prog.c MPI parallel code: mpirun -np 2 ./prog cja 2014 33 2/14 Lab 4 Task: compile and execute simple programs on the Flux login node Copy sample code to your login directory: cd cp ~brockp/cac-intro-code.tar.gz . tar -xvzf cac-intro-code.tar.gz cd ./cac-intro-code Examine, compile & execute helloworld.f90: ifort -O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90 ./f90hello Examine, compile & execute helloworld.c: icc -O3 -ipo -no-prec-div -xHost -o chello helloworld.c ./chello Examine, compile & execute MPI parallel code: mpicc -O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c mpirun -np 2 ./c_ex01 cja 2014 34 2/14 Makefiles The Uses makea command automates your code compilation process makefile to specify dependencies between source and object files The sample directory contains a sample makefile make c_ex01 To compile c_ex01: make To compile all programs in the directory make clean To remove all compiled programs make -j8 To make all the programs using 8 compiles in parallel cja 2014 35 2/14 Flux Batch Operations cja 2014 36 2/14 Portable Batch System All production runs are run on the compute nodes using the Portable Batch System (PBS) PBS manages all aspects of cluster job execution except job scheduling Flux uses the Torque implementation of PBS Flux uses the Moab scheduler for job scheduling Torque and Moab work together to control access to the compute nodes PBS puts jobs into queues Flux has a single queue, named flux cja 2014 37 2/14 Cluster workflow You create a batch script and submit it to PBS PBS schedules your job, and it enters the flux queue When its turn arrives, your job will execute the batch script Your script has access to any applications or data stored on the Flux cluster When your job completes, anything it sent to standard output and error are saved and returned to you You can check on the status of your job at any time, or delete it if it’s not doing what you want A short time after your job completes, it disappears cja 2014 38 2/14 Basic batch commands Once you have a script, submit it: qsub scriptfile $ qsub singlenode.pbs 6023521.nyx.engin.umich.edu jobid You can check on the job status: qstat -u user $ qstat -u cja nyx.engin.umich.edu: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ---6023521.nyx.engi cja flux hpc101i -1 1 -- 00:05 Q - To delete your job qdel jobid $ qdel 6023521 $ cja 2014 39 2/14 Loosely-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l procs=12,pmem=1gb,walltime=01:00:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR mpirun ./c_ex01 cja 2014 40 2/14 Tightly-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l nodes=1:ppn=12,mem=47gb,walltime=02:00:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR matlab -nodisplay -r script cja 2014 41 2/14 Lab 5 Task: Run an MPI job on 8 cores Compile c_ex05 cd ~/cac-intro-code make c_ex05 Edit file run with your favorite Linux editor Change #PBS -M address to your own I don’t want Brock to get your email! Change #PBS -A allocation to FluxTraining_flux, or to your own allocation, if desired Change #PBS -l allocation to flux Submit your job qsub run cja 2014 42 2/14 PBS attributes As always, man qsub is your friend -N : sets the job name, can’t start with a number -V : copy shell environment to compute node -A youralloc_flux: sets the allocation you are using -l qos=flux: sets the quality of service parameter -q flux: sets the queue you are submitting to -l : requests resources, like number of cores or nodes -M : whom to email, can be multiple addresses -m : when to email: a=job abort, b=job begin, e=job end -j oe: join STDOUT and STDERR to a common file -I : allow interactive use -X : allow X GUI use cja 2014 43 2/14 PBS resources (1) A resource (-l) can specify: Request wallclock (that is, running) time -l walltime=HH:MM:SS Request C MB of memory per core -l pmem=Cmb Request T MB of memory for entire job -l mem=Tmb Request M cores on arbitrary node(s) -l procs=M Request a token to use licensed software -l gres=stata:1 -l gres=matlab -l gres=matlab%Communication_toolbox cja 2014 44 2/14 PBS resources (2) A resource (-l) can specify: For multithreaded code: Request M nodes with at least N cores per node -l nodes=M:ppn=N Request M cores with exactly N cores per node (note the difference vis a vis ppn syntax and semantics!) -l nodes=M,tpn=N (you’ll only use this for specific algorithms) cja 2014 45 2/14 Interactive jobs You can submit jobs interactively: qsub -I -X -V -l procs=2 -l walltime=15:00 -A youralloc_flux -l qos=flux –q flux This queues a job as usual Your terminal session will be blocked until the job runs When your job runs, you'll get an interactive shell on one of your nodes Invoked commands will have access to all of your nodes When you exit the shell your job is deleted Interactive jobs allow you to Develop and test on cluster node(s) Execute GUI tools on a cluster node Utilize a parallel debugger interactively cja 2014 46 2/14 Lab 6 Task: Run an interactive job Enter this command (all on one line): qsub -I -V -l procs=1 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux When your job starts, you’ll get an interactive shell Copy and paste the batch commands from the “run” file, one at a time, into this shell Experiment with other commands After thirty minutes, your interactive shell will be killed cja 2014 47 2/14 Lab 7 Task: Run Matlab interactively module load matlab Start an interactive PBS session qsub -I -V -l procs=2 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux Run Matlab in the interactive PBS session matlab -nodisplay cja 2014 48 2/14 Introduction to Scheduling cja 2014 49 2/14 The Scheduler (1/3) Flux scheduling policies: The job’s queue determines the set of nodes you run on The job’s account and qos determine the allocation to be charged If you specify an inactive allocation, your job will never run The job’s resource requirements help determine when the job becomes eligible to run If you ask for unavailable resources, your job will wait until they become free There is no pre-emption cja 2014 50 2/14 The Scheduler (2/3) Flux scheduling policies: If there is competition for resources among eligible jobs in the allocation or in the cluster, two things help determine when you run: How long you have waited for the resource How much of the resource you have used so far This is called “fairshare” The scheduler will reserve nodes for a job with sufficient priority This is intended to prevent starving jobs with large resource requirements cja 2014 51 2/14 The Scheduler (3/3) Flux scheduling policies: If there is room for shorter jobs in the gaps of the schedule, the scheduler will fit smaller jobs in those gaps This is called “backfill” Cores Time cja 2014 52 2/14 Gaining insight There are several commands you can run to get some insight over the scheduler’s actions: freenodes : shows the number of free nodes and cores currently available mdiag -a youralloc_name : shows resources defined for your allocation and who can run against it showq -w acct=yourallocname: shows jobs using your allocation (running/idle/blocked) checkjob jobid : Can show why your job might not be starting showstart -e all jobid : Gives you a coarse estimate of job start time; use the smallest value returned cja 2014 53 2/14 More advanced scheduling Job Arrays Dependent Scheduling cja 2014 54 2/14 Job Arrays • Submit copies of identical jobs • Invoked via qsub –t: qsub –t array-spec pbsbatch.txt Where array-spec can be m-n a,b,c m-n%slotlimit e.g. qsub –t 1-50%10 Fifty jobs, numbered 1 through 50, only ten can run simultaneously • $PBS_ARRAYID records array identifier cja 2014 55 2/14 Dependent scheduling • Submit jobs whose execution scheduling depends on other jobs • Invoked via qsub –W: qsub -W depend=type:jobid[:jobid]… Where depend can be after afterok Schedule after jobids have started Schedule after jobids have finished, only if no errors afternotok Schedule after jobids have finished, only if errors afterany Schedule after jobids have finished, regardless of status before,beforeok,beforenotok,beforeany cja 2014 56 2/14 Dependent scheduling Where depend can be (cont’t) before When this job has started, jobids will be scheduled beforeok After this job completes without errors, jobids will be scheduled beforenotok After this job completes without errors, jobids will be scheduled afterany After this job completes, regardless of status, jobids will be scheduled cja 2014 57 2/14 Some Flux Resources http://arc.research.umich.edu/resources-services/flux/ U-M Advanced Research Computing Flux pages http://cac.engin.umich.edu/ CAEN HPC Flux pages http://www.youtube.com/user/UMCoECAC CAEN HPC YouTube channel For assistance: flux-support@umich.edu Read by a team of people including unit support staff Cannot help with programming questions, but can help with operational Flux and basic usage questions cja 2014 58 2/14 Summary The Flux cluster is just a collection of similar Linux machines connected together to run your code, much faster than your desktop can Command-line scripts are queued by a batch system and executed when resources become available Some important commands are qsub qstat -u username qdel jobid checkjob Develop and test, then submit your jobs in bulk and let the scheduler optimize their execution cja 2014 59 2/14 Any Questions? Charles J. Antonelli LSAIT Advocacy and Research Support cja@umich.edu http://www.umich.edu/~cja 734 763 0607 cja 2014 60 2/14 References 1. http://arc.research.umich.edu/resources-services/flux/ 2. http://arc.research.umich.edu/flux/hardware-services/ 3. http://cac.engin.umich.edu/resources/software/R.html 1. http://cac.engin.umich.edu/resources/software/matlab.html 2. 3. CAC supported Flux software, http://cac.engin.umich.edu/resources/software/flux-software (accessed August 2013) J. L. Gustafson, “Reevaluating Amdahl’s Law,” chapter for book, Supercomputers and Artificial Intelligence, edited by Kai Hwang, 1988. http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html (accessed November 2011). Mark D. Hill and Michael R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, vol. 41, no. 7, pp. 33-38, July 2008. http://research.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf (accessed November 2011). InfiniBand, http://en.wikipedia.org/wiki/InfiniBand (accessed August 2011). Intel C and C++ Compiler 1.1 User and Reference Guide, http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm (accessed August 2011). Intel Fortran Compiler 11.1 User and Reference Guide,http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm (accessed August 2011). Lustre file system, http://wiki.lustre.org/index.php/Main_Page (accessed August 2011). Torque User’s Manual, http://www.clusterresources.com/torquedocs21/usersmanual.shtml (accessed August 2011). Jurg van Vliet & Flvia Paginelli, Programming Amazon EC2,’Reilly Media, 2011. ISBN 978-1-449-39368-7. 4. 5. 6. 7. 8. 9. 10. cja 2014 61 2/14