SLURM for Yorktown Bluegene/Q © 2014 IBM Corporation SLURM on Wat2q • Goals • Setup a scheduler for the Yorktown Bluegene system to increase research utilization of the system. • Become familiar with the Bluegene/Q SRM (system resource manager) interfaces as it is a model for future HPC control API’s. • Divide the Yorktown system into multipl[‘e submidplane blocks. • Develop scripts to allow users (optionally) to land on a specific submidplane block. • Get slurm to run the bgas.pl script automatically based on information in the SLURM sbatch command used to queue a job. • This requires that jobs be limited to running on complete partitions. • SLURM by default will attempt to run a job on part of a submidplane partition if that partition is already booted. • This is accomplished with prolog scripts. 2 © 2014 IBM Corporation SLURM Scheduling Jobs 3 © 2014 IBM Corporation SLURM Allocation Vs. Task Placement Allocation is the selection of the resources needed for the job – Each job includes zero or more job steps (srun) – Each job step is comprised of one to multiple tasks – This is done by the “sbatch” command. Task placement is the process of assigning a subset of the job’s allocated resources (cpus) to each task. – This is handled by the SLURM “srun” command invoked from within the script scheduled by “sbatch”. 4 © 2014 IBM Corporation Effectively this becomes a game of Tetris 5 © 2014 IBM Corporation Slurm documentation Slurm docs can be found here: – http://slurm.schedmd.com/documentation.html – Typical commands: 6 sacct displays accounting data for all jobs and job steps in the SLURM job accounting log. sbatch Submit a batch job to SLURM. scancel Used to signal jobs or job steps that are under the control of Slurm. scontrol Used view and modify Slurm configuration and state sinfo view information about SLURM nodes and partitions smap graphically view information about SLURM jobs, partitions, and set configurations parameters. squeue view information about jobs located in the SLURM scheduling queue. srun run parallel jobs. sstat Display various status information of a running job/step. sview graphical user interface to view and modify SLURM state. © 2014 IBM Corporation SLURM functions SLURMD carries out five key tasks and has five corresponding subsystems: – Machine Status • responds to SLURMCTLD requests for machine state information and sends asynchronous reports of state changes to help with queue control. – Job Status • responds to SLURMCTLD requests for job state information and sends asynchronous reports of state changes to help with queue control. – Remote Execution • starts, monitors, and cleans up after a set of processes (usually shared by a parallel job), as decided by SLURMCTLD (or by direct user intervention). – Stream Copy Service • handles all STDERR, STDIN, and STDOUT for remote tasks. This may involve redirection, and it always involves locally buffering job output to avoid blocking local tasks. – Job Control • propagates signals and job-termination requests to any SLURM-managed processes (often interacting with the Remote Execution subsystem). 7 © 2014 IBM Corporation Slurm software SLURM daemons don’t execute directly on the compute nodes. SLURM gets system state, allocates resources and other state from the Bluegene/Q control system. This interface is entirely contained in a SLURM plugin (src/plugings/select/bluegene). The user interacts bluegene with the following slurm commands. – sbatch. – srun. – scontrol. – squeue. 8 © 2014 IBM Corporation Slurm Architecture for Bluegene/Q 9 © 2014 IBM Corporation Job Launch Process 10 © 2014 IBM Corporation Sview of BlueGene system 11 © 2014 IBM Corporation Slurm naming conventions • Slurm names things with torus coordinates • Top level names use 4 dimension midplane coordinates. • Submidplane partitions use 5 dimension torus coordinates. Bgq name Slurm name Bgq name Slurm name R00-M0 bgq0000 R00-M0-N00 bgq0000[00000x11111] R00-M1 bgq0001 R00-M0-N01 bgq0000[00200x11311] R01-M0 bgq0010 R00-M0-N02 bgq0000[00020x11131] R01-M1 bgq0011 R00-M0-N03 bgq0000[00220x11331] R00-M0-N04 bgq0000[20000x31111] R00 bgq[0000x0001] R00-M0-N05 bgq0000[20200x31311] R00-M0-N06 bgq0000[20020x31131] R00-M0-N07 bgq0000[20220x31331] R00-M0-N08 bgq0000[02000x12111] R00-M0-N09 bgq0000[02200x13311] R00-M0-N10 bgq0000[02020x13131] R00-M0-M11 bgq0000[03330x13331] R00-M0-N12 bgq0000[22000x33111] R00-M0-N13 bgq0000[22200x33311] R00-M0-N14 bgq0000[22020x33131] R00-M0-N15 bgq0000[22220x33331] R01 Bgq[0010x0011] R00R01 Bgq[0000x0011] Example larger blocks R01-M0-N00-128 12 Bgq0010[00000x11331] © 2014 IBM Corporation Slurm queuing a JOB. Use the sbatch command to queue a script that will run one or more jobs. Within the script presented to the sbatch command do one or more “srun” commands. – The srun command will eventually cause a runjob command to be created. For example: – This schedules the script rj01.sh to be run when a 64 node block on the partition “prod” is booted. sbatch –nodes=64 --partition=prod rj01.sh – Inside rj01.sh we have: #!/bin/bash srun --chdir=/bgusr/home1/bvt_scratch /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf – The srun will call runjob as follows: runjob --exe /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf --block RMP28Ap122959767 --cwd /bgusr/home1/bvt_scratch 13 © 2014 IBM Corporation Queuing a job with only one script. Using sbatch/srun to queue a job typically requires two scripts, one to queue the job, (sbatch) and one to run one or more jobs (srun) once the block is allocated. One can do this with a single script with this simple boilerplate. ##!/bin/bash if [ -z "$SLURM_JOBID" ]; then sbatch --gid=bqluan --time=5:00 --nodes=128 --ntasks-per-node=32 -O --qos=umax-128 $0 else srun --chdir=/gpfs/DDNgpfs2/bqluan/mushroomP \ --output=equilibrate-4V-21-new.out --error=equilibrate-4V-22-new.namd \ /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 equilibrate-4V-22-new.namd fi The above script is a re-expression of the following (original) run job script runjob --block R01-M0-N04-128 --ranks-per-node 32 --cwd /gpfs/DDNgpfs2/bqluan/mushroomP \ --exe /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 \ --args equilibrate-4V-21-new.namd > equilibrate-4V-21-new.out 2> equilibrate-4V-21-new.err & 14 © 2014 IBM Corporation Srun/runjob decoder Runjob option Srun option --cwd --chdir --exe (first field without an option) --label xx --label=xx --verbose --verbose --ranks-per-node --ntasks-per-node All other options --launcher_opts= • Launcher options is a catch-all for all other runjob options • For example: --launcher-opts=“—timeout-300 –strace” 15 © 2014 IBM Corporation Partitions (SLURM queue names). We have setup multiple basic slurm queues (partitions). – prod – regular production nodes (R00-M0, R00-M1, R01-M0, R01-M1). – bgas – full system bgas allocation (R00-M0, R00-M1, R01-M0, R01-M1). There are a couple of midplane level reservations setup to run each day. – bgas_daily – active 3am to 3:30pm – bgas_full – 3:30 pm to 6pm. – The default queue/partition is the “prod” queue. – The queue/partition name is used by the prolog script to determine if it is necessary to switch the IO nodes to either BGAS or production. 16 © 2014 IBM Corporation SLURM small block divisions. Block divisions as of May 2024. – bgq0000 (R00-M0) – divided into 16 32 way blocks. – bgq0001 (R00-M1) – divided into 32,64,128,256 way (overlapping blocks) – Bgq0010 (R01-M0) – divided into ,64,128,256 way (overlapping blocks) – Bgq0011 (R01-M1) – divided into ,64,128,256 way (overlapping blocks) • sbatch option “--nodes=xx” where xx is, either 32,64,128,256 will cause a job to land on one of the small block partitions. Slurm will pick which small block to run it on. • Prolog scripts ensure that partial blocks are not used (i.e. 2 32 way jobs running on the same 64 way block at the same time. • You can restrict which midplane that slurm will try to select its blocks from with the – nodelist=xxxx, where xxxx is bgq0000, bgq0001, bgq0010, or bgq0011. 17 © 2014 IBM Corporation Getting SLURM to run on a specific node card/block To get slurm to land on a specific block we use the prolog script and the “nodelist” and “constraint” option for sbatch. For example: sbatch --partition=prod –nodelist=bgq0000 --nodes=32 --constraint=N00-32 NOTE: – The --nodes option and the constraint must agree as to the size. – A sub-block of that size MUST exist on the nodelist requested. Valid constraints are: – Nxx-32, where xx is 00-15 – Nxx-64, where xx is 00,02,04,06,08,10,12,14 – Nxx-128, where xx is 00,04,08,12 – Nxx-256, where xx is 00,08 If the block is not capable of being scheduled the job will be canceled and a message will appear in the stdout file (slurm-$jobid.out). Trying to use the higher number Nxx cards for 64 and 32 ways is discouraged, because the system will try to run the jobs on the Lower Number cards first and down each node card in turn until it lands on the card it needs to run on. 18 © 2014 IBM Corporation SLURM Job order. If the user uses the –constraints parameter to select a specific node card, the order that jobs are submitted on may not be respected. This is because the prolog scripts can reject the node SLURM first selects either due to it trying to run on a block larger than requested, or by a constraint. – When the job is rejected on a specific node, it gets re-queued and this will cause some reordering. If Job order is required one can use the --singleton and --jobname options as follows: – sbatch --job-name=a --dependency=singleton -N32 --constraint=N01-32 rj01.s Another way to do this is with the “--dependency”: – after:job_id[:jobid...] : This job can begin execution after the specified jobs have begun execution. – afterany:job_id[:jobid...] : This job can begin execution after the specified jobs have terminated. – afternotok:job_id[:jobid...]: This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc). – afterok:job_id[:jobid...] : This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero). 19 © 2014 IBM Corporation SLURM – reservations. Slurm can reserve an entire Midplane for jobs by a specific reservation id. The current version can only reserve entire midplane blocks (not sub-midplane) – The September release of SLURM is supposed to have better sub-midplane capabilities for both node selection and reservations. Creating a reseveration: scontrol create reservation user=myid starttime=now nodes=bgq0001 duration=120 \ – This will reply with a reservation id as follows: Reservation created: myid_5 Using the reservation: sbatch --reservation=myid_5 –nodes=64 my.script This web page outlines reservations in more detail https://computing.llnl.gov/linux/slurm/reservations.html 20 © 2014 IBM Corporation Reservation Time-limit interaction. For each job in there queue there is an execution timelimit imposed on it. The default for this normally comes from the queue name. – It can be overridden at various levels such as the sbatch command line. – The initial default for the SLURM queues is 1 hour, so to over ride it use the --time parameter on the sbatch as follows: sbatch –time=xxx nameofscript.sh • The xxx value is in minutes, other forms of date/times can be found in the sbatch man page: “man sbatch” The job will not run if the timelimit overlaps a node reservation. – So for example, if there is a reservation every day at 3:30 for the entire machine and the time limit associated for the job will over lap that full system reservation, the job won’t run. Until after the reservation is over. – If the time-limit exceeds the queue/partition time-limit the job will be left in the pending state indefinitely. 21 © 2014 IBM Corporation QOS settings. QOS (quality of service settings), are used by SLURM to control limits on the amount of resources a given user/group/account/job can consume at any one time. Our initial deployment of SLURM will associate a default QOS setting limiting each user to the total number of compute nodes that they previously had as a static allocation. This will be used to keep users from consuming all of the machine by submitting multiple sbatch commands, but still allow a user to run 3 32 way jobs if their normal allocaiton was 128 nodes. Each user will have a “default QOS” setting associated with their ID as well as a list of qos settings they are allowed to use. – umax-32 == user max nodes = 32 – umax-64 == user max nodes = 64 – umax-128 == user max nodes == 128 –… One can select one of the authorized qos settings in the sbatch command line as follows: sbatch –qos=umax-128 –nodes=32 xx.sh – The above command would allow the user to run 4 32 way jobs in parallel, before the queue would back up his jobs behind other work. 22 © 2014 IBM Corporation