Introduction to IBM LoadLeveler Batch Scheduling System J. Skovira 5/05 v1 1 Agenda l Batch Scheduling Basics l LoadLeveler basics l LoadLeveler configuration l Job command files Basic Commands l Job Submission l Job cancellation l Job monitoring l Advanced Functions l Questions and Answers J. Skovira 5/05 v1 2 Who Needs a Job Scheduler? Single Machine Job 1 Job 2 …. Job N HPC Machine OS multi-tasks single CPU: time-shared scheduling Parallel Dimension Many Machines and Users: User 1: Job 1 Job 2 …. Job N User 2: Job 1 Job 2 …. Job N User 3: Job 1 Job 2 …. Job N More Jobs Parallel Dimension User may impact a distant job Scheduler runs jobs according to: Scheduling Theory Site-defined Policy J. Skovira 5/05 v1 3 Scheduling Terms Scheduler Start jobs on specific resources at specific times Batch Scheduler Resource manager Job Queue Job 1 Job 2 Job 3 …. HPC Cluster J. Skovira 5/05 v1 4 More Tasks for User? Job Meta Data Application Code Job Command File is a small set of job directives Job Command files can be “borrowed” from samples Simple Command files take predefined defaults Experienced users may enhance command files Once control is handed to the job, scheduler is out of the way J. Skovira 5/05 v1 5 LoadLeveler Components IBM Cluster Loadleveler Central Manager Negotiator Daemon High Performance Switch Worker Nodes Startd daemon Schedd Machine Schedd Machine J. Skovira 5/05 v1 6 LoadLeveler Components J. Skovira 5/05 v1 7 JobD Priority and Scheduling Jobs arrive: from different users at different time in different job classes with different priorities JobA JobE JobB JobC Loadleveler sorts the job queue Job A Job B Job C Job D Job E 8 12 10 4 4 2 1 1 1 5 Loadleveler schedules the jobs in queue order J. Skovira 5/05 v1 8 Reservation vs Backfill Reservation (standard) Scheduling Top job waits a short time for resources to free Defer if not available Backfill Top job starts if it can If not enough resources, compute when available which resources job will use Backfill jobs onto available nodes Backfill superior for parallel machines J. Skovira 5/05 v1 9 Backfill Job Queue Job Job A Job B Job C Job D Job E Nodes 8 12 10 4 4 J. Skovira 5/05 v1 Time 2 1 1 1 5 10 Backfill Job Queue Job Job A Job B Job C Job D Job E Nodes 8 12 10 4 4 J. Skovira 5/05 v1 Time 2 1 1 1 5 11 Job Command File Basics Command file contains job “directives” Basic items include: Shell Class Input/output directories Notification control Queue keyword Job Command File Application Code 2 ways to specify job executable: Executable keyword Script invocation after the keyword J. Skovira 5/05 v1 12 Basic Job Command File #!/bin/ksh # @ class = demo # @ queue perlspin2 > /tmp J. Skovira 5/05 v1 13 More Job Command File Keywords Requirements allow you to select: I/O directives Node requirements Wallclock limit Locally defined requirements Etc… notification controls what LL sends about the job From never to always notify_user tells LL where to send job info An email address J. Skovira 5/05 v1 14 Serial Job Command File #!/bin/ksh # @ error = ./out/job2.$(jobid).err # @ output = ./out/job2.$(jobid).out # @ wall_clock_limit = 180 # @ class = demo # @ notification = complete # @ notify_user = skoviraj@us.ibm.com # @ queue perlspin2 J. Skovira 5/05 v1 15 Communication on the System Each node has a connection to the high-performance switch There are 2 ways to use the switch ip mode "unlimited" channels slower communication performance User space mode limited number of channels faster than ip mode Can be selected in job command file J. Skovira 5/05 v1 16 Parallel Job Command File Keywords node How many nodes your job requires tasks_per_node How many tasks will run on each node network How your job will communicate wall_clock_limit An estimate of how long your job runs J. Skovira 5/05 v1 17 The Network Keyword network.protocol = network_type, usage, mode protocol: MPI, LAPI, PVM network_type: sn_single or sn_all for switch adapter usage: shared or not_shared mode: IP, US An example: # @ network.MPI = sn_single, shared, us J. Skovira 5/05 v1 18 Parallel Job Command File #!/bin/ksh # @ job_type = parallel # @ node = 1 # @ tasks_per_node = 4 # @ error = ./out/job3.$(jobid).err # @ output = ./out/job3.$(jobid).out # @ wall_clock_limit = 05:00 # @ class = demo # @ notification = complete # @ notify_user = skovira@tc.cornell.edu # @ network.MPI = sn_all,shared,us # @ queue poe perlspin2 J. Skovira 5/05 v1 19 Basic Loadleveler Commands llsubmit – submits a job to Loadleveler llcancel – cancels a submitted job llq – queries the status of jobs in the job queue llstatus – queries the status of machines in the cluster J. Skovira 5/05 v1 20 llq example v01n08:/u/skoviraj $ llsubmit mybasic.cmd llsubmit: The job "v01n08.vendor.pok.ibm.com.205" has been submitted v01n08:/u/skoviraj $ llq Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------v01n08.204.0 skoviraj 11/11 22:29 R 50 No_Class v01n02 v01n08.205.0 skoviraj 11/11 22:30 R 50 No_Class v01n02 v01n08.203.0 skoviraj 11/11 22:28 I 50 No_class 3 job steps in queue, 1 waiting, 0 pending, 2 running, 0 held J. Skovira 5/05 v1 21 llstatus example v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus v01n02 Name Schedd InQ Act Startd Run LdAvg Idle Arch v01n02.vendor.pok.ibm.com Avail 0 0 Run 1 0.00 9999 R6000 OpSys AIX43 v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus | more Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys v01n01.vendor.pok.ibm.com Avail 0 0 Idle 0 0.05 9999 R6000 AIX43 v01n02.vendor.pok.ibm.com Avail 0 0 Run 1 0.00 9999 R6000 AIX43 v01n03.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43 v01n04.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43 v01n05.vendor.pok.ibm.com Avail 0 0 Idle 0 0.02 9999 R6000 AIX43 v01n06.vendor.pok.ibm.com Avail 0 0 Idle 0 0.05 9999 R6000 AIX43 v01n07.vendor.pok.ibm.com Avail 1 0 Idle 0 0.06 155 R6000 AIX43 v01n08.vendor.pok.ibm.com Avail 1 0 Idle 0 0.00 83 R6000 AIX43 v01n09.vendor.pok.ibm.com Avail 0 0 Idle 0 0.00 9999 R6000 AIX43 J. Skovira 5/05 v1 22 llctl Examples llctl -h hostname command Useful Commands: reconfig - Forces all daemons to reread the configuration files. start - Starts the LoadLeveler daemons on the specified machine. stop - Stops the LoadLeveler daemons on the specified machine. Commands sometimes used: flush - Terminates running jobs on this machine, places jobs in idle recycle - Stops all LoadLeveler daemons and restarts them. J. Skovira 5/05 v1 23 llctl Example drain [schedd|startd [classlist |allclasses]] With no options: (1) no more LoadLeveler jobs can begin running on this machine, (2) no more LoadLeveler jobs can be submitted through this machine. When you issue drain schedd, the following happens: (1) the schedd machine accepts no more LoadLeveler jobs for submission. (2) jobs in the Starting or Running state in the queue are allowed to continue running. (3) jobs in the Idle state in the schedd queue are drained When you issue drain startd, the following happens: (1) the startd machine accepts no more LoadLeveler jobs to be run (2) jobs already running on the startd machine are allowed to complete. J. Skovira 5/05 v1 24 More Loadleveler Commands llclass - returns information about available classes llprio - changes the user priority of a job step J. Skovira 5/05 v1 25 llclass Example v60n129:/u/skoviraj $ llclass Name MaxJobCPU MaxProcCPU Free Max d+hh:mm:ss d+hh:mm:ss Slots Slots --------------- -------------- -------------- ----- ----inter_class undefined undefined 192 192 X_Class undefined undefined 192 192 v60n129:/u/skoviraj $ =============== Class Name: Priority: Exclude_Users: Include_Users: Exclude_Groups: Include_Groups: Admin: NQS_class: NQS_submit: NQS_query: Max_processors: Maxjobs: Resource_requirement: Class_comment: Class_ckpt_dir: Ckpt_limit: Wall_clock_limit: Job_cpu_limit: … Description --------------------- llclass -l X_Class X_Class =============== X_Class 0 F -1 -1 undefined, undefined 11+13:46:39, 11+13:46:39 (999999 seconds, 999999 seconds) undefined, undefined J. Skovira 5/05 v1 26 llprio Example v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq Id Owner Submitted ST PRI Class Running On v01n07.137.0 skoviraj 11/11 22:51 I 50 No_class 1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llprio -p 100 v01n07.137.0 llprio: Priority command has been sent to the central manager. v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq Id Owner Submitted ST PRI Class Running On v01n07.137.0 skoviraj 11/11 22:51 I 100 No_class 1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held J. Skovira 5/05 v1 27 Advanced Topics Job Preemption Job Checkpointing Loadleveler APIs (data access, scheduling) Consumable resource control Workload Manager (WLM) integration Advance Reservation Submit filter J. Skovira 5/05 v1 28 Job Suspension 16 way job runs 16 way job completes 4 Node job runs 4 Node suspended 4 way restarts J. Skovira 5/05 v1 29 Job Checkpoint 16 way job runs 16 way job completes 4 Node job runs 4 Node Checkpoints and ends 4 way restarts from saved state 4 Node job state saved J. Skovira 5/05 v1 GPFS 30 Submit Filter $NetKey = FALSE; while (<STDIN>) { chomp($value = $_); if ( $value =~ /network/ ) { # If we find the network keyword.... $NetKey = TRUE; # remember it! } if ( $value =~ /queue/ ) { # If at the end of LL keywords for this job step... if ( $NetKey eq FALSE ) { # if No network keyword... # Add one which uses the switch print "# @ network.MPI = sn_all,not_shared,US\n" } $NetKey = FALSE; # Reset network keyword memory } print "$value\n"; # Copy a single ll cmd file line to new cmd file } J. Skovira 5/05 v1 31 Tips for Efficient Job Processing Assumptions: One task per CPU Classes Configured Get your job to the TOP of the queue: Short run Small number of nodes Use ip communication over the switch Priority? Submit during low use periods (evening) These are FREE! all above tips (except priority) will impact no other job J. Skovira 5/05 v1 32 More Tips for Efficient Job Processing Allow your job to run as QUICKLY as possible: Balance node operations Keep data entirely in physical memory Use processors of similar types (system admin?) Use distributed data load and store Profile your applications for efficient compiler use This could be an entirely new presentation! J. Skovira 5/05 v1 33 Questions and Answers J. Skovira 5/05 v1 34