Introduction to LoadLeveler

advertisement
Introduction to IBM
LoadLeveler
Batch Scheduling System
J. Skovira 5/05 v1
1
Agenda
l
Batch Scheduling Basics
l
LoadLeveler basics
l
LoadLeveler configuration
l
Job command files
Basic Commands
l Job Submission
l Job cancellation
l Job monitoring
l
Advanced Functions
l
Questions and Answers
J. Skovira 5/05 v1
2
Who Needs a Job Scheduler?
Single Machine
Job 1
Job 2
….
Job N
HPC Machine
OS multi-tasks single CPU:
time-shared scheduling
Parallel Dimension
Many Machines and Users:
User 1:
Job 1
Job 2
….
Job N
User 2:
Job 1
Job 2
….
Job N
User 3:
Job 1
Job 2
….
Job N
More Jobs
Parallel Dimension
User may impact a
distant job
Scheduler runs jobs according to:
Scheduling Theory
Site-defined Policy
J. Skovira 5/05 v1
3
Scheduling Terms
Scheduler
Start jobs on specific
resources at specific times
Batch
Scheduler
Resource manager
Job Queue
Job 1
Job 2
Job 3
….
HPC Cluster
J. Skovira 5/05 v1
4
More Tasks for User?
Job Meta Data
Application Code
Job Command File is a small set of job directives
Job Command files can be “borrowed” from samples
Simple Command files take predefined defaults
Experienced users may enhance command files
Once control is handed to the job, scheduler is out of the way
J. Skovira 5/05 v1
5
LoadLeveler Components
IBM Cluster
Loadleveler Central Manager
Negotiator Daemon
High
Performance
Switch
Worker Nodes
Startd daemon
Schedd Machine
Schedd Machine
J. Skovira 5/05 v1
6
LoadLeveler Components
J. Skovira 5/05 v1
7
JobD
Priority and Scheduling
Jobs arrive:
from different users
at different time
in different job classes
with different priorities
JobA
JobE
JobB
JobC
Loadleveler sorts the job queue
Job A
Job B
Job C
Job D
Job E
8
12
10
4
4
2
1
1
1
5
Loadleveler schedules the jobs in queue order
J. Skovira 5/05 v1
8
Reservation vs Backfill
Reservation (standard) Scheduling
Top job waits a short time for resources to free
Defer if not available
Backfill
Top job starts if it can
If not enough resources, compute
when available
which resources job will use
Backfill jobs onto available nodes
Backfill superior for parallel machines
J. Skovira 5/05 v1
9
Backfill
Job Queue
Job
Job A
Job B
Job C
Job D
Job E
Nodes
8
12
10
4
4
J. Skovira 5/05 v1
Time
2
1
1
1
5
10
Backfill
Job Queue
Job
Job A
Job B
Job C
Job D
Job E
Nodes
8
12
10
4
4
J. Skovira 5/05 v1
Time
2
1
1
1
5
11
Job Command File Basics
Command file contains job “directives”
Basic items include:
Shell
Class
Input/output directories
Notification control
Queue keyword
Job Command File
Application Code
2 ways to specify job executable:
Executable keyword
Script invocation after the keyword
J. Skovira 5/05 v1
12
Basic Job Command File
#!/bin/ksh
# @ class = demo
# @ queue
perlspin2 > /tmp
J. Skovira 5/05 v1
13
More Job Command File Keywords
Requirements allow you to select:
I/O directives
Node requirements
Wallclock limit
Locally defined requirements
Etc…
notification controls what LL sends about the job
From never to always
notify_user tells LL where to send job info
An email address
J. Skovira 5/05 v1
14
Serial Job Command File
#!/bin/ksh
# @ error = ./out/job2.$(jobid).err
# @ output = ./out/job2.$(jobid).out
# @ wall_clock_limit = 180
# @ class = demo
# @ notification = complete
# @ notify_user = skoviraj@us.ibm.com
# @ queue
perlspin2
J. Skovira 5/05 v1
15
Communication on the System
Each node has a connection to the high-performance
switch
There are 2 ways to use the switch
ip mode
"unlimited" channels
slower communication performance
User space mode
limited number of channels
faster than ip mode
Can be selected in job command file
J. Skovira 5/05 v1
16
Parallel Job Command File Keywords
node
How many nodes your job requires
tasks_per_node
How many tasks will run on each node
network
How your job will communicate
wall_clock_limit
An estimate of how long your job runs
J. Skovira 5/05 v1
17
The Network Keyword
network.protocol = network_type, usage, mode
protocol: MPI, LAPI, PVM
network_type: sn_single or sn_all for switch adapter
usage: shared or not_shared
mode: IP, US
An example:
# @ network.MPI = sn_single, shared, us
J. Skovira 5/05 v1
18
Parallel Job Command File
#!/bin/ksh
# @ job_type = parallel
# @ node = 1
# @ tasks_per_node = 4
# @ error = ./out/job3.$(jobid).err
# @ output = ./out/job3.$(jobid).out
# @ wall_clock_limit = 05:00
# @ class = demo
# @ notification = complete
# @ notify_user = skovira@tc.cornell.edu
# @ network.MPI = sn_all,shared,us
# @ queue
poe perlspin2
J. Skovira 5/05 v1
19
Basic Loadleveler Commands
llsubmit – submits a job to Loadleveler
llcancel – cancels a submitted job
llq – queries the status of jobs in the job queue
llstatus – queries the status of machines in the cluster
J. Skovira 5/05 v1
20
llq example
v01n08:/u/skoviraj $ llsubmit mybasic.cmd
llsubmit: The job "v01n08.vendor.pok.ibm.com.205" has been submitted
v01n08:/u/skoviraj $ llq
Id
Owner
Submitted ST PRI Class
Running On
------------------------ ---------- ----------- -- --- ------------ ----------v01n08.204.0
skoviraj 11/11 22:29 R 50 No_Class v01n02
v01n08.205.0
skoviraj 11/11 22:30 R 50 No_Class v01n02
v01n08.203.0
skoviraj 11/11 22:28 I 50 No_class
3 job steps in queue, 1 waiting, 0 pending, 2 running, 0 held
J. Skovira 5/05 v1
21
llstatus example
v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus v01n02
Name
Schedd InQ Act Startd Run LdAvg Idle Arch
v01n02.vendor.pok.ibm.com Avail 0 0 Run
1 0.00 9999 R6000
OpSys
AIX43
v01n08:/u/skoviraj/suspender1.0/suspender_stuff $ llstatus | more
Name
Schedd InQ Act Startd Run LdAvg Idle Arch
OpSys
v01n01.vendor.pok.ibm.com Avail 0 0
Idle 0 0.05 9999 R6000 AIX43
v01n02.vendor.pok.ibm.com Avail 0 0
Run 1 0.00 9999 R6000 AIX43
v01n03.vendor.pok.ibm.com Avail 0 0
Idle 0 0.00 9999 R6000 AIX43
v01n04.vendor.pok.ibm.com Avail 0 0
Idle 0 0.00 9999 R6000 AIX43
v01n05.vendor.pok.ibm.com Avail 0 0
Idle 0 0.02 9999 R6000 AIX43
v01n06.vendor.pok.ibm.com Avail 0 0
Idle 0 0.05 9999 R6000 AIX43
v01n07.vendor.pok.ibm.com Avail 1 0
Idle 0 0.06 155 R6000 AIX43
v01n08.vendor.pok.ibm.com Avail 1 0
Idle 0 0.00
83 R6000 AIX43
v01n09.vendor.pok.ibm.com Avail 0 0
Idle 0 0.00 9999 R6000 AIX43
J. Skovira 5/05 v1
22
llctl Examples
llctl -h hostname command
Useful Commands:
reconfig - Forces all daemons to reread the configuration files.
start - Starts the LoadLeveler daemons on the specified machine.
stop - Stops the LoadLeveler daemons on the specified machine.
Commands sometimes used:
flush - Terminates running jobs on this machine, places jobs in idle
recycle - Stops all LoadLeveler daemons and restarts them.
J. Skovira 5/05 v1
23
llctl Example
drain [schedd|startd [classlist |allclasses]]
With no options:
(1) no more LoadLeveler jobs can begin running on this machine,
(2) no more LoadLeveler jobs can be submitted through this machine.
When you issue drain schedd, the following happens:
(1) the schedd machine accepts no more LoadLeveler jobs for submission.
(2) jobs in the Starting or Running state in the queue are allowed to continue running.
(3) jobs in the Idle state in the schedd queue are drained
When you issue drain startd, the following happens:
(1) the startd machine accepts no more LoadLeveler jobs to be run
(2) jobs already running on the startd machine are allowed to complete.
J. Skovira 5/05 v1
24
More Loadleveler Commands
llclass - returns information about available
classes
llprio - changes the user priority of a job step
J. Skovira 5/05 v1
25
llclass Example
v60n129:/u/skoviraj $ llclass
Name
MaxJobCPU
MaxProcCPU Free
Max
d+hh:mm:ss
d+hh:mm:ss Slots Slots
--------------- -------------- -------------- ----- ----inter_class
undefined
undefined
192
192
X_Class
undefined
undefined
192
192
v60n129:/u/skoviraj $
=============== Class
Name:
Priority:
Exclude_Users:
Include_Users:
Exclude_Groups:
Include_Groups:
Admin:
NQS_class:
NQS_submit:
NQS_query:
Max_processors:
Maxjobs:
Resource_requirement:
Class_comment:
Class_ckpt_dir:
Ckpt_limit:
Wall_clock_limit:
Job_cpu_limit:
…
Description
---------------------
llclass -l X_Class
X_Class ===============
X_Class
0
F
-1
-1
undefined, undefined
11+13:46:39, 11+13:46:39 (999999 seconds, 999999 seconds)
undefined, undefined
J. Skovira 5/05 v1
26
llprio Example
v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq
Id
Owner Submitted
ST PRI Class
Running On
v01n07.137.0 skoviraj 11/11 22:51 I 50 No_class
1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held
v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llprio -p 100
v01n07.137.0
llprio: Priority command has been sent to the central manager.
v01n07:/u/skoviraj/suspender1.0/suspender_stuff $ llq
Id
Owner
Submitted ST PRI Class
Running On
v01n07.137.0 skoviraj 11/11 22:51 I 100 No_class
1 job steps in queue, 1 waiting, 0 pending, 0 running, 0 held
J. Skovira 5/05 v1
27
Advanced Topics
Job Preemption
Job Checkpointing
Loadleveler APIs (data access, scheduling)
Consumable resource control
Workload Manager (WLM) integration
Advance Reservation
Submit filter
J. Skovira 5/05 v1
28
Job Suspension
16 way job runs
16 way job completes
4 Node job runs
4 Node suspended
4 way restarts
J. Skovira 5/05 v1
29
Job Checkpoint
16 way job runs
16 way job completes
4 Node job runs
4 Node Checkpoints
and ends
4 way restarts from
saved state
4 Node job state saved
J. Skovira 5/05 v1
GPFS
30
Submit Filter
$NetKey = FALSE;
while (<STDIN>) {
chomp($value = $_);
if ( $value =~ /network/ ) { # If we find the network keyword....
$NetKey = TRUE; # remember it!
}
if ( $value =~ /queue/ ) { # If at the end of LL keywords for this job step...
if ( $NetKey eq FALSE ) { # if No network keyword...
# Add one which uses the switch
print "# @ network.MPI = sn_all,not_shared,US\n"
}
$NetKey = FALSE; # Reset network keyword memory
}
print "$value\n"; # Copy a single ll cmd file line to new cmd file
}
J. Skovira 5/05 v1
31
Tips for Efficient Job Processing
Assumptions:
One task per CPU
Classes Configured
Get your job to the TOP of the queue:
Short run
Small number of nodes
Use ip communication over the switch
Priority?
Submit during low use periods (evening)
These are FREE!
all above tips (except priority) will impact no other job
J. Skovira 5/05 v1
32
More Tips for Efficient Job Processing
Allow your job to run as QUICKLY as possible:
Balance node operations
Keep data entirely in physical memory
Use processors of similar types (system admin?)
Use distributed data load and store
Profile your applications for efficient compiler use
This could be an entirely new presentation!
J. Skovira 5/05 v1
33
Questions and Answers
J. Skovira 5/05 v1
34
Download