Sun Grid Engine Fred Youhanaie Systems Manager Oxford Supercomputing Centre

advertisement
Sun Grid Engine
Fred Youhanaie
fy@comlab.ox.ac.uk
Systems Manager
Oxford Supercomputing Centre
&
Oxford e-Science Centre
Overview
n
n
n
n
n
n
SGE basics.
SGE advanced features.
SGE at OSC.
Integration with Globus toolkit.
More Features.
Conclusion.
13.02.2003
2
SGE Basics
n
n
n
n
SGE is a job scheduler and load balancer.
Users submit a job via a script to SGE.
The scheduler decides when and where to
execute the script/job, based on required
resources and available resources.
While the job is running, the scheduler may
decide to abort it if, say, the job has
exceeded some limits.
13.02.2003
3
SGE Basics – Types of Hosts
n
n
n
n
n
n
n
There are four types of logical hosts:
Master host - in charge of everything.
Execution host – where the real work is done.
And, where the queues are defined.
Admin host – for issuing admin commands.
Submit host – for issuing job related
commands.
All four can be put on a single physical host.
Shadow host – high availability - optional.
13.02.2003
4
SGE Basics - Queues
n
n
n
n
n
The ‘Pending Queue’ is where the jobs
are held until it is their turn to execute.
A ‘Queue’ is where a job is executed.
Queues can be thought of as virtual
hosts.
More than one Queue may be defined
for a host, however,
Queues cannot span multiple hosts.
13.02.2003
5
SGE Basics - Slots
n
n
n
n
n
n
n
A ‘Slot’ normally represents a CPU on an execution
host or Queue.
It can be thought of as a virtual CPU.
It is one of the basic units of resource allocation.
Each sequential job requires one slot to run.
An N-way parallel job will require N slots to run.
Slots are defined for both Hosts and Queues.
Depending on the relative values one can over-utilize
or under-utilize the host.
13.02.2003
6
SGE Basics – Job Execution
n
n
n
n
Three execution modes:
Batch – straight forward sequential
programs – single thread.
Interactive – User is given shell access
to some suitable host – X-windows is
catered for.
Parallel – more complicated - requires
Parallel Environments.
13.02.2003
7
SGE Basics – Parallel Environment
n
n
n
n
n
Collects one or more Queues under a single
definition.
A Queue may belong to more than one PE.
E.g. MPI, PVM, SHMEM.
PE start script sets up the environment before
starting a parallel job.
PE stop script tidies up after the job has
terminated.
13.02.2003
8
SGE Basics - Examples
Full utilization
Overloading
Interactive +
Parallel
Queues
Hosts
13.02.2003
9
SGE Basics – Daemons
n
n
n
n
n
qmaster/schedd – Runs on Master Host.
shadowd – Runs on Shadow Host.
execd – Runs on Execute Host, starts shepherd
daemons for jobs, collects load information and
passes them to qmaster/schedd.
commd – All communication is done via this daemon.
Each of the three daemons also have one of these.
shepherd – started by execd, one shepherd per job
per node. Starts and terminates the job script,
collects accounting information.
13.02.2003
10
SGE Basics – Execution Cycle
n
n
n
n
n
n
n
n
n
n
The qmaster gives the job to execd.
execd starts a shepherd process for the job.
From here on the shepherd does everything.
The ‘prolog’ for the queue is run.
If parallel job, the ‘pe_start’ script is run.
The user script is started.
The shepherd may terminate the job.
If parallel job, the pe_stop script is run.
The ‘epilog’ script for the queue is run.
The shepherd sends accounting info to qmaster and
stops.
13.02.2003
11
SGE Basics – Submit Examples
n
n
n
n
n
n
n
qsub j1.sh
qsub –l arch=solaris64 j2.sh
qsub –pe mpi 4 j3.sh
qsub –pe mpi 4-10 j4.sh
qsub –pe shm 2 –l h_rt=20:0:0 j5.sh
qrsh –l arch=glinux abaqus viewer
qsh –l arch=solaris64
13.02.2003
12
SGE vs SGEEE
n
n
n
n
SGE, Enterprise Edition.
The two are largely identical.
SGEEE has an additional policy module
for dynamic resource management.
The same set of binaries are used for
both.
13.02.2003
13
SGE Advanced – Policy Module
n
n
n
n
n
SGEEE has a ticket system for the sharing of
the resources.
Total number of tickets are normally fixed.
But, may be changed by the administrator.
Jobs are allocated a number of tickets by the
scheduler.
A job’s ticket count varies with time.
The higher the number of tickets, the higher
the priority of the job.
13.02.2003
14
SGE Advanced – Policy Module
n
Jobs acquire their tickets from four sources
(policies)
n
n
n
n
n
n
Share Tree Policy – user/group hierarchy.
Functional Policy – user/group/class.
Deadline Policy – jobs.
Override Policy – for manual intervention.
All four policies can be deployed.
We only use the Share Tree Policy at OSC.
13.02.2003
15
SGE Advanced – Share Tree
n
n
n
n
n
n
Share based – users are allocated percentages of
resources (cpu/mem/io)
SGE keeps track of per user resource utilization.
Ensures that users who have under-utilized the
system get precedence over those who over-utilized
the system.
Half life decay formula ensures that heavy users are
not penalised too much. Half life value set by the
administrator.
Pre-dispatch - helps sort the pending queue.
Post-dispatch - changes process priorities to balance
per user CPU usage.
13.02.2003
16
SGE Advanced – Share Tree
n
n
n
n
E.g. users A, B, C, D are allocated 10%, 20%,
35%, 35% respectively of the resources.
SGE will always try to achieve the above
shares.
Starting from zero, C and D will always get
precedence over B, all three will get
precedence over A.
If over time C ends up using, say, 40% of
total resources because A did not use any,
then A will be given higher priority in order to
bring the usage statistics into balance.
13.02.2003
17
SGE Advanced – Share Tree
n
n
n
Share Tree ensures fairness.
It also allows full utilization of otherwise
unused resources.
Heavy users are not kept out of empty
systems.
13.02.2003
18
SGE Advanced - Cells
n
n
n
n
n
n
A cell is a single instance of SGE installation.
Basically collection of configuration and spool
files. All under a single sub-directory.
A host may belong to more than one cell.
Switch between cells by changing
environment variable.
E.g. Multi-clusters.
E.g. Transfer queues – Globus jobmanager.
13.02.2003
19
SGE at OSC – The Platforms
n
Sun/Solaris
n
n
n
n
IBM/Linux
n
n
n
n
n
4 x SF6800, 84 CPUs, 168GB.
Solaris 8.
Suitable for shared memory jobs. Typically OpenMP or any
multithreaded application.
64 x IBM x330, 128 CPUs, 128GB.
RH7.2 Linux (kernel 2.4).
Suitable for distributed memory jobs. Typically MPI or PVM.
Myrinet switch for high-speed communication.
A set of other hosts acting as infrastructure servers
and head nodes for non-Grid users.
13.02.2003
20
SGE at OSC – IBM Cluster
n
n
n
n
n
Each IBM node contains a single queue.
Two parallel environments, mpichgm and
pvm (tcp/ip).
Normally each PE contains all 64 queues.
mpi/pvm jobs can share the same queue.
If myrinet card fails on a node it is taken out
of mpichgm PE. pvm jobs can still continue
using that node.
13.02.2003
21
SGE at OSC – Sun Cluster
n
n
n
n
n
n
n
One parallel queue on each host.
Additional interactive queue on just one host.
All hosts are overloaded, i.e. more slots that
actual CPUs.
This allows higher precedence jobs to be
dispatched.
Two parallel environments, shmem and
sunmpi (uses IPC-SHMEM).
Interactive queue uses Solaris processor sets.
Only enabled 8am-8pm weekdays.
13.02.2003
22
SGE at OSC – Users/Groups
n
n
n
n
n
Our users belong to one of 30 research groups.
e-Science is one of these groups.
e-Science group is allocated 5% of total share (CPU
time only).
95% divided equally among the rest, approx. 3.2%
each.
Some groups have asked for resource allocation
within the group, i.e. the 3.2% is further subdivided
among group members, unequally!
13.02.2003
23
SGE at OSC – Accounting
n
n
n
n
SGE keeps accounting records in a flat file.
Accounting records are taken from flat file
and put into a database.
Monthly reports are generated from the DB.
Currently deleting a job may cause duplicate
records to be generated, bug #438.
13.02.2003
24
Globus Integration – Current Status
n
n
n
n
n
n
n
gatekeeper connected to the e-Science Grid.
Still at GT2.0.
Set up as another OSC submit host.
Default jobmanager in GT2.0 requires further
work.
Shows too much unnecessary details.
Does not show everything (qstat -l arch=…)
Plan is to deploy transfer queues.
13.02.2003
25
Globus Integration - Plans
n
n
n
n
n
n
n
Create a separate cell on gatekeeper.
gatekeeper will be master, exec host and submit host
for the new cell.
gatekeeper will also be submit host for the OSC cell.
Dummy queues will be defined for various types of
standard grid jobs, mpi, shmem etc.
jobmanager will submit to the dummy queue.
Each queue’s start/stop methods will then submit to
the appropriate OSC queue.
No interactive access possible!
13.02.2003
26
Other Features
n
n
n
n
n
n
n
n
n
Consumable Resources.
Load Sensors.
Grid Engine Portal.
Java Interface Package.
Job arrays
Even clients (unofficial interface)
qtcsh/qtask – grid enabled tcsh!
qmake – concurrent make.
Calendar – for enabling/disabling queues.
13.02.2003
27
Conclusions
n
n
n
n
n
n
n
n
Very Good email support – by Sun staff J
Source available J
Treated as community project J
Many HOWTOs available J
Good Users’ Guide J
Comprehensive internals documentation L
PE/Processor set split L
Globus jobmanager L
13.02.2003
28
Acknowledgements
n
OSC
n
n
n
n
Bob McLatchie
Joe Pitt-Francis
Jon Lockley
OeSC
n
Jon Hillier
13.02.2003
29
Questions?
13.02.2003
30
Download