Sun Grid Engine Fred Youhanaie fy@comlab.ox.ac.uk Systems Manager Oxford Supercomputing Centre & Oxford e-Science Centre Overview n n n n n n SGE basics. SGE advanced features. SGE at OSC. Integration with Globus toolkit. More Features. Conclusion. 13.02.2003 2 SGE Basics n n n n SGE is a job scheduler and load balancer. Users submit a job via a script to SGE. The scheduler decides when and where to execute the script/job, based on required resources and available resources. While the job is running, the scheduler may decide to abort it if, say, the job has exceeded some limits. 13.02.2003 3 SGE Basics – Types of Hosts n n n n n n n There are four types of logical hosts: Master host - in charge of everything. Execution host – where the real work is done. And, where the queues are defined. Admin host – for issuing admin commands. Submit host – for issuing job related commands. All four can be put on a single physical host. Shadow host – high availability - optional. 13.02.2003 4 SGE Basics - Queues n n n n n The ‘Pending Queue’ is where the jobs are held until it is their turn to execute. A ‘Queue’ is where a job is executed. Queues can be thought of as virtual hosts. More than one Queue may be defined for a host, however, Queues cannot span multiple hosts. 13.02.2003 5 SGE Basics - Slots n n n n n n n A ‘Slot’ normally represents a CPU on an execution host or Queue. It can be thought of as a virtual CPU. It is one of the basic units of resource allocation. Each sequential job requires one slot to run. An N-way parallel job will require N slots to run. Slots are defined for both Hosts and Queues. Depending on the relative values one can over-utilize or under-utilize the host. 13.02.2003 6 SGE Basics – Job Execution n n n n Three execution modes: Batch – straight forward sequential programs – single thread. Interactive – User is given shell access to some suitable host – X-windows is catered for. Parallel – more complicated - requires Parallel Environments. 13.02.2003 7 SGE Basics – Parallel Environment n n n n n Collects one or more Queues under a single definition. A Queue may belong to more than one PE. E.g. MPI, PVM, SHMEM. PE start script sets up the environment before starting a parallel job. PE stop script tidies up after the job has terminated. 13.02.2003 8 SGE Basics - Examples Full utilization Overloading Interactive + Parallel Queues Hosts 13.02.2003 9 SGE Basics – Daemons n n n n n qmaster/schedd – Runs on Master Host. shadowd – Runs on Shadow Host. execd – Runs on Execute Host, starts shepherd daemons for jobs, collects load information and passes them to qmaster/schedd. commd – All communication is done via this daemon. Each of the three daemons also have one of these. shepherd – started by execd, one shepherd per job per node. Starts and terminates the job script, collects accounting information. 13.02.2003 10 SGE Basics – Execution Cycle n n n n n n n n n n The qmaster gives the job to execd. execd starts a shepherd process for the job. From here on the shepherd does everything. The ‘prolog’ for the queue is run. If parallel job, the ‘pe_start’ script is run. The user script is started. The shepherd may terminate the job. If parallel job, the pe_stop script is run. The ‘epilog’ script for the queue is run. The shepherd sends accounting info to qmaster and stops. 13.02.2003 11 SGE Basics – Submit Examples n n n n n n n qsub j1.sh qsub –l arch=solaris64 j2.sh qsub –pe mpi 4 j3.sh qsub –pe mpi 4-10 j4.sh qsub –pe shm 2 –l h_rt=20:0:0 j5.sh qrsh –l arch=glinux abaqus viewer qsh –l arch=solaris64 13.02.2003 12 SGE vs SGEEE n n n n SGE, Enterprise Edition. The two are largely identical. SGEEE has an additional policy module for dynamic resource management. The same set of binaries are used for both. 13.02.2003 13 SGE Advanced – Policy Module n n n n n SGEEE has a ticket system for the sharing of the resources. Total number of tickets are normally fixed. But, may be changed by the administrator. Jobs are allocated a number of tickets by the scheduler. A job’s ticket count varies with time. The higher the number of tickets, the higher the priority of the job. 13.02.2003 14 SGE Advanced – Policy Module n Jobs acquire their tickets from four sources (policies) n n n n n n Share Tree Policy – user/group hierarchy. Functional Policy – user/group/class. Deadline Policy – jobs. Override Policy – for manual intervention. All four policies can be deployed. We only use the Share Tree Policy at OSC. 13.02.2003 15 SGE Advanced – Share Tree n n n n n n Share based – users are allocated percentages of resources (cpu/mem/io) SGE keeps track of per user resource utilization. Ensures that users who have under-utilized the system get precedence over those who over-utilized the system. Half life decay formula ensures that heavy users are not penalised too much. Half life value set by the administrator. Pre-dispatch - helps sort the pending queue. Post-dispatch - changes process priorities to balance per user CPU usage. 13.02.2003 16 SGE Advanced – Share Tree n n n n E.g. users A, B, C, D are allocated 10%, 20%, 35%, 35% respectively of the resources. SGE will always try to achieve the above shares. Starting from zero, C and D will always get precedence over B, all three will get precedence over A. If over time C ends up using, say, 40% of total resources because A did not use any, then A will be given higher priority in order to bring the usage statistics into balance. 13.02.2003 17 SGE Advanced – Share Tree n n n Share Tree ensures fairness. It also allows full utilization of otherwise unused resources. Heavy users are not kept out of empty systems. 13.02.2003 18 SGE Advanced - Cells n n n n n n A cell is a single instance of SGE installation. Basically collection of configuration and spool files. All under a single sub-directory. A host may belong to more than one cell. Switch between cells by changing environment variable. E.g. Multi-clusters. E.g. Transfer queues – Globus jobmanager. 13.02.2003 19 SGE at OSC – The Platforms n Sun/Solaris n n n n IBM/Linux n n n n n 4 x SF6800, 84 CPUs, 168GB. Solaris 8. Suitable for shared memory jobs. Typically OpenMP or any multithreaded application. 64 x IBM x330, 128 CPUs, 128GB. RH7.2 Linux (kernel 2.4). Suitable for distributed memory jobs. Typically MPI or PVM. Myrinet switch for high-speed communication. A set of other hosts acting as infrastructure servers and head nodes for non-Grid users. 13.02.2003 20 SGE at OSC – IBM Cluster n n n n n Each IBM node contains a single queue. Two parallel environments, mpichgm and pvm (tcp/ip). Normally each PE contains all 64 queues. mpi/pvm jobs can share the same queue. If myrinet card fails on a node it is taken out of mpichgm PE. pvm jobs can still continue using that node. 13.02.2003 21 SGE at OSC – Sun Cluster n n n n n n n One parallel queue on each host. Additional interactive queue on just one host. All hosts are overloaded, i.e. more slots that actual CPUs. This allows higher precedence jobs to be dispatched. Two parallel environments, shmem and sunmpi (uses IPC-SHMEM). Interactive queue uses Solaris processor sets. Only enabled 8am-8pm weekdays. 13.02.2003 22 SGE at OSC – Users/Groups n n n n n Our users belong to one of 30 research groups. e-Science is one of these groups. e-Science group is allocated 5% of total share (CPU time only). 95% divided equally among the rest, approx. 3.2% each. Some groups have asked for resource allocation within the group, i.e. the 3.2% is further subdivided among group members, unequally! 13.02.2003 23 SGE at OSC – Accounting n n n n SGE keeps accounting records in a flat file. Accounting records are taken from flat file and put into a database. Monthly reports are generated from the DB. Currently deleting a job may cause duplicate records to be generated, bug #438. 13.02.2003 24 Globus Integration – Current Status n n n n n n n gatekeeper connected to the e-Science Grid. Still at GT2.0. Set up as another OSC submit host. Default jobmanager in GT2.0 requires further work. Shows too much unnecessary details. Does not show everything (qstat -l arch=…) Plan is to deploy transfer queues. 13.02.2003 25 Globus Integration - Plans n n n n n n n Create a separate cell on gatekeeper. gatekeeper will be master, exec host and submit host for the new cell. gatekeeper will also be submit host for the OSC cell. Dummy queues will be defined for various types of standard grid jobs, mpi, shmem etc. jobmanager will submit to the dummy queue. Each queue’s start/stop methods will then submit to the appropriate OSC queue. No interactive access possible! 13.02.2003 26 Other Features n n n n n n n n n Consumable Resources. Load Sensors. Grid Engine Portal. Java Interface Package. Job arrays Even clients (unofficial interface) qtcsh/qtask – grid enabled tcsh! qmake – concurrent make. Calendar – for enabling/disabling queues. 13.02.2003 27 Conclusions n n n n n n n n Very Good email support – by Sun staff J Source available J Treated as community project J Many HOWTOs available J Good Users’ Guide J Comprehensive internals documentation L PE/Processor set split L Globus jobmanager L 13.02.2003 28 Acknowledgements n OSC n n n n Bob McLatchie Joe Pitt-Francis Jon Lockley OeSC n Jon Hillier 13.02.2003 29 Questions? 13.02.2003 30