Using Condor as a Local Resource Management System in NorduGrid By Haakon Riiser

advertisement
Using Condor as a Local
Resource Management System
in NorduGrid
By Haakon Riiser
University of Oslo
The reasons for this project
 We (the Experimental Particle Physics group at the University of Oslo)
were already using Condor on our own desktop
computers, and the university’s central IT
department also had a Condor pool in place.
 We wanted our university to contribute to the
ATLAS Data Challenges.
 For Northern Europe, ATLAS DC jobs would be
submitted with the NorduGrid Middleware,
creating the need for a bridge between Condor
and NorduGrid.
The NorduGrid middleware
 The NorduGrid middleware is a complete Grid
framework providing all the basic Grid services
that is built on top of a slightly modified version
of the Globus toolkit.
 Relies on a LRMS (such as PBS or Condor) on
each cluster to manage resources and run jobs.
 Prior to the completion of the Condor/NorduGrid
project, the only batch system for which a
NorduGrid interface existed was PBS.
Initial goals for the
Condor/NorduGrid interface
 A NorduGrid user should not need to know that
Condor is being used – a Condor cluster must
behave exactly like PBS.
 Vanilla jobs. Support for other universes could
be done, but isn’t very interesting – one can
always use Condor directly for special needs.
NorduGrid is for when you only need the basic
features, but a large number of potential
execution nodes.
 Linux and x86/x86-64 support. Support for other
architectures and Unixes should be easy to add
later.
Overview of implementation
The Condor/NorduGrid interface consists of two
distinct parts:
1. The Grid Manager (GM) interface, whose task
is primarily to convert the job description into
Condor’s format, submit the job and store some
information required by the Information System
and, finally, to notify the GM on job completion.
2. The Information System interface, which is a
set of scripts called on regular intervals to poll
the status of the cluster and its jobs.
Cluster organization
Submit/
Execute
Submit/
Execute
Submit/
Execute
Submit/
Execute
Submit/
Execute
The Grid
GM
Submit/
Execute
Central
Manager
Submit/
Execute
Submit/
Execute
Submit/
Execute
Implementation challenges
 Condor’s job description language has some
limitations, such as the inability to use
whitespace in command line arguments, no zero
length arguments, no stdout/stderr join operator,
etc.
 NorduGrid’s Information System, being designed
with PBS in mind, has attributes whose meaning
under Condor is not perfectly clear, and careful
thought had to be given in order for Condor’s
information to mean approximately the same as
PBS’.
The first real test of the system:
The ATLAS Data Challenges
 Huge amounts of I/O (several hundred MB
per job).
 10-30 hours of CPU per job.
 At least 400 MB of memory.
 A runtime environment of 2.2 GB.
In summary
 Our small Condor pool of non-dedicated
machines is now a surprisingly contributor to the
Data Challenges.
 We had the lowest failure rate of all clusters.
 The students using the desktop computers
allocated to our pool were amazed by how little
they were disturbed in their daily work. They
only noted some extra fan noise on some
machines, and a few seconds delay on login
because running jobs had to be suspended and
memory swapped in from disk.
Download