Using Condor as a Local Resource Management System in NorduGrid By Haakon Riiser

Using Condor as a Local Resource Management System in NorduGrid By Haakon Riiser University of Oslo The reasons for this project  We (the Experimental Particle Physics group at the University of Oslo) were already using Condor on our own desktop computers, and the university’s central IT department also had a Condor pool in place.  We wanted our university to contribute to the ATLAS Data Challenges.  For Northern Europe, ATLAS DC jobs would be submitted with the NorduGrid Middleware, creating the need for a bridge between Condor and NorduGrid. The NorduGrid middleware  The NorduGrid middleware is a complete Grid framework providing all the basic Grid services that is built on top of a slightly modified version of the Globus toolkit.  Relies on a LRMS (such as PBS or Condor) on each cluster to manage resources and run jobs.  Prior to the completion of the Condor/NorduGrid project, the only batch system for which a NorduGrid interface existed was PBS. Initial goals for the Condor/NorduGrid interface  A NorduGrid user should not need to know that Condor is being used – a Condor cluster must behave exactly like PBS.  Vanilla jobs. Support for other universes could be done, but isn’t very interesting – one can always use Condor directly for special needs. NorduGrid is for when you only need the basic features, but a large number of potential execution nodes.  Linux and x86/x86-64 support. Support for other architectures and Unixes should be easy to add later. Overview of implementation The Condor/NorduGrid interface consists of two distinct parts: 1. The Grid Manager (GM) interface, whose task is primarily to convert the job description into Condor’s format, submit the job and store some information required by the Information System and, finally, to notify the GM on job completion. 2. The Information System interface, which is a set of scripts called on regular intervals to poll the status of the cluster and its jobs. Cluster organization Submit/ Execute Submit/ Execute Submit/ Execute Submit/ Execute Submit/ Execute The Grid GM Submit/ Execute Central Manager Submit/ Execute Submit/ Execute Submit/ Execute Implementation challenges  Condor’s job description language has some limitations, such as the inability to use whitespace in command line arguments, no zero length arguments, no stdout/stderr join operator, etc.  NorduGrid’s Information System, being designed with PBS in mind, has attributes whose meaning under Condor is not perfectly clear, and careful thought had to be given in order for Condor’s information to mean approximately the same as PBS’. The first real test of the system: The ATLAS Data Challenges  Huge amounts of I/O (several hundred MB per job).  10-30 hours of CPU per job.  At least 400 MB of memory.  A runtime environment of 2.2 GB. In summary  Our small Condor pool of non-dedicated machines is now a surprisingly contributor to the Data Challenges.  We had the lowest failure rate of all clusters.  The students using the desktop computers allocated to our pool were amazed by how little they were disturbed in their daily work. They only noted some extra fan noise on some machines, and a few seconds delay on login because running jobs had to be suspended and memory swapped in from disk.

Using Condor as a Local Resource Management System in NorduGrid By Haakon Riiser

Related documents

Products

Support

Using Condor as a Local Resource Management System in NorduGrid By Haakon Riiser

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib