Condor: The CCLRC Experience UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre Outline o The Challenge of Condor on Personal Workstations o The Pools: configuration and status o Our Users 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre The Challenge of Condor On Personal Workstations UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre Under Abundance of machines o Windows workstations (but centrally administered) o Linux desktops (but administered by “owners”) o Commodity Clusters (unavailable, many being decommissioned, no access to root) o Servers for CVS, backup, external web access, access grid (production systems – mission critical) o Training machines (turned off when not in use – only 4 at present) o HPCx (No comment!) 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Security / Paranoia o 2 zone firewall separates machines o No root access to server machines o No root access to personal Linux Workstations o Personal firewalls “Not on MY machine you’re not” 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Site Firewalls + Flocking Internal Pool 11th October 2004 UK Condor Week External Pool John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Site Firewall(s) o 2 levels of Firewall o Every request for a change in the site firewall needs justification - takes up to 2 working days. o In theory, every submit node needs to be able to talk to some fixed (configurable) and ephemeral ports in every execute as well as the central node. o In addition, both UDP and TCP need to be opened. o It would be good if we could have a more precise definition of exactly what is necessary. 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Firewalls within a Condor Pool o Some resource owners have firewalls on their personal workstations o Since Condor needs each submit node to be able to talk to every potential execute node, this necessitates the opening of every firewall in the pool to every submit node when it is added. o Between adding the new node and the firewalls being updated, the firewalled nodes will be unavailable for use. Or are they? Maybe someone should tell Condor! 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Adding a new machine to the pool o If we add a new machine to the pool, the existing firewalls may not have anticipated this. o The firewalls will likely block this new machine o A Job may still match for the newly added machine to the firewalled resource. o This job will not be able to run o Parts of the system can jam as a result. – condor_q on submitting node – Subsequent parts of the submit script – (maybe also parts of the central node) 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Private networks o Similar “jams” occur if part of your pool (or flock of pools) is on a network that is unavailable to some of the other nodes o How can we permit jobs from submit nodes that can access the private network to run on these nodes whilst preventing Condor sending jobs from other submit nodes there? 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Workaround Solution o Mirror the firewall settings using ClassAds o They can be updated at the whim of the machine owner as long as the settings are mirrored. o New users can be added at any time without disruption For more details, see my talk in the Security WG 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Other problems o Lack of root access – I had to go and grovel to each resource owner not only for permission to install condor, but for them to log me in as root so I could do the installation. o Many different Linuxes. Condor installs neatly with the rpm on Red Hat family Linuxes. I had no trouble on the other ones, but the additional installation steps I had to perform for updating init.d was different in each case. I now use an updated version of the condor.root issued with the release. 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre The Pools: Configuration and Status UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre Strategy o “Community” approach: everyone has the right to run jobs from their machine. o 2 Condor Pools – One for internal use only – One for access by external collaborators and testing 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Internal Pool o Comprised of central node, personal workstations and other “spare” machines. o Inside “thick” part of site firewall, so no submission access from outside DL (although we expect to flock to/from other CCLRC sites) o Build up trust by gradually growing pool 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre External Pool o Comprised of the remains of a “broken-down” cluster o Originally Dual “head” node plus 8 workers on a private network. Now Dual + 4 standalone nodes. o Inside a “thin” firewall, so external access can be granted to collaborators (e.g. ETF/OMII Distributed Build and Test project) o Originally could be flocked to from the Internal Pool 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Configuration (1) o Always run jobs (this may change at some point) o The majority of machines are setup for both execute and submit (even central node at present). There is only one node set up for submit only. o Additional ClassAds – OS Flavour and Version – To mirror firewall settings (see Firewall “Avoidance” talk in WG2 tomorrow) o Dual-boot nodes are configured for Condor in both of their manifestations 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Configuration (2) o All machines setup the same way (in /opt/condor) condor.sh for installation in /etc/profile.d : CONDOR_ROOT=/opt/condor export CONDOR_CONFIG=${CONDOR_ROOT}/etc/condor_config export PATH=${PATH}:${CONDOR_ROOT}/bin condor.csh for installation in /etc/profile.d : set condor_root = /opt/condor setenv CONDOR_CONFIG "${condor_root}/etc/condor_config" set path = ( ${path} ${condor_root}/bin ) o Common condor_config.local for inclusion o Common condor init.d script with several enhancements over packaged one 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre The Pools: Configuration and Status UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre Internal Pool Stats o o o o 11 resource “Owners” at 2 sites 11 OS Variants 1 submit-only node (head-node of e-HTPX cluster – Red Hat 9) 27 Processors on 21 execution Machines (including central node) • 6 Windows – 3x Windows XP Professional – 2x Windows 2000 Professional – 1x Windows NT 4.0 Workstation • 21 Linux – 6x SuSE Linux 9.0 – 2x SuSE Linux 8.0 – 5x White Box Enterprise Linux 3.0 – 1x Red Hat Enterprise Linux 3.0 – 3x Red Hat Linux 9 – 2x Red Hat Linux 8.0 – 1x Mandrake Linux 10.0 John Kewley th 11 October 2004 Presenter Name Grid Technology Group 1x Gentoo Linux 1.4 UK Condor Week– Facility Name e-Science Centre condor_status $ condor_status -f "%-6s" Arch -f "%-7s" OpSys \ -f " %-12s" OPSYS_FLAVOUR \ -f "\n" OpSys | sort | uniq -c 1 1 1 2 3 1 2 6 5 1 3 2 11th October 2004 UK Condor Week INTEL INTEL INTEL INTEL INTEL INTEL INTEL INTEL INTEL INTEL INTEL LINUX LINUX LINUX LINUX LINUX LINUX LINUX LINUX WINNT40 WINNT50 WINNT51 Gentoo Mandrake10 RH80 RH9 RHEL2 SUSE80 SUSE90 WBL John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre External Pool Stats o o o o o 2 resource “owners” 2 OS Variants Can flock to/from pools at 4 other sites In the process of adding GSI Security 5 Machines containing 6 Linux Processors: – 2x Red Hat Linux 7.3 – 4x White Box Enterprise Linux 3.0 (currently disabled since inaccessible from outside due to firewall restrictions) 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Our Users UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre e-HTPX The e-HTPX project is developing a Grid-based e-science environment to allow structural biologists remote, integrated access to web and grid technologies associated with protein crystallography. http://clyde.dl.ac.uk/e-htpx/index.htm 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre e-HTPX Workflow Stage 1 – Select protein target Structure Solution Stage 2 – Crystallization of Protein Stage 3 – Data Collection (X-ray diffraction images, Scaling and Integration) Target Selection Start Finish A single all encompassing web interface from which users can initiate, plan, direct and document the experimental workflow either locally or remotely from a desktop computer. 11th October 2004 UK Condor Week Stage 4 – Structure Solution (HPC data processing to derive digital protein model) Stage 5 – Submit model into public database John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre e-HTPX Structure Solution o Given a target sequence for a protein, the Protein data bank (PDB) is searched for similar sequences. o The corresponding structures are downloaded for use in a high-throughput system for determining the structure of the target protein. o Depending on the protein structure size and matching criteria, up to several hundred structures can be downloaded. The modelling for these is carried out by submitting multiple jobs to the cluster and/or Condor pool. 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre e-HTPX Structure Solution 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre CCP1 / GAMESS-UK CCP1: “The Electronic Structure of Molecules“ http://www.ccp1.ac.uk/ GAMESS-UK is a multi-method ab initio molecular electronic structure program. 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre CCP1 / GAMESS-UK o GAMESS-UK is a Quantum-Mechanical molecular modelling program used by chemists, physicists and biologists to run molecular calculations. o Given the nuclear coordinates of a molecule, GAMESS-UK calculates a wavefunction that describes its electronic properties. o From the wavefunction, various molecular properties (e.g. shape, energetics and reactivity) can be calculated. http://www.cfs.dl.ac.uk/ 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre GAMESS-UK + Condor The following are being investigated: o Building GAMESS-UK and run its tests on a variety of environments (OS, compilers, libraries) o Using pool to build release packages of a cut-down evaluation version of the software. o Using Condor as it is intended: submitting many jobs to ascertain. 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre ETF “Build and Test” Testbed o The external pool is part of the ETF “Build and Test” testbed. o Software bundles are distributed to a variety of OS types around the flocked pool for building and testing. o This type of (flocked) pool relies on heterogeneity and small numbers of each type are all that are required. http://polaris.ecs.soton.ac.uk:65000/ http://wiki.nesc.ac.uk/read/sfct?HomePage 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Other non-HTC Uses o I want to ensure my code compiles without warnings and/or runs its basic tests on – As many OSs as possible – With as many different compilers as possible o I want to perform a release build of my product for platform X, but I only have accounts on A, B and C o I have several server-licensed products and many potential occasional users. How can this be made available to them more easily (within the bounds of the licence of course!) 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre In Conclusion UK Condor Week 2004 John Kewley Grid Technology Group e-Science Centre Summary o 12 brave souls have offered up their personal workstations so others can run arbitrary vanilla jobs. o Installations have been made on 12 different operating systems o Both pools are now in use. Provision of administrative support is underway – web page, user guide, etc o Distributed build is great! o Firewalls are not (although I now understand firewalls a lot better)! 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre Final Thoughts o Setting up a Condor pool of personal workstations requires considerable coaxing, convincing, coercion and cajoling. o Flocking through firewalls should be easier. Something needs doing, at least for flocking. o Distributed build can be very useful, but Condor’s default ClassAds could do with extending (at least to more accurately describe the OS) o What use can be made of pools which are seriously heterogenous? 11th October 2004 UK Condor Week John Kewley Presenter Name Grid Technology Group Facility Name e-Science Centre