Condor use in Department of Computing, Imperial College Stephen McGough, David McBride London e-Science Centre Computing Resources • Dedicated 16-node Linux cluster (“thor”) • 250+ workstations in undergraduate labs • 200+ workstations for research, PhD and support staff – Athlon 1.4Ghz – 3.0Ghz P4s, 512MB-1GB • Well-provisioned Extreme networking infrastructure – 100Mbit full duplex to the desk, 1Gbit fibre backbone with 2 Black Diamond core routers 2 Operating Environment • Standardized Windows and Linux managed installations – Nearly every machine has a Linux install – Windows only installed on a subset of desktops – Automated configuration, software installation and updates • Shared automounted /home and /vol filesystems – Small number of central NFS fileservers – Numerous /vol areas provided for individual research groups – Includes /vol/condor to support Condor activity • No firewalls deployed within departmental netblock – Firewalls exist between the pool hosts and the outside world, but internally have unrestricted access. 3 Original Motivation for Condor • An experiment! • Lots of capable workstations idle for substantial portions of the day • Wanted to be able to make better use of resources • Condor an ideal framework – Simple to set up – Freely available – Low maintenance 4 Condor Configuration • Operated in a ‘cycle-stealing’ mode. – Only dedicated machine is an old Athlon workstation running condor_negotiator and condor_collector daemons • Primary concern is to not impinge upon users’ main work – By all means use up any spare CPU cycles, but get out of the way when the user returns. 5 Production users • Now have a number of high-throughput users: – Bioinformatics • “Evaluating protein-protein interaction network evolution models” – Visual Information Processing • “Non-rigid registrations of 3D infant brain MR images” – London e-Science Centre • GENIE: “Grid ENabled Integrated Earth system model” – Teaching • Part of Grid Computing course tutorial work 6 Recent statistics Overnight maintenance Nightly reboot New desktops get Condor switched on Start of term (main lab back online) 7 Perceived Benefits • Makes better utilization of otherwise unused resources • Frees up compute time on production cluster hardware • Reduces the barrier to entry to obtaining access to large quantities of CPU time 8 Issues • User detection currently not fully functional…! – Recent Linux kernel revisions don’t behave as Condor expects – When a user logs in through X11 without opening a terminal, doesn’t get noticed by Condor. – Fix being developed. • Sometimes consuming disk resources to exhaustion – Low-tech solution – ask users not to generate large quantities of output.. • Source code availability? – Condor effectively already managed as an open source project – Source would have been helpful when diagnosing fault (Documentation, however, is excellent.) 9 Comparison with Sun Grid Engine • SGE used on LeSC dedicated highperformance clusters • Different fundamental design philosophy: – SGE uses a central, static configuration – Condor designed to function well with a floating pool • Has some features Condor lacks: – Greater control over queuing policy – SGE 6.0 provides advanced reservation capability – Source code readily available 10 Conclusions • Consider the experiment to be very successful • Has become essential to the work of others in the department and College at large • Very satisfied with the quality of the implementation and documentation 11