Condor use in Department of Computing, Imperial College Stephen M Gough, David McBride

advertisement
Condor use in Department of
Computing, Imperial College
Stephen McGough, David McBride
London e-Science Centre
Computing Resources
• Dedicated 16-node Linux cluster (“thor”)
• 250+ workstations in undergraduate labs
• 200+ workstations for research, PhD and
support staff
– Athlon 1.4Ghz – 3.0Ghz P4s, 512MB-1GB
• Well-provisioned Extreme networking
infrastructure
– 100Mbit full duplex to the desk, 1Gbit fibre
backbone with 2 Black Diamond core routers
2
Operating Environment
• Standardized Windows and Linux managed installations
– Nearly every machine has a Linux install
– Windows only installed on a subset of desktops
– Automated configuration, software installation and updates
• Shared automounted /home and /vol filesystems
– Small number of central NFS fileservers
– Numerous /vol areas provided for individual research groups
– Includes /vol/condor to support Condor activity
• No firewalls deployed within departmental netblock
– Firewalls exist between the pool hosts and the outside world,
but internally have unrestricted access.
3
Original Motivation for Condor
• An experiment!
• Lots of capable workstations idle for
substantial portions of the day
• Wanted to be able to make better use of
resources
• Condor an ideal framework
– Simple to set up
– Freely available
– Low maintenance
4
Condor Configuration
• Operated in a ‘cycle-stealing’ mode.
– Only dedicated machine is an old Athlon
workstation running condor_negotiator and
condor_collector daemons
• Primary concern is to not impinge upon
users’ main work
– By all means use up any spare CPU cycles,
but get out of the way when the user returns.
5
Production users
• Now have a number of high-throughput users:
– Bioinformatics
• “Evaluating protein-protein interaction network evolution
models”
– Visual Information Processing
• “Non-rigid registrations of 3D infant brain MR images”
– London e-Science Centre
• GENIE: “Grid ENabled Integrated Earth system model”
– Teaching
• Part of Grid Computing course tutorial work
6
Recent statistics
Overnight maintenance
Nightly reboot
New desktops get Condor switched on
Start of term (main lab back online)
7
Perceived Benefits
• Makes better utilization of otherwise
unused resources
• Frees up compute time on production
cluster hardware
• Reduces the barrier to entry to obtaining
access to large quantities of CPU time
8
Issues
• User detection currently not fully functional…!
– Recent Linux kernel revisions don’t behave as Condor expects
– When a user logs in through X11 without opening a terminal,
doesn’t get noticed by Condor.
– Fix being developed.
• Sometimes consuming disk resources to exhaustion
– Low-tech solution – ask users not to generate large quantities of
output..
• Source code availability?
– Condor effectively already managed as an open source project
– Source would have been helpful when diagnosing fault
(Documentation, however, is excellent.)
9
Comparison with Sun Grid Engine
• SGE used on LeSC dedicated highperformance clusters
• Different fundamental design philosophy:
– SGE uses a central, static configuration
– Condor designed to function well with a floating pool
• Has some features Condor lacks:
– Greater control over queuing policy
– SGE 6.0 provides advanced reservation capability
– Source code readily available
10
Conclusions
• Consider the experiment to be very
successful
• Has become essential to the work of
others in the department and College at
large
• Very satisfied with the quality of the
implementation and documentation
11
Download