Condor: The CCLRC Experience John Kewley Grid Technology Group e-Science Centre

advertisement
Condor: The CCLRC Experience
UK Condor Week 2004
John Kewley
Grid Technology Group
e-Science Centre
Outline
o The Challenge of Condor on Personal
Workstations
o The Pools: configuration and status
o Our Users
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
The Challenge of Condor
On Personal Workstations
UK Condor Week 2004
John Kewley
Grid Technology Group
e-Science Centre
Under Abundance of machines
o Windows workstations
(but centrally administered)
o Linux desktops
(but administered by “owners”)
o Commodity Clusters
(unavailable, many being decommissioned, no access to root)
o Servers for CVS, backup, external web
access, access grid
(production systems – mission critical)
o Training machines
(turned off when not in use – only 4 at present)
o HPCx
(No comment!)
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Security / Paranoia
o 2 zone firewall separates machines
o No root access to server machines
o No root access to personal Linux
Workstations
o Personal firewalls
“Not on MY machine you’re not”
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Site Firewalls + Flocking
Internal Pool
11th October 2004
UK Condor Week
External Pool
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Site Firewall(s)
o 2 levels of Firewall
o Every request for a change in the site firewall needs
justification - takes up to 2 working days.
o In theory, every submit node needs to be able to talk
to some fixed (configurable) and ephemeral ports in
every execute as well as the central node.
o In addition, both UDP and TCP need to be opened.
o It would be good if we could have a more precise
definition of exactly what is necessary.
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Firewalls within a Condor Pool
o Some resource owners have firewalls on their
personal workstations
o Since Condor needs each submit node to be able to
talk to every potential execute node, this
necessitates the opening of every firewall in the pool
to every submit node when it is added.
o Between adding the new node and the firewalls being
updated, the firewalled nodes will be unavailable for
use.
Or are they?
Maybe someone should tell Condor!
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Adding a new machine to the pool
o If we add a new machine to the pool, the existing
firewalls may not have anticipated this.
o The firewalls will likely block this new machine
o A Job may still match for the newly added machine to
the firewalled resource.
o This job will not be able to run
o Parts of the system can jam as a result.
– condor_q on submitting node
– Subsequent parts of the submit script
– (maybe also parts of the central node)
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Private networks
o Similar “jams” occur if part of your pool (or flock of
pools) is on a network that is unavailable to some of
the other nodes
o How can we permit jobs from submit nodes that can
access the private network to run on these nodes
whilst preventing Condor sending jobs from other
submit nodes there?
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Workaround Solution
o Mirror the firewall settings using ClassAds
o They can be updated at the whim of the machine
owner as long as the settings are mirrored.
o New users can be added at any time without
disruption
For more details, see my talk in the
Security WG
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Other problems
o Lack of root access – I had to go and grovel to each
resource owner not only for permission to install
condor, but for them to log me in as root so I could
do the installation.
o Many different Linuxes. Condor installs neatly with
the rpm on Red Hat family Linuxes. I had no trouble
on the other ones, but the additional installation
steps I had to perform for updating init.d was
different in each case. I now use an updated version
of the condor.root issued with the release.
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
The Pools: Configuration
and Status
UK Condor Week 2004
John Kewley
Grid Technology Group
e-Science Centre
Strategy
o “Community” approach: everyone has the right
to run jobs from their machine.
o 2 Condor Pools
– One for internal use only
– One for access by external collaborators
and testing
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Internal Pool
o Comprised of central node, personal
workstations and other “spare” machines.
o Inside “thick” part of site firewall, so no
submission access from outside DL (although
we expect to flock to/from other CCLRC
sites)
o Build up trust by gradually growing pool
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
External Pool
o Comprised of the remains of a “broken-down”
cluster
o Originally Dual “head” node plus 8 workers on
a private network. Now Dual + 4 standalone
nodes.
o Inside a “thin” firewall, so external access can
be granted to collaborators (e.g. ETF/OMII
Distributed Build and Test project)
o Originally could be flocked to from the
Internal Pool
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Configuration (1)
o Always run jobs (this may change at some point)
o The majority of machines are setup for both execute
and submit (even central node at present). There is
only one node set up for submit only.
o Additional ClassAds
– OS Flavour and Version
– To mirror firewall settings (see Firewall
“Avoidance” talk in WG2 tomorrow)
o Dual-boot nodes are configured for Condor in both of
their manifestations
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Configuration (2)
o All machines setup the same way (in /opt/condor)
condor.sh for installation in /etc/profile.d :
CONDOR_ROOT=/opt/condor
export CONDOR_CONFIG=${CONDOR_ROOT}/etc/condor_config
export PATH=${PATH}:${CONDOR_ROOT}/bin
condor.csh for installation in /etc/profile.d :
set condor_root = /opt/condor
setenv CONDOR_CONFIG "${condor_root}/etc/condor_config"
set path = ( ${path} ${condor_root}/bin )
o Common condor_config.local for inclusion
o Common condor init.d script with several
enhancements over packaged one
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
The Pools: Configuration
and Status
UK Condor Week 2004
John Kewley
Grid Technology Group
e-Science Centre
Internal Pool Stats
o
o
o
o
11 resource “Owners” at 2 sites
11 OS Variants
1 submit-only node (head-node of e-HTPX cluster – Red Hat 9)
27 Processors on 21 execution Machines (including central node)
• 6 Windows
– 3x Windows XP Professional
– 2x Windows 2000 Professional
– 1x Windows NT 4.0 Workstation
• 21 Linux
– 6x SuSE Linux 9.0
– 2x SuSE Linux 8.0
– 5x White Box Enterprise Linux 3.0
– 1x Red Hat Enterprise Linux 3.0
– 3x Red Hat Linux 9
– 2x Red Hat Linux 8.0
– 1x Mandrake Linux 10.0
John Kewley
th
11 October 2004
Presenter Name
Grid Technology Group
1x Gentoo Linux 1.4
UK Condor Week–
Facility Name
e-Science Centre
condor_status
$ condor_status -f "%-6s" Arch -f "%-7s" OpSys \
-f " %-12s" OPSYS_FLAVOUR \
-f "\n" OpSys | sort | uniq -c
1
1
1
2
3
1
2
6
5
1
3
2
11th October 2004
UK Condor Week
INTEL
INTEL
INTEL
INTEL
INTEL
INTEL
INTEL
INTEL
INTEL
INTEL
INTEL
LINUX
LINUX
LINUX
LINUX
LINUX
LINUX
LINUX
LINUX
WINNT40
WINNT50
WINNT51
Gentoo
Mandrake10
RH80
RH9
RHEL2
SUSE80
SUSE90
WBL
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
External Pool Stats
o
o
o
o
o
2 resource “owners”
2 OS Variants
Can flock to/from pools at 4 other sites
In the process of adding GSI Security
5 Machines containing 6 Linux Processors:
– 2x Red Hat Linux 7.3
– 4x White Box Enterprise Linux 3.0 (currently
disabled since inaccessible from outside due to firewall
restrictions)
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Our Users
UK Condor Week 2004
John Kewley
Grid Technology Group
e-Science Centre
e-HTPX
The e-HTPX project is developing a Grid-based
e-science environment to allow structural
biologists remote, integrated access to web
and grid technologies associated with protein
crystallography.
http://clyde.dl.ac.uk/e-htpx/index.htm
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
e-HTPX Workflow
Stage 1 – Select protein
target
Structure
Solution
Stage 2 – Crystallization of
Protein
Stage 3 – Data Collection
(X-ray diffraction images,
Scaling and Integration)
Target Selection
Start
Finish
A single all encompassing web interface from which
users can initiate, plan, direct and document the
experimental workflow either locally or remotely from a
desktop computer.
11th October 2004
UK Condor Week
Stage 4 – Structure Solution
(HPC data processing to
derive digital protein model)
Stage 5 – Submit model into
public database
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
e-HTPX Structure Solution
o Given a target sequence for a protein, the Protein
data bank (PDB) is searched for similar sequences.
o The corresponding structures are downloaded for use
in a high-throughput system for determining the
structure of the target protein.
o Depending on the protein structure size and matching
criteria, up to several hundred structures can be
downloaded. The modelling for these is carried out by
submitting multiple jobs to the cluster and/or Condor
pool.
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
e-HTPX Structure Solution
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
CCP1 / GAMESS-UK
CCP1: “The Electronic Structure of Molecules“
http://www.ccp1.ac.uk/
GAMESS-UK is a multi-method ab initio molecular
electronic structure program.
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
CCP1 / GAMESS-UK
o GAMESS-UK is a Quantum-Mechanical molecular
modelling program used by chemists, physicists and
biologists to run molecular calculations.
o Given the nuclear coordinates of a molecule,
GAMESS-UK calculates a wavefunction that
describes its electronic properties.
o From the wavefunction, various molecular properties
(e.g. shape, energetics and reactivity) can be
calculated.
http://www.cfs.dl.ac.uk/
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
GAMESS-UK + Condor
The following are being investigated:
o Building GAMESS-UK and run its tests on a variety of
environments (OS, compilers, libraries)
o Using pool to build release packages of a cut-down
evaluation version of the software.
o Using Condor as it is intended: submitting many jobs
to ascertain.
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
ETF “Build and Test” Testbed
o The external pool is part of the ETF “Build and Test”
testbed.
o Software bundles are distributed to a variety of OS
types around the flocked pool for building and testing.
o This type of (flocked) pool relies on heterogeneity and
small numbers of each type are all that are required.
http://polaris.ecs.soton.ac.uk:65000/
http://wiki.nesc.ac.uk/read/sfct?HomePage
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Other non-HTC Uses
o I want to ensure my code compiles without warnings
and/or runs its basic tests on
– As many OSs as possible
– With as many different compilers as possible
o I want to perform a release build of my product for
platform X, but I only have accounts on A, B and C
o I have several server-licensed products and many
potential occasional users. How can this be made
available to them more easily (within the bounds of
the licence of course!)
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
In Conclusion
UK Condor Week 2004
John Kewley
Grid Technology Group
e-Science Centre
Summary
o 12 brave souls have offered up their personal
workstations so others can run arbitrary vanilla jobs.
o Installations have been made on 12 different
operating systems
o Both pools are now in use. Provision of administrative
support is underway – web page, user guide, etc
o Distributed build is great!
o Firewalls are not (although I now understand firewalls
a lot better)!
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Final Thoughts
o Setting up a Condor pool of personal workstations
requires considerable coaxing, convincing, coercion and
cajoling.
o Flocking through firewalls should be easier. Something
needs doing, at least for flocking.
o Distributed build can be very useful, but Condor’s
default ClassAds could do with extending (at least to
more accurately describe the OS)
o What use can be made of pools which are seriously
heterogenous?
11th October 2004
UK Condor Week
John Kewley
Presenter Name
Grid Technology Group
Facility Name
e-Science Centre
Download