Use of Condor in our Campus Grid and the University September 2004

advertisement
Dr. David Wallom
Use of Condor in our Campus
Grid and the University
September 2004
Outline
2
•
•
•
•
•
•
The University of Bristol Grid (UoBGrid).
The UoBGrid Resource Broker
Users & their environment.
Problems encountered.
Other Condor use within Bristol.
Summary.
3
The UoBGrid
• Planned for ~1000+ CPUs from 1.2 → 3.2GHz
arranged in 7 clusters & 3+ Condor pools located in
4 different departments.
• Core services run on individual servers, e.g.
Resource Broker & MDS.
4
The UoBGrid System Layout
5
The UoBGrid, now
• Currently 270 CPUs in 3 clusters and 1 Windows
Condor pool.
• Central services run on 2 beige boxes in my office.
• Windows Condor pool only in single student open
access area.
• Currently only two departments (Physics, Chemistry)
fully engaged though more on their way.
• Remainder of large clusters still on legacy versions of
operating systems, University wide upgrade
programme started.
6
Middleware
• Virtual Data Toolkit.
– Chosen for stability.
– Platform independent installation method.
– Widely used in other European production grid systems.
• Contains the standard Globus Toolkit version 2.4
with several enhancements.
• Also.
– GSI enhanced OpenSSH.
– myProxy Client & Server.
• Has a defined support structure.
7
Condor-G Resource Broker
• Uses the Condor-G matchmaking mechanism with
Grid resources.
• Set to run immediately a job appears.
• Custom script for determination of resource status
& priority.
• Integrated the Condor Resource description
mechanism and Globus Monitoring and Discovery
Service.
8
Resource Broker Operation
9
Information Passed into Resource ClassAd
MyType = "Machine"
TargetType = "Job"
Name and gatekeeper URLs dependant on resource system name and installed scheduler as
systems may easily have more than one jobmanager installed.
Name = "grendel.chm.bris.ac.uk-pbs“
gatekeeper_url = "grendel.chm.bris.ac.uk/jobmanager-pbs"
Make sure Globus universe, check number of nodes in cluster and set max number of matched
jobs to a particular resource.
Requirements = (CurMatches < 5) && (TARGET.JobUniverse == 9)
WantAdRevaluate = True
Time classad constructed.
UpdateSequenceNumber = 1097580300
Currently hard coded in the ClassAd
CurMatches
=0
System information retreived from Globus MDS information for head node only not worker
OpSys = "LINUX“
Arch = "INTEL"
Memory = 501
Installed software defined in resource broker file for each resource
INTEL_COMPILER=True
GCC3=True
10
Possible extensions to Resource Information
• Resource state information (LoadAvg, Claimed
etc):
– How is this defined for a cluster, maybe Condor-G could
introduce new states of % full??
• Number of CPUs and free diskspace:
– How do you define this in a cluster? Is the number of
CPUs set as per worker or overall the whole system?
Same for disk space.
• Cluster performance (MIPS, KFlops):
– This is not commonly worked out for small clusters so
would need to be hardwired in but could be very useful for
Ranking resources.
11
Results of condor_status
12
Load Management
• Only defines the raw numbers of jobs running, idle
& held (for whatever reason).
• Has little measure of relative performance of nodes
within grid, currently based on:
– Head node processor type & memory.
– MDS value of nodeCount for the jobmanager (this is not always
the same as the real number of worker nodes.
• Executes currently only to a single queue on each
resource.
13
What is currently running and how do I find
out?
•Simple interface to condor_q
•Planning to use Condor Job Monitor when installed
due to scalability issues.
14
Display of jobs currently running
15
Issues with Condor-G
• The following is a list of small issues we have:
– How do you do some resource definitions for
clusters?
– When using condor_q –globus actual hostname
job matched to is not displayed.
– No job exit codes…..
• The job exit codes will become more important with
increased number of users/problems.
• Once a job has been allocated to a remote cluster
then rescheduling elsewhere is difficult.
16
The Users
• BaBar:
– One resource is the Bristol BaBar farm so Monte-Carlo event
production in parallel to UoBGrid usage.
• GENIE:
– installing software onto each connected system by agreement
with owners.
• LHCb:
– Windows compiled Pythia event generation.
• Earth Sciences:
– River simulation.
• Myself…
– Undergraduate written charge distribution simulation code.
17
Usage
• Current record:
– ~10000 individual jobs in a week,
– ~2500 in one day.
18
Windows Condor through Globus
•
•
•
•
Install Linux machine as Condor Master only.
Configure this to flock to Windows Condor pool.
Install Globus Gatekeeper.
Edit jobmanager.pm file so that architecture for
submitted jobs always WINNT51 (matches all the
workers in the pool).
• Appears in Condor-G Resource list as WINNT51
Resource.
19
Windows Condor pools available through a
Globus Interface from a flocked Linux pool
• Within three departments, currently there are three separate
Windows Condor pools approximately 200 CPUs.
• Planning to allow all student teaching resources in as many
departments as possible to have the software installed.
• This will allow a significant increase in university processing
power with little cost increase.
• When department gives the OK then they will be added to the
flocking list on single Linux submission system machine.
• Difficulty encountered with this setup is the lack of Microsoft
Installer file.
– Affects ability to use group policy method of software delivery
and installation, directly affects how some computer officers
view installing etc.
20
Evaluation of Condor again United Devices
GridMP
• Computational Chemistry group have significant
links with industrial partner who is currently using
U.D. GridMP.
• Suggested to CC group they also use GridMP
though after initial contact this was suggested to be
very costly.
• e-Science group suggested that Condor would be
a better system for them to use.
• Agreement from UD to do a published function &
usage comparison between Condor & GridMP.
• Due to start this autumn.
Download