The University of Bristol Grid, a production campus grid April 2005 David Wallom

advertisement
David Wallom
The University of Bristol Grid, a
production campus grid
April 2005
Outline
2
• Why?
• What was planned?
• Components
• Problems
• Outputs
3
The UoBGrid, why?
• Leveraging extra use from existing resources by:
– evening out load between heavily and lightly systems
– using currently non-scientifically used resources
(i.e. Admin department systems after hours).
• Large communities of users with serial computation
needs, equally large of parallel users, currently
using the same resources which is not optimal.
• Enable experience to be gained in supporting a
distributed system before joining NGS.
4
The UoBGrid, what?
• Planned for ~1000 CPUs from 1.2 → 3.2GHz
arranged in 7 clusters & 4 Condor pools located in
6 different departments.
• Central services all run on individual servers in
Information Services main computer room.
–
–
–
–
Resource Broker.
Information & Systems Monitoring.
Virtual Organisation & Credential Management.
Storage Resource Broker Vault.
• Choice of software used to be lead by
developments within other UK efforts (NGS).
5
The UoBGrid System Layout
6
Compute Clusters installation
• Primary concern of compute owners is on-going operations
• Software used must be:
–
–
–
–
Self-contained,
Simple installation,
Must not need key service interruption,
Present simple system status information to support staff.
• Solution:
– VDT 1.2.2
•
•
•
•
Non-WS Globus + Patches
GSI-SSH
EDG-Gridmapfile
myProxy
– SRB S-commands
– Big Brother monitoring client
7
Resource Broker
• Uses the Condor-G job distribution
mechanism.
• Custom script for determination of resource
status & priority.
– Integrated the Condor ClassAds and Globus
MDS.
– Lightweight self contained solution (~20Mb).
8
Resource Broker Operation
9
Information Services
• Central UoBGrid Globus GIIS.
• Each worker node configured with GRIS to publish
Scheduler as well as node data using GLUE as
well as job manager reporters.
• May change core server to BDII.
• Allowed systems for registration controlled by
VOM.
– Small system hic-up when new machine added as GIIS needs
restarting.
10
UoBGrid, Monitoring
4 Hourly Grid Systems monitoring and Reporting
Resource Broker & Job distribution status
11
Virtual Organisation Management
• Two tier system:
• For local only users:
– Web based system developed in-house.
– Runs completely on server with push model out to clients
using GridFTP for distribution.
• For NGS registered users:
– EDG make gridmapfile to construct files based upon the
use of pool accounts.
• Longer term intention to use only NGS style pool
accounts for simplified user management.
12
Virtual Organisation Management
13
Resource Usage Service
• Custom changes to jobmanager scripts.
• Usage records for each job as follows:
–
–
–
–
User
Start-time
End-time
Executable
• Records usage whenever job completes
successfully.
• Publishes back to webserver on VOM system.
14
How to use UOBGrid
• Using an e-Science certificate for AA.
• Simple command line interface:
– What out users want,
– The most successful way of getting happy users has been
to change their usual interface/usage model as little as
possible.
15
The Users
• Polymer & Nano:
– Run optical trapping simulations in readiness for real
experiments in the new building.
• BioChemistry:
– Protein & ligand docking simulation.
• Earth Sciences:
– River simulation.
• Comp Sci
– Radiance.
• Myself…
– Charge distribution simulation code for system testing.
16
New Users/being ported
• Chemistry:
– Gaussian computational Chemistry application
• GENIE, Geographical Sciences:
– Whole Earth system modelling.
• Civil Engineering:
– EuroNEES/UKNEES.
17
Usage
• Current record:
– ~15000 individual jobs in a week, ~4500 in one day.
– Single submitted job containing 2000 individual sub-components
18
Software Problems Encountered
• Some of the middleware that we have been trying to use has
not been a reliable as we would have hoped.
– MDS is a prime examples where necessity for reliability has
defined our usage model.
– More software than originally wanted has had to be
designed/written in house due to externally released software
not being anywhere near where it was advertised.
– Constant polling of queue managers by job-manager solved by
adding a sleep, reduced head-node load by ~70%!
• Some systems are running operating systems versions so old
that the middleware refused to install!
19
Things we worry about
• System upgrading:
– Now it is a service we cannot take it all down at once to
upgrade it because we have real academic users!
• Future directions and scalability of the certificate
mechanism.
• Future compatibility of tools such as Condor to
Globus/SRB/anything else that is useful!
20
Results
• http://escience.bristol.ac.uk/Science_results.htm
• Rendering On Demand output, Graphics!
21
Future Plans
• Expand UoBGrid to become SWGrid
– Incorporate resources from UWE, Exeter and other SW
institutions.
– Maintain central and cluster systems with up-to-date
middleware.
• Longer-term uncertain
– University has to believe benefits are tangible in the long
term, necessity for lots of specialist people is bad!
– competition from professional solutions such as LSF and
GridMP.
22
Further Information
• Centre for e-Research Bristol:
http://escience.bristol.ac.uk
• Email: david.wallom@bristol.ac.uk
• Telephone: +44 (0)117 928 8769
• UOBGrid: uobgrid-admin@bristol.ac.uk
Download