David Wallom The University of Bristol Grid, a production campus grid April 2005 Outline 2 • Why? • What was planned? • Components • Problems • Outputs 3 The UoBGrid, why? • Leveraging extra use from existing resources by: – evening out load between heavily and lightly systems – using currently non-scientifically used resources (i.e. Admin department systems after hours). • Large communities of users with serial computation needs, equally large of parallel users, currently using the same resources which is not optimal. • Enable experience to be gained in supporting a distributed system before joining NGS. 4 The UoBGrid, what? • Planned for ~1000 CPUs from 1.2 → 3.2GHz arranged in 7 clusters & 4 Condor pools located in 6 different departments. • Central services all run on individual servers in Information Services main computer room. – – – – Resource Broker. Information & Systems Monitoring. Virtual Organisation & Credential Management. Storage Resource Broker Vault. • Choice of software used to be lead by developments within other UK efforts (NGS). 5 The UoBGrid System Layout 6 Compute Clusters installation • Primary concern of compute owners is on-going operations • Software used must be: – – – – Self-contained, Simple installation, Must not need key service interruption, Present simple system status information to support staff. • Solution: – VDT 1.2.2 • • • • Non-WS Globus + Patches GSI-SSH EDG-Gridmapfile myProxy – SRB S-commands – Big Brother monitoring client 7 Resource Broker • Uses the Condor-G job distribution mechanism. • Custom script for determination of resource status & priority. – Integrated the Condor ClassAds and Globus MDS. – Lightweight self contained solution (~20Mb). 8 Resource Broker Operation 9 Information Services • Central UoBGrid Globus GIIS. • Each worker node configured with GRIS to publish Scheduler as well as node data using GLUE as well as job manager reporters. • May change core server to BDII. • Allowed systems for registration controlled by VOM. – Small system hic-up when new machine added as GIIS needs restarting. 10 UoBGrid, Monitoring 4 Hourly Grid Systems monitoring and Reporting Resource Broker & Job distribution status 11 Virtual Organisation Management • Two tier system: • For local only users: – Web based system developed in-house. – Runs completely on server with push model out to clients using GridFTP for distribution. • For NGS registered users: – EDG make gridmapfile to construct files based upon the use of pool accounts. • Longer term intention to use only NGS style pool accounts for simplified user management. 12 Virtual Organisation Management 13 Resource Usage Service • Custom changes to jobmanager scripts. • Usage records for each job as follows: – – – – User Start-time End-time Executable • Records usage whenever job completes successfully. • Publishes back to webserver on VOM system. 14 How to use UOBGrid • Using an e-Science certificate for AA. • Simple command line interface: – What out users want, – The most successful way of getting happy users has been to change their usual interface/usage model as little as possible. 15 The Users • Polymer & Nano: – Run optical trapping simulations in readiness for real experiments in the new building. • BioChemistry: – Protein & ligand docking simulation. • Earth Sciences: – River simulation. • Comp Sci – Radiance. • Myself… – Charge distribution simulation code for system testing. 16 New Users/being ported • Chemistry: – Gaussian computational Chemistry application • GENIE, Geographical Sciences: – Whole Earth system modelling. • Civil Engineering: – EuroNEES/UKNEES. 17 Usage • Current record: – ~15000 individual jobs in a week, ~4500 in one day. – Single submitted job containing 2000 individual sub-components 18 Software Problems Encountered • Some of the middleware that we have been trying to use has not been a reliable as we would have hoped. – MDS is a prime examples where necessity for reliability has defined our usage model. – More software than originally wanted has had to be designed/written in house due to externally released software not being anywhere near where it was advertised. – Constant polling of queue managers by job-manager solved by adding a sleep, reduced head-node load by ~70%! • Some systems are running operating systems versions so old that the middleware refused to install! 19 Things we worry about • System upgrading: – Now it is a service we cannot take it all down at once to upgrade it because we have real academic users! • Future directions and scalability of the certificate mechanism. • Future compatibility of tools such as Condor to Globus/SRB/anything else that is useful! 20 Results • http://escience.bristol.ac.uk/Science_results.htm • Rendering On Demand output, Graphics! 21 Future Plans • Expand UoBGrid to become SWGrid – Incorporate resources from UWE, Exeter and other SW institutions. – Maintain central and cluster systems with up-to-date middleware. • Longer-term uncertain – University has to believe benefits are tangible in the long term, necessity for lots of specialist people is bad! – competition from professional solutions such as LSF and GridMP. 22 Further Information • Centre for e-Research Bristol: http://escience.bristol.ac.uk • Email: david.wallom@bristol.ac.uk • Telephone: +44 (0)117 928 8769 • UOBGrid: uobgrid-admin@bristol.ac.uk