EPCC Sun Data and Compute Grids Project Using Sun Grid Engine and Globus to Schedule Jobs Across a Combination of Local and Remote Machines Terry Sloan Edinburgh Parallel Computing Centre (EPCC) Telephone: +44 131 650 5155 Email: t.sloan@epcc.ed.ac.uk http://www.epcc.ed.ac.uk/sungrid Overview 4The Project 4Why do it ? 4Project Scenario 4Project Goal 4How ? 4Project Achievements 4The Compute Scheduler 4The Compute & Data Scheduler 2 http://www.epcc.ed.ac.uk/sungrid The Project http://www.epcc.ed.ac.uk/sungrid The Project 4Develop a Globus enabled compute and data scheduler 4Based on Grid Engine, Globus and variety of data technologies 4 http://www.epcc.ed.ac.uk/sungrid The Project (cont) 4Partners – Sun Microsystems – National e-Science Centre represented by EPCC 4Timescales – – – – 5 23 months Start Feb 2002 End Dec 2003 Feb 2003 = Project Month 13 (PM13) http://www.epcc.ed.ac.uk/sungrid Why do it ? http://www.epcc.ed.ac.uk/sungrid Why do it? 4Grid Engine – over 20000 downloads (Nov 2002) – Distributed Resource Management tool – Schedules activities across networked resources 4Sun classifies 3 levels of Grid – Cluster Grid – a single team or project and their associated resources – Enterprise Grid – multiple teams and projects but within a single organisation, facilitating collaboration across the enterprise – Global Grid – linked Cluster and Enterprise grids, providing collaboration amongst organisations 4Grid Engine meets first two levels but by itself does not meet the third 7 http://www.epcc.ed.ac.uk/sungrid Why do it? (cont) 4Globus Toolkit – A Grid API for connecting distributed compute and instrument resources 4Integration with Globus allows Grid Engine to meet level 3 – Collaboration amongst enterprises – Most integration efforts use Globus to submit work to Grid Engine 4This project tackles opposite problem - to engineer Grid Engine on top of Globus 8 http://www.epcc.ed.ac.uk/sungrid Why do it? (cont) 4Grid Engine concerned with compute resources – Extend it to work with popular data and service access protocols (eg. OGSA-DAI) 9 http://www.epcc.ed.ac.uk/sungrid Project Scenario http://www.epcc.ed.ac.uk/sungrid Project Scenario Two collaborating enterprises A and B both have some machines – Both enterprises run Grid Engine to schedule jobs – Local demand for machines is variable • Sometimes it exceeds supply • Other times machines lie idle A Users (A) Grid Engine a 11 B b c d Grid Engine e f g http://www.epcc.ed.ac.uk/sungrid h Users (B) Project Scenario(cont) Ideal Situation – If enterprises A and B could expose some of their machines to each other across the internet through Grid Engine… • Both A and B could enjoy through-put efficiency improvements • Large gains when one enterprise is busy while the other is idle A Users (A) 12 B Grid Engine Grid Engine a b c d e f g h e f g h a b c d http://www.epcc.ed.ac.uk/sungrid Users (B) The Project Goal http://www.epcc.ed.ac.uk/sungrid Project Goal 4Final goal – Develop a scheduler based on Grid Engine to schedule jobs across a combination of local and remote machines – Enable jobs to access necessary data sources – Use Globus as the Grid API to provide secure communications and transfer 4Development Criteria – – – – 14 Industrial strength Application of software engineering techniques Use of industry standard design and analysis tools Migration to OGSA-compliant Globus 3 http://www.epcc.ed.ac.uk/sungrid How ? http://www.epcc.ed.ac.uk/sungrid Workpackages 4WP 1: Analysis of existing Grid components 9 WP 1.1: UML analysis of core Globus 2.0 9 WP 1.2: UML analysis of Grid Engine 9 WP 1.3: UML analysis of other Globus 2.0 – WP 1.4: UML analysis of Globus 3.0 – WP 1.5: Exploration of data technologies 9 WP 2: Requirements Capture & Analysis 4WP 3: Prototype Compute Scheduler 4WP 4: Compute/Data Scheduler Design 4WP 5: Compute/Data Scheduler Development 16 http://www.epcc.ed.ac.uk/sungrid The Project Team 4Project Personnel – Terry Sloan – Geoff Cawood – Ratna Abrol – Thomas Seed – Ali Anjomshoaa – Paul Graham – Amy Krause : Project leader : Project architect : Engineering : Engineering : Globus 2 Analysis : Requirements Capture and Analysis : Technical reviewer 4Project Review Board – Fritz Ferstl (Sun Microsystems Gmbh) – John Barr (Sun Microsystems Ltd) – Steven Newhouse (London e-Science Centre) – Neil Chue Hong (EPCC) 17 http://www.epcc.ed.ac.uk/sungrid Achievements http://www.epcc.ed.ac.uk/sungrid Achievements 4Publications – – – – – D1.1 Analysis of Globus Toolkit V2.0 D1.2 Grid Engine UML Analysis D2.1 Use cases and requirements D2.2 Questionnaire Report D3.1 Prototype Development: Requirements 4Software – Transfer-queue Over Globus (TOG) 19 http://www.epcc.ed.ac.uk/sungrid Transfer-queue Over Globus (TOG) - A Compute Scheduler http://www.epcc.ed.ac.uk/sungrid Transfer-queue Over Globus (TOG) B A Grid Engine a e b c d Globus 2 User A Grid Engine e f g User B h d 4 Integrates Grid Engine and Globus 2 to access remote resources 4 GE execution methods provide job submission and control 4 GE job context stores job specific information eg job handle 4 Globus GSI for security 4 Globus GRAM enables interaction with remote resource 4 GASS for small data transfer, GridFTP for large datasets 21 http://www.epcc.ed.ac.uk/sungrid TOG (cont) 4 Current Status – Secure job submission functionality implemented and tested • Staging of input data and executables and transfer of output – Secure job control functionality implemented and tested • Suspend, Resume, Terminate – Basic scheduling functionality implemented and tested • Schedules jobs to remote resources when local resources are full – Testing • Integrated successfully within Grid Engine test suite • Tested through firewalls 4 TOG software available upon request – Contact sungrid@epcc.ed.ac.uk 4 Generally available via web site soon – www.epcc.ed.ac.uk/sungrid 22 http://www.epcc.ed.ac.uk/sungrid TOG (cont) Pros 4Simple approach 4Usability – existing Grid Engine interface, users only need to be aware of Globus certificates 4Remote administrators still have full control of their resources 23 http://www.epcc.ed.ac.uk/sungrid TOG (cont) Cons 4Low quality scheduling decisions (?) – May be a time-lag in getting query results back from remote resource – Incorporating data transfer costs into scheduling 4Mirror queues for remote resources 4Possible set-up overhead 4Globus 2 vs. Globus 3 4Grid Engine specific solution 24 http://www.epcc.ed.ac.uk/sungrid The Compute & Data Scheduler http://www.epcc.ed.ac.uk/sungrid Current status Considering two possible routes 1. Extend TOG – Migrate to Globus 3 – Incorporate OGSA-DAI 2. Hierarchical Scheduler – Overcome limitations – Global Grid vision 26 http://www.epcc.ed.ac.uk/sungrid 1. Extend compute scheduler 4Compute Grid GE GE us Glob Globus 4Data Grid 27 GE GridFTP Site Globus SRB Globus OGSA-DAI (Hides ODBC, JDBC, XMLDB etc.) http://www.epcc.ed.ac.uk/sungrid 2. Hierarchical Scheduler 4Unified Interface Web Services Layer – Grid Scalability Hierarchical Scheduler Same Interface Web Services Layer 4Query child DRMs for capabilities 4Pass Job Specification to the child 28 Grid Engine Scotland Web Services Layer Edinburgh Hierarchical Scheduler Web Services Layer Web Services Layer Grid Engine Grid Engine http://www.epcc.ed.ac.uk/sungrid EPCC Conclusions 4Before proceeding 4Examine Globus 3 Analysis 4Examine Data Technologies ie OGSA-DAI, etc 4Informed decision on whether to – Extend Compute Scheduler, or – Build Hierarchical Scheduler or some sub-set of this. 4Delivery in December 2003 29 http://www.epcc.ed.ac.uk/sungrid