EPCC Sun Data and Compute Grids Project

advertisement
EPCC Sun Data and Compute Grids
Project
Using Sun Grid Engine and Globus to Schedule Jobs
Across a Combination of Local and Remote Machines
Terry Sloan
Edinburgh Parallel Computing Centre (EPCC)
Telephone: +44 131 650 5155
Email: t.sloan@epcc.ed.ac.uk
http://www.epcc.ed.ac.uk/sungrid
Overview
4The Project
4Why do it ?
4Project Scenario
4Project Goal
4How ?
4Project Achievements
4The Compute Scheduler
4The Compute & Data Scheduler
2
http://www.epcc.ed.ac.uk/sungrid
The Project
http://www.epcc.ed.ac.uk/sungrid
The Project
4Develop a Globus enabled compute and data
scheduler
4Based on Grid Engine, Globus and variety of data
technologies
4
http://www.epcc.ed.ac.uk/sungrid
The Project (cont)
4Partners
– Sun Microsystems
– National e-Science Centre represented by EPCC
4Timescales
–
–
–
–
5
23 months
Start Feb 2002
End Dec 2003
Feb 2003 = Project Month 13 (PM13)
http://www.epcc.ed.ac.uk/sungrid
Why do it ?
http://www.epcc.ed.ac.uk/sungrid
Why do it?
4Grid Engine – over 20000 downloads (Nov 2002)
– Distributed Resource Management tool
– Schedules activities across networked resources
4Sun classifies 3 levels of Grid
– Cluster Grid – a single team or project and their associated
resources
– Enterprise Grid – multiple teams and projects but within a single
organisation, facilitating collaboration across the enterprise
– Global Grid – linked Cluster and Enterprise grids, providing
collaboration amongst organisations
4Grid Engine meets first two levels but by itself
does not meet the third
7
http://www.epcc.ed.ac.uk/sungrid
Why do it? (cont)
4Globus Toolkit
– A Grid API for connecting distributed compute and instrument
resources
4Integration with Globus allows Grid Engine to
meet level 3
– Collaboration amongst enterprises
– Most integration efforts use Globus to submit work to Grid Engine
4This project tackles opposite problem - to
engineer Grid Engine on top of Globus
8
http://www.epcc.ed.ac.uk/sungrid
Why do it? (cont)
4Grid Engine concerned with compute resources
– Extend it to work with popular data and service access protocols (eg.
OGSA-DAI)
9
http://www.epcc.ed.ac.uk/sungrid
Project Scenario
http://www.epcc.ed.ac.uk/sungrid
Project Scenario
Two collaborating enterprises A and B both have
some machines
– Both enterprises run Grid Engine to schedule jobs
– Local demand for machines is variable
• Sometimes it exceeds supply
• Other times machines lie idle
A
Users (A)
Grid Engine
a
11
B
b
c
d
Grid Engine
e
f
g
http://www.epcc.ed.ac.uk/sungrid
h
Users (B)
Project Scenario(cont)
Ideal Situation
– If enterprises A and B could expose some of their machines to
each other across the internet through Grid Engine…
• Both A and B could enjoy through-put efficiency improvements
• Large gains when one enterprise is busy while the other is idle
A
Users (A)
12
B
Grid Engine
Grid Engine
a
b
c
d
e
f
g
h
e
f
g
h
a
b
c
d
http://www.epcc.ed.ac.uk/sungrid
Users (B)
The Project Goal
http://www.epcc.ed.ac.uk/sungrid
Project Goal
4Final goal
– Develop a scheduler based on Grid Engine to schedule jobs
across a combination of local and remote machines
– Enable jobs to access necessary data sources
– Use Globus as the Grid API to provide secure communications
and transfer
4Development Criteria
–
–
–
–
14
Industrial strength
Application of software engineering techniques
Use of industry standard design and analysis tools
Migration to OGSA-compliant Globus 3
http://www.epcc.ed.ac.uk/sungrid
How ?
http://www.epcc.ed.ac.uk/sungrid
Workpackages
4WP 1: Analysis of existing Grid components
9 WP 1.1: UML analysis of core Globus 2.0
9 WP 1.2: UML analysis of Grid Engine
9 WP 1.3: UML analysis of other Globus 2.0
– WP 1.4: UML analysis of Globus 3.0
– WP 1.5: Exploration of data technologies
9 WP 2: Requirements Capture & Analysis
4WP 3: Prototype Compute Scheduler
4WP 4: Compute/Data Scheduler Design
4WP 5: Compute/Data Scheduler Development
16
http://www.epcc.ed.ac.uk/sungrid
The Project Team
4Project Personnel
– Terry Sloan
– Geoff Cawood
– Ratna Abrol
– Thomas Seed
– Ali Anjomshoaa
– Paul Graham
– Amy Krause
: Project leader
: Project architect
: Engineering
: Engineering
: Globus 2 Analysis
: Requirements Capture and Analysis
: Technical reviewer
4Project Review Board
– Fritz Ferstl (Sun Microsystems Gmbh)
– John Barr (Sun Microsystems Ltd)
– Steven Newhouse (London e-Science Centre)
– Neil Chue Hong (EPCC)
17
http://www.epcc.ed.ac.uk/sungrid
Achievements
http://www.epcc.ed.ac.uk/sungrid
Achievements
4Publications
–
–
–
–
–
D1.1 Analysis of Globus Toolkit V2.0
D1.2 Grid Engine UML Analysis
D2.1 Use cases and requirements
D2.2 Questionnaire Report
D3.1 Prototype Development: Requirements
4Software
– Transfer-queue Over Globus (TOG)
19
http://www.epcc.ed.ac.uk/sungrid
Transfer-queue Over Globus (TOG) - A
Compute Scheduler
http://www.epcc.ed.ac.uk/sungrid
Transfer-queue Over Globus
(TOG)
B
A
Grid Engine
a
e
b
c
d
Globus 2
User A
Grid Engine
e
f
g
User B
h
d
4 Integrates Grid Engine and Globus 2 to access remote resources
4 GE execution methods provide job submission and control
4 GE job context stores job specific information eg job handle
4 Globus GSI for security
4 Globus GRAM enables interaction with remote resource
4 GASS for small data transfer, GridFTP for large datasets
21
http://www.epcc.ed.ac.uk/sungrid
TOG (cont)
4 Current Status
– Secure job submission functionality implemented and tested
• Staging of input data and executables and transfer of output
– Secure job control functionality implemented and tested
• Suspend, Resume, Terminate
– Basic scheduling functionality implemented and tested
• Schedules jobs to remote resources when local resources are full
– Testing
• Integrated successfully within Grid Engine test suite
• Tested through firewalls
4 TOG software available upon request
– Contact sungrid@epcc.ed.ac.uk
4 Generally available via web site soon
– www.epcc.ed.ac.uk/sungrid
22
http://www.epcc.ed.ac.uk/sungrid
TOG (cont)
Pros
4Simple approach
4Usability – existing Grid Engine interface, users only
need to be aware of Globus certificates
4Remote administrators still have full control of their
resources
23
http://www.epcc.ed.ac.uk/sungrid
TOG (cont)
Cons
4Low quality scheduling decisions (?)
– May be a time-lag in getting query results back from
remote resource
– Incorporating data transfer costs into scheduling
4Mirror queues for remote resources
4Possible set-up overhead
4Globus 2 vs. Globus 3
4Grid Engine specific solution
24
http://www.epcc.ed.ac.uk/sungrid
The Compute & Data Scheduler
http://www.epcc.ed.ac.uk/sungrid
Current status
Considering two possible routes
1. Extend TOG
– Migrate to Globus 3
– Incorporate OGSA-DAI
2. Hierarchical Scheduler
– Overcome limitations
– Global Grid vision
26
http://www.epcc.ed.ac.uk/sungrid
1. Extend compute
scheduler
4Compute
Grid
GE
GE
us
Glob
Globus
4Data
Grid
27
GE
GridFTP Site
Globus
SRB
Globus
OGSA-DAI
(Hides ODBC,
JDBC, XMLDB
etc.)
http://www.epcc.ed.ac.uk/sungrid
2. Hierarchical Scheduler
4Unified Interface
Web Services Layer
– Grid Scalability
Hierarchical Scheduler
Same
Interface
Web Services
Layer
4Query child
DRMs for
capabilities
4Pass Job
Specification to
the child
28
Grid Engine
Scotland
Web Services Layer
Edinburgh
Hierarchical Scheduler
Web Services
Layer
Web Services
Layer
Grid Engine
Grid Engine
http://www.epcc.ed.ac.uk/sungrid
EPCC
Conclusions
4Before proceeding
4Examine Globus 3 Analysis
4Examine Data Technologies ie OGSA-DAI, etc
4Informed decision on whether to
– Extend Compute Scheduler, or
– Build Hierarchical Scheduler or some sub-set of this.
4Delivery in December 2003
29
http://www.epcc.ed.ac.uk/sungrid
Download