EPCC Sun Data and Compute Grids Project

advertisement
EPCC Sun Data and Compute Grids
Project
Using Sun Grid Engine and Globus to Schedule Jobs
Across a Combination of Local and Remote Machines
Terry Sloan
Edinburgh Parallel Computing Centre (EPCC)
Telephone: +44 131 650 5155
Email: t.sloan@epcc.ed.ac.uk
1
http://www.epcc.ed.ac.uk/sungrid
Overview
The Project
Why do it ?
Project Scenario
Project Goal
How ?
Project Achievements
The Compute Scheduler
The Compute & Data Scheduler
2
http://www.epcc.ed.ac.uk/sungrid
The Project
http://www.epcc.ed.ac.uk/sungrid
The Project
Develop a Globus enabled compute and data
scheduler
Based on Grid Engine, Globus and variety of data
technologies
4
http://www.epcc.ed.ac.uk/sungrid
The Project (cont)
Partners
– Sun Microsystems
– National e-Science Centre represented by EPCC
Timescales
–
–
–
–
5
23 months
Start Feb 2002
End Dec 2003
Feb 2003 = Project Month 13 (PM13)
http://www.epcc.ed.ac.uk/sungrid
Why do it ?
http://www.epcc.ed.ac.uk/sungrid
Why do it?
Grid Engine – over 20000 downloads (Nov 2002)
– Distributed Resource Management tool
– Schedules activities across networked resources
Sun classifies 3 levels of Grid
– Cluster Grid – a single team or project and their associated
resources
– Enterprise Grid – multiple teams and projects but within a single
organisation, facilitating collaboration across the enterprise
– Global Grid – linked Cluster and Enterprise grids, providing
collaboration amongst organisations
Grid Engine meets first two levels but by itself
does not meet the third
7
http://www.epcc.ed.ac.uk/sungrid
Why do it? (cont)
Globus Toolkit
– A Grid API for connecting distributed compute and instrument
resources
Integration with Globus allows Grid Engine to
meet level 3
– Collaboration amongst enterprises
– Most integration efforts use Globus to submit work to Grid Engine
This project tackles opposite problem - to
engineer Grid Engine on top of Globus
8
http://www.epcc.ed.ac.uk/sungrid
Why do it? (cont)
Grid Engine concerned with compute resources
– Extend it to work with popular data and service access protocols (eg.
OGSA-DAI)
9
http://www.epcc.ed.ac.uk/sungrid
Project Scenario
http://www.epcc.ed.ac.uk/sungrid
Project Scenario
Two collaborating enterprises A and B both have
some machines
– Both enterprises run Grid Engine to schedule jobs
– Local demand for machines is variable
• Sometimes it exceeds supply
• Other times machines lie idle
A
Users (A)
Grid Engine
a
11
B
b
c
d
Grid Engine
e
f
g
http://www.epcc.ed.ac.uk/sungrid
h
Users (B)
Project Scenario(cont)
Ideal Situation
– If enterprises A and B could expose some of their machines to
each other across the internet through Grid Engine…
• Both A and B could enjoy through-put efficiency improvements
• Large gains when one enterprise is busy while the other is idle
A
Users (A)
12
B
Grid Engine
Grid Engine
a
b
c
d
e
f
g
h
e
f
g
h
a
b
c
d
http://www.epcc.ed.ac.uk/sungrid
Users (B)
The Project Goal
http://www.epcc.ed.ac.uk/sungrid
Project Goal
Final goal
– Develop a scheduler based on Grid Engine to schedule jobs
across a combination of local and remote machines
– Enable jobs to access necessary data sources
– Use Globus as the Grid API to provide secure communications
and transfer
Development Criteria
–
–
–
–
14
Industrial strength
Application of software engineering techniques
Use of industry standard design and analysis tools
Migration to OGSA-compliant Globus 3
http://www.epcc.ed.ac.uk/sungrid
How ?
http://www.epcc.ed.ac.uk/sungrid
Workpackages
WP 1: Analysis of existing Grid components
 WP 1.1: UML analysis of core Globus 2.0
 WP 1.2: UML analysis of Grid Engine
 WP 1.3: UML analysis of other Globus 2.0
– WP 1.4: UML analysis of Globus 3.0
– WP 1.5: Exploration of data technologies
 WP 2: Requirements Capture & Analysis
WP 3: Prototype Compute Scheduler
WP 4: Compute/Data Scheduler Design
WP 5: Compute/Data Scheduler Development
16
http://www.epcc.ed.ac.uk/sungrid
The Project Team
Project Personnel
– Terry Sloan
– Geoff Cawood
– Ratna Abrol
– Thomas Seed
– Ali Anjomshoaa
– Paul Graham
– Amy Krause
: Project leader
: Project architect
: Engineering
: Engineering
: Globus 2 Analysis
: Requirements Capture and Analysis
: Technical reviewer
Project Review Board
– Fritz Ferstl (Sun Microsystems Gmbh)
– John Barr (Sun Microsystems Ltd)
– Steven Newhouse (London e-Science Centre)
– Neil Chue Hong (EPCC)
17
http://www.epcc.ed.ac.uk/sungrid
Achievements
http://www.epcc.ed.ac.uk/sungrid
Achievements
Publications
–
–
–
–
–
D1.1 Analysis of Globus Toolkit V2.0
D1.2 Grid Engine UML Analysis
D2.1 Use cases and requirements
D2.2 Questionnaire Report
D3.1 Prototype Development: Requirements
Software
– Transfer-queue Over Globus (TOG)
19
http://www.epcc.ed.ac.uk/sungrid
Transfer-queue Over Globus (TOG) - A
Compute Scheduler
http://www.epcc.ed.ac.uk/sungrid
Transfer-queue Over Globus
(TOG)
B
A
Grid Engine
a
e
b
c
d
Globus 2
User A
Grid Engine
e
f
g
User B
h
d
 Integrates Grid Engine and Globus 2 to access remote resources
 GE execution methods provide job submission and control
 GE job context stores job specific information eg job handle
 Globus GSI for security
 Globus GRAM enables interaction with remote resource
 GASS for small data transfer, GridFTP for large datasets
21
http://www.epcc.ed.ac.uk/sungrid
TOG (cont)
 Current Status
– Secure job submission functionality implemented and tested
• Staging of input data and executables and transfer of output
– Secure job control functionality implemented and tested
• Suspend, Resume, Terminate
– Basic scheduling functionality implemented and tested
• Schedules jobs to remote resources when local resources are full
– Testing
• Integrated successfully within Grid Engine test suite
• Tested through firewalls
 TOG software available upon request
– Contact sungrid@epcc.ed.ac.uk
 Generally available via web site soon
– www.epcc.ed.ac.uk/sungrid
22
http://www.epcc.ed.ac.uk/sungrid
TOG (cont)
Pros
Simple approach
Usability – existing Grid Engine interface, users only
need to be aware of Globus certificates
Remote administrators still have full control of their
resources
23
http://www.epcc.ed.ac.uk/sungrid
TOG (cont)
Cons
Low quality scheduling decisions (?)
– May be a time-lag in getting query results back from
remote resource
– Incorporating data transfer costs into scheduling
Mirror queues for remote resources
Possible set-up overhead
Globus 2 vs. Globus 3
Grid Engine specific solution
24
http://www.epcc.ed.ac.uk/sungrid
The Compute & Data Scheduler
http://www.epcc.ed.ac.uk/sungrid
Current status
Considering two possible routes
1. Extend TOG
– Migrate to Globus 3
– Incorporate OGSA-DAI
2. Hierarchical Scheduler
– Overcome limitations
– Global Grid vision
26
http://www.epcc.ed.ac.uk/sungrid
1. Extend compute
scheduler
Compute
Grid
GE
GE
Globus
Data
Grid
27
GridFTP Site
GE
Globus
SRB
Globus
OGSA-DAI
(Hides ODBC,
JDBC, XMLDB
etc.)
http://www.epcc.ed.ac.uk/sungrid
2. Hierarchical Scheduler
Unified Interface
Web Services Layer
– Grid Scalability
Hierarchical Scheduler
Same
Interface
Web Services
Layer
Query child
DRMs for
capabilities
Pass Job
Specification to
the child
28
Grid Engine
Scotland
Web Services Layer
Edinburgh
Hierarchical Scheduler
Web Services
Layer
Web Services
Layer
Grid Engine
Grid Engine
http://www.epcc.ed.ac.uk/sungrid
EPCC
Conclusions
Before proceeding
Examine Globus 3 Analysis
Examine Data Technologies ie OGSA-DAI, etc
Informed decision on whether to
– Extend Compute Scheduler, or
– Build Hierarchical Scheduler or some sub-set of this.
Delivery in December 2003
29
http://www.epcc.ed.ac.uk/sungrid
Download