EPCC Sun Data and Compute Grids NeSC Review 18 March 2004

advertisement
EPCC Sun Data and Compute Grids
Geoff Cawood, Terry Sloan
Edinburgh Parallel Computing Centre (EPCC)
Telephone: +44 131 650 5155
Email: t.sloan@epcc.ed.ac.uk
NeSC Review
18 March 2004
Overview
Description and Aims
Project Status
Technical Achievements
Dissemination/Exploitation
Future Plans
Description and Aims
Project Goal
“Develop a fully Globus-enabled compute and data scheduler
based around Grid Engine, Globus and a wide variety of data
technologies”
Partners
–
–
Sun Microsystems
National e-Science Centre represented by EPCC
Timescales
–
23 (+2) months duration
–
–
Due to project staff involvement in ODDGenes
Start Feb 2002, end Feb 2004
Grid Engine
open source distributed resource management (DRM) system
Globus integration enables sharing of resources amongst
collaborating enterprises
Project Scenario
If enterprises A and B could expose some of their
machines to each other across the internet …
Both A and B could enjoy throughput efficiency
improvements
Large gains when one enterprise is busy and the other is
idle
A
Users (A)
B
Grid Engine
Grid Engine
a
b
c
d
e
f
g
h
e
f
g
h
a
b
c
d
Users (B)
Functional Aims
What does the project goal mean in practice?
Identify five key functional aims
1.
2.
3.
4.
5.
Job scheduling across Globus to remote Grid Engines
File transfer between local client site and remote jobs
File transfer between any site and remote jobs
Allow 'datagrid aware' jobs to work remotely
Data-aware job scheduling
Derived from questioning existing Grid Engine
users during Requirements WP
Project Status
Workpackages
WP 1: Analysis of existing Grid components
– WP 1.1: UML analysis of core Globus 2.0
– WP 1.2: UML analysis of Grid Engine
– WP 1.3: UML analysis of other Globus 2.0
– WP 1.4: Globus toolkit V3.0 Investigations
– WP 1.5: Data Technologies Investigations
WP 2: Requirements Capture & Analysis
WP 3: Prototype Development
WP 4: Hierarchical Scheduler Design
WP 5: Hierarchical Scheduler Development
Deliverables
All WPs finished
Deliverables available from project public web site
http://www.epcc.ed.ac.uk/sungrid
Or from the Grid Engine community web site (for software)
http://gridengine.sunsource.net/
WP 1: Analysis of existing Grid components
(FINISHED)
–
D1.1 Analysis of Globus Toolkit V2.0
–
D1.2 Grid Engine UML Analysis
–
D1.3 Globus Toolkit 2.0 GRAM Client API
Functions
–
D1.4 Globus 3.0 Features and Use
–
D1.5.2 Datagrids In Practice
–
D1.5.3 GridFTP
–
D1.5.4 OGSA-DAI
–
D1.5.5 Storage Resource Broker (SRB)
WP 2: Requirements Capture & Analysis
(FINISHED)
–
D2.1 Use cases and requirements
–
D2.2 Questionnaire Report
WP 3: Prototype Development (FINISHED)
D3.1 Prototype Development: Requirements
D3.2 Prototype Development: Design
D3.3 Prototype Development: Test plan
D3.4 Prototype Development: TOG software
D3.6 Prototype Development: How-To
WP 4: Hierarchical Scheduler Design (FINISHED)
D4.1 JOSH Functional Specification
D4.2 JOSH Systems Design
WP5: Hierarchical Scheduler Development (FINISHED)
JOSH User Guide
JOSH Software
JOSH Client Install Guide
JOSH Server Install Guide
JOSH Known Problems & Solutions
Technical Achievements
"From Sun's perspective, the SunDCG project has been
tremendously successful. Together, EPCC and Sun have produced
very high quality software and documents, providing real added
value to Sun's Grid Engine suite and addressing some of the key
issues in robust and usable Grid middleware."
Fritz Ferstl, Sun Microsystems
TOG (Transfer-queue Over Globus)
Site B
Site A
Grid Engine
a
Transfer
queue
–
–
–
–
–
–
–
e
b
c
d
Globus 2.2.x
User A
Grid Engine
e
f
g
User B
h
d
WP 3 deliverable – prototype compute scheduler
Integrates GE and Globus 2.2.x/2.4 (Software library)
Supply GE execution methods (starter method etc.) to implement a
'transfer queue' which sends jobs over Globus to a remote GE
GE complexes used for configuration
Globus GSI for security, GRAM for interaction with remote GE
GASS for small data transfer, GridFTP for large datasets
Written in Java - Globus functionality accessed through Java COG kit
TOG Software
Functionality
1. Job scheduling across Globus to remote Grid Engines
2. File transfer between local client site and remote jobs

Add special comments to job script to specify set of files to transfer
between local site and remote site
4. Allow 'datagrid aware' jobs to work remotely

Use of Globus GRAM ensures proxy certificate is present in remote
environment
Absent
3. File transfer between any site and remote jobs

Files are transferred between remote site and local site only
5. Data-aware job scheduling
TOG Software
Pros
–
–
Simple approach
Usability
●
●
–
Existing Grid Engine interface
Only addition is Globus certificate for authentication/authorisation
Remote administrators still have full control over their resources
Cons
–
Low quality scheduling decisions
●
●
–
–
State of remote resource – is it fully loaded?
Ignores data transfer costs
Scales poorly - one local transfer queue for each remote queue
Manual set-up
●
Configuring the transfer queue with same properties as remote queue
Java virtual machine invocation per job submission
JOSH (JOb Scheduling Hierarchically)
User
Developing JOSH software
Address the shortcomings of TOG
Incorporate Globus 3 and grid
services
WP 5 deliverable - compute/data
scheduler
Adds a new 'hierarchical'
scheduler above Grid Engine
hiersched submit_ge


Takes GE job script as input
(embellished with data requirements)
Queries grid services at each compute
site to find best match and submits job
Job
Spec
hiersched user Interface
Hierarchical Scheduler
Grid Service
Layer
Grid Service
Layer
Grid Engine
Grid Engine
Input Data Site
Output Data Site
JOSH
Pros
Satisfies the 5 functionality goals
Fulfills the project goal
Remote administrators still have full control over their GEs
Makes use of existing GE functionality eg. 'can run'
Cons
Latency in decision making
Not so much 'scheduling' as 'choosing'
Grid Engine specific solution
Dissemination/Exploitation
Presentations
Ernst & Young, WestInfo Services, Strategy & Performance Associates, SingTel
Optus, Executive Briefing Centre, Curtin Business School, Curtin University of
Technology, Perth Australia, February 24th, 26th, 2004.
Curtin Business School Information Systems Seminar, Curtin University of
Technology, Perth, Australia, February 20th 2004
GlobusWORLD 2004, San Francisco, USA, January 22nd, 2004
White Rose Grid, EPCC Sun Data & Compute Grids, UCL Workshop, York
University, November 11th, 2003
Sun HPC Consortium, Phoenix, USA, November 2003
Open Issues in Grid Scheduling, National e-Science Centre, Edinburgh, UK,
October 21st 2003
2nd Grid Engine Workshop, Regensburg, Germany, September 22-24 2003
SunLabs Europe, Edinburgh, September 1st, 2003
Sun HPC Consortium, Grid and Portal Computing SIG, Heidelberg, Germany, June
21st 2003
Resource Management and Scheduling for the Grid, National e-Science Centre,
Edinburgh, UK, February 13th 2003
Sun HPC Consortium, Grid and Portal Computing SIG, Baltimore, USA, November
15th 2002
EPCC Sun Data and Compute Grids / White Rose Computational Grid Meeting,
EPCC, Edinburgh, UK, November 7th 2002
Sun HPC Consortium, Grid and Portal Computing SIG, Glasgow, UK, July 18th 02
Grid Engine Workshop, Regensburg, Germany, April 22-24 2002
Software Take-up
Transfer-queue Over Globus (TOG) take–up
includes
ODD-Genes

Uses SunDCG TOG and OGSA-DAI to demonstrate a scientific use for
the grid (bioinformatics), presented at
– UK All Hands Meeting 2003 in Sept 2003
– Supercomputing 2003 in Nov 2003 on Sun, UK e-Science and
Globus Alliance booths
– Poster/Demo at Globusworld 2
– Numerous visitors to Edinburgh University
INWA

Uses Sun DCG TOG, OGSA-DAI and FirstDIG browser to demonstrate
data mining of commercial bank and telco data over the grid with
Curtin Business School, Perth Australia
Liverpool University’s ULGrid

Using Sun DCG TOG to enable users to access resources from
various departments
Raytheon Inc (USA)

Use SunDCG TOG in grid evaluations
Sun Singapore
Software Take-up
Job Scheduling Hierarchically (JOSH) known
interest includes
White Rose Grid
Raytheon Inc.
Academic Technology Services at UCLA
School of Pharmaceutical Sciences at the University of
Nottingham
Texas Advanced Computing Center
Forecast Systems Laboratory of NOAA
Downloads
10,300 document downloads between Feb 27th 2003 and
Feb 26th 2004
No specific figures on TOG/JOSH software downloads
Hosted at Grid Engine community web site
Figures are not available
BUT from EPCC web site …
> 400 TOG Requirements document
> 400 JOSH Functional Specification
> 300 JOSH Systems design
JOSH documents only available since *Feb 3rd 2004*
Community Scheduler Framework
Does not have data aware scheduling
Platform have asked if they could get the JOSH algorithms
included
So LOTS of interest in JOSH
Future Plans
Future Plans
Effort budget ran out in February 2004
Sun will integrate TOG/JOSH into Grid Engine source from
March 2004
Open Source development via Grid Engine community web site
If funds made available
WS-RF update
Access to other DRMS eg Loadleveller, LSF
WS-Agreement compliance, JSDL
Further functionality
All are straightforward due to good design in JOSH
"I just recommended TOG and JOSH as a starting point for a
partner who wants to build Grid middleware for nuclear plants."
Fritz Ferstl, Sun Microsystems
Demo
Download