UK e-Science Grid and EGEE Malcolm Atkinson &

advertisement
UK e-Science Grid
and EGEE
Malcolm Atkinson
&
Richard Sinnott
www.nesc.ac.uk
9th June 2004
The Primary Requirement …
Enabling People to Work Together on Challenging Projects: Science, Engineering & Medicine
Total: £213M
UK e-Science Budget
(2001-2006)
EPSRC Breakdown
M RC (£21.1M )
10%
EPSRC (£77.7M )
37%
HPC (£11.5M)
BBSRC (£18M )
15%
8%
NERC (£15M )
7%
Applied (£35M)
Staff
45%
costs Grid Resources
funded separately
CLRC (£10M )
5%
Core (£31.2M)
40%
PPARC (£57.6M
)
27%
ESRC (£13.6M )
6%
+ Industrial Contributions
Source: Science Budget 2003/4 – 2005/6, DTI(OST)
The e-Science Centres
e-Science
Institute
Globus Alliance
Open
Middleware
Infrastructure
Institute
Digital
Curation
Centre
Grid
Operations
Centre
?
CeSC (Cambridge)
EGEE
The e-Science Grid:
National Grid Service
Engineering Task
Force
(Contributions
from e-Science
Centres)
1600 x CPU
AIX
512 x CPU
Irix
HPC(x)
20 x CPU
18TB Disk
Linux
Grid Support
Centre / Grid
Operations
Centre
OGSA Test Grid
projects
CeSC (Cambridge)
64 x CPU
4TB Disk
Linux
UK e-Science Grid Status – Q3 2003
Operational and heterogeneous Level-2 Grid
based on Globus Toolkit 2
Demonstrated broad set of applications running
across it
X
X
X
X
X
X
X
X
Monte Carlo simulations of ionic diffusion through radiation
damaged crystal structures
Integrated Earth system modelling
BLAST on the Grid
Nimrod/G
DL_POLY and Portals
Grid Enabled Optimisation: Vibrations in Space Application to
Satellite Truss Design
RealityGrid-Lite
Grid Integration Test Script Suite- from WP7
UK Level 2 Grid Project
The Engineering Task Force
They made it work
WP0 Project management
WP1 Middleware development
WP2 Grid information systems
WP3 Authentication and the CA
WP4 User mgt and accounting
WP5 Grid security
WP6 Grid platform deployment
WP7 Operational monitoring
WP8 Applications
Rob Allan
Nick Hill
Rob Allan
Alistair Mills
Steven Newhouse
Jon Hillier
Alistair Mills
David Baker
Simon Cox
UK e-Science Grids – the next steps
Prototype status …
–
–
–
–
–
–
Fragility?
Security ?
Reliability ?
Predictability ?
Dynamicity ?
Maintainability ?
Explore Research Grids based on OGSA
Deliver production service – GT2 & LCG2
NGS – initially a subset of sites
Grid Operations Centre – Director Neil Geddes
Many projects continue with GT2
Others
use on
WS-I
(plus–GSI
in some
The future
is based
WSRF
Converge
bycases)
mid 2005
Projects achieve results
Across all disciplines
More than 100 projects
Applications 70% core M/W 30%
Approximately $50 million from Industry collaboration
X
>70 companies
Using UK e-Science Grid / NGS or Local Grids
Or using their own “grids”
The Grid idea is thriving
It is still hard work to get started
It is still very hard to push the technical limits
E.g. Teragyroid
Success depended on a strong rallying cry
Research Grid Experience Panel, GGF11
9 June 2004
www.eu-egee.org
EGEE & LCG2
Experience
Malcolm Atkinson
&
Ian Bird
EGEE is a project funded by the European Union under contract IST-2003-508833
EGEE manifesto:
Enabling Grids for E-science in Europe
• Goal
ƒ Create a wide European Grid production
quality infrastructure on top of present and
future EU RN infrastructure
Applications
• Build On:
ƒ EUEU and EU member states major
investments in Grid Technology
ƒ International connections (US and AP)
ƒ Several pioneering prototype results
ƒ Large Grid development teams in EU
require major EU funding effort
Grid infrastructure
• Approach
ƒ Leverage current and planned national and
regional Grid programmes
ƒ Work closely with relevant industrial Grid
developers, NRENs and US-AP projects
Geant network
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 11
EGEE: Partners
• Leverage national resources in a more effective way for
broader European benefit
• 70 leading institutions in 27 countries, federated in regional
Grids
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 12
EGEE Activities
24% Joint Research
28% Networking
JRA1: Middleware Engineering and
Integration
JRA2: Quality Assurance
JRA3: Security
JRA4: Network Services
Development
NA1: Management
NA2: Dissemination and Outreach
NA3: User Training and Education
NA4: Application Identification and
Support
NA5: Policy and International
Cooperation
Emphasis in EGEE is on
operating a production
grid and supporting the endusers
48% Services
SA1: Grid Operations, Support and Management
SA2: Network Resource Provision
31 million Euros over 2 years
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 13
The Training Team Consortium
NeSC
Edinburgh
UK & Ireland
IHEP
IMPB RAS
ITEP
JINR
Moscow
Russia
Moscow
Russia
Moscow
Russia
Dubna
Russia
KU-NATFAK
PNPI
RRCKI
Copenhagen
Denmark
Petersburgh
Russia
Moscow
Russia
GUP
Linz
Austria
FZK
Karlsruhe
Germany
Innsbruck
Austria
GRNET
Athens
Greece
INFN
CESNET
Rome
Italy
Prague
Czech Rep.
BUTE
Budapest
Hungary
II-SAS
Bratislava
Slovakia
ICM
PSNC
ICI
Warsaw
Poland
Poznan
Poland
Bucharest
Romania
ELUB
Budapest
Hungary
MTA SZTAKI
TAU
Budapest
Hungary
Tel Aviv
Isreal
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 14
EGEE Implementation
• From day 1 (1st April 2004)
Production grid service based on the LCG infrastructure running LCG-2 grid
middleware (SA)
LCG-2 will be maintained until the new generation has proven itself (fallback
solution)
• In parallel develop a “next generation” grid facility (JRA)
Produce a new set of grid services according to evolving standards (Web Services)
Run a development service providing early access for evaluation purposes
Will replace LCG-2 on production facility in 2005
LCG-1
LCG-2
Globus 2 based
EGEE-1
EGEE-2
Web services based
VDT
EDG
...
AliEn
LCG
...
EGEE
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 15
Some observations - middleware
• LCG took close to 1 year to make existing middleware into
something close to production quality
• Found existing middleware:
ƒ Was not well tested
ƒ Did not handle exceptions
ƒ Assumed that networks, other services would always work
• This is a distributed system!
ƒ Did not address reliability
ƒ Did not address scalability
ƒ Did not address application required functionality
• Use-cases often do not describe exactly how a service will be used Æ
underlying architecture sometimes not appropriate
ƒ Could not easily be integrated into existing computing
infrastructures
• Often assumed full control of dedicated test-beds – this is not the real
situation
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 16
Certification, Testing and Release Cycle
APP
INTEGR
Integrate
HEP
EXPTS
Basic
Functionality
Tests
BIO-MED
Run
Certification
Matrix
OTHER
TBD
Run tests
C&T suites
Site suites
Release
candidate
tag
APPS
SW
Installation
Certified
release
tag
DEPLOY
SERVICES
Deployment
release
tag
Production
tag
PRODUCTION
CERTIFICATION
TESTING
PRE-PRODUCTION
Dev
Tag
SA1
DEPLOYMENT
PREPARATION
DEVELOPMENT & INTEGRATION
UNIT & FUNCTIONAL TESTING
JRA1
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 17
LCG Certification
• Significant investment in certification and testing process and team
ƒ Skilled people capable of system-level debugging, tightly coupled to VDT,
Globus, and EDG teams
ƒ Needs significant hardware resources
ƒ This was essential in achieving a robust service
• Making production quality software is
ƒ
ƒ
ƒ
ƒ
Expensive
Time consuming
Not glamorous!
… and takes very skilled people with a lot of experience
• Message for developers
ƒ One good idea at a time – test, test, test
ƒ Exception handling is not the exception – it is the normal mode of operation
in a distributed system!
ƒ Reference implementations are (by definition) NOT production quality
• Expect components to be replaced by the deployers!
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 18
Experiences in deployment
• LCG covers many sites (~60) now – both large and small
ƒ Large sites – existing infrastructures – need to add-on grid interfaces etc.
ƒ Small sites want a completely packaged, push-button, out-of-the-box
installation (including batch system, etc)
ƒ Satisfying both simultaneously is hard – requires very flexible packaging,
installation, and configuration tools and procedures
• A lot of effort had to be invested in this area
• Richness of batch systems does not match simple gatekeeper model
ƒ Many queues, heterogenous clusters, hierarchical “fair-share” scheduling
and polices versus simple assumptions at gatekeeper
• Current model of Information system with unique schema does not
scale
ƒ Does not allow for differences between sites and full richness of available
reesources
• Data management systems still have a long way to go
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 19
Summary
• This was a list of problems – but in the end we are quite
successful
ƒ System is stable and reliable
ƒ System is used in production
ƒ System is reasonably easy to install now – 60 sites
ƒ Now have a basis on which to incrementally build essential
functionality
• This infrastructure forms the basis of the initial EGEE
production service
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 20
Sites in LCG-2/EGEE-0 : June 4
2004
Austria
U-Innsbruck
Canada
Triumf
Alberta
Carleton
Montreal
Toronto
Italy
CNAF
Frascati
Legnaro
Milano
Napoli
Roma
Torino
Czech
Republic
Prague-FZU
Prague-CESNET
Japan
Tokyo
Netherlands
NIKHEF
France
CC-IN2P3
Clermont-Ferrand
Pakistan
NCP
Germany
FZK
Aachen
DESY
Wuppertal
Poland
Krakow
Portugal
LIP
Russia
SINP-Moscow
JINR-Dubna
Spain
PIC
UAM
USC
UB-Barcelona
IFCA
CIEMAT
IFIC
Greece
HellasGrid
Hungary
Budapest
India
TIFR
Israel
Tel-Aviv
Weizmann
Switzerland
CERN
CSCS
Taiwan
ASCC
NCU
UK
RAL
Birmingham
Cavendish
Glasgow
Imperial
Lancaster
Manchester
QMUL
RAL-PP
Sheffield
UCL
US
BNL
FNAL
HP
Puerto-Rico
• 22 Countries
• 58 Sites (45 Europe, 2 US, 5 Canada, 5 Asia, 1 HP)
• Coming: New Zealand, China,
other HP (Brazil, Singapore)
• 3800 cpu
Research Grid Panel, GGF11, Hawaii 9 June 2004 - 21
Download