Middleware Development and Deployment Status Tony Doyle 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Contents • What are the Challenges? • What is the scale? • How does the Grid work? • What is the status of (EGEE) middleware development? • What is the deployment status? 9 November 2004 • What is GridPP doing as part of the International effort? – What was GridPP1? – Is GridPP a Grid? – What is planned for GridPP2? – What lies ahead? • Summary – Why? What? How? When? Tony Doyle - University of Glasgow PPE & PPT Lunchtime Talk Science generates data and might require a Grid? Astronomy Healthcare Earth Observation Bioinformatics ? Digital Curation Collaborative Engineering 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow What are the challenges? •Must •share data between thousands of scientists with multiple interests •link major (Tier-0 [Tier-1]) and minor (Tier-1 [Tier-2]) computer centres •ensure all data accessible anywhere, anytime •grow rapidly, yet remain reliable for more than a decade •cope with different management policies of different centres •ensure data security •be up and running routinely by 2007 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow What are the challenges? 1. Software process 2. Software efficiency3. Deployment planning 10. Policies 9. Accounting 9 November 2004 Data Management, Security and Sharing 8. Analyse data 7. Install software PPE & PPT Lunchtime Talk 4. Link centres 5. Share data 6. Manage data Tony Doyle - University of Glasgow Tier-1 Scale Step-1 Disk Doubling CPU Doubling Select Hardware Price Assumptions using the GREEN cells (These are contained in the worksheet called "Assumptions") Model AGGRESSIVE AGGRESSIVE STA NDA RD STANDARD 2002 12 24 2003 12 24 Calender Year 2004 2005 2006 2007 2008 15 15 15 18 18 24 24 24 24 24 Table-1: Moore's Law Assumptions 2009 18 24 2010 18 24 2011 18 24 2012 18 24 Step-1.. CPU (kSI2K) 9100 16600 12600 9500 47800 Disk (Tbytes) 3000 9200 8700 1300 22200 financial planning Tape (Pbytes) 3.6 6 6.6 0.4 16.6 Number of T1s 5 11 7 6 29 Step-2.. Compare to Ian Foster / Carl Kesselman: (e.g. Tier-1) expt. requirements "A computational Grid is a Step-3.. Conclude that more hardware and software than one centre is needed infrastructure that provides Step-4.. A Grid? dependable, consistent, Reqts 2008 Currently network performance doubles every year (or so) for unit cost. 9 November 2004 ALICE ATLAS CMS LHCb SUM pervasive and inexpensive access to high-end computational capabilities." PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow What is the Grid? Hour Glass I. Experiment Layer e.g. Portals II. Application Middleware e.g. Metadata III. Grid Middleware e.g. Information Services IV. Facilities and Fabrics e.g. Storage Services 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow How do I start? http://www.gridpp.ac.uk/start/ • Getting started as a Grid user • Quick start guide for LCG2 GridPP guide to starting as a user of the Large Hadron Collider Computing Grid. • Getting an e-science certificate In order to use the Grid you need a Grid certificate. This page introduces the UK e-Science Certification Authority, which issues cerficates to users. You can get a certificate from here. • Using the LHC Computing Grid (LCG) CERN's guide on the steps you need to take in order to become a user of the LCG. This includes contact details for support. • LCG user scenario This describes in a practical way the steps a user has to follow to send and run jobs on LCG and to retrieve and process the output successfully. • Currently being improved.. 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Job Submission (behind the scenes) Replica UI JDL Catalogue Input “sandbox” DataSets info Information Service Output “sandbox” 9 November 2004 Storage Element Globus RSL Job Status Logging & Book-keeping Publish Job Query Job Submit Event Author. &Authen. Expanded JDL Resource Broker Job Submission Service Job Status PPE & PPT Lunchtime Talk Compute Element Tony Doyle - University of Glasgow Enabling Grids for E-sciencE Deliver a 24/7 Grid service to European science •build a consistent, robust and secure Grid network that will attract additional computing resources. •continuously improve and maintain the middleware in order to deliver a reliable service to users. •attract new users from industry as well as science and ensure they receive the high standard of training and support they need. •100 million euros/4years, funded by EU • >400 software engineers + service support • 70 European partners 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Enabling Grids for E-sciencE Prototype Middleware Status & Plans (I) • Workload Management – AliEn TaskQueue – EDG WMS (plus new TaskQueue and Information Supermarket) – EDG L&B • Computing Element – Globus Gatekeeper + LCAS/LCMAPS Dynamic accounts (from Globus) Blue: deployed on development testbed Red: proposed – CondorC – Interfaces to LSF/PBS (blahp) – “Pull components” AliEn CE gLite CEmon (being configured) INFSO-RI-508833 LHCC Comprehensive Review – November 2004 11 Prototype Middleware Status & Plans (II) Enabling Grids for E-sciencE • Storage Element • – Existing SRM implementations – Simple interface defined (AliEn+BioMed) dCache, Castor, … FNAL & LCG DPM – gLite-I/O (re-factored AliEn-I/O) • Catalogs – AliEn FileCatalog – global catalog – gLite Replica Catalog – local catalog – Catalog update (messaging) – FiReMan Interface – RLS (globus) • Metadata Catalog • Information & Monitoring – R-GMA web service version; multi-VO support Data Scheduling – File Transfer Service (Stork+GridFTP) – File Placement Service – Data Scheduler INFSO-RI-508833 LHCC Comprehensive Review – November 2004 12 Enabling Grids for E-sciencE • Security – VOMS as Attribute Authority and VO mgmt – myProxy as proxy store – GSI security and VOMS attributes as enforcement fine-grained authorization (e.g. ACLs) globus to provide a set-uid service on CE • Accounting Prototype Middleware Status & Plans (III) • User Interface – AliEn shell – CLIs and APIs – GAS Catalogs Integrate remaining services • Package manager – Prototype based on AliEn backend – evolve to final architecture agreed with ARDA team – EDG DGAS (not used yet) INFSO-RI-508833 LHCC Comprehensive Review – November 2004 13 CB PMB Deployment Board Tier1/Tier2, Testbeds, Rollout Metadata Storage Workload Requirements Network Security Info. Mon. User feedback LCG ARDA PPE & PPT Lunchtime Talk Application Development Expmts 9 November 2004 EGEE Service specification & provision User Board Tony Doyle - University of Glasgow Middleware Development Grid Data Management Network Monitoring Configuration Management Storage Interfaces Information Services Security 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Application Development ATLAS BaBar (SLAC) 9 November 2004 LHCb CMS SAMGrid (FermiLab) QCDGrid PPE & PPT Lunchtime Talk PhenoGrid Tony Doyle - University of Glasgow GridPP Deployment Status GridPP deployment is part of LCG (Currently the largest Grid in the world) The future Grid in the UK is dependent upon LCG releases Three Grids on Global scale in HEP (similar functionality) sites CPUs • LCG (GridPP) 90 (15) 8700 (1500) • Grid3 [USA] 29 2800 • NorduGrid 30 3200 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow LCG Overview By 2007: - 100,000 CPUs - More than 100 institutes worldwide - building on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT) - prototype went live in September 2003 in 12 countries - Extensively tested by the LHC experiments during this summer 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Deployment Status (26/10/04) • Incremental releases: significant improvements in reliability, performance and scalability – within the limits of the current architecture – scalability is much better than expected a year ago • Many more nodes and processors than anticipated – installation problems of last year overcome – many small sites have contributed to MC productions • • Full-scale testing as part of this year’s data challenges GridPP “The Grid becomes a reality” – widely reported Technology Sites 9 November 2004 British Embassy (USA) PPE & PPT Lunchtime Talk British Embassy (Russia) Tony Doyle - University of Glasgow Data Challenges • Ongoing.. • Grid and non-Grid Production • Grid now significant • ALICE - 35 CPU Years • Phase 1 done • Phase 2 ongoing LCG • CMS - 75 M events and 150 TB: first of this year’s Grid data challenges Entering Grid Production Phase.. 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Data Challenge • 7.7 M GEANT4 events and 22 TB ATLAS DC2 - LCG - September 7 • UK ~20% of LCG • Ongoing.. • (3) Grid Production • ~150 CPU years so far • Largest total computing requirement • Small fraction of what ATLAS need.. 1% 2% 0% 1% 2% 10% ATLAS DC2 - CPU usage 2% 14% 1% 1% 0% 3% 1% 12% 3% Grid3 29% 0% 1% 9% 4% LCG 41% 1% 1% 8% 0% 4% 3% 1% 1% 5% 2% 3% 1% Total: Entering Grid Production Phase.. 9 November 2004 1% 4% NorduGrid 30% PPE & PPT Lunchtime Talk at.uibk ca.triumf ca.ualberta ca.umontreal ca.utoronto ch.cern cz.golias cz.skurut de.fzk es.ifae es.ific es.uam fr.in2p3 it.infn.cnaf it.infn.lnl it.infn.mi it.infn.na it.infn.na it.infn.roma it.infn.to it.infn.lnf jp.icepp nl.nikhef pl.zeus ru.msu LCG tw.sinica NorduGrid uk.bham Grid3 uk.ic uk.lancs uk.man uk.rl ~ 1350 kSI2k.months ~ 95000 jobs ~ 7.7 Million events fully simulated (Geant4) ~ 22 TB Tony Doyle - University of Glasgow LHCb Data Challenge 424 CPU years (4,000 kSI2k months), 186M events • UK’s input significant (>1/4 total) Entering Grid Production Phase.. • LCG(UK) resource: – Tier-1 7.7% 186 M Produced EventsPhase 1 – – – – Tier-2 sites: London 3.9% South 2.3% North 1.4% • DIRAC: – – – – Imperial 2.0% L'pool 3.1% Oxford 0.1% ScotGrid 5.1% 9 November 2004 3-5 106/day Completed LCG LCG paused restarted LCG in action 1.8 106/day DIRAC alone PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Paradigm Shift Transition to Grid… 424 CPU · Years 9 November 2004 May: 89%:11% Jun: 80%:20% 11% of DC’04 25% of DC’04 Jul: 77%:23% Aug: 27%:73% 22% of DC’04 42% of DC’04 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow More Applications ZEUS uses LCG •needs the Grid to respond to increasing demand for MC production • 5 million Geant events on Grid since August 2004 QCDGrid • For UKQCD • Currently a 4-site data grid • Key technologies used - Globus Toolkit 2.4 - European DataGrid - eXist XML database •managing a few hundred gigabytes of data 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow Issues First large-scale Grid production problems being addressed… at all levels “LCG-2 MIDDLEWARE PROBLEMS AND REQUIREMENTS FOR LHC EXPERIMENT DATA CHALLENGES” https://edms.cern.ch/file/495809/2.2/LCG2-Limitations_and_Requirements.pdf 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow 5 Is GridPP a Grid? http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf 1. Coordinates resources 1. YES. that are not subject to This is why development centralized control and maintenance of LCG is important. 2. YES. 2. … using standard, open, VDT (Globus/Condor-G) general-purpose protocols + EDG/EGEE(Glite) and interfaces ~meet this requirement. 3. YES. 3. … to deliver nontrivial LHC experiments data qualities of service challenges over the summer of 2004. http://agenda.cern.ch/fullAgenda.php?ida=a042133 9 November 2004 PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow What was GridPP1? • A team that built a working prototype grid of significant scale > 1,500 (7,300) CPUs > 500 (6,500) TB of storage > 1000 (6,000) simultaneous jobs Update GridPP Goal Clear To develop and deploy a large scale science Grid in the UK for the use of the Particle Physics community • A complex project where 82% of the 190 tasks for the first three years were completed 9 November 2004 PPE & PPT Lunchtime Talk 1 CERN LCG Creation 1. 1 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 Applications 1. 2 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.2.6 1.2.7 1.2.8 1.2.9 1.2.10 Fabric 1. 3 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.3.7 1.3.8 1.3.9 1.3.101.3.11 Technology 1. 4 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.4.6 1.4.7 1.4.8 1.4.9 Deployment 1. 5 1.5.1 1.5.2 1.5.3 1.5.4 1.5.5 1.5.6 1.5.7 1.5.8 1.5.9 1.5.10 2 DataGrid WP1 2. 1 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 2.1.6 2.1.7 2.1.8 2.1.9 WP2 2. 2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 WP3 2. 3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7 WP4 2. 4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 3 Applications 6 Dissemination 7 Resources Presentation 6. 1 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 Deployment 7. 1 7.1.1 7.1.2 7.1.3 7.1.4 Open Source 5. 2 5.2.1 5.2.2 5.2.3 Participation 6. 2 6.2.1 6.2.2 6.2.3 Monitoring 7. 2 7.2.1 7.2.2 7.2.3 Tier-2 4. 3 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 Worldwide Integration 5. 3 5.3.1 5.3.2 5.3.3 Engagement 6. 3 6.3.1 6.3.2 6.3.3 6.3.4 Developing 7. 3 7.3.1 7.3.2 7.3.3 7.3.4 CMS 3. 4 3.4.2 3.4.3 3.4.4 3.4.6 3.4.7 3.4.8 3.4.10 BaBar 3. 5 3.5.2 3.5.3 3.5.4 3.5.6 3.5.7 Testbed 4. 4 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 UK Integration 5. 4 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 3.4.1 3.4.5 3.4.9 WP6 2. 6 2.6.1 2.6.2 2.6.3 2.6.5 2.6.6 2.6.7 2.6.9 WP7 2. 7 2.7.1 2.7.2 2.7.3 2.7.5 2.7.6 2.7.7 CDF/DO 3. 6 3.6.1 3.6.2 3.6.3 3.6.4 3.6.5 3.6.6 3.6.7 3.6.8 3.6.9 3.6.103.6.113.6.12 UKQCD 3. 7 3.7.1 3.7.2 3.7.3 3.7.4 3.7.5 3.7.6 WP8 2. 8 2.8.1 2.8.2 2.8.3 2.8.4 2.8.5 5 Interoperability Int. Standards 5. 1 5.1.1 5.1.2 5.1.3 3.5.1 3.5.5 2.7.4 2.7.8 Tier-A 4. 1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.1.6 4.1.7 4.1.8 4.1.9 Tier-1 4. 2 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 WP5 2. 5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 2.6.4 2.6.8 4 Infrastructure ATLAS 3. 1 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 3.1.7 3.1.8 3.1.9 3.1.10 ATLAS/LHCb 3. 2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.2.8 3.2.9 LHCb 3. 3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 3.3.6 Other 3. 8 3.8.1 3.8.2 3.8.3 Rollout 4. 5 4.5.1 4.5.2 4.5.3 4.5.4 Data Challenges 4. 6 4.6.1 4.6.2 4.6.3 Status Date 1-Jan-04 Metric OK Metric not OK Task complete Task overdue Due within 60 days Task not due soon Not Active No Task or metric Navigate up Navigate down External link Link to goals 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 Tony Doyle - University of Glasgow Aims for GridPP2? From Prototype to Production BaBarGrid BaBar CDF ATLAS ALICE LHCb CMS CERN Computer Centre RAL Computer Centre 19 UK Institutes Separate Experiments, Resources, Multiple Accounts 2001 9 November 2004 EGEE SAMGrid D0 GANGA EDG ARDA LCG LCG CERN Prototype Tier-0 Centre UK Prototype Tier-1/A Centre CERN Tier-0 Centre UK Tier-1/A Centre 4 UK Tier-2 Centres 4 UK Prototype Tier-2 Centres Prototype Grids 2004 PPE & PPT Lunchtime Talk 'One' Production Grid 2007 Tony Doyle - University of Glasgow Planning: GridPP2 ProjectMap GridPP2 Goal To develop and deploy a large scale production quality grid in the UK for the use of the Particle Physics community 0. Tier-A Tier-1 1 LCG Tier-2 1. 1 Applications 1. 2 1. 3 1. 4 2.1 2.2 2.3 Grid Deployment 2.4 2.5 2.6 Network 9 November 2004 3.2 3.3 3.4 3.5 3.6 LHC Deployment 4.2 4.3 Grid Operations 5 Management 5.1 6 External Planning 5.2 4.4 Deployment 4.5 4.6 Portal PPE & PPT Lunchtime Talk Interoperability 6.3 Engagement 6.4 Knowledge Transfer SAMGrid 6.2 UKQCD 6.1 Dissemination D0 PhenoGrid 4.1 CDF CMS BaBar LHCb Information & Monitoring 3.1 Experiment Support 4 Non-LHC Apps Ganga Security ATLAS Workload Management Middleware Support 3 LHC Apps Data & Storage Management Grid Technology Metadata Computing Fabric Deployment 2 M/S/N Production Grid Navigate down External link Link to goals Structures agreed and in place (except LCG phase-2) Tony Doyle - University of Glasgow What lies ahead? Some mountain climbing.. Annual data storage: 12-14 PetaBytes per year 100 Million SPECint2000 Importance of step-by-step planning… Pre-plan your trip, carry an ice axe and crampons and arrange for a guide… CD stack with 1 year LHC data (~ 20 km) Concorde (15 km) In production terms, 100,000 PCs (3 GHz Pentium 4) we’ve made base camp Quantitatively, we’re ~9% of the way there in terms of CPU (9,000 ex 100,000) and disk (3 ex 12-14*3 years)… 9 November 2004 PPE & PPT Lunchtime Talk We are here (1 km) Tony Doyle - University of Glasgow 1. Why? 2. What? 3. How? 4. When? From Particle Physics perspective the Grid is: 1. needed to utilise large-scale computing resources efficiently and securely 2. a) a working prototype running today on large testbed(s)… b) about seamless discovery of computing resources c) using evolving standards for interoperation d) the basis for computing in the 21st Century e) not (yet) as transparent or robust as end-users need 3. see the GridPP getting started pages (two-day EGEE training courses available) 4. a) Now, at prototype level, for simple(r) applications (e.g. experiment Monte Carlo production) b) September 2007 for more complex applications (e.g. data ready for LHC 9 November 2004 analysis) – PPE & PPT Lunchtime Talk Tony Doyle - University of Glasgow