The Particle Physics Computational Grid Paul Jeffreys/CCLRC 21 March 2000 System Managers Meeting Slide 1 Financial Times, 7 March 2000 21 March 2000 System Managers Meeting Slide 2 Front Page FT, 7 March 2000 21 March 2000 System Managers Meeting Slide 3 LHC Computing: Different from Previous Experiment Generations – – – – Geographical dispersion: of people and resources Complexity: the detector and the LHC environment Scale: Petabytes per year of data (NB – for purposes of this talk – mostly LHC specific) ~5000 Physicists 250 Institutes ~50 Countries Major challenges associated with: Coordinated Use of Distributed Computing Resources Remote software development and physics analysis Communication and collaboration at a distance R&D: A New Form of Distributed System: Data-Grid 21 March 2000 System Managers Meeting Slide 4 The LHC Computing Challenge – by example • Consider UK group searching for Higgs particle in LHC experiment – Data flowing off detectors at 40TB/sec (30 million floppies/sec)! • Factor of c. 5.105 rejection made online before writing to media – But have to be sure not throwing away the physics with the background – Need to simulate samples to exercise rejection algorithms • Simulation samples will be created around the world • Common access required – After 1 year, 1PB sample of experimental events stored on media • Initial analysed sample will be at CERN, in due course elsewhere – UK has particular detector expertise (CMS: e-, e+, ) – Apply our expertise to : access 1PB exptal. data (located?), re-analyse e.m. signatures (where?) to select c. 1 in 104 Higgs candidates, but S/N will be c. 1 to 20 (continuum background), and store results (where?) • Also .. access some simulated samples (located?), generate (where?) additional samples, store (where?) -- PHYSICS (where?) • In addition .. strong competition • Desire to implement infrastructure in generic way 21 March 2000 System Managers Meeting Slide 5 Proposed Solution to LHC Computing Challenge (?) • A data analysis ‘Grid’ for High Energy Physics CERN 3 3 T2 3 3 3 T2 T2 T2 3 3 Tier 1 4 3 T2 T2 4 4 3 3 3 21 March 2000 3 3 3 System Managers Meeting Slide 6 4 Access Patterns Typical particle physics experiment in 2000-2005: On year of acquisition and analysis of data 100 Mbytes/s (2-5 physicists) Raw Data ~1000 Tbytes Reco-V1 ~1000 Tbytes ESD-V1.1 ~100 Tbytes AOD ~10 TB AOD ~10 TB Reco-V2 ~1000 Tbytes ESD-V1.2 ~100 Tbytes AOD ~10 TB 21 March 2000 AOD ~10 TB ESD-V2.1 ~100 Tbytes AOD ~10 TB AOD ~10 TB ESD-V2.2 ~100 Tbytes AOD ~10 TB Access Rates (aggregate, average) AOD ~10 TB AOD ~10 TB System Managers Meeting 500 Mbytes/s (5-10 physicists) 1000 Mbytes/s (~50 physicists) 2000 Mbytes/s (~150 physicists) Slide 7 Hierarchical Data Grid • Physical – Efficient network/resource use local > regional > national > oceanic • Human – University/regional computing complements national labs, in turn complements accelerator site • Easier to leverage resources, maintain control, assert priorities at regional/local level – Effective involvement of scientists and students independently of location • The ‘challenge for UK particle physics’ … How do we: – Go from the 200 PC99 farm maximum of today to 10000 PC99 centre? – Connect/participate in European and World-wide PP grid? – Write the applications needed to operate within this hierarchical grid? AND – Ensure other disciplines able to work with us, our developments & applications are made available to others, exchange of expertise, and enjoy fruitful collaboration with Computer Scientists and Industry 21 March 2000 System Managers Meeting Slide 8 Quantitative Requirements • Start with typical experiment’s Computing Model • UK Tier-1 Regional Centre specification • Then consider implications for UK Particle Physics Computational Grid – Over years 2000, 2001, 2002, 2003 – Joint Infrastructure Bid made for resources to cover this – Estimates of costs • Look further ahead 21 March 2000 System Managers Meeting Slide 9 21 March 2000 System Managers Meeting Slide 10 21 March 2000 System Managers Meeting Slide 11 21 March 2000 System Managers Meeting Slide 12 21 March 2000 System Managers Meeting Slide 13 21 March 2000 System Managers Meeting Slide 14 21 March 2000 System Managers Meeting Slide 15 21 March 2000 System Managers Meeting Slide 16 21 March 2000 System Managers Meeting Slide 17 Steering Committee ‘Help establish the Particle Physics Grid activities in the UK' a. An interim committee be put in place. b. The immediate objectives would be prepare for the presentation to John Taylor on 27 March 2000, and to co-ordinate the EU 'Work Package' activities for April 14 c. After discharging these objectives, membership would be re-considered d. The next action of the committee would be to refine the Terms of Reference (presented to the meeting on 15 March) e. After that the Steering Committee will be charged with commissioning a Project Team to co-ordinate the Grid technical work in the UK f. The interim membership is: • Chairman: Andy Halley • Secretary: Paul Jeffreys • Tier 2 reps: Themis Bowcock, Steve Playfer • CDF: Todd Hoffmann • D0: Ian Bertram • CMS: David Britton • BaBar: Alessandra Forti • CNAP: Steve Lloyd – The 'labels' against the members are not official in any sense at this stage, but the members are intended to cover these areas approximately! 21 March 2000 System Managers Meeting Slide 18 UK Project Team • • • • • Need to really get underway! System Managers crucial! PPARC needs to see genuine plans and genuine activities… Must coordinate our activities And – Fit in with CERN activities – Meet needs of experiments (BaBar, CDF, D0, …) • So … go through range of options and then discuss… 21 March 2000 System Managers Meeting Slide 19 EU Bid(1) • Bid will be made to EU to link national grids – “Process” has become more than ‘just a bid’ • Almost reached the point where have to be active participant in EU bid, and associated activities, in order to access data from CERN in the future • Decisions need to be taken today… • Timescale: – March 7 – March 17 – March 30 – – – – April 17 April 25 May 1 May 7 21 March 2000 Workshop at CERN to prepare programme of work (RPM) Editorial meeting to look for industrial partners Outline of paper used to obtain pre-commitment of partners Finalise ‘Work Packages’ – see next slides Final draft of proposal Final version of proposal for signature Submit System Managers Meeting Slide 20 EU Bid(2) • The bid was originally for 30MECU, with matching contribution from national funding organisations – Now scaled down, possibly to 10MECU – Possibly as ‘taster’ before follow-up bid? – EU funds for Grid activities in Framework VI likely to be larger • Work Packages have been defined – Objective is that countries (through named individuals) take responsibility to split up the work and define deliverables within each, to generate draft content for EU bid – BUT • Without doubt the same people will be well positioned to lead the work in due course • .. And funds split accordingly?? • Considerable manoeuvering! – UK – need to establish priorities, decide where to contribute… 21 March 2000 System Managers Meeting Slide 21 Work Packages Middleware Contact Point 1 Grid Work Scheduling Cristina Vistoli 2 Grid Data Management Ben Segal 3 Grid Application Monitoring Robin Middleton 4 Fabric Management Tim Smith 5 Mass Storage Management Olof Barring Infrastructure 6 Testbed and Demonstrators François Etienne 7 Network Services Christian Michau Applications 8 HEP Applications Hans Hoffmann 9 Earth Observation Applications Luigi Fusco 10 Biology Applications Christian Michau Management 11 Project Management Fabrizio Gagliardi INFN CERN UK CERN CERN IN2P3 CNRS 4expts CERN Robin is ‘place-holder’ – holding UK’s interest (explanation in Open Session) 21 March 2000 System Managers Meeting Slide 22 UK Participation in Work Packages MIDDLEWARE 1. Grid Work Scheduling 2. Grid Data Management TONY DOYLE, Iain Bertram? 3. Grid Application monitoring ROBIN MIDDLETON, Chris Brew 4. Fabric Management 5. Mass Storage Management JOHN GORDON INFRASTRUCTURE 6. Testbed and demonstrators 7. Network Services PETER CLARKE, Richard HughesJones APPLICATIONS 8. HEP Applications 21 March 2000 System Managers Meeting Slide 23 PPDG PPDG as an NGI Problem PPDG Goals The ability to query and partially retrieve hundreds of terabytes across Wide Area Networks within seconds, Making effective data analysis from ten to one hundred US universities possible. PPDG is taking advantage of NGI services in three areas: – Differentiated Services: to allow particle-physics bulk data transport to coexist with interactive and real-time remote collaboration sessions, and other network traffic. – Distributed caching: to allow for rapid data delivery in response to multiple “interleaved” requests – “Robustness”: Matchmaking and Request/Resource co-scheduling: to manage workflow and use computing and net resources efficiently; to achieve high throughput Particle Physics Data Grid 21 March 2000 Richard P. Mount, SLAC DoE NGI Program PI Meeting, System Managers Meeting October 1999 Slide 24 PPDG PPDG Resources • Network Testbeds: – ESNET links at up to 622 Mbits/s (e.g. LBNL-ANL) – Other testbed links at up to 2.5 Gbits/s (e.g. Caltech-SLAC via NTON) • Data and Hardware: – Tens of terabytes of disk-resident particle physics data (plus hundreds of terabytes of tape-resident data) at accelerator labs; – Dedicated terabyte university disk cache; – Gigabit LANs at most sites. • Middleware Developed by Collaborators: – Many components needed to meet short-term targets (e.g.Globus, SRB, MCAT, Condor,OOFS,Netlogger, STACS, Mass Storage Management) already developed by collaborators. • Existing Achievements of Collaborators: – WAN transfer at 57 Mbytes/s; – Single site database access at 175 Mbytes/s Data Analysis for SLAC Physics 21 March 2000 Richard P. Mount CHEP 2000 System Managers Meeting 27 Slide 25 PPDG PPDG First Year Milestones • Project Start • Decision on existing middleware to be integrated into the first-year Data Grid; • First demonstration of high-speed site-to-site data replication; • First demonstration of multi-site cached file access (3 sites); • Deployment of high-speed site-to-site data replication in support of two particle-physics experiments; • Deployment of multi-site cached file access in partial support of at least two particle-physics experiments. Particle Physics Data Grid 21 March 2000 Richard P. Mount, SLAC August, 1999 October, 1999 January, 2000 February, 1999 July, 2000 August, 2000 DoE NGI Program PI Meeting, System Managers Meeting October 1999 Slide 26 PPDG First Year PPDG “System” Components Middleware Components (Initial Choice): See PPDG Proposal Page 15 Object and File-Based Objectivity/DB (SLAC enhanced) Application Services GC Query Object, Event Iterator, Query Monitor FNAL SAM System Resource Management Start with Human Intervention (but begin to deploy resource discovery & mgmnt tools) File Access Service Components of OOFS (SLAC) Cache Manager GC Cache Manager (LBNL) Mass Storage Manager HPSS, Enstore, OSM (Site-dependent) Matchmaking Service Condor (U. Wisconsin) File Replication Index MCAT (SDSC) Transfer Cost Estimation Service Globus (ANL) File Fetching Service Components of OOFS File Movers(s) SRB (SDSC); Site specific End-to-end Network Services Globus tools for QoS reservation Security and authentication Globus (ANL) Particle Physics Data Grid 21 March 2000 Richard P. Mount, SLAC DoE NGI Program PI Meeting, System Managers Meeting October 1999 Slide 27 LHCb contribution to EU proposal HEP Applications Work Package • Grid testbed in 2001, 2002 • Production 106 simulated b->D*pi – Create 108 events at Liverpool MAP in 4 months – Transfer 0.62TB to RAL – RAL dispatch AOD and TAG datasets to other sites • 0.02TB to Lyon and CERN • Then permit a study of all the various options for performing a distributed analysis in a Grid environment 21 March 2000 System Managers Meeting Slide 28 American Activities • Collaboration with Ian Foster – Transatlantic collaboration using GLOBUS • Networking – QoS tests with SLAC – Also link in with GLOBUS? • CDF and D0 – Real challenge to ‘export data’ – Have to implement 4Mbps connection – Have to set up mini Grid • BaBar – Distributed LINUX farms etc in JIF bid 21 March 2000 System Managers Meeting Slide 29 Networking Proposal - 1 DETAILS DEPENDENCIES/RISKS RESOURCES REQUIRED MILESTONES Availability of temporary PVCs on inter and intra-national WANs – or from collaborating industries. 1.5 SY Jan-01: Demonstration of low rate transfers between all sites Demonstration of high rate site to site file replication Single site-to-site tests at low rates to set up technologies and gain experience. These should include and benefit the experiments which will be taking data. TIER-0 -> TIER-1 (CERNRAL) TIER-1 -> TIER-1 (FNALRAL) TIER-1 to TIER-2 (RAL-LVPL, GLA/ED) Needs negotiation now. Monitoring expertise/ tools already available: PPNCG(UK), ICFA(Worldwide) 10 Mbit/s PVCs between sites in 00-01 50 Mbits/s PVCs between sites in 01-02 > 100 Mbits/s PVCs in 02-03 Jan-02: Demonstration of cascaded file transfer Demonstration of sustained modest rate transfers 03: Implementation of sustained transfers of real data at rates approaching 1000 Mbits/s Use existing monitoring tools. Adapt to function as resource predictors also. Multi site file replication, cascaded file replication at modest rates rates. Transfers at Neo-GRID rates 21 March 2000 System Managers Meeting Slide 30 Networking - 2 Differentiated Services Deploy some form of DiffServ on dedicated PVCs. Measure high and low priority latency and rates as a function of strategy and load. PVCs must be QoS capable. May rely upon proprietary or technology dependent factors in short term. 21 March 2000 Same PVCs as in NET-1 Apr-01: Successful deployment and measurement of pilot QoS on PVCs under project control. Production deployment of QoS on WAN Monotoring tools [Depends upon QoS developments]. Attempt to deploy end-to-end QoS across several interconnected networks. 1.5 SY WAN end-to-end depends upon expected developments by network suppliers. System Managers Meeting Slide 31 Networking - 3 Monitoring and Metrics for resource prediction. Survey and define monitoring requirements of GRID . PPNCG monitoring 0.5 SY Dec-00: Interim report on GRID monitoring requirements ICFA monitoring Adapt existing monitoring tools for for measurement and monitoring needs of network work packages (all NET-xx) as described here. Jul-01: Dec-01: Finish of adaptation of existing tools for monitoring Jul-02: First prototype predictive tools deployed In particular develop protocol sensitive monitoring as will be needed for QoS Dec-02: Report on tests of predictive tools Develop and test prediction metrics 21 March 2000 System Managers Meeting Slide 32 Networking - 4 Data Flow modelling Assimilate Monarc modeling tool set. Applicability of existing tools unknown before appraisal. 3 SY Oct-00: Assimilate Monarc Dec-00: Determine requirements of GRID model Determine requirements of model of UK GRID – and to what extent this factorises or not from international GRID. Determine scope of work needed to adapt/provide components. Appraise work needed to adapt/write necessary components. ??: Configure initial model. Configure and run models in parallel with transfer tests NET1 and QoS tests NET-2 for calibration purposes Apply models to determination of GRID topology and resource location. 21 March 2000 System Managers Meeting Slide 33 Pulling it together… • Networking: – EU work package – Existing tests – Integration of ICFA studies to Grid • Will networking lead the non-experiment activities?? • Data Storage – EU work package • Grid Application Monitoring – EU work package • CDF, D0 and BaBar – Need to integrate these into Grid activities – Best approach is to centre on experiments 21 March 2000 System Managers Meeting Slide 34 …Pulling it all together • Experiment-driven – Like LHCb, meet specific objectives • Middleware preparation – Set up GLOBUS? • QMW, RAL, DL ..? – Authenticate – Familiar – Try moving data between sites • Resource Specification • Collect dynamic information • Try with international collaborators – Learn about alternatives to GLOBUS – Understand what is missing – Exercise and measure performance of distributed cacheing • What do you think? • Anyone like to work with Ian Foster for 3 months?! 21 March 2000 System Managers Meeting Slide 35