Chinese Delegation Visit High Performance Computer Mission UK e-Science & The National e-Science Centre Prof. Malcolm Atkinson Director www.nesc.ac.uk 22nd October 2003 Outline What is e-Science? Our Role in Enabling e-Science Information Grids & Data Dominates Foundation for e-Science e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) enabling a whole-system approach computers software Grid sensor nets instruments Diagram derived from Ian Foster’s slide colleagues Shared data archives e-Science and SR2002 Research Council Medical Biological Environmental Eng & Phys HPC Core Prog. Particle Phys & Astro Economic & Social Central Labs 2004-6 £13.1M £10.0M £8.0M £18.0M £2.5M £16.2M + ? £31.6M £10.6M £5.0M 2001-4 (£8M) (£8M) (£7M) (£17M) (£9M) (£15M) + £20M (£26M) (£3M) (£5M) www.nesc.ac.uk NeSC in the UK Globus Alliance National eScience Centre HPC(x) Edinburgh Glasgow Newcastle Belfast Daresbury Lab Directors’ Forum Helped build a community Engineering Task Force Grid Support Centre Architecture Task Force UK Adoption of OGSA OGSA Grid Market Workflow Management Database Task Force OGSA-DAI GGF DAIS-WG Manchester Cambridge Hinxton Oxford Cardiff RAL London Southampton Events held in the 2nd Year (from 1 Aug 2002 to 31 Jul 2003) We have had 86 events: (Year 1 figures in brackets) 11 project meetings ( 4) 11 research meetings ( 7) > 3600 Participant Days 25 workshops (17 + 1) Establishing a training 2 “summer” schools (0) team 15 training sessions (8) Investing in community 12 outreach events (3) building, skill generation & knowledge development 5 international meetings (1) 5 e-Science management meetings (7) (though the definitions are fuzzy!) Suggestions always welcome Workshops Space for real work http://www.nesc.ac.uk/events/ Crossing communities Creativity: new strategies and solutions Written reports Scientific Data Mining, Integration and Visualisation Grid Information Systems Portals and Portlets Virtual Observatory as a Data Grid Imaging, Medical Analysis and Grid Environments Grid Scheduling Provenance & Workflow GeoSciences & Scottish Bioinformatics Forum Duties of a NeSC Research Leader encouraging the uptake of Grid technologies in Astronomy and related fields encouraging visitors, with whom you have a research overlap, to visit Edinburgh and work with you and other local colleagues organising and running research workshops assisting with the development of new core Grid and scientific database technologies promoting NeSC within the Universities of Edinburgh and Glasgow through, for example, personal presentations and more widely at conferences and workshops. …and all that in 0.5 FTE! Slide from Dr Bob Mann: 30 September 2003 Three-way Alliance Multi-national, Multi-discipline, Computer-enabled Consortia, Cultures & Societies Theory Models & Simulations → Shared Data Requires Much Computing Science Engineering, Systems, Notations & Much Innovation Formal Foundation Experiment & Advanced Data Collection → Shared Data Changes Culture, New Mores, New Behaviours → Process & Trust New Opportunities, New Results, New Rewards Global Drivers of e-Science Collaboration Data Deluge Digital Technology Ubiquity Cost reduction Performance increase Consequential Investment UK e-Science £240 million + 80 companies EU e-Infrastructure USA cyberinfrastructure … It’s Easy to Forget How Different 2003 is From 1993 Enormous quantities of data: Petabytes For an increasing number of communities gating step is not collection but analysis Ubiquitous Internet: >100 million hosts Collaboration & resource sharing the norm Security and Trust are crucial issues Ultra-high-speed networks: >10 Gb/s Global optical networks Bottlenecks: last kilometre & firewalls Huge quantities of computing: >100 Top/s Moore’s law gives us all supercomputers Organising their effective use is the challenge Moore’s law everywhere Instruments, detectors, sensors, scanners, … Organising their effective use is the challenge Derived from Ian Foster’s slide at ssdbM July 03 Global Knowledge Communities Often Driven by Data: E.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins Tera → Peta Bytes RAM time to move 15 minutes RAM time to move 2 months 1Gb WAN move time 10 hours ($1000) Disk Cost 1Gb WAN move time 14 months ($1 million) Disk Cost 7 disks = $5000 (SCSI) Disk Power 6800 Disks + 490 units + 32 racks = $7 million Disk Power 100 Watts 100 Kilowatts Disk Weight Disk Weight 5.6 Kg 33 Tonnes Disk Footprint Inside machine Disk Footprint 60 m2 May 2003 Approximately Correct See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24 Three Pillars of e-Science Research Apply known results Foundations Informatics Enable new science Technology Focus for new work EPCC DCS ETF&Testbeds Physics & Astronomy edikt Repositories Computing industry Applications Steering of development Research Departments Research Institutes Scottish universities Commercial customers Deployment UK Test Grid Funding? L2G IBM Contract Local Grid OGSA Deployment Testbed? Research Applications FireGrid NanoCMOS Grid Proposals Research Applications ODD-Genes? BioInf Physics AstroGrid NeuroInf ScotGrid GridPP QCDGrid e-Health CS Research Engage EQUATOR CoAKTinG Mobile Code Scientific Data Automated Synthesis Projects Staff AMUSE DIRC Ontologies Papers CS Research Middleware, etc. MS.NETGrid FirstDIG BRIDGES ODD-Genes GridWeaver SunDCG OGSA-DAI DAIT eldas Grid Core Programme BinX edikt OMII PGPGrid Middleware, Etc. Data Access, Integration, Publication, Annotation and Curation ODD-Genes BRIDGES AstroGrid ScotGrid GridPP OGSA-DAI & DAIT edikt QCDGrid Research Applications Super COSMOS Theme: DAIPAC GTI Mobile Code Repositories CS Research Mouse Atlas Scientific Data DCC Proposal Linguistic Corpora ArkDB Infrastructure Architecture Data Intensive Users Data Intensive Applications for Science X Simulation, Analysis & Integration Technology for Science X Generic Virtual Data Access and Integration Layer Job Submission Brokering Registry Banking Data Transport Workflow Structured Data Integration Authorisation OGSA Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Structured Data Relational Distributed Virtual Integration Architecture XML Semi-structured - Data Services GGF Data Access and Integration Svcs (DAIS) OGSI-compliant interfaces to access relational and XML databases Needs to be generalized to encompass other data sources (see next slide…) Generalized DAIS becomes the foundation for: Replication: Data located in multiple locations Federation: Composition of multiple sources Provenance: How was data generated? “OGSA Data Services” (Foster, Tuecke, Unger, eds.) Describes conceptual model for representing all manner of data sources as Web services Database, filesystems, devices, programs, … Integrates WS-Agreement Data service is an OGSI-compliant Web service that implements one or more of base data interfaces: DataDescription, DataAccess, DataFactory, DataManagement These would be extended and combined for specific domains (including DAIS) OGSA-DAI Approach Reuse existing technologies and standards OGSA, Query languages, Java, transport Build portTypes and services which will enable: controlled exposure of heterogenous data resources on an OGSIcompliant grid access to these resource via common interfaces using existing underlying query mechanisms (ultimately) data integration across distributed data resources OGSA-DAI (the software) seeks to be a reference implementation of the GGF DAIS WG standard Can’t keep up with frequent standard changes, so software releases track specific drafts See http://www.ogsadai.org.uk/ for details. Data Access & Integration Services 1a. Request to Registry for sources of data about “x” SOAP/HTTP Registry 1b. Registry responds with Factory handle service creation API interactions 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3c. Results of query returned to client as XML 2b. Factory creates GridDataService to manage access Grid Data Service XML / Relationa l database 3b. GDS interacts with database Third Party Delivery 2 Data Set C L I E N T A P I R E Q U E S T O R S T U B 1 Data Set dr 3 Data Set Data Set C L I E N T C O N S U M E R A P I S T U B 4 Future DAI Services 1a. Request to Registry for sources of data about “x” & “y” 1b. Registry responds with Factory handle Data Registry SOAP/HTTP service creation API interactions 2a. Request to Factory for access and integration from resources Sx and Sy Data Access & Integration master 2c. Factory returns handle of GDS to client 3b. Client Problem tells“scientific” Solving analyst Client Application Environment coding scientific insights Analyst 2b. Factory creates Semantic GridDataServices network Meta data 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc GDTS1 GDS GDTS XML database GDS2 Sx 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation Application Code GDS GDS1 Sy GDS3 GDS GDTS2 GDTS Relational database Take Home Message Data is a Major Source of Challenges AND an Enabler of New Science, Engineering , Medicine, Planning, … Information Grids There are many opportunities for International collaboration Support for collaboration Support for computation and data grids Structured data fundamental Integrated strategies & technologies needed OGSA-DAI is here now Join in making better DAI services & standards Bioinformatics is a Priority Application Area