BNCOD Coventry Databases and the Grid: Who Challenges Whom Prof. Malcolm Atkinson Director www.nesc.ac.uk 15th July 2003 1 Outline What is e-Science? Grids, Collaboration, Virtual Organisations Structured Data at its Foundation Key Uses of Distributed Data Resources Challenges Scientific Digital Data Curation Data Access & Integration OGSA-DAI: Progress and Plans DAIS-WG Composition of Analysis & Interpretation 2 3 It’s Easy to Forget How Different 2003 is From 1993 Enormous quantities of data: Petabytes For an increasing number of communities, gating step is not collection but analysis Ubiquitous Internet: >100 million hosts Collaboration & resource sharing the norm Ultra-high-speed networks: >10 Gb/s Global optical networks Huge quantities of computing: >100 Top/s Moore’s law gives us all supercomputers Ubiquitous computing Moore’s law everywhere Instruments, detectors, sensors, scanners, … Derived from Ian Foster’s slide at ssdbM July 03 4 Foundation for e-Science e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) X enabling a whole-system approach computers software Grid sensor nets instruments Diagram derived from Ian Foster’s slide colleagues Shared data archives 5 6 e-Science and SR2002 2004-6 Medical £13.1M Biological £10.0M Environmental £8.0M Eng & Phys £18.0M HPC £2.5M Core Prog. £16.2M + ? Particle Phys & Astro £31.6M Economic & Social £10.6M Central Labs £5.0M Research Council 2001-4 (£8M) (£8M) (£7M) (£17M) (£9M) (£15M) + £20M (£26M) (£3M) (£5M) 7 NeSC in the UK National e-Science Centre HPC(x) Glasgow Edinburgh Directors’ Forum Newcastle Helped build a community Belfast Engineering Task Force Grid Support Centre Manchester Daresbury Lab Architecture Task Force UK Adoption of OGSA Cambridge OGSA Grid Market Oxford Workflow Management Hinxton RAL Database Task Force Cardiff London OGSA-DAI GGF DAIS-WG Southampton e-SI Programme training, coordination, community building, workshops, pioneering GridNet e-Storm 8 www.nesc.ac.uk 9 UK Grid Currently a Level-2 Grid based on GT2 Transition to OGSI/OGSA will prove worthwhile There are still issues to be resolved OGSA definition / delivery Hosting environments & Platforms Combinations of Services supported Material and grids to support adopters A schedule of transitions should be (approximately & provisionally) published Expected time line Now GT2 L2 service + GT3 M/W development & evaluation Q3 & Q4 2003 GT2 L3 + GT3 L1 Q1 & Q2 2004 significant project transitions to GT3 L2/L3 Late Q4 2004 most projects have transitioned + end GT2 L3 10 11 Three-way Alliance Multi-national, Multi-discipline, Computer-enabled Consortia, Cultures & Societies Theory Models & Simulations → Shared Data Requires Much Computing Science Engineering, Systems, Notations & Much Innovation Formal Foundation Experiment & Advanced Data Collection → Shared Data Changes Culture, New Mores, New Behaviours → Process & Trust New Opportunities, New Results, New Rewards 12 Biochemical Pathway Simulator (Computing Science, Bioinformatics, Beatson Cancer Research Labs) Closing the information loop – between lab and computational model. DTI Bioscience Beacon Project Harnessing Genomics Programme Slide from Muffy Calder, Glasgow 13 UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign14 15 Emergence of Global Knowledge Communities Teams organised around common goals Communities: “Virtual organisations” Overlapping memberships, resources and activities Essential diversity is a strength & challenge membership & capabilities Geographic and political distribution No location/organisation/country possesses all required skills and resources Dynamic: adapt as a function of their situation Adjust membership, reallocate responsibilities, renegotiate resources Slide derived from Ian Foster’s ssdbm 03 keynote 16 The Emergence of Global Knowledge Communities 17 Slide from Ian Foster’s ssdbm 03 keynote Global Knowledge Communities Often Driven by Data: E.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins Wellcome Trust: Cardiovascular Functional Genomics Glasgow Shared data BRIDGES IBM Edinburgh Public curated data Leicester Oxford London Netherlands 19 Database-mediated Communication Experimentation Communities Curated & Shared Databases Data Carries knowledge Simulation Communities Analysis &Theory Communities Carries knowledge Data knowledge Discoveries 20 21 global in-flight engine diagnostics in-flight data airline ground station 100,000 engines 2-5 Gbytes/flight 5 flights/day = 2.5 petabytes/day global network eg SITA DS&S Engine Health Center internet, e-mail, pager maintenance centre data centre 22 Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York Database Growth Bases 39,856,567,747 PDB Content Growth 23 Distributed Structured Data Key to Integration of Scientific Methods Key to Large-scale Collaboration Many Data Resources Independently managed Geographically distributed Primary Data, Data Products, Meta Data, Administrative data, … Discovery and Decisions! Extracting nuggets from multiple sources Combing them using sophisticated models Analysis on scales required by statistics Repeated Processes 24 Tera → Peta Bytes RAM time to move 15 minutes 1Gb WAN move time 10 hours ($1000) Disk Cost 7 disks = $5000 (SCSI) Disk Power 100 Watts Disk Weight 5.6 Kg Disk Footprint Inside machine RAM time to move 2 months 1Gb WAN move time 14 months ($1 million) Disk Cost 6800 Disks + 490 units + 32 racks = $7 million Disk Power 100 Kilowatts Disk Weight 33 Tonnes Disk Footprint 60 m2 May 2003 Approximately Correct 25 See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24 Mohammed & Mountains Petabytes of Data cannot be moved It stays where it is produced or curated X Hospitals, observatories, European Bioinformatics Institute, … Distributed collaborating communities Expertise in curation, simulation & analysis Distributed & diverse data collections Discovery depends on insights X ⇒ Unpredictable sophisticated application code Tested by combining data from many sources Using sophisticated models & algorithms What can you do? 26 Dynamically Move computation to the data Assumption: code size << data size Develop the database philosophy for this? Queries are dynamically re-organised & bound Develop the storage architecture for this? Compute closer to disk? X Dave Patterson Seattle SIGMOD 98 System on a Chip using free space in the on-disk controller Data Cutter a step in this direction Develop the sensor & simulation architectures for this? Safe hosting of arbitrary computation Proof-carrying code for data and compute intensive tasks + robust hosting environments Provision combined storage & compute resources Decomposition of applications To ship behaviour-bounded sub-computations to data Co-scheduling & co-optimisation Data & Code (movement), Code execution Recovery and compensation 27 The Story so Far Technology enables Grids, More Data & … Information Grids will dominate Collaboration essential Combining approaches Combining skills Sharing resources (Structured) Data is the language of Collaboration Data Access & Integration a Ubiquitous Requirement Many hard technical challenges Scale, heterogeneity, distribution, dynamic variation Intimate combinations of data and computation With unpredictable (autonomous) development of both 28 Outline What is e-Science? Grids, Collaboration, Virtual Organisations Structured Data at its Foundation Key Uses of Distributed Data Resources Challenges Scientific Digital Data Curation Data Access & Integration OGSA-DAI: Progress and Plans DAIS-WG Composition of Analysis & Interpretation 29 Science as Workflow Data integration = the derivation of new data from old, via coordinated computation May be computationally demanding The workflows used to achieve integration are often valuable artifacts in their own right May be Data Access & Movement Demanding Obtaining data from files and DBs, transfer between computations, deliver to DBs and File stores Thus we must be concerned with how we Build workflows Share and reuse workflows Explain workflows Schedule workflows Slide derived from Ian Foster’s ssdbm 03 keynote Consider also DBs & (Autonomous) Updates 30 Sloan Digital Sky Survey Production System Slide from Ian Foster’s ssdbm 03 keynote 31 32 OGSA-DAI First steps towards a generic framework for integrating data access and computation Using the grid to take specific classes of computation nearer to the data Kit of parts for building tailored access and integration applications 33 DAIS-WG Specification of Grid Data Services Chairs Norman Paton, Manchester University Dave Pearson, Oracle Current Spec. Draft Authors Mario Antonioletti Neil P Chue Hong Susan Malaika Simon Laws Norman W Paton Malcolm Atkinson Amy Krause Gavin McCance James Magowan Greg Riccardi 34 Draft Specification for GGF 7 35 Conceptual Model External Universe External data resource External data resource Data set DBMS DB ResultSet 36 Conceptual Model DAI Service Classes Data resource Data resource DBMS DB Data activity session Data request Data set ResultSet 37 OGSA-DAI Partners IBM USA EPCC & NeSC Glasgow Newcastle Belfast Daresbury Lab Manchester Oxford Oracle Cambridge Hinxton RAL Cardiff London IBM Hursley Southampton $5 million, 20 months, started February 2002 Additional 24 months, starts October 2003 38 39 Infrastructure Architecture Data Intensive X Scientists Data Intensive Applications for Science X Simulation, Analysis & Integration Technology for Science X Generic Virtual Data Access and Integration Layer Job Submission Brokering Registry Banking Data Transport Workflow Structured Data Integration Authorisation OGSA Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Structured Data Relational Distributed Virtual Integration Architecture XML Semi-structured 40 Data Access & Integration Services 1a. Request to Registry for sources of data about “x” SOAP/HTTP Registry 1b. Registry responds with Factory handle service creation API interactions 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3c. Results of query returned to client as XML 2b. Factory creates GridDataService to manage access Grid Data Service XML / Relational database 3b. GDS interacts with database 41 Future DAI Services 1a. Request to Registry for sources of data about “x” & “y” 1b. Registry responds with Factory handle Data Registry SOAP/HTTP service creation API interactions 2a. Request to Factory for access and integration from resources Sx and Sy 2c. Factory returns handle of GDS to client 3b. Client Problem tells“scientific” Solving analyst Client Application Environment coding scientific insights Analyst Data Access & Integration master 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc 2b. Factory creates Semantic GridDataServices network Meta data GDTS1 GDS GDTS GDS2 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation Application Code Sx GDS GDS1 XML database Sy GDS3 GDS GDTS2 Relational database GDTS 42 A New World What Architecture will Enable Data & Computation Integration? Common Conceptual Models Common Planning & Optimisation Common Enactment of Workflows Common Debugging … What Fundamental CS is needed? Trustworthy code & Trustworthy evaluators Decomposition and Recomposition of Applications … Is there an evolutionary path? 43 Take Home Message There is plenty for DB researchers to do Workflow & DB integration, co-optimised Distributed Queries on a global scale Heterogeneity on a global scale Dynamic variability X X Authorisation, Resources, Data & Schema Performance Some Massive Data Metadata for discovery, automation, repetition, … Grasp the theoretical & practical challenges Working in Open & Dynamic systems Incorporate all computation Welcome code into your most critical places 44 www.ogsadai.org.uk www.nesc.ac.uk 45