The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director www.nesc.ac.uk 22nd January 2003 1 Overview Essentials of e-Science Collaboration Resource Sharing Data Sharing Mutual Dependence Essentials of the Grid Distributed Virtual Machine? Essentials of Data Sharing Database Research did it? New Challenges Data Access & Integration Building Bricks Band Wagon v Research Opportunity Thresholds, Visions and Questions 2 3 UK e-Science e-Science and the Grid ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ ‘e-Science will change the dynamic of the way science is undertaken.’ John Taylor Director General of Research Councils Office of Science and Technology 5 From presentation by Tony Hey UK e-Science Investment National eScience Centre HPC(x) Edinburgh Glasgow Newcastle Belfast Projects > 60 started > 30 proposed + EU Projects Daresbury Lab Manchester Cambridge Hinxton Oxford Cardiff RAL London Southampton 6 UK e-Science Programme (2) 2003 - 2005 DG Research Councils E-Science Steering Committee Director’s Awareness and Co-ordination Role Grid TAG Director Director’s Management Role Generic Challenges EPSRC (£15m), DTI (£15m) Academic Application Support Programme Research Councils (£74m), DTI (£5m) PPARC (£26m) BBSRC (£8m) MRC (£8m) NERC (£7m) £80m Collaborative projects ESRC (£3m) EPSRC (£17m) CLRC (£5m) Industrial Collaboration (£40m) 7 8 Collaboration Growing Hard Problems, Multi-disciplinary, Expense Sharing Ideas Thought processes and Stimuli Effort Resources Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure Scientists have done this for Centuries 9 Interdependence Science has relied on experiment and theory Simulation, Data Mining, Analysis Theory- Experiment Greece Italy 400 BC 1,500 AD Simulation Europe 1,980 AD For problems which are: - too large/small - too fast/slow - too complex - too expensive, unethical, ... -Testing Understanding 12 Interdependence Models Theory Data Computing Data Experiment 13 Database Growth PDB protein structures 14 15 Globus Toolkit® History 30000 Does not include downloads from: NMI, UK eScience, EU Datagrid, IBM, Platform, etc. Physiology of the Grid Paper Released 25000 20000 Anatomy of the Grid Paper Released The Grid: Blueprint for a New Computing Infrastructure published DARPA, NSF, and DOE begin funding Grid work NASA begins funding Grid work, DOE adds support NSF & European Commission Initiate Many New Grid Projects Significant Commercial Interest in Grids 15000 10000 Early Application Successes Reported GT 1.0.0 Released 5000 Downloads per Month from ftp.globus.org GT 2.0 Released 0 1997 1998 1999 2000 2001 2002 16 Encompassing Vision software computers sensor nets instruments colleagues data archives 17 People & Industry Global Grid Forum 900 800 700 600 500 400 300 200 100 0 GGF1 GGF2 GGF3 GGF2 GGF3 GGF4 GGF5 GGF4 GGF5 GGF6 GGF7 260 220 400 900 450 >1000 Jul 01 Oct 01 Feb 02 Jul 02 Oct 02 Mar 03 450 Targets Sep 02 Jan 03 Financial, Life Sciences Automotive & Aerospace Governments Partners GlobusWorld 1 “IBM DRIVES GRID COMPUTING FOR COMMERCIAL BUSINESS WITH TEN NEW GRID OFFERINGS” UK All Hands AHM’02 350 IBM This week Platform, DataSynapse Avaki, Entropia United Devices IBM last 20 months Leaders of OGSI Development teams Grid Jamboree GGF 18 19 High-Altitude Views A Rallying Cry Meeting a Hard Challenge requires Many Minds Operating & Maintaining Infrastructure requires Many Hands & Many Companies Another Stab at Distributed Computing Hard Challenge: Intellectually and Practically Important Dependable Ubiquity over Heterogeneity & Fallibility An Ambitious Virtual Machine Consistent large scale computational environments A Global Operating System Collective Resources, Common Management 20 An Architectural View Application Users Application Application Common Application Platform for Group of Applications & Platform Developers Monitoring Diagnosis Logging Scheduling Accounting Authorisation Grid Plumbing & Security Infrastructure Data & Compute Resources Providers Distributed Operations Teams 21 Open Grid Services Infrastructure Confluence of Web Services & Grid Consistent Interface Description Based on WSDL 1.2 proposal Extend Properties Separate Binding from Interface Function Composition & Inheritence Exploit WS* Investment Grid Features Security Life-Time Management Service (state) Information via Data Elements Discovery Grouping Notification OGSI Version 1 Proposal at GGF7 (March 03) 22 Open Grid Services Architecture Ubiquitous Building Blocks Using OGSI Platform Open & Extensible Encourage Refactoring Experiments Initially The Globus 2 model Except State Information now distributed Example New Features Global Name Mapping Service Replication and Caching Service Data Access & Integration Metering, Logging, Authorisation, Charging, … 23 Grid Challenge Balancing “Direct” Access to the “Platforms” with Abstraction & Virtualisation Developers often have exploitable application knowledge Automation necessary & helpful Interface matching, operation validation, … Optimisation at many scales There isn’t enough effort to develop Languages & Abstractions 24 25 Data Integration Scientist with Idea 2) Extract Data Data Resource 1 1) Find Data 3) Transform Data 4) Combine Data 5) Interpret Data Data Resource 2 26 Wellcome Trust: Cardiovascular Functional Genomics Glasgow Shared data Edinburgh Public curated data Leicester Oxford London Netherlands 27 OGSA-DAI Partners IBM USA EPCC & NeSC Glasgow Newcastle Belfast Daresbury Lab Manchester Oxford Cambridge EPCC & NeSC Oracle Hinxton RAL IBM UK Cardiff London IBM Hursley IBM USA Southampton Manchester e-SC Newcastle e-SC £3 million, 18 months, started February 2002 Oracle 28 DAI Key Services GridDataService GDS Access to data & DB operations GridDataServiceFactory GDSF Makes GDS & GDSF GridDataServiceRegistry GDSR Discovery of GDS(F) & Data GridDataTranslationService GDTS Translates or Transforms Data GridDataTransportDepot Data transport with persistence GDTD Integrated Structured Data Transport Relational & XML models supported Role-based Authorisation Binary structured files (later) 30 DAI Architecture Data Intensive X Scientists Data Intensive Applications for Science X Simulation, Analysis & Integration Technology for Science X Generic Virtual Data Access and Integration Technology Monitoring Diagnosis Scheduling Accounting GridFTP Naming Authorisation Caching Data Integration Services Data Access Ser vices Grid Infrastructure Compute, Data & Storage Resources Structured Data Distributed Data Integration Architecture 31 1a. Request to Registry for sources of data about “x” SOAP/HTTP Registry 1b. Registry responds with Factory handle service creation API interactions 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3c. Results of query returned to client as XML 2b. Factory creates GridDataService to manage access Grid Data Service 3b. GDS interacts with database XML / Relationa l database 32 1a. Request to Registry for sources of data about “x” & “y” SOAP/HTTP Registry 1b. Registry responds with Factory handle 3b. Tell consumer Client service creation API interactions 2a. Request to Factory for access and integration to databases 2c. Factory returns handle of GDS to client Factory 2b. Factory creates GridDataServices network 3a. Client submits set of queries GDS with XPath, SQL, etc Consumer GDS GDS XML / Relationa l database GDS 3c. Results of queries returned to consumer as XML or binary GDS GDS XML / Relationa l 33 database Biomedical (or ANY) Data Opportunities Global Production of Published Data Volume Diversity Combination Analysis Discovery Opportunities Specialised Indexing Structurally varied replication Consistent Structured Universe of Discourse Data & Computation Integration Challenges Data Huggers Meagre metadata Ease of Use Automated, optimised integration Traceability, Dependability Challenges Approximate Matching Multi-scale optimisation Bad habits / industrial structures Safety and Multi-scale optimisation 34 Data Integration Challenges High-Level Languages Describing the Data Extraction Recipes Describing the Sources & Components Metadata that drives automation & validation Mobility Code & Data Integrating Existing DB technology Moving the DBMS to the Grid context New Optimisation Challenges Data & Computation & Storage & Movement Shared Distributed Annotation Systems How to Reference Provenance & Acknowledgement 35 36 Challenges A Programming & Development Model Dependability at this Scale Foundations for Trust Raising the Level of Automation Supporting New Forms of Collaboration Data 37 38