UCISA: How do I Grid enable my University? Prepare for the Data Deluge Prof. Malcolm Atkinson Director www.nesc.ac.uk 23rd October 2003 Outline Aspects of the Data Deluge Our approach to Data Access and Integration Sloan Digital Sky Survey Production System Slide from Ian Foster’s ssdbm 03 keynote Global Knowledge Communities Often Driven by Data: E.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins Database Growth Bases 45,356,382,990 PDB Content Growth It’s Easy to Forget How Different 2003 is From 1993 Enormous quantities of data: Petabytes For an increasing number of communities gating step is not collection but analysis Ubiquitous Internet: >100 million hosts Collaboration & resource sharing the norm Security and Trust are crucial issues Ultra-high-speed networks: >10 Gb/s Global optical networks Bottlenecks: last kilometre & firewalls Huge quantities of computing: >100 Top/s Moore’s law gives us all supercomputers Organising their effective use is the challenge Moore’s law everywhere Instruments, detectors, sensors, scanners, … Organising their effective use is the challenge Derived from Ian Foster’s slide at ssdbM July 03 Tera → Peta Bytes RAM time to move 15 minutes RAM time to move 2 months 1Gb WAN move time 10 hours ($1000) Disk Cost 1Gb WAN move time 14 months ($1 million) Disk Cost 7 disks = $5000 (SCSI) Disk Power 6800 Disks + 490 units + 32 racks = $7 million Disk Power 100 Watts 100 Kilowatts Disk Weight Disk Weight 5.6 Kg 33 Tonnes Disk Footprint Inside machine Disk Footprint 60 m2 May 2003 Approximately Correct See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24 Infrastructure Architecture Data Intensive Users Data Intensive Applications for Application area X Simulation, Analysis & Integration Technology for Application area X Generic Virtual Data Access and Integration Layer Job Submission Brokering Registry Banking Data Transport Workflow Structured Data Integration Authorisation OGSA Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Structured Data Relational Distributed Virtual Integration Architecture XML Semi-structured - Integrating DBs into the Grid We want to build on existing DBs, not replace them. Could produce a Grid-enabled version of JDBC/ODBC Need something more for metadata-driven access to data Service-based approach should be better Provide a uniform framework for access to databases on the Grid Data as Service: OGSA Data Access & Integration Service-oriented treatment of data appears to have significant advantages Leverage OGSI introspection, lifetime, etc. Compatibility with Web services Standard service interfaces being defined Service data: e.g., schema Derive new data services from old (views) Externalize to e.g. file/database format Perform queries or other operations Data Services GGF Data Access and Integration Svcs (DAIS) OGSI-compliant interfaces to access relational and XML databases Needs to be generalized to encompass other data sources (see next slide…) Generalized DAIS becomes the foundation for: Replication: Data located in multiple locations Federation: Composition of multiple sources Provenance: How was data generated? “OGSA Data Services” (Foster, Tuecke, Unger, eds.) Describes conceptual model for representing all manner of data sources as Web services Database, filesystems, devices, programs, … Integrates WS-Agreement Data service is an OGSI-compliant Web service that implements one or more of base data interfaces: DataDescription, DataAccess, DataFactory, DataManagement These would be extended and combined for specific domains (including DAIS) OGSA-DAI Approach Reuse existing technologies and standards OGSA, Query languages, Java, transport Build portTypes and services which will enable: controlled exposure of heterogenous data resources on an OGSIcompliant grid access to these resource via common interfaces using existing underlying query mechanisms (ultimately) data integration across distributed data resources OGSA-DAI (the software) seeks to be a reference implementation of the GGF DAIS WG standard Can’t keep up with frequent standard changes, so software releases track specific drafts See http://www.ogsadai.org.uk/ for details. Data Access & Integration Services 1a. Request to Registry for sources of data about “x” SOAP/HTTP Registry 1b. Registry responds with Factory handle service creation API interactions 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3c. Results of query returned to client as XML 2b. Factory creates GridDataService to manage access Grid Data Service XML / Relational database 3b. GDS interacts with database Third Party Delivery 2 Data Set C L I E N T A P I R E Q U E S T O R S T U B 1 Data Set dr 3 Data Set Data Set C L I E N T C O N S U M E R A P I S T U B 4 Future DAI Services? 1a. Request to Registry for sources of data about “x” & “y” 1b. Registry responds with Factory handle Data Registry SOAP/HTTP service creation API interactions 2a. Request to Factory for access and integration from resources Sx and Sy 2c. Factory returns handle of GDS to client 3b. Client Problem tells“scientific” Solving analyst Client Application Environment coding scientific insights Analyst Data Access & Integration master 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc 2b. Factory creates Semantic GridDataServices network Meta data GDTS1 GDS GDTS GDS2 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation Application Code Sx GDS GDS1 XML database Sy GDS3 GDS GDTS2 GDTS Relational database Take Home Message Information Grids Support for collaboration Integrated support for computation and data grids Structured data fundamental X Relations, XML, semi-structured, files, … Integrated strategies & technologies needed OGSA-DAI is here now See http://www.ogsadai.org.uk/ for details. A first step — Try it Tell us what is needed to make it better Managing Scientific Data is a Major Requirement The Grid is 30% of the software stack needed for e-Science