OGSA-DAI: tools for data access over web services PRISM Forum NeSC, 27th April 2005 Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk +44 131 650 5957 Overview • • • • The difficulty with data The Challenge of Data Services The OGSA-DAI Software Projects using OGSA-DAI PRISM Forum - http://www.ogsadai.org.uk 2 The Data Deluge • Entering an age of data – Data Explosion – CERN: LHC will generate 1GB/s = 10PB/y – VLBA (NRAO) generates 1GB/s today – Pixar generate 100 TB/Movie – Storage getting cheaper • Data stored in many different ways – Data resources – Relational databases – XML databases / files – Result files • Need ways to facilitate – Data discovery – Data access – Data integration • Empower e-Business and e-Science – The Grid is a vehicle for achieving this PRISM Forum - http://www.ogsadai.org.uk 3 Composing Observations in Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins PRISM Forum - http://www.ogsadai.org.uk 4 Data Services: motives • Key to Integration of Scientific Methods – Publication and sharing of results – Primary data from observation, simulation & experiment – Encourages novel uses – Allows validation of methods and derivatives – Enables discovery by combining data collected independently • Key to Large-scale Collaboration – Economies: data production, publication & management – Sharing cost of storage, management and curation – Many researchers contributing increments of data – Pooling annotation leads to rapid incremental publication – Accommodates global distribution – Data & code travel faster and more cheaply – Accommodates temporal distribution – Researchers assemble data – Later (other) researchers access data PRISM Forum - http://www.ogsadai.org.uk 5 Data Services: a definition • A Data Service is a web service which provides published interfaces to allow: – access to a data resource – management of a data resource – transfer of data to and from a data resource • A data resource here is any form of structured data e.g. databases, spreadsheets, image files, sensor streams, records,… • Standards allow interoperability between services – HTTP, SOAP, XML,… PRISM Forum - http://www.ogsadai.org.uk 6 Data Services: challenges • Scale – Many sites, large collections, many uses • Longevity – Research requirements outlive technical decisions • Diversity – No “one size fits all” solutions will work – Primary Data, Data Products, Meta Data, Administrative data, … • Many Data Resources – Independently owned & managed – No common goals – No common design – Work hard for agreements on foundation types and ontologies – Autonomous decisions change data, structure, policy, … – Geographically distributed • and I haven’t even mentioned security yet! PRISM Forum - http://www.ogsadai.org.uk 7 The Discovery Process • Choosing data sources – How do you find them? – How do they describe and advertise them? – Is the equivalent of Google possible? • Obtaining access to that data – Overcoming administrative barriers – Overcoming technical barriers • Understanding that data and extracting from multiple sources – The parts you care about for your research • Combing them using sophisticated models – The picture of reality in your head • Analysis on scales required by statistics – Coupling data access with computation • Repeated Processes – Examining variations, covering a set of candidates – Monitoring the emerging details PRISM Forum - http://www.ogsadai.org.uk 8 Small problems • Not just “Grand Challenges”! – Also the small problems • For instance: – What happens to data when an analyst leaves a team? – How does a team leader point to “popular” data when a new analyst joins? – How do you manage your data when you start to run out of local storage space? – How do you get your data from one format/database to another? – How do I combine my data with your data without changing either? • You need to manage your data: – the Grid can help, but you need to put in place the process yourself PRISM Forum - http://www.ogsadai.org.uk 9 What is a data service? • An interface to a stored collection of data – e.g. Google and Amazon – web services • But the data could be: – – – – – replicated shared federated virtual incomplete • Don’t care about the underlying representation – do care about the information it represents – standards give us interoperability PRISM Forum - http://www.ogsadai.org.uk 10 Examples of Data Services • Many Data Services and applications – Commercial databases – Web interfaces – Applications developed individually by groups and projects • Also many places to get hold of public data – Publications and citation servers – Results servers • OGSA-DAI is a project – which provides an implementation of data services – which provides an extensible framework to customise data services for particular applications PRISM Forum - http://www.ogsadai.org.uk 11 OGSA-DAI Project • Develop a component library – Access and manipulate data in a grid – Serve UK and International e-Science communities • Aims to provide – Common interface to data resources – Simple integration of distributed queries to multiple data resources • Contribute to standardisation efforts – Input into GGF DAIS WG and other groups – Provide a reference implementation of DAIS spec • Based on Open Grid Services Architecture (OGSA) – Globus Toolkit 3 (GT3) “compliant” – Moving to WS-RF(GT4) and WS-I+(OMII) versions PRISM Forum - http://www.ogsadai.org.uk 12 Project Partners Powered by …. Funded by the Grid Core Programme OGSA-DAI £3 million, 18 months, from Feb 2002 Three major releases, three interim releases DAIT (DAI-Two) Keep the OGSA-DAI brand name £1.5 million, 24 months, from Oct 2003 Four major releases GGF DAIS WG Strong involvement. Standardise the interfaces OGSA-DAI to be a reference implementation PRISM Forum - http://www.ogsadai.org.uk 13 Web Service Architecture Service Registry is bl h Service Consumer Pu o c is D r e v Bind Service Provider PRISM Forum - http://www.ogsadai.org.uk 14 Why OGSA-DAI? • Why use OGSA-DAI over JDBC? – Can embed additional functionality at the service end – Transformations, compressions – Third party delivery – The extensible activity framework – Avoiding unnecessary data movement – Common interface to heterogeneous data resources – Relational, XML databases, and files – Usefulness of the Registry for service discovery – Dynamic service binding process – Provision of good meta-data is necessary – Language independence at the client end – Do not need to use Java – Platform independence – Do not have to worry about connection technology, drivers, etc PRISM Forum - http://www.ogsadai.org.uk 15 Heterogeneity Grid Data Service Xindice MySql Oracle DB2 • Data source abstraction behind GDS instance – plug in “data resource implementations” for different data source technologies – does not mandate any particular query language or data format PRISM Forum - http://www.ogsadai.org.uk 16 Location Registry DAISGR findServiceData registerService Factory Analyst findServiceData GDSF • Data resource publication through registry • Data location hidden by factory • Data resource meta data available through Service Data Elements PRISM Forum - http://www.ogsadai.org.uk 17 OGSA-DAI Services • OGSA-DAI uses three main service types – DAISGR (registry) for discovery – GDSF (factory) to represent a data resource – GDS (data service) to access a data resource creates GDSF GDS es pr re es locates ts en ac ce ss DAISGR Data Resource PRISM Forum - http://www.ogsadai.org.uk 18 GDSF and GDS • Grid Data Service Factory (GDSF) – Represents a data resource – Persistent service – Currently static (no dynamic GDSFs) – Cannot instantiate new services to represent other/new databases – Exposes capabilities and metadata – May register with a DAISGR • Grid Data Service (GDS) – – – – Created by a GDSF Generally transient service Required to access data resource Holds the client session PRISM Forum - http://www.ogsadai.org.uk 19 DAISGR • DAI Service Group Registry (DAISGR) – – – – Persistent service Based on OGSI ServiceGroups GDSFs may register with DAISGR Clients access DAISGR to discover – Resources – Services (may need specific capabilities) – Support a given portType or activity – In Release 5.0, services no longer automatically register PRISM Forum - http://www.ogsadai.org.uk 20 Current Version: Release 5.0 • Released on December 3rd 2004 – Globus Toolkit 3.2.1 – Platform and language independent – Java 1.4 – Runs on Windows, Solaris, Linux, AIX • Listened to major user requirements – – – – – Wide range of supported data resources Wide range of delivery methods (e.g. GridFTP), transformations,… Added indexed text file access to support the bioinformatics community Added GUI installation and configuration wizard Continued making improvements in robustness and usability • Work concentrated on data access – Wraps data resources without hiding underlying data model – Provide base for higher-level services – Distributed Query Processing (DQP) – Data federation services • Next release (May 2005) offers GT4 and OMII versions PRISM Forum - http://www.ogsadai.org.uk 21 Supported Data Resources Relational XML Other 9 Xindice 9 DB2 9 eXist 9 Oracle 9 PostgreSQL 9 SQLServer 9 MySQL Files PRISM Forum - http://www.ogsadai.org.uk 9 22 OGSA-DAI Deck of Activities PRISM Forum - http://www.ogsadai.org.uk 23 Predefined Activities DeliverFromGDT xmlCollectionManagement relationalResourceManager xmlResourceManagement sqlBulkLoadRowset sqlUpdateStatement sqlStoredProcedure sqlQueryStatement xQueryStatement xUpdateStatement xPathStatement DeliverToGDT DeliverToStream outputStream DeliverFromGFTP DeliverToGFTP DeliverToURL DeliverFromURL PRISM Forum - http://www.ogsadai.org.uk inputStream xslTransform zipArchive gzipCompression 24 Client Toolkit • Why? Nobody wants to write XML! • A programming API which makes writing applications easier – Now: Java – Next: Perl, C, C# // Create a query SQLQuery query = new SQLQuery(SQLQueryString); ActivityRequest request = new ActivityRequest(); request.addActivity(query); // Perform the query Response response = gds.perform(request); // Display the result ResultSet rs = query.getResultSet(); displayResultSet(rs, 1); PRISM Forum - http://www.ogsadai.org.uk 25 Integration Scenario • A patient moves hospital Data A Data B Amalgamated patient record Data C DB2 Oracle A: (PID, name, address, DOB) B: (PID, first_contact) CSV file C: (PID, first_name, last_name, address, first_contact, DOB) PRISM Forum - http://www.ogsadai.org.uk 26 Distributed Query Processing • Higher level services building on OGSA-DAI 3,4 • Queries mapped to algebraic reduce op_call (Blast) exchange expressions for evaluation • Parallelism represented by partitioning queries hash_join (proteinId) –Use exchange operators reduce exchange reduce 1 table_scan (protein) PRISM Forum - http://www.ogsadai.org.uk 2 table_scan termID=S92 (proteinTerm) 27 OGSA-DAI Users Group • User Group Chair – Prof. Beth Plale, Indiana University • A separate independent body to engage with users and feedback to • developers in a formal way Held meetings in Edinburgh and Brussels in 2004 – Presentations from projects using OGSA-DAI – Discussion of requirements and issues – Discussion of roadmap • Next meeting is 1st June 2005 in Edinburgh • Contact Beth Plale (plale@cs.indiana.edu) for more details PRISM Forum - http://www.ogsadai.org.uk 28 FAQ, Support, Mailing List • Frequently Asked Questions – http://www.ogsadai.org.uk/support/faq.php – Updated as common problems become clear • Support for OGSA-DAI releases – http://www.ogsadai.org.uk/support – support@ogsadai.org.uk – Use to report problems • Discussion list – users@ogsadai.org.uk – http://www.ogsadai.org.uk/support/list.php – General discussion of OGSA-DAI, data and the Grid PRISM Forum - http://www.ogsadai.org.uk 29 Projects Using OGSA-DAI Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk +44 131 650 5957 Projects Using OGSA-DAI Bridges N2Grid (http://www.brc.dcs.gla.ac.uk/projects/bridges/) (http://www.cs.univie.ac.at/institute/index.html?project-80=80) BioSimGrid AstroGrid (http://www.biosimgrid.org/) (http://www.astrogrid.org/) BioGrid GEON (http://www.biogrid.jp/) (http://www.geongrid.org/) OGSA-DAI eDiaMoND (http://www.ogsadai.org.uk) OGSA-WebDB (http://www.ediamond.ox.ac.uk/) (http://www.gtrc.aist.go.jp/dbgrid/) GeneGrid FirstDig (http://www.qub.ac.uk/escience/projects.php#genegrid) (http://www.epcc.ed.ac.uk/~firstdig/) myGrid INWA (http://www.mygrid.org.uk/) (http://www.epcc.ed.ac.uk/) ODD-Genes IU RGRBench (http://www.epcc.ed.ac.uk/oddgenes/) (http://www.cs.indiana.edu/~plale/projects/RGR/OGSA-DAI.html) PRISM Forum - http://www.ogsadai.org.uk 31 Project classification • Bridges • BioGrid • ODD-Genes • AstroGrid • BioSimGrid Physical Sciences • GEON • eDiamond Biological Sciences • myGrid • GeneGrid OGSA-DAI • MCS • N2Grid • OGSA Web-DB • GridMiner • IU RGBench • FirstDig • INWA Commercial Applications PRISM Forum - http://www.ogsadai.org.uk Computer Sciences 32 • e-Digital MammOgraphy National Database –Mammogram - X-ray of the breast • Built prototype of a national database of mammographic images –In support of the UK Breast screening programme • Employed Grid technologies to facilitate process Thanks to eDiaMonND project and the Digital Database for Screening Mammography for this image. PRISM Forum - http://www.ogsadai.org.uk 33 • Breast screening in the UK began in 1988 – Women aged 50-64 screened every 3 Years – Women aged 50-70 from 2004 – 1 View/Breast → 2 views by 2003 • UK has – Over 90 Breast screening units throughout the UK – Each one deals with about 45000 women on average p.a. • Each centre sees 5000-20000 images/year • In 2001-02 → 2002-03 – – – – Screened: 1.4M → 1.5M Recalled for Assessment : 77911 → 79441 Cancers detected : 10003 → 10467 Lives per year Saved: 300 → 1250 (by 2010) • Distributed team of doctors perform the analysis PRISM Forum - http://www.ogsadai.org.uk 34 CHU Data Training Load App Core & Training API KCL Data Training Load App Data Training Load App Core & Training API Core Services Core Services OGSA-DAI OGSA-DAI UED UCL Core & Training API Core Services OGSA-DAI Data Training Load App Core & Training API Core Services Content Manager DB2 Content Manager DB2 Core Training API API Training Services OGSA-DAI OGSA-DAI DB2 Federation DB2 Training Application OGSA-DAI Content Manager DB2 Content Manager PRISM Forum - http://www.ogsadai.org.uk Database Files 35 • eDiaMoND Findings: – – – – – OGSA-DAI provides a flexible framework Dynamically configure the system through discovery Activities can operate with different levels of granularity Federation can be introduced at various levels Good documentation on how to extend the framework – Extended Activities to access IBM DB2 Content Manager – Changes between versions broke some things – Low level XML issues PRISM Forum - http://www.ogsadai.org.uk 36 FirstDIG • Data mining with the First Transport Group, UK – Example: “When buses are more than 10 minutes late there is an 82% chance that revenue drops by at least 10%” – "The results of this exercise will revolutionise the way we do things in the bus industry.“, Darren Unwin, Divisional Manager, First South Yorkshire. OGSA-DAI OGSA-DAI OGSA-DAI OGSA-DAI OGSA-DAI Client Application Data Mining Application PRISM Forum - http://www.ogsadai.org.uk 37 INWA • Innovation Node: Western Australia –Informing Business & Regional Policy: Grid-enabled fusion of global data and local knowledge • Project –Run from Nov 2003 - Aug 2004 –Involved 10 partners (6 UK + 4 Australia) • Aim –Data mine commercially sensitive data –Security an absolute MUST –Employ Grid technologies –Need access to data and computational resources • Demonstrator using: –OGSA-DAI –Incorporate data resources –Sun DCG's TOG (Transfer-queue Over Globus) –Handle job submission to analyse micro array data PRISM Forum - http://www.ogsadai.org.uk 38 INWA EPCC,UK TOG Grid Engine Bank Telco OGSA-DAI Bank data OGSA-DAI UK Property Data Browser user@australia Curtin,Australia TOG Grid Engine user@edinburgh Bank Telco OGSA-DAI Telco data OGSA-DAI Australian property Data Browser PRISM Forum - http://www.ogsadai.org.uk 39 ODD-Genes • • OGSA-DAI Demo for Genetics Collaboration between –EPCC –Scottish Centre for Genomic Technology and Informatics (GTI) –Human Genetics Unit (HGU) • ODD-Genes demonstrates: –Perform high-speed batch analysis of microarray data on the Grid –Browse the results of previous analyses stored in a database –View data from arbitrary databases as HTML –Discover related databases on the Grid –Perform coupled queries on newlydiscovered databases to provide a richer analysis of gene data PRISM Forum - http://www.ogsadai.org.uk 40 ODD-Genes Actors GTI Micro Array Data (relational) Globus GridEngine OGSA-DAI DAISGR OGSA-DAI ODD-Genes Webapp GridEngine TOG EPCC 1. Client 2. EPCC is a computational resource. 3. HGU is an example of a data repository. HGU Mouse Genome Information (XML) OGSA-DAI PRISM Forum - http://www.ogsadai.org.uk 41 ODD-Genes Findings • Data discovery perceived to be very important – Map data views: time -> spatial locations – Discovery of new resources • Transparency to data access – @HGU had an XML database – @GTI had a relational database – Deploy OGSA-DAI and not worry about databases • Issues – Registry maintenance policy – Semantics of the discovery process – Groups working the same area but different schemas, no generic metadata (schemas were the effective metadata) • Provides an additional tool for researchers PRISM Forum - http://www.ogsadai.org.uk 42 GridMiner • Test application area: medical – traumatic brain injury treatment – Predicting the outcome of seriously ill patients – analytical part focuses on data mining and On-Line Analytical Processing (OLAP) • Target: – provide tools to discover and access relevant knowledge and information from different distributed and heterogeneous data sources – building on and extending OGSA-DAI • http://www.gridminer.org/ PRISM Forum - http://www.ogsadai.org.uk 43 GridMiner Scenario • Heterogeneities: – Name in A is „First Last“ (as the target format) – Name in C has to be combined • Distribution: – 3 data sources PRISM Forum - http://www.ogsadai.org.uk 44 Summary • New technology – Standardisation process still ongoing – Infrastructure still developing • OGSA-DAI acting as an enabler – It builds on what you already have – It does not define a radically new model (not rewriting SQL) – It may make you think about your business process • Some problems are not OGSA-DAI specific – Metadata, time zones, security, … • Data discovery opens up a window of integration opportunity • Try it out – It’s free and supported – Make suggestions, extend functionality, contribute to DAIS-WG PRISM Forum - http://www.ogsadai.org.uk 45 OGSA-DAI Project Webpage • http://www.ogsadai.org.uk Background News & Events Software Releases Documentation On-line Tutorials Support Training Courses Links PRISM Forum - http://www.ogsadai.org.uk 46