OGSA-DQP/DAI OGSA DQP/DAI related work at AIST Isao Kojima, Kojima Said Mirza Pahlevi Pahlevi, Steven Lynden Lynden, Akiyoshi Matono, Masahiro Kimoto Information o at o Technology ec o ogy Research esea c Institute st tute National Institute of Advanced Science and Technology (AIST) OGSA-DAI/DQP OGSA DAI/DQP based work • OGSA-WebDB OGSA WebDB – Accessing Web databases though Grid services – Optimises joins between multiple DBs • OGSA-DAI-RDF – OGSA-DAI activities for accessing g RDF stores • OGSA-DQP extensions – Enable XML and integration with OGSA-WebDB – Try out with more complex application; support GeoGrid project (www.geogrid.org) Research Areas e‐Science e Science Applications Applications Semantic G id/W b Grid/Web Infrastructures Support RDF Databases S-MDS OGSA-DAI OGSA DAI RDF AIST‐SOA SPARQL-based Matchmaking Service‐based (OGSA) Data Grid Data Integration Bloom Filter Query Q er Processing MapReduce Provide Scalable Database Processing Support Web Databases GEO Grid Service Registry GEO Grid Federated SPARQL OGSA-WebDB OGSA-DQP with XML& WebDB Distributed Di t ib t d Databases OGSA-DQP + OGSA-WebDB OGS OGSA-WebDB eb • Contributes activities to OGSADAI that support Web database access and integration Client OGSA-DAI WDB SQL activities JDBC • Uses a mediatorwrapper architecture • Wrappers exist for p popular p webdatabases, e.g. DBLP, Citeseer, etc. etc Proxy Database: RDB View MySQL Internet I t t Pubmed PDB FDA WebDB System y Architecture Select * from rdblp where title like ‘%grid%’ and year > 2000 Web Grid/OGSA-DAI Grid Client 10.results WebDB activity 8 SQL 8. Proxy l ti relations rdblp PDB Wrapper wrapper Bibs 1. SQL Data Service MEDIATOR Medical Medical web databases 7.insert Wrapper results wrapper 9. resultsManagement ge e relations user 2. Invoke Mediator 3.supported conditions SQL Analyzer 5. Boolean query 6. results DBLP Bibliography web bd databases b Key features of OGSA-WebDB OGSA WebDB • Can integrate data from multiple Web DB sources declaratively – Parser supports SQL • Queries are optimised at runtime – Statistical analysis of intermediate results can be used to determine join order. • Interface is consistent with OGSA-DAI OGSA DAI relational resources – Web DB types yp mapped pp to SQL types yp – Each WebDB looks like a relational table Extensions to OGSA-DQP OGSA DQP • Support XML and OGSA-WebDB OGSA WebDB OGSA-DQP client OGSA-DAI OGSA-DAI OGSA-DAI OGSA DAI Analysis service + OGSA-WebDB • Motivation Relational DB XML DB Web DB – Integration of heterogeneous data – Declarative service orchestration • Difficult to do without manipulating XML DQP with OGSA-WebDB OGSA WebDB • OGSA OGSA-WebDB W bDB already l d behaves a lot like a relational resource • Different language constraints and response times required changes to the optimiser • Replication/parallelism i b is beneficial fi i l ffor jjoins i with relational data to improve response times Example: select * from employee, dblp where dblp.author=employee.name dblp.author employee.name Client OGSA-DQP OGSA-WebDB OGSA-DAI employee DB XML processing • The approach is to add a set of XML functions for manipulating XML and converting to/from relational data • The functions are based on those from various sources sources, including – The fifth revision of the SQL language (SQL 2003) – DBMS vendors such as Oracle 10g – Other projects, e.g. Tukwila data integration system • A mix-and-match from these sources is employed as compliance with SQL in this area is far from universal. Example functions 1 Construction of XML from relational data select XMLGen( ‘<person> <name>{$p.name}</name> <age>{$p.address}</age> </person>’ ) from person as p 1 name | age John | 31 David | 47 <person> <name>John</person> <age>31</age> </person> <person> <name>David</person> <age>47</age> </person> 2 Extracting values from XML and type conversion select avg(XMLIntValue(p,‘//age’)) from ExtractXML(people,’//person’) as p 2 avg 39 Joining data • IIntegration t ti off WebDB, W bDB relational, l ti l XML data <authors> <author> <name>Cliff Jones</name> <field>Dependability</field> </author> <author> <name>…</name> <field>…</field> </author> … </authors> AUTHOR PUBLICATION CITESEER OpenXML <authors> <author> <name>Cliff Jones</name> <field>Dependability</field> </author> <author> <name>…</name> <field> </field> <field>…</field> </author> … </authors> select xauthor.name, publication.title, citeseer.url, xauthor.field from publication, citeseer, OpenXML( p author, '//author', '//name/text() name, //field/text() field‘ ) as xauthor where publication.title=citeseer.title and xauthor.name=publication.author; <author> <name>Cliff Jones</name> <field>Dependability</field> </author> name title url field … … … … … … … … <author> <name>…</name> <field>…</field> </author> Example application • The application is based on an e-Science project, eXSys (e-Science (e Science tools for the analysis of complex systems. • Newcastle University 2003/2004 • This project uses graph theory-based techniques to predict target proteins for in-silico drug design. • It later matured into a drug discovery company (from http://www.etherapeutics.co.uk/) • Protein interaction network = nodes (proteins) and edges (binding capability, formation of complex). (Static representation) • Analysing topology allows the extension of a protein's properties to encompass its role relative to the network in which it exists. • Properties P ti off protein t i interaction networks have been studied: structurally important implies functionally important. Process • Data from separate relational databases are integrated to form an un-directed graph structure representing protein interactions from a specific organism. • An A analysis l i program iis executed t d with ith th the graph structure as input. Each protein is ranked k d according di tto it its significance i ifi tto th the topology of the interaction network. • More information is retrieved about highly ranked proteins using Web databases such as UniProt. Locally owned interaction / species tables Topology analysis – example XML input/output Convert back to a relational view OGSA-DQP OGSA DQP query result: protein name and gene extract values from XML data invoke Web service construct XML input to analysis web service interaction select important proteins only join condition with web database species Queryy Plan ExtractXML UniProt XMLOccurs Function call Analyser XMLElement XMLA XMLAgg XMLGen The function call is i l implemented t db by calling lli an external Web service interactions.Interaction_id = species.interaction_id species interactions Wrapped using OGSADAI Implementation pe e a o • B Based d on OGSA OGSA-DQP DQP 3 3.2.1 21 • At least Java 1.5 required q • Installation – Coordinator C di t • Install DQP 3.2.1, extensions over-write the DQP Jars J d l deployed d onto t T Tomcatt – Evaluator • Install extended evaluator from scratch (changes to more than just the Java code were made) Configuration Co gu a o p process ocess – XML data sources imported in a similar manner to relational data sources. <xs:complexType name="ImportedXMLDataSourceType"> <xs:sequence> <xs:element name="URI" type="xs:anyURI" /> <xs:element name="CollectionName" type="xs:string" /> <xs:element name="ResourceID" type="xs:anyURI" /> <xs:element name="SchemaURI" type="xs:anyURI" minOccurs="0"/> <xs:element l name="cardinality" " di li " type="xs:long" " l " minOccurs="0"/> i "0"/ </xs:sequence> </xs:complexType> – Cardinality (optional) must be specified by the client if it is to be used during optimisation, otherwise a default is used Web DBs <xs:complexType name="ImportedWebDataSourceType"> <xs:sequence> <xs:element name="URI" type="xs:anyURI" minOccurs="1"/> <xs:element name="ResourceID" type="xs:anyURI" /> < <xs:element l t name=”SupportsOR” ”S t OR” type=”xs:boolean” t ” b l ” minOccurs=”0”/> i O ”0”/> <xs:element name="table" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> yp g use="required" q / /> <xs:attribute name="name" type="xs:string" <xs:attribute name="cardinality" type="xs:long" use="required"/> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> • • “SupportsOR”: to speed up loop joins, joins indicates whether the Web database allows multiple bindings from input tuples to be grouped together by building a Web database query of the form SELECT * from table where attribute = “value1” or “value2” … Cardinality, if not supplied, is assumed to be very large to reflect the fact that most Web databases are relatively large compared to more relational database tables Required q modifications • Variable assignment g – Fields in the FROM and SELECT clauses can now be assigned names. For example, the following query is now supported: select p.name p name from person_person person person as p where p p.id<5; id<5; select name as surname from person_person; • Di t consequence off th Direct the XML extensions, t i for f example l select xml from ExtractXML(othello,’//actor’) as xml; • Limited support for nested queries in the FROM/WHERE clauses and functions returning collections (as in the above example) introduced. • Limited support, i.e. one statically defined function is supported that allows arbitrary XML to be passed as a parameter and arbitrary XML returned as a result. JAX-WS looked at as a potential implementation mechanism for this. Implemented functions • XMLForest( value1, value2, ..., valueN ) returns XML Converts relational columns to XML and merges them together. Each XML element is assigned a name equal to the column name. Parameters types can be any type that is not XML. For example, given a tuple with two fields, name=john, age=19, the output of XMLForest(name age) is <name>john</name><age>19</age>. XMLForest(name,age) <name>john</name><age>19</age> • XMLConcat( XML value1, XML value2, ..., XML valueN ) returns XML Basically does the same thing as XMLForest except that the values are already XML typed. • XMLGen( String template ) returns XML Already seen • XMLAgg( collection<XML> values ) returns XML Aggregate function that creates a single XML value from a collection of XML values. Limited applicability as OGSA-DQP doesn’t support GROUP-BY Implemented p e e ed functions u c o s • ExtractXML( String collectionName, String XPathStatement ) returns t collection<XML> ll ti <XML> Applies an XPath statement to a named XML collection residing in an XML database. • ExtractXMLValue( XML value, String XPathStatement ) returns collection<XML> Applies an XPath statement to an XML value. • ExtractXMLStringValue( XML value, String XPathStatement ) returns collection<String> Applies an XPath statement to an XML value, value converts results to String values [Variants exist for other types, e.g. ExtractXMLIntValue etc.] • XMLOccurs( XML value, value String XPathStatement ) returns boolean Applies an XPath statement to an XML value, returns true if the statement returns any results, false otherwise. Can be used in the WHERE clause. Status/future S a us/ u u e work o • • • • • Respond to requirements from GeoGrid project – One potential issue here is support for VOMS security Code is not up to the standards required for OGSA-DAI contributions (extensions and OGSA-WebDB) OGSA-WebDB wrappers require constant updates More work needed on optimisation strategies when invoking functions Processing XML data with elements named <row> or <col> may fail due to the way that OGSA-DQP encodes results internally using i these h tags. (b (bug)) OGS OGSA-DAI-RDF • Activities for accessing RDF data resources e.g: • DataSetManagement • create, remove, list • GraphManagement • insert, delete triples • SPARQLQuery • forward a query to underlying DB • Current work looking at implementing the WS-DAI-RDF specifictions