The GridMiner project Alexander Wöhrer University of Vienna Institute for Software Science email: woehrer@par.univie.ac.at www.gridminer.org … Intelligent Grid Solutions Outline GridMiner overview Members, hosts KDD process Work packages OGSA-DAI introduction Grid Data Mediation Service Motivation: data integration scenarios Requirements => Principles Concepts => Architecture Example Current prototype Future work www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 2 GridMiner Overview I Start: Jan. 2003 Host: University of Vienna Test application area: medical Vienna University of Technology traumatic brain injury treatment Predicting the outcome of seriously ill patients analytical part focuses on data mining and On-Line Analytical Processing (OLAP) Target: provide tools to discover and access relevant knowledge and information from different distributed and heterogeneous data sources www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 3 GridMiner Overview II Current status: Prototypes GT3 based GDMS functionality implemented as OGSADAI R3 activity Going to support WSRF when its more stable General applicable Not bound to a special application area www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 4 Project members Project leader: Prof. A Min Tjoa, Vienna University of Technology Prof. Peter Brezany, University of Vienna Visualization: Radoslav Ivanov Data streaming: Nguyen Manh Tho Data mediation: Alexander Wöhrer Knowledge Mgt: Ivan Janciak Job Control: Günter Kickinger Sequence Rules: Michael Rinner Decision rules: Christian Kloner GUI: Paul Panhofer OLAP: Bernhard Fiser Umut Onan Ibrahim Elsayed Clustering: Markus Mayer www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 5 The process to cover Data distributed over participating hospitals accesses from different platforms (hand held, PC,…) for data generation, querying, analysis Process needs to access various data sources www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 6 Work packages GMGUI - Graphical User Interface Demo at SC 04 - research exhibition booth 2437 www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 7 Outline GridMiner overview Members, Hosts KDD process Work packages OGSA-DAI introduction Target main services Perform/Response Document Overall architecture Grid Data Mediation Service Requirements => Principles Concepts => Architecture Example Current Prototype Future directions www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 8 OGSA-DAI introduction I Target of OGSA-DAI: To incorporate data resources within the OGSA framework and accessible via a standard interface OGSA-DAI consists of 3 parts Grid Data Service (GDS): primary service access to one particular physical data resource providing access via a document-oriented model Grid Data Service Factory (GDSF): service creation facility Grid Data Service Registry (GDSR): directory facility for OGSA-DAI services www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 9 OGSA-DAI introduction II www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 10 OGSA-DAI introduction III Engine performs specified activities solves well a lot of aspects for accessing data via the Grid Metadata Flexible chains of activities support new activities Transformation Delivery (to 3rd parties) Rights management www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 11 Outline GridMiner overview Members, Hosts KDD process Work packages OGSA-DAI introduction Target main services Perform/Response Document Overall architecture Grid Data Mediation Service Motivation: data integration scenarios Requirements => Principles Concepts => Architecture Example Current Prototype Future directions www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 12 Data Integration I Single data source Federated data source: www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 13 Data Integration II Heterogeneities to overcome: Technical: OS, hardware,... Interface: access language Data Model: OO, Relational,.... Semantic: equal names for different concepts Schematic: encoding of concepts with different elements of a data model Structural: attributes are grouped into different tables www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 14 GDMS Requirements Ease of use Language, location, schema transparency SQL subset virtually one homogeneous, single RDBMS (read only) Maintainability easy installation & setup maintenance (tool supported) Access/Authorization/Authentication View support Performance Semantic issues Security Extensible Open to new data sources Flexible User Defined Functions to solve various kinds of heterogeneities www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 15 GDMS Principles Tight Federation: global (relational) schema Virtual integration: let the date where it is always up-to-date data No proprietary solution inherit well solve aspects from OGSA-DAI Not bound to special architecture www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 16 GDMS Concepts I Mapping Schema to describe the building process of the virtual data source for each table operators as building blocks SELECT: to query data sources UNION/JOIN: to combine the results XML document www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 17 GDMS Concepts II Transformation Functions: Why: hard to predict all functionality once will need Flexibility Static/dynamic java functions Use logical names for parameters www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 18 GDMS Architecture www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 19 Example I - Scenario Heterogeneities: Name in A is „First Last“ (as the target format) Name in C has to be combined Distribution: 3 data sources www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 20 Example II – the basic mapping <VDSTable name=“patient”> <union kind=“all”> <join> <select source=“xmldb:..” name=“A”> <mapSource>….</mapSource> <sourcePart>collectionXYZ</sourcePart> </select> <select source=“jdbc://…” name=“B”> <mapSource>….</mapSource> <sourcePart>databaseXYZ</sourcePart> </select> <joinInfo kind=“inner”> <left keys=“pid”> <right keys=“pid”> </joinInfo> </join> <select soure=“file://...” name=“C”> <mapSource>….</mapSource> <sourcePart></sourcePart> </select> </union> </VDSTable> …………. www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 21 Example III – CVS mapSource CSV file line example: 1;Woehrer;Alexander;Vienna;24/07/1980;01/01/2004 <mapSource> <ColSeperator>;</ColSeperator> <LineSeperator>\r\n</LineSeperator> … <column ref=“p_name” transform=“combine(fn,ln)” </column> <column ref=“ln”> <source>2<source> </column> <column ref=“fn”> <source>3<source> </column> … </mapSource> //Transformation function for //the CSV file public class TestTransform { public static String combine( String one, String two) { return one+“ “+two; } } www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 22 Example IV – XML mapSource <mapSource> … <column ref=“p_id"> <soure>/entry/@id</soure> </column> <column ref=“p_name"> <soure>/entry/name/text()</soure> </column> … </mapSource> Example entry in XMLDB: <entry id=“1”> <name>Alexander Woehrer</name> <Address>Edinburgh</Address> <dob>24/07/1980</dob> </entry> www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 23 Example Query execution Query: SELECT p_name FROM patient WHERE id=10 Standard to optimized www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 24 Current GDMS Prototype Supported data sources: RDBMS (via JDBC) XMLDB (Xindice) CSV files Operators: “union all” and “inner join” Centralized version inside OGSA-DAI R3 SQL subset: SELECT column, [column] FROM table WHERE condition [AND|OR] condition ORDER BY column [,column] [ASC|DESC] Operators are XQuery based (using SAXON) www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 25 Future Work Evolve to distributed mediator Investigate the use of a proxy database to increase performance Improve tools for Installation/setup Maintenance Semantic issues Performance issues www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 26 References GridMiner project page http://www.gridminer.org OGSA-DAI and DQP http://www.ogsadai.org SAXON XSLT/XQuery Processor http://saxon.sf.net/ www.gridminer.org NeSC, 6. Sept. 04 Alexander Wöhrer 27