SEMDIG Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid Alexander Wöhrer and Peter Brezany Institute of Scientific Computing University of Vienna {woehrer|brezany}@par.univie.ac.at supported by: funded by: supported by: funded by: Contents • context of SemDIG SemDIG • starting scenario • information needed for query optimization and adaptive query processing (AQP) • continuous data statistics with D³G • overall strategie for metadata about data sources • future work and conclusions supported by: funded by: Context of this work SemDIG • SemDIG: Semantic Data Integration on the Grid – 2 years project – focus on: • Query Optimization – e.g. early exclusion of data sources – which source to take? • Adaptive Query Processing – e.g. changes on available data source indexes • Pilot applications: – ecological (via AustrianGrid) – GridMiner project supported by: funded by: SemDIG Starting scenario I • ecological application • need to query measurement data from water, air and soil – various replicas definied AIR WATER MAIN_2 MAIN_2 REP_1 MAIN_1 REP_2 SOIL MAIN_1 MAIN_2 REP_2 REP_1 REP_2 MAIN_1 REP_1 supported by: funded by: Starting scenario II SemDIG • Questions for DAI: – which sources can provide data to answer a query with various conditions? – take main source or replica? – Data distribution and volume (important for query optimisation)? • „Normal“ answers: – all main sources – take main source if available – normal distribution of the values supported by: funded by: Starting scenario III • An example query plan could look like this: Host 3 SemDIG J Host 1 Host 2 J U MAIN_1 U MAIN_2 MAIN_1 U MAIN_2 MAIN_1 MAIN_2 supported by: funded by: Needed information for further DAI optimisations • Data access related: – Available indexes SemDIG • provided by OGSA-DAI on request – Connection time • indicator for current database workload • Data related: – available histograms – exact data statistics (for columns often used in conditions!) General idea: provide more information for better initial query plans and support AQP supported by: funded by: SemDIG Envisioned Solution • independent from the actual data access technology • Supporting/using SOA features – e.g. subscribe to index changes HOST RDBMS Data Access related Connection Time Indexes Data related Histograms Data statistics supported by: SemDIG funded by: Histograms http://www.dba-oracle.com/art_builder_histo.htm • important for cost based optimiser • available from system tables of a DBMS supported by: funded by: Exact Data Statistics SemDIG • expensive to query each time when needed • Idea: – gather once – include the effect of the delta (increment) for various database operations (insert, delete, update) • Advantage: – Low running costs – use to refute data sources from a query plan early supported by: D³G RDBMS-side architecture RDBMS side SemDIG funded by: Data statstics update Triggers monitor init Stored procedure create Tables All Triggers are dynamically (according to the table structure) generated after initializing the data statistics • Maintainance: – row trigger after delete/insert/update to update the following values of a table: • mean, standard deviation (numerical) • missing and total frequency – statement trigger to keep min/max for columns upto-date supported by: SemDIG funded by: D³G RDBMS-side performance Performance of RDBMS side functionality in msec • Setup: – table with 11 columns (9 numerical) – Oracle 10g on a AMD 1 GHz, 768 MB RAM • init just once per table • RT independent of the table size • no updates to min/max => ST returns immediately supported by: funded by: Target DAI scenario I • The following information is available: SemDIG – Water • REP_1 has an index on a column used • MAIN_2 exposes 1 < WATER_ID < 5000 – Soil • MAIN_2 has a very bad connection time – Air • MAIN_1 exposes 1 < AIR_ID < 100.000 Let the query be: select * from water, soil, air where .... WATER_ID > 10000 and AIR_ID > 150000 supported by: funded by: Target DAI scenario: Starting query plan Host 3 SemDIG J Host 1 Host 2 J U MAIN_1 U MAIN_2 MAIN_1 U MAIN_2 MAIN_1 MAIN_2 supported by: funded by: Target DAI scenario II J SemDIG Host 1 Host 3 Host 2 J U REP_1 MAIN_1 REP_2 MAIN_2 • refute data sources early • Histograms and information about row numbers could be used to change operator distribution supported by: funded by: Conclusions • Efficient DAI needs more metadata about a data source – Data related • histograms • data statistics SemDIG – Data access related • indexes • connection time Additionally: info about main source + info about replicas = more knowledge about one source (combine it) • D³G promising first results • Query optimisation as well as AQP could profit • More information on this and future work http://www.par.univie.ac.at/project/semdig – QO: better initial query plans – AQP: react to index changes, more information used during adaption supported by: funded by: SemDIG References • Jim Gray, “Distributed Computing Economics” ,TR, 2003 • Alexander Wöhrer, Lenka Novakova, Peter Brezany and A Min Tjoa, „D3G: Novel Approaches to Data Statistics, Understanding and Preprocessing on the Grid“, Accepted for IEEE AINA, Vienna, 2006 • SemDIG, http://www.par.univie.ac.at/project/semdig • PMML, http://www.dmg.org