Providing Data Access and Data Related Monitoring Information for Data

SEMDIG Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid Alexander Wöhrer and Peter Brezany Institute of Scientific Computing University of Vienna {woehrer|brezany}@par.univie.ac.at supported by: funded by: supported by: funded by: Contents • context of SemDIG SemDIG • starting scenario • information needed for query optimization and adaptive query processing (AQP) • continuous data statistics with D³G • overall strategie for metadata about data sources • future work and conclusions supported by: funded by: Context of this work SemDIG • SemDIG: Semantic Data Integration on the Grid – 2 years project – focus on: • Query Optimization – e.g. early exclusion of data sources – which source to take? • Adaptive Query Processing – e.g. changes on available data source indexes • Pilot applications: – ecological (via AustrianGrid) – GridMiner project supported by: funded by: SemDIG Starting scenario I • ecological application • need to query measurement data from water, air and soil – various replicas definied AIR WATER MAIN_2 MAIN_2 REP_1 MAIN_1 REP_2 SOIL MAIN_1 MAIN_2 REP_2 REP_1 REP_2 MAIN_1 REP_1 supported by: funded by: Starting scenario II SemDIG • Questions for DAI: – which sources can provide data to answer a query with various conditions? – take main source or replica? – Data distribution and volume (important for query optimisation)? • „Normal“ answers: – all main sources – take main source if available – normal distribution of the values supported by: funded by: Starting scenario III • An example query plan could look like this: Host 3 SemDIG J Host 1 Host 2 J U MAIN_1 U MAIN_2 MAIN_1 U MAIN_2 MAIN_1 MAIN_2 supported by: funded by: Needed information for further DAI optimisations • Data access related: – Available indexes SemDIG • provided by OGSA-DAI on request – Connection time • indicator for current database workload • Data related: – available histograms – exact data statistics (for columns often used in conditions!) General idea: provide more information for better initial query plans and support AQP supported by: funded by: SemDIG Envisioned Solution • independent from the actual data access technology • Supporting/using SOA features – e.g. subscribe to index changes HOST RDBMS Data Access related Connection Time Indexes Data related Histograms Data statistics supported by: SemDIG funded by: Histograms http://www.dba-oracle.com/art_builder_histo.htm • important for cost based optimiser • available from system tables of a DBMS supported by: funded by: Exact Data Statistics SemDIG • expensive to query each time when needed • Idea: – gather once – include the effect of the delta (increment) for various database operations (insert, delete, update) • Advantage: – Low running costs – use to refute data sources from a query plan early supported by: D³G RDBMS-side architecture RDBMS side SemDIG funded by: Data statstics update Triggers monitor init Stored procedure create Tables All Triggers are dynamically (according to the table structure) generated after initializing the data statistics • Maintainance: – row trigger after delete/insert/update to update the following values of a table: • mean, standard deviation (numerical) • missing and total frequency – statement trigger to keep min/max for columns upto-date supported by: SemDIG funded by: D³G RDBMS-side performance Performance of RDBMS side functionality in msec • Setup: – table with 11 columns (9 numerical) – Oracle 10g on a AMD 1 GHz, 768 MB RAM • init just once per table • RT independent of the table size • no updates to min/max => ST returns immediately supported by: funded by: Target DAI scenario I • The following information is available: SemDIG – Water • REP_1 has an index on a column used • MAIN_2 exposes 1 < WATER_ID < 5000 – Soil • MAIN_2 has a very bad connection time – Air • MAIN_1 exposes 1 < AIR_ID < 100.000 Let the query be: select * from water, soil, air where .... WATER_ID > 10000 and AIR_ID > 150000 supported by: funded by: Target DAI scenario: Starting query plan Host 3 SemDIG J Host 1 Host 2 J U MAIN_1 U MAIN_2 MAIN_1 U MAIN_2 MAIN_1 MAIN_2 supported by: funded by: Target DAI scenario II J SemDIG Host 1 Host 3 Host 2 J U REP_1 MAIN_1 REP_2 MAIN_2 • refute data sources early • Histograms and information about row numbers could be used to change operator distribution supported by: funded by: Conclusions • Efficient DAI needs more metadata about a data source – Data related • histograms • data statistics SemDIG – Data access related • indexes • connection time Additionally: info about main source + info about replicas = more knowledge about one source (combine it) • D³G promising first results • Query optimisation as well as AQP could profit • More information on this and future work http://www.par.univie.ac.at/project/semdig – QO: better initial query plans – AQP: react to index changes, more information used during adaption supported by: funded by: SemDIG References • Jim Gray, “Distributed Computing Economics” ,TR, 2003 • Alexander Wöhrer, Lenka Novakova, Peter Brezany and A Min Tjoa, „D3G: Novel Approaches to Data Statistics, Understanding and Preprocessing on the Grid“, Accepted for IEEE AINA, Vienna, 2006 • SemDIG, http://www.par.univie.ac.at/project/semdig • PMML, http://www.dmg.org

Providing Data Access and Data Related Monitoring Information for Data

Related documents

Products

Support

Providing Data Access and Data Related Monitoring Information for Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib