Providing Data Access and Data Related Monitoring Information for Data

advertisement
SEMDIG
Providing Data Access and Data Related
Monitoring Information for Data
Integration on the Grid
Alexander Wöhrer and Peter Brezany
Institute of Scientific Computing
University of Vienna
{woehrer|brezany}@par.univie.ac.at
supported by:
funded by:
supported by:
funded by:
Contents
• context of SemDIG
SemDIG
• starting scenario
• information needed for query optimization
and adaptive query processing (AQP)
• continuous data statistics with D³G
• overall strategie for metadata about data
sources
• future work and conclusions
supported by:
funded by:
Context of this work
SemDIG
• SemDIG: Semantic Data Integration on the
Grid
– 2 years project
– focus on:
• Query Optimization
– e.g. early exclusion of data sources
– which source to take?
• Adaptive Query Processing
– e.g. changes on available data source indexes
• Pilot applications:
– ecological (via AustrianGrid)
– GridMiner project
supported by:
funded by:
SemDIG
Starting scenario I
• ecological application
• need to query measurement data
from water, air and soil
– various replicas definied
AIR
WATER
MAIN_2
MAIN_2
REP_1
MAIN_1
REP_2
SOIL
MAIN_1
MAIN_2
REP_2
REP_1
REP_2
MAIN_1
REP_1
supported by:
funded by:
Starting scenario II
SemDIG
• Questions for DAI:
– which sources can provide data to answer
a query with various conditions?
– take main source or replica?
– Data distribution and volume (important
for query optimisation)?
• „Normal“ answers:
– all main sources
– take main source if available
– normal distribution of the values
supported by:
funded by:
Starting scenario III
• An example query plan could look like this:
Host 3
SemDIG
J
Host 1
Host 2
J
U
MAIN_1
U
MAIN_2
MAIN_1
U
MAIN_2
MAIN_1
MAIN_2
supported by:
funded by:
Needed information for further
DAI optimisations
• Data access related:
– Available indexes
SemDIG
• provided by OGSA-DAI on request
– Connection time
• indicator for current database workload
• Data related:
– available histograms
– exact data statistics (for columns often
used in conditions!)
General idea: provide more information for
better initial query plans and support AQP
supported by:
funded by:
SemDIG
Envisioned Solution
• independent from the actual data
access technology
• Supporting/using SOA features
– e.g. subscribe to index changes
HOST
RDBMS
Data Access related
Connection Time
Indexes
Data related
Histograms
Data statistics
supported by:
SemDIG
funded by:
Histograms
http://www.dba-oracle.com/art_builder_histo.htm
• important for cost based optimiser
• available from system tables of a
DBMS
supported by:
funded by:
Exact Data Statistics
SemDIG
• expensive to query each time when
needed
• Idea:
– gather once
– include the effect of the delta (increment)
for various database operations (insert,
delete, update)
• Advantage:
– Low running costs
– use to refute data sources from a query
plan early
supported by:
D³G RDBMS-side architecture
RDBMS side
SemDIG
funded by:
Data
statstics
update
Triggers
monitor
init
Stored
procedure
create
Tables
All Triggers are dynamically
(according to the table
structure)
generated after initializing
the data statistics
• Maintainance:
– row trigger after delete/insert/update to update the
following values of a table:
• mean, standard deviation (numerical)
• missing and total frequency
– statement trigger to keep min/max for columns upto-date
supported by:
SemDIG
funded by:
D³G RDBMS-side performance
Performance of RDBMS side functionality in msec
• Setup:
– table with 11 columns (9 numerical)
– Oracle 10g on a AMD 1 GHz, 768 MB RAM
• init just once per table
• RT independent of the table size
• no updates to min/max => ST returns immediately
supported by:
funded by:
Target DAI scenario I
• The following information is available:
SemDIG
– Water
• REP_1 has an index on a column used
• MAIN_2 exposes 1 < WATER_ID < 5000
– Soil
• MAIN_2 has a very bad connection time
– Air
• MAIN_1 exposes 1 < AIR_ID < 100.000
Let the query be:
select * from water, soil, air
where .... WATER_ID > 10000 and AIR_ID > 150000
supported by:
funded by:
Target DAI scenario:
Starting query plan
Host 3
SemDIG
J
Host 1
Host 2
J
U
MAIN_1
U
MAIN_2
MAIN_1
U
MAIN_2
MAIN_1
MAIN_2
supported by:
funded by:
Target DAI scenario II
J
SemDIG
Host 1
Host 3
Host 2
J
U
REP_1
MAIN_1
REP_2
MAIN_2
• refute data sources early
• Histograms and information about row
numbers could be used to change operator
distribution
supported by:
funded by:
Conclusions
•
Efficient DAI needs more metadata about a data source
– Data related
• histograms
• data statistics
SemDIG
– Data access related
• indexes
• connection time
Additionally: info about main source + info about replicas =
more knowledge about one source (combine it)
•
D³G promising first results
•
Query optimisation as well as AQP could profit
•
More information on this and future work
http://www.par.univie.ac.at/project/semdig
– QO: better initial query plans
– AQP: react to index changes, more information used during adaption
supported by:
funded by:
SemDIG
References
• Jim Gray, “Distributed Computing Economics” ,TR,
2003
• Alexander Wöhrer, Lenka Novakova, Peter Brezany
and A Min Tjoa, „D3G: Novel Approaches to Data
Statistics, Understanding and Preprocessing on the
Grid“, Accepted for IEEE AINA, Vienna, 2006
• SemDIG, http://www.par.univie.ac.at/project/semdig
• PMML, http://www.dmg.org
Download