Distributed Query Processing Basics

advertisement
Distributed Query Processing
Based on “The state of the art in distributed query processing”
Donald Kossman (ACM Computing Surveys, 2000)
Motivation
• Cost and scalability: network of off-shelf
machines
• Integration of different software vendors
(with own DBMS)
• Integration of legacy systems
• Applications inherently distributed, such as
workflow or collaborative-design
• State-of-the-art distributed information
technologies (e-businesses)
Part 1 : Basics
• Query Processing Basics
– centralized query processing
– distributed query processing
Problem Statement
• Input: Query such as „Biological objects in
study A referenced in a literature in journal Y“.
• Output: Answer
• Objectives:
– response time, throughput, first answers, little IO, ...
• Centralized vs. Distributed Query Processing
– same basic problem
– but, more and different parameters, such(data sites
or available machine power) and objectives
Steps in Query Processing
• Input: Declarative Query
– SQL, XQuery, ...
• Step 1: Translate Query into Algebra
– Tree of operators (query plan generation)
• Step 2: Optimize Query
– Tree of operators (logical) - also select partitions of table
– Tree of operators (physical) – also site annotations
– (Compilation)
• Step 3: Execution
– Interpretation; Query result generation
Algebra
A.d
SELECT A.d
FROM A, B
WHERE A.a = B.b
AND A.c = 35
A.a = B.b,
A.c = 35
X
A
B
– relational algebra for SQL very well understood
– algebra for XQuery mostly understood
Query Optimization
A.d
A.d
A.a = B.b,
A.c = 35
hashjoin
X
A
B.b
B
index A.c
B
– logical, e.g., push down cheap predicates
– enumerate alternative plans, apply cost model
– use search heuristics to find cheapest plan
Basic Query Optimization
• Classical Dynamic Programming algorithm
– Performs join order optimization
– Input : Join query on n relations
– Output : Best join order
The Dynamic Prog. Algorithm
for i = 1 to n do {
optPlan({Ri}) = accessPlans(Ri)
prunePlans(optPlan({Ri}))
}
for i = 2 to n do
for all S  { R1, R2 … Rn } such that |S| = i do {
optPlan(S) = 
for all O  S do {
optPlan(S) = optPlan(S) 
joinPlans(optPlan(O), optPlan(S – O))
prunePlans(optPlan(S))
}
}
return optPlan({R1, R2, … Rn})
Query Execution
John
A.d
(John, 35, CS)
hashjoin
(John, 35, CS)
(Mary, 35, EE)
index A.c
(CS)
(AS)
B.b
(Edinburgh, CS,5.0)
(Edinburgh, AS, 6.0)
B
– library of operators (hash join, merge join, ...)
– exploit indexes and clustering in database
– pipelining (iterator model)
Summary : Centralized Queries
• Basic SQL (SPJG, nesting) well understood
• Very good extensibility
– spatial joins, time series, UDF, xquery, etc.
• Current problems
– Better statistics : cost model for optimization
– Physical database design expensive & complex
• Some Trends
– interactiveness during execution
– approximate answers, top-k
– self-tuning capabilities (adaptive; robust; etc.)
Distributed Query Processing: Basics
• Idea:
Extension of centralized query processing. (System
R* et al. in 80s)
• What is different?
–
–
–
–
–
–
extend physical algebra: send&receive operators
other metrics : optimize for response time
resource vectors, network interconnect matrix
caching and replication
less predictability in cost model (adaptive algos)
heterogeneity in data formats and data models
Issues in Distributed Databases
• Plan enumeration
– The time and space complexity of traditional dynamic
programming algorithm is very large
– Iterative Dynamic Programming (heuristic for large
queries)
• Cost Models
– Classic Cost Model
– Response Time Model
– Economic Models
Distributed Query Plan
A.d
Forms
Of
Parallelism?
hashjoin
receive
receive
send
send
B.b
index A.c
B
Cost : Resource Utilization
1
Total Cost =
Sum of Cost of Ops
8
Cost = 40
1
6
1
6
2
5
10
Another Metric : Response Time
Total Cost = 40
first tuple = 25
last tuple = 33
25, 33
24, 32
Pipelined
parallelism
Independent
parallelism
0, 7
0, 24
0, 6
0, 18
0, 12
0, 5
0, 10
first tuple = 0
last tuple = 10
Query Execution Techniques for
Distributed Databases
•
•
•
•
•
•
Row Blocking
Multi-cast optimization
Multi-threaded execution
Joins with horizontal partitioning
Semi joins
Top n queries
Query Execution Techniques for DD
• Row Blocking –
– SEND and RECEIVE operators in query plan
to model communication
– Implemented by TCP/IP, UDP, etc.
– Ship tuples in block-wise fashion (batch);
smooth burstiness
Query Execution Techniques for DD
• Multi-cast Optimization
– Location of sending/receiving may affect
communication costs; forwarding versus multi-casting
• Multi-threaded execution
– Several threads for operators at the same site (intraquery parallelism)
– May be useful to enable concurrent reads for diverse
machines (while continuing query processing)
– Must consider if resources warrant concurrent operator
execution (say two sorts each needing all memory)
Query Execution Techniques for DD
• Joins with Data (horizontal) partitioning:
– Hash-based partitioning to conduct joins on independent partitions
• Semi Joins :
– Reduce communication costs; Send only “join keys” instead of
complete tuples to the site to extract relevant join partners
• Double-pipelined hash joins :
– Non-blocking join operators to deliver first results quickly; fully
exploit pipelined parallelism, and reduce overall response time
• Top n queries :
– Isloate top n tuples quickly and only perform other expensive
operations (like sort, join, etc) on those few (use “stop” operators)
Adaptive Algorithms
• Deal with unpredictable events at run time
– delays in arrival of data, burstiness of network
– autonomity of nodes, changes in policies
• Example: double pipelined hash joins
– build hash table for both input streams
– read inputs in separate threads
– good for bursty arrival of data
• Re-optimization at run time (LEO, etc.)
– monitor execution of query
– adjust estimates of cost model
– re-optimize if delta is too large
Special Techniques for Client-Server
Architectures
• Shipping techniques
– Query shipping
– Data shipping
– Hybrid shipping
• Query Optimization
– Site Selection
– Where to optimize
– Two Phase Optimization
Special Techniques for Federated
Database Systems
• Wrapper architecture
• Query optimization
– Query capabilities
– Cost estimation
• Calibration Approach
• Wrapper Cost Model
• Parameter Binding
Heterogeneity
•
•
•
•
Use Wrappers to “hide“ heterogeneity
Wrappers take care of data format, packaging
Wrappers map from local to global schema
Wrappers carry out caching
– connections, cursors, data, ...
• Wrappers map queries into local dialect
• Wrappers participate in query planning!!!
– define the subset of queries that can be handled
– give cost information, statistics
– “capability-based rewriting“
Middleware
• Two kinds of middleware
– data warehouses
– virtual integration
• Data Warehouses
–
–
–
–
good: query response times
good: materializes results of data cleaning
bad: high resource requirements in middleware
bad: staleness of data
• Virtual Integration
– the opposite
– caching possible to improve response times
Virtual Integration
Query
Middleware
(query decomposition, result composition)
wrapper
wrapper
sub
query
sub
query
DB1
DB2
IBM Data Joiner
SQL Query
Data Joiner
wrapper
wrapper
sub
query
sub
query
SQL DB1
SQL DB2
Adding XML
Query
XML Publishing
Middleware (SQL)
wrapper
wrapper
sub
query
sub
query
DB1
DB2
XML Data Integration
XML Query
Middleware (XML)
XML
query
XML
query
wrapper
wrapper
DB1
DB2
XML Data Integration
• Example: BEA Liquid Data
• Advantage
– Availability of XML wrappers for all major databases
• Problems
– XML - SQL mapping is very difficult
– XML is not always the right language
(e.g., decision support style queries)
Summary
• Middleware looks like a homogenous centralized
database
– location transparency
– data model transparency
• Middleware provides global schema
– data sources map local schemas to global schema
• Various kinds of middleware (SQL, XML)
• “Stacks“ of middleware possible
• Data cleaning requires special attention
Download