slides

advertisement
Advanced
Database Topics
Copyright © Ellis Cohen 2002-2005
Distributed Databases:
Organization &
Query Processing
These slides are licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 2.5 License.
For more information on how you may use them,
please see http://www.openlineconsult.com/db
1
Topics
Distributed Database Architecture
Location Transparency
Data Placement & Fragmentation
Distributed Query Processing
Copyright © Ellis Cohen, 2002-2005
2
Distributed Databases
Distribution
Spreading data across multiple network
nodes
Partitioning & Fragmentation
Distribute tables divided into
vertical or horizontal parts
Replication
Replicating (parts of) tables across
multiple nodes
Why would we want to distribute or replicate data?
Copyright © Ellis Cohen, 2002-2005
3
Distribution & Replication
Distribution
Integrate separate databases
Decrease network latency by locating data
near greatest demand
Locate data within
secure administrative boundaries
Parallel processing
Replication
Decreased network latency by placing
replicas near multiple high demand sites
High availability & reliability in face of failure
More parallel processing
Scalability (single copy no longer bottleneck)
Disconnected operation
Copyright © Ellis Cohen, 2002-2005
4
Date's 12 Rules for
Integrated Distributed Systems
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Local autonomy
No reliance on a central site
Continuous operation
Location transparency
Fragmentation transparency
Replication transparency
Distributed query processing
Distributed transaction management
Hardware independence
Operating system independence
Network independence
DBMS independence
(transparent heterogeneity)
To the user, a distributed DB system
should look exactly like a non-distributed system
Copyright © Ellis Cohen, 2002-2005
5
Design Issues for DDBMS's
Keep track of data names & locations
Decide what to fragment & replicate
Decide placement (allocation, distribution) of
objects, fragments & replicas
Devise strategies for executing transactions &
queries that access data from multiple sites
Manage distributed transactions,
including backup & recovery from
– individual site crashes
– communication link failures
Decide which copy/copies of replicated data
to access
Maintain consistency of replicated data copies
Copyright © Ellis Cohen, 2002-2005
6
Distributed Database
Architectures
Copyright © Ellis Cohen, 2002-2005
7
Distributed Database Architectures
Architectures
– Multi-Database Architecture
• Appears to user as separate databases
– TP Monitor / Application Server Architecture
• Separate server to handle transaction management
& other services (e.g. security)
– Federated Database Architecture
• Appears to user as a single database providing a
global schema integrating disparate DB's
– Collaborating Database Architecture
• A collection of peer databases, which interconnect
to one another, providing a global schema to users
who connect to an individual peer
Heterogeneity
Homogeneous: every site runs same type of DBMS
Heterogeneous: different sites run different DBMS's
(perhaps even non-relational ones)
Copyright © Ellis Cohen, 2002-2005
8
Coordination
Coordination of a distributed transaction is
managed by a coordinator, which resides
at a single node
• Multi-Database Architecture
Client is the coordinator
• TP Monitor / Application Server
Architecture
TP Monitor / App Server is coordinator
• Federated Database Architecture
Federation Server is coordinator
• Collaborating Database Architecture
The peer connected to by the client is the
coordinator
Copyright © Ellis Cohen, 2002-2005
9
Multi-Database Architecture
Client acts as
coordinator
• Issues queries
directly to
multiple DB
servers
(subordinates)
• Integrates the
results
• Handles
distributed
transaction
management
(as well as it can)
Client
DB Server
Coordinator
DB Server
Subordinates
Copyright © Ellis Cohen, 2002-2005
10
Sub-query Distribution
Suppose a coordinator wants to execute the query
that lists the project managed by the highest paid
employee
SELECT * FROM Projs WHERE pmgr =
(SELECT empno FROM Emps
WHERE sal =
(SELECT max(sal) FROM Emps))
If subordinate S1 holds the Projs table, and
subordinate S2 holds the Emps tables, then the
coordinator will request S2 to execute the sub-query
SELECT empno FROM Emps
WHERE sal =
(SELECT max(sal) FROM Emps)
Will get the result back (let's call it result), and
request S1 to execute (and return the results of) the
sub-query
SELECT * FROM Projs WHERE pmgr = result
Copyright © Ellis Cohen, 2002-2005
11
Sub-transactions
Imagine a coordinator C has started a
transaction TC, and is executing a query as
part of TC.
– The coordinator divides the query up into
sub-queries, which it sends to various
subordinates.
– It labels each subquery with TC, the identity of the
main transaction.
When a subordinate S is passed a sub-query
– If it has not yet seen the label TC, it creates a local
transaction TS (called a sub-transaction), and
associates TS with TC.
– If it has seen TC before, it looks up the
corresponding TS.
In either case, S runs the sub-query as part of the
local sub-transaction TS
Copyright © Ellis Cohen, 2002-2005
12
Transaction Manager
Client acts as
coordinator
• Uses Transaction
Manager to
handle
distributed
transaction
management
Client still
• Issues queries
directly to
multiple DB
servers
• integrates the
results
Client
Transaction
Manager
API's &
Protocols
standardized
by X/Open
Subtransactions
DB Server
Copyright © Ellis Cohen, 2002-2005
DB Server
13
Distributed Transaction Management
Coordinator's transaction manager
communicates with each subordinate
(participating DB server)
Each subordinate manages its own
sub-transactions
– Reflects queries performed by that subordinate
on behalf of the parent transaction
– Enforces ACID requirements of the subordinate
– Enables independent recovery by each
subordinate
Provides distributed concurrency control to
ensure global serializability
Provides atomic commit protocol
to ensure global atomicity & durability
Copyright © Ellis Cohen, 2002-2005
14
The Distributed Commit Problem
• A distributed transaction which executes
at multiple sites must either be
committed at all sites or aborted at all
sites
• Not acceptable for one sub-transaction to
commit and one abort.
• If the coordinator just sends a COMMIT
message to two subordinates S1 and S2
– S1 could get the COMMIT message and commit
– S2 could crash just before it gets the COMMIT
message, and before writing any local
subtransaction state to stable storage) -- i.e.
S2 is aborted
• Obviously a more complicated protocol is
needed, which we will address later
Copyright © Ellis Cohen, 2002-2005
15
TP Monitor / Application Server
Client uses TP Monitor /
App Server to execute
transactions
Application Server may
use load balancing to
decide which Application
Server should coordinate
transaction
Transaction executing
within App Server
...
Client
App Server
• makes direct calls to
multiple DB servers
Subqueries
• integrates the results
• Uses App Server's
Transaction Mgr to
handle distributed
transaction
management
...
App Server
DB Server
DB Server
Copyright © Ellis Cohen, 2002-2005
DB Server
16
Heterogeneity
Heterogeneous Databases
–
–
–
–
Different data types
Different SQL commands or syntax
Different protocols
Different embedded programming
languages
– Different security mechanisms
(authentication & access control)
– Different concurrency mechanisms
Heterogeneous Data Models
– Different names
– Different values (esp units)
– Different constraints & derived values
Copyright © Ellis Cohen, 2002-2005
17
Heterogeneity Transparency
Non-Transparent:
Client must deal with some
or all aspects of database
heterogeneity directly
Semi-Transparent:
Mapping layer hides most
differences among
databases
Coordinator may still be able
to exploit differences
(e.g. pass-through SQL)
Coordinator
Mapping
Layer
DB Server
Transparent
Mapping layer hides
differences among
databases and among
data models
Copyright © Ellis Cohen, 2002-2005
18
Mapping Architecture
Coordinator may be
• Client
• App Server
• DB Server
Mapping layer may
reside in
• Coordinator
• DB Server
• separate
DB Server
Gateway
Server
Coordinator
Mapping
Layer
DB Server
Copyright © Ellis Cohen, 2002-2005
DB Server
19
Federated Database Architecture
Federation Layer supports
Client / App Server
• Transaction Management
• Heterogeneity Mapping Layer
• Global Schema supported by
Distributed Query Processing
Federation Layer may be
• Software layer callable by
client (i.e. extended
transaction manager)
• Provided by separate
Federated DB Server (e.g.
extended TP Monitor)
• Integrated with DB server
(i.e. Collaborating DB
Architecture)
Transactions
Queries
Federation
Layer
SubTransactions
Sub-Queries
DB Server
Copyright © Ellis Cohen, 2002-2005
DB Server
20
Collaborating Database Architecture
Client can connect to one
of a set of DB Servers
Could itself be an
App/DB/Gateway
Server
Connecting DB Server
• Provides global schema
• May choose a different DB
Server to coordinate
transaction (e,g, based on load
balancing or one nearest data)
Coordinating DB Server
• Handles distributed transaction
management
• Handles distributed query
management
DB Servers
• Appear homogeneous
• May themselves be Federated
DBs or Gateway Servers
Client
DB Server
DB Server
DB Server
DB Server
Collaborating DB
servers generally
communicate using
private protocol
Copyright © Ellis Cohen, 2002-2005
21
Location Transparency
Copyright © Ellis Cohen, 2002-2005
22
Location Transparency
Requirements
1.
DB objects must be able to reside and be
created at multiple sites in a system
2.
Each DB object must be able to be uniquely
named by a transaction
3.
The name for a DB object used by a transaction
must enable the object to be located efficiently
4.
It must be possible to write transaction code
that will not need to be modified if either
•
the transaction is executed at a different site
•
The DB objects accessed are moved
Copyright © Ellis Cohen, 2002-2005
23
Explicit Site Naming
SELECT * FROM scott.emp@hq.acme.com
If @ (as in Oracle) reflects the table's current
location, this does not support the key
transparency requirement.
However, if @ identifies the table's birth site, which
then holds the table's forwarding location (where
it is currently located, or which does further
forwarding), the transparency is retained.
Security considerations
In what security domain does the transaction run
on the remote machine?
What if the user currently running does not have
an account on the remote machine?
Copyright © Ellis Cohen, 2002-2005
24
Synonyms
joe@boston> create SYNONYM emp for
scott.emp@hq.acme.com
joe@boston> SELECT * FROM emp
Is emp a
– Local synonym [can only be used by joe?]
– Part of joe's schema?
dilip@boston> SELECT * FROM joe.emp
Even if synonyms are automatically
replicated on every machine

no guarantee of location transparency because of
naming conflicts
Copyright © Ellis Cohen, 2002-2005
25
Location Transparency via
Global Directory Management
Design a global directory hierarchy
Provides a separate naming scope
for storing synonyms
joe@boston> CREATE PUBLIC GLOBAL DIRECTORY /stuff
joe@boston> CREATE PUBLIC DIRECTORY /stuff/empinfo
// invented syntax
joe@boston> CREATE PUBLIC GLOBAL SYNONYM
/stuff/empinfo/emp FOR scott.emp@hq.acme.com
sam@podunk> SELECT * FROM /stuff/empinfo/emp
Where is the global directory stored?
– Centralized directory manager (name server)
susceptible to bottlenecks and failures
– Needs to be replicated
Copyright © Ellis Cohen, 2002-2005
26
Data Placement &
Fragmentation
Copyright © Ellis Cohen, 2002-2005
27
Data Placement
Company HQ in Des Moines
Warehouses in SF, NY, Denver
SfCust( custid, addr )
NyCust( custid, addr )
DenverCust( custid, addr )
A. Place all 3 in DesMoines
How would
B. Place SfCust in SF
you decide?
NyCust in NY
DenverCust in Denver
C. Place SfCust in SF & DesMoines
NyCust in NY & Des Moines
DenverCust in Denver & DesMoines
Copyright © Ellis Cohen, 2002-2005
28
Data Fragmentation
Why would you choose
one or another of these
approaches?
Horizontal Fragmentation
• Each fragment is a subset of
rows
• Rows do not overlap (else
doing partial replication)
• Reconstruction by union
• Updates may requires tuple
migration
Vertical Fragmentation
• Each fragment is a subset of
columns
• All fragments include
primary key columns or
share ROWIDs
• Reconstruction by join
• Updates do not require tuple
migration
Copyright © Ellis Cohen, 2002-2005
29
Rules for Data Fragmentation
Completeness
All the data of the global relation must be
mapped to the fragments
Reconstruction
It must always be possible to reconstruct
each global relation from its fragments
Disjointedness
If fragments are disjoint, then decisions
about replication of data can be made
somewhat separately from decisions
about fragmentation
Copyright © Ellis Cohen, 2002-2005
30
Horizontal Fragmentation
Create:
CREATE TABLE emp ( … ) PARTITION (
scott.emp10@hq.acme.com WHERE deptno = 10,
scott.emp20@dallas.acme.com WHERE deptno = 20,
scott.emp30@chicago.acme.com WHERE deptno = 30,
scott.otheremp@hq.acme.com OTHERWISE)
// invented syntax loosely based on Oracle
The predicates defining all the fragments should
be complete and mutually exclusive (or else there
is replication)
Reconstruct:
SELECT
SELECT
SELECT
SELECT
*
*
*
*
FROM
FROM
FROM
FROM
scott.emp10@hq.acme.com UNION
scott.emp20@dallas.acme.com UNION
scott.emp30@chicago.acme.com UNION
scott.otheremp@boston.acme.com
Copyright © Ellis Cohen, 2002-2005
31
Fragmentation Transparency
SELECT ename, job FROM emp
WHERE sal > 50000
SELECT ename, job FROM
scott.emp10@hq.acme.com
WHERE sal > 50000
UNION
SELECT ename, job FROM
scott.emp20@dallas.acme.com
WHERE sal > 50000
UNION
SELECT ename, job FROM
scott.emp30@chicago.acme.com
WHERE sal > 50000
UNION
SELECT ename, job FROM
scott.otheremp@hq.acme.com
WHERE sal > 50000
Copyright © Ellis Cohen, 2002-2005
Implement
as
Integrate
decomposed
queries via
union
32
Fragmentation Transparency for Updates
UPDATE emp SET deptno = 30
WHERE empno = 6749;
// assumes you know deptno currently 20;
// much more complicated otherwise
Implementing this update requires
tuple migration
Implement
as
SELECT * INTO anEmp
FROM scott.emp20@dallas.acme.com
WHERE empno = 6749;
DELETE FROM scott.emp20@dallas.acme.com
WHERE empno = 6749;
INSERT INTO scott.emp30@chicago.acme.com
VALUES ( 6749, anEmp.ename, anEmp.job,
anEmp.mgr, anEmp.hiredate, anEmp.sal,
anEmp.comm, 30 );
Copyright © Ellis Cohen, 2002-2005
33
Vertical Fragmentation
Create:
CREATE TABLE emp ( empno int primary key, … )
PARTITION (
ename, job, mgr, deptno AS
scott.empinfo@boston.acme.com,
hiredate AS scott.emphr@hq.acme.com,
sal, comm AS scott.empacct@hq.acme.com)
// invented syntax loosely based on Oracle
The rows defining all the fragments should be
complete and mutually exclusive. All
automatically include the primary key empno
to match up rows (or use some other
mechanism to match ROWIDs)
Reconstruct:
SELECT i.empno, i.job, i.mgr, h.hiredate,
a.sal, a.comm, i.deptno
FROM scott.empinfo@boston.acme.com i,
NATURAL JOIN scott.emphr@hq.acme.com h,
NATURAL JOIN scott.empacct@hq.acme.com a
Copyright © Ellis Cohen, 2002-2005
34
Hybrid Fragmentation
CREATE TABLE emp ( empno int primary key, … )
PARTITION (
ename, job, mgr, deptno AS (
scott.emp10@hq.acme.com where deptno = 10,
scott.emp20@dallas.acme.com where deptno = 20,
scott.emp30@chicago.acme.com where deptno = 30,
scott.otheremp@hq.acme.com otherwise )
hiredate AS scott.emphr@hq.acme.com,
sal, comm AS scott.empacct@hq.acme.com)
// invented syntax loosely based on Oracle
Copyright © Ellis Cohen, 2002-2005
35
Data Placement Revisited
Company HQ in Des Moines
Warehouses in SF, NY, Denver
Cust( custid, addr, whse )
whse is 'SF', 'NY', or 'Denver'
A. Place Cust at Des Moines
How would
A. Partition Cust by whse
you decide?
SfCust@SF
NyCust@NY
DenverCust@Denver
C. Leave Cust at Des Moines and also
partition as SfCust@SF, NyCust@NY
& DenverCust@Denver
Copyright © Ellis Cohen, 2002-2005
36
Database Design Problem
Hard Optimization Problem
(even w/o considering replication)
– Fragmentation: How to fragment tables
– Allocation/Placement:
Where to place tables and fragments
Relative to minimizing/maximizing some
cost function - e.g.
– minimize query response time
– maximize throughput
– must be approximate, since determining actual
query plan is a separate optimization problem
Subject to constraints - e.g.
– Available storage, bandwidth, processing
power, …
– Keep 90% of response time below X
Copyright © Ellis Cohen, 2002-2005
37
Optimization Approach
Factors to Consider
The originating site(s) of queries/updates
Which attributes are accessed together
Which attributes & combinations of selection
predicates are used from which sites, with
which frequencies
Frequencies of updates that affect combinations
of selection predicates
Data integration costs (costs of joins and unions
for fragments) vs increase in parallelism
Costs of communication, concurrency control,
security & integrity maintenance
Copyright © Ellis Cohen, 2002-2005
38
Distributed Query
Processing
Copyright © Ellis Cohen, 2002-2005
39
Distributed Query Processing
Query processing
Based on algorithms that analyze queries
and convert them into a series of data
manipulation operations.
The problem
Deciding a strategy for executing each
query over the network in the most
cost effective way, however the cost is
defined.
Main factors
I/O, CPU, Communication costs
Opportunity for pipelining & parallel
operations
Copyright © Ellis Cohen, 2002-2005
40
Distributed Query Example
S2
proj
S3
dept
S1
emp
Given tables
emp( empno, ename, deptno, sal, … )
at site S1 (largest)
project( pno, pname, mgr, … )
at site S2
dept( deptno, dname, loc )
at site S3 (smallest)
Copyright © Ellis Cohen, 2002-2005
41
Sub-Querying & Shipping
Queries are executed via a combination of
computing queries and shipping data.
For example, suppose we want to execute a
query to find out the name of each
project, along with its project manager &
the name of that manager's department
SELECT pname, ename, dname
FROM project p, emp e, dept d
WHERE p.mgr = e.empno
AND e.deptno = d.deptno
Copyright © Ellis Cohen, 2002-2005
42
Consider Cost-based Alternatives
S2
proj
Alternative 1
Ship dept & project to S1
Process query at S1
S3
dept
S1
emp
S2
Alternative 2
Ship emp & project to S3
Process query at S3
proj
S3 dept
S1
emp
Which one is better?
Copyright © Ellis Cohen, 2002-2005
43
Evaluating Alternatives
Alternative 1
Ship dept & project to S1
Process query at S1
Alternative 2
S2
proj
S3
dept
S1
emp
Ship emp & project to S3
Process query at S3
S2
In general, alternative #1
proj
is better, because it
involves shipping less information
But to really determine the best
approach, you must consider
S3 dept
S1
emp
– Communication costs to S1 vs S3
(what if slow line between S2 & S1)
– Relative processing speeds and scheduling
algorithms at S1 vs S3
– Size of result & location of coordinator
Copyright © Ellis Cohen, 2002-2005
44
Intermixing Querying & Shipping
Rather than shipping base tables and
performing a single query, it may make
sense to
– do a query at one site
– ship the query results to another site
– do a query at that site joining the results
received with data available at a that site
In general, a distributed query plan
involves a (potentially lengthy)
sequence of performing queries and
shipping data (either base tables or
query results)
Copyright © Ellis Cohen, 2002-2005
45
Distributed Query Planning Example
For example, suppose we are only interested in
projects, where the project manager makes more
than 8000/month. For those projects, we want
the name of the project, the name of the project
manager & the name of that manager's
department.
S2
S3
proj
dept
Process
SELECT pname, ename, dname
FROM project p, emp e, dept d
WHERE p.mgr = e.empno
AND e.deptno = d.deptno
AND e.sal > 8000
S1
emp
If there are not very many employees who make > 8000,
what's the best plan for executing this query?
Copyright © Ellis Cohen, 2002-2005
46
Restrict before Ship
At S1, COMPUTE emplet AS
SELECT empno, ename FROM emp
WHERE sal > 8000
SHIP emplet & dept
FROM S1 TO S2
S2
S3
proj
AT S2, COMPUTE
SELECT ename, dname, pname
FROM emplet e, dept d, project p
WHERE p.mgr = e.empno
AND e.deptno = d.deptno
Copyright © Ellis Cohen, 2002-2005
dept
S1
emp
47
Semijoins
A semijoin is
• a join between two (or more tables) where
• one of the tables is just used to restrict the result,
but not provide any data
Example
List the names of employees whose departments
are located in NY
SELECT e.empno FROM emp e, dept d
WHERE e.deptno = d.deptno
All the result data comes from the emp table
The dept table is joined with emp, simply to
restrict the tuples chosen from the emp table
Copyright © Ellis Cohen, 2002-2005
48
Using Semijoins in Distributed Queries
1) Some data (generally the
result of a query) is shipped
from site Sa to site Sb
Sb
2
Db
3
1
Sa
Da
2) The shipped data is used in
a semijoin with the data at Sb.
This produces a subset of the
data at Sb, restricted based
on the data shipped from Sa
3) The result of the semijoin
is shipped back to Sa, where it
is combined with data already
there
If S1 is the coordinator (where the results must end up),
how can semijoins be used to produce a more efficient
solution to the project manager query?
Copyright © Ellis Cohen, 2002-2005
49
Using Semijoins
At S1, COMPUTE emplet AS
SELECT empno, ename
FROM emp
1
WHERE sal > 8000
At S1, COMPUTE empl AS
SELECT empno FROM emplet
At S3, COMPUTE deptlet AS
SELECT deptno, dname
3
FROM Dept
SHIP deptlet FROM S3 TO S1
S2
SHIP empl FROM S1 TO S2
S3
dept
proj
Shipping empl to S2 limits the tuples
from proj to be sent back to S1
2
3
1
emp
At S1, COMPUTE emproj AS
SELECT pmgr, pname
2
FROM project, empl
WHERE pmgr = empno
ORDER by pmgr
SHIP emproj FROM S2 TO S1
S1
4
At S3, COMPUTE deptlet AS
SELECT pname, ename, dname
FROM emplet e, deptlet d,
emproj p
4
WHERE e.deptno = d.deptno
and e.empno = p.pmgr
Copyright © Ellis Cohen, 2002-2005
50
Planning Alternatives
Result-Based or Stream-Based
– Result-Based: A site waits until it
receives the entire result set shipped to
it before it can use it in a query
– Stream-Based: A query at a site will use
data streamed to it as it arrives from
another site (also called pipelining)
Sequential or Parallel
– Sequential: A site ships data to (or
requests data) from one other site at a
time
– Parallel: A site can ship data to (or
request data from) multiple sites in
parallel
Copyright © Ellis Cohen, 2002-2005
51
Streaming & Pipelining
At S3, COMPUTE deptlet AS
SELECT deptno, dname
FROM dept
S2
proj
S3
dept
SHIP deptlet FROM S3 TO S1
AT S1, COMPUTE empdept AS
SELECT empno, ename, dname
FROM emp, dept
WHERE emp.deptno = dept.deptno
AND sal > 8000
ORDER BY empno
STREAM empdept FROM S1 TO S2
emp
AT S2, COMPUTE
SELECT p.pname, ed.ename, ed.dname
FROM project p, empdept ed
WHERE p.mgr = ed.empno
When would this approach be useful?
Copyright © Ellis Cohen, 2002-2005
52
S1
Parallelism & Streaming
At S1, COMPUTE empl AS
SELECT empno
FROM emp
WHERE sal > 8000
ORDER BY empno
AT S1, COMPUTE dempl AS
SELECT DISTINCT deptno
FROM emp
WHERE sal > 8000
STREAM empl FROM S1 TO S2
AS S2, COMPUTE eproj AS
SELECT pmgr, pname
FROM project p, empl e
WHERE e.empno = p.pmgr
STREAM eproj FROM S2 TO S1
STREAM dempl FROM S1 TO S3
AS S3, COMPUTE deptlet AS
SELECT deptno, dname
FROM dept d, dempl e
WHERE d.deptno = e.deptno
STREAM deptlet FROM S3 TO S1
Do in parallel
AT S1, COMPUTE
SELECT pname, ename, dname
FROM emp e, eproj p, deptlet d
WHERE e.deptno = d.deptno
AND e.empno = p.pmgr
S3
dept
S2
proj
emp
S1
Copyright © Ellis Cohen, 2002-2005
53
What's Best
Informally, we've talked about
how query planning finds the
best way to process the query,
involving
• subqueries
• shipping/streaming
• parallel execution
But when we say "best",
what do we actually mean?
Copyright © Ellis Cohen, 2002-2005
54
Possible Query Plan Goals
Fastest complete result
Fastest first result
Minimize resource usage
of specific resources
Combination of the above
Copyright © Ellis Cohen, 2002-2005
55
Query Optimization
Build initial tree for query
– Build tree reflecting relational algebra
corresponding to query
– Modify tree to account for fragmentation (more
complex if distributed fragments overlap)
– Incorporate simplest ship operations into tree for
accessing remote data
Perform global query optimization
– Apply transformation operators that produce an
equivalent tree
– Account for pipelining & parallelism as well
– Use heuristic search algorithm (e.g. hill climbing,
simulated annealing, genetic algorithms) to find
best distributed query plan considering replicas
– Use cost function incorporating time taken by
I/O, CPU & communication (best if statistics on
size of relations & result sets are maintained)
Copyright © Ellis Cohen, 2002-2005
56
Global vs Local Query Optimization
Global Optimization produces
– A set of decomposed queries to be sent
to various DB servers
– Combined with ship/stream
instructions
– All placed in a parallel/sequential
control flow graph
Local Optimization
– Each local server determines best way
to execute each decomposed query
sent to it (though global optimization
may generate preliminary plans)
Copyright © Ellis Cohen, 2002-2005
57
Download