Distributed Database Systems

advertisement
Distributed Database Systems

R* Optimizer Validation and Performance
Evaluation for Local Queries


R* Optimizer Validation and Performance
Evaluation for Distributed Queries


Lothar F. Mackert, Guy M. Lohman
Lothar F. Mackert, Guy M. Lohman
R*: An Overview of the Architecture

R. Williams, D. Daniels, L. Haas, G. Lapis, B.
Lindsay, P. Ng, R. Obermarck, P. Selinger, A.
Walker, P. Wilms, and R. Yost
What is System R? R*?

System R is a database system built as
a research project at IBM San Jose
Research (now IBM Almaden Research
Center) in the 1970's. System R
introduced the SQL language and also
demonstrated that a relational system
could provide good transaction
processing performance.
R* basic facts

Each site is autonomous as possible.


No central scheduler, no central deadlock
detection, no central catalog, etc.
R* uses snapshot data – a “copy of a
relation(s) in which the data is
consistent but not necessarily up to
date.” used to provide a static copy of
the database
Object naming in R*


For autonomy, no global naming system
required.
To keep object names unique, site
name incorporated into names, called
System Wide Names – SWN

EX. USER@USER_SITE.OBJECT_NAME
@BIRTH_SITE
Global Catalogs in R*



All Global Table names stored at all sites
Creation of a global table involves
broadcasting global relation name to all sites
in the network.
Catalogs at each site keep and maintain info
about objects in the dbase, including replicas
or fragments, stored at the site.
Transaction Numbering


A transaction is given a number that is
composed of the unique site name and
a unique sequence number from that
site that incorporates time of day at
that site so no synchronization between
sites is needed.
The transaction number is both unique
and ordered in the R* framework
Transaction Numbering (cont)



Numbers used in deadlock detection.
Uniqueness is used for identification purposes
to determine which transactions control which
locks to avoid case where a transaction is
waiting for itself.
In case of a deadlock, R* aborts the
youngest, largest numbered transaction.
Transaction commit


Termination must be uniform – all sites
commit or all sites abort
Two phase commit protocol.

One site acts as coordinator – makes
commit or abort decision after all sites
involved in the transaction are known to be
recoverably prepared to commit or abort
and all sites are waiting coordinators
decision.
Transaction commit (cont)



While non-coordinator sites await coordinator
decision, all locks held – transaction
resources are sequestered.
Before entering the prepare state, any site
can abort the transaction – other sites will
abort after a transaction timeout. After
entering the prepare state, a site may not
abandon the transaction.
3(N-1) messages needed to successfully
commit, 4(N-1) messages if a transaction
must abort.
Authorization


All sites cooperate in R* voluntarily and
no site wishes to trust others with
authorization.
Each individual site must check remote
access requests and all controls
regarding accessing data are stored at
that same site.
Compilation, Plan Generation



R* compiles rather than interprets the
query language.
Recompilation may need to be done if
objects change in the database during
compilation – ie, table deleted.
Recompilation is done at a local level
with a commit process similar to a
transaction.
Binding in compilation

When/where should binding occur?



All binding for every request can be done at a
chosen site? – no, creates a bottleneck site.
All binding can be done at the site where request
began? – no, compiling site should not need to
remember physical details about access paths at
remote sites.
All binding can be done in a distributed way? Yes,
requesting site can decide high level details, leave
minor/ low level details to other sites.
Deadlock detection


No centralized deadlock detection.
Each site does periodic deadlock
detection using transaction wait-for info
gathered locally or received from
others. Wait-for strings are sent from
one site to the next. If a site finds a
cycle, youngest transaction is aborted.
Deadlock Example
Changes Made to R.



Explain – writes out optimizer details
such as estimated cost to temp tables.
Collect Counters – dumps internal
system variables to temporary tables.
Force Optimizer – order optimizer to
choose a particular (perhaps
suboptimal) plan.
New SQL instructions

EXPLAIN PLAN FOR
<any valid Delete, Insert, Select,
Select Into, Update, Values, or
Values Into SQL Statement>
R* Cost Structure



Cost = Wcpu(#_instructions) +
Wi/o(#ios) + Wmsg(#_MSGs) +
Wbyt(#_byt)
Wmsg for some constant penalty for
sending a communication over the
network
Wbyt penalty for message length.
Is CPU cost significant?


Both local and distributed queries found
cpu cost to be important.
CPU costs high in sorts



1.allocating temp disk space for partially
sorted strings
2. Encoding and decoding sorted columns
3. Quicksorting individual pages in mem.
CPU costs continued


CPU costs are also high in scans
Although “CPU cost is significant . . . [it]
is not enough to affect the choice of the
optimizer”
CPU Cost Equation

CPUsort = ACQ_TEMP + #SORT *
CODINGINST + #PAGES *
QUICKSORTINST + #PASS *
(ACQ_TEMP + #PAGES * IO_INST +
#SORT * NWAY * MERGINST)
Improve Local Joins

Communicate type of Join


Are there likely to be runs of pages to
prefetch?
Can choice of LRU, MRU, DBMin improve
performance if Join type known.
Optimizer performance (Local)


Optimizer has trouble modeling unclustered
indexes on smaller tables. In such cases,
Optimizer actually picks worst plan, thinking it
is the best and thinks the best is actually the
worst.
Why?


Tuple order unimportant to nested loop join and
index on outer table may clutter buffer.
A highly selective predicate may eliminate the
need for a scan in which case the index is
important.
Optimizer (Local) cont.


Adding an index can increase the cost - The
Optimizer models each table independently,
ignoring competition for buffer space
amongst two tables being joined.
Nested loop join estimates often artificially
inflated – optimizer pessimistically models
worst case buffer behavior by saying each
outer tuple starts new inner scan when may
times inner pages are in the buffer.
Distributed Joins

Simplest kind – single table access at
romote site



A process at remote site accesses the table
and ships back the result.
When doing joins, can try to ship
smaller of two tables.
Or can try to ship the outer to take
advantage of indexes on inner.
Tuple Blocking



Can get faster response time with Tuple
Blocking.
Tuples “stream in” from a query.
Pay more message overhead, one
message per tuple instead of one
message for entire result.
Transfer Trade-offs

Option W – transfer the whole inner table.

Negatives



No indexes can be shipped with it.
May ship inner tuples that have no matching outer tuples
Positives


Predicates applied before shipping may reduce size
Only one transfer is needed for the join which may result
in lower overall network costs
Transfer Trade-offs Cont.

Option F – Fetch matching tuples only


Negatives



Idea – Send outer tuple join column values, match with
inner tuples, then send these inner tuples over network
Multiple rounds of sending congest network
May have to send whole table anyway – W better
Positives

Tables may have few actual join values in common, can
eliminate need to send many inner tuples.
Use W or F Option?


In W – Network costs only 2.9% of total
Strategy F handy when:

Cardinality of outer table <0.5 the # of messages
required to ship the inner table as a whole.


Idea behind rule – beat plan W in theory by sending few
join values from outer table that will weed out most
inner tuples
The join cardinality < inner cardinality

Idea behind 2nd rule – since most inner tuples will not be
needed, we can beat plan W by sending only outer join
values and eliminating most inner tuples.
What might be better

Ship outer relation to the inner relation
and return results to outer relation


Allows use of indexes on inner relation in
nested loops join
If outer is small, this works well
Outer relation shipping Cont.

Shipping outer “enjoys more simultaneity” –
ie, Nested loop join
For all outer tuples do
For all inner tuples do
If outer == inner add result to answer.

Can start outer loop and iterate inner loop
with only fraction of outer tuples arrived. For
shipping inner relation, must wait for whole
thing to do loop iterations.
Distributed vs. Local Joins



Total resources consumed in Distributed joins
higher than in local joins
Response time in Distributed joins less than
in Local Joins
What does this mean?

We have more machines doing work in a
distributed join so they can do work in parallelmore work is done but since more machines are
doing work, the result takes less time.
Distrib vs. Local Joins Cont.

Response time improvement for distributed
queries has 2 reasons



1. Parallelism
2. Reduced Contention – accessing tables using
unclustered indexes benefit greatly from larger
buffers – n machines = n buffer size.
Negatives of Distributed – Slow network
speeds make reverse true, then local joins
are faster.
Alternative Join methods

Dynamically Create Temporary Index on
Inner Table


Since we cannot send an index, we can try
to make one
Cost structure may be high



Scan entire table and send to site 1
Store table and create a temporary index on it
at site 1
Execute best local join plan
Semijoin



Sort S and T on the join column.
Produce S’, T’
Send S’ ‘s join column values to site T,
match against T’ and send these actual
tuples to site S.
Merge-join T’ ‘s tuples and S’ ‘s tuples
to get answer.
Bloom join




Use Bloom filter – bit string sort of like hash
table where each bit represents a bucket like
in a hash table. All bits start off 0. If a value
hashes to bit x, turn x to 1.
Generate a Bloom filter for table S and send
to T.
Hash T using the same hash function and
ship any tuples that hash to a 1 in S’s Bloom
filter
At site S, join T’s tuples with table S.
Comparing join methods


Bloom joins generally outperform other
methods
Semijoins advantageous when both data and
index (unclustered) pages of inner table fit
into the buffer so that efficient use of these
tables keep semijoins procesing costs low. If
not, constant paging of unclustered index
results in poor performance.
Why are Bloom Joins better?


Message costs of Semi and Bloom
comparable
Semijoin incurs higher local processing
costs to perform a “second join”, ie
once send S’ ‘s join column to T’, join,
then send this result to S’ and join
these T’ values with S’.
Commercial Products
Download