Query Optimization and Intro to Transactions R&G, Chapter 15

advertisement
Query Optimization
and Intro to
Transactions
R&G, Chapter 15
Chapter 16
Lecture 18
Administrivia
•
Homework 3 due next Tuesday, March 20 by end of class
period
•
Homework 4 available on class website
– Implement nested loops and hash join operators for (new!)
minibase
– Due date: April 10 (after Spring Break)
•
Midterm 2 is 3/22, 1 week from today
– In class, covers lectures 10-17
– Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall
•
Internships at Google this summer…
– See http://www.postgresql.org/developer/summerofcode
– Booth at the UCB TechExpo this Thursday:
– http://csba.berkeley.edu/tech_expo.html
– Contact josh@postgresql.org
Review
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
We are here
•Query plans are a tree of operators that
compute the result of a query
•Optimization is the process of picking the
best plan
•Execution is the process of executing the
plan
Example
Sid, COUNT(*) AS numbers
Select S.sid, COUNT(*) AS number
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid
AND B.color = “red”
GROUP BY S.sid
GROUPBY sid
sid=sid
Sailors:
Tuples 50 bytes long, 80 tuples/page, 500 pages
Unclustered B+ and Hash on sid (key)
40,000 sid values
10 ratings
bid=bid
Reserves
Color=red
Reserves:
Tuples 40 bytes long, 100 tuples/page, 1000
pages
Clustered B+ tree on bid (key), Unclustered B+
on sid
100 distinct bid values
Boats
Tuples 50 bytes long, 80 tuples/page, 100 pages
Unclustered B+, Clustered Hash on color, 50
distinct color values
Sailors
Boats
Pass 1: Best access method for each
relation
Sid, COUNT(*) AS numbers
• Sailors
GROUPBY sid
– No predicates, so File Scan is best
• Reserves
– No predicates so File Scan is best
– What about Index Scan with B+
index on bid for Reserves?
– Keep it in mind…tuples will
come out in join order…might
come in handy later for a join
on bid…
sid=sid
bid=bid
Sailors
Reserves
Color=red
Boats
Sailors:
Tuples 50 bytes long, 80 tuples/page, 500
pages
Unclustered B+ and Hash on sid (key)
40,000 sid values
10 ratings
Reserves:
Tuples 40 bytes long, 100 tuples/page, 1000
pages
Clustered B+ tree on bid (key), Unclustered
B+ on sid
100 distinct bid values
Pass 1: Best access method for each
relation
Always keep
• Boats
around just in case
– File Scan
Book says to keep Sid, COUNT(*) AS number
100 I/Os
it because of
– B+ Index Scan on color interesting order
GROUPBY sid
2-3 + 80*100/50 =
Cheapest sid=sid
3 + 160 I/Os = 163 I/Os
– Hash Index Scan on color
Sailors
bid=bid
1.2 + (80*100/50)/80 =
Reserves
1.2 + 2 I/Os
Color=red
Boats
Tuples 50 bytes long, 80 tuples/page, 100
pages
Unclustered B+, Clustered Hash on color,
50 distinct color values
Boats
Pass 2
• For each of the plans in pass 1:
– generate left deep plans for joins- consider
different order and join methods
bid=bid
bid=bid
Reserves
sid=sid
Reserves
Color=red
Color=red
Boats
Boats
Sailors
Reserves
sid=sid
Reserves
• Question: what about SailorsXBoats?
X
Sailors
Pass 2
• First consider which pass 1 plans to use for
access path
bid=bid
bid=bid
Reserves
Reserves
Color=red
Color=red
Boats
Boats
1. Hash index(color) Boats,
File scan Reserves
2. B+(color),
File Scan Reserves
Boats: B+ tree(color), Hash(color), File Scan
Sailors: File Scan
Reserves: File Scan
•
File scan Reserves,
File Scan Boats
Boats: B+ tree(bid), Hash(bid), File Scan
Sailors: File Scan
Reserves: File Scan
Pass 2
• First consider which pass 1 plans to use for
access path
sid=sid
Sailors
•
•
sid=sid
Reserves
Reserves
File Scan Sailors, File Scan Reserves
B+(sid), File Scan Reserves
•
•
•
Sailors
File scan Reserves, File Scan Sailors
Note: book also includes these plans:
– File Scan Sailors (outer) with Boats (inner)
– Boats hash on color with Sailors (inner)
– Boats Btree on color with Sailors (inner)
Would you agree these should be considered?
Boats: B+ tree(bid), Hash(bid), File Scan
Sailors: File Scan
Reserves: File Scan
Pass 2
• Now consider join methods and consider all access
paths for inner
bid=bid
bid=bid
Reserves
Color=red
If you replace file scan of
Reserves with B+ index
Reserves
scan, Index Nested
Color=red
Loops is a good choice!
Sort-Merge could also be
a good choice because
Tuples come out in bidorder.
Boats
1. Hash index(color) Boats,
File scan Reserves
2. B+(color),
File Scan Reserves
•
Boats
File scan Reserves,
File Scan Boats
Reserves:
Clustered B+ tree on bid (key), Unclustered B+ on sid
Boats
Unclustered B+, Clustered Hash on color, 50 distinct color values
Boats: B+ tree(bid), Hash(bid), File Scan
Sailors: File Scan
Reserves: File Scan
Pass 2
• Now consider join methods
sid=sid
Sailors
•
•
sid=sid
Reserves
Reserves
File Scan Sailors, File Scan Reserves
B+(color), File Scan Reserves
•
Sailors
File scan Reserves, File Scan Sailors
Exercise: Which plans would you keep
for this set of joins?
Pass 3 and beyond
• For each of the plans retained from Pass 2, taken
as the outer, generate plans for the next join
– For example, let’s take the sort-merge plan for BoatsxReserves and
add in Sailors:
Sort-merge
sid=sid
From pass 2
Sort-merge
Sailors
B+ index(sid) Reserves
bid=bid
Note that this plan
will produce tuples
in sid order;
interesting order
for the upcoming
GROUP BY.
Reserves
Color=red
Boats
Hash index(color)
Boats
B+ index(bid) Reserves
• GROUP BY, ORDER BY,
AGGREGATES are all
considered after the join
plans are chosen.
Nested Queries (Subqueries)
• Nested queries are parsed into their own ‘query block’ and optimized
separately.
• Two kinds: SELECT S.sname
Uncorrelated FROM Sailors S
WHERE EXISTS
(SELECT *
FROM Reserves R
WHERE R.day = ’01/31/07’)
Correlated
SELECT S.sname
FROM Sailors S
WHERE EXISTS
(SELECT *
FROM Reserves R
WHERE R.bid=103
AND R.sid=S.sid)
•
An uncorrelated subquery
can be computed once
per query.
-> Optimizer can choose a
plan for the subquery
and add temp operator
to ‘cache’ the subquery
results for use in the rest
of the query
Plan for outer
query
TEMP
Subquery plan
Nested Queries
Correlated
• Optimizer chooses a plan for subquery, and treats it as
a subroutine to be invoked once per tuple produced by
the outer query block.
SELECT S.sname
FROM Sailors S
WHERE EXISTS
(SELECT *
FROM Reserves R
WHERE R.bid=103
AND R.sid=S.sid)
Nested block to optimize:
SELECT *
FROM Reserves R
WHERE R.bid=103
AND S.sid= outer value
Plan for outer
query
Subquery plan
Invoked once per tuple
produced by outer query
block
Nested Queries
Correlated
• Sometimes correlated subqueries can be rewritten as a
non-correlated query
– When rewriting, need to be careful to preserve semantics
Returns at most 1 tuple per sailor;
need to add DISTINCT to ensure same
semantics in rewritten query
SELECT S.sname
FROM Sailors S
WHERE EXISTS
(SELECT *
FROM Reserves R
WHERE R.bid=103
AND R.sid=S.sid)
Equivalent non-nested query:
SELECT DISTINCT S.sname
FROM Sailors S, Reserves R
WHERE S.sid=R.sid
AND R.bid=103
• Rewritten query gives the optimizer
more choices for optimization, and it
is more likely to choose a good plan
Taking it one step further…
• My area of research is data integration
This is a very common case in the world today:
 What if I could use the
query planning and
execution of a DBMS to
query over multiple,
different DBMSs, and other
data sources as well?
Query over multiple DBMS’s
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB21
DB
DB22
Oracle
Postgres
Other DBs
Files
Spreadsheets
….
Taking it one step further…
• Heterogenous Data Integration leads to all sorts of fun
with optimization …
– Compensation becomes possible…
• You can do complex SQL queries over simple data sources…even if
a spreadsheet can’t do joins, the DBMS optimizing the query can!
– Knowing what other sources can do becomes a factor
• Oracle, DB2, Postgres all support slightly different versions of SQL
• If the data source isn’t a DBMS, how do I know what operations it
can perform (e.g. can a spreadsheet do projects? Filters? Joins?)
– Where to do the work becomes a factor…
• e.g., Should I always push as much of query as possible to another
DBMS?
– Network cost becomes a factor…
• When do we ship tuples from original source to DBMS processing
the complete query?
Summary
• Query optimization is an important task in a relational
DBMS.
• Must understand optimization in order to understand
the performance impact of a given database design
(relations, indexes) on a workload (set of queries).
• Two parts to optimizing a query:
– Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
–
Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
• Key issues: Statistics, indexes, operator implementations.
Transaction Management Overview
R & G Chapter 16
There are three side effects of acid.
Enhanced long term memory,
decreased short term memory,
and I forget the third.
- Timothy Leary
Components of a DBMS
transaction
Data Definition
query
Query Compiler
Execution Engine
Transaction Manager
Logging/Recovery
Schema Manager
Concurrency Control
Buffer Manager
LOCK TABLE
Storage Manager
BUFFERS
BUFFER POOL
DBMS: a set of cooperating software modules
Concurrency Control & Recovery
• Very valuable properties of DBMSs
– without these, DBMSs would be much less useful
• Based on concept of transactions with ACID
properties (yep…they’re baa-aack…)
Statement of Problem
• Concurrent execution of independent transactions
– utilization/throughput (“hide” waiting for I/Os.)
T2 wins, but if there
– response time
was a slight delay,
maybe T1 would win
– fairness
-> DBMS wants to
• Example:
ensure a predictable
T1:
T2:
outcome for
concurrent users
t0: tmp1 := read(X)
t1:
tmp2 := read(X)
t2: tmp1 := tmp1 – 20
t3:
tmp2 := tmp2 + 10
t4: write tmp1 into X
t5:
write tmp2 into X
Statement of problem (cont.)
• Arbitrary interleaving can lead to
– Temporary inconsistency (ok, unavoidable)
– “Permanent” inconsistency (bad!)
• Need formal correctness criteria.
Definitions
• A program may carry out many operations on the
data retrieved from the database
• However, the DBMS is only concerned about what
data is read/written from/to the database.
• transaction - a sequence of read and write
operations (read(A), write(B), …)
– DBMS’s abstract view of a user program
Correctness criteria: The ACID properties
•
•
A tomicity: All actions in the Xact happen, or none happen.
C onsistency: If each Xact is consistent, and the DB starts
consistent, it ends up consistent.
•
I solation:
Execution of one Xact is isolated from that of other
Xacts.
•
D urability:
If a Xact commits, its effects persist.
A Atomicity of Transactions
• Two possible outcomes of executing a transaction:
– Xact might commit after completing all its actions
– or it could abort (or be aborted by the DBMS) after
executing some actions.
• DBMS guarantees that Xacts are atomic.
– From user’s point of view: Xact always either executes all
its actions, or executes no actions at all.
A Mechanisms for Ensuring Atomicity
•
Main approach: LOGGING
– DBMS logs all actions so that it can undo the
actions of aborted transactions.
•
Logging used by modern systems, because of
need for audit trail and for efficiency reasons.
C Transaction Consistency
• “Consistency” - data in DBMS is accurate in modeling
real world and follows integrity constraints
• User must ensure transaction consistent by itself
– I.e., if DBMS consistent before Xact, it will be after also
•Key point:
consistent
database
S1
transaction T
consistent
database
S2
C Transaction Consistency (cont.)
• Recall: Integrity constraints
– must be true for DB to be considered consistent
– Examples:
1. FOREIGN KEY R.sid REFERENCES S
2. ACCT-BAL >= 0
• System checks ICs and if they fail, the transaction
rolls back (i.e., is aborted).
– Beyond this, DBMS does not understand the semantics of
the data.
– e.g., it does not understand how interest on a bank
account is computed
I Isolation of Transactions
• Users submit transactions, and
• Each transaction executes as if it was running by
itself.
– Concurrency is achieved by DBMS, which interleaves
actions (reads/writes of DB objects) of various
transactions.
•
Many techniques have been developed. Fall into two
basic categories:
– Pessimistic – don’t let problems arise in the first place
– Optimistic – assume conflicts are rare, deal with them
after they happen.
I Example
• Consider two transactions (Xacts):
T1:
BEGIN A=A+100, B=B-100 END
T2:
BEGIN A=1.06*A, B=1.06*B END
•
•
•
•
1st xact transfers $100 from B’s account to A’s
2nd credits both accounts with 6% interest.
Assume at first A and B each have $1000. What
are the legal outcomes of running T1 and T2???
• $2000 *1.06 = $2120
There is no guarantee that T1 will execute before
T2 or vice-versa, if both are submitted together.
But, the net effect must be equivalent to these two
transactions running serially in some order.
I
Example (Contd.)
• Legal outcomes: A=1166,B=954 or A=1160,B=960
• Consider a possible interleaved schedule:
T1:
T2:

•
A=1.06*A,
B=B-100
B=1.06*B
This is OK (same as T1;T2). But what about:
T1:
T2:
•
A=A+100,
A=A+100,
A=1.06*A, B=1.06*B
B=B-100
Result: A=1166, B=960; A+B = 2126, bank loses $6
The DBMS’s view of the second schedule:
T1:
T2:
R(A), W(A),
R(A), W(A), R(B), W(B)
R(B), W(B)
I Formal Properties of Schedules
• Serial schedule: Schedule that does not interleave the
actions of different transactions.
• Equivalent schedules: For any database state, the
effect of executing the first schedule is identical to the
effect of executing the second schedule.
• Serializable schedule: A schedule that is equivalent to
some serial execution of the transactions.
(Note: If each transaction preserves consistency, every
serializable schedule preserves consistency. )
Download