Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15

advertisement
Query Optimization 3
Cost Estimation
R&G, Chapters 12, 13, 14
Lecture 15
Administrivia
• Homework 2 Due Tonight
– Remember you have 4 slip days for the course
• Homeworks 3 & 4 available later this week
– 3 is written assignment, deals with optimization,
due before midterm 2
– 4 is programming assignment, implementing query
processing, due after Spring Break
• Midterm 2 is 3/22, 2 weeks from Thursday
Review: Query Processing
• Queries start out as SQL
• Database translates SQL to one or more Relational
Algebra plans
• Plan is a tree of operations, with access path for each
• Access path is how each operator gets tuples
– If working directly on table, can use scan, index
– Some operators, like sort-merge join, or group-by, need tuples
sorted
– Often, operators pipelined, getting tuples that are output from
earlier operators in the tree
• Database estimates cost for various plans, chooses least
expensive
Today
• Review costs for:
– Sorting
– Selection
– Projection
– Joins
• Re-examine Hashing for:
– Projection
– Joins
General External Merge Sort
 More than 3 buffer pages. How can we utilize them?
• To sort a file with N pages using B buffer pages:
– Pass 0: use B buffer pages. Produce  N / B sorted runs of
B pages each.
– Pass 1, 2, …, etc.: merge B-1 runs.
INPUT 1
...
INPUT 2
...
OUTPUT
...
INPUT B-1
Disk
B Main memory buffers
Disk
Cost of External Merge Sort
• Minimum amount of memory: 3 pages
– Initial runs of 3 pages
– Then 2-way merge of sorted runs
(2 pages for inputs, one for outputs)
– #of passes: 1 + log2(N/3)
• With more memory, fewer passes
– With B pages, #of passes: 1 + log(B-1)(N/B)
• I/O Cost = 2N * (# of passes)
External Sort Example
• E.g., with 5 buffer pages, to sort 108 page file:
– Pass 0:
• 22 sorted runs of 5 pages each (last run is only 3 pages)
• Now, do four-way (B-1) merges
– Pass 1:
• 6 sorted runs of 20 pages each (last run is only 8 pages)
– Pass 2:
• 2 sorted runs, 80 pages and 28 pages
– Pass 3:
• Sorted file of 108 pages
Using B+ Trees for Sorting
• Scenario:
– Table to be sorted has B+ tree index on sorting column(s).
• Idea:
– Retrieve records in order by traversing leaf pages.
• Is this a good idea? Cases to consider:
– B+ tree is clustered
Good idea!
– B+ tree is not clustered Could be a very bad idea!
• I/O Cost
– Clustered tree: ~ 1.5N
– Unclustered tree: 1 I/O per tuple, worst case!
Selections: “age < 20”, “fname = Bob”, etc
• No index
– Do sequential scan over all tuples
– Cost: N I/Os
• Sorted data
– Do binary search
– Cost: log2(N) I/Os
• Clustered B-Tree
– Cost: 2 or 3 to find first record +
1 I/O for each #qualifying pages
• Unclustered B-Tree
– Cost: 2 or 3 to find first RID +
~1 I/O for each qualifying tuple
• Clustered Hash Index
– Cost: ~1.2 I/Os to find bucket, all tuples inside
• Unclustered Hash Index
– Cost: ~1.2 I/Os to find bucket, +
~1 I/O for each matching tuple
Selection Exercise
• Sal > 100
– Sequential Scan: 100 I/Os
– Btree on sal? Unclustered, 503 I/Os
– BTree on <age, sal>? Can’t use it
• Age = 25
– Sequential Scan: 100 I/Os
– Hash on age: 1.2 I/O to get to bucket
• 20 matching tuples, 1 I/O for each
– BTree <age,sal>: ~3 I/Os to find leaf + #matching pages
• 20 matching tuples, clustered, ~ 2 I/Os
• Age > 20
– Sequential Scan: 100 I/Os
– Hash on age? Not with range query.
– BTree <age, sal>: ~3 I/Os to find leaf + #matching pages
• (Age > 20) is 90% of pages, or ~90*1.5 = 135 I/Os
Selection Exercise (cont)
• Eid = 1000
– Sequential Scan: ~50 I/Os (avg)
– Hash on eid: ~1.2 I/Os to find bucket, 1 I/O to
get record
• Sal > 100 and age < 30
– Sequential Scan: 100 I/Os
– Btree <age, sal>: ~ 3 I/Os to find leaf, 30% of
pages match, so 30*1.5 = 45 I/Os
Projection
• Expensive when eliminating duplicates
• Can do this via:
– Sorting: cost no more than external sort
• Cheaper if you project columns in initial pass, since more
projected tuples fit in each page.
– Hashing: build a hash table, duplicates will end up
in the same bucket
An Alternative to Sorting:
Remove duplicates with Hashing!
• Idea:
– Many of the things we use sort for don’t exploit the
order of the sorted data
– e.g.: removing duplicates in DISTINCT
– e.g.: finding matches in JOIN
• Often good enough to match all tuples with equal
values
• Hashing does this!
– And may be cheaper than sorting! (Hmmm…!)
– But how to do it for data sets bigger than memory??
General Idea
• Two phases:
– Partition: use a hash function h to split tuples into
partitions on disk.
• Key property: all matches live in the same partition.
– ReHash: for each partition on disk, build a mainmemory hash table using a hash function h2
Original
Relation
OUTPUT
1
Two Phases
• Partition:
Partitions
1
2
INPUT
2
hash
function
...
h
B-1
B-1
Disk
• Rehash:
B main memory buffers
Result
Partitions
hash
fn
Hash table for partition
Ri (<= B pages)
h2
Disk
Disk
B main memory buffers
Original
Relation
Duplicate
Elimination
using Hashing
OUTPUT
1
Partitions
1
2
INPUT
2
hash
function
...
h
B-1
B-1
Disk
B main memory buffers
Result
Partitions
• read one bucket
at a time
• for each group of
identical tuples,
output one
hash
fn
Hash table for partition
Ri (<= B pages)
h2
Disk
Disk
B main memory buffers
Hashing in Detail
• Two phases, two hash functions
• First pass: partition into (B-1) buckets
• E.g., B = 5 pages, h(x) is two low order bit
Input File
6 2 1
9
4
1
Memory
5
6 1
Output Files
Memory, I/O costs Requirement
• If we can hash in two passes -> cost is 4N
• How big of a table can we hash in two passes?
– B-1 “partitions” result from Phase 0
– Each should be no more than B pages in size
– Answer: B(B-1).
N
Said differently:
We can hash a table of size N pages in about
space
– Note: assumes hash function distributes
records evenly!

• Have a bigger table? Recursive partitioning!
How does this compare with
external sorting?
Memory Requirement for
External Sorting
• How big of a table can we sort in two passes?
– Each “sorted run” after Phase 0 is of size B
– Can merge up to B-1 sorted runs in Phase 1
– Answer: B(B-1).
Said differently:
We can sort a table of size N pages in about
N space

• Have a bigger table? Additional merge passes!
So which is better ??
• Based on our simple analysis:
– Same memory requirement for 2 passes
– Same IO cost
• Digging deeper …
• Sorting pros:
– Great if input already sorted (or almost sorted)
– Great if need output to be sorted anyway
– Not sensitive to “data skew” or “bad” hash functions
• Hashing pros:
– Highly parallelizable (will discuss later in semester)
– Can exploit extra memory to reduce # IOs (stay tuned…)
Nested Loops Joins
• R, with M pages, joins S, with N Pages
• Nested Loops
– Simple nested loops
• Insanely inefficient M + PR*M*n
– Paged nested loops – only 3 pages of memory
• M + M*N
– Blocked nested loops – B pages of memory
• M + M/(B-2) * N
• If M fits in memory (B-2), cost only M + N
– Index nested loops
• M + PR*M* index cost
• Only good in M very small
Sort-Merge Join
• Simple case:
– sort both tables on join column
– Merge
– Cost: external sort cost + merge cost
• 2M*(1 + log(B-1)(M/B)) + 2N*(1 + log(B-1)(N/B)) + M + N
• Optimized Case:
– If we have enough memory, do final merge and join in same
pass. This avoids final write pass from sort, and read pass from
merge
– Can we merge on 2nd pass? Only in #runs from 1st pass < B
– #runs for R is M/B. #runs for S is N/B.
• Total #runs ~~ (M+N)/B
– Can merge on 2nd pass if M+N/B < B, or M+N < B2
– Cost: 3(M+N)
Hash Join
Original
Relation
OUTPUT
1
Partitions
1
2
INPUT
2
hash
function
...
h
B-1
B-1
Disk
B main memory buffers
Partitions
of R & S
Disk
Join Result
hash
fn
Hash table for partition
Ri (B-2 pages)
h2
h2
Input buffer
for Si
Disk
Output
buffer
B main memory buffers
Disk
Cost of Hash Join
• Partitioning phase: read+write both relations
 2(|R|+|S|) I/Os
• Matching phase: read+write both relations
 |R|+|S| I/Os
• Total cost of 2-pass hash join = 3(|R|+|S|)
Q: what is cost of 2-pass merge-sort join?
Q: how much memory needed for 2-pass sort join?
Q: how much memory needed for 2-pass hash join?
An important optimization to hashing
• Have B memory buffers
• Want to hash relation of size N
cost
# passes
3N
2
N
1
B
B2
N
If B < N < B2, will have unused memory …
Hybrid Hashing
• Idea: keep one of the hash buckets in memory!
Original
Relation
k-buffer hashtable
OUTPUT
2
1
...
INPUT
Partitions
2
3
h3
h
3
B-k
B-k
Disk
B main memory buffers
Disk
Q: how do we choose the value of k?
Cost reduction due to hybrid hashing
• Now:
cost
# passes
3N
2
N
1
B
B2
N
Summary: Hashing vs. Sorting
• Sorting pros:
– Good if input already sorted, or need output sorted
– Not sensitive to data skew or bad hash functions
• Hashing pros:
– Often cheaper due to hybrid hashing
– For join: # passes depends on size of smaller relation
– Highly parallelizable
Summary
• Several alternative evaluation algorithms for each operator.
• Query evaluated by converting to a tree of operators and
evaluating the operators in the tree.
• Must understand query optimization in order to fully understand
the performance impact of a given database design (relations,
indexes) on a workload (set of queries).
• Two parts to optimizing a query:
– Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
–
Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
• Key issues: Statistics, indexes, operator implementations.
Download