Quick Review of Apr 22 material

advertisement
Quick Review of Apr 22 material
• Sections 13.1 through 13.3 in text
• Query Processing: take an SQL query and:
– parse/translate it into an internal representation
– optimize it (choose an efficient form for the query)
– evaluate it
• Metadata for query processing
• Operations (and their costs):
–
–
–
–
–
Sel for equality (one particular value)
Range selection
projection
all with and without indices
Complex selections
Sorting?
• What has sorting to do with query processing?
– SQL queries can specify the output be sorted
– several relational operations (such as joins) can be implemented
very efficiently if the input data is sorted first
– as a result, query processing is often concerned with sorting
temporary (intermediate) and final results
– creating a secondary index on the active relation (“logical” sorting)
isn’t sufficient -- sequential scans through the data on secondary
indices are very inefficient. We often need to sort the data
physically into order
Sorting
• We differentiate two types of sorting:
– internal sorting: the entire relation fits in memory
– external sorting: the relation is too large to fit in memory
• Internal sorting can use any of a large range of wellestablished sorting algorithms (e.g., Quicksort)
• In databases, the most commonly used method for external
sorting is the sort-merge algorithm. (based upon
Mergesort)
Sort-merge Algorithm
• create runs phase.
– Load in M consecutive blocks of the relation (M is number of blocks that
will fit easily in main memory)
– Use some internal sorting algorithm to sort the tuples in those M blocks
– Write the sorted run to disk
– Continue with the next M blocks, etcetera, until finished
• merge runs phase (assuming that the number of runs, N, is less than M)
– load the first block of each run into memory
– grab the first tuple (lowest value) from all the runs and write it to an
output buffer page
– when the last tuple of a block is read, grab a new block from that run
– when the output buffer page is full, write it to disk and start a new one
– continue until all buffer pages are empty
Sort-merge Algorithm (2)
• Merge-runs phase (N>M)
– operate on M runs at a time, creating runs of length M2, and continue in
multiple passes of the Merge operation
• Cost of sorting: b(r ) is the number of blocks occupied by relation r
– runs phase does one read, one write on each block of r: cost 2b(r )
– total number of runs (N): b(r )/M
– number of passes in merge operation: 1 if N<M;
otherwise logM-1(b(r )/M)
– during each pass in the merge phase we read the whole relation and write
it all out again: cost 2b(r ) per pass
– total cost of merge phase is therefore 2b(r ) (logM-1(b(r )/M)+1)
– if only one merge pass is required (N<M) the cost is 4b(r );
– if M>b(r ) then there is only one run (internal sorting) and the cost is b(r )
Join Operation
• Join is perhaps the most important operation combining
two relations
• Algorithms computing the join efficiently are crucial to the
optimization phase of query processing
• We will examine a number of algorithms for computing
joins
• An important metric for estimating efficiency is the size of
the result: as mentioned last class, the best algorithms on
complex (multi-relation) queries is to cut down the size of
the intermediate results as quickly as possible.
Join Operation: Size Estimation
• 0 <= size <= n(r ) * n(s)
(between 0 and size of cartesian product)
• If A = R S is a key of R, then
size <= n(s)
• If A = R S is a key of R and a foreign key of S, then
size = n(s)
• If A = R S is not a key, then each value of A in R appears no more
than n(s)/V(A,s) times in S, so n(r ) tuples of R produce:
size <= n(r ) *n(s)/V(A,s)
symmetrically,
size <= n(s) *n(r)/V(A,r)
if the two values are different we use:
size <=min{n(s)*n(r)/V(A,r), n(r)*n(s)/V(A,s)}
Join Methods: Nested Loop
• Simplest algorithm to compute a join: nested for loops
– requires no indices
• tuple-oriented:
for each tuple t1 in r do begin
for each tuple t2 in s do begin
join t1, t2 and append the result to the output
• block-oriented:
for each block b1 in r do begin
for each block b2 in s do begin
join b1, b2 and append the result to the output
• reverse inner loop
– as above, but we alternate counting up and down in the inner loop. Why?
Cost of Nested Loop Join
• Cost depends upon the number of buffers and the replacement strategy
– pin 1 block from the outer relation, k for the inner and LRU
cost: b(r ) + b(r )*b(s) (assuming b(s)>k)
– pin 1 block from the outer relation, k for the inner and MRU
cost: b(r ) + b(s) + (b(s) - (k-1))*(b(r )-1)
= b(r )(2-k) + k + 1 + b(r )*b(s) (assuming b(s)>k)
– pin k blocks from the outer relation, 1 for the inner
• read k from the outer
• for each block of s join 1xk blocks
• repeat with next k blocks of r untildone
(cost k)
(cost b(s))
(repeated b(r )/k times)
cost: (k+b(s)) * b(r )/k
=b(r ) + b(r )*b(s)/k
– which relation should be the outer one?
Join Methods: Sort-Merge Join
• Two phases:
– Sort both relations on the join attributes
– Merge the sorted relations
sort R on joining attribute
sort S on joining attribute
merge (sorted-R, sorted-S)
• cost with k buffers:
–
–
–
–
b(r) (2 log( b(r)/k) +1) to sort R
b(s) (2 log( b(s)/k) +1) to sort S
b(r ) + b(s) to merge
total: b(r) (2 log(b(r)/k) +1) + b(s) (2 log( b(s)/k) +1) +b(r) + b(s)
Join Methods: Hash Join
• Two phases:
– Hash both relations into hashed partitions
– Bucket-wise join: join tuples of the same partitions only
Hash R on joining attribute into H(R) buckets
Hash S on joining attribute into H(S) buckets
nested-loop join of corresponding buckets
• cost (assuming pairwise buckets fit in the buffers)
–
–
–
–
2b(r ) to hash R (read and write)
2b(s) to hash S (as above)
b(r ) + b(s) to merge
total: 3(b(r ) + b(s))
Join Methods: Indexed Join
• Inner relation has an index (clustering or not):
for each block b(r ) in R do begin
for each tuple t in b(r ) do begin
search the index on S with the value t.A of
the joining attribute and join with the resulting
tuples of S
• cost = b(r ) + n(r ) * cost(select(S.A=c))
where cost(select(S.A=c)) is as described before for indexed selection
– What if R is sorted on A? (hint: use V(A,r) in the above)
3-way Join
Suppose we want to compute R(A,B) |X| S(B,C) |X| T(C,D)
• 1st method: pairwise.
– First compute temp(A,B,C) = R(A,B) |X| S(B,C) cost b(r ) + b(r )*b(s)
size of temp b(temp) = n(r )*n(s)/(V(B,S)/f(r+s))
– then compute result temp(A,B,C) |X| T(C,D)
cost b(t)+b(t)*b(temp)
• 2nd method: scan S and do simultaneous selections on R and T
– cost = b(s) + b(s)* (b(r ) + b(t))
– if R and T are indexed we could do the selections through the indices
cost = b(s) + n(s ) *[ cost(select(R.B=S.B)) + cost(select(T.C=S.C))]
Download