Quick Review of Apr 22 material • Sections 13.1 through 13.3 in text • Query Processing: take an SQL query and: – parse/translate it into an internal representation – optimize it (choose an efficient form for the query) – evaluate it • Metadata for query processing • Operations (and their costs): – – – – – Sel for equality (one particular value) Range selection projection all with and without indices Complex selections Sorting? • What has sorting to do with query processing? – SQL queries can specify the output be sorted – several relational operations (such as joins) can be implemented very efficiently if the input data is sorted first – as a result, query processing is often concerned with sorting temporary (intermediate) and final results – creating a secondary index on the active relation (“logical” sorting) isn’t sufficient -- sequential scans through the data on secondary indices are very inefficient. We often need to sort the data physically into order Sorting • We differentiate two types of sorting: – internal sorting: the entire relation fits in memory – external sorting: the relation is too large to fit in memory • Internal sorting can use any of a large range of wellestablished sorting algorithms (e.g., Quicksort) • In databases, the most commonly used method for external sorting is the sort-merge algorithm. (based upon Mergesort) Sort-merge Algorithm • create runs phase. – Load in M consecutive blocks of the relation (M is number of blocks that will fit easily in main memory) – Use some internal sorting algorithm to sort the tuples in those M blocks – Write the sorted run to disk – Continue with the next M blocks, etcetera, until finished • merge runs phase (assuming that the number of runs, N, is less than M) – load the first block of each run into memory – grab the first tuple (lowest value) from all the runs and write it to an output buffer page – when the last tuple of a block is read, grab a new block from that run – when the output buffer page is full, write it to disk and start a new one – continue until all buffer pages are empty Sort-merge Algorithm (2) • Merge-runs phase (N>M) – operate on M runs at a time, creating runs of length M2, and continue in multiple passes of the Merge operation • Cost of sorting: b(r ) is the number of blocks occupied by relation r – runs phase does one read, one write on each block of r: cost 2b(r ) – total number of runs (N): b(r )/M – number of passes in merge operation: 1 if N<M; otherwise logM-1(b(r )/M) – during each pass in the merge phase we read the whole relation and write it all out again: cost 2b(r ) per pass – total cost of merge phase is therefore 2b(r ) (logM-1(b(r )/M)+1) – if only one merge pass is required (N<M) the cost is 4b(r ); – if M>b(r ) then there is only one run (internal sorting) and the cost is b(r ) Join Operation • Join is perhaps the most important operation combining two relations • Algorithms computing the join efficiently are crucial to the optimization phase of query processing • We will examine a number of algorithms for computing joins • An important metric for estimating efficiency is the size of the result: as mentioned last class, the best algorithms on complex (multi-relation) queries is to cut down the size of the intermediate results as quickly as possible. Join Operation: Size Estimation • 0 <= size <= n(r ) * n(s) (between 0 and size of cartesian product) • If A = R S is a key of R, then size <= n(s) • If A = R S is a key of R and a foreign key of S, then size = n(s) • If A = R S is not a key, then each value of A in R appears no more than n(s)/V(A,s) times in S, so n(r ) tuples of R produce: size <= n(r ) *n(s)/V(A,s) symmetrically, size <= n(s) *n(r)/V(A,r) if the two values are different we use: size <=min{n(s)*n(r)/V(A,r), n(r)*n(s)/V(A,s)} Join Methods: Nested Loop • Simplest algorithm to compute a join: nested for loops – requires no indices • tuple-oriented: for each tuple t1 in r do begin for each tuple t2 in s do begin join t1, t2 and append the result to the output • block-oriented: for each block b1 in r do begin for each block b2 in s do begin join b1, b2 and append the result to the output • reverse inner loop – as above, but we alternate counting up and down in the inner loop. Why? Cost of Nested Loop Join • Cost depends upon the number of buffers and the replacement strategy – pin 1 block from the outer relation, k for the inner and LRU cost: b(r ) + b(r )*b(s) (assuming b(s)>k) – pin 1 block from the outer relation, k for the inner and MRU cost: b(r ) + b(s) + (b(s) - (k-1))*(b(r )-1) = b(r )(2-k) + k + 1 + b(r )*b(s) (assuming b(s)>k) – pin k blocks from the outer relation, 1 for the inner • read k from the outer • for each block of s join 1xk blocks • repeat with next k blocks of r untildone (cost k) (cost b(s)) (repeated b(r )/k times) cost: (k+b(s)) * b(r )/k =b(r ) + b(r )*b(s)/k – which relation should be the outer one? Join Methods: Sort-Merge Join • Two phases: – Sort both relations on the join attributes – Merge the sorted relations sort R on joining attribute sort S on joining attribute merge (sorted-R, sorted-S) • cost with k buffers: – – – – b(r) (2 log( b(r)/k) +1) to sort R b(s) (2 log( b(s)/k) +1) to sort S b(r ) + b(s) to merge total: b(r) (2 log(b(r)/k) +1) + b(s) (2 log( b(s)/k) +1) +b(r) + b(s) Join Methods: Hash Join • Two phases: – Hash both relations into hashed partitions – Bucket-wise join: join tuples of the same partitions only Hash R on joining attribute into H(R) buckets Hash S on joining attribute into H(S) buckets nested-loop join of corresponding buckets • cost (assuming pairwise buckets fit in the buffers) – – – – 2b(r ) to hash R (read and write) 2b(s) to hash S (as above) b(r ) + b(s) to merge total: 3(b(r ) + b(s)) Join Methods: Indexed Join • Inner relation has an index (clustering or not): for each block b(r ) in R do begin for each tuple t in b(r ) do begin search the index on S with the value t.A of the joining attribute and join with the resulting tuples of S • cost = b(r ) + n(r ) * cost(select(S.A=c)) where cost(select(S.A=c)) is as described before for indexed selection – What if R is sorted on A? (hint: use V(A,r) in the above) 3-way Join Suppose we want to compute R(A,B) |X| S(B,C) |X| T(C,D) • 1st method: pairwise. – First compute temp(A,B,C) = R(A,B) |X| S(B,C) cost b(r ) + b(r )*b(s) size of temp b(temp) = n(r )*n(s)/(V(B,S)/f(r+s)) – then compute result temp(A,B,C) |X| T(C,D) cost b(t)+b(t)*b(temp) • 2nd method: scan S and do simultaneous selections on R and T – cost = b(s) + b(s)* (b(r ) + b(t)) – if R and T are indexed we could do the selections through the indices cost = b(s) + n(s ) *[ cost(select(R.B=S.B)) + cost(select(T.C=S.C))]