• Multiple-Key Access
– There are good and bad ways to run queries on multiple single keys
• Indices on Multiple Attributes
– Combining two keys into a single concatenated attribute
– Grid files
• crude array with one dimension and linear scale for each attribute
• more than one cell may point to a given bucket of values
• array grid may be dynamically resized during use
– Other alternatives are spatial databases: R-tree, quad-trees, k-d tree
• Bitmap Indices
– linear array of bits: bit j is set if tuple j has the attribute that this bitmap tracks (e.g.,
“Moonroof”: bit j is 1 if record j is a car with moonroof)
– Queries are answered by combining several bitmaps using and, or, not
• HW #4: due Thursday April 24 (next class)
– Questions: 12.11, 12.12, 12.13, 12.16
• No HW for next week
• Today:
– Start Chapter 13: Query Processing
• SQL is good for humans, but not as an internal (machine) representation of how to calculate a result
• Processing an SQL (or other) query requires these steps:
– parsing and translation
• turning the query into a useful internal representation in the extended relational algebra
– optimization
• manipulating the relational algebra query into the most efficient form
(one that gets results the fastest)
– evaluation
• actually computing the results of the query
1. parsing and translation
– details of parsing are covered in other places (texts and courses on compilers). We’ve already covered SQL and relational algebra; translating between the two should be relatively familiar ground
2. optimization
– This is the meat of chapter 13. How to figure out which plan, among many, is the best way to execute a query
3. evaluation
– actually computing the results of the query is mostly mechanical
(doesn’t require much cleverness) once a good plan is in place.
• Initial query: select balance from account where balance<2500
• Two different relational algebra expressions could represent this query:
– sel balance<2500
(Pro balance
(account))
– Pro balance
( sel balance<2500
(account))
• which choice is better? It depends upon metadata (data about the data) and what indices are available for use on these operations.
•
Cost parameters (some are easy to maintain; some are very hard -
- this is statistical info maintained in the system’s catalog)
– n(r ): number of tuples in relation r
– b(r ): number of disk blocks containing tuples of relation r
– s(r ): average size of a tuple of relation r
– f(r ): blocking factor of r: how many tuples fit in a disk block
– V(A,r): number of distinct values of attribute A in r.
(V(A,r)=n(r ) if
A is a candidate key)
– SC(A,r): average selectivity cardinality factor for attribute A of r.
Equivalent to n(r )/V(A,r). (1 if A is a key)
– min(A,r): minimum value of attribute A in r
– max(A,r): maximum value of attribute A in r
• Cost parameters are used in two important computations:
– I/O cost of an operation
– the size of the result
• In the following examination we’ll find it useful to differentiate three important operations:
– Selection (search) for equality (R.A1=c)
– Selection (search) for inequality (R.A1>c) (range queries)
– Projection on attribute A1
• Selection (search) for equality (R.A1=c)
– cost (sequential search on a sorted relation) = b(r )/2 b(r )/2 + SC(A1,r) -1 average unsuccessful average successful
– cost (binary search on a sorted relation) = log b(r ) average unsuccessful log b(r ) + SC(A1,r) -1 average successful
– size of the result n(select(R.A1=c)) =
SC(A1,r) = n(r )/V(A1,r)
• Selection (search) for inequality (R.A1>c)
– cost (file unsorted) = b(r )
– cost (file sorted on A1) = b(r )/2 + b(r )/2 (if we assume that half the tuples qualify) b(r ) in general
(regardless of the number of tuples that qualify. Why?)
– size of the result = depends upon the query; unpredictable
• Projection on attribute A1
– cost = b(r )
– size of the result n(Pro(R,A1)) =
V(A1,r)
Primary Index on key: cost = (height+1) unsuccessful cost = (height+1) +1 successful
Primary (clustering) Index on non-key: cost = (height+1) + SC(A1,r)/f(r ) all tuples with the same value are clustered
Secondary Index cost = (height+1) + SC(A1,r) tuples with the same value are scattered
Primary Index on key: search for first value and then pick tuples >= value cost = (height+1) +1+ size of the result
(in disk pages )
= height+2 + n(r ) * (max(A,r)c)/(max(A,r)-min(A,r))/f(r )
Primary (clustering) Index on non-key: cost as above (all tuples with the same value are clustered)
Secondary (non-clustering) Index cost = (height+1) +B-treeLeaves/2
+ size of result (in tuples )
= height+1 + B-treeLeaves/2 + n(r ) *
(max(A,r)-c)/(max(A,r)-min(A,r))
• Conjunction (select where theta1 and theta2)
(s1 = # of tuples satisfying selection condition theta1) combined SC = (s1/n(r )) * (s2/n(r )) = s1*s2/n(r ) 2 assuming independence of predicates
• Disjunction (select where theta1 or theta2) combined SC = 1 - (1 - s1/n(r )) * (1 - s2/n(r ))
= s1/n(r )) + s2/n(r ) - s1*s2/n(r ) 2
• Negation (select where not theta1) n(! Theta1) = n(r ) - n(Theta1)
GOAL: apply the most restrictive condition first and combined use of multiple indices to reduce the intermediate results as early as possible
• Why? No index will be available on intermediate results!
• Conjunctive selection using one index B:
– select using B and then apply remaining predicates on intermediate results
• Conjunctive selection using a composite key index (R.A1, R.A2):
– create a composite key or range from the query values and search directly
(range search on the first attribute (MSB of the composite key) only)
• Conjunctive selection using two indices B1 and B2:
– search each separately and intersect the tuple identifiers (TIDs)
• Disjunctive selection using two indices B1 and B2:
– search each separately and union the tuple identifiers (TIDs)