Query Processing INLS 256-02, Database 1 Lecture Reading: Ch. 18 (but tell students to skim lightly). Introductory Example: [Print out “database_records.example” file, three copies, one for each contestant group]. Give groups copies of database. Assign leader (query execution manager). Have leader assign 4 other students as processors (P1, P2, P3, P4). Each processor can perform ONE relational operation (once). Give each team 1 copy of database records file. On board write in queries (except leave out values, until read for them to start). Leader raise hand when done. Q1: (field 27 = ‘a’) Q2: (field 10 > 749850438) AND (field 10 < 749850542) AND (field 6 == 0) Q3: (field 3 = 162) AND ((field 4 = 165) OR (field 9 = 2)) AND (field 8 = ‘R’) Have them do Q1 first, then Q2, then Q3. Ask them what they learned about doing the queries. Hopefully, it will include Q1: the query processor must check the syntax and validity of the query statements. It should indicate an error if it cannot process. Q2: searching on attribute, for equality or “range” is much faster if it is ordered (sorted by that key). (better to do range in 1 pass, instead of 2 separate passes). Q3: do they query parts that limit the solution set as early as possible. In general a query passes through the following stages on its way to being answered: 1. parse the query, check for syntax errors, validity. 2. determine a plan to execute the query ("query optimization") 3. execute the resulting code 4. return the result Query Optimization == find a good solution, not the single most optimal one. This is because you have to estimate parts of the problem (solution not exact), and determining the most optimal solution can be very time consuming (which would negate what you’re trying to accomplish, i.e. speedups). What do we need to consider to optimize queries? (Let CLASS answer) 1. physical storage (files on disk, in memory, etc) 2. access methods (structure of files). Is the file of records ordered, unordered? Is it indexed, and how? Are there secondary indexes? 3. What are the individual components (relational operations) of the query 1 What do we have to decide? The structure of how we’ll execute the query components (query tree), which algorithm we’ll use to implement the relational operation, and using what access methods. NOTE-LEAVE THIS ON CORNER OF BOARD Note our choices of how we can optimize are affected by how we set up the database. Thus, our choices of what attributes have unique values (whether primary key or not) and what attributes/keys are indexed (and how) can have a big effect for a large database. Work through example: (from page 605 book). For every project in Stafford, retrieve the project #, project name, and managers last name. SELECT P.PNUMBER, P.DNUM, E.LNAME FROM PROJECT as P, Department as D, EMPLOYEE as E WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND P.PLOCATION=’Stafford’; A (default)t: 1: (Project X Dept) 2: (1) X Employee) 3: ρ D.NUM=D.DNUMBER, D.MGRSSN=E.SSN, P.PLOCATION= STAFFORD (2) 4: π p.number, p.dnum, e.lname (3) B (Better): 1: ρ P.PLOCATION= STAFFORD Project 2: (1) |X| P.DNUM=D.DNUMBER Department 3: (2) |X| D.MGRSSN=E.SSN EMPLOYEE 4: π p.number, p.dnum, e.lname (3) DRAW PICTURES TO SIDE, OF QUERY TREES (introduce term query tree) Book page 605. Explanation: In first case we generate Cartesian products of tables. Thus we get size(Employee) * size(Dept) * size(project) before we select out the Stafford based projects, and then project the desired attributes. In the “better” case we limit by selecting to Stafford locations first (yields two entries), which are joined, with Dept (table of size 2), the joined with Employee (wider table, still of size 2), then we project. 2 <CLASS> what if the sizes of my tables were Proj=100, Dept=20, Employee=5000. With record sizes of 100, 50, 150. How big of a table would result from the default method when all three were Cartesian producted together? 100 X 20 X 5000 X (100+50+150) = 10 million X (300 bytes) = 3 Giga (my computer memory can’t hold!) With (B) if only two projects in Stafford, then still only two records of size 300 bytes would be required! Main optimization thought behind query optimization: Keep your intermediate results as small as possible (smaller means faster computations and less time to store out intermediate results; especially if the intermediate result is so large that it has to go to disk instead of memory)! To do this, rules of thumb (heuristics) are 1. do selects and projects as early as possible to limit size of tables 2. do the most restrictive operations (field 7 = “R”) as early as possible. 3. do joins intelligently (i.e. not as Cartesian products followed by selects) 4. do joins as late as possible 5. replace combinations of queries with equivalent ones that are faster to compute. Book has comprehensive list of equivalence rules. (page 611). There is much more to it, but this gives you a good sense. (we’ll come back to cost based optimization later). First, let’s talk about how we can optimize the individual queries. I.e. we must know how efficient the relational operations are if we want to choose to replace one relational operation with another one that is faster to compute. Select Remember that the select command chooses rows out of a relation on the basis of a condition. For a simple select, with one condition, there are several ways of getting the set. REMIND THEM OF <EXAMPLE.DATABASE.FILE.RANDOMIZED> What was the difference between group (sorted on 10) vs group (sorted on 1) for finding a range of values of attribute 10? This is why it’s nice to have ordered file. Unordered File 1. Brute force linear search. This is the least preferred, and slowest, because you are going to examine every record, and the value of the specified attribute, to see if it matches the condition. O(n) if no match, O(n/2) if matched. 3 Ordered File 2. Binary search. If the condition specifies the attribute by which the table is stored, then you can do a binary search to find the record. O(log2n) Indexed File 3. Primary index or hash key. This will find a record with a unique value. 4. Primary index for multiple records - Find the indexed record, and then fetch the subsequent or previous ones - e.g. when the condition specifies a value >= some value. (range query: 30K < salary < 45K) 5. Cluster index, to retrieve all records with that (non-unique) value. Secondary Index 6. Secondary index, such as B+, on other attributes. When there is more than one condition joined by and or or, you want to minimize the amount of work that must be done if at all possible. Conjuncts (AND): When multiple conditions are joined by and, try to search on index or ordered list, not brute force. This allows you to quickly narrow scope without retrieving entire database. 1. If one attribute in a condition has an access path available, use it to collect the set, then apply the rest of the conditions to that set. 2. If two or more attributes in a condition have a composite index set up, use it. 3. If secondary indexes have record pointers for all the records (rather than just the blocks in which they occur), get the intersection of the pointers, rather than the intersection of the records. If more than one option is available, or you are forced to use brute force, you want to perform the select that will return the smallest set first, and then apply the conditions to that set, so you have to handle the smallest number of records possible. In other words, you want to find records that satisfy the condition with the smallest selectivity first. For example, the condition "gender = male", in an average employee database, you might expect that about 50% of the records might fit the condition. The condition "age = n" might have a return rate of about 1-5%. The second condition is more selective, so if you were looking for all males of age n, you want to apply the age condition first. CLASS QUESTION: EXAMPLE retrieve male employees making more than 50K who are over 30 yrs old. < (>50K AND MALE AND > 30 yrs OLD). a) which should we do first? (>50K from our table, this is only 1/8th of the population). b) How can we speed up (secondary indexes on salary, age, gender) Disjuncts (OR): When multiple conditions are joined by or. <ASK CLASS HOW YOU CAN OPTIMIZE>. There is less that you can do with disjuncts. You cannot cull by the 4 smallest set involved like with ANDs, since you have to combine all members of each set. Can still optimize access to attributes; however, if any of the attributes does not have an access path, then you must use the brute force method on it - there is no way around it. If access paths exist to all attributes, you can do the union on those sets, and remove duplicates. if you can get at the record pointers, you can get the union of those, rather than the whole record. Join 1. Nested loop join. Neither are Sorted/Indexed. Brute force, using a nested loop. For each record in one table, loop through each record in the other table, and see which ones match the join condition. 2. Single Loop Join. One of the two is indexed/sorted. If an index or hash key exists for ONE of the TWO join attributes (attr B of rel S), then retrieve each record t in rel R, one at a time (single loop), and use the access structure to retrieve matching records s from S that satisfy s[B] = t[A]. (This is why you generally index FKs, i.e. since they are usually there because of a relationship, e.g. like work_on, so you would expect to see join operations on this table). 3. Sort-merge. If the tables are sorted on the join attributes, then you can search each in order, matching up values as you go. If there is only one match, output it. If there is no match, skip it. If there is more than one match, output all of them before you move on to the next record. (very efficient) 4. Hash-join. Use a hash function on the join attribute to hash each record in both files to the same file. Then those with the same value will be in the same bucket. Note: above description are generalities based on retrieval of records. In real life, affected greatly by actual retrieval (disk block size), and memory (caching, main memory) and pipelining, multi-processor, etc. Two reasonably general rules: 1. Smaller table as outer loop (nest looped joins) It matters which table is in the outer loop - the outer loop records will be read through once, the inner loop records will be read through once for each outer record. So, you want the smaller table to be the outer loop, and the larger table to be the inner loop, to give the smallest amount of reads. Example: Table 1 has 100 records. Table 2 has 5 records. Number of Outer Loops Loaded + (Number of Outer Loop Records * Number of Inner Loop records) Table 1 as outer table: 100 + (100 * 5) = 600 record accesses Table 2 as outer table: 5 + ( 5 * 100) = 505 record accesses 5 2. File that has highest likelihood of matches (Join Selection Factor) should be on outside. If table A is likely to have a higher percentage of records that match table B than table B is (matching to table A), go through table A looking for matches, rather than the other way around. You’ll be more efficient because your searches are mostly successful. You will have fewer searches through table B with no result. In essence, you are using table A as the outer loop. [Example description in book p 598, using managers and employees, i.e. managers as outer loop since Join Selection Factor =1 (100% of managers are employees, while 0.01% other way around). Project Straightforward in that you use list out part of records. Complicated only in that you may have to eliminate duplicate tuples. If the projected attributes include the key attribute, or another unique identifier, then all you have to do is the project. If not, then there is the possibility of producing duplicate rows, which then have to be removed. REVIEW (FROM CORNER OF BOARD): What do we need to consider to optimize queries? 1. physical storage (files on disk, in memory, etc). 2. access methods (structure of files). Is the file of records ordered, unordered? Is it indexed, and how? Are there secondary indexes? 3. What are the individual components (relational operations) of the query What do we have to decide? The structure of how we’ll execute the query components (query tree), which algorithm we’ll use to implement the relational operation, and using what access methods. Things we choose: Hardware capabilities: (i.e. more memory so indexes, larger relations will fit entirely in memory) Access methods: provide indexes on attributes/keys commonly used in queries. Query Structure and Execution: optimize query execution and structure based on 1. Rules (re-order execution of queries, or composition of queries to speed-up) 2. Cost Estimates (estimate cost functions and minimize to chose best) 1. Rules (Heuristics) ASK CLASS FOR to remember…. To do this, rules of thumb (heuristics) are 1. do selects and projects as early as possible to limit size of tables 2. do the most restrictive operations (field 7 = “R”) as early as possible. 3. do joins intelligently (i.e. not as Cartesian products followed by selects) 4. do joins as late as possible 6 5. replace combinations of queries with equivalent ones. Book has comprehensive list of equivalence rules. (page 611). A query tree can be used as a method for modeling query transformations. A query transformation is an alternate form of the query, that will produce the same result. (Recall that you could do relational algebra operations in any order - this is a way of visualizing what is going on.) <slides of example; on web> <Have students do efficient and inefficient versions before looking at slides.> 7 2. Cost Estimates Another way to find the best query processing strategy is to estimate the cost of different strategies. This is best done ahead of time, since it takes time to do these. Also remember that these are just estimates, and may not hold for all queries of a particular form. Things to consider in calculating cost: 1. Access to secondary storage - the base files, large temporary files, large indexes. 2. Temporary storage space for intermediate tables. 3. Computation time - search, sort, merge. 4. Communication time from memory to screen. Frequently, only the first is considered - just count the estimated number of disk accesses. The other factors are less significant especially as most smaller databases can be stored entirely in memory and also these other are more effected by differences in computer architecture (caching, pipelining, etc). To calculate costs, you need information such as: 1. number of levels of indexes 2. number of records in files 3. average record size 4. number of distinct values for an attribute Some of this information may change frequently, so you need to balance the advantages and cost of maintaining this information against the advantages and costs of not having it for optimization. (can run automated processes to calculate up to date values, as well as “hints”, like ¼ of the genders are male instead of an expected value like ½). Cost estimates are used more extensively in designing distributed databases, and their query strategies, as network communication has a more significant effect. Cost estimate approach seems to be winning out on large databases (eg. Oracle is phasing out rule based in favor of cost based approach. Semantic Query Optimization This starts approaching intelligent search - using knowledge of the world, functional dependencies, and other constraints to narrow down the search space. Research mostly now. Likely to become more common with active databases that have more constraints specified. 8