QUERY-PROCESSING

advertisement
Query Processing
INLS 256-02, Database
1 Lecture
Reading: Ch. 18 (but tell students to skim lightly).
Introductory Example:
[Print out “database_records.example” file, three copies, one for each contestant group].
Give groups copies of database. Assign leader (query execution manager). Have leader
assign 4 other students as processors (P1, P2, P3, P4). Each processor can perform ONE
relational operation (once). Give each team 1 copy of database records file. On board
write in queries (except leave out values, until read for them to start). Leader raise hand
when done.
Q1: (field 27 = ‘a’)
Q2: (field 10 > 749850438) AND (field 10 < 749850542) AND (field 6 == 0)
Q3: (field 3 = 162) AND ((field 4 = 165) OR (field 9 = 2)) AND (field 8 = ‘R’)
Have them do Q1 first, then Q2, then Q3. Ask them what they learned about doing the
queries. Hopefully, it will include
Q1: the query processor must check the syntax and validity of the query statements. It
should indicate an error if it cannot process.
Q2: searching on attribute, for equality or “range” is much faster if it is ordered (sorted
by that key). (better to do range in 1 pass, instead of 2 separate passes).
Q3: do they query parts that limit the solution set as early as possible.
In general a query passes through the following stages on its way to being answered:
1. parse the query, check for syntax errors, validity.
2. determine a plan to execute the query ("query optimization")
3. execute the resulting code
4. return the result
Query Optimization == find a good solution, not the single most optimal one. This is
because you have to estimate parts of the problem (solution not exact), and determining
the most optimal solution can be very time consuming (which would negate what you’re
trying to accomplish, i.e. speedups).
What do we need to consider to optimize queries? (Let CLASS answer)
1. physical storage (files on disk, in memory, etc)
2. access methods (structure of files). Is the file of records ordered, unordered? Is it
indexed, and how? Are there secondary indexes?
3. What are the individual components (relational operations) of the query
1
What do we have to decide? The structure of how we’ll execute the query components
(query tree), which algorithm we’ll use to implement the relational operation, and using
what access methods. NOTE-LEAVE THIS ON CORNER OF BOARD
Note our choices of how we can optimize are affected by how we set up the database.
Thus, our choices of what attributes have unique values (whether primary key or not) and
what attributes/keys are indexed (and how) can have a big effect for a large database.
Work through example: (from page 605 book). For every project in Stafford, retrieve the
project #, project name, and managers last name.
SELECT P.PNUMBER, P.DNUM, E.LNAME
FROM PROJECT as P, Department as D, EMPLOYEE as E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=’Stafford’;
A (default)t: 1: (Project X Dept)
2: (1) X Employee)
3: ρ D.NUM=D.DNUMBER, D.MGRSSN=E.SSN, P.PLOCATION=
STAFFORD (2)
4: π p.number, p.dnum, e.lname (3)
B (Better):
1: ρ P.PLOCATION= STAFFORD Project
2: (1) |X| P.DNUM=D.DNUMBER Department
3: (2) |X| D.MGRSSN=E.SSN EMPLOYEE
4: π p.number, p.dnum, e.lname (3)
DRAW PICTURES TO SIDE, OF QUERY TREES (introduce term query tree)
Book page 605.
Explanation: In first case we generate Cartesian products of tables. Thus we get
size(Employee) * size(Dept) * size(project) before we select out the Stafford based
projects, and then project the desired attributes. In the “better” case we limit by selecting
to Stafford locations first (yields two entries), which are joined, with Dept (table of size
2), the joined with Employee (wider table, still of size 2), then we project.
2
<CLASS> what if the sizes of my tables were Proj=100, Dept=20, Employee=5000.
With record sizes of 100, 50, 150. How big of a table would result from the default
method when all three were Cartesian producted together?
100 X 20 X 5000 X (100+50+150) =
10 million
X (300 bytes) = 3 Giga (my computer memory can’t hold!)
With (B) if only two projects in Stafford, then still only two records of size 300 bytes
would be required!
Main optimization thought behind query optimization:
Keep your intermediate results as small as possible (smaller means faster computations
and less time to store out intermediate results; especially if the intermediate result is so
large that it has to go to disk instead of memory)!
To do this, rules of thumb (heuristics) are
1. do selects and projects as early as possible to limit size of tables
2. do the most restrictive operations (field 7 = “R”) as early as possible.
3. do joins intelligently (i.e. not as Cartesian products followed by selects)
4. do joins as late as possible
5. replace combinations of queries with equivalent ones that are faster to compute.
Book has comprehensive list of equivalence rules. (page 611).
There is much more to it, but this gives you a good sense. (we’ll come back to cost based
optimization later). First, let’s talk about how we can optimize the individual queries.
I.e. we must know how efficient the relational operations are if we want to choose to
replace one relational operation with another one that is faster to compute.
Select
Remember that the select command chooses rows out of a relation on the basis of a
condition. For a simple select, with one condition, there are several ways of getting the
set.
REMIND THEM OF <EXAMPLE.DATABASE.FILE.RANDOMIZED>
What was the difference between group (sorted on 10) vs group (sorted on 1) for
finding a range of values of attribute 10? This is why it’s nice to have ordered file.
Unordered File
1. Brute force linear search. This is the least preferred, and slowest, because you are
going to examine every record, and the value of the specified attribute, to see if it matches
the condition. O(n) if no match, O(n/2) if matched.
3
Ordered File
2. Binary search. If the condition specifies the attribute by which the table is stored, then
you can do a binary search to find the record. O(log2n)
Indexed File
3. Primary index or hash key. This will find a record with a unique value.
4. Primary index for multiple records - Find the indexed record, and then fetch the
subsequent or previous ones - e.g. when the condition specifies a value >= some value.
(range query: 30K < salary < 45K)
5. Cluster index, to retrieve all records with that (non-unique) value.
Secondary Index
6. Secondary index, such as B+, on other attributes.
When there is more than one condition joined by and or or, you want to minimize the
amount of work that must be done if at all possible.
Conjuncts (AND): When multiple conditions are joined by and, try to search on index or
ordered list, not brute force. This allows you to quickly narrow scope without retrieving
entire database.
1. If one attribute in a condition has an access path available, use it to collect the set, then
apply the rest of the conditions to that set.
2. If two or more attributes in a condition have a composite index set up, use it.
3. If secondary indexes have record pointers for all the records (rather than just the blocks
in which they occur), get the intersection of the pointers, rather than the intersection of
the records.
If more than one option is available, or you are forced to use brute force, you want to
perform the select that will return the smallest set first, and then apply the conditions to
that set, so you have to handle the smallest number of records possible. In other words,
you want to find records that satisfy the condition with the smallest selectivity first. For
example, the condition "gender = male", in an average employee database, you might
expect that about 50% of the records might fit the condition. The condition "age = n"
might have a return rate of about 1-5%. The second condition is more selective, so if you
were looking for all males of age n, you want to apply the age condition first.
CLASS QUESTION: EXAMPLE retrieve male employees making more than 50K
who are over 30 yrs old. < (>50K AND MALE AND > 30 yrs OLD).
a) which should we do first? (>50K from our table, this is only 1/8th of the
population).
b) How can we speed up (secondary indexes on salary, age, gender)
Disjuncts (OR): When multiple conditions are joined by or. <ASK CLASS HOW YOU
CAN OPTIMIZE>. There is less that you can do with disjuncts. You cannot cull by the
4
smallest set involved like with ANDs, since you have to combine all members of each
set. Can still optimize access to attributes; however, if any of the attributes does not have
an access path, then you must use the brute force method on it - there is no way around it.
If access paths exist to all attributes, you can do the union on those sets, and remove
duplicates. if you can get at the record pointers, you can get the union of those, rather
than the whole record.
Join
1. Nested loop join. Neither are Sorted/Indexed. Brute force, using a nested loop. For
each record in one table, loop through each record in the other table, and see which ones
match the join condition.
2. Single Loop Join. One of the two is indexed/sorted. If an index or hash key exists for
ONE of the TWO join attributes (attr B of rel S), then retrieve each record t in rel R, one
at a time (single loop), and use the access structure to retrieve matching records s from S
that satisfy s[B] = t[A]. (This is why you generally index FKs, i.e. since they are usually
there because of a relationship, e.g. like work_on, so you would expect to see join
operations on this table).
3. Sort-merge. If the tables are sorted on the join attributes, then you can search each in
order, matching up values as you go. If there is only one match, output it. If there is no
match, skip it. If there is more than one match, output all of them before you move on to
the next record. (very efficient)
4. Hash-join. Use a hash function on the join attribute to hash each record in both files to
the same file. Then those with the same value will be in the same bucket.
Note: above description are generalities based on retrieval of records. In real life,
affected greatly by actual retrieval (disk block size), and memory (caching, main memory)
and pipelining, multi-processor, etc.
Two reasonably general rules:
1. Smaller table as outer loop (nest looped joins)
It matters which table is in the outer loop - the outer loop records will be read through
once, the inner loop records will be read through once for each outer record. So, you
want the smaller table to be the outer loop, and the larger table to be the inner loop, to
give the smallest amount of reads.
Example:
Table 1 has 100 records.
Table 2 has 5 records.
Number of Outer Loops Loaded + (Number of Outer Loop Records * Number of Inner Loop records)
Table 1 as outer table: 100 + (100 * 5) = 600 record accesses
Table 2 as outer table: 5 + ( 5 * 100) = 505 record accesses
5
2. File that has highest likelihood of matches (Join Selection Factor) should be on
outside. If table A is likely to have a higher percentage of records that match table B than
table B is (matching to table A), go through table A looking for matches, rather than the
other way around. You’ll be more efficient because your searches are mostly successful.
You will have fewer searches through table B with no result. In essence, you are using
table A as the outer loop. [Example description in book p 598, using managers and
employees, i.e. managers as outer loop since Join Selection Factor =1 (100% of managers
are employees, while 0.01% other way around).
Project
Straightforward in that you use list out part of records. Complicated only in that you may
have to eliminate duplicate tuples. If the projected attributes include the key attribute, or
another unique identifier, then all you have to do is the project. If not, then there is the
possibility of producing duplicate rows, which then have to be removed.
REVIEW (FROM CORNER OF BOARD):
What do we need to consider to optimize queries?
1. physical storage (files on disk, in memory, etc).
2. access methods (structure of files). Is the file of records ordered, unordered? Is it
indexed, and how? Are there secondary indexes?
3. What are the individual components (relational operations) of the query
What do we have to decide? The structure of how we’ll execute the query components
(query tree), which algorithm we’ll use to implement the relational operation, and using
what access methods.
Things we choose:
Hardware capabilities: (i.e. more memory so indexes, larger relations will fit entirely in
memory)
Access methods: provide indexes on attributes/keys commonly used in queries.
Query Structure and Execution: optimize query execution and structure based on
1. Rules (re-order execution of queries, or composition of queries to speed-up)
2. Cost Estimates (estimate cost functions and minimize to chose best)
1. Rules (Heuristics)
ASK CLASS FOR to remember….
To do this, rules of thumb (heuristics) are
1. do selects and projects as early as possible to limit size of tables
2. do the most restrictive operations (field 7 = “R”) as early as possible.
3. do joins intelligently (i.e. not as Cartesian products followed by selects)
4. do joins as late as possible
6
5. replace combinations of queries with equivalent ones. Book has comprehensive list
of equivalence rules. (page 611).
A query tree can be used as a method for modeling query transformations. A query
transformation is an alternate form of the query, that will produce the same result.
(Recall that you could do relational algebra operations in any order - this is a way of
visualizing what is going on.)
<slides of example; on web>
<Have students do efficient and inefficient versions before looking at slides.>
7
2. Cost Estimates
Another way to find the best query processing strategy is to estimate the cost of different
strategies. This is best done ahead of time, since it takes time to do these. Also
remember that these are just estimates, and may not hold for all queries of a particular
form. Things to consider in calculating cost:
1. Access to secondary storage - the base files, large temporary files, large indexes.
2. Temporary storage space for intermediate tables.
3. Computation time - search, sort, merge.
4. Communication time from memory to screen.
Frequently, only the first is considered - just count the estimated number of disk accesses.
The other factors are less significant especially as most smaller databases can be stored
entirely in memory and also these other are more effected by differences in computer
architecture (caching, pipelining, etc).
To calculate costs, you need information such as:
1. number of levels of indexes
2. number of records in files
3. average record size
4. number of distinct values for an attribute
Some of this information may change frequently, so you need to balance the advantages
and cost of maintaining this information against the advantages and costs of not having it
for optimization. (can run automated processes to calculate up to date values, as well as
“hints”, like ¼ of the genders are male instead of an expected value like ½).
Cost estimates are used more extensively in designing distributed databases, and their
query strategies, as network communication has a more significant effect.
Cost estimate approach seems to be winning out on large databases (eg. Oracle is phasing
out rule based in favor of cost based approach.
Semantic Query Optimization
This starts approaching intelligent search - using knowledge of the world, functional
dependencies, and other constraints to narrow down the search space. Research mostly
now. Likely to become more common with active databases that have more constraints
specified.
8
Download