Uploaded by Annmarie Kanhai

Query Processing

advertisement
Query Processing
Chapter 23 – Textbook 2
Lisa A Wharwood
1
Query Processing
• The activity of choosing an efficient execution strategy for processing a query.
• The aims of query processing are to
• transform a query written in a high-level language, typically SQL, into a correct
and efficient execution strategy expressed in a low-level language
• to execute the strategy to retrieve the required
• An important aspect of query processing is query optimization.
• the aim of query optimization is to choose the better strategy that minimizes
resource usage.
2
Techniques
• An attempt to reduce the total execution time of the query, which is the sum of
the execution times of all individual operations that make up the query (Selinger
et al., 1979).
• However, resource usage may also be viewed as the response time of the query,
in which case we concentrate on maximizing the number of parallel operations
(Valduriez and Gardarin, 1984).
• There are two main techniques
• The first uses heuristic rules that order the operations in a query
• The second compares different strategies based on their relative cost and selects the one
that minimizes resource usage
3
Note!
• Both methods of query optimization depend on database statistics to
evaluate properly the different options that are available.
• The accuracy and currency of these statistics have a significant
bearing on the efficiency of the execution strategy chosen.
4
Transformations
• SELECT *
• FROM Staff s, Branch b
• WHERE s.branchNo = b.branchNo AND
• (s.position = ‘Manager’ AND b.city = ‘London’);
•
σ(position=‘Manager’)∧(city=‘London’)∧(Staff.branchNo=Branch.branchNo)(Staff ×
Branch)
or
•
σ(position=‘Manager’) ∧
(city=‘London’)(Staff
Staff.branchNo=Branch.branchNo
•
or
σ(σposition=‘Manager’(Staff))
(σcity=‘London’(Branch))
Branch)
Staff.branchNo=Branch.branchNo
5
Four Phases of QP
• Query processing can be divided into four main phases:
1.
2.
3.
4.
decomposition (consisting of parsing and validation)
optimization
code generation
execution
6
Types of QP
Dynamic query optimization
Static query optimization
• carry out decomposition and optimization
every time the query is run
• the query is parsed, validated, and optimized
once
• All information required to select an optimum
strategy is up to date.
• Runtime overhead is removed
• Performance is affected as query has to be
parsed, validated, and optimized before it can
be executed.
• it may be necessary to reduce the number of
execution strategies to be analyzed to achieve
an acceptable overhead, which may have the
effect of selecting a less than optimum
strategy.
• more time available to evaluate a larger number
of execution strategies, thereby increasing the
chances of finding a more optimum strategy.
• the execution strategy chosen as being optimal
when the query is compiled may no longer be
optimal when the query is run
• Hybrid approaches can deal with this
disadvantage.
7
First Phase - Query
decomposition
Query decomposition is the first phase of query processing.
8
Stages of Query Decomposition
• The typical stages of query decomposition are
1.
2.
3.
4.
5.
analysis,
normalization,
semantic analysis,
simplification, and
query restructuring
9
Analysis
The query is lexically and syntactically analyzed using the techniques of programming language compilers
This query would be rejected on two grounds:
• In the select list, the attribute staffNumber is not defined for the Staff relation (should be staffNo).
• In the WHERE clause, the comparison ‘>10’ is incompatible with the data type position, which is a variable
character string.
• On completion of this stage, the high-level query has been transformed into some internal representation
that is more suitable for processing.
SELECT staffNumber
FROM Staff
WHERE position > 10;
10
The Tree
• The query tree, is constructed as follows:
•
•
•
•
A leaf node is created for each base relation in the query.
A non-leaf node is created for each intermediate relation produced by a relational algebra operation.
The root of the tree represents the result of the query.
The sequence of operations is directed from the leaves to the root.
11
Normalization
• The normalization stage of query processing converts the query into a
normalized form that can be more easily manipulated.
• The predicate (in SQL, the WHERE condition), which may be arbitrarily
complex, can be converted into one of two forms by applying a few
transformation rules (Jarke and Koch, 1984):
• Conjunctive
• Disjunctive
12
Conjunctive Normal Form
• A sequence of conjuncts that are connected with the ∧ (AND) operator.
• Each conjunct contains one or more terms connected by the ∨ (OR) operator.
• A conjunctive selection contains only those tuples that satisfy all
conjuncts.
• For example
(position = ‘Manager’ ∨ salary > 20000) ∧ branchNo = ‘B003’
13
Disjunctive Normal Form
• A sequence of disjuncts that are connected with the ∨ (OR)
operator.
• Each disjunct contains one or more terms connected by the ∧ (AND)
operator.
• A disjunctive selection contains those tuples formed by the union of
all tuples that satisfy the disjuncts.
• For example
(position = ‘Manager’ ∧ branchNo = ‘B003’ ) ∨ (salary > 20000 ∧
branchNo = ‘B003’)
14
Semantic Analysis
• Must reject normalized queries that are incorrectly formulated or
contradictory.
• if components do not contribute to the generation of the result, which may
happen if some join specifications are missing.(incorrectly formulated)
• if its predicate cannot be satisfied by any tuple (contradictory)
15
Simplification
• to detect redundant qualifications, eliminate common sub-expressions, and
transform the query to a semantically equivalent but more easily and
efficiently computed form.
• access restrictions, view definitions, and integrity constraints are considered at this
stage
• If the user does not have the appropriate access to all the components of
the query, the query must be rejected.
16
Query restructuring
• The final stage
• the query is restructured to provide a more efficient implementation
17
Heuristical Approach to Query
Optimization
See transformation rules page 640(691in pdf) of text.
18
Canonical Tree
19
Rules
a) canonical relational algebra tree;
b) relational algebra tree formed by pushing Selections down;
c) relational algebra tree formed by changing Selection/Cartesian products to
Equijoins;
d) relational algebra tree formed using associativity of Equijoins;
e) relational algebra tree formed by pushing Projections down;
f) final reduced relational algebra tree formed by pushing resulting Selections
down tree.
20
SELECT sname, coursename
FROM Student, Course
WHERE s-id > 0 AND
year = 2007
AND Student.course-id =
Course.course-id
AND duration = 4
AND intake-no = 60.
21
22
Query Processing & Optimization
• Task:
• Find an efficient physical query plan (aka execution plan) for an SQL query
• Goal:
• Minimize the evaluation time for the query, i.e., compute query result as fast as possible
• Cost Factors:
• Disk accesses, read/write operations, [I/O, page transfer] (CPU time is typically ignored)
How does this happen?
• The Query Parser checks the validity of the query and then translates it into an internal form, usually a
relational calculus expression or something equivalent.
• The Query Optimizer examines all algebraic expressions that are equivalent to the given query and
chooses the one that is estimated to be the cheapest.
• The Code Generator or the Interpreter transforms the access plan generated by the optimizer into calls
to the query processor.
• The Query Processor actually executes the query.
Dynamic versus static optimization
• Within the first three (3) phases of query processing, we can refer to this as the compile
time, we can choose the option of
• dynamically carrying out decomposition and optimization every time the query is run.
• information will usually be update, however the performance of the query is affected
• the alternative option of
• static query optimization, where the query is parsed, validated, and optimized once when query is first
submitted.
• the removal of the runtime overhead boost performance but the chosen execution strategy may no longer be
optimal at eventual query run time
Query Decomposition
• The first phase of query processing.
• The typical stages of query decomposition are:
1.
Analysis
•
2.
Normalization
•
3.
This is the process of verifying that all queries are formulated properly and no contradictory.
Simplification
•
•
•
5.
The normalization stage of query processing converts the query into a normalized form that can be more easily manipulated.
Semantic analysis
•
4.
The query is lexically and syntactically analysed using the techniques of programming language compilers.
Identifies redundant qualifications
Eliminates common sub-expressions
Transforms the query to a semantically equivalent form that is more efficient
Query restructuring
•
In the final stage of query decomposition, the query is restructured to provide a more efficient implementation
Techniques
• There are two main techniques that are employed during query optimization.
• A technique based on heuristic rules ( the Heuristical Approach )
• The use of transformation rules to convert one relational algebra expression into an equivalent form that is
known to be more efficient.
• Use these rules to restructure a (canonical) relational algebra tree generated during query decomposition.
• A techniques involving systematic evaluation
• Cost-based query optimization
• A query optimizer does not depend solely on heuristic rules; it also estimates and compares the costs of
executing a query using different execution strategies and algorithms, and it then chooses the strategy with the
lowest cost estimate
Basic Steps in Query Processing : Optimization
• A relational algebra expression may have many equivalent expressions
• E.g., salary75000(salary(instructor)) is equivalent to
salary(salary75000(instructor))
• Each relational algebra operation can be evaluated using one of several different algorithms
• Correspondingly, a relational-algebra expression can be evaluated in many ways.
• Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.
• E.g., can use an index on salary to find instructors with salary < 75000,
• or can perform complete relation scan and discard instructors with salary  75000
Query Optimization
• Amongst all equivalent evaluation plans choose the one with lowest cost.
• Cost is estimated using statistical information from the
database catalog
• e.g. number of tuples in each relation, size of tuples, etc.
• Cost is generally measured as total elapsed time for answering query
• Many factors contribute to time cost
•
disk accesses, CPU, or even network communication
• Typically disk access is the predominant cost, and is also relatively easy to estimate.
Measured by taking into account
• Number of seeks
* average-seek-cost
• Number of blocks read * average-block-read-cost
• Number of blocks written * average-block-write-cost
• Cost to write a block is greater than cost to read a block
• data is read back after being written to ensure that the write was successful
Measures of Query Cost (Cont.)
• For simplicity we just use the number of block transfers from disk and the number of seeks as the cost
measures
• tT – time to transfer one block
• tS – time for one seek
• Cost for b block transfers plus S seeks
b * tT + S * tS
• We ignore CPU costs for simplicity
• Real systems do take CPU cost into account
• We do not include cost to writing output to disk in our cost formulae
Measures of Query Cost (Cont.)
• Several algorithms can reduce disk IO by using extra buffer space
• Amount of real memory available to buffer depends on other concurrent queries and OS processes,
known only during execution
• We often use worst case estimates, assuming only the minimum amount of memory needed for
the operation is available
• Required data may be buffer resident already, avoiding disk I/O
• But hard to take into account for cost estimation
Join Operation
• Several different algorithms to implement joins
• Nested-loop join
• Block nested-loop join
• Indexed nested-loop join
• Merge-join
• Hash-join
• Choice based on cost estimate
• Examples use the following information
• Number of records of student: 5,000
• Number of blocks of student: 100
takes: 10,000
takes: 400
Nested-Loop Join (Cont.)
• In the worst case, if there is enough memory only to hold one block of each relation, the estimated cost
is
nr  bs + br block transfers, plus
nr + br
seeks
• If the smaller relation fits entirely in memory, use that as the inner relation.
• Reduces cost to br + bs block transfers and 2 seeks
• Assuming worst case memory availability cost estimate is
• with student as outer relation:
• 5000  400 + 100 = 2,000,100 block transfers,
• 5000 + 100 = 5100 seeks
• with takes as the outer relation
• 10000  100 + 400 = 1,000,
• 400 block transfers and 10,400 seeks
• If smaller relation (student) fits entirely in memory, the cost estimate will be 500 block transfers.
• Block nested-loops algorithm is preferable.
Block Nested-Loop Join (Cont.)
• Worst case estimate: br  bs + br block transfers + 2 * br seeks
• Each block in the inner relation s is read once for each block in the
outer relation
• Best case: br + bs block transfers + 2 seeks.
• Improvements to nested loop and block nested loop algorithms:
• In block nested-loop, use M — 2 disk blocks as blocking unit for
outer relations, where M = memory size in blocks; use remaining
two blocks to buffer inner relation and output
Cost = br / (M-2)  bs + br block transfers +
2 br / (M-2) seeks
• If equi-join attribute forms a key or inner relation, stop inner loop
on first match
• Scan inner loop forward and backward alternately, to make use of
the blocks remaining in buffer (with LRU replacement)
• Use index on inner relation if available
•
Evaluation of Expressions
• So far: we have seen algorithms for individual operations
• Alternatives for evaluating an entire expression tree
• Materialization: generate results of an expression whose inputs are relations or are already
computed, materialize (store) it on disk. Repeat.
• Pipelining: pass on tuples to parent operations even as an operation is being executed
Query trees
•
•
•
•
Tree that represents a relational algebra expression.
Leaves = base tables.
Internal nodes = relational algebra operators applied to the node’s children.
The tree is executed from leaves to root.
• Example: List the last name of the employees born after 1957 who work on a
project named ”Aquarius”.
SELECT E.LNAME
FROM EMPLOYEE E, WORKS_ON W, PROJECT P
WHERE P.PNAME = ‘Aquarius’ AND P.PNUMBER = W.PNO AND W.ESSN = E.SSN AND E.BDATE > ‘1957-12-31’
Canonial query tree
πattributes
SELECT attributes
FROM A, B, C
WHERE condition
σcondition
Construct the canonical query tree as follows
•
Cartesian product of the FROM-tables
•
Select with WHERE-condition
•
Project to the SELECT-attributes
A
X
X
C
B
Equivalent query trees
Query processing
Real world
User 4
User 3Queries Answers
Updates
User 2Queries Answers
Updates
User 1Queries Answers
Updates
Model
Updates
Database
management
system
Queries
Answers
Processing of
queries and updates
Access to stored data
Physical
database
Query processing
StarsIn( movieTitle, movieYear, starName )
MovieStar( name, address, gender, birthdate )
SELECT movieTitle
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ’%1960’);
Canonical query tree
(usually very inefficient)
Query optimizer
• Compare the estimate cost estimate of different execution plans and choose the cheapest.
• The cost estimate decomposes into the following components.
•
Access cost to secondary storage.
• Depends on the access method and file organization. Leading term for large databases.
•
Storage cost .
• Storing intermediate results on disk.
•
Computation cost.
• In-memory searching, sorting, computation. Leading term for small databases.
•
Memory usage cost.
• Memory buffers needed in the server.
•
Communication cost.
• Remote connection cost, network transfer cost. Leading term for distributed databases.
• The costs above are estimated via the information in the DBMS catalog (e.g. #records, record size, #blocks,
primary and secondary access methods, #distinct values, selectivity, etc.).
References
Textbook 1:
FUNDAMENTALS OF Database Systems SIXTH or SEVENTH EDITION by Ramez Elmasri and Shamkant B. Navathe
ISBN-13: 978-0133970777 - seventh edition ISBN-10: 0133970779 - seventh edition
Textbook 2:
Database Systems: A Practical Approach to Design, Implementation, and Management SIXTH EDITION
Global Edition Thomas M. Connolly z Carolyn E. Begg
Lisa A Wharwood
41
Download