Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved “Optimization” More properly called access path selection “Optimizer” selects a strategy for processing Approaches: ◦ Cost-based: estimate total cost to process by different approaches, choose lowest estimate ◦ Heuristic: use rules to decide how to process Cost-based is typically used by all database systems today 2 RUNSTATS RUNSTATS is the name of a statisticsgathering utility first included in IBM’s DB2 It scans the database, gathers statistics used for estimating costs for access path selection DBA determines how often and when to run the utility What statistics do you think are gathered? 3 Quandary The more that RUNSTATS collects, the better job the optimizer can do of selecting efficient processing methods However, RUNSTATS uses a lot of resources, scanning every relation Use of RUNSTATS must be balanced against its cost 4 The Optimizer Selects which indexes to use Chooses the order of using indexes Chooses algorithms to use Decides when to apply predicates 5 SQL Statement Parts of Interest Simple query: SELECT ENAME, JOB FROM EMP WHERE SAL > 20 OR JOB = ‘VP’ OR JOB LIKE ‘PRES%’; We’re interested in the FROM clause (that tells us the table names) and the WHERE clause (that tells us the predicates) 6 SINGLE-TABLE QUERIES 7 Predicates WHERE clause of SQL statement is made up of predicates Each predicate is a condition Each condition references a column Conditions may be equality, inequality, range, LIKE The first three conditions use an index if one exists, scan the table if no index exists 8 Example SAL > 20 OR JOB = ‘VP’ OR JOB LIKE ‘PRES%’; For each predicate, do we use an index to retrieve rows that make it true, then examine each row for the other predicates? 9 Predicate Selectivity Selectivity: an estimate of the fraction of rows of a table that make a predicate true 10 Classes of Predicates Predicate: condition in the WHERE clause Predicates are combined using AND, OR to make WHERE clauses Classes of predicates: ◦ Sargable: search arguments that can be processed close to the data ◦ Residual: not sargable, such as complex use of nesting 11 Access Paths Five possible access paths: ◦ ◦ ◦ ◦ ◦ Table scan Non-selective index scan Selective index scan Index only access Fully qualified unique index Each of these types of scans has different cost estimates for its use 12 Predicate Selectivity Selectivity function f(p): % of rows retrieved on average by predicate p Number of rows retrieved is strongly related to the cost to carry out the operation n = number of rows in table Form of P f column = value 1/n column != value 1-1/n (nearly 1) column > value (high value - search value)/(high value low value) Column LIKE ‘value’ n p1 or p2 f(p1) + f(p2) p1 and p2 f(p1) * f(p2) 13 Single-Table Queries Find out which columns that are referenced in the WHERE clause have indexes Find out selectivity of indexes Estimate selectivity of each predicate Use most selective index-predicate combination to retrieve rows that satisfy one predicate Examine each row for other predicates 14 MULTI-TABLE QUERIES (I.E., JOINS) 15 Join Result of a join is a subset of the Cartesian product of the tables being joined Cartesian product of two tables with m and n rows is a new table of mn rows, where every row of the join consists of one row of the first table and one row of the second table 16 Example Join 17 Simple Join Processing Algorithm 1. 2. 3. Form the Cartesian product of all tables involved in the join Scan rows of the Cartesian product, testing each against all of the predicates Eliminate rows of the Cartesian product that don’t meet the predicates What’s wrong with this picture? Think about two tables of 1 million rows. Cartesian product would be 1 thousand billion rows! 18 Joining More than 2 Relations A join of more than two relations is processed 2 relations at a time Part of access path planning is to select that sequence We will talk about algorithms for joining 2 tables and then choosing the order of processing a multi-table join 19 Joins An equijoin is based on equality of an attribute of each of two relations Outer join includes all rows of both tables even if some rows did not have a matching value A semi-join can be based on inequalities as the relationship 20 Join-Processing Algorithms Nested loop join ◦ Each tuple of outer relation is compared to all rows of the inner relation Sort-merge join Hash-based join 21 Nested-Loop Join The algorithm: For efficiency, the relation with higher cardinality (R) is chosen as the inner relation Number of operations: NR+ NR* NS What if there is an index? 22 Nested-Loop 23 Join Order For JOIN queries, the “outer” table is access first, “inner” second Order for joining tables must be selected Most selective first Least costly joins first 24 Merge Join First, each relation is sorted on the join attribute Then both relations are scanned in the order of the join attributes Tuples that satisfy the join predicate are concatenated and placed in the output relation Number of operations: NR+NS (after the sort!) What is there is an index on R or S or both? 25 Merge Join Algorithm 26 Merge Join 27 Hash-Join The joins we have looked at compare tuples in the first relation with tuples in the second relation that cannot possibly be part of the join The goal of the hash join is to compare only those tuples that might be part of the join Hashing is used to identify those tuples There are many variations of hash-join 28 Simple Hash-Join Algorithm 29 Hash-Join Performance Performance of hash-join can be superior to other join algorithm Performance depends on the hashing algorithm (although note Lum’s research) Perfect hashing algorithm could find match or non-match with a single probe With hash table in RAM, processing would be very fast 30 Indexes Impact of a b-tree index on performance of these algorithms is obvious But the index must be maintained itself When an attribute that’s indexed in changed in a relation, the value in the index must also be changed And note that the changes must be synchronized (and locked together) 31 Order of Processing Joins Typically, all combinations of order of processing are considered and a cost developed for each Selectivity of predicates, selectivity of indexes, cardinality of relations all are factors in cost analysis Goal is to minimize number of intermediate results produced during processing Usually, low selectivity values are processed first (that is, highest selectivity) 32 Summary Single-table queries Multi-table queries ◦ Nested loop ◦ Sort-merge ◦ Hash Order of processing joins 33 But Note: We have left out a LOT Relations may be partitioned and joins processed by partition Many other parts of the DBMS affect performance If you are responsible for database performance, buy a book and dig in Remember not to give up on normalization to get performance 34 And Now: What You Do for Performance 35 How to Start First, don’t even consider denormalization You have many tools to get the performance you need without ruining the data model (and the applications) Performance test the applications Look for SQL operations that are taking a long time 36 EXPLAIN IBM invented the EXPLAIN utility; it explains the processing strategy for each WHERE clause Run it for operations that are taking too long Look for table scans, cartesian product joins Provide indexes to speed things up 37 EXPLAIN PLAN Tells you the execution plan an Oracle database follow for a SQL statement Inserts a row describing each step of the execution plan into a specified table Determines total cost of execution 38 39 Beyond EXPLAIN There are many indexing options, other options to control physical characteristics of the database Learn about them, learn how to control them But you will go very far with EXPLAIN and providing indexes 40 THANK YOU! 41