Relational Algebra SQL specifies what to retrieve but not how to retrieve it. Need a process to translate a descriptive statement into a collection of activities. This is “behind the scenes”, but important to know nonetheless. If a CS student doesn’t know it – who will? SQL Statement Low level logic involving B-Tree, index, hash table access, etc. Result High level logic involving table (relation) manipulation Mid level logic involving loops and comparisons SQL queries relations and generates a new relation. Need something that manipulates relations in order to create other relations. Need to do it efficiently. Relational Algebra: a collection of operations (activities) to construct new relations from given relations. Typical process: SQL relational algebra operations (or something similar) mid/low level logic result Reference: First few slides of [http://www.cs.wayne.edu/~shiyong/c sc6710/slides/kiferComp_348761_ppt 11.ppt], though the notation may be intimidating. Background: Tuple (n-tuple): collection of n things written as (a1, a2,…, an). Typically a tuple represents an entity in a relation (a row in a table). Set operations: Definition: 2 relations are Union-compatible if they have same degree (number of attributes) and the i th attributes of each are defined on same domains. The first three definitions that follow apply only to Union-compatible relations. Union A B is the set of elements belonging to either A or B. Venn Diagram AB SQL Server: SELECT * FROM S WHERE status = 20 UNION SELECT * FROM S WHERE status = 50 Difference (Minus): A – B is the set of elements belonging to A but not to B. Venn Diagram. A–B SQL Server uses Except instead of Minus Supplier table: Create a view defined by that below (all suppliers not in Paris). SELECT * FROM S EXCEPT SELECT * FROM S WHERE City = 'Paris' Intersection A B is the set of elements belonging to both A and B. Venn Diagram AB SQL Server: SELECT * FROM S WHERE status > 20 INTERSECT SELECT * FROM S WHERE status < 50 Cartesian Product: A×B is the set of all elements of the form (x, y) where x belongs to A and y belongs to B. A = {a, b} B = {1, 2, 3} A×B = {(a, 1), (a, 2), (a, 3), (b, 1), (b, 2), (b, 3) } Supplier table: Create a view defined by SELECT * FROM S CROSS JOIN SP Selection (Restriction) R where predicate where the predicate is some condition that evaluates to true or false. A row subset of relation R. Select * from R where predicate. Projection Let R be a relation and X, Y, …, Z be among R’s attributes. Then R[X, Y, …, Z] is the set of elements from R restricted to attributes from X, Y, … , Z with duplicates removed. Column subset of R Select distinct attribute list from R Join Cartesian Product followed by a selection A×B where predicate. SELECT * FROM S CROSS JOIN SP where S.S# = SP.S# Or SELECT * FROM S, SP where S.S#=SP.S# Natural Join: A join B Let Ci, … , Cj be attribute names common to A and B. Then A JOIN B is (A ×B)[all attributes with duplicates removed] where A.Ci = B.Ci and : : A.Cj = B.Cj Outer Join: Example, Consider S join SP. Suppose S9 is a supplier but S9 supplies no parts, yet. An entry for S9 will NOT appear in S join SP. S left outerjoin SP will include every row from the left table (S) with NULLS filled in if there’s no matching S# in the SP table. Inner Join Conventional join operation but the phrase is used if there’s the potential for confusion with outer join. Examples select * from S inner join SP on S.S#=SP.S# select * from S left outer join SP on S.S#=SP.S# Get supplier names for suppliers who supply part P2. ( ( SP where P# = ‘P2) join S) [SName] Find a row subset (Selection) of SP where P#=‘P2’ 2. Do a natural join with S 3. Find a column subset of the result (project on Sname) 1. Get supplier names for suppliers who supply at least one red part. ( ( ( P where color = ‘Red’) Join SP) [S#] Join S) [Sname] 1. 2. 3. 4. 5. Find a row subset (Selection) of P where color=‘Red’ Do a natural join with SP Project on S# Do a natural join with S Project on SName Get supplier names for suppliers who do not supply part P2. S[Sname] Minus ( ( SP where P# = ‘P2) join S) [Sname] 1. 2. 3. 4. 5. Project S on SName Find a row subset of SP where P#=‘P2’ Join the result of 2 with S Project the result of 3 on Sname Calculate the difference between the result of 1 and the result of 2 Division: Let A be a relation of degree m+n and B be a relation of degree n Visualize an m+n tuple (entry) of A as a pair (x, y) where x is an m-tuple (1st m attributes) and y is an n-tuple (last n attributes) Also suppose the n attributes of B are defined on the same domains as the last n attributes of A. Then C = A dividedby B is a relation consisting of m-tuples x where for all y in B there is a pair (x, y) in A. NOTE: only 5 operations: restriction, projection, product, union, and difference are primitive. AB = A – (A – B) A dividedby B = A[x] – (A[x]×B – A)[x]. Example: SP dividedby B B is: 1) P1 or 2) P2 and P4 or 3) P1 through P6 Supplier numbers that supply all parts in B SP Dividedby B is: 1) S1 and S2 or 2) S1 and S4 or 3) S1 S# S1 S1 S1 S1 S1 S1 S2 S2 S2 S3 S3 S4 S4 S4 P# P1 P2 P3 P4 P5 P6 P1 P2 P3 P2 P3 P2 P3 P4 A dividedby B = A[x] – (A[x]×B – A)[x]. Apply to previous example where A = SP. B = P1 B = P2 and P4 B = P1 thru P6 S1 thru S4 (Si, P1) i = 1…4 S1 thru S4 (Si, P2) i = 1…4 (Si, P4) i = 1…4 S1 thru S4 (S1, Pi) i = 1…6 (S2, Pi) i = 1…6 (S3, Pi) i = 1…6 (S4, Pi) i = 1…6 A[x] ×B - A (S3, P1) (S4, P1) (S2, P4) (S3, P4) (S2, Pi) i=3,4,5,6 (S3, Pi) i=1,3,4,5,6 (S4, Pi) i=1,3,6 (A[x] ×B – A)[x] S3 and S4 S1 and S2 S2 and S3 S1 and S4 S2, S3, S4 S1 A[x] A[x] ×B A[x] -(A[x] × B – A)[x] Example Get supplier names for suppliers who supply all parts. ( ( SP dividedby P[P#]) Join S)[Sname] or ( ( SP[S#] - (SP[S#] times P[P#] – SP)[S#]) Join S)[Sname] Recall previous notes in SQL where finding ALL of something required double “not exist” subqueries. Query Optimization: Introduction to Database Systems by C. J. Date 8th ed Select S.name From S, SP Where S.S# = SP.S# And SP.P# = ‘P2’ Suppose 100 suppliers and 10,000 shipments Consider (S times SP) where S.S#=SP.S# and SP.P#=’P2’ vs (S times (SP where SP.P#=’P2’) where S.S# = SP.S# First option creates a VERY large intermediate table. Stages of query optimization: Slide 4 of [http://www.cs.wayne.edu/~shiyong/csc6710/ slides/kiferComp_348761_ppt11.ppt] Internal representation of a relational algebra expression (e.g. a query tree) Project on name Project on name join Where SP.P# = ‘P2 Where SP.P# = ‘P2 join S SP S (S join (SP where P#=’P2’))[name] SP ((S join SP) where P#=’P2) [name] Canonical forms Set of queries, called C, such that for every possible query there is a query in C that is equivalent. Reason for this? Get suppliers who supply P2 can be expressed in 8 different ways. Ideally, the efficiency should not depend on the query form. Examples: Where p or (q and r) where (p or q) and (p or r) -- Conjunctive normal form (A where p) where q A where (p and q) (A[attributes]) where p (A where p)[attributes] A.F1 > B.F2 and B.F2 = 3 A.F1 > 3 Consider: Get suppliers not in London or Paris. SELECT FROM WHERE S#, Sname, Status, City dbo.S (City <> 'Paris') OR (City <> 'London') Does this yield the expected result? NOT (p and q) NOT p or NOT q (DeMorgan’s laws NOT (p or q) NOT p and NOT q (DeMorgan’s laws If system knew that SP.P# is a foreign key matching a primary key P.P# in P then (SP join P) [S#] SP[S#] Query optimizers do not optimize – just try to find “reasonably good” evaluation strategies. Might take longer to find optimal strategy than to do brute force method. [http://www.cs.wayne.edu/~shiyong/c sc6710/slides/kiferComp_348761_ppt 11.ppt] Summaries and References: “MySQL's query optimization engine isn't always the most efficient when it comes to subqueries.” often a good idea to convert a subquery to a join [http://www.databasejournal.com/features/mysq l/article.php/3813821/Five-QueryOptimizations-in-MySQL.htm] Some additional information [http://avid.cs.umass.edu/courses/645/s2 009/lectures/Lec12-QueryOptimizerx6.pdf] SQL Server: Query Analysis: Select a database Create a query window (right click on the database and select new query) and enter an query Query Display Estimated Execution Plan Query Include Actual Execution Plan. When the query is executed there will be an execution plan tab on the result pane. Some example queries to analyze: select * from S select * from S order by S# select * from S order by status select * from SP where Qty = 200 select * from S, SP where S.S#=SP.S# and P#='P2' select * from S where S# in (select S# from SP where P#='P2') S2 Replacement View Any Red part View. Only Red Parts1 view. Do some views from the genealogy database (see next slide for family tree) Mike Tom Jim John Joanne Jason Alice Kathy Sally Jeanne Brian Linda Ira Sarah Ellen Joseph Daniel Tim Intro to SQL Server Optimization [http://www.csharpcorner.com/UploadFile/john_cha rles/QueryoptimizationinSQLServer20 0512112007154303PM/Queryoptimiz ationinSQLServer2005.aspx] Some useful terms: [http://technet.microsoft.com/enus/library/cc917672.aspx] Clustered index A clustered index is organized as a Btree, where the nonleaf nodes are index pages and the leaf nodes are data pages. A table can have only 1. Nonclustered index A nonclustered index is organized as a B-tree. Unlike a clustered index, a nonclustered index consists of only index pages. The leaf nodes in a nonclustered index are not data pages, but contain row locators for individual rows in the data pages. Hash match inner join: [http://blogs.msdn.com/craigfr/archive/2006 /08/10/687630.aspx]. Left Semi Join “returns rows from one table that would join with another table without performing a complete join” [http://blogs.msdn.com/craigfr/archive/20 06/07/19/671712.aspx] Left Anti Semi Join Similar to above but returns those that would NOT join. Stream Aggregates: [http://blogs.msdn.com/craigfr/archive/ 2006/09/13/752728.aspx] Index scan vs index seek: [http://blogs.msdn.com/craigfr/archive/ 2006/06/26/647852.aspx Spooling: Temporary “caching” of data for use by another activity: [http://technet.microsoft.com/enus/library/ms191221.aspx] Summaries at [http://blogs.msdn.com/craigfr/attach ment/8508493.ashx] and [http://technet.microsoft.com/enus/library/ms191158.aspx] Generating low level routines Construct query plans from low level routines and make some attempt at optimizing. True optimization is sometimes more costly than the savings. Join of relations R and S (on attribute C): Intro to Database Systems by C. J. Date Brute Force: Cardinality of R is m, cardinality of S is n For i = 1 to m do For j = 1 to n do If R[i].C = S[j].C then Add joined tuple, R[i]:S[j] to result table Time is proportional to m*n Worse depending on many factors, number of disk I/Os could be proportional to m*n Potentially large if tens or hundreds of thousands of tuples in each table or if tuples are very large and require a lot of disk I/Os. Indexed method Assume index on S.C For i = 1 to m do { Find all records S[j], in S, where S[j].C=R[i].C //requires a search through the index which is //far less work than looking through all records in S Join R[i] with corresponding tuples and add to result table. } Hash Table: Assume (or build) a hash table on S.C For i = 1 to m do { k = hash(R[i].C) search hash list at H[k] looking for matches to R[i].C if found, add joined tuple to result table. } Projection: R… relation; A…attributes; R[i] tuple i of R; Chosen is a boolean vector, one bit for each R[i]; initially all False. Nested Loop : For i = 1 to m do //look for alternative If Chosen[i] is true then { add R[i].A to result table for j = i+1 to n if R[j].A = R[i].A then set Chosen[j]=true } Hashing: using a hash table H For i = 1 to m do { dup = false k = hash(R[i].A) search all tuples at H[k], looking for R[i].A; if not found add R[i].A to tuples at H[k] } Gather all tuples from hash table to store into result table. Find London suppliers who supply red parts with weight < 20 in quantities of 200 CreateView P’ As Select * from P where Color = ‘Red’ and Weight < 20 CreateView SP’ As Select * from SP where Qty > 200 CreateView S’ As Select * from S where City = ‘London’ Select name from S’, SP’, P’ Where S’.S# = SP’.S# and SP’.P# = P’.P# Could create these three views concurrently with multiprocessors or multi-core processors.