SQL What is SQL? • DML • DDL • Partially declarative • Based on the algebra via the tuple calculus, and therefore its core has an elegant set-theoretic foundations • Provides a sound foundation for mathematically precise optimization • Has very simple structure; optimized for moving large numbers of tuples at once • Set-based, retrieval-based What are relational databases? Tables and SQL, a programming language engineered for high volume data applications “must always be correct” data management apps transaction-based applications non-network, non-object based data Chapter 3: The syntax of SQL SELECT select_list [FROM table_source] [WHERE search_condition] [ORDER BY order_by_list] [LIMIT row_limit] Details of the Select clause SELECT [ALL|DISTINCT] column_specification [[AS] result_column] [, column_specification [[AS] result_column]] All columns in a base table Column name in a base table Calculation Function Details of Select, continued • Select * means all columns (attributes) • Using arithmetic • SELECT invoice_total - payment_total – credit_total AS balance_due • Using a function • SELECT CONCAT(first_name, ' ', last_name) AS full_name • Renaming a column • SELECT invoice_number AS "Invoice Number", invoice_date AS Date, invoice_total AS Total FROM invoices Operator precedence Operator * / DIV % (MOD) + - Name Order of precedence Multiplication 1 Division 1 Integer division 1 Modulo (remainder) 1 Addition 2 Subtraction 2 Calculations & concatenation & strings SELECT invoice_total, payment_total, credit_total, invoice_total - payment_total - credit_total AS balance_due FROM invoices SELECT vendor_city, vendor_state, CONCAT(vendor_city, vendor_state) FROM vendors SELECT vendor_contact_first_name,vendor_contact_last_name CONCAT(LEFT(vendor_contact_first_name, 1), LEFT(vendor_contact_last_name, 1)) AS initials FROM vendors Date function, round function SELECT invoice_date, DATE_FORMAT(invoice_date, '%m/%d/%y') AS 'MM/DD/YY', DATE_FORMAT(invoice_date, '%e-%b-%Y') AS 'DD-Mon-YYYY' FROM invoices SELECT invoice_date, invoice_total, ROUND(invoice_total) AS nearest_dollar, ROUND(invoice_total, 1) AS nearest_dime FROM invoices Where clause format WHERE [NOT] search_condition_1 {AND|OR} [NOT] search_condition_2 ... WHERE NOT (invoice_total >= 5000 OR NOT invoice_date <= '2011-08-01') WHERE invoice_total < 5000 AND invoice_date <= '2011-08-01' In phrase and nested selects WHERE test_expression [NOT] IN ({subquery|expression_1 [, expression_2]...}) WHERE vendor_id IN (SELECT vendor_id FROM invoices WHERE invoice_date = '2011-07-18') Like clause WHERE vendor_city LIKE 'SAN%' Cities that will be retrieved “San Diego”, “Santa Ana” WHERE vendor_name LIKE 'COMPU_ER%' Vendors that will be retrieved “Compuserve”, “Computerworld” Order By clause SELECT vendor_name, CONCAT(vendor_city, ', ', vendor_state, ' ', vendor_zip_code) AS address FROM vendors ORDER BY vendor_name Chapter 4: Manipulating multiple tables • Join • Equijoin • Natural join • Outer join • Self join • N-way join Explicit versus implicit joins SELECT vendor_name, invoice_number, invoice_date, line_item_amount, account_description FROM vendors v JOIN invoices i ON v.vendor_id = i.vendor_id JOIN invoice_line_items li ON i.invoice_id = li.invoice_id JOIN general_ledger_accounts gl ON li.account_number = gl.account_number WHERE invoice_total - payment_total - credit_total > 0 ORDER BY vendor_name, line_item_amount DESC SELECT invoice_number, vendor_name FROM vendors v, invoices i WHERE v.vendor_id = i.vendor_id ORDER BY invoice_number Left outer join and natural join SELECT vendor_name, invoice_number, invoice_total FROM vendors LEFT JOIN invoices ON vendors.vendor_id = invoices.vendor_id ORDER BY vendor_name SELECT invoice_number, vendor_name FROM vendors NATURAL JOIN invoices ORDER BY invoice_number Union operator SELECT 'Active' AS source, invoice_number, invoice_date, invoice_total FROM active_invoices WHERE invoice_date >= '2011-06-01' UNION SELECT 'Paid' AS source, invoice_number, invoice_date, invoice_total FROM paid_invoices WHERE invoice_date >= '2011-06-01' ORDER BY invoice_total DESC Chapter 5: Aggregates in queries AVG([ALL|DISTINCT] expression) SUM([ALL|DISTINCT] expression) MIN([ALL|DISTINCT] expression) MAX([ALL|DISTINCT] expression) COUNT([ALL|DISTINCT] expression) COUNT(*) – counts nulls Issues with aggregate data • This is when we intentionally create a new form of “object” and shift to ones that are deliberately not related to individual tuples • It’s selections and projections that often unintentionally lose a perspective on object identity Aggregate examples SELECT COUNT(*) AS number_of_invoices, SUM(invoice_total – payment_total – credit_total) AS total_due FROM invoices WHERE invoice_total – payment_total – credit_total > 0 SELECT 'After 1/1/2011' AS selection_date, COUNT(*) AS number_of_invoices, ROUND(AVG(invoice_total), 2) AS avg_invoice_amt, SUM(invoice_total) AS total_invoice_amt FROM invoices WHERE invoice_date > '2011-01-01‘ SELECT 'After 1/1/2011' AS selection_date, COUNT(*) AS number_of_invoices, MAX(invoice_total) AS highest_invoice_total, MIN(invoice_total) AS lowest_invoice_total FROM invoices WHERE invoice_date > '2011-01-01' Aggregate examples, continued SELECT MIN(vendor_name) AS first_vendor, MAX(vendor_name) AS last_vendor, COUNT(vendor_name) AS number_of_vendors FROM vendors SELECT COUNT(DISTINCT vendor_id) AS number_of_vendors, COUNT(vendor_id) AS number_of_invoices, ROUND(AVG(invoice_total), 2) AS avg_invoice_amt, SUM(invoice_total) AS total_invoice_amt FROM invoices WHERE invoice_date > '2011-01-01' Having clause (with group by) SELECT select_list FROM table_source [WHERE search_condition] [GROUP BY group_by_list] [HAVING search_condition] [ORDER BY order_by_list] Group by and Having: complexities • Group by: how to vertically group tuples • Having: which groups will be included in the final result • Note: All of this is after the Where clause is applied • Group by: based on columns or expressions that have columns in them • If there are any calculations done in the Select clause, this happens after the Group by clause; i.e., it is performed for each group that results from the Group by • Group by can be nested if you specify more than one column • Order by operators are performed after the Having Having example, group by example SELECT vendor_id, ROUND(AVG(invoice_total), 2) AS average_invoice_amount FROM invoices GROUP BY vendor_id HAVING AVG(invoice_total) > 2000 ORDER BY average_invoice_amount DESC (Note: 2 is the number of decimals in the result.) SELECT vendor_name, COUNT(*) AS invoice_qty, ROUND(AVG(invoice_total),2) AS invoice_avg FROM vendors JOIN invoices ON vendors.vendor_id = invoices.vendor_id WHERE invoice_total > 500 GROUP BY vendor_name ORDER BY invoice_qty DESC These are the same… SELECT invoice_date, COUNT(*) AS invoice_qty, SUM(invoice_total) AS invoice_sum FROM invoices GROUP BY invoice_date HAVING invoice_date BETWEEN '2011-05-01' AND '2011-05-31' AND COUNT(*) > 1 AND SUM(invoice_total) > 100 ORDER BY invoice_date DESC SELECT invoice_date, COUNT(*) AS invoice_qty, SUM(invoice_total) AS invoice_sum FROM invoices WHERE invoice_date BETWEEN '2011-05-01' AND '2011-05-31' GROUP BY invoice_date HAVING COUNT(*) > 1 AND SUM(invoice_total) > 100 ORDER BY invoice_date DESC We will shift our focus a bit… • We’ll run the actually queries that appear in chapters 5 onward • The slides will contain overview material and not the actual queries Chapter 6: subqueries • Often used to pass an aggregate value to a parent query • Often a good way to book-keep what might have been a very complex WHERE clause, with perhaps a multiway join • A good way to make a query look more readable to someone who uses it later • Subqueries can also be reused in other queries • Note: you cannot use the SELECT attributes from the embedded query in the parent query unless it directly references the appropriate table(s) in the outer FROM clause Subqueries, continued • Important tool: IN operator, which is a set “element of” • • • • operator (written with an epsilon), and you can negate the IN (not IN) Often you use SOME or ALL or ANY, if you want to return multiple values (i.e., multiple tuples, perhaps with only one attribute each) The default if you don’t use one of the operators above is to return a single value ALL is a set “for all elements of” operator (written with an upside down A) ANY is a set “there exists” operator (written with a backward E) Subqueries, continued • The default is that a subquery only executes once, but you can use a “correlated” query so that it will run once for each row processed by the parent query. • This breaks the “execute from the inside out” paradigm on an uncorrelated subquery. • EXISTS is often used with correlated queries • You can put a subquery in a HAVING, FROM, or SELECT clause as well • But such queries get messy and we will skip this for now. • It’s a good idea to write and test subqueries independently whenever possible, unless they are trivial Examples from chapter 6 • 7: passing a single value • 6: ANY • 3: IN • 8: NOT EXISTS • 5: ALL Chapter 7: changing the database state • This is when we need a transaction protocol • Updates must never overlap with each other or with read- only queries • Read-only queries can overlap • But we want to increase through put by supporting as much “concurrency” as possible • Each transaction has the potential to update the DB state 2 Phase Transactions • Each SQL program is within a begin and end transaction pair • Each transaction has its own workspace for DB items it is going • • • • • • to update Any transactions that overlap in execution time will appear to have run in some serial order This is done by transactions requesting read and write locks (also known as shared and exclusive locks) Read locks can be shared with other readers Write locks cannot be shared with readers or writers All locks held until the end of the transaction They are released and then the changes that a transaction has made are moved to the DB Serializability of transactions • The net effect is that the transactions that overlap in execution time appear to have run in some serial order • Transactions can be undone by throwing away the local store (conceptually, at least) • The write period at the end of the transaction must be atomic • The two phases: • Request read, write, upgrade locks (and wait on locks) and process • Release locks and move updates to the DB • There is a notion of “serializability” which means that the actual schedule of executed steps corresponds to some serial order of running the transactions Interesting facts on transactions… • Various legal schedules might produce different results • A crash during phase two can lead the database • • • • inconsistent The transaction manager uses a lot of overhead resources handling locks We still need to be able to roll the database back and rerun transaction logs User must control the nature of overlapping transactions or there might be very little true concurrency In a distributed database, the lock manager is a bottleneck because all components of the database must move in lockstep Updating data: changing the DB state INSERT INTO invoices VALUES (115, 97, '456789', '2011-08-01', 8344.50, 0, 0, 1, '2011-08-31', NULL); INSERT INTO invoices VALUES (116, 97, '456701', '2011-08-02', 270.50, 0, 0, 1, '2011-09-01', NULL), (117, 97, '456791', '2011-08-03', 4390.00, 0, 0, 1, '2011-09-02', NULL), (118, 97, '456792', '2011-08-03', 565.60, 0, 0, 1, '2011-09-02', NULL); Updating tables, continued USE ex; INSERT INTO color_sample (color_number) VALUES (606); INSERT INTO color_sample (color_name) VALUES ('Yellow'); INSERT INTO color_sample VALUES (DEFAULT, DEFAULT, 'Orange'); INSERT INTO color_sample VALUES (DEFAULT, 808, NULL); INSERT INTO color_sample VALUES (DEFAULT, DEFAULT, NULL); Updating tables, continued INSERT INTO invoice_archive SELECT * FROM invoices WHERE invoice_total - payment_total - credit_total = 0; INSERT INTO invoice_archive (invoice_id, vendor_id, invoice_number, invoice_total, credit_total, payment_total, terms_id, invoice_date, invoice_due_date) SELECT invoice_id, vendor_id, invoice_number, invoice_total, credit_total, payment_total, terms_id, invoice_date, invoice_due_date FROM invoices WHERE invoice_total - payment_total - credit_total = 0; Updating, continued UPDATE invoices SET payment_date = '2011-09-21', payment_total = 19351.18 WHERE invoice_number = '97/522'; UPDATE invoices SET terms_id = 1 WHERE vendor_id = 95; UPDATE invoices SET credit_total = credit_total + 100 WHERE invoice_number = '97/522'; Updating, continued DELETE FROM general_ledger_accounts WHERE account_number = 306; DELETE FROM invoice_line_items WHERE invoice_id = 78 AND invoice_sequence = 2; DELETE FROM invoice_line_items WHERE invoice_id = 12; DELETE FROM invoice_line_items WHERE invoice_id IN (SELECT invoice_id FROM invoices WHERE vendor_id = 115);