Spring 2007 Midterm 1 Review Lectures 2-10 Cow book Chapters 1,3,4,5,8,9,10,11 Administrivia • Midterm 1 – in class this Thursday! – Closed book examination – You will be allowed one 8.5” x 11” sheet of notes (double sided). • Sample questions on class web site Review Outline • Relational Data Model, Algebra, Calculus and SQL • Storage, Buffer Management and Indexes Review: DBMS components •Talks to DBMS to manage data for a specific task Database application Query Optimization and Execution -> e.g. app to withdraw/deposit money or provide a history of the account •Figures out the best way to answer a question -> There is always more than 1 way to skin a cat…! •Provides generic ways to combine data Relational Operators Access Methods -> Do you want a list of customers and accounts or the total account balance of all customers? •Provides efficient ways to extract data -> Do you need 1 record or a bunch? •Makes efficient use of RAM Buffer Management -> Think 1,000,000 simultaneous requests! •Makes efficient use of disk space Disk Space Management DB -> Think 300,000,000 accounts! Review: ACID properties • A DBMS ensures a database has ACID properties: • Atomicity – nothing is ever half baked; database changes either happen or they don’t. • Consistency – you can’t peek at the data til it is baked; database changes aren’t visible until they are committed • Isolation – concurrent operations have an explainable outcome; multiple users can operate on a database without conflicting • Durability – what’s done is done; once a database operation completes, it remains even if the database crashes Review: Relational Data Model • Most widely used data model. • Relation: made up of 2 parts: – Schema : specifies name of relation, plus name and type of each column. • e.g. Students(sid: string, name: string, login: string, age: integer, gpa: real) – Instance : a table, with rows and columns described by the schema • Introduced data independence – Data layout on disk can change without affecting applications using the data • Keys contribute to data independence – Relationships are determined by field value, not physical pointers! Review: Bank of Middle Earth CustomerID Name Address AccountID Account ID Balance 314159 Frodo Baggins BagEnd 112358 112358 4500.00 132124 2000.00 271828 Sam Gamgee BagShot Row 132124 42 Bilbo Baggins Rivendell 112358 Give me an example of… •A super key for Accounts •Good primary key choices for both •A foreign key •A possible check constraint ALTER TABLE ACCOUNTS ADD CONSTRAINT CHECK_BAL CHECK (BALANCE>= 0) Review: Query Languages • Query languages provide 2 key advantages: – Less work for user asking query – More opportunities for optimization • Algebra and safe calculus are simple and powerful models for query languages for relational model – Have same expressive power – Algebra is more operational; calculus is more declarative • SQL can express every query that is expressible in relational algebra/calculus. (and more) • Two sublanguages: – DDL – Data Definition Language • Define and modify schema (at all 3 levels) – DML – Data Manipulation Language • Queries and IUD (insert update delete) Review: Basic DDL CREATE TABLE CUSTOMERS (CustomerID INTEGER NOT NULL, Name VARCHAR(128), Address VARCHAR(256), AccountID INTEGER, PRIMARY KEY(CustomerID), FOREIGN KEY(AccountId) REFERENCES ACCOUNTS); CREATE TABLE ACCOUNTS (AccountID INTEGER NOT NULL, Balance Double, PRIMARY KEY (AccountID)); Customer ID Name Address Account ID 314159 Frodo Baggins BagEnd 112358 271828 Sam Gamgee BagShot Row 132124 42 Bilbo Baggins Rivendell 112358 Account ID Balance 112358 4500.00 132124 2000.00 • Why do we need NOT NULL? • What would happen if I executed these commands in this order? Relational Algebra Review Reserves sid 22 58 bid 101 103 day 10/10/96 11/12/96 Basic operations: •Selection ( σ ) •Projection ( π ) •Cross-product ( ) •Set-difference ( — ) •Union ( ) Sailors Boats sid 22 31 58 bid 101 102 103 104 sname rating age dustin 7 45.0 lubber 8 55.5 rusty 10 35.0 bname Interlake Interlake Clipper Marine color Blue Red Green Red : gives a subset of rows. : deletes unwanted columns. : combine two relations. : tuples in relation 1, but not 2 : tuples in relation 1 appended with tuples in relation 2. Additional operations: •Intersection () :tuples that appear in both relations. •Join ( ) :like but only keep tuples where common fields are equal. •Division ( / ) :tuples from relation 1 with matches in relation 2 Relational Algebra Review Reserves sid 22 58 bid 101 103 day 10/10/96 11/12/96 Sailors Boats sid 22 31 58 bid 101 102 103 104 sname rating age dustin 7 45.0 lubber 8 55.5 rusty 10 35.0 bname Interlake Interlake Clipper Marine color Blue Red Green Red Find names of sailors who’ve reserved a green boat (σ color=‘Green’Boats) ( ( πsname ( πsid ( ( πbid ) Reserves) ) Sailors) ) Relational Algebra Review Sailors Reserves Boats sid sname rating age 1 Frodo 7 22 2 Bilbo 2 39 3 Sam 8 27 sid bid day bid bname color 1 103 9/12 101 Nina red 2 103 9/13 103 Pinta blue 3 103 9/14 3 101 9/12 1 103 9/13 Find names of sailors who’ve reserved all boats •First use division and renaming to find sids of sailors who reserved all boats •Then join result with sailors and project to get their names ( ρ (Tempsids, πsid,bid Reserves) sid bid 1 103 2 103 3 101 3 103 ( π bid Boats) / πsname ( ( Tempsids bid 101 103 Sailors) ) Tempsids sid = 3 ) Relational Calculus Review • Variables TRC: Variables are bound to tuples. DRC: Variables are bound to domain elements (= column values) • Constants 7, “Foo”, 3.14159, etc. • Comparison operators =, <>, <, >, etc. • Logical connectives - not – and - or - implies - is a member of • Quantifiers X(p(X)): For every X, p(X) must be true X(p(X)): There exists at least one X such that p(X) is true Relational Calculus Review Find names of sailors who have reserved a green boat { N | S Sailors (S.name = N.name R Reserves(S.sid = R.sid B Boats(B.color = “Green” B.bid = R.bid)))} Sailors sid S 22 S 31 S 58 sname rating age dustin 7 45.0 lubber 8 55.5 rusty 10 35.0 Reserves R R sid 22 58 bid 101 103 day 10/10/96 11/12/96 Boats sname N rusty B B B B bid 101 102 103 104 bname Interlake Interlake Clipper Marine color Blue Red Green Red Relational Calculus Review Boats Sailors sid sname rating age S 1 Frodo 7 22 S 2 Bilbo 2 39 3 Sam 8 27 S sid bid day bid bname color R 1 103 9/12 B 101 Nina red R 2 103 9/13 B 103 Pinta blue R 3 103 9/14 R 3 101 9/12 R 1 103 9/13 Find names of sailors who’ve reserved all boats N Reserves {N | SSailors (S.name = N.name BBoats (RReserves sname (S.sid = R.sid Sam B.bid = R.bid))} Basic SQL Query DISTINCT: optional keyword indicating target-list : A list of attributes answer should not contain duplicates. of tables in relation-list In SQL, default is that duplicates are not eliminated! (Result is called a “multiset”) SELECT [DISTINCT] target-list FROM relation-list WHERE qualification qualification : Comparisons combined using AND, OR and NOT. Comparisons are Attr op const or Attr1 op Attr2, where op is one of ,,,, etc. relation-list : A list of relation names, possibly with a rangevariable after each name Set Operators in SQL • UNION Set operators are almost always used with nested queries – Returns the UNION of two sets with (same arity) – UNION ALL retains duplicates in result • INTERSECT – Returns the INTERSECTION of two sets (with same arity) • EXCEPT – Set difference: A EXCEPT B returns tuples in A but not B • IN/NOT IN – A in B is true if A is a subset of B • EXISTS/NOT EXISTS – True if expression evaluates to a set with at least one member • UNIQUE/NOT UNIQUE – True if expression evaluates to a set with no duplicates • Value <comparison op> ANY/ALL – Value > ANY A is true if A contains at least one member that makes the comparison true – Value > ALL B is true if all members of A make the comparison true SQL Review: Nested query Find the names of sailors who’ve reserved boat #103 exactly once SELECT S.sname FROM Sailors S WHERE UNIQUE (SELECT sid, bid FROM Reserves R WHERE R.bid=103 AND S.sid=R.sid) 1 2 3 Sailors Reserves sid sname rating age sid bid day S 1 Frodo 7 22 1 103 9/12 S 2 Bilbo 2 39 2 103 9/13 S 3 Sam 8 27 1 103 9/13 Aggregate Operators Often used with GROUP BY and HAVING clauses • Very powerful; enables computations over sets of tuples • COUNT: returns a count of tuples in the set • AVG: returns average of column values in the set • SUM: returns sum of column values in the set • MIN, MAX: returns min (max) value of column values in a set. • DISTINCT can be added to COUNT, AVG, SUM to perform computation only over distinct values. SELECT COUNT (*) FROM Sailors S SELECT AVG (S.age) FROM Sailors S WHERE S.rating=10 SELECT AVG(DISTINCT S.age) FROM Sailors S WHERE S.rating=10 Sailors who have reserved all boats Sailors sid sname rating age 1 Frodo 7 22 2 Bilbo 2 39 3 Sam 8 27 SELECT S.name FROM Sailors S, reserves R WHERE S.sid = R.sid GROUP BY S.name, S.sid HAVING COUNT(DISTINCT R.bid) = ( Select COUNT (*) FROM Boats) Boats sname sid Frodo bid bname color 101 Nina red bid 102 Pinta blue 1 102 103 Santa Maria red Bilbo 2 101 Bilbo 2 102 sname sid count Frodo 1 102 Frodo 1 1 count sid bid day Bilbo 2 103 Bilbo 2 3 3 1 102 9/12 2 102 9/12 Reserves sname sid bid 2 101 9/14 Frodo 1 102,102 1 102 9/10 Bilbo 2 101, 102, 103 2 103 9/13 Review: Storage • A DBMS is like an ogre; it has layers Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Disks and Files • DBMS stores information on disks. Why? • To work with information, DBMS moves data to RAM. – READ: transfer data from disk to main memory (RAM). – WRITE: transfer data from RAM to disk. • READ and WRITE are expensive. Why? – must be planned carefully! – DBMS architecture is designed to minimize both The Storage Hierarchy Smaller, Faster –Main memory (RAM) for currently used data. –Disk for the main database (secondary storage). –Tapes for archiving older versions of the data (tertiary storage). Bigger, Slower Source: Operating Systems Concepts 5th Edition Components of a Disk Disk head The platters spin (say, 120 rps). The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!). Sector Arm movement Only one head reads/writes at any one time. Arm assembly Block size is a multiple of sector size (which is fixed). Spindle Tracks Platters Disks are slow. Why? • Time to access (read/write) a disk block: Transfer time Seek time – seek time (moving arms to position disk head on track) – rotational delay (waiting for block to rotate under head) – transfer time (actually moving data to/from disk surface) Arm movement Rotational delay Disk Space Manager • Lowest layer of DBMS software manages space on disk (using OS file system or not?). • Higher levels call upon this layer to: – allocate/de-allocate a page – read/write a page • Best if a request for a sequence of pages is satisfied by pages stored sequentially on disk! – Responsibility of disk space manager. – Higher levels don’t know how this is done, or how free space is managed. – Though they may make performance assumptions! • Hence disk space manager should do a decent job. Buffer Management in a DBMS Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK DB choice of frame dictated by replacement policy • Buffer pool information table contains: <frame#, pageid, pin_count, dirty> Buffer Management • Keeps a group a disk pages in memory • Records whether each is pinned – What happens when all pages pinned? – Whan happens when a page is unpinned? • Keeps track of whether pages are dirty Buffer Management – Replacement • What if all frames are used, but not pinned, and a new page is requested? • What pages are candidates for replacement? • How is the replaced page chosen? Replacement Policies • Least Recently Used (LRU) • Most Recently Used (MRU) • Clock • Advantages? Disadvantages? What is in Database Pages? • Database contains files, which are made up of… • Pages, which are made up of… • Records, which are made up of… • Fields, which hold single values. How are records organized? • It depends on whether fields variable, or fixed length • In Minibase, array of type/offsets, followed by data. F1 F2 F3 Array of Field Offsets F4 How are pages organized? • It depends on whether records variable, fixed length. • In Minibase, slot array at beginning of page, records compacted at end of page. • What happens if record deleted? Rid = (i,N) Page i Rid = (i,2) Rid = (i,1) 20 N ... 16 2 SLOT DIRECTORY 24 N 1 # slots Pointer to start of free space How are files organized? • Unordered Heap File: chained directory pages, containing records that point to data pages. Data Page 1 Header Page Data Page 2 DIRECTORY Data Page N Several possible file organizations • • • • • Heap Files Sorted Files Clustered Indexes Unclustered Index + regular file What are the tradeoffs? – Scan – Sort – Equality Search – Range Search – Insertion/Deletion Indexes • Can be used to store data records (alt 1), or be an auxillary data structure that referrs to existing file of records (alt 2, 3) • Many types of index (B-Tree, Hash Table, RTree, etc.) • How do you choose the right index? • Difference between clustered and unclustered indexes? Clustered vs. Unclustered Index • Suppose that Alternative (2) is used for data entries, and that the data records are stored in a Heap file. – To build clustered index, first sort the Heap file (with some free space on each block for future inserts). – Overflow blocks may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.) CLUSTERED Index entries direct search for data entries Data entries UNCLUSTERED Data entries (Index File) (Data file) Data Records Data Records B-Trees: a common, flexible index • What is a B-Tree? • What goes in an index (interior) node? • What goes in a leaf node? • How do insertions and deletions work? Any Questions? See you here on Thursday…