The Relational Model (cont’d) Introduction to Disks and Storage CS 186, Spring 2007, Lecture 3 Cow book Section 1.5, Chapter 3 (cont’d) Cow book Chapter 9 Mary Roth Administrivia • Homework 0 due today 10 p.m.! • Nathan and Erinaios posted their office hours on class homepage • Homework 1 available today from class web site Submit team members online Read thru description; we’ll talk more about it after today’s lecture • Questions from last time? Outline • What we learned last time – Components of a DBMS – Relational Data Model • New stuff – Storage, Disks and Files Review: Components of a DBMS • A DBMS is like an ogre; it has layers Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Today we go here… Review: Relational Data Model • Most widely used data model today. • Relations: – Schema : specifies name of relation, plus name and type of each column. – Instance : a table, with rows and columns that contain data. • SQL is a query language for relational data model – DDL: To define/modify/change schemas – DML: To query data in table. • Keys are a way to associate tuples in different relations Let’s return to our bank… • Can we apply a relational model to our bank spreadsheet? CREATE TABLE CUSTOMERS (CustomerID INTEGER, Name VARCHAR(128), Address VARCHAR(256), AccountID INTEGER); CREATE TABLE ACCOUNTS (AccountID INTEGER, Balance Double); Customer ID Name Address Account ID Account ID Balance 314159 Frodo Baggins BagEnd 112358 112358 4500.00 271828 Sam Gamgee BagShot Row 132124 132124 2000.00 42 Bilbo Baggins Rivendell 112358 Primary Keys • A set of fields is a superkey if: – No two distinct tuples can have same values in all key fields • A set of fields is a key for a relation if : – It is a superkey – No subset of the fields is a superkey • what if >1 key for a relation? – One of the keys is chosen (by DBA) to be the primary key. Other keys are called candidate keys. • e.g. – {sid, gpa} is an example of a superkey. Students – sid is a key for Students. sid name – what about name? login? login 53666 Jones jones@cs 53688 Smith smith@eecs 53650 Smith smith@math age 18 18 19 gpa 3.4 3.2 3.8 Primary and Candidate Keys in SQL • • Keys must be chosen and defined carefully! • They imply semantics! What does this set of key definitions imply about students? CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid), UNIQUE (cid, grade)) “Students can take only one course, and no two students in a course receive the same grade.” Primary and Candidate Keys in SQL • Better definition: CREATE TABLE Enrolled (sid CHAR(20) cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid)) “For a given student and course, there is a single grade.” Foreign Keys, Referential Integrity • Foreign key: Set of fields in one relation that is used to `refer’ to a tuple in another relation. – Must correspond to the primary key of the other relation. – Like a `logical pointer’. • Plays the same role as the physical pointer in IMS • If all foreign keys in a table refer to tuples in the other, referential integrity is achieved (i.e., no dangling references.) Foreign Keys in SQL • E.g. Only students listed in the Students relation should be allowed to enroll for courses. – sid is a foreign key referring to Students: CREATE TABLE Enrolled (sid CHAR(20),cid CHAR(20),grade CHAR(2), PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCES Students ) Enrolled sid 53666 53666 53650 53666 cid grade Carnatic101 C Reggae203 B Topology112 A History105 B 11111 English102 A Students sid 53666 53688 53650 name login Jones jones@cs Smith smith@eecs Smith smith@math age 18 18 19 gpa 3.4 3.2 3.8 Let’s return to our bank… • Can we define keys for our relations? CREATE TABLE CUSTOMERS (CustomerID INTEGER NOT NULL, Name VARCHAR(128), Address VARCHAR(256), AccountID INTEGER, PRIMARY KEY(CustomerID), FOREIGN KEY(accountid) references ACCOUNTS); CREATE TABLE ACCOUNTS (AccountID INTEGER NOT NULL, Balance Double, PRIMARY KEY (AccountID)); • Why do we need NOT NULL? • What would happen if I executed these commands in this order? Let’s return to our bank… We’ll come back to these later… • Write a SQL query (DML) that returns the names and account balances for all customers that have an account balance > 2500. • Write a SQL query (DML) that withdraws $300 from Frodo’s account. Intermission Disks, Memory, and Files Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB You are here… Disks and Files • DBMS stores information on disks. – Data must be transferred to and from disk and RAM – READ: transfer data from disk to main memory (RAM). – WRITE: transfer data from RAM to disk. • READ and WRITE are expensive and must be planned carefully! – DBMS architecture is designed to minimize both Why Not Store Everything in Main Memory? • Costs too much. For ~$300, PCConnection will sell you: – ~1GB of RAM – ~30GB of flash – ~1 TB of disk • Main memory is volatile. We want data to be saved between runs. (Obviously!) The Storage Hierarchy Smaller, Faster –Main memory (RAM) for currently used data. –Disk for the main database (secondary storage). –Tapes for archiving older versions of the data (tertiary storage). Bigger, Slower Source: Operating Systems Concepts 5th Edition Jim Gray’s Storage Latency Analogy: How Far Away is the Data? 10 9 Andromeda Tape /Optical Robot 10 6 Disk 100 10 2 1 RAM On Board Cache On Chip Cache Registers 2,000 Years Pluto Sacramento 2 Years 1.5 hr This Lecture Hall 10 min This Room My Head 1 min Disks • Secondary storage device of choice. • Main advantage over tapes: – faster time to retrieve – random access vs. sequential. • Data is stored and retrieved in units called disk blocks or pages. • Unlike RAM, time to retrieve a disk block varies depending upon location on disk. – Therefore, relative placement of blocks on disk has major impact on DBMS performance! Components of a Disk Disk head The platters spin (say, 120 rps). The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!). Sector Arm movement Only one head reads/writes at any one time. Arm assembly Block size is a multiple of sector size (which is fixed). Spindle Tracks Platters Accessing a Disk Page • Time to access (read/write) a disk block: Transfer time Seek time – seek time (moving arms to position disk head on track) – rotational delay (waiting for block to rotate under head) – transfer time (actually moving data to/from disk surface) Arm movement Rotational delay Accessing a Disk Page • Seek time and rotational delay dominate. – Seek time varies between about 0.3 and 10msec – Rotational delay varies from 0 to 4msec – Transfer rate around .08msec per 8K block • Key to lower I/O cost: reduce seek/rotation delays! Arranging Pages on Disk • `Next’ block concept: – blocks on same track, followed by – blocks on same cylinder, followed by – blocks on adjacent cylinder • Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay. • For a sequential scan, pre-fetching several pages at a time is a big win! Summary: Disk Space Manager • Lowest layer of DBMS software manages space on disk (using OS file system or not?). • Higher levels call upon this layer to: – allocate/de-allocate a page – read/write a page • Best if a request for a sequence of pages is satisfied by pages stored sequentially on disk! – Responsibility of disk space manager. – Higher levels don’t know how this is done, or how free space is managed. – Though they may make performance assumptions! • Hence disk space manager should do a decent job. Buffer Management Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management You are here… Disk Space Management DB • Data must be in RAM for DBMS to operate on it! • Buffer Mgr hides the fact that not all data is in RAM Buffer Management in a DBMS Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK DB choice of frame dictated by replacement policy • Buffer pool information table contains: <frame#, pageid, pin_count, dirty> Requesting a page Higher level DBMS component I need page 3 BUFFER POOL Buf Mgr 22 disk page I need page 3 3 3 free frames MAIN MEMORY Disk Mgr DISK 1 2 3 … 22 … 90 If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time! Releasing a page Higher level DBMS component I read page 3 and I’m done with it BUFFER POOL Buf Mgr 22 disk page 3 free frames MAIN MEMORY Disk Mgr DISK 1 2 3 … 22 … 90 Releasing a page Higher level DBMS component I wrote on page 3 and I’m done with it BUFFER POOL Buf Mgr 22 disk page 3’ 3’ free frames MAIN MEMORY Disk Mgr DISK 1 2 3’ 3 … 22 … 90 More on Buffer Management • Requestor of page must eventually unpin it, and indicate whether page has been modified: – dirty bit is used for this. • Page in pool may be requested many times, – a pin count is used. – To pin a page, pin_count++ – A page is a candidate for replacement iff pin count == 0 (“unpinned”) • CC & recovery may entail additional I/O when a frame is chosen for replacement. – Write-Ahead Log protocol; more later!