541: Database Systems S. Muthu Muthukrishnan 1 Course Project Pick a dataset. Stock market data, US patent data, web data, internet traffic data. UC Irvine data repository. http://odwin.ucsd.edu/idata/ Set of conf papers: http://www.acm.org/sigs/sigmod/record/xml/ Medical, ecological, biological, text, movie database. Rutgers labs.. How to collect it? How to make it up? HW 0: Decide by 02/26. Submit a writeup of what data, how you will collect it, how much, what application you will build—what queries are important, what challenges you foresee, schedule+timeline and how you are going to divide work, etc. Midterm project review 03/25. Experiment with different indices, join methods, different ways of posing queries, schemas, etc. Project demo and project writeup due: 04/22. Check out http://paul.rutgers.edu/~eiman/cs541_fall03.html for details. 2 Homeworks HW 1:1 Write a short note about yourself: your educational background, interest in d/b, career goals, anything you’d like to bring to my attention. EX 1-2: Data procuring. Can you build a web crawler to pull data into a flat file. (Extra Credit). We looked at an overview of DBMS. Question: ER diagrams. 3 Exercise Consider Rutgers Univ data comprising information about various departments, courses and students, and determine the number of bytes needed to store this data for one year. Write a report discussing what is the data you considered, how you estimated the size of various data sets, how you estimated the number of bytes needed to store them. Size? Problems? How does it grow? Solve problem 2.5 in the book. Describe major design decisions. Puzzles. 4 Data(base) Compressions Homework: Study data compression, write about issues in database compression versus data compression and table compression in data warehouses. Use www.cs.wisc.edu/~joldst/ compressing relations and indexes. Use Table compression paper. How to compress the web data? 5 Homework II Solve problem 2.5 in the book. Describe major design decisions. 6 The Relational Model Chapter 3 Most widely used in commerical databases. 7 Relational Database: Definitions Relational database: a set of relations Relation: made up of 2 parts: Instance : a table, with rows and columns. #Rows = cardinality, #fields = degree / arity. Schema : specifies name of relation, plus name and type of each column. E.G. Students(sid: string, name: string, login: string, age: integer, gpa: real). Can think of a relation as a set of rows or tuples (i.e., all rows are distinct). 8 Example Instance of Students Relation Fields/attributes/columns Field names Records/ Tuples/ Rows sid 53666 53688 53650 name login Jones jones@cs Smith smith@eecs Smith smith@math age 18 18 19 gpa 3.4 3.2 3.8 Cardinality = 3, degree = 5, all rows distinct Do all columns in a relation instance have to be distinct? 9 Relational Query Languages A major strength of the relational model: supports simple, powerful querying of data. Queries can be written intuitively, and the DBMS is responsible for efficient evaluation. The key: precise semantics for relational queries. Allows the optimizer to extensively re-order operations, and still ensure that the answer does not change. 10 The SQL Query Language Developed by IBM (system R) in the 1970s Need for a standard since it is used by many vendors Standards: SQL-86 SQL-89 (minor revision) SQL-92 (major revision, current standard) SQL-99 (major extensions) 11 The SQL Query Language To find all 18 year old students, we can write: SELECT * FROM Students S WHERE S.age=18 sid name 53666 Jones login jones@cs age gpa 18 3.4 53688 Smith smith@ee 18 3.2 *: All fields. S: Variable over tuples. •To find just names and logins, replace the first line: SELECT S.name, S.login 12 Querying Multiple Relations What does the following query compute? SELECT S.name, E.cid FROM Students S, Enrolled E WHERE S.sid=E.sid AND E.grade=“A” Given the following instance of Enrolled: we get: sid 53831 53831 53650 53666 cid grade Carnatic101 C Reggae203 B Topology112 A History105 B S.name E.cid Smith Topology112 13 Creating Relations in SQL CREATE TABLE Students (sid: CHAR(20), Creates the Students relation. name: CHAR(20), Observe that the type (domain) login: CHAR(10), of each field is specified, and age: INTEGER, enforced by the DBMS whenever gpa: REAL) tuples are added or modified. As another example, the Enrolled table holds information about CREATE TABLE Enrolled courses that students take. (sid: CHAR(20), cid: CHAR(20), grade: CHAR(2)) 14 Destroying and Altering Relations DROP TABLE Students Destroys the relation Students. The schema information and the tuples are deleted. ALTER TABLE Students ADD COLUMN firstYear: integer The schema of Students is altered by adding a new field; every tuple in the current instance is extended with a null value in the new field. 15 Adding and Deleting Tuples Can insert a single tuple using: INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2) Can delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = ‘Smith’ What happens when the WHERE clause checks a condition that involves modified attribute? 16 Integrity Constraints (ICs) IC: condition that must be true for any instance of the database; e.g., domain constraints. ICs are specified when schema is defined. ICs are checked when relations are modified. A legal instance of a relation is one that satisfies all specified ICs. DBMS should not allow illegal instances. If the DBMS checks ICs, stored data is more faithful to realworld meaning. Avoids data entry errors, too! 17 Primary Key Constraints A set of fields is a key for a relation if : 1. No two distinct tuples can have same values in all key fields, and 2. This is not true for any subset of the key. Part 2 false? A superkey. If there’s >1 key for a relation, one of the keys is chosen (by DBA) to be the primary key. E.g., sid is a key for Students. (What about name?) The set {sid, gpa} is a superkey. Primary key can not have null value 18 Primary and Candidate Keys in SQL Possibly many candidate keys (specified using UNIQUE), one of which is chosen as the primary key. “For a given student and course, CREATE TABLE Enrolled (sid CHAR(20) there is a single grade.” vs. cid CHAR(20), “Students can take only one grade CHAR(2), course, and receive a single grade PRIMARY KEY (sid,cid) ) for that course; further, no two CREATE TABLE Enrolled students in a course receive the (sid CHAR(20) same grade.” cid CHAR(20), Used carelessly, an IC can prevent grade CHAR(2), the storage of database instances PRIMARY KEY (sid), that arise in practice! UNIQUE (cid, grade) ) 19 Foreign Keys, Referential Integrity Foreign key : Set of fields in one relation that is used to `refer’ to a tuple in another relation. (Must correspond to primary key of the second relation.) Like a `logical pointer’. E.g. sid is a foreign key referring to Students: Enrolled(sid: string, cid: string, grade: string) If all foreign key constraints are enforced, referential integrity is achieved, i.e., no dangling references. Can you name a data model w/o referential integrity? Links in HTML! 20 Foreign Keys in SQL Only students listed in the Students relation should be allowed to enroll for courses. CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY (sid,cid), FOREIGN KEY (sid) REFERENCES Students ) Enrolled sid 53666 53666 53650 53666 cid grade Carnatic101 C Reggae203 B Topology112 A History105 B Students sid 53666 53688 53650 name login Jones jones@cs Smith smith@eecs Smith smith@math age 18 18 19 gpa 3.4 3.2 3.8 21 Enforcing Referential Integrity Consider Students and Enrolled; sid in Enrolled is a foreign key that references Students. What should be done if an Enrolled tuple with a nonexistent student id is inserted? (Reject it!) What should be done if a Students tuple is deleted? Also delete all Enrolled tuples that refer to it. Disallow deletion of a Students tuple that is referred to. Set sid in Enrolled tuples that refer to it to a default sid. (In SQL, also: Set sid in Enrolled tuples that refer to it to a special value null, denoting `unknown’ or `inapplicable’.) Similar if primary key of Students tuple is updated. 22 Referential Integrity in SQL/92 SQL/92 supports all 4 options CREATE TABLE Enrolled on deletes and updates. (sid CHAR(20), cid CHAR(20), Default is NO ACTION grade CHAR(2), (delete/update is rejected) PRIMARY KEY (sid,cid), CASCADE (also delete all FOREIGN KEY (sid) tuples that refer to deleted REFERENCES Students tuple) ON DELETE CASCADE SET NULL / SET DEFAULT ON UPDATE SET DEFAULT ) (sets foreign key value of referencing tuple) 23 Where do ICs Come From? ICs are based upon the semantics of the real-world enterprise that is being described in the database relations. We can check a database instance to see if an IC is violated, but we can NEVER infer that an IC is true by looking at an instance. An IC is a statement about all possible instances! From example, we know name is not a key, but the assertion that sid is a key is given to us. Key and foreign key ICs are the most common; more general ICs supported too. 24 Views A view is just a relation, but we store a definition, rather than a set of tuples. CREATE VIEW YoungActiveStudents (name, grade) AS SELECT S.name, E.grade FROM Students S, Enrolled E WHERE S.sid = E.sid and S.age<21 Views can be dropped using the DROP VIEW command. How to handle DROP TABLE if there’s a view on the table? DROP TABLE command has options to let the user specify this. 25 Views and Security Views can be used to present necessary information (or a summary), while hiding details in underlying relation(s). Given YoungStudents, but not Students or Enrolled, we can find students s who have are enrolled, but not the cid’s of the courses they are enrolled in. 26 How to Update Views? Users need to know the difference between views and the base tables. Problem: Modifying the view leads to modifying the underlying base tables which needs lot of care: delete a row (what happens when the key is not part of the view?) insert a row. 27 Homework III.2 Pick a database of your interest. Define 2 or 3 useful views. Discuss difficulties with updating the views, giving examples. Is it worth allowing users to update views? What do commerical systems do? 28 Logical DB Design: ER to Relational Entity sets to tables. ssn name Employees lot CREATE TABLE Employees (ssn CHAR(11), name CHAR(20), lot INTEGER, PRIMARY KEY (ssn)) 29 Relationship Sets to Tables In translating a relationship set to a relation, attributes of the relation must include: Keys for each participating entity set (as foreign keys). This set of attributes forms a superkey for the relation. All descriptive attributes. CREATE TABLE Works_In( ssn CHAR(1), did INTEGER, since DATE, PRIMARY KEY (ssn, did), FOREIGN KEY (ssn) REFERENCES Employees, FOREIGN KEY (did) REFERENCES Departments) 30 Review: Key Constraints since Each dept has at most one manager, according to the key constraint on Manages. name ssn dname lot Employees did Manages budget Departments Translation to relational model? 1-to-1 1-to Many Many-to-1 Many-to-Many 31 Translating ER Diagrams with Key Constraints Map relationship to a table: Note that did is the key now! Separate tables for Employees and Departments. Since each department has a unique manager, we could instead combine Manages and Departments. CREATE TABLE Manages( ssn CHAR(11), did INTEGER, since DATE, PRIMARY KEY (did), FOREIGN KEY (ssn) REFERENCES Employees, FOREIGN KEY (did) REFERENCES Departments) CREATE TABLE Dept_Mgr( did INTEGER, dname CHAR(20), budget REAL, ssn CHAR(11), since DATE, PRIMARY KEY (did), FOREIGN KEY (ssn) REFERENCES Employees) 32 Review: Participation Constraints Does every department have a manager? If so, this is a participation constraint: the participation of Departments in Manages is said to be total (vs. partial). Every did value in Departments table must appear in a row of the Manages table (with a non-null ssn value!) since name ssn dname did lot Employees Manages budget Departments Works_In since 33 Participation Constraints in SQL We can capture participation constraints involving one entity set in a binary relationship, but little else (without resorting to CHECK constraints). CREATE TABLE Dept_Mgr( did INTEGER, dname CHAR(20), budget REAL, ssn CHAR(11) NOT NULL, since DATE, PRIMARY KEY (did), FOREIGN KEY (ssn) REFERENCES Employees, ON DELETE NO ACTION) 34 Review: Weak Entities A weak entity can be identified uniquely only by considering the primary key of another (owner) entity. Owner entity set and weak entity set must participate in a one-tomany relationship set (1 owner, many weak entities). Weak entity set must have total participation in this identifying relationship set. name ssn lot Employees cost Policy pname age Dependents 35 Translating Weak Entity Sets Weak entity set and identifying relationship set are translated into a single table. When the owner entity is deleted, all owned weak entities must also be deleted. CREATE TABLE Dep_Policy ( pname CHAR(20), age INTEGER, cost REAL, ssn CHAR(11) NOT NULL, PRIMARY KEY (pname, ssn), FOREIGN KEY (ssn) REFERENCES Employees, ON DELETE CASCADE) 36 Review: Binary vs. Ternary Relationships name ssn If each policy is owned by just 1 employee: Key constraint on Policies would mean policy can only cover 1 dependent! What are the additional constraints in the 2nd diagram? lot Employees age Dependents Covers Bad design Policies policyid cost name ssn pname pname lot age Dependents Employees Purchaser Better design policyid Beneficiary Policies cost 37 Binary vs. Ternary Relationships (Contd.) CREATE TABLE Policies ( The key constraints policyid INTEGER, allow us to combine cost REAL, Purchaser with ssn CHAR(11) NOT NULL, Policies and PRIMARY KEY (policyid). Beneficiary with FOREIGN KEY (ssn) REFERENCES Employees, Dependents. ON DELETE CASCADE) Participation CREATE TABLE Dependents ( constraints lead to pname CHAR(20), NOT NULL age INTEGER, constraints. policyid INTEGER, What if Policies is a PRIMARY KEY (pname, policyid). weak entity set? FOREIGN KEY (policyid) REFERENCES Policies, ON DELETE CASCADE) 38 Relational Model: Summary A tabular representation of data. Simple and intuitive, currently the most widely used. Integrity constraints can be specified by the DBA, based on application semantics. DBMS checks for violations. Two important ICs: primary and foreign keys In addition, we always have domain constraints. Powerful and natural query languages exist. Rules to translate ER to relational model 39 Homework III.3 Homework III.3: Solve problem 3.15. 40 Relational Algebra Chapter 4, Part A 41 Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational model supports simple, powerful QLs: Strong formal foundation based on logic. Allows for much optimization. Query Languages != programming languages! QLs not expected to be “Turing complete”. QLs not intended to be used for complex calculations. QLs support easy, efficient access to large data sets. 42 Formal Relational Query Languages Two mathematical Query Languages form the basis for “real” languages (e.g. SQL), and for implementation: Relational Algebra: More operational, very useful for representing execution plans. Relational Calculus: Lets users describe what they want, rather than how to compute it. (Nonoperational, declarative.) Understanding Algebra & Calculus is key to understanding SQL, query processing! 43 Preliminaries A query is applied to relation instances, and the result of a query is also a relation instance. Schemas of input relations for a query are fixed (but query will run regardless of instance!) The schema for the result of a given query is also fixed! Determined by definition of query language constructs. Positional vs. named-field notation: Positional notation easier for formal definitions, namedfield notation more readable. Both used in SQL 44 R1 sid Example Instances “Sailors” and “Reserves” sid S1 relations for our examples. 22 We’ll use positional or named 31 field notation, assume that names of fields in query results 58 are `inherited’ from names of fields in query input relations. S2 sid 28 31 44 58 22 58 bid day 101 10/10/96 103 11/12/96 sname rating age dustin 7 45.0 lubber 8 55.5 rusty 10 35.0 sname rating age yuppy 9 35.0 lubber 8 55.5 guppy 5 35.0 rusty 10 35.0 45 Relational Algebra Basic operations: Selection ( ) Selects a subset of rows from relation. Projection ( ) Deletes unwanted columns from relation. Cross-product ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union ( ) Tuples in reln. 1 and in reln. 2. Additional operations: Intersection, join, division, renaming: Not essential, but (very!) useful. Since each operation returns a relation, operations can be composed! (Algebra is “closed”.) 46 Projection Deletes attributes that are not in projection list. Schema of result contains exactly the fields in the projection list, with the same names that they had in the (only) input relation. Projection operator has to eliminate duplicates! (Why??) Note: real systems typically don’t do duplicate elimination unless the user explicitly asks for it. (Why not?) sname rating yuppy lubber guppy rusty 9 8 5 10 sname,rating(S2) age 35.0 55.5 age(S2) 47 Selection Selects rows that satisfy selection condition. No duplicates in result! (Why?) Schema of result identical to schema of (only) input relation. Result relation can be the input for another relational algebra operation! (Operator composition.) sid sname rating age 28 yuppy 9 35.0 58 rusty 10 35.0 rating 8(S2) sname rating yuppy 9 rusty 10 sname,rating( rating 8(S2)) 48 Union, Intersection, SetDifference sid sname rating All of these operations take two input relations, which must be union-compatible: Same number of fields. `Corresponding’ fields have the same type. What is the schema of result? sid sname 22 dustin rating age 7 45.0 S1 S2 22 31 58 44 28 dustin lubber rusty guppy yuppy 7 8 10 5 9 age 45.0 55.5 35.0 35.0 35.0 S1 S2 sid sname 31 lubber 58 rusty rating age 8 55.5 10 35.0 S1 S2 49 Cross-Product Each row of S1 is paired with each row of R1. Result schema has one field per field of S1 and R1, with field names `inherited’ if possible. Conflict: Both S1 and R1 have a field called sid. (sid) sname rating age (sid) bid day 22 dustin 7 45.0 22 101 10/10/96 22 dustin 7 45.0 58 103 11/12/96 31 lubber 8 55.5 22 101 10/10/96 31 lubber 8 55.5 58 103 11/12/96 58 rusty 10 35.0 22 101 10/10/96 58 rusty 10 35.0 58 103 11/12/96 Renaming operator: (C(1 sid1, 5 sid 2), S1 R1) 50 Joins R c S c ( R S) Condition Join: (sid) 22 31 sname dustin lubber rating age 7 45.0 8 55.5 (sid) 58 58 bid 103 103 day 11/12/96 11/12/96 Result schema same as that of cross-product. Fewer tuples than cross-product, might be able to compute more efficiently Sometimes called a theta-join. S1 S1. sid R1. sid R1 51 Joins Equi-Join: A special case of condition join where the condition c contains only equalities. sid 22 58 sname dustin rusty rating age 7 45.0 10 35.0 S1 bid 101 103 day 10/10/96 11/12/96 R1 sid Result schema similar to cross-product, but only one copy of fields for which equality is specified. Natural Join: Equijoin on all common fields. 52 Division Not supported as a primitive operator, but useful for expressing queries like: Find sailors who have reserved all boats. Let A have 2 fields, x and y; B have only field y: x | x, y A y B A/B = i.e., A/B contains all x tuples (sailors) such that for every y tuple (boat) in B, there is an xy tuple in A. Or: If the set of y values (boats) associated with an x value (sailor) in A contains all y values in B, the x value is in A/B. In general, x and y can be any lists of fields; y is the list of fields in B, and x y is the list of fields of A. 53 Examples of Division A/B sno s1 s1 s1 s1 s2 s2 s3 s4 s4 pno p1 p2 p3 p4 p1 p2 p2 p2 p4 A pno p2 B1 pno p2 p4 B2 pno p1 p2 p4 B3 sno s1 s2 s3 s4 sno s1 s4 sno s1 A/B1 A/B2 A/B3 54 Expressing A/B Using Basic Operators Division is not essential op; just a useful shorthand. (Also true of joins, but joins are so common that systems implement joins specially.) Idea: For A/B, compute all x values that are not `disqualified’ by some y value in B. x value is disqualified if by attaching y value from B, we obtain an xy tuple that is not in A. Disqualified x values: A/B: x ( A) x (( x ( A) B) A) all disqualified tuples 55 Find names of sailors who’ve reserved boat #103 Solution 1: Solution 2: sname(( bid 103 (Temp1, Reserves) Sailors) bid 103 Re serves) ( Temp2, Temp1 Sailors) sname (Temp2) Solution 3: sname ( bid 103 (Re serves Sailors)) 56 Find names of sailors who’ve reserved a red boat Information about boat color only available in Boats; so need an extra join: sname (( Boats) Re serves Sailors) color ' red ' A more efficient solution: sname ( (( Boats) Re s) Sailors) sid bid color ' red ' A query optimizer can find this given the first solution! 57 Find sailors who’ve reserved a red or a green boat Can identify all red or green boats, then find sailors who’ve reserved one of these boats: (Tempboats, ( color ' red ' color ' green ' Boats)) sname(Tempboats Re serves Sailors) Can also define Tempboats using union! (How?) What happens if is replaced by in this query? 58 Find sailors who’ve reserved a red and a green boat Previous approach won’t work! Must identify sailors who’ve reserved red boats, sailors who’ve reserved green boats, then find the intersection (note that sid is a key for Sailors): (Tempred, sid (Tempgreen, (( sid color ' red ' (( Boats) Re serves)) color ' green' Boats) Re serves)) sname((Tempred Tempgreen) Sailors) 59 Find the names of sailors who’ve reserved all boats Uses division; schemas of the input relations to / must be carefully chosen: (Tempsids, ( sid, bid Re serves) / ( bid Boats)) sname (Tempsids Sailors) To find sailors who’ve reserved all ‘Interlake’ boats: ..... / bid ( bname ' Interlake' Boats) 60 Summary The relational model has rigorously defined query languages that are simple and powerful. Relational algebra is more operational; useful as internal representation for query evaluation plans. Several ways of expressing a given query; a query optimizer should choose the most efficient version. 61