CS 540 Database Management Systems Lecture 2: Relational Model Contributions of the paper • Concepts of data independence and declarative queries. – Advocates more high level and natural modeling – Argues for declarative languages • Definition of relational model – Data structure based on relations – A set of operations (algebra) to manipulate data. • Formal notions – Expressive power, redundancy, and consistency The key idea “Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation)” • It started relational databases. Data model • An abstraction of data and how it is being used. • Elements of a data model – Structural part: mathematical structures to describe data. – Operations: a set of operations used to query and/or manipulate the data • A way of thinking about information • A successful paradigm in data management. The state of the world before relational model • Network/ hierarchical DBMS, 1960’s – IDS network DBMS: Bachman at GE, 1961 – IMS hierarchical DBMS: IBM in 1968 • CODASYL approach to data management, 1960’s – CODASYL: Conf. Of Data System Languages, set up by US DOD, to standardize software applications – COBOL (comm. bus. oriented lang.) defined by CODASYL • ruled the business data processing world • DBTG (Database Task Group, under CODASYL), 1971 – closely aligned with COBOL – DBTG Report would standardize network model (Bachman got Turing award in 1973 for the network model) Network model: DBTG report • Network DB – A collection of records • record = collection of fields • similar to an entity in ER model – Records connected by binary, many-to-one links • similar to binary relationships in ER • simulate one-to-one, many-to-many by many-to-one. Network model: implementation • Student record linked to enrollment record Johnson … CS001 A+ CS308 B • A lot of linkage pointers – ring-structured ptrs implements many-one links • Data manipulation is thus navigational DBTG query example SQL: select name from student where dept = “EECS” DBTG: student.dept = “EECS”; find any student using dept; while DB-status = 0 do begin get student; print (student.name); find duplicate student using dept; end DBTG query example: predicates SQL: select name from student where dept = “EECS” and gender = “Male” DBTG: student.dept = “EECS”; find any student using dept; while DB-status = 0 do begin get student; if student.gender = “Male” print (student.name); find duplicate student using dept; end DBTG query example: navigation SQL: select E.grade from student S, enrollment E where S.name = “Johnson” and E.id = S.id DBTG: student.name = “Johnson”; find any student using name; find first enrollment within StudentEnroll while DB-status = 0 do begin get enrollment; print (enrollment.grade); find next enrollment within StudentEnroll; end What’s wrong? What’s wrong? • Mix presentation and data access • As a result, – Programming is difficult and complex – Application can become incorrect once there’s a change in data representation • Just like programming in assembly languages (as opposed to high-level programming languages) Data dependence • Ordering preferences – Applications may rely on a particular ordering of the stored data – example? • Indexing dependence – Applications may rely on the availability of certain indices, but indices are semantically redundant and only necessary for “optimization”. – example? Data dependence • Access path dependence – Applications would hard code access paths to data, so would rely on the continued existence of the used access paths Data dependence • Access path dependence – Applications would hard code access paths to data, so would rely on the continued existence of the used access paths Levels of abstraction in DBMS • Physical implementation – storage structure, indexing, access method • Logical data model – conceptual data structure and manipulation • Views – different portions of databases • Who should see each level? Relational model of data • Relations – given sets S1, S2, …, Sn (not distinct) – relation R is a subset of the Cartesian product S1 x S2 x … x Sn – Sj is jth domain of R, n is degree of R • Relations as tables – – – – – each row represents an n-tuple of R ordering of rows is immaterial all rows are distinct ordering of columns is significant label each column with the name of the corresponding domain Relation: example Relation name Book Title Attribute names Price Category Year tuples MySQL $102.1 computer 2001 Cell biology $201.69 biology 1954 French cinema $53.99 art 2002 NBA History $63.65 sport 2010 18 Data manipulation • Relational algebra – operations • Relational calculus – semantics in terms of logics • Essential beauty of relational model: – Query is data and data is query! Named Relations Expressible Relations Operations on relations • Usual set operations – since relations are sets of homogenous tuples Operations on relations: deriving relations • Permutation – interchange the columns of an n-ary relation • Projection – select columns and remove any duplication in the rows • Join – selectively combining tuples in two relations – as a “class” of new relations that losslessly take some columns from either source relations • Composition – join two relations and remove join columns • Restriction – filter one relation with another • You can combine them to write queries. Algebra: questions • What is missing in this set of operators? • Is it minimal? • How is it different from “current” algebra? The relational algebra (now) • Basic operations: Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes unwanted columns from relation. – Cross-product ( ) Allows us to combine two relations. – Set-difference ( ) Tuples in reln. 1, but not in reln. 2. – Union ( ) Tuples in reln. 1 and in reln. 2. – Renaming ( r ) – Redundancy and consistency • Redundancy – redundant if something can be “derived” from others – foundation: what operations allowed in “derivation” • Consistency – data snapshot must satisfy some constraints • Started research in schema design – e.g., normal forms; normalization The advantages of relational model • Simplicity! • Mathematically complete data model • Declarative query languages – queries can be automatically compiled, executed, and optimized without resorting to low-level programming Does relational model provide data independence? • Ordering independence? • Index independence? • Access path independence? Does relational model provide access path independence? • A schema for a HR database: EmployeeDept(employee, department) ManagerDept(manager, department) – Find the manager of each employee. • A schema for the HR database: EmpManagerDept(manager, employee, department) – Find the manager of each employee. – different query! • Relational model is not fully access path independent. The access path independence of relational model versus previous models The access path independence of relational model versus previous models • Which one has more variations for the same data? Solution to access path dependence • Universal relation – Always join all relations in to one universal relation • The universal relation has all attribute. – Write your queries for universal relation. – Problems? • Schema independence – Query interfaces that return the same answers for the same query over different schemas of the same data. – Successfully deployed for some types of queries. – Problems? Unexpected benefits • Client-server architecture – SQL request/response enables high-level, compact exchange between clients and server – clients: input and output, application logics – server: data processing • Parallel processing: relations in and out – pipeline: piping the output of one op into the next – partition: N op-clones, each processes 1/N input • Graphical user interfaces – relations fits the spreadsheet (table) metaphor The rise of relational model • Codd’s paper in 1970 – resistance even within IBM – Too mathematical, no system (students raised the same questions!) • First implementations, 1973 – System R at IBM San Jose Lab – INGRES at UC.Berkeley • The “Great Debate” in 1975 SIGMOD conf. The great debate (SIGMOD 1975) • COBOL/CODASYL Relational – too mathematical (to understand) • Relational COBOL/CODASYL – too complicated (to program) Relational model/system impact • Codd’s paper published in 1970 • First implementations, 1973-– System R at IBM San Jose Lab, 1974-1978 – INGRES at UC.Berkeley, 1973-1977 • System R influence: – IBM DB2 – Oracle: started from published spec. of System R • INGRES: – member later funded Sybase – evolved into Microsoft SQL server by buying code from Sybase What have changes over the years? • Row may not be distinct now – set versus bag semantics – SQL: “select distinct” to eliminate the duplicates • Non-simple domains: i.e. complex objects – allowed only built-in data types – new: object-relational DB, multimedia DB • Generations of relations: temporal aspect – temporal databases • e.g.: query GPA at the end of year 2000 Problems with relational model • Data is often hierarchical/complex in nature – normalization is unnatural decomposition of data for storage, to be assembled by joins at query time. • Other data models provide a more natural representations for in many domains. Network/hierarchical models making a come back! • A great deal of graph data sets – Web is a huge network database! • XML is both navigational and hierarchical <student> <name>John Smith</name> <dept>CS</dept> <enrollments> <enrollment> <course>CS311</course> <grade>A+</grade> </enrollment> … … // more enrollments </enrollment> <student> Domain specific data models • Scientific data is better captured by arrays. • We create new data models for certain domains – preserve the data independence principle. Questions to think about • How to manage current network data without losing the benefit of data independence? • Is there a trade-off between data dependence and say, efficiency? • How to combine intuitive nature of network model and benefits of relational models? Carry away messages • Raise important research questions – See deficiencies in the current state of the world (data dependency) – Propose a change to the world that would address some of the deficiencies (declarative queries) • Leverage principled/mathematical tools (relational algebra). What is next? • How to carry out and present your project? • Overview of some sample projects.