IST459: NOTES : THE RELATIONAL DATABASE MODEL AND DATABASE D EVELOPMENT PROCESS T OPIC : T HE R ELATIONAL D ATABASE M ODEL AND D ATABASE D EVELOPMENT P ROCESS T ABLE OF C ONTENTS Topic: The Relational Database Model and Database Development Process ...............................................................1 Learning Objectives ...................................................................................................................................................2 Part 1: The Relational Database Model .....................................................................................................................2 Let’s start with a little history lesson… ..................................................................................................................2 In Theory... and in Practice ....................................................................................................................................3 Part 2: Components of the Relational Model ............................................................................................................3 Relations ................................................................................................................................................................3 Characteristics of Relations ...................................................................................................................................3 Okay… But what makes a table a relation? ...........................................................................................................4 Domains in theory and practice .............................................................................................................................4 Implementing Logical Domain in a DBMS ..............................................................................................................5 Keys ........................................................................................................................................................................6 In practice, what makes a good primary key? .......................................................................................................7 Null and Flags.........................................................................................................................................................8 Integrity Rules ........................................................................................................................................................8 Part 3: Database Development ..................................................................................................................................9 Systems Development Lifecycle ............................................................................................................................9 Database Development Lifecycle ........................................................................................................................10 Various Systems Development Lifecycle Strategies ............................................................................................10 An alternative development model - Prototyping ...............................................................................................11 Page 1 L EARNING O BJECTIVES In this learning unit we will learn the about the relational database model and how it is put into practice on the modern DBMS. We will also explore the models and intricacies of database design and implementation. Some of these objectives will be covered in this document, others in the class lecture, assigned readings, and labs. Describe the Relational Model Define relational terms and understand the terminology in practice. Explain the System Development Life Cycle (SDLC) Explain the Database Life Cycle (DBLC) Explain how database development fits within the SDLC Compare and contrast various database SDLC strategies P ART 1: T HE R ELATIONAL D ATABASE M ODEL The relational data model was originally conceived in a paper titled “A Relational Model for Large Shared Data Banks” by IBM’s E. F. Codd in 1970. This paper, and its popularity, started a trend towards commercial Relational database management systems (RDBMS) products. These applications were conceived during the “disco era” of the 70’s and early 80’s, and have grown into a $20B industry (Forrester research, 2005) with the major players being Oracle, IBM and Microsoft. Today, the RDBMS is the cornerstone of just about every business application and is used by a wide variety of market segments. L ET ’ S START WITH A LITTL E HISTORY LESSON … The relational data model or simply the relational model comes to us from the work of a number of mathematicians and computer researchers. An American mathematician, D.L. Childs, started the movement with his 1968 work the "Description of a Set-Theoretic Data Structures”, which used set theory as the basis for querying data. Childs’ work inspired E.F. “Ted” Codd, a researcher working at the IBM San Jose Research Laboratory to develop, what we now take for granted, the relational data model. In the 1970s Codd went on to build early database management system and query language prototypes predicated on a set-based model. Codd’s early work is documented in a landmark paper, “A Relational Model for Large Shared Data Banks” , where he describes the benefits of the relational model and its significance in managing large data structures. Codd identifies the following advantages of relational structures over the traditional data structures used by file access methods that were commonplace at the time. 1. 2. 3. Data Independence – “Provides a means of describing data with its natural structure only -- that is, without superimposing any additional structure for machine representation poses.” Data Consistency – “A further advantage of the relational view is that it forms a sound basis for treating derivability, redundancy, and consistency of relations…” Ease of Use – “… the relational view permits a clearer evaluation of the scope and logical limitations of present formatted data systems, and also the relative merits (from a logical standpoint) of competing representations of data within a single system.” Page 2 I N T HEORY ... AND IN P RACTICE It’s important to differentiate between actual relational theory and that which a relational database management system can accomplish in practice. Generally speaking, a DBMS can implement the elements of the relational model, but can also accomplish much, much more. Why? Flexibility. Not every task is suited to the rigidity of the relational model, and DBMS vendors are trying to make a business offering their customers the most bang for the buck. Of course, all this flexibility comes at a price. The layperson often misinterprets understanding the intricacies of a software product, such as Microsoft Access, with the nuances of good relational design when in reality the two concepts are mutually exclusive. This allows the layperson to “shoot themselves in the foot” so to speak, and create a poor database design containing redundant data. So the bottom line is if you want to design efficient databases with minimal redundancy, learning the relational model will give you far more mileage than learning a particular DBMS. FUDGE’S “MORE MILEAGE” STATEMENT: IN THIS CLASS, WE WILL EMPHASIZE RELATIONAL DATABASE DESIGN OVER THE SPECIFICIS OF A PARTICULAR DBMS APPLICATION. WE WILL ALWAYS DIFFERENTIATE AMONG ELEMENTS OF THE RELATIONAL MODEL AND HOW THOSE ELEMENTS ARE TYPICALLY IMPLEMENTED BY VARIOUS DBMS APPLICATIONS, SUCH AS ORACLE OR MS SQL SERVER. P ART 2: C OMPONENTS OF THE R ELATIONAL M ODEL R ELATIONS The relational model focuses on a logical representation of data, rather than a physical one. This means that in an implementation, the programmer can focus on what rather than on how. For example, if you were to write a program to collect information from a webpage and save it into a text file, you would be responsible for the data management tasks, such as how the data gets saved in the file. If you were placing the same form data in a relational DBMS; you only need to tell it what to add to the database, and how to add it. The relational model is based on set theory - you know {c, a, b, s} ∩ {b, a, d, s} = {a, b, s}. There are three main components to the relational model: 1. 2. 3. A structural component - sets of relations, which are tables of data, consisting of rows (tuples) and columns (attributes). A manipulative component of operations which act upon the relations. A set of integrity rules for maintaining integrity within the database. C HARACTERISTICS OF R ELATIONS At the heart of the relational model is the relation, better known in its implementation as a table. Think of a table as a 2 dimensional structure of common data. Each row in the table corresponds to a specific entity while each column corresponds to a particular domain. (The difference between a relation and a table will be discussed below, but for now, you may assume they are synonymous.) All relations: Have a unique name among the set of all relations in the database. Page 3 Are 2 dimensional structures consisting of rows and columns of data. The specific order of the rows and columns is irrelevant. Each column (or attribute) must have a unique name (within the scope of the relation, not the database). Each column (or attribute) is drawn from a domain, or set of possible values from which the actual values are taken. Have a column or set of columns which uniquely identify each row in the table. O KAY … B UT WHAT MAKES A TABLE A RELATION ? What makes a table a relation? The last mentioned characteristic does - the fact that a column or set of columns must uniquely identify each row. In a DBMS such as MS SQL Server you can create a table where no column (or group of columns) uniquely identifies each row, but if you do, technically it won’t be a relation. Why is this a requirement and why should it matter? Remember, the relational model based on set theory, and all operations on relations are performed over sets of data. In set theory the order of the elements in a set does not matter, so the only means you have to retrieve a specific row is if you can find something which will uniquely identify it. For example, in the figure below, you can’t delete the 2nd Emmitt Smith because there is no way to differentiate it from the 1st row containing the same data. So, is there some way we can force a table to be a relation, for example, so that we would not be able to add Emmit Smith twice? Yes, we can do this with meta-data that we add to the table itself. It will be discussed in the sections below. D OMAINS IN THEORY AND PRACTICE Page 4 The concept of domain, or set of possible values from which the actual values are taken, can be difficult to implement in practice. Sure, for ordinal values such as GPA domain easy to implement (decimals between 0.000 and 4.000, for example). But what about a column of employee first names? Now despite all the crazy names that are out there, most humans can easily differentiate among what is and isn’t a last name, such as when an address like “One Park Place” is inadvertently added to the name column. But how can we make the computer understand what is and isn’t a name? It’s a problem with a non-trivial solution. For this reason, we separate domain into two categories. The current definition of domain is called logical domain, or the set of acceptable values. And the DBMS also implements physical domain representing the type of data acceptable in the columns. Physical domain has no place in relational theory, but is required in the DBMS because of how computers sort and compare data encoded in different forms. For example “December” comes before “January” when you compare them as text strings, but when you consider them as dates, January comes before December. To further complicate the issue, each DBMS implementation addresses physical domain, also known as data types! For example to store a number in the DBMS IBM DB2 you could use the data type INTEGER, but in Microsoft SQL server it would be int. Yes, they’re similar, but just different enough to make your life difficult! I MPLEMENTING L OGICAL D OMAIN IN A DBMS Let’s take a moment to refresh what we’ve learned so far: Tables are collections of similar entities. Tables are organized into rows and columns. (cars, for example) The entities are stored in the rows of the table and represent single instances of data (for example, my car, or your car) An attribute describes one particular facet of that entity. (my car’s color is green, for example) A column in the table represents the domain of data for an entity’s attribute ( the car’s color, or mpg rating) Domain is represented as both physical and logical domain. Physical domain represents the type of data in the column, logical domain represents the acceptable values. (For example the physical domain of a car’s Page 5 color would be varchar(20) (in the SQL server DBMS), but the logical domain would be colors. With just physical domain you could enter “Mike” as a color, which would be unacceptable) To implement logical domain we use constraints. Constraints are rules and conditions that must be met before data can be added or updated. There are techniques you can use to implement logical domain over DBMS tables. These include: Default Value Constraint - data used for an attribute when one isn’t specified. Check Constraint - an expression which must evaluate to “true” prior to insert or update Unique Constraint - a condition that forbids duplicate values are in a column or group of columns. Lookup table - a separate table containing all of the acceptable values for a given column. The lookup table is implemented using primary key and foreign key constraints. PHYSICAL AND LOGICAL DOMAINS ARE METADATA - THEY HELP US DESCRIBE, DEFINE, AND STRUCURE OUR DATA, AS WELL AS HELP ENFORCE BUSINESS RULES. K EYS If Yogi Berra was well versed in the relational model, I have no doubt he would say something like “Keys are the key to understanding relational databases. Yes, it’s true keys are the key!” A key is a special kind of constraint on an attribute or set of attributes which can be used to lookup other attributes. Keys are used to look-up a set of rows in a table, or find one specific row in a table, and find rows in one table based on the attributes of another table. As you will see, the true power of the relational model comes from keys. Here are the types of keys found in relational databases: Super Key – Any combination of attributes which can be used to uniquely identify a row in the table. Candidate Key The smallest number of attributes required to uniquely identify each row in a table. For example, in an employee table the name, address, and home phone could be a super key, but the employee social-security number would be a candidate key, because it’s minimal. In practice, singleattribute candidate keys should be enforced with a unique constraint (after all you wouldn’t want two different employee rows with the same social security number, right?) Primary Key - A candidate key which has been chosen by the database designer to uniquely identify each row in the table. The primary key cannot contain null (empty) values, and is implemented as a constraint by the DBMS. Surrogate Key - A primary key whose values are automatically generated by the DBMS there is no need on the user’s part to enter a value for primary key - the system does it for you. Surrogate keys can be an auto-incremented integer, or a GUID or UUID (Read the RFC). Some DBMSs such as Postgres and Oracle permit the database designer to create global surrogate keys, called sequences. Composite (primary) Key - A primary key consisting of more than one column. Secondary Key - An attribute or combination of attributes used for row retrieval from a table. In practice, secondary keys are index candidates in the table design. Foreign Key - A foreign key is an attribute (or combination thereof) in one table whose values either match those of the primary key in another table or are null. The purpose of the foreign key is to associate one relation with another. It is also implemented as a constraint by the DBMS. IN PRACTICE, ONLY PRIMARY AND FOREIGN KEYS ARE IMPLEMENTED BY RDBMSS. SOME DBMSS IMPLEMENT SURROGATE KEYS AND SECONDARY KEYS ARE ONLY POSSIBLE AS INDEXES Page 6 Candidate Keys (for Classes table): Class_id, Class_number, Crs_id+Class_Section (composite) Primary Keys: (as set in the DBMS) Courses: Crs_id, Classes: Class_id Secondary Keys: Crs_num, Time_ID, Room_ID Surrogate Keys: Crs_id, Class_id Foreign Keys: (as set by the DBMS) Crs_id (in Classes table), Time_ID (in Classes table), Room_ID (in Classes table) Example Primary Key / Foriegn Key relationship is highlighted in yellow. I N PRACTICE , WHAT MAKES A GOOD PRIMARY KEY ? Another appropriate title for this section might be: “How to choose a primary key and turn your table into a Relation”, but that’s too long of a title. Page 7 You might think that any candidate key would make a good primary key, but it is that line of thinking which has caused database designers fits of frustration in the past. In theory, any candidate key will suffice as primary key, but in practice not every candidate key makes an ideal primary key. The ideal primary key has three important characteristics: Should be unique within the scope / context of the entire database model, not just the current data. A good primary key is not unique by coincidence as is the case with a candidate key such as an employee’s date of birth. Sure, in a particular table right now no employee has the same birth date, but could it happen? Should not need to change often or depend on other data for its generation. Columns which are hashes or encodings of other columns make poor choices for primary keys. Because the primary key values are going to be in other tables as foreign keys, you want to resist the need to change them often. Should not compromise security or performance. You should never use sensitive information such as SSN for a primary key. You should never use long descriptive text such as street addresses as primary keys. Remember if you select it as a primary key, it is probably going to be a foreign key elsewhere. THE BEST SELECTION FOR PRIMARY KEY IS THE ATTRIBUTE(S) THAT SELDOM CHANGE AND CARRY THE LEAST AMOUNT OF MEANING. IF YOU DON’T HAVE ONE IN YOUR EXISTING TABLE, USE A SURROGATE KEY! (REMEMBER A SURROGATE KEY IS SELECTED BY THE DBMS AND IS ALWAYS UNIQUE.) N ULL AND F LAGS In the relational model there exists an oft-misunderstood concept called null. Simply put, null means no value entered. The problem with null is that the DBMS needs somehow to account for it in comparisons and searches. For example if something is true then it’s true, but if it is NOT true then it could be false or null. This is called threevalued logic. To make matters worse, end-users tend to try to interpret meaning from null, even though null can only mean one thing: no value was entered (Despite the fact that null is stored by computers as an actual value = 0). To avoid using nulls in practice, database designers use flags. Flags are special codes used in place of the absence of a value. For example, rather than risk a user performing data entry leaving employee salary blank a default value of 0 might be added to avoid having nulls in a numeric column. I NTEGRITY R ULES There are two important integrity rules in the relational model: Entity Integrity - A rule which imposes a constraint that each row in a relation must be uniquely identified. In practice we do this by designating a column or groups of columns in our table as the Primary Key. Since the primary key entries must be unique and cannot be null, entity integrity must be enforced. Entity Integrity is what makes a table a relation. Referential Integrity - The foreign key must contain either null or a value matching a primary key in another table. In other words, every non-null foreign key must reference an existing primary key value. This ensures relationship integrity is preserved among the tables. For example when you enroll in IST459 your SUID is associated with the course section. It would not make any sense to have SUIDs enrolled in this course which did not correspond to actual students! Page 8 ENTITY AND REFERENTIAL INTEGRITY ARE IMPLEMENTED IN PRACTICE USING PRIMARY AND FOREIGN KEY CONSTRAINTS RESPECTIVELY. P ART 3: D ATABASE D EVELOPMENT S YSTEMS D EVELOPMENT L IFECYCLE There are a variety of ways to design, build and implement effective database solutions. One way is to just start building the database. The modern database tools make it very easy to just cobb something together. The problem with this approach is, of course, without a plan you’re just trying to hit a moving target. Some key questions that should be asked of any database project are: What problem am I solving? Whose needs am I attempting to meet? What business opportunity am I pursuing? How do I know when I’m done? Are the steps in the approach I’m using explainable, repeatable, or verifiable? After I’m gone, can the database be maintained by others? Will my solution work today and into the future? The foundation for building database solutions is the Systems Development Lifecycle. The figure below charts the amount of resources (time, effort, money) against time for each phase of a database project. Included in the phases are the various levels of data abstraction we learned from the previous unit. Page 9 D ATABASE D EVELOPMENT L IFECYCLE The above graph shows an interesting perspective for how resources should be consumed - the majority of the resource effort should be spent in the design phase. The payoff idea here is: do not “just start building a database” just because the tools allow you to do so; spend your time planning “up front” for your database solution, analyze the problem, and design a solution that works before you start any actual construction. Think of it this way. You want to build a new house. Do you drive over to Home Depot to buy some lumber, nails, plumbing fixtures, electrical fixtures and some tools then start building or do you make a list of your housing requirements, develop a project plan, then work with an architect on a building design, floor layout and create a blueprint? Creating a database solution is no different. Think about the database development lifecycle (DBDLC) as a subset of the activities that you need for successful database development. These activities map very nicely to the broader set of activities in the SDLC. V ARIOUS S YSTEMS D EVELOPMENT L IFECYCLE S TRATEGIES Page 10 There are two fundamental ways or strategies for using the Systems Development Lifecycle, Top-down and Bottom-up: Top-down - Approach your database project perspective from the top of the organization, looking at how the data is viewed cross-functionally across each of the organization’s business units starting with seniorlevel management. Your focus is looking at how the data is used for making decisions and how the data may be shared across the enterprise. You start at the top and work down the various layers of management. Bottom-up - Approach your database project prospective from the bottom of the organization, looking at how the data is viewed and used by the end user. Your focus is on how the data is used to perform various day-to-day operational functions and how the data “rolls up― the organization. You start at the bottom gathering very detail and granular data requirements and work up the organization. A N ALTERNATIVE DEVELOPMENT MODEL - P ROTOTYPING The SDLC is a methodical and highly structured development process, with many processes and checks in place to ensure consistent and accurate results. The primary disadvantages to this approach are, of course, time and the discipline it takes to follow the cycle, as well as the large windows of time between your end-users being involved in the process. Prototyping, a rapid application development model, attempts to address these issues. Prototyping is an iterative process where requirements are quickly developed into a working model (i.e. the prototype) and feedback is solicited from end-users. The gathered feedback, in turn is used to modify and establish new requirements, as well as revise any design issues with the prototype. Prototypes that are demos of system functionality or proofs of concepts are called throwaway prototypes. Prototypes that are actual systems which eventually become the real thing are called evolutionary prototypes. Microsoft Access is a great system for developing database prototypes. Almost all web design and development follows the prototype development model. The end result of web prototyping is rapid site development, saving time and money. But development in this manner can cause performance problems, code maintenance issues, and development delays all of which incur hidden costs. This is especially true in web applications because there is oftentimes too much focus on end-user needs and getting things done type-tasks at the expense of sufficient analysis of the overall big-picture architectural design of the system. Page 11