LU02 Notes

advertisement
IST459: NOTES : THE RELATIONAL DATABASE MODEL AND DATABASE
D EVELOPMENT PROCESS
T OPIC : T HE R ELATIONAL D ATABASE M ODEL AND D ATABASE D EVELOPMENT P ROCESS
T ABLE
OF
C ONTENTS
Topic: The Relational Database Model and Database Development Process ...............................................................1
Learning Objectives ...................................................................................................................................................2
Part 1: The Relational Database Model .....................................................................................................................2
Let’s start with a little history lesson… ..................................................................................................................2
In Theory... and in Practice ....................................................................................................................................3
Part 2: Components of the Relational Model ............................................................................................................3
Relations ................................................................................................................................................................3
Characteristics of Relations ...................................................................................................................................3
Okay… But what makes a table a relation? ...........................................................................................................4
Domains in theory and practice .............................................................................................................................4
Implementing Logical Domain in a DBMS ..............................................................................................................5
Keys ........................................................................................................................................................................6
In practice, what makes a good primary key? .......................................................................................................7
Null and Flags.........................................................................................................................................................8
Integrity Rules ........................................................................................................................................................8
Part 3: Database Development ..................................................................................................................................9
Systems Development Lifecycle ............................................................................................................................9
Database Development Lifecycle ........................................................................................................................10
Various Systems Development Lifecycle Strategies ............................................................................................10
An alternative development model - Prototyping ...............................................................................................11
Page 1
L EARNING O BJECTIVES
In this learning unit we will learn the about the relational database model and how it is put into practice on the
modern DBMS. We will also explore the models and intricacies of database design and implementation. Some of
these objectives will be covered in this document, others in the class lecture, assigned readings, and labs.






Describe the Relational Model
Define relational terms and understand the terminology in practice.
Explain the System Development Life Cycle (SDLC)
Explain the Database Life Cycle (DBLC)
Explain how database development fits within the SDLC
Compare and contrast various database SDLC strategies
P ART 1: T HE R ELATIONAL D ATABASE M ODEL
The relational data model was originally conceived in a paper titled “A Relational Model for Large Shared Data
Banks” by IBM’s E. F. Codd in 1970. This paper, and its popularity, started a trend towards commercial Relational
database management systems (RDBMS) products. These applications were conceived during the “disco era” of
the 70’s and early 80’s, and have grown into a $20B industry (Forrester research, 2005) with the major players
being Oracle, IBM and Microsoft. Today, the RDBMS is the cornerstone of just about every business application
and is used by a wide variety of market segments.
L ET ’ S
START WITH A LITTL E HISTORY LESSON …
The relational data model or simply the relational model comes to us from the work of a number of
mathematicians and computer researchers. An American mathematician, D.L. Childs, started the movement with
his 1968 work the "Description of a Set-Theoretic Data Structures”, which used set theory as the basis for querying
data. Childs’ work inspired E.F. “Ted” Codd, a researcher working at the IBM San Jose Research Laboratory to
develop, what we now take for granted, the relational data model. In the 1970s Codd went on to build early
database management system and query language prototypes predicated on a set-based model. Codd’s early work
is documented in a landmark paper, “A Relational Model for Large Shared Data Banks” , where he describes the
benefits of the relational model and its significance in managing large data structures. Codd identifies the following
advantages of relational structures over the traditional data structures used by file access methods that were
commonplace at the time.
1.
2.
3.
Data Independence – “Provides a means of describing data with its natural structure only -- that is,
without superimposing any additional structure for machine representation poses.”
Data Consistency – “A further advantage of the relational view is that it forms a sound basis for treating
derivability, redundancy, and consistency of relations…”
Ease of Use – “… the relational view permits a clearer evaluation of the scope and logical limitations of
present formatted data systems, and also the relative merits (from a logical standpoint) of competing
representations of data within a single system.”
Page 2
I N T HEORY ...
AND IN
P RACTICE
It’s important to differentiate between actual relational theory and that which a relational database management
system can accomplish in practice. Generally speaking, a DBMS can implement the elements of the relational
model, but can also accomplish much, much more. Why? Flexibility. Not every task is suited to the rigidity of the
relational model, and DBMS vendors are trying to make a business offering their customers the most bang for the
buck.
Of course, all this flexibility comes at a price. The layperson often misinterprets understanding the intricacies of a
software product, such as Microsoft Access, with the nuances of good relational design when in reality the two
concepts are mutually exclusive. This allows the layperson to “shoot themselves in the foot” so to speak, and
create a poor database design containing redundant data. So the bottom line is if you want to design efficient
databases with minimal redundancy, learning the relational model will give you far more mileage than learning a
particular DBMS.
FUDGE’S “MORE MILEAGE” STATEMENT:
IN THIS CLASS, WE WILL EMPHASIZE RELATIONAL DATABASE DESIGN OVER THE SPECIFICIS OF A PARTICULAR
DBMS APPLICATION. WE WILL ALWAYS DIFFERENTIATE AMONG ELEMENTS OF THE RELATIONAL MODEL AND
HOW THOSE ELEMENTS ARE TYPICALLY IMPLEMENTED BY VARIOUS DBMS APPLICATIONS, SUCH AS ORACLE OR
MS SQL SERVER.
P ART 2: C OMPONENTS
OF THE
R ELATIONAL M ODEL
R ELATIONS
The relational model focuses on a logical representation of data, rather than a physical one. This means that in an
implementation, the programmer can focus on what rather than on how. For example, if you were to write a
program to collect information from a webpage and save it into a text file, you would be responsible for the data
management tasks, such as how the data gets saved in the file. If you were placing the same form data in a
relational DBMS; you only need to tell it what to add to the database, and how to add it.
The relational model is based on set theory - you know {c, a, b, s} ∩ {b, a, d, s} = {a, b, s}. There are three main
components to the relational model:
1.
2.
3.
A structural component - sets of relations, which are tables of data, consisting of rows (tuples) and
columns (attributes).
A manipulative component of operations which act upon the relations.
A set of integrity rules for maintaining integrity within the database.
C HARACTERISTICS
OF
R ELATIONS
At the heart of the relational model is the relation, better known in its implementation as a table. Think of a table
as a 2 dimensional structure of common data. Each row in the table corresponds to a specific entity while each
column corresponds to a particular domain. (The difference between a relation and a table will be discussed
below, but for now, you may assume they are synonymous.) All relations:

Have a unique name among the set of all relations in the database.
Page 3




Are 2 dimensional structures consisting of rows and columns of data. The specific order of the rows and
columns is irrelevant.
Each column (or attribute) must have a unique name (within the scope of the relation, not the database).
Each column (or attribute) is drawn from a domain, or set of possible values from which the actual values
are taken.
Have a column or set of columns which uniquely identify each row in the table.
O KAY … B UT
WHAT MAKES A TABLE A RELATION ?
What makes a table a relation? The last mentioned characteristic does - the fact that a column or set of columns
must uniquely identify each row. In a DBMS such as MS SQL Server you can create a table where no column (or
group of columns) uniquely identifies each row, but if you do, technically it won’t be a relation. Why is this a
requirement and why should it matter? Remember, the relational model based on set theory, and all operations
on relations are performed over sets of data. In set theory the order of the elements in a set does not matter, so
the only means you have to retrieve a specific row is if you can find something which will uniquely identify it. For
example, in the figure below, you can’t delete the 2nd Emmitt Smith because there is no way to differentiate it
from the 1st row containing the same data.
So, is there some way we can force a table to be a relation, for example, so that we would not be able to add
Emmit Smith twice? Yes, we can do this with meta-data that we add to the table itself. It will be discussed in the
sections below.
D OMAINS IN THEORY
AND PRACTICE
Page 4
The concept of domain, or set of possible values from which the actual values are taken, can be difficult to
implement in practice. Sure, for ordinal values such as GPA domain easy to implement (decimals between 0.000
and 4.000, for example). But what about a column of employee first names? Now despite all the crazy names that
are out there, most humans can easily differentiate among what is and isn’t a last name, such as when an address
like “One Park Place” is inadvertently added to the name column. But how can we make the computer understand
what is and isn’t a name? It’s a problem with a non-trivial solution.
For this reason, we separate domain into two categories. The current definition of domain is called logical domain,
or the set of acceptable values. And the DBMS also implements physical domain representing the type of data
acceptable in the columns. Physical domain has no place in relational theory, but is required in the DBMS because
of how computers sort and compare data encoded in different forms. For example “December” comes before
“January” when you compare them as text strings, but when you consider them as dates, January comes before
December. To further complicate the issue, each DBMS implementation addresses physical domain, also known as
data types! For example to store a number in the DBMS IBM DB2 you could use the data type INTEGER, but in
Microsoft SQL server it would be int. Yes, they’re similar, but just different enough to make your life difficult!
I MPLEMENTING L OGICAL D OMAIN IN A DBMS
Let’s take a moment to refresh what we’ve learned so far:





Tables are collections of similar entities. Tables are organized into rows and columns. (cars, for example)
The entities are stored in the rows of the table and represent single instances of data (for example, my
car, or your car)
An attribute describes one particular facet of that entity. (my car’s color is green, for example)
A column in the table represents the domain of data for an entity’s attribute ( the car’s color, or mpg
rating)
Domain is represented as both physical and logical domain. Physical domain represents the type of data in
the column, logical domain represents the acceptable values. (For example the physical domain of a car’s
Page 5
color would be varchar(20) (in the SQL server DBMS), but the logical domain would be colors. With just
physical domain you could enter “Mike” as a color, which would be unacceptable)
To implement logical domain we use constraints. Constraints are rules and conditions that must be met before
data can be added or updated. There are techniques you can use to implement logical domain over DBMS tables.
These include:




Default Value Constraint - data used for an attribute when one isn’t specified.
Check Constraint - an expression which must evaluate to “true” prior to insert or update
Unique Constraint - a condition that forbids duplicate values are in a column or group of columns.
Lookup table - a separate table containing all of the acceptable values for a given column. The lookup
table is implemented using primary key and foreign key constraints.
PHYSICAL AND LOGICAL DOMAINS ARE METADATA - THEY HELP US DESCRIBE, DEFINE, AND STRUCURE OUR
DATA, AS WELL AS HELP ENFORCE BUSINESS RULES.
K EYS
If Yogi Berra was well versed in the relational model, I have no doubt he would say something like “Keys are the
key to understanding relational databases. Yes, it’s true keys are the key!”
A key is a special kind of constraint on an attribute or set of attributes which can be used to lookup other
attributes. Keys are used to look-up a set of rows in a table, or find one specific row in a table, and find rows in one
table based on the attributes of another table. As you will see, the true power of the relational model comes from
keys. Here are the types of keys found in relational databases:
 Super Key – Any combination of attributes which can be used to uniquely identify a row in the table.
 Candidate Key The smallest number of attributes required to uniquely identify each row in a table. For
example, in an employee table the name, address, and home phone could be a super key, but the
employee social-security number would be a candidate key, because it’s minimal. In practice, singleattribute candidate keys should be enforced with a unique constraint (after all you wouldn’t want two
different employee rows with the same social security number, right?)
 Primary Key - A candidate key which has been chosen by the database designer to uniquely identify each
row in the table. The primary key cannot contain null (empty) values, and is implemented as a constraint
by the DBMS.
 Surrogate Key - A primary key whose values are automatically generated by the DBMS there is no need on
the user’s part to enter a value for primary key - the system does it for you. Surrogate keys can be an
auto-incremented integer, or a GUID or UUID (Read the RFC). Some DBMSs such as Postgres and Oracle
permit the database designer to create global surrogate keys, called sequences.
 Composite (primary) Key - A primary key consisting of more than one column.
 Secondary Key - An attribute or combination of attributes used for row retrieval from a table. In practice,
secondary keys are index candidates in the table design.
 Foreign Key - A foreign key is an attribute (or combination thereof) in one table whose values either
match those of the primary key in another table or are null. The purpose of the foreign key is to associate
one relation with another. It is also implemented as a constraint by the DBMS.
IN PRACTICE, ONLY PRIMARY AND FOREIGN KEYS ARE IMPLEMENTED BY RDBMSS. SOME DBMSS IMPLEMENT
SURROGATE KEYS AND SECONDARY KEYS ARE ONLY POSSIBLE AS INDEXES
Page 6






Candidate Keys (for Classes table): Class_id, Class_number, Crs_id+Class_Section (composite)
Primary Keys: (as set in the DBMS) Courses: Crs_id, Classes: Class_id
Secondary Keys: Crs_num, Time_ID, Room_ID
Surrogate Keys: Crs_id, Class_id
Foreign Keys: (as set by the DBMS) Crs_id (in Classes table), Time_ID (in Classes table), Room_ID (in
Classes table)
Example Primary Key / Foriegn Key relationship is highlighted in yellow.
I N PRACTICE ,
WHAT MAKES A GOOD PRIMARY KEY ?
Another appropriate title for this section might be: “How to choose a primary key and turn your table into a
Relation”, but that’s too long of a title. 
Page 7
You might think that any candidate key would make a good primary key, but it is that line of thinking which has
caused database designers fits of frustration in the past. In theory, any candidate key will suffice as primary key,
but in practice not every candidate key makes an ideal primary key. The ideal primary key has three important
characteristics:



Should be unique within the scope / context of the entire database model, not just the current data. A
good primary key is not unique by coincidence as is the case with a candidate key such as an employee’s
date of birth. Sure, in a particular table right now no employee has the same birth date, but could it
happen?
Should not need to change often or depend on other data for its generation. Columns which are hashes
or encodings of other columns make poor choices for primary keys. Because the primary key values are
going to be in other tables as foreign keys, you want to resist the need to change them often.
Should not compromise security or performance. You should never use sensitive information such as SSN
for a primary key. You should never use long descriptive text such as street addresses as primary keys.
Remember if you select it as a primary key, it is probably going to be a foreign key elsewhere.
THE BEST SELECTION FOR PRIMARY KEY IS THE ATTRIBUTE(S) THAT SELDOM CHANGE AND CARRY THE LEAST
AMOUNT OF MEANING. IF YOU DON’T HAVE ONE IN YOUR EXISTING TABLE, USE A SURROGATE KEY! (REMEMBER
A SURROGATE KEY IS SELECTED BY THE DBMS AND IS ALWAYS UNIQUE.)
N ULL AND F LAGS
In the relational model there exists an oft-misunderstood concept called null. Simply put, null means no value
entered. The problem with null is that the DBMS needs somehow to account for it in comparisons and searches.
For example if something is true then it’s true, but if it is NOT true then it could be false or null. This is called threevalued logic. To make matters worse, end-users tend to try to interpret meaning from null, even though null can
only mean one thing: no value was entered (Despite the fact that null is stored by computers as an actual value =
0).
To avoid using nulls in practice, database designers use flags. Flags are special codes used in place of the absence
of a value. For example, rather than risk a user performing data entry leaving employee salary blank a default value
of 0 might be added to avoid having nulls in a numeric column.
I NTEGRITY R ULES
There are two important integrity rules in the relational model:


Entity Integrity - A rule which imposes a constraint that each row in a relation must be uniquely
identified. In practice we do this by designating a column or groups of columns in our table as the Primary
Key. Since the primary key entries must be unique and cannot be null, entity integrity must be enforced.
Entity Integrity is what makes a table a relation.
Referential Integrity - The foreign key must contain either null or a value matching a primary key in
another table. In other words, every non-null foreign key must reference an existing primary key value.
This ensures relationship integrity is preserved among the tables. For example when you enroll in IST459
your SUID is associated with the course section. It would not make any sense to have SUIDs enrolled in
this course which did not correspond to actual students!
Page 8
ENTITY AND REFERENTIAL INTEGRITY ARE IMPLEMENTED IN PRACTICE USING PRIMARY AND FOREIGN KEY
CONSTRAINTS RESPECTIVELY.
P ART 3: D ATABASE D EVELOPMENT
S YSTEMS D EVELOPMENT L IFECYCLE
There are a variety of ways to design, build and implement effective database solutions. One way is to just start
building the database. The modern database tools make it very easy to just cobb something together. The problem
with this approach is, of course, without a plan you’re just trying to hit a moving target. Some key questions that
should be asked of any database project are:







What problem am I solving?
Whose needs am I attempting to meet?
What business opportunity am I pursuing?
How do I know when I’m done?
Are the steps in the approach I’m using explainable, repeatable, or verifiable?
After I’m gone, can the database be maintained by others?
Will my solution work today and into the future?
The foundation for building database solutions is the Systems Development Lifecycle. The figure below charts the
amount of resources (time, effort, money) against time for each phase of a database project. Included in the
phases are the various levels of data abstraction we learned from the previous unit.
Page 9
D ATABASE D EVELOPMENT L IFECYCLE
The above graph shows an interesting perspective for how resources should be consumed - the majority of the
resource effort should be spent in the design phase. The payoff idea here is: do not “just start building a database”
just because the tools allow you to do so; spend your time planning “up front” for your database solution, analyze
the problem, and design a solution that works before you start any actual construction. Think of it this way. You
want to build a new house. Do you drive over to Home Depot to buy some lumber, nails, plumbing fixtures,
electrical fixtures and some tools then start building or do you make a list of your housing requirements, develop a
project plan, then work with an architect on a building design, floor layout and create a blueprint? Creating a
database solution is no different.
Think about the database development lifecycle (DBDLC) as a subset of the activities that you need for successful
database development. These activities map very nicely to the broader set of activities in the SDLC.
V ARIOUS S YSTEMS D EVELOPMENT L IFECYCLE S TRATEGIES
Page 10
There are two fundamental ways or strategies for using the Systems Development Lifecycle, Top-down and
Bottom-up:


Top-down - Approach your database project perspective from the top of the organization, looking at how
the data is viewed cross-functionally across each of the organization’s business units starting with seniorlevel management. Your focus is looking at how the data is used for making decisions and how the data
may be shared across the enterprise. You start at the top and work down the various layers of
management.
Bottom-up - Approach your database project prospective from the bottom of the organization, looking at
how the data is viewed and used by the end user. Your focus is on how the data is used to perform
various day-to-day operational functions and how the data “rolls up― the organization. You start at
the bottom gathering very detail and granular data requirements and work up the organization.
A N ALTERNATIVE
DEVELOPMENT MODEL
- P ROTOTYPING
The SDLC is a methodical and highly structured development process, with many processes and checks in place to
ensure consistent and accurate results. The primary disadvantages to this approach are, of course, time and the
discipline it takes to follow the cycle, as well as the large windows of time between your end-users being involved
in the process. Prototyping, a rapid application development model, attempts to address these issues.
Prototyping is an iterative process where requirements are quickly developed into a working model (i.e. the
prototype) and feedback is solicited from end-users. The gathered feedback, in turn is used to modify and establish
new requirements, as well as revise any design issues with the prototype. Prototypes that are demos of system
functionality or proofs of concepts are called throwaway prototypes. Prototypes that are actual systems which
eventually become the real thing are called evolutionary prototypes.
Microsoft Access is a great system for developing database prototypes. Almost all web design and development
follows the prototype development model. The end result of web prototyping is rapid site development, saving
time and money. But development in this manner can cause performance problems, code maintenance issues, and
development delays all of which incur hidden costs. This is especially true in web applications because there is
oftentimes too much focus on end-user needs and getting things done type-tasks at the expense of sufficient
analysis of the overall big-picture architectural design of the system.
Page 11
Download