Relational Database Design

advertisement
Relational Database Design
Bill Woolfolk
Public Health Sciences
University of Virginia
woolfolk@virginia.edu
Objectives
Understand definition of modern
relational database
 Understand and be able to apply a
practical method for designing databases
 Recognize and avoid common pitfalls of
database design

What’s a database?

A collection of logically-related information
stored in a consistent fashion
◦
◦
◦
◦

Phone book
Bank records (checking statements, etc)
Library card catalog
Soccer team roster
The storage format typically appears to users as
some kind of tabular list (table, spreadsheet)
What Does a Database Do?
Stores information in a highly organized manner
 Manipulates information in various ways, some
of which are not available in other applications
or are easier to accomplish with a database
 Models some real world process or activity
through electronic means

◦ Often called modeling a business process
◦ Often replicates the process only in appearance or
end result
Databases and the Systems which
manage them
Modern electronic databases are created
and managed through means of RDBMS:
Relational DataBase Management Systems
 An individual data storage structure
created with an RDBMS is typically called
a “database”
 A database and its attendant views,
reports, and procedures is called an
“application”

Database Applications
Database (the actual DB with its
attendant storage structure)
 SQL Engine - interprets between the
database and the interface/application
 Interface or application – the part the
user gets to see and use

Relational Database
Management Systems

Low-end, proprietary, specific purpose
◦ Email: Outlook, Eudora, Mulberry
◦ Bibliographic: Ref. Mgr., EndNote, ProCite

Mid-level
◦ Microsoft Access, Lotus Approach, Borland’s Paradox
◦ More or less total control of design allows custom
builds

High-end
◦ Oracle, Microsoft SQL Server, Sybase, IBM DB2
◦ Professional level DBs: Banks, e-commerce, secure
◦ Amazon.com, Ebay.com,Yahoo.com
Problems with Bad Design
Early computers were slow and had
limited storage capacity
 Redundant or repeating data slowed
operations and took up too much
precious storage space
 Poor design increased chance of data
errors, lost or orphaned information

Benefits of Good Design
Computers today are faster and possess much
larger storage devices
 Rigid structure of modern relational databases
helped codify problems and solutions
 Design problems are still possible, because the
DBMS software won’t protect you from poor
practices
 Good design still increases efficiency of data
processes, reduces waste of storage, and helps
eliminate data entry errors

Codd’s Rules

Edgar F. Codd
◦ Mathematician and Researcher at IBM
◦ Devised the relational data model in 1970
◦ Published 12 rules in 1985 defining ideal relational
database, added 6 more in 1990
E. F. Codd: A Relational Model of Data for Large Shared Data Banks.
CACM 13(6): 377-387 (1970)
(http://www.acm.org/classics/nov95/toc.html)
Codd, E. (1985). "Is Your DBMS Really Relational?" and "Does Your
DBMS Run By the Rules?" ComputerWorld, October 14 and
October 21.
Modification Anomalies
Customers_Orders_Inventory
Customer
OrderNum ItemNum
Item
General Tool
07456
2246
Pentium Computer
General Toll
08622
3145
HP Printer
General Tool Co.
08622
3967
17” monitor
Totally Toys
06755
2246
Pentium computer
TOTALLY TOYS
08134
3145
Hewlett-Packard Printer
XYZ Inc.
09010
0446
Dot Matrix Printer
A search for “General Tool Co.” would miss “General Tool”
and “General Toll”. A case-sensitive search for “Totally
Toys” would miss “TOTALLY TOYS”
Insertion Anomalies
Customers_Orders_Inventory
Customer
OrderNum ItemNum
Item
General Tool
07456
2246
Pentium Computer
General Toll
08622
3145
HP Printer
General Tool Co.
08622
3967
17” monitor
Totally Toys
06755
2246
Pentium computer
TOTALLY TOYS
08134
3145
Hewlett-Packard Printer
XYZ Inc.
09010
0446
Dot Matrix Printer
How would you enter a new item into your
inventory if no one had ordered it yet?
Deletion Anomalies
Customers_Orders_Inventory
Customer
OrderNum ItemNum
Item
General Tool
07456
2246
Pentium Computer
General Toll
08622
3145
HP Printer
General Tool Co.
08622
3967
17” monitor
Totally Toys
06755
2246
Pentium computer
TOTALLY TOYS
08134
3145
Hewlett-Packard Printer
XYZ Inc.
09010
0446
Dot Matrix Printer
If you wanted to stop selling “dot matrix printer” and
remove it from your inventory, you would have to
delete the order and customer info for “XYZ Inc.”
The
Fix
Order_Items
OrderNum
ItemNum
Orders
CustomerNum OrderNum
06755
2246
7822
09010
07456
2246
8755
06755
08134
3145
8755
08134
08622
3145
9123
07456
08622
3967
9123
08622
09010
0446
Customers
CustomerNum Customer
Products
ItemNum Item
0446
Dot Matrix Printer
7822
XYZ Inc.
2246
Pentium Computer
8755
Totally Toys
3145
Hewlett-Packard printer
9123
General Tool Co.
3967
17” monitor
The Design Process
1)
2)
3)
4)
5)
6)
7)
8)
Identify the purpose of the database
Review existing data
Make a preliminary list of fields
Make a preliminary list of tables and enter
fields
Identify the key fields
Draft the table relationships
Enter sample data and normalize the
data/tables
Review and finalize the design
Database Modeling
Refers to various, more-or-less formal
methods for designing a database
 Some provide precision steps and tools

◦ Ex.: Entity-Relationship (E-R) Modeling
 Widely used, especially by high-end database
designers who can’t afford to miss things
 Fairly complex process
 Extremely precise
1. Identify purpose of the DB
Clients can tell you what information they
want but have no idea what data they need.
“We need to keep track of inventory”
 “We need an order entry system”
 “I need monthly sales reports”
 “We need to provide our product catalog on the
Web”

Be sure to Limit the Scope of the database.
2. Review Existing Data

Electronic
◦ Legacy database(s)
◦ Spreadsheets
◦ Web forms

Manual
◦ Paper forms
◦ Receipts and other printed output
3. Make Preliminary Field List

Make sure fields exist to support needs
◦ Ex. if client wants monthly sales reports, you need a
date field for orders.
◦ Ex. To group employees by division, you need a
division identifier

Make sure values are atomic
◦ Ex. First and Last names stored separately
◦ Ex. Addresses broken down to Street, City, State, etc.

Do not store values that can be calculated from
other values
◦ Ex. “Age” can be calculated from “Date of Birth”
4. Make Preliminary Tables
(and insert the fields into them)
Each table holds info about one subject
 Don’t worry about the quantity of tables
 Look for logical groupings of information
 Use a consistent naming convention

Naming Conventions

Rules of thumb
◦
◦
◦
◦
◦
◦
◦
◦
Table names must be unique in DB; should be plural
Field names must be unique in the table(s)
Clearly identify table subject or field data
Be as brief as possible
Avoid abbreviations and acronyms
Use less than 30 characters,
Use letters, numbers, underscores (_)
Do not use spaces or other special characters
Naming Conventions (cont’d)

Leszynski Naming Convention (LNC)
◦ Example: tblEmployees, qryPartNum
◦ tbl, qry = tag
◦ Employees, PartNum = basename

LNC at Microsoft Developers Network
5. Identify the Key Fields

Primary Key(s)
◦
◦
◦
◦

Can never be Null; must hold unique values
Automatically indexed in most RDBMSs
Values rarely (if ever) change
Try to include as few fields as possible
Multi-field Primary Key
◦ Combination of two or more fields that uniquely
identify an individual record

Candidate Key
◦ Field or fields that qualify as a primary key
◦ Important in Third and Boyce-Codd Normal Forms
6. Identify Table Relationships
 Based
on business rules being
modeled
 Examples:
◦ “each customer can place many orders”
◦ “all employees belong to a department”
◦ “each TA is assigned to one course”
Relationship Terminology

Relationship Type
◦ One-to-one: expressed as 1:1
◦ One-to-Many: expressed as 1:N or 1:M or 1:∞
◦ Many-to-Many: expressed as N:N or M:M

Primary or Parent Table
◦ Table on the left side of 1:N relationship

Related or Child Table
◦ Table on the right side of 1:N relationship

Relational Schema
◦ Diagram of table relationships in database
Relationship Terminology (cont’d)

Join
◦ Definition of how related records are returned

Join Line
◦ Visual relationship indicators in schema

Key fields
◦ Primary Key: the linking field on the one side of a 1:N
relationship
◦ Foreign Key: the primary key from one table that is
added to another table so the records can be related
◦ Non-Key Fields: any field that is not part of a primary
key, multi-field primary key, or foreign key
One-to-One (1:1)
Each record in Table A relates to one, and
only one, record in Table B, and vice versa.
 Either table can be considered the
Primary, or Parent Table
 Can usually be combined into one table,
although may not be most efficient design

One-to-Many (1:N)
Each record in Table A may relate to zero, one
or many records in Table B, but each record in
Table B relates to only one record in Table A.
 The potential relationship is what’s important:
there might be no related records, or only one,
but there could be many.
 The table on the One (or left) side of a 1:N
relationship is considered the Primary Table.

Many-to-Many (N:N)
A record in Table A can relate to many records
in Table B, and a record in Table B can relate to
many records in Table A.
 Most RDBMSs do not support N:N
relationships, requiring the use of a linking (or
intersection or bridge) table that breaks the
N:N relationship down into two 1:N
relationships with the linking table being on the
Many side of both new relationships.

Relational Schema
Table 1
Field1_1
Field1_2
Field1_3
Field1_4
1
Table 2
N Field2_1
Field1_1
Field2_2
Field2_3
7. Normalization
Normal Forms (NF): design standards
based on database design theory
 Normalization is the process of applying
the NFs to table design to eliminate
redundancy and create a more efficient
organization of DB storage.
 Each successive NF applies an increasingly
stringent set of rules

First Normal Form (1NF)
A table is in first normal form if there are
no repeating groups.
 Repeating Groups : a set of logically
related fields or values that occur multiple
times in one record

◦ 1: non-atomic value, or multiple values, stored
in a field
◦ 2: multiple fields in the same table that hold
logically similar values
Sample 1NF Violation - 1
Employee_Projects_Time
EmployeeID Name
EN1-26
EN1-33
EN1-35
Sean O’Brien
Project
30-452-T3, 30457-T3, 32244-T3
Amy Guya
30-452-T3, 30382-TC, 32244-T3
Steven Baranco 30-452-T3, 31238-TC
Time
0.25, 0.40, 0.30
0.05, 0.35, 0.60
0.15, 0.80
Sample 1NF Violation - 2
Employee_Projects_Time
EmpID
Last
Name
First
Name
EN1-26
O’Brien Sean
30-452- 0.25
T3
30-457- 0.40
T3
EN1-33
Guya
30-452- 0.05
T3
30-328- 0.35
TC
Amy
Proj1
Time1
Proj2
Time2
Tables in 1NF
Employees
*EmployeeID LastName
FirstName
EN1-26
O’Brien
Sean
EN1-33
Guya
Amy
EN1-35
Baranco
Steven
Employees_Projects
*ProjNum
EmployeeID Time
30-328-TC
EN1-33
0.35
30-452-T3
EN1-26
0.25
30-452-T3
EN1-33
0.05
Second Normal Form (2NF)



A table is in 2NF if it is in 1NF and each non-key field is
functionally dependent on the entire primary key.
Functional dependency: a relationship between fields
such that the value in one field determines the one
value that can be contained in the other field.
Determinant: a field in which the value determines the
value in another field.
Example
Airport – City
Dulles – Washington, DC
Sample 2NF Violation
Employees_Projects
*EmpID Lname
Fname
*ProjNum
ProjTitle
EN1-25
O’Brien
Sean
30-452-T3
STAR Manual
EN1-25
O’Brien
Sean
30-457-T3
ISO Procedures
EN1-25
O’Brien
Sean
31-124-T3
EN1-33
Guya
Amy
30-452-T3
Employee
Handbook
STAR Manual
EN1-33
Guya
Amy
30-482-TC
Web site
Tables in 2NF
Employees
*EmployeeID LastName
FirstName
EN1-26
O’Brien
Sean
EN1-33
Guya
Amy
Employees_Projects
Projects
*EmployeeID *ProjNum
*ProjNum Title
EN1-26
30-452-T3
30-452-T3
STAR manual
EN1-33
30-457-T3
30-457-T3
ISO procedure
Third Normal Form (3NF)
A table is in 3NF when it is in 2NF and
there are no transitive dependencies.
 Transitive Dependency: a type of
functional dependency in which the value
of a non-key field is determined by the
value in another non-key field and that
field is not a candidate key.

Sample 3NF Violation
Projects_Managers
*ProjNum ProjTitle
ProjMgr
Phone
30-452-T3 STAR Manual
Garrison
2756
30-457-T3 ISO Procedures
Jacanda
2954
30-482-TC Web Site
Friedman
2846
31-124-T3 STAR prototype
Garrison
2756
35-272-TC Order System
Jacanda
2954
Tables in 3NF
Projects
*ProjNum
ProjTitle
Manager
30-452-T3
STAR manual
Garrison
30-457-T3
ISO procedures Jacanda
Project Managers
*Manager
Phone
Garrison
2846
Jacanda
2756
Boyce-Codd Normal Form (BCNF)
A table is in BCNF when it is in 3NF and
all determinants are candidate keys.
 Developed to cover situations that 3NF
did not address.
 Applies to situations where you have
overlapping candidate keys.

Sample Business Rules

Business Rules:
◦ Each course can have many students
◦ Each student can take many courses
◦ Each course can have multiple teaching
assistants (TAs)
◦ Each TA is associated with only one course
◦ For each course, each student has one TA
Sample BCNF Violation
Course_Students_TAs
CourseNum
Student
TA
ENG101
Jones
Clark
ENG101
Grayson
Chen
ENG101
Samara
Chen
MAT350
Grayson
Powers
MAT350
Jones
O’Shea
MAT350
Berg
Powers
Tables in BCNF
Courses
*CourseNum *Student
ENG101
Jones
MAT350
Grayson
Students
TAs
*Student
*TA
*CourseNum *TA
Jones
Clark
ENG101
Clark
Grayson
Chen
MAT350
Chen
Fourth Normal Form (4NF)
A table is in 4NF when it is in BCNF and
there are no multi-valued dependencies.
 Multi-valued Dependency: occurs when,
for each value in field A, there is a set of
values for field B and a set of values for
field C, but B and C are not related.
 Occurs when the table contains fields
that are not logically related.

Sample 4NF Violation - 1
Movies
*Movie
*Star
*Producer
Once Upon a Time
Judy Garland
Alfred Brown
Once Upon a Time
Mickey Rooney
Alfred Brown
Once Upon a Time
Judy Garland
Muriel Hemingway
Once Upon a Time
Mickey Rooney
Muriel Hemingway
Moonlight
Humphrey Bogart
Alfred Brown
Moonlight
Judy Garland
Alfred Brown
Tables in 4NF - 1
Stars
*Movie
*Star
Once Upon a Time
Judy Garland
Once Upon a Time
Mickey Rooney
Moonlight
Humphrey Bogart
Moonlight
Judy Garland
Producers
*Movie
*Producer
Once Upon a Time
Alfred Brown
Once Upon a Time
Muriel Hemingway
Moonlight
Alfred Brown
Sample 4NF Violation - 2
Projects_Equipment
Dept
Code
ProjNum
ProjMgrID
Equip
PropID
IS
36-272-TC
EN1-15
CD-ROM
657
VGA monitor
305
AC
Dot matrix printer
358
AC
Calculator w/tape
239
486 PC
275
Laser Printer
109
IS
AC
36-152-TC
EN1-15
TW
30-452-T3
EN1-10
TW
30-457-T3
EN1-15
TW
31-124-T3
EN1-15
Tables in 4NF - 2
Equipment
*PropID
Equip
DeptCode
657
CD-ROM
IS
305
VGA monitor
IS
358
Dot matrix printer AC
Projects
*ProjNum
ProjMgrID
DeptCode
30-452-T3
EN1-15
IS
30-457-T3
EN1-15
AC
35-152-TC
EN1-10
TW
Fifth Normal Form (5NF)
A table is in 5NF when it is in 4NF and
there are no cyclic dependencies.
 Cyclic Dependency: occurs when there is
a multi-field primary key with three or
more fields (ex. A, B, C) and those fields
are related in pairs AB, BC and AC.
 Can occur only with a multi-field primary
key of three or more fields

Sample 5NF Violation
BUYING
*Buyer
Chris
Chris
Chris
Lori
*Product
Jeans
Jeans
Shirts
Jeans
*Company
Levi
Wrangler
Levi
Levi
Do the math

Our sample is two buyers, two products and two
companies, so…
2 x 2 x 2 = 8 total records

But, what if our store has 20 buyers, 50 products
and 100 companies?
20 x 50 x 100 = 100,000 total records
A Tempting Solution
Buyers
*Buyer
*Product
Chris
Jeans
Chris
Shirts
Lori
Jeans
Products
*Product
*Company
Jeans
Wrangler
Jeans
Levi
Shirts
Levi
The Correct Solution
Buyers
*Buyer *Product
Chris
Jeans
Chris
Shirts
Lori
Jeans
Products
Companies
*Product *Company
*Buyer
*Company
Jeans
Wrangler
Chris
Wrangler
Jeans
Levi
Chris
Levi
Shirts
Levi
Lori
Levi
Check the Math, Again
 If
our company has 20 buyers, 50
products and 100 companies?
Buyers = 20 x 50 = 1000
Products = 50 x 100 = 5000
Companies = 20 x 100 = 2000
8,000 total records instead of 100,000!
8. Finalizing the Design
Double-check to ensure good, principlebased design
 Evaluate design in light of business model
and determine desired deviations from
design principles

◦ Process efficiency
◦ Security concerns
That’s it for Table Design
Watch for repeating values and fields
 Check against the Normal Forms
 Make new tables when necessary
 Re-check all tables against the NFs
 Remember the business rules
 Use common sense, but check anyway!

Ensuring Data Integrity

Placing constraints on how and when and
where data can be entered

Done after or along with table design

Part of design process because many
constraints are established at the
database and table levels
Referential Integrity
True relational databases support
Referential Integrity: every non-null foreign
key value must match an existing primary key
value.
 In other words, every record in a related
table must have a matching record in the
primary table.
 Preserves the validity of foreign key values.
 Enforced at database level.

Cascading Updates
When a primary key value changes,
Cascade Update changes the
corresponding values in the related
records, so no records get orphaned.
 Usually only one level deep

◦ Foreign key is not usually primary key of
related table (except in 1:1 relationships)
hence no other tables are usually related to it
Cascade Deletes
When a primary table record is deleted,
all matching records in any related table
are also deleted
 Can propagate through multiple tables if
Cascade Delete is turned on in all
relationships between those tables
 Another protection against orphan
records, only this time by eradicating
them instead!

Levels of Enforcement
Referential Integrity enforced at database level
because it affects relationship between two
tables.
 Many other business rules enforced at field and
table level to ensure data integrity.
 Business rule implementation should be
documented: how and where it is enforced in
the design.
 Some rules can’t be enforced at table or field
level; must be enforced in the application level.

Testing of Business Rules

Always test business rule implementation
◦ What happens when rule is met?
◦ What happens when rule is violated?
Not much good as a data entry constraint
if it doesn’t constrain properly
 Good application or interface design will
provide feedback when user violates a
constraint or rule

Field Level Integrity

Constraining by use of field properties
◦ Data type: text, number,Yes/No, Date/Time
◦ Field size
◦ Formats

Entry and editing constraints
◦
◦
◦
◦
◦
Required
Indexed, with or without duplicates
Input masks
Default value
Validation Rule
Table Level Integrity

Field Comparisons
◦ Compare value in one field to value in another
◦ Comparison performed before record is saved
◦ Violations could display an error message or
force constraint of available values

Validation or Lookup Tables
◦ Store generally static set of values
◦ Stored values used to populate new records to
ensure accuracy of data entry
Documentation
A good design deserves good documentation
 Data Dictionary for database/table design

◦
◦
◦
◦

Table and field names
Table and field properties
Relationships, including primary and foreign keys
Indexes
Provide reasons for design features, especially
if they intentionally violate normal design
principles
Download