What is database design?

advertisement
Class Agenda: 03/13 – 3/15
Review Database design – core concepts

Review design for ERD Scenarios #3 & #4

Review concepts of normalization.

Do practice design from forms using the Replica Toy
database (ERD scenario #5).

Discuss issues in database design and normalization.
Discuss concepts of data warehouse design.

Establish environment surrounding DW design.

Contrast methods of DW design.
1
Goals for Transaction Database Design
Protect the integrity of the data.

Reduce data redundancy.

Prevent data anomalies.
Provide for change.

Prevent inflexible data structures.

Anticipate changes.
Provide access to complete data for decision
making.
2
What is normalization?
Normalization is a formal, process-oriented
approach to data modeling.
Normalization is the process of:

examining groups of data attributes;

splitting them into appropriate entities;

identifying the relationships between the entities;
and

identifying appropriate primary and foreign keys.
3
Two methods of applying normalization
1. Use it to help in designing a database.

Normalization starts with a single entity.

Normalization breaks that entity into a series of additional
entities.

More entities are discovered and named during the process.

Entities are linked during the process.
2. Use it to validate the design of a database.

Identify entities from the meaning of the data.

Create conceptual and logical data models.

Apply the rules of normalization to ensure a stable, nonredundant design.
4
Normalization Vocabulary:
Functional Dependency and Determinants
A social security number determines your name and
address.
SSN  name, address.
A vehicle id number determines the make and
model of a car.
VIN  make, model.
Name and address are “functionally dependent” on
SSN.
SSN “determines” name and address.
Functional dependency diagram format:



CrsNum  CrsDescription, CrsCredits
ZipCode  City, State (this implies that a zip code uniquely identifies a
city and state in the U.S. postal system)
PatID, TrtDateTime  TstResults, TrtID, LocID,
Normal forms relevant to business
oriented databases
First normal form:
groups.
Remove repeating
Second normal form:
Remove partial
functional dependencies.
Third normal form:
dependences
Remove transitive
6
First Normal Form
First normal form:
groups.

Remove repeating
A repeating group is an attribute or group of
attributes that can have more than one value for an
instance of an entity.
Example of repeating groups:

StudentID  StudentName, StudentAddress,
courseID1, DateTaken1, Grade1, courseID2,
DateTaken2, Grade2, courseID3, DateTaken3,
Grade3, CourseID4, DateTaken4, Grade4…
7
Other examples of a repeating
group
Serial#  model#, customer name, customer
address, feature 1 chosen, feature 2 chosen, feature
3 chosen…
PatientID  name, address, zip, first insurance
company, second insurance company, third
insurance company…
8
To remedy a problem discovered
with normalization
To get a data model into an appropriate
normal form:




Identify the problem (repeating group, partial
functional dependency, or transitive dependency)
and place the “problem” attributes in one or more
new separate entities in the model.
Identify a primary key for the new entity. The key
may be concatenated if it is an associative entity,
rather than a strong entity.
Create relationships between existing and new
entities.
Divide m:n relationships with appropriate
intersection entities.
9
Second Normal Form
Second normal form:
Remove partial
functional dependencies.
A partial functional dependency is a situation
in which one or more non-key attributes are
functionally dependent on part, but not all, of
the primary key.

Partial functional dependencies occur only with
entities that have concatenated primary keys.
Examples of partial functional dependencies:

PatID, TrtDateTime  PatName, TstResults, TrtType,
TrtDescription, LocName, TrtID, LocID,

CourseID, StudentID  CourseTitle, Grade
10
Third Normal Form
Third normal form:
dependencies.

Remove transitive
A transitive dependency occurs when a non-key attribute
is functionally dependent on one or more non-key
attributes.
Examples of transitive dependencies:

TrackingNumber  ShipmentDate, OrderID, ItemID
ShipmentLocationID, LocationDescription,
QuantityShipped

PatID, TrtDateTime  TstResults, TrtType,
TrtDescription, LocName, TrtID, LocID,
11
Issues in Database Design
Characteristics of business-oriented databases.

Used to store transactions.

Updated quickly and frequently, but not always accurately.

Accessed online real-time.

Support operational decision making.
Assuming that the data stored is accurate, what
is the biggest potential problem with a
transaction database in third normal form?
How do most organizations solve that problem?
What do organizations potentially lose when they
solve that problem?
12
Major purposes of a data warehouse
To create a data storage designed to facilitate
managerial decision making.

Integrated data.

Subject-oriented.

Time-variant.

Non-volatile.
To create a data storage that has better quality,
more consistent data than existing operational
databases.
13
OperationalTransaction
and External
Data Sources
User
Departments
Data
Warehouse
Server
Extract
Transform
Load
(ETL)
Processes
Data Mart Tier
Extract
Load
Processes
Reconciled
Enterprise Data
Warehouse
Goals of data warehouse design
Make accurate information easily accessible.
Present information consistently.
Be adaptive and flexible to change.
Provide reasonable and expected performance
for information to support decision making.
Minimize data redundancy.
Protect/secure information.
15
Three different data models
Transaction (operational) data model:
Contains
current data required by separate and/or integrated
operational systems. Supports the transactional processing
of the organization. Is frequently used to support day-to-day
decision making. 3rd normal form.
Reconciled (enterprise data warehouse) data
model: Contains detailed, current data intended to be the
single, authoritative source for all decision support
applications. Usually in 3rd normal form.
Derived (data mart) data model:
Contains data
that are selected, formatted and aggregated for end-user
decision support applications. Star schema. Probably not
normalized.
16
Comparison – Replica Toys
Transaction data model
Reconciled data model
Derived (data mart) data model
17
Reconciled and Derived Data Models
Reconciled (EDW)
 Independent of specific
decisions
 Centralized control;
usually owned by IT
 Historical
 Not summarized
 Normalized
 Flexible
 Many data sources
 Long life
 Starts large, becomes
larger
Derived (Data Mart)
 Specific decisions
 One central subject
 Usually accessed directly by
users; usually decentralized
into user area
 Closely defined subject area
 Detailed and/or
summarized
 Usually denormalized
 Restrictive – few sources
 Short life span
 Starts small, becomes large
Two approaches to design
Enterprise Data Warehouse
(Inmon)
 Focus is on enterprise
subjects that will be needed
to support comprehensive
decision making.
 Emphasis on creating
design that is consistent
among subject areas.
 Implementation is of a data
mart.
 Uses ERD for modeling.
 Relies on comprehensive
blueprint for interrelation of
data.
Interrelated Data Marts
(Kimball)
 Focus is on business subject
area for data warehouse.
 Emphasis on creating
simple design that can be
implemented quickly.
 Implementation is of a data
mart.
 Uses “dimensional model”
for modeling. Kind of like
an ERD with UML-type
aspects.
 Relies on consistent
interrelation of data by
integration of existing data
models.
Compare/Contrast Approaches
Similarities:

Both focus on subject areas for development of data model.

Both require extensive input from data warehouse stakeholders.

Both produce a subject-oriented, non-volatile, time-related data
warehouse.

Both try to quickly implement a prototype data mart.
Differences:

Inmon creates a more integrated and consistent data warehouse by
attempting to design an enterprise-wide warehouse at the beginning
of the first data warehouse project. This is called a “reconciled” DW
design.

Kimball relies on future project teams referencing existing data
warehouse models for new projects.
20
What do both approaches yield?
A design for a data mart.
The design for a data mart relies on the concept
of a data warehouse “cube.”
A cube is a logical construct containing a “fact”
table that is accessed on multiple “dimension”
tables.
A fact table contains values that a manager
uses to make decisions.
A dimension table is used as a reference for the
values in the fact table.
21
Steps of data warehouse design
1.
Identify the stakeholders that need data to support their
decisions.
2.
Define and describe the data needs of those stakeholders.
3.
Define the subject area.
4.
Choose (EDW and data mart) or just data mart.
5.
Select the data of interest.
6.
Add element of time.
7.
Add derived data.
8.
Determine granularity level.
9.
Summarize data.
10.
Identify and attempt to solve potential performance issues.
22
How do you identify
those people within an
organization who require
data to support their
decision making
processes?
23
Define and describe the data needs
 Usually termed “stakeholder analysis”.
 Differing levels of decision making require differing sets of
data.


Internal vs. external data.

Integrated vs. non-integrated data.

Detailed vs. summarized data.
Different stakeholders require different access
mechanisms.

Online vs. reports.

Pre-formatted vs. ad-hoc availability of data.
 Different stakeholders require different timing.

Online, real time vs. delay.

Relative size of delay/timeliness is always an issue.
24
Stakeholder Analysis Table Example – Replica Toys
Stakeholder
Decision Making
Responsibilities
Existing
Information?
Additional
Information?
Availability of
Additional
Information?
Marketing
Analyst
Decide what features are
most valuable to which
customers.
No data related to
features currently
available.
Features selected by
customers.
Not in existing system
and cannot be
compiled manually.
Maybe telephone
survey? Maybe
registration system?
Determine trends in toy
purchases.
Distribution
Manager
Determine trends in use
of distribution outlets.
Customer order data
by distribution outlet.
Customer order data
by distribution outlet.
Determine distribution
outlet profitability.
Purchases by toy by
customer by
distribution outlet.
Purchase price by toy
by customer by
distribution outlet.
Quality
control
specialist
Support call data.
Evaluate comparative
defects of toys within and Product return data.
across product lines.
Development
engineer
Evaluate relative safety
issues with existing
product line.
Determine potential
safety issues with new
product development.
Purchases by toy by
customer.
Support call data.
Product return data.
Safety test data.
Need customer order
data with more
specific parameters.
See if available in
customer order
system.
Detailed problem
reports including date,
toy, problem, extent
of damage.
Not available in
current support call
and product return
systems. Could be
added.
Detailed problem
reports including date,
toy, problem, injury,
relative impact of
injury, potential
responsibility.
Not available in
current support call
and product return
systems. Could be
added.
Engineering safety test
data is available.
Define the subject area
 Potential subject areas in common to many businesses:









Customers: people and organizations who acquire and/or use the
company’s products.
Equipment: Machinery, devices, tools and their components.
Facilities: Real estate and their components.
Sales: Transactions that move a product from company to a
customer.
Suppliers: Entities that provide a company with goods and services.
Products: Goods and services that the company, or its competitors,
provide to customers.
Materials: Goods and services that the company uses to produce its
products.
Financials: Information about money that is received, retained,
expended, invested or in any way tracked by the company.
Human resources: Individuals who perform work for the company –
may be employees, contracts, or simply positions.
26
Select the data of interest
Use the existing transaction database model.
Identify and understand the necessary business
decisions.
Identify external data that could help support
decisions.
Use tables to help sort available attributes.

Example: Table 4.1 on pgs 104-106 of chapter 4 in
“Mastering Data Warehouse Design.”
27
Add element of time
Data warehouse is a historical model rather
than a current “point in time” model.
Must have a way to incorporate changes that
occur over time.
Important issues:

Fact table must include a time component.

Ranges of time vs. effective period in time

Time also relates to dimension tables

May have to deal with differing time periods. Examples
are fiscal years, “holiday rush,” billing cycle, etc.
28
Add derived data
Derived data includes any kind of calculated
field.
Examples:
total sales; net sales amount; total
funds raised; total cost of products.
Issues:

Must be identified, defined and agreed upon by data
warehouse stakeholders.

Must be documented in metadata.

Must be consistent.
29
Determine granularity level
What are the benefits and drawbacks of a low
level of granularity?
What are the benefits and drawbacks of a high
level of granularity?
What factors should be considered when
determining the level of granularity in the data
warehouse?
30
Summarize (aggregate) data
What is summarized data?
How is data summarized?
Does summarized data save disk space?
Why summarize data?
31
Identify and solve performance issues
What are the potential performance problems
that can occur with a data warehouse?
Why is performance a consideration during data
warehouse design?
What can a designer do to alleviate potential
performance problems?
32
Download