RELATIONAL DATA MODELING MIS2502 Data Analytics

advertisement
RELATIONAL DATA
MODELING
MIS2502
Data Analytics
A Brief Review
Gathering
Storing
Data
Retrieving
Interpreting
Will be talked about
throughout the class
A Brief Review
Transactional
Database
Analytical Data
Store
Supports
management of an
organization’s data
Supports managerial
decision-making
For everyday
transactions
For periodic analysis
This is what is
commonly thought of
as “database
management”
This is the foundation
for business
intelligence
The Information Architecture of an
Organization
Data
entry
Data
extraction
Transactional
Database
Data
analysis
Analytical
Data Store
Stores real-time
transactional
data
Stores historical
transactional and
summary data
Called OLTP:
Called OLAP:
Online
transaction
processing
Online
analytical
processing
A Brief Review
Transactional Database Analytical Data Store
Based on Relational
paradigm
Based on Dimensional
paradigm
Storage of real-time
transactional data
Storage of historical
transactional data
Optimized for storage
efficiency and data
integrity
Optimized for data
retrieval and
summarization
Supports day-to-day
operations
Supports periodic and
on-demand analysis
What is a model?
• Representation of something in the real world
Modeling a database
• A representation of the structure of the data
• Describes the data contained in the database
• Explains how the data interrelates
• A student is part of a section, which is part of a course
Why bother modeling?
• Creates a blueprint before you start building the database
• Gets the story straight: easy for non-technical people to
understand
• Minimize having to go back and make changes in the
implementation stage
The process of analysis and design
• Systems Analysis
• Analysis of complex, large-scale
systems and the interactions within
those systems
http://en.wikipedia.org/wiki/Systems_analysis
• Systems Design
• The process of defining the hardware
and software architectures,
components, models, interfaces, and
data for a computer system to satisfy
specified requirements
http://en.wikipedia.org/wiki/Systems_design
Notice that they
are not the
same!
Basically…
In the context of database
development.
• Systems Analysis is the
process of modeling the
problem
• Requirements-oriented
• What should we do?
This is where we define
and understand the
business scenario.
• Systems Design is the
process of modeling a
solution
• Functionality-oriented
• How should we do it?
This is where we
implement that
scenario as a
database.
Start with a problem statement
• “We want a database to track orders.”
• That’s too vague to create a useful system, so we then
gather requirements to learn more
• Gather documentation
• About the business process
• About existing systems
• Conduct interviews
• Employees directly involved in the process
• Other stakeholders (i.e., customers)
• Management
Why are each of
these
important?
Are there
others?
Start with a problem statement
• Refine the problem statement
• Getting iterative feedback from the client
• End up with a scenario like this:
• The system must track customer orders
• Multiple products can go into an order
• A customer is described by their name, address, and a unique
Customer ID number
• An order is described by the date in which it was placed, what was
bought, and how much it costs
The specification “what was bought” is a little vague,
and that will cause us a problem a little later.
But let’s leave it for now…
First lecture on data modeling stops here.
Review Questions
• What is the key difference between transactional
database and analytical data store?
• What is a Model?
• What is the first step to build a database?
•  Let’s build a relational database!
The Entity Relationship Diagram (ERD)
• The primary way of modeling a relational database
• Part of the “analysis” process
• Implemented as a picture with three key elements
Use rectangle
Use diamond
Use oval
Entity
A uniquely identifiable thing
(i.e., person, order)
Relationship
Describes how two entities
relate to one another
(i.e., makes)
Attribute
A characteristic of an entity
or relationship (i.e., first
name, order number)
A very simple example
Last
name
City
State
First
name
Customer
Zip
Customer
ID
place
Order
number
Order
Date
Order
Product
name
Price
The primary key
• Entities need to be uniquely identifiable
• So you can tell them apart when you retrieve them
• Use a primary key
• An attribute (or a set of attributes) that uniquely identifies an entity
Customer
ID
Order
number
Uniquely identifies a
customer
How about these as
primary keys for
Customer:
Uniquely identifies
an order
First name and/or last
name?
Social security
number?
One to many relationship (ERD)
Customer
at least – one
at most - one
place
Order
at least – one
at most - many
This is a one-to-many relationship:
One customer can have many orders
One order can only belong to one customer
many to one relationship (ERD)
Order
associate
at least – one
at most - many
Customer
at least – one
at most - one
This is a many-to-one relationship:
Many to many relationship (ERD)
First Read this way
Employee
has
at least – one
at most - many
Then read this way!
Office
at least – one
at most - many
Crows Feet Notation
Customer
So called
because this…
makes
Order
…looks something
like this
There are other
ways of denoting
cardinality, but this
one is pretty
standard.
There are also
variations of the
crows feet notion!
Cardinality is defined by business rules
• What would the cardinality be in these situations?
Order
Course
Employee
?
?
?
contains
has
has
?
?
?
Product
Section
Office
But we have a problem with our ERD
Last
name
City
State
First
name
Customer
Zip
Customer
ID
makes
Order
number
Order
Date
Order
Product
name
Price
This assumes every order contains only one product.
So if I want two products, I have to make two orders!
The problem: Product is defined as an attribute, not an entity.
(Because we didn’t define our requirements clearly enough?)
Here’s a solution
Last
name
First
name
City
Customer
ID
Customer
State
Order
number
place
Order
Date
Order
Zip
contains
• Now
• A customer can place multiple orders
• An order can contain multiple products
• A product can be part of multiple
orders
Product
Price
Quantity
Product
name
So far for the 2nd class of ERD…
Implementing the ERD
• As a database schema
• A map of the tables and fields in the database
• This is what is implemented in the database
management system
• Part of the “design” process
• A schema actually looks a lot like the ERD
• Entities become tables
• Attributes become fields
• Relationships can become additional tables (manymany)
Structure of a database
Data element Description
Character
Single letter or number
(“A”, “Z”, “1”)
Field
Set of related characters
(first name)
Record
Set of related fields
(all information about a customer)
Table
Set of related records
(all customers in the company)
Database
Set of related tables
(all information about the company)
Data Base Structure
• character, field, record, table, db
The Rules
1. Create a table for every entity
2. Create table fields for every entity’s attributes
3. Implement relationships between the tables
1:many
relationships
• Primary key field of “1” table put into
“many” table as foreign key field
many:many
relationships
• Create new table
• 1:many relationships with original tables
1:1
relationships
• Primary key field of one table put into
other table as foreign key field
Our Order Database schema
Original 1:n relationship
Original n:n relationship
• Order-Product is a decomposed many-to-many
relationship
• Order-Product has a 1:n relationship with Order and Product
• Now an order can have multiple products, and a product can be
associated with multiple orders
What the Customer and Order tables look like
Customer Table
CustomerID
FirstName
LastName
City
State
Zip
1001
Greg
House
Princeton
NJ
09120
1002
Lisa
Cuddy
Plainsboro
NJ
09123
1003
James
Wilson
Pittsgrove
NJ
09121
1004
Eric
Foreman
Warminster
PA
19111
Order
Number
OrderDate
Customer
ID
101
3-2-2011
1001
102
3-3-2011
1002
103
3-4-2011
1001
104
3-6-2011
1004
Order Table
Note that there are no
repeating records
Every customer is unique
Every order is unique
This is an example of
normalization.
Normalization
• Organizing data to minimize redundancy (repeated data)
• This is good for two reasons
• The database takes up less space
• You have a lower chance of inconsistencies in your data
• If you want to make a change to a record, you only have
to make it in one place (but you do not change the
primary key-unique identifier)
• The relationships take care of the rest
• But you will usually need to link the separate tables
together in order to retrieve information
To figure out who ordered what
• Match the Customer IDs of the two tables, starting with the
table with the foreign key (Order):
Order Table
Customer Table
Order
Number
OrderDate
Customer
ID
Customer
ID
FirstName
LastName
City
State
Zip
101
3-2-2011
1001
1001
Greg
House
Princeton
NJ
09120
102
3-3-2011
1002
1002
Lisa
Cuddy
Plainsboro
NJ
09123
103
3-4-2011
1001
1001
Greg
House
Princeton
NJ
09120
104
3-6-2011
1004
1004
Eric
Foreman
Warminster
PA
19111
• We now know which order belonged to which customer
• This is called a join
• But it’s an inefficient way to store data (redundancies)
• So we normalize
Now the many:many relationship
Order Table
Order-Product Table
Order
Number
OrderDate
Customer ID
Order
ProductID
Order
number
Product ID
Quantity
101
3-2-2011
1001
1
101
2251
2
102
3-3-2011
1002
2
101
2282
3
103
3-4-2011
1001
3
101
2505
1
104
3-6-2011
1004
4
102
2251
5
5
102
2282
2
6
103
2505
3
7
104
2505
8
Product Table
ProductID
ProductName
Price
2251
Cheerios
3.99
2282
Bananas
1.29
2505
Eggo Waffles
2.99
This table relates
Order and Product to
each other!
To figure out what each order contains
• Match the Product IDs and Order IDs of the tables, starting
with the table with the foreign keys (Order-Product):
Order-Product Table
Order Table
Product Table
Order
ProductID
Order
Number
Product
ID
Quantity
Order
Number
Order
Date
Customer
ID
Product
ID
Product
Name
Price
1
101
2251
2
101
3-2-2011
1001
2251
Cheerios
3.99
2
101
2282
3
101
3-2-2011
1001
2282
Bananas
1.29
3
101
2505
1
101
3-2-2011
1001
2505
Eggo Waffles
2.99
4
102
2251
5
102
3-3-2011
1002
2251
Cheerios
3.99
5
102
2282
2
102
3-3-2011
1002
2282
Bananas
1.29
6
103
2505
3
103
3-4-2011
1001
2505
Eggo Waffles
2.99
7
104
2505
8
104
3-6-2011
1004
2505
Eggo Waffles
2.99
Now there is redundant product data as a result of the join!
Why redundant data is a big deal
Customer
ID
Product
ID
Product
Name
Price
1001
2251
Cheerios
3.99
1001
2282
Bananas
1.29
1001
2505
Eggo Waffles
2.99
1002
2251
Cheerios
3.99
1002
2282
Bananas
1.29
1001
2505
Eggo Waffles
2.99
1004
2505
Eggo Waffles
2.99
stomer
Customer
ID
First Name
Last Name
City
State
Zip
01
1001
Greg
House
Princeton
NJ
09120
02
1002
Lisa
Cuddy
Plainsboro
NJ
09123
01
1001
Greg
House
Princeton
NJ
09120
04
1004
Eric
Foreman
Warminster
PA
19111
The redundant data
seems harmless, but:
What if the price of
“Eggo Waffles”
changes?
And what if Greg
House changes his
address?
And if there are
1,000,000 records?
Best practices for normalization
• Create new entities when
there are collections of
related attributes, especially
when they would repeat
• For example, consider a
modified Product entity
Vendor
Phone
Don’t do
this…
Vendor
Name
Product
Price
Vendor
Address
Product
name
Vendor
Phone
Vendor
Name
Vendor
Vendor
Address
Vendor
ID
…do this.
Then you
won’t have to
repeat vendor
information
for each
product.
sells
Product
Product
name
Price
?
Why did we
introduce VendorID?
Best practices for normalization
• Create new entities to
enforce data entry
standards
…but this can be even better.
Last
name
First
name
Customer
ID
This is fine…
Last
name
First
name
Zip
Customer
ID
Customer
State
City
City
Customer
State
Zip
City
ID
!
City
Name
State
ID
State
Name
The city name is entered only once in the
City table; CityID is used in Customer
table
City and State as “lookup tables”
• Why this can be a better way of doing it
Customer
Last
name
First
name
Zip
Customer
ID
CustomerID
FirstName
LastName
CityID
StateID
Zip
1001
Greg
House
1
1
09120
1002
Lisa
Cuddy
2
1
09123
1003
James
Wilson
3
1
09121
1004
Eric
Foreman
4
2
19111
Customer
City
State
City
City
ID
City
Name
State
ID
State
Name
CityID
CityName
1
Princeton
2
Plainsboro
3
Pittsgrove
4
Warminster
This helps prevent
inconsistent spellings
(Pennsylvania is always
entered as “2”)
State
StateID
StateName
Abbr
1
New Jersey
NJ
2
Pennsylvania
PA
The three-way relationship
• Sometimes
three entities
are necessary
to capture what
happens in a
transaction
Employee
ID
Repair
code
modeled as an
many-to-manyto-many
relationship
Salary
Mechanic
Description
Repair
• This would be
Name
Performs
Repair date
Charge
Car
Model
Make
VIN
The many:many:many table
• The many-to-many-to-many relationship would still be
represented as a separate table
• Just with three foreign keys, instead of two
Employee
ID
Repair
code
Name
Salary
Mechanic
RepairID
RepairCode
EmployeeID
VIN
Repair
date
1
101
9112
10192919201
2011-3-1
2
201
2313
19292919291
2011-3-2
3
302
1231
102010023
2011-3-3
4
223
2132
393848383
2011-3-4
Description
Repair
Performs
Repair date
Charge
Car
Model
Make
VIN
Download