Data Management

advertisement
Data Management
& Data Warehouses
MIS 320
Kraig Pencil
Summer 2014
1
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Game Plan
•
•
•
•
•
•
Introduction
Why use a relational database?
Database management systems
Data warehouses
Data mining
Data marts
2
PPT Slides by Dr. Craig Tyran & Kraig Pencil
A. Why use a relational database?
1. A database sounds great, but why don’t we just store all our data in one
big table in an Excel spreadsheet?
–
Example: Can you foresee any hassles or potential difficulties associated
with entering/storing order information in the following Excel table?
3
PPT Slides by Dr. Craig Tyran & Kraig Pencil
A. Why use a relational database?
1. Why don’t we just store our data in one spreadsheet table?
(cont.)
–
–
Potential problems
• May have “redundant” data entry
• Potential for data entry errors (different/wrong phone
numbers)
• Updates can be a hassle/inefficient (e.g., change phone no)
Solution
• “Normalize” the data …
 Break up the table into a set of linked tables in a data
base (instead of having one spreadsheet)
– See example
PPT Slides by Dr. Craig Tyran & Kraig Pencil
4
Example: Normalized Tables
(and the advantages of a database)
Questions:
a)
Any unneeded
redundancy?
b) Is it now
efficient to
update customer
info?
c)
Where is the
foreign key?
5
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Example: Non-Normalized Data Table
for an Auto Shop (Rainer & Turban, Fig 4.6)
Examples of
redundancy
PPT Slides by Dr. Craig Tyran & Kraig Pencil
B. Database Management Systems
1. What is a “database management system”
(DBMS)?
 SW that allows one to create, store, organize,
manage, and use data
•
Example of a DBMS?
2. Key components
–
–
–
–
–
Data Definition subsystem
Data Manipulation subsystem
Application Generation subsystem
Data Administration subsystem
DBMS engine
7
PPT Slides by Dr. Craig Tyran & Kraig Pencil
DBMS Components
Lab Tutorials 1,2
Lab Tutorials 3,5
Lab Tutorials 4,6
8
PPT Slides by Dr. Craig Tyran & Kraig Pencil
B. Database Management Systems
3. Examples of DBMS components in Access
Data Definition subsystem
– Data dictionary (“Design view” for a table)
Data Manipulation subsystem: Move, change,
and “ask questions”
– View of a table (“Datasheet view”
for a table)
– Query-by-example (QBE) tool
– Structured query language (SQL)
Application Generation subsystem: the “front end”
– Design of forms and reports
Data Administration subsystem
– Optimize query performance
– Security settings with password
PPT Slides by Dr. Craig Tyran & Kraig Pencil
9
B. Database Management Systems
4. What aspects of data need to be specified?
–
Lots of aspects!!!
•
–
Common data properties
•
•
•
•
•
–
Recall table creation in MS Access (Tutorials 1 & 2)
Data “type” (number, text, date, etc.)
Description
Field size
Required/not required
Etc.
An important reference for a database system:
 Data dictionary
– Stores information about the data in a database
PPT Slides by Dr. Craig Tyran & Kraig Pencil
10
Access Example:
Data “type” (number,
text, date, etc.)
Description
Field size
Required/not required
Information about the
“Gender” field is
specified in “Field
Properties” section
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Access Example: Data Manipulation Subsystem
(Low Stock Products query)
QBE or SQL may be used to
prepare a query. Which
approach would be easier for
most
people?
PPT Slides by Dr. Craig Tyran & Kraig
Pencil
Access Example: Application Generation Subsystem
(Employer Information Form)
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Access Example: Data Administration
(Performance Analysis for a Database)
PPT Slides by Dr. Craig Tyran & Kraig Pencil
B. Database Management Systems (cont.)
5. DBMS: Example products
–
You are very likely to work with – and
possibly help develop a database– using
one or more of the following:
•
•
Small-Midsize DBMS: Microsoft Access,
dBase, Paradox
Mid-to-Large DBMS: Microsoft SQL
Server, Oracle, DB2, Informix, IMS
15
PPT Slides by Dr. Craig Tyran & Kraig Pencil
C. Data Warehouses
1. Business problem:
•
•
Difficult for larger organizations
to analyze organizational data
from multiple sources
Solution: Data warehouse
2. Gather/integrate information
from existing operational
databases into a “warehouse”
•
•
 Create “Business Intelligence”
system
See next figure
PPT Slides by Dr. Craig Tyran & Kraig Pencil
16
Create a Data Warehouse from Operational Databases
17
PPT Slides by Dr. Craig Tyran & Kraig Pencil
From Haag, et al., MIS for the Information Age, 2004
C. Data Warehouses (cont.)
3. Data warehouse features
•
Designed to support business decision making
•
•
Not transactions!
Supports OLAP
–
•
•
•
Crosses functional boundaries of an organization
Can be very large
Note: Warehouse is “read only”
•
•
Why?
Can be a significant strategic resource for a company

4.
On-line Analytical Processing
Can yield a high ROI
Examples
•
???
PPT Slides by Dr. Craig Tyran & Kraig Pencil
18
C. Data Warehouses (cont.)
5. Implementation issues
•
•
People may be reluctant to
share information
“ETL” process is not easy
•
•
Extraction, transformation, load
Expensive
19
PPT Slides by Dr. Craig Tyran & Kraig Pencil
D. Data Mining
1.
Provides a means to extract patterns and
relationships from large amount of data
(e.g., a data warehouse)
2.
Mining analogy
–
–
3.
Sift through raw dirt/rock to find something of
value
Large volumes of data are sifted in an attempt
to find something worthwhile
Example: market basket analysis
–
Identify products that may be attractive to a
customer
•
See next slide: Amazon.com buyer suggestions
20
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Data
Mining:
Example
of pattern
discovered
via mining
PPT Slides by Dr. Craig Tyran & Kraig Pencil
D. Data mining (cont.)
4. Identify previously unknown patterns
– e.g., What are characteristics of
customers likely to default on a bank
loan?
“Target knows before it shows”
– How Target Figured Out A Teen Girl
Was Pregnant Before Her Father Did
– How Companies Learn Your Secrets:
NYTimes
–
e.g., Suppose you discovered that beer and diapers*were often found in the same purchase?
•
•
“Market basket analysis”
What could you do with that information to improve sales of one, the other or both?
*This is a common example, not an actual case.
PPT Slides by Dr. Craig Tyran & Kraig Pencil
22
E. Data Marts
5. Data marts
•
•
Warehouses can be
overwhelming/difficult to
implement …
 Some organizations create “data
marts”
A subset of a data warehouse
•
•
•
Simpler, scaled-down version
Focuses on/Integrates a specific
area (e.g., Sales department)
Provides useful decision making
tools
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Haggen photo from: www.callhugh.com/ ferndale.php
23
MiniMart photo from:
http://www.ae.gatech.edu/research/controls/pictures/f020801_gtar/Mini%20Mart.JPG
Data Marts:
Subsets of Data Warehouse
24
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Data Mining – Business Intelligence
• A few videos to watch and think about …
•
•
•
•
•
•
http://www.youtube.com/user/SASsoftware?v=C14GVhNt7Do&featu
re=pyv&ad=4782573666&kw=CRM
http://www.youtube.com/user/ibm?#p/c/13/fFdITHMuy2w
http://www.youtube.com/user/SASsoftware?v=2677nWVNg9M&feat
ure=pyv&ad=4782551166&kw=business%20analytics
http://www.youtube.com/watch?v=El_lSd6G5WU
http://www.youtube.com/watch?v=uP89kaDU40c
http://www.youtube.com/user/SASsoftware?v=C14GVhNt7Do&featu
re=pyv&ad=4782573666&kw=CRM#p/u/35/ecqk0JUKvAI
25
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Big Data
• Big data[1][2] is a
collection of data sets so
large and complex that it
becomes difficult to
process using on-hand
database management
tools or traditional data
processing applications.
(Wikipedia)
• (Image)
26
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Global Big Data:
+ 2.5 exobytes/day
• The world's technological
per-capita capacity to
store information has
roughly doubled every 40
months since the 1980s[15]
• As of 2012, every day
2.5 quintillion (2.5×1018)
bytes of data were
created.[16]
• (Wikipedia)
• (Image)
27
PPT Slides by Dr. Craig Tyran & Kraig Pencil
Big Data
Total World Data Storage
Capacity
(in CDs @ 730MB/CD)
3,000,000,000,000
• The next frontier in data?
•
http://www.eweek.com/c/a/Data-Storage/Big-Data-Analytics- 2,500,000,000,000
Is-Just-Starting-to-Reach-Its-Potential-10-Reasons-Why457684/?kc=EWKNLEAU07102012STR1
• Some terms:
– Hadoop (distributed file
organization)
– Distributed databases and
server clusters
– Cassandra (No only SQL
DBMS)
– MapReduce (breaking
computation into smaller
pieced, then combining the
results of each computation)
PPT Slides by Dr. Craig Tyran & Kraig Pencil
2,000,000,000,000
1,500,000,000,000
1,000,000,000,000
500,000,000,000
1993
2000
2007
2014
28
Download