DEVQ400-01 Developing OLAP Business Solutions with Analysis

advertisement
Data Warehouse
and
Business Intelligence
Dr. Minder Chen
Professor of MIS
Martin V. Smith School of Business and Economics
CSU Channel Islands
Minder.Chen@CSUCI.EDU
BI
Business Intelligence (BI) is the process of gathering meaningful
information to answer questions and identify significant trends or
patterns, giving key stakeholders the ability to make better
business decisions.
“The key in business is to
know something that
nobody else knows.”
-- Aristotle Onassis
PHOTO: HULTON-DEUTSCH COLL
“To understand is to perceive patterns.”
— Sir Isaiah Berlin
"The manager asks how and when,
the leader asks what and why."
— “On Becoming a Leader” by Warren Bennis
© Minder Chen, 2004-2014
DW & BI - 2
BI Questions
• What happened?
–
What were our total sales this month?
• What’s happening?
–
Are our sales going up or down, trend analysis
• Why?
–
Why have sales gone down?
• What will happen?
–
Forecasting & “What If” Analysis
• What do I want to happen?
–
Planning & Targets
Source: Bill Baker, Microsoft
© Minder Chen, 2004-2014
DW & BI - 3
Business Valuation Models for BI
© Minder Chen, 2004-2014
DW & BI - 4
Performance Dashboards for Information Delivery
© Minder Chen, 2004-2014
DW & BI - 5
Scorecards for Information Delivery
Balanced Scorecard
© Minder Chen, 2004-2014
DW & BI - 6
© Minder Chen, 2004-2014
DW & BI - 7
Inmon's Definition of Data Warehouse – Data View
• A warehouse is a
– subject-oriented,
– integrated,
– time-variant and
– non-volatile
collection of data in support of
management's decision making process.
Source: http://www.intranetjournal.com/features/datawarehousing.html
– Bill Inmon in 1990
© Minder Chen, 2004-2014
DW & BI - 8
Inmon's Definition Explain
• Subject-oriented: They are organized around major
subjects such as customer, supplier, product, and
sales. Data warehouses focus on modeling and
analysis to support planning and management
decisions vs. operations and transaction processing.
• Integrated: Data warehouses involve an integration of
sources such as relational databases, flat files, and online transaction records. Processes such as data
cleansing and data scrubbing achieve data
consistency in naming conventions, encoding
structures, and attribute measures.
• Time-variant: Data contained in the warehouse provide
information from an historical perspective.
• Nonvolatile: Data contained in the warehouse are
physically separate from data present in the
operational environment.
© Minder Chen, 2004-2014
DW & BI - 9
Business Intelligence
Increasing potential
to support
business decisions (MIS)
Making
Decisions
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
OLAP, MDA,
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
(Paper, Files, Information Providers, Database Systems, OLTP)
© Minder Chen, 2004-2014
DBA
DW & BI - 10
Where is Business Intelligence applied?
Operational Efficiency
•
•
•
•
•
•
•
•
ERP Reporting
KPI Tracking
Product Profitability
Risk Management
Balanced Scorecard
Activity Based Costing
Global Sourcing
Logistics
© Minder Chen, 2004-2014
Customer Interaction
•
•
•
•
•
•
•
Sales Analysis
Sales Forecasting
Segmentation
Cross-selling
CRM Analytics
Campaign Planning
Customer Profitability
DW & BI - 11
OLTP Versus Business Intelligence: Who asks what?
OLTP Questions
• When did that order ship?
• How many units are in
inventory?
• Does this customer have
unpaid bills?
• Are any of customer X’s line
items on backorder?
© Minder Chen, 2004-2014
Analysis Questions
• What factors affect order
processing time?
• How did each product line (or
product) contribute to profit last
quarter?
• Which products have the lowest
Gross Margin?
• What is the value of items on
backorder, and is it trending up
or down
over time?
DW & BI - 12
The Data Warehouse/BI Architecture & Process
ETL: Extract, Transform, and Load
OLAP
Cubes
Data Marts
Source
Systems
ETL
ETL
ETL
Data
Warehouse
1
Design the
Data Warehouse
© Minder Chen, 2004-2014
2
Populate
Data Warehouse
3
Create
OLAP Cubes
Clients
Query Tools
Reporting
Analysis
Data Mining
4
Query
Data
DW & BI - 13
Normalized Database for OLTP
© Minder Chen, 2004-2014
DW & BI - 14
OLTP vs. OLAP
OLTP System
OLAP System
Online Transaction Processing
Online Analytical Processing
(Operational System)
(Data Warehouse)
Operational data; OLTPs are the original Consolidation data; OLAP data comes from the various
Source of data
source of the data.
OLTP Databases
Purpose of To control and run fundamental business To help with planning, problem solving, and decision
data
tasks
support
Reveals a snapshot of ongoing business Multi-dimensional views of various kinds of business
What the data
processes
activities
Inserts and
Short and fast inserts and updates
Periodic long-running batch jobs refresh the data
Updates
initiated by end users
Queries
Relatively standardized and simple
queries Returning relatively few records
Often complex queries involving aggregations
Depends on the amount of data involved; batch data
Typically very fast
refreshes and complex queries may take many hours;
query speed can be improved by creating indexes
Space
Can be relatively small if historical data is Larger due to the existence of aggregation structures
Requirements
archived
and history data; requires more indexes than OLTP
Database
Typically de-normalized with fewer tables; use of star
Highly normalized with many tables
Design
and/or snowflake schemas
Processing
Speed
Backup and
Recovery
Backup religiously; operational data is
Instead of regular backups, some environments may
critical to run the business, data loss is
consider simply reloading the OLTP data as a recovery
likely to entail significant monetary loss
method
and legal liability
Source: http://datawarehouse4u.info/OLTP-vs-OLAP.html
© Minder Chen, 2004-2014
DW & BI - 15
Measuring Performance
• Real estate consumer services and analysis firm Trulia reports
that Oct. 2013 saw only an 0.6% rise in home asking prices
comparing to Sept. 2013.
• However, the average home asking price rose by 11.7% from
Oct. 2012 to Oct. 2013.
• The year-over-year figure is the largest jump since the housing
bubble popped back in 2007-08.
Source: http://www.thestreet.com/story/12100873/1/home-sellers-price-hikes-coming-unsustainably-fast.html
© Minder Chen, 2004-2014
DW & BI - 16
compare with last period vs. year-on-year comparison
• A time series is a sequence of data points, measured
typically at successive points in time spaced at uniform
time intervals. Examples of time series are the daily
closing value of the Dow Jones Industrial Average Wikipedia “Time series”
• The year-over-year data compares a time period
(e.g., a month or a quarter) against the same time
period last year.
• You can compare a performance indicator with one
from last period (quarter, month, week, day)
• One of the advantages of year-over-year comparisons
is that it automatically negates the effect of seasonality
(e.g., seasonal effect). It is a more effective way of
looking at performance.
© Minder Chen, 2004-2014
DW & BI - 17
Identifying Measures and Dimensions
Information for Decision Making
Performance
Measures for KPI
Measures
What?
Performance
Drivers
Attribute
Type?
Dimensions
Why?
The attribute (column) varies
continuously:
The attribute is perceived as
a constant or discrete value:
•
•
•
•
•Name/Description
•Location
•Color
•Size
Unit Sold
Cost
Sales
Balance
© Minder Chen, 2004-2014
DW & BI - 18
Star Schema
Dimension
Table
Products
Multi-dimensional
Data model
Channels
Dates
Dimension
Table
Dimension
Table
Sales
Fact Table
with
measures
Dimension
Promotions
Table
© Minder Chen, 2004-2014
Customers
Dimension
Table
DW & BI - 19
Snowflake Schema
Brands
Normalized
Dimension
Table
Products
Channels
Dates
Dimension
Table
Star Schema
Sales
Dimension
Table
Fact Table
Normalized
Dimension
Table
Promotions
Customers
Dimension
Table
Customer
type
Source: http://www.diffen.com/difference/Snowflake_Schema_vs_Star_Schema
© Minder Chen, 2004-2014
DW & BI - 20
Designing Data Warehouse: Dimensional Design Process
Business
Requirements
• Select the business process to model
• Declare the grain of the business process/data in
the fact table (The grain represents the most atomic level by
which the facts may be defined. The grain of a SALES fact table might
be stated as "Sales volume by Day by Product by Store". )
• Identify the numeric facts/meaures that will
populate each fact table row
• Choose the dimensions that apply to each fact
table row
Data
Realities
Ref: http://en.wikipedia.org/wiki/Fact_table
© Minder Chen, 2004-2014
DW & BI - 21
Select a business process to model
•
•
•
•
Not business departments or business functions
Cross-functional business processes
Business events
Examples:
–
–
–
–
–
–
–
–
–
Raw materials purchasing
Order fulfillment process
Shipments
Invoicing
Inventory
General ledger
Insurance claims
Class enrollment
Airline ticket sales
© Minder Chen, 2004-2014
DW & BI - 22
Facts Table
Measurements of business events.
DateID
ProductID
Dimensions
CustomerID
Units
Dollars
Measures
The Fact Table contains keys and units of
measure
© Minder Chen, 2004-2014
DW & BI - 23
Fact Tables
Fact tables have the following characteristics:
• It contains numeric measures (metric) of the
business.
• It may contain summarized (aggregated) data.
• It almost always contains date-stamped data.
• Measures are typically additive.
• Have key value that is typically a concatenated
key composed of the primary keys of the
dimensions.
• Joined to dimension tables through foreign keys
that reference primary keys in the dimension
tables.
• Fact tables are narrow (few attributes) but many
records.
© Minder Chen, 2004-2014
DW & BI - 24
A Dimensional Model for a Grocery Store’s Sales
© Minder Chen, 2004-2014
DW & BI - 25
Creating Dimensional Model
• Identify fact tables
• Translate business measures into fact tables
• Analyze information from source systems for
additional measures
• Identify base and derived measures
• Document additivity of measures (e.g., nonadditive[price], semi-additive [quantity-on-hand is
not additive over time], or additive [quantity])
• Identify dimension tables
• Link fact tables to the dimension tables
• Create views for users
© Minder Chen, 2004-2014
DW & BI - 26
Transaction Level Order Item Fact Table
© Minder Chen, 2004-2014
DW & BI - 27
Inside a Dimension Table
• Dimension table key: Uniquely identify each row. Use
surrogate key (integer).
• Table is wide: A table may have many attributes
•
•
•
•
(columns).
Textual attributes. Descriptive attributes in string
format. No numerical values for calculation.
Attributes not directly related: E.g., product color and
product package size. No transitive dependency.
Not normalized (star schema).
Drilling down and rolling up along a dimension.
• One or more hierarchy within a dimension.
• Fewer number of records.
© Minder Chen, 2004-2014
DW & BI - 28
OLAP Solutions
•
•
•
•
•
•
Data Warehouse
Data Mart
Cubes
Dimensions
Measures
Cells
A cube is a collection of
data that’s been aggregated
to allow queries to return
data quickly.
© Minder Chen, 2004-2014
OLAP Server (e.g., Oracle ESSBase &
SQL Server’s Analysis Services)
Europe
Asia
US
Gadgets
130
135
140
142
Gizmos
205
390
350
475
Thingies
175
230
190
250
Widgets
310
340
410
450
Q1
Q2
Q3
Q4DW & BI - 29
Hierarchy
© Minder Chen, 2004-2014
DW & BI - 30
A Hierarchy in the Product Dimension
• SKU: Stock Keeping Unit
• Hierarchy:
–
Department  Category  Subcategory  Brand  Product
© Minder Chen, 2004-2014
DW & BI - 31
Multidimensional Query Techniques
Performance
Drivers
Performance
Measures
What?
Why?
Product
Time
Slicing
Geography
Why?
Dicing
Drilling
down
Drill down
Why?
Roll up
Aggregated
data
Detail data
© Minder Chen, 2004-2014
DW & BI - 32
Roll-Up and Drill-Down
Source: http://www.tutorialspoint.com/dwh/dwh_olap.htm
© Minder Chen, 2004-2014
DW & BI - 33
Slice and Dice
Source: http://www.tutorialspoint.com/dwh/dwh_olap.htm
© Minder Chen, 2004-2014
DW & BI - 34
A Visual Operation: Pivot (Rotate)
Juice
Cola
Milk
10
47
30
Cream 12
3/1 3/2 3/3 3/4 Date
© Minder Chen, 2004-2014
Product
DW & BI - 35
Operations in Multidimensional Data Model
• Aggregation (roll-up)
– dimension reduction: e.g., total sales by city
– summarization over aggregate hierarchy: e.g., total
sales by city and year  total sales by region and
by year
• Navigation to detailed data (drill-down)
– e.g., (sales - expense) by city, top 3% of cities by
average income
• Selection (slice or dice) defines a subcube
– e.g., sales where city = Palo Alto and date = 1/15/96
• Visualization operations (e.g., Pivot)
© Minder Chen, 2004-2014
DW & BI - 36
Pivot Table in Excel
© Minder Chen, 2004-2014
DW & BI - 37
Date Dimension of the Retail Sales Model
© Minder Chen, 2004-2014
DW & BI - 38
Store Dimension
• It is not uncommon to represent multiple
hierarchies in a dimension table. Ideally, the
attribute names and values should be unique
across the multiple hierarchies.
© Minder Chen, 2004-2014
DW & BI - 39
ETL
ETL = Extract, Transform, Load.
ETL cycle includes
• Build reference data (e.g., currency codes)
• Extract (from sources)
• Validate
• Transform (clean, apply business rules, check for data
integrity, create aggregates)
• Stage (load into staging tables, if used)
• Audit reports on compliance with business rules.
• Publish/load (to target tables in the data warehouse)
• Clean up
© Minder Chen, 2004-2014
DW & BI - 40
Data Quality Issues
•
•
•
•
•
•
•
•
•
No common time basis
Different calculation algorithms
Different levels of extraction
Different levels of granularity
Different data field names
Different data field meanings
Missing information
No data correction rules
No drill-down capability
© Minder Chen, 2004-2014
DW & BI - 41
Building The Warehouse
Transforming Data
© Minder Chen, 2004-2014
DW & BI - 42
The Anomalies Nightmare
CUST #
NAME
ADDRESS
90328575
Digital Equipment
187 N. PARK St. Salem NH 01458
OEM
90328575
DEC
187 N. Pk. St. Salem NH 01458
OEM
90238475
Digital
187 N. Park St Salem NH 01458
$#%
90233479
Digital Corp
187 N. Park Ave. Salem NH 01458
Comp
90233489
Digital Consulting
15 Main Street Andover MA 02341
Consult
90234889
Digital Info Service
PO Box 9 Boston MA 02210
Mail List
90345672
Digital Integration
Park Blvd. Boston MA 04106
SYS INT
No Unique Key
Anomalies
No Standardization
TYPE
Spelling
Noise in
Blank Fields
How does one correctly identify and consolidate
anomalies from millions of records?
© Minder Chen, 2004-2014
DW & BI - 43
Data Mining & Knowledge Discovery in Database (KDD) Process
Data Mining is the practice of searching through large amounts of
computerized data to find useful patterns or trends
Data mining is the analysis
step of the "Knowledge
Discovery in Databases"
process (KDD) involving
methods such as artificial
intelligence, machine
learning, statistics,
and database systems.
Source:
© Minder http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
Chen, 2004-2014
DW & BI - 44
Knowledge Discovery
• Knowledge discovery in databases is the non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data.
Data
A set of facts.
An association, dependence, clusters, etc. among facts (items)
Pattern
in the data set.
KDD is a multi-step process involving data preparation, pattern
Process searching, knowledge evaluation, and refinement with iteration
after modification.
Discovered patterns should be true on new data with some
Valid
degree of certainty. Generalize to the future (other data).
Novel
Patterns must be novel (should not be previously known).
Actionable; patterns should potentially lead to some useful
Useful
actions.
The process should lead to human insight. Patterns must be
Undermade understandable in order to facilitate a better understanding
standable
of the underlying data.
© Minder Chen, 2004-2014
http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html DW & BI - 45
Cross Industry Standard Process for Data Mining
Action
Decision
Source: http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
© Minder Chen, 2004-2014
DW & BI - 46
Data Mining Tasks and Examples
• Classification - Customer profiling into predefined
categories via supervised learning using Decision Tree or
Neural Network
• Clustering - grouping a set of objects in such a way that
objects in the same group (cluster) are more similar to each
other than to those in other groups (clusters) Market
segmentation , e.g.,
• Summarization - Credit scoring and risk analysis using
Bayesian inference. It is considered a Structured
prediction technique.
• Association - What is the likelihood that a customer will
buy a product next month, if he buys a related item today?
(sequence association)
http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/2_tasks.html
© Minder Chen, 2004-2014
DW & BI - 47
OLAP and Data Mining Address Different Types of Questions
While reporting and OLAP are informative about past facts, only data mining can
help you predict the future of your business.
OLAP
Data Mining
What was the response rate to our mailing?
What is the profile of people who are likely
to respond to future mailings?
How many units of our new product did we Which existing customers are likely to buy
sell to our existing customers?
our next new product?
Who were my 10 best customers last year?
Which 10 customers offer me the greatest
profit potential?
Which customers didn't renew their policies Which customers are likely to switch to the
last month?
competition in the next six months?
Which customers defaulted on their loans?
Is this customer likely to be a good credit
risk?
What were sales by region last quarter?
What are expected sales by region next
year?
What percentage of the parts we produced
yesterday are defective?
What can I do to improve throughput and
reduce scrap?
Source: http://www.dmreview.com/editorial/dmreview/print_action.cfm?articleId=2367
© Minder Chen, 2004-2014
DW & BI - 48
Shopping Basket Analysis
• Which items are purchased in a retail store at
the same time?
• Amazon use collaborative filtering that use
shopping basket (sales) data to make
recommendations when you select an item.
Ref: http://en.wikipedia.org/wiki/Collaborative_filtering
© Minder Chen, 2004-2014
DW & BI - 49
Issues on Interpreting Modeling Results
• Housing price: Use factors, such as location,
number of bedrooms, and square footage, to
determine the market value of a property.
• Beer and Diaper
Source: http://dssresources.com/newsletters/66.php
© Minder Chen, 2004-2014
DW & BI - 50
Veracity
Scale of Data
Analysis of
Streaming Data
Different forms
of data
Uncertainty of
data
Source:
•http://www.ibmbigdatahub.com/infographic/four-vs-big-data
•http://whatsthebigdata.com/2013/07/25/big-data-3-vs-volume-variety-velocity-infographic/
© Minder Chen, 2004-2014
DW & BI - 51
CRISP-DM Methodology
Cross Industry Standard Process for Data Mining Methodology
Source:
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
© Minder
Chen,
2004-2014
& (link)
DW & BI - 52
Data Mining Contexts
Source:
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
© Minder
Chen,
2004-2014
DW & BI - 53
Phases and Tasks
Source:
http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
© Minder Chen,
2004-2014
DW & BI - 54
• Backup Slides
© Minder Chen, 2004-2014
DW & BI - 55
© Minder Chen, 2004-2014
DW & BI - 56
Key Concepts in BI Development Lifecycle
© Minder Chen, 2004-2014
DW & BI - 57
OLTP Normalized Design
Warehouse
Ordering
Process
Chain
Retailer
Store
Retailer
Payments
Retailer
Returns
Product
POS
Process
Retail
Promo
Brand
GL
Account
Retail
Cust
Cash
Register
© Minder Chen, 2004-2014
Clerk
DW & BI - 58
© Minder Chen, 2004-2014
DW & BI - 59
Download