DIMENSIONAL MODELING MIS2502 Data Analytics

advertisement
DIMENSIONAL
MODELING
MIS2502
Data Analytics
So we know…
• Relational databases are good for storing
transactional data
• But bad for analytical data
• What we can do is design an analytical data store
based on the operational data store
• That architecture gives us the advantages of both
• Relational database for operational use
• Analytical database for analysis (Online Analytical Processing)
Why have a separate ADS?
• Issue 1: Performance
• The structure is built to handle analysis
• You keep the load off the operational data store
• Issue 2: Usability
• We can structure the data in an intuitive way
• You keep the load off of your IT department
Some terminology
Data
Warehouse
• Takes many forms
• Really is just a repository for data
Data Mart
• More focused
• Specially designed for analysis
Data Cube
• Organization of data as a
“multidimensional matrix”
• Implementation of a Data Mart
How they all relate
The data in
the
operational
database…
…is put into
a data
warehouse…
…which
feeds the
data mart…
…and is
analyzed as
a cube.
We’ll start
here.
Product
The Data Cube
M&Ms
Diet
Coke
Doritos
Famous
Amos
Online Analytical
Processing
and
Multidimensional
Data Analysis
• Made up of “facts”
and “dimensions”
Store
• Core component of
Ardmore,
PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Temple
Main
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Cherry Hill,
NJ
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Mar. 2011
King of
Prussia, PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Quantity sold and total price are measured facts.
Why isn’t product price a measured fact?
Feb. 2011
Jan. 2011
The Data Cube
Product
M&Ms
A single summary
record representing
a business event
(monthly sales).
Store
The highlighted
element represents
all the M&Ms sold
in Ardmore, PA in
January, 2011
Diet
Coke
Doritos
Ardmore,
PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Temple
Main
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Cherry Hill,
NJ
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Famous
Amos
Mar. 2011
King of
Prussia, PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Feb. 2011
Jan. 2011
The Data Cube
Product
This is called
“slicing the data.”
Store
The highlighted
elements represent
Famous Amos
cookies sold on
Temple’s Main
campus from
January to March,
2011
M&Ms
Diet
Coke
Doritos
Ardmore,
PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Temple
Main
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Cherry Hill,
NJ
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Famous
Amos
Mar. 2011
King of
Prussia, PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Feb. 2011
Jan. 2011
The Data Cube
Product
M&Ms
What do the blue
highlighted elements
represent?
Store
What do the orange
highlighted elements
represent?
Diet
Coke
Doritos
Ardmore,
PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Temple
Main
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Cherry Hill,
NJ
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Famous
Amos
Mar. 2011
King of
Prussia, PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Feb. 2011
Jan. 2011
The n-dimensional cube
• Could you have a data mart with five dimensions?
• If so, give an example
• Then why does our cube example (and most
others you will see) only
have three?
Fact
store the original
data as a single
table
• Summarization would
• So it is stored as a
star schema
Dimension
be too slow
• A lot of redundancy
Product
Product_ID
Product_Name
Product_Price
Product_Weight
Store
Store_ID
Store_Address
Store_City
Store_State
Store_Type
Sales
Sales_ID
Product_ID
Store_ID
Time_ID
Quantity Sold
Total Price
Dimension
• We can’t reasonably
Dimension
Designing the Cube:
The Star Schema
Time
Time_ID
Day
Month
Year
Revisiting Usability: Why a Cube?
• So you have your star schema, now what?
• Non-IT folks probably won’t understand data
normalization and table relations in MySQL
• They won’t know how to do table JOINs
• They probably cannot work with raw table data
• Two options to produce usable data (i.e., cube)
• #1: Perform one big table JOIN from star schema
• #2: Calculate meaningful values, store in a data cube.
Option 1: One Big JOIN
Storing the entire join
would generate many,
many rows!
Sales
ID
Qty.
Sold
Total
Price
Prod.
ID
Prod.
Name
Prod.
Price
Prod.
Weight
Store
ID
Store
Address
Store
City
Store
State
Store
Type
Time
ID
Day
Month
Year
1000
1001
1002
Sales Fact
Product Dimension
Store Dimension
Time Dimension
It adds up fast… see Sales Person ex.
1000 products
300 stores
365 days
100 daily product purchases
=10,950,000,000 records per year!
Option 2: Cube of Summary Stats
Summarize the data and store it in the cube
Retrieve only the summary, not the raw data.
Much more efficient, but we are “locked in”
Product
Store
M&Ms
Diet
Coke
Doritos
Ardmore,
PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Temple
Main
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Cherry Hill,
NJ
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
King of
Prussia, PA
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Famous
Amos
Mar. 2011
Feb. 2011
Jan. 2011
Demo – Foodmart
• A pre-created data cube that can be read in Excel
Gender
Houseowner
Marital status
Media type
Product name
Sales region
Store name
Total children
Yearly income
Measured Facts
Dimensions
• Summaries are already created
Cost
Sales
Updating the cube
• Data marts are non-volatile (i.e., they can’t be
changed)
• Logically: It’s a record of what has happened
• Practically: Would require constant re-computation of the
cube
• The cube is refreshed periodically from the
transactional database
• Overnight
• Daily
• Weekly
Designing the Star Schema
• Ralph Kimball’s Four Step Process for Data Cube
Design (Kimball et al., 2008)
• Choose the business process
• Decide on the level of granularity
• Identify the dimensions
• Identify the fact
• From where do you get the data?
• Can be in house, but might come from elsewhere too…
Choose the business process
• What your data cube is “about”
• Determined by the questions you want to
answer about your organization
Question
Business
Process
Who is my best customer?
Sales
What are my highest selling products?
Sales
Which teachers have the best student
performance?
Standardized
testing
Which supplier is offering us the best
deals?
Purchasing
Note that a “business process” is not always
about business.
Decide on the level of granularity
• Level of detail for each event (row in the table)
• Will determine the data in the dimensions
• Example: Who is my best customer?
• The “event” is a sales transaction
• Choices for time: yearly, quarterly, monthly, daily
• Choices for store: store, city, state
How would you select the
right granularity?
Identify the dimensions
• Determined by the business process
• Refined by the level of granularity
• The key elements of the process needed
to answer to the question
• Example: Sales transaction
• Our example schema defines a “sale” as
taking place for a particular product, in a
particular store, at a particular time
• Could this data mart tell you
• The best selling product?
• The best customer?
Try it for the “student
performance” example.
Identify the fact
• The data associated with the business event
Keys
Measured, numeric data
• Unique identifier for each row
• Unique identifiers from
dimensions
• Associates a combination of
the dimensions to a unique
business event
• Example: Sales has
Product_ID, Store_ID, and
Time_ID
• Quantifiable information for
each business event
• Does not describe any
particular dimension
• Describes a particular
combination of dimensional
data
• Example: Sales has
quantity_sold and total_price.
Try it for the “student
performance” example.
Data cube caveats
• You have to choose your aggregations in advance
• So choose wisely!
• Consider a sales data cube with product, store, time,
salesperson
• If quantity_sold and total_price are the facts, you can’t figure out
the average number of people working in a store
• All people might not have sold all products and therefore wouldn’t
be in the joined table
• Granularity is also an issue
• Can’t track daily sales if “date” is monthly (pre-aggregated)
• So why not include every single sale and do no aggregation
beforehand?
OGE Energy (Oklahoma) Example
• Trying to reduce peak power demand
• A few strategies…
• Variable
Pricing;
• Smart Meters;
• Customer
Notifications
(Text, e-Mail,
Twitter);
• Customer
Rewards
sources: OGE overview, http://energy.gov/, http://www.ogepet.com
OGE Energy (Oklahoma) Example
• Business question
• “How can we reduce peak power demand?”
• What are the relevant facts (perform. measures)?
Power consumption (kw/h)
Power outages
•
What are relevant dimensions?
• Time (hours or minutes), location, weather_emergency,
price, smart_meter, communications, rebates
comms_ID
Num_texts
Num_tweets
Num_emails
Location
Fact
Dimension
PowerDraw
Customer
customer_ID
Attribute 1
Attribute 2
Attribute 3
PowerDrawID
Customer_id
Comms_id
Time_id
Location_id
Consumption
Outages
_ID
Attribute 1
Attribute 2
Attribute 3
Time
Time_ID
Day
Month
Year
Hour
Dimension
Communicat.
Dimension
Dimension
OGE Energy Star Schema
Download