MIS2502
Data Analytics
• Relational databases are good for storing transactional data
• But bad for analytical data
• What we can do is design an analytical data store based on the operational data store
• That architecture gives us the advantages of both
• Relational database for operational use
• Analytical database for analysis (Online Analytical Processing)
• Issue 1: Performance
• The structure is built to handle analysis
• You keep the load off the operational data store
• Issue 2: Usability
• We can structure the data in an intuitive way
• You keep the load off of your IT department
• Takes many forms
• Really is just a repository for data
• Data may come from transactional DB
• Access Layer of DWH
• Specially designed for analysis
• Stores subsets of data from DWH
• Organization of data as a
“multidimensional matrix”
• Implementation of a Data Mart
The data in the operational database…
…is put into a data warehouse…
…which feeds the data mart…
…and is analyzed as a cube.
We’ll start here.
• Core component of
Online Analytical
Processing and
Multidimensional
Data Analysis
• Made up of “ facts ” and “ dimensions ”
M&Ms
Product
Diet
Coke
Doritos
Famous
Amos
Ardmore,
PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Temple
Main quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Cherry Hill,
NJ quantity
& total price quantity
& total price quantity
& total price quantity
& total price
King of
Prussia, PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Mar. 2011
Feb. 2011
Jan. 2011
Quantity sold and total price are measured facts.
Why isn’t product price a measured fact?
M&Ms
Product
Diet
Coke
Doritos
Famous
Amos
The highlighted element represents all the M&Ms sold in Ardmore, PA in
January, 2011
A single summary record representing a business event
(monthly sales).
Ardmore,
PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Temple
Main quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Cherry Hill,
NJ quantity
& total price quantity
& total price quantity
& total price quantity
& total price
King of
Prussia, PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Mar. 2011
Feb. 2011
Jan. 2011
The highlighted elements represent
Famous Amos cookies sold on
Temple’s Main campus from
January to March,
2011
This is called
“slicing the data.”
M&Ms
Product
Diet
Coke
Doritos
Famous
Amos
Ardmore,
PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Temple
Main quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Cherry Hill,
NJ quantity
& total price quantity
& total price quantity
& total price quantity
& total price
King of
Prussia, PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Mar. 2011
Feb. 2011
Jan. 2011
M&Ms
Product
Diet
Coke
Doritos
Famous
Amos
What do the orange highlighted elements represent?
What do the blue highlighted elements represent?
Ardmore,
PA quantity
& total price quantity
& total price
Temple
Main quantity
& total price quantity
& total price
Cherry Hill,
NJ quantity
& total price quantity
& total price
King of
Prussia, PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price quantity
& total price quantity
& total price quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Mar. 2011
Feb. 2011
Jan. 2011
• Could you have a data mart with five dimensions?
• If so, give an example
• Then why does our cube example (and most others you will see) only have three?
• What is an ideal cube?
• We can’t reasonably store the original data as a single table
• Summarization would be too slow
• A lot of redundancy
• So it is stored as a star schema
Product
Product_ID
Product_Name
Product_Price
Product_Weight
Store
Store_ID
Store_Address
Store_City
Store_State
Store_Type
Sales
Sales_ID
Product_ID
Store_ID
Time_ID
Quantity Sold
Total Price
Time
Time_ID
Day
Month
Year
Storing the entire join would generate many, many rows!
Sales
ID
1000
Qty.
Sold
Total
Price
Prod.
ID
Prod.
Name
Prod.
Price
Prod.
Weight
Store
ID
Store
Address
Store
City
Store
State
Store
Type
Time
ID
Day Month Year
1001
1002
Sales Fact Product Dimension Store Dimension Time Dimension
1000 products
300 stores
365 days
100 daily product purchases
=10,950,000,000 records per year!
Summarize the data and store it in the cube
Retrieve only the summary, not the raw data.
Much more efficient, but we are “locked in”
M&Ms
Product
Diet
Coke
Doritos
Famous
Amos
Ardmore,
PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Temple
Main quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Cherry Hill,
NJ quantity
& total price quantity
& total price quantity
& total price quantity
& total price
King of
Prussia, PA quantity
& total price quantity
& total price quantity
& total price quantity
& total price
Mar. 2011
Feb. 2011
Jan. 2011
• A pre-created data cube that can be read in Excel
• Summaries are already created
• Data marts are nonvolatile (i.e., they can’t be changed)
• Logically: It’s a record of what has happened
• Practically: Would require constant re-computation of the cube
• The cube is refreshed periodically from the transactional database
• Overnight
• Daily
• Weekly
• Kimball’s Four Step Process for Data Cube
Design
(Kimball et al., 2008)
• Choose the business process
• Decide on the level of granularity
• Identify the dimensions
• Identify the fact
• From where do you get the data?
• What your data cube is “about”
• Determined by the questions you want to answer about your organization
Question
Who is my best customer?
What are my highest selling products?
Which teachers have the best student performance?
Which supplier is offering us the best deals?
Business
Process
Sales
Sales
Standardized testing
Purchasing
Note that a “business process” is not always about business.
• Level of detail for each event (row in the table)
• Will determine the data in the dimensions
• Example: Who is my best customer?
• The “event” is a sales transaction
• Choices for time: yearly, quarterly, monthly, daily
• Choices for store: store, city, state
How would you select the right granularity?
• Determined by the business process
• Refined by the level of granularity
• The key elements of the process needed to answer to the question
• Example: Sales transaction
• Our example schema defines a “sale” as taking place for a particular product, in a particular store, at a particular time
• Could this data mart tell you
• The best selling product?
• The best customer?
Try it for the “student performance” example.
• The data associated with the business event
Keys
• Unique identifier for each row
• Unique identifiers from dimensions
• Associates a combination of the dimensions to a unique business event
• Example: Sales has
Product_ID, Store_ID, and
Time_ID
Measured, numeric data
• Quantifiable information for each business event
• Does not describe any particular dimension
• Describes a particular combination of dimensional data
• Example: Sales has quantity_sold and total_price.
Try it for the “student performance” example.
• You have to choose your aggregations in advance
• So choose wisely!
• Consider a sales data cube with product, store, time, salesperson
•
• If quantity_sold and total_price are the facts, you can’t figure out the average number of people working in a store
All people might not have sold all products and therefore wouldn’t be in the joined table
• Granularity is also an issue
• Can’t track daily sales if “date” is monthly (pre-aggregated)
• So why not include every single sale and do no aggregation beforehand?