DIMENSIONAL MODELING MIS2502 Data Analytics

advertisement

DIMENSIONAL

MODELING

MIS2502

Data Analytics

So we know…

• Relational databases are good for storing transactional data

• But bad for analytical data

• What we can do is design an analytical data store based on the operational data store

• That architecture gives us the advantages of both

• Relational database for operational use

• Analytical database for analysis (Online Analytical Processing)

Why have a separate ADS?

• Issue 1: Performance

• The structure is built to handle analysis

• You keep the load off the operational data store

• Issue 2: Usability

• We can structure the data in an intuitive way

• You keep the load off of your IT department

Some terminology

Data

Warehouse

• Takes many forms

• Really is just a repository for data

• Data may come from transactional DB

Data Mart

• Access Layer of DWH

• Specially designed for analysis

• Stores subsets of data from DWH

Data Cube

• Organization of data as a

“multidimensional matrix”

• Implementation of a Data Mart

How they all relate

The data in the operational database…

…is put into a data warehouse…

…which feeds the data mart…

…and is analyzed as a cube.

We’ll start here.

The Data Cube

• Core component of

Online Analytical

Processing and

Multidimensional

Data Analysis

• Made up of “ facts ” and “ dimensions ”

M&Ms

Product

Diet

Coke

Doritos

Famous

Amos

Ardmore,

PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Temple

Main quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Cherry Hill,

NJ quantity

& total price quantity

& total price quantity

& total price quantity

& total price

King of

Prussia, PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Mar. 2011

Feb. 2011

Jan. 2011

Quantity sold and total price are measured facts.

Why isn’t product price a measured fact?

The Data Cube

M&Ms

Product

Diet

Coke

Doritos

Famous

Amos

The highlighted element represents all the M&Ms sold in Ardmore, PA in

January, 2011

A single summary record representing a business event

(monthly sales).

Ardmore,

PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Temple

Main quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Cherry Hill,

NJ quantity

& total price quantity

& total price quantity

& total price quantity

& total price

King of

Prussia, PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Mar. 2011

Feb. 2011

Jan. 2011

The Data Cube

The highlighted elements represent

Famous Amos cookies sold on

Temple’s Main campus from

January to March,

2011

This is called

“slicing the data.”

M&Ms

Product

Diet

Coke

Doritos

Famous

Amos

Ardmore,

PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Temple

Main quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Cherry Hill,

NJ quantity

& total price quantity

& total price quantity

& total price quantity

& total price

King of

Prussia, PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Mar. 2011

Feb. 2011

Jan. 2011

The Data Cube

M&Ms

Product

Diet

Coke

Doritos

Famous

Amos

What do the orange highlighted elements represent?

What do the blue highlighted elements represent?

Ardmore,

PA quantity

& total price quantity

& total price

Temple

Main quantity

& total price quantity

& total price

Cherry Hill,

NJ quantity

& total price quantity

& total price

King of

Prussia, PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price quantity

& total price quantity

& total price quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Mar. 2011

Feb. 2011

Jan. 2011

The n-dimensional cube

• Could you have a data mart with five dimensions?

• If so, give an example

• Then why does our cube example (and most others you will see) only have three?

Designing the Cube:

The Star Schema

• What is an ideal cube?

• We can’t reasonably store the original data as a single table

• Summarization would be too slow

• A lot of redundancy

• So it is stored as a star schema

Product

Product_ID

Product_Name

Product_Price

Product_Weight

Store

Store_ID

Store_Address

Store_City

Store_State

Store_Type

Sales

Sales_ID

Product_ID

Store_ID

Time_ID

Quantity Sold

Total Price

Time

Time_ID

Day

Month

Year

A join to make the cube?

Storing the entire join would generate many, many rows!

Sales

ID

1000

Qty.

Sold

Total

Price

Prod.

ID

Prod.

Name

Prod.

Price

Prod.

Weight

Store

ID

Store

Address

Store

City

Store

State

Store

Type

Time

ID

Day Month Year

1001

1002

Sales Fact Product Dimension Store Dimension Time Dimension

It adds up fast

1000 products

300 stores

365 days

100 daily product purchases

=10,950,000,000 records per year!

From Star Schema to Cubes

Summarize the data and store it in the cube

Retrieve only the summary, not the raw data.

Much more efficient, but we are “locked in”

M&Ms

Product

Diet

Coke

Doritos

Famous

Amos

Ardmore,

PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Temple

Main quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Cherry Hill,

NJ quantity

& total price quantity

& total price quantity

& total price quantity

& total price

King of

Prussia, PA quantity

& total price quantity

& total price quantity

& total price quantity

& total price

Mar. 2011

Feb. 2011

Jan. 2011

Demo – Sales Performance

• A pre-created data cube that can be read in Excel

• Summaries are already created

Salesman

Region

Product

Time Period

# of customers

Net sales

Profit/loss

Updating the cube

• Data marts are nonvolatile (i.e., they can’t be changed)

• Logically: It’s a record of what has happened

• Practically: Would require constant re-computation of the cube

• The cube is refreshed periodically from the transactional database

• Overnight

• Daily

• Weekly

Designing the Star Schema

• Kimball’s Four Step Process for Data Cube

Design

(Kimball et al., 2008)

• Choose the business process

• Decide on the level of granularity

• Identify the dimensions

• Identify the fact

• From where do you get the data?

Choose the business process

• What your data cube is “about”

• Determined by the questions you want to answer about your organization

Question

Who is my best customer?

What are my highest selling products?

Which teachers have the best student performance?

Which supplier is offering us the best deals?

Business

Process

Sales

Sales

Standardized testing

Purchasing

Note that a “business process” is not always about business.

Decide on the level of granularity

• Level of detail for each event (row in the table)

• Will determine the data in the dimensions

• Example: Who is my best customer?

• The “event” is a sales transaction

• Choices for time: yearly, quarterly, monthly, daily

• Choices for store: store, city, state

How would you select the right granularity?

Identify the dimensions

• Determined by the business process

• Refined by the level of granularity

• The key elements of the process needed to answer to the question

• Example: Sales transaction

• Our example schema defines a “sale” as taking place for a particular product, in a particular store, at a particular time

• Could this data mart tell you

• The best selling product?

• The best customer?

Try it for the “student performance” example.

Identify the fact

• The data associated with the business event

Keys

• Unique identifier for each row

• Unique identifiers from dimensions

• Associates a combination of the dimensions to a unique business event

• Example: Sales has

Product_ID, Store_ID, and

Time_ID

Measured, numeric data

• Quantifiable information for each business event

• Does not describe any particular dimension

• Describes a particular combination of dimensional data

• Example: Sales has quantity_sold and total_price.

Try it for the “student performance” example.

Data cube caveats

• You have to choose your aggregations in advance

• So choose wisely!

• Consider a sales data cube with product, store, time, salesperson

• If quantity_sold and total_price are the facts, you can’t figure out the average number of people working in a store

All people might not have sold all products and therefore wouldn’t be in the joined table

• Granularity is also an issue

• Can’t track daily sales if “date” is monthly (pre-aggregated)

• So why not include every single sale and do no aggregation beforehand?

Download