Dimensional-Modeling-PP

advertisement
Dimensional Modeling Overview
Agenda
■ Dimensional Modeling Overview
■ Dimensional Modeling Steps
■ Dimensional Modeling Framework
■ Retail Sales - Case Study
■ Dimensional Modeling Tips
Dimensional Modeling Overview
From Requirement to Data Design
■
The requirements definition completely drives the Data design for the
DW.
■
Data design consists of putting together the data structures.
■
A group of data elements form a data structure.
■
Logical data design includes determination of the various data
elements that are needed and combination of the data elements into
structures of data and also establishing relationships among the data
structures.
Information
Package
Rqts.
Definition
Document
Requirements
Gathering
Dimensional
Model
Data
Design
Design Decisions
■
Choosing the process.
Selecting the subjects from the information packages for the first set of
logical structures to be designed.
■
Choosing the grain.
Determining the level of detail for the data structures.
■
Identifying and conforming the dimensions.
Choosing the business dimensions (such as product, market, time, etc.) to
be included in the first set of structures .
■
Choosing the facts.
Selecting the metrics or units of measurements (such as product sale units,
dollar sales, dollar revenue, etc.) included in set of structures.
■
Choosing the duration of the database.
Determining how far back in time you should go for historical data.
Dimensional Modeling Overview
■ What is Dimensional Modeling
♦ It is a logical design technique to structure the business
dimensions and the metrics that are analyzed along these
dimensions.
♦ A logical design technique that seeks to present the data in a
standard framework that is intuitive and allows for high
performance access
♦ It is inherently dimensional and adheres to a discipline that uses
relational model with some important restrictions
♦ The fundamental idea of dimensional modeling is that nearly
every type of business data can be represented as a kind of
cube of data
♦ The model has also proved to provide high performance for
queries and analysis
Dimensional Modeling Overview
■ Components of Dimensional Model
♦ Fact Table
► The fact table contains facts or measurements of the business
♦ Dimension Table
► The dimension tables contain textual attributes that describe the
facts
Dimensional Modeling Overview
Sample Report Translation
Sales Rep Performance Report
Central Region
Jul-00
Aug-00
(Dollars)
(Dollars)
Chicago District
879
878
Adams
345
456
Brown
564
565
Frederickson
657
768
Minneapolis District
890
789
Andersen Smith
909
978
4244
4434
Central Region Total
“Dimension”
Report,row and column heading
“Facts”
Numeric report values
Dimensional Modeling Overview
■ Dimension Tables
♦ Dimension Tables are the entry points into the data warehouse
♦ Dimension tables are designed especially for selection and
grouping under a common head
♦ Determine contextual background for facts
♦ Parameters for OLAP
♦ Common Dimensions
►
Date
►
Product
►
Location/Region
►
Customers
The dimensional model to represent the information contained
in the information package, the data structure must be
represent with Metrics, business dimensions and attributes for
each business dimension.
Dimensional Modeling Overview
■ Dimension Tables
♦ Dimension Table Characteristics
►
Serve as report labels and query constraints
• “By” words
• “Where” clauses
►
Provide Descriptive Information
• Minimal codes
• Embedded meaning as attributes
►
Represent hierarchical relationships
Let see Product business dimension example,
When we want to analyze the fact by products.
ITEM KEY
Item Description
Item Size
Package Type
Category
Dimensional Modeling Overview
Product Dimension Table with Sample Rows
Product Key
Product Desc.
Brand Desc.
Class Desc.
Size
Product
Key
Product Desc
Brand Desc
Class
Size
0001
CHEERIOS 10 OZ
CHEERIOS
Family
10 OZ
0002
CHEERIOS 24 OZ
CHEERIOS
Family
24 OZ
0003
LUCKY CHARMS 10 OZ
LUCKY CHARMS
Kids
10 OZ
Big Dimensions
Retail_Fact
time_Key
Customer Dimension
Store_Key
Customer_key
Customer_Key
Customer_ID (natural key)
Product_key
Customer_name
Sales_dollars
Customer_address
Units_sold
Date_of_birth
Age
Gender
Annual_Income
Number_of_children
Marital_status
Other_attributes...
Big Dimensions
Customer Dimension
Customer_key
Customer_ID
Customer Dimension
Customer_key
Customer_ID (natural key)
Customer_name
Customer_address
Date_of_birth
Customer_name
Customer_address
Date_of_birth
Age
Customer Demographics
Dimension
Customer_demographics_key
Age_Band
Gender
Annual_Income
Gender
Income_Band
Number_of_children
Number_of_children
Marital_status
Marital_status
Other_attributes...
Dirty dimensions
■ A dirty dimension is the one in which data quality
cannot be guaranteed
♦ Data about the same customer can appear multiple times
Customer Dimension
Customer_key
Customer_ID (natural key)
Customer_name
Customer_address
Date_of_birth
Age
Gender
Annual_Income
Number_of_children
Marital_status
Other_attributes...
Hierarchies in Dimensions
■ Multiple Hierarchies
♦ Dimension tables can represent multiple hierarchies roll-ups
♦ For example ,Store Dimension could have
♦
the following hierarchies
►
Physical Geography
• Zip, City, County, State, Country
►
Sales Organization
• District, Region, Zone
►
Distribution Roll-up
• Distribution Center, Distribution Center Region
Store Dimension
Store_key
Store_description
Store_type
Zip
City
State
Sales_region
Sales_zone
Distribution Center
Distribution Center
Region
Hierarchies in Dimensions
Year
Region
Quarter
State
Month
District
Region
Category
Sales Zone
Brand
Product
Week
Date
City
Dimension Tables can represent multiple hierarchical roll-ups
Dimensional Modeling Overview
■ Fact Table
♦ The fact table is where the numerical measurements of the
business are stored
♦ Facts
►
The detail information in a Fact tables
• For Examples: Sales Quantity, Unit Sales Price, Sales Amount etc.
►
►
►
Key performance indicators of the business
Numeric in Nature
Analyzed across the dimensions
♦ Multi-part key
Sales Fact
DATE KEY
ITEM KEY
STORE KEY
PROMOTION KEY
►
Foreign keys to dimension tables
POS TRXN#
►
Date is always a key
Sales Quantity
Unit Sales Price
Sales $ Amount
Dimensional Modeling Overview
■
So far we have formed fact table and dimension tables.
■
How should these tables be arranged in the dimensional model?
■
What are the relationships and how should we mark the relationships
in the model?
■
The dimensional model should primarily facilitate queries and
analyses. What would be the types of queries and analyses?
■
Before combining these tables in dimensional model. What are the
requirements?
 The model should provide the best data access.
 The whole model must be query-centric.
 It must be optimized for queries and analysis.
 The model must show that the dimension tables interact with the fact table.
 It should structured in such a way that every dimension can interact equally with
the fact table.
 The model should allow drilling down or rolling up along dimension hierarchies.
 With this rqts., each of the dimension tables are directly relates to fact table in
the middle.
■
Such an arrangement in the dimensional model looks like a star
formation, with the fact table at the core of a star and the dimension
tables along the spikes of the star.
■
The dimensional model is therefore called a STAR schema.
■
See figure:
Star like Model
Dimension2
Dimension3
Fact
Dimension1
Dimensionn
STAR schema for AUTO Sales
PRODUCT
TIME
DEALER
AUTO
SALES
PAYMENT
METHOD
CUSTOMER
DEMO
GRAPHICS
Typical Star model
Product Dimension
Date Dimension
Sales Fact
Product_key
Date_key
Description
Product_key
Brand
Store_key
Category
Date_key
Day_of_week
Month
Quarter
Dollars_sold
Store Dimension
Year
Units_sold
Store_key
Holiday_flag
Dollars_cost
Store_name
Address
Floor_plan_type
Star model…..
■
It consists of sales fact table in the middle of schema diagram. It have
3 dimension tables of Date, Product and Store.
■
The user will analyze the sales using dollar sold, unit sold and dollar
cost.
■
From the STAR schema structure intuitively answers the questions for
a given amount of dollars, what was the product sold? Who was the
customer? Which store sold the product? When was the order placed?
■
Constraints and filters of queries are easily understood by looking at
the star.
■
A common type of analysis is the drilling down the summary numbers
to get at the details at the lower levels by filtering queries.
Snowflake Design
Dim
Table
Dim
Table
Fact
Table
Dim
Table
Dim
Table
Low cardinality redundant attributes moved to “sub dimension” tables
Snowflake Design
■ Issues
♦ Only few tools optimized for snowflake schema
♦ When the tool is not optimized for snowflake design
►
Presentation more complex
►
Browsing is slower
►
Problems with multiple joins
Dimensional
Modeling
Steps
Snowflake
Design
Dimensional Modeling Steps
■ Identify the Business Process
♦ A major operational process that is supported by some kind of legacy
system(s) from which data can be collected for the purpose of the data
warehouse
♦ Example: orders, invoices, shipments, inventory, sales
■ Identify the Grain
♦ The fundamental lowest level of data represented in a fact table for the
business process
♦ Example: individual transactions, individual daily snapshots
■ Identify the Dimensions
♦ Choose the dimensions that will apply to each fact table record
■ Identify the Facts
♦ Choose the measured facts that will populate each fact table record
Dimensional Modeling Steps
Key Inputs
Business
Requirements
Data Reality
Dimensional Modeling Steps
1.
2.
3.
4.
Identify the business Process
Identify the Grain
Identify the Dimensions
Identify the Facts
Output
Dimensional Model
Resist the temptation to model data by looking copy books alone
Conformed Dimension
Shared Dimensions Must Conform
■ Option 1: Identical dimensions with the same keys, labels,
definitions and values
Sales Schema
Inventory Schema
Item Key
DATE KEY
Item Desc.
ITEM KEY
Brand Desc.
STORE KEY
Category
PROMO KEY
..
Sales Fact
Item Key
DATE KEY
Item Desc.
ITEM KEY
Brand Desc.
Category
..
STORE KEY
Inventory Fact
Conformed Dimension
■ Option 2: “Subset” of base dimension
Sales Schema
Item Key
DATE KEY
DATE KEY
Item Desc.
ITEM KEY
Day-of-week
Brand Desc.
STORE KEY
Week Desc
Category
PROMO KEY
Month Desc
Desc.
Sales $
..
Item key Item Desc
0001
Cheerios 10oz
Forecast Schema
Brand Desc
Cheerios
Category Desc
Cereal
Brand Key
Month Key
Month KEY
Brand Desc.
Brand Key
Month Desc
Category
Estimate
Desc.
Sales $
..
Brand key
1001
Brand Desc
Cheerios
Category Desc
Cereal
Slowly Changing Dimensions
■ Dimension attributes evolve over time
♦ For example, customers change their names, move, have
children, adjust their Incomes
■ For every dimension attribute, need to identify
“Changes” strategy
♦ May use combination of strategies within same dimension table
Slowly Changing Dimensions
Type 1 : Overwrite the changed attributes
Original record
Item Key
Item Desc
Dept
12345
Sim City 3000 Educational S/W
Updated record
Item Key
Item Desc
Dept
12345
Sim City 3000 Strategy S/W
Slowly Changing Dimensions
Type 2 : Add a New Dimension Record
Original record
Item Key
Item Desc
Dept
12345
Sim City 3000 Educational S/W
Additional record
Item Key
Item Desc
Dept
12345
Sim City 3000 Strategy S/W
Slowly Changing Dimensions
Type 3 : Add a “Prior” Attribute
Original record
Item Key
Item Desc
Dept
12345
Sim City 3000 Educational S/W
Updated record
Item Key
Item Desc
Dept
12345
Sim City 3000 Strategy S/W
Prior Dept
Educational S/W
Slowly Changing Dimensions
■ Data Warehouse Keys ie., STAR schema keys
♦ All tables (facts and dimensions) should use Data Warehouse
generated surrogate keys
♦ It is possible that the customer no., of discontinued customers are
reassigned to new customers. We will have a problem because the
same customer no,. Could relate to the data for the newer customer and
also to the data of the retried customer. Therefore do not use such keys
as a primary keys for dimensional tables.
♦ The surrogate keys are simply system generated sequence numbers.
♦ Each row in a dimension table is identified by a unique value of an
attribute designated as the primary key of the dimension.
♦ Each dimension table is in 1:M relationship with central fact table. So
the primary key of each dimension table must be a foreign key in the
fact table.
Additive Measures
■ The ability of measures to be added across all
dimensions of the fact table.
■ Measures could be fully additive, semi additive or
non additive
♦ Fully Additive - The values of the attributes summed up by
simple addition, Aggregation is a fully additive measures is done
by simple addition. Sales Quantity, Revenue
♦ Semi Additive - Account Balance, Inventory, number of
customers (Measure of Intensity, head counts)
♦ Non-Additive
- Profit margin (Ratios and Percentages) i.e.,
Ratio or Percentages should not be added 1:3, 30%, etc.
Factless Fact tables to track attendance of students
Time Dimension
time_key
time attributes…..
Student Dimension
Student Attendance Fact
Student_key
time_Key
Student attributes…..
Course Dimension
Student_Key
Course_key
Course_Key
Course attributes…..
Faculty_Key
Facility_key
Faculty Dimension
Facility Dimension
Faculty_key
Facility_key
Faculty attributes…..
Facility attributes…..
Factless Fact tables – Coverage tables
Sales Fact (revisited)
Date Dimension
Product Dimension
Date_key
Product_key
Day_of_week
Month
Quarter
Sales Fact
Description
Date_key
Brand
Product_key
Category
Year
Holiday_flag
Promotion Dimension
Promotion_key
Promotion_key
Store_key
Store Dimension
Dollars_sold
Store_key
Discount
Units_sold
Store_name
Event
Dollars_cost
Promotion_name
Address
Floor_plan_type
Factless Fact tables – Coverage tables
Date Dimension
Product Dimension
Date_key
Product_key
Day_of_week
Month
Quarter
Sales Fact
Date_key
Year
Holiday_flag
Promotion Dimension
Promotion_key
Product_key
Description
Brand
Category
Promotion_key
Store_key
Promotion_name
Store Dimension
Discount
Store_key
Event
Store_name
Address
Floor_plan_type
Dimensional Modeling Framework
Dimensional Modeling Framework
Identify Subject Area, Grain
Conceptual
Level
Identify Major Dimension & Facts,
Conform Dimensions across Facts
Level of
detail
Detail Facts with Measures
•Detail Dimensions with Hierarchies &Attributes
•Slowly changing Dimensions Policies
Logical Level
Source-Data Model Mapping
Pre-calculations, Aggregates, Indexes, Data
Structures, Source-Physical Model Mapping
Physical
Level
Dimensional Modeling Framework
■ STEP1: Choose the process
♦ Chose a process or subject area from the list of subject areas
identified
♦ Examples of this could be Sales Analysis, Strategic Sourcing,
Human Resources
■ STEP2 : Choose the Grain
♦ Choose the level of detail
♦ Every data mart / warehouse should be based on the most
granular (atomic) data that can possibly be collected and stored.
Dimensional Modeling Framework
■ STEP3 : Identifying Dimensions & Dimension
Hierarchy
Year
Region
Category
Quarter
State
Brand
Month
District
Product
Week
Date
City
Dimensional Modeling Framework
■ STEP4 : Choose the Measures
♦ Identify all the measures for the fact table
■ STEP5 : Conforming the dimensions
♦ Common dimensions across the Facts/ data marts have to be
exactly same or subset of the main dimension table
Dimensional Modeling Framework
■ STEP6 : Adding Attributes to Dimension Tables
♦ Enhance the depth of analysis
♦ Examples Customer Age, Address, Profession, Product color,
flavor, product size, packaging type etc.
Regional Manager
Year
Region
Category
Quarter
State
Brand
Month
Week
District
Product
Date
City
Sequence
CurrentFlag
Phone
Address
Size
Color
Dimensional Modeling Framework
■ STEP7 : Storing Pre-calculations in the Fact table
♦ Calculated based on one or more base measures
■ STEP8 : Choosing the Duration of the Database
♦ Need for analyzing the data over a period of time
Dimensional Modeling Framework
■ STEP9 : Track Slowly Changing Dimensions
♦ Certain kinds of dimension attribute changes need to be handled
differently in Data Warehouse
►
Type I – Overwrite
►
Type II - History
►
Type III – Add new column example :- Organizational changes
Retail Sales – Case Study
Retail Sales - Case Study
■ Background
♦ Chain consists of over 100 grocery stores in five states
♦ Stores average 60,000 SKUs in departments such as frozen
foods, dairy etc.
♦ Bar codes are scanned directly into the cash registers PoS
system
♦ Items are promoted via coupons, temporary price reductions,
ads and in-store promotions
■ Analytical Requirements
♦ Need to know what is selling in store each day in order to
evaluate product movement, as well as to see how sales are
impacted by promotions
♦ Need to understand the mix of items in a consumer market
basket
Retail Sales - Case Study
■ Identify the Business
■ Sales
Process
■ Identify the Grain
■ Identify the Dimensions
■ Transaction Item (Daily
Sales)
■ Date, Location, Item,
Promotion
■ Identify the Measures
■ Quantity, Price, Amount
Retail Sales - Case Study
DATE KEY
Date Description
Week
Month
Quarter
Year
STORE KEY
ITEM KEY
DATE KEY
ITEM KEY
STORE KEY
PROMOTION KEY
POS TRXN#
Sales Quantity
Unit Sales Price
Sales $ Amount
Store Name
City
District
Zone
Item Description
Item Size
Package Type
Category
PROMOTION KEY
Promotion Description
Discount
Resultant Sales Schema
Retail Sales - Case Study
Time Dimension
Date Key
Date
Day of Week
Day Number in
Month
Month
Quarter
Year
Holiday
Indicator
1
1/1/1999
Friday
1
January
Q1
1999
Holiday
2
1/2/1999
Saturday
2
January
Q1
1999
Non-Holiday
3
1/3/1999
Sunday
3
January
Q1
1999
Non-Holiday
4
1/4/1999
Monday
4
January
Q1
1999
Non-Holiday
SKU Number
Dept
Size
Package Type
Brand
Category
Item Dimension
Itemkey
Item
Description
1
Lasagna 6 OZ
90706287103
Grocery
6 OZ
Box
Cold Gourmet
Frozen Foods
2
Beef Stew 6 OZ
16005393282
Grocery
6 OZ
Box
Cold Gourmet
Frozen Foods
3
Extra Nougat 2
OZ
46817560065
Grocery
6 OZ
Can
Chewy
Candy
Retail Sales - Case Study
Promotion Dimension
Promo Key
Promo Name
Price Reduction
Ad Type
Media Type
Promo $
Begin Date
End Date
1
Blue Ribbon
Discounts
Temporary
Daily Paper
Paper
2000
1/1/1999
1/15/1999
2
Red Carpet
Closeout
Markdown
Sunday Paper
Paper
1000
1/3/1999
1/10/1999
3
Ad Blitz
None
Paper and Radio
Paper and
Radio
7000
1/15/1999
1/30/1999
Sales Fact
Date Key
Item Key
Store Key
Promo Key
POS Trxn #
Sales Qty
Unit Sales
Price
Sales $Amt
1
1
1
15
763457893
1
4.59
4.59
1
2
1
1
763457893
2
0.89
1.78
1
5
11
19
763457894
1
2.56
2.56
2
13
5
8
763457923
1
0.33
0.33
2
5
11
12
763457998
1
1.29
1.29
Dimensional Modeling Tips
Dimensional Modeling Tips
■ Carefully choose the labels to identify data marts, dimension,
attributes and facts
■ An attribute can live in one and only one dimension, whereas a
fact can be repeated in multiple fact tables
■ If a single dimension appears to reside in more than one
places, several roles are probably being played. Name the roles
uniquely and treat them as separate dimensions
■ A single field in the underlying source data can have one or
more logical columns associated with it
♦ E.g., A product attribute field may translate to product code, product
short description, and product long description
■ Every fact should have a default aggregation rule (sum, min,
max, latest, semi-additive, special algorithm, and not
aggregatable)
♦ This will serve as a requirements list for query and report writers tools
evaluations
Thank You
■ References
♦ Ralph Kimball
►
The Data Warehouse Toolkit
►
The Data Warehouse Lifecycle Toolkit
Download