Data Warehouse Toolkit Classics

advertisement
Dimensional Modeling
Primer
Chapter 1
Kimball & Ross
Concepts Discussed
 Business driven goals
 Data warehouse publishing
 Major components
 Importance of dimensional modeling for the
presentation area
 Facts & dimension tables
 Myths of dimensional modeling
 Pitfalls to avoid
Different Information Worlds
 Users of operational system turn the wheels
of an organization
 Users of data warehouse watch the wheels
of the organization turn
 Warehouse users have drastically different
needs than users of operational systems
Returning Themes
 We have mountains of data but we cannot access it
 We need to slice the data in different ways
 Need to make it easy for business users to access
the data
 Just show me what is important
 It drives me craze when different people present
the same metrics with different numbers
 Fact-based decision making
Goals of Data Warehouse
 Make an organization’s information easily
accessible
 Present the information in a consistent manner
 Adaptive and resilient to change
 Secure and protects information
 Serves as a foundation for improved decision
making
 Business users must accept the data warehouse if
it is to be useful
Publishing Metaphor
 Data warehouse manager is a “publisher” of
the right data
 Responsible for publishing data collected
from a variety of sources and edited for
quality and consistency
Components of a Data
Warehouse
 Operational source systems
 Data staging area
 Data presentation area
 Data access tools
Data Staging Area
 Key structural requirement is that is it off-
limits to business users and does not
provide query and presentation services.
– Correct misspellings, resolve domain conflicts,
deal with missing elements, parse into standard
formats, combine data from multiple sources.
– Normalized structures sometimes called
“enterprise data warehouse” – it is a misnomer
(Kimball).
Data Staging Area
 Dominated by simple activities sorting and
sequential processing.
 Normalized data is acceptable, although this
is not the end goal.
Data Presentation
 Series of integrated data marts. Data mart is
data from a single business process. Wedge
of the overall pie.
 Data must be presented, stored and accessed
in dimensional schema.
Data Presentation
 Should not be in normalized form.
 They must contain detailed atomic data in
addition to data in summary form, because
the queries are ad hoc and cannot be
predicted.
 Facts and dimensions – called conformed.
Presentation Area
 If it is based on a relational data base, it is
called start schema.
 If it is multidimensional database, or OLAP,
then the data is stored in cubes.
Data Access Tools
 Querying is the whole point of DW.
 Can be as simple as an ad hoc query tool or
as complex as a data mining or a modeling
application.
 Parameter driven analytic operations.
 80 to 90 of the users are served by canned
applications.
Additional Considerations
 Meta data
 Operational data store
Retail Sales
Kimball & Ross, Chapter 2
Overview










Four-step dimensional design process
Transaction-level fact tables
Additive and non-additive facts
Sample dimension table attributes
Causal dimensions
Degenerate dimensions
Extending an existing dimension model
Snowflaking dimension attributes
Avoiding the “too many dimensions” trap
Surrogate keys
Four-Step Dimensional Design Process
1. Select the business process to model.
–
–
not business department or function
E.g., purchasing, ordering, shipping, invoicing, inventorying
2. Declare the grain of the business process.
–
–
Specifies individual fact table row
E.g., individual line item on sales ticket, daily snapshot of the
inventory levels for a product
Four-Step Dimensional Design Process
3.
Choose the dimensions that apply for each fact table row.
–
–
4.
Identify the numeric (measured) facts that will populate each
fact table row.
–
–
–

Q: How do business people describe the data that results from
the business process?
E.g., date, product, store, customer, transaction type
Q: What are we measuring?
Typical facts are numeric additive figures
E.g., quantity ordered, dollar cost amount
In making decisions regarding the 4 steps, consider both the
user requirements as well as the realities of the source data
Retail Case Study
 Large grocery chain: 100 grocery stores over 5 regions
 Each store:
– Departments: grocery, frozen foods, dairy, meat, produce, bakery,
floral, health/beauty aids, etc.
– 60,000 products (SKUs = stock keeping units) on shelves
– 55,000 SKUs with UPCs
– 5,000 SKUs without UPCs but with assigned SKU numbers
 Data is collected:
– from cash registers into a point-of-sale (POS) system
– at back door where vendors make deliveries
Retail Case Study – Cont’d
 Management concerns
–
–
–
–
–
Logistics of ordering, stocking, and selling products
Maximizing profit
Product pricing
Lowering cost of acquisition and overhead
Use of promotions to increase sales
• temporary price reductions
• newspaper ads
• grocery store displays
• coupons
Step 1. Select the Business Process
 Decide what business process to model, by combining an
understanding of the business requirements with an understanding
of data realities.
 The first dimensional model built should be the one
– with the most impact,
– that answers the most pressing business questions,
– is readily accessible for data extraction.
 In retail case study: POS retail sales
 Business Question: What products are selling in which stores on
what days and under what promotional conditions?
Step 2. Declare the Grain
 What level of data detail should be made available in the
dimensional model?
 Choose the most atomic information captured by the
business process.
– Atomic data
• Most detailed, cannot be subdivided
• Facilitates ad hoc, unexpected usage and ability to drill down
to details
 Case study grain: individual line item on a POS
transaction
Step 3. Choose the Dimensions
 A careful grain statement determines the primary
dimensions.
 It is then usually possible to add additional dimensions.
 If an additional desired dimension violates the grain by
causing additional fact rows to be generated, then the
grain statement must be revised to accommodate this
dimension.
 Case study dimensions: date, product, store, promotion
Preliminary Retail Sales Schema

POS Sales Transaction Fact
–
–
–
–
–
–

Product Dimension
–
–

Promotion Key (PK)
Promotion attributes TBD
Date Dimension
–
–

Product Key (PK)
Product attributes TBD
Promotion Dimension
–
–

Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction Number
Other facts TBD
Date Key (PK)
Date attributes TBD
Store Dimension
–
–
Store Key (PK)
Store attributes TBD
Step 4. Identify the Facts
 Picking the business measurements for the fact table: true to
the grain.
 Case study - Facts collected by POS system:
– Sales quantity, sales price/unit, sales $ amount, standard cost $
amount
– Gross Profit = cost – sales
• Recommendation: Include in fact table even though it can be
calculated. Eliminates the possibility of user error.
 For non-additive measurements such as percentages and ratios
(e.g., gross margin) store the numerator (gross profit) and
denominator ($ revenue) in the fact table. The ratio can be
calculated in a data access tool for any slice of the fact table.
Caution: Calculate the ratio of the sums, not the sum of the
ratios
Date Dimension
 Ubiquitous in every data mart
 See Figure 2.4, p. 39
 Use verbose, self-explanatory values rather than coded values.
They are used as column headers in reports. By decoding in
the database, we ensure consistency across different
application environments.
– E.g., Holiday Indicator – use values: Holiday, Nonholiday; as
opposed to Y/N
 Date Key should be an integer rather than a date data type
 Data warehouses need an explicit date dimension table to
describe fiscal periods, seasons, holidays, weekends, and other
calendar calculations that are not supported by the SQL date
function.
 If transaction time is of interest, we may need a separate Time
Dimension table
Product Dimension
 Describes every SKU in the store
 Fill this dimension with as many descriptive attributes as possible.
 “Robust dimension attributes deliver robust analytic slicing and
dicing capabilities.”
 Hierarchies = groups of attributes
 Merchandise hierarchy
– SKUs roll up to brands to categories to departments.
– Each is a many-to-one relationship
 Although there will be redundancy, no need to normalize. Given
the relative size of the dimension (as compared to the fact table)
space saving is minimal.
Store Dimension
 The store dimension: Store Key (PK), Store
Name, Store Number (Natural Key), Store
Address, …
 Possible to represent multiple hierarchies in
a dimension table
– Store to any geographic attribute (e.g., ZIP,
county, state)
– Store to store district to region
Promotion Dimension
 Describes the promotion conditions under which a product is sold
 Called a “causal dimension” – describes factors thought to cause
a change in product sales (price reductions, ads, displays,
coupons)
 Could keep all 4 causal mechanisms in a single dimension
– They are highly correlated, so not much difference in space
requirements
– More efficient browsing for finding out how various promotions are
used together
 … or split into 4 separate dimensions
– May be more understandable to business
– Administration may be more straightforward
 To avoid null keys in the fact table (violation of referential
integrity), for line items not being promoted include a row in the
promotion dimension to indicate “No Promotion in Effect”
Factless Fact Table
 Q: Which products were under promotion but did not sell?
 Cannot answer yet. POS sales fact table has only products that
were sold
 Answer: Create Promotion Coverage Factless Fact Table
– Factless Fact Table = has no measurement metrics
– Contains date, product, store, and promotion keys
 Two-step process to answer Q:
– Query Promotion Coverage table: products under promotion on given
date
– From POS Sales Fact table: products sold
– Answer is the set difference of above
Degenerate Dimension (DD)
 Dimension keys used in fact table without




corresponding dimension tables
In case study: POS Transaction #
Still useful for grouping by transaction
Common DDs: order numbers, invoice numbers
Fact table primary key: Product Key and POS
Transaction Number
Retail Schema Extensibility
 Original schema extends gracefully because POS
transaction data was modeled at its most granular level.
 Premature aggregation limits ability to extend if new
dimensions do not apply to higher grain
 Case study new dimensions:
– Frequent Shopper
– Clerk
– Time of Day
Schema Extensibility
 Dimensional models can handle extensions without invalidating
existing applications:
– New dimension attributes – simply add columns to dimension
table. If new attribute is only available after point in time,
populate old dimension records with something like “Not
Available”
– New dimensions – add foreign field keys to fact table
– New measured facts – add to fact table. If not at the same grain,
then need separate fact table
– Dimension becoming more granular – create new dimension.
May imply more granular fact table, in which case, may have to
rebuild the fact table.
– Addition of a completely new data source involving existing and
new dimensions – usually needs new fact table
Resisting Dimension
Normalization
 Snowflaking = Dimension table normalization
– Redundant attributes are removed from the denormalized dimension table
and are placed in normalized secondary dimension tables
– Fully snowflaked schema = 3NF ER diagram
 The dimension tables must not be normalized, and should remain as flat




tables.
Numerous tables and joins usually translate into slower query
performance.
Efforts to normalize any of the tables in a dimensional database solely in
order to save disk space are a waste of time. Disk space savings gained
by normalizing the dimension tables are typically less than one percent
of the total disk space needed for the overall schema.
Normalized dimension tables destroy the ability to browse within a
dimension or across dimensions (e.g., list package types for each brand
in a category). SQL needed becomes too complex.
The fact table is naturally normalized.
Too Many Dimensions
 Too many dimensions increase space requirements for
the fact table.
 A very large number of dimensions typically means that
several dimensions are not completely independent and
should be combined.
 A single hierarchy should not be captured in separate
dimensions.
Surrogate Keys
 Surrogate keys are integers assigned sequentially as needed to
populate a dimension. They serve to join dimension tables to the fact
table.
 Avoid embedding intelligence in the data warehouse keys.
 Benefits:
– Surrogate keys buffer the DW environment from operational changes.
What happens when operations decide to recycle account numbers after
some period of inactivity? Fine for operational systems, but problematic
for DW if it is using account numbers as a PK.
– Can more easily integrate data from multiple operational systems, even if
they lack consistent source keys.
– Performance advantages because small size of surrogate keys leads to
smaller fact tables
– Surrogate keys are used to support one of the primary techniques for
handling changes in dimension table attributes (Chapter 4).
Inventory
Kimball & Ross, Chapter 3
Overview
 Value chain implications
 Inventory periodic snapshot model,
transaction and accumulating snapshot
models
 Semi-additive facts
 Enhanced inventory facts
 Data Warehouse bus architecture and
matrix
 Conformed dimensions and facts
Value Chain
 The value chain identifies the natural, logical flow of an
organization’s primary activities. See Fig. 3.1
 Operational source systems produce transactions or
snapshots at each step in the value chain. They generate
interesting performance metrics along the way.
 Each business process generates one or more fact tables.
Inventory Models
 Inventory periodic snapshot
– Inventory level of each product measured daily (or weekly) –
represented as a separate row in a fact table
 Inventory transactions
– As products move through the warehouse, all transactions with
impact on inventory levels are recorded
 Inventory accumulating snapshot
– One fact table row for each product updated as the product
moves through the warehouse
Inventory Periodic Snapshot
Model
 Business need
– Analysis of daily quantity-on-hand inventory levels by product
and store
 Business process
– Retail store inventory
 Granularity
– Daily inventory by product at each store
 Dimensions
– Date, product, store
 Fact
– Quantity on hand
Inventory Periodic Snapshot Model Challenge
 Very dense (huge) fact table
– As opposed to retail sales, which was sparse because only about 10%
of products sell each day
 60,000 items in 100 stores = 6,000,000 rows
 If 14 bytes per row: 84MB per day
 One-year period: 365 x 84MB = 30GB
 Solution: Reduce snapshot frequencies over time
– Last 60 days at daily level
– Weekly snapshots for historical data
– For a 3-year period =208 snapshots vs. 3x365=1095 snapshots;
reduction by a factor of 5
Semiadditive Facts
 Inventory levels (quantity on hand) are additive across
products or stores, but NOT across dates = semi-additive
facts
 Compare to Retail Sales:
– once the product is sold it is not counted again
 Static level measurements (inventory, balances…) are
not additive across date dimension; to aggregate over
time use average over number of time periods.
Enhanced Inventory Facts





Number of turns = total quantity sold / daily average quantity on hand
Days’ supply = final quantity on hand / average quantity sold
Gross profit = value at latest selling price - value at cost
Gross margin = gross profit / value at latest selling price
GMROI (Gross Margin Return On Inventory)
– GMROI = number of turns * gross margin
– measures effectiveness of inventory investment
– high = lot of turns and more profit, low = low turns and low profit
 Need additional facts:
– quantity sold, value at cost, value at latest selling price
 GMROI is not additive and, therefore, is not stored in enhanced fact table.
It is calculated from the constituent columns.
Inventory Transactions Model
 Record every transaction that affects inventory
– Receive product
– Place product into inspection hold
– Release product from inspection hold
– Return product to vendor due to inspection failure
– Place product in bin
– Authorize product for sale
– Pick product from bin
– Package product for shipment
– Ship product to customer
– Receive product from customer
– Return product to inventory from customer return
– Remove product from inventory
Inventory Transactions Model - Con’t
 Dimensions: date, warehouse, product, vendor, inventory




transaction type.
The transaction-level fact table contains the most detailed
information possible about the inventory.
It is useful for measuring the frequency and timing of specific
transaction types.
It is impractical for broad data warehouse questions that span dates
or products.
To give a more cumulative view of a process, some form of
snapshot table often accompanies a transaction fact table.
Inventory Accumulating Snapshot Model


Build one record in the fact table for each product delivery to the warehouse
Track disposition of a product until it leaves the warehouse
–
–
–
–
–
–
–

Receiving
Inspection
Bin placement
Authorization to sell
Picking
Boxing
Shipping
The philosophy of the inventory accumulating snapshot fact table is to provide an
updated status of the product shipment as it moves through above milestones.
 Rarely used in long-running, continuously replenished inventory processes.
 More on this in chapter 5.
Value Chain Integration
 Both business and IT organizations are interested in value chain





integration
Desire to look across the business to better evaluate overall
performance
Data marts may correspond to different business processes
Need to look consistently at dimensions shared between business
processes
Need an integrated data warehouse architecture
If dimension table attributes in various marts are identical, each
mart is queried separately; the results are then outer-joined based
on a common dimension attribute = drill across
Data Warehouse Bus
Architecture
 Cannot built the enterprise data warehouse in one step.
 Building isolated pieces will defeat consistency goal.
 Need an architected incremental approach
 data warehouse bus architecture.
 See Fig. 3.7
 By defining a standard bus interface for the data warehouse
environment, separate data marts can be implemented by different
groups at different times. The separate data marts can be plugged
together and usefully coexist if they adhere to the standard.
Data Warehouse Bus Architecture – Cont’d
 During architecture phase, team designs a
master suite of standardized dimensions and
facts that have uniform interpretation across
the enterprise.
 Separate data marts are then developed
adhering to this architecture.
Data Warehouse Bus Matrix
 See Figure 3.8
 The rows of the bus matrix correspond to business processes 
data marts
 Separate rows should be created if:
– the sources are different,
– the processes are different, or
– a row represents more than what can be tackled in a single
implementation iteration.
 Creating the DW bus matrix is a very important up-front
deliverable of a DW implementation. The DW bus matrix is a
hybrid resource: technical design tool, project management tool,
and communication tool.
Conformed Dimensions
 Conformed dimensions are:
– identical, or
– strict mathematical subsets of the most granular, detailed dimension.
 Conformed dimensions have consistent
– Dimension keys
– Attribute column names
– Attribute definitions
– Attribute values
 If two marts have dimensions (e.g., customer, product) that are not
conformed, then they cannot be used together
Types of Dimension Conformity
 Mean same thing
– Single shared table or physical copy
– Consistent data content, data interpretation, user presentation
 Rolled-up level of granularity
– Roll-up dimensions conform to the base-level atomic dimension if
they are a strict subset of that atomic dimension. (see Fig. 3.9)
 Dimension subset at same level of granularity
– At same level but one represents only a subset of rows (see Fig. 3.10)
 Combination of above
Centralized Dimension Authority
 The major responsibility of the centralized dimension
authority is to:
– establish,
– maintain, and
– publish the conformed dimensions to all client data marts.
 90% of up-front data architecture effort
 Political challenge
Conformed Facts
 In general, facts table data is not duplicated explicitly in
multiple data marts.
 If facts live in more than one location, then their
definitions and equations must be the same and they
must be called the same.
 If it is impossible to conform a fact exactly, then
different names should be given to different
interpretations. This will make it less likely that
incompatible facts will not be used in a calculation.
Acknowledgements
• Ralph Kimball & Margy Ross
Download