Star schema (Stjernediagram)

advertisement
Contents of this slideshow:
• What is a data warehouse?
• Multi-dimensional data modeling
An example of a Datawarehouse:
A star shema datawarehouse has a central table (the Fact table)
surrouded by dimension tables with on-to-many relationships towards
the fact table.
Dimension
Orders
- Order#
- Ordertype
Dimension
Products
- Product#
- Product-name
- Price
The fixed data base structure
implies that application programs
(drilling functions/aggregates) can
be generated automatically!
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Dimension hierarchies:
A dimension hierarchy is a set of tables connected by one-to-many
relationships towards the fact table:
Fact table
Orderdetails
- Product#
- Order#
- Qty
- Price
Dimension hierarchy
Orders
- Order#
- Customer#
- Date
Customers
- Customer#
- Customer-name
In a dimension hierarchiy it is possible to aggregate data from the fact
table to the different levels of the hierachy.
Drill-down = “de-aggregate” = break an aggregate into its constituents.
Roll-up = aggregate along one or more dimensions.
Two different types of drilling:
Dimension
• Drilling in dimension hierarchies.
• Drilling between dimensions.
Dimension
Products
- Product#
- Product-name
- Price
Orders
- Order#
- Ordertype
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Which star
schemas or
data marts
can be build
by using the
illustrated
integrated
E-commerce/
ERP data
model?
Which star
schema would
you
recommend
to be
implemented
first?
Product
Product#
ProductName
Price
Order-Detail
Product#
Order#
Qty
Price
Timestamp
Order-DetailHistory
Inv-Item#
Order#
Seq#
State
Timestamp
InvoyceHistory
Invoice#
Timestamp
State
Notes
Product-Stock
Product#
Location#
Qty
Order
Order#
OrderDate
Balance
State
Shipping
Shipping#
ShipMethod
ShipCharge
State
ShipDate
Address
Address#
Name
Add1
Add2
City
State
Zip
Shipping
Invoice
Invoice#
CreationDate
Location
Location#
Address
Customer
Customer#
Kredit-Limit
Balance
UserSession
Session#
IPaddress
#Click
Timestamp
UserAccount
Salesman#
PassWord
Timestamp
#visits
#trans
Ttl-tr-amount
Payment
Payment#
Ammount
State
Timestamp
Billing
CreditCard
Card#
HolderName
ExpireDate
Data marts = Kimball uses the word
for any multidimensional
database/star schema.
A galaxy is a set of multidimensional
databases with conformed (fælles
tilpassede) dimensions:
Fact table
The value chain
SaleOrderdetails
- Product#
- Sale-order#
- Qty
- Discount
- Sale-price
- Date#
Suppose an entreprise has a datamart for
Purchase and another datamart for Sale
as illustrated above. Is it possible to
calculate the revenue per month for the
last year by using such a galaxy?
Time dimension hierarchy
Year
- yy
Month
- yy
- mm
Day
- yy
- mm
- dd
Fact table
- Date#
Storage-per- Qty
product
Fact table
Purchaseorderdetails
- Product#
- Date#
- End-of-daystorage-qty
- Product#
- Purchase-order#
- Purchase-price
- Qty
- Date#
Products
- Product#
- Product-name
Product
groups
- Product-group#
- Product-group-name
Product dimension hierarchy
Time dimension hierarchy
Conformed dimensions = dimensions designed to
be common for different data marts in order to
make drill across operations possible.
Conformed facts = measures with common units
of measurement and granularities that make it
possible to integrate measures from different fact
tables.
Fact table
The value chain
SaleOrderdetails
- Product#
- Sale-order#
- Qty
- Discount
- Sale-price
- Date#
Is it possible to calculate the revenue per
month for the last year if the datamart for
Purchase and the datamart for Sale do
not have conformed dimensions or facts?
Year
- yy
Month
- yy
- mm
Day
- yy
- mm
- dd
Fact table
- Date#
Storage-per- Qty
product
Fact table
Purchaseorderdetails
- Product#
- Date#
- End-of-daystorage-qty
- Product#
- Purchase-order#
- Purchase-price
- Qty
- Date#
Products
- Product#
- Product-name
Product
groups
- Product-group#
- Product-group-name
Product dimension hierarchy
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modelling
Datawarehouse aggregating to the product level:
SELECT Product#, SUM(Qty*Price) AS omsætning
FROM Orderdetails JOIN Products
GROUP BY Product#
Dimension
Orders
- Order#
- Ordertype
Dimension
Products
- Product#
- Product-name
- Price
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Drill down to the Product per Salesman level:
SELECT Product#, Salesman#, SUM(Qty*Price) AS omsætning
FROM Orderdetails JOIN Products JOIN Salesmen
GROUP BY Product#, Salesman#;
Dimension
Orders
- Order#
- Ordertype
Dimension
Products
- Product#
- Product-name
- Price
Where should the
Price be stored?
Fact table
Dimension
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
Dimension
Time
- Date#
- Date-Name
Dimension hierarchies:
A dimension hierarchi is a set of tables connected by one-to-many relationships towards
the fact table:
Fact table
Orderdetails
- Product#
- Order#
- Qty
- Price
Dimension hierarchy
Orders
- Order#
- Customer#
- Date
Customers
- Customer#
- Customer-name
A Snowflake schema may in contrast to star schemas have
dimension hierarchies.
Describe advantage and disadvantage by using dimension
hierarchies/Snowflake schema?
Snowflake schema with branches:
A Snowflake schema may have branches in the dimension hierarchies:
Fact table
Dimension hierarchy
Orderdetails
Orders
- Product#
- Order#
- Qty
- Order#
- Customer#
- Date
Products
-
Customers
Product#
Product-name
Price
Group#
- Customer#
- Customer-name
Snowflake hierarchy
Salesmen
Branch
offices
- Salesman#
- Salesman-name
– Branch-office#
- Branch-office#
- Branch-office#
- Region#
Product
groups
- Group#
- Group-name
- Department#
Departments
- Department#
- Department-name
Dimension hierarchy
Regions
- Region#
- Region-name
Are Customers related to the regions?
Dimension
Orders
The aggregation level is
the argument to the
GROUP BY statement.
- Order#
- Ordertype
Dimension
Fact table
Products
Dimension
Orderdetails
- Product#
- Product-name
- Price
- Product#
- Order#
- Qty
- Date#
- Salesman#
Salesmen
- Salesman#
- Salesman-name
- Branch-Office#
Dimension
Time
- Date#
- Date-Name
Salesman#
Productname
Turnover
Branch-office#
Smith
Screw
10,000
LA
Smith
Bolt
30,000
LA
Smith
Nut
60,000
LA
Jones
Screw
20,000
SF
Jones
Nut
40,000
SF
Aggregated data
Non-aggregated data
...
x1
x2
…
xn
Drilling in dimension hierarchies:
Dimension hierarchy
Fact table
- Product#
- Order#
- Qty
Customers
Orders
Orderdetails
- Customer#
- Customer-name
- Order#
- Customer#
- Date
Snowflake hierarchy
Products
-
Product#
Product-name
Price
Group#
Salesmen
Branch
offices
- Salesman#
- Salesman-name
– Branch-office#
- Branch-office#
- Branch-office#
- Region#
Product
groups
- Group#
- Group-name
- Department#
Departments
Dimension hierarchy
Salesman#
Turnover
Branch-office#
Branch-office#
Turnover
Smith
100,000
LA
LA
400,000
Jones
300,000
LA
SF
200,000
Adams
200,000
SF
Drilling between dimension hierarchies:
Dimension hierarchy
Fact table
Customers
Orders
Orderdetails
- Product#
- Order#
- Qty
- Customer#
- Customer-name
- Order#
- Customer#
- Date
Snowflake hierarchy
Products
Salesmen
Branch
offices
- Salesman#
- Salesman-name
– Branch-office#
- Branch-office#
- Branch-office#
- Region#
Sales
man#
Productname
Turnover
Branchoffice#
Salesman#
Turnover
Branchoffice#
Smith
Screw
10,000
LA
Smith
100,000
LA
Smith
Bolt
30,000
LA
Jones
300,000
LA
Smith
Nut
60,000
LA
Adams
200,000
SF
Jones
Screw
20,000
SF
Jones
Nut
40,000
SF
...
Roll up to the top level:
Sales
man#
Productname
Turnover
Branchoffice#
Smith
Screw
10,000
LA
Smith
Bolt
30,000
LA
Smith
Nut
60,000
LA
Jones
Screw
20,000
SF
Jones
Nut
40,000
SF
Roll up can be executed by
removing one or more argument to
the GROUP BY statement.
...
Productname
Turnover
Screw
100.000
Bolt
200.000
Nut
300,000
Top level
Roll up to the product level.
Turnover
600.000
Roll up to the top level.
Non-linear dimensions as e.g. the Date Dimension:
Fiscal Year
Calendar Year
Fiscal Quarter
Calendar
Quarter
Fiscal Month
Calendar
Month
• The granularity is day.
• Many different hierarchies.
• Two major problems:
– Calender Week do not aggregate to
year.
– Type of Day distinguish between
working day and holiday. However,
they are idependent of the other
dimensions (e.g. Easter).
Fiscal Week
Calendar
Week
Type of Day
Day of Week
Day
What aggregation level
would you use to calculate
the average sale in nonhollyday mondays per
month?
The time dimension:
Day Part
AM/PM Flag
• The granularity is
minute.
• The top level is a
hole day.
Hour
Why do you think Kimball recommends to
separate the date and time dimensions?
Minute
Degenerated dimension =
A dimension that is not created because nobody want to aggregate data to the degenerated level.
Example:
The Order dimension should be deleted while the Time and Customer attributes should be
created as new dimensions to which it is meaningful to aggregate data.
Orders
Products
- Product#
- Product-name
- Price
Fact table
Orderdetails
- Product#
- Order#
- Qty
- Date#
- Salesman#
- Order#
- Time
- Customer#
Salesmen
- Salesman#
- Salesman-name
Exercise:
Customers
Branch
offices
The figure illustrates an
ER-diagram of a car rental
company like Hertz or Avis.
Orders
Contracts
Pick up
Reservations
Car return
Cars
Car types
Garage
services
Garages
Design a snowflake shema,
star shema or Galaxy for the
car rental company!
Major problems in data warehouse design:
Drilling in many-to-many relationships and tree structures.
Inconsistensies caused by ”slowly changing dimensions”.
Slowly Changing Dimensions (SCD)
If the attributes of a dimension is dynamic
(e.i. they may be updated) we say that they are slowly changing.
May the Branch-size of a Branch-office change after e.g. a renovation?
May the Branch-name of a Branch-office change?
Fact table
Bank
accounts
- Account#
- Interest-last-year
- Cost-last-year
- Branch#
Dimension
Branch-offices
- Branch#
- Branch-name
- Branch-size
Exercise in SCD:
Soppose the attribute Branch-size is dynamic and aggregations is
made to the levels (Branch-size, Year) or (Branch-size, Month) .
Does this aggregation make sense and how would you solve
possible problems?
Fact table
Bank
accounts
- Account#
- Interest-last-year
- Cost-last-year
- Branch#
Dimension
Branch-offices
- Branch#
- Branch-name
- Branch-size
Exercise:
Customers
Branch
offices
Is the region of the
customer a dynamic
attribute of the customer?
Orders
Contracts
Pick up
Reservations
Car return
Cars
Car types
Garage
services
Garages
Does it make sense to
aggregate the rental revenue
to the region of the
customers?
It is possible to cheat the application generator. That is,
special very complicated data structures may function
as many-to-many or networt relationships when they
are dealt with as 1-to-many relationships.
How would you recommend to design a datawarehouse
where it is possible to aggregate Sale to the Stock
locations used for the sale?
Product
Product#
ProductName
Price
Order-Detail
Product#
Order#
Qty
Price
Timestamp
Product-Stock
Product#
Location#
Qty
Order
Order#
OrderDate
Balance
State
Location
Location#
Address
Customer
Customer#
Kredit-Limit
Balance
UserSession
Session#
IPaddress
#Click
Timestamp
UserAccount
Salesman#
PassWord
Timestamp
#visits
#trans
Ttl-tr-amount
Exercise.
Design a datawarehouse for a travel agency.
Customers
Buyer
Orders
Traveler
Bookings
Reservations
Flight routes/
Room types/
Car types/
service types
Departures/
Hotel rooms/
Car rentals/
etc.
Product
owners
Design a data warehouse (or galaxy) for an ERP system
with as many meaningful dimensions as possible:
The sales module
Customers
Orders
Orderlines
Stocks per product
per location
Products
The account module offer
services to the other ERP
modules.
Account
items
Accounts
End of session
Thank you !!!
Response type
Evaluation criteria
Is historical
information preserved
Aggregation performance
Storage consumption
Response 1 where dimension
records are overwritten
No
In the evaluation, we define this
solution to have average
performance
Only the current dimension
record version is stored. No
redundant data is stored
Response 2 where new
versions are created
Yes
Version records makes
performance slower
proportional to the number of
changes
All old versions of dimension
records are stored often with
redundant attributes
Response 3 where only one
historical version is saved
The current version and a
single history destroying
version are saved
No performance degradation
occurs if either the current or
the historical version are used in
a query
Normally, only a single extra
attribute version is stored
Response 4 that use the top of
a dynamic dimen-sion
hierarchy as a new static
dimension
Yes
Better or worse depen-ding on
whether both dimension tables
are used in a query
The relatively large fact table
must have an extra foreign key
attribute
Response 5 with dimension
data as fact data
Yes
Better or worse depen-ding on
whether the new fact data are
used in a query
The relatively large fact table
must have an extra attribute for
each dynamic dimension attribute
Response 6 that use fine
granularity in combination
with response 1 or 3
The finer the granularity,
the more historical state
information is preserved
The finer the granularity, the
slower the performance
The finer the granularity, the
more storage consumption
Response 7 that stores
dynamic dimension data as
static facts in another data
mart
Yes
Better or worse depen-ding on
whether both fact tables are
used in a drill across query
This is the most storage
consuming solution as at least a
new fact and foreign key are
stored in the new fact table
Where do the responses of SCDs store historic information?
• Response 1 does not store historic information.
• Response 2 store historic information in a new
record version.
• Response 3 store at one historic value in a new
dimension attribute.
• Response 4 store historic information in a new
dimension relationship.
• Response 5 store historic information in a new fact
attribute.
• Response 6 can sometimes deminish the
aggregation error of response 1 as finer granularity
Download